Need Help?

Topic-based Classifier Cookbook

Last Updated: Nov 13, 2017

Topic-based Classifier Cookbook


Introduction

This document describes how to create and manage Topic-based Classifiers in Compass. Please refer to the Compass API reference for details on logging in and creating a project.

Once you have made a Compass project, you can create one or more topic-based classifiers in it. Topic-based Classifier is a semi-supervised classifier; it relies on topics defined by the user and does not require training per se.
 

Creating a Classifier

Creating a Classifier via the API is a three-step process:
  1. Upload reference documents
  2. Create topics
  3. Build the classifier
 

Uploading Reference Documents

Topic-based classifier relies on the semantic space constructed from a reference set of documents. Use the project’s /documents endpoint to POST documents. Posting them in batches of about 1,000 is the most efficient. Upload speed is around 880 KiB per second.

Each document must be a JSON-encoded dictionary with the following fields:
  • text (string; required): the text content of the document, in UTF-8 encoding
  • label (string; optional): the class the document belongs to 
  • dataset (string; optional): the collection the document belongs to

The  text field is mandatory. The dataset attribute is useful if you intend to build more than one classifier, or if you want to control which subset of the full dataset is used, but it is not required; it defaults to the empty string. The label attribute does not apply to the topic-based classifier, and may be left blank.
 

Creating Topics

The user must define one or more topics for the topic-based classifier. Creating useful topics is both art and science.  There are two typical approaches we see:
1. User can leverage their knowledge of the space and professional intuition to define topics directly. For example, at a bank, topics like "open checking account", "withdraw money", "apply for credit card" are all typical candidates.
2. User can rely on Luminoso's AI-aided "Topic Discovery" solutions that can find and generate a set of potential topics (or intents) that are present in the corpus of data.The output of either method must result in topics specified as text, following a specific syntax.

The two methods are not mutually exclusive, in fact starting with a known list and using the AI-aided topic discovery to optimize the list is a best practice.

A topic specification is a string composed of terms (operands) linked by Boolean operators. There are three operators: AND, OR, and NOT, with their usual meaning; they must appear in all-uppercase. Operands may be single words or multi-word phrases. The string must not contain any punctuation (including quotation marks).

Specifications may range from the very simple (two terms linked by a single operator) to the arbitrarily complex (with parentheses demarcating embedded clauses). Here are some examples:
 

coffee OR tea
chips AND dip
card AND (debit OR credit)
(bagel AND (schmear OR cream cheese)) OR oatmeal


There are two way to create topics - in bulk, or individually.

To create topics in bulk, specify the topics list in the info dictionary, as in the following example:
 

“topics”: [
      {“title”: “breakfast”, “clause”: “bagel AND (schmear OR cream cheese)”},
      {“title”: “coffee”, “clause”: “drink AND coffee AND NOT latte”},
      {“title”: “lunch”, “clause”: “falafel OR chickpea fritter sandwich”},
      {“title”: “dessert”, “clause”: “cake OR pie OR ice cream”}
    ]


To create topics individually, after the classifier has been created, use the classifier's /topics endpoint to POST topics.
 

Building the Classifier

When the documents needed to create the semantic space have been uploaded, use the project’s /classifiers endpoint to POST a classifier specification with the following content: 

  • name: a string that uniquely identifies the classifier for the project
  • type: "topic_based"
  • status: "active"
  • info: {}

The info dictionary may be empty, with the following attributes may also be added to the info dictionary:
  • threshold: a floating-point number between 0 and 1 to indicate the minimum confidence score a document must achieve to be assigned a label (default value: 0.4)
  • num_topics: the number of topics (labels) to assign to a document (default value: 1) or the string “ALL” if you want to see the scores for all labels. If a number greater than 1 is provided, the system will return the “top X” labels above the threshold

The response will contain the URL for the newly created classifier, together with its training_state:
 {
    "url": "https://.../api/projects/<proj_id>/p/classifiers/1/",
    "name": "unique name for classifier",
    "type": "topic_based",
    "status": "inactive",
    "training_state": "in_training",
    "topics": [],
    "info": {
        “num_topics”: 1,
        “threshold”: 0.4
    },
    "created": "2017-06-28T18:03:50.256085+00:00",
    "last_update": "2017-06-28T18:03:50.256085+00:00"
}

While the classifier is under construction, its training_state is set to "in_training" and its status to "inactive". Use the URL provided to check periodically until the training_state value has changed to ready and the status to active.

 

Updating a Classifier

After a classifier has been built, the following attributes may be changed by means of a PUT request to the classifier’s URL:
  • name: unique name for the classifier
  • status: either active or inactive; the value must be active for any classification to happen
  • num_topics
  • threshold

The following attributes are read-only and cannot be modified:
  • dataset
  • training_state
  • created
  • updated

You can also create new topics, update existing topics or delete topics. Changes to topics are instantaneous and do not require re-building or re-training a classifier.

To add topics, use POST to the classifier's /topics endpoint. To update a topic, use PUT on the classifier's /topics endpoint. To delete a topic, use DELETE on the classifier's /topics endpoint.

 

Multiple Classifiers

A Compass project can have more than one classifier. In the case of multiple classifiers, each message is classified by all the classifiers present. No classification order is enforced, i.e. all classifiers are created equal. It is possible to mix-and-match classifier types, e.g. you can have a voting classifier and a topic-based one in the same project.
 

Classifying Documents

Compass classifies documents by presenting them to each active classifier in the project. If no classifiers are active, no classification is attempted and no topics are returned.

Documents can be POSTed for classification in two ways. One uses the project’s /messages endpoint, and saves each message in the project database. The other uses the /classify endpoint and does not save anything. Because of this, it is significantly faster than the first.

Both endpoints return in “near” real time a response object for each message or document that will contain a topics element with a list of the topics predicted for the message. If just one classifier is active and its num_topics attribute is set to 1, the list will contain one entry. If multiple classifiers are active, or the num_topics attribute is greater than 1, the list will contain multiple entries, one for each of the topics predicted. Each entry in the list has these four attributes:
  • source: the name of the Classifier that assigned the topic
  • name: one of the labels (class names) from the Classifier training set, or UNCLASSIFIED if the document did not match any of them with a high enough confidence
  • score: a value between 0 and 1 that indicates the strength of the match between the text and the topic; for UNCLASSIFIED topics the score will be 0 
  • id: the internal identifier of the topic (the last digit of the corresponding topic URL in the classifier definition)
 

/classify

Documents POSTed to the /classify endpoint are not saved in the database, and therefore get classified more quickly. The documents must be JSON-encoded dictionaries with the following fields:
  • text: the content of the document
  • source_id: a unique ID string of arbitrary length
 

/messages

Messages POSTed to the /messages endpoint are saved in the database, and may participate in Compass’s default clustering mechanism, which is described in the support documentation. Messages must be JSON-encoded dictionaries with the following fields:
  • text: the content of the message
  • timestamp: date and time expressed in UTC (Coordinated Universal Time) format
Note: the labels_for_clustering configuration parameter on the project can be set to specify a list of topics (by label or ID) from one or more active Classifiers to filter the messages participating in the default unsupervised clustering. A good use of that feature is to designate UNCLASSIFIED'S to be further clustered to find the themes within messages that have not been classified into known "buckets".
 

Processing Times


Building a Classifier

Building a classifier is a relatively time-consuming process: the duration is directly proportional to the size of the training-document collection. It consists of two stages.
  1. Preparing the input text
  2. Building the semantic space

Here are two illustrative examples for two sets of reference documents. The first set contains about 65,000 tweet-sized messages (total size: 8.4 MiB) and was trained for 5 labels. The second contains about 71,000 short newspaper articles (approximately 2000 characters per article, for a total of 135 MiB). Times are in seconds.

 
stage tweet-sized news articles
stage 1 112 1613
stage 2 70 1214
total (seconds)       182                    2827                       
 

Classifying Documents

Activating a new classifier takes about 3 to 30 seconds depending on its size, and that cost will be borne by the first few messages presented to it. Once a classifier has “warmed up”, its performance should be constant if the number of projects and classifiers remains constant.
 

We define performance as the time it takes to classify a message.  Three factors impact performance:

  • whether you save the document classified: the /classify endpoint is two to three times faster than /messages, because it saves nothing to the project database
  • the size of the message: larger messages take longer to prepare than smaller ones
  • the complexity of the classifier: classifiers with a large number of classes (topics) take longer to process documents
 

Throughput is the number of messages classified per unit of time.  To increase the number of messages processed, two techniques may be used

  • increase batch size: submitting more than one message at a time creates economies of scale and processing time per document on average will be faster. Batch size of 10-30 is optimal.
  • use multiple clients (threads): by making requests in parallel you can process more messages per unit of time. We recommend using no more than 5 threads.
Here are some performance numbers for illustrative purposes.
  • Short Messages (10-15 words), Few Topics (4): Classifying one message at a time, performance is 53ms per message, throughput is 19 messages per second.
  • Longer Messages (200 words), Many Topics (300): Classifying one message at a time, performance is 95ms per message, throughput is 11 messages per second.

More Support

  • Please log in to see help articles and/or submit a support ticket
6560c840000970c2e652c21ccafb3167@luminoso.desk-mail.com
https://cdn.desk.com/
false
desk
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/en/portal/articles/autocomplete