Need Help?

Voting Classifier Cookbook

Last Updated: Sep 01, 2017

Voting Classifier Cookbook


Introduction

This document describes how to create and manage Voting Classifiers in Compass. Please refer to the Compass API reference for details on logging in and creating a project.

Once you have made a Compass project, you can create one or more voting classifiers in it. Voting Classifier is a supervised classifier and to create it, a list of labeled data must be supplied to train it.
 

Creating a Classifier

Creating a Classifier via the API is a two-step process:
  1. Upload your (labeled) training documents to the project
  2. Build the classifier
 

Uploading Training Documents

Use the project’s /documents endpoint to POST either a single document or a list of them; posting the documents in batches of about 1,000 is the most efficient. Upload speed is around 880 KiB per second.

Each document must be a JSON-encoded dictionary with the following fields:
  • text (string; required): the text content of the document, in UTF-8 encoding
  • label (string; optional): the class the document belongs to 
  • dataset (string; optional): the collection the document belongs to

The  text field is mandatory. The dataset attribute is useful if you intend to train more than one classifier, but it is not required; it defaults to the empty string. The label attribute is also optional, although at least some documents must have it for the system to be able to train a classifier from them. Documents without labels will be used to “enrich” the semantic associations over the entire document collection and may thereby improve classifier accuracy.

The question of how much training data needed to create a good classifier is outside of scope of this document - feel free to reach out to your Luminoso representative to discuss it further.
 

Building A Classifier

Use the project’s /documents endpoint to POST either a single document or a list of them; posting the documents in batches of about 1,000 is the most efficient. Upload speed is around 880 KiB per second.

When all documents needed to train the classifier have been uploaded, use the project’s /classifiers endpoint to POST a classifier specification with the following content: 
  • name: a string that uniquely identifies the classifier for the project
  • type: "voting"
  • status: "active"
  • info: {}

The info dictionary may be empty, in which case all the documents with a non-empty label value will be used to train the classifier. Alternatively, the info dictionary may contain a dataset directive that names the subset of documents to use; documents in this data set must contain non-empty label fields for anything to happen. The following attributes may also be added to the info dictionary:
  • threshold: a floating-point number between 0 and 1 to indicate the minimum confidence score a document must achieve to be assigned a label (default value: 0.4)
  • num_topics: the number of topics (labels) to assign to a document (default value: 1) or the string “ALL” if you want to see the scores for all labels. If a number greater than 1 is provided, the system will return the “top X” labels above the threshold

The response will contain the URL for the newly created classifier, together with its training_state:
 {
    "url": "https://.../api/projects/<proj_id>/p/classifiers/1/",
    "name": "unique name for classifier",
    "type": "voting",
    "status": "inactive",
    "training_state": "in_training",
    "topics": [],
    "info": {
        “num_topics”: 1,
        “threshold”: 0.4
    },
    "created": "2017-06-28T18:03:50.256085+00:00",
    "last_update": "2017-06-28T18:03:50.256085+00:00"
}

While the classifier is under construction, its training_state is set to in_training and its status to inactive. Use the URL provided to check periodically until the training_state value has changed to ready and the status to active. Depending on the number and size of your training documents, this may take a few minutes to several hours.

When the classifier is ready to use, the topics list in the response will contain one entry for every label that has at least one representative document in the training data, together with an UNCLASSIFIED entry that will be used if none of the other labels matches an incoming message well enough. Each topics entry has the following elements (only the first three are important):
  • url: full URL to the topic
  • title: the label name
  • status: active or inactive (same as the classifier status) 
  • blocking: false (internal use) 
  • created: creation date and time (UTC format)
  • ended: null or date and time the topic stopped being active (UTC format)
  • info: {} (internal use) 
  • measurements: {} (overall statistics about volume, velocity, and acceleration) 
  • time_buckets: {} (statistics over equal time intervals) 
  • messages_url: full URL to the messages that were (or will be) classified into the topic
 

Updating a Classifier

After a classifier has been built, the following attributes may be changed by means of a PUT request to the classifier’s URL:
  • name: unique name for the classifier
  • status: either active or inactive; the value must be active for any classification to happen
  • num_topics
  • threshold

The following attributes are read-only and cannot be modified:
  • dataset
  • training_state
  • created
  • updated
  • topics
 

Multiple Classifiers

A Compass project can have more than one classifier. In the case of multiple classifiers, each message is classified by all the classifiers present. No classification order is enforced, i.e. all classifiers are created equal.
 

Classifying Documents

Compass classifies documents by presenting them to each active classifier in the project. If no classifiers are active, no classification is attempted and no topics are returned.

Documents can be POSTed for classification in two ways. One uses the project’s /messages endpoint, and saves each message in the project database. The other uses the /classify endpoint and does not save anything. Because of this, it is significantly faster than the first.

Both endpoints return in “near” real time a response object for each message or document that will contain a topics element with a list of the topics predicted for the message. If just one classifier is active and its num_topics attribute is set to 1, the list will contain one entry. If multiple classifiers are active, or the num_topics attribute is greater than 1, the list will contain multiple entries, one for each of the topics predicted. Each entry in the list has these four attributes:
  • source: the name of the Voting Classifier that assigned the topic
  • name: one of the labels (class names) from the Classifier training set, or UNCLASSIFIED if the document did not match any of them with a high enough confidence
  • score: a value between 0 and 1 that indicates the strength of the match between the text and the topic; for UNCLASSIFIED topics the score will be 0 
  • id: the internal identifier of the topic (the last digit of the corresponding topic URL in the classifier definition)
 

/classify

Documents POSTed to the /classify endpoint are not saved in the database, and therefore get classified more quickly. The documents must be JSON-encoded dictionaries with the following fields:
  • text: the content of the document
  • source_id: a unique ID string of arbitrary length
 

/messages

Messages POSTed to the /messages endpoint are saved in the database, and may participate in Compass’s default clustering mechanism, which is described in the support documentation. Messages must be JSON-encoded dictionaries with the following fields:
  • text: the content of the message
  • timestamp: date and time expressed in UTC (Coordinated Universal Time) format
Note: the labels_for_clustering configuration parameter on the project can be set to specify a list of topics (by label or ID) from one or more active Voting Classifiers to filter the messages participating in the default unsupervised clustering. A good use of that feature is to designate UNCLASSIFIED'S to be further clustered to find the themes within messages that have not been classified into known "buckets".
 

Processing Times


Building a Classifier

Building a classifier is a time-consuming process: the duration is directly proportional to the size of the training-document collection. It consists of three stages.
  1. Preparing the input text
  2. Building the semantic space
  3. Training the voting classifiers


Here are two illustrative examples for two sets of training documents. The first set contains about 65,000 tweet-sized messages (total size: 8.4 MiB) and was trained for 5 labels. The second contains about 71,000 short newspaper articles (approximately 2000 characters per article, for a total of 135 MiB) and was trained for 20 labels. Times are in seconds.
stage tweets news articles
stage 1 112 1613
stage 2 70 1214
stage 3 202 2411
total 384 5258
 

Classifying Documents

Activating a new classifier takes time (up to 10 seconds depending on its size) and that cost will be borne by the first few documents presented to it. Once a classifier has “warmed up”, its performance should be constant if the number of projects and classifiers remains constant.
 

We define performance as the time it takes to classify a document.  Three factors impact performance:

  • whether you save the document classified: the /classify endpoint is two to three times faster than /messages, because it saves nothing to the project database
  • the size of the document: larger documents take longer to prepare than smaller ones
  • the complexity of the classifier: classifiers with a large feature space and/or a large number of classes (labels) take slightly longer to process documents
 

Throughput is the number of documents classified per unit of time.  To increase the number of documents processed, two techniques may be used

  • increase batch size: submitting more than one document at a time creates economies of scale and processing time per document on average will be faster. Batch size of 10-30 is optimal.
  • use multiple clients (threads): by making requests in parallel you can process more documents per unit of time. We recommend using no more than 5 threads.
 

Graphs

The following graphs illustrate Performance and Throughput.

Graph 1: Vary Message Size



Graph 2: Vary Number of Classification Topics


Graph 3: Vary Batch Size


Graph 4: Vary Number of Threads


Graph 5: Vary Batch Size - Save vs. No-Save Option

 

More Support

6560c840000970c2e652c21ccafb3167@luminoso.desk-mail.com
https://cdn.desk.com/
false
desk
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/en/portal/articles/autocomplete