Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
post
page
Filter by Categories
Data Scraping
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
post
page
Filter by Categories
Data Scraping

Using a confusion matrix in Compass

Why use a confusion matrix?

  • To compare what Luminoso’s classifiers detected vs. what the true label was (the label on the training data)
  • To be able to see which labels are having overlap, and may be causing ‘confusion’ in the classifier.
  • To fix those areas of confusion, so that the classifier is in a better state to push into production

Reading a confusion matrix

An example of how to read a column in the above screenshot of a confusion matrix:

  • In D2 you see, ‘local host installs’.
  • This D column is to illustrate how the classification for the label ‘local host installs’ performed.
  • You can see that out of 304 testing documents (B8), 268 (B3) were correctly classified.
  • 4 documents were classified with the label of ‘installation’, which differed from the training data tag of ‘local host install”.

Determining correction needs

  • Generally, we recommend inspecting and potentially correcting a label if the accuracy is below 80%.
    • This accuracy is calculated by taking the correct number, and dividing by the total number, for example in the above screenshot (B3/B8=accuracy)
  • Also, if there is an incorrect cell that is taking a large portion of the total incorrect classifications, this should be investigated to see why this large overlap/confusion is existing. A ‘large portion’ is not an exact threshold, but typically 10-20% of the total classifications is a good rule of thumb.

Related Posts