Last Updated: Feb 22, 2016

Two of the fields on documents - "terms" and "fragments" - are lists of tuples that look like ["fly|en", "VERB/T", [10, 14]]. The tuples correspond to the words in the document. The first of the three parts is the root form of the word, followed by a pipe ("|") and its (abbreviated) language.  The second part is the part-of-speech tag, which consists of a grammatical part of speech followed by a slash and a single character representing its broader category ("T" for regular terms, "S" for stopwords, "P" for punctuation, etc).  The third part is the start and end indices of the word in the document's text (inclusive of the start index and exclusive of the end index; indices start at zero). In this example, if the sentence is "Catherine flew to San Francisco.", the term "fly" appears in the document as "flew", which starts at the 10th character in the text and ends before the 14th character.

The "terms" list basically has one item per word in the document, but excludes punctuation and words that are considered "stopwords", which do not contribute much meaning (such as "the" and "is). It also combines words into phrases if they occur together often - for example, "San Francisco" instead of "San" and "Francisco".

The "fragments" field only contains words that got combined into multi-word terms (called "collocations") - it would list "San" and "Francisco" when the "terms" list contains "San Francisco". 

See also: Which words and terms get analyzed, and which do not?

