Need Help?

Document Fields

Last Updated: Dec 22, 2015

A document corresponds to one unit of text, along with associated metadata such as title, date, etc. Documents can be long or short, but a novel is probably too long, and a single word is generally too short (it's okay to have some documents in a project be just a single word, but if all of them are, you won't get useful information from them).

A document is represented as a dictionary with several fields. In the UI, only some of the fields are allowed on uploaded documents and only some of the fields are displayed, but in the API you can access all of the fields.

Currently, most of the possible fields on a document are listed in the following sections. Not all documents have all fields. Any other fields on uploaded documents will be removed, as well as any fields whose value is null.

_id

A unique ID assigned to the document.

Not user-specified (if a non-Luminoso _id is included on an uploaded document, it will be removed and put into the "source" dictionary). All documents have this field.

Example: "uuid-186ef54305dc481297076795efa2aa4c"

title

A title for the document. Not necessarily unique. Note: Luminoso document-search does not search in titles.

Optional; if unspecified, it will default to the empty string.

Example: "My Document"

text

The text of the document. May contain any characters you want. The maximum length allowed is currently 300000 characters (the length of a short novel, or about 30 long Wikipedia articles).

Technically optional, but there's no point in having documents with no text. If unspecified, will default to empty string.

Example: "Catherine flew to San Francisco."

date

Date of the document, in UNIX time

Optional; no default will be set.

Example: 1366318864

language

Two-letter ISO language abbreviation. For a list of supported languages, check the /supported_languages/ API endpoint (or the dropdown menu on the upload page in the UI).

Optional; defaults to "en" (English).

Example: "en"

subsets

List of subsets the document belongs to. (If a string is specified instead of a list, that will be considered a single subset and put into a list.) Subset names must be strings and cannot be the empty string. Whitespace will be stripped from the beginning and end of subset names (for example, " my subset " will turn into "my subset").

Optional (the "__all__" subset will be added to all documents).

Example: ["survey responses", "2013"]

terms

List of the terms in the document. (See the Terms and Fragments page for details.)

Not user-specified. All processed documents have this field.

Example: [["catherine|en", "NOUN/T", [0, 9]], ["fly|en", "VERB/T", [10, 14] ], ["san|en francisco|en", "NOUN/T NOUN/T", [18, 31]]]

fragments

List of the term fragments in the document. (See the Terms and Fragments page for details.)

Not user-specified. All processed documents have this field (though it could be an empty list).

Example: [["san|en", "NOUN/T", [18, 21]], ["francisco|en", "NOUN/T", [22, 31]]]

source

A dictionary of information about the document's source. It can contain any information that you would like to keep track of.

Optional; a "type" field will automatically be added, which defaults to "Unknown".

Example: {"type": "survey responses", "creation_date": 1359382566, "url": "http://example.com/123"}

assoc_version

An integer indicating what version of the project this document's vector comes from.

Not user-specified. All processed documents have this field.

Example: 0

vector

The document's vector (see [[how-it-works]] for details), in pack64 format. Not user-specified. All processed documents have this field.

Example: "WDP97uWM2l_cU-T3ANRAuUAwCAyI9KH_4a_JzAPTDtJ_Lr-vI9YkAdO-_UE45An0Clw7S0_w39AzA9S9LpMY3_Tm_xmCiI_5j_MS7spBlY8xBE3IHbiBp2LQuHcH1yN-za9OlBv8yG47H98hX4-w9BV73gEso6iy-ZF7uWAg9EPp-791DHAHn_3C3Pr8Li-KRDD4HgJEAv-f86Jx-RW_ySuHuJqK4sxEbCI8YGWj8LwDlu5WHDcP3LN_l_9ZCAru-LPEpD0rM9QF_EA92vDfO5dL1307Jq0rS4wGDYUACPQMeJ442GOEBvC0Gz0WICuG1bEwQ8yxAvq_1uE8T96XCba_qq_tR8JO8qmAix8k7_K__AfDJh_VMHPVF_3_e96s4BHOC0AEe394I_7dBgOASc_ftAGCIcCG-s76h9knAqz9_x_Cd_sGAQt-dfAiDDDr_jc"

predict

A dictionary mapping predictors to their values for this document, to be used for predicting values for other documents (see the Prediction page for details).

Optional; defaults to empty dictionary.

Example: {"rating": {"value": 3.2}, "happiness": {"value": 22}

More Support

6560c840000970c2e652c21ccafb3167@luminoso.desk-mail.com
https://cdn.desk.com/
false
desk
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/en/portal/articles/autocomplete