extends ./outline.jade mixin columns(...names) tr each name in names th= name mixin row(...cells) tr each cell in cells td= cell block body_block article(class="page docs-page") p. This document describes the target annotations spaCy is trained to predict. This is currently a work in progress. Please ask questions on the issue tracker, so that the answers can be integrated here to improve the documentation. h2 Tokenization p Tokenization standards are based on the OntoNotes 5 corpus. p. The tokenizer differs from most by including tokens for significant whitespace. Any sequence of whitespace characters beyond a single space (' ') is included as a token. For instance: pre.language-python code | from spacy.en import English | nlp = English(parse=False) | tokens = nlp('Some\nspaces and\ttab characters') | print([t.orth_ for t in tokens]) p Which produces: pre.language-python code | ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters'] p. The whitespace tokens are useful for much the same reason punctuation is – it's often an important delimiter in the text. By preserving it in the token output, we are able to maintain a simple alignment between the tokens and the original string, and we ensure that no information is lost during processing. h3 Sentence boundary detection p. Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text. h3 Part-of-speech Tagging p. The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS Tag set. Details here: https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124 h3 Lemmatization p. A "lemma" is the uninflected form of a word. In English, this means: ul li Adjectives: The form like "happy", not "happier" or "happiest" li Adverbs: The form like "badly", not "worse" or "worst" li Nouns: The form like "dog", not "dogs"; like "child", not "children" li Verbs: The form like "write", not "writes", "writing", "wrote" or "written" p. The lemmatization data is taken from WordNet. However, we also add a special case for pronouns: all pronouns are lemmatized to the special token -PRON-. h3 Syntactic Dependency Parsing p. The parser is trained on data produced by the ClearNLP converter. Details of the annotation scheme can be found here: http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf h3 Named Entity Recognition table thead +columns("Entity Type", "Description") tbody +row("PERSON", "People, including fictional.") +row("NORP", "Nationalities or religious or political groups.") +row("FACILITY", "Buildings, airports, highways, bridges, etc.") +row("ORG", "Companies, agencies, institutions, etc.") +row("GPE", "Countries, cities, states.") +row("LOC", "Non-GPE locations, mountain ranges, bodies of water.") +row("PRODUCT", "Vehicles, weapons, foods, etc. (Not services") +row("EVENT", "Named hurricanes, battles, wars, sports events, etc.") +row("WORK_OF_ART", "Titles of books, songs, etc.") +row("LAW", "Named documents made into laws") +row("LANGUAGE", "Any named language") p The following values are also annotated in a style similar to names: table thead +columns("Entity Type", "Description") tbody +row("DATE", "Absolute or relative dates or periods") +row("TIME", "Times smaller than a day") +row("PERCENT", 'Percentage (including “%”)') +row("MONEY", "Monetary values, including unit") +row("QUANTITY", "Measurements, as of weight or distance") +row("ORDINAL", 'first", "second"') +row("CARDINAL", "Numerals that do not fall under another type")