p This document describes the target annotations spaCy is trained to predict. This is currently a work in progress. Please ask questions on the issue tracker, so that the answers can be integrated here to improve the documentation.
details
summary: h4 Tokenization
p Tokenization standards are based on the OntoNotes 5 corpus.
p The tokenizer differs from most by including tokens for significant whitespace. Any sequence of whitespace characters beyond a single space (' ') is included as a token. For instance:
p The whitespace tokens are useful for much the same reason punctuation is – it's often an important delimiter in the text. By preserving it in the token output, we are able to maintain a simple alignment between the tokens and the original string, and we ensure that no information is lost during processing.
details
summary: h4 Sentence boundary detection
p Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text.
details
summary: h4 Part-of-speech Tagging
p The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS Tag set.
A "lemma" is the uninflected form of a word. In English, this means:
ul
li Adjectives: The form like "happy", not "happier" or "happiest"
li Adverbs: The form like "badly", not "worse" or "worst"
li Nouns: The form like "dog", not "dogs"; like "child", not "children"
li Verbs: The form like "write", not "writes", "writing", "wrote" or "written"
p.
The lemmatization data is taken from WordNet. However, we also add a
special case for pronouns: all pronouns are lemmatized to the special
token #[code -PRON-].
details
summary: h4 Syntactic Dependency Parsing
p The parser is trained on data produced by the ClearNLP converter. Details of the annotation scheme can be found #[a(href="http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") here].
details
summary: h4 Named Entity Recognition
table
thead
+columns("Entity Type", "Description")
tbody
+row("PERSON", "People, including fictional.")
+row("NORP", "Nationalities or religious or political groups.")