//- 💫 DOCS > API > ANNOTATION SPECS include ../../_includes/_mixins p This document describes the target annotations spaCy is trained to predict. +h(2, "tokenization") Tokenization p | Tokenization standards are based on the | #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus. | The tokenizer differs from most by including tokens for significant | whitespace. Any sequence of whitespace characters beyond a single space | (#[code ' ']) is included as a token. +aside-code("Example"). from spacy.en import English nlp = English(parser=False) tokens = nlp('Some\nspaces and\ttab characters') print([t.orth_ for t in tokens]) # ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters'] p | The whitespace tokens are useful for much the same reason punctuation is | – it's often an important delimiter in the text. By preserving it in the | token output, we are able to maintain a simple alignment between the | tokens and the original string, and we ensure that no information is | lost during processing. +h(2, "sentence-boundary") Sentence boundary detection p | Sentence boundaries are calculated from the syntactic parse tree, so | features such as punctuation and capitalisation play an important but | non-decisive role in determining the sentence boundaries. Usually this | means that the sentence boundaries will at least coincide with clause | boundaries, even given poorly punctuated text. +h(2, "pos-tagging") Part-of-speech Tagging +infobox("Tip: Understanding tags") | In spaCy v1.9+, you can also use #[code spacy.explain()] to get the | description for the string representation of a tag. For example, | #[code spacy.explain("RB")] will return "adverb". include _annotation/_pos-tags +h(2, "lemmatization") Lemmatization p A "lemma" is the uninflected form of a word. In English, this means: +list +item #[strong Adjectives]: The form like "happy", not "happier" or "happiest" +item #[strong Adverbs]: The form like "badly", not "worse" or "worst" +item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children" +item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written" +aside("About spaCy's custom pronoun lemma") | Unlike verbs and common nouns, there's no clear base form of a personal | pronoun. Should the lemma of "me" be "I", or should we normalize person | as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a | novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for | all personal pronouns. p | The lemmatization data is taken from | #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a | special case for pronouns: all pronouns are lemmatized to the special | token #[code -PRON-]. +h(2, "dependency-parsing") Syntactic Dependency Parsing +infobox("Tip: Understanding labels") | In spaCy v1.9+, you can also use #[code spacy.explain()] to get the | description for the string representation of a label. For example, | #[code spacy.explain("prt")] will return "particle". include _annotation/_dep-labels +h(2, "named-entities") Named Entity Recognition +infobox("Tip: Understanding entity types") | In spaCy v1.9+, you can also use #[code spacy.explain()] to get the | description for the string representation of an entity label. For example, | #[code spacy.explain("LANGUAGE")] will return "any named language". include _annotation/_named-entities +h(3, "biluo") BILUO Scheme p | spaCy translates the character offsets into this scheme, in order to | decide the cost of each action given the current state of the entity | recogniser. The costs are then used to calculate the gradient of the | loss, to train the model. The exact algorithm is a pastiche of | well-known methods, and is not currently described in any single | publication. The model is a greedy transition-based parser guided by a | linear model whose weights are learned using the averaged perceptron | loss, via the #[+a("http://www.aclweb.org/anthology/C12-1059") dynamic oracle] | imitation learning strategy. The transition system is equivalent to the | BILOU tagging scheme. +aside("Why BILUO, not IOB?") | There are several coding schemes for encoding entity annotations as | token tags. These coding schemes are equally expressive, but not | necessarily equally learnable. | #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth] | showed that the minimal #[strong Begin], #[strong In], #[strong Out] | scheme was more difficult to learn than the #[strong BILUO] scheme that | we use, which explicitly marks boundary tokens. +table(["Tag", "Description"]) +row +cell #[code #[span.u-color-theme B] EGIN] +cell The first token of a multi-token entity. +row +cell #[code #[span.u-color-theme I] N] +cell An inner token of a multi-token entity. +row +cell #[code #[span.u-color-theme L] AST] +cell The final token of a multi-token entity. +row +cell #[code #[span.u-color-theme U] NIT] +cell A single-token entity. +row +cell #[code #[span.u-color-theme O] UT] +cell A non-entity token. +h(2, "json-input") JSON input format for training p | spaCy takes training data in JSON format. The built-in | #[+a("/docs/usage/cli#convert") #[code convert] command] helps you | convert the #[code .conllu] format used by the | #[+a("https://github.com/UniversalDependencies") Universal Dependencies corpora] | to spaCy's training format. +aside("Annotating entities") | Named entities are provided in the #[+a("#biluo") BILUO] | notation. Tokens outside an entity are set to #[code "O"] and tokens | that are part of an entity are set to the entity label, prefixed by the | BILUO marker. For example #[code "B-ORG"] describes the first token of | a multi-token #[code ORG] entity and #[code "U-PERSON"] a single | token representing a #[code PERSON] entity +code("Example structure"). [{ "id": int, # ID of the document within the corpus "paragraphs": [{ # list of paragraphs in the corpus "raw": string, # raw text of the paragraph "sentences": [{ # list of sentences in the paragraph "tokens": [{ # list of tokens in the sentence "id": int, # index of the token in the document "dep": string, # dependency label "head": int, # offset of token head relative to token index "tag": string, # part-of-speech tag "orth": string, # verbatim text of the token "ner": string # BILUO label, e.g. "O" or "B-ORG" }], "brackets": [{ # phrase structure (NOT USED by current models) "first": int, # index of first token "last": int, # index of last token "label": string # phrase label }] }] }] }]