spaCy/website/docs/api/annotation.jade

172 lines
7.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > API > ANNOTATION SPECS
include ../../_includes/_mixins
p This document describes the target annotations spaCy is trained to predict.
+h(2, "tokenization") Tokenization
p
| Tokenization standards are based on the
| #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus.
| The tokenizer differs from most by including tokens for significant
| whitespace. Any sequence of whitespace characters beyond a single space
| (#[code ' ']) is included as a token.
+aside-code("Example").
from spacy.en import English
nlp = English(parser=False)
tokens = nlp('Some\nspaces and\ttab characters')
print([t.orth_ for t in tokens])
# ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']
p
| The whitespace tokens are useful for much the same reason punctuation is
| it's often an important delimiter in the text. By preserving it in the
| token output, we are able to maintain a simple alignment between the
| tokens and the original string, and we ensure that no information is
| lost during processing.
+h(2, "sentence-boundary") Sentence boundary detection
p
| Sentence boundaries are calculated from the syntactic parse tree, so
| features such as punctuation and capitalisation play an important but
| non-decisive role in determining the sentence boundaries. Usually this
| means that the sentence boundaries will at least coincide with clause
| boundaries, even given poorly punctuated text.
+h(2, "pos-tagging") Part-of-speech Tagging
+infobox("Tip: Understanding tags")
| In spaCy v1.9+, you can also use #[code spacy.explain()] to get the
| description for the string representation of a tag. For example,
| #[code spacy.explain("RB")] will return "adverb".
include _annotation/_pos-tags
+h(2, "lemmatization") Lemmatization
p A "lemma" is the uninflected form of a word. In English, this means:
+list
+item #[strong Adjectives]: The form like "happy", not "happier" or "happiest"
+item #[strong Adverbs]: The form like "badly", not "worse" or "worst"
+item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children"
+item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written"
+aside("About spaCy's custom pronoun lemma")
| Unlike verbs and common nouns, there's no clear base form of a personal
| pronoun. Should the lemma of "me" be "I", or should we normalize person
| as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
| novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
| all personal pronouns.
p
| The lemmatization data is taken from
| #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a
| special case for pronouns: all pronouns are lemmatized to the special
| token #[code -PRON-].
+h(2, "dependency-parsing") Syntactic Dependency Parsing
+infobox("Tip: Understanding labels")
| In spaCy v1.9+, you can also use #[code spacy.explain()] to get the
| description for the string representation of a label. For example,
| #[code spacy.explain("prt")] will return "particle".
include _annotation/_dep-labels
+h(2, "named-entities") Named Entity Recognition
+infobox("Tip: Understanding entity types")
| In spaCy v1.9+, you can also use #[code spacy.explain()] to get the
| description for the string representation of an entity label. For example,
| #[code spacy.explain("LANGUAGE")] will return "any named language".
include _annotation/_named-entities
+h(3, "biluo") BILUO Scheme
p
| spaCy translates the character offsets into this scheme, in order to
| decide the cost of each action given the current state of the entity
| recogniser. The costs are then used to calculate the gradient of the
| loss, to train the model. The exact algorithm is a pastiche of
| well-known methods, and is not currently described in any single
| publication. The model is a greedy transition-based parser guided by a
| linear model whose weights are learned using the averaged perceptron
| loss, via the #[+a("http://www.aclweb.org/anthology/C12-1059") dynamic oracle]
| imitation learning strategy. The transition system is equivalent to the
| BILOU tagging scheme.
+aside("Why BILUO, not IOB?")
| There are several coding schemes for encoding entity annotations as
| token tags. These coding schemes are equally expressive, but not
| necessarily equally learnable.
| #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
| showed that the minimal #[strong Begin], #[strong In], #[strong Out]
| scheme was more difficult to learn than the #[strong BILUO] scheme that
| we use, which explicitly marks boundary tokens.
+table(["Tag", "Description"])
+row
+cell #[code #[span.u-color-theme B] EGIN]
+cell The first token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme I] N]
+cell An inner token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme L] AST]
+cell The final token of a multi-token entity.
+row
+cell #[code #[span.u-color-theme U] NIT]
+cell A single-token entity.
+row
+cell #[code #[span.u-color-theme O] UT]
+cell A non-entity token.
+h(2, "json-input") JSON input format for training
p
| spaCy takes training data in JSON format. The built-in
| #[+a("/docs/usage/cli#convert") #[code convert] command] helps you
| convert the #[code .conllu] format used by the
| #[+a("https://github.com/UniversalDependencies") Universal Dependencies corpora]
| to spaCy's training format.
+aside("Annotating entities")
| Named entities are provided in the #[+a("#biluo") BILUO]
| notation. Tokens outside an entity are set to #[code "O"] and tokens
| that are part of an entity are set to the entity label, prefixed by the
| BILUO marker. For example #[code "B-ORG"] describes the first token of
| a multi-token #[code ORG] entity and #[code "U-PERSON"] a single
| token representing a #[code PERSON] entity
+code("Example structure").
[{
"id": int, # ID of the document within the corpus
"paragraphs": [{ # list of paragraphs in the corpus
"raw": string, # raw text of the paragraph
"sentences": [{ # list of sentences in the paragraph
"tokens": [{ # list of tokens in the sentence
"id": int, # index of the token in the document
"dep": string, # dependency label
"head": int, # offset of token head relative to token index
"tag": string, # part-of-speech tag
"orth": string, # verbatim text of the token
"ner": string # BILUO label, e.g. "O" or "B-ORG"
}],
"brackets": [{ # phrase structure (NOT USED by current models)
"first": int, # index of first token
"last": int, # index of last token
"label": string # phrase label
}]
}]
}]
}]