mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 13:47:13 +03:00
624 lines
38 KiB
Markdown
624 lines
38 KiB
Markdown
---
|
||
title: Annotation Specifications
|
||
teaser: Schemes used for labels, tags and training data
|
||
menu:
|
||
- ['Text Processing', 'text-processing']
|
||
- ['POS Tagging', 'pos-tagging']
|
||
- ['Dependencies', 'dependency-parsing']
|
||
- ['Named Entities', 'named-entities']
|
||
- ['Models & Training', 'training']
|
||
---
|
||
|
||
## Text processing {#text-processing}
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.lang.en import English
|
||
> nlp = English()
|
||
> tokens = nlp("Some\\nspaces and\\ttab characters")
|
||
> tokens_text = [t.text for t in tokens]
|
||
> assert tokens_text == ["Some", "\\n", "spaces", " ", "and", "\\t", "tab", "characters"]
|
||
> ```
|
||
|
||
Tokenization standards are based on the
|
||
[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus. The tokenizer
|
||
differs from most by including **tokens for significant whitespace**. Any
|
||
sequence of whitespace characters beyond a single space (`' '`) is included as a
|
||
token. The whitespace tokens are useful for much the same reason punctuation is
|
||
– it's often an important delimiter in the text. By preserving it in the token
|
||
output, we are able to maintain a simple alignment between the tokens and the
|
||
original string, and we ensure that **no information is lost** during
|
||
processing.
|
||
|
||
### Lemmatization {#lemmatization}
|
||
|
||
> #### Examples
|
||
>
|
||
> In English, this means:
|
||
>
|
||
> - **Adjectives**: happier, happiest → happy
|
||
> - **Adverbs**: worse, worst → badly
|
||
> - **Nouns**: dogs, children → dog, child
|
||
> - **Verbs**: writes, writing, wrote, written → write
|
||
|
||
As of v2.2, lemmatization data is stored in a separate package,
|
||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
|
||
be installed if needed via `pip install spacy[lookups]`. Some languages provide
|
||
full lemmatization rules and exceptions, while other languages currently only
|
||
rely on simple lookup tables.
|
||
|
||
<Infobox title="About spaCy's custom pronoun lemma for English" variant="warning">
|
||
|
||
spaCy adds a **special case for English pronouns**: all English pronouns are
|
||
lemmatized to the special token `-PRON-`. Unlike verbs and common nouns, there's
|
||
no clear base form of a personal pronoun. Should the lemma of "me" be "I", or
|
||
should we normalize person as well, giving "it" — or maybe "he"? spaCy's
|
||
solution is to introduce a novel symbol, `-PRON-`, which is used as the lemma
|
||
for all personal pronouns.
|
||
|
||
</Infobox>
|
||
|
||
### Sentence boundary detection {#sentence-boundary}
|
||
|
||
Sentence boundaries are calculated from the syntactic parse tree, so features
|
||
such as punctuation and capitalization play an important but non-decisive role
|
||
in determining the sentence boundaries. Usually this means that the sentence
|
||
boundaries will at least coincide with clause boundaries, even given poorly
|
||
punctuated text.
|
||
|
||
## Part-of-speech tagging {#pos-tagging}
|
||
|
||
> #### Tip: Understanding tags
|
||
>
|
||
> You can also use `spacy.explain` to get the description for the string
|
||
> representation of a tag. For example, `spacy.explain("RB")` will return
|
||
> "adverb".
|
||
|
||
This section lists the fine-grained and coarse-grained part-of-speech tags
|
||
assigned by spaCy's [models](/models). The individual mapping is specific to the
|
||
training corpus and can be defined in the respective language data's
|
||
[`tag_map.py`](/usage/adding-languages#tag-map).
|
||
|
||
<Accordion title="Universal Part-of-speech Tags" id="pos-universal">
|
||
|
||
spaCy maps all language-specific part-of-speech tags to a small, fixed set of
|
||
word type tags following the
|
||
[Universal Dependencies scheme](http://universaldependencies.org/u/pos/). The
|
||
universal tags don't code for any morphological features and only cover the word
|
||
type. They're available as the [`Token.pos`](/api/token#attributes) and
|
||
[`Token.pos_`](/api/token#attributes) attributes.
|
||
|
||
| POS | Description | Examples |
|
||
| ------- | ------------------------- | --------------------------------------------- |
|
||
| `ADJ` | adjective | big, old, green, incomprehensible, first |
|
||
| `ADP` | adposition | in, to, during |
|
||
| `ADV` | adverb | very, tomorrow, down, where, there |
|
||
| `AUX` | auxiliary | is, has (done), will (do), should (do) |
|
||
| `CONJ` | conjunction | and, or, but |
|
||
| `CCONJ` | coordinating conjunction | and, or, but |
|
||
| `DET` | determiner | a, an, the |
|
||
| `INTJ` | interjection | psst, ouch, bravo, hello |
|
||
| `NOUN` | noun | girl, cat, tree, air, beauty |
|
||
| `NUM` | numeral | 1, 2017, one, seventy-seven, IV, MMXIV |
|
||
| `PART` | particle | 's, not, |
|
||
| `PRON` | pronoun | I, you, he, she, myself, themselves, somebody |
|
||
| `PROPN` | proper noun | Mary, John, London, NATO, HBO |
|
||
| `PUNCT` | punctuation | ., (, ), ? |
|
||
| `SCONJ` | subordinating conjunction | if, while, that |
|
||
| `SYM` | symbol | \$, %, §, ©, +, −, ×, ÷, =, :), 😝 |
|
||
| `VERB` | verb | run, runs, running, eat, ate, eating |
|
||
| `X` | other | sfpksdpsxmsa |
|
||
| `SPACE` | space |
|
||
|
||
</Accordion>
|
||
|
||
<Accordion title="English" id="pos-en">
|
||
|
||
The English part-of-speech tagger uses the
|
||
[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn
|
||
Treebank tag set. We also map the tags to the simpler Universal Dependencies v2
|
||
POS tag set.
|
||
|
||
| Tag | POS | Morphology | Description |
|
||
| ----------------------------------- | ------- | -------------------------------------------------- | ----------------------------------------- |
|
||
| `$` | `SYM` | | symbol, currency |
|
||
| <InlineCode>``</InlineCode> | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark |
|
||
| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark |
|
||
| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma |
|
||
| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket |
|
||
| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket |
|
||
| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer |
|
||
| `:` | `PUNCT` | | punctuation mark, colon or ellipsis |
|
||
| `ADD` | `X` | | email |
|
||
| `AFX` | `ADJ` | `Hyph=yes` | affix |
|
||
| `CC` | `CCONJ` | `ConjType=comp` | conjunction, coordinating |
|
||
| `CD` | `NUM` | `NumType=card` | cardinal number |
|
||
| `DT` | `DET` | | determiner |
|
||
| `EX` | `PRON` | `AdvType=ex` | existential there |
|
||
| `FW` | `X` | `Foreign=yes` | foreign word |
|
||
| `GW` | `X` | | additional word in multi-word expression |
|
||
| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen |
|
||
| `IN` | `ADP` | | conjunction, subordinating or preposition |
|
||
| `JJ` | `ADJ` | `Degree=pos` | adjective |
|
||
| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative |
|
||
| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative |
|
||
| `LS` | `X` | `NumType=ord` | list item marker |
|
||
| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary |
|
||
| `NFP` | `PUNCT` | | superfluous punctuation |
|
||
| `NIL` | `X` | | missing tag |
|
||
| `NN` | `NOUN` | `Number=sing` | noun, singular or mass |
|
||
| `NNP` | `PROPN` | `NounType=prop Number=sing` | noun, proper singular |
|
||
| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural |
|
||
| `NNS` | `NOUN` | `Number=plur` | noun, plural |
|
||
| `PDT` | `DET` | | predeterminer |
|
||
| `POS` | `PART` | `Poss=yes` | possessive ending |
|
||
| `PRP` | `PRON` | `PronType=prs` | pronoun, personal |
|
||
| `PRP$` | `DET` | `PronType=prs Poss=yes` | pronoun, possessive |
|
||
| `RB` | `ADV` | `Degree=pos` | adverb |
|
||
| `RBR` | `ADV` | `Degree=comp` | adverb, comparative |
|
||
| `RBS` | `ADV` | `Degree=sup` | adverb, superlative |
|
||
| `RP` | `ADP` | | adverb, particle |
|
||
| `SP` | `SPACE` | | space |
|
||
| `SYM` | `SYM` | | symbol |
|
||
| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" |
|
||
| `UH` | `INTJ` | | interjection |
|
||
| `VB` | `VERB` | `VerbForm=inf` | verb, base form |
|
||
| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense |
|
||
| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle |
|
||
| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle |
|
||
| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present |
|
||
| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=three` | verb, 3rd person singular present |
|
||
| `WDT` | `DET` | | wh-determiner |
|
||
| `WP` | `PRON` | | wh-pronoun, personal |
|
||
| `WP$` | `DET` | `Poss=yes` | wh-pronoun, possessive |
|
||
| `WRB` | `ADV` | | wh-adverb |
|
||
| `XX` | `X` | | unknown |
|
||
| `_SP` | `SPACE` | | |
|
||
|
||
</Accordion>
|
||
|
||
<Accordion title="German" id="pos-de">
|
||
|
||
The German part-of-speech tagger uses the
|
||
[TIGER Treebank](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/)
|
||
annotation scheme. We also map the tags to the simpler Universal Dependencies v2
|
||
POS tag set.
|
||
|
||
| Tag | POS | Morphology | Description |
|
||
| --------- | ------- | ---------------------------------------- | ------------------------------------------------- |
|
||
| `$(` | `PUNCT` | `PunctType=brck` | other sentence-internal punctuation mark |
|
||
| `$,` | `PUNCT` | `PunctType=comm` | comma |
|
||
| `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark |
|
||
| `ADJA` | `ADJ` | | adjective, attributive |
|
||
| `ADJD` | `ADJ` | | adjective, adverbial or predicative |
|
||
| `ADV` | `ADV` | | adverb |
|
||
| `APPO` | `ADP` | `AdpType=post` | postposition |
|
||
| `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left |
|
||
| `APPRART` | `ADP` | `AdpType=prep PronType=art` | preposition with article |
|
||
| `APZR` | `ADP` | `AdpType=circ` | circumposition right |
|
||
| `ART` | `DET` | `PronType=art` | definite or indefinite article |
|
||
| `CARD` | `NUM` | `NumType=card` | cardinal number |
|
||
| `FM` | `X` | `Foreign=yes` | foreign language material |
|
||
| `ITJ` | `INTJ` | | interjection |
|
||
| `KOKOM` | `CCONJ` | `ConjType=comp` | comparative conjunction |
|
||
| `KON` | `CCONJ` | | coordinate conjunction |
|
||
| `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive |
|
||
| `KOUS` | `SCONJ` | | subordinate conjunction with sentence |
|
||
| `NE` | `PROPN` | | proper noun |
|
||
| `NN` | `NOUN` | | noun, singular or mass |
|
||
| `NNE` | `PROPN` | | proper noun |
|
||
| `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun |
|
||
| `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun |
|
||
| `PIAT` | `DET` | `PronType=ind|neg|tot` | attributive indefinite pronoun without determiner |
|
||
| `PIS` | `PRON` | `PronType=ind|neg|tot` | substituting indefinite pronoun |
|
||
| `PPER` | `PRON` | `PronType=prs` | replaceable personal pronoun |
|
||
| `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun |
|
||
| `PPOSS` | `PRON` | `Poss=yes PronType=prs` | substituting possessive pronoun |
|
||
| `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun |
|
||
| `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun |
|
||
| `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun |
|
||
| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb |
|
||
| `PTKA` | `PART` | | particle with adjective or adverb |
|
||
| `PTKANT` | `PART` | `PartType=res` | answer particle |
|
||
| `PTKNEG` | `PART` | `Polarity=neg` | negative particle |
|
||
| `PTKVZ` | `ADP` | `PartType=vbp` | separable verbal particle |
|
||
| `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive |
|
||
| `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun |
|
||
| `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun |
|
||
| `PWS` | `PRON` | `PronType=int` | substituting interrogative pronoun |
|
||
| `TRUNC` | `X` | `Hyph=yes` | word remnant |
|
||
| `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary |
|
||
| `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary |
|
||
| `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary |
|
||
| `VAPP` | `AUX` | `Aspect=perf VerbForm=part` | perfect participle, auxiliary |
|
||
| `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal |
|
||
| `VMINF` | `VERB` | `VerbForm=inf VerbType=mod` | infinitive, modal |
|
||
| `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal |
|
||
| `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full |
|
||
| `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full |
|
||
| `VVINF` | `VERB` | `VerbForm=inf` | infinitive, full |
|
||
| `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full |
|
||
| `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full |
|
||
| `XY` | `X` | | non-word containing non-letter |
|
||
| `_SP` | `SPACE` | | |
|
||
|
||
</Accordion>
|
||
|
||
---
|
||
|
||
<Infobox title="Annotation schemes for other models">
|
||
|
||
For the label schemes used by the other models, see the respective `tag_map.py`
|
||
in [`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang).
|
||
|
||
</Infobox>
|
||
|
||
## Syntactic Dependency Parsing {#dependency-parsing}
|
||
|
||
> #### Tip: Understanding labels
|
||
>
|
||
> You can also use `spacy.explain` to get the description for the string
|
||
> representation of a label. For example, `spacy.explain("prt")` will return
|
||
> "particle".
|
||
|
||
This section lists the syntactic dependency labels assigned by spaCy's
|
||
[models](/models). The individual labels are language-specific and depend on the
|
||
training corpus.
|
||
|
||
<Accordion title="Universal Dependency Labels" id="dependency-parsing-universal">
|
||
|
||
The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is
|
||
used in all languages trained on Universal Dependency Corpora.
|
||
|
||
| Label | Description |
|
||
| ------------ | -------------------------------------------- |
|
||
| `acl` | clausal modifier of noun (adjectival clause) |
|
||
| `advcl` | adverbial clause modifier |
|
||
| `advmod` | adverbial modifier |
|
||
| `amod` | adjectival modifier |
|
||
| `appos` | appositional modifier |
|
||
| `aux` | auxiliary |
|
||
| `case` | case marking |
|
||
| `cc` | coordinating conjunction |
|
||
| `ccomp` | clausal complement |
|
||
| `clf` | classifier |
|
||
| `compound` | compound |
|
||
| `conj` | conjunct |
|
||
| `cop` | copula |
|
||
| `csubj` | clausal subject |
|
||
| `dep` | unspecified dependency |
|
||
| `det` | determiner |
|
||
| `discourse` | discourse element |
|
||
| `dislocated` | dislocated elements |
|
||
| `expl` | expletive |
|
||
| `fixed` | fixed multiword expression |
|
||
| `flat` | flat multiword expression |
|
||
| `goeswith` | goes with |
|
||
| `iobj` | indirect object |
|
||
| `list` | list |
|
||
| `mark` | marker |
|
||
| `nmod` | nominal modifier |
|
||
| `nsubj` | nominal subject |
|
||
| `nummod` | numeric modifier |
|
||
| `obj` | object |
|
||
| `obl` | oblique nominal |
|
||
| `orphan` | orphan |
|
||
| `parataxis` | parataxis |
|
||
| `punct` | punctuation |
|
||
| `reparandum` | overridden disfluency |
|
||
| `root` | root |
|
||
| `vocative` | vocative |
|
||
| `xcomp` | open clausal complement |
|
||
|
||
</Accordion>
|
||
|
||
<Accordion title="English" id="dependency-parsing-english">
|
||
|
||
The English dependency labels use the
|
||
[CLEAR Style](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md)
|
||
by [ClearNLP](http://www.clearnlp.com).
|
||
|
||
| Label | Description |
|
||
| ----------- | -------------------------------------------- |
|
||
| `acl` | clausal modifier of noun (adjectival clause) |
|
||
| `acomp` | adjectival complement |
|
||
| `advcl` | adverbial clause modifier |
|
||
| `advmod` | adverbial modifier |
|
||
| `agent` | agent |
|
||
| `amod` | adjectival modifier |
|
||
| `appos` | appositional modifier |
|
||
| `attr` | attribute |
|
||
| `aux` | auxiliary |
|
||
| `auxpass` | auxiliary (passive) |
|
||
| `case` | case marking |
|
||
| `cc` | coordinating conjunction |
|
||
| `ccomp` | clausal complement |
|
||
| `compound` | compound |
|
||
| `conj` | conjunct |
|
||
| `cop` | copula |
|
||
| `csubj` | clausal subject |
|
||
| `csubjpass` | clausal subject (passive) |
|
||
| `dative` | dative |
|
||
| `dep` | unclassified dependent |
|
||
| `det` | determiner |
|
||
| `dobj` | direct object |
|
||
| `expl` | expletive |
|
||
| `intj` | interjection |
|
||
| `mark` | marker |
|
||
| `meta` | meta modifier |
|
||
| `neg` | negation modifier |
|
||
| `nn` | noun compound modifier |
|
||
| `nounmod` | modifier of nominal |
|
||
| `npmod` | noun phrase as adverbial modifier |
|
||
| `nsubj` | nominal subject |
|
||
| `nsubjpass` | nominal subject (passive) |
|
||
| `nummod` | numeric modifier |
|
||
| `oprd` | object predicate |
|
||
| `obj` | object |
|
||
| `obl` | oblique nominal |
|
||
| `parataxis` | parataxis |
|
||
| `pcomp` | complement of preposition |
|
||
| `pobj` | object of preposition |
|
||
| `poss` | possession modifier |
|
||
| `preconj` | pre-correlative conjunction |
|
||
| `prep` | prepositional modifier |
|
||
| `prt` | particle |
|
||
| `punct` | punctuation |
|
||
| `quantmod` | modifier of quantifier |
|
||
| `relcl` | relative clause modifier |
|
||
| `root` | root |
|
||
| `xcomp` | open clausal complement |
|
||
|
||
</Accordion>
|
||
|
||
<Accordion title="German" id="dependency-parsing-german">
|
||
|
||
The German dependency labels use the
|
||
[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html)
|
||
annotation scheme.
|
||
|
||
| Label | Description |
|
||
| ------- | ------------------------------- |
|
||
| `ac` | adpositional case marker |
|
||
| `adc` | adjective component |
|
||
| `ag` | genitive attribute |
|
||
| `ams` | measure argument of adjective |
|
||
| `app` | apposition |
|
||
| `avc` | adverbial phrase component |
|
||
| `cc` | comparative complement |
|
||
| `cd` | coordinating conjunction |
|
||
| `cj` | conjunct |
|
||
| `cm` | comparative conjunction |
|
||
| `cp` | complementizer |
|
||
| `cvc` | collocational verb construction |
|
||
| `da` | dative |
|
||
| `dm` | discourse marker |
|
||
| `ep` | expletive es |
|
||
| `ju` | junctor |
|
||
| `mnr` | postnominal modifier |
|
||
| `mo` | modifier |
|
||
| `ng` | negation |
|
||
| `nk` | noun kernel element |
|
||
| `nmc` | numerical component |
|
||
| `oa` | accusative object |
|
||
| `oa2` | second accusative object |
|
||
| `oc` | clausal object |
|
||
| `og` | genitive object |
|
||
| `op` | prepositional object |
|
||
| `par` | parenthetical element |
|
||
| `pd` | predicate |
|
||
| `pg` | phrasal genitive |
|
||
| `ph` | placeholder |
|
||
| `pm` | morphological particle |
|
||
| `pnc` | proper noun component |
|
||
| `punct` | punctuation |
|
||
| `rc` | relative clause |
|
||
| `re` | repeated element |
|
||
| `rs` | reported speech |
|
||
| `sb` | subject |
|
||
| `sbp` | passivized subject (PP) |
|
||
| `sp` | subject or predicate |
|
||
| `svp` | separable verb prefix |
|
||
| `uc` | unit component |
|
||
| `vo` | vocative |
|
||
| `ROOT` | root |
|
||
|
||
</Accordion>
|
||
|
||
## Named Entity Recognition {#named-entities}
|
||
|
||
> #### Tip: Understanding entity types
|
||
>
|
||
> You can also use `spacy.explain` to get the description for the string
|
||
> representation of an entity label. For example, `spacy.explain("LANGUAGE")`
|
||
> will return "any named language".
|
||
|
||
Models trained on the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19)
|
||
corpus support the following entity types:
|
||
|
||
| Type | Description |
|
||
| ------------- | ---------------------------------------------------- |
|
||
| `PERSON` | People, including fictional. |
|
||
| `NORP` | Nationalities or religious or political groups. |
|
||
| `FAC` | Buildings, airports, highways, bridges, etc. |
|
||
| `ORG` | Companies, agencies, institutions, etc. |
|
||
| `GPE` | Countries, cities, states. |
|
||
| `LOC` | Non-GPE locations, mountain ranges, bodies of water. |
|
||
| `PRODUCT` | Objects, vehicles, foods, etc. (Not services.) |
|
||
| `EVENT` | Named hurricanes, battles, wars, sports events, etc. |
|
||
| `WORK_OF_ART` | Titles of books, songs, etc. |
|
||
| `LAW` | Named documents made into laws. |
|
||
| `LANGUAGE` | Any named language. |
|
||
| `DATE` | Absolute or relative dates or periods. |
|
||
| `TIME` | Times smaller than a day. |
|
||
| `PERCENT` | Percentage, including "%". |
|
||
| `MONEY` | Monetary values, including unit. |
|
||
| `QUANTITY` | Measurements, as of weight or distance. |
|
||
| `ORDINAL` | "first", "second", etc. |
|
||
| `CARDINAL` | Numerals that do not fall under another type. |
|
||
|
||
### Wikipedia scheme {#ner-wikipedia-scheme}
|
||
|
||
Models trained on Wikipedia corpus
|
||
([Nothman et al., 2013](http://www.sciencedirect.com/science/article/pii/S0004370212000276))
|
||
use a less fine-grained NER annotation scheme and recognise the following
|
||
entities:
|
||
|
||
| Type | Description |
|
||
| ------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `PER` | Named person or family. |
|
||
| `LOC` | Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains). |
|
||
| `ORG` | Named corporate, governmental, or other organizational entity. |
|
||
| `MISC` | Miscellaneous entities, e.g. events, nationalities, products or works of art. |
|
||
|
||
### IOB Scheme {#iob}
|
||
|
||
| Tag | ID | Description |
|
||
| ----- | --- | ------------------------------------- |
|
||
| `"I"` | `1` | Token is inside an entity. |
|
||
| `"O"` | `2` | Token is outside an entity. |
|
||
| `"B"` | `3` | Token begins an entity. |
|
||
| `""` | `0` | No entity tag is set (missing value). |
|
||
|
||
### BILUO Scheme {#biluo}
|
||
|
||
| Tag | Description |
|
||
| ----------- | ---------------------------------------- |
|
||
| **`B`**EGIN | The first token of a multi-token entity. |
|
||
| **`I`**N | An inner token of a multi-token entity. |
|
||
| **`L`**AST | The final token of a multi-token entity. |
|
||
| **`U`**NIT | A single-token entity. |
|
||
| **`O`**UT | A non-entity token. |
|
||
|
||
> #### Why BILUO, not IOB?
|
||
>
|
||
> There are several coding schemes for encoding entity annotations as token
|
||
> tags. These coding schemes are equally expressive, but not necessarily equally
|
||
> learnable. [Ratinov and Roth](http://www.aclweb.org/anthology/W09-1119) showed
|
||
> that the minimal **Begin**, **In**, **Out** scheme was more difficult to learn
|
||
> than the **BILUO** scheme that we use, which explicitly marks boundary tokens.
|
||
|
||
spaCy translates the character offsets into this scheme, in order to decide the
|
||
cost of each action given the current state of the entity recognizer. The costs
|
||
are then used to calculate the gradient of the loss, to train the model. The
|
||
exact algorithm is a pastiche of well-known methods, and is not currently
|
||
described in any single publication. The model is a greedy transition-based
|
||
parser guided by a linear model whose weights are learned using the averaged
|
||
perceptron loss, via the
|
||
[dynamic oracle](http://www.aclweb.org/anthology/C12-1059) imitation learning
|
||
strategy. The transition system is equivalent to the BILUO tagging scheme.
|
||
|
||
## Models and training data {#training}
|
||
|
||
### JSON input format for training {#json-input}
|
||
|
||
spaCy takes training data in JSON format. The built-in
|
||
[`convert`](/api/cli#convert) command helps you convert the `.conllu` format
|
||
used by the
|
||
[Universal Dependencies corpora](https://github.com/UniversalDependencies) to
|
||
spaCy's training format. To convert one or more existing `Doc` objects to
|
||
spaCy's JSON format, you can use the
|
||
[`gold.docs_to_json`](/api/goldparse#docs_to_json) helper.
|
||
|
||
> #### Annotating entities
|
||
>
|
||
> Named entities are provided in the [BILUO](#biluo) notation. Tokens outside an
|
||
> entity are set to `"O"` and tokens that are part of an entity are set to the
|
||
> entity label, prefixed by the BILUO marker. For example `"B-ORG"` describes
|
||
> the first token of a multi-token `ORG` entity and `"U-PERSON"` a single token
|
||
> representing a `PERSON` entity. The
|
||
> [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) function
|
||
> can help you convert entity offsets to the right format.
|
||
|
||
```python
|
||
### Example structure
|
||
[{
|
||
"id": int, # ID of the document within the corpus
|
||
"paragraphs": [{ # list of paragraphs in the corpus
|
||
"raw": string, # raw text of the paragraph
|
||
"sentences": [{ # list of sentences in the paragraph
|
||
"tokens": [{ # list of tokens in the sentence
|
||
"id": int, # index of the token in the document
|
||
"dep": string, # dependency label
|
||
"head": int, # offset of token head relative to token index
|
||
"tag": string, # part-of-speech tag
|
||
"orth": string, # verbatim text of the token
|
||
"ner": string # BILUO label, e.g. "O" or "B-ORG"
|
||
}],
|
||
"brackets": [{ # phrase structure (NOT USED by current models)
|
||
"first": int, # index of first token
|
||
"last": int, # index of last token
|
||
"label": string # phrase label
|
||
}]
|
||
}],
|
||
"cats": [{ # new in v2.2: categories for text classifier
|
||
"label": string, # text category label
|
||
"value": float / bool # label applies (1.0/true) or not (0.0/false)
|
||
}]
|
||
}]
|
||
}]
|
||
```
|
||
|
||
Here's an example of dependencies, part-of-speech tags and names entities, taken
|
||
from the English Wall Street Journal portion of the Penn Treebank:
|
||
|
||
```json
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
|
||
```
|
||
|
||
### Lexical data for vocabulary {#vocab-jsonl new="2"}
|
||
|
||
To populate a model's vocabulary, you can use the
|
||
[`spacy init-model`](/api/cli#init-model) command and load in a
|
||
[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one
|
||
lexical entry per line via the `--jsonl-loc` option. The first line defines the
|
||
language and vocabulary settings. All other lines are expected to be JSON
|
||
objects describing an individual lexeme. The lexical attributes will be then set
|
||
as attributes on spaCy's [`Lexeme`](/api/lexeme#attributes) object. The `vocab`
|
||
command outputs a ready-to-use spaCy model with a `Vocab` containing the lexical
|
||
data.
|
||
|
||
```python
|
||
### First line
|
||
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
|
||
```
|
||
|
||
```python
|
||
### Entry structure
|
||
{
|
||
"orth": string, # the word text
|
||
"id": int, # can correspond to row in vectors table
|
||
"lower": string,
|
||
"norm": string,
|
||
"shape": string
|
||
"prefix": string,
|
||
"suffix": string,
|
||
"length": int,
|
||
"cluster": string,
|
||
"prob": float,
|
||
"is_alpha": bool,
|
||
"is_ascii": bool,
|
||
"is_digit": bool,
|
||
"is_lower": bool,
|
||
"is_punct": bool,
|
||
"is_space": bool,
|
||
"is_title": bool,
|
||
"is_upper": bool,
|
||
"like_url": bool,
|
||
"like_num": bool,
|
||
"like_email": bool,
|
||
"is_stop": bool,
|
||
"is_oov": bool,
|
||
"is_quote": bool,
|
||
"is_left_punct": bool,
|
||
"is_right_punct": bool
|
||
}
|
||
```
|
||
|
||
Here's an example of the 20 most frequent lexemes in the English training data:
|
||
|
||
```json
|
||
https://github.com/explosion/spaCy/tree/master/examples/training/vocab-data.jsonl
|
||
```
|