Update POS tagging workflow

This commit is contained in:
ines 2017-05-23 23:18:08 +02:00
parent 43258d6b0a
commit b6209e2427

View File

@ -7,22 +7,12 @@ p
| assigned to each token in the document. They're useful in rule-based
| processes. They can also be useful features in some statistical models.
p
| To use spaCy's tagger, you need to have a data pack installed that
| includes a tagging model. Tagging models are included in the data
| downloads for English and German. After you load the model, the tagger
| is applied automatically, as part of the default pipeline. You can then
| access the tags using the #[+api("token") #[code Token.tag]] and
| #[+api("token") #[code token.pos]] attributes. For English, the tagger
| also triggers some simple rule-based morphological processing, which
| gives you the lemma as well.
+h(2, "101") Part-of-speech tagging 101
+tag-model("dependency parse")
+code("Usage").
import spacy
nlp = spacy.load('en')
doc = nlp(u'They told us to duck.')
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
include _spacy-101/_pos-deps
+aside("Help spaCy's output is wrong!")
+h(2, "rule-based-morphology") Rule-based morphology
@ -63,7 +53,8 @@ p
+list("numbers")
+item
| The tokenizer consults a #[strong mapping table]
| The tokenizer consults a
| #[+a("/docs/usage/adding-languages#tokenizer-exceptions") mapping table]
| #[code TOKENIZER_EXCEPTIONS], which allows sequences of characters
| to be mapped to multiple tokens. Each token may be assigned a part
| of speech and one or more morphological features.
@ -77,8 +68,9 @@ p
+item
| For words whose POS is not set by a prior process, a
| #[strong mapping table] #[code TAG_MAP] maps the tags to a
| part-of-speech and a set of morphological features.
| #[+a("/docs/usage/adding-languages#tag-map") mapping table]
| #[code TAG_MAP] maps the tags to a part-of-speech and a set of
| morphological features.
+item
| Finally, a #[strong rule-based deterministic lemmatizer] maps the