mirror of
https://github.com/explosion/spaCy.git
synced 2025-05-02 23:03:41 +03:00
Update POS tagging workflow
This commit is contained in:
parent
43258d6b0a
commit
b6209e2427
|
@ -7,22 +7,12 @@ p
|
||||||
| assigned to each token in the document. They're useful in rule-based
|
| assigned to each token in the document. They're useful in rule-based
|
||||||
| processes. They can also be useful features in some statistical models.
|
| processes. They can also be useful features in some statistical models.
|
||||||
|
|
||||||
p
|
+h(2, "101") Part-of-speech tagging 101
|
||||||
| To use spaCy's tagger, you need to have a data pack installed that
|
+tag-model("dependency parse")
|
||||||
| includes a tagging model. Tagging models are included in the data
|
|
||||||
| downloads for English and German. After you load the model, the tagger
|
|
||||||
| is applied automatically, as part of the default pipeline. You can then
|
|
||||||
| access the tags using the #[+api("token") #[code Token.tag]] and
|
|
||||||
| #[+api("token") #[code token.pos]] attributes. For English, the tagger
|
|
||||||
| also triggers some simple rule-based morphological processing, which
|
|
||||||
| gives you the lemma as well.
|
|
||||||
|
|
||||||
+code("Usage").
|
include _spacy-101/_pos-deps
|
||||||
import spacy
|
|
||||||
nlp = spacy.load('en')
|
+aside("Help – spaCy's output is wrong!")
|
||||||
doc = nlp(u'They told us to duck.')
|
|
||||||
for word in doc:
|
|
||||||
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
|
|
||||||
|
|
||||||
+h(2, "rule-based-morphology") Rule-based morphology
|
+h(2, "rule-based-morphology") Rule-based morphology
|
||||||
|
|
||||||
|
@ -63,7 +53,8 @@ p
|
||||||
|
|
||||||
+list("numbers")
|
+list("numbers")
|
||||||
+item
|
+item
|
||||||
| The tokenizer consults a #[strong mapping table]
|
| The tokenizer consults a
|
||||||
|
| #[+a("/docs/usage/adding-languages#tokenizer-exceptions") mapping table]
|
||||||
| #[code TOKENIZER_EXCEPTIONS], which allows sequences of characters
|
| #[code TOKENIZER_EXCEPTIONS], which allows sequences of characters
|
||||||
| to be mapped to multiple tokens. Each token may be assigned a part
|
| to be mapped to multiple tokens. Each token may be assigned a part
|
||||||
| of speech and one or more morphological features.
|
| of speech and one or more morphological features.
|
||||||
|
@ -77,8 +68,9 @@ p
|
||||||
|
|
||||||
+item
|
+item
|
||||||
| For words whose POS is not set by a prior process, a
|
| For words whose POS is not set by a prior process, a
|
||||||
| #[strong mapping table] #[code TAG_MAP] maps the tags to a
|
| #[+a("/docs/usage/adding-languages#tag-map") mapping table]
|
||||||
| part-of-speech and a set of morphological features.
|
| #[code TAG_MAP] maps the tags to a part-of-speech and a set of
|
||||||
|
| morphological features.
|
||||||
|
|
||||||
+item
|
+item
|
||||||
| Finally, a #[strong rule-based deterministic lemmatizer] maps the
|
| Finally, a #[strong rule-based deterministic lemmatizer] maps the
|
||||||
|
|
Loading…
Reference in New Issue
Block a user