From b6209e24271bcc141c21168e4592a5063e8bc2f2 Mon Sep 17 00:00:00 2001 From: ines Date: Tue, 23 May 2017 23:18:08 +0200 Subject: [PATCH] Update POS tagging workflow --- website/docs/usage/pos-tagging.jade | 28 ++++++++++------------------ 1 file changed, 10 insertions(+), 18 deletions(-) diff --git a/website/docs/usage/pos-tagging.jade b/website/docs/usage/pos-tagging.jade index cded00b6c..245156b77 100644 --- a/website/docs/usage/pos-tagging.jade +++ b/website/docs/usage/pos-tagging.jade @@ -7,22 +7,12 @@ p | assigned to each token in the document. They're useful in rule-based | processes. They can also be useful features in some statistical models. -p - | To use spaCy's tagger, you need to have a data pack installed that - | includes a tagging model. Tagging models are included in the data - | downloads for English and German. After you load the model, the tagger - | is applied automatically, as part of the default pipeline. You can then - | access the tags using the #[+api("token") #[code Token.tag]] and - | #[+api("token") #[code token.pos]] attributes. For English, the tagger - | also triggers some simple rule-based morphological processing, which - | gives you the lemma as well. ++h(2, "101") Part-of-speech tagging 101 + +tag-model("dependency parse") -+code("Usage"). - import spacy - nlp = spacy.load('en') - doc = nlp(u'They told us to duck.') - for word in doc: - print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_) +include _spacy-101/_pos-deps + ++aside("Help – spaCy's output is wrong!") +h(2, "rule-based-morphology") Rule-based morphology @@ -63,7 +53,8 @@ p +list("numbers") +item - | The tokenizer consults a #[strong mapping table] + | The tokenizer consults a + | #[+a("/docs/usage/adding-languages#tokenizer-exceptions") mapping table] | #[code TOKENIZER_EXCEPTIONS], which allows sequences of characters | to be mapped to multiple tokens. Each token may be assigned a part | of speech and one or more morphological features. @@ -77,8 +68,9 @@ p +item | For words whose POS is not set by a prior process, a - | #[strong mapping table] #[code TAG_MAP] maps the tags to a - | part-of-speech and a set of morphological features. + | #[+a("/docs/usage/adding-languages#tag-map") mapping table] + | #[code TAG_MAP] maps the tags to a part-of-speech and a set of + | morphological features. +item | Finally, a #[strong rule-based deterministic lemmatizer] maps the