Add "Part-of-speech tagging" workflow (closes #581)

2026-03-07 05:11:27 +03:00 · 2016-12-18 23:50:49 +01:00 · 2016-12-18 23:50:49 +01:00 · 1cddb7da36
commit 1cddb7da36
parent 89398ca57b
2 changed files with 99 additions and 0 deletions
--- a/website/docs/usage/_data.json
+++ b/website/docs/usage/_data.json
@ -8,6 +8,7 @@
            "Loading the pipeline": "language-processing-pipeline",
            "Processing text": "processing-text",
            "spaCy's data model": "data-model",
+            "POS tagging": "pos-tagging",
            "Using the parse": "dependency-parse",
            "Entity recognition": "entity-recognition",
            "Custom pipelines": "customizing-pipeline",
@ -82,6 +83,11 @@
        "title": "Training the tagger, parser and entity recognizer"
    },

+    "pos-tagging": {
+        "title": "Part-of-speech tagging",
+        "next": "dependency-parse"
+    },
+
    "showcase": {
        "title": "Showcase",

--- a/website/docs/usage/pos-tagging.jade
+++ b/website/docs/usage/pos-tagging.jade
@ -0,0 +1,93 @@
+//- 💫 DOCS > USAGE > PART-OF-SPEECH TAGGING
+
+include ../../_includes/_mixins
+
+p
+    |  Part-of-speech tags are labels like noun, verb, adjective etc that are
+    |  assigned to each token in the document. They're useful in rule-based
+    |  processes. They can also be useful features in some statistical models.
+
+p
+    |  To use spaCy's tagger, you need to have a data pack installed that
+    |  includes a tagging model. Tagging models are included in the data
+    |  downloads for English and German. After you load the model, the tagger
+    |  is applied automatically, as part of the default pipeline. You can then
+    |  access the tags using the #[+api("token") #[code Token.tag]] and
+    |  #[+api("token") #[code token.pos]] attributes. For English, the tagger
+    |  also triggers some simple rule-based morphological processing, which
+    |  gives you the lemma as well.
+
+code("Usage").
+    import spacy
+    nlp = spacy.load('en')
+    doc = nlp(u'They told us to duck.')
+    for word in doc:
+        print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
+
+h(2, "rule-based-morphology") Rule-based morphology
+
+p
+    |  Inflectional morphology is the process by which a root form of a word is
+    |  modified by adding prefixes or suffixes that specify its grammatical
+    |  function but do not changes its part-of-speech. We say that a
+    |  #[strong lemma] (root form) is #[strong inflected] (modified/combined)
+    |  with one or more #[strong morphological features] to create a surface
+    |  form. Here are some examples:
+
+table(["Context", "Surface", "Lemma", "POS", "Morphological Features"])
+    +row
+        +cell I was reading the paper
+        +cell reading
+        +cell read
+        +cell verb
+        +cell #[code VerbForm=Ger]
+
+    +row
+        +cell I don't watch the news, I read the paper.
+        +cell read
+        +cell read
+        +cell verb
+        +cell #[code VerbForm=Fin], #[code Mood=Ind], #[code Tense=Pres]
+
+    +row
+        +cell I read the paper yesteday
+        +cell read
+        +cell read
+        +cell verb
+        +cell #[code VerbForm=Fin], #[code Mood=Ind], #[code Tense=Past]
+
+p
+    |  English has a relatively simple morphological system, which spaCy
+    |  handles using rules that can be keyed by the token, the part-of-speech
+    |  tag, or the combination of the two. The system works as follows:
+
+list("numbers")
+    +item
+        |  The tokenizer consults a #[strong mapping table]
+        |  #[code TOKENIZER_EXCEPTIONS], which allows sequences of characters
+        |  to be mapped to multiple tokens. Each token may be assigned a part
+        |  of speech and one or more morphological features.
+
+    +item
+        |  The part-of-speech tagger then assigns each token an
+        |  #[strong extended POS tag]. In the API, these tags are known as
+        |  #[code Token.tag]. They express the part-of-speech (e.g.
+        |  #[code VERB]) and some amount of morphological information, e.g.
+        |  that the verb is past tense.
+
+    +item
+        |  For words whose POS is not set by a prior process, a
+        |  #[strong mapping table] #[code TAG_MAP] maps the tags to a
+        |  part-of-speech and a set of morphological features.
+
+    +item
+        |  Finally, a #[strong rule-based deterministic lemmatizer] maps the
+        |  surface form, to a lemma in light of the previously assigned
+        |  extended part-of-speech and morphological information, without
+        |  consulting the context of the token. The lemmatizer also accepts
+        |  list-based exception files, acquired from
+        |  #[+a("https://wordnet.princeton.edu/") WordNet].
+
+h(2, "pos-schemes") Part-of-speech tag schemes
+
+include ../api/_annotation/_pos-tags