Add "Part-of-speech tagging" workflow (closes #581)

2025-09-14 08:02:40 +03:00 · 2016-12-18 23:50:49 +01:00 · 2016-12-18 23:50:49 +01:00 · 1cddb7da36
commit 1cddb7da36
parent 89398ca57b
2 changed files with 99 additions and 0 deletions
--- a/website/docs/usage/_data.json
+++ b/website/docs/usage/_data.json
@ -8,6 +8,7 @@
            "Loading the pipeline": "language-processing-pipeline",
            "Processing text": "processing-text",
            "spaCy's data model": "data-model",
            "POS tagging": "pos-tagging",
            "Using the parse": "dependency-parse",
            "Entity recognition": "entity-recognition",
            "Custom pipelines": "customizing-pipeline",
@ -82,6 +83,11 @@
        "title": "Training the tagger, parser and entity recognizer"
    },
    "pos-tagging": {
        "title": "Part-of-speech tagging",
        "next": "dependency-parse"
    },
    "showcase": {
        "title": "Showcase",
--- a/website/docs/usage/pos-tagging.jade
+++ b/website/docs/usage/pos-tagging.jade
@ -0,0 +1,93 @@
 //- 💫 DOCS > USAGE > PART-OF-SPEECH TAGGING
 include ../../_includes/_mixins
 p
    |  Part-of-speech tags are labels like noun, verb, adjective etc that are
    |  assigned to each token in the document. They're useful in rule-based
    |  processes. They can also be useful features in some statistical models.
 p
    |  To use spaCy's tagger, you need to have a data pack installed that
    |  includes a tagging model. Tagging models are included in the data
    |  downloads for English and German. After you load the model, the tagger
    |  is applied automatically, as part of the default pipeline. You can then
    |  access the tags using the #[+api("token") #[code Token.tag]] and
    |  #[+api("token") #[code token.pos]] attributes. For English, the tagger
    |  also triggers some simple rule-based morphological processing, which
    |  gives you the lemma as well.
 +code("Usage").
    import spacy
    nlp = spacy.load('en')
    doc = nlp(u'They told us to duck.')
    for word in doc:
        print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
 +h(2, "rule-based-morphology") Rule-based morphology
 p
    |  Inflectional morphology is the process by which a root form of a word is
    |  modified by adding prefixes or suffixes that specify its grammatical
    |  function but do not changes its part-of-speech. We say that a
    |  #[strong lemma] (root form) is #[strong inflected] (modified/combined)
    |  with one or more #[strong morphological features] to create a surface
    |  form. Here are some examples:
 +table(["Context", "Surface", "Lemma", "POS", "Morphological Features"])
    +row
        +cell I was reading the paper
        +cell reading
        +cell read
        +cell verb
        +cell #[code VerbForm=Ger]
    +row
        +cell I don't watch the news, I read the paper.
        +cell read
        +cell read
        +cell verb
        +cell #[code VerbForm=Fin], #[code Mood=Ind], #[code Tense=Pres]
    +row
        +cell I read the paper yesteday
        +cell read
        +cell read
        +cell verb
        +cell #[code VerbForm=Fin], #[code Mood=Ind], #[code Tense=Past]
 p
    |  English has a relatively simple morphological system, which spaCy
    |  handles using rules that can be keyed by the token, the part-of-speech
    |  tag, or the combination of the two. The system works as follows:
 +list("numbers")
    +item
        |  The tokenizer consults a #[strong mapping table]
        |  #[code TOKENIZER_EXCEPTIONS], which allows sequences of characters
        |  to be mapped to multiple tokens. Each token may be assigned a part
        |  of speech and one or more morphological features.
    +item
        |  The part-of-speech tagger then assigns each token an
        |  #[strong extended POS tag]. In the API, these tags are known as
        |  #[code Token.tag]. They express the part-of-speech (e.g.
        |  #[code VERB]) and some amount of morphological information, e.g.
        |  that the verb is past tense.
    +item
        |  For words whose POS is not set by a prior process, a
        |  #[strong mapping table] #[code TAG_MAP] maps the tags to a
        |  part-of-speech and a set of morphological features.
    +item
        |  Finally, a #[strong rule-based deterministic lemmatizer] maps the
        |  surface form, to a lemma in light of the previously assigned
        |  extended part-of-speech and morphological information, without
        |  consulting the context of the token. The lemmatizer also accepts
        |  list-based exception files, acquired from
        |  #[+a("https://wordnet.princeton.edu/") WordNet].
 +h(2, "pos-schemes") Part-of-speech tag schemes
 include ../api/_annotation/_pos-tags