--- title: Linguistic Features next: /usage/rule-based-matching menu: - ['POS Tagging', 'pos-tagging'] - ['Morphology', 'morphology'] - ['Lemmatization', 'lemmatization'] - ['Dependency Parse', 'dependency-parse'] - ['Named Entities', 'named-entities'] - ['Entity Linking', 'entity-linking'] - ['Tokenization', 'tokenization'] - ['Merging & Splitting', 'retokenization'] - ['Sentence Segmentation', 'sbd'] - ['Mappings & Exceptions', 'mappings-exceptions'] - ['Vectors & Similarity', 'vectors-similarity'] - ['Language Data', 'language-data'] --- Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a [`Doc`](/api/doc) object, that comes with a variety of annotations. ## Part-of-speech tagging {id="pos-tagging",model="tagger, parser"} For a list of the fine-grained and coarse-grained part-of-speech tags assigned by spaCy's models across different languages, see the label schemes documented in the [models directory](/models). ## Morphology {id="morphology"} Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that a **lemma** (root form) is **inflected** (modified/combined) with one or more **morphological features** to create a surface form. Here are some examples: | Context | Surface | Lemma | POS | Morphological Features | | ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- | | I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` | | I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` | | I read the paper yesterday | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` | Morphological features are stored in the [`MorphAnalysis`](/api/morphology#morphanalysis) under `Token.morph`, which allows you to access individual morphological features. > #### πŸ“ Things to try > > 1. Change "I" to "She". You should see that the morphological features change > and express that it's a pronoun in the third person. > 2. Inspect `token.morph` for the other tokens. ```python {executable="true"} import spacy nlp = spacy.load("en_core_web_sm") print("Pipeline:", nlp.pipe_names) doc = nlp("I was reading the paper.") token = doc[0] # 'I' print(token.morph) # 'Case=Nom|Number=Sing|Person=1|PronType=Prs' print(token.morph.get("PronType")) # ['Prs'] ``` ### Statistical morphology {id="morphologizer",version="3",model="morphologizer"} spaCy's statistical [`Morphologizer`](/api/morphologizer) component assigns the morphological features and coarse-grained part-of-speech tags as `Token.morph` and `Token.pos`. ```python {executable="true"} import spacy nlp = spacy.load("de_core_news_sm") doc = nlp("Wo bist du?") # English: 'Where are you?' print(doc[2].morph) # 'Case=Nom|Number=Sing|Person=2|PronType=Prs' print(doc[2].pos_) # 'PRON' ``` ### Rule-based morphology {id="rule-based-morphology"} For languages with relatively simple morphological systems like English, spaCy can assign morphological features through a rule-based approach, which uses the **token text** and **fine-grained part-of-speech tags** to produce coarse-grained part-of-speech tags and morphological features. 1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech tag**. In the API, these tags are known as `Token.tag`. They express the part-of-speech (e.g. verb) and some amount of morphological information, e.g. that the verb is past tense (e.g. `VBD` for a past tense verb in the Penn Treebank) . 2. For words whose coarse-grained POS is not set by a prior process, a [mapping table](#mappings-exceptions) maps the fine-grained tags to a coarse-grained POS tags and morphological features. ```python {executable="true"} import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("Where are you?") print(doc[2].morph) # 'Case=Nom|Person=2|PronType=Prs' print(doc[2].pos_) # 'PRON' ``` ## Lemmatization {id="lemmatization",version="3"} spaCy provides two pipeline components for lemmatization: 1. The [`Lemmatizer`](/api/lemmatizer) component provides lookup and rule-based lemmatization methods in a configurable component. An individual language can extend the `Lemmatizer` as part of its [language data](#language-data). 2. The [`EditTreeLemmatizer`](/api/edittreelemmatizer) 3.3 component provides a trainable lemmatizer. ```python {executable="true"} import spacy # English pipelines include a rule-based lemmatizer nlp = spacy.load("en_core_web_sm") lemmatizer = nlp.get_pipe("lemmatizer") print(lemmatizer.mode) # 'rule' doc = nlp("I was reading the paper.") print([token.lemma_ for token in doc]) # ['I', 'be', 'read', 'the', 'paper', '.'] ``` Unlike spaCy v2, spaCy v3 models do _not_ provide lemmas by default or switch automatically between lookup and rule-based lemmas depending on whether a tagger is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to include a [`Lemmatizer`](/api/lemmatizer) component. The lemmatizer component is configured to use a single mode such as `"lookup"` or `"rule"` on initialization. The `"rule"` mode requires `Token.pos` to be set by a previous component. The data for spaCy's lemmatizers is distributed in the repository [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The provided trained pipelines already include all the required tables, but if you are creating new pipelines, you can load data from the repository in the lemmatizer initialization. ### Lookup lemmatizer {id="lemmatizer-lookup"} For pipelines without a tagger or morphologizer, a lookup lemmatizer can be added to the pipeline as long as a lookup table is provided, typically through [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The lookup lemmatizer looks up the token surface form in the lookup table without reference to the token's part-of-speech or context. ```python # pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS import spacy nlp = spacy.blank("sv") nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) ``` ### Rule-based lemmatizer {id="lemmatizer-rule",model="morphologizer"} When training pipelines that include a component that assigns part-of-speech tags (a morphologizer or a tagger with a [POS mapping](#mappings-exceptions)), a rule-based lemmatizer can be added using rule tables from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data): ```python # pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS import spacy nlp = spacy.blank("de") # Morphologizer (note: model is not yet trained!) nlp.add_pipe("morphologizer") # Rule-based lemmatizer nlp.add_pipe("lemmatizer", config={"mode": "rule"}) ``` The rule-based deterministic lemmatizer maps the surface form to a lemma in light of the previously assigned coarse-grained part-of-speech and morphological information, without consulting the context of the token. The rule-based lemmatizer also accepts list-based exception files. For English, these are acquired from [WordNet](https://wordnet.princeton.edu/). ### Trainable lemmatizer {id="lemmatizer-train",model="trainable_lemmatizer"} The [`EditTreeLemmatizer`](/api/edittreelemmatizer) can learn form-to-lemma transformations from a training corpus that includes lemma annotations. This removes the need to write language-specific rules and can (in many cases) provide higher accuracies than lookup and rule-based lemmatizers. ```python import spacy nlp = spacy.blank("de") nlp.add_pipe("trainable_lemmatizer", name="lemmatizer") ``` ## Dependency Parsing {id="dependency-parse",model="parser"} spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks". You can check whether a [`Doc`](/api/doc) object has been parsed by calling `doc.has_annotation("DEP")`, which checks whether the attribute `Token.dep` has been set returns a boolean value. If the result is `False`, the default sentence iterator will raise an exception. For a list of the syntactic dependency labels assigned by spaCy's models across different languages, see the label schemes documented in the [models directory](/models). ### Noun chunks {id="noun-chunks"} Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund". To get the noun chunks in a document, simply iterate over [`Doc.noun_chunks`](/api/doc#noun_chunks). ```python {executable="true"} import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("Autonomous cars shift insurance liability toward manufacturers") for chunk in doc.noun_chunks: print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text) ``` > - **Text:** The original noun chunk text. > - **Root text:** The original text of the word connecting the noun chunk to > the rest of the parse. > - **Root dep:** Dependency relation connecting the root to its head. > - **Root head text:** The text of the root token's head. | Text | root.text | root.dep\_ | root.head.text | | ------------------- | ------------- | ---------- | -------------- | | Autonomous cars | cars | `nsubj` | shift | | insurance liability | liability | `dobj` | shift | | manufacturers | manufacturers | `pobj` | toward | ### Navigating the parse tree {id="navigating"} spaCy uses the terms **head** and **child** to describe the words **connected by a single arc** in the dependency tree. The term **dep** is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of `.dep` is a hash value. You can get the string value with `.dep_`. ```python {executable="true"} import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("Autonomous cars shift insurance liability toward manufacturers") for token in doc: print(token.text, token.dep_, token.head.text, token.head.pos_, [child for child in token.children]) ``` > - **Text:** The original token text. > - **Dep:** The syntactic relation connecting child to head. > - **Head text:** The original text of the token head. > - **Head POS:** The part-of-speech tag of the token head. > - **Children:** The immediate syntactic dependents of the token. | Text | Dep | Head text | Head POS | Children | | ------------- | ---------- | --------- | -------- | ----------------------- | | Autonomous | `amod` | cars | `NOUN` | | | cars | `nsubj` | shift | `VERB` | Autonomous | | shift | `ROOT` | shift | `VERB` | cars, liability, toward | | insurance | `compound` | liability | `NOUN` | | | liability | `dobj` | shift | `VERB` | insurance | | toward | `prep` | shift | `NOUN` | manufacturers | | manufacturers | `pobj` | toward | `ADP` | |