58 KiB
title | next | menu | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Linguistic Features | /usage/rule-based-matching |
|
Processing raw text intelligently is difficult: most words are rare, and it's
common for words that look completely different to mean almost the same thing.
The same words in a different order can mean something completely different.
Even splitting text into useful word-like units can be difficult in many
languages. While it's possible to solve some problems starting from only the raw
characters, it's usually better to use linguistic knowledge to add useful
information. That's exactly what spaCy is designed to do: you put in raw text,
and get back a Doc
object, that comes with a variety of
annotations.
Part-of-speech tagging
import PosDeps101 from 'usage/101/_pos-deps.md'
For a list of the fine-grained and coarse-grained part-of-speech tags assigned by spaCy's models across different languages, see the label schemes documented in the models directory.
Rule-based morphology
Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not changes its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:
Context | Surface | Lemma | POS | Morphological Features |
---|---|---|---|---|
I was reading the paper | reading | read | verb | VerbForm=Ger |
I don't watch the news, I read the paper | read | read | verb | VerbForm=Fin , Mood=Ind , Tense=Pres |
I read the paper yesterday | read | read | verb | VerbForm=Fin , Mood=Ind , Tense=Past |
English has a relatively simple morphological system, which spaCy handles using rules that can be keyed by the token, the part-of-speech tag, or the combination of the two. The system works as follows:
- The tokenizer consults a
mapping table
TOKENIZER_EXCEPTIONS
, which allows sequences of characters to be mapped to multiple tokens. Each token may be assigned a part of speech and one or more morphological features. - The part-of-speech tagger then assigns each token an extended POS tag. In
the API, these tags are known as
Token.tag
. They express the part-of-speech (e.g.VERB
) and some amount of morphological information, e.g. that the verb is past tense. - For words whose POS is not set by a prior process, a
mapping table
TAG_MAP
maps the tags to a part-of-speech and a set of morphological features. - Finally, a rule-based deterministic lemmatizer maps the surface form, to a lemma in light of the previously assigned extended part-of-speech and morphological information, without consulting the context of the token. The lemmatizer also accepts list-based exception files, acquired from WordNet.
Dependency Parsing
spaCy features a fast and accurate syntactic dependency parser, and has a rich
API for navigating the tree. The parser also powers the sentence boundary
detection, and lets you iterate over base noun phrases, or "chunks". You can
check whether a Doc
object has been parsed with the
doc.is_parsed
attribute, which returns a boolean value. If this attribute is
False
, the default sentence iterator will raise an exception.
Noun chunks
Noun chunks are "base noun phrases" – flat phrases that have a noun as their
head. You can think of noun chunks as a noun plus the words describing the noun
– for example, "the lavish green grass" or "the world’s largest tech fund". To
get the noun chunks in a document, simply iterate over
Doc.noun_chunks
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text)
- Text: The original noun chunk text.
- Root text: The original text of the word connecting the noun chunk to the rest of the parse.
- Root dep: Dependency relation connecting the root to its head.
- Root head text: The text of the root token's head.
Text | root.text | root.dep_ | root.head.text |
---|---|---|---|
Autonomous cars | cars | nsubj |
shift |
insurance liability | liability | dobj |
shift |
manufacturers | manufacturers | pobj |
toward |
Navigating the parse tree
spaCy uses the terms head and child to describe the words connected by
a single arc in the dependency tree. The term dep is used for the arc
label, which describes the type of syntactic relation that connects the child to
the head. As with other attributes, the value of .dep
is a hash value. You can
get the string value with .dep_
.
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
- Text: The original token text.
- Dep: The syntactic relation connecting child to head.
- Head text: The original text of the token head.
- Head POS: The part-of-speech tag of the token head.
- Children: The immediate syntactic dependents of the token.
Text | Dep | Head text | Head POS | Children |
---|---|---|---|---|
Autonomous | amod |
cars | NOUN |
|
cars | nsubj |
shift | VERB |
Autonomous |
shift | ROOT |
shift | VERB |
cars, liability, toward |
insurance | compound |
liability | NOUN |
|
liability | dobj |
shift | VERB |
insurance |
toward | prep |
shift | NOUN |
manufacturers |
manufacturers | pobj |
toward | ADP |
import DisplaCyLong2Html from 'images/displacy-long2.html'