spaCy/linguistic-features.md at 2e7c896fe5541c188b9e9984210687e57cef614c

mirror of https://github.com/explosion/spaCy.git synced 2026-01-21 15:55:05 +03:00

adrianeboyd 2c876eb672 Add tokenizer explain() debugging method (#4596 )

* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs

2019-11-20 13:07:25 +01:00

58 KiB

Raw Blame History

title

Linguistic Features

/usage/rule-based-matching

POS Tagging

pos-tagging

Dependency Parse

dependency-parse

Named Entities

named-entities

Entity Linking

entity-linking

Tokenization

tokenization

Merging & Splitting

retokenization

Sentence Segmentation

sbd

Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

Part-of-speech tagging

import PosDeps101 from 'usage/101/_pos-deps.md'

For a list of the fine-grained and coarse-grained part-of-speech tags assigned by spaCy's models across different languages, see the POS tag scheme documentation.

Rule-based morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not changes its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:

Context	Surface	Lemma	POS	Morphological Features
I was reading the paper	reading	read	verb	`VerbForm=Ger`
I don't watch the news, I read the paper	read	read	verb	`VerbForm=Fin`, `Mood=Ind`, `Tense=Pres`
I read the paper yesterday	read	read	verb	`VerbForm=Fin`, `Mood=Ind`, `Tense=Past`

English has a relatively simple morphological system, which spaCy handles using rules that can be keyed by the token, the part-of-speech tag, or the combination of the two. The system works as follows:

The tokenizer consults a mapping table TOKENIZER_EXCEPTIONS, which allows sequences of characters to be mapped to multiple tokens. Each token may be assigned a part of speech and one or more morphological features.
The part-of-speech tagger then assigns each token an extended POS tag. In the API, these tags are known as Token.tag. They express the part-of-speech (e.g. VERB) and some amount of morphological information, e.g. that the verb is past tense.
For words whose POS is not set by a prior process, a mapping table TAG_MAP maps the tags to a part-of-speech and a set of morphological features.
Finally, a rule-based deterministic lemmatizer maps the surface form, to a lemma in light of the previously assigned extended part-of-speech and morphological information, without consulting the context of the token. The lemmatizer also accepts list-based exception files, acquired from WordNet.

Dependency Parsing

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks". You can check whether a Doc object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.

Noun chunks

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund". To get the noun chunks in a document, simply iterate over Doc.noun_chunks

### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

Text: The original noun chunk text.

Root text: The original text of the word connecting the noun chunk to the rest of the parse.

Root dep: Dependency relation connecting the root to its head.

Root head text: The text of the root token's head.

Text	root.text	root.dep_	root.head.text
Autonomous cars	cars	`nsubj`	shift
insurance liability	liability	`dobj`	shift
manufacturers	manufacturers	`pobj`	toward

Navigating the parse tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Text: The original token text.

Dep: The syntactic relation connecting child to head.

Head text: The original text of the token head.

Head POS: The part-of-speech tag of the token head.

Children: The immediate syntactic dependents of the token.

Text	Dep	Head text	Head POS	Children
Autonomous	`amod`	cars	`NOUN`
cars	`nsubj`	shift	`VERB`	Autonomous
shift	`ROOT`	shift	`VERB`	cars, liability, toward
insurance	`compound`	liability	`NOUN`
liability	`dobj`	shift	`VERB`	insurance
toward	`prep`	shift	`NOUN`	manufacturers
manufacturers	`pobj`	toward	`ADP`

import DisplaCyLong2Html from 'images/displacy-long2.html'

58 KiB Raw Blame History Unescape Escape