spaCy/website/docs/api/pipeline-functions.md
Adriane Boyd bf0cdae8d4
Add token_splitter component (#6726)
* Add long_token_splitter component

Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.

The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.

Notes:

* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.

* Adjust API, add test

* Fix name in factory
2021-01-17 19:54:41 +08:00

5.0 KiB

title teaser source menu
Pipeline Functions Other built-in pipeline components and helpers spacy/pipeline/functions.py
merge_noun_chunks
merge_noun_chunks
merge_entities
merge_entities
merge_subtokens
merge_subtokens
token_splitter
token_splitter

merge_noun_chunks

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

Example

texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a", "blue", "car"]

nlp.add_pipe("merge_noun_chunks")
texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a blue car"]

Since noun chunks require part-of-speech tags and the dependency parse, make sure to add this component after the "tagger" and "parser" components. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNS The modified Doc with merged noun chunks. Doc

merge_entities

Merge named entities into a single token. Also available via the string name "merge_entities".

Example

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

nlp.add_pipe("merge_entities")

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

Since named entities are set by the entity recognizer, make sure to add this component after the "ner" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNS The modified Doc with merged entities. Doc

merge_subtokens

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict "subtokens" that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a "word" isn't defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Example

Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.

doc = nlp("拜托")
print([(token.text, token.dep_) for token in doc])
# [('拜', 'subtok'), ('托', 'subtok')]

nlp.add_pipe("merge_subtokens")
doc = nlp("拜托")
print([token.text for token in doc])
# ['拜托']

Since subtokens are set by the parser, make sure to add this component after the "parser" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
label The subtoken dependency label. Defaults to "subtok". str
RETURNS The modified Doc with merged subtokens. Doc

token_splitter

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length. See managing transformer model max length limitations.

Example

config={"min_length": 20, "split_length": 5}
nlp.add_pipe("token_splitter", config=config, first=True)
doc = nlp("aaaaabbbbbcccccdddddee")
print([token.text for token in doc])
# ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']
Setting Description
min_length The minimum length for a token to be split. Defaults to 25. int
split_length The length of the split tokens. Defaults to 5. int