spaCy/website/docs/api/pipeline-functions.md
Adriane Boyd 9ac6d4991e
Add doc_cleaner component (#9659)
* Add doc_cleaner component

* Fix types

* Fix loop

* Rephrase method description
2021-11-23 15:33:33 +01:00

6.4 KiB

title teaser source menu
Pipeline Functions Other built-in pipeline components and helpers spacy/pipeline/functions.py
merge_noun_chunks
merge_noun_chunks
merge_entities
merge_entities
merge_subtokens
merge_subtokens
token_splitter
token_splitter

merge_noun_chunks

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

Example

texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a", "blue", "car"]

nlp.add_pipe("merge_noun_chunks")
texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a blue car"]

Since noun chunks require part-of-speech tags and the dependency parse, make sure to add this component after the "tagger" and "parser" components. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNS The modified Doc with merged noun chunks. Doc

merge_entities

Merge named entities into a single token. Also available via the string name "merge_entities".

Example

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

nlp.add_pipe("merge_entities")

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

Since named entities are set by the entity recognizer, make sure to add this component after the "ner" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNS The modified Doc with merged entities. Doc

merge_subtokens

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict "subtokens" that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a "word" isn't defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Example

Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.

doc = nlp("拜托")
print([(token.text, token.dep_) for token in doc])
# [('拜', 'subtok'), ('托', 'subtok')]

nlp.add_pipe("merge_subtokens")
doc = nlp("拜托")
print([token.text for token in doc])
# ['拜托']

Since subtokens are set by the parser, make sure to add this component after the "parser" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
label The subtoken dependency label. Defaults to "subtok". str
RETURNS The modified Doc with merged subtokens. Doc

token_splitter

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.

Example

config = {"min_length": 20, "split_length": 5}
nlp.add_pipe("token_splitter", config=config, first=True)
doc = nlp("aaaaabbbbbcccccdddddee")
print([token.text for token in doc])
# ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']
Setting Description
min_length The minimum length for a token to be split. Defaults to 25. int
split_length The length of the split tokens. Defaults to 5. int
RETURNS The modified Doc with the split tokens. Doc

doc_cleaner

Clean up Doc attributes. Intended for use at the end of pipelines with tok2vec or transformer pipeline components that store tensors and other values that can require a lot of memory and frequently aren't needed after the whole pipeline has run.

Example

config = {"attrs": {"tensor": None}}
nlp.add_pipe("doc_cleaner", config=config)
doc = nlp("text")
assert doc.tensor is None
Setting Description
attrs A dict of the Doc attributes and the values to set them to. Defaults to {"tensor": None, "_.trf_data": None} to clean up after tok2vec and transformer components. dict
silent If False, show warnings if attributes aren't found or can't be set. Defaults to True. bool
RETURNS The modified Doc with the modified attributes. Doc