6.4 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	| title | teaser | source | menu | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pipeline Functions | Other built-in pipeline components and helpers | spacy/pipeline/functions.py | 
 | 
merge_noun_chunks
Merge noun chunks into a single token. Also available via the string name
"merge_noun_chunks".
Example
texts = [t.text for t in nlp("I have a blue car")] assert texts == ["I", "have", "a", "blue", "car"] nlp.add_pipe("merge_noun_chunks") texts = [t.text for t in nlp("I have a blue car")] assert texts == ["I", "have", "a blue car"]
Since noun chunks require part-of-speech tags and the dependency parse, make
sure to add this component after the "tagger" and "parser" components. By
default, nlp.add_pipe will add components to the end of the pipeline and after
all other components.
| Name | Description | 
|---|---|
| doc | The Docobject to process, e.g. theDocin the pipeline. | 
| RETURNS | The modified Docwith merged noun chunks. | 
merge_entities
Merge named entities into a single token. Also available via the string name
"merge_entities".
Example
texts = [t.text for t in nlp("I like David Bowie")] assert texts == ["I", "like", "David", "Bowie"] nlp.add_pipe("merge_entities") texts = [t.text for t in nlp("I like David Bowie")] assert texts == ["I", "like", "David Bowie"]
Since named entities are set by the entity recognizer, make sure to add this
component after the "ner" component. By default, nlp.add_pipe will add
components to the end of the pipeline and after all other components.
| Name | Description | 
|---|---|
| doc | The Docobject to process, e.g. theDocin the pipeline. | 
| RETURNS | The modified Docwith merged entities. | 
merge_subtokens
Merge subtokens into a single token. Also available via the string name
"merge_subtokens". As of v2.1, the parser is able to predict "subtokens" that
should be merged into one single token later on. This is especially relevant for
languages like Chinese, Japanese or Korean, where a "word" isn't defined as a
whitespace-delimited sequence of characters. Under the hood, this component uses
the Matcher to find sequences of tokens with the dependency
label "subtok" and then merges them into a single token.
Example
Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.
doc = nlp("拜托") print([(token.text, token.dep_) for token in doc]) # [('拜', 'subtok'), ('托', 'subtok')] nlp.add_pipe("merge_subtokens") doc = nlp("拜托") print([token.text for token in doc]) # ['拜托']
Since subtokens are set by the parser, make sure to add this component after
the "parser" component. By default, nlp.add_pipe will add components to the
end of the pipeline and after all other components.
| Name | Description | 
|---|---|
| doc | The Docobject to process, e.g. theDocin the pipeline. | 
| label | The subtoken dependency label. Defaults to "subtok". | 
| RETURNS | The modified Docwith merged subtokens. | 
token_splitter
Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.
Example
config = {"min_length": 20, "split_length": 5} nlp.add_pipe("token_splitter", config=config, first=True) doc = nlp("aaaaabbbbbcccccdddddee") print([token.text for token in doc]) # ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']
| Setting | Description | 
|---|---|
| min_length | The minimum length for a token to be split. Defaults to 25. | 
| split_length | The length of the split tokens. Defaults to 5. | 
| RETURNS | The modified Docwith the split tokens. | 
doc_cleaner
Clean up Doc attributes. Intended for use at the end of pipelines with
tok2vec or transformer pipeline components that store tensors and other
values that can require a lot of memory and frequently aren't needed after the
whole pipeline has run.
Example
config = {"attrs": {"tensor": None}} nlp.add_pipe("doc_cleaner", config=config) doc = nlp("text") assert doc.tensor is None
| Setting | Description | 
|---|---|
| attrs | A dict of the Docattributes and the values to set them to. Defaults to{"tensor": None, "_.trf_data": None}to clean up aftertok2vecandtransformercomponents. | 
| silent | If False, show warnings if attributes aren't found or can't be set. Defaults toTrue. | 
| RETURNS | The modified Docwith the modified attributes. |