spaCy/website/docs/api/pipeline-functions.md
Paul O'Leary McCann a44b7d4622
Add experimental coref docs (#11291)
* Add experimental coref docs

* Docs cleanup

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply changes from code review

* Fix prettier formatting

It seems a period after a number made this think it was a list?

* Update docs on examples for initialize

* Add docs for coref scorers

* Remove 3.4 notes from coref

There won't be a "new" tag until it's in core.

* Add docs for span cleaner

* Fix docs

* Fix docs to match spacy-experimental

These weren't properly updated when the code was moved out of spacy
core.

* More doc fixes

* Formatting

* Update architectures

* Fix links

* Fix another link

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2022-09-27 18:11:23 +09:00

7.9 KiB

title teaser source menu
Pipeline Functions Other built-in pipeline components and helpers spacy/pipeline/functions.py
merge_noun_chunks
merge_noun_chunks
merge_entities
merge_entities
merge_subtokens
merge_subtokens
token_splitter
token_splitter
doc_cleaner
doc_cleaner

merge_noun_chunks

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

Example

texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a", "blue", "car"]

nlp.add_pipe("merge_noun_chunks")
texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a blue car"]

Since noun chunks require part-of-speech tags and the dependency parse, make sure to add this component after the "tagger" and "parser" components. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNS The modified Doc with merged noun chunks. Doc

merge_entities

Merge named entities into a single token. Also available via the string name "merge_entities".

Example

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

nlp.add_pipe("merge_entities")

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

Since named entities are set by the entity recognizer, make sure to add this component after the "ner" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
RETURNS The modified Doc with merged entities. Doc

merge_subtokens

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict "subtokens" that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a "word" isn't defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Example

Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.

doc = nlp("拜托")
print([(token.text, token.dep_) for token in doc])
# [('拜', 'subtok'), ('托', 'subtok')]

nlp.add_pipe("merge_subtokens")
doc = nlp("拜托")
print([token.text for token in doc])
# ['拜托']

Since subtokens are set by the parser, make sure to add this component after the "parser" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name Description
doc The Doc object to process, e.g. the Doc in the pipeline. Doc
label The subtoken dependency label. Defaults to "subtok". str
RETURNS The modified Doc with merged subtokens. Doc

token_splitter

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.

Example

config = {"min_length": 20, "split_length": 5}
nlp.add_pipe("token_splitter", config=config, first=True)
doc = nlp("aaaaabbbbbcccccdddddee")
print([token.text for token in doc])
# ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']
Setting Description
min_length The minimum length for a token to be split. Defaults to 25. int
split_length The length of the split tokens. Defaults to 5. int
RETURNS The modified Doc with the split tokens. Doc

doc_cleaner

Clean up Doc attributes. Intended for use at the end of pipelines with tok2vec or transformer pipeline components that store tensors and other values that can require a lot of memory and frequently aren't needed after the whole pipeline has run.

Example

config = {"attrs": {"tensor": None}}
nlp.add_pipe("doc_cleaner", config=config)
doc = nlp("text")
assert doc.tensor is None
Setting Description
attrs A dict of the Doc attributes and the values to set them to. Defaults to {"tensor": None, "_.trf_data": None} to clean up after tok2vec and transformer components. dict
silent If False, show warnings if attributes aren't found or can't be set. Defaults to True. bool
RETURNS The modified Doc with the modified attributes. Doc

span_cleaner

Remove SpanGroups from doc.spans based on a key prefix. This is used to clean up after the CoreferenceResolver when it's paired with a SpanResolver.

This pipeline function is not yet integrated into spaCy core, and is available via the extension package spacy-experimental starting in version 0.6.0. It exposes the component via entry points, so if you have the package installed, using factory = "span_cleaner" in your training config or nlp.add_pipe("span_cleaner") will work out-of-the-box.

Example

config = {"prefix": "coref_head_clusters"}
nlp.add_pipe("span_cleaner", config=config)
doc = nlp("text")
assert "coref_head_clusters_1" not in doc.spans
Setting Description
prefix A prefix to check SpanGroup keys for. Any matching groups will be removed. Defaults to "coref_head_clusters". str
RETURNS The modified Doc with any matching spans removed. Doc