spaCy/pipeline-functions.md at 3033babe9859ba6483e0b97e18342c9ecf4c6c24

mirror of https://github.com/explosion/spaCy.git synced 2024-09-22 03:49:17 +03:00

Paul O'Leary McCann a44b7d4622

* Add experimental coref docs

* Docs cleanup

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Apply changes from code review

* Fix prettier formatting

It seems a period after a number made this think it was a list?

* Update docs on examples for initialize

* Add docs for coref scorers

* Remove 3.4 notes from coref

There won't be a "new" tag until it's in core.

* Add docs for span cleaner

* Fix docs

* Fix docs to match spacy-experimental

These weren't properly updated when the code was moved out of spacy
core.

* More doc fixes

* Formatting

* Update architectures

* Fix links

* Fix another link

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>

2022-09-27 18:11:23 +09:00

7.9 KiB

Raw Blame History

title

teaser

source

Pipeline Functions

Other built-in pipeline components and helpers

spacy/pipeline/functions.py

merge_noun_chunks

merge_entities

merge_subtokens

token_splitter

doc_cleaner

merge_noun_chunks

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

Example

texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a", "blue", "car"]

nlp.add_pipe("merge_noun_chunks")
texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a blue car"]

Since noun chunks require part-of-speech tags and the dependency parse, make sure to add this component after the "tagger" and "parser" components. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
RETURNS	The modified `Doc` with merged noun chunks. ~~Doc~~

merge_entities

Merge named entities into a single token. Also available via the string name "merge_entities".

Example

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

nlp.add_pipe("merge_entities")

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

Since named entities are set by the entity recognizer, make sure to add this component after the "ner" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
RETURNS	The modified `Doc` with merged entities. ~~Doc~~

merge_subtokens

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict "subtokens" that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a "word" isn't defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Example

Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.
doc = nlp("拜托")
print([(token.text, token.dep_) for token in doc])
# [('拜', 'subtok'), ('托', 'subtok')]

nlp.add_pipe("merge_subtokens")
doc = nlp("拜托")
print([token.text for token in doc])
# ['拜托']

Since subtokens are set by the parser, make sure to add this component after the "parser" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
`label`	The subtoken dependency label. Defaults to `"subtok"`. ~~str~~
RETURNS	The modified `Doc` with merged subtokens. ~~Doc~~

token_splitter

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.

Example

config = {"min_length": 20, "split_length": 5}
nlp.add_pipe("token_splitter", config=config, first=True)
doc = nlp("aaaaabbbbbcccccdddddee")
print([token.text for token in doc])
# ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']

Setting	Description
`min_length`	The minimum length for a token to be split. Defaults to `25`. ~~int~~
`split_length`	The length of the split tokens. Defaults to `5`. ~~int~~
RETURNS	The modified `Doc` with the split tokens. ~~Doc~~

doc_cleaner

Clean up Doc attributes. Intended for use at the end of pipelines with tok2vec or transformer pipeline components that store tensors and other values that can require a lot of memory and frequently aren't needed after the whole pipeline has run.

Example

config = {"attrs": {"tensor": None}}
nlp.add_pipe("doc_cleaner", config=config)
doc = nlp("text")
assert doc.tensor is None

Setting	Description
`attrs`	A dict of the `Doc` attributes and the values to set them to. Defaults to `{"tensor": None, "_.trf_data": None}` to clean up after `tok2vec` and `transformer` components. ~~dict~~
`silent`	If `False`, show warnings if attributes aren't found or can't be set. Defaults to `True`. ~~bool~~
RETURNS	The modified `Doc` with the modified attributes. ~~Doc~~

span_cleaner

Remove SpanGroups from doc.spans based on a key prefix. This is used to clean up after the CoreferenceResolver when it's paired with a SpanResolver.

This pipeline function is not yet integrated into spaCy core, and is available via the extension package spacy-experimental starting in version 0.6.0. It exposes the component via entry points, so if you have the package installed, using factory = "span_cleaner" in your training config or nlp.add_pipe("span_cleaner") will work out-of-the-box.

Example

config = {"prefix": "coref_head_clusters"}
nlp.add_pipe("span_cleaner", config=config)
doc = nlp("text")
assert "coref_head_clusters_1" not in doc.spans

Setting	Description
`prefix`	A prefix to check `SpanGroup` keys for. Any matching groups will be removed. Defaults to `"coref_head_clusters"`. ~~str~~
RETURNS	The modified `Doc` with any matching spans removed. ~~Doc~~

7.9 KiB Raw Blame History

merge_noun_chunks

Example

merge_entities

Example

merge_subtokens

Example

token_splitter

Example

doc_cleaner

Example

span_cleaner

Example

7.9 KiB

Raw Blame History