spaCy/pipeline-functions.md at f4d547b73c31e7578dcc676b9276dab3dce851ad

mirror of https://github.com/explosion/spaCy.git synced 2025-02-27 00:50:42 +03:00

* Add long_token_splitter component

Add a `long_token_splitter` component for use with transformer
pipelines. This component splits up long tokens like URLs into smaller
tokens. This is particularly relevant for pretrained pipelines with
`strided_spans`, since the user can't change the length of the span
`window` and may not wish to preprocess the input texts.

The `long_token_splitter` splits tokens that are at least
`long_token_length` tokens long into smaller tokens of `split_length`
size.

Notes:

* Since this is intended for use as the first component in a pipeline,
the token splitter does not try to preserve any token annotation.
* API docs to come when the API is stable.

* Adjust API, add test

* Fix name in factory

2021-01-17 19:54:41 +08:00

5.0 KiB

Raw Blame History

title

teaser

source

Pipeline Functions

Other built-in pipeline components and helpers

spacy/pipeline/functions.py

merge_noun_chunks

merge_entities

merge_subtokens

token_splitter

merge_noun_chunks

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

Example

texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a", "blue", "car"]

nlp.add_pipe("merge_noun_chunks")
texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a blue car"]

Since noun chunks require part-of-speech tags and the dependency parse, make sure to add this component after the "tagger" and "parser" components. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
RETURNS	The modified `Doc` with merged noun chunks. ~~Doc~~

merge_entities

Merge named entities into a single token. Also available via the string name "merge_entities".

Example

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

nlp.add_pipe("merge_entities")

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

Since named entities are set by the entity recognizer, make sure to add this component after the "ner" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
RETURNS	The modified `Doc` with merged entities. ~~Doc~~

merge_subtokens

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict "subtokens" that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a "word" isn't defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Example

Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.
doc = nlp("拜托")
print([(token.text, token.dep_) for token in doc])
# [('拜', 'subtok'), ('托', 'subtok')]

nlp.add_pipe("merge_subtokens")
doc = nlp("拜托")
print([token.text for token in doc])
# ['拜托']

Since subtokens are set by the parser, make sure to add this component after the "parser" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~
`label`	The subtoken dependency label. Defaults to `"subtok"`. ~~str~~
RETURNS	The modified `Doc` with merged subtokens. ~~Doc~~

token_splitter

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length. See managing transformer model max length limitations.

Example

config={"min_length": 20, "split_length": 5}
nlp.add_pipe("token_splitter", config=config, first=True)
doc = nlp("aaaaabbbbbcccccdddddee")
print([token.text for token in doc])
# ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']

Setting	Description
`min_length`	The minimum length for a token to be split. Defaults to `25`. ~~int~~
`split_length`	The length of the split tokens. Defaults to `5`. ~~int~~

5.0 KiB Raw Blame History

merge_noun_chunks

Example

merge_entities

Example

merge_subtokens

Example

token_splitter

Example

5.0 KiB

Raw Blame History