spaCy/pipeline-functions.md at 25cb764e64814aa1ad61b8a854cb6404b38f9753

mirror of https://github.com/explosion/spaCy.git synced 2025-07-04 03:43:09 +03:00

Tidy up and improve docs and docstrings (#3370 )

<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

2019-03-08 11:42:26 +01:00

4.5 KiB

Raw Blame History

title

teaser

source

Pipeline Functions

Other built-in pipeline components and helpers

spacy/pipeline/functions.py

merge_noun_chunks

merge_entities

merge_subtokens

merge_noun_chunks

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks". After initialization, the component is typically added to the processing pipeline using nlp.add_pipe.

Example

texts = [t.token for t in nlp(u"I have a blue car")]
assert texts = ["I", "have", "a", "blue", "car"]

merge_nps = nlp.create_pipe("merge_noun_chunks")
nlp.add_pipe(merge_nps)

texts = [t.token for t in nlp(u"I have a blue car")]
assert texts == ["I", "have", "a blue car"]

Since noun chunks require part-of-speech tags and the dependency parse, make sure to add this component after the "tagger" and "parser" components. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Type	Description
`doc`	`Doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline.
RETURNS	`Doc`	The modified `Doc` with merged noun chunks.

merge_entities

Merge named entities into a single token. Also available via the string name "merge_entities". After initialization, the component is typically added to the processing pipeline using nlp.add_pipe.

Example

texts = [t.token for t in nlp(u"I like David Bowie")]
assert texts = ["I", "like", "David", "Bowie"]

merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)

texts = [t.token for t in nlp(u"I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

Since named entities are set by the entity recognizer, make sure to add this component after the "ner" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Type	Description
`doc`	`Doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline.
RETURNS	`Doc`	The modified `Doc` with merged entities.

merge_subtokens

Merge subtokens into a single token. Also available via the string name "merge_subtokens". After initialization, the component is typically added to the processing pipeline using nlp.add_pipe.

As of v2.1, the parser is able to predict "subtokens" that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a "word" isn't defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Example

Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.

doc = nlp("拜托")
print([(token.text, token.dep_) for token in doc])
# [('拜', 'subtok'), ('托', 'subtok')]

merge_subtok = nlp.create_pipe("merge_subtokens")
nlp.add_pipe(merge_subtok)

doc = nlp("拜托")
print([token.text for token in doc])
# ['拜托']

Since subtokens are set by the parser, make sure to add this component after the "parser" component. By default, nlp.add_pipe will add components to the end of the pipeline and after all other components.

Name	Type	Description
`doc`	`Doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline.
`label`	unicode	The subtoken dependency label. Defaults to `"subtok"`.
RETURNS	`Doc`	The modified `Doc` with merged subtokens.

4.5 KiB Raw Blame History

merge_noun_chunks

Example

merge_entities

Example

merge_subtokens

Example

4.5 KiB

Raw Blame History