mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 12:18:04 +03:00
296446a1c8
<!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
120 lines
4.5 KiB
Markdown
120 lines
4.5 KiB
Markdown
---
|
|
title: Pipeline Functions
|
|
teaser: Other built-in pipeline components and helpers
|
|
source: spacy/pipeline/functions.py
|
|
menu:
|
|
- ['merge_noun_chunks', 'merge_noun_chunks']
|
|
- ['merge_entities', 'merge_entities']
|
|
- ['merge_subtokens', 'merge_subtokens']
|
|
---
|
|
|
|
## merge_noun_chunks {#merge_noun_chunks tag="function"}
|
|
|
|
Merge noun chunks into a single token. Also available via the string name
|
|
`"merge_noun_chunks"`. After initialization, the component is typically added to
|
|
the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> texts = [t.token for t in nlp(u"I have a blue car")]
|
|
> assert texts = ["I", "have", "a", "blue", "car"]
|
|
>
|
|
> merge_nps = nlp.create_pipe("merge_noun_chunks")
|
|
> nlp.add_pipe(merge_nps)
|
|
>
|
|
> texts = [t.token for t in nlp(u"I have a blue car")]
|
|
> assert texts == ["I", "have", "a blue car"]
|
|
> ```
|
|
|
|
<Infobox variant="warning">
|
|
|
|
Since noun chunks require part-of-speech tags and the dependency parse, make
|
|
sure to add this component _after_ the `"tagger"` and `"parser"` components. By
|
|
default, `nlp.add_pipe` will add components to the end of the pipeline and after
|
|
all other components.
|
|
|
|
</Infobox>
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | ------------------------------------------------------------ |
|
|
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
|
| **RETURNS** | `Doc` | The modified `Doc` with merged noun chunks. |
|
|
|
|
## merge_entities {#merge_entities tag="function"}
|
|
|
|
Merge named entities into a single token. Also available via the string name
|
|
`"merge_entities"`. After initialization, the component is typically added to
|
|
the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> texts = [t.token for t in nlp(u"I like David Bowie")]
|
|
> assert texts = ["I", "like", "David", "Bowie"]
|
|
>
|
|
> merge_ents = nlp.create_pipe("merge_entities")
|
|
> nlp.add_pipe(merge_ents)
|
|
>
|
|
> texts = [t.token for t in nlp(u"I like David Bowie")]
|
|
> assert texts == ["I", "like", "David Bowie"]
|
|
> ```
|
|
|
|
<Infobox variant="warning">
|
|
|
|
Since named entities are set by the entity recognizer, make sure to add this
|
|
component _after_ the `"ner"` component. By default, `nlp.add_pipe` will add
|
|
components to the end of the pipeline and after all other components.
|
|
|
|
</Infobox>
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ----- | ------------------------------------------------------------ |
|
|
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
|
| **RETURNS** | `Doc` | The modified `Doc` with merged entities. |
|
|
|
|
## merge_subtokens {#merge_subtokens tag="function" new="2.1"}
|
|
|
|
Merge subtokens into a single token. Also available via the string name
|
|
`"merge_subtokens"`. After initialization, the component is typically added to
|
|
the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
|
|
|
|
As of v2.1, the parser is able to predict "subtokens" that should be merged into
|
|
one single token later on. This is especially relevant for languages like
|
|
Chinese, Japanese or Korean, where a "word" isn't defined as a
|
|
whitespace-delimited sequence of characters. Under the hood, this component uses
|
|
the [`Matcher`](/api/matcher) to find sequences of tokens with the dependency
|
|
label `"subtok"` and then merges them into a single token.
|
|
|
|
> #### Example
|
|
>
|
|
> Note that this example assumes a custom Chinese model that oversegments and
|
|
> was trained to predict subtokens.
|
|
>
|
|
> ```python
|
|
> doc = nlp("拜托")
|
|
> print([(token.text, token.dep_) for token in doc])
|
|
> # [('拜', 'subtok'), ('托', 'subtok')]
|
|
>
|
|
> merge_subtok = nlp.create_pipe("merge_subtokens")
|
|
> nlp.add_pipe(merge_subtok)
|
|
>
|
|
> doc = nlp("拜托")
|
|
> print([token.text for token in doc])
|
|
> # ['拜托']
|
|
> ```
|
|
|
|
<Infobox variant="warning">
|
|
|
|
Since subtokens are set by the parser, make sure to add this component _after_
|
|
the `"parser"` component. By default, `nlp.add_pipe` will add components to the
|
|
end of the pipeline and after all other components.
|
|
|
|
</Infobox>
|
|
|
|
| Name | Type | Description |
|
|
| ----------- | ------- | ------------------------------------------------------------ |
|
|
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
|
| `label` | unicode | The subtoken dependency label. Defaults to `"subtok"`. |
|
|
| **RETURNS** | `Doc` | The modified `Doc` with merged subtokens. |
|