mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 13:41:21 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			120 lines
		
	
	
		
			4.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			120 lines
		
	
	
		
			4.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| title: Pipeline Functions
 | |
| teaser: Other built-in pipeline components and helpers
 | |
| source: spacy/pipeline/functions.py
 | |
| menu:
 | |
|   - ['merge_noun_chunks', 'merge_noun_chunks']
 | |
|   - ['merge_entities', 'merge_entities']
 | |
|   - ['merge_subtokens', 'merge_subtokens']
 | |
| ---
 | |
| 
 | |
| ## merge_noun_chunks {#merge_noun_chunks tag="function"}
 | |
| 
 | |
| Merge noun chunks into a single token. Also available via the string name
 | |
| `"merge_noun_chunks"`. After initialization, the component is typically added to
 | |
| the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > texts = [t.text for t in nlp("I have a blue car")]
 | |
| > assert texts == ["I", "have", "a", "blue", "car"]
 | |
| >
 | |
| > merge_nps = nlp.create_pipe("merge_noun_chunks")
 | |
| > nlp.add_pipe(merge_nps)
 | |
| >
 | |
| > texts = [t.text for t in nlp("I have a blue car")]
 | |
| > assert texts == ["I", "have", "a blue car"]
 | |
| > ```
 | |
| 
 | |
| <Infobox variant="warning">
 | |
| 
 | |
| Since noun chunks require part-of-speech tags and the dependency parse, make
 | |
| sure to add this component _after_ the `"tagger"` and `"parser"` components. By
 | |
| default, `nlp.add_pipe` will add components to the end of the pipeline and after
 | |
| all other components.
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| | Name        | Type  | Description                                                  |
 | |
| | ----------- | ----- | ------------------------------------------------------------ |
 | |
| | `doc`       | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
 | |
| | **RETURNS** | `Doc` | The modified `Doc` with merged noun chunks.                  |
 | |
| 
 | |
| ## merge_entities {#merge_entities tag="function"}
 | |
| 
 | |
| Merge named entities into a single token. Also available via the string name
 | |
| `"merge_entities"`. After initialization, the component is typically added to
 | |
| the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > ```python
 | |
| > texts = [t.text for t in nlp("I like David Bowie")]
 | |
| > assert texts == ["I", "like", "David", "Bowie"]
 | |
| >
 | |
| > merge_ents = nlp.create_pipe("merge_entities")
 | |
| > nlp.add_pipe(merge_ents)
 | |
| >
 | |
| > texts = [t.text for t in nlp("I like David Bowie")]
 | |
| > assert texts == ["I", "like", "David Bowie"]
 | |
| > ```
 | |
| 
 | |
| <Infobox variant="warning">
 | |
| 
 | |
| Since named entities are set by the entity recognizer, make sure to add this
 | |
| component _after_ the `"ner"` component. By default, `nlp.add_pipe` will add
 | |
| components to the end of the pipeline and after all other components.
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| | Name        | Type  | Description                                                  |
 | |
| | ----------- | ----- | ------------------------------------------------------------ |
 | |
| | `doc`       | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
 | |
| | **RETURNS** | `Doc` | The modified `Doc` with merged entities.                     |
 | |
| 
 | |
| ## merge_subtokens {#merge_subtokens tag="function" new="2.1"}
 | |
| 
 | |
| Merge subtokens into a single token. Also available via the string name
 | |
| `"merge_subtokens"`. After initialization, the component is typically added to
 | |
| the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe).
 | |
| 
 | |
| As of v2.1, the parser is able to predict "subtokens" that should be merged into
 | |
| one single token later on. This is especially relevant for languages like
 | |
| Chinese, Japanese or Korean, where a "word" isn't defined as a
 | |
| whitespace-delimited sequence of characters. Under the hood, this component uses
 | |
| the [`Matcher`](/api/matcher) to find sequences of tokens with the dependency
 | |
| label `"subtok"` and then merges them into a single token.
 | |
| 
 | |
| > #### Example
 | |
| >
 | |
| > Note that this example assumes a custom Chinese model that oversegments and
 | |
| > was trained to predict subtokens.
 | |
| >
 | |
| > ```python
 | |
| > doc = nlp("拜托")
 | |
| > print([(token.text, token.dep_) for token in doc])
 | |
| > # [('拜', 'subtok'), ('托', 'subtok')]
 | |
| >
 | |
| > merge_subtok = nlp.create_pipe("merge_subtokens")
 | |
| > nlp.add_pipe(merge_subtok)
 | |
| >
 | |
| > doc = nlp("拜托")
 | |
| > print([token.text for token in doc])
 | |
| > # ['拜托']
 | |
| > ```
 | |
| 
 | |
| <Infobox variant="warning">
 | |
| 
 | |
| Since subtokens are set by the parser, make sure to add this component _after_
 | |
| the `"parser"` component. By default, `nlp.add_pipe` will add components to the
 | |
| end of the pipeline and after all other components.
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| | Name        | Type    | Description                                                  |
 | |
| | ----------- | ------- | ------------------------------------------------------------ |
 | |
| | `doc`       | `Doc`   | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
 | |
| | `label`     | unicode | The subtoken dependency label. Defaults to `"subtok"`.       |
 | |
| | **RETURNS** | `Doc`   | The modified `Doc` with merged subtokens.                    |
 |