mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	* document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * typo fix * add candidate API to kb documentation * update API sidebar with EntityLinker and KnowledgeBase * remove EL from 101 docs * remove entity linker from 101 pipelines / rephrase * custom el model instead of existing model * set version to 2.2 for EL functionality * update documentation for 2 CLI scripts
		
			
				
	
	
		
			74 lines
		
	
	
		
			4.6 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			74 lines
		
	
	
		
			4.6 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc`
 | 
						||
object. The `Doc` is then processed in several different steps – this is also
 | 
						||
referred to as the **processing pipeline**. The pipeline used by the
 | 
						||
[default models](/models) consists of a tagger, a parser and an entity
 | 
						||
recognizer. Each pipeline component returns the processed `Doc`, which is then
 | 
						||
passed on to the next component.
 | 
						||
 | 
						||

 | 
						||
 | 
						||
> - **Name**: ID of the pipeline component.
 | 
						||
> - **Component:** spaCy's implementation of the component.
 | 
						||
> - **Creates:** Objects, attributes and properties modified and set by the
 | 
						||
>   component.
 | 
						||
 | 
						||
| Name              | Component                                                          | Creates                                                     | Description                                      |
 | 
						||
| ----------------- | ------------------------------------------------------------------ | ----------------------------------------------------------- | ------------------------------------------------ |
 | 
						||
| **tokenizer**     | [`Tokenizer`](/api/tokenizer)                                      | `Doc`                                                       | Segment text into tokens.                        |
 | 
						||
| **tagger**        | [`Tagger`](/api/tagger)                                            | `Doc[i].tag`                                                | Assign part-of-speech tags.                      |
 | 
						||
| **parser**        | [`DependencyParser`](/api/dependencyparser)                        | `Doc[i].head`, `Doc[i].dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels.                        |
 | 
						||
| **ner**           | [`EntityRecognizer`](/api/entityrecognizer)                        | `Doc.ents`, `Doc[i].ent_iob`, `Doc[i].ent_type`             | Detect and label named entities.                 |
 | 
						||
| **textcat**       | [`TextCategorizer`](/api/textcategorizer)                          | `Doc.cats`                                                  | Assign document labels.                          |
 | 
						||
| ...               | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx`                    | Assign custom attributes, methods or properties. |
 | 
						||
 | 
						||
The processing pipeline always **depends on the statistical model** and its
 | 
						||
capabilities. For example, a pipeline can only include an entity recognizer
 | 
						||
component if the model includes data to make predictions of entity labels. This
 | 
						||
is why each model will specify the pipeline to use in its meta data, as a simple
 | 
						||
list containing the component names:
 | 
						||
 | 
						||
```json
 | 
						||
"pipeline": ["tagger", "parser", "ner"]
 | 
						||
```
 | 
						||
 | 
						||
import Accordion from 'components/accordion.js'
 | 
						||
 | 
						||
<Accordion title="Does the order of pipeline components matter?" id="pipeline-components-order">
 | 
						||
 | 
						||
In spaCy v2.x, the statistical components like the tagger or parser are
 | 
						||
independent and don't share any data between themselves. For example, the named
 | 
						||
entity recognizer doesn't use any features set by the tagger and parser, and so
 | 
						||
on. This means that you can swap them, or remove single components from the
 | 
						||
pipeline without affecting the others.
 | 
						||
 | 
						||
However, custom components may depend on annotations set by other components.
 | 
						||
For example, a custom lemmatizer may need the part-of-speech tags assigned, so
 | 
						||
it'll only work if it's added after the tagger. The parser will respect
 | 
						||
pre-defined sentence boundaries, so if a previous component in the pipeline sets
 | 
						||
them, its dependency predictions may be different. Similarly, it matters if you
 | 
						||
add the [`EntityRuler`](/api/entityruler) before or after the statistical entity
 | 
						||
recognizer: if it's added before, the entity recognizer will take the existing
 | 
						||
entities into account when making predictions.
 | 
						||
The [`EntityLinker`](/api/entitylinker), which resolves named entities to 
 | 
						||
knowledge base IDs, should be preceded by 
 | 
						||
a pipeline component that recognizes entities such as the 
 | 
						||
[`EntityRecognizer`](/api/entityrecognizer).
 | 
						||
 | 
						||
</Accordion>
 | 
						||
 | 
						||
<Accordion title="Why is the tokenizer special?" id="pipeline-components-tokenizer">
 | 
						||
 | 
						||
The tokenizer is a "special" component and isn't part of the regular pipeline.
 | 
						||
It also doesn't show up in `nlp.pipe_names`. The reason is that there can only
 | 
						||
really be one tokenizer, and while all other pipeline components take a `Doc`
 | 
						||
and return it, the tokenizer takes a **string of text** and turns it into a
 | 
						||
`Doc`. You can still customize the tokenizer, though. `nlp.tokenizer` is
 | 
						||
writable, so you can either create your own
 | 
						||
[`Tokenizer` class from scratch](/usage/linguistic-features#native-tokenizers),
 | 
						||
or even replace it with an
 | 
						||
[entirely custom function](/usage/linguistic-features#custom-tokenizer).
 | 
						||
 | 
						||
</Accordion>
 | 
						||
 | 
						||
---
 |