spaCy/website/docs/usage/101/_architecture.md

The central data structures in spaCy are the [`Language`](/api/language) class,
the [`Vocab`](/api/vocab) and the [`Doc`](/api/doc) object. The `Language` class
is used to process a text and turn it into a `Doc` object. It's typically stored
as a variable called `nlp`. The `Doc` object owns the **sequence of tokens** and
all their annotations. By centralizing strings, word vectors and lexical
attributes in the `Vocab`, we avoid storing multiple copies of this data. This
saves memory, and ensures there's a **single source of truth**.

Text annotations are also designed to allow a single source of truth: the `Doc`
object owns the data, and [`Span`](/api/span) and [`Token`](/api/token) are
**views that point into it**. The `Doc` object is constructed by the
[`Tokenizer`](/api/tokenizer), and then **modified in place** by the components
of the pipeline. The `Language` object coordinates these components. It takes
raw text and sends it through the pipeline, returning an **annotated document**.
It also orchestrates training and serialization.

![Library architecture](../../images/architecture.svg)

### Container objects {#architecture-containers}

| Name                        | Description                                                                                                                                             |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`Doc`](/api/doc)           | A container for accessing linguistic annotations.                                                                                                       |
| [`DocBin`](/api/docbin)     | A collection of `Doc` objects for efficient binary serialization. Also used for [training data](/api/data-formats#binary-training).                     |
| [`Example`](/api/example)   | A collection of training annotations, containing two `Doc` objects: the reference data and the predictions.                                             |
| [`Language`](/api/language) | Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`.  |
| [`Lexeme`](/api/lexeme)     | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |
| [`Span`](/api/span)         | A slice from a `Doc` object.                                                                                                                            |
| [`Token`](/api/token)       | An individual token — i.e. a word, punctuation symbol, whitespace, etc.                                                                                 |

### Processing pipeline {#architecture-pipeline}

The processing pipeline consists of one or more **pipeline components** that are
called on the `Doc` in order. The tokenizer runs before the components. Pipeline
components can be added using [`Language.add_pipe`](/api/language#add_pipe).
They can contain a statistical model and trained weights, or only make
rule-based modifications to the `Doc`. spaCy provides a range of built-in
components for different language processing tasks and also allows adding
[custom components](/usage/processing-pipelines#custom-components).

![The processing pipeline](../../images/pipeline.svg)

| Name                                            | Description                                                                                 |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------- |
| [`AttributeRuler`](/api/attributeruler)         | Set token attributes using matcher rules.                                                   |
| [`DependencyParser`](/api/dependencyparser)     | Predict syntactic dependencies.                                                             |
| [`EntityLinker`](/api/entitylinker)             | Disambiguate named entities to nodes in a knowledge base.                                   |
| [`EntityRecognizer`](/api/entityrecognizer)     | Predict named entities, e.g. persons or products.                                           |
| [`EntityRuler`](/api/entityruler)               | Add entity spans to the `Doc` using token-based rules or exact phrase matches.              |
| [`Lemmatizer`](/api/lemmatizer)                 | Determine the base forms of words.                                                          |
| [`Morphologizer`](/api/morphologizer)           | Predict morphological features and coarse-grained part-of-speech tags.                      |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Predict sentence boundaries.                                                                |
| [`Sentencizer`](/api/sentencizer)               | Implement rule-based sentence boundary detection that doesn't require the dependency parse. |
| [`Tagger`](/api/tagger)                         | Predict part-of-speech tags.                                                                |
| [`TextCategorizer`](/api/textcategorizer)       | Predict categories or labels over the whole document.                                       |
| [`Tok2Vec`](/api/tok2vec)                       | Apply a "token-to-vector" model and set its outputs.                                        |
| [`Tokenizer`](/api/tokenizer)                   | Segment raw text and create `Doc` objects from the words.                                   |
| [`TrainablePipe`](/api/pipe)                    | Class that all trainable pipeline components inherit from.                                  |
| [`Transformer`](/api/transformer)               | Use a transformer model and set its outputs.                                                |
| [Other functions](/api/pipeline-functions)      | Automatically apply something to the `Doc`, e.g. to merge spans of tokens.                  |

### Matchers {#architecture-matchers}

Matchers help you find and extract information from [`Doc`](/api/doc) objects
based on match patterns describing the sequences you're looking for. A matcher
operates on a `Doc` and gives you access to the matched tokens **in context**.

| Name                                          | Description                                                                                                                                                                        |
| --------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`DependencyMatcher`](/api/dependencymatcher) | Match sequences of tokens based on dependency trees using [Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). |
| [`Matcher`](/api/matcher)                     | Match sequences of tokens, based on pattern rules, similar to regular expressions.                                                                                                 |
| [`PhraseMatcher`](/api/phrasematcher)         | Match sequences of tokens based on phrases.                                                                                                                                        |

### Other classes {#architecture-other}

| Name                                             | Description                                                                                        |
| ------------------------------------------------ | -------------------------------------------------------------------------------------------------- |
| [`Corpus`](/api/corpus)                          | Class for managing annotated corpora for training and evaluation data.                             |
| [`KnowledgeBase`](/api/kb)                       | Storage for entities and aliases of a knowledge base for entity linking.                           |
| [`Lookups`](/api/lookups)                        | Container for convenient access to large lookup tables and dictionaries.                           |
| [`MorphAnalysis`](/api/morphology#morphanalysis) | A morphological analysis.                                                                          |
| [`Morphology`](/api/morphology)                  | Store morphological analyses and map them to and from hash values.                                 |
| [`Scorer`](/api/scorer)                          | Compute evaluation scores.                                                                         |
| [`StringStore`](/api/stringstore)                | Map strings to and from hash values.                                                               |
| [`Vectors`](/api/vectors)                        | Container class for vector data keyed by string.                                                   |
| [`Vocab`](/api/vocab)                            | The shared vocabulary that stores strings and gives you access to [`Lexeme`](/api/lexeme) objects. |
Update docs [ci skip] 2020-08-10 01:42:26 +03:00			The central data structures in spaCy are the [`Language`](/api/language) class,
			the [`Vocab`](/api/vocab) and the [`Doc`](/api/doc) object. The `Language` class
			is used to process a text and turn it into a `Doc` object. It's typically stored
			as a variable called `nlp`. The `Doc` object owns the sequence of tokens and
			`all their annotations. By centralizing strings, word vectors and lexical`
			attributes in the `Vocab`, we avoid storing multiple copies of this data. This
			`saves memory, and ensures there's a single source of truth.`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			Text annotations are also designed to allow a single source of truth: the `Doc`
Update docs [ci skip] 2020-08-10 01:42:26 +03:00			object owns the data, and [`Span`](/api/span) and [`Token`](/api/token) are
			views that point into it. The `Doc` object is constructed by the
			[`Tokenizer`](/api/tokenizer), and then modified in place by the components
			of the pipeline. The `Language` object coordinates these components. It takes
			`raw text and sends it through the pipeline, returning an annotated document.`
			`It also orchestrates training and serialization.`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			`![Library architecture](../../images/architecture.svg)`

			`### Container objects {#architecture-containers}`

Update docs [ci skip] 2020-08-10 01:42:26 +03:00			`\| Name \| Description \|`
			`\| --------------------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------- \|`
			\| [`Doc`](/api/doc) \| A container for accessing linguistic annotations. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`DocBin`](/api/docbin) \| A collection of `Doc` objects for efficient binary serialization. Also used for [training data](/api/data-formats#binary-training). \|
			\| [`Example`](/api/example) \| A collection of training annotations, containing two `Doc` objects: the reference data and the predictions. \|
			\| [`Language`](/api/language) \| Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`. \|
			\| [`Lexeme`](/api/lexeme) \| An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. \|
Update docs [ci skip] 2020-08-10 01:42:26 +03:00			\| [`Span`](/api/span) \| A slice from a `Doc` object. \|
			\| [`Token`](/api/token) \| An individual token — i.e. a word, punctuation symbol, whitespace, etc. \|
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			`### Processing pipeline {#architecture-pipeline}`

Update docs [ci skip] 2020-08-10 01:42:26 +03:00			`The processing pipeline consists of one or more pipeline components that are`
			called on the `Doc` in order. The tokenizer runs before the components. Pipeline
			components can be added using [`Language.add_pipe`](/api/language#add_pipe).
			`They can contain a statistical model and trained weights, or only make`
			rule-based modifications to the `Doc`. spaCy provides a range of built-in
			`components for different language processing tasks and also allows adding`
			`[custom components](/usage/processing-pipelines#custom-components).`

			`![The processing pipeline](../../images/pipeline.svg)`

			`\| Name \| Description \|`
			`\| ----------------------------------------------- \| ------------------------------------------------------------------------------------------- \|`
			\| [`AttributeRuler`](/api/attributeruler) \| Set token attributes using matcher rules. \|
			\| [`DependencyParser`](/api/dependencyparser) \| Predict syntactic dependencies. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`EntityLinker`](/api/entitylinker) \| Disambiguate named entities to nodes in a knowledge base. \|
Update docs [ci skip] 2020-08-10 01:42:26 +03:00			\| [`EntityRecognizer`](/api/entityrecognizer) \| Predict named entities, e.g. persons or products. \|
			\| [`EntityRuler`](/api/entityruler) \| Add entity spans to the `Doc` using token-based rules or exact phrase matches. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Lemmatizer`](/api/lemmatizer) \| Determine the base forms of words. \|
			\| [`Morphologizer`](/api/morphologizer) \| Predict morphological features and coarse-grained part-of-speech tags. \|
Update docs [ci skip] 2020-08-10 01:42:26 +03:00			\| [`SentenceRecognizer`](/api/sentencerecognizer) \| Predict sentence boundaries. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Sentencizer`](/api/sentencizer) \| Implement rule-based sentence boundary detection that doesn't require the dependency parse. \|
			\| [`Tagger`](/api/tagger) \| Predict part-of-speech tags. \|
			\| [`TextCategorizer`](/api/textcategorizer) \| Predict categories or labels over the whole document. \|
			\| [`Tok2Vec`](/api/tok2vec) \| Apply a "token-to-vector" model and set its outputs. \|
			\| [`Tokenizer`](/api/tokenizer) \| Segment raw text and create `Doc` objects from the words. \|
TrainablePipe (#6213) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36 2020-10-08 22:33:49 +03:00			\| [`TrainablePipe`](/api/pipe) \| Class that all trainable pipeline components inherit from. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Transformer`](/api/transformer) \| Use a transformer model and set its outputs. \|
			\| [Other functions](/api/pipeline-functions) \| Automatically apply something to the `Doc`, e.g. to merge spans of tokens. \|
Update docs [ci skip] 2020-08-10 01:42:26 +03:00
			`### Matchers {#architecture-matchers}`

			Matchers help you find and extract information from [`Doc`](/api/doc) objects
			`based on match patterns describing the sequences you're looking for. A matcher`
			operates on a `Doc` and gives you access to the matched tokens in context.

Update architecture overview 2020-09-22 10:31:47 +03:00			`\| Name \| Description \|`
			`\| --------------------------------------------- \| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|`
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`DependencyMatcher`](/api/dependencymatcher) \| Match sequences of tokens based on dependency trees using [Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). \|
Update architecture overview 2020-09-22 10:31:47 +03:00			\| [`Matcher`](/api/matcher) \| Match sequences of tokens, based on pattern rules, similar to regular expressions. \|
			\| [`PhraseMatcher`](/api/phrasematcher) \| Match sequences of tokens based on phrases. \|
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			`### Other classes {#architecture-other}`

Update architecture overview 2020-09-22 10:31:47 +03:00			`\| Name \| Description \|`
			`\| ------------------------------------------------ \| -------------------------------------------------------------------------------------------------- \|`
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Corpus`](/api/corpus) \| Class for managing annotated corpora for training and evaluation data. \|
			\| [`KnowledgeBase`](/api/kb) \| Storage for entities and aliases of a knowledge base for entity linking. \|
Update architecture overview 2020-09-22 10:31:47 +03:00			\| [`Lookups`](/api/lookups) \| Container for convenient access to large lookup tables and dictionaries. \|
			\| [`MorphAnalysis`](/api/morphology#morphanalysis) \| A morphological analysis. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Morphology`](/api/morphology) \| Store morphological analyses and map them to and from hash values. \|
Update architecture overview 2020-09-22 10:31:47 +03:00			\| [`Scorer`](/api/scorer) \| Compute evaluation scores. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`StringStore`](/api/stringstore) \| Map strings to and from hash values. \|
			\| [`Vectors`](/api/vectors) \| Container class for vector data keyed by string. \|
			\| [`Vocab`](/api/vocab) \| The shared vocabulary that stores strings and gives you access to [`Lexeme`](/api/lexeme) objects. \|