spaCy/website/docs/usage/101/_architecture.md

The central data structures in spaCy are the [`Language`](/api/language) class,
the [`Vocab`](/api/vocab) and the [`Doc`](/api/doc) object. The `Language` class
is used to process a text and turn it into a `Doc` object. It's typically stored
as a variable called `nlp`. The `Doc` object owns the **sequence of tokens** and
all their annotations. By centralizing strings, word vectors and lexical
attributes in the `Vocab`, we avoid storing multiple copies of this data. This
saves memory, and ensures there's a **single source of truth**.

Text annotations are also designed to allow a single source of truth: the `Doc`
object owns the data, and [`Span`](/api/span) and [`Token`](/api/token) are
**views that point into it**. The `Doc` object is constructed by the
[`Tokenizer`](/api/tokenizer), and then **modified in place** by the components
of the pipeline. The `Language` object coordinates these components. It takes
raw text and sends it through the pipeline, returning an **annotated document**.
It also orchestrates training and serialization.

![Library architecture](../../images/architecture.svg)

### Container objects {#architecture-containers}

| Name                          | Description                                                                                                                                             |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`Doc`](/api/doc)             | A container for accessing linguistic annotations.                                                                                                       |
| [`DocBin`](/api/docbin)       | A collection of `Doc` objects for efficient binary serialization. Also used for [training data](/api/data-formats#binary-training).                     |
| [`Example`](/api/example)     | A collection of training annotations, containing two `Doc` objects: the reference data and the predictions.                                             |
| [`Language`](/api/language)   | Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`.  |
| [`Lexeme`](/api/lexeme)       | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |
| [`Span`](/api/span)           | A slice from a `Doc` object.                                                                                                                            |
| [`SpanGroup`](/api/spangroup) | A named collection of spans belonging to a `Doc`.                                                                                                       |
| [`Token`](/api/token)         | An individual token — i.e. a word, punctuation symbol, whitespace, etc.                                                                                 |

### Processing pipeline {#architecture-pipeline}

The processing pipeline consists of one or more **pipeline components** that are
called on the `Doc` in order. The tokenizer runs before the components. Pipeline
components can be added using [`Language.add_pipe`](/api/language#add_pipe).
They can contain a statistical model and trained weights, or only make
rule-based modifications to the `Doc`. spaCy provides a range of built-in
components for different language processing tasks and also allows adding
[custom components](/usage/processing-pipelines#custom-components).

![The processing pipeline](../../images/pipeline.svg)

| Name                                            | Description                                                                                 |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------- |
| [`AttributeRuler`](/api/attributeruler)         | Set token attributes using matcher rules.                                                   |
| [`DependencyParser`](/api/dependencyparser)     | Predict syntactic dependencies.                                                             |
| [`EntityLinker`](/api/entitylinker)             | Disambiguate named entities to nodes in a knowledge base.                                   |
| [`EntityRecognizer`](/api/entityrecognizer)     | Predict named entities, e.g. persons or products.                                           |
| [`EntityRuler`](/api/entityruler)               | Add entity spans to the `Doc` using token-based rules or exact phrase matches.              |
| [`Lemmatizer`](/api/lemmatizer)                 | Determine the base forms of words.                                                          |
| [`Morphologizer`](/api/morphologizer)           | Predict morphological features and coarse-grained part-of-speech tags.                      |
| [`SentenceRecognizer`](/api/sentencerecognizer) | Predict sentence boundaries.                                                                |
| [`Sentencizer`](/api/sentencizer)               | Implement rule-based sentence boundary detection that doesn't require the dependency parse. |
| [`Tagger`](/api/tagger)                         | Predict part-of-speech tags.                                                                |
| [`TextCategorizer`](/api/textcategorizer)       | Predict categories or labels over the whole document.                                       |
| [`Tok2Vec`](/api/tok2vec)                       | Apply a "token-to-vector" model and set its outputs.                                        |
| [`Tokenizer`](/api/tokenizer)                   | Segment raw text and create `Doc` objects from the words.                                   |
| [`TrainablePipe`](/api/pipe)                    | Class that all trainable pipeline components inherit from.                                  |
| [`Transformer`](/api/transformer)               | Use a transformer model and set its outputs.                                                |
| [Other functions](/api/pipeline-functions)      | Automatically apply something to the `Doc`, e.g. to merge spans of tokens.                  |

### Matchers {#architecture-matchers}

Matchers help you find and extract information from [`Doc`](/api/doc) objects
based on match patterns describing the sequences you're looking for. A matcher
operates on a `Doc` and gives you access to the matched tokens **in context**.

| Name                                          | Description                                                                                                                                                                        |
| --------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`DependencyMatcher`](/api/dependencymatcher) | Match sequences of tokens based on dependency trees using [Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). |
| [`Matcher`](/api/matcher)                     | Match sequences of tokens, based on pattern rules, similar to regular expressions.                                                                                                 |
| [`PhraseMatcher`](/api/phrasematcher)         | Match sequences of tokens based on phrases.                                                                                                                                        |

### Other classes {#architecture-other}

| Name                                             | Description                                                                                        |
| ------------------------------------------------ | -------------------------------------------------------------------------------------------------- |
| [`Corpus`](/api/corpus)                          | Class for managing annotated corpora for training and evaluation data.                             |
| [`KnowledgeBase`](/api/kb)                       | Storage for entities and aliases of a knowledge base for entity linking.                           |
| [`Lookups`](/api/lookups)                        | Container for convenient access to large lookup tables and dictionaries.                           |
| [`MorphAnalysis`](/api/morphology#morphanalysis) | A morphological analysis.                                                                          |
| [`Morphology`](/api/morphology)                  | Store morphological analyses and map them to and from hash values.                                 |
| [`Scorer`](/api/scorer)                          | Compute evaluation scores.                                                                         |
| [`StringStore`](/api/stringstore)                | Map strings to and from hash values.                                                               |
| [`Vectors`](/api/vectors)                        | Container class for vector data keyed by string.                                                   |
| [`Vocab`](/api/vocab)                            | The shared vocabulary that stores strings and gives you access to [`Lexeme`](/api/lexeme) objects. |
Update docs [ci skip] 2020-08-10 01:42:26 +03:00			The central data structures in spaCy are the [`Language`](/api/language) class,
			the [`Vocab`](/api/vocab) and the [`Doc`](/api/doc) object. The `Language` class
			is used to process a text and turn it into a `Doc` object. It's typically stored
			as a variable called `nlp`. The `Doc` object owns the sequence of tokens and
			`all their annotations. By centralizing strings, word vectors and lexical`
			attributes in the `Vocab`, we avoid storing multiple copies of this data. This
			`saves memory, and ensures there's a single source of truth.`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			Text annotations are also designed to allow a single source of truth: the `Doc`
Update docs [ci skip] 2020-08-10 01:42:26 +03:00			object owns the data, and [`Span`](/api/span) and [`Token`](/api/token) are
			views that point into it. The `Doc` object is constructed by the
			[`Tokenizer`](/api/tokenizer), and then modified in place by the components
			of the pipeline. The `Language` object coordinates these components. It takes
			`raw text and sends it through the pipeline, returning an annotated document.`
			`It also orchestrates training and serialization.`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			`![Library architecture](../../images/architecture.svg)`

			`### Container objects {#architecture-containers}`

Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io> 2021-01-14 09:30:41 +03:00			`\| Name \| Description \|`
			`\| ----------------------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------- \|`
			\| [`Doc`](/api/doc) \| A container for accessing linguistic annotations. \|
			\| [`DocBin`](/api/docbin) \| A collection of `Doc` objects for efficient binary serialization. Also used for [training data](/api/data-formats#binary-training). \|
			\| [`Example`](/api/example) \| A collection of training annotations, containing two `Doc` objects: the reference data and the predictions. \|
			\| [`Language`](/api/language) \| Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`. \|
			\| [`Lexeme`](/api/lexeme) \| An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. \|
			\| [`Span`](/api/span) \| A slice from a `Doc` object. \|
			\| [`SpanGroup`](/api/spangroup) \| A named collection of spans belonging to a `Doc`. \|
			\| [`Token`](/api/token) \| An individual token — i.e. a word, punctuation symbol, whitespace, etc. \|
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			`### Processing pipeline {#architecture-pipeline}`

Update docs [ci skip] 2020-08-10 01:42:26 +03:00			`The processing pipeline consists of one or more pipeline components that are`
			called on the `Doc` in order. The tokenizer runs before the components. Pipeline
			components can be added using [`Language.add_pipe`](/api/language#add_pipe).
			`They can contain a statistical model and trained weights, or only make`
			rule-based modifications to the `Doc`. spaCy provides a range of built-in
			`components for different language processing tasks and also allows adding`
			`[custom components](/usage/processing-pipelines#custom-components).`

			`![The processing pipeline](../../images/pipeline.svg)`

			`\| Name \| Description \|`
			`\| ----------------------------------------------- \| ------------------------------------------------------------------------------------------- \|`
			\| [`AttributeRuler`](/api/attributeruler) \| Set token attributes using matcher rules. \|
			\| [`DependencyParser`](/api/dependencyparser) \| Predict syntactic dependencies. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`EntityLinker`](/api/entitylinker) \| Disambiguate named entities to nodes in a knowledge base. \|
Update docs [ci skip] 2020-08-10 01:42:26 +03:00			\| [`EntityRecognizer`](/api/entityrecognizer) \| Predict named entities, e.g. persons or products. \|
			\| [`EntityRuler`](/api/entityruler) \| Add entity spans to the `Doc` using token-based rules or exact phrase matches. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Lemmatizer`](/api/lemmatizer) \| Determine the base forms of words. \|
			\| [`Morphologizer`](/api/morphologizer) \| Predict morphological features and coarse-grained part-of-speech tags. \|
Update docs [ci skip] 2020-08-10 01:42:26 +03:00			\| [`SentenceRecognizer`](/api/sentencerecognizer) \| Predict sentence boundaries. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Sentencizer`](/api/sentencizer) \| Implement rule-based sentence boundary detection that doesn't require the dependency parse. \|
			\| [`Tagger`](/api/tagger) \| Predict part-of-speech tags. \|
			\| [`TextCategorizer`](/api/textcategorizer) \| Predict categories or labels over the whole document. \|
			\| [`Tok2Vec`](/api/tok2vec) \| Apply a "token-to-vector" model and set its outputs. \|
			\| [`Tokenizer`](/api/tokenizer) \| Segment raw text and create `Doc` objects from the words. \|
TrainablePipe (#6213) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36 2020-10-08 22:33:49 +03:00			\| [`TrainablePipe`](/api/pipe) \| Class that all trainable pipeline components inherit from. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Transformer`](/api/transformer) \| Use a transformer model and set its outputs. \|
			\| [Other functions](/api/pipeline-functions) \| Automatically apply something to the `Doc`, e.g. to merge spans of tokens. \|
Update docs [ci skip] 2020-08-10 01:42:26 +03:00
			`### Matchers {#architecture-matchers}`

			Matchers help you find and extract information from [`Doc`](/api/doc) objects
			`based on match patterns describing the sequences you're looking for. A matcher`
			operates on a `Doc` and gives you access to the matched tokens in context.

Update architecture overview 2020-09-22 10:31:47 +03:00			`\| Name \| Description \|`
			`\| --------------------------------------------- \| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|`
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`DependencyMatcher`](/api/dependencymatcher) \| Match sequences of tokens based on dependency trees using [Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). \|
Update architecture overview 2020-09-22 10:31:47 +03:00			\| [`Matcher`](/api/matcher) \| Match sequences of tokens, based on pattern rules, similar to regular expressions. \|
			\| [`PhraseMatcher`](/api/phrasematcher) \| Match sequences of tokens based on phrases. \|
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			`### Other classes {#architecture-other}`

Update architecture overview 2020-09-22 10:31:47 +03:00			`\| Name \| Description \|`
			`\| ------------------------------------------------ \| -------------------------------------------------------------------------------------------------- \|`
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Corpus`](/api/corpus) \| Class for managing annotated corpora for training and evaluation data. \|
			\| [`KnowledgeBase`](/api/kb) \| Storage for entities and aliases of a knowledge base for entity linking. \|
Update architecture overview 2020-09-22 10:31:47 +03:00			\| [`Lookups`](/api/lookups) \| Container for convenient access to large lookup tables and dictionaries. \|
			\| [`MorphAnalysis`](/api/morphology#morphanalysis) \| A morphological analysis. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`Morphology`](/api/morphology) \| Store morphological analyses and map them to and from hash values. \|
Update architecture overview 2020-09-22 10:31:47 +03:00			\| [`Scorer`](/api/scorer) \| Compute evaluation scores. \|
Update docs [ci skip] 2020-10-09 11:36:06 +03:00			\| [`StringStore`](/api/stringstore) \| Map strings to and from hash values. \|
			\| [`Vectors`](/api/vectors) \| Container class for vector data keyed by string. \|
			\| [`Vocab`](/api/vocab) \| The shared vocabulary that stores strings and gives you access to [`Lexeme`](/api/lexeme) objects. \|