spaCy/website/docs/usage/101/_architecture.md

The central data structures in spaCy are the `Doc` and the `Vocab`. The `Doc`
object owns the **sequence of tokens** and all their annotations. The `Vocab`
object owns a set of **look-up tables** that make common information available
across documents. By centralizing strings, word vectors and lexical attributes,
we avoid storing multiple copies of this data. This saves memory, and ensures
there's a **single source of truth**.

Text annotations are also designed to allow a single source of truth: the `Doc`
object owns the data, and `Span` and `Token` are **views that point into it**.
The `Doc` object is constructed by the `Tokenizer`, and then **modified in
place** by the components of the pipeline. The `Language` object coordinates
these components. It takes raw text and sends it through the pipeline, returning
an **annotated document**. It also orchestrates training and serialization.

![Library architecture](../../images/architecture.svg)

### Container objects {#architecture-containers}

| Name                    | Description                                                                                                                                             |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`Doc`](/api/doc)       | A container for accessing linguistic annotations.                                                                                                       |
| [`Span`](/api/span)     | A slice from a `Doc` object.                                                                                                                            |
| [`Token`](/api/token)   | An individual token — i.e. a word, punctuation symbol, whitespace, etc.                                                                                 |
| [`Lexeme`](/api/lexeme) | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |

### Processing pipeline {#architecture-pipeline}

| Name                                        | Description                                                                                                                   |
| ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| [`Language`](/api/language)                 | A text-processing pipeline. Usually you'll load this once per process as `nlp` and pass the instance around your application. |
| [`Tokenizer`](/api/tokenizer)               | Segment text, and create `Doc` objects with the discovered segment boundaries.                                                |
| [`Lemmatizer`](/api/lemmatizer)             | Determine the base forms of words.                                                                                            |
| `Morphology`                                | Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag.              |
| [`Tagger`](/api/tagger)                     | Annotate part-of-speech tags on `Doc` objects.                                                                                |
| [`DependencyParser`](/api/dependencyparser) | Annotate syntactic dependencies on `Doc` objects.                                                                             |
| [`EntityRecognizer`](/api/entityrecognizer) | Annotate named entities, e.g. persons or products, on `Doc` objects.                                                          |
| [`TextCategorizer`](/api/textcategorizer)   | Assign categories or labels to `Doc` objects.                                                                                 |
| [`Matcher`](/api/matcher)                   | Match sequences of tokens, based on pattern rules, similar to regular expressions.                                            |
| [`PhraseMatcher`](/api/phrasematcher)       | Match sequences of tokens based on phrases.                                                                                   |
| [`EntityRuler`](/api/entityruler)           | Add entity spans to the `Doc` using token-based rules or exact phrase matches.                                                |
| [`Sentencizer`](/api/sentencizer)           | Implement custom sentence boundary detection logic that doesn't require the dependency parse.                                 |
| [Other functions](/api/pipeline-functions)  | Automatically apply something to the `Doc`, e.g. to merge spans of tokens.                                                    |

### Other classes {#architecture-other}

| Name                              | Description                                                                                                   |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| [`Vocab`](/api/vocab)             | A lookup table for the vocabulary that allows you to access `Lexeme` objects.                                 |
| [`StringStore`](/api/stringstore) | Map strings to and from hash values.                                                                          |
| [`Vectors`](/api/vectors)         | Container class for vector data keyed by string.                                                              |
| [`GoldParse`](/api/goldparse)     | Collection for training annotations.                                                                          |
| [`GoldCorpus`](/api/goldcorpus)   | An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. |
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00			The central data structures in spaCy are the `Doc` and the `Vocab`. The `Doc`
			object owns the sequence of tokens and all their annotations. The `Vocab`
			`object owns a set of look-up tables that make common information available`
			`across documents. By centralizing strings, word vectors and lexical attributes,`
			`we avoid storing multiple copies of this data. This saves memory, and ensures`
			`there's a single source of truth.`

			Text annotations are also designed to allow a single source of truth: the `Doc`
			object owns the data, and `Span` and `Token` are views that point into it.
			The `Doc` object is constructed by the `Tokenizer`, and then **modified in
			place** by the components of the pipeline. The `Language` object coordinates
			`these components. It takes raw text and sends it through the pipeline, returning`
			`an annotated document. It also orchestrates training and serialization.`

			`![Library architecture](../../images/architecture.svg)`

			`### Container objects {#architecture-containers}`

			`\| Name \| Description \|`
			`\| ----------------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------- \|`
			\| [`Doc`](/api/doc) \| A container for accessing linguistic annotations. \|
			\| [`Span`](/api/span) \| A slice from a `Doc` object. \|
			\| [`Token`](/api/token) \| An individual token — i.e. a word, punctuation symbol, whitespace, etc. \|
			\| [`Lexeme`](/api/lexeme) \| An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. \|

			`### Processing pipeline {#architecture-pipeline}`

💫 Add better and serializable sentencizer (#3471) * Add better serializable sentencizer component * Replace default factory * Add tests * Tidy up * Pass test * Update docs 2019-03-23 17:45:02 +03:00			`\| Name \| Description \|`
			`\| ------------------------------------------- \| ----------------------------------------------------------------------------------------------------------------------------- \|`
			\| [`Language`](/api/language) \| A text-processing pipeline. Usually you'll load this once per process as `nlp` and pass the instance around your application. \|
			\| [`Tokenizer`](/api/tokenizer) \| Segment text, and create `Doc` objects with the discovered segment boundaries. \|
			\| [`Lemmatizer`](/api/lemmatizer) \| Determine the base forms of words. \|
			\| `Morphology` \| Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag. \|
			\| [`Tagger`](/api/tagger) \| Annotate part-of-speech tags on `Doc` objects. \|
			\| [`DependencyParser`](/api/dependencyparser) \| Annotate syntactic dependencies on `Doc` objects. \|
			\| [`EntityRecognizer`](/api/entityrecognizer) \| Annotate named entities, e.g. persons or products, on `Doc` objects. \|
			\| [`TextCategorizer`](/api/textcategorizer) \| Assign categories or labels to `Doc` objects. \|
			\| [`Matcher`](/api/matcher) \| Match sequences of tokens, based on pattern rules, similar to regular expressions. \|
			\| [`PhraseMatcher`](/api/phrasematcher) \| Match sequences of tokens based on phrases. \|
			\| [`EntityRuler`](/api/entityruler) \| Add entity spans to the `Doc` using token-based rules or exact phrase matches. \|
			\| [`Sentencizer`](/api/sentencizer) \| Implement custom sentence boundary detection logic that doesn't require the dependency parse. \|
			\| [Other functions](/api/pipeline-functions) \| Automatically apply something to the `Doc`, e.g. to merge spans of tokens. \|
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			`### Other classes {#architecture-other}`

			`\| Name \| Description \|`
			`\| --------------------------------- \| ------------------------------------------------------------------------------------------------------------- \|`
			\| [`Vocab`](/api/vocab) \| A lookup table for the vocabulary that allows you to access `Lexeme` objects. \|
			\| [`StringStore`](/api/stringstore) \| Map strings to and from hash values. \|
			\| [`Vectors`](/api/vectors) \| Container class for vector data keyed by string. \|
			\| [`GoldParse`](/api/goldparse) \| Collection for training annotations. \|
			\| [`GoldCorpus`](/api/goldcorpus) \| An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. \|