2020-08-10 01:42:26 +03:00
The central data structures in spaCy are the [`Language` ](/api/language ) class,
the [`Vocab` ](/api/vocab ) and the [`Doc` ](/api/doc ) object. The `Language` class
is used to process a text and turn it into a `Doc` object. It's typically stored
as a variable called `nlp` . The `Doc` object owns the **sequence of tokens** and
all their annotations. By centralizing strings, word vectors and lexical
attributes in the `Vocab` , we avoid storing multiple copies of this data. This
saves memory, and ensures there's a **single source of truth** .
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
Text annotations are also designed to allow a single source of truth: the `Doc`
2020-08-10 01:42:26 +03:00
object owns the data, and [`Span` ](/api/span ) and [`Token` ](/api/token ) are
**views that point into it**. The `Doc` object is constructed by the
[`Tokenizer` ](/api/tokenizer ), and then **modified in place** by the components
of the pipeline. The `Language` object coordinates these components. It takes
raw text and sends it through the pipeline, returning an **annotated document** .
It also orchestrates training and serialization.
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
![Library architecture ](../../images/architecture.svg )
### Container objects {#architecture-containers}
2021-01-14 09:30:41 +03:00
| Name | Description |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`Doc` ](/api/doc ) | A container for accessing linguistic annotations. |
| [`DocBin` ](/api/docbin ) | A collection of `Doc` objects for efficient binary serialization. Also used for [training data ](/api/data-formats#binary-training ). |
| [`Example` ](/api/example ) | A collection of training annotations, containing two `Doc` objects: the reference data and the predictions. |
| [`Language` ](/api/language ) | Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp` . |
| [`Lexeme` ](/api/lexeme ) | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |
| [`Span` ](/api/span ) | A slice from a `Doc` object. |
| [`SpanGroup` ](/api/spangroup ) | A named collection of spans belonging to a `Doc` . |
| [`Token` ](/api/token ) | An individual token — i.e. a word, punctuation symbol, whitespace, etc. |
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
### Processing pipeline {#architecture-pipeline}
2020-08-10 01:42:26 +03:00
The processing pipeline consists of one or more **pipeline components** that are
called on the `Doc` in order. The tokenizer runs before the components. Pipeline
components can be added using [`Language.add_pipe` ](/api/language#add_pipe ).
They can contain a statistical model and trained weights, or only make
rule-based modifications to the `Doc` . spaCy provides a range of built-in
components for different language processing tasks and also allows adding
[custom components ](/usage/processing-pipelines#custom-components ).
![The processing pipeline ](../../images/pipeline.svg )
2022-10-24 10:11:35 +03:00
| Component name | Component class | Description |
| ---------------------- | ---------------------------------------------------- | ------------------------------------------------------------------------------------------- |
| `attribute_ruler` | [`AttributeRuler` ](/api/attributeruler ) | Set token attributes using matcher rules. |
| `entity_linker` | [`EntityLinker` ](/api/entitylinker ) | Disambiguate named entities to nodes in a knowledge base. |
| `entity_ruler` | [`SpanRuler` ](/api/spanruler ) | Add entity spans to the `Doc` using token-based rules or exact phrase matches. |
| `lemmatizer` | [`Lemmatizer` ](/api/lemmatizer ) | Determine the base forms of words using rules and lookups. |
| `morphologizer` | [`Morphologizer` ](/api/morphologizer ) | Predict morphological features and coarse-grained part-of-speech tags. |
| `ner` | [`EntityRecognizer` ](/api/entityrecognizer ) | Predict named entities, e.g. persons or products. |
| `parser` | [`DependencyParser` ](/api/dependencyparser ) | Predict syntactic dependencies. |
| `senter` | [`SentenceRecognizer` ](/api/sentencerecognizer ) | Predict sentence boundaries. |
| `sentencizer` | [`Sentencizer` ](/api/sentencizer ) | Implement rule-based sentence boundary detection that doesn't require the dependency parse. |
| `span_ruler` | [`SpanRuler` ](/api/spanruler ) | Add spans to the `Doc` using token-based rules or exact phrase matches. |
| `tagger` | [`Tagger` ](/api/tagger ) | Predict part-of-speech tags. |
| `textcat` | [`TextCategorizer` ](/api/textcategorizer ) | Predict exactly one category or label over a whole document. |
| `textcat_multilabel` | [`MultiLabel_TextCategorizer` ](/api/textcategorizer ) | Predict 0, 1 or more categories or labels over a whole document. |
| `tok2vec` | [`Tok2Vec` ](/api/tok2vec ) | Apply a "token-to-vector" model and set its outputs. |
| `tokenizer` | [`Tokenizer` ](/api/tokenizer ) | Segment raw text and create `Doc` objects from the words. |
| `trainable_lemmatizer` | [`EditTreeLemmatizer` ](/api/edittreelemmatizer ) | Predict base forms of words. |
| `transformer` | [`Transformer` ](/api/transformer ) | Use a transformer model and set its outputs. |
| - | [`TrainablePipe` ](/api/pipe ) | Class that all trainable pipeline components inherit from. |
| - | [Other functions ](/api/pipeline-functions ) | Automatically apply something to the `Doc` , e.g. to merge spans of tokens. |
2020-08-10 01:42:26 +03:00
### Matchers {#architecture-matchers}
Matchers help you find and extract information from [`Doc` ](/api/doc ) objects
based on match patterns describing the sequences you're looking for. A matcher
operates on a `Doc` and gives you access to the matched tokens **in context** .
2020-09-22 10:31:47 +03:00
| Name | Description |
| --------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
2020-10-09 11:36:06 +03:00
| [`DependencyMatcher` ](/api/dependencymatcher ) | Match sequences of tokens based on dependency trees using [Semgrex operators ](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html ). |
2020-09-22 10:31:47 +03:00
| [`Matcher` ](/api/matcher ) | Match sequences of tokens, based on pattern rules, similar to regular expressions. |
| [`PhraseMatcher` ](/api/phrasematcher ) | Match sequences of tokens based on phrases. |
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
### Other classes {#architecture-other}
2020-09-22 10:31:47 +03:00
| Name | Description |
| ------------------------------------------------ | -------------------------------------------------------------------------------------------------- |
2020-10-09 11:36:06 +03:00
| [`Corpus` ](/api/corpus ) | Class for managing annotated corpora for training and evaluation data. |
2022-09-08 11:38:07 +03:00
| [`KnowledgeBase` ](/api/kb ) | Abstract base class for storage and retrieval of data for entity linking. |
| [`InMemoryLookupKB` ](/api/kb_in_memory ) | Implementation of `KnowledgeBase` storing all data in memory. |
| [`Candidate` ](/api/kb#candidate ) | Object associating a textual mention with a specific entity contained in a `KnowledgeBase` . |
2020-09-22 10:31:47 +03:00
| [`Lookups` ](/api/lookups ) | Container for convenient access to large lookup tables and dictionaries. |
| [`MorphAnalysis` ](/api/morphology#morphanalysis ) | A morphological analysis. |
2020-10-09 11:36:06 +03:00
| [`Morphology` ](/api/morphology ) | Store morphological analyses and map them to and from hash values. |
2020-09-22 10:31:47 +03:00
| [`Scorer` ](/api/scorer ) | Compute evaluation scores. |
2020-10-09 11:36:06 +03:00
| [`StringStore` ](/api/stringstore ) | Map strings to and from hash values. |
| [`Vectors` ](/api/vectors ) | Container class for vector data keyed by string. |
| [`Vocab` ](/api/vocab ) | The shared vocabulary that stores strings and gives you access to [`Lexeme` ](/api/lexeme ) objects. |