spaCy/website/docs/usage/processing-pipelines.md

---
title: Language Processing Pipelines
next: /usage/embeddings-transformers
menu:
  - ['Processing Text', 'processing']
  - ['Pipelines & Components', 'pipelines']
  - ['Custom Components', 'custom-components']
  - ['Component Data', 'component-data']
  - ['Type Hints & Validation', 'type-hints']
  - ['Trainable Components', 'trainable-components']
  - ['Extension Attributes', 'custom-components-attributes']
  - ['Plugins & Wrappers', 'plugins']
---

import Pipelines101 from 'usage/101/\_pipelines.md'

<Pipelines101 />

## Processing text {#processing}

When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
component** on the `Doc`, in order. It then returns the processed `Doc` that you
can work with.

```python
doc = nlp("This is a text")
```

When processing large volumes of text, the statistical models are usually more
efficient if you let them work on batches of texts. spaCy's
[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
processed `Doc` objects. The batching is done internally.

```diff
texts = ["This is a text", "These are lots of texts", "..."]
- docs = [nlp(text) for text in texts]
+ docs = list(nlp.pipe(texts))
```

<Infobox title="Tips for efficient processing" emoji="💡">

- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
  buffer them in batches, instead of one-by-one. This is usually much more
  efficient.
- Only apply the **pipeline components you need**. Getting predictions from the
  model that you don't actually need adds up and becomes very inefficient at
  scale. To prevent this, use the `disable` keyword argument to disable
  components you don't need – either when loading a pipeline, or during
  processing with `nlp.pipe`. See the section on
  [disabling pipeline components](#disabling) for more details and examples.

</Infobox>

In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
(potentially very large) iterable of texts as a stream. Because we're only
accessing the named entities in `doc.ents` (set by the `ner` component), we'll
disable all other components during processing. `nlp.pipe` yields `Doc` objects,
so we can iterate over them and access the named entity predictions:

> #### ✏️ Things to try
>
> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
>    empty, because the entity recognizer didn't run.

```python
### {executable="true"}
import spacy

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
```

<Infobox title="Important note" variant="warning">

When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
[generator](https://realpython.com/introduction-to-python-generators/) that
yields `Doc` objects – not a list. So if you want to use it like a list, you'll
have to call `list()` on it first:

```diff
- docs = nlp.pipe(texts)[0]         # will raise an error
+ docs = list(nlp.pipe(texts))[0]   # works as expected
```

</Infobox>

You can use the `as_tuples` option to pass additional context along with each
doc when using [`nlp.pipe`](/api/language#pipe). If `as_tuples` is `True`, then
the input should be a sequence of `(text, context)` tuples and the output will
be a sequence of `(doc, context)` tuples. For example, you can pass metadata in
the context and save it in a [custom attribute](#custom-components-attributes):

```python
### {executable="true"}
import spacy
from spacy.tokens import Doc

if not Doc.has_extension("text_id"):
    Doc.set_extension("text_id", default=None)

text_tuples = [
    ("This is the first text.", {"text_id": "text1"}),
    ("This is the second text.", {"text_id": "text2"})
]

nlp = spacy.load("en_core_web_sm")
doc_tuples = nlp.pipe(text_tuples, as_tuples=True)

docs = []
for doc, context in doc_tuples:
    doc._.text_id = context["text_id"]
    docs.append(doc)

for doc in docs:
    print(f"{doc._.text_id}: {doc.text}")
```

### Multiprocessing {#multiprocessing}

spaCy includes built-in support for multiprocessing with
[`nlp.pipe`](/api/language#pipe) using the `n_process` option:

```python
# Multiprocessing with 4 processes
docs = nlp.pipe(texts, n_process=4)

# With as many processes as CPUs (use with caution!)
docs = nlp.pipe(texts, n_process=-1)
```

Depending on your platform, starting many processes with multiprocessing can add
a lot of overhead. In particular, the default start method `spawn` used in
macOS/OS X (as of Python 3.8) and in Windows can be slow for larger models
because the model data is copied in memory for each new process. See the
[Python docs on multiprocessing](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
for further details.

For shorter tasks and in particular with `spawn`, it can be faster to use a
smaller number of processes with a larger batch size. The optimal `batch_size`
setting will depend on the pipeline components, the length of your documents,
the number of processes and how much memory is available.

```python
# Default batch size is `nlp.batch_size` (typically 1000)
docs = nlp.pipe(texts, n_process=2, batch_size=2000)
```

<Infobox title="Multiprocessing on GPU" variant="warning">

Multiprocessing is not generally recommended on GPU because RAM is too limited.
If you want to try it out, be aware that it is only possible using `spawn` due
to limitations in CUDA.

</Infobox>

<Infobox title="Multiprocessing with transformer models" variant="warning">

In Linux, transformer models may hang or deadlock with multiprocessing due to an
[issue in PyTorch](https://github.com/pytorch/pytorch/issues/17199). One
suggested workaround is to use `spawn` instead of `fork` and another is to limit
the number of threads before loading any models using
`torch.set_num_threads(1)`.

</Infobox>

## Pipelines and built-in components {#pipelines}

spaCy makes it very easy to create your own pipelines consisting of reusable
components – this includes spaCy's default tagger, parser and entity recognizer,
but also your own custom processing functions. A pipeline component can be added
to an already existing `nlp` object, specified when initializing a
[`Language`](/api/language) class, or defined within a
[pipeline package](/usage/saving-loading#models).

> #### config.cfg (excerpt)
>
> ```ini
>  [nlp]
>  lang = "en"
>  pipeline = ["tok2vec", "parser"]
>
> [components]
>
> [components.tok2vec]
> factory = "tok2vec"
> # Settings for the tok2vec component
>
> [components.parser]
> factory = "parser"
> # Settings for the parser component
> ```

When you load a pipeline, spaCy first consults the
[`meta.json`](/usage/saving-loading#models) and
[`config.cfg`](/usage/training#config). The config tells spaCy what language
class to use, which components are in the pipeline, and how those components
should be created. spaCy will then do the following:

1. Load the **language class and data** for the given ID via
   [`get_lang_class`](/api/top-level#util.get_lang_class) and initialize it. The
   `Language` class contains the shared vocabulary, tokenization rules and the
   language-specific settings.
2. Iterate over the **pipeline names** and look up each component name in the
   `[components]` block. The `factory` tells spaCy which
   [component factory](#custom-components-factories) to use for adding the
   component with [`add_pipe`](/api/language#add_pipe). The settings are passed
   into the factory.
3. Make the **model data** available to the `Language` class by calling
   [`from_disk`](/api/language#from_disk) with the path to the data directory.

So when you call this...

```python
nlp = spacy.load("en_core_web_sm")
```

... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the
pipeline
`["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]`. spaCy
will then initialize `spacy.lang.en.English`, and create each pipeline component
and add it to the processing pipeline. It'll then load in the model data from
the data directory and return the modified `Language` class for you to use as
the `nlp` object.

<Infobox title="Changed in v3.0" variant="warning">

spaCy v3.0 introduces a `config.cfg`, which includes more detailed settings for
the pipeline, its components and the [training process](/usage/training#config).
You can export the config of your current `nlp` object by calling
[`nlp.config.to_disk`](/api/language#config).

</Infobox>

Fundamentally, a [spaCy pipeline package](/models) consists of three components:
**the weights**, i.e. binary data loaded in from a directory, a **pipeline** of
functions called in order, and **language data** like the tokenization rules and
language-specific settings. For example, a Spanish NER pipeline requires
different weights, language data and components than an English parsing and
tagging pipeline. This is also why the pipeline state is always held by the
`Language` class. [`spacy.load`](/api/top-level#spacy.load) puts this all
together and returns an instance of `Language` with a pipeline set and access to
the binary data:

```python
### spacy.load under the hood
lang = "en"
pipeline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-3.0.0"

cls = spacy.util.get_lang_class(lang)  # 1. Get Language class, e.g. English
nlp = cls()                            # 2. Initialize it
for name in pipeline:
    nlp.add_pipe(name)                 # 3. Add the component to the pipeline
nlp.from_disk(data_path)               # 4. Load in the binary data
```

When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
component** on the `Doc`, in order. Since the model data is loaded, the
components can access it to assign annotations to the `Doc` object, and
subsequently to the `Token` and `Span` which are only views of the `Doc`, and
don't own any data themselves. All components return the modified document,
which is then processed by the next component in the pipeline.

```python
### The pipeline under the hood
doc = nlp.make_doc("This is a sentence")  # Create a Doc from raw text
for name, proc in nlp.pipeline:           # Iterate over components in order
    doc = proc(doc)                       # Apply each component
```

The current processing pipeline is available as `nlp.pipeline`, which returns a
list of `(name, component)` tuples, or `nlp.pipe_names`, which only returns a
list of human-readable component names.

```python
print(nlp.pipeline)
# [('tok2vec', <spacy.pipeline.Tok2Vec>), ('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>), ('attribute_ruler', <spacy.pipeline.AttributeRuler>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer>)]
print(nlp.pipe_names)
# ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
```

### Built-in pipeline components {#built-in}

spaCy ships with several built-in pipeline components that are registered with
string names. This means that you can initialize them by calling
[`nlp.add_pipe`](/api/language#add_pipe) with their names and spaCy will know
how to create them. See the [API documentation](/api) for a full list of
available pipeline components and component functions.

> #### Usage
>
> ```python
> nlp = spacy.blank("en")
> nlp.add_pipe("sentencizer")
> # add_pipe returns the added component
> ruler = nlp.add_pipe("entity_ruler")
> ```

| String name          | Component                                            | Description                                                                               |
| -------------------- | ---------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `tagger`             | [`Tagger`](/api/tagger)                              | Assign part-of-speech-tags.                                                               |
| `parser`             | [`DependencyParser`](/api/dependencyparser)          | Assign dependency labels.                                                                 |
| `ner`                | [`EntityRecognizer`](/api/entityrecognizer)          | Assign named entities.                                                                    |
| `entity_linker`      | [`EntityLinker`](/api/entitylinker)                  | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
| `entity_ruler`       | [`EntityRuler`](/api/entityruler)                    | Assign named entities based on pattern rules and dictionaries.                            |
| `textcat`            | [`TextCategorizer`](/api/textcategorizer)            | Assign text categories: exactly one category is predicted per document.                   |
| `textcat_multilabel` | [`MultiLabel_TextCategorizer`](/api/textcategorizer) | Assign text categories in a multi-label setting: zero, one or more labels per document.   |
| `lemmatizer`         | [`Lemmatizer`](/api/lemmatizer)                      | Assign base forms to words.                                                               |
| `morphologizer`      | [`Morphologizer`](/api/morphologizer)                | Assign morphological features and coarse-grained POS tags.                                |
| `attribute_ruler`    | [`AttributeRuler`](/api/attributeruler)              | Assign token attribute mappings and rule-based exceptions.                                |
| `senter`             | [`SentenceRecognizer`](/api/sentencerecognizer)      | Assign sentence boundaries.                                                               |
| `sentencizer`        | [`Sentencizer`](/api/sentencizer)                    | Add rule-based sentence segmentation without the dependency parse.                        |
| `tok2vec`            | [`Tok2Vec`](/api/tok2vec)                            | Assign token-to-vector embeddings.                                                        |
| `transformer`        | [`Transformer`](/api/transformer)                    | Assign the tokens and outputs of a transformer model.                                     |

### Disabling, excluding and modifying components {#disabling}

If you don't need a particular component of the pipeline – for example, the
tagger or the parser, you can **disable or exclude** it. This can sometimes make
a big difference and improve loading and inference speed. There are two
different mechanisms you can use:

1. **Disable:** The component and its data will be loaded with the pipeline, but
   it will be disabled by default and not run as part of the processing
   pipeline. To run it, you can explicitly enable it by calling
   [`nlp.enable_pipe`](/api/language#enable_pipe). When you save out the `nlp`
   object, the disabled component will be included but disabled by default.
2. **Exclude:** Don't load the component and its data with the pipeline. Once
   the pipeline is loaded, there will be no reference to the excluded component.

Disabled and excluded component names can be provided to
[`spacy.load`](/api/top-level#spacy.load) as a list.

> #### 💡 Optional pipeline components
>
> The `disable` mechanism makes it easy to distribute pipeline packages with
> optional components that you can enable or disable at runtime. For instance,
> your pipeline may include a statistical _and_ a rule-based component for
> sentence segmentation, and you can choose which one to run depending on your
> use case.
>
> For example, spaCy's [trained pipelines](/models) like
> [`en_core_web_sm`](/models/en#en_core_web_sm) contain both a `parser` and
> `senter` that perform sentence segmentation, but the `senter` is disabled by
> default.

```python
# Load the pipeline without the entity recognizer
nlp = spacy.load("en_core_web_sm", exclude=["ner"])

# Load the tagger and parser but don't enable them
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
# Explicitly enable the tagger later on
nlp.enable_pipe("tagger")
```

<Infobox variant="warning" title="Changed in v3.0">

As of v3.0, the `disable` keyword argument specifies components to load but
disable, instead of components to not load at all. Those components can now be
specified separately using the new `exclude` keyword argument.

</Infobox>

As a shortcut, you can use the [`nlp.select_pipes`](/api/language#select_pipes)
context manager to temporarily disable certain components for a given block. At
the end of the `with` block, the disabled pipeline components will be restored
automatically. Alternatively, `select_pipes` returns an object that lets you
call its `restore()` method to restore the disabled components when needed. This
can be useful if you want to prevent unnecessary code indentation of large
blocks.

```python
### Disable for block
# 1. Use as a context manager
with nlp.select_pipes(disable=["tagger", "parser", "lemmatizer"]):
    doc = nlp("I won't be tagged and parsed")
doc = nlp("I will be tagged and parsed")

# 2. Restore manually
disabled = nlp.select_pipes(disable="ner")
doc = nlp("I won't have named entities")
disabled.restore()
```

If you want to disable all pipes except for one or a few, you can use the
`enable` keyword. Just like the `disable` keyword, it takes a list of pipe
names, or a string defining just one pipe.

```python
# Enable only the parser
with nlp.select_pipes(enable="parser"):
    doc = nlp("I will only be parsed")
```

The [`nlp.pipe`](/api/language#pipe) method also supports a `disable` keyword
argument if you only want to disable components during processing:

```python
for doc in nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer"]):
    # Do something with the doc here
```

Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
to remove pipeline components from an existing pipeline, the
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
custom component entirely (more details on this in the section on
[custom components](#custom-components)).

```python
nlp.remove_pipe("parser")
nlp.rename_pipe("ner", "entityrecognizer")
nlp.replace_pipe("tagger", "my_custom_tagger")
```

The `Language` object exposes different [attributes](/api/language#attributes)
that let you inspect all available components and the components that currently
run as part of the pipeline.

> #### Example
>
> ```python
> nlp = spacy.blank("en")
> nlp.add_pipe("ner")
> nlp.add_pipe("textcat")
> assert nlp.pipe_names == ["ner", "textcat"]
> nlp.disable_pipe("ner")
> assert nlp.pipe_names == ["textcat"]
> assert nlp.component_names == ["ner", "textcat"]
> assert nlp.disabled == ["ner"]
> ```

| Name                  | Description                                                      |
| --------------------- | ---------------------------------------------------------------- |
| `nlp.pipeline`        | `(name, component)` tuples of the processing pipeline, in order. |
| `nlp.pipe_names`      | Pipeline component names, in order.                              |
| `nlp.components`      | All `(name, component)` tuples, including disabled components.   |
| `nlp.component_names` | All component names, including disabled components.              |
| `nlp.disabled`        | Names of components that are currently disabled.                 |

### Sourcing components from existing pipelines {#sourced-components new="3"}

Pipeline components that are independent can also be reused across pipelines.
Instead of adding a new blank component, you can also copy an existing component
from a trained pipeline by setting the `source` argument on
[`nlp.add_pipe`](/api/language#add_pipe). The first argument will then be
interpreted as the name of the component in the source pipeline – for instance,
`"ner"`. This is especially useful for
[training a pipeline](/usage/training#config-components) because it lets you mix
and match components and create fully custom pipeline packages with updated
trained components and new components trained on your data.

<Infobox variant="warning" title="Important note for trained components">

When reusing components across pipelines, keep in mind that the **vocabulary**,
**vectors** and model settings **must match**. If a trained pipeline includes
[word vectors](/usage/linguistic-features#vectors-similarity) and the component
uses them as features, the pipeline you copy it to needs to have the _same_
vectors available – otherwise, it won't be able to make the same predictions.

</Infobox>

> #### In training config
>
> Instead of providing a `factory`, component blocks in the training
> [config](/usage/training#config) can also define a `source`. The string needs
> to be a loadable spaCy pipeline package or path.
>
> ```ini
> [components.ner]
> source = "en_core_web_sm"
> component = "ner"
> ```
>
> By default, sourced components will be updated with your data during training.
> If you want to preserve the component as-is, you can "freeze" it if the
> pipeline is not using a shared `Tok2Vec` layer:
>
> ```ini
> [training]
> frozen_components = ["ner"]
> ```

```python
### {executable="true"}
import spacy

# The source pipeline with different components
source_nlp = spacy.load("en_core_web_sm")
print(source_nlp.pipe_names)

# Add only the entity recognizer to the new blank pipeline
nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
print(nlp.pipe_names)
```

### Analyzing pipeline components {#analysis new="3"}

The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
components in the current pipeline and outputs information about them like the
attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether
they retokenize the `Doc` and which scores they produce during training. It will
also show warnings if components require values that aren't set by previous
component – for instance, if the entity linker is used but no component that
runs before it sets named entities. Setting `pretty=True` will pretty-print a
table instead of only returning the structured data.

> #### ✏️ Things to try
>
> 1. Add the components `"ner"` and `"sentencizer"` _before_ the
>    `"entity_linker"`. The analysis should now show no problems, because
>    requirements are met.

```python
### {executable="true"}
import spacy

nlp = spacy.blank("en")
nlp.add_pipe("tagger")
# This is a problem because it needs entities and sentence boundaries
nlp.add_pipe("entity_linker")
analysis = nlp.analyze_pipes(pretty=True)
```

<Accordion title="Example output">

```json
### Structured
{
  "summary": {
    "tagger": {
      "assigns": ["token.tag"],
      "requires": [],
      "scores": ["tag_acc", "pos_acc", "lemma_acc"],
      "retokenizes": false
    },
    "entity_linker": {
      "assigns": ["token.ent_kb_id"],
      "requires": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
      "scores": [],
      "retokenizes": false
    }
  },
  "problems": {
    "tagger": [],
    "entity_linker": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"]
  },
  "attrs": {
    "token.ent_iob": { "assigns": [], "requires": ["entity_linker"] },
    "doc.ents": { "assigns": [], "requires": ["entity_linker"] },
    "token.ent_kb_id": { "assigns": ["entity_linker"], "requires": [] },
    "doc.sents": { "assigns": [], "requires": ["entity_linker"] },
    "token.tag": { "assigns": ["tagger"], "requires": [] },
    "token.ent_type": { "assigns": [], "requires": ["entity_linker"] }
  }
}
```

```
### Pretty
============================= Pipeline Overview =============================

#   Component       Assigns           Requires         Scores        Retokenizes
-   -------------   ---------------   --------------   -----------   -----------
0   tagger          token.tag                          tag_acc       False

1   entity_linker   token.ent_kb_id   doc.ents         nel_micro_f   False
                                      doc.sents        nel_micro_r
                                      token.ent_iob    nel_micro_p
                                      token.ent_type


================================ Problems (4) ================================
⚠ 'entity_linker' requirements not met: doc.ents, doc.sents,
token.ent_iob, token.ent_type
```

</Accordion>

<Infobox variant="warning" title="Important note">

The pipeline analysis is static and does **not actually run the components**.
This means that it relies on the information provided by the components
themselves. If a custom component declares that it assigns an attribute but it
doesn't, the pipeline analysis won't catch that.

</Infobox>

## Creating custom pipeline components {#custom-components}

A pipeline component is a function that receives a `Doc` object, modifies it and
returns it – for example, by using the current weights to make a prediction and
set some annotation on the document. By adding a component to the pipeline,
you'll get access to the `Doc` at any point **during processing** – instead of
only being able to modify it afterwards.

> #### Example
>
> ```python
> from spacy.language import Language
>
> @Language.component("my_component")
> def my_component(doc):
>    # Do something to the doc here
>    return doc
> ```

| Argument    | Type              | Description                                            |
| ----------- | ----------------- | ------------------------------------------------------ |
| `doc`       | [`Doc`](/api/doc) | The `Doc` object processed by the previous component.  |
| **RETURNS** | [`Doc`](/api/doc) | The `Doc` object processed by this pipeline component. |

The [`@Language.component`](/api/language#component) decorator lets you turn a
simple function into a pipeline component. It takes at least one argument, the
**name** of the component factory. You can use this name to add an instance of
your component to the pipeline. It can also be listed in your pipeline config,
so you can save, load and train pipelines using your component.

Custom components can be added to the pipeline using the
[`add_pipe`](/api/language#add_pipe) method. Optionally, you can either specify
a component to add it **before or after**, tell spaCy to add it **first or
last** in the pipeline, or define a **custom name**. If no name is set and no
`name` attribute is present on your component, the function name is used.

> #### Example
>
> ```python
> nlp.add_pipe("my_component")
> nlp.add_pipe("my_component", first=True)
> nlp.add_pipe("my_component", before="parser")
> ```

| Argument | Description                                                                       |
| -------- | --------------------------------------------------------------------------------- |
| `last`   | If set to `True`, component is added **last** in the pipeline (default). ~~bool~~ |
| `first`  | If set to `True`, component is added **first** in the pipeline. ~~bool~~          |
| `before` | String name or index to add the new component **before**. ~~Union[str, int]~~     |
| `after`  | String name or index to add the new component **after**. ~~Union[str, int]~~      |

<Infobox title="Changed in v3.0" variant="warning">

As of v3.0, components need to be registered using the
[`@Language.component`](/api/language#component) or
[`@Language.factory`](/api/language#factory) decorator so spaCy knows that a
function is a component. [`nlp.add_pipe`](/api/language#add_pipe) now takes the
**string name** of the component factory instead of the component function. This
doesn't only save you lines of code, it also allows spaCy to validate and track
your custom components, and make sure they can be saved and loaded.

```diff
- ruler = nlp.create_pipe("entity_ruler")
- nlp.add_pipe(ruler)
+ ruler = nlp.add_pipe("entity_ruler")
```

</Infobox>

### Examples: Simple stateless pipeline components {#custom-components-simple}

The following component receives the `Doc` in the pipeline and prints some
information about it: the number of tokens, the part-of-speech tags of the
tokens and a conditional message based on the document length. The
[`@Language.component`](/api/language#component) decorator lets you register the
component under the name `"info_component"`.

> #### ✏️ Things to try
>
> 1. Add the component first in the pipeline by setting `first=True`. You'll see
>    that the part-of-speech tags are empty, because the component now runs
>    before the tagger and the tags aren't available yet.
> 2. Change the component `name` or remove the `name` argument. You should see
>    this change reflected in `nlp.pipe_names`.
> 3. Print `nlp.pipeline`. You'll see a list of tuples describing the component
>    name and the function that's called on the `Doc` object in the pipeline.
> 4. Change the first argument to `@Language.component`, the name, to something
>    else. spaCy should now complain that it doesn't know a component of the
>    name `"info_component"`.

```python
### {executable="true"}
import spacy
from spacy.language import Language

@Language.component("info_component")
def my_component(doc):
    print(f"After tokenization, this doc has {len(doc)} tokens.")
    print("The part-of-speech tags are:", [token.pos_ for token in doc])
    if len(doc) < 10:
        print("This is a pretty short document.")
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("info_component", name="print_info", last=True)
print(nlp.pipe_names)  # ['tagger', 'parser', 'ner', 'print_info']
doc = nlp("This is a sentence.")
```

Here's another example of a pipeline component that implements custom logic to
improve the sentence boundaries set by the dependency parser. The custom logic
should therefore be applied **after** tokenization, but _before_ the dependency
parsing – this way, the parser can also take advantage of the sentence
boundaries.

> #### ✏️ Things to try
>
> 1. Print `[token.dep_ for token in doc]` with and without the custom pipeline
>    component. You'll see that the predicted dependency parse changes to match
>    the sentence boundaries.
> 2. Remove the `else` block. All other tokens will now have `is_sent_start` set
>    to `None` (missing value), the parser will assign sentence boundaries in
>    between.

```python
### {executable="true"}
import spacy
from spacy.language import Language

@Language.component("custom_sentencizer")
def custom_sentencizer(doc):
    for i, token in enumerate(doc[:-2]):
        # Define sentence start if pipe + titlecase token
        if token.text == "|" and doc[i + 1].is_title:
            doc[i + 1].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise, to tell
            # the parser to leave those tokens alone
            doc[i + 1].is_sent_start = False
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("custom_sentencizer", before="parser")  # Insert before the parser
doc = nlp("This is. A sentence. | This is. Another sentence.")
for sent in doc.sents:
    print(sent.text)
```

### Component factories and stateful components {#custom-components-factories}

Component factories are callables that take settings and return a **pipeline
component function**. This is useful if your component is stateful and if you
need to customize their creation, or if you need access to the current `nlp`
object or the shared vocab. Component factories can be registered using the
[`@Language.factory`](/api/language#factory) decorator and they need at least
**two named arguments** that are filled in automatically when the component is
added to the pipeline:

> #### Example
>
> ```python
> from spacy.language import Language
>
> @Language.factory("my_component")
> def my_component(nlp, name):
>     return MyComponent()
> ```

| Argument | Description                                                                                                                       |
| -------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `nlp`    | The current `nlp` object. Can be used to access the shared vocab. ~~Language~~                                                    |
| `name`   | The **instance name** of the component in the pipeline. This lets you identify different instances of the same component. ~~str~~ |

All other settings can be passed in by the user via the `config` argument on
[`nlp.add_pipe`](/api/language). The
[`@Language.factory`](/api/language#factory) decorator also lets you define a
`default_config` that's used as a fallback.

```python
### With config {highlight="4,9"}
import spacy
from spacy.language import Language

@Language.factory("my_component", default_config={"some_setting": True})
def my_component(nlp, name, some_setting: bool):
    return MyComponent(some_setting=some_setting)

nlp = spacy.blank("en")
nlp.add_pipe("my_component", config={"some_setting": False})
```

<Accordion title="How is @Language.factory different from @Language.component?" id="factories-decorator-component">

The [`@Language.component`](/api/language#component) decorator is essentially a
**shortcut** for stateless pipeline components that don't need any settings.
This means you don't have to always write a function that returns your function
if there's no state to be passed through – spaCy can just take care of this for
you. The following two code examples are equivalent:

```python
# Stateless component with @Language.factory
@Language.factory("my_component")
def create_my_component():
    def my_component(doc):
        # Do something to the doc
        return doc

    return my_component

# Stateless component with @Language.component
@Language.component("my_component")
def my_component(doc):
    # Do something to the doc
    return doc
```

</Accordion>

<Accordion title="Can I add the @Language.factory decorator to a class?" id="factories-class-decorator" spaced>

Yes, the [`@Language.factory`](/api/language#factory) decorator can be added to
a function or a class. If it's added to a class, it expects the `__init__`
method to take the arguments `nlp` and `name`, and will populate all other
arguments from the config. That said, it's often cleaner and more intuitive to
make your factory a separate function. That's also how spaCy does it internally.

</Accordion>

### Language-specific factories {#factories-language new="3"}

There are many use cases where you might want your pipeline components to be
language-specific. Sometimes this requires entirely different implementation per
language, sometimes the only difference is in the settings or data. spaCy allows
you to register factories of the **same name** on both the `Language` base
class, as well as its **subclasses** like `English` or `German`. Factories are
resolved starting with the specific subclass. If the subclass doesn't define a
component of that name, spaCy will check the `Language` base class.

Here's an example of a pipeline component that overwrites the normalized form of
a token, the `Token.norm_` with an entry from a language-specific lookup table.
It's registered twice under the name `"token_normalizer"` – once using
`@English.factory` and once using `@German.factory`:

```python
### {executable="true"}
from spacy.lang.en import English
from spacy.lang.de import German

class TokenNormalizer:
    def __init__(self, norm_table):
        self.norm_table = norm_table

    def __call__(self, doc):
        for token in doc:
            # Overwrite the token.norm_ if there's an entry in the data
            token.norm_ = self.norm_table.get(token.text, token.norm_)
        return doc

@English.factory("token_normalizer")
def create_en_normalizer(nlp, name):
    return TokenNormalizer({"realise": "realize", "colour": "color"})

@German.factory("token_normalizer")
def create_de_normalizer(nlp, name):
    return TokenNormalizer({"daß": "dass", "wußte": "wusste"})

nlp_en = English()
nlp_en.add_pipe("token_normalizer")  # uses the English factory
print([token.norm_ for token in nlp_en("realise colour daß wußte")])

nlp_de = German()
nlp_de.add_pipe("token_normalizer")  # uses the German factory
print([token.norm_ for token in nlp_de("realise colour daß wußte")])
```

<Infobox title="Implementation details">

Under the hood, language-specific factories are added to the
[`factories` registry](/api/top-level#registry) prefixed with the language code,
e.g. `"en.token_normalizer"`. When resolving the factory in
[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a
language-specific version of the factory using `nlp.lang` and if none is
available, falls back to looking up the regular factory name.

</Infobox>

### Example: Stateful component with settings {#example-stateful-components}

This example shows a **stateful** pipeline component for handling acronyms:
based on a dictionary, it will detect acronyms and their expanded forms in both
directions and add them to a list as the custom `doc._.acronyms`
[extension attribute](#custom-components-attributes). Under the hood, it uses
the [`PhraseMatcher`](/api/phrasematcher) to find instances of the phrases.

The factory function takes three arguments: the shared `nlp` object and
component instance `name`, which are passed in automatically by spaCy, and a
`case_sensitive` config setting that makes the matching and acronym detection
case-sensitive.

> #### ✏️ Things to try
>
> 1. Change the `config` passed to `nlp.add_pipe` and set `"case_sensitive"` to
>    `True`. You should see that the expanded acronym for "LOL" isn't detected
>    anymore.
> 2. Add some more terms to the `DICTIONARY` and update the processed text so
>    they're detected.
> 3. Add a `name` argument to `nlp.add_pipe` to change the component name. Print
>    `nlp.pipe_names` to see the change reflected in the pipeline.
> 4. Print the config of the current `nlp` object with
>    `print(nlp.config.to_str())` and inspect the `[components]` block. You
>    should see an entry for the acronyms component, referencing the factory
>    `acronyms` and the config settings.

```python
### {executable="true"}
from spacy.language import Language
from spacy.tokens import Doc
from spacy.matcher import PhraseMatcher
import spacy

DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
DICTIONARY.update({value: key for key, value in DICTIONARY.items()})

@Language.factory("acronyms", default_config={"case_sensitive": False})
def create_acronym_component(nlp: Language, name: str, case_sensitive: bool):
    return AcronymComponent(nlp, case_sensitive)

class AcronymComponent:
    def __init__(self, nlp: Language, case_sensitive: bool):
        # Create the matcher and match on Token.lower if case-insensitive
        matcher_attr = "TEXT" if case_sensitive else "LOWER"
        self.matcher = PhraseMatcher(nlp.vocab, attr=matcher_attr)
        self.matcher.add("ACRONYMS", [nlp.make_doc(term) for term in DICTIONARY])
        self.case_sensitive = case_sensitive
        # Register custom extension on the Doc
        if not Doc.has_extension("acronyms"):
            Doc.set_extension("acronyms", default=[])

    def __call__(self, doc: Doc) -> Doc:
        # Add the matched spans when doc is processed
        for _, start, end in self.matcher(doc):
            span = doc[start:end]
            acronym = DICTIONARY.get(span.text if self.case_sensitive else span.text.lower())
            doc._.acronyms.append((span, acronym))
        return doc

# Add the component to the pipeline and configure it
nlp = spacy.blank("en")
nlp.add_pipe("acronyms", config={"case_sensitive": False})

# Process a doc and see the results
doc = nlp("LOL, be right back")
print(doc._.acronyms)
```

## Initializing and serializing component data {#component-data}

Many stateful components depend on **data resources** like dictionaries and
lookup tables that should ideally be **configurable**. For example, it makes
sense to make the `DICTIONARY` in the above example an argument of the
registered function, so the `AcronymComponent` can be re-used with different
data. One logical solution would be to make it an argument of the component
factory, and allow it to be initialized with different dictionaries.

> #### config.cfg
>
> ```ini
> [components.acronyms.data]
> # 🚨 Problem: you don't want the data in the config
> lol = "laugh out loud"
> brb = "be right back"
> ```

```python
@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
    # 🚨 Problem: data ends up in the config file
    return AcronymComponent(nlp, data, case_sensitive)
```

However, passing in the dictionary directly is problematic, because it means
that if a component saves out its config and settings, the
[`config.cfg`](/usage/training#config) will include a dump of the entire data,
since that's the config the component was created with. It will also fail if the
data is not JSON-serializable.

### Option 1: Using a registered function {#component-data-function}

<Infobox>

- ✅ **Pros:** can load anything in Python, easy to add to and configure via
  config
- ❌ **Cons:** requires the function and its dependencies to be available at
  runtime

</Infobox>

If what you're passing in isn't JSON-serializable – e.g. a custom object like a
[model](#trainable-components) – saving out the component config becomes
impossible because there's no way for spaCy to know _how_ that object was
created, and what to do to create it again. This makes it much harder to save,
load and train custom pipelines with custom components. A simple solution is to
**register a function** that returns your resources. The
[registry](/api/top-level#registry) lets you **map string names to functions**
that create objects, so given a name and optional arguments, spaCy will know how
to recreate the object. To register a function that returns your custom
dictionary, you can use the `@spacy.registry.misc` decorator with a single
argument, the name:

> #### What's the misc registry?
>
> The [`registry`](/api/top-level#registry) provides different categories for
> different types of functions – for example, model architectures, tokenizers or
> batchers. `misc` is intended for miscellaneous functions that don't fit
> anywhere else.

```python
### Registered function for assets {highlight="1"}
@spacy.registry.misc("acronyms.slang_dict.v1")
def create_acronyms_slang_dict():
    dictionary = {"lol": "laughing out loud", "brb": "be right back"}
    dictionary.update({value: key for key, value in dictionary.items()})
    return dictionary
```

In your `default_config` (and later in your
[training config](/usage/training#config)), you can now refer to the function
registered under the name `"acronyms.slang_dict.v1"` using the `@misc` key. This
tells spaCy how to create the value, and when your component is created, the
result of the registered function is passed in as the key `"dictionary"`.

> #### config.cfg
>
> ```ini
> [components.acronyms]
> factory = "acronyms"
>
> [components.acronyms.data]
> @misc = "acronyms.slang_dict.v1"
> ```

```diff
- default_config = {"dictionary:" DICTIONARY}
+ default_config = {"dictionary": {"@misc": "acronyms.slang_dict.v1"}}
```

Using a registered function also means that you can easily include your custom
components in pipelines that you [train](/usage/training). To make sure spaCy
knows where to find your custom `@misc` function, you can pass in a Python file
via the argument `--code`. If someone else is using your component, all they
have to do to customize the data is to register their own function and swap out
the name. Registered functions can also take **arguments**, by the way, that can
be defined in the config as well – you can read more about this in the docs on
[training with custom code](/usage/training#custom-code).

### Option 2: Save data with the pipeline and load it in once on initialization {#component-data-initialization}

<Infobox>

- ✅ **Pros:** lets components save and load their own data and reflect user
  changes, load in data assets before training without depending on them at
  runtime
- ❌ **Cons:** requires more component methods, more complex config and data
  flow

</Infobox>

Just like models save out their binary weights when you call
[`nlp.to_disk`](/api/language#to_disk), components can also **serialize** any
other data assets – for instance, an acronym dictionary. If a pipeline component
implements its own `to_disk` and `from_disk` methods, those will be called
automatically by `nlp.to_disk` and will receive the path to the directory to
save to or load from. The component can then perform any custom saving or
loading. If a user makes changes to the component data, they will be reflected
when the `nlp` object is saved. For more examples of this, see the usage guide
on [serialization methods](/usage/saving-loading/#serialization-methods).

> #### About the data path
>
> The `path` argument spaCy passes to the serialization methods consists of the
> path provided by the user, plus a directory of the component name. This means
> that when you call `nlp.to_disk("/path")`, the `acronyms` component will
> receive the directory path `/path/acronyms` and can then create files in this
> directory.

```python
### Custom serialization methods {highlight="6-7,9-11"}
import srsly

class AcronymComponent:
    # other methods here...

    def to_disk(self, path, exclude=tuple()):
        srsly.write_json(path / "data.json", self.data)

    def from_disk(self, path, exclude=tuple()):
        self.data = srsly.read_json(path / "data.json")
        return self
```

Now the component can save to and load from a directory. The only remaining
question: How do you **load in the initial data**? In Python, you could just
call the pipe's `from_disk` method yourself. But if you're adding the component
to your [training config](/usage/training#config), spaCy will need to know how
to set it up, from start to finish, including the data to initialize it with.

While you could use a registered function or a file loader like
[`srsly.read_json.v1`](/api/top-level#file_readers) as an argument of the
component factory, this approach is problematic: the component factory runs
**every time the component is created**. This means it will run when creating
the `nlp` object before training, but also every time a user loads your
pipeline. So your runtime pipeline would either depend on a local path on your
file system, or it's loaded twice: once when the component is created, and then
again when the data is by `from_disk`.

> ```ini
> ### config.cfg
> [components.acronyms.data]
> # 🚨 Problem: Runtime pipeline depends on local path
> @readers = "srsly.read_json.v1"
> path = "/path/to/slang_dict.json"
> ```
>
> ```ini
> ### config.cfg
> [components.acronyms.data]
> # 🚨 Problem: this always runs
> @misc = "acronyms.slang_dict.v1"
> ```

```python
@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
    # 🚨 Problem: data will be loaded every time component is created
    return AcronymComponent(nlp, data, case_sensitive)
```

To solve this, your component can implement a separate method, `initialize`,
which will be called by [`nlp.initialize`](/api/language#initialize) if
available. This typically happens before training, but not at runtime when the
pipeline is loaded. For more background on this, see the usage guides on the
[config lifecycle](/usage/training#config-lifecycle) and
[custom initialization](/usage/training#initialization).

![Illustration of pipeline lifecycle](../images/lifecycle.svg)

A component's `initialize` method needs to take at least **two named
arguments**: a `get_examples` callback that gives it access to the training
examples, and the current `nlp` object. This is mostly used by trainable
components so they can initialize their models and label schemes from the data,
so we can ignore those arguments here. All **other arguments** on the method can
be defined via the config – in this case a dictionary `data`.

> #### config.cfg
>
> ```ini
> [initialize.components.my_component]
>
> [initialize.components.my_component.data]
> # ✅ This only runs on initialization
> @readers = "srsly.read_json.v1"
> path = "/path/to/slang_dict.json"
> ```

```python
### Custom initialize method {highlight="5-6"}
class AcronymComponent:
    def __init__(self):
        self.data = {}

    def initialize(self, get_examples=None, nlp=None, data={}):
        self.data = data
```

When [`nlp.initialize`](/api/language#initialize) runs before training (or when
you call it in your own code), the
[`[initialize]`](/api/data-formats#config-initialize) block of the config is
loaded and used to construct the `nlp` object. The custom acronym component will
then be passed the data loaded from the JSON file. After training, the `nlp`
object is saved to disk, which will run the component's `to_disk` method. When
the pipeline is loaded back into spaCy later to use it, the `from_disk` method
will load the data back in.

## Python type hints and validation {#type-hints new="3"}

spaCy's configs are powered by our machine learning library Thinc's
[configuration system](https://thinc.ai/docs/usage-config), which supports
[type hints](https://docs.python.org/3/library/typing.html) and even
[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your component
factory provides type hints, the values that are passed in will be **checked
against the expected types**. If the value can't be cast to an integer, spaCy
will raise an error. `pydantic` also provides strict types like `StrictFloat`,
which will force the value to be an integer and raise an error if it's not – for
instance, if your config defines a float.

<Infobox variant="warning">

If you're not using
[strict types](https://pydantic-docs.helpmanual.io/usage/types/#strict-types),
values that can be **cast to** the given type will still be accepted. For
example, `1` can be cast to a `float` or a `bool` type, but not to a
`List[str]`. However, if the type is
[`StrictFloat`](https://pydantic-docs.helpmanual.io/usage/types/#strict-types),
only a float will be accepted.

</Infobox>

The following example shows a custom pipeline component for debugging. It can be
added anywhere in the pipeline and logs information about the `nlp` object and
the `Doc` that passes through. The `log_level` config setting lets the user
customize what log statements are shown – for instance, `"INFO"` will show info
logs and more critical logging statements, whereas `"DEBUG"` will show
everything. The value is annotated as a `StrictStr`, so it will only accept a
string value.

> #### ✏️ Things to try
>
> 1. Change the `config` passed to `nlp.add_pipe` to use the log level `"INFO"`.
>    You should see that only the statement logged with `logger.info` is shown.
> 2. Change the `config` passed to `nlp.add_pipe` so that it contains unexpected
>    values – for example, a boolean instead of a string: `"log_level": False`.
>    You should see a validation error.
> 3. Check out the docs on `pydantic`'s
>    [constrained types](https://pydantic-docs.helpmanual.io/usage/types/#constrained-types)
>    and write a type hint for `log_level` that only accepts the exact string
>    values `"DEBUG"`, `"INFO"` or `"CRITICAL"`.

```python
### {executable="true"}
import spacy
from spacy.language import Language
from spacy.tokens import Doc
from pydantic import StrictStr
import logging

@Language.factory("debug", default_config={"log_level": "DEBUG"})
class DebugComponent:
    def __init__(self, nlp: Language, name: str, log_level: StrictStr):
        self.logger = logging.getLogger(f"spacy.{name}")
        self.logger.setLevel(log_level)
        self.logger.info(f"Pipeline: {nlp.pipe_names}")

    def __call__(self, doc: Doc) -> Doc:
        is_tagged = doc.has_annotation("TAG")
        self.logger.debug(f"Doc: {len(doc)} tokens, is tagged: {is_tagged}")
        return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("debug", config={"log_level": "DEBUG"})
doc = nlp("This is a text...")
```

## Trainable components {#trainable-components new="3"}

spaCy's [`TrainablePipe`](/api/pipe) class helps you implement your own
trainable components that have their own model instance, make predictions over
`Doc` objects and can be updated using [`spacy train`](/api/cli#train). This
lets you plug fully custom machine learning components into your pipeline.

![Illustration of Pipe methods](../images/trainable_component.svg)

You'll need the following:

1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
   can be a model implemented in [Thinc](/usage/layers-architectures#thinc), or
   a [wrapped model](/usage/layers-architectures#frameworks) implemented in
   PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a
   list of [`Doc`](/api/doc) objects as input and can have any type of output.
2. **TrainablePipe subclass:** A subclass of [`TrainablePipe`](/api/pipe) that
   implements at least two methods: [`TrainablePipe.predict`](/api/pipe#predict)
   and [`TrainablePipe.set_annotations`](/api/pipe#set_annotations).
3. **Component factory:** A component factory registered with
   [`@Language.factory`](/api/language#factory) that takes the `nlp` object and
   component `name` and optional settings provided by the config and returns an
   instance of your trainable component.

> #### Example
>
> ```python
> from spacy.pipeline import TrainablePipe
> from spacy.language import Language
>
> class TrainableComponent(TrainablePipe):
>     def predict(self, docs):
>         ...
>
>     def set_annotations(self, docs, scores):
>         ...
>
> @Language.factory("my_trainable_component")
> def make_component(nlp, name, model):
>     return TrainableComponent(nlp.vocab, model, name=name)
> ```

| Name                                           | Description                                                                                                         |
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| [`predict`](/api/pipe#predict)                 | Apply the component's model to a batch of [`Doc`](/api/doc) objects (without modifying them) and return the scores. |
| [`set_annotations`](/api/pipe#set_annotations) | Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores generated by `predict`.                      |

By default, [`TrainablePipe.__init__`](/api/pipe#init) takes the shared vocab,
the [`Model`](https://thinc.ai/docs/api-model) and the name of the component
instance in the pipeline, which you can use as a key in the losses. All other
keyword arguments will become available as [`TrainablePipe.cfg`](/api/pipe#cfg)
and will also be serialized with the component.

<Accordion title="Why components should be passed a Model instance, not create it" spaced>

spaCy's [config system](/usage/training#config) resolves the config describing
the pipeline components and models **bottom-up**. This means that it will
_first_ create a `Model` from a [registered architecture](/api/architectures),
validate its arguments and _then_ pass the object forward to the component. This
means that the config can express very complex, nested trees of objects – but
the objects don't have to pass the model settings all the way down to the
components. It also makes the components more **modular** and lets you
[swap](/usage/layers-architectures#swap-architectures) different architectures
in your config, and re-use model definitions.

```ini
### config.cfg (excerpt)
[components]

[components.textcat]
factory = "textcat"
labels = []

# This function is created and then passed to the "textcat" component as
# the argument "model"
[components.textcat.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false

[components.other_textcat]
factory = "textcat"
# This references the [components.textcat.model] block above
model = ${components.textcat.model}
labels = []
```

Your trainable pipeline component factories should therefore always take a
`model` argument instead of instantiating the
[`Model`](https://thinc.ai/docs/api-model) inside the component. To register
custom architectures, you can use the
[`@spacy.registry.architectures`](/api/top-level#registry) decorator. Also see
the [training guide](/usage/training#config) for details.

</Accordion>

For some use cases, it makes sense to also overwrite additional methods to
customize how the model is updated from examples, how it's initialized, how the
loss is calculated and to add evaluation scores to the training output.

| Name                                 | Description                                                                                                                                                                                                                                                                                                                                   |
| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`update`](/api/pipe#update)         | Learn from a batch of [`Example`](/api/example) objects containing the predictions and gold-standard annotations, and update the component's model.                                                                                                                                                                                           |
| [`initialize`](/api/pipe#initialize) | Initialize the model. Typically calls into [`Model.initialize`](https://thinc.ai/docs/api-model#initialize) and can be passed custom arguments via the [`[initialize]`](/api/data-formats#config-initialize) config block that are only loaded during training or when you call [`nlp.initialize`](/api/language#initialize), not at runtime. |
| [`get_loss`](/api/pipe#get_loss)     | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects.                                                                                                                                                                                                                                                 |
| [`score`](/api/pipe#score)           | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_score_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score.                            |

<Infobox title="Custom trainable components and models" emoji="📖">

For more details on how to implement your own trainable components and model
architectures, and plug existing models implemented in PyTorch or TensorFlow
into your spaCy pipeline, see the usage guide on
[layers and model architectures](/usage/layers-architectures#components).

</Infobox>

## Extension attributes {#custom-components-attributes new="2"}

spaCy allows you to set any custom attributes and methods on the `Doc`, `Span`
and `Token`, which become available as `Doc._`, `Span._` and `Token._` – for
example, `Token._.my_attr`. This lets you store additional information relevant
to your application, add new features and functionality to spaCy, and implement
your own models trained with other machine learning libraries. It also lets you
take advantage of spaCy's data structures and the `Doc` object as the "single
source of truth".

<Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">

Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
separation and makes it easier to ensure backwards compatibility. For example,
if you've implemented your own `.coref` property and spaCy claims it one day,
it'll break your code. Similarly, just by looking at the code, you'll
immediately know what's built-in and what's custom – for example,
`doc.sentiment` is spaCy, while `doc._.sent_score` isn't.

</Accordion>

<Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">

Extension definitions – the defaults, methods, getters and setters you pass in
to `set_extension` – are stored in class attributes on the `Underscore` class.
If you write to an extension attribute, e.g. `doc._.hello = True`, the data is
stored within the [`Doc.user_data`](/api/doc#attributes) dictionary. To keep the
underscore data separate from your other dictionary entries, the string `"._."`
is placed before the name, in a tuple.

</Accordion>

---

There are three main types of extensions, which can be defined using the
[`Doc.set_extension`](/api/doc#set_extension),
[`Span.set_extension`](/api/span#set_extension) and
[`Token.set_extension`](/api/token#set_extension) methods.

## Description

1. **Attribute extensions.** Set a default value for an attribute, which can be
   overwritten manually at any time. Attribute extensions work like "normal"
   variables and are the quickest way to store arbitrary information on a `Doc`,
   `Span` or `Token`.

   ```python
    Doc.set_extension("hello", default=True)
    assert doc._.hello
    doc._.hello = False
   ```

2. **Property extensions.** Define a getter and an optional setter function. If
   no setter is provided, the extension is immutable. Since the getter and
   setter functions are only called when you _retrieve_ the attribute, you can
   also access values of previously added attribute extensions. For example, a
   `Doc` getter can average over `Token` attributes. For `Span` extensions,
   you'll almost always want to use a property – otherwise, you'd have to write
   to _every possible_ `Span` in the `Doc` to set up the values correctly.

   ```python
   Doc.set_extension("hello", getter=get_hello_value, setter=set_hello_value)
   assert doc._.hello
   doc._.hello = "Hi!"
   ```

3. **Method extensions.** Assign a function that becomes available as an object
   method. Method extensions are always immutable. For more details and
   implementation ideas, see
   [these examples](/usage/examples#custom-components-attr-methods).

   ```python
   Doc.set_extension("hello", method=lambda doc, name: f"Hi {name}!")
   assert doc._.hello("Bob") == "Hi Bob!"
   ```

Before you can access a custom extension, you need to register it using the
`set_extension` method on the object you want to add it to, e.g. the `Doc`. Keep
in mind that extensions are always **added globally** and not just on a
particular instance. If an attribute of the same name already exists, or if
you're trying to access an attribute that hasn't been registered, spaCy will
raise an `AttributeError`.

```python
### Example
from spacy.tokens import Doc, Span, Token

fruits = ["apple", "pear", "banana", "orange", "strawberry"]
is_fruit_getter = lambda token: token.text in fruits
has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])

Token.set_extension("is_fruit", getter=is_fruit_getter)
Doc.set_extension("has_fruit", getter=has_fruit_getter)
Span.set_extension("has_fruit", getter=has_fruit_getter)
```

> #### Usage example
>
> ```python
> doc = nlp("I have an apple and a melon")
> assert doc[3]._.is_fruit      # get Token attributes
> assert not doc[0]._.is_fruit
> assert doc._.has_fruit        # get Doc attributes
> assert doc[1:4]._.has_fruit   # get Span attributes
> ```

Once you've registered your custom attribute, you can also use the built-in
`set`, `get` and `has` methods to modify and retrieve the attributes. This is
especially useful it you want to pass in a string instead of calling
`doc._.my_attr`.

### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}

This example shows the implementation of a pipeline component that fetches
country meta data via the [REST Countries API](https://restcountries.eu), sets
entity annotations for countries and sets custom attributes on the `Doc` and
`Span` – for example, the capital, latitude/longitude coordinates and even the
country flag.

```python
### {executable="true"}
import requests
from spacy.lang.en import English
from spacy.language import Language
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token

@Language.factory("rest_countries")
class RESTCountriesComponent:
    def __init__(self, nlp, name, label="GPE"):
        r = requests.get("https://restcountries.eu/rest/v2/all")
        r.raise_for_status()  # make sure requests raises an error if it fails
        countries = r.json()
        # Convert API response to dict keyed by country name for easy lookup
        self.countries = {c["name"]: c for c in countries}
        self.label = label
        # Set up the PhraseMatcher with Doc patterns for each country name
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add("COUNTRIES", [nlp.make_doc(c) for c in self.countries.keys()])
        # Register attributes on the Span. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Span.set_extension("is_country", default=None)
        Span.set_extension("country_capital", default=None)
        Span.set_extension("country_latlng", default=None)
        Span.set_extension("country_flag", default=None)
        # Register attribute on Doc via a getter that checks if the Doc
        # contains a country entity
        Doc.set_extension("has_country", getter=self.has_country)

    def __call__(self, doc):
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in self.matcher(doc):
            # Generate Span representing the entity & set label
            entity = Span(doc, start, end, label=self.label)
            # Set custom attributes on entity. Can be extended with other data
            # returned by the API, like currencies, country code, calling code etc.
            entity._.set("is_country", True)
            entity._.set("country_capital", self.countries[entity.text]["capital"])
            entity._.set("country_latlng", self.countries[entity.text]["latlng"])
            entity._.set("country_flag", self.countries[entity.text]["flag"])
            spans.append(entity)
        # Overwrite doc.ents and add entity – be careful not to replace!
        doc.ents = list(doc.ents) + spans
        return doc  # don't forget to return the Doc!

    def has_country(self, doc):
        """Getter for Doc attributes. Since the getter is only called
        when we access the attribute, we can refer to the Span's 'is_country'
        attribute here, which is already set in the processing step."""
        return any([entity._.get("is_country") for entity in doc.ents])

nlp = English()
nlp.add_pipe("rest_countries", config={"label": "GPE"})
doc = nlp("Some text about Colombia and the Czech Republic")
print("Pipeline", nlp.pipe_names)  # pipeline contains component name
print("Doc has countries", doc._.has_country)  # Doc contains countries
for ent in doc.ents:
    if ent._.is_country:
        print(ent.text, ent.label_, ent._.country_capital, ent._.country_latlng, ent._.country_flag)
```

In this case, all data can be fetched on initialization in one request. However,
if you're working with text that contains incomplete country names, spelling
mistakes or foreign-language versions, you could also implement a
`like_country`-style getter function that makes a request to the search API
endpoint and returns the best-matching result.

### User hooks {#custom-components-user-hooks}

While it's generally recommended to use the `Doc._`, `Span._` and `Token._`
proxies to add your own custom attributes, spaCy offers a few exceptions to
allow **customizing the built-in methods** like
[`Doc.similarity`](/api/doc#similarity) or [`Doc.vector`](/api/doc#vector) with
your own hooks, which can rely on components you train yourself. For instance,
you can provide your own on-the-fly sentence segmentation algorithm or document
similarity method.

Hooks let you customize some of the behaviors of the `Doc`, `Span` or `Token`
objects by adding a component to the pipeline. For instance, to customize the
[`Doc.similarity`](/api/doc#similarity) method, you can add a component that
sets a custom function to `doc.user_hooks["similarity"]`. The built-in
`Doc.similarity` method will check the `user_hooks` dict, and delegate to your
function if you've set one. Similar results can be achieved by setting functions
to `Doc.user_span_hooks` and `Doc.user_token_hooks`.

> #### Implementation note
>
> The hooks live on the `Doc` object because the `Span` and `Token` objects are
> created lazily, and don't own any data. They just proxy to their parent `Doc`.
> This turns out to be convenient here – we only have to worry about installing
> hooks in one place.

| Name               | Customizes                                                                                                                                                                                                              |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `user_hooks`       | [`Doc.similarity`](/api/doc#similarity), [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents)                             |
| `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
| `user_span_hooks`  | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root)                     |

```python
### Add custom similarity hooks
from spacy.language import Language


class SimilarityModel:
    def __init__(self, name: str, index: int):
        self.name = name
        self.index = index

    def __call__(self, doc):
        doc.user_hooks["similarity"] = self.similarity
        doc.user_span_hooks["similarity"] = self.similarity
        doc.user_token_hooks["similarity"] = self.similarity
        return doc

    def similarity(self, obj1, obj2):
        return obj1.vector[self.index] + obj2.vector[self.index]


@Language.factory("similarity_component", default_config={"index": 0})
def create_similarity_component(nlp, name, index: int):
    return SimilarityModel(name, index)
```

## Developing plugins and wrappers {#plugins}

We're very excited about all the new possibilities for community extensions and
plugins in spaCy, and we can't wait to see what you build with it! To get you
started, here are a few tips, tricks and best
practices. [See here](/universe/?category=pipeline) for examples of other spaCy
extensions.

### Usage ideas {#custom-components-usage-ideas}

- **Adding new features and hooking in models.** For example, a sentiment
  analysis model, or your preferred solution for lemmatization or sentiment
  analysis. spaCy's built-in tagger, parser and entity recognizer respect
  annotations that were already set on the `Doc` in a previous step of the
  pipeline.
- **Integrating other libraries and APIs.** For example, your pipeline component
  can write additional information and data directly to the `Doc` or `Token` as
  custom attributes, while making sure no information is lost in the process.
  This can be output generated by other libraries and models, or an external
  service with a REST API.
- **Debugging and logging.** For example, a component which stores and/or
  exports relevant information about the current state of the processed
  document, and insert it at any point of your pipeline.

### Best practices {#custom-components-best-practices}

Extensions can claim their own `._` namespace and exist as standalone packages.
If you're developing a tool or library and want to make it easy for others to
use it with spaCy and add it to their pipeline, all you have to do is expose a
function that takes a `Doc`, modifies it and returns it.

- Make sure to choose a **descriptive and specific name** for your pipeline
  component class, and set it as its `name` attribute. Avoid names that are too
  common or likely to clash with built-in or a user's other custom components.
  While it's fine to call your package `"spacy_my_extension"`, avoid component
  names including `"spacy"`, since this can easily lead to confusion.

  ```diff
  + name = "myapp_lemmatizer"
  - name = "lemmatizer"
  ```

- When writing to `Doc`, `Token` or `Span` objects, **use getter functions**
  wherever possible, and avoid setting values explicitly. Tokens and spans don't
  own any data themselves, and they're implemented as C extension classes – so
  you can't usually add new attributes to them like you could with most pure
  Python objects.

  ```diff
  + is_fruit = lambda token: token.text in ("apple", "orange")
  + Token.set_extension("is_fruit", getter=is_fruit)

  - token._.set_extension("is_fruit", default=False)
  - if token.text in ('"apple", "orange"):
  -     token._.set("is_fruit", True)
  ```

- Always add your custom attributes to the **global** `Doc`, `Token` or `Span`
  objects, not a particular instance of them. Add the attributes **as early as
  possible**, e.g. in your extension's `__init__` method or in the global scope
  of your module. This means that in the case of namespace collisions, the user
  will see an error immediately, not just when they run their pipeline.

  ```diff
  + from spacy.tokens import Doc
  + def __init__(attr="my_attr"):
  +     Doc.set_extension(attr, getter=self.get_doc_attr)

  - def __call__(doc):
  -     doc.set_extension("my_attr", getter=self.get_doc_attr)
  ```

- If your extension is setting properties on the `Doc`, `Token` or `Span`,
  include an option to **let the user to change those attribute names**. This
  makes it easier to avoid namespace collisions and accommodate users with
  different naming preferences. We recommend adding an `attrs` argument to the
  `__init__` method of your class so you can write the names to class attributes
  and reuse them across your component.

  ```diff
  + Doc.set_extension(self.doc_attr, default="some value")
  - Doc.set_extension("my_doc_attr", default="some value")
  ```

- Ideally, extensions should be **standalone packages** with spaCy and
  optionally, other packages specified as a dependency. They can freely assign
  to their own `._` namespace, but should stick to that. If your extension's
  only job is to provide a better `.similarity` implementation, and your docs
  state this explicitly, there's no problem with writing to the
  [`user_hooks`](#custom-components-user-hooks) and overwriting spaCy's built-in
  method. However, a third-party extension should **never silently overwrite
  built-ins**, or attributes set by other extensions.

- If you're looking to publish a pipeline package that depends on a custom
  pipeline component, you can either **require it** in the package's
  dependencies, or – if the component is specific and lightweight – choose to
  **ship it with your pipeline package**. Just make sure the
  [`@Language.component`](/api/language#component) or
  [`@Language.factory`](/api/language#factory) decorator that registers the
  custom component runs in your package's `__init__.py` or is exposed via an
  [entry point](/usage/saving-loading#entry-points).

- Once you're ready to share your extension with others, make sure to **add docs
  and installation instructions** (you can always link to this page for more
  info). Make it easy for others to install and use your extension, for example
  by uploading it to [PyPi](https://pypi.python.org). If you're sharing your
  code on GitHub, don't forget to tag it with
  [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
  [`spacy-extension`](https://github.com/topics/spacy-extension?o=desc&s=stars)
  to help people find it. If you post it on Twitter, feel free to tag
  [@spacy_io](https://twitter.com/spacy_io) so we can check it out.

### Wrapping other models and libraries {#wrapping-models-libraries}

Let's say you have a custom entity recognizer that takes a list of strings and
returns their [BILUO tags](/usage/linguistic-features#accessing-ner). Given an
input like `["A", "text", "about", "Facebook"]`, it will predict and return
`["O", "O", "O", "U-ORG"]`. To integrate it into your spaCy pipeline and make it
add those entities to the `doc.ents`, you can wrap it in a custom pipeline
component function and pass it the token texts from the `Doc` object received by
the component.

The [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans) is very
helpful here, because it takes a `Doc` object and token-based BILUO tags and
returns a sequence of `Span` objects in the `Doc` with added labels. So all your
wrapper has to do is compute the entity spans and overwrite the `doc.ents`.

> #### How the doc.ents work
>
> When you add spans to the `doc.ents`, spaCy will automatically resolve them
> back to the underlying tokens and set the `Token.ent_type` and `Token.ent_iob`
> attributes. By definition, each token can only be part of one entity, so
> overlapping entity spans are not allowed.

```python
### {highlight="1,8-9"}
import your_custom_entity_recognizer
from spacy.training import biluo_tags_to_spans
from spacy.language import Language

@Language.component("custom_ner_wrapper")
def custom_ner_wrapper(doc):
    words = [token.text for token in doc]
    custom_entities = your_custom_entity_recognizer(words)
    doc.ents = biluo_tags_to_spans(doc, custom_entities)
    return doc
```

The `custom_ner_wrapper` can then be added to a blank pipeline using
[`nlp.add_pipe`](/api/language#add_pipe). You can also replace the existing
entity recognizer of a trained pipeline with
[`nlp.replace_pipe`](/api/language#replace_pipe).

Here's another example of a custom model, `your_custom_model`, that takes a list
of tokens and returns lists of fine-grained part-of-speech tags, coarse-grained
part-of-speech tags, dependency labels and head token indices. Here, we can use
the [`Doc.from_array`](/api/doc#from_array) to create a new `Doc` object using
those values. To create a numpy array we need integers, so we can look up the
string labels in the [`StringStore`](/api/stringstore). The
[`doc.vocab.strings.add`](/api/stringstore#add) method comes in handy here,
because it returns the integer ID of the string _and_ makes sure it's added to
the vocab. This is especially important if the custom model uses a different
label scheme than spaCy's default models.

> #### Example: spacy-stanza
>
> For an example of an end-to-end wrapper for statistical tokenization, tagging
> and parsing, check out
> [`spacy-stanza`](https://github.com/explosion/spacy-stanza). It uses a very
> similar approach to the example in this section – the only difference is that
> it fully replaces the `nlp` object instead of providing a pipeline component,
> since it also needs to handle tokenization.

```python
### {highlight="1,11,17-19"}
import your_custom_model
from spacy.language import Language
from spacy.symbols import POS, TAG, DEP, HEAD
from spacy.tokens import Doc
import numpy

@Language.component("custom_model_wrapper")
def custom_model_wrapper(doc):
    words = [token.text for token in doc]
    spaces = [token.whitespace for token in doc]
    pos, tags, deps, heads = your_custom_model(words)
    # Convert the strings to integers and add them to the string store
    pos = [doc.vocab.strings.add(label) for label in pos]
    tags = [doc.vocab.strings.add(label) for label in tags]
    deps = [doc.vocab.strings.add(label) for label in deps]
    # Create a new Doc from a numpy array
    attrs = [POS, TAG, DEP, HEAD]
    arr = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
    new_doc = Doc(doc.vocab, words=words, spaces=spaces).from_array(attrs, arr)
    return new_doc
```

<Infobox title="Sentence boundaries and heads" variant="warning">

If you create a `Doc` object with dependencies and heads, spaCy is able to
resolve the sentence boundaries automatically. However, note that the `HEAD`
value used to construct a `Doc` is the token index **relative** to the current
token – e.g. `-1` for the previous token. The CoNLL format typically annotates
heads as `1`-indexed absolute indices with `0` indicating the root. If that's
the case in your annotations, you need to convert them first:

```python
heads = [2, 0, 4, 2, 2]
new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)]
```

</Infobox>

<Infobox title="Advanced usage, serialization and entry points" emoji="📖">

For more details on how to write and package custom components, make them
available to spaCy via entry points and implement your own serialization
methods, check out the usage guide on
[saving and loading](/usage/saving-loading).

</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								---
 								title: Language Processing Pipelines
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								next: /usage/embeddings-transformers
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								menu:
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 18:38:03 +03:00
+								  - ['Processing Text', 'processing']
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								  - ['Pipelines & Components', 'pipelines']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								  - ['Custom Components', 'custom-components']
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								  - ['Component Data', 'component-data']
 								  - ['Type Hints & Validation', 'type-hints']
 								  - ['Trainable Components', 'trainable-components']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								  - ['Extension Attributes', 'custom-components-attributes']
 								  - ['Plugins & Wrappers', 'plugins']
 								---
 								import Pipelines101 from 'usage/101/\_pipelines.md'
 								<Pipelines101 />
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 18:38:03 +03:00
+								## Processing text {#processing}
 								When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
 								component** on the `Doc`, in order. It then returns the processed `Doc` that you
 								can work with.
 								```python
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("This is a text")
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 18:38:03 +03:00
+								```
 								When processing large volumes of text, the statistical models are usually more
 								efficient if you let them work on batches of texts. spaCy's
 								[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
 								processed `Doc` objects. The batching is done internally.
 								```diff
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								texts = ["This is a text", "These are lots of texts", "..."]
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 18:38:03 +03:00
+								- docs = [nlp(text) for text in texts]
 								+ docs = list(nlp.pipe(texts))
 								```
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								<Infobox title="Tips for efficient processing" emoji="💡">
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 18:38:03 +03:00
 								- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
 								  buffer them in batches, instead of one-by-one. This is usually much more
 								  efficient.
 								- Only apply the **pipeline components you need**. Getting predictions from the
 								  model that you don't actually need adds up and becomes very inefficient at
 								  scale. To prevent this, use the `disable` keyword argument to disable
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								  components you don't need – either when loading a pipeline, or during
 								  processing with `nlp.pipe`. See the section on
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 18:38:03 +03:00
+								  [disabling pipeline components](#disabling) for more details and examples.
 								</Infobox>
 								In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
 								(potentially very large) iterable of texts as a stream. Because we're only
 								accessing the named entities in `doc.ents` (set by the `ner` component), we'll
-												Reformat processing pipelines

											
										
										
											2021-03-18 15:29:51 +03:00
+								disable all other components during processing. `nlp.pipe` yields `Doc` objects,
 								so we can iterate over them and access the named entity predictions:
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 18:38:03 +03:00
 								> #### ✏️ Things to try
 								>
 								> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
 								>    empty, because the entity recognizer didn't run.
 								```python
 								### {executable="true"}
 								import spacy
 								texts = [
 								    "Net income was $9.4 million compared to the prior year of $2.7 million.",
 								    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
 								]
 								nlp = spacy.load("en_core_web_sm")
-												Include all en_core_web_sm components in examples

											
										
										
											2021-03-17 17:05:22 +03:00
+								for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
-												Add "Processing text" section [ci skip]

											
										
										
											2019-07-25 18:38:03 +03:00
+								    # Do something with the doc here
 								    print([(ent.text, ent.label_) for ent in doc.ents])
 								```
 								<Infobox title="Important note" variant="warning">
 								When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
 								[generator](https://realpython.com/introduction-to-python-generators/) that
 								yields `Doc` objects – not a list. So if you want to use it like a list, you'll
 								have to call `list()` on it first:
 								```diff
 								- docs = nlp.pipe(texts)[0]         # will raise an error
 								+ docs = list(nlp.pipe(texts))[0]   # works as expected
 								```
 								</Infobox>
-												Update processing-pipelines.md to mention method for doc metadata (#7480)

* Update processing-pipelines.md

Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True)

Link to a new example on the attributes page detailing the following:

> ```
> data = [
>   ("Some text to process", {"meta": "foo"}),
>   ("And more text...", {"meta": "bar"})
> ]
> 
> for doc, context in nlp.pipe(data, as_tuples=True):
>     # Let's assume you have a "meta" extension registered on the Doc
>     doc._.meta = context["meta"]
> ```

from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as

* Updating the attributes section

Update the attributes section with example of how extensions can be used to store metadata.

* Update processing-pipelines.md

* Update processing-pipelines.md

Made as_tuples example executable and relocated to the end of the "Processing Text" section.

* Update processing-pipelines.md

* Update processing-pipelines.md

Removed extra line

* Reformat and rephrase

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
											
										
										
											2021-04-19 12:58:12 +03:00
+								You can use the `as_tuples` option to pass additional context along with each
 								doc when using [`nlp.pipe`](/api/language#pipe). If `as_tuples` is `True`, then
 								the input should be a sequence of `(text, context)` tuples and the output will
 								be a sequence of `(doc, context)` tuples. For example, you can pass metadata in
 								the context and save it in a [custom attribute](#custom-components-attributes):
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.tokens import Doc
 								if not Doc.has_extension("text_id"):
 								    Doc.set_extension("text_id", default=None)
 								text_tuples = [
 								    ("This is the first text.", {"text_id": "text1"}),
 								    ("This is the second text.", {"text_id": "text2"})
 								]
 								nlp = spacy.load("en_core_web_sm")
 								doc_tuples = nlp.pipe(text_tuples, as_tuples=True)
 								docs = []
 								for doc, context in doc_tuples:
 								    doc._.text_id = context["text_id"]
 								    docs.append(doc)
 								for doc in docs:
 								    print(f"{doc._.text_id}: {doc.text}")
 								```
-												Update website/docs/usage/processing-pipelines.md

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2021-03-19 10:12:49 +03:00
+								### Multiprocessing {#multiprocessing}
-												Add multiprocessing section

											
										
										
											2021-03-17 23:28:04 +03:00
 								spaCy includes built-in support for multiprocessing with
 								[`nlp.pipe`](/api/language#pipe) using the `n_process` option:
 								```python
 								# Multiprocessing with 4 processes
 								docs = nlp.pipe(texts, n_process=4)
 								# With as many processes as CPUs (use with caution!)
 								docs = nlp.pipe(texts, n_process=-1)
 								```
-												Reformat processing pipelines

											
										
										
											2021-03-18 15:29:51 +03:00
+								Depending on your platform, starting many processes with multiprocessing can add
 								a lot of overhead. In particular, the default start method `spawn` used in
-												Add multiprocessing section

											
										
										
											2021-03-17 23:28:04 +03:00
+								macOS/OS X (as of Python 3.8) and in Windows can be slow for larger models
 								because the model data is copied in memory for each new process. See the
-												Reformat processing pipelines

											
										
										
											2021-03-18 15:29:51 +03:00
+								[Python docs on multiprocessing](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
-												Add multiprocessing section

											
										
										
											2021-03-17 23:28:04 +03:00
+								for further details.
 								For shorter tasks and in particular with `spawn`, it can be faster to use a
 								smaller number of processes with a larger batch size. The optimal `batch_size`
 								setting will depend on the pipeline components, the length of your documents,
 								the number of processes and how much memory is available.
 								```python
 								# Default batch size is `nlp.batch_size` (typically 1000)
 								docs = nlp.pipe(texts, n_process=2, batch_size=2000)
 								```
 								<Infobox title="Multiprocessing on GPU" variant="warning">
 								Multiprocessing is not generally recommended on GPU because RAM is too limited.
 								If you want to try it out, be aware that it is only possible using `spawn` due
 								to limitations in CUDA.
 								</Infobox>
 								<Infobox title="Multiprocessing with transformer models" variant="warning">
 								In Linux, transformer models may hang or deadlock with multiprocessing due to an
 								[issue in PyTorch](https://github.com/pytorch/pytorch/issues/17199). One
-												Reformat processing pipelines

											
										
										
											2021-03-18 15:29:51 +03:00
+								suggested workaround is to use `spawn` instead of `fork` and another is to limit
 								the number of threads before loading any models using
-												Add multiprocessing section

											
										
										
											2021-03-17 23:28:04 +03:00
+								`torch.set_num_threads(1)`.
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								## Pipelines and built-in components {#pipelines}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								spaCy makes it very easy to create your own pipelines consisting of reusable
 								components – this includes spaCy's default tagger, parser and entity recognizer,
 								but also your own custom processing functions. A pipeline component can be added
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								to an already existing `nlp` object, specified when initializing a
 								[`Language`](/api/language) class, or defined within a
 								[pipeline package](/usage/saving-loading#models).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								> #### config.cfg (excerpt)
 								>
 								> ```ini
 								>  [nlp]
 								>  lang = "en"
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								>  pipeline = ["tok2vec", "parser"]
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								>
 								> [components]
 								>
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								> [components.tok2vec]
 								> factory = "tok2vec"
 								> # Settings for the tok2vec component
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								>
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								> [components.parser]
 								> factory = "parser"
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								> # Settings for the parser component
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> ```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								When you load a pipeline, spaCy first consults the
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								[`meta.json`](/usage/saving-loading#models) and
 								[`config.cfg`](/usage/training#config). The config tells spaCy what language
 								class to use, which components are in the pipeline, and how those components
 								should be created. spaCy will then do the following:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+. Load the **language class and data** for the given ID via
 								   [`get_lang_class`](/api/top-level#util.get_lang_class) and initialize it. The
 								   `Language` class contains the shared vocabulary, tokenization rules and the
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								   language-specific settings.
 . Iterate over the **pipeline names** and look up each component name in the
 								   `[components]` block. The `factory` tells spaCy which
 								   [component factory](#custom-components-factories) to use for adding the
-												Merge branch 'develop' into pr/6253

											
										
										
											2020-10-14 17:55:46 +03:00
+								   component with [`add_pipe`](/api/language#add_pipe). The settings are passed
 								   into the factory.
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+. Make the **model data** available to the `Language` class by calling
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								   [`from_disk`](/api/language#from_disk) with the path to the data directory.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								So when you call this...
 								```python
-												Improve pipeline model and meta example [ci skip]

											
										
										
											2019-02-24 20:45:39 +03:00
+								nlp = spacy.load("en_core_web_sm")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the
-												Include all en_core_web_sm components in examples

											
										
										
											2021-03-17 17:05:22 +03:00
+								pipeline
 								`["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]`. spaCy
 								will then initialize `spacy.lang.en.English`, and create each pipeline component
 								and add it to the processing pipeline. It'll then load in the model data from
 								the data directory and return the modified `Language` class for you to use as
 								the `nlp` object.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								<Infobox title="Changed in v3.0" variant="warning">
 								spaCy v3.0 introduces a `config.cfg`, which includes more detailed settings for
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								the pipeline, its components and the [training process](/usage/training#config).
 								You can export the config of your current `nlp` object by calling
 								[`nlp.config.to_disk`](/api/language#config).
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
 								</Infobox>
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Fundamentally, a [spaCy pipeline package](/models) consists of three components:
 								**the weights**, i.e. binary data loaded in from a directory, a **pipeline** of
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								functions called in order, and **language data** like the tokenization rules and
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								language-specific settings. For example, a Spanish NER pipeline requires
 								different weights, language data and components than an English parsing and
 								tagging pipeline. This is also why the pipeline state is always held by the
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								`Language` class. [`spacy.load`](/api/top-level#spacy.load) puts this all
 								together and returns an instance of `Language` with a pipeline set and access to
 								the binary data:
 								```python
 								### spacy.load under the hood
 								lang = "en"
-												Include all en_core_web_sm components in examples

											
										
										
											2021-03-17 17:05:22 +03:00
+								pipeline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								data_path = "path/to/en_core_web_sm/en_core_web_sm-3.0.0"
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								cls = spacy.util.get_lang_class(lang)  # 1. Get Language class, e.g. English
 								nlp = cls()                            # 2. Initialize it
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for name in pipeline:
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								    nlp.add_pipe(name)                 # 3. Add the component to the pipeline
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								nlp.from_disk(data_path)               # 4. Load in the binary data
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
 								component** on the `Doc`, in order. Since the model data is loaded, the
 								components can access it to assign annotations to the `Doc` object, and
 								subsequently to the `Token` and `Span` which are only views of the `Doc`, and
 								don't own any data themselves. All components return the modified document,
-												Include all en_core_web_sm components in examples

											
										
										
											2021-03-17 17:05:22 +03:00
+								which is then processed by the next component in the pipeline.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### The pipeline under the hood
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								doc = nlp.make_doc("This is a sentence")  # Create a Doc from raw text
 								for name, proc in nlp.pipeline:           # Iterate over components in order
 								    doc = proc(doc)                       # Apply each component
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								The current processing pipeline is available as `nlp.pipeline`, which returns a
 								list of `(name, component)` tuples, or `nlp.pipe_names`, which only returns a
 								list of human-readable component names.
 								```python
 								print(nlp.pipeline)
-												Include all en_core_web_sm components in examples

											
										
										
											2021-03-17 17:05:22 +03:00
+								# [('tok2vec', <spacy.pipeline.Tok2Vec>), ('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>), ('attribute_ruler', <spacy.pipeline.AttributeRuler>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer>)]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								print(nlp.pipe_names)
-												Include all en_core_web_sm components in examples

											
										
										
											2021-03-17 17:05:22 +03:00
+								# ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Fix missing ids

											
										
										
											2019-03-14 19:56:53 +03:00
+								### Built-in pipeline components {#built-in}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								spaCy ships with several built-in pipeline components that are registered with
 								string names. This means that you can initialize them by calling
 								[`nlp.add_pipe`](/api/language#add_pipe) with their names and spaCy will know
 								how to create them. See the [API documentation](/api) for a full list of
 								available pipeline components and component functions.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Usage
 								>
 								> ```python
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								> nlp = spacy.blank("en")
 								> nlp.add_pipe("sentencizer")
 								> # add_pipe returns the added component
 								> ruler = nlp.add_pipe("entity_ruler")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> ```
-												textcat scoring fix and multi_label docs (#6974)

* add multi-label textcat to menu

* add infobox on textcat API

* add info to v3 migration guide

* small edits

* further fixes in doc strings

* add infobox to textcat architectures

* add textcat_multilabel to overview of built-in components

* spelling

* fix unrelated warn msg

* Add textcat_multilabel to quickstart [ci skip]

* remove separate documentation page for multilabel_textcategorizer

* small edits

* positive label clarification

* avoid duplicating information in self.cfg and fix textcat.score

* fix multilabel textcat too

* revert threshold to storage in cfg

* revert threshold stuff for multi-textcat

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2021-03-09 15:04:22 +03:00
+								| String name          | Component                                            | Description                                                                               |
 								| -------------------- | ---------------------------------------------------- | ----------------------------------------------------------------------------------------- |
 								| `tagger`             | [`Tagger`](/api/tagger)                              | Assign part-of-speech-tags.                                                               |
 								| `parser`             | [`DependencyParser`](/api/dependencyparser)          | Assign dependency labels.                                                                 |
 								| `ner`                | [`EntityRecognizer`](/api/entityrecognizer)          | Assign named entities.                                                                    |
 								| `entity_linker`      | [`EntityLinker`](/api/entitylinker)                  | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
 								| `entity_ruler`       | [`EntityRuler`](/api/entityruler)                    | Assign named entities based on pattern rules and dictionaries.                            |
 								| `textcat`            | [`TextCategorizer`](/api/textcategorizer)            | Assign text categories: exactly one category is predicted per document.                   |
 								| `textcat_multilabel` | [`MultiLabel_TextCategorizer`](/api/textcategorizer) | Assign text categories in a multi-label setting: zero, one or more labels per document.   |
 								| `lemmatizer`         | [`Lemmatizer`](/api/lemmatizer)                      | Assign base forms to words.                                                               |
 								| `morphologizer`      | [`Morphologizer`](/api/morphologizer)                | Assign morphological features and coarse-grained POS tags.                                |
 								| `attribute_ruler`    | [`AttributeRuler`](/api/attributeruler)              | Assign token attribute mappings and rule-based exceptions.                                |
 								| `senter`             | [`SentenceRecognizer`](/api/sentencerecognizer)      | Assign sentence boundaries.                                                               |
 								| `sentencizer`        | [`Sentencizer`](/api/sentencizer)                    | Add rule-based sentence segmentation without the dependency parse.                        |
 								| `tok2vec`            | [`Tok2Vec`](/api/tok2vec)                            | Assign token-to-vector embeddings.                                                        |
 								| `transformer`        | [`Transformer`](/api/transformer)                    | Assign the tokens and outputs of a transformer model.                                     |
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								### Disabling, excluding and modifying components {#disabling}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								If you don't need a particular component of the pipeline – for example, the
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								tagger or the parser, you can **disable or exclude** it. This can sometimes make
 								a big difference and improve loading and inference speed. There are two
 								different mechanisms you can use:
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+. **Disable:** The component and its data will be loaded with the pipeline, but
 								   it will be disabled by default and not run as part of the processing
 								   pipeline. To run it, you can explicitly enable it by calling
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								   [`nlp.enable_pipe`](/api/language#enable_pipe). When you save out the `nlp`
 								   object, the disabled component will be included but disabled by default.
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+. **Exclude:** Don't load the component and its data with the pipeline. Once
 								   the pipeline is loaded, there will be no reference to the excluded component.
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
 								Disabled and excluded component names can be provided to
 								[`spacy.load`](/api/top-level#spacy.load) as a list.
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> #### 💡 Optional pipeline components
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								>
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> The `disable` mechanism makes it easy to distribute pipeline packages with
 								> optional components that you can enable or disable at runtime. For instance,
 								> your pipeline may include a statistical _and_ a rule-based component for
 								> sentence segmentation, and you can choose which one to run depending on your
 								> use case.
-												Update docs and resolve todos [ci skip]

											
										
										
											2020-09-24 14:41:25 +03:00
+								>
 								> For example, spaCy's [trained pipelines](/models) like
 								> [`en_core_web_sm`](/models/en#en_core_web_sm) contain both a `parser` and
 								> `senter` that perform sentence segmentation, but the `senter` is disabled by
 								> default.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								# Load the pipeline without the entity recognizer
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								nlp = spacy.load("en_core_web_sm", exclude=["ner"])
 								# Load the tagger and parser but don't enable them
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 15:25:34 +03:00
+								nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								# Explicitly enable the tagger later on
 								nlp.enable_pipe("tagger")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								<Infobox variant="warning" title="Changed in v3.0">
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 15:25:34 +03:00
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								As of v3.0, the `disable` keyword argument specifies components to load but
 								disable, instead of components to not load at all. Those components can now be
 								specified separately using the new `exclude` keyword argument.
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 15:25:34 +03:00
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								</Infobox>
 								As a shortcut, you can use the [`nlp.select_pipes`](/api/language#select_pipes)
 								context manager to temporarily disable certain components for a given block. At
 								the end of the `with` block, the disabled pipeline components will be restored
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 23:27:10 +03:00
+								automatically. Alternatively, `select_pipes` returns an object that lets you
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 15:25:34 +03:00
+								call its `restore()` method to restore the disabled components when needed. This
 								can be useful if you want to prevent unnecessary code indentation of large
 								blocks.
 								```python
 								### Disable for block
-												context manager with space (for consistency)

											
										
										
											2020-08-21 19:34:02 +03:00
+								# 1. Use as a context manager
-												Include all en_core_web_sm components in examples

											
										
										
											2021-03-17 17:05:22 +03:00
+								with nlp.select_pipes(disable=["tagger", "parser", "lemmatizer"]):
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								    doc = nlp("I won't be tagged and parsed")
 								doc = nlp("I will be tagged and parsed")
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 15:25:34 +03:00
 								# 2. Restore manually
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 23:27:10 +03:00
+								disabled = nlp.select_pipes(disable="ner")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("I won't have named entities")
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 15:25:34 +03:00
+								disabled.restore()
 								```
-												unicode -> str consistency

											
										
										
											2020-05-24 18:23:00 +03:00
+								If you want to disable all pipes except for one or a few, you can use the
 								`enable` keyword. Just like the `disable` keyword, it takes a list of pipe
 								names, or a string defining just one pipe.
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 23:27:10 +03:00
+								```python
 								# Enable only the parser
 								with nlp.select_pipes(enable="parser"):
 								    doc = nlp("I will only be parsed")
 								```
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								The [`nlp.pipe`](/api/language#pipe) method also supports a `disable` keyword
 								argument if you only want to disable components during processing:
 								```python
-												Include all en_core_web_sm components in examples

											
										
										
											2021-03-17 17:05:22 +03:00
+								for doc in nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer"]):
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								    # Do something with the doc here
 								```
-												Improve section on disabling pipes [ci skip]

											
										
										
											2019-07-25 15:25:34 +03:00
+								Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
 								to remove pipeline components from an existing pipeline, the
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
 								[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
 								custom component entirely (more details on this in the section on
-												New batch of proofs

Just tiny fixes to the docs as a proofreader

											
										
										
											2020-10-14 17:37:57 +03:00
+								[custom components](#custom-components)).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								nlp.remove_pipe("parser")
 								nlp.rename_pipe("ner", "entityrecognizer")
-												Update docs [ci skip]

											
										
										
											2020-09-08 11:33:48 +03:00
+								nlp.replace_pipe("tagger", "my_custom_tagger")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update docs

											
										
										
											2020-08-29 13:36:05 +03:00
+								The `Language` object exposes different [attributes](/api/language#attributes)
 								that let you inspect all available components and the components that currently
 								run as part of the pipeline.
 								> #### Example
 								>
 								> ```python
 								> nlp = spacy.blank("en")
 								> nlp.add_pipe("ner")
 								> nlp.add_pipe("textcat")
 								> assert nlp.pipe_names == ["ner", "textcat"]
 								> nlp.disable_pipe("ner")
 								> assert nlp.pipe_names == ["textcat"]
 								> assert nlp.component_names == ["ner", "textcat"]
 								> assert nlp.disabled == ["ner"]
 								> ```
 								| Name                  | Description                                                      |
 								| --------------------- | ---------------------------------------------------------------- |
 								| `nlp.pipeline`        | `(name, component)` tuples of the processing pipeline, in order. |
 								| `nlp.pipe_names`      | Pipeline component names, in order.                              |
 								| `nlp.components`      | All `(name, component)` tuples, including disabled components.   |
 								| `nlp.component_names` | All component names, including disabled components.              |
 								| `nlp.disabled`        | Names of components that are currently disabled.                 |
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								### Sourcing components from existing pipelines {#sourced-components new="3"}
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Pipeline components that are independent can also be reused across pipelines.
 								Instead of adding a new blank component, you can also copy an existing component
 								from a trained pipeline by setting the `source` argument on
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								[`nlp.add_pipe`](/api/language#add_pipe). The first argument will then be
 								interpreted as the name of the component in the source pipeline – for instance,
 								`"ner"`. This is especially useful for
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								[training a pipeline](/usage/training#config-components) because it lets you mix
 								and match components and create fully custom pipeline packages with updated
 								trained components and new components trained on your data.
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								<Infobox variant="warning" title="Important note for trained components">
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								When reusing components across pipelines, keep in mind that the **vocabulary**,
 								**vectors** and model settings **must match**. If a trained pipeline includes
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								[word vectors](/usage/linguistic-features#vectors-similarity) and the component
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								uses them as features, the pipeline you copy it to needs to have the _same_
 								vectors available – otherwise, it won't be able to make the same predictions.
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
 								</Infobox>
 								> #### In training config
 								>
 								> Instead of providing a `factory`, component blocks in the training
 								> [config](/usage/training#config) can also define a `source`. The string needs
-												New batch of proofs

Just tiny fixes to the docs as a proofreader

											
										
										
											2020-10-14 17:37:57 +03:00
+								> to be a loadable spaCy pipeline package or path.
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								>
 								> ```ini
 								> [components.ner]
 								> source = "en_core_web_sm"
 								> component = "ner"
 								> ```
 								>
 								> By default, sourced components will be updated with your data during training.
-												textcat scoring fix and multi_label docs (#6974)

* add multi-label textcat to menu

* add infobox on textcat API

* add info to v3 migration guide

* small edits

* further fixes in doc strings

* add infobox to textcat architectures

* add textcat_multilabel to overview of built-in components

* spelling

* fix unrelated warn msg

* Add textcat_multilabel to quickstart [ci skip]

* remove separate documentation page for multilabel_textcategorizer

* small edits

* positive label clarification

* avoid duplicating information in self.cfg and fix textcat.score

* fix multilabel textcat too

* revert threshold to storage in cfg

* revert threshold stuff for multi-textcat

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2021-03-09 15:04:22 +03:00
+								> If you want to preserve the component as-is, you can "freeze" it if the
 								> pipeline is not using a shared `Tok2Vec` layer:
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								>
 								> ```ini
 								> [training]
 								> frozen_components = ["ner"]
 								> ```
 								```python
 								### {executable="true"}
 								import spacy
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								# The source pipeline with different components
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								source_nlp = spacy.load("en_core_web_sm")
 								print(source_nlp.pipe_names)
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								# Add only the entity recognizer to the new blank pipeline
-												Update docs

											
										
										
											2020-08-05 16:00:54 +03:00
+								nlp = spacy.blank("en")
 								nlp.add_pipe("ner", source=source_nlp)
 								print(nlp.pipe_names)
 								```
-												Update docs [ci skip]

											
										
										
											2020-07-31 19:55:38 +03:00
+								### Analyzing pipeline components {#analysis new="3"}
 								The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the
-												New batch of proofs

Just tiny fixes to the docs as a proofreader

											
										
										
											2020-10-14 17:37:57 +03:00
+								components in the current pipeline and outputs information about them like the
-												Update docs [ci skip]

											
										
										
											2020-07-31 19:55:38 +03:00
+								attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether
 								they retokenize the `Doc` and which scores they produce during training. It will
 								also show warnings if components require values that aren't set by previous
 								component – for instance, if the entity linker is used but no component that
-												Simplify pipe analysis

- remove unused code
- don't print by default
- integrate attrs info into analysis output

											
										
										
											2020-08-01 14:40:06 +03:00
+								runs before it sets named entities. Setting `pretty=True` will pretty-print a
 								table instead of only returning the structured data.
 								> #### ✏️ Things to try
 								>
-												casing consistent

											
										
										
											2020-08-07 00:20:13 +03:00
+								> 1. Add the components `"ner"` and `"sentencizer"` _before_ the
 								>    `"entity_linker"`. The analysis should now show no problems, because
 								>    requirements are met.
-												Update docs [ci skip]

											
										
										
											2020-07-31 19:55:38 +03:00
 								```python
-												Simplify pipe analysis

- remove unused code
- don't print by default
- integrate attrs info into analysis output

											
										
										
											2020-08-01 14:40:06 +03:00
+								### {executable="true"}
 								import spacy
-												Update docs [ci skip]

											
										
										
											2020-07-31 19:55:38 +03:00
+								nlp = spacy.blank("en")
 								nlp.add_pipe("tagger")
-												Simplify pipe analysis

- remove unused code
- don't print by default
- integrate attrs info into analysis output

											
										
										
											2020-08-01 14:40:06 +03:00
+								# This is a problem because it needs entities and sentence boundaries
 								nlp.add_pipe("entity_linker")
 								analysis = nlp.analyze_pipes(pretty=True)
-												Update docs [ci skip]

											
										
										
											2020-07-31 19:55:38 +03:00
+								```
-												Simplify pipe analysis

- remove unused code
- don't print by default
- integrate attrs info into analysis output

											
										
										
											2020-08-01 14:40:06 +03:00
+								<Accordion title="Example output">
 								```json
 								### Structured
 								{
 								  "summary": {
 								    "tagger": {
 								      "assigns": ["token.tag"],
 								      "requires": [],
 								      "scores": ["tag_acc", "pos_acc", "lemma_acc"],
 								      "retokenizes": false
 								    },
 								    "entity_linker": {
 								      "assigns": ["token.ent_kb_id"],
 								      "requires": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"],
 								      "scores": [],
 								      "retokenizes": false
 								    }
 								  },
 								  "problems": {
 								    "tagger": [],
 								    "entity_linker": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"]
 								  },
 								  "attrs": {
 								    "token.ent_iob": { "assigns": [], "requires": ["entity_linker"] },
 								    "doc.ents": { "assigns": [], "requires": ["entity_linker"] },
 								    "token.ent_kb_id": { "assigns": ["entity_linker"], "requires": [] },
 								    "doc.sents": { "assigns": [], "requires": ["entity_linker"] },
 								    "token.tag": { "assigns": ["tagger"], "requires": [] },
 								    "token.ent_type": { "assigns": [], "requires": ["entity_linker"] }
 								  }
 								}
-												Update docs [ci skip]

											
										
										
											2020-07-31 19:55:38 +03:00
+								```
-												Simplify pipe analysis

- remove unused code
- don't print by default
- integrate attrs info into analysis output

											
										
										
											2020-08-01 14:40:06 +03:00
 								```
 								### Pretty
-												Update docs [ci skip]

											
										
										
											2020-07-31 19:55:38 +03:00
+								============================= Pipeline Overview =============================
-												Update docs [ci skip]

											
										
										
											2020-11-09 07:43:26 +03:00
+								#   Component       Assigns           Requires         Scores        Retokenizes
 								-   -------------   ---------------   --------------   -----------   -----------
 tagger          token.tag                          tag_acc       False
 entity_linker   token.ent_kb_id   doc.ents         nel_micro_f   False
 								                                      doc.sents        nel_micro_r
 								                                      token.ent_iob    nel_micro_p
-												Update docs [ci skip]

											
										
										
											2020-07-31 19:55:38 +03:00
+								                                      token.ent_type
 								================================ Problems (4) ================================
 								⚠ 'entity_linker' requirements not met: doc.ents, doc.sents,
 								token.ent_iob, token.ent_type
 								```
-												Simplify pipe analysis

- remove unused code
- don't print by default
- integrate attrs info into analysis output

											
										
										
											2020-08-01 14:40:06 +03:00
+								</Accordion>
-												Update docs [ci skip]

											
										
										
											2020-07-31 19:55:38 +03:00
 								<Infobox variant="warning" title="Important note">
 								The pipeline analysis is static and does **not actually run the components**.
 								This means that it relies on the information provided by the components
 								themselves. If a custom component declares that it assigns an attribute but it
 								doesn't, the pipeline analysis won't catch that.
 								</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								## Creating custom pipeline components {#custom-components}
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								A pipeline component is a function that receives a `Doc` object, modifies it and
-												Merge branch 'develop' into pr/6253

											
										
										
											2020-10-14 17:55:46 +03:00
+								returns it – for example, by using the current weights to make a prediction and
 								set some annotation on the document. By adding a component to the pipeline,
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								you'll get access to the `Doc` at any point **during processing** – instead of
 								only being able to modify it afterwards.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Example
 								>
 								> ```python
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								> from spacy.language import Language
 								>
 								> @Language.component("my_component")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> def my_component(doc):
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								>    # Do something to the doc here
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								>    return doc
 								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Argument    | Type              | Description                                            |
 								| ----------- | ----------------- | ------------------------------------------------------ |
 								| `doc`       | [`Doc`](/api/doc) | The `Doc` object processed by the previous component.  |
 								| **RETURNS** | [`Doc`](/api/doc) | The `Doc` object processed by this pipeline component. |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								The [`@Language.component`](/api/language#component) decorator lets you turn a
 								simple function into a pipeline component. It takes at least one argument, the
 								**name** of the component factory. You can use this name to add an instance of
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								your component to the pipeline. It can also be listed in your pipeline config,
 								so you can save, load and train pipelines using your component.
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								Custom components can be added to the pipeline using the
 								[`add_pipe`](/api/language#add_pipe) method. Optionally, you can either specify
 								a component to add it **before or after**, tell spaCy to add it **first or
 								last** in the pipeline, or define a **custom name**. If no name is set and no
 								`name` attribute is present on your component, the function name is used.
 								> #### Example
 								>
 								> ```python
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								> nlp.add_pipe("my_component")
 								> nlp.add_pipe("my_component", first=True)
 								> nlp.add_pipe("my_component", before="parser")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Argument | Description                                                                       |
 								| -------- | --------------------------------------------------------------------------------- |
 								| `last`   | If set to `True`, component is added **last** in the pipeline (default). ~~bool~~ |
 								| `first`  | If set to `True`, component is added **first** in the pipeline. ~~bool~~          |
 								| `before` | String name or index to add the new component **before**. ~~Union[str, int]~~     |
 								| `after`  | String name or index to add the new component **after**. ~~Union[str, int]~~      |
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
 								<Infobox title="Changed in v3.0" variant="warning">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								As of v3.0, components need to be registered using the
 								[`@Language.component`](/api/language#component) or
 								[`@Language.factory`](/api/language#factory) decorator so spaCy knows that a
 								function is a component. [`nlp.add_pipe`](/api/language#add_pipe) now takes the
 								**string name** of the component factory instead of the component function. This
 								doesn't only save you lines of code, it also allows spaCy to validate and track
 								your custom components, and make sure they can be saved and loaded.
 								```diff
 								- ruler = nlp.create_pipe("entity_ruler")
 								- nlp.add_pipe(ruler)
 								+ ruler = nlp.add_pipe("entity_ruler")
 								```
 								</Infobox>
 								### Examples: Simple stateless pipeline components {#custom-components-simple}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								The following component receives the `Doc` in the pipeline and prints some
 								information about it: the number of tokens, the part-of-speech tags of the
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								tokens and a conditional message based on the document length. The
 								[`@Language.component`](/api/language#component) decorator lets you register the
 								component under the name `"info_component"`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### ✏️ Things to try
 								>
 								> 1. Add the component first in the pipeline by setting `first=True`. You'll see
 								>    that the part-of-speech tags are empty, because the component now runs
 								>    before the tagger and the tags aren't available yet.
 								> 2. Change the component `name` or remove the `name` argument. You should see
 								>    this change reflected in `nlp.pipe_names`.
 								> 3. Print `nlp.pipeline`. You'll see a list of tuples describing the component
 								>    name and the function that's called on the `Doc` object in the pipeline.
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								> 4. Change the first argument to `@Language.component`, the name, to something
 								>    else. spaCy should now complain that it doesn't know a component of the
 								>    name `"info_component"`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### {executable="true"}
 								import spacy
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								from spacy.language import Language
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								@Language.component("info_component")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								def my_component(doc):
-												Drop Python 2.7 and 3.5 (#4828)

* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]

											
										
										
											2019-12-22 03:53:56 +03:00
+								    print(f"After tokenization, this doc has {len(doc)} tokens.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								    print("The part-of-speech tags are:", [token.pos_ for token in doc])
 								    if len(doc) < 10:
 								        print("This is a pretty short document.")
 								    return doc
 								nlp = spacy.load("en_core_web_sm")
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								nlp.add_pipe("info_component", name="print_info", last=True)
-												update response after calling add_pipe (#3661)

* update response after calling add_pipe

component:print_info is appened in the last, so need show it at the end of  pipeline

* Create henry860916.md

											
										
										
											2019-05-01 13:02:18 +03:00
+								print(nlp.pipe_names)  # ['tagger', 'parser', 'ner', 'print_info']
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("This is a sentence.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								Here's another example of a pipeline component that implements custom logic to
 								improve the sentence boundaries set by the dependency parser. The custom logic
 								should therefore be applied **after** tokenization, but _before_ the dependency
 								parsing – this way, the parser can also take advantage of the sentence
 								boundaries.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### ✏️ Things to try
 								>
 								> 1. Print `[token.dep_ for token in doc]` with and without the custom pipeline
 								>    component. You'll see that the predicted dependency parse changes to match
 								>    the sentence boundaries.
 								> 2. Remove the `else` block. All other tokens will now have `is_sent_start` set
 								>    to `None` (missing value), the parser will assign sentence boundaries in
 								>    between.
 								```python
 								### {executable="true"}
 								import spacy
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								from spacy.language import Language
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								@Language.component("custom_sentencizer")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								def custom_sentencizer(doc):
 								    for i, token in enumerate(doc[:-2]):
 								        # Define sentence start if pipe + titlecase token
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								        if token.text == "|" and doc[i + 1].is_title:
 								            doc[i + 1].is_sent_start = True
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								        else:
 								            # Explicitly set sentence start to False otherwise, to tell
 								            # the parser to leave those tokens alone
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								            doc[i + 1].is_sent_start = False
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								    return doc
 								nlp = spacy.load("en_core_web_sm")
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								nlp.add_pipe("custom_sentencizer", before="parser")  # Insert before the parser
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("This is. A sentence. | This is. Another sentence.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for sent in doc.sents:
 								    print(sent.text)
 								```
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								### Component factories and stateful components {#custom-components-factories}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								Component factories are callables that take settings and return a **pipeline
 								component function**. This is useful if your component is stateful and if you
 								need to customize their creation, or if you need access to the current `nlp`
 								object or the shared vocab. Component factories can be registered using the
 								[`@Language.factory`](/api/language#factory) decorator and they need at least
 								**two named arguments** that are filled in automatically when the component is
 								added to the pipeline:
 								> #### Example
 								>
 								> ```python
 								> from spacy.language import Language
 								>
 								> @Language.factory("my_component")
 								> def my_component(nlp, name):
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								>     return MyComponent()
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								> ```
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| Argument | Description                                                                                                                       |
 								| -------- | --------------------------------------------------------------------------------------------------------------------------------- |
 								| `nlp`    | The current `nlp` object. Can be used to access the shared vocab. ~~Language~~                                                    |
 								| `name`   | The **instance name** of the component in the pipeline. This lets you identify different instances of the same component. ~~str~~ |
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
 								All other settings can be passed in by the user via the `config` argument on
 								[`nlp.add_pipe`](/api/language). The
 								[`@Language.factory`](/api/language#factory) decorator also lets you define a
 								`default_config` that's used as a fallback.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								### With config {highlight="4,9"}
 								import spacy
 								from spacy.language import Language
 								@Language.factory("my_component", default_config={"some_setting": True})
 								def my_component(nlp, name, some_setting: bool):
 								    return MyComponent(some_setting=some_setting)
 								nlp = spacy.blank("en")
 								nlp.add_pipe("my_component", config={"some_setting": False})
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								<Accordion title="How is @Language.factory different from @Language.component?" id="factories-decorator-component">
 								The [`@Language.component`](/api/language#component) decorator is essentially a
-												Merge branch 'develop' into pr/6253

											
										
										
											2020-10-14 17:55:46 +03:00
+								**shortcut** for stateless pipeline components that don't need any settings.
 								This means you don't have to always write a function that returns your function
 								if there's no state to be passed through – spaCy can just take care of this for
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								you. The following two code examples are equivalent:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								```python
-												Minor typo fix in docs
											
										
										
											2021-09-11 08:22:05 +03:00
+								# Stateless component with @Language.factory
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								@Language.factory("my_component")
 								def create_my_component():
 								    def my_component(doc):
 								        # Do something to the doc
 								        return doc
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								    return my_component
 								# Stateless component with @Language.component
 								@Language.component("my_component")
 								def my_component(doc):
 								    # Do something to the doc
 								    return doc
 								```
 								</Accordion>
 								<Accordion title="Can I add the @Language.factory decorator to a class?" id="factories-class-decorator" spaced>
 								Yes, the [`@Language.factory`](/api/language#factory) decorator can be added to
 								a function or a class. If it's added to a class, it expects the `__init__`
 								method to take the arguments `nlp` and `name`, and will populate all other
 								arguments from the config. That said, it's often cleaner and more intuitive to
 								make your factory a separate function. That's also how spaCy does it internally.
 								</Accordion>
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								### Language-specific factories {#factories-language new="3"}
-												Merge branch 'develop' into pr/6253

											
										
										
											2020-10-14 17:55:46 +03:00
+								There are many use cases where you might want your pipeline components to be
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								language-specific. Sometimes this requires entirely different implementation per
 								language, sometimes the only difference is in the settings or data. spaCy allows
 								you to register factories of the **same name** on both the `Language` base
 								class, as well as its **subclasses** like `English` or `German`. Factories are
 								resolved starting with the specific subclass. If the subclass doesn't define a
 								component of that name, spaCy will check the `Language` base class.
 								Here's an example of a pipeline component that overwrites the normalized form of
 								a token, the `Token.norm_` with an entry from a language-specific lookup table.
 								It's registered twice under the name `"token_normalizer"` – once using
 								`@English.factory` and once using `@German.factory`:
 								```python
 								### {executable="true"}
 								from spacy.lang.en import English
 								from spacy.lang.de import German
 								class TokenNormalizer:
 								    def __init__(self, norm_table):
 								        self.norm_table = norm_table
 								    def __call__(self, doc):
 								        for token in doc:
 								            # Overwrite the token.norm_ if there's an entry in the data
 								            token.norm_ = self.norm_table.get(token.text, token.norm_)
 								        return doc
 								@English.factory("token_normalizer")
 								def create_en_normalizer(nlp, name):
 								    return TokenNormalizer({"realise": "realize", "colour": "color"})
 								@German.factory("token_normalizer")
 								def create_de_normalizer(nlp, name):
 								    return TokenNormalizer({"daß": "dass", "wußte": "wusste"})
 								nlp_en = English()
 								nlp_en.add_pipe("token_normalizer")  # uses the English factory
 								print([token.norm_ for token in nlp_en("realise colour daß wußte")])
 								nlp_de = German()
 								nlp_de.add_pipe("token_normalizer")  # uses the German factory
 								print([token.norm_ for token in nlp_de("realise colour daß wußte")])
 								```
 								<Infobox title="Implementation details">
 								Under the hood, language-specific factories are added to the
 								[`factories` registry](/api/top-level#registry) prefixed with the language code,
 								e.g. `"en.token_normalizer"`. When resolving the factory in
 								[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a
 								language-specific version of the factory using `nlp.lang` and if none is
 								available, falls back to looking up the regular factory name.
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								### Example: Stateful component with settings {#example-stateful-components}
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
 								This example shows a **stateful** pipeline component for handling acronyms:
 								based on a dictionary, it will detect acronyms and their expanded forms in both
 								directions and add them to a list as the custom `doc._.acronyms`
 								[extension attribute](#custom-components-attributes). Under the hood, it uses
 								the [`PhraseMatcher`](/api/phrasematcher) to find instances of the phrases.
 								The factory function takes three arguments: the shared `nlp` object and
 								component instance `name`, which are passed in automatically by spaCy, and a
 								`case_sensitive` config setting that makes the matching and acronym detection
 								case-sensitive.
 								> #### ✏️ Things to try
 								>
 								> 1. Change the `config` passed to `nlp.add_pipe` and set `"case_sensitive"` to
 								>    `True`. You should see that the expanded acronym for "LOL" isn't detected
 								>    anymore.
 								> 2. Add some more terms to the `DICTIONARY` and update the processed text so
 								>    they're detected.
 								> 3. Add a `name` argument to `nlp.add_pipe` to change the component name. Print
 								>    `nlp.pipe_names` to see the change reflected in the pipeline.
 								> 4. Print the config of the current `nlp` object with
 								>    `print(nlp.config.to_str())` and inspect the `[components]` block. You
 								>    should see an entry for the acronyms component, referencing the factory
 								>    `acronyms` and the config settings.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								### {executable="true"}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								from spacy.language import Language
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								from spacy.tokens import Doc
 								from spacy.matcher import PhraseMatcher
 								import spacy
 								DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
 								DICTIONARY.update({value: key for key, value in DICTIONARY.items()})
 								@Language.factory("acronyms", default_config={"case_sensitive": False})
 								def create_acronym_component(nlp: Language, name: str, case_sensitive: bool):
 								    return AcronymComponent(nlp, case_sensitive)
 								class AcronymComponent:
 								    def __init__(self, nlp: Language, case_sensitive: bool):
 								        # Create the matcher and match on Token.lower if case-insensitive
 								        matcher_attr = "TEXT" if case_sensitive else "LOWER"
 								        self.matcher = PhraseMatcher(nlp.vocab, attr=matcher_attr)
 								        self.matcher.add("ACRONYMS", [nlp.make_doc(term) for term in DICTIONARY])
 								        self.case_sensitive = case_sensitive
 								        # Register custom extension on the Doc
 								        if not Doc.has_extension("acronyms"):
 								            Doc.set_extension("acronyms", default=[])
 								    def __call__(self, doc: Doc) -> Doc:
 								        # Add the matched spans when doc is processed
 								        for _, start, end in self.matcher(doc):
 								            span = doc[start:end]
 								            acronym = DICTIONARY.get(span.text if self.case_sensitive else span.text.lower())
 								            doc._.acronyms.append((span, acronym))
 								        return doc
 								# Add the component to the pipeline and configure it
 								nlp = spacy.blank("en")
 								nlp.add_pipe("acronyms", config={"case_sensitive": False})
 								# Process a doc and see the results
 								doc = nlp("LOL, be right back")
 								print(doc._.acronyms)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								## Initializing and serializing component data {#component-data}
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								Many stateful components depend on **data resources** like dictionaries and
 								lookup tables that should ideally be **configurable**. For example, it makes
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								sense to make the `DICTIONARY` in the above example an argument of the
 								registered function, so the `AcronymComponent` can be re-used with different
 								data. One logical solution would be to make it an argument of the component
 								factory, and allow it to be initialized with different dictionaries.
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								> #### config.cfg
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								>
 								> ```ini
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								> [components.acronyms.data]
 								> # 🚨 Problem: you don't want the data in the config
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								> lol = "laugh out loud"
 								> brb = "be right back"
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								```python
 								@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
 								def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
 								    # 🚨 Problem: data ends up in the config file
 								    return AcronymComponent(nlp, data, case_sensitive)
 								```
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								However, passing in the dictionary directly is problematic, because it means
 								that if a component saves out its config and settings, the
 								[`config.cfg`](/usage/training#config) will include a dump of the entire data,
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								since that's the config the component was created with. It will also fail if the
 								data is not JSON-serializable.
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								### Option 1: Using a registered function {#component-data-function}
 								<Infobox>
 								- ✅ **Pros:** can load anything in Python, easy to add to and configure via
 								  config
 								- ❌ **Cons:** requires the function and its dependencies to be available at
 								  runtime
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
 								If what you're passing in isn't JSON-serializable – e.g. a custom object like a
 								[model](#trainable-components) – saving out the component config becomes
 								impossible because there's no way for spaCy to know _how_ that object was
 								created, and what to do to create it again. This makes it much harder to save,
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								load and train custom pipelines with custom components. A simple solution is to
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								**register a function** that returns your resources. The
 								[registry](/api/top-level#registry) lets you **map string names to functions**
 								that create objects, so given a name and optional arguments, spaCy will know how
-												registry.assets -> registry.misc

											
										
										
											2020-09-03 18:31:14 +03:00
+								to recreate the object. To register a function that returns your custom
 								dictionary, you can use the `@spacy.registry.misc` decorator with a single
 								argument, the name:
 								> #### What's the misc registry?
 								>
 								> The [`registry`](/api/top-level#registry) provides different categories for
 								> different types of functions – for example, model architectures, tokenizers or
 								> batchers. `misc` is intended for miscellaneous functions that don't fit
 								> anywhere else.
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
 								```python
 								### Registered function for assets {highlight="1"}
-												registry.assets -> registry.misc

											
										
										
											2020-09-03 18:31:14 +03:00
+								@spacy.registry.misc("acronyms.slang_dict.v1")
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								def create_acronyms_slang_dict():
 								    dictionary = {"lol": "laughing out loud", "brb": "be right back"}
 								    dictionary.update({value: key for key, value in dictionary.items()})
 								    return dictionary
 								```
 								In your `default_config` (and later in your
 								[training config](/usage/training#config)), you can now refer to the function
-												registry.assets -> registry.misc

											
										
										
											2020-09-03 18:31:14 +03:00
+								registered under the name `"acronyms.slang_dict.v1"` using the `@misc` key. This
 								tells spaCy how to create the value, and when your component is created, the
 								result of the registered function is passed in as the key `"dictionary"`.
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
 								> #### config.cfg
 								>
 								> ```ini
 								> [components.acronyms]
 								> factory = "acronyms"
 								>
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								> [components.acronyms.data]
-												registry.assets -> registry.misc

											
										
										
											2020-09-03 18:31:14 +03:00
+								> @misc = "acronyms.slang_dict.v1"
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								> ```
 								```diff
 								- default_config = {"dictionary:" DICTIONARY}
-												registry.assets -> registry.misc

											
										
										
											2020-09-03 18:31:14 +03:00
+								+ default_config = {"dictionary": {"@misc": "acronyms.slang_dict.v1"}}
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
+								```
 								Using a registered function also means that you can easily include your custom
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								components in pipelines that you [train](/usage/training). To make sure spaCy
-												registry.assets -> registry.misc

											
										
										
											2020-09-03 18:31:14 +03:00
+								knows where to find your custom `@misc` function, you can pass in a Python file
 								via the argument `--code`. If someone else is using your component, all they
 								have to do to customize the data is to register their own function and swap out
-												New batch of proofs

Just tiny fixes to the docs as a proofreader

											
										
										
											2020-10-14 17:37:57 +03:00
+								the name. Registered functions can also take **arguments**, by the way, that can
-												registry.assets -> registry.misc

											
										
										
											2020-09-03 18:31:14 +03:00
+								be defined in the config as well – you can read more about this in the docs on
 								[training with custom code](/usage/training#custom-code).
-												Update docs [ci skip]

											
										
										
											2020-08-19 17:04:21 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								### Option 2: Save data with the pipeline and load it in once on initialization {#component-data-initialization}
 								<Infobox>
 								- ✅ **Pros:** lets components save and load their own data and reflect user
 								  changes, load in data assets before training without depending on them at
 								  runtime
 								- ❌ **Cons:** requires more component methods, more complex config and data
 								  flow
-												Update docs [ci skip]

											
										
										
											2020-10-02 14:24:33 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								</Infobox>
 								Just like models save out their binary weights when you call
 								[`nlp.to_disk`](/api/language#to_disk), components can also **serialize** any
 								other data assets – for instance, an acronym dictionary. If a pipeline component
 								implements its own `to_disk` and `from_disk` methods, those will be called
 								automatically by `nlp.to_disk` and will receive the path to the directory to
 								save to or load from. The component can then perform any custom saving or
 								loading. If a user makes changes to the component data, they will be reflected
 								when the `nlp` object is saved. For more examples of this, see the usage guide
 								on [serialization methods](/usage/saving-loading/#serialization-methods).
 								> #### About the data path
 								>
 								> The `path` argument spaCy passes to the serialization methods consists of the
 								> path provided by the user, plus a directory of the component name. This means
 								> that when you call `nlp.to_disk("/path")`, the `acronyms` component will
 								> receive the directory path `/path/acronyms` and can then create files in this
 								> directory.
 								```python
 								### Custom serialization methods {highlight="6-7,9-11"}
 								import srsly
 								class AcronymComponent:
 								    # other methods here...
 								    def to_disk(self, path, exclude=tuple()):
 								        srsly.write_json(path / "data.json", self.data)
 								    def from_disk(self, path, exclude=tuple()):
 								        self.data = srsly.read_json(path / "data.json")
 								        return self
 								```
-												Update docs [ci skip]

											
										
										
											2020-10-02 14:24:33 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								Now the component can save to and load from a directory. The only remaining
 								question: How do you **load in the initial data**? In Python, you could just
 								call the pipe's `from_disk` method yourself. But if you're adding the component
 								to your [training config](/usage/training#config), spaCy will need to know how
 								to set it up, from start to finish, including the data to initialize it with.
 								While you could use a registered function or a file loader like
 								[`srsly.read_json.v1`](/api/top-level#file_readers) as an argument of the
 								component factory, this approach is problematic: the component factory runs
 								**every time the component is created**. This means it will run when creating
-												Fix silent evaluation (#8581)

* fix silentness

* sneak in docs typo fix

* pass silent boolean instead
											
										
										
											2021-07-06 15:16:19 +03:00
+								the `nlp` object before training, but also every time a user loads your
 								pipeline. So your runtime pipeline would either depend on a local path on your
 								file system, or it's loaded twice: once when the component is created, and then
 								again when the data is by `from_disk`.
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
 								> ```ini
 								> ### config.cfg
 								> [components.acronyms.data]
 								> # 🚨 Problem: Runtime pipeline depends on local path
 								> @readers = "srsly.read_json.v1"
 								> path = "/path/to/slang_dict.json"
 								> ```
 								>
 								> ```ini
 								> ### config.cfg
 								> [components.acronyms.data]
 								> # 🚨 Problem: this always runs
 								> @misc = "acronyms.slang_dict.v1"
 								> ```
 								```python
 								@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False})
 								def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool):
 								    # 🚨 Problem: data will be loaded every time component is created
 								    return AcronymComponent(nlp, data, case_sensitive)
 								```
 								To solve this, your component can implement a separate method, `initialize`,
 								which will be called by [`nlp.initialize`](/api/language#initialize) if
 								available. This typically happens before training, but not at runtime when the
 								pipeline is loaded. For more background on this, see the usage guides on the
 								[config lifecycle](/usage/training#config-lifecycle) and
 								[custom initialization](/usage/training#initialization).
 								![Illustration of pipeline lifecycle](../images/lifecycle.svg)
 								A component's `initialize` method needs to take at least **two named
 								arguments**: a `get_examples` callback that gives it access to the training
 								examples, and the current `nlp` object. This is mostly used by trainable
 								components so they can initialize their models and label schemes from the data,
 								so we can ignore those arguments here. All **other arguments** on the method can
 								be defined via the config – in this case a dictionary `data`.
 								> #### config.cfg
 								>
 								> ```ini
 								> [initialize.components.my_component]
 								>
 								> [initialize.components.my_component.data]
 								> # ✅ This only runs on initialization
 								> @readers = "srsly.read_json.v1"
 								> path = "/path/to/slang_dict.json"
 								> ```
 								```python
 								### Custom initialize method {highlight="5-6"}
 								class AcronymComponent:
 								    def __init__(self):
 								        self.data = {}
 								    def initialize(self, get_examples=None, nlp=None, data={}):
 								        self.data = data
 								```
 								When [`nlp.initialize`](/api/language#initialize) runs before training (or when
 								you call it in your own code), the
 								[`[initialize]`](/api/data-formats#config-initialize) block of the config is
 								loaded and used to construct the `nlp` object. The custom acronym component will
 								then be passed the data loaded from the JSON file. After training, the `nlp`
 								object is saved to disk, which will run the component's `to_disk` method. When
 								the pipeline is loaded back into spaCy later to use it, the `from_disk` method
 								will load the data back in.
 								## Python type hints and validation {#type-hints new="3"}
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
 								spaCy's configs are powered by our machine learning library Thinc's
 								[configuration system](https://thinc.ai/docs/usage-config), which supports
 								[type hints](https://docs.python.org/3/library/typing.html) and even
 								[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types)
 								using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your component
 								factory provides type hints, the values that are passed in will be **checked
 								against the expected types**. If the value can't be cast to an integer, spaCy
 								will raise an error. `pydantic` also provides strict types like `StrictFloat`,
 								which will force the value to be an integer and raise an error if it's not – for
 								instance, if your config defines a float.
 								<Infobox variant="warning">
 								If you're not using
 								[strict types](https://pydantic-docs.helpmanual.io/usage/types/#strict-types),
 								values that can be **cast to** the given type will still be accepted. For
 								example, `1` can be cast to a `float` or a `bool` type, but not to a
 								`List[str]`. However, if the type is
 								[`StrictFloat`](https://pydantic-docs.helpmanual.io/usage/types/#strict-types),
 								only a float will be accepted.
 								</Infobox>
 								The following example shows a custom pipeline component for debugging. It can be
 								added anywhere in the pipeline and logs information about the `nlp` object and
 								the `Doc` that passes through. The `log_level` config setting lets the user
 								customize what log statements are shown – for instance, `"INFO"` will show info
 								logs and more critical logging statements, whereas `"DEBUG"` will show
 								everything. The value is annotated as a `StrictStr`, so it will only accept a
 								string value.
 								> #### ✏️ Things to try
 								>
 								> 1. Change the `config` passed to `nlp.add_pipe` to use the log level `"INFO"`.
 								>    You should see that only the statement logged with `logger.info` is shown.
 								> 2. Change the `config` passed to `nlp.add_pipe` so that it contains unexpected
 								>    values – for example, a boolean instead of a string: `"log_level": False`.
 								>    You should see a validation error.
 								> 3. Check out the docs on `pydantic`'s
 								>    [constrained types](https://pydantic-docs.helpmanual.io/usage/types/#constrained-types)
 								>    and write a type hint for `log_level` that only accepts the exact string
 								>    values `"DEBUG"`, `"INFO"` or `"CRITICAL"`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								### {executable="true"}
 								import spacy
 								from spacy.language import Language
 								from spacy.tokens import Doc
 								from pydantic import StrictStr
 								import logging
 								@Language.factory("debug", default_config={"log_level": "DEBUG"})
 								class DebugComponent:
 								    def __init__(self, nlp: Language, name: str, log_level: StrictStr):
 								        self.logger = logging.getLogger(f"spacy.{name}")
 								        self.logger.setLevel(log_level)
 								        self.logger.info(f"Pipeline: {nlp.pipe_names}")
 								    def __call__(self, doc: Doc) -> Doc:
-												Update docs [ci skip]

											
										
										
											2020-10-09 13:04:52 +03:00
+								        is_tagged = doc.has_annotation("TAG")
 								        self.logger.debug(f"Doc: {len(doc)} tokens, is tagged: {is_tagged}")
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								        return doc
 								nlp = spacy.load("en_core_web_sm")
 								nlp.add_pipe("debug", config={"log_level": "DEBUG"})
 								doc = nlp("This is a text...")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update docs [ci skip]

											
										
										
											2020-10-03 15:47:02 +03:00
+								## Trainable components {#trainable-components new="3"}
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
-												TrainablePipe (#6213)

* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
											
										
										
											2020-10-08 22:33:49 +03:00
+								spaCy's [`TrainablePipe`](/api/pipe) class helps you implement your own
 								trainable components that have their own model instance, make predictions over
 								`Doc` objects and can be updated using [`spacy train`](/api/cli#train). This
 								lets you plug fully custom machine learning components into your pipeline.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-06 15:15:08 +03:00
+								![Illustration of Pipe methods](../images/trainable_component.svg)
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
-												Update docs [ci skip]

											
										
										
											2020-10-06 15:15:08 +03:00
+								You'll need the following:
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
 . **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This
-												Update docs [ci skip]

											
										
										
											2020-10-06 15:15:08 +03:00
+								   can be a model implemented in [Thinc](/usage/layers-architectures#thinc), or
 								   a [wrapped model](/usage/layers-architectures#frameworks) implemented in
-												references to usage page on layers and architectures

											
										
										
											2020-09-09 15:47:32 +03:00
+								   PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a
 								   list of [`Doc`](/api/doc) objects as input and can have any type of output.
-												TrainablePipe (#6213)

* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
											
										
										
											2020-10-08 22:33:49 +03:00
+. **TrainablePipe subclass:** A subclass of [`TrainablePipe`](/api/pipe) that
 								   implements at least two methods: [`TrainablePipe.predict`](/api/pipe#predict)
 								   and [`TrainablePipe.set_annotations`](/api/pipe#set_annotations).
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
+. **Component factory:** A component factory registered with
 								   [`@Language.factory`](/api/language#factory) that takes the `nlp` object and
 								   component `name` and optional settings provided by the config and returns an
 								   instance of your trainable component.
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
+								> #### Example
 								>
 								> ```python
-												TrainablePipe (#6213)

* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
											
										
										
											2020-10-08 22:33:49 +03:00
+								> from spacy.pipeline import TrainablePipe
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
+								> from spacy.language import Language
 								>
-												TrainablePipe (#6213)

* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
											
										
										
											2020-10-08 22:33:49 +03:00
+								> class TrainableComponent(TrainablePipe):
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
+								>     def predict(self, docs):
 								>         ...
 								>
 								>     def set_annotations(self, docs, scores):
 								>         ...
 								>
 								> @Language.factory("my_trainable_component")
 								> def make_component(nlp, name, model):
 								>     return TrainableComponent(nlp.vocab, model, name=name)
 								> ```
 								| Name                                           | Description                                                                                                         |
 								| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
 								| [`predict`](/api/pipe#predict)                 | Apply the component's model to a batch of [`Doc`](/api/doc) objects (without modifying them) and return the scores. |
 								| [`set_annotations`](/api/pipe#set_annotations) | Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores generated by `predict`.                      |
-												TrainablePipe (#6213)

* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
											
										
										
											2020-10-08 22:33:49 +03:00
+								By default, [`TrainablePipe.__init__`](/api/pipe#init) takes the shared vocab,
 								the [`Model`](https://thinc.ai/docs/api-model) and the name of the component
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
+								instance in the pipeline, which you can use as a key in the losses. All other
-												TrainablePipe (#6213)

* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
											
										
										
											2020-10-08 22:33:49 +03:00
+								keyword arguments will become available as [`TrainablePipe.cfg`](/api/pipe#cfg)
 								and will also be serialized with the component.
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
 								<Accordion title="Why components should be passed a Model instance, not create it" spaced>
 								spaCy's [config system](/usage/training#config) resolves the config describing
 								the pipeline components and models **bottom-up**. This means that it will
 								_first_ create a `Model` from a [registered architecture](/api/architectures),
 								validate its arguments and _then_ pass the object forward to the component. This
 								means that the config can express very complex, nested trees of objects – but
 								the objects don't have to pass the model settings all the way down to the
-												references to usage page on layers and architectures

											
										
										
											2020-09-09 15:47:32 +03:00
+								components. It also makes the components more **modular** and lets you
 								[swap](/usage/layers-architectures#swap-architectures) different architectures
 								in your config, and re-use model definitions.
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
 								```ini
 								### config.cfg (excerpt)
 								[components]
 								[components.textcat]
 								factory = "textcat"
 								labels = []
 								# This function is created and then passed to the "textcat" component as
 								# the argument "model"
 								[components.textcat.model]
-												Resizable textcat (#7862)

* implement textcat resizing for TextCatCNN

* resizing textcat in-place

* simplify code

* ensure predictions for old textcat labels remain the same after resizing (WIP)

* fix for softmax

* store softmax as attr

* fix ensemble weight copy and cleanup

* restructure slightly

* adjust documentation, update tests and quickstart templates to use latest versions

* extend unit test slightly

* revert unnecessary edits

* fix typo

* ensemble architecture won't be resizable for now

* use resizable layer (WIP)

* revert using resizable layer

* resizable container while avoid shape inference trouble

* cleanup

* ensure model continues training after resizing

* use fill_b parameter

* use fill_defaults

* resize_layer callback

* format

* bump thinc to 8.0.4

* bump spacy-legacy to 3.0.6
											
										
										
											2021-06-16 12:45:00 +03:00
+								@architectures = "spacy.TextCatBOW.v2"
-												textcat scoring fix and multi_label docs (#6974)

* add multi-label textcat to menu

* add infobox on textcat API

* add info to v3 migration guide

* small edits

* further fixes in doc strings

* add infobox to textcat architectures

* add textcat_multilabel to overview of built-in components

* spelling

* fix unrelated warn msg

* Add textcat_multilabel to quickstart [ci skip]

* remove separate documentation page for multilabel_textcategorizer

* small edits

* positive label clarification

* avoid duplicating information in self.cfg and fix textcat.score

* fix multilabel textcat too

* revert threshold to storage in cfg

* revert threshold stuff for multi-textcat

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2021-03-09 15:04:22 +03:00
+								exclusive_classes = true
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
+								ngram_size = 1
-												TextCat updates and fixes (#6263)

* small fix in example imports

* throw error when train_corpus or dev_corpus is not a string

* small fix in custom logger example

* limit macro_auc to labels with 2 annotations

* fix typo

* also create parents of output_dir if need be

* update documentation of textcat scores

* refactor TextCatEnsemble

* fix tests for new AUC definition

* bump to 3.0.0a42

* update docs

* rename to spacy.TextCatEnsemble.v2

* spacy.TextCatEnsemble.v1 in legacy

* cleanup

* small fix

* update to 3.0.0rc2

* fix import that got lost in merge

* cursed IDE

* fix two typos
											
										
										
											2020-10-18 15:50:41 +03:00
+								no_output_layer = false
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
 								[components.other_textcat]
 								factory = "textcat"
 								# This references the [components.textcat.model] block above
 								model = ${components.textcat.model}
 								labels = []
 								```
 								Your trainable pipeline component factories should therefore always take a
 								`model` argument instead of instantiating the
 								[`Model`](https://thinc.ai/docs/api-model) inside the component. To register
 								custom architectures, you can use the
 								[`@spacy.registry.architectures`](/api/top-level#registry) decorator. Also see
 								the [training guide](/usage/training#config) for details.
 								</Accordion>
 								For some use cases, it makes sense to also overwrite additional methods to
 								customize how the model is updated from examples, how it's initialized, how the
 								loss is calculated and to add evaluation scores to the training output.
-												Update docs [ci skip]

											
										
										
											2020-09-30 16:16:00 +03:00
+								| Name                                 | Description                                                                                                                                                                                                                                                                                                                                   |
 								| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| [`update`](/api/pipe#update)         | Learn from a batch of [`Example`](/api/example) objects containing the predictions and gold-standard annotations, and update the component's model.                                                                                                                                                                                           |
 								| [`initialize`](/api/pipe#initialize) | Initialize the model. Typically calls into [`Model.initialize`](https://thinc.ai/docs/api-model#initialize) and can be passed custom arguments via the [`[initialize]`](/api/data-formats#config-initialize) config block that are only loaded during training or when you call [`nlp.initialize`](/api/language#initialize), not at runtime. |
 								| [`get_loss`](/api/pipe#get_loss)     | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects.                                                                                                                                                                                                                                                 |
-												Fix typo in docs
											
										
										
											2021-03-05 20:30:09 +03:00
+								| [`score`](/api/pipe#score)           | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_score_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score.                            |
-												Tidy up pipes (#5906)

* Tidy up pipes

* Fix init, defaults and raise custom errors

* Update docs

* Update docs [ci skip]

* Apply suggestions from code review

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Tidy up error handling and validation, fix consistency

* Simplify get_examples check

* Remove unused import [ci skip]

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-08-12 00:29:31 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								<Infobox title="Custom trainable components and models" emoji="📖">
 								For more details on how to implement your own trainable components and model
 								architectures, and plug existing models implemented in PyTorch or TensorFlow
 								into your spaCy pipeline, see the usage guide on
-												Update docs [ci skip]

											
										
										
											2020-10-06 15:15:08 +03:00
+								[layers and model architectures](/usage/layers-architectures#components).
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
 								</Infobox>
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								## Extension attributes {#custom-components-attributes new="2"}
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								spaCy allows you to set any custom attributes and methods on the `Doc`, `Span`
 								and `Token`, which become available as `Doc._`, `Span._` and `Token._` – for
 								example, `Token._.my_attr`. This lets you store additional information relevant
 								to your application, add new features and functionality to spaCy, and implement
 								your own models trained with other machine learning libraries. It also lets you
 								take advantage of spaCy's data structures and the `Doc` object as the "single
 								source of truth".
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Don't auto-slugify accordion links [ci skip]

											
										
										
											2019-03-12 17:30:49 +03:00
+								<Accordion title="Why ._ and not just a top-level attribute?" id="why-dot-underscore">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								Writing to a `._` attribute instead of to the `Doc` directly keeps a clearer
 								separation and makes it easier to ensure backwards compatibility. For example,
 								if you've implemented your own `.coref` property and spaCy claims it one day,
 								it'll break your code. Similarly, just by looking at the code, you'll
 								immediately know what's built-in and what's custom – for example,
 								`doc.sentiment` is spaCy, while `doc._.sent_score` isn't.
 								</Accordion>
-												Don't auto-slugify accordion links [ci skip]

											
										
										
											2019-03-12 17:30:49 +03:00
+								<Accordion title="How is the ._ implemented?" id="dot-underscore-implementation">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								Extension definitions – the defaults, methods, getters and setters you pass in
 								to `set_extension` – are stored in class attributes on the `Underscore` class.
 								If you write to an extension attribute, e.g. `doc._.hello = True`, the data is
 								stored within the [`Doc.user_data`](/api/doc#attributes) dictionary. To keep the
 								underscore data separate from your other dictionary entries, the string `"._."`
 								is placed before the name, in a tuple.
 								</Accordion>
 								---
 								There are three main types of extensions, which can be defined using the
 								[`Doc.set_extension`](/api/doc#set_extension),
 								[`Span.set_extension`](/api/span#set_extension) and
 								[`Token.set_extension`](/api/token#set_extension) methods.
-												Update processing-pipelines.md to mention method for doc metadata (#7480)

* Update processing-pipelines.md

Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True)

Link to a new example on the attributes page detailing the following:

> ```
> data = [
>   ("Some text to process", {"meta": "foo"}),
>   ("And more text...", {"meta": "bar"})
> ]
> 
> for doc, context in nlp.pipe(data, as_tuples=True):
>     # Let's assume you have a "meta" extension registered on the Doc
>     doc._.meta = context["meta"]
> ```

from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as

* Updating the attributes section

Update the attributes section with example of how extensions can be used to store metadata.

* Update processing-pipelines.md

* Update processing-pipelines.md

Made as_tuples example executable and relocated to the end of the "Processing Text" section.

* Update processing-pipelines.md

* Update processing-pipelines.md

Removed extra line

* Reformat and rephrase

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
											
										
										
											2021-04-19 12:58:12 +03:00
+								## Description
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+. **Attribute extensions.** Set a default value for an attribute, which can be
 								   overwritten manually at any time. Attribute extensions work like "normal"
 								   variables and are the quickest way to store arbitrary information on a `Doc`,
-												💫 Support mutable default values for extension attributes (#3389)

* Support mutable default values in extensions

* Update documentation

											
										
										
											2019-03-11 14:50:44 +03:00
+								   `Span` or `Token`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								   ```python
 								    Doc.set_extension("hello", default=True)
 								    assert doc._.hello
 								    doc._.hello = False
 								   ```
 . **Property extensions.** Define a getter and an optional setter function. If
 								   no setter is provided, the extension is immutable. Since the getter and
 								   setter functions are only called when you _retrieve_ the attribute, you can
 								   also access values of previously added attribute extensions. For example, a
 								   `Doc` getter can average over `Token` attributes. For `Span` extensions,
 								   you'll almost always want to use a property – otherwise, you'd have to write
 								   to _every possible_ `Span` in the `Doc` to set up the values correctly.
 								   ```python
 								   Doc.set_extension("hello", getter=get_hello_value, setter=set_hello_value)
 								   assert doc._.hello
 								   doc._.hello = "Hi!"
 								   ```
 . **Method extensions.** Assign a function that becomes available as an object
 								   method. Method extensions are always immutable. For more details and
 								   implementation ideas, see
 								   [these examples](/usage/examples#custom-components-attr-methods).
 								   ```python
-												Drop Python 2.7 and 3.5 (#4828)

* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]

											
										
										
											2019-12-22 03:53:56 +03:00
+								   Doc.set_extension("hello", method=lambda doc, name: f"Hi {name}!")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								   assert doc._.hello("Bob") == "Hi Bob!"
 								   ```
 								Before you can access a custom extension, you need to register it using the
 								`set_extension` method on the object you want to add it to, e.g. the `Doc`. Keep
 								in mind that extensions are always **added globally** and not just on a
 								particular instance. If an attribute of the same name already exists, or if
 								you're trying to access an attribute that hasn't been registered, spaCy will
 								raise an `AttributeError`.
 								```python
 								### Example
 								from spacy.tokens import Doc, Span, Token
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								fruits = ["apple", "pear", "banana", "orange", "strawberry"]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								is_fruit_getter = lambda token: token.text in fruits
 								has_fruit_getter = lambda obj: any([t.text in fruits for t in obj])
 								Token.set_extension("is_fruit", getter=is_fruit_getter)
 								Doc.set_extension("has_fruit", getter=has_fruit_getter)
 								Span.set_extension("has_fruit", getter=has_fruit_getter)
 								```
 								> #### Usage example
 								>
 								> ```python
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								> doc = nlp("I have an apple and a melon")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> assert doc[3]._.is_fruit      # get Token attributes
 								> assert not doc[0]._.is_fruit
 								> assert doc._.has_fruit        # get Doc attributes
 								> assert doc[1:4]._.has_fruit   # get Span attributes
 								> ```
 								Once you've registered your custom attribute, you can also use the built-in
 								`set`, `get` and `has` methods to modify and retrieve the attributes. This is
 								especially useful it you want to pass in a string instead of calling
 								`doc._.my_attr`.
 								### Example: Pipeline component for GPE entities and country meta data via a REST API {#component-example3}
 								This example shows the implementation of a pipeline component that fetches
 								country meta data via the [REST Countries API](https://restcountries.eu), sets
-												Fix docs example [ci skip]

											
										
										
											2020-10-09 17:03:57 +03:00
+								entity annotations for countries and sets custom attributes on the `Doc` and
 								`Span` – for example, the capital, latitude/longitude coordinates and even the
 								country flag.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								### {executable="true"}
 								import requests
 								from spacy.lang.en import English
 								from spacy.language import Language
 								from spacy.matcher import PhraseMatcher
 								from spacy.tokens import Doc, Span, Token
 								@Language.factory("rest_countries")
 								class RESTCountriesComponent:
 								    def __init__(self, nlp, name, label="GPE"):
 								        r = requests.get("https://restcountries.eu/rest/v2/all")
 								        r.raise_for_status()  # make sure requests raises an error if it fails
 								        countries = r.json()
 								        # Convert API response to dict keyed by country name for easy lookup
 								        self.countries = {c["name"]: c for c in countries}
 								        self.label = label
 								        # Set up the PhraseMatcher with Doc patterns for each country name
 								        self.matcher = PhraseMatcher(nlp.vocab)
 								        self.matcher.add("COUNTRIES", [nlp.make_doc(c) for c in self.countries.keys()])
-												Fix docs example [ci skip]

											
										
										
											2020-10-09 17:03:57 +03:00
+								        # Register attributes on the Span. We'll be overwriting this based on
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								        # the matches, so we're only setting a default value, not a getter.
-												Fix docs example [ci skip]

											
										
										
											2020-10-09 17:03:57 +03:00
+								        Span.set_extension("is_country", default=None)
 								        Span.set_extension("country_capital", default=None)
 								        Span.set_extension("country_latlng", default=None)
 								        Span.set_extension("country_flag", default=None)
 								        # Register attribute on Doc via a getter that checks if the Doc
 								        # contains a country entity
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								        Doc.set_extension("has_country", getter=self.has_country)
 								    def __call__(self, doc):
 								        spans = []  # keep the spans for later so we can merge them afterwards
 								        for _, start, end in self.matcher(doc):
 								            # Generate Span representing the entity & set label
 								            entity = Span(doc, start, end, label=self.label)
-												Fix docs example [ci skip]

											
										
										
											2020-10-09 17:03:57 +03:00
+								            # Set custom attributes on entity. Can be extended with other data
 								            # returned by the API, like currencies, country code, calling code etc.
 								            entity._.set("is_country", True)
 								            entity._.set("country_capital", self.countries[entity.text]["capital"])
 								            entity._.set("country_latlng", self.countries[entity.text]["latlng"])
 								            entity._.set("country_flag", self.countries[entity.text]["flag"])
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								            spans.append(entity)
 								        # Overwrite doc.ents and add entity – be careful not to replace!
 								        doc.ents = list(doc.ents) + spans
 								        return doc  # don't forget to return the Doc!
-												Fix docs example [ci skip]

											
										
										
											2020-10-09 17:03:57 +03:00
+								    def has_country(self, doc):
 								        """Getter for Doc attributes. Since the getter is only called
 								        when we access the attribute, we can refer to the Span's 'is_country'
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								        attribute here, which is already set in the processing step."""
-												Fix docs example [ci skip]

											
										
										
											2020-10-09 17:03:57 +03:00
+								        return any([entity._.get("is_country") for entity in doc.ents])
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
 								nlp = English()
 								nlp.add_pipe("rest_countries", config={"label": "GPE"})
 								doc = nlp("Some text about Colombia and the Czech Republic")
 								print("Pipeline", nlp.pipe_names)  # pipeline contains component name
 								print("Doc has countries", doc._.has_country)  # Doc contains countries
-												Fix docs example [ci skip]

											
										
										
											2020-10-09 17:03:57 +03:00
+								for ent in doc.ents:
 								    if ent._.is_country:
 								        print(ent.text, ent.label_, ent._.country_capital, ent._.country_latlng, ent._.country_flag)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								In this case, all data can be fetched on initialization in one request. However,
 								if you're working with text that contains incomplete country names, spelling
 								mistakes or foreign-language versions, you could also implement a
 								`like_country`-style getter function that makes a request to the search API
 								endpoint and returns the best-matching result.
 								### User hooks {#custom-components-user-hooks}
 								While it's generally recommended to use the `Doc._`, `Span._` and `Token._`
 								proxies to add your own custom attributes, spaCy offers a few exceptions to
 								allow **customizing the built-in methods** like
 								[`Doc.similarity`](/api/doc#similarity) or [`Doc.vector`](/api/doc#vector) with
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								your own hooks, which can rely on components you train yourself. For instance,
 								you can provide your own on-the-fly sentence segmentation algorithm or document
 								similarity method.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								Hooks let you customize some of the behaviors of the `Doc`, `Span` or `Token`
 								objects by adding a component to the pipeline. For instance, to customize the
 								[`Doc.similarity`](/api/doc#similarity) method, you can add a component that
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								sets a custom function to `doc.user_hooks["similarity"]`. The built-in
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								`Doc.similarity` method will check the `user_hooks` dict, and delegate to your
 								function if you've set one. Similar results can be achieved by setting functions
 								to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
 								> #### Implementation note
 								>
 								> The hooks live on the `Doc` object because the `Span` and `Token` objects are
 								> created lazily, and don't own any data. They just proxy to their parent `Doc`.
-												New batch of proofs

Just tiny fixes to the docs as a proofreader

											
										
										
											2020-10-14 17:37:57 +03:00
+								> This turns out to be convenient here – we only have to worry about installing
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> hooks in one place.
 								| Name               | Customizes                                                                                                                                                                                                              |
 								| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-												Reformat processing pipelines

											
										
										
											2021-03-18 15:29:51 +03:00
+								| `user_hooks`       | [`Doc.similarity`](/api/doc#similarity), [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents)                             |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								| `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
 								| `user_span_hooks`  | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root)                     |
 								```python
 								### Add custom similarity hooks
-												Update custom similarity hooks example

											
										
										
											2021-03-18 14:49:20 +03:00
+								from spacy.language import Language
-												Remove object subclassing

											
										
										
											2020-07-12 15:03:23 +03:00
+								class SimilarityModel:
-												Update custom similarity hooks example

											
										
										
											2021-03-18 14:49:20 +03:00
+								    def __init__(self, name: str, index: int):
 								        self.name = name
 								        self.index = index
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								    def __call__(self, doc):
 								        doc.user_hooks["similarity"] = self.similarity
 								        doc.user_span_hooks["similarity"] = self.similarity
 								        doc.user_token_hooks["similarity"] = self.similarity
-												Update custom similarity hooks example

											
										
										
											2021-03-18 14:49:20 +03:00
+								        return doc
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								    def similarity(self, obj1, obj2):
-												Update custom similarity hooks example

											
										
										
											2021-03-18 14:49:20 +03:00
+								        return obj1.vector[self.index] + obj2.vector[self.index]
 								@Language.factory("similarity_component", default_config={"index": 0})
 								def create_similarity_component(nlp, name, index: int):
 								    return SimilarityModel(name, index)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								## Developing plugins and wrappers {#plugins}
 								We're very excited about all the new possibilities for community extensions and
-												Start updating website for v3 [ci skip]

											
										
										
											2020-07-01 22:26:39 +03:00
+								plugins in spaCy, and we can't wait to see what you build with it! To get you
 								started, here are a few tips, tricks and best
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								practices. [See here](/universe/?category=pipeline) for examples of other spaCy
 								extensions.
 								### Usage ideas {#custom-components-usage-ideas}
 								- **Adding new features and hooking in models.** For example, a sentiment
 								  analysis model, or your preferred solution for lemmatization or sentiment
 								  analysis. spaCy's built-in tagger, parser and entity recognizer respect
 								  annotations that were already set on the `Doc` in a previous step of the
 								  pipeline.
 								- **Integrating other libraries and APIs.** For example, your pipeline component
 								  can write additional information and data directly to the `Doc` or `Token` as
 								  custom attributes, while making sure no information is lost in the process.
 								  This can be output generated by other libraries and models, or an external
 								  service with a REST API.
 								- **Debugging and logging.** For example, a component which stores and/or
 								  exports relevant information about the current state of the processed
 								  document, and insert it at any point of your pipeline.
 								### Best practices {#custom-components-best-practices}
 								Extensions can claim their own `._` namespace and exist as standalone packages.
 								If you're developing a tool or library and want to make it easy for others to
 								use it with spaCy and add it to their pipeline, all you have to do is expose a
 								function that takes a `Doc`, modifies it and returns it.
 								- Make sure to choose a **descriptive and specific name** for your pipeline
 								  component class, and set it as its `name` attribute. Avoid names that are too
 								  common or likely to clash with built-in or a user's other custom components.
 								  While it's fine to call your package `"spacy_my_extension"`, avoid component
 								  names including `"spacy"`, since this can easily lead to confusion.
 								  ```diff
 								  + name = "myapp_lemmatizer"
 								  - name = "lemmatizer"
 								  ```
 								- When writing to `Doc`, `Token` or `Span` objects, **use getter functions**
 								  wherever possible, and avoid setting values explicitly. Tokens and spans don't
 								  own any data themselves, and they're implemented as C extension classes – so
 								  you can't usually add new attributes to them like you could with most pure
 								  Python objects.
 								  ```diff
 								  + is_fruit = lambda token: token.text in ("apple", "orange")
 								  + Token.set_extension("is_fruit", getter=is_fruit)
 								  - token._.set_extension("is_fruit", default=False)
 								  - if token.text in ('"apple", "orange"):
 								  -     token._.set("is_fruit", True)
 								  ```
 								- Always add your custom attributes to the **global** `Doc`, `Token` or `Span`
 								  objects, not a particular instance of them. Add the attributes **as early as
 								  possible**, e.g. in your extension's `__init__` method or in the global scope
 								  of your module. This means that in the case of namespace collisions, the user
 								  will see an error immediately, not just when they run their pipeline.
 								  ```diff
 								  + from spacy.tokens import Doc
 								  + def __init__(attr="my_attr"):
 								  +     Doc.set_extension(attr, getter=self.get_doc_attr)
 								  - def __call__(doc):
 								  -     doc.set_extension("my_attr", getter=self.get_doc_attr)
 								  ```
 								- If your extension is setting properties on the `Doc`, `Token` or `Span`,
 								  include an option to **let the user to change those attribute names**. This
 								  makes it easier to avoid namespace collisions and accommodate users with
 								  different naming preferences. We recommend adding an `attrs` argument to the
 								  `__init__` method of your class so you can write the names to class attributes
 								  and reuse them across your component.
 								  ```diff
 								  + Doc.set_extension(self.doc_attr, default="some value")
 								  - Doc.set_extension("my_doc_attr", default="some value")
 								  ```
 								- Ideally, extensions should be **standalone packages** with spaCy and
 								  optionally, other packages specified as a dependency. They can freely assign
 								  to their own `._` namespace, but should stick to that. If your extension's
 								  only job is to provide a better `.similarity` implementation, and your docs
 								  state this explicitly, there's no problem with writing to the
 								  [`user_hooks`](#custom-components-user-hooks) and overwriting spaCy's built-in
 								  method. However, a third-party extension should **never silently overwrite
 								  built-ins**, or attributes set by other extensions.
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								- If you're looking to publish a pipeline package that depends on a custom
 								  pipeline component, you can either **require it** in the package's
 								  dependencies, or – if the component is specific and lightweight – choose to
 								  **ship it with your pipeline package**. Just make sure the
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								  [`@Language.component`](/api/language#component) or
 								  [`@Language.factory`](/api/language#factory) decorator that registers the
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								  custom component runs in your package's `__init__.py` or is exposed via an
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								  [entry point](/usage/saving-loading#entry-points).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								- Once you're ready to share your extension with others, make sure to **add docs
 								  and installation instructions** (you can always link to this page for more
 								  info). Make it easy for others to install and use your extension, for example
 								  by uploading it to [PyPi](https://pypi.python.org). If you're sharing your
 								  code on GitHub, don't forget to tag it with
 								  [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
 								  [`spacy-extension`](https://github.com/topics/spacy-extension?o=desc&s=stars)
 								  to help people find it. If you post it on Twitter, feel free to tag
 								  [@spacy_io](https://twitter.com/spacy_io) so we can check it out.
 								### Wrapping other models and libraries {#wrapping-models-libraries}
 								Let's say you have a custom entity recognizer that takes a list of strings and
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								returns their [BILUO tags](/usage/linguistic-features#accessing-ner). Given an
 								input like `["A", "text", "about", "Facebook"]`, it will predict and return
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								`["O", "O", "O", "U-ORG"]`. To integrate it into your spaCy pipeline and make it
 								add those entities to the `doc.ents`, you can wrap it in a custom pipeline
 								component function and pass it the token texts from the `Doc` object received by
 								the component.
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								The [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans) is very
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								helpful here, because it takes a `Doc` object and token-based BILUO tags and
 								returns a sequence of `Span` objects in the `Doc` with added labels. So all your
 								wrapper has to do is compute the entity spans and overwrite the `doc.ents`.
 								> #### How the doc.ents work
 								>
 								> When you add spans to the `doc.ents`, spaCy will automatically resolve them
 								> back to the underlying tokens and set the `Token.ent_type` and `Token.ent_iob`
 								> attributes. By definition, each token can only be part of one entity, so
 								> overlapping entity spans are not allowed.
 								```python
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								### {highlight="1,8-9"}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								import your_custom_entity_recognizer
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								from spacy.training import biluo_tags_to_spans
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								from spacy.language import Language
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								@Language.component("custom_ner_wrapper")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								def custom_ner_wrapper(doc):
 								    words = [token.text for token in doc]
 								    custom_entities = your_custom_entity_recognizer(words)
-												rename converts in_to_out

											
										
										
											2020-09-22 12:50:19 +03:00
+								    doc.ents = biluo_tags_to_spans(doc, custom_entities)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								    return doc
 								```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								The `custom_ner_wrapper` can then be added to a blank pipeline using
 								[`nlp.add_pipe`](/api/language#add_pipe). You can also replace the existing
 								entity recognizer of a trained pipeline with
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								[`nlp.replace_pipe`](/api/language#replace_pipe).
 								Here's another example of a custom model, `your_custom_model`, that takes a list
 								of tokens and returns lists of fine-grained part-of-speech tags, coarse-grained
 								part-of-speech tags, dependency labels and head token indices. Here, we can use
 								the [`Doc.from_array`](/api/doc#from_array) to create a new `Doc` object using
 								those values. To create a numpy array we need integers, so we can look up the
 								string labels in the [`StringStore`](/api/stringstore). The
 								[`doc.vocab.strings.add`](/api/stringstore#add) method comes in handy here,
 								because it returns the integer ID of the string _and_ makes sure it's added to
 								the vocab. This is especially important if the custom model uses a different
 								label scheme than spaCy's default models.
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								> #### Example: spacy-stanza
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								>
 								> For an example of an end-to-end wrapper for statistical tokenization, tagging
 								> and parsing, check out
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								> [`spacy-stanza`](https://github.com/explosion/spacy-stanza). It uses a very
 								> similar approach to the example in this section – the only difference is that
 								> it fully replaces the `nlp` object instead of providing a pipeline component,
 								> since it also needs to handle tokenization.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								### {highlight="1,11,17-19"}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								import your_custom_model
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								from spacy.language import Language
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								from spacy.symbols import POS, TAG, DEP, HEAD
 								from spacy.tokens import Doc
 								import numpy
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								@Language.component("custom_model_wrapper")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								def custom_model_wrapper(doc):
 								    words = [token.text for token in doc]
 								    spaces = [token.whitespace for token in doc]
 								    pos, tags, deps, heads = your_custom_model(words)
 								    # Convert the strings to integers and add them to the string store
 								    pos = [doc.vocab.strings.add(label) for label in pos]
 								    tags = [doc.vocab.strings.add(label) for label in tags]
 								    deps = [doc.vocab.strings.add(label) for label in deps]
 								    # Create a new Doc from a numpy array
 								    attrs = [POS, TAG, DEP, HEAD]
 								    arr = numpy.array(list(zip(pos, tags, deps, heads)), dtype="uint64")
 								    new_doc = Doc(doc.vocab, words=words, spaces=spaces).from_array(attrs, arr)
 								    return new_doc
 								```
 								<Infobox title="Sentence boundaries and heads" variant="warning">
 								If you create a `Doc` object with dependencies and heads, spaCy is able to
 								resolve the sentence boundaries automatically. However, note that the `HEAD`
 								value used to construct a `Doc` is the token index **relative** to the current
 								token – e.g. `-1` for the previous token. The CoNLL format typically annotates
 								heads as `1`-indexed absolute indices with `0` indicating the root. If that's
 								the case in your annotations, you need to convert them first:
 								```python
 								heads = [2, 0, 4, 2, 2]
 								new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)]
 								```
 								</Infobox>
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								<Infobox title="Advanced usage, serialization and entry points" emoji="📖">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								For more details on how to write and package custom components, make them
 								available to spaCy via entry points and implement your own serialization
 								methods, check out the usage guide on
 								[saving and loading](/usage/saving-loading).
 								</Infobox>