spaCy/website/docs/usage/linguistic-features.md

---
title: Linguistic Features
next: /usage/rule-based-matching
menu:
  - ['POS Tagging', 'pos-tagging']
  - ['Morphology', 'morphology']
  - ['Lemmatization', 'lemmatization']
  - ['Dependency Parse', 'dependency-parse']
  - ['Named Entities', 'named-entities']
  - ['Entity Linking', 'entity-linking']
  - ['Tokenization', 'tokenization']
  - ['Merging & Splitting', 'retokenization']
  - ['Sentence Segmentation', 'sbd']
  - ['Vectors & Similarity', 'vectors-similarity']
  - ['Mappings & Exceptions', 'mappings-exceptions']
  - ['Language Data', 'language-data']
---

Processing raw text intelligently is difficult: most words are rare, and it's
common for words that look completely different to mean almost the same thing.
The same words in a different order can mean something completely different.
Even splitting text into useful word-like units can be difficult in many
languages. While it's possible to solve some problems starting from only the raw
characters, it's usually better to use linguistic knowledge to add useful
information. That's exactly what spaCy is designed to do: you put in raw text,
and get back a [`Doc`](/api/doc) object, that comes with a variety of
annotations.

## Part-of-speech tagging {#pos-tagging model="tagger, parser"}

import PosDeps101 from 'usage/101/\_pos-deps.md'

<PosDeps101 />

<Infobox title="Part-of-speech tag scheme" emoji="📖">

For a list of the fine-grained and coarse-grained part-of-speech tags assigned
by spaCy's models across different languages, see the label schemes documented
in the [models directory](/models).

</Infobox>

## Morphology {#morphology}

Inflectional morphology is the process by which a root form of a word is
modified by adding prefixes or suffixes that specify its grammatical function
but do not changes its part-of-speech. We say that a **lemma** (root form) is
**inflected** (modified/combined) with one or more **morphological features** to
create a surface form. Here are some examples:

| Context                                  | Surface | Lemma | POS    |  Morphological Features                  |
| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- |
| I was reading the paper                  | reading | read  | `VERB` | `VerbForm=Ger`                           |
| I don't watch the news, I read the paper | read    | read  | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
| I read the paper yesterday               | read    | read  | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |

Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis)
under `Token.morph`, which allows you to access individual morphological
features.

> #### 📝 Things to try
>
> 1. Change "I" to "She". You should see that the morphological features change
>    and express that it's a pronoun in the third person.
> 2. Inspect `token.morph` for the other tokens.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
print("Pipeline:", nlp.pipe_names)
doc = nlp("I was reading the paper.")
token = doc[0]  # 'I'
print(token.morph)  # 'Case=Nom|Number=Sing|Person=1|PronType=Prs'
print(token.morph.get("PronType"))  # ['Prs']
```

### Statistical morphology {#morphologizer new="3" model="morphologizer"}

spaCy's statistical [`Morphologizer`](/api/morphologizer) component assigns the
morphological features and coarse-grained part-of-speech tags as `Token.morph`
and `Token.pos`.

```python
### {executable="true"}
import spacy

nlp = spacy.load("de_core_news_sm")
doc = nlp("Wo bist du?") # English: 'Where are you?'
print(doc[2].morph)  # 'Case=Nom|Number=Sing|Person=2|PronType=Prs'
print(doc[2].pos_) # 'PRON'
```

### Rule-based morphology {#rule-based-morphology}

For languages with relatively simple morphological systems like English, spaCy
can assign morphological features through a rule-based approach, which uses the
**token text** and **fine-grained part-of-speech tags** to produce
coarse-grained part-of-speech tags and morphological features.

1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech
   tag**. In the API, these tags are known as `Token.tag`. They express the
   part-of-speech (e.g. verb) and some amount of morphological information, e.g.
   that the verb is past tense (e.g. `VBD` for a past tense verb in the Penn
   Treebank) .
2. For words whose coarse-grained POS is not set by a prior process, a
   [mapping table](#mapping-exceptions) maps the fine-grained tags to a
   coarse-grained POS tags and morphological features.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Where are you?")
print(doc[2].morph)  # 'Case=Nom|Person=2|PronType=Prs'
print(doc[2].pos_)  # 'PRON'
```

## Lemmatization {#lemmatization model="lemmatizer" new="3"}

The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup
and rule-based lemmatization methods in a configurable component. An individual
language can extend the `Lemmatizer` as part of its
[language data](#language-data).

```python
### {executable="true"}
import spacy

# English pipelines include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode)  # 'rule'

doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']
```

<Infobox title="Changed in v3.0" variant="warning">

Unlike spaCy v2, spaCy v3 models do _not_ provide lemmas by default or switch
automatically between lookup and rule-based lemmas depending on whether a tagger
is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to include a
[`Lemmatizer`](/api/lemmatizer) component. The lemmatizer component is
configured to use a single mode such as `"lookup"` or `"rule"` on
initialization. The `"rule"` mode requires `Token.pos` to be set by a previous
component.

</Infobox>

The data for spaCy's lemmatizers is distributed in the package
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
provided trained pipelines already include all the required tables, but if you
are creating new pipelines, you'll probably want to install `spacy-lookups-data`
to provide the data when the lemmatizer is initialized.

### Lookup lemmatizer {#lemmatizer-lookup}

For pipelines without a tagger or morphologizer, a lookup lemmatizer can be
added to the pipeline as long as a lookup table is provided, typically through
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
lookup lemmatizer looks up the token surface form in the lookup table without
reference to the token's part-of-speech or context.

```python
# pip install spacy-lookups-data
import spacy

nlp = spacy.blank("sv")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
```

### Rule-based lemmatizer {#lemmatizer-rule}

When training pipelines that include a component that assigns part-of-speech
tags (a morphologizer or a tagger with a [POS mapping](#mappings-exceptions)), a
rule-based lemmatizer can be added using rule tables from
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data):

```python
# pip install spacy-lookups-data
import spacy

nlp = spacy.blank("de")
# Morphologizer (note: model is not yet trained!)
nlp.add_pipe("morphologizer")
# Rule-based lemmatizer
nlp.add_pipe("lemmatizer", config={"mode": "rule"})
```

The rule-based deterministic lemmatizer maps the surface form to a lemma in
light of the previously assigned coarse-grained part-of-speech and morphological
information, without consulting the context of the token. The rule-based
lemmatizer also accepts list-based exception files. For English, these are
acquired from [WordNet](https://wordnet.princeton.edu/).

## Dependency Parsing {#dependency-parse model="parser"}

spaCy features a fast and accurate syntactic dependency parser, and has a rich
API for navigating the tree. The parser also powers the sentence boundary
detection, and lets you iterate over base noun phrases, or "chunks". You can
check whether a [`Doc`](/api/doc) object has been parsed by calling
`doc.has_annotation("DEP")`, which checks whether the attribute `Token.dep` has
been set returns a boolean value. If the result is `False`, the default sentence
iterator will raise an exception.

<Infobox title="Dependency label scheme" emoji="📖">

For a list of the syntactic dependency labels assigned by spaCy's models across
different languages, see the label schemes documented in the
[models directory](/models).

</Infobox>

### Noun chunks {#noun-chunks}

Noun chunks are "base noun phrases" – flat phrases that have a noun as their
head. You can think of noun chunks as a noun plus the words describing the noun
– for example, "the lavish green grass" or "the world’s largest tech fund". To
get the noun chunks in a document, simply iterate over
[`Doc.noun_chunks`](/api/doc#noun_chunks)

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)
```

> - **Text:** The original noun chunk text.
> - **Root text:** The original text of the word connecting the noun chunk to
>   the rest of the parse.
> - **Root dep:** Dependency relation connecting the root to its head.
> - **Root head text:** The text of the root token's head.

| Text                | root.text     | root.dep\_ | root.head.text |
| ------------------- | ------------- | ---------- | -------------- |
| Autonomous cars     | cars          | `nsubj`    | shift          |
| insurance liability | liability     | `dobj`     | shift          |
| manufacturers       | manufacturers | `pobj`     | toward         |

### Navigating the parse tree {#navigating}

spaCy uses the terms **head** and **child** to describe the words **connected by
a single arc** in the dependency tree. The term **dep** is used for the arc
label, which describes the type of syntactic relation that connects the child to
the head. As with other attributes, the value of `.dep` is a hash value. You can
get the string value with `.dep_`.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])
```

> - **Text:** The original token text.
> - **Dep:** The syntactic relation connecting child to head.
> - **Head text:** The original text of the token head.
> - **Head POS:** The part-of-speech tag of the token head.
> - **Children:** The immediate syntactic dependents of the token.

| Text          | Dep        | Head text | Head POS | Children                |
| ------------- | ---------- | --------- | -------- | ----------------------- |
| Autonomous    | `amod`     | cars      | `NOUN`   |                         |
| cars          | `nsubj`    | shift     | `VERB`   | Autonomous              |
| shift         | `ROOT`     | shift     | `VERB`   | cars, liability, toward |
| insurance     | `compound` | liability | `NOUN`   |                         |
| liability     | `dobj`     | shift     | `VERB`   | insurance               |
| toward        | `prep`     | shift     | `NOUN`   | manufacturers           |
| manufacturers | `pobj`     | toward    | `ADP`    |                         |

import DisplaCyLong2Html from 'images/displacy-long2.html'

<Iframe title="displaCy visualization of dependencies and entities 2" html={DisplaCyLong2Html} height={450} />

Because the syntactic relations form a tree, every word has **exactly one
head**. You can therefore iterate over the arcs in the tree by iterating over
the words in the sentence. This is usually the best way to match an arc of
interest — from below:

```python
### {executable="true"}
import spacy
from spacy.symbols import nsubj, VERB

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
print(verbs)
```

If you try to match from above, you'll have to iterate twice. Once for the head,
and then again through the children:

```python
# Finding a verb with a subject from above — less good
verbs = []
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break
```

To iterate through the children, use the `token.children` attribute, which
provides a sequence of [`Token`](/api/token) objects.

#### Iterating around the local tree {#navigating-around}

A few more convenience attributes are provided for iterating around the local
tree from the token. [`Token.lefts`](/api/token#lefts) and
[`Token.rights`](/api/token#rights) attributes provide sequences of syntactic
children that occur before and after the token. Both sequences are in sentence
order. There are also two integer-typed attributes,
[`Token.n_lefts`](/api/token#n_lefts) and
[`Token.n_rights`](/api/token#n_rights) that give the number of left and right
children.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("bright red apples on the tree")
print([token.text for token in doc[2].lefts])  # ['bright', 'red']
print([token.text for token in doc[2].rights])  # ['on']
print(doc[2].n_lefts)  # 2
print(doc[2].n_rights)  # 1
```

```python
### {executable="true"}
import spacy

nlp = spacy.load("de_core_news_sm")
doc = nlp("schöne rote Äpfel auf dem Baum")
print([token.text for token in doc[2].lefts])  # ['schöne', 'rote']
print([token.text for token in doc[2].rights])  # ['auf']
```

You can get a whole phrase by its syntactic head using the
[`Token.subtree`](/api/token#subtree) attribute. This returns an ordered
sequence of tokens. You can walk up the tree with the
[`Token.ancestors`](/api/token#ancestors) attribute, and check dominance with
[`Token.is_ancestor`](/api/token#is_ancestor)

> #### Projective vs. non-projective
>
> For the [default English pipelines](/models/en), the parse tree is
> **projective**, which means that there are no crossing brackets. The tokens
> returned by `.subtree` are therefore guaranteed to be contiguous. This is not
> true for the German pipelines, which have many
> [non-projective dependencies](https://explosion.ai/blog/german-model#word-order).

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")

root = [token for token in doc if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts,
            descendant.n_rights,
            [ancestor.text for ancestor in descendant.ancestors])
```

| Text     | Dep        | n_lefts | n_rights | ancestors                        |
| -------- | ---------- | ------- | -------- | -------------------------------- |
| Credit   | `nmod`     | `0`     | `2`      | holders, submit                  |
| and      | `cc`       | `0`     | `0`      | holders, submit                  |
| mortgage | `compound` | `0`     | `0`      | account, Credit, holders, submit |
| account  | `conj`     | `1`     | `0`      | Credit, holders, submit          |
| holders  | `nsubj`    | `1`     | `0`      | submit                           |

Finally, the `.left_edge` and `.right_edge` attributes can be especially useful,
because they give you the first and last token of the subtree. This is the
easiest way to create a `Span` object for a syntactic phrase. Note that
`.right_edge` gives a token **within** the subtree — so if you use it as the
end-point of a range, don't forget to `+1`!

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
    retokenizer.merge(span)
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)
```

| Text                                |  POS   | Dep     | Head text |
| ----------------------------------- | ------ | ------- | --------- |
| Credit and mortgage account holders | `NOUN` | `nsubj` | submit    |
| must                                | `VERB` | `aux`   | submit    |
| submit                              | `VERB` | `ROOT`  | submit    |
| their                               | `ADJ`  | `poss`  | requests  |
| requests                            | `NOUN` | `dobj`  | submit    |

The dependency parse can be a useful tool for **information extraction**,
especially when combined with other predictions like
[named entities](#named-entities). The following example extracts money and
currency values, i.e. entities labeled as `MONEY`, and then uses the dependency
parse to find the noun phrase they are referring to – for example `"Net income"`
&rarr; `"$9.4 million"`.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
# Merge noun phrases and entities for easier analysis
nlp.add_pipe("merge_entities")
nlp.add_pipe("merge_noun_chunks")

TEXTS = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]
for doc in nlp.pipe(TEXTS):
    for token in doc:
        if token.ent_type_ == "MONEY":
            # We have an attribute and direct object, so check for subject
            if token.dep_ in ("attr", "dobj"):
                subj = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                if subj:
                    print(subj[0], "-->", token)
            # We have a prepositional object with a preposition
            elif token.dep_ == "pobj" and token.head.dep_ == "prep":
                print(token.head.head, "-->", token)
```

<Infobox title="Combining models and rules" emoji="📖">

For more examples of how to write rule-based information extraction logic that
takes advantage of the model's predictions produced by the different components,
see the usage guide on
[combining models and rules](/usage/rule-based-matching#models-rules).

</Infobox>

### Visualizing dependencies {#displacy}

The best way to understand spaCy's dependency parser is interactively. To make
this easier, spaCy comes with a visualization module. You can pass a `Doc` or a
list of `Doc` objects to displaCy and run
[`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or
[`displacy.render`](/api/top-level#displacy.render) to generate the raw markup.
If you want to know how to write rules that hook into some type of syntactic
construction, just plug the sentence into the visualizer and see how spaCy
annotates it.

```python
### {executable="true"}
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc, style='dep')
```

<Infobox>

For more details and examples, see the
[usage guide on visualizing spaCy](/usage/visualizers). You can also test
displaCy in our [online demo](https://explosion.ai/demos/displacy)..

</Infobox>

### Disabling the parser {#disabling}

In the [trained pipelines](/models) provided by spaCy, the parser is loaded and
enabled by default as part of the
[standard processing pipeline](/usage/processing-pipelines). If you don't need
any of the syntactic information, you should disable the parser. Disabling the
parser will make spaCy load and run much faster. If you want to load the parser,
but need to disable it for specific documents, you can also control its use on
the `nlp` object. For more details, see the usage guide on
[disabling pipeline components](/usage/processing-pipelines/#disabling).

```python
nlp = spacy.load("en_core_web_sm", disable=["parser"])
```

## Named Entity Recognition {#named-entities}

spaCy features an extremely fast statistical entity recognition system, that
assigns labels to contiguous spans of tokens. The default
[trained pipelines](/models) can indentify a variety of named and numeric
entities, including companies, locations, organizations and products. You can
add arbitrary classes to the entity recognition system, and update the model
with new examples.

### Named Entity Recognition 101 {#named-entities-101}

import NER101 from 'usage/101/\_named-entities.md'

<NER101 />

### Accessing entity annotations and labels {#accessing-ner}

The standard way to access entity annotations is the [`doc.ents`](/api/doc#ents)
property, which produces a sequence of [`Span`](/api/span) objects. The entity
type is accessible either as a hash value or as a string, using the attributes
`ent.label` and `ent.label_`. The `Span` object acts as a sequence of tokens, so
you can iterate over the entity or index into it. You can also get the text form
of the whole entity, as though it were a single token.

You can also access token entity annotations using the
[`token.ent_iob`](/api/token#attributes) and
[`token.ent_type`](/api/token#attributes) attributes. `token.ent_iob` indicates
whether an entity starts, continues or ends on the tag. If no entity type is set
on a token, it will return an empty string.

> #### IOB Scheme
>
> - `I` – Token is **inside** an entity.
> - `O` – Token is **outside** an entity.
> - `B` – Token is the **beginning** of an entity.
>
> #### BILUO Scheme
>
> - `B` – Token is the **beginning** of a multi-token entity.
> - `I` – Token is **inside** a multi-token entity.
> - `L` – Token is the **last** token of a multi-token entity.
> - `U` – Token is a single-token **unit** entity.
> - `O` – Toke is **outside** an entity.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)  # ['San', 'B', 'GPE']
print(ent_francisco)  # ['Francisco', 'I', 'GPE']
```

| Text      | ent_iob | ent_iob\_ | ent_type\_ | Description            |
| --------- | ------- | --------- | ---------- | ---------------------- |
| San       | `3`     | `B`       | `"GPE"`    | beginning of an entity |
| Francisco | `1`     | `I`       | `"GPE"`    | inside an entity       |
| considers | `2`     | `O`       | `""`       | outside an entity      |
| banning   | `2`     | `O`       | `""`       | outside an entity      |
| sidewalk  | `2`     | `O`       | `""`       | outside an entity      |
| delivery  | `2`     | `O`       | `""`       | outside an entity      |
| robots    | `2`     | `O`       | `""`       | outside an entity      |

### Setting entity annotations {#setting-entities}

To ensure that the sequence of token annotations remains consistent, you have to
set entity annotations **at the document level**. However, you can't write
directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest
way to set entities is to assign to the [`doc.ents`](/api/doc#ents) attribute
and create the new entity as a [`Span`](/api/span).

```python
### {executable="true"}
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# The model didn't recognize "fb" as an entity :(

fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
# [('fb', 0, 2, 'ORG')] 🎉
```

Keep in mind that you need to create a `Span` with the start and end index of
the **token**, not the start and end index of the entity in the document. In
this case, "fb" is token `(0, 1)` – but at the document level, the entity will
have the start and end indices `(0, 2)`.

#### Setting entity annotations from array {#setting-from-array}

You can also assign entity annotations using the
[`doc.from_array`](/api/doc#from_array) method. To do this, you should include
both the `ENT_TYPE` and the `ENT_IOB` attributes in the array you're importing
from.

```python
### {executable="true"}
import numpy
import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")
doc = nlp.make_doc("London is a big city in the United Kingdom.")
print("Before", doc.ents)  # []

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
attr_array[0, 0] = 3  # B
attr_array[0, 1] = doc.vocab.strings["GPE"]
doc.from_array(header, attr_array)
print("After", doc.ents)  # [London]
```

#### Setting entity annotations in Cython {#setting-cython}

Finally, you can always write to the underlying struct, if you compile a
[Cython](http://cython.org/) function. This is easy to do, and allows you to
write efficient native code.

```python
# cython: infer_types=True
from spacy.tokens.doc cimport Doc

cpdef set_entity(Doc doc, int start, int end, int ent_type):
    for i in range(start, end):
        doc.c[i].ent_type = ent_type
    doc.c[start].ent_iob = 3
    for i in range(start+1, end):
        doc.c[i].ent_iob = 2
```

Obviously, if you write directly to the array of `TokenC*` structs, you'll have
responsibility for ensuring that the data is left in a consistent state.

### Built-in entity types {#entity-types}

> #### Tip: Understanding entity types
>
> You can also use `spacy.explain()` to get the description for the string
> representation of an entity label. For example, `spacy.explain("LANGUAGE")`
> will return "any named language".

<Infobox title="Annotation scheme">

For details on the entity types available in spaCy's trained pipelines, see the
"label scheme" sections of the individual models in the
[models directory](/models).

</Infobox>

### Visualizing named entities {#displacy}

The
[displaCy <sup>ENT</sup> visualizer](https://explosion.ai/demos/displacy-ent)
lets you explore an entity recognition model's behavior interactively. If you're
training a model, it's very useful to run the visualization yourself. To help
you do that, spaCy comes with a visualization module. You can pass a `Doc` or a
list of `Doc` objects to displaCy and run
[`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or
[`displacy.render`](/api/top-level#displacy.render) to generate the raw markup.

For more details and examples, see the
[usage guide on visualizing spaCy](/usage/visualizers).

```python
### Named Entity example
import spacy
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")
```

import DisplacyEntHtml from 'images/displacy-ent2.html'

<Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={180} />

## Entity Linking {#entity-linking}

To ground the named entities into the "real world", spaCy provides functionality
to perform entity linking, which resolves a textual entity to a unique
identifier from a knowledge base (KB). You can create your own
[`KnowledgeBase`](/api/kb) and [train](/usage/training) a new
[`EntityLinker`](/api/entitylinker) using that custom knowledge base.

### Accessing entity identifiers {#entity-linking-accessing model="entity linking"}

The annotated KB identifier is accessible as either a hash value or as a string,
using the attributes `ent.kb_id` and `ent.kb_id_` of a [`Span`](/api/span)
object, or the `ent_kb_id` and `ent_kb_id_` attributes of a
[`Token`](/api/token) object.

```python
import spacy

nlp = spacy.load("my_custom_el_pipeline")
doc = nlp("Ada Lovelace was born in London")

# Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents)  # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]

# Token level
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
print(ent_ada_0)  # ['Ada', 'PERSON', 'Q7259']
print(ent_ada_1)  # ['Lovelace', 'PERSON', 'Q7259']
print(ent_london_5)  # ['London', 'GPE', 'Q84']
```

## Tokenization {#tokenization}

Tokenization is the task of splitting a text into meaningful segments, called
_tokens_. The input to the tokenizer is a unicode text, and the output is a
[`Doc`](/api/doc) object. To construct a `Doc` object, you need a
[`Vocab`](/api/vocab) instance, a sequence of `word` strings, and optionally a
sequence of `spaces` booleans, which allow you to maintain alignment of the
tokens into the original string.

<Infobox title="Important note" variant="warning">

spaCy's tokenization is **non-destructive**, which means that you'll always be
able to reconstruct the original input from the tokenized output. Whitespace
information is preserved in the tokens and no information is added or removed
during tokenization. This is kind of a core principle of spaCy's `Doc` object:
`doc.text == input_text` should always hold true.

</Infobox>

import Tokenization101 from 'usage/101/\_tokenization.md'

<Tokenization101 />

<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>

spaCy introduces a novel tokenization algorithm, that gives a better balance
between performance, ease of definition, and ease of alignment into the original
string.

After consuming a prefix or suffix, we consult the special cases again. We want
the special cases to handle things like "don't" in English, and we want the same
rule to work for "(don't)!". We do this by splitting off the open bracket, then
the exclamation, then the close bracket, and finally matching the special case.
Here's an implementation of the algorithm in Python, optimized for readability
rather than performance:

```python
def tokenizer_pseudo_code(
    special_cases,
    prefix_search,
    suffix_search,
    infix_finditer,
    token_match,
    url_match
):
    tokens = []
    for substring in text.split():
        suffixes = []
        while substring:
            while prefix_search(substring) or suffix_search(substring):
                if token_match(substring):
                    tokens.append(substring)
                    substring = ""
                    break
                if substring in special_cases:
                    tokens.extend(special_cases[substring])
                    substring = ""
                    break
                if prefix_search(substring):
                    split = prefix_search(substring).end()
                    tokens.append(substring[:split])
                    substring = substring[split:]
                    if substring in special_cases:
                        continue
                if suffix_search(substring):
                    split = suffix_search(substring).start()
                    suffixes.append(substring[split:])
                    substring = substring[:split]
            if token_match(substring):
                tokens.append(substring)
                substring = ""
            elif url_match(substring):
                tokens.append(substring)
                substring = ""
            elif substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ""
            elif list(infix_finditer(substring)):
                infixes = infix_finditer(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[offset : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                if substring[offset:]:
                    tokens.append(substring[offset:])
                substring = ""
            elif substring:
                tokens.append(substring)
                substring = ""
        tokens.extend(reversed(suffixes))
    return tokens
```

The algorithm can be summarized as follows:

1. Iterate over whitespace-separated substrings.
2. Look for a token match. If there is a match, stop processing and keep this
   token.
3. Check whether we have an explicitly defined special case for this substring.
   If we do, use it.
4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
   so that the token match and special cases always get priority.
5. If we didn't consume a prefix, try to consume a suffix and then go back to
   #2.
6. If we can't consume a prefix or a suffix, look for a URL match.
7. If there's no URL match, then look for a special case.
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
   tokens on all infixes.
9. Once we can't consume any more of the string, handle it as a single token.

</Accordion>

**Global** and **language-specific** tokenizer data is supplied via the language
data in [`spacy/lang`](%%GITHUB_SPACY/spacy/lang). The tokenizer exceptions
define special cases like "don't" in English, which needs to be split into two
tokens: `{ORTH: "do"}` and `{ORTH: "n't", NORM: "not"}`. The prefixes, suffixes
and infixes mostly define punctuation rules – for example, when to split off
periods (at the end of a sentence), and when to leave tokens containing periods
intact (abbreviations like "U.S.").

<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">

Tokenization rules that are specific to one language, but can be **generalized
across that language** should ideally live in the language data in
[`spacy/lang`](%%GITHUB_SPACY/spacy/lang) – we always appreciate pull requests!
Anything that's specific to a domain or text type – like financial trading
abbreviations, or Bavarian youth slang – should be added as a special case rule
to your tokenizer instance. If you're dealing with a lot of customizations, it
might make sense to create an entirely custom subclass.

</Accordion>

---

### Adding special case tokenization rules {#special-cases}

Most domains have at least some idiosyncrasies that require custom tokenization
rules. This could be very certain expressions, or abbreviations only used in
this specific field. Here's how to add a special case rule to an existing
[`Tokenizer`](/api/tokenizer) instance:

```python
### {executable="true"}
import spacy
from spacy.symbols import ORTH

nlp = spacy.load("en_core_web_sm")
doc = nlp("gimme that")  # phrase to tokenize
print([w.text for w in doc])  # ['gimme', 'that']

# Add special case rule
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

# Check new tokenization
print([w.text for w in nlp("gimme that")])  # ['gim', 'me', 'that']
```

The special case doesn't have to match an entire whitespace-delimited substring.
The tokenizer will incrementally split off punctuation, and keep looking up the
remaining substring. The special case rules also have precedence over the
punctuation splitting.

```python
assert "gimme" not in [w.text for w in nlp("gimme!")]
assert "gimme" not in [w.text for w in nlp('("...gimme...?")')]

nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
assert len(nlp("...gimme...?")) == 1
```

#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}

A working implementation of the pseudo-code above is available for debugging as
[`nlp.tokenizer.explain(text)`](/api/tokenizer#explain). It returns a list of
tuples showing which tokenizer rule or pattern was matched for each token. The
tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:

> #### Expected output
>
> ```
> "      PREFIX
> Let    SPECIAL-1
> 's     SPECIAL-2
> go     TOKEN
> !      SUFFIX
> "      SUFFIX
> ```

```python
### {executable="true"}
from spacy.lang.en import English

nlp = English()
text = '''"Let's go!"'''
doc = nlp(text)
tok_exp = nlp.tokenizer.explain(text)
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
for t in tok_exp:
    print(t[1], "\\t", t[0])
```

### Customizing spaCy's Tokenizer class {#native-tokenizers}

Let's imagine you wanted to create a tokenizer for a new language or specific
domain. There are six things you may need to define:

1. A dictionary of **special cases**. This handles things like contractions,
   units of measurement, emoticons, certain abbreviations, etc.
2. A function `prefix_search`, to handle **preceding punctuation**, such as open
   quotes, open brackets, etc.
3. A function `suffix_search`, to handle **succeeding punctuation**, such as
   commas, periods, close quotes, etc.
4. A function `infixes_finditer`, to handle non-whitespace separators, such as
   hyphens etc.
5. An optional boolean function `token_match` matching strings that should never
   be split, overriding the infix rules. Useful for things like numbers.
6. An optional boolean function `url_match`, which is similar to `token_match`
   except that prefixes and suffixes are removed before applying the match.

You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
to use `re.compile()` to build a regular expression object, and pass its
`.search()` and `.finditer()` methods:

```python
### {executable="true"}
import re
import spacy
from spacy.tokenizer import Tokenizer

special_cases = {":)": [{"ORTH": ":)"}]}
prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, rules=special_cases,
                                prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                url_match=simple_url_re.match)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("hello-world. :)")
print([t.text for t in doc]) # ['hello', '-', 'world.', ':)']
```

If you need to subclass the tokenizer instead, the relevant methods to
specialize are `find_prefix`, `find_suffix` and `find_infix`.

<Infobox title="Important note" variant="warning">

When customizing the prefix, suffix and infix handling, remember that you're
passing in **functions** for spaCy to execute, e.g. `prefix_re.search` – not
just the regular expressions. This means that your functions also need to define
how the rules should be applied. For example, if you're adding your own prefix
rules, you need to make sure they're only applied to characters at the
**beginning of a token**, e.g. by adding `^`. Similarly, suffix rules should
only be applied at the **end of a token**, so your expression should end with a
`$`.

</Infobox>

#### Modifying existing rule sets {#native-tokenizer-additions}

In many situations, you don't necessarily need entirely custom rules. Sometimes
you just want to add another character to the prefixes, suffixes or infixes. The
default prefix, suffix and infix rules are available via the `nlp` object's
`Defaults` and the `Tokenizer` attributes such as
[`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can
overwrite them with compiled regular expression objects using modified default
rules. spaCy ships with utility functions to help you compile the regular
expressions – for example,
[`compile_suffix_regex`](/api/top-level#util.compile_suffix_regex):

```python
suffixes = nlp.Defaults.suffixes + [r'''-+$''',]
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
```

Similarly, you can remove a character from the default suffixes:

```python
suffixes = list(nlp.Defaults.suffixes)
suffixes.remove("\\\\[")
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
```

The `Tokenizer.suffix_search` attribute should be a function which takes a
unicode string and returns a **regex match object** or `None`. Usually we use
the `.search` attribute of a compiled regex object, but you can use some other
function that behaves the same way.

<Infobox title="Important note" variant="warning">

If you've loaded a trained pipeline, writing to the
[`nlp.Defaults`](/api/language#defaults) or `English.Defaults` directly won't
work, since the regular expressions are read from the pipeline data and will be
compiled when you load it. If you modify `nlp.Defaults`, you'll only see the
effect if you call [`spacy.blank`](/api/top-level#spacy.blank). If you want to
modify the tokenizer loaded from a trained pipeline, you should modify
`nlp.tokenizer` directly. If you're training your own pipeline, you can register
[callbacks](/usage/training/#custom-code-nlp-callbacks) to modify the `nlp`
object before training.

</Infobox>

The prefix, infix and suffix rule sets include not only individual characters
but also detailed regular expressions that take the surrounding context into
account. For example, there is a regular expression that treats a hyphen between
letters as an infix. If you do not want the tokenizer to split on hyphens
between letters, you can modify the existing infix definition from
[`lang/punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py):

```python
### {executable="true"}
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

# Default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("mother-in-law")
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']

# Modify tokenizer infix patterns
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # ✅ Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp("mother-in-law")
print([t.text for t in doc]) # ['mother-in-law']
```

For an overview of the default regular expressions, see
[`lang/punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) and
language-specific definitions such as
[`lang/de/punctuation.py`](%%GITHUB_SPACY/spacy/lang/de/punctuation.py) for
German.

### Hooking a custom tokenizer into the pipeline {#custom-tokenizer}

The tokenizer is the first component of the processing pipeline and the only one
that can't be replaced by writing to `nlp.pipeline`. This is because it has a
different signature from all the other components: it takes a text and returns a
[`Doc`](/api/doc), whereas all other components expect to already receive a
tokenized `Doc`.

![The processing pipeline](../images/pipeline.svg)

To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a
custom function that takes a text, and returns a [`Doc`](/api/doc).

> #### Creating a Doc
>
> Constructing a [`Doc`](/api/doc) object manually requires at least two
> arguments: the shared `Vocab` and a list of words. Optionally, you can pass in
> a list of `spaces` values indicating whether the token at this position is
> followed by a space (default `True`). See the section on
> [pre-tokenized text](#own-annotations) for more info.
>
> ```python
> words = ["Let", "'s", "go", "!"]
> spaces = [False, True, False, False]
> doc = Doc(nlp.vocab, words=words, spaces=spaces)
> ```

```python
nlp = spacy.blank("en")
nlp.tokenizer = my_tokenizer
```

| Argument    | Type              | Description               |
| ----------- | ----------------- | ------------------------- |
| `text`      | `str`             | The raw text to tokenize. |
| **RETURNS** | [`Doc`](/api/doc) | The tokenized document.   |

#### Example 1: Basic whitespace tokenizer {#custom-tokenizer-example}

Here's an example of the most basic whitespace tokenizer. It takes the shared
vocab, so it can construct `Doc` objects. When it's called on a text, it returns
a `Doc` object consisting of the text split on single space characters. We can
then overwrite the `nlp.tokenizer` attribute with an instance of our custom
tokenizer.

```python
### {executable="true"}
import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        return Doc(self.vocab, words=words)

nlp = spacy.blank("en")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print([token.text for token in doc])
```

#### Example 2: Third-party tokenizers (BERT word pieces) {#custom-tokenizer-example2}

You can use the same approach to plug in any other third-party tokenizers. Your
custom callable just needs to return a `Doc` object with the tokens produced by
your tokenizer. In this example, the wrapper uses the **BERT word piece
tokenizer**, provided by the
[`tokenizers`](https://github.com/huggingface/tokenizers) library. The tokens
available in the `Doc` object returned by spaCy now match the exact word pieces
produced by the tokenizer.

> #### 💡 Tip: spacy-transformers
>
> If you're working with transformer models like BERT, check out the
> [`spacy-transformers`](https://github.com/explosion/spacy-transformers)
> extension package and [documentation](/usage/embeddings-transformers). It
> includes a pipeline component for using pretrained transformer weights and
> **training transformer models** in spaCy, as well as helpful utilities for
> aligning word pieces to linguistic tokenization.

```python
### Custom BERT word piece tokenizer
from tokenizers import BertWordPieceTokenizer
from spacy.tokens import Doc
import spacy

class BertTokenizer:
    def __init__(self, vocab, vocab_file, lowercase=True):
        self.vocab = vocab
        self._tokenizer = BertWordPieceTokenizer(vocab_file, lowercase=lowercase)

    def __call__(self, text):
        tokens = self._tokenizer.encode(text)
        words = []
        spaces = []
        for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)):
            words.append(text)
            if i < len(tokens.tokens) - 1:
                # If next start != current end we assume a space in between
                next_start, next_end = tokens.offsets[i + 1]
                spaces.append(next_start > end)
            else:
                spaces.append(True)
        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("en")
nlp.tokenizer = BertTokenizer(nlp.vocab, "bert-base-uncased-vocab.txt")
doc = nlp("Justin Drew Bieber is a Canadian singer, songwriter, and actor.")
print(doc.text, [token.text for token in doc])
# [CLS]justin drew bi##eber is a canadian singer, songwriter, and actor.[SEP]
# ['[CLS]', 'justin', 'drew', 'bi', '##eber', 'is', 'a', 'canadian', 'singer',
#  ',', 'songwriter', ',', 'and', 'actor', '.', '[SEP]']
```

<Infobox title="Important note on tokenization and models" variant="warning">

Keep in mind that your models' results may be less accurate if the tokenization
during training differs from the tokenization at runtime. So if you modify a
trained pipeline's tokenization afterwards, it may produce very different
predictions. You should therefore train your pipeline with the **same
tokenizer** it will be using at runtime. See the docs on
[training with custom tokenization](#custom-tokenizer-training) for details.

</Infobox>

#### Training with custom tokenization {#custom-tokenizer-training new="3"}

spaCy's [training config](/usage/training#config) describe the settings,
hyperparameters, pipeline and tokenizer used for constructing and training the
pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that
takes the `nlp` object and returns a tokenizer. Here, we're registering a
function called `whitespace_tokenizer` in the
[`@tokenizers` registry](/api/registry). To make sure spaCy knows how to
construct your tokenizer during training, you can pass in your Python file by
setting `--code functions.py` when you run [`spacy train`](/api/cli#train).

> #### config.cfg
>
> ```ini
> [nlp.tokenizer]
> @tokenizers = "whitespace_tokenizer"
> ```

```python
### functions.py {highlight="1"}
@spacy.registry.tokenizers("whitespace_tokenizer")
def create_whitespace_tokenizer():
    def create_tokenizer(nlp):
        return WhitespaceTokenizer(nlp.vocab)

    return create_tokenizer
```

Registered functions can also take arguments that are then passed in from the
config. This allows you to quickly change and keep track of different settings.
Here, the registered function called `bert_word_piece_tokenizer` takes two
arguments: the path to a vocabulary file and whether to lowercase the text. The
Python type hints `str` and `bool` ensure that the received values have the
correct type.

> #### config.cfg
>
> ```ini
> [nlp.tokenizer]
> @tokenizers = "bert_word_piece_tokenizer"
> vocab_file = "bert-base-uncased-vocab.txt"
> lowercase = true
> ```

```python
### functions.py {highlight="1"}
@spacy.registry.tokenizers("bert_word_piece_tokenizer")
def create_whitespace_tokenizer(vocab_file: str, lowercase: bool):
    def create_tokenizer(nlp):
        return BertWordPieceTokenizer(nlp.vocab, vocab_file, lowercase)

    return create_tokenizer
```

To avoid hard-coding local paths into your config file, you can also set the
vocab path on the CLI by using the `--nlp.tokenizer.vocab_file`
[override](/usage/training#config-overrides) when you run
[`spacy train`](/api/cli#train). For more details on using registered functions,
see the docs in [training with custom code](/usage/training#custom-code).

<Infobox variant="warning">

Remember that a registered function should always be a function that spaCy
**calls to create something**, not the "something" itself. In this case, it
**creates a function** that takes the `nlp` object and returns a callable that
takes a text and returns a `Doc`.

</Infobox>

#### Using pre-tokenized text {#own-annotations}

spaCy generally assumes by default that your data is **raw text**. However,
sometimes your data is partially annotated, e.g. with pre-existing tokenization,
part-of-speech tags, etc. The most common situation is that you have
**pre-defined tokenization**. If you have a list of strings, you can create a
[`Doc`](/api/doc) object directly. Optionally, you can also specify a list of
boolean values, indicating whether each word is followed by a space.

> #### ✏️ Things to try
>
> 1. Change a boolean value in the list of `spaces`. You should see it reflected
>    in the `doc.text` and whether the token is followed by a space.
> 2. Remove `spaces=spaces` from the `Doc`. You should see that every token is
>    now followed by a space.
> 3. Copy-paste a random sentence from the internet and manually construct a
>    `Doc` with `words` and `spaces` so that the `doc.text` matches the original
>    input text.

```python
### {executable="true"}
import spacy
from spacy.tokens import Doc

nlp = spacy.blank("en")
words = ["Hello", ",", "world", "!"]
spaces = [False, True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)
print([(t.text, t.text_with_ws, t.whitespace_) for t in doc])
```

If provided, the spaces list must be the **same length** as the words list. The
spaces list affects the `doc.text`, `span.text`, `token.idx`, `span.start_char`
and `span.end_char` attributes. If you don't provide a `spaces` sequence, spaCy
will assume that all words are followed by a space. Once you have a
[`Doc`](/api/doc) object, you can write to its attributes to set the
part-of-speech tags, syntactic dependencies, named entities and other
attributes.

#### Aligning tokenization {#aligning-tokenization}

spaCy's tokenization is non-destructive and uses language-specific rules
optimized for compatibility with treebank annotations. Other tools and resources
can sometimes tokenize things differently – for example, `"I'm"` →
`["I", "'", "m"]` instead of `["I", "'m"]`.

In situations like that, you often want to align the tokenization so that you
can merge annotations from different sources together, or take vectors predicted
by a
[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
apply them to spaCy tokens. spaCy's [`Alignment`](/api/example#alignment-object)
object allows the one-to-one mappings of token indices in both directions as
well as taking into account indices where multiple tokens align to one single
token.

> #### ✏️ Things to try
>
> 1. Change the capitalization in one of the token lists – for example,
>    `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive.
> 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see
>    that there are now two tokens of length 2 in `y2x`, one corresponding to
>    "'s", and one to "podcasts".
> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that all
>    tokens now correspond 1-to-1.

```python
### {executable="true"}
from spacy.training import Alignment

other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
print(f"a -> b, lengths: {align.x2y.lengths}")  # array([1, 1, 1, 1, 1, 1, 1, 1])
print(f"a -> b, mapping: {align.x2y.dataXd}")  # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s"
print(f"b -> a, lengths: {align.y2x.lengths}")  # array([1, 1, 1, 1, 2, 1, 1])   : the token "'s" refers to two tokens
print(f"b -> a, mappings: {align.y2x.dataXd}")  # array([0, 1, 2, 3, 4, 5, 6, 7])
```

Here are some insights from the alignment information generated in the example
above:

- The one-to-one mappings for the first four tokens are identical, which means
  they map to each other. This makes sense because they're also identical in the
  input: `"i"`, `"listened"`, `"to"` and `"obama"`.
- The value of `x2y.dataXd[6]` is `5`, which means that `other_tokens[6]`
  (`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`).
- `x2y.dataXd[4]` and `x2y.dataXd[5]` are both `4`, which means that both tokens
  4 and 5 of `other_tokens` (`"'"` and `"s"`) align to token 4 of `spacy_tokens`
  (`"'s"`).

<Infobox title="Important note" variant="warning">

The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
`["I", "'m"]` and `["I", "am"]`.

</Infobox>

## Merging and splitting {#retokenization new="2.1"}

The [`Doc.retokenize`](/api/doc#retokenize) context manager lets you merge and
split tokens. Modifications to the tokenization are stored and performed all at
once when the context manager exits. To merge several tokens into one single
token, pass a `Span` to [`retokenizer.merge`](/api/doc#retokenizer.merge). An
optional dictionary of `attrs` lets you set attributes that will be assigned to
the merged token – for example, the lemma, part-of-speech tag or entity type. By
default, the merged token will receive the same attributes as the merged span's
root.

> #### ✏️ Things to try
>
> 1. Inspect the `token.lemma_` attribute with and without setting the `attrs`.
>    You'll see that the lemma defaults to "New", the lemma of the span's root.
> 2. Overwrite other attributes like the `"ENT_TYPE"`. Since "New York" is also
>    recognized as a named entity, this change will also be reflected in the
>    `doc.ents`.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I live in New York")
print("Before:", [token.text for token in doc])

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[3:5], attrs={"LEMMA": "new york"})
print("After:", [token.text for token in doc])
```

> #### Tip: merging entities and noun phrases
>
> If you need to merge named entities or noun chunks, check out the built-in
> [`merge_entities`](/api/pipeline-functions#merge_entities) and
> [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
> components. When added to your pipeline using `nlp.add_pipe`, they'll take
> care of merging the spans automatically.

If an attribute in the `attrs` is a context-dependent token attribute, it will
be applied to the underlying [`Token`](/api/token). For example `LEMMA`, `POS`
or `DEP` only apply to a word in context, so they're token attributes. If an
attribute is a context-independent lexical attribute, it will be applied to the
underlying [`Lexeme`](/api/lexeme), the entry in the vocabulary. For example,
`LOWER` or `IS_STOP` apply to all words of the same spelling, regardless of the
context.

<Infobox variant="warning" title="Note on merging overlapping spans">

If you're trying to merge spans that overlap, spaCy will raise an error because
it's unclear how the result should look. Depending on the application, you may
want to match the shortest or longest possible span, so it's up to you to filter
them. If you're looking for the longest non-overlapping span, you can use the
[`util.filter_spans`](/api/top-level#util.filter_spans) helper:

```python
doc = nlp("I live in Berlin Kreuzberg")
spans = [doc[3:5], doc[3:4], doc[4:5]]
filtered_spans = filter_spans(spans)
```

</Infobox>

### Splitting tokens

The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
one token into two or more tokens. This can be useful for cases where
tokenization rules alone aren't sufficient. For example, you might want to split
"its" into the tokens "it" and "is" — but not the possessive pronoun "its". You
can write rule-based logic that can find only the correct "its" to split, but by
that time, the `Doc` will already be tokenized.

This process of splitting a token requires more settings, because you need to
specify the text of the individual tokens, optional per-token attributes and how
the should be attached to the existing syntax tree. This can be done by
supplying a list of `heads` – either the token to attach the newly split token
to, or a `(token, subtoken)` tuple if the newly split token should be attached
to another subtoken. In this case, "New" should be attached to "York" (the
second split subtoken) and "York" should be attached to "in".

> #### ✏️ Things to try
>
> 1. Assign different attributes to the subtokens and compare the result.
> 2. Change the heads so that "New" is attached to "in" and "York" is attached
>    to "New".
> 3. Split the token into three tokens instead of two – for example,
>    `["New", "Yo", "rk"]`.

```python
### {executable="true"}
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I live in NewYork")
print("Before:", [token.text for token in doc])
displacy.render(doc)  # displacy.serve if you're not in a Jupyter environment

with doc.retokenize() as retokenizer:
    heads = [(doc[3], 1), doc[2]]
    attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
    retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
print("After:", [token.text for token in doc])
displacy.render(doc)  # displacy.serve if you're not in a Jupyter environment
```

Specifying the heads as a list of `token` or `(token, subtoken)` tuples allows
attaching split subtokens to other subtokens, without having to keep track of
the token indices after splitting.

| Token    | Head          | Description                                                                                         |
| -------- | ------------- | --------------------------------------------------------------------------------------------------- |
| `"New"`  | `(doc[3], 1)` | Attach this token to the second subtoken (index `1`) that `doc[3]` will be split into, i.e. "York". |
| `"York"` | `doc[2]`      | Attach this token to `doc[1]` in the original `Doc`, i.e. "in".                                     |

If you don't care about the heads (for example, if you're only running the
tokenizer and not the parser), you can each subtoken to itself:

```python
### {highlight="3"}
doc = nlp("I live in NewYorkCity")
with doc.retokenize() as retokenizer:
    heads = [(doc[3], 0), (doc[3], 1), (doc[3], 2)]
    retokenizer.split(doc[3], ["New", "York", "City"], heads=heads)
```

<Infobox title="Important note" variant="warning">

When splitting tokens, the subtoken texts always have to match the original
token text – or, put differently `"".join(subtokens) == token.text` always needs
to hold true. If this wasn't the case, splitting tokens could easily end up
producing confusing and unexpected results that would contradict spaCy's
non-destructive tokenization policy.

```diff
doc = nlp("I live in L.A.")
with doc.retokenize() as retokenizer:
-    retokenizer.split(doc[3], ["Los", "Angeles"], heads=[(doc[3], 1), doc[2]])
+    retokenizer.split(doc[3], ["L.", "A."], heads=[(doc[3], 1), doc[2]])
```

</Infobox>

### Overwriting custom extension attributes {#retokenization-extensions}

If you've registered custom
[extension attributes](/usage/processing-pipelines#custom-components-attributes),
you can overwrite them during tokenization by providing a dictionary of
attribute names mapped to new values as the `"_"` key in the `attrs`. For
merging, you need to provide one dictionary of attributes for the resulting
merged token. For splitting, you need to provide a list of dictionaries with
custom attributes, one per split subtoken.

<Infobox title="Important note" variant="warning">

To set extension attributes during retokenization, the attributes need to be
**registered** using the [`Token.set_extension`](/api/token#set_extension)
method and they need to be **writable**. This means that they should either have
a default value that can be overwritten, or a getter _and_ setter. Method
extensions or extensions with only a getter are computed dynamically, so their
values can't be overwritten. For more details, see the
[extension attribute docs](/usage/processing-pipelines/#custom-components-attributes).

</Infobox>

> #### ✏️ Things to try
>
> 1. Add another custom extension – maybe `"music_style"`? – and overwrite it.
> 2. Change the extension attribute to use only a `getter` function. You should
>    see that spaCy raises an error, because the attribute is not writable
>    anymore.
> 3. Rewrite the code to split a token with `retokenizer.split`. Remember that
>    you need to provide a list of extension attribute values as the `"_"`
>    property, one for each split subtoken.

```python
### {executable="true"}
import spacy
from spacy.tokens import Token

# Register a custom token attribute, token._.is_musician
Token.set_extension("is_musician", default=False)

nlp = spacy.load("en_core_web_sm")
doc = nlp("I like David Bowie")
print("Before:", [(token.text, token._.is_musician) for token in doc])

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[2:4], attrs={"_": {"is_musician": True}})
print("After:", [(token.text, token._.is_musician) for token in doc])
```

## Sentence Segmentation {#sbd}

A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents`
property. To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a
generator that yields [`Span`](/api/span) objects. You can check whether a `Doc`
has sentence boundaries with the `doc.is_sentenced` attribute.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence. This is another sentence.")
assert doc.is_sentenced
for sent in doc.sents:
    print(sent.text)
```

spaCy provides four alternatives for sentence segmentation:

1. [Dependency parser](#sbd-parser): the statistical
   [`DependencyParser`](/api/dependencyparser) provides the most accurate
   sentence boundaries based on full dependency parses.
2. [Statistical sentence segmenter](#sbd-senter): the statistical
   [`SentenceRecognizer`](/api/sentencerecognizer) is a simpler and faster
   alternative to the parser that only sets sentence boundaries.
3. [Rule-based pipeline component](#sbd-component): the rule-based
   [`Sentencizer`](/api/sentencizer) sets sentence boundaries using a
   customizable list of sentence-final punctuation.
4. [Custom function](#sbd-custom): your own custom function added to the
   processing pipeline can set sentence boundaries by writing to
   `Token.is_sent_start`.

### Default: Using the dependency parse {#sbd-parser model="parser"}

Unlike other libraries, spaCy uses the dependency parse to determine sentence
boundaries. This is usually the most accurate approach, but it requires a
**trained pipeline** that provides accurate predictions. If your texts are
closer to general-purpose news or web text, this should work well out-of-the-box
with spaCy's provided trained pipelines. For social media or conversational text
that doesn't follow the same rules, your application may benefit from a custom
trained or rule-based component.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)
```

spaCy's dependency parser respects already set boundaries, so you can preprocess
your `Doc` using custom components _before_ it's parsed. Depending on your text,
this may also improve parse accuracy, since the parser is constrained to predict
parses consistent with the sentence boundaries.

### Statistical sentence segmenter {#sbd-senter model="senter" new="3"}

The [`SentenceRecognizer`](/api/sentencerecognizer) is a simple statistical
component that only provides sentence boundaries. Along with being faster and
smaller than the parser, its primary advantage is that it's easier to train
because it only requires annotated sentence boundaries rather than full
dependency parses. spaCy's [trained pipelines](/models) include both a parser
and a trained sentence segmenter, which is
[disabled](/usage/processing-pipelines#disabling) by default. If you only need
sentence boundaries and no parser, you can use the `enable` and `disable`
arguments on [`spacy.load`](/api/top-level#spacy.load) to enable the senter and
disable the parser.

> #### senter vs. parser
>
> The recall for the `senter` is typically slightly lower than for the parser,
> which is better at predicting sentence boundaries when punctuation is not
> present.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm", enable=["senter"], disable=["parser"])
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)
```

### Rule-based pipeline component {#sbd-component}

The [`Sentencizer`](/api/sentencizer) component is a
[pipeline component](/usage/processing-pipelines) that splits sentences on
punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
need sentence boundaries without dependency parses.

```python
### {executable="true"}
import spacy
from spacy.lang.en import English

nlp = English()  # just the language with no pipeline
nlp.add_pipe("sentencizer")
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)
```

### Custom rule-based strategy {id="sbd-custom"}

If you want to implement your own strategy that differs from the default
rule-based approach of splitting on sentences, you can also create a
[custom pipeline component](/usage/processing-pipelines#custom-components) that
takes a `Doc` object and sets the `Token.is_sent_start` attribute on each
individual token. If set to `False`, the token is explicitly marked as _not_ the
start of a sentence. If set to `None` (default), it's treated as a missing value
and can still be overwritten by the parser.

<Infobox title="Important note" variant="warning">

To prevent inconsistent state, you can only set boundaries **before** a document
is parsed (and `doc.has_annotation("DEP")` is `False`). To ensure that your
component is added in the right place, you can set `before='parser'` or
`first=True` when adding it to the pipeline using
[`nlp.add_pipe`](/api/language#add_pipe).

</Infobox>

Here's an example of a component that implements a pre-processing rule for
splitting on `"..."` tokens. The component is added before the parser, which is
then used to further segment the text. That's possible, because `is_sent_start`
is only set to `True` for some of the tokens – all others still specify `None`
for unset sentence boundaries. This approach can be useful if you want to
implement **additional** rules specific to your data, while still being able to
take advantage of dependency-based sentence segmentation.

```python
### {executable="true"}
from spacy.language import Language
import spacy

text = "this is a sentence...hello...and another sentence."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("Before:", [sent.text for sent in doc.sents])

@Language.component("set_custom_coundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == "...":
            doc[token.i + 1].is_sent_start = True
    return doc

nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
print("After:", [sent.text for sent in doc.sents])
```

## Mappings & Exceptions {#mappings-exceptions new="3"}

The [`AttributeRuler`](/api/attributeruler) manages **rule-based mappings and
exceptions** for all token-level attributes. As the number of
[pipeline components](/api/#architecture-pipeline) has grown from spaCy v2 to
v3, handling rules and exceptions in each component individually has become
impractical, so the `AttributeRuler` provides a single component with a unified
pattern format for all token attribute mappings and exceptions.

The `AttributeRuler` uses
[`Matcher` patterns](/usage/rule-based-matching#adding-patterns) to identify
tokens and then assigns them the provided attributes. If needed, the
[`Matcher`](/api/matcher) patterns can include context around the target token.
For example, the attribute ruler can:

- provide exceptions for any **token attributes**
- map **fine-grained tags** to **coarse-grained tags** for languages without
  statistical morphologizers (replacing the v2.x `tag_map` in the
  [language data](#language-data))
- map token **surface form + fine-grained tags** to **morphological features**
  (replacing the v2.x `morph_rules` in the [language data](#language-data))
- specify the **tags for space tokens** (replacing hard-coded behavior in the
  tagger)

The following example shows how the tag and POS `NNP`/`PROPN` can be specified
for the phrase `"The Who"`, overriding the tags provided by the statistical
tagger and the POS tag map.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
text = "I saw The Who perform. Who did you see?"
doc1 = nlp(text)
print(doc1[2].tag_, doc1[2].pos_)  # DT DET
print(doc1[3].tag_, doc1[3].pos_)  # WP PRON

# Add attribute ruler with exception for "The Who" as NNP/PROPN NNP/PROPN
ruler = nlp.get_pipe("attribute_ruler")
# Pattern to match "The Who"
patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]]
# The attributes to assign to the matched token
attrs = {"TAG": "NNP", "POS": "PROPN"}
# Add rules to the attribute ruler
ruler.add(patterns=patterns, attrs=attrs, index=0)  # "The" in "The Who"
ruler.add(patterns=patterns, attrs=attrs, index=1)  # "Who" in "The Who"

doc2 = nlp(text)
print(doc2[2].tag_, doc2[2].pos_)  # NNP PROPN
print(doc2[3].tag_, doc2[3].pos_)  # NNP PROPN
# The second "Who" remains unmodified
print(doc2[5].tag_, doc2[5].pos_)  # WP PRON
```

<Infobox variant="warning" title="Migrating from spaCy v2.x">

The [`AttributeRuler`](/api/attributeruler) can import a **tag map and morph rules** in the v2.x format via its built-in methods or when the component is initialized before training. See the [migration guide](/usage/v3#migrating-training-mappings-exceptions) for details.

</Infobox>

## Word vectors and semantic similarity {#vectors-similarity}

import Vectors101 from 'usage/101/\_vectors-similarity.md'

<Vectors101 />

### Adding word vectors {#adding-vectors}

Custom word vectors can be trained using a number of open-source libraries, such
as [Gensim](https://radimrehurek.com/gensim), [FastText](https://fasttext.cc),
or Tomas Mikolov's original
[Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most
word vector libraries output an easy-to-read text-based format, where each line
consists of the word followed by its vector. For everyday use, we want to
convert the vectors into a binary format that loads faster and takes up less
space on disk. The easiest way to do this is the
[`init vectors`](/api/cli#init-vectors) command-line utility. This will output a
blank spaCy pipeline in the directory `/tmp/la_vectors_wiki_lg`, giving you
access to some nice Latin vectors. You can then pass the directory path to
[`spacy.load`](/api/top-level#spacy.load) or use it in the
[`[initialize]`](/api/data-formats#config-initialize) of your config when you
[train](/usage/training) a model.

> #### Usage example
>
> ```python
> nlp_latin = spacy.load("/tmp/la_vectors_wiki_lg")
> doc1 = nlp_latin("Caecilius est in horto")
> doc2 = nlp_latin("servus est in atrio")
> doc1.similarity(doc2)
> ```

```cli
$ wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
$ python -m spacy init vectors en cc.la.300.vec.gz /tmp/la_vectors_wiki_lg
```

<Accordion title="How to optimize vector coverage" id="custom-vectors-coverage" spaced>

To help you strike a good balance between coverage and memory usage, spaCy's
[`Vectors`](/api/vectors) class lets you map **multiple keys** to the **same
row** of the table. If you're using the
[`spacy init vectors`](/api/cli#init-vectors) command to create a vocabulary,
pruning the vectors will be taken care of automatically if you set the `--prune`
flag. You can also do it manually in the following steps:

1. Start with a **word vectors package** that covers a huge vocabulary. For
   instance, the [`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg)
   starter provides 300-dimensional GloVe vectors for over 1 million terms of
   English.
2. If your vocabulary has values set for the `Lexeme.prob` attribute, the
   lexemes will be sorted by descending probability to determine which vectors
   to prune. Otherwise, lexemes will be sorted by their order in the `Vocab`.
3. Call [`Vocab.prune_vectors`](/api/vocab#prune_vectors) with the number of
   vectors you want to keep.

```python
nlp = spacy.load('en_vectors_web_lg')
n_vectors = 105000  # number of vectors to keep
removed_words = nlp.vocab.prune_vectors(n_vectors)

assert len(nlp.vocab.vectors) <= n_vectors  # unique vectors have been pruned
assert nlp.vocab.vectors.n_keys > n_vectors  # but not the total entries
```

[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
table to a given number of unique entries, and returns a dictionary containing
the removed words, mapped to `(string, score)` tuples, where `string` is the
entry the removed word was mapped to, and `score` the similarity score between
the two words.

```python
### Removed words
{
    "Shore": ("coast", 0.732257),
    "Precautionary": ("caution", 0.490973),
    "hopelessness": ("sadness", 0.742366),
    "Continous": ("continuous", 0.732549),
    "Disemboweled": ("corpse", 0.499432),
    "biostatistician": ("scientist", 0.339724),
    "somewheres": ("somewheres", 0.402736),
    "observing": ("observe", 0.823096),
    "Leaving": ("leaving", 1.0),
}
```

In the example above, the vector for "Shore" was removed and remapped to the
vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to
the vector of "leaving", which is identical. If you're using the
[`init vectors`](/api/cli#init-vectors) command, you can set the `--prune`
option to easily reduce the size of the vectors as you add them to a spaCy
pipeline:

```cli
$ python -m spacy init vectors en la.300d.vec.tgz /tmp/la_vectors_web_md --prune 10000
```

This will create a blank spaCy pipeline with vectors for the first 10,000 words
in the vectors. All other words in the vectors are mapped to the closest vector
among those retained.

</Accordion>

### Adding vectors individually {#adding-individual-vectors}

The `vector` attribute is a **read-only** numpy or cupy array (depending on
whether you've configured spaCy to use GPU memory), with dtype `float32`. The
array is read-only so that spaCy can avoid unnecessary copy operations where
possible. You can modify the vectors via the [`Vocab`](/api/vocab) or
[`Vectors`](/api/vectors) table. Using the
[`Vocab.set_vector`](/api/vocab#set_vector) method is often the easiest approach
if you have vectors in an arbitrary format, as you can read in the vectors with
your own logic, and just set them with a simple loop. This method is likely to
be slower than approaches that work with the whole vectors table at once, but
it's a great approach for once-off conversions before you save out your `nlp`
object to disk.

```python
### Adding vectors
from spacy.vocab import Vocab

vector_data = {
    "dog": numpy.random.uniform(-1, 1, (300,)),
    "cat": numpy.random.uniform(-1, 1, (300,)),
    "orange": numpy.random.uniform(-1, 1, (300,))
}
vocab = Vocab()
for word, vector in vector_data.items():
    vocab.set_vector(word, vector)
```

## Language Data {#language-data}

import LanguageData101 from 'usage/101/\_language-data.md'

<LanguageData101 />

### Creating a custom language subclass {#language-subclass}

If you want to customize multiple components of the language data or add support
for a custom language or domain-specific "dialect", you can also implement your
own language subclass. The subclass should define two attributes: the `lang`
(unique language code) and the `Defaults` defining the language data. For an
overview of the available attributes that can be overwritten, see the
[`Language.Defaults`](/api/language#defaults) documentation.

```python
### {executable="true"}
from spacy.lang.en import English

class CustomEnglishDefaults(English.Defaults):
    stop_words = set(["custom", "stop"])

class CustomEnglish(English):
    lang = "custom_en"
    Defaults = CustomEnglishDefaults

nlp1 = English()
nlp2 = CustomEnglish()

print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])
```

The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you
register a custom language class and assign it a string name. This means that
you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom
language name, and even train pipelines with it and refer to it in your
[training config](/usage/training#config).

> #### Config usage
>
> After registering your custom language class using the `languages` registry,
> you can refer to it in your [training config](/usage/training#config). This
> means spaCy will train your pipeline using the custom subclass.
>
> ```ini
> [nlp]
> lang = "custom_en"
> ```
>
> In order to resolve `"custom_en"` to your subclass, the registered function
> needs to be available during training. You can load a Python file containing
> the code using the `--code` argument:
>
> ```cli
> python -m spacy train config.cfg --code code.py
> ```

```python
### Registering a custom language {highlight="7,12-13"}
import spacy
from spacy.lang.en import English

class CustomEnglishDefaults(English.Defaults):
    stop_words = set(["custom", "stop"])

@spacy.registry.languages("custom_en")
class CustomEnglish(English):
    lang = "custom_en"
    Defaults = CustomEnglishDefaults

# This now works! 🎉
nlp = spacy.blank("custom_en")
```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								---
 								title: Linguistic Features
 								next: /usage/rule-based-matching
 								menu:
 								  - ['POS Tagging', 'pos-tagging']
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								  - ['Morphology', 'morphology']
 								  - ['Lemmatization', 'lemmatization']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								  - ['Dependency Parse', 'dependency-parse']
 								  - ['Named Entities', 'named-entities']
-												Add Entity Linking to menu (#4489)


											
										
										
											2019-10-21 13:17:30 +03:00
+								  - ['Entity Linking', 'entity-linking']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								  - ['Tokenization', 'tokenization']
 								  - ['Merging & Splitting', 'retokenization']
 								  - ['Sentence Segmentation', 'sbd']
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								  - ['Vectors & Similarity', 'vectors-similarity']
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								  - ['Mappings & Exceptions', 'mappings-exceptions']
 								  - ['Language Data', 'language-data']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								---
 								Processing raw text intelligently is difficult: most words are rare, and it's
 								common for words that look completely different to mean almost the same thing.
 								The same words in a different order can mean something completely different.
 								Even splitting text into useful word-like units can be difficult in many
 								languages. While it's possible to solve some problems starting from only the raw
 								characters, it's usually better to use linguistic knowledge to add useful
 								information. That's exactly what spaCy is designed to do: you put in raw text,
 								and get back a [`Doc`](/api/doc) object, that comes with a variety of
 								annotations.
 								## Part-of-speech tagging {#pos-tagging model="tagger, parser"}
 								import PosDeps101 from 'usage/101/\_pos-deps.md'
 								<PosDeps101 />
-												Update WIP

											
										
										
											2020-07-06 23:22:37 +03:00
+								<Infobox title="Part-of-speech tag scheme" emoji="📖">
-												Make pos/tag distinction more clear in docs (#4246)

* make distinction between tag and pos more prominent in docs

* out of the 101

											
										
										
											2019-09-06 11:31:21 +03:00
 								For a list of the fine-grained and coarse-grained part-of-speech tags assigned
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								by spaCy's models across different languages, see the label schemes documented
 								in the [models directory](/models).
-												Make pos/tag distinction more clear in docs (#4246)

* make distinction between tag and pos more prominent in docs

* out of the 101

											
										
										
											2019-09-06 11:31:21 +03:00
 								</Infobox>
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								## Morphology {#morphology}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								Inflectional morphology is the process by which a root form of a word is
 								modified by adding prefixes or suffixes that specify its grammatical function
 								but do not changes its part-of-speech. We say that a **lemma** (root form) is
 								**inflected** (modified/combined) with one or more **morphological features** to
 								create a surface form. Here are some examples:
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								| Context                                  | Surface | Lemma | POS    |  Morphological Features                  |
 								| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- |
 								| I was reading the paper                  | reading | read  | `VERB` | `VerbForm=Ger`                           |
 								| I don't watch the news, I read the paper | read    | read  | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
 								| I read the paper yesterday               | read    | read  | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` |
 								Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis)
 								under `Token.morph`, which allows you to access individual morphological
-												Update docs for Token.morph / Token.set_morph

											
										
										
											2020-10-02 09:48:28 +03:00
+								features.
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
 								> #### 📝 Things to try
 								>
 								> 1. Change "I" to "She". You should see that the morphological features change
 								>    and express that it's a pronoun in the third person.
-												Update docs for Token.morph / Token.set_morph

											
										
										
											2020-10-02 09:48:28 +03:00
+								> 2. Inspect `token.morph` for the other tokens.
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								print("Pipeline:", nlp.pipe_names)
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								doc = nlp("I was reading the paper.")
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								token = doc[0]  # 'I'
-												Update docs for Token.morph / Token.set_morph

											
										
										
											2020-10-02 09:48:28 +03:00
+								print(token.morph)  # 'Case=Nom|Number=Sing|Person=1|PronType=Prs'
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								print(token.morph.get("PronType"))  # ['Prs']
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								```
 								### Statistical morphology {#morphologizer new="3" model="morphologizer"}
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								spaCy's statistical [`Morphologizer`](/api/morphologizer) component assigns the
 								morphological features and coarse-grained part-of-speech tags as `Token.morph`
 								and `Token.pos`.
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("de_core_news_sm")
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								doc = nlp("Wo bist du?") # English: 'Where are you?'
-												Update docs for Token.morph / Token.set_morph

											
										
										
											2020-10-02 09:48:28 +03:00
+								print(doc[2].morph)  # 'Case=Nom|Number=Sing|Person=2|PronType=Prs'
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								print(doc[2].pos_) # 'PRON'
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								```
 								### Rule-based morphology {#rule-based-morphology}
 								For languages with relatively simple morphological systems like English, spaCy
 								can assign morphological features through a rule-based approach, which uses the
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								**token text** and **fine-grained part-of-speech tags** to produce
 								coarse-grained part-of-speech tags and morphological features.
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 . The part-of-speech tagger assigns each token a **fine-grained part-of-speech
 								   tag**. In the API, these tags are known as `Token.tag`. They express the
 								   part-of-speech (e.g. verb) and some amount of morphological information, e.g.
 								   that the verb is past tense (e.g. `VBD` for a past tense verb in the Penn
 								   Treebank) .
 . For words whose coarse-grained POS is not set by a prior process, a
 								   [mapping table](#mapping-exceptions) maps the fine-grained tags to a
 								   coarse-grained POS tags and morphological features.
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("Where are you?")
-												Update docs for Token.morph / Token.set_morph

											
										
										
											2020-10-02 09:48:28 +03:00
+								print(doc[2].morph)  # 'Case=Nom|Person=2|PronType=Prs'
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								print(doc[2].pos_)  # 'PRON'
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								```
 								## Lemmatization {#lemmatization model="lemmatizer" new="3"}
 								The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup
 								and rule-based lemmatization methods in a configurable component. An individual
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								language can extend the `Lemmatizer` as part of its
 								[language data](#language-data).
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 								```python
 								### {executable="true"}
 								import spacy
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								# English pipelines include a rule-based lemmatizer
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								nlp = spacy.load("en_core_web_sm")
 								lemmatizer = nlp.get_pipe("lemmatizer")
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								print(lemmatizer.mode)  # 'rule'
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 								doc = nlp("I was reading the paper.")
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								print([token.lemma_ for token in doc])
 								# ['I', 'be', 'read', 'the', 'paper', '.']
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								```
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								<Infobox title="Changed in v3.0" variant="warning">
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								Unlike spaCy v2, spaCy v3 models do _not_ provide lemmas by default or switch
 								automatically between lookup and rule-based lemmas depending on whether a tagger
 								is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to include a
 								[`Lemmatizer`](/api/lemmatizer) component. The lemmatizer component is
 								configured to use a single mode such as `"lookup"` or `"rule"` on
 								initialization. The `"rule"` mode requires `Token.pos` to be set by a previous
 								component.
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 								</Infobox>
 								The data for spaCy's lemmatizers is distributed in the package
 								[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								provided trained pipelines already include all the required tables, but if you
 								are creating new pipelines, you'll probably want to install `spacy-lookups-data`
 								to provide the data when the lemmatizer is initialized.
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 								### Lookup lemmatizer {#lemmatizer-lookup}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								For pipelines without a tagger or morphologizer, a lookup lemmatizer can be
 								added to the pipeline as long as a lookup table is provided, typically through
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The
 								lookup lemmatizer looks up the token surface form in the lookup table without
 								reference to the token's part-of-speech or context.
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 								```python
 								# pip install spacy-lookups-data
 								import spacy
 								nlp = spacy.blank("sv")
 								nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
 								```
 								### Rule-based lemmatizer {#lemmatizer-rule}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								When training pipelines that include a component that assigns part-of-speech
 								tags (a morphologizer or a tagger with a [POS mapping](#mappings-exceptions)), a
 								rule-based lemmatizer can be added using rule tables from
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data):
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 								```python
 								# pip install spacy-lookups-data
 								import spacy
 								nlp = spacy.blank("de")
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								# Morphologizer (note: model is not yet trained!)
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								nlp.add_pipe("morphologizer")
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								# Rule-based lemmatizer
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								nlp.add_pipe("lemmatizer", config={"mode": "rule"})
 								```
 								The rule-based deterministic lemmatizer maps the surface form to a lemma in
 								light of the previously assigned coarse-grained part-of-speech and morphological
 								information, without consulting the context of the token. The rule-based
 								lemmatizer also accepts list-based exception files. For English, these are
 								acquired from [WordNet](https://wordnet.princeton.edu/).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								## Dependency Parsing {#dependency-parse model="parser"}
 								spaCy features a fast and accurate syntactic dependency parser, and has a rich
 								API for navigating the tree. The parser also powers the sentence boundary
 								detection, and lets you iterate over base noun phrases, or "chunks". You can
-												Update docs [ci skip]

											
										
										
											2020-09-22 10:45:41 +03:00
+								check whether a [`Doc`](/api/doc) object has been parsed by calling
 								`doc.has_annotation("DEP")`, which checks whether the attribute `Token.dep` has
 								been set returns a boolean value. If the result is `False`, the default sentence
 								iterator will raise an exception.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-25 14:27:59 +03:00
+								<Infobox title="Dependency label scheme" emoji="📖">
 								For a list of the syntactic dependency labels assigned by spaCy's models across
 								different languages, see the label schemes documented in the
 								[models directory](/models).
 								</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								### Noun chunks {#noun-chunks}
 								Noun chunks are "base noun phrases" – flat phrases that have a noun as their
 								head. You can think of noun chunks as a noun plus the words describing the noun
 								– for example, "the lavish green grass" or "the world’s largest tech fund". To
 								get the noun chunks in a document, simply iterate over
 								[`Doc.noun_chunks`](/api/doc#noun_chunks)
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for chunk in doc.noun_chunks:
 								    print(chunk.text, chunk.root.text, chunk.root.dep_,
 								            chunk.root.head.text)
 								```
 								> - **Text:** The original noun chunk text.
 								> - **Root text:** The original text of the word connecting the noun chunk to
 								>   the rest of the parse.
 								> - **Root dep:** Dependency relation connecting the root to its head.
 								> - **Root head text:** The text of the root token's head.
 								| Text                | root.text     | root.dep\_ | root.head.text |
 								| ------------------- | ------------- | ---------- | -------------- |
 								| Autonomous cars     | cars          | `nsubj`    | shift          |
 								| insurance liability | liability     | `dobj`     | shift          |
 								| manufacturers       | manufacturers | `pobj`     | toward         |
 								### Navigating the parse tree {#navigating}
 								spaCy uses the terms **head** and **child** to describe the words **connected by
 								a single arc** in the dependency tree. The term **dep** is used for the arc
 								label, which describes the type of syntactic relation that connects the child to
 								the head. As with other attributes, the value of `.dep` is a hash value. You can
 								get the string value with `.dep_`.
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for token in doc:
 								    print(token.text, token.dep_, token.head.text, token.head.pos_,
 								            [child for child in token.children])
 								```
 								> - **Text:** The original token text.
 								> - **Dep:** The syntactic relation connecting child to head.
 								> - **Head text:** The original text of the token head.
 								> - **Head POS:** The part-of-speech tag of the token head.
 								> - **Children:** The immediate syntactic dependents of the token.
 								| Text          | Dep        | Head text | Head POS | Children                |
 								| ------------- | ---------- | --------- | -------- | ----------------------- |
 								| Autonomous    | `amod`     | cars      | `NOUN`   |                         |
 								| cars          | `nsubj`    | shift     | `VERB`   | Autonomous              |
 								| shift         | `ROOT`     | shift     | `VERB`   | cars, liability, toward |
 								| insurance     | `compound` | liability | `NOUN`   |                         |
 								| liability     | `dobj`     | shift     | `VERB`   | insurance               |
 								| toward        | `prep`     | shift     | `NOUN`   | manufacturers           |
 								| manufacturers | `pobj`     | toward    | `ADP`    |                         |
 								import DisplaCyLong2Html from 'images/displacy-long2.html'
 								<Iframe title="displaCy visualization of dependencies and entities 2" html={DisplaCyLong2Html} height={450} />
 								Because the syntactic relations form a tree, every word has **exactly one
 								head**. You can therefore iterate over the arcs in the tree by iterating over
 								the words in the sentence. This is usually the best way to match an arc of
 								interest — from below:
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.symbols import nsubj, VERB
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								# Finding a verb with a subject from below — good
 								verbs = set()
 								for possible_subject in doc:
 								    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
 								        verbs.add(possible_subject.head)
 								print(verbs)
 								```
 								If you try to match from above, you'll have to iterate twice. Once for the head,
 								and then again through the children:
 								```python
 								# Finding a verb with a subject from above — less good
 								verbs = []
 								for possible_verb in doc:
 								    if possible_verb.pos == VERB:
 								        for possible_subject in possible_verb.children:
 								            if possible_subject.dep == nsubj:
 								                verbs.append(possible_verb)
 								                break
 								```
 								To iterate through the children, use the `token.children` attribute, which
 								provides a sequence of [`Token`](/api/token) objects.
 								#### Iterating around the local tree {#navigating-around}
 								A few more convenience attributes are provided for iterating around the local
 								tree from the token. [`Token.lefts`](/api/token#lefts) and
 								[`Token.rights`](/api/token#rights) attributes provide sequences of syntactic
 								children that occur before and after the token. Both sequences are in sentence
 								order. There are also two integer-typed attributes,
 								[`Token.n_lefts`](/api/token#n_lefts) and
 								[`Token.n_rights`](/api/token#n_rights) that give the number of left and right
 								children.
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("bright red apples on the tree")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								print([token.text for token in doc[2].lefts])  # ['bright', 'red']
 								print([token.text for token in doc[2].rights])  # ['on']
 								print(doc[2].n_lefts)  # 2
 								print(doc[2].n_rights)  # 1
 								```
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("de_core_news_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("schöne rote Äpfel auf dem Baum")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								print([token.text for token in doc[2].lefts])  # ['schöne', 'rote']
 								print([token.text for token in doc[2].rights])  # ['auf']
 								```
 								You can get a whole phrase by its syntactic head using the
 								[`Token.subtree`](/api/token#subtree) attribute. This returns an ordered
 								sequence of tokens. You can walk up the tree with the
 								[`Token.ancestors`](/api/token#ancestors) attribute, and check dominance with
 								[`Token.is_ancestor`](/api/token#is_ancestor)
 								> #### Projective vs. non-projective
 								>
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> For the [default English pipelines](/models/en), the parse tree is
 								> **projective**, which means that there are no crossing brackets. The tokens
 								> returned by `.subtree` are therefore guaranteed to be contiguous. This is not
 								> true for the German pipelines, which have many
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> [non-projective dependencies](https://explosion.ai/blog/german-model#word-order).
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Credit and mortgage account holders must submit their requests")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								root = [token for token in doc if token.head == token][0]
 								subject = list(root.lefts)[0]
 								for descendant in subject.subtree:
 								    assert subject is descendant or subject.is_ancestor(descendant)
 								    print(descendant.text, descendant.dep_, descendant.n_lefts,
 								            descendant.n_rights,
 								            [ancestor.text for ancestor in descendant.ancestors])
 								```
 								| Text     | Dep        | n_lefts | n_rights | ancestors                        |
 								| -------- | ---------- | ------- | -------- | -------------------------------- |
 								| Credit   | `nmod`     | `0`     | `2`      | holders, submit                  |
 								| and      | `cc`       | `0`     | `0`      | holders, submit                  |
 								| mortgage | `compound` | `0`     | `0`      | account, Credit, holders, submit |
 								| account  | `conj`     | `1`     | `0`      | Credit, holders, submit          |
 								| holders  | `nsubj`    | `1`     | `0`      | submit                           |
 								Finally, the `.left_edge` and `.right_edge` attributes can be especially useful,
 								because they give you the first and last token of the subtree. This is the
 								easiest way to create a `Span` object for a syntactic phrase. Note that
 								`.right_edge` gives a token **within** the subtree — so if you use it as the
 								end-point of a range, don't forget to `+1`!
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Credit and mortgage account holders must submit their requests")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
 								with doc.retokenize() as retokenizer:
 								    retokenizer.merge(span)
 								for token in doc:
 								    print(token.text, token.pos_, token.dep_, token.head.text)
 								```
 								| Text                                |  POS   | Dep     | Head text |
 								| ----------------------------------- | ------ | ------- | --------- |
 								| Credit and mortgage account holders | `NOUN` | `nsubj` | submit    |
 								| must                                | `VERB` | `aux`   | submit    |
 								| submit                              | `VERB` | `ROOT`  | submit    |
 								| their                               | `ADJ`  | `poss`  | requests  |
 								| requests                            | `NOUN` | `dobj`  | submit    |
-												Update docs [ci skip]

											
										
										
											2020-08-25 14:27:59 +03:00
+								The dependency parse can be a useful tool for **information extraction**,
 								especially when combined with other predictions like
 								[named entities](#named-entities). The following example extracts money and
 								currency values, i.e. entities labeled as `MONEY`, and then uses the dependency
 								parse to find the noun phrase they are referring to – for example `"Net income"`
 								&rarr; `"$9.4 million"`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-25 14:27:59 +03:00
+								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
 								# Merge noun phrases and entities for easier analysis
 								nlp.add_pipe("merge_entities")
 								nlp.add_pipe("merge_noun_chunks")
 								TEXTS = [
 								    "Net income was $9.4 million compared to the prior year of $2.7 million.",
 								    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
 								]
 								for doc in nlp.pipe(TEXTS):
 								    for token in doc:
 								        if token.ent_type_ == "MONEY":
 								            # We have an attribute and direct object, so check for subject
 								            if token.dep_ in ("attr", "dobj"):
 								                subj = [w for w in token.head.lefts if w.dep_ == "nsubj"]
 								                if subj:
 								                    print(subj[0], "-->", token)
 								            # We have a prepositional object with a preposition
 								            elif token.dep_ == "pobj" and token.head.dep_ == "prep":
 								                print(token.head.head, "-->", token)
 								```
 								<Infobox title="Combining models and rules" emoji="📖">
 								For more examples of how to write rule-based information extraction logic that
 								takes advantage of the model's predictions produced by the different components,
 								see the usage guide on
 								[combining models and rules](/usage/rule-based-matching#models-rules).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								</Infobox>
 								### Visualizing dependencies {#displacy}
 								The best way to understand spaCy's dependency parser is interactively. To make
-												Start updating website for v3 [ci skip]

											
										
										
											2020-07-01 22:26:39 +03:00
+								this easier, spaCy comes with a visualization module. You can pass a `Doc` or a
 								list of `Doc` objects to displaCy and run
-												Fix small issues in the docs [ci skip]

											
										
										
											2019-03-13 00:57:15 +03:00
+								[`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or
 								[`displacy.render`](/api/top-level#displacy.render) to generate the raw markup.
 								If you want to know how to write rules that hook into some type of syntactic
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								construction, just plug the sentence into the visualizer and see how spaCy
 								annotates it.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy import displacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								# Since this is an interactive Jupyter environment, we can use displacy.render here
 								displacy.render(doc, style='dep')
 								```
 								<Infobox>
 								For more details and examples, see the
 								[usage guide on visualizing spaCy](/usage/visualizers). You can also test
 								displaCy in our [online demo](https://explosion.ai/demos/displacy)..
 								</Infobox>
 								### Disabling the parser {#disabling}
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								In the [trained pipelines](/models) provided by spaCy, the parser is loaded and
 								enabled by default as part of the
 								[standard processing pipeline](/usage/processing-pipelines). If you don't need
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								any of the syntactic information, you should disable the parser. Disabling the
 								parser will make spaCy load and run much faster. If you want to load the parser,
 								but need to disable it for specific documents, you can also control its use on
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								the `nlp` object. For more details, see the usage guide on
 								[disabling pipeline components](/usage/processing-pipelines/#disabling).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								nlp = spacy.load("en_core_web_sm", disable=["parser"])
 								```
 								## Named Entity Recognition {#named-entities}
 								spaCy features an extremely fast statistical entity recognition system, that
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								assigns labels to contiguous spans of tokens. The default
 								[trained pipelines](/models) can indentify a variety of named and numeric
 								entities, including companies, locations, organizations and products. You can
 								add arbitrary classes to the entity recognition system, and update the model
 								with new examples.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### Named Entity Recognition 101 {#named-entities-101}
 								import NER101 from 'usage/101/\_named-entities.md'
 								<NER101 />
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								### Accessing entity annotations and labels {#accessing-ner}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								The standard way to access entity annotations is the [`doc.ents`](/api/doc#ents)
 								property, which produces a sequence of [`Span`](/api/span) objects. The entity
 								type is accessible either as a hash value or as a string, using the attributes
 								`ent.label` and `ent.label_`. The `Span` object acts as a sequence of tokens, so
 								you can iterate over the entity or index into it. You can also get the text form
 								of the whole entity, as though it were a single token.
 								You can also access token entity annotations using the
 								[`token.ent_iob`](/api/token#attributes) and
 								[`token.ent_type`](/api/token#attributes) attributes. `token.ent_iob` indicates
 								whether an entity starts, continues or ends on the tag. If no entity type is set
 								on a token, it will return an empty string.
 								> #### IOB Scheme
 								>
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								> - `I` – Token is **inside** an entity.
 								> - `O` – Token is **outside** an entity.
 								> - `B` – Token is the **beginning** of an entity.
 								>
 								> #### BILUO Scheme
 								>
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								> - `B` – Token is the **beginning** of a multi-token entity.
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								> - `I` – Token is **inside** a multi-token entity.
 								> - `L` – Token is the **last** token of a multi-token entity.
 								> - `U` – Token is a single-token **unit** entity.
 								> - `O` – Toke is **outside** an entity.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("San Francisco considers banning sidewalk delivery robots")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								# document level
 								ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
 								print(ents)
 								# token level
 								ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
 								ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								print(ent_san)  # ['San', 'B', 'GPE']
 								print(ent_francisco)  # ['Francisco', 'I', 'GPE']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								| Text      | ent_iob | ent_iob\_ | ent_type\_ | Description            |
 								| --------- | ------- | --------- | ---------- | ---------------------- |
 								| San       | `3`     | `B`       | `"GPE"`    | beginning of an entity |
 								| Francisco | `1`     | `I`       | `"GPE"`    | inside an entity       |
 								| considers | `2`     | `O`       | `""`       | outside an entity      |
 								| banning   | `2`     | `O`       | `""`       | outside an entity      |
 								| sidewalk  | `2`     | `O`       | `""`       | outside an entity      |
 								| delivery  | `2`     | `O`       | `""`       | outside an entity      |
 								| robots    | `2`     | `O`       | `""`       | outside an entity      |
 								### Setting entity annotations {#setting-entities}
 								To ensure that the sequence of token annotations remains consistent, you have to
 								set entity annotations **at the document level**. However, you can't write
 								directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest
 								way to set entities is to assign to the [`doc.ents`](/api/doc#ents) attribute
 								and create the new entity as a [`Span`](/api/span).
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.tokens import Span
 								nlp = spacy.load("en_core_web_sm")
-												Make example consistent with model (closes #4587) [ci skip]

											
										
										
											2019-11-18 14:41:48 +03:00
+								doc = nlp("fb is hiring a new vice president of global policy")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
 								print('Before', ents)
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								# The model didn't recognize "fb" as an entity :(
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								doc.ents = list(doc.ents) + [fb_ent]
 								ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
 								print('After', ents)
-												Make example consistent with model (closes #4587) [ci skip]

											
										
										
											2019-11-18 14:41:48 +03:00
+								# [('fb', 0, 2, 'ORG')] 🎉
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								Keep in mind that you need to create a `Span` with the start and end index of
 								the **token**, not the start and end index of the entity in the document. In
-												Make example consistent with model (closes #4587) [ci skip]

											
										
										
											2019-11-18 14:41:48 +03:00
+								this case, "fb" is token `(0, 1)` – but at the document level, the entity will
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								have the start and end indices `(0, 2)`.
 								#### Setting entity annotations from array {#setting-from-array}
 								You can also assign entity annotations using the
 								[`doc.from_array`](/api/doc#from_array) method. To do this, you should include
 								both the `ENT_TYPE` and the `ENT_IOB` attributes in the array you're importing
 								from.
 								```python
 								### {executable="true"}
 								import numpy
 								import spacy
 								from spacy.attrs import ENT_IOB, ENT_TYPE
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp.make_doc("London is a big city in the United Kingdom.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								print("Before", doc.ents)  # []
 								header = [ENT_IOB, ENT_TYPE]
-												Fix numpy.zeros() dtype for Doc.from_array

											
										
										
											2020-06-16 21:35:45 +03:00
+								attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								attr_array[0, 0] = 3  # B
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								attr_array[0, 1] = doc.vocab.strings["GPE"]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								doc.from_array(header, attr_array)
 								print("After", doc.ents)  # [London]
 								```
 								#### Setting entity annotations in Cython {#setting-cython}
 								Finally, you can always write to the underlying struct, if you compile a
 								[Cython](http://cython.org/) function. This is easy to do, and allows you to
 								write efficient native code.
 								```python
 								# cython: infer_types=True
 								from spacy.tokens.doc cimport Doc
 								cpdef set_entity(Doc doc, int start, int end, int ent_type):
 								    for i in range(start, end):
 								        doc.c[i].ent_type = ent_type
 								    doc.c[start].ent_iob = 3
 								    for i in range(start+1, end):
 								        doc.c[i].ent_iob = 2
 								```
 								Obviously, if you write directly to the array of `TokenC*` structs, you'll have
 								responsibility for ensuring that the data is left in a consistent state.
 								### Built-in entity types {#entity-types}
 								> #### Tip: Understanding entity types
 								>
 								> You can also use `spacy.explain()` to get the description for the string
 								> representation of an entity label. For example, `spacy.explain("LANGUAGE")`
 								> will return "any named language".
 								<Infobox title="Annotation scheme">
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								For details on the entity types available in spaCy's trained pipelines, see the
-												Update v3 docs [ci skip]

											
										
										
											2020-07-05 17:11:16 +03:00
+								"label scheme" sections of the individual models in the
 								[models directory](/models).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								</Infobox>
 								### Visualizing named entities {#displacy}
 								The
 								[displaCy <sup>ENT</sup> visualizer](https://explosion.ai/demos/displacy-ent)
 								lets you explore an entity recognition model's behavior interactively. If you're
 								training a model, it's very useful to run the visualization yourself. To help
-												Start updating website for v3 [ci skip]

											
										
										
											2020-07-01 22:26:39 +03:00
+								you do that, spaCy comes with a visualization module. You can pass a `Doc` or a
 								list of `Doc` objects to displaCy and run
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								[`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or
 								[`displacy.render`](/api/top-level#displacy.render) to generate the raw markup.
 								For more details and examples, see the
 								[usage guide on visualizing spaCy](/usage/visualizers).
 								```python
 								### Named Entity example
 								import spacy
 								from spacy import displacy
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Make visualized NER examples more clear (closes #4104) [ci skip]

											
										
										
											2019-08-18 17:29:29 +03:00
+								nlp = spacy.load("en_core_web_sm")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								doc = nlp(text)
 								displacy.serve(doc, style="ent")
 								```
-												Make visualized NER examples more clear (closes #4104) [ci skip]

											
										
										
											2019-08-18 17:29:29 +03:00
+								import DisplacyEntHtml from 'images/displacy-ent2.html'
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Make visualized NER examples more clear (closes #4104) [ci skip]

											
										
										
											2019-08-18 17:29:29 +03:00
+								<Iframe title="displaCy visualizer for entities" html={DisplacyEntHtml} height={180} />
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 12:38:34 +03:00
+								## Entity Linking {#entity-linking}
-												Fix docs consistency [ci skip]

											
										
										
											2019-09-14 17:23:37 +03:00
+								To ground the named entities into the "real world", spaCy provides functionality
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								to perform entity linking, which resolves a textual entity to a unique
-												Update NEL examples and documentation (#5370)

* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
											
										
										
											2020-04-29 13:53:53 +03:00
+								identifier from a knowledge base (KB). You can create your own
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								[`KnowledgeBase`](/api/kb) and [train](/usage/training) a new
 								[`EntityLinker`](/api/entitylinker) using that custom knowledge base.
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 12:38:34 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-25 14:27:59 +03:00
+								### Accessing entity identifiers {#entity-linking-accessing model="entity linking"}
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 12:38:34 +03:00
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								The annotated KB identifier is accessible as either a hash value or as a string,
 								using the attributes `ent.kb_id` and `ent.kb_id_` of a [`Span`](/api/span)
 								object, or the `ent_kb_id` and `ent_kb_id_` attributes of a
 								[`Token`](/api/token) object.
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 12:38:34 +03:00
 								```python
 								import spacy
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								nlp = spacy.load("my_custom_el_pipeline")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Ada Lovelace was born in London")
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 12:38:34 +03:00
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								# Document level
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 12:38:34 +03:00
+								ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
 								print(ents)  # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								# Token level
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 12:38:34 +03:00
+								ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
 								ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
 								ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
 								print(ent_ada_0)  # ['Ada', 'PERSON', 'Q7259']
 								print(ent_ada_1)  # ['Lovelace', 'PERSON', 'Q7259']
 								print(ent_london_5)  # ['London', 'GPE', 'Q84']
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								## Tokenization {#tokenization}
 								Tokenization is the task of splitting a text into meaningful segments, called
 								_tokens_. The input to the tokenizer is a unicode text, and the output is a
 								[`Doc`](/api/doc) object. To construct a `Doc` object, you need a
 								[`Vocab`](/api/vocab) instance, a sequence of `word` strings, and optionally a
 								sequence of `spaces` booleans, which allow you to maintain alignment of the
 								tokens into the original string.
 								<Infobox title="Important note" variant="warning">
 								spaCy's tokenization is **non-destructive**, which means that you'll always be
 								able to reconstruct the original input from the tokenized output. Whitespace
 								information is preserved in the tokens and no information is added or removed
 								during tokenization. This is kind of a core principle of spaCy's `Doc` object:
 								`doc.text == input_text` should always hold true.
 								</Infobox>
 								import Tokenization101 from 'usage/101/\_tokenization.md'
 								<Tokenization101 />
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								spaCy introduces a novel tokenization algorithm, that gives a better balance
 								between performance, ease of definition, and ease of alignment into the original
 								string.
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								After consuming a prefix or suffix, we consult the special cases again. We want
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								the special cases to handle things like "don't" in English, and we want the same
 								rule to work for "(don't)!". We do this by splitting off the open bracket, then
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								the exclamation, then the close bracket, and finally matching the special case.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								Here's an implementation of the algorithm in Python, optimized for readability
 								rather than performance:
 								```python
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								def tokenizer_pseudo_code(
 								    special_cases,
 								    prefix_search,
 								    suffix_search,
 								    infix_finditer,
 								    token_match,
 								    url_match
 								):
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								    tokens = []
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								    for substring in text.split():
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								        suffixes = []
 								        while substring:
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								            while prefix_search(substring) or suffix_search(substring):
-												Revert changes to token_match priority from #4374

* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly

											
										
										
											2020-03-09 14:09:41 +03:00
+								                if token_match(substring):
 								                    tokens.append(substring)
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								                    substring = ""
-												Revert changes to token_match priority from #4374

* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly

											
										
										
											2020-03-09 14:09:41 +03:00
+								                    break
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								                if substring in special_cases:
 								                    tokens.extend(special_cases[substring])
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								                    substring = ""
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								                    break
 								                if prefix_search(substring):
 								                    split = prefix_search(substring).end()
 								                    tokens.append(substring[:split])
 								                    substring = substring[split:]
 								                    if substring in special_cases:
 								                        continue
 								                if suffix_search(substring):
 								                    split = suffix_search(substring).start()
 								                    suffixes.append(substring[split:])
 								                    substring = substring[:split]
-												Revert changes to token_match priority from #4374

* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly

											
										
										
											2020-03-09 14:09:41 +03:00
+								            if token_match(substring):
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								                tokens.append(substring)
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								                substring = ""
-												Rename to url_match

Rename to `url_match` and update docs.

											
										
										
											2020-05-22 13:41:03 +03:00
+								            elif url_match(substring):
 								                tokens.append(substring)
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								                substring = ""
-												Revert changes to token_match priority from #4374

* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly

											
										
										
											2020-03-09 14:09:41 +03:00
+								            elif substring in special_cases:
 								                tokens.extend(special_cases[substring])
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								                substring = ""
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								            elif list(infix_finditer(substring)):
 								                infixes = infix_finditer(substring)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								                offset = 0
 								                for match in infixes:
 								                    tokens.append(substring[offset : match.start()])
 								                    tokens.append(substring[match.start() : match.end()])
 								                    offset = match.end()
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								                if substring[offset:]:
 								                    tokens.append(substring[offset:])
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								                substring = ""
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								            elif substring:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								                tokens.append(substring)
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								                substring = ""
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								        tokens.extend(reversed(suffixes))
 								    return tokens
 								```
 								The algorithm can be summarized as follows:
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+. Iterate over whitespace-separated substrings.
-												Rename to url_match

Rename to `url_match` and update docs.

											
										
										
											2020-05-22 13:41:03 +03:00
+. Look for a token match. If there is a match, stop processing and keep this
 								   token.
 . Check whether we have an explicitly defined special case for this substring.
 								   If we do, use it.
-												Start updating website for v3 [ci skip]

											
										
										
											2020-07-01 22:26:39 +03:00
+. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
 								   so that the token match and special cases always get priority.
-												Revert changes to token_match priority from #4374

* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly

											
										
										
											2020-03-09 14:09:41 +03:00
+. If we didn't consume a prefix, try to consume a suffix and then go back to
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								   #2.
-												Rename to url_match

Rename to `url_match` and update docs.

											
										
										
											2020-05-22 13:41:03 +03:00
+. If we can't consume a prefix or a suffix, look for a URL match.
 . If there's no URL match, then look for a special case.
 . Look for "infixes" — stuff like hyphens etc. and split the substring into
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								   tokens on all infixes.
-												Rename to url_match

Rename to `url_match` and update docs.

											
										
										
											2020-05-22 13:41:03 +03:00
+. Once we can't consume any more of the string, handle it as a single token.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								</Accordion>
 								**Global** and **language-specific** tokenizer data is supplied via the language
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								data in [`spacy/lang`](%%GITHUB_SPACY/spacy/lang). The tokenizer exceptions
 								define special cases like "don't" in English, which needs to be split into two
 								tokens: `{ORTH: "do"}` and `{ORTH: "n't", NORM: "not"}`. The prefixes, suffixes
 								and infixes mostly define punctuation rules – for example, when to split off
 								periods (at the end of a sentence), and when to leave tokens containing periods
 								intact (abbreviations like "U.S.").
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
 								<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
 								Tokenization rules that are specific to one language, but can be **generalized
 								across that language** should ideally live in the language data in
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								[`spacy/lang`](%%GITHUB_SPACY/spacy/lang) – we always appreciate pull requests!
 								Anything that's specific to a domain or text type – like financial trading
 								abbreviations, or Bavarian youth slang – should be added as a special case rule
 								to your tokenizer instance. If you're dealing with a lot of customizations, it
 								might make sense to create an entirely custom subclass.
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
 								</Accordion>
 								---
 								### Adding special case tokenization rules {#special-cases}
 								Most domains have at least some idiosyncrasies that require custom tokenization
 								rules. This could be very certain expressions, or abbreviations only used in
 								this specific field. Here's how to add a special case rule to an existing
 								[`Tokenizer`](/api/tokenizer) instance:
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.symbols import ORTH
 								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("gimme that")  # phrase to tokenize
 								print([w.text for w in doc])  # ['gimme', 'that']
 								# Add special case rule
 								special_case = [{ORTH: "gim"}, {ORTH: "me"}]
 								nlp.tokenizer.add_special_case("gimme", special_case)
 								# Check new tokenization
 								print([w.text for w in nlp("gimme that")])  # ['gim', 'me', 'that']
 								```
 								The special case doesn't have to match an entire whitespace-delimited substring.
 								The tokenizer will incrementally split off punctuation, and keep looking up the
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								remaining substring. The special case rules also have precedence over the
 								punctuation splitting.
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
 								```python
 								assert "gimme" not in [w.text for w in nlp("gimme!")]
 								assert "gimme" not in [w.text for w in nlp('("...gimme...?")')]
 								nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
 								assert len(nlp("...gimme...?")) == 1
 								```
-												Add tokenizer explain() debugging method (#4596)

* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5c67f55807fdbda9e9a936464723931.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs

											
										
										
											2019-11-20 15:07:25 +03:00
+								#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
 								A working implementation of the pseudo-code above is available for debugging as
 								[`nlp.tokenizer.explain(text)`](/api/tokenizer#explain). It returns a list of
 								tuples showing which tokenizer rule or pattern was matched for each token. The
-												Auto-format [ci skip]

											
										
										
											2019-11-20 15:14:58 +03:00
+								tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
-												Add tokenizer explain() debugging method (#4596)

* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5c67f55807fdbda9e9a936464723931.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs

											
										
										
											2019-11-20 15:07:25 +03:00
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								> #### Expected output
 								>
 								> ```
 								> "      PREFIX
 								> Let    SPECIAL-1
 								> 's     SPECIAL-2
 								> go     TOKEN
 								> !      SUFFIX
 								> "      SUFFIX
 								> ```
-												Add tokenizer explain() debugging method (#4596)

* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5c67f55807fdbda9e9a936464723931.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs

											
										
										
											2019-11-20 15:07:25 +03:00
+								```python
 								### {executable="true"}
 								from spacy.lang.en import English
-												Auto-format [ci skip]

											
										
										
											2019-11-20 15:14:58 +03:00
-												Add tokenizer explain() debugging method (#4596)

* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5c67f55807fdbda9e9a936464723931.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs

											
										
										
											2019-11-20 15:07:25 +03:00
+								nlp = English()
 								text = '''"Let's go!"'''
 								doc = nlp(text)
 								tok_exp = nlp.tokenizer.explain(text)
 								assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
 								for t in tok_exp:
 								    print(t[1], "\\t", t[0])
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								### Customizing spaCy's Tokenizer class {#native-tokenizers}
 								Let's imagine you wanted to create a tokenizer for a new language or specific
-												Documentation updates for v2.3.0 (#5593)

* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2020-06-16 16:37:35 +03:00
+								domain. There are six things you may need to define:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 . A dictionary of **special cases**. This handles things like contractions,
 								   units of measurement, emoticons, certain abbreviations, etc.
 . A function `prefix_search`, to handle **preceding punctuation**, such as open
 								   quotes, open brackets, etc.
 . A function `suffix_search`, to handle **succeeding punctuation**, such as
 								   commas, periods, close quotes, etc.
 . A function `infixes_finditer`, to handle non-whitespace separators, such as
 								   hyphens etc.
-												Auto-format [ci skip]

											
										
										
											2019-11-18 14:41:31 +03:00
+. An optional boolean function `token_match` matching strings that should never
-												Documentation updates for v2.3.0 (#5593)

* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2020-06-16 16:37:35 +03:00
+								   be split, overriding the infix rules. Useful for things like numbers.
-												Rename to url_match

Rename to `url_match` and update docs.

											
										
										
											2020-05-22 13:41:03 +03:00
+. An optional boolean function `url_match`, which is similar to `token_match`
-												Documentation updates for v2.3.0 (#5593)

* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2020-06-16 16:37:35 +03:00
+								   except that prefixes and suffixes are removed before applying the match.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
 								to use `re.compile()` to build a regular expression object, and pass its
 								`.search()` and `.finditer()` methods:
 								```python
 								### {executable="true"}
 								import re
 								import spacy
 								from spacy.tokenizer import Tokenizer
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								special_cases = {":)": [{"ORTH": ":)"}]}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								prefix_re = re.compile(r'''^[\[\("']''')
 								suffix_re = re.compile(r'''[\]\)"']$''')
 								infix_re = re.compile(r'''[-~]''')
 								simple_url_re = re.compile(r'''^https?://''')
 								def custom_tokenizer(nlp):
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								    return Tokenizer(nlp.vocab, rules=special_cases,
 								                                prefix_search=prefix_re.search,
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								                                suffix_search=suffix_re.search,
 								                                infix_finditer=infix_re.finditer,
-												Documentation updates for v2.3.0 (#5593)

* Update website models for v2.3.0

* Add docs for Chinese word segmentation

* Tighten up Chinese docs section

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Merge branch 'master' into docs/v2.3.0 [ci skip]

* Auto-format and update version

* Update matcher.md

* Update languages and sorting

* Typo in landing page

* Infobox about token_match behavior

* Add meta and basic docs for Japanese

* POS -> TAG in models table

* Add info about lookups for normalization

* Updates to API docs for v2.3

* Update adding norm exceptions for adding languages

* Add --omit-extra-lookups to CLI API docs

* Add initial draft of "What's New in v2.3"

* Add new in v2.3 tags to Chinese and Japanese sections

* Add tokenizer to migration section

* Add new in v2.3 flags to init-model

* Typo

* More what's new in v2.3

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2020-06-16 16:37:35 +03:00
+								                                url_match=simple_url_re.match)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								nlp = spacy.load("en_core_web_sm")
 								nlp.tokenizer = custom_tokenizer(nlp)
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								doc = nlp("hello-world. :)")
 								print([t.text for t in doc]) # ['hello', '-', 'world.', ':)']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								If you need to subclass the tokenizer instead, the relevant methods to
 								specialize are `find_prefix`, `find_suffix` and `find_infix`.
 								<Infobox title="Important note" variant="warning">
 								When customizing the prefix, suffix and infix handling, remember that you're
 								passing in **functions** for spaCy to execute, e.g. `prefix_re.search` – not
 								just the regular expressions. This means that your functions also need to define
 								how the rules should be applied. For example, if you're adding your own prefix
 								rules, you need to make sure they're only applied to characters at the
 								**beginning of a token**, e.g. by adding `^`. Similarly, suffix rules should
 								only be applied at the **end of a token**, so your expression should end with a
 								`$`.
 								</Infobox>
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								#### Modifying existing rule sets {#native-tokenizer-additions}
-												Add docs on adding to existing tokenizer rules [ci skip]

											
										
										
											2019-02-24 20:35:19 +03:00
 								In many situations, you don't necessarily need entirely custom rules. Sometimes
-												Auto-format [ci skip]

											
										
										
											2019-11-18 14:41:31 +03:00
+								you just want to add another character to the prefixes, suffixes or infixes. The
 								default prefix, suffix and infix rules are available via the `nlp` object's
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								`Defaults` and the `Tokenizer` attributes such as
 								[`Tokenizer.suffix_search`](/api/tokenizer#attributes) are writable, so you can
 								overwrite them with compiled regular expression objects using modified default
 								rules. spaCy ships with utility functions to help you compile the regular
 								expressions – for example,
-												Add docs on adding to existing tokenizer rules [ci skip]

											
										
										
											2019-02-24 20:35:19 +03:00
+								[`compile_suffix_regex`](/api/top-level#util.compile_suffix_regex):
 								```python
-												Update suffixes example (#5989)

* Update suffixes example

The current example will throw `TypeError: can only concatenate list (not "tuple") to list`

* Signing Contributor Agreement
											
										
										
											2020-08-31 13:44:56 +03:00
+								suffixes = nlp.Defaults.suffixes + [r'''-+$''',]
-												Add docs on adding to existing tokenizer rules [ci skip]

											
										
										
											2019-02-24 20:35:19 +03:00
+								suffix_regex = spacy.util.compile_suffix_regex(suffixes)
 								nlp.tokenizer.suffix_search = suffix_regex.search
 								```
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								Similarly, you can remove a character from the default suffixes:
 								```python
 								suffixes = list(nlp.Defaults.suffixes)
 								suffixes.remove("\\\\[")
 								suffix_regex = spacy.util.compile_suffix_regex(suffixes)
 								nlp.tokenizer.suffix_search = suffix_regex.search
 								```
-												Add docs on adding to existing tokenizer rules [ci skip]

											
										
										
											2019-02-24 20:35:19 +03:00
+								The `Tokenizer.suffix_search` attribute should be a function which takes a
 								unicode string and returns a **regex match object** or `None`. Usually we use
 								the `.search` attribute of a compiled regex object, but you can use some other
 								function that behaves the same way.
 								<Infobox title="Important note" variant="warning">
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								If you've loaded a trained pipeline, writing to the
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								[`nlp.Defaults`](/api/language#defaults) or `English.Defaults` directly won't
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								work, since the regular expressions are read from the pipeline data and will be
 								compiled when you load it. If you modify `nlp.Defaults`, you'll only see the
 								effect if you call [`spacy.blank`](/api/top-level#spacy.blank). If you want to
 								modify the tokenizer loaded from a trained pipeline, you should modify
 								`nlp.tokenizer` directly. If you're training your own pipeline, you can register
 								[callbacks](/usage/training/#custom-code-nlp-callbacks) to modify the `nlp`
 								object before training.
-												Add docs on adding to existing tokenizer rules [ci skip]

											
										
										
											2019-02-24 20:35:19 +03:00
 								</Infobox>
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								The prefix, infix and suffix rule sets include not only individual characters
 								but also detailed regular expressions that take the surrounding context into
-												Auto-format [ci skip]

											
										
										
											2019-11-18 14:41:31 +03:00
+								account. For example, there is a regular expression that treats a hyphen between
 								letters as an infix. If you do not want the tokenizer to split on hyphens
 								between letters, you can modify the existing infix definition from
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								[`lang/punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py):
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
 								from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
 								from spacy.util import compile_infix_regex
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								# Default tokenizer
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("mother-in-law")
 								print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								# Modify tokenizer infix patterns
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								infixes = (
 								    LIST_ELLIPSES
 								    + LIST_ICONS
 								    + [
 								        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
 								        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
 								            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
 								        ),
 								        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 14:49:18 +03:00
+								        # ✅ Commented out regex that splits on hyphens between letters:
 								        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
+								        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
 								    ]
 								)
 								infix_re = compile_infix_regex(infixes)
 								nlp.tokenizer.infix_finditer = infix_re.finditer
 								doc = nlp("mother-in-law")
 								print([t.text for t in doc]) # ['mother-in-law']
 								```
 								For an overview of the default regular expressions, see
-												Update docs [ci skip]

											
										
										
											2020-09-12 18:05:10 +03:00
+								[`lang/punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) and
 								language-specific definitions such as
 								[`lang/de/punctuation.py`](%%GITHUB_SPACY/spacy/lang/de/punctuation.py) for
 								German.
-												Update tokenization usage docs (#4666)

Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
											
										
										
											2019-11-18 14:35:13 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								### Hooking a custom tokenizer into the pipeline {#custom-tokenizer}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								The tokenizer is the first component of the processing pipeline and the only one
 								that can't be replaced by writing to `nlp.pipeline`. This is because it has a
 								different signature from all the other components: it takes a text and returns a
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								[`Doc`](/api/doc), whereas all other components expect to already receive a
 								tokenized `Doc`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								![The processing pipeline](../images/pipeline.svg)
 								To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								custom function that takes a text, and returns a [`Doc`](/api/doc).
 								> #### Creating a Doc
 								>
 								> Constructing a [`Doc`](/api/doc) object manually requires at least two
 								> arguments: the shared `Vocab` and a list of words. Optionally, you can pass in
 								> a list of `spaces` values indicating whether the token at this position is
 								> followed by a space (default `True`). See the section on
 								> [pre-tokenized text](#own-annotations) for more info.
 								>
 								> ```python
 								> words = ["Let", "'s", "go", "!"]
 								> spaces = [False, True, False, False]
 								> doc = Doc(nlp.vocab, words=words, spaces=spaces)
 								> ```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								nlp = spacy.blank("en")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								nlp.tokenizer = my_tokenizer
 								```
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								| Argument    | Type              | Description               |
 								| ----------- | ----------------- | ------------------------- |
-												Update docs, types and API consistency

											
										
										
											2020-08-17 17:45:24 +03:00
+								| `text`      | `str`             | The raw text to tokenize. |
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								| **RETURNS** | [`Doc`](/api/doc) | The tokenized document.   |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								#### Example 1: Basic whitespace tokenizer {#custom-tokenizer-example}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								Here's an example of the most basic whitespace tokenizer. It takes the shared
 								vocab, so it can construct `Doc` objects. When it's called on a text, it returns
 								a `Doc` object consisting of the text split on single space characters. We can
 								then overwrite the `nlp.tokenizer` attribute with an instance of our custom
 								tokenizer.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.tokens import Doc
-												Remove object subclassing

											
										
										
											2020-07-12 15:03:23 +03:00
+								class WhitespaceTokenizer:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								    def __init__(self, vocab):
 								        self.vocab = vocab
 								    def __call__(self, text):
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								        words = text.split(" ")
 								        return Doc(self.vocab, words=words)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								nlp = spacy.blank("en")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("What's happened to me? he thought. It wasn't a dream.")
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								print([token.text for token in doc])
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								#### Example 2: Third-party tokenizers (BERT word pieces) {#custom-tokenizer-example2}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								You can use the same approach to plug in any other third-party tokenizers. Your
 								custom callable just needs to return a `Doc` object with the tokens produced by
 								your tokenizer. In this example, the wrapper uses the **BERT word piece
 								tokenizer**, provided by the
 								[`tokenizers`](https://github.com/huggingface/tokenizers) library. The tokens
 								available in the `Doc` object returned by spaCy now match the exact word pieces
 								produced by the tokenizer.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								> #### 💡 Tip: spacy-transformers
 								>
 								> If you're working with transformer models like BERT, check out the
 								> [`spacy-transformers`](https://github.com/explosion/spacy-transformers)
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								> extension package and [documentation](/usage/embeddings-transformers). It
 								> includes a pipeline component for using pretrained transformer weights and
 								> **training transformer models** in spaCy, as well as helpful utilities for
 								> aligning word pieces to linguistic tokenization.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								### Custom BERT word piece tokenizer
 								from tokenizers import BertWordPieceTokenizer
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								from spacy.tokens import Doc
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								import spacy
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								class BertTokenizer:
 								    def __init__(self, vocab, vocab_file, lowercase=True):
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								        self.vocab = vocab
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								        self._tokenizer = BertWordPieceTokenizer(vocab_file, lowercase=lowercase)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								    def __call__(self, text):
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								        tokens = self._tokenizer.encode(text)
 								        words = []
 								        spaces = []
 								        for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)):
 								            words.append(text)
 								            if i < len(tokens.tokens) - 1:
 								                # If next start != current end we assume a space in between
 								                next_start, next_end = tokens.offsets[i + 1]
 								                spaces.append(next_start > end)
 								            else:
 								                spaces.append(True)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								        return Doc(self.vocab, words=words, spaces=spaces)
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								nlp = spacy.blank("en")
 								nlp.tokenizer = BertTokenizer(nlp.vocab, "bert-base-uncased-vocab.txt")
 								doc = nlp("Justin Drew Bieber is a Canadian singer, songwriter, and actor.")
 								print(doc.text, [token.text for token in doc])
 								# [CLS]justin drew bi##eber is a canadian singer, songwriter, and actor.[SEP]
 								# ['[CLS]', 'justin', 'drew', 'bi', '##eber', 'is', 'a', 'canadian', 'singer',
 								#  ',', 'songwriter', ',', 'and', 'actor', '.', '[SEP]']
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								<Infobox title="Important note on tokenization and models" variant="warning">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Keep in mind that your models' results may be less accurate if the tokenization
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								during training differs from the tokenization at runtime. So if you modify a
-												Fix typos and wording [ci skip]

											
										
										
											2020-09-03 17:37:45 +03:00
+								trained pipeline's tokenization afterwards, it may produce very different
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								predictions. You should therefore train your pipeline with the **same
 								tokenizer** it will be using at runtime. See the docs on
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								[training with custom tokenization](#custom-tokenizer-training) for details.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								</Infobox>
 								#### Training with custom tokenization {#custom-tokenizer-training new="3"}
 								spaCy's [training config](/usage/training#config) describe the settings,
 								hyperparameters, pipeline and tokenizer used for constructing and training the
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								takes the `nlp` object and returns a tokenizer. Here, we're registering a
 								function called `whitespace_tokenizer` in the
 								[`@tokenizers` registry](/api/registry). To make sure spaCy knows how to
 								construct your tokenizer during training, you can pass in your Python file by
 								setting `--code functions.py` when you run [`spacy train`](/api/cli#train).
 								> #### config.cfg
 								>
 								> ```ini
 								> [nlp.tokenizer]
 								> @tokenizers = "whitespace_tokenizer"
 								> ```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								### functions.py {highlight="1"}
 								@spacy.registry.tokenizers("whitespace_tokenizer")
 								def create_whitespace_tokenizer():
 								    def create_tokenizer(nlp):
 								        return WhitespaceTokenizer(nlp.vocab)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								    return create_tokenizer
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								Registered functions can also take arguments that are then passed in from the
 								config. This allows you to quickly change and keep track of different settings.
 								Here, the registered function called `bert_word_piece_tokenizer` takes two
 								arguments: the path to a vocabulary file and whether to lowercase the text. The
 								Python type hints `str` and `bool` ensure that the received values have the
 								correct type.
 								> #### config.cfg
 								>
 								> ```ini
 								> [nlp.tokenizer]
 								> @tokenizers = "bert_word_piece_tokenizer"
 								> vocab_file = "bert-base-uncased-vocab.txt"
 								> lowercase = true
 								> ```
 								```python
 								### functions.py {highlight="1"}
 								@spacy.registry.tokenizers("bert_word_piece_tokenizer")
 								def create_whitespace_tokenizer(vocab_file: str, lowercase: bool):
 								    def create_tokenizer(nlp):
 								        return BertWordPieceTokenizer(nlp.vocab, vocab_file, lowercase)
 								    return create_tokenizer
 								```
 								To avoid hard-coding local paths into your config file, you can also set the
 								vocab path on the CLI by using the `--nlp.tokenizer.vocab_file`
 								[override](/usage/training#config-overrides) when you run
 								[`spacy train`](/api/cli#train). For more details on using registered functions,
 								see the docs in [training with custom code](/usage/training#custom-code).
 								<Infobox variant="warning">
 								Remember that a registered function should always be a function that spaCy
 								**calls to create something**, not the "something" itself. In this case, it
 								**creates a function** that takes the `nlp` object and returns a callable that
 								takes a text and returns a `Doc`.
 								</Infobox>
 								#### Using pre-tokenized text {#own-annotations}
 								spaCy generally assumes by default that your data is **raw text**. However,
 								sometimes your data is partially annotated, e.g. with pre-existing tokenization,
 								part-of-speech tags, etc. The most common situation is that you have
 								**pre-defined tokenization**. If you have a list of strings, you can create a
 								[`Doc`](/api/doc) object directly. Optionally, you can also specify a list of
 								boolean values, indicating whether each word is followed by a space.
 								> #### ✏️ Things to try
 								>
 								> 1. Change a boolean value in the list of `spaces`. You should see it reflected
 								>    in the `doc.text` and whether the token is followed by a space.
 								> 2. Remove `spaces=spaces` from the `Doc`. You should see that every token is
 								>    now followed by a space.
 								> 3. Copy-paste a random sentence from the internet and manually construct a
 								>    `Doc` with `words` and `spaces` so that the `doc.text` matches the original
 								>    input text.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.tokens import Doc
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								nlp = spacy.blank("en")
 								words = ["Hello", ",", "world", "!"]
 								spaces = [False, True, False, False]
 								doc = Doc(nlp.vocab, words=words, spaces=spaces)
 								print(doc.text)
 								print([(t.text, t.text_with_ws, t.whitespace_) for t in doc])
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								If provided, the spaces list must be the **same length** as the words list. The
 								spaces list affects the `doc.text`, `span.text`, `token.idx`, `span.start_char`
 								and `span.end_char` attributes. If you don't provide a `spaces` sequence, spaCy
 								will assume that all words are followed by a space. Once you have a
 								[`Doc`](/api/doc) object, you can write to its attributes to set the
 								part-of-speech tags, syntactic dependencies, named entities and other
 								attributes.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Update tokenizer docs and add test

											
										
										
											2020-08-09 16:24:01 +03:00
+								#### Aligning tokenization {#aligning-tokenization}
-												Add usage docs for aligning tokenization

											
										
										
											2019-07-17 16:08:33 +03:00
 								spaCy's tokenization is non-destructive and uses language-specific rules
 								optimized for compatibility with treebank annotations. Other tools and resources
-												Adjust example

Not actually supported in this alignment interpretation

											
										
										
											2019-07-17 16:13:50 +03:00
+								can sometimes tokenize things differently – for example, `"I'm"` →
 								`["I", "'", "m"]` instead of `["I", "'m"]`.
-												Add usage docs for aligning tokenization

											
										
										
											2019-07-17 16:08:33 +03:00
-												Adjust wording [ci skip]

											
										
										
											2019-07-17 17:06:25 +03:00
+								In situations like that, you often want to align the tokenization so that you
 								can merge annotations from different sources together, or take vectors predicted
 								by a
-												Auto-format [ci skip]

											
										
										
											2019-11-18 14:41:31 +03:00
+								[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
-												Update docs

											
										
										
											2020-08-06 20:30:43 +03:00
+								apply them to spaCy tokens. spaCy's [`Alignment`](/api/example#alignment-object)
 								object allows the one-to-one mappings of token indices in both directions as
 								well as taking into account indices where multiple tokens align to one single
 								token.
-												Add usage docs for aligning tokenization

											
										
										
											2019-07-17 16:08:33 +03:00
-												Add "Things to try" prompts

											
										
										
											2019-07-17 16:25:02 +03:00
+								> #### ✏️ Things to try
 								>
 								> 1. Change the capitalization in one of the token lists – for example,
-												Improve wording

											
										
										
											2019-07-17 16:27:53 +03:00
+								>    `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive.
-												Add "Things to try" prompts

											
										
										
											2019-07-17 16:25:02 +03:00
+								> 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see
-												Sync develop with nightly docs state (#5883)

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
											
										
										
											2020-08-06 01:28:14 +03:00
+								>    that there are now two tokens of length 2 in `y2x`, one corresponding to
 								>    "'s", and one to "podcasts".
 								> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that all
 								>    tokens now correspond 1-to-1.
-												Add "Things to try" prompts

											
										
										
											2019-07-17 16:25:02 +03:00
-												Add usage docs for aligning tokenization

											
										
										
											2019-07-17 16:08:33 +03:00
+								```python
 								### {executable="true"}
-												Renaming gold & annotation_setter (#6042)

* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
											
										
										
											2020-09-09 11:31:03 +03:00
+								from spacy.training import Alignment
-												Add usage docs for aligning tokenization

											
										
										
											2019-07-17 16:08:33 +03:00
 								other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
 								spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
-												Sync develop with nightly docs state (#5883)

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
											
										
										
											2020-08-06 01:28:14 +03:00
+								align = Alignment.from_strings(other_tokens, spacy_tokens)
 								print(f"a -> b, lengths: {align.x2y.lengths}")  # array([1, 1, 1, 1, 1, 1, 1, 1])
 								print(f"a -> b, mapping: {align.x2y.dataXd}")  # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s"
 								print(f"b -> a, lengths: {align.y2x.lengths}")  # array([1, 1, 1, 1, 2, 1, 1])   : the token "'s" refers to two tokens
 								print(f"b -> a, mappings: {align.y2x.dataXd}")  # array([0, 1, 2, 3, 4, 5, 6, 7])
-												Add usage docs for aligning tokenization

											
										
										
											2019-07-17 16:08:33 +03:00
+								```
 								Here are some insights from the alignment information generated in the example
 								above:
 								- The one-to-one mappings for the first four tokens are identical, which means
 								  they map to each other. This makes sense because they're also identical in the
 								  input: `"i"`, `"listened"`, `"to"` and `"obama"`.
-												Sync develop with nightly docs state (#5883)

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
											
										
										
											2020-08-06 01:28:14 +03:00
+								- The value of `x2y.dataXd[6]` is `5`, which means that `other_tokens[6]`
-												Fix typos [ci skip]

											
										
										
											2019-07-19 14:08:18 +03:00
+								  (`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`).
-												Sync develop with nightly docs state (#5883)

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
											
										
										
											2020-08-06 01:28:14 +03:00
+								- `x2y.dataXd[4]` and `x2y.dataXd[5]` are both `4`, which means that both tokens
 and 5 of `other_tokens` (`"'"` and `"s"`) align to token 4 of `spacy_tokens`
 								  (`"'s"`).
-												Add usage docs for aligning tokenization

											
										
										
											2019-07-17 16:08:33 +03:00
-												Add infobox

											
										
										
											2019-07-17 16:29:36 +03:00
+								<Infobox title="Important note" variant="warning">
 								The current implementation of the alignment algorithm assumes that both
 								tokenizations add up to the same string. For example, you'll be able to align
 								`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
 								`["I", "'m"]` and `["I", "am"]`.
 								</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								## Merging and splitting {#retokenization new="2.1"}
 								The [`Doc.retokenize`](/api/doc#retokenize) context manager lets you merge and
 								split tokens. Modifications to the tokenization are stored and performed all at
 								once when the context manager exits. To merge several tokens into one single
 								token, pass a `Span` to [`retokenizer.merge`](/api/doc#retokenizer.merge). An
 								optional dictionary of `attrs` lets you set attributes that will be assigned to
 								the merged token – for example, the lemma, part-of-speech tag or entity type. By
 								default, the merged token will receive the same attributes as the merged span's
 								root.
 								> #### ✏️ Things to try
 								>
 								> 1. Inspect the `token.lemma_` attribute with and without setting the `attrs`.
 								>    You'll see that the lemma defaults to "New", the lemma of the span's root.
 								> 2. Overwrite other attributes like the `"ENT_TYPE"`. Since "New York" is also
 								>    recognized as a named entity, this change will also be reflected in the
 								>    `doc.ents`.
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("I live in New York")
 								print("Before:", [token.text for token in doc])
 								with doc.retokenize() as retokenizer:
 								    retokenizer.merge(doc[3:5], attrs={"LEMMA": "new york"})
 								print("After:", [token.text for token in doc])
 								```
-												Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip]

											
										
										
											2019-10-01 22:59:50 +03:00
+								> #### Tip: merging entities and noun phrases
 								>
 								> If you need to merge named entities or noun chunks, check out the built-in
 								> [`merge_entities`](/api/pipeline-functions#merge_entities) and
 								> [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) pipeline
 								> components. When added to your pipeline using `nlp.add_pipe`, they'll take
 								> care of merging the spans automatically.
-												💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325)

* Fix formatting and whitespace

* Add support for lexical attributes (closes #2390)

* Document lexical attribute setting during retokenization

* Assign variable oputside of nested loop

											
										
										
											2019-02-24 23:13:51 +03:00
+								If an attribute in the `attrs` is a context-dependent token attribute, it will
 								be applied to the underlying [`Token`](/api/token). For example `LEMMA`, `POS`
 								or `DEP` only apply to a word in context, so they're token attributes. If an
 								attribute is a context-independent lexical attribute, it will be applied to the
 								underlying [`Lexeme`](/api/lexeme), the entry in the vocabulary. For example,
 								`LOWER` or `IS_STOP` apply to all words of the same spelling, regardless of the
 								context.
-												Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip]

											
										
										
											2019-10-01 22:59:50 +03:00
+								<Infobox variant="warning" title="Note on merging overlapping spans">
 								If you're trying to merge spans that overlap, spaCy will raise an error because
 								it's unclear how the result should look. Depending on the application, you may
 								want to match the shortest or longest possible span, so it's up to you to filter
 								them. If you're looking for the longest non-overlapping span, you can use the
 								[`util.filter_spans`](/api/top-level#util.filter_spans) helper:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip]

											
										
										
											2019-10-01 22:59:50 +03:00
+								```python
 								doc = nlp("I live in Berlin Kreuzberg")
 								spans = [doc[3:5], doc[3:4], doc[4:5]]
 								filtered_spans = filter_spans(spans)
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								</Infobox>
-												Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip]

											
										
										
											2019-10-01 22:59:50 +03:00
+								### Splitting tokens
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting
 								one token into two or more tokens. This can be useful for cases where
 								tokenization rules alone aren't sufficient. For example, you might want to split
 								"its" into the tokens "it" and "is" — but not the possessive pronoun "its". You
 								can write rule-based logic that can find only the correct "its" to split, but by
 								that time, the `Doc` will already be tokenized.
 								This process of splitting a token requires more settings, because you need to
 								specify the text of the individual tokens, optional per-token attributes and how
 								the should be attached to the existing syntax tree. This can be done by
 								supplying a list of `heads` – either the token to attach the newly split token
 								to, or a `(token, subtoken)` tuple if the newly split token should be attached
 								to another subtoken. In this case, "New" should be attached to "York" (the
 								second split subtoken) and "York" should be attached to "in".
 								> #### ✏️ Things to try
 								>
 								> 1. Assign different attributes to the subtokens and compare the result.
 								> 2. Change the heads so that "New" is attached to "in" and "York" is attached
 								>    to "New".
 								> 3. Split the token into three tokens instead of two – for example,
 								>    `["New", "Yo", "rk"]`.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy import displacy
 								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("I live in NewYork")
 								print("Before:", [token.text for token in doc])
 								displacy.render(doc)  # displacy.serve if you're not in a Jupyter environment
 								with doc.retokenize() as retokenizer:
 								    heads = [(doc[3], 1), doc[2]]
 								    attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
 								    retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
 								print("After:", [token.text for token in doc])
 								displacy.render(doc)  # displacy.serve if you're not in a Jupyter environment
 								```
 								Specifying the heads as a list of `token` or `(token, subtoken)` tuples allows
 								attaching split subtokens to other subtokens, without having to keep track of
 								the token indices after splitting.
 								| Token    | Head          | Description                                                                                         |
 								| -------- | ------------- | --------------------------------------------------------------------------------------------------- |
 								| `"New"`  | `(doc[3], 1)` | Attach this token to the second subtoken (index `1`) that `doc[3]` will be split into, i.e. "York". |
 								| `"York"` | `doc[2]`      | Attach this token to `doc[1]` in the original `Doc`, i.e. "in".                                     |
 								If you don't care about the heads (for example, if you're only running the
 								tokenizer and not the parser), you can each subtoken to itself:
 								```python
 								### {highlight="3"}
 								doc = nlp("I live in NewYorkCity")
 								with doc.retokenize() as retokenizer:
 								    heads = [(doc[3], 0), (doc[3], 1), (doc[3], 2)]
 								    retokenizer.split(doc[3], ["New", "York", "City"], heads=heads)
 								```
 								<Infobox title="Important note" variant="warning">
 								When splitting tokens, the subtoken texts always have to match the original
-												Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip]

											
										
										
											2019-10-01 22:59:50 +03:00
+								token text – or, put differently `"".join(subtokens) == token.text` always needs
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								to hold true. If this wasn't the case, splitting tokens could easily end up
 								producing confusing and unexpected results that would contradict spaCy's
 								non-destructive tokenization policy.
 								```diff
 								doc = nlp("I live in L.A.")
 								with doc.retokenize() as retokenizer:
 								-    retokenizer.split(doc[3], ["Los", "Angeles"], heads=[(doc[3], 1), doc[2]])
 								+    retokenizer.split(doc[3], ["L.", "A."], heads=[(doc[3], 1), doc[2]])
 								```
 								</Infobox>
-												💫 Allow setting of custom attributes during retokenization (closes #3314) (#3324)

<!--- Provide a general summary of your changes in the title. -->

## Description

This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.

```python
Token.set_extension('is_musician', default=False)

doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
    retokenizer.merge(doc[2:4], attrs=attrs)

assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-24 20:38:47 +03:00
+								### Overwriting custom extension attributes {#retokenization-extensions}
 								If you've registered custom
-												Small doc fixes (#5250)

* fix link

* torchtext instead tochtext
											
										
										
											2020-04-03 14:01:43 +03:00
+								[extension attributes](/usage/processing-pipelines#custom-components-attributes),
-												💫 Allow setting of custom attributes during retokenization (closes #3314) (#3324)

<!--- Provide a general summary of your changes in the title. -->

## Description

This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.

```python
Token.set_extension('is_musician', default=False)

doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
    retokenizer.merge(doc[2:4], attrs=attrs)

assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-24 20:38:47 +03:00
+								you can overwrite them during tokenization by providing a dictionary of
 								attribute names mapped to new values as the `"_"` key in the `attrs`. For
 								merging, you need to provide one dictionary of attributes for the resulting
 								merged token. For splitting, you need to provide a list of dictionaries with
 								custom attributes, one per split subtoken.
 								<Infobox title="Important note" variant="warning">
 								To set extension attributes during retokenization, the attributes need to be
 								**registered** using the [`Token.set_extension`](/api/token#set_extension)
 								method and they need to be **writable**. This means that they should either have
 								a default value that can be overwritten, or a getter _and_ setter. Method
 								extensions or extensions with only a getter are computed dynamically, so their
 								values can't be overwritten. For more details, see the
 								[extension attribute docs](/usage/processing-pipelines/#custom-components-attributes).
 								</Infobox>
 								> #### ✏️ Things to try
 								>
 								> 1. Add another custom extension – maybe `"music_style"`? – and overwrite it.
 								> 2. Change the extension attribute to use only a `getter` function. You should
 								>    see that spaCy raises an error, because the attribute is not writable
 								>    anymore.
 								> 3. Rewrite the code to split a token with `retokenizer.split`. Remember that
 								>    you need to provide a list of extension attribute values as the `"_"`
 								>    property, one for each split subtoken.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.tokens import Token
 								# Register a custom token attribute, token._.is_musician
 								Token.set_extension("is_musician", default=False)
 								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("I like David Bowie")
 								print("Before:", [(token.text, token._.is_musician) for token in doc])
 								with doc.retokenize() as retokenizer:
 								    retokenizer.merge(doc[2:4], attrs={"_": {"is_musician": True}})
 								print("After:", [(token.text, token._.is_musician) for token in doc])
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								## Sentence Segmentation {#sbd}
 								A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents`
-												Update sentence segmentation usage docs

Update sentence segmentation usage docs to incorporate `senter`.

											
										
										
											2020-08-28 11:57:55 +03:00
+								property. To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a
 								generator that yields [`Span`](/api/span) objects. You can check whether a `Doc`
 								has sentence boundaries with the `doc.is_sentenced` attribute.
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 17:45:02 +03:00
-												Update sentence segmentation usage docs

Update sentence segmentation usage docs to incorporate `senter`.

											
										
										
											2020-08-28 11:57:55 +03:00
+								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("This is a sentence. This is another sentence.")
 								assert doc.is_sentenced
 								for sent in doc.sents:
 								    print(sent.text)
 								```
-												Update docs [ci skip]

											
										
										
											2020-08-29 13:53:14 +03:00
+								spaCy provides four alternatives for sentence segmentation:
 . [Dependency parser](#sbd-parser): the statistical
 								   [`DependencyParser`](/api/dependencyparser) provides the most accurate
 								   sentence boundaries based on full dependency parses.
 . [Statistical sentence segmenter](#sbd-senter): the statistical
 								   [`SentenceRecognizer`](/api/sentencerecognizer) is a simpler and faster
 								   alternative to the parser that only sets sentence boundaries.
 . [Rule-based pipeline component](#sbd-component): the rule-based
 								   [`Sentencizer`](/api/sentencizer) sets sentence boundaries using a
 								   customizable list of sentence-final punctuation.
 . [Custom function](#sbd-custom): your own custom function added to the
 								   processing pipeline can set sentence boundaries by writing to
 								   `Token.is_sent_start`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								### Default: Using the dependency parse {#sbd-parser model="parser"}
-												Update sentence segmentation usage docs

Update sentence segmentation usage docs to incorporate `senter`.

											
										
										
											2020-08-28 11:57:55 +03:00
+								Unlike other libraries, spaCy uses the dependency parse to determine sentence
 								boundaries. This is usually the most accurate approach, but it requires a
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								**trained pipeline** that provides accurate predictions. If your texts are
-												Update sentence segmentation usage docs

Update sentence segmentation usage docs to incorporate `senter`.

											
										
										
											2020-08-28 11:57:55 +03:00
+								closer to general-purpose news or web text, this should work well out-of-the-box
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								with spaCy's provided trained pipelines. For social media or conversational text
 								that doesn't follow the same rules, your application may benefit from a custom
 								trained or rule-based component.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("This is a sentence. This is another sentence.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for sent in doc.sents:
 								    print(sent.text)
 								```
-												Update sentence segmentation usage docs

Update sentence segmentation usage docs to incorporate `senter`.

											
										
										
											2020-08-28 11:57:55 +03:00
+								spaCy's dependency parser respects already set boundaries, so you can preprocess
 								your `Doc` using custom components _before_ it's parsed. Depending on your text,
 								this may also improve parse accuracy, since the parser is constrained to predict
 								parses consistent with the sentence boundaries.
 								### Statistical sentence segmenter {#sbd-senter model="senter" new="3"}
 								The [`SentenceRecognizer`](/api/sentencerecognizer) is a simple statistical
 								component that only provides sentence boundaries. Along with being faster and
 								smaller than the parser, its primary advantage is that it's easier to train
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								because it only requires annotated sentence boundaries rather than full
-												Update docs and resolve todos [ci skip]

											
										
										
											2020-09-24 14:41:25 +03:00
+								dependency parses. spaCy's [trained pipelines](/models) include both a parser
 								and a trained sentence segmenter, which is
 								[disabled](/usage/processing-pipelines#disabling) by default. If you only need
 								sentence boundaries and no parser, you can use the `enable` and `disable`
 								arguments on [`spacy.load`](/api/top-level#spacy.load) to enable the senter and
 								disable the parser.
-												Update docs [ci skip]

											
										
										
											2020-08-29 13:53:14 +03:00
 								> #### senter vs. parser
 								>
 								> The recall for the `senter` is typically slightly lower than for the parser,
 								> which is better at predicting sentence boundaries when punctuation is not
 								> present.
-												Update sentence segmentation usage docs

Update sentence segmentation usage docs to incorporate `senter`.

											
										
										
											2020-08-28 11:57:55 +03:00
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm", enable=["senter"], disable=["parser"])
 								doc = nlp("This is a sentence. This is another sentence.")
 								for sent in doc.sents:
 								    print(sent.text)
 								```
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 17:45:02 +03:00
+								### Rule-based pipeline component {#sbd-component}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 17:45:02 +03:00
+								The [`Sentencizer`](/api/sentencizer) component is a
 								[pipeline component](/usage/processing-pipelines) that splits sentences on
 								punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
-												Update sentence segmentation usage docs

Update sentence segmentation usage docs to incorporate `senter`.

											
										
										
											2020-08-28 11:57:55 +03:00
+								need sentence boundaries without dependency parses.
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 17:45:02 +03:00
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.lang.en import English
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								nlp = English()  # just the language with no pipeline
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								nlp.add_pipe("sentencizer")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("This is a sentence. This is another sentence.")
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 17:45:02 +03:00
+								for sent in doc.sents:
 								    print(sent.text)
 								```
 								### Custom rule-based strategy {id="sbd-custom"}
 								If you want to implement your own strategy that differs from the default
 								rule-based approach of splitting on sentences, you can also create a
 								[custom pipeline component](/usage/processing-pipelines#custom-components) that
 								takes a `Doc` object and sets the `Token.is_sent_start` attribute on each
 								individual token. If set to `False`, the token is explicitly marked as _not_ the
 								start of a sentence. If set to `None` (default), it's treated as a missing value
 								and can still be overwritten by the parser.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								<Infobox title="Important note" variant="warning">
 								To prevent inconsistent state, you can only set boundaries **before** a document
-												Update docs [ci skip]

											
										
										
											2020-09-22 10:45:41 +03:00
+								is parsed (and `doc.has_annotation("DEP")` is `False`). To ensure that your
 								component is added in the right place, you can set `before='parser'` or
 								`first=True` when adding it to the pipeline using
 								[`nlp.add_pipe`](/api/language#add_pipe).
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								</Infobox>
 								Here's an example of a component that implements a pre-processing rule for
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								splitting on `"..."` tokens. The component is added before the parser, which is
-												💫 Add better and serializable sentencizer (#3471)

* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs

											
										
										
											2019-03-23 17:45:02 +03:00
+								then used to further segment the text. That's possible, because `is_sent_start`
 								is only set to `True` for some of the tokens – all others still specify `None`
 								for unset sentence boundaries. This approach can be useful if you want to
 								implement **additional** rules specific to your data, while still being able to
 								take advantage of dependency-based sentence segmentation.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### {executable="true"}
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								from spacy.language import Language
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								import spacy
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								text = "this is a sentence...hello...and another sentence."
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								nlp = spacy.load("en_core_web_sm")
 								doc = nlp(text)
 								print("Before:", [sent.text for sent in doc.sents])
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								@Language.component("set_custom_coundaries")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								def set_custom_boundaries(doc):
 								    for token in doc[:-1]:
 								        if token.text == "...":
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								            doc[token.i + 1].is_sent_start = True
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								    return doc
-												Update docs [ci skip]

											
										
										
											2020-07-27 01:29:45 +03:00
+								nlp.add_pipe("set_custom_boundaries", before="parser")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								doc = nlp(text)
 								print("After:", [sent.text for sent in doc.sents])
 								```
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								## Mappings & Exceptions {#mappings-exceptions new="3"}
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								The [`AttributeRuler`](/api/attributeruler) manages **rule-based mappings and
 								exceptions** for all token-level attributes. As the number of
 								[pipeline components](/api/#architecture-pipeline) has grown from spaCy v2 to
 								v3, handling rules and exceptions in each component individually has become
 								impractical, so the `AttributeRuler` provides a single component with a unified
 								pattern format for all token attribute mappings and exceptions.
 								The `AttributeRuler` uses
 								[`Matcher` patterns](/usage/rule-based-matching#adding-patterns) to identify
 								tokens and then assigns them the provided attributes. If needed, the
 								[`Matcher`](/api/matcher) patterns can include context around the target token.
 								For example, the attribute ruler can:
 								- provide exceptions for any **token attributes**
 								- map **fine-grained tags** to **coarse-grained tags** for languages without
 								  statistical morphologizers (replacing the v2.x `tag_map` in the
 								  [language data](#language-data))
 								- map token **surface form + fine-grained tags** to **morphological features**
 								  (replacing the v2.x `morph_rules` in the [language data](#language-data))
 								- specify the **tags for space tokens** (replacing hard-coded behavior in the
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								  tagger)
 								The following example shows how the tag and POS `NNP`/`PROPN` can be specified
 								for the phrase `"The Who"`, overriding the tags provided by the statistical
 								tagger and the POS tag map.
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
 								text = "I saw The Who perform. Who did you see?"
 								doc1 = nlp(text)
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								print(doc1[2].tag_, doc1[2].pos_)  # DT DET
 								print(doc1[3].tag_, doc1[3].pos_)  # WP PRON
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								# Add attribute ruler with exception for "The Who" as NNP/PROPN NNP/PROPN
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								ruler = nlp.get_pipe("attribute_ruler")
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								# Pattern to match "The Who"
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]]
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								# The attributes to assign to the matched token
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								attrs = {"TAG": "NNP", "POS": "PROPN"}
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								# Add rules to the attribute ruler
 								ruler.add(patterns=patterns, attrs=attrs, index=0)  # "The" in "The Who"
 								ruler.add(patterns=patterns, attrs=attrs, index=1)  # "Who" in "The Who"
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
 								doc2 = nlp(text)
-												Update docs [ci skip]

											
										
										
											2020-08-29 19:43:19 +03:00
+								print(doc2[2].tag_, doc2[2].pos_)  # NNP PROPN
 								print(doc2[3].tag_, doc2[3].pos_)  # NNP PROPN
 								# The second "Who" remains unmodified
 								print(doc2[5].tag_, doc2[5].pos_)  # WP PRON
 								```
 								<Infobox variant="warning" title="Migrating from spaCy v2.x">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Tidy up, tests and docs

											
										
										
											2020-10-04 14:54:05 +03:00
+								The [`AttributeRuler`](/api/attributeruler) can import a **tag map and morph rules** in the v2.x format via its built-in methods or when the component is initialized before training. See the [migration guide](/usage/v3#migrating-training-mappings-exceptions) for details.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								</Infobox>
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								## Word vectors and semantic similarity {#vectors-similarity}
 								import Vectors101 from 'usage/101/\_vectors-similarity.md'
 								<Vectors101 />
 								### Adding word vectors {#adding-vectors}
 								Custom word vectors can be trained using a number of open-source libraries, such
-												Update docs [ci skip]

											
										
										
											2020-08-22 17:47:03 +03:00
+								as [Gensim](https://radimrehurek.com/gensim), [FastText](https://fasttext.cc),
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								or Tomas Mikolov's original
 								[Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most
 								word vector libraries output an easy-to-read text-based format, where each line
 								consists of the word followed by its vector. For everyday use, we want to
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								convert the vectors into a binary format that loads faster and takes up less
 								space on disk. The easiest way to do this is the
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								[`init vectors`](/api/cli#init-vectors) command-line utility. This will output a
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								blank spaCy pipeline in the directory `/tmp/la_vectors_wiki_lg`, giving you
 								access to some nice Latin vectors. You can then pass the directory path to
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								[`spacy.load`](/api/top-level#spacy.load) or use it in the
 								[`[initialize]`](/api/data-formats#config-initialize) of your config when you
 								[train](/usage/training) a model.
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
 								> #### Usage example
 								>
 								> ```python
 								> nlp_latin = spacy.load("/tmp/la_vectors_wiki_lg")
 								> doc1 = nlp_latin("Caecilius est in horto")
 								> doc2 = nlp_latin("servus est in atrio")
 								> doc1.similarity(doc2)
 								> ```
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								```cli
 								$ wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								$ python -m spacy init vectors en cc.la.300.vec.gz /tmp/la_vectors_wiki_lg
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								```
 								<Accordion title="How to optimize vector coverage" id="custom-vectors-coverage" spaced>
 								To help you strike a good balance between coverage and memory usage, spaCy's
 								[`Vectors`](/api/vectors) class lets you map **multiple keys** to the **same
 								row** of the table. If you're using the
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								[`spacy init vectors`](/api/cli#init-vectors) command to create a vocabulary,
 								pruning the vectors will be taken care of automatically if you set the `--prune`
 								flag. You can also do it manually in the following steps:
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+. Start with a **word vectors package** that covers a huge vocabulary. For
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								   instance, the [`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg)
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								   starter provides 300-dimensional GloVe vectors for over 1 million terms of
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								   English.
 . If your vocabulary has values set for the `Lexeme.prob` attribute, the
 								   lexemes will be sorted by descending probability to determine which vectors
 								   to prune. Otherwise, lexemes will be sorted by their order in the `Vocab`.
 . Call [`Vocab.prune_vectors`](/api/vocab#prune_vectors) with the number of
 								   vectors you want to keep.
 								```python
 								nlp = spacy.load('en_vectors_web_lg')
 								n_vectors = 105000  # number of vectors to keep
 								removed_words = nlp.vocab.prune_vectors(n_vectors)
 								assert len(nlp.vocab.vectors) <= n_vectors  # unique vectors have been pruned
 								assert nlp.vocab.vectors.n_keys > n_vectors  # but not the total entries
 								```
 								[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector
 								table to a given number of unique entries, and returns a dictionary containing
 								the removed words, mapped to `(string, score)` tuples, where `string` is the
 								entry the removed word was mapped to, and `score` the similarity score between
 								the two words.
 								```python
 								### Removed words
 								{
 								    "Shore": ("coast", 0.732257),
 								    "Precautionary": ("caution", 0.490973),
 								    "hopelessness": ("sadness", 0.742366),
 								    "Continous": ("continuous", 0.732549),
 								    "Disemboweled": ("corpse", 0.499432),
 								    "biostatistician": ("scientist", 0.339724),
 								    "somewheres": ("somewheres", 0.402736),
 								    "observing": ("observe", 0.823096),
 								    "Leaving": ("leaving", 1.0),
 								}
 								```
 								In the example above, the vector for "Shore" was removed and remapped to the
 								vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to
 								the vector of "leaving", which is identical. If you're using the
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								[`init vectors`](/api/cli#init-vectors) command, you can set the `--prune`
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								option to easily reduce the size of the vectors as you add them to a spaCy
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								pipeline:
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								```cli
-												Update docs [ci skip]

											
										
										
											2020-10-01 13:15:53 +03:00
+								$ python -m spacy init vectors en la.300d.vec.tgz /tmp/la_vectors_web_md --prune 10000
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
+								```
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								This will create a blank spaCy pipeline with vectors for the first 10,000 words
 								in the vectors. All other words in the vectors are mapped to the closest vector
 								among those retained.
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
 								</Accordion>
 								### Adding vectors individually {#adding-individual-vectors}
 								The `vector` attribute is a **read-only** numpy or cupy array (depending on
 								whether you've configured spaCy to use GPU memory), with dtype `float32`. The
 								array is read-only so that spaCy can avoid unnecessary copy operations where
 								possible. You can modify the vectors via the [`Vocab`](/api/vocab) or
 								[`Vectors`](/api/vectors) table. Using the
 								[`Vocab.set_vector`](/api/vocab#set_vector) method is often the easiest approach
 								if you have vectors in an arbitrary format, as you can read in the vectors with
 								your own logic, and just set them with a simple loop. This method is likely to
 								be slower than approaches that work with the whole vectors table at once, but
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								it's a great approach for once-off conversions before you save out your `nlp`
 								object to disk.
-												Update docs [ci skip]

											
										
										
											2020-08-18 01:49:19 +03:00
 								```python
 								### Adding vectors
 								from spacy.vocab import Vocab
 								vector_data = {
 								    "dog": numpy.random.uniform(-1, 1, (300,)),
 								    "cat": numpy.random.uniform(-1, 1, (300,)),
 								    "orange": numpy.random.uniform(-1, 1, (300,))
 								}
 								vocab = Vocab()
 								for word, vector in vector_data.items():
 								    vocab.set_vector(word, vector)
 								```
-												Update usage docs for lemmatization and morphology

											
										
										
											2020-08-29 16:56:50 +03:00
+								## Language Data {#language-data}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Start updating website for v3 [ci skip]

											
										
										
											2020-07-01 22:26:39 +03:00
+								import LanguageData101 from 'usage/101/\_language-data.md'
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Start updating website for v3 [ci skip]

											
										
										
											2020-07-01 22:26:39 +03:00
+								<LanguageData101 />
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
 								### Creating a custom language subclass {#language-subclass}
 								If you want to customize multiple components of the language data or add support
 								for a custom language or domain-specific "dialect", you can also implement your
 								own language subclass. The subclass should define two attributes: the `lang`
 								(unique language code) and the `Defaults` defining the language data. For an
 								overview of the available attributes that can be overwritten, see the
 								[`Language.Defaults`](/api/language#defaults) documentation.
 								```python
 								### {executable="true"}
 								from spacy.lang.en import English
 								class CustomEnglishDefaults(English.Defaults):
 								    stop_words = set(["custom", "stop"])
 								class CustomEnglish(English):
 								    lang = "custom_en"
 								    Defaults = CustomEnglishDefaults
 								nlp1 = English()
 								nlp2 = CustomEnglish()
 								print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
 								print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])
 								```
 								The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you
 								register a custom language class and assign it a string name. This means that
 								you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								language name, and even train pipelines with it and refer to it in your
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								[training config](/usage/training#config).
 								> #### Config usage
 								>
 								> After registering your custom language class using the `languages` registry,
 								> you can refer to it in your [training config](/usage/training#config). This
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								> means spaCy will train your pipeline using the custom subclass.
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								>
 								> ```ini
 								> [nlp]
 								> lang = "custom_en"
 								> ```
 								>
 								> In order to resolve `"custom_en"` to your subclass, the registered function
 								> needs to be available during training. You can load a Python file containing
 								> the code using the `--code` argument:
 								>
-												Update docs [ci skip]

											
										
										
											2020-08-19 01:28:37 +03:00
+								> ```cli
 								> python -m spacy train config.cfg --code code.py
-												Update docs [ci skip]

											
										
										
											2020-07-25 19:51:12 +03:00
+								> ```
 								```python
 								### Registering a custom language {highlight="7,12-13"}
 								import spacy
 								from spacy.lang.en import English
 								class CustomEnglishDefaults(English.Defaults):
 								    stop_words = set(["custom", "stop"])
 								@spacy.registry.languages("custom_en")
 								class CustomEnglish(English):
 								    lang = "custom_en"
 								    Defaults = CustomEnglishDefaults
 								# This now works! 🎉
 								nlp = spacy.blank("custom_en")
 								```