Update syntax for code block meta data

This commit is contained in:
Marcus Blättermann 2022-11-29 18:09:19 +01:00
parent 47e4c62741
commit b305510227
No known key found for this signature in database
GPG Key ID: A1E1F04008AC450D
26 changed files with 253 additions and 478 deletions

View File

@ -120,8 +120,7 @@ file.
> srsly.write_jsonl("/path/to/text.jsonl", data) > srsly.write_jsonl("/path/to/text.jsonl", data)
> ``` > ```
```json ```json {title="Example"}
### Example
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"} {"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."} {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."} {"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."}

View File

@ -317,8 +317,7 @@ $ python -m spacy convert ./data.json .
> [`offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags) function can > [`offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags) function can
> help you convert entity offsets to the right format. > help you convert entity offsets to the right format.
```python ```python {title="Example structure"}
### Example structure
[{ [{
"id": int, # ID of the document within the corpus "id": int, # ID of the document within the corpus
"paragraphs": [{ # list of paragraphs in the corpus "paragraphs": [{ # list of paragraphs in the corpus
@ -436,8 +435,7 @@ file to keep track of your settings and hyperparameters and your own
</Infobox> </Infobox>
```python ```python {title="Examples"}
### Examples
# Training data for a part-of-speech tagger # Training data for a part-of-speech tagger
doc = Doc(vocab, words=["I", "like", "stuff"]) doc = Doc(vocab, words=["I", "like", "stuff"])
gold_dict = {"tags": ["NOUN", "VERB", "NOUN"]} gold_dict = {"tags": ["NOUN", "VERB", "NOUN"]}
@ -483,13 +481,11 @@ spaCy's [`Lexeme`](/api/lexeme#attributes) object.
> vocab_data = "/path/to/vocab-data.jsonl" > vocab_data = "/path/to/vocab-data.jsonl"
> ``` > ```
```python ```python {title="First line"}
### First line
{"lang": "en", "settings": {"oov_prob": -20.502029418945312}} {"lang": "en", "settings": {"oov_prob": -20.502029418945312}}
``` ```
```python ```python {title="Entry structure"}
### Entry structure
{ {
"orth": string, # the word text "orth": string, # the word text
"id": int, # can correspond to row in vectors table "id": int, # can correspond to row in vectors table

View File

@ -15,8 +15,7 @@ notable downside to this format is that you can't easily extract just one
document from the `DocBin`. The serialization format is gzipped msgpack, where document from the `DocBin`. The serialization format is gzipped msgpack, where
the msgpack object has the following structure: the msgpack object has the following structure:
```python ```python {title="msgpack object structure"}
### msgpack object structure
{ {
"version": str, # DocBin version number "version": str, # DocBin version number
"attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE] "attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE]

View File

@ -780,8 +780,7 @@ doesn't, the pipeline analysis won't catch that.
<Accordion title="Example output" spaced> <Accordion title="Example output" spaced>
```json ```json {title="Structured"}
### Structured
{ {
"summary": { "summary": {
"tagger": { "tagger": {

View File

@ -69,8 +69,7 @@ Initialize the sentencizer.
| `overwrite` <Tag variant="new">3.2</Tag> | Whether existing annotation is overwritten. Defaults to `False`. ~~bool~~ | | `overwrite` <Tag variant="new">3.2</Tag> | Whether existing annotation is overwritten. Defaults to `False`. ~~bool~~ |
| `scorer` <Tag variant="new">3.2</Tag> | The scoring method. Defaults to [`Scorer.score_spans`](/api/scorer#score_spans) for the attribute `"sents"` ~~Optional[Callable]~~ | | `scorer` <Tag variant="new">3.2</Tag> | The scoring method. Defaults to [`Scorer.score_spans`](/api/scorer#score_spans) for the attribute `"sents"` ~~Optional[Callable]~~ |
```python ```python {title="punct_chars defaults"}
### punct_chars defaults
['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።',
'፧', '፨', '', '', '᜶', '', '', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '፧', '፨', '', '', '᜶', '', '', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫',
'᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉',

View File

@ -61,8 +61,7 @@ Essentially, `spacy.load()` is a convenience wrapper that reads the pipeline's
information to construct a `Language` object, loads in the model data and information to construct a `Language` object, loads in the model data and
weights, and returns it. weights, and returns it.
```python ```python {title="Abstract example"}
### Abstract example
cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
nlp = cls() # 2. Initialize it nlp = cls() # 2. Initialize it
for name in pipeline: for name in pipeline:

View File

@ -384,8 +384,7 @@ to raw text with no highlighting. An optional label can be added as the first
line with the prefix `####` (Python-like) and `///` (JavaScript-like). the line with the prefix `####` (Python-like) and `///` (JavaScript-like). the
indented block as plain text and preserve whitespace. indented block as plain text and preserve whitespace.
```python ```python {title="Using spaCy"}
### Using spaCy
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.") doc = nlp("This is a sentence.")
@ -407,8 +406,7 @@ adding `{highlight="..."}` to the headline. Acceptable ranges are spans like
> ``` > ```
> ```` > ````
```python ```python {title="Using the matcher",highlight="5-7"}
### Using the matcher {highlight="5-7"}
import spacy import spacy
from spacy.matcher import Matcher from spacy.matcher import Matcher
@ -435,8 +433,7 @@ interactive widget defaults to a regular code block.
> ``` > ```
> ```` > ````
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.") doc = nlp("This is a sentence.")
@ -554,8 +551,7 @@ This is a regular paragraph with a [link](https://spacy.io) and **bold text**.
| -------- | -------- | | -------- | -------- |
| Column 1 | Column 2 | | Column 1 | Column 2 |
```python ```python {title="Code block title",highlight="2-3"}
### Code block title {highlight="2-3"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello world") doc = nlp("Hello world")

View File

@ -7,8 +7,7 @@ later, depending on your use case.
Named entities are available as the `ents` property of a `Doc`: Named entities are available as the `ents` property of a `Doc`:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")

View File

@ -11,8 +11,7 @@ Linguistic annotations are available as
efficiency. So to get the readable string representation of an attribute, we efficiency. So to get the readable string representation of an attribute, we
need to add an underscore `_` to its name: need to add an underscore `_` to its name:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")

View File

@ -4,8 +4,7 @@ language. For example, punctuation at the end of a sentence should be split off
whereas "U.K." should remain one token. Each `Doc` consists of individual whereas "U.K." should remain one token. Each `Doc` consists of individual
tokens, and we can iterate over them: tokens, and we can iterate over them:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")

View File

@ -3,8 +3,7 @@ multi-dimensional meaning representations of a word. Word vectors can be
generated using an algorithm like generated using an algorithm like
[word2vec](https://en.wikipedia.org/wiki/Word2vec) and usually look like this: [word2vec](https://en.wikipedia.org/wiki/Word2vec) and usually look like this:
```python ```python {title="banana.vector"}
### banana.vector
array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01, array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01,
3.28450017e-02, -4.19569999e-01, 7.20689967e-02, 3.28450017e-02, -4.19569999e-01, 7.20689967e-02,
-3.74760002e-01, 5.74599989e-02, -1.24009997e-02, -3.74760002e-01, 5.74599989e-02, -1.24009997e-02,
@ -42,8 +41,7 @@ the [`Token.vector`](/api/token#vector) attribute.
default to an average of their token vectors. You can also check if a token has default to an average of their token vectors. You can also check if a token has
a vector assigned, and get the L2 norm, which can be used to normalize vectors. a vector assigned, and get the L2 norm, which can be used to normalize vectors.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_md") nlp = spacy.load("en_core_web_md")
@ -93,8 +91,7 @@ similarity.
> You should see that the similarity results are identical to the token > You should see that the similarity results are identical to the token
> similarity. > similarity.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_md") # make sure to use larger package! nlp = spacy.load("en_core_web_md") # make sure to use larger package!

View File

@ -123,8 +123,7 @@ the entity recognizer, use a
[Tok2VecListener](/api/architectures#Tok2VecListener) layer as their model's [Tok2VecListener](/api/architectures#Tok2VecListener) layer as their model's
`tok2vec` argument, which connects to the `tok2vec` component model. `tok2vec` argument, which connects to the `tok2vec` component model.
```ini ```ini {title="Shared",highlight="1-2,4-5,19-20"}
### Shared {highlight="1-2,4-5,19-20"}
[components.tok2vec] [components.tok2vec]
factory = "tok2vec" factory = "tok2vec"
@ -152,8 +151,7 @@ In the independent setup, the entity recognizer component defines its own
same. This makes them fully independent and doesn't require an upstream same. This makes them fully independent and doesn't require an upstream
[`Tok2Vec`](/api/tok2vec) component to be present in the pipeline. [`Tok2Vec`](/api/tok2vec) component to be present in the pipeline.
```ini ```ini {title="Independent", highlight="7-8"}
### Independent {highlight="7-8"}
[components.ner] [components.ner]
factory = "ner" factory = "ner"
@ -210,8 +208,7 @@ your package manager and CUDA version. If you skip this step, pip will install
PyTorch as a dependency below, but it may not find the best version for your PyTorch as a dependency below, but it may not find the best version for your
setup. setup.
```bash ```bash {title="Example: Install PyTorch 1.11.0 for CUDA 11.3 with pip"}
### Example: Install PyTorch 1.11.0 for CUDA 11.3 with pip
# See: https://pytorch.org/get-started/locally/ # See: https://pytorch.org/get-started/locally/
$ pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html $ pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
``` ```
@ -224,8 +221,7 @@ environment variable if your CUDA runtime is installed in a non-standard
location. Putting it all together, if you had installed CUDA 11.3 in location. Putting it all together, if you had installed CUDA 11.3 in
`/opt/nvidia/cuda`, you would run: `/opt/nvidia/cuda`, you would run:
```bash ```bash {title="Installation with CUDA"}
### Installation with CUDA
$ export CUDA_PATH="/opt/nvidia/cuda" $ export CUDA_PATH="/opt/nvidia/cuda"
$ pip install -U %%SPACY_PKG_NAME[cuda113,transformers]%%SPACY_PKG_FLAGS $ pip install -U %%SPACY_PKG_NAME[cuda113,transformers]%%SPACY_PKG_FLAGS
``` ```
@ -235,8 +231,7 @@ that require [`SentencePiece`](https://github.com/google/sentencepiece) (e.g.,
ALBERT, CamemBERT, XLNet, Marian, and T5), install the additional dependencies ALBERT, CamemBERT, XLNet, Marian, and T5), install the additional dependencies
with: with:
```bash ```bash {title="Install sentencepiece"}
### Install sentencepiece
$ pip install transformers[sentencepiece] $ pip install transformers[sentencepiece]
``` ```
@ -261,8 +256,7 @@ transformer-based [pipelines](/models) provided by spaCy end on `_trf`, e.g.
$ python -m spacy download en_core_web_trf $ python -m spacy download en_core_web_trf
``` ```
```python ```python {title="Example"}
### Example
import spacy import spacy
from thinc.api import set_gpu_allocator, require_gpu from thinc.api import set_gpu_allocator, require_gpu
@ -343,8 +337,7 @@ component:
> ) > )
> ``` > ```
```ini ```ini {title="config.cfg",excerpt="true"}
### config.cfg (excerpt)
[components.transformer] [components.transformer]
factory = "transformer" factory = "transformer"
max_batch_items = 4096 max_batch_items = 4096
@ -424,8 +417,7 @@ subsentences of at most `max_length` tokens are returned.
> max_length = 25 > max_length = 25
> ``` > ```
```python ```python {title="code.py"}
### code.py
import spacy_transformers import spacy_transformers
@spacy_transformers.registry.span_getters("custom_sent_spans") @spacy_transformers.registry.span_getters("custom_sent_spans")
@ -475,8 +467,7 @@ is where we'll plug in our transformer model, using the
[TransformerListener](/api/architectures#TransformerListener) layer, which [TransformerListener](/api/architectures#TransformerListener) layer, which
sneakily delegates to the `Transformer` pipeline component. sneakily delegates to the `Transformer` pipeline component.
```ini ```ini {title="config.cfg (excerpt)",highlight="12"}
### config.cfg (excerpt) {highlight="12"}
[components.ner] [components.ner]
factory = "ner" factory = "ner"
@ -735,8 +726,7 @@ whole model), or a
spaCy's built-in model architectures have a reference named `"tok2vec"` that spaCy's built-in model architectures have a reference named `"tok2vec"` that
will refer to the right layer. will refer to the right layer.
```ini ```ini {title="config.cfg"}
### config.cfg
# 1. Use the whole model of the "tok2vec" component # 1. Use the whole model of the "tok2vec" component
[pretraining] [pretraining]
component = "tok2vec" component = "tok2vec"
@ -759,8 +749,7 @@ A pretraining step that runs for 5 epochs with an output path of `pretrain/`, as
an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`. To an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`. To
make use of the final output, you could fill in this value in your config file: make use of the final output, you could fill in this value in your config file:
```ini ```ini {title="config.cfg"}
### config.cfg
[paths] [paths]
init_tok2vec = "pretrain/model4.bin" init_tok2vec = "pretrain/model4.bin"

View File

@ -40,8 +40,7 @@ this config, you won't be able to change it anymore. The architecture is like a
recipe for the network, and you can't change the recipe once the dish has recipe for the network, and you can't change the recipe once the dish has
already been prepared. You have to make a new one. already been prepared. You have to make a new one.
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[components.tagger] [components.tagger]
factory = "tagger" factory = "tagger"
@ -126,8 +125,7 @@ default. This architecture combines a simple bag-of-words model with a neural
network, usually resulting in the most accurate results, but at the cost of network, usually resulting in the most accurate results, but at the cost of
speed. The config file for this model would look something like this: speed. The config file for this model would look something like this:
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[components.textcat] [components.textcat]
factory = "textcat" factory = "textcat"
labels = [] labels = []
@ -165,8 +163,7 @@ use those by swapping out the definition of the textcat's model. For instance,
to use the simple and fast bag-of-words model to use the simple and fast bag-of-words model
[TextCatBOW](/api/architectures#TextCatBOW), you can change the config to: [TextCatBOW](/api/architectures#TextCatBOW), you can change the config to:
```ini ```ini {title="config.cfg (excerpt)",highlight="6-10"}
### config.cfg (excerpt) {highlight="6-10"}
[components.textcat] [components.textcat]
factory = "textcat" factory = "textcat"
labels = [] labels = []
@ -198,8 +195,7 @@ These steps together compute dense, context-sensitive representations of the
tokens, and their combination forms a typical tokens, and their combination forms a typical
[`Tok2Vec`](/api/architectures#Tok2Vec) layer: [`Tok2Vec`](/api/architectures#Tok2Vec) layer:
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[components.tok2vec] [components.tok2vec]
factory = "tok2vec" factory = "tok2vec"
@ -220,8 +216,7 @@ a sublayer for another one, for instance changing the first sublayer to a
character embedding with the [CharacterEmbed](/api/architectures#CharacterEmbed) character embedding with the [CharacterEmbed](/api/architectures#CharacterEmbed)
architecture: architecture:
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[components.tok2vec.model.embed] [components.tok2vec.model.embed]
@architectures = "spacy.CharacterEmbed.v2" @architectures = "spacy.CharacterEmbed.v2"
# ... # ...
@ -260,8 +255,7 @@ Let's use PyTorch to define a very simple neural network consisting of two
hidden `Linear` layers with `ReLU` activation and dropout, and a hidden `Linear` layers with `ReLU` activation and dropout, and a
softmax-activated output layer: softmax-activated output layer:
```python ```python {title="PyTorch model"}
### PyTorch model
from torch import nn from torch import nn
torch_model = nn.Sequential( torch_model = nn.Sequential(
@ -312,8 +306,7 @@ architecture a name so spaCy knows how to find it, and allows passing in
arguments like hyperparameters via the [config](/usage/training#config). The arguments like hyperparameters via the [config](/usage/training#config). The
full example then becomes: full example then becomes:
```python ```python {title="Registering the architecture",highlight="9"}
### Registering the architecture {highlight="9"}
from typing import List from typing import List
from thinc.types import Floats2d from thinc.types import Floats2d
from thinc.api import Model, PyTorchWrapper, chain, with_array from thinc.api import Model, PyTorchWrapper, chain, with_array
@ -352,8 +345,7 @@ by specifying it in the config file. In this configuration, all required
parameters for the various subcomponents of the custom architecture are passed parameters for the various subcomponents of the custom architecture are passed
in as settings via the config. in as settings via the config.
```ini ```ini {title="config.cfg (excerpt)",highlight="5-5"}
### config.cfg (excerpt) {highlight="5-5"}
[components.tagger] [components.tagger]
factory = "tagger" factory = "tagger"
@ -381,8 +373,7 @@ GPU memory allocator accordingly. When `gpu_allocator` is set to "pytorch" or
respective libraries, preventing OOM errors when there's available memory respective libraries, preventing OOM errors when there's available memory
sitting in the other library's pool. sitting in the other library's pool.
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[training] [training]
gpu_allocator = "pytorch" gpu_allocator = "pytorch"
``` ```
@ -461,8 +452,7 @@ you have to call
[`Model.initialize`](https://thinc.ai/docs/api-model#initialize) with an **input [`Model.initialize`](https://thinc.ai/docs/api-model#initialize) with an **input
sample** `X` and an **output sample** `Y` with the correct dimensions: sample** `X` and an **output sample** `Y` with the correct dimensions:
```python ```python {title="Shape inference with initialization",highlight="3,7,10"}
### Shape inference with initialization {highlight="3,7,10"}
with Model.define_operators({">>": chain}): with Model.define_operators({">>": chain}):
layers = ( layers = (
Relu(hidden_width) Relu(hidden_width)
@ -564,8 +554,7 @@ matrix** (~~Floats2d~~) of predictions:
> type checks and validation. See the section on [type signatures](#type-sigs) > type checks and validation. See the section on [type signatures](#type-sigs)
> for details. > for details.
```python ```python {title="The model architecture"}
### The model architecture
@spacy.registry.architectures("rel_model.v1") @spacy.registry.architectures("rel_model.v1")
def create_relation_model(...) -> Model[List[Doc], Floats2d]: def create_relation_model(...) -> Model[List[Doc], Floats2d]:
model = ... # 👈 model will go here model = ... # 👈 model will go here
@ -590,8 +579,7 @@ transforms the instance tensor into a final tensor holding the predictions:
> # ... > # ...
> ``` > ```
```python ```python {title="The model architecture",highlight="6"}
### The model architecture {highlight="6"}
@spacy.registry.architectures("rel_model.v1") @spacy.registry.architectures("rel_model.v1")
def create_relation_model( def create_relation_model(
create_instance_tensor: Model[List[Doc], Floats2d], create_instance_tensor: Model[List[Doc], Floats2d],
@ -614,8 +602,7 @@ The `classification_layer` could be something like a
> nO = null > nO = null
> ``` > ```
```python ```python {title="The classification layer"}
### The classification layer
@spacy.registry.architectures("rel_classification_layer.v1") @spacy.registry.architectures("rel_classification_layer.v1")
def create_classification_layer( def create_classification_layer(
nO: int = None, nI: int = None nO: int = None, nI: int = None
@ -651,8 +638,7 @@ that has the full implementation.
> # ... > # ...
> ``` > ```
```python ```python {title="The layer that creates the instance tensor"}
### The layer that creates the instance tensor
@spacy.registry.architectures("rel_instance_tensor.v1") @spacy.registry.architectures("rel_instance_tensor.v1")
def create_tensors( def create_tensors(
tok2vec: Model[List[Doc], List[Floats2d]], tok2vec: Model[List[Doc], List[Floats2d]],
@ -732,8 +718,7 @@ are within a **maximum distance** (in number of tokens) of each other:
> max_length = 100 > max_length = 100
> ``` > ```
```python ```python {title="Candidate generation"}
### Candidate generation
@spacy.registry.misc("rel_instance_generator.v1") @spacy.registry.misc("rel_instance_generator.v1")
def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]: def create_instances(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]:
def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]: def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]:
@ -776,8 +761,7 @@ above 0.5 to be a `True` relation. The ~~Example~~ instances that we'll use as
training data, will include their gold-standard relation annotations in training data, will include their gold-standard relation annotations in
`example.reference._.rel`. `example.reference._.rel`.
```python ```python {title="Registering the extension attribute"}
### Registering the extension attribute
from spacy.tokens import Doc from spacy.tokens import Doc
Doc.set_extension("rel", default={}) Doc.set_extension("rel", default={})
``` ```
@ -790,8 +774,7 @@ create a subclass of [`TrainablePipe`](/api/pipe) that holds the model.
![Illustration of Pipe methods](/images/trainable_component.svg) ![Illustration of Pipe methods](/images/trainable_component.svg)
```python ```python {title="Pipeline component skeleton"}
### Pipeline component skeleton
from spacy.pipeline import TrainablePipe from spacy.pipeline import TrainablePipe
class RelationExtractor(TrainablePipe): class RelationExtractor(TrainablePipe):
@ -828,8 +811,7 @@ and the name of this component. Additionally, this component, just like the
will predict scores for each label. We add convenience methods to easily will predict scores for each label. We add convenience methods to easily
retrieve and add to them. retrieve and add to them.
```python ```python {title="The constructor (continued)"}
### The constructor (continued)
def __init__(self, vocab, model, name="rel"): def __init__(self, vocab, model, name="rel"):
"""Create a component instance.""" """Create a component instance."""
# ... # ...
@ -858,8 +840,7 @@ will be used to do
layers of the neural network. This is triggered by calling layers of the neural network. This is triggered by calling
[`Model.initialize`](https://thinc.ai/api/model#initialize). [`Model.initialize`](https://thinc.ai/api/model#initialize).
```python ```python {title="The initialize method",highlight="12,15,18,22"}
### The initialize method {highlight="12,15,18,22"}
from itertools import islice from itertools import islice
def initialize( def initialize(
@ -899,8 +880,7 @@ update the weights of the model layers. Thinc provides several
[loss functions](https://thinc.ai/docs/api-loss) that can be used for the [loss functions](https://thinc.ai/docs/api-loss) that can be used for the
implementation of the `get_loss` function. implementation of the `get_loss` function.
```python ```python {title="The update method",highlight="12-14"}
### The update method {highlight="12-14"}
def update( def update(
self, self,
examples: Iterable[Example], examples: Iterable[Example],
@ -926,8 +906,7 @@ delegate to the internal model's
[predict](https://thinc.ai/docs/api-model#predict) function that takes a batch [predict](https://thinc.ai/docs/api-model#predict) function that takes a batch
of `Doc` objects and returns a ~~Floats2d~~ array: of `Doc` objects and returns a ~~Floats2d~~ array:
```python ```python {title="The predict method"}
### The predict method
def predict(self, docs: Iterable[Doc]) -> Floats2d: def predict(self, docs: Iterable[Doc]) -> Floats2d:
predictions = self.model.predict(docs) predictions = self.model.predict(docs)
return self.model.ops.asarray(predictions) return self.model.ops.asarray(predictions)
@ -944,8 +923,7 @@ need to refer to the model's `get_instances` function that defined which pairs
of entities were relevant candidates, so that the predictions can be linked to of entities were relevant candidates, so that the predictions can be linked to
those exact entities: those exact entities:
```python ```python {title="The set_annotations method",highlight="5-6,10"}
### The set_annotations method {highlight="5-6,10"}
def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d): def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d):
c = 0 c = 0
get_instances = self.model.attrs["get_instances"] get_instances = self.model.attrs["get_instances"]
@ -962,8 +940,7 @@ def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d):
Under the hood, when the pipe is applied to a document, it delegates to the Under the hood, when the pipe is applied to a document, it delegates to the
`predict` and `set_annotations` methods: `predict` and `set_annotations` methods:
```python ```python {title="The __call__ method"}
### The __call__ method
def __call__(self, doc: Doc): def __call__(self, doc: Doc):
predictions = self.predict([doc]) predictions = self.predict([doc])
self.set_annotations([doc], predictions) self.set_annotations([doc], predictions)
@ -974,8 +951,7 @@ There is one more optional method to implement: [`score`](/api/pipe#score)
calculates the performance of your component on a set of examples, and returns calculates the performance of your component on a set of examples, and returns
the results as a dictionary: the results as a dictionary:
```python ```python {title="The score method"}
### The score method
def score(self, examples: Iterable[Example]) -> Dict[str, Any]: def score(self, examples: Iterable[Example]) -> Dict[str, Any]:
prf = PRFScore() prf = PRFScore()
for example in examples: for example in examples:
@ -1014,8 +990,7 @@ assigns it a name and lets you create the component with
> rel_micro_f = 1.0 > rel_micro_f = 1.0
> ``` > ```
```python ```python {title="Registering the pipeline component"}
### Registering the pipeline component
from spacy.language import Language from spacy.language import Language
@Language.factory("relation_extractor") @Language.factory("relation_extractor")
@ -1027,8 +1002,7 @@ You can extend the decorator to include information such as the type of
annotations that are required for this component to run, the type of annotations annotations that are required for this component to run, the type of annotations
it produces, and the scores that can be calculated: it produces, and the scores that can be calculated:
```python ```python {title="Factory annotations",highlight="5-11"}
### Factory annotations {highlight="5-11"}
from spacy.language import Language from spacy.language import Language
@Language.factory( @Language.factory(

View File

@ -62,8 +62,7 @@ allows you to access individual morphological features.
> and express that it's a pronoun in the third person. > and express that it's a pronoun in the third person.
> 2. Inspect `token.morph` for the other tokens. > 2. Inspect `token.morph` for the other tokens.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -80,8 +79,7 @@ spaCy's statistical [`Morphologizer`](/api/morphologizer) component assigns the
morphological features and coarse-grained part-of-speech tags as `Token.morph` morphological features and coarse-grained part-of-speech tags as `Token.morph`
and `Token.pos`. and `Token.pos`.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("de_core_news_sm") nlp = spacy.load("de_core_news_sm")
@ -106,8 +104,7 @@ coarse-grained part-of-speech tags and morphological features.
[mapping table](#mappings-exceptions) maps the fine-grained tags to a [mapping table](#mappings-exceptions) maps the fine-grained tags to a
coarse-grained POS tags and morphological features. coarse-grained POS tags and morphological features.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -126,8 +123,7 @@ spaCy provides two pipeline components for lemmatization:
2. The [`EditTreeLemmatizer`](/api/edittreelemmatizer) 2. The [`EditTreeLemmatizer`](/api/edittreelemmatizer)
<Tag variant="new">3.3</Tag> component provides a trainable lemmatizer. <Tag variant="new">3.3</Tag> component provides a trainable lemmatizer.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
# English pipelines include a rule-based lemmatizer # English pipelines include a rule-based lemmatizer
@ -238,8 +234,7 @@ head. You can think of noun chunks as a noun plus the words describing the noun
get the noun chunks in a document, simply iterate over get the noun chunks in a document, simply iterate over
[`Doc.noun_chunks`](/api/doc#noun_chunks). [`Doc.noun_chunks`](/api/doc#noun_chunks).
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -269,8 +264,7 @@ label, which describes the type of syntactic relation that connects the child to
the head. As with other attributes, the value of `.dep` is a hash value. You can the head. As with other attributes, the value of `.dep` is a hash value. You can
get the string value with `.dep_`. get the string value with `.dep_`.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -307,8 +301,7 @@ head**. You can therefore iterate over the arcs in the tree by iterating over
the words in the sentence. This is usually the best way to match an arc of the words in the sentence. This is usually the best way to match an arc of
interest from below: interest from below:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.symbols import nsubj, VERB from spacy.symbols import nsubj, VERB
@ -351,8 +344,7 @@ order. There are also two integer-typed attributes,
[`Token.n_rights`](/api/token#n_rights) that give the number of left and right [`Token.n_rights`](/api/token#n_rights) that give the number of left and right
children. children.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -363,8 +355,7 @@ print(doc[2].n_lefts) # 2
print(doc[2].n_rights) # 1 print(doc[2].n_rights) # 1
``` ```
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("de_core_news_sm") nlp = spacy.load("de_core_news_sm")
@ -387,8 +378,7 @@ sequence of tokens. You can walk up the tree with the
> true for the German pipelines, which have many > true for the German pipelines, which have many
> [non-projective dependencies](https://explosion.ai/blog/german-model#word-order). > [non-projective dependencies](https://explosion.ai/blog/german-model#word-order).
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -417,8 +407,7 @@ easiest way to create a `Span` object for a syntactic phrase. Note that
`.right_edge` gives a token **within** the subtree so if you use it as the `.right_edge` gives a token **within** the subtree so if you use it as the
end-point of a range, don't forget to `+1`! end-point of a range, don't forget to `+1`!
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -445,8 +434,7 @@ currency values, i.e. entities labeled as `MONEY`, and then uses the dependency
parse to find the noun phrase they are referring to for example `"Net income"` parse to find the noun phrase they are referring to for example `"Net income"`
&rarr; `"$9.4 million"`. &rarr; `"$9.4 million"`.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -491,8 +479,7 @@ If you want to know how to write rules that hook into some type of syntactic
construction, just plug the sentence into the visualizer and see how spaCy construction, just plug the sentence into the visualizer and see how spaCy
annotates it. annotates it.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy import displacy from spacy import displacy
@ -567,8 +554,7 @@ on a token, it will return an empty string.
> - `U` Token is a single-token **unit** entity. > - `U` Token is a single-token **unit** entity.
> - `O` Token is **outside** an entity. > - `O` Token is **outside** an entity.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -603,8 +589,7 @@ directly to the `token.ent_iob` or `token.ent_type` attributes, so the easiest
way to set entities is to use the [`doc.set_ents`](/api/doc#set_ents) function way to set entities is to use the [`doc.set_ents`](/api/doc#set_ents) function
and create the new entity as a [`Span`](/api/span). and create the new entity as a [`Span`](/api/span).
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.tokens import Span from spacy.tokens import Span
@ -644,8 +629,7 @@ You can also assign entity annotations using the
both the `ENT_TYPE` and the `ENT_IOB` attributes in the array you're importing both the `ENT_TYPE` and the `ENT_IOB` attributes in the array you're importing
from. from.
```python ```python {executable="true"}
### {executable="true"}
import numpy import numpy
import spacy import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE from spacy.attrs import ENT_IOB, ENT_TYPE
@ -714,8 +698,7 @@ list of `Doc` objects to displaCy and run
For more details and examples, see the For more details and examples, see the
[usage guide on visualizing spaCy](/usage/visualizers). [usage guide on visualizing spaCy](/usage/visualizers).
```python ```python {title="Named Entity example"}
### Named Entity example
import spacy import spacy
from spacy import displacy from spacy import displacy
@ -919,8 +902,7 @@ rules. This could be very certain expressions, or abbreviations only used in
this specific field. Here's how to add a special case rule to an existing this specific field. Here's how to add a special case rule to an existing
[`Tokenizer`](/api/tokenizer) instance: [`Tokenizer`](/api/tokenizer) instance:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.symbols import ORTH from spacy.symbols import ORTH
@ -967,8 +949,7 @@ tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
> " SUFFIX > " SUFFIX
> ``` > ```
```python ```python {executable="true"}
### {executable="true"}
from spacy.lang.en import English from spacy.lang.en import English
nlp = English() nlp = English()
@ -1002,8 +983,7 @@ You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
to use `re.compile()` to build a regular expression object, and pass its to use `re.compile()` to build a regular expression object, and pass its
`.search()` and `.finditer()` methods: `.search()` and `.finditer()` methods:
```python ```python {executable="true"}
### {executable="true"}
import re import re
import spacy import spacy
from spacy.tokenizer import Tokenizer from spacy.tokenizer import Tokenizer
@ -1096,8 +1076,7 @@ letters as an infix. If you do not want the tokenizer to split on hyphens
between letters, you can modify the existing infix definition from between letters, you can modify the existing infix definition from
[`lang/punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py): [`lang/punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py):
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
@ -1181,8 +1160,7 @@ a `Doc` object consisting of the text split on single space characters. We can
then overwrite the `nlp.tokenizer` attribute with an instance of our custom then overwrite the `nlp.tokenizer` attribute with an instance of our custom
tokenizer. tokenizer.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.tokens import Doc from spacy.tokens import Doc
@ -1232,8 +1210,7 @@ produced by the tokenizer.
> **training transformer models** in spaCy, as well as helpful utilities for > **training transformer models** in spaCy, as well as helpful utilities for
> aligning word pieces to linguistic tokenization. > aligning word pieces to linguistic tokenization.
```python ```python {title="Custom BERT word piece tokenizer"}
### Custom BERT word piece tokenizer
from tokenizers import BertWordPieceTokenizer from tokenizers import BertWordPieceTokenizer
from spacy.tokens import Doc from spacy.tokens import Doc
import spacy import spacy
@ -1295,8 +1272,7 @@ setting `--code functions.py` when you run [`spacy train`](/api/cli#train).
> @tokenizers = "whitespace_tokenizer" > @tokenizers = "whitespace_tokenizer"
> ``` > ```
```python ```python {title="functions.py",highlight="1"}
### functions.py {highlight="1"}
@spacy.registry.tokenizers("whitespace_tokenizer") @spacy.registry.tokenizers("whitespace_tokenizer")
def create_whitespace_tokenizer(): def create_whitespace_tokenizer():
def create_tokenizer(nlp): def create_tokenizer(nlp):
@ -1321,8 +1297,7 @@ correct type.
> lowercase = true > lowercase = true
> ``` > ```
```python ```python {title="functions.py",highlight="1"}
### functions.py {highlight="1"}
@spacy.registry.tokenizers("bert_word_piece_tokenizer") @spacy.registry.tokenizers("bert_word_piece_tokenizer")
def create_whitespace_tokenizer(vocab_file: str, lowercase: bool): def create_whitespace_tokenizer(vocab_file: str, lowercase: bool):
def create_tokenizer(nlp): def create_tokenizer(nlp):
@ -1365,8 +1340,7 @@ boolean values, indicating whether each word is followed by a space.
> `Doc` with `words` and `spaces` so that the `doc.text` matches the original > `Doc` with `words` and `spaces` so that the `doc.text` matches the original
> input text. > input text.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.tokens import Doc from spacy.tokens import Doc
@ -1412,8 +1386,7 @@ token.
> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that all > 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that all
> tokens now correspond 1-to-1. > tokens now correspond 1-to-1.
```python ```python {executable="true"}
### {executable="true"}
from spacy.training import Alignment from spacy.training import Alignment
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."] other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
@ -1465,8 +1438,7 @@ root.
> recognized as a named entity, this change will also be reflected in the > recognized as a named entity, this change will also be reflected in the
> `doc.ents`. > `doc.ents`.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -1535,8 +1507,7 @@ second split subtoken) and "York" should be attached to "in".
> 3. Split the token into three tokens instead of two for example, > 3. Split the token into three tokens instead of two for example,
> `["New", "Yo", "rk"]`. > `["New", "Yo", "rk"]`.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy import displacy from spacy import displacy
@ -1565,8 +1536,7 @@ the token indices after splitting.
If you don't care about the heads (for example, if you're only running the If you don't care about the heads (for example, if you're only running the
tokenizer and not the parser), you can attach each subtoken to itself: tokenizer and not the parser), you can attach each subtoken to itself:
```python ```python {highlight="3"}
### {highlight="3"}
doc = nlp("I live in NewYorkCity") doc = nlp("I live in NewYorkCity")
with doc.retokenize() as retokenizer: with doc.retokenize() as retokenizer:
heads = [(doc[3], 0), (doc[3], 1), (doc[3], 2)] heads = [(doc[3], 0), (doc[3], 1), (doc[3], 2)]
@ -1622,8 +1592,7 @@ values can't be overwritten. For more details, see the
> you need to provide a list of extension attribute values as the `"_"` > you need to provide a list of extension attribute values as the `"_"`
> property, one for each split subtoken. > property, one for each split subtoken.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.tokens import Token from spacy.tokens import Token
@ -1648,8 +1617,7 @@ has sentence boundaries by calling
[`Doc.has_annotation`](/api/doc#has_annotation) with the attribute name [`Doc.has_annotation`](/api/doc#has_annotation) with the attribute name
`"SENT_START"`. `"SENT_START"`.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -1684,8 +1652,7 @@ with spaCy's provided trained pipelines. For social media or conversational text
that doesn't follow the same rules, your application may benefit from a custom that doesn't follow the same rules, your application may benefit from a custom
trained or rule-based component. trained or rule-based component.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -1719,8 +1686,7 @@ without the parser and then enable the sentence recognizer explicitly with
> which is better at predicting sentence boundaries when punctuation is not > which is better at predicting sentence boundaries when punctuation is not
> present. > present.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm", exclude=["parser"]) nlp = spacy.load("en_core_web_sm", exclude=["parser"])
@ -1737,8 +1703,7 @@ The [`Sentencizer`](/api/sentencizer) component is a
punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only
need sentence boundaries without dependency parses. need sentence boundaries without dependency parses.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.lang.en import English from spacy.lang.en import English
@ -1777,8 +1742,7 @@ for unset sentence boundaries. This approach can be useful if you want to
implement **additional** rules specific to your data, while still being able to implement **additional** rules specific to your data, while still being able to
take advantage of dependency-based sentence segmentation. take advantage of dependency-based sentence segmentation.
```python ```python {executable="true"}
### {executable="true"}
from spacy.language import Language from spacy.language import Language
import spacy import spacy
@ -1828,8 +1792,7 @@ The following example shows how the tag and POS `NNP`/`PROPN` can be specified
for the phrase `"The Who"`, overriding the tags provided by the statistical for the phrase `"The Who"`, overriding the tags provided by the statistical
tagger and the POS tag map. tagger and the POS tag map.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -1932,8 +1895,7 @@ the removed words, mapped to `(string, score)` tuples, where `string` is the
entry the removed word was mapped to and `score` the similarity score between entry the removed word was mapped to and `score` the similarity score between
the two words. the two words.
```python ```python {title="Removed words"}
### Removed words
{ {
"Shore": ("coast", 0.732257), "Shore": ("coast", 0.732257),
"Precautionary": ("caution", 0.490973), "Precautionary": ("caution", 0.490973),
@ -1978,8 +1940,7 @@ be slower than approaches that work with the whole vectors table at once, but
it's a great approach for once-off conversions before you save out your `nlp` it's a great approach for once-off conversions before you save out your `nlp`
object to disk. object to disk.
```python ```python {title="Adding vectors"}
### Adding vectors
from spacy.vocab import Vocab from spacy.vocab import Vocab
vector_data = { vector_data = {
@ -2005,8 +1966,7 @@ own language subclass. The subclass should define two attributes: the `lang`
overview of the available attributes that can be overwritten, see the overview of the available attributes that can be overwritten, see the
[`Language.Defaults`](/api/language#defaults) documentation. [`Language.Defaults`](/api/language#defaults) documentation.
```python ```python {executable="true"}
### {executable="true"}
from spacy.lang.en import English from spacy.lang.en import English
class CustomEnglishDefaults(English.Defaults): class CustomEnglishDefaults(English.Defaults):
@ -2048,8 +2008,7 @@ language name, and even train pipelines with it and refer to it in your
> python -m spacy train config.cfg --code code.py > python -m spacy train config.cfg --code code.py
> ``` > ```
```python ```python {title="Registering a custom language",highlight="7,12-13"}
### Registering a custom language {highlight="7,12-13"}
import spacy import spacy
from spacy.lang.en import English from spacy.lang.en import English

View File

@ -113,8 +113,7 @@ The Chinese language class supports three word segmentation options, `char`,
> nlp.tokenizer.initialize(pkuseg_model="mixed") > nlp.tokenizer.initialize(pkuseg_model="mixed")
> ``` > ```
```ini ```ini {title="config.cfg"}
### config.cfg
[nlp.tokenizer] [nlp.tokenizer]
@tokenizers = "spacy.zh.ChineseTokenizer" @tokenizers = "spacy.zh.ChineseTokenizer"
segmenter = "char" segmenter = "char"
@ -155,8 +154,7 @@ local path at runtime. See the usage guide on the
[config lifecycle](/usage/training#config-lifecycle) for more background on [config lifecycle](/usage/training#config-lifecycle) for more background on
this. this.
```ini ```ini {title="config.cfg"}
### config.cfg
[initialize] [initialize]
[initialize.tokenizer] [initialize.tokenizer]
@ -167,8 +165,7 @@ pkuseg_user_dict = "default"
You can also initialize the tokenizer for a blank language class by calling its You can also initialize the tokenizer for a blank language class by calling its
`initialize` method: `initialize` method:
```python ```python {title="Examples"}
### Examples
# Initialize the pkuseg tokenizer # Initialize the pkuseg tokenizer
cfg = {"segmenter": "pkuseg"} cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
@ -247,8 +244,7 @@ segmentation and part-of-speech tagging. The default Japanese language class and
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
config can be used to configure the split mode to `A`, `B` or `C`. config can be used to configure the split mode to `A`, `B` or `C`.
```ini ```ini {title="config.cfg"}
### config.cfg
[nlp.tokenizer] [nlp.tokenizer]
@tokenizers = "spacy.ja.JapaneseTokenizer" @tokenizers = "spacy.ja.JapaneseTokenizer"
split_mode = "A" split_mode = "A"
@ -291,8 +287,7 @@ than MeCab. To configure a Korean pipeline with the rule-based tokenizer:
> nlp = spacy.blank("ko", config=config) > nlp = spacy.blank("ko", config=config)
> ``` > ```
```ini ```ini {title="config.cfg"}
### config.cfg
[nlp] [nlp]
lang = "ko" lang = "ko"
tokenizer = {"@tokenizers" = "spacy.Tokenizer.v1"} tokenizer = {"@tokenizers" = "spacy.Tokenizer.v1"}
@ -416,8 +411,7 @@ or configure your own download script using the URL of the archive file. The
archive consists of a package directory that contains another directory with the archive consists of a package directory that contains another directory with the
pipeline data. pipeline data.
```yaml ```yaml {title="Directory structure",highlight="6"}
### Directory structure {highlight="6"}
└── en_core_web_md-3.0.0.tar.gz # downloaded archive └── en_core_web_md-3.0.0.tar.gz # downloaded archive
├── setup.py # setup file for pip installation ├── setup.py # setup file for pip installation
├── meta.json # copy of pipeline meta ├── meta.json # copy of pipeline meta
@ -493,8 +487,7 @@ If you've installed a trained pipeline via [`spacy download`](/api/cli#download)
or directly via pip, you can also `import` it and then call its `load()` method or directly via pip, you can also `import` it and then call its `load()` method
with no arguments: with no arguments:
```python ```python {executable="true"}
### {executable="true"}
import en_core_web_sm import en_core_web_sm
nlp = en_core_web_sm.load() nlp = en_core_web_sm.load()
@ -535,8 +528,7 @@ installation, you can upload the pipeline packages there. pip's
supports both package names to download via a PyPi server, as well as supports both package names to download via a PyPi server, as well as
[direct URLs](#pipeline-urls). [direct URLs](#pipeline-urls).
```text ```text {title="requirements.txt"}
### requirements.txt
spacy>=3.0.0,<4.0.0 spacy>=3.0.0,<4.0.0
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl
``` ```

View File

@ -60,8 +60,7 @@ so we can iterate over them and access the named entity predictions:
> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now > 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
> empty, because the entity recognizer didn't run. > empty, because the entity recognizer didn't run.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
texts = [ texts = [
@ -95,8 +94,7 @@ the input should be a sequence of `(text, context)` tuples and the output will
be a sequence of `(doc, context)` tuples. For example, you can pass metadata in be a sequence of `(doc, context)` tuples. For example, you can pass metadata in
the context and save it in a [custom attribute](#custom-components-attributes): the context and save it in a [custom attribute](#custom-components-attributes):
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.tokens import Doc from spacy.tokens import Doc
@ -246,8 +244,7 @@ tagging pipeline. This is also why the pipeline state is always held by the
together and returns an instance of `Language` with a pipeline set and access to together and returns an instance of `Language` with a pipeline set and access to
the binary data: the binary data:
```python ```python {title="spacy.load under the hood"}
### spacy.load under the hood
lang = "en" lang = "en"
pipeline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"] pipeline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-3.0.0" data_path = "path/to/en_core_web_sm/en_core_web_sm-3.0.0"
@ -266,8 +263,7 @@ subsequently to the `Token` and `Span` which are only views of the `Doc`, and
don't own any data themselves. All components return the modified document, don't own any data themselves. All components return the modified document,
which is then processed by the next component in the pipeline. which is then processed by the next component in the pipeline.
```python ```python {title="The pipeline under the hood"}
### The pipeline under the hood
doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text
for name, proc in nlp.pipeline: # Iterate over components in order for name, proc in nlp.pipeline: # Iterate over components in order
doc = proc(doc) # Apply each component doc = proc(doc) # Apply each component
@ -390,8 +386,7 @@ call its `restore()` method to restore the disabled components when needed. This
can be useful if you want to prevent unnecessary code indentation of large can be useful if you want to prevent unnecessary code indentation of large
blocks. blocks.
```python ```python {title="Disable for block"}
### Disable for block
# 1. Use as a context manager # 1. Use as a context manager
with nlp.select_pipes(disable=["tagger", "parser", "lemmatizer"]): with nlp.select_pipes(disable=["tagger", "parser", "lemmatizer"]):
doc = nlp("I won't be tagged and parsed") doc = nlp("I won't be tagged and parsed")
@ -502,8 +497,7 @@ vectors available otherwise, it won't be able to make the same predictions.
> frozen_components = ["ner"] > frozen_components = ["ner"]
> ``` > ```
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
# The source pipeline with different components # The source pipeline with different components
@ -533,8 +527,7 @@ table instead of only returning the structured data.
> `"entity_linker"`. The analysis should now show no problems, because > `"entity_linker"`. The analysis should now show no problems, because
> requirements are met. > requirements are met.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.blank("en") nlp = spacy.blank("en")
@ -546,8 +539,7 @@ analysis = nlp.analyze_pipes(pretty=True)
<Accordion title="Example output"> <Accordion title="Example output">
```json ```json {title="Structured"}
### Structured
{ {
"summary": { "summary": {
"tagger": { "tagger": {
@ -565,7 +557,12 @@ analysis = nlp.analyze_pipes(pretty=True)
}, },
"problems": { "problems": {
"tagger": [], "tagger": [],
"entity_linker": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"] "entity_linker": [
"doc.ents",
"doc.sents",
"token.ent_iob",
"token.ent_type"
]
}, },
"attrs": { "attrs": {
"token.ent_iob": { "assigns": [], "requires": ["entity_linker"] }, "token.ent_iob": { "assigns": [], "requires": ["entity_linker"] },
@ -698,8 +695,7 @@ component under the name `"info_component"`.
> else. spaCy should now complain that it doesn't know a component of the > else. spaCy should now complain that it doesn't know a component of the
> name `"info_component"`. > name `"info_component"`.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.language import Language from spacy.language import Language
@ -732,8 +728,7 @@ boundaries.
> to `None` (missing value), the parser will assign sentence boundaries in > to `None` (missing value), the parser will assign sentence boundaries in
> between. > between.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.language import Language from spacy.language import Language
@ -786,8 +781,7 @@ All other settings can be passed in by the user via the `config` argument on
[`@Language.factory`](/api/language#factory) decorator also lets you define a [`@Language.factory`](/api/language#factory) decorator also lets you define a
`default_config` that's used as a fallback. `default_config` that's used as a fallback.
```python ```python {title="With config",highlight="4,9"}
### With config {highlight="4,9"}
import spacy import spacy
from spacy.language import Language from spacy.language import Language
@ -851,8 +845,7 @@ a token, the `Token.norm_` with an entry from a language-specific lookup table.
It's registered twice under the name `"token_normalizer"` once using It's registered twice under the name `"token_normalizer"` once using
`@English.factory` and once using `@German.factory`: `@English.factory` and once using `@German.factory`:
```python ```python {executable="true"}
### {executable="true"}
from spacy.lang.en import English from spacy.lang.en import English
from spacy.lang.de import German from spacy.lang.de import German
@ -921,8 +914,7 @@ case-sensitive.
> should see an entry for the acronyms component, referencing the factory > should see an entry for the acronyms component, referencing the factory
> `acronyms` and the config settings. > `acronyms` and the config settings.
```python ```python {executable="true"}
### {executable="true"}
from spacy.language import Language from spacy.language import Language
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.matcher import PhraseMatcher from spacy.matcher import PhraseMatcher
@ -1024,8 +1016,7 @@ argument, the name:
> batchers. `misc` is intended for miscellaneous functions that don't fit > batchers. `misc` is intended for miscellaneous functions that don't fit
> anywhere else. > anywhere else.
```python ```python {title="Registered function for assets",highlight="1"}
### Registered function for assets {highlight="1"}
@spacy.registry.misc("acronyms.slang_dict.v1") @spacy.registry.misc("acronyms.slang_dict.v1")
def create_acronyms_slang_dict(): def create_acronyms_slang_dict():
dictionary = {"lol": "laughing out loud", "brb": "be right back"} dictionary = {"lol": "laughing out loud", "brb": "be right back"}
@ -1093,8 +1084,7 @@ on [serialization methods](/usage/saving-loading/#serialization-methods).
> receive the directory path `/path/acronyms` and can then create files in this > receive the directory path `/path/acronyms` and can then create files in this
> directory. > directory.
```python ```python {title="Custom serialization methods",highlight="7-11,13-15"}
### Custom serialization methods {highlight="7-11,13-15"}
import srsly import srsly
from spacy.util import ensure_path from spacy.util import ensure_path
@ -1176,8 +1166,7 @@ be defined via the config in this case a dictionary `data`.
> path = "/path/to/slang_dict.json" > path = "/path/to/slang_dict.json"
> ``` > ```
```python ```python {title="Custom initialize method",highlight="5-6"}
### Custom initialize method {highlight="5-6"}
class AcronymComponent: class AcronymComponent:
def __init__(self): def __init__(self):
self.data = {} self.data = {}
@ -1240,8 +1229,7 @@ string value.
> and write a type hint for `log_level` that only accepts the exact string > and write a type hint for `log_level` that only accepts the exact string
> values `"DEBUG"`, `"INFO"` or `"CRITICAL"`. > values `"DEBUG"`, `"INFO"` or `"CRITICAL"`.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.language import Language from spacy.language import Language
from spacy.tokens import Doc from spacy.tokens import Doc
@ -1330,8 +1318,7 @@ components. It also makes the components more **modular** and lets you
[swap](/usage/layers-architectures#swap-architectures) different architectures [swap](/usage/layers-architectures#swap-architectures) different architectures
in your config, and re-use model definitions. in your config, and re-use model definitions.
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[components] [components]
[components.textcat] [components.textcat]
@ -1465,8 +1452,7 @@ particular instance. If an attribute of the same name already exists, or if
you're trying to access an attribute that hasn't been registered, spaCy will you're trying to access an attribute that hasn't been registered, spaCy will
raise an `AttributeError`. raise an `AttributeError`.
```python ```python {title="Example"}
### Example
from spacy.tokens import Doc, Span, Token from spacy.tokens import Doc, Span, Token
fruits = ["apple", "pear", "banana", "orange", "strawberry"] fruits = ["apple", "pear", "banana", "orange", "strawberry"]
@ -1501,8 +1487,7 @@ entity annotations for countries and sets custom attributes on the `Doc` and
`Span` for example, the capital, latitude/longitude coordinates and even the `Span` for example, the capital, latitude/longitude coordinates and even the
country flag. country flag.
```python ```python {executable="true"}
### {executable="true"}
import requests import requests
from spacy.lang.en import English from spacy.lang.en import English
from spacy.language import Language from spacy.language import Language
@ -1600,8 +1585,7 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
| `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) | | `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
| `user_span_hooks` | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root) | | `user_span_hooks` | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root) |
```python ```python {title="Add custom similarity hooks"}
### Add custom similarity hooks
from spacy.language import Language from spacy.language import Language
@ -1759,8 +1743,7 @@ wrapper has to do is compute the entity spans and overwrite the `doc.ents`.
> attributes. By definition, each token can only be part of one entity, so > attributes. By definition, each token can only be part of one entity, so
> overlapping entity spans are not allowed. > overlapping entity spans are not allowed.
```python ```python {highlight="1,8-9"}
### {highlight="1,8-9"}
import your_custom_entity_recognizer import your_custom_entity_recognizer
from spacy.training import biluo_tags_to_spans from spacy.training import biluo_tags_to_spans
from spacy.language import Language from spacy.language import Language
@ -1798,8 +1781,7 @@ label scheme than spaCy's default models.
> it fully replaces the `nlp` object instead of providing a pipeline component, > it fully replaces the `nlp` object instead of providing a pipeline component,
> since it also needs to handle tokenization. > since it also needs to handle tokenization.
```python ```python {highlight="1,11,17-19"}
### {highlight="1,11,17-19"}
import your_custom_model import your_custom_model
from spacy.language import Language from spacy.language import Language
from spacy.symbols import POS, TAG, DEP, HEAD from spacy.symbols import POS, TAG, DEP, HEAD

View File

@ -393,8 +393,7 @@ it will export a directory `model-best`, which you can then re-use in other
commands. commands.
{/* prettier-ignore */} {/* prettier-ignore */}
```yaml ```yaml {title="project.yml"}
### project.yml
commands: commands:
- name: train - name: train
help: 'Train a spaCy pipeline using the specified corpus and config' help: 'Train a spaCy pipeline using the specified corpus and config'
@ -450,8 +449,7 @@ directory:
> directories: ['assets', 'configs', 'corpus', 'metas', 'metrics', 'notebooks', 'packages', 'scripts', 'training'] > directories: ['assets', 'configs', 'corpus', 'metas', 'metrics', 'notebooks', 'packages', 'scripts', 'training']
> ``` > ```
```yaml ```yaml {title="Example project directory"}
### Example project directory
├── project.yml # the project settings ├── project.yml # the project settings
├── project.lock # lockfile that tracks inputs/outputs ├── project.lock # lockfile that tracks inputs/outputs
├── assets/ # downloaded data assets ├── assets/ # downloaded data assets
@ -485,8 +483,7 @@ calls into [`pytest`](https://docs.pytest.org/en/latest/), runs your tests and
uses [`pytest-html`](https://github.com/pytest-dev/pytest-html) to export a test uses [`pytest-html`](https://github.com/pytest-dev/pytest-html) to export a test
report: report:
```yaml ```yaml {title="project.yml"}
### project.yml
commands: commands:
- name: test - name: test
help: 'Test the trained pipeline' help: 'Test the trained pipeline'
@ -522,8 +519,7 @@ that you can define via your `project.yml`:
> types. For instance, `batch_size: int` means that the value provided via the > types. For instance, `batch_size: int` means that the value provided via the
> command line is converted to an integer. > command line is converted to an integer.
```python ```python {title="scripts/custom_evaluation.py"}
### scripts/custom_evaluation.py
import typer import typer
def custom_evaluation(batch_size: int = 128, model_path: str, data_path: str): def custom_evaluation(batch_size: int = 128, model_path: str, data_path: str):
@ -550,8 +546,7 @@ override settings on the command line for example using `--vars.batch_size`.
> system). It also normalizes references to `python3`, `pip3` and `pip`. > system). It also normalizes references to `python3`, `pip3` and `pip`.
{/* prettier-ignore */} {/* prettier-ignore */}
```yaml ```yaml {title="project.yml"}
### project.yml
vars: vars:
batch_size: 128 batch_size: 128
@ -575,8 +570,7 @@ settings on the command line and passing through system-level settings.
> BATCH_SIZE=128 python -m spacy project run evaluate > BATCH_SIZE=128 python -m spacy project run evaluate
> ``` > ```
```yaml ```yaml {title="project.yml"}
### project.yml
env: env:
batch_size: BATCH_SIZE batch_size: BATCH_SIZE
gpu_id: GPU_ID gpu_id: GPU_ID
@ -675,8 +669,7 @@ protocols.
> $ python -m spacy project pull local > $ python -m spacy project pull local
> ``` > ```
```yaml ```yaml {title="project.yml"}
### project.yml
remotes: remotes:
default: 's3://my-spacy-bucket' default: 's3://my-spacy-bucket'
local: '/mnt/scratch/cache' local: '/mnt/scratch/cache'
@ -707,8 +700,7 @@ their contents.
For instance, let's say you had the following command in your `project.yml`: For instance, let's say you had the following command in your `project.yml`:
```yaml ```yaml {title="project.yml"}
### project.yml
- name: train - name: train
help: 'Train a spaCy pipeline using the specified corpus and config' help: 'Train a spaCy pipeline using the specified corpus and config'
script: script:
@ -858,8 +850,7 @@ collected with Prodigy and training a spaCy pipeline:
> ``` > ```
{/* prettier-ignore */} {/* prettier-ignore */}
```yaml ```yaml {title="project.yml"}
### project.yml
vars: vars:
prodigy: prodigy:
train_dataset: "fashion_brands_training" train_dataset: "fashion_brands_training"
@ -904,8 +895,7 @@ if accuracy increases in the last segment, this could indicate that collecting
more annotations of the same type might improve the model further. more annotations of the same type might improve the model further.
{/* prettier-ignore */} {/* prettier-ignore */}
```yaml ```yaml {title="project.yml (excerpt)"}
### project.yml (excerpt)
- name: "train_curve" - name: "train_curve"
help: "Train the model with Prodigy by using different portions of training examples to evaluate if more annotations can potentially improve the performance" help: "Train the model with Prodigy by using different portions of training examples to evaluate if more annotations can potentially improve the performance"
script: script:
@ -975,8 +965,7 @@ and explore your own custom trained pipelines.
> ``` > ```
{/* prettier-ignore */} {/* prettier-ignore */}
```yaml ```yaml {title="project.yml"}
### project.yml
commands: commands:
- name: visualize - name: visualize
help: "Visualize the pipeline's output interactively using Streamlit" help: "Visualize the pipeline's output interactively using Streamlit"
@ -1023,8 +1012,7 @@ query your API from Python and JavaScript (Vanilla JS and React).
> ``` > ```
{/* prettier-ignore */} {/* prettier-ignore */}
```yaml ```yaml {title="project.yml"}
### project.yml
- name: "serve" - name: "serve"
help: "Serve the models via a FastAPI REST API using the given host and port" help: "Serve the models via a FastAPI REST API using the given host and port"
script: script:
@ -1135,8 +1123,7 @@ automatically as part of a workflow. Make sure to set `--build wheel` when
running `spacy package` to build a wheel file for your pipeline package. running `spacy package` to build a wheel file for your pipeline package.
{/* prettier-ignore */} {/* prettier-ignore */}
```yaml ```yaml {title="project.yml"}
### project.yml
- name: "push_to_hub" - name: "push_to_hub"
help: "Upload the trained model to the Hugging Face Hub" help: "Upload the trained model to the Hugging Face Hub"
script: script:

View File

@ -102,8 +102,7 @@ First, we initialize the `Matcher` with a vocab. The matcher must always share
the same vocab with the documents it will operate on. We can now call the same vocab with the documents it will operate on. We can now call
[`matcher.add()`](/api/matcher#add) with an ID and a list of patterns. [`matcher.add()`](/api/matcher#add) with an ID and a list of patterns.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.matcher import Matcher from spacy.matcher import Matcher
@ -305,8 +304,7 @@ more valid tokens, `Doc.char_span` returns `None`.
> `"USA"` is a single token and `Span` objects are **sequences of tokens**. So > `"USA"` is a single token and `Span` objects are **sequences of tokens**. So
> `"US"` cannot be its own span, because it does not end on a token boundary. > `"US"` cannot be its own span, because it does not end on a token boundary.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
import re import re
@ -353,8 +351,7 @@ the (white)space tokens are split on. That hopefully shouldn't happen, though,
because it'd mean your regex is producing matches with leading or trailing because it'd mean your regex is producing matches with leading or trailing
whitespace. whitespace.
```python ```python {highlight="5-8"}
### {highlight="5-8"}
span = doc.char_span(start, end) span = doc.char_span(start, end)
if span is not None: if span is not None:
print("Found match:", span.text) print("Found match:", span.text)
@ -427,8 +424,7 @@ The `Matcher` can validate patterns against a JSON schema with the option
`validate=True`. This is useful for debugging patterns during development, in `validate=True`. This is useful for debugging patterns during development, in
particular for catching unsupported attributes. particular for catching unsupported attributes.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.matcher import Matcher from spacy.matcher import Matcher
@ -452,8 +448,7 @@ corpus of blog articles, and you want to match all mentions of "Google I/O"
match on the uppercase versions, avoiding matches with phrases such as "Google match on the uppercase versions, avoiding matches with phrases such as "Google
i/o". i/o".
```python ```python {executable="true"}
### {executable="true"}
from spacy.lang.en import English from spacy.lang.en import English
from spacy.matcher import Matcher from spacy.matcher import Matcher
from spacy.tokens import Span from spacy.tokens import Span
@ -525,8 +520,7 @@ label. As of spaCy v3.0, you can also set `as_spans=True` when calling the
matcher on a `Doc`, which will return a list of [`Span`](/api/span) objects matcher on a `Doc`, which will return a list of [`Span`](/api/span) objects
using the `match_id` as the span label. using the `match_id` as the span label.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.matcher import Matcher from spacy.matcher import Matcher
from spacy.tokens import Span from spacy.tokens import Span
@ -560,8 +554,7 @@ process the text. You can achieve this by adding a
that's called on each `Doc` object, merges the leftover HTML spans and sets an that's called on each `Doc` object, merges the leftover HTML spans and sets an
attribute `bad_html` on the token. attribute `bad_html` on the token.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.language import Language from spacy.language import Language
from spacy.matcher import Matcher from spacy.matcher import Matcher
@ -656,8 +649,7 @@ calculate the start and end of the matched span within the sentence. Using
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
list of dictionaries containing the text and entities to render. list of dictionaries containing the text and entities to render.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy import displacy from spacy import displacy
from spacy.matcher import Matcher from spacy.matcher import Matcher
@ -728,8 +720,7 @@ set of rules like this is often better than training a model. It'll produce more
predictable results, is much easier to modify and extend, and doesn't require predictable results, is much easier to modify and extend, and doesn't require
any training data only a set of test cases. any training data only a set of test cases.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.matcher import Matcher from spacy.matcher import Matcher
@ -777,8 +768,7 @@ that you can create a pattern for one or more emoji tokens. Valid hashtags
usually consist of a `#`, plus a sequence of ASCII characters with no usually consist of a `#`, plus a sequence of ASCII characters with no
whitespace, making them easy to match as well. whitespace, making them easy to match as well.
```python ```python {executable="true"}
### {executable="true"}
from spacy.lang.en import English from spacy.lang.en import English
from spacy.matcher import Matcher from spacy.matcher import Matcher
@ -848,8 +838,7 @@ To label the hashtags, we can use a
[custom attribute](/usage/processing-pipelines#custom-components-attributes) set [custom attribute](/usage/processing-pipelines#custom-components-attributes) set
on the respective token: on the respective token:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.matcher import Matcher from spacy.matcher import Matcher
from spacy.tokens import Token from spacy.tokens import Token
@ -888,8 +877,7 @@ patterns can contain single or multiple tokens.
### Adding phrase patterns {id="adding-phrase-patterns"} ### Adding phrase patterns {id="adding-phrase-patterns"}
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.matcher import PhraseMatcher from spacy.matcher import PhraseMatcher
@ -940,8 +928,7 @@ By default, the `PhraseMatcher` will match on the verbatim token text, e.g.
pattern to the matched `Doc`. For example, using the attribute `LOWER` lets you pattern to the matched `Doc`. For example, using the attribute `LOWER` lets you
match on `Token.lower` and create case-insensitive match patterns: match on `Token.lower` and create case-insensitive match patterns:
```python ```python {executable="true"}
### {executable="true"}
from spacy.lang.en import English from spacy.lang.en import English
from spacy.matcher import PhraseMatcher from spacy.matcher import PhraseMatcher
@ -975,8 +962,7 @@ will be tokenized and you'll be able to find tokens and combinations of tokens
based on a few examples. Here, we're matching on the shapes `ddd.d.d.d` and based on a few examples. Here, we're matching on the shapes `ddd.d.d.d` and
`ddd.ddd.d.d`: `ddd.ddd.d.d`:
```python ```python {executable="true"}
### {executable="true"}
from spacy.lang.en import English from spacy.lang.en import English
from spacy.matcher import PhraseMatcher from spacy.matcher import PhraseMatcher
@ -1132,8 +1118,7 @@ head to an immediate dependent as `head > child`.
The simplest dependency matcher pattern will identify and name a single token in The simplest dependency matcher pattern will identify and name a single token in
the tree: the tree:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.matcher import DependencyMatcher from spacy.matcher import DependencyMatcher
@ -1154,8 +1139,7 @@ print(matches) # [(4851363122962674176, [1])]
Now that we have a named anchor token (`anchor_founded`), we can add the founder Now that we have a named anchor token (`anchor_founded`), we can add the founder
as the immediate dependent (`>`) of `founded` with the dependency label `nsubj`: as the immediate dependent (`>`) of `founded` with the dependency label `nsubj`:
```python ```python {title="Step 1",highlight="8,10"}
### Step 1 {highlight="8,10"}
pattern = [ pattern = [
{ {
"RIGHT_ID": "anchor_founded", "RIGHT_ID": "anchor_founded",
@ -1173,8 +1157,7 @@ pattern = [
The direct object (`dobj`) is added in the same way: The direct object (`dobj`) is added in the same way:
```python ```python {title="Step 2"}
### Step 2 {highlight=""}
pattern = [ pattern = [
#... #...
{ {
@ -1194,8 +1177,7 @@ tokens into the pattern**. For the final part of our pattern, we'll specify that
the token `founded_object` should have a modifier with the dependency relation the token `founded_object` should have a modifier with the dependency relation
`amod` or `compound`: `amod` or `compound`:
```python ```python {title="Step 3",highlight="7"}
### Step 3 {highlight="7"}
pattern = [ pattern = [
# ... # ...
{ {
@ -1218,8 +1200,7 @@ right:
The full pattern comes together as shown in the example below: The full pattern comes together as shown in the example below:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.matcher import DependencyMatcher from spacy.matcher import DependencyMatcher
@ -1310,8 +1291,7 @@ matches were to overlap, the pattern matching most tokens takes priority. If
they also happen to be equally long, then the match occurring first in the `Doc` they also happen to be equally long, then the match occurring first in the `Doc`
is chosen. is chosen.
```python ```python {executable="true"}
### {executable="true"}
from spacy.lang.en import English from spacy.lang.en import English
nlp = English() nlp = English()
@ -1333,8 +1313,7 @@ entity ruler will only add spans to the `doc.ents` if they don't overlap with
existing entities predicted by the model. To overwrite overlapping entities, you existing entities predicted by the model. To overwrite overlapping entities, you
can set `overwrite_ents=True` on initialization. can set `overwrite_ents=True` on initialization.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -1362,8 +1341,7 @@ The [`EntityRuler`](/api/entityruler) can also accept an `id` attribute for each
pattern. Using the `id` attribute allows multiple patterns to be associated with pattern. Using the `id` attribute allows multiple patterns to be associated with
the same entity. the same entity.
```python ```python {executable="true"}
### {executable="true"}
from spacy.lang.en import English from spacy.lang.en import English
nlp = English() nlp = English()
@ -1392,8 +1370,7 @@ The [`to_disk`](/api/entityruler#to_disk) and
from JSONL (newline-delimited JSON) files, containing one pattern object per from JSONL (newline-delimited JSON) files, containing one pattern object per
line. line.
```json ```json {title="patterns.jsonl"}
### patterns.jsonl
{"label": "ORG", "pattern": "Apple"} {"label": "ORG", "pattern": "Apple"}
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]} {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
``` ```
@ -1491,8 +1468,7 @@ Unlike in `doc.ents`, overlapping matches are allowed in `doc.spans`, so no
filtering is required, but optional filtering and sorting can be applied to the filtering is required, but optional filtering and sorting can be applied to the
spans before they're saved. spans before they're saved.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.blank("en") nlp = spacy.blank("en")
@ -1514,8 +1490,7 @@ always filtered, using [`util.filter_spans`](/api/top-level#util.filter_spans)
by default. See the [`SpanRuler` API docs](/api/spanruler) for more information by default. See the [`SpanRuler` API docs](/api/spanruler) for more information
about how to customize the sorting and filtering of matched spans. about how to customize the sorting and filtering of matched spans.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -1535,8 +1510,7 @@ You can save patterns in a JSONL file (newline-delimited JSON) to load with
[`SpanRuler.initialize`](/api/spanruler#initialize) or [`SpanRuler.initialize`](/api/spanruler#initialize) or
[`SpanRuler.add_patterns`](/api/spanruler#add_patterns). [`SpanRuler.add_patterns`](/api/spanruler#add_patterns).
```json ```json {title="patterns.jsonl"}
### patterns.jsonl
{"label": "ORG", "pattern": "Apple"} {"label": "ORG", "pattern": "Apple"}
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]} {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
``` ```
@ -1597,8 +1571,7 @@ or "Dr.". This makes sense, because it makes it easier to resolve the entity
type back to a knowledge base. But what if your application needs the full type back to a knowledge base. But what if your application needs the full
names, _including_ the titles? names, _including_ the titles?
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -1617,8 +1590,7 @@ expands the entity span by one token. After all, what all titles in this example
have in common is that _if_ they occur, they occur in the **previous token** have in common is that _if_ they occur, they occur in the **previous token**
right before the person entity. right before the person entity.
```python ```python {highlight="9-13"}
### {highlight="9-13"}
from spacy.language import Language from spacy.language import Language
from spacy.tokens import Span from spacy.tokens import Span
@ -1646,8 +1618,7 @@ register it as a [pipeline component](/usage/processing-pipelines) so it can run
automatically when processing a text. We can use automatically when processing a text. We can use
[`nlp.add_pipe`](/api/language#add_pipe) to add it to the current pipeline. [`nlp.add_pipe`](/api/language#add_pipe) to add it to the current pipeline.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.language import Language from spacy.language import Language
from spacy.tokens import Span from spacy.tokens import Span
@ -1696,8 +1667,7 @@ We can now use the [`Span.set_extension`](/api/span#set_extension) method to add
the custom extension attribute `"person_title"`, using `get_person_title` as the the custom extension attribute `"person_title"`, using `get_person_title` as the
getter function. getter function.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.tokens import Span from spacy.tokens import Span
@ -1739,8 +1709,7 @@ tense**, whether company names are attached to it and whether the person is the
subject. All of this information is available in the part-of-speech tags and the subject. All of this information is available in the part-of-speech tags and the
dependency parse. dependency parse.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -1804,8 +1773,7 @@ the entity `Span` for example `._.orgs` or `._.prev_orgs` and
> retokenizer.merge(ent) > retokenizer.merge(ent)
> ``` > ```
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.language import Language from spacy.language import Language
from spacy import displacy from spacy import displacy
@ -1842,8 +1810,7 @@ information is in the attached auxiliary "was":
To solve this, we can adjust the rules to also check for the above construction: To solve this, we can adjust the rules to also check for the above construction:
```python ```python {highlight="10-12"}
### {highlight="10-12"}
@Language.component("extract_person_orgs") @Language.component("extract_person_orgs")
def extract_person_orgs(doc): def extract_person_orgs(doc):
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"] person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]

View File

@ -28,14 +28,12 @@ contains the pipeline configuration and all the relevant settings.
> dictionary containing the training configuration, pipeline component factories > dictionary containing the training configuration, pipeline component factories
> and other settings. It is saved out with a pipeline as the `config.cfg`. > and other settings. It is saved out with a pipeline as the `config.cfg`.
```python ```python {title="Serialize"}
### Serialize
config = nlp.config config = nlp.config
bytes_data = nlp.to_bytes() bytes_data = nlp.to_bytes()
``` ```
```python ```python {title="Deserialize"}
### Deserialize
lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"]) lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
nlp = lang_cls.from_config(config) nlp = lang_cls.from_config(config)
nlp.from_bytes(bytes_data) nlp.from_bytes(bytes_data)
@ -62,8 +60,7 @@ collection of `Doc` objects together, and is much more efficient than calling
also control what data gets saved, and you can merge pallets together for easy also control what data gets saved, and you can merge pallets together for easy
map/reduce-style processing. map/reduce-style processing.
```python ```python {highlight="4,8,9,13,14"}
### {highlight="4,8,9,13,14"}
import spacy import spacy
from spacy.tokens import DocBin from spacy.tokens import DocBin
@ -122,8 +119,7 @@ the entire pipeline once. And instead of pickling several `Doc` objects
separately, pickle a list of `Doc` objects. Since they all share a reference to separately, pickle a list of `Doc` objects. Since they all share a reference to
the _same_ `Vocab` object, it will only be included once. the _same_ `Vocab` object, it will only be included once.
```python ```python {title="Pickling objects with shared data",highlight="8-9"}
### Pickling objects with shared data {highlight="8-9"}
doc1 = nlp("Hello world") doc1 = nlp("Hello world")
doc2 = nlp("This is a test") doc2 = nlp("This is a test")
@ -199,8 +195,7 @@ the data to and from a JSON file.
> saving out a pipeline with a rule-based entity recognizer and including all > saving out a pipeline with a rule-based entity recognizer and including all
> rules _with_ the component data. > rules _with_ the component data.
```python ```python {highlight="16-23,25-30"}
### {highlight="16-23,25-30"}
import json import json
from spacy import Language from spacy import Language
from spacy.util import ensure_path from spacy.util import ensure_path
@ -240,8 +235,7 @@ After adding the component to the pipeline and adding some data to it, we can
serialize the `nlp` object to a directory, which will call the custom serialize the `nlp` object to a directory, which will call the custom
component's `to_disk` method. component's `to_disk` method.
```python ```python {highlight="2-4"}
### {highlight="2-4"}
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
my_component = nlp.add_pipe("my_component") my_component = nlp.add_pipe("my_component")
my_component.add({"hello": "world"}) my_component.add({"hello": "world"})
@ -252,8 +246,7 @@ The contents of the directory would then look like this.
`CustomComponent.to_disk` converted the data to a JSON string and saved it to a `CustomComponent.to_disk` converted the data to a JSON string and saved it to a
file `data.json` in its subdirectory: file `data.json` in its subdirectory:
```yaml ```yaml {title="Directory structure",highlight="2-3"}
### Directory structure {highlight="2-3"}
└── /path/to/pipeline └── /path/to/pipeline
├── my_component # data serialized by "my_component" ├── my_component # data serialized by "my_component"
│ └── data.json │ └── data.json
@ -341,8 +334,7 @@ snake when it's called:
> └── setup.py # setup file for pip installation > └── setup.py # setup file for pip installation
> ``` > ```
```python ```python {title="snek.py"}
### snek.py
from spacy.language import Language from spacy.language import Language
snek = """ snek = """
@ -375,8 +367,7 @@ entry to the factories, you can now expose it in your `setup.py` via the
> the created entry point is named `snek` and points to the function > the created entry point is named `snek` and points to the function
> `snek_component` in the module `snek`, i.e. `snek.py`. > `snek_component` in the module `snek`, i.e. `snek.py`.
```python ```python {title="setup.py",highlight="5-7"}
### setup.py {highlight="5-7"}
from setuptools import setup from setuptools import setup
setup( setup(
@ -457,8 +448,7 @@ class SnekFactory:
return doc return doc
``` ```
```diff ```diff {title="setup.py"}
### setup.py
entry_points={ entry_points={
- "spacy_factories": ["snek = snek:snek_component"] - "spacy_factories": ["snek = snek:snek_component"]
+ "spacy_factories": ["snek = snek:SnekFactory"] + "spacy_factories": ["snek = snek:SnekFactory"]
@ -503,8 +493,7 @@ custom pipeline  but you don't necessarily want to modify spaCy's code to ad
language. In your package, you could then implement the following language. In your package, you could then implement the following
[custom language subclass](/usage/linguistic-features#language-subclass): [custom language subclass](/usage/linguistic-features#language-subclass):
```python ```python {title="snek.py"}
### snek.py
from spacy.language import Language from spacy.language import Language
class SnekDefaults(Language.Defaults): class SnekDefaults(Language.Defaults):
@ -519,8 +508,7 @@ Alongside the `spacy_factories`, there's also an entry point option for
`spacy_languages`, which maps language codes to language-specific `Language` `spacy_languages`, which maps language codes to language-specific `Language`
subclasses: subclasses:
```diff ```diff {title="setup.py"}
### setup.py
from setuptools import setup from setuptools import setup
setup( setup(
@ -553,8 +541,7 @@ values.
> [scispaCy](/universe/project/scispacy) and > [scispaCy](/universe/project/scispacy) and
> [Blackstone](/universe/project/blackstone). > [Blackstone](/universe/project/blackstone).
```python ```python {title="snek.py"}
### snek.py
displacy_colors = {"SNEK": "#3dff74", "HUMAN": "#cfc5ff"} displacy_colors = {"SNEK": "#3dff74", "HUMAN": "#cfc5ff"}
``` ```
@ -562,8 +549,7 @@ Given the above colors, the entry point can be defined as follows. Entry points
need to have a name, so we use the key `colors`. However, the name doesn't need to have a name, so we use the key `colors`. However, the name doesn't
matter and whatever is defined in the entry point group will be used. matter and whatever is defined in the entry point group will be used.
```diff ```diff {title="setup.py"}
### setup.py
from setuptools import setup from setuptools import setup
setup( setup(
@ -671,8 +657,7 @@ This command will create a pipeline package directory and will run
`.tar.gz` archive of your package that can be installed using `pip install`. `.tar.gz` archive of your package that can be installed using `pip install`.
Installing the binary wheel is usually more efficient. Installing the binary wheel is usually more efficient.
```yaml ```yaml {title="Directory structure"}
### Directory structure
└── / └── /
├── MANIFEST.in # to include meta.json ├── MANIFEST.in # to include meta.json
├── meta.json # pipeline meta data ├── meta.json # pipeline meta data

View File

@ -180,8 +180,7 @@ can load it via [`spacy.load`](/api/top-level#spacy.load). This will return a
usually call it `nlp`. Calling the `nlp` object on a string of text will return usually call it `nlp`. Calling the `nlp` object on a string of text will return
a processed `Doc`: a processed `Doc`:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -293,8 +292,7 @@ way too much space. So instead, spaCy hashes the string and stores it in the
**lookup table that works in both directions** you can look up a string to get **lookup table that works in both directions** you can look up a string to get
its hash, or a hash to get its string: its hash, or a hash to get its string:
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -311,8 +309,7 @@ a word. For example, no matter if "love" is used as a verb or a noun in some
context, its spelling and whether it consists of alphabetic characters won't context, its spelling and whether it consists of alphabetic characters won't
ever change. Its hash value will also always be the same. ever change. Its hash value will also always be the same.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
@ -350,8 +347,7 @@ vocabulary. That's why you always need to make sure all objects you create have
access to the same vocabulary. If they don't, spaCy might not be able to find access to the same vocabulary. If they don't, spaCy might not be able to find
the strings it needs. the strings it needs.
```python ```python {executable="true"}
### {executable="true"}
import spacy import spacy
from spacy.tokens import Doc from spacy.tokens import Doc
from spacy.vocab import Vocab from spacy.vocab import Vocab

View File

@ -310,8 +310,7 @@ define a setting once and can reference it across your config using the
the `[training]` block, and the whole block of `[training.optimizer]` is reused the `[training]` block, and the whole block of `[training.optimizer]` is reused
in `[pretraining]` and will become `pretraining.optimizer`. in `[pretraining]` and will become `pretraining.optimizer`.
```ini ```ini {title="config.cfg (excerpt)",highlight="5,18"}
### config.cfg (excerpt) {highlight="5,18"}
[system] [system]
seed = 0 seed = 0
@ -372,8 +371,7 @@ are handled automatically.
Here's an example of creating a `.spacy` file from some NER annotations. Here's an example of creating a `.spacy` file from some NER annotations.
```python ```python {title="preprocess.py"}
### preprocess.py
import spacy import spacy
from spacy.tokens import DocBin from spacy.tokens import DocBin
@ -438,8 +436,7 @@ existing weights. This lets you include an already trained component in your
pipeline, or update a trained component with more data specific to your use pipeline, or update a trained component with more data specific to your use
case. case.
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[components] [components]
# "parser" and "ner" are sourced from a trained pipeline # "parser" and "ner" are sourced from a trained pipeline
@ -532,8 +529,7 @@ list of components. For example, the feature `DEP` from the parser could be used
as a tagger feature by including `DEP` in the tok2vec `attrs` and including as a tagger feature by including `DEP` in the tok2vec `attrs` and including
`parser` in `annotating_components`: `parser` in `annotating_components`:
```ini ```ini {title="config.cfg (excerpt)",highlight="7,12"}
### config.cfg (excerpt) {highlight="7,12"}
[nlp] [nlp]
pipeline = ["parser", "tagger"] pipeline = ["parser", "tagger"]
@ -555,8 +551,7 @@ pipeline is run. The config excerpt below shows how a frozen `ner` component and
a `sentencizer` can provide the required `doc.sents` and `doc.ents` for the a `sentencizer` can provide the required `doc.sents` and `doc.ents` for the
entity linker during training: entity linker during training:
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[nlp] [nlp]
pipeline = ["sentencizer", "ner", "entity_linker"] pipeline = ["sentencizer", "ner", "entity_linker"]
@ -588,8 +583,7 @@ the `batch_size` can be a number that doesn't change, or a schedule, like a
sequence of compounding values, which has shown to be an effective trick (see sequence of compounding values, which has shown to be an effective trick (see
[Smith et al., 2017](https://arxiv.org/abs/1711.00489)). [Smith et al., 2017](https://arxiv.org/abs/1711.00489)).
```ini ```ini {title="With static value"}
### With static value
[training] [training]
batch_size = 128 batch_size = 128
``` ```
@ -612,8 +606,7 @@ from your configs.
> instance, a learning rate schedule can be provided as the an argument of an > instance, a learning rate schedule can be provided as the an argument of an
> optimizer. > optimizer.
```ini ```ini {title="With registered function"}
### With registered function
[training.batch_size] [training.batch_size]
@schedules = "compounding.v1" @schedules = "compounding.v1"
start = 100 start = 100
@ -799,8 +792,7 @@ stop word to the defaults:
> @callbacks = "customize_language_data" > @callbacks = "customize_language_data"
> ``` > ```
```python ```python {title="functions.py",highlight="3,6"}
### functions.py {highlight="3,6"}
import spacy import spacy
@spacy.registry.callbacks("customize_language_data") @spacy.registry.callbacks("customize_language_data")
@ -836,8 +828,7 @@ we're adding the arguments `extra_stop_words` (a list of strings) and `debug`
> debug = true > debug = true
> ``` > ```
```python ```python {title="functions.py",highlight="5,7-9"}
### functions.py {highlight="5,7-9"}
from typing import List from typing import List
import spacy import spacy
@ -892,8 +883,7 @@ settings and specify this callback in your config:
> @callbacks = "customize_tokenizer" > @callbacks = "customize_tokenizer"
> ``` > ```
```python ```python {title="functions.py"}
### functions.py
from spacy.util import registry, compile_suffix_regex from spacy.util import registry, compile_suffix_regex
@registry.callbacks("customize_tokenizer") @registry.callbacks("customize_tokenizer")
@ -966,8 +956,7 @@ tabular results to a file:
> log_path = "my_file.tab" > log_path = "my_file.tab"
> ``` > ```
```python ```python {title="functions.py"}
### functions.py
import sys import sys
from typing import IO, Tuple, Callable, Dict, Any, Optional from typing import IO, Tuple, Callable, Dict, Any, Optional
import spacy import spacy
@ -1019,8 +1008,7 @@ You can also implement your own batch size schedule to use during training. The
> know that a config referencing `v1` means a different function than a config > know that a config referencing `v1` means a different function than a config
> referencing `v2`. > referencing `v2`.
```python ```python {title="functions.py"}
### functions.py
import spacy import spacy
@spacy.registry.schedules("my_custom_schedule.v1") @spacy.registry.schedules("my_custom_schedule.v1")
@ -1037,8 +1025,7 @@ settings in the block will be passed to the function as keyword arguments. Keep
in mind that the config shouldn't have any hidden defaults and all arguments on in mind that the config shouldn't have any hidden defaults and all arguments on
the functions need to be represented in the config. the functions need to be represented in the config.
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[training.batch_size] [training.batch_size]
@schedules = "my_custom_schedule.v1" @schedules = "my_custom_schedule.v1"
start = 2 start = 2
@ -1064,8 +1051,7 @@ for more details.
> output_width = 512 > output_width = 512
> ``` > ```
```python ```python {title="functions.py"}
### functions.py
from typing import List from typing import List
from thinc.types import Floats2d from thinc.types import Floats2d
from thinc.api import Model from thinc.api import Model
@ -1215,8 +1201,7 @@ especially when packing multiple documents together. You can also create `Doc`
objects manually, so you can write your own custom logic to convert and store objects manually, so you can write your own custom logic to convert and store
existing annotations for use in spaCy. existing annotations for use in spaCy.
```python ```python {title="Training data from Doc objects",highlight="6-9"}
### Training data from Doc objects {highlight="6-9"}
import spacy import spacy
from spacy.tokens import Doc, DocBin from spacy.tokens import Doc, DocBin
@ -1296,8 +1281,7 @@ as **config settings** in this case, `source`.
> source = "s3://your_bucket/path/data.csv" > source = "s3://your_bucket/path/data.csv"
> ``` > ```
```python ```python {title="functions.py",highlight="7-8"}
### functions.py {highlight="7-8"}
from typing import Callable, Iterator, List from typing import Callable, Iterator, List
import spacy import spacy
from spacy.training import Example from spacy.training import Example
@ -1354,8 +1338,7 @@ training should stop.
> max_steps = 2000 > max_steps = 2000
> ``` > ```
```python ```python {title="functions.py"}
### functions.py
from typing import Callable, Iterable, Iterator from typing import Callable, Iterable, Iterator
from spacy import util from spacy import util
import random import random
@ -1418,8 +1401,7 @@ annotations are the same.
> size = 150 > size = 150
> ``` > ```
```python ```python {title="functions.py"}
### functions.py
from typing import Callable, Iterable, Iterator, List from typing import Callable, Iterable, Iterator, List
import spacy import spacy
from spacy.training import Example from spacy.training import Example
@ -1455,8 +1437,7 @@ your config. The built-in [`orth_variants`](/api/top-level#orth_variants)
augmenter creates a data augmentation callback that uses orth-variant augmenter creates a data augmentation callback that uses orth-variant
replacement. replacement.
```ini ```ini {title="config.cfg (excerpt)",highlight="8,14"}
### config.cfg (excerpt) {highlight="8,14"}
[corpora.train] [corpora.train]
@readers = "spacy.Corpus.v1" @readers = "spacy.Corpus.v1"
path = ${paths.train} path = ${paths.train}
@ -1480,11 +1461,18 @@ typically loaded from a JSON file. There are two types of orth variant rules:
`"single"` for single tokens that should be replaced (e.g. hyphens) and `"single"` for single tokens that should be replaced (e.g. hyphens) and
`"paired"` for pairs of tokens (e.g. quotes). `"paired"` for pairs of tokens (e.g. quotes).
```json ```json {title="orth_variants.json"}
### orth_variants.json
{ {
"single": [{ "tags": ["NFP"], "variants": ["…", "..."] }], "single": [{ "tags": ["NFP"], "variants": ["…", "..."] }],
"paired": [{ "tags": ["``", "''"], "variants": [["'", "'"], ["", ""]] }] "paired": [
{
"tags": ["``", "''"],
"variants": [
["'", "'"],
["", ""]
]
}
]
} }
``` ```
@ -1758,8 +1746,7 @@ of being dropped.
> - [`nlp.to_disk`](/api/language#to_disk): Save the updated pipeline to a > - [`nlp.to_disk`](/api/language#to_disk): Save the updated pipeline to a
> directory. > directory.
```python ```python {title="Example training loop"}
### Example training loop
optimizer = nlp.initialize() optimizer = nlp.initialize()
for itn in range(100): for itn in range(100):
random.shuffle(train_data) random.shuffle(train_data)

View File

@ -389,8 +389,7 @@ with only tokenizers, you now need to install that data explicitly via
setup is required the package just needs to be installed in the same setup is required the package just needs to be installed in the same
environment as spaCy. environment as spaCy.
```python ```python {highlight="3-4"}
### {highlight="3-4"}
nlp = Turkish() nlp = Turkish()
doc = nlp("Bu bir cümledir.") doc = nlp("Bu bir cümledir.")
# 🚨 This now requires the lookups data to be installed explicitly # 🚨 This now requires the lookups data to be installed explicitly

View File

@ -29,8 +29,7 @@ on the predicted docs during training. This makes it easy to use the predictions
of a previous component in the pipeline as features for a subsequent component, of a previous component in the pipeline as features for a subsequent component,
e.g. the dependency labels in the tagger: e.g. the dependency labels in the tagger:
```ini ```ini {title="config.cfg (excerpt)",highlight="7,12"}
### config.cfg (excerpt) {highlight="7,12"}
[nlp] [nlp]
pipeline = ["parser", "tagger"] pipeline = ["parser", "tagger"]
@ -293,8 +292,7 @@ spaCy v3.0, a bug allowed vectors to be loaded implicitly through `source`,
however in v3.1 this setting must be provided explicitly as however in v3.1 this setting must be provided explicitly as
`[initialize.vectors]`: `[initialize.vectors]`:
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[components.ner] [components.ner]
source = "en_core_web_md" source = "en_core_web_md"

View File

@ -31,8 +31,7 @@ $ pip install spacy[apple]
To customize the scoring, you can specify a scoring function for each component To customize the scoring, you can specify a scoring function for each component
in your config from the new [`scorers` registry](/api/top-level#registry): in your config from the new [`scorers` registry](/api/top-level#registry):
```ini ```ini {title="config.cfg (excerpt)",highlight="3"}
### config.cfg (excerpt) {highlight="3"}
[components.tagger] [components.tagger]
factory = "tagger" factory = "tagger"
scorer = {"@scorers":"spacy.tagger_scorer.v1"} scorer = {"@scorers":"spacy.tagger_scorer.v1"}
@ -43,8 +42,7 @@ scorer = {"@scorers":"spacy.tagger_scorer.v1"}
Most pipeline components now include an `overwrite` setting in the config that Most pipeline components now include an `overwrite` setting in the config that
determines whether existing annotation in the `Doc` is preserved or overwritten: determines whether existing annotation in the `Doc` is preserved or overwritten:
```ini ```ini {title="config.cfg (excerpt)",highlight="3"}
### config.cfg (excerpt) {highlight="3"}
[components.tagger] [components.tagger]
factory = "tagger" factory = "tagger"
overwrite = false overwrite = false

View File

@ -647,8 +647,7 @@ Custom pipeline components now have to be registered explicitly using the
that take a `Doc` and return it, all you have to do is add the that take a `Doc` and return it, all you have to do is add the
`@Language.component` decorator to it and assign it a name: `@Language.component` decorator to it and assign it a name:
```diff ```diff {title="Stateless function components"}
### Stateless function components
+ from spacy.language import Language + from spacy.language import Language
+ @Language.component("my_component") + @Language.component("my_component")
@ -662,8 +661,7 @@ the method used to initialize the factory has **two named arguments**: `nlp`
(the current `nlp` object) and `name` (the string name of the component (the current `nlp` object) and `name` (the string name of the component
instance). instance).
```diff ```diff {title="Stateful class components"}
### Stateful class components
+ from spacy.language import Language + from spacy.language import Language
+ @Language.factory("my_component") + @Language.factory("my_component")
@ -679,8 +677,7 @@ class MyComponent:
Instead of decorating your class, you could also add a factory function that Instead of decorating your class, you could also add a factory function that
takes the arguments `nlp` and `name` and returns an instance of your component: takes the arguments `nlp` and `name` and returns an instance of your component:
```diff ```diff {title="Stateful class components with factory function"}
### Stateful class components with factory function
+ from spacy.language import Language + from spacy.language import Language
+ @Language.factory("my_component") + @Language.factory("my_component")
@ -863,8 +860,7 @@ tables = ["lexeme_norm"]
> details see the [config lifecycle](/usage/training/#config-lifecycle) and > details see the [config lifecycle](/usage/training/#config-lifecycle) and
> [initialization](/usage/training/#initialization) docs. > [initialization](/usage/training/#initialization) docs.
```ini ```ini {title="config.cfg (excerpt)"}
### config.cfg (excerpt)
[initialize.components.attribute_ruler] [initialize.components.attribute_ruler]
[initialize.components.attribute_ruler.tag_map] [initialize.components.attribute_ruler.tag_map]
@ -981,8 +977,7 @@ this callback in your config:
> @callbacks = "customize_tokenizer" > @callbacks = "customize_tokenizer"
> ``` > ```
```python ```python {title="functions.py"}
### functions.py
from spacy.util import registry, compile_suffix_regex from spacy.util import registry, compile_suffix_regex
@registry.callbacks("customize_tokenizer") @registry.callbacks("customize_tokenizer")
@ -1028,16 +1023,14 @@ classmethod takes a reference `Doc` and a
[dictionary of annotations](/api/data-formats#dict-input), similar to the [dictionary of annotations](/api/data-formats#dict-input), similar to the
"simple training style" in spaCy v2.x: "simple training style" in spaCy v2.x:
```diff ```diff {title="Migrating Doc and GoldParse"}
### Migrating Doc and GoldParse
doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook") doc = nlp.make_doc("Mark Zuckerberg is the CEO of Facebook")
entities = [(0, 15, "PERSON"), (30, 38, "ORG")] entities = [(0, 15, "PERSON"), (30, 38, "ORG")]
- gold = GoldParse(doc, entities=entities) - gold = GoldParse(doc, entities=entities)
+ example = Example.from_dict(doc, {"entities": entities}) + example = Example.from_dict(doc, {"entities": entities})
``` ```
```diff ```diff {title="Migrating simple training style"}
### Migrating simple training style
text = "Mark Zuckerberg is the CEO of Facebook" text = "Mark Zuckerberg is the CEO of Facebook"
annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]} annotations = {"entities": [(0, 15, "PERSON"), (30, 38, "ORG")]}
+ doc = nlp.make_doc(text) + doc = nlp.make_doc(text)
@ -1050,8 +1043,7 @@ The [`Language.update`](/api/language#update),
[`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or [`Example`](/api/example) objects instead of `Doc` and `GoldParse` objects, or
raw text and a dictionary of annotations. raw text and a dictionary of annotations.
```python ```python {title="Training loop",highlight="5-8,12"}
### Training loop {highlight="5-8,12"}
TRAIN_DATA = [ TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}), ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London.", {"entities": [(7, 13, "LOC")]}), ("I like London.", {"entities": [(7, 13, "LOC")]}),
@ -1130,8 +1122,7 @@ the following:
- Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs - Update all references to [`nlp.add_pipe`](/api/language#add_pipe) in your docs
to use **string names** instead of the component functions. to use **string names** instead of the component functions.
```python ```python {highlight="1-5"}
### {highlight="1-5"}
from spacy.language import Language from spacy.language import Language
@Language.factory("my_component", default_config={"some_setting": False}) @Language.factory("my_component", default_config={"some_setting": False})

View File

@ -34,8 +34,7 @@ package that helps you integrate spaCy visualizations into your apps!
The dependency visualizer, `dep`, shows part-of-speech tags and syntactic The dependency visualizer, `dep`, shows part-of-speech tags and syntactic
dependencies. dependencies.
```python ```python {title="Dependency example"}
### Dependency example
import spacy import spacy
from spacy import displacy from spacy import displacy
@ -104,8 +103,7 @@ displacy.serve(sentence_spans, style="dep")
The entity visualizer, `ent`, highlights named entities and their labels in a The entity visualizer, `ent`, highlights named entities and their labels in a
text. text.
```python ```python {title="Named Entity example"}
### Named Entity example
import spacy import spacy
from spacy import displacy from spacy import displacy
@ -176,8 +174,7 @@ title for a brief description of the text example and the number of iterations.
The span visualizer, `span`, highlights overlapping spans in a text. The span visualizer, `span`, highlights overlapping spans in a text.
```python ```python {title="Span example"}
### Span example
import spacy import spacy
from spacy import displacy from spacy import displacy
from spacy.tokens import Span from spacy.tokens import Span
@ -234,8 +231,7 @@ displaCy is able to detect whether you're working in a
rendered in a cell straight away. When you export your notebook, the rendered in a cell straight away. When you export your notebook, the
visualizations will be included as HTML. visualizations will be included as HTML.
```python ```python {title="Jupyter example"}
### Jupyter example
# Don't forget to install a trained pipeline, e.g.: python -m spacy download en # Don't forget to install a trained pipeline, e.g.: python -m spacy download en
# In[1]: # In[1]:
@ -279,8 +275,7 @@ example, to export it to a file or serve it in a custom way you can use
[`displacy.render`](/api/top-level#displacy.render). It works the same way, but [`displacy.render`](/api/top-level#displacy.render). It works the same way, but
returns a string containing the markup. returns a string containing the markup.
```python ```python {title="Example"}
### Example
import spacy import spacy
from spacy import displacy from spacy import displacy
@ -323,8 +318,7 @@ rendering all `Doc`s at once, loop over them and export them separately.
### Example: Export SVG graphics of dependency parses {id="examples-export-svg"} ### Example: Export SVG graphics of dependency parses {id="examples-export-svg"}
```python ```python {title="Example"}
### Example
import spacy import spacy
from spacy import displacy from spacy import displacy
from pathlib import Path from pathlib import Path
@ -372,8 +366,7 @@ helper functions for converting `Doc` objects to displaCy's format for use with
> html = displacy.render(ex, style="ent", manual=True) > html = displacy.render(ex, style="ent", manual=True)
> ``` > ```
```python ```python {title="DEP input"}
### DEP input
{ {
"words": [ "words": [
{"text": "This", "tag": "DT"}, {"text": "This", "tag": "DT"},
@ -389,8 +382,7 @@ helper functions for converting `Doc` objects to displaCy's format for use with
} }
``` ```
```python ```python {title="ENT input"}
### ENT input
{ {
"text": "But Google is starting from behind.", "text": "But Google is starting from behind.",
"ents": [{"start": 4, "end": 10, "label": "ORG"}], "ents": [{"start": 4, "end": 10, "label": "ORG"}],
@ -398,8 +390,7 @@ helper functions for converting `Doc` objects to displaCy's format for use with
} }
``` ```
```python ```python {title="ENT input with knowledge base links"}
### ENT input with knowledge base links
{ {
"text": "But Google is starting from behind.", "text": "But Google is starting from behind.",
"ents": [{"start": 4, "end": 10, "label": "ORG", "kb_id": "Q95", "kb_url": "https://www.wikidata.org/entity/Q95"}], "ents": [{"start": 4, "end": 10, "label": "ORG", "kb_id": "Q95", "kb_url": "https://www.wikidata.org/entity/Q95"}],
@ -407,8 +398,7 @@ helper functions for converting `Doc` objects to displaCy's format for use with
} }
``` ```
```python ```python {title="SPANS input"}
### SPANS input
{ {
"text": "Welcome to the Bank of China.", "text": "Welcome to the Bank of China.",
"spans": [ "spans": [