Merge pull request #7483 from adrianeboyd/docs/various-v3-4 [ci skip]

This commit is contained in:
Ines Montani 2021-03-22 12:37:06 +01:00 committed by GitHub
commit 3ee2fcfba0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 118 additions and 55 deletions

View File

@ -77,7 +77,7 @@ $ python -m spacy info [model] [--markdown] [--silent] [--exclude]
| Name | Description |
| ------------------------------------------------ | --------------------------------------------------------------------------------------------- |
| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ |
| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(option)~~ |
| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ |
| `--silent`, `-s` <Tag variant="new">2.0.12</Tag> | Don't print anything, just return the values. ~~bool (flag)~~ |
| `--exclude`, `-e` | Comma-separated keys to exclude from the print-out. Defaults to `"labels"`. ~~Optional[str]~~ |
@ -259,7 +259,7 @@ $ python -m spacy convert [input_file] [output_dir] [--converter] [--file-type]
| Name | Description |
| ------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `input_file` | Input file. ~~Path (positional)~~ |
| `output_dir` | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(positional)~~ |
| `output_dir` | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(option)~~ |
| `--converter`, `-c` <Tag variant="new">2</Tag> | Name of converter to use (see below). ~~str (option)~~ |
| `--file-type`, `-t` <Tag variant="new">2.1</Tag> | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. ~~str (option)~~ |
| `--n-sents`, `-n` | Number of sentences per document. Supported for: `conll`, `conllu`, `iob`, `ner` ~~int (option)~~ |
@ -642,7 +642,7 @@ $ python -m spacy debug profile [model] [inputs] [--n-texts]
| Name | Description |
| ----------------- | ---------------------------------------------------------------------------------- |
| `model` | A loadable spaCy pipeline (package name or path). ~~str (positional)~~ |
| `inputs` | Optional path to input file, or `-` for standard input. ~~Path (positional)~~ |
| `inputs` | Path to input file, or `-` for standard input. ~~Path (positional)~~ |
| `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. ~~int (option)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **PRINTS** | Profiling information for the pipeline. |
@ -1192,9 +1192,9 @@ $ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose]
> ```
| Name | Description |
| ----------------- | ----------------------------------------------------------------------------------------------------------------- |
| ----------------- | ------------------------------------------------------------------------------------------------------------- |
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
| `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(positional)~~ |
| `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(option)~~ |
| `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ |
| `--verbose`, `-V` |  Print more output generated by DVC. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
@ -1236,7 +1236,7 @@ $ python -m spacy ray train [config_path] [--code] [--output] [--n-workers] [--a
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ |
| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ |
| `--output`, `-o` | Directory or remote storage URL for saving trained pipeline. The directory will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ |
| `--output`, `-o` | Directory or remote storage URL for saving trained pipeline. The directory will be created if it doesn't exist. ~~Optional[Path] \(option)~~ |
| `--n-workers`, `-n` | The number of workers. Defaults to `1`. ~~int (option)~~ |
| `--address`, `-a` | Optional address of the Ray cluster. If not set (default), Ray will run locally. ~~Optional[str] \(option)~~ |
| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ |

View File

@ -198,7 +198,6 @@ more efficient than processing texts one-by-one.
| `as_tuples` | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. ~~bool~~ |
| `batch_size` | The number of texts to buffer. ~~Optional[int]~~ |
| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ |
| `cleanup` | If `True`, unneeded strings are freed to control memory use. Experimental. ~~bool~~ |
| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ |
| `n_process` <Tag variant="new">2.2.2</Tag> | Number of processors to use. Defaults to `1`. ~~int~~ |
| **YIELDS** | Documents in the order of the original text. ~~Doc~~ |
@ -873,7 +872,7 @@ when loading a config with
> ```
| Name | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tok2vec_name` | Name of the token-to-vector component, typically `"tok2vec"` or `"transformer"`.~~str~~ |
| `pipe_name` | Name of pipeline component to replace listeners for. ~~str~~ |
| `listeners` | The paths to the listeners, relative to the component config, e.g. `["model.tok2vec"]`. Typically, implementations will only connect to one tok2vec component, `model.tok2vec`, but in theory, custom models can use multiple listeners. The value here can either be an empty list to not replace any listeners, or a _complete_ list of the paths to all listener layers used by the model that should be replaced.~~Iterable[str]~~ |

View File

@ -599,18 +599,27 @@ ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# The model didn't recognize "fb" as an entity :(
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
# Create a span for the new entity
fb_ent = Span(doc, 0, 1, label="ORG")
# Option 1: Modify the provided entity spans, leaving the rest unmodified
doc.set_ents([fb_ent], default="unmodified")
# Option 2: Assign a complete list of ents to doc.ents
doc.ents = list(doc.ents) + [fb_ent]
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
ents = [(e.text, e.start, e.end, e.label_) for e in doc.ents]
print('After', ents)
# [('fb', 0, 2, 'ORG')] 🎉
# [('fb', 0, 1, 'ORG')] 🎉
```
Keep in mind that you need to create a `Span` with the start and end index of
the **token**, not the start and end index of the entity in the document. In
this case, "fb" is token `(0, 1)` but at the document level, the entity will
have the start and end indices `(0, 2)`.
Keep in mind that `Span` is initialized with the start and end **token**
indices, not the character offsets. To create a span from character offsets, use
[`Doc.char_span`](/api/doc#char_span):
```python
fb_ent = doc.char_span(0, 2, label="ORG")
```
#### Setting entity annotations from array {#setting-from-array}
@ -645,9 +654,10 @@ write efficient native code.
```python
# cython: infer_types=True
from spacy.typedefs cimport attr_t
from spacy.tokens.doc cimport Doc
cpdef set_entity(Doc doc, int start, int end, int ent_type):
cpdef set_entity(Doc doc, int start, int end, attr_t ent_type):
for i in range(start, end):
doc.c[i].ent_type = ent_type
doc.c[start].ent_iob = 3

View File

@ -54,9 +54,8 @@ texts = ["This is a text", "These are lots of texts", "..."]
In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
(potentially very large) iterable of texts as a stream. Because we're only
accessing the named entities in `doc.ents` (set by the `ner` component), we'll
disable all other statistical components (the `tagger` and `parser`) during
processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and
access the named entity predictions:
disable all other components during processing. `nlp.pipe` yields `Doc` objects,
so we can iterate over them and access the named entity predictions:
> #### ✏️ Things to try
>
@ -73,7 +72,7 @@ texts = [
]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
# Do something with the doc here
print([(ent.text, ent.label_) for ent in doc.ents])
```
@ -92,6 +91,54 @@ have to call `list()` on it first:
</Infobox>
### Multiprocessing {#multiprocessing}
spaCy includes built-in support for multiprocessing with
[`nlp.pipe`](/api/language#pipe) using the `n_process` option:
```python
# Multiprocessing with 4 processes
docs = nlp.pipe(texts, n_process=4)
# With as many processes as CPUs (use with caution!)
docs = nlp.pipe(texts, n_process=-1)
```
Depending on your platform, starting many processes with multiprocessing can add
a lot of overhead. In particular, the default start method `spawn` used in
macOS/OS X (as of Python 3.8) and in Windows can be slow for larger models
because the model data is copied in memory for each new process. See the
[Python docs on multiprocessing](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
for further details.
For shorter tasks and in particular with `spawn`, it can be faster to use a
smaller number of processes with a larger batch size. The optimal `batch_size`
setting will depend on the pipeline components, the length of your documents,
the number of processes and how much memory is available.
```python
# Default batch size is `nlp.batch_size` (typically 1000)
docs = nlp.pipe(texts, n_process=2, batch_size=2000)
```
<Infobox title="Multiprocessing on GPU" variant="warning">
Multiprocessing is not generally recommended on GPU because RAM is too limited.
If you want to try it out, be aware that it is only possible using `spawn` due
to limitations in CUDA.
</Infobox>
<Infobox title="Multiprocessing with transformer models" variant="warning">
In Linux, transformer models may hang or deadlock with multiprocessing due to an
[issue in PyTorch](https://github.com/pytorch/pytorch/issues/17199). One
suggested workaround is to use `spawn` instead of `fork` and another is to limit
the number of threads before loading any models using
`torch.set_num_threads(1)`.
</Infobox>
## Pipelines and built-in components {#pipelines}
spaCy makes it very easy to create your own pipelines consisting of reusable
@ -144,10 +191,12 @@ nlp = spacy.load("en_core_web_sm")
```
... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the
pipeline `["tok2vec", "tagger", "parser", "ner"]`. spaCy will then initialize
`spacy.lang.en.English`, and create each pipeline component and add it to the
processing pipeline. It'll then load in the model data from the data directory
and return the modified `Language` class for you to use as the `nlp` object.
pipeline
`["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]`. spaCy
will then initialize `spacy.lang.en.English`, and create each pipeline component
and add it to the processing pipeline. It'll then load in the model data from
the data directory and return the modified `Language` class for you to use as
the `nlp` object.
<Infobox title="Changed in v3.0" variant="warning">
@ -171,7 +220,7 @@ the binary data:
```python
### spacy.load under the hood
lang = "en"
pipeline = ["tok2vec", "tagger", "parser", "ner"]
pipeline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-3.0.0"
cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
@ -186,7 +235,7 @@ component** on the `Doc`, in order. Since the model data is loaded, the
components can access it to assign annotations to the `Doc` object, and
subsequently to the `Token` and `Span` which are only views of the `Doc`, and
don't own any data themselves. All components return the modified document,
which is then processed by the component next in the pipeline.
which is then processed by the next component in the pipeline.
```python
### The pipeline under the hood
@ -201,9 +250,9 @@ list of human-readable component names.
```python
print(nlp.pipeline)
# [('tok2vec', <spacy.pipeline.Tok2Vec>), ('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>)]
# [('tok2vec', <spacy.pipeline.Tok2Vec>), ('tagger', <spacy.pipeline.Tagger>), ('parser', <spacy.pipeline.DependencyParser>), ('ner', <spacy.pipeline.EntityRecognizer>), ('attribute_ruler', <spacy.pipeline.AttributeRuler>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer>)]
print(nlp.pipe_names)
# ['tok2vec', 'tagger', 'parser', 'ner']
# ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
```
### Built-in pipeline components {#built-in}
@ -300,7 +349,7 @@ blocks.
```python
### Disable for block
# 1. Use as a context manager
with nlp.select_pipes(disable=["tagger", "parser"]):
with nlp.select_pipes(disable=["tagger", "parser", "lemmatizer"]):
doc = nlp("I won't be tagged and parsed")
doc = nlp("I will be tagged and parsed")
@ -324,7 +373,7 @@ The [`nlp.pipe`](/api/language#pipe) method also supports a `disable` keyword
argument if you only want to disable components during processing:
```python
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
for doc in nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer"]):
# Do something with the doc here
```
@ -1497,24 +1546,33 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`.
| Name | Customizes |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `user_hooks` | [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents) |
| `user_hooks` | [`Doc.similarity`](/api/doc#similarity), [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents) |
| `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) |
| `user_span_hooks` | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root) |
```python
### Add custom similarity hooks
from spacy.language import Language
class SimilarityModel:
def __init__(self, model):
self._model = model
def __init__(self, name: str, index: int):
self.name = name
self.index = index
def __call__(self, doc):
doc.user_hooks["similarity"] = self.similarity
doc.user_span_hooks["similarity"] = self.similarity
doc.user_token_hooks["similarity"] = self.similarity
return doc
def similarity(self, obj1, obj2):
y = self._model([obj1.vector, obj2.vector])
return float(y[0])
return obj1.vector[self.index] + obj2.vector[self.index]
@Language.factory("similarity_component", default_config={"index": 0})
def create_similarity_component(nlp, name, index: int):
return SimilarityModel(name, index)
```
## Developing plugins and wrappers {#plugins}

View File

@ -19,9 +19,8 @@ import Serialization101 from 'usage/101/\_serialization.md'
When serializing the pipeline, keep in mind that this will only save out the
**binary data for the individual components** to allow spaCy to restore them
not the entire objects. This is a good thing, because it makes serialization
safe. But it also means that you have to take care of storing the language name
and pipeline component names as well, and restoring them separately before you
can load in the data.
safe. But it also means that you have to take care of storing the config, which
contains the pipeline configuration and all the relevant settings.
> #### Saving the meta and config
>
@ -33,24 +32,21 @@ can load in the data.
```python
### Serialize
config = nlp.config
bytes_data = nlp.to_bytes()
lang = nlp.config["nlp"]["lang"] # "en"
pipeline = nlp.config["nlp"]["pipeline"] # ["tagger", "parser", "ner"]
```
```python
### Deserialize
nlp = spacy.blank(lang)
for pipe_name in pipeline:
nlp.add_pipe(pipe_name)
lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"])
nlp = lang_cls.from_config(config)
nlp.from_bytes(bytes_data)
```
This is also how spaCy does it under the hood when loading a pipeline: it loads
the `config.cfg` containing the language and pipeline information, initializes
the language class, creates and adds the pipeline components based on the
defined [factories](/usage/processing-pipeline#custom-components-factories) and
_then_ loads in the binary data. You can read more about this process
the language class, creates and adds the pipeline components based on the config
and _then_ loads in the binary data. You can read more about this process
[here](/usage/processing-pipelines#pipelines).
## Serializing Doc objects efficiently {#docs new="2.2"}