From 9fd41d674296bcfffc064cb7bcae8f0b5dcb6880 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Wed, 17 Mar 2021 14:54:04 +0100 Subject: [PATCH 1/9] Remove Language.pipe cleanup arg --- website/docs/api/language.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/website/docs/api/language.md b/website/docs/api/language.md index a90476dab..ca87cbb16 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -198,7 +198,6 @@ more efficient than processing texts one-by-one. | `as_tuples` | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. ~~bool~~ | | `batch_size` | The number of texts to buffer. ~~Optional[int]~~ | | `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ | -| `cleanup` | If `True`, unneeded strings are freed to control memory use. Experimental. ~~bool~~ | | `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ | | `n_process` 2.2.2 | Number of processors to use. Defaults to `1`. ~~int~~ | | **YIELDS** | Documents in the order of the original text. ~~Doc~~ | @@ -872,10 +871,10 @@ when loading a config with > replace_listeners = ["model.tok2vec"] > ``` -| Name | Description | -| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `tok2vec_name` | Name of the token-to-vector component, typically `"tok2vec"` or `"transformer"`.~~str~~ | -| `pipe_name` | Name of pipeline component to replace listeners for. ~~str~~ | +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `tok2vec_name` | Name of the token-to-vector component, typically `"tok2vec"` or `"transformer"`.~~str~~ | +| `pipe_name` | Name of pipeline component to replace listeners for. ~~str~~ | | `listeners` | The paths to the listeners, relative to the component config, e.g. `["model.tok2vec"]`. Typically, implementations will only connect to one tok2vec component, `model.tok2vec`, but in theory, custom models can use multiple listeners. The value here can either be an empty list to not replace any listeners, or a _complete_ list of the paths to all listener layers used by the model that should be replaced.~~Iterable[str]~~ | ## Language.meta {#meta tag="property"} From 83c1b919a7f35452a23a1016fd862e6034107cfb Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Wed, 17 Mar 2021 14:54:40 +0100 Subject: [PATCH 2/9] Fix positional/option in CLI types --- website/docs/api/cli.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index 16e84e53f..73a03cba8 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -77,7 +77,7 @@ $ python -m spacy info [model] [--markdown] [--silent] [--exclude] | Name | Description | | ------------------------------------------------ | --------------------------------------------------------------------------------------------- | -| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ | +| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(option)~~ | | `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ | | `--silent`, `-s` 2.0.12 | Don't print anything, just return the values. ~~bool (flag)~~ | | `--exclude`, `-e` | Comma-separated keys to exclude from the print-out. Defaults to `"labels"`. ~~Optional[str]~~ | @@ -259,7 +259,7 @@ $ python -m spacy convert [input_file] [output_dir] [--converter] [--file-type] | Name | Description | | ------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | | `input_file` | Input file. ~~Path (positional)~~ | -| `output_dir` | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(positional)~~ | +| `output_dir` | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(option)~~ | | `--converter`, `-c` 2 | Name of converter to use (see below). ~~str (option)~~ | | `--file-type`, `-t` 2.1 | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. ~~str (option)~~ | | `--n-sents`, `-n` | Number of sentences per document. Supported for: `conll`, `conllu`, `iob`, `ner` ~~int (option)~~ | @@ -642,7 +642,7 @@ $ python -m spacy debug profile [model] [inputs] [--n-texts] | Name | Description | | ----------------- | ---------------------------------------------------------------------------------- | | `model` | A loadable spaCy pipeline (package name or path). ~~str (positional)~~ | -| `inputs` | Optional path to input file, or `-` for standard input. ~~Path (positional)~~ | +| `inputs` | Path to input file, or `-` for standard input. ~~Path (positional)~~ | | `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. ~~int (option)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | **PRINTS** | Profiling information for the pipeline. | @@ -1191,14 +1191,14 @@ $ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose] > $ python -m spacy project dvc all > ``` -| Name | Description | -| ----------------- | ----------------------------------------------------------------------------------------------------------------- | -| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | -| `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(positional)~~ | -| `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ | -| `--verbose`, `-V` |  Print more output generated by DVC. ~~bool (flag)~~ | -| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | -| **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. | +| Name | Description | +| ----------------- | ------------------------------------------------------------------------------------------------------------- | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(option)~~ | +| `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ | +| `--verbose`, `-V` |  Print more output generated by DVC. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. | ## ray {#ray new="3"} @@ -1236,7 +1236,7 @@ $ python -m spacy ray train [config_path] [--code] [--output] [--n-workers] [--a | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | | `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | -| `--output`, `-o` | Directory or remote storage URL for saving trained pipeline. The directory will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ | +| `--output`, `-o` | Directory or remote storage URL for saving trained pipeline. The directory will be created if it doesn't exist. ~~Optional[Path] \(option)~~ | | `--n-workers`, `-n` | The number of workers. Defaults to `1`. ~~int (option)~~ | | `--address`, `-a` | Optional address of the Ray cluster. If not set (default), Ray will run locally. ~~Optional[str] \(option)~~ | | `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ | From 9a254d39956ecee8dd124c6223711732324a35e4 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Wed, 17 Mar 2021 15:05:22 +0100 Subject: [PATCH 3/9] Include all en_core_web_sm components in examples --- website/docs/usage/processing-pipelines.md | 29 +++++++++++----------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 909a9c7de..25eaf6558 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -54,9 +54,8 @@ texts = ["This is a text", "These are lots of texts", "..."] In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a (potentially very large) iterable of texts as a stream. Because we're only accessing the named entities in `doc.ents` (set by the `ner` component), we'll -disable all other statistical components (the `tagger` and `parser`) during -processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and -access the named entity predictions: +disable all other components during processing. `nlp.pipe` yields `Doc` +objects, so we can iterate over them and access the named entity predictions: > #### ✏️ Things to try > @@ -73,7 +72,7 @@ texts = [ ] nlp = spacy.load("en_core_web_sm") -for doc in nlp.pipe(texts, disable=["tagger", "parser"]): +for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]): # Do something with the doc here print([(ent.text, ent.label_) for ent in doc.ents]) ``` @@ -144,10 +143,12 @@ nlp = spacy.load("en_core_web_sm") ``` ... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the -pipeline `["tok2vec", "tagger", "parser", "ner"]`. spaCy will then initialize -`spacy.lang.en.English`, and create each pipeline component and add it to the -processing pipeline. It'll then load in the model data from the data directory -and return the modified `Language` class for you to use as the `nlp` object. +pipeline +`["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]`. spaCy +will then initialize `spacy.lang.en.English`, and create each pipeline component +and add it to the processing pipeline. It'll then load in the model data from +the data directory and return the modified `Language` class for you to use as +the `nlp` object. @@ -171,7 +172,7 @@ the binary data: ```python ### spacy.load under the hood lang = "en" -pipeline = ["tok2vec", "tagger", "parser", "ner"] +pipeline = ["tok2vec", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"] data_path = "path/to/en_core_web_sm/en_core_web_sm-3.0.0" cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English @@ -186,7 +187,7 @@ component** on the `Doc`, in order. Since the model data is loaded, the components can access it to assign annotations to the `Doc` object, and subsequently to the `Token` and `Span` which are only views of the `Doc`, and don't own any data themselves. All components return the modified document, -which is then processed by the component next in the pipeline. +which is then processed by the next component in the pipeline. ```python ### The pipeline under the hood @@ -201,9 +202,9 @@ list of human-readable component names. ```python print(nlp.pipeline) -# [('tok2vec', ), ('tagger', ), ('parser', ), ('ner', )] +# [('tok2vec', ), ('tagger', ), ('parser', ), ('ner', ), ('attribute_ruler', ), ('lemmatizer', )] print(nlp.pipe_names) -# ['tok2vec', 'tagger', 'parser', 'ner'] +# ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'] ``` ### Built-in pipeline components {#built-in} @@ -300,7 +301,7 @@ blocks. ```python ### Disable for block # 1. Use as a context manager -with nlp.select_pipes(disable=["tagger", "parser"]): +with nlp.select_pipes(disable=["tagger", "parser", "lemmatizer"]): doc = nlp("I won't be tagged and parsed") doc = nlp("I will be tagged and parsed") @@ -324,7 +325,7 @@ The [`nlp.pipe`](/api/language#pipe) method also supports a `disable` keyword argument if you only want to disable components during processing: ```python -for doc in nlp.pipe(texts, disable=["tagger", "parser"]): +for doc in nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer"]): # Do something with the doc here ``` From c9e1a9ac174abe4c8113518955e56af6ea2c5a8d Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Wed, 17 Mar 2021 21:28:04 +0100 Subject: [PATCH 4/9] Add multiprocessing section --- website/docs/usage/processing-pipelines.md | 49 ++++++++++++++++++++++ 1 file changed, 49 insertions(+) diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 25eaf6558..9e8e87239 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -91,6 +91,55 @@ have to call `list()` on it first: +### Multiprocessing + +spaCy includes built-in support for multiprocessing with +[`nlp.pipe`](/api/language#pipe) using the `n_process` option: + +```python +# Multiprocessing with 4 processes +docs = nlp.pipe(texts, n_process=4) + +# With as many processes as CPUs (use with caution!) +docs = nlp.pipe(texts, n_process=-1) +``` + +Depending on your platform, starting many processes with multiprocessing can +add a lot of overhead. In particular, the default start method `spawn` used in +macOS/OS X (as of Python 3.8) and in Windows can be slow for larger models +because the model data is copied in memory for each new process. See the +[Python docs on +multiprocessing](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) +for further details. + +For shorter tasks and in particular with `spawn`, it can be faster to use a +smaller number of processes with a larger batch size. The optimal `batch_size` +setting will depend on the pipeline components, the length of your documents, +the number of processes and how much memory is available. + +```python +# Default batch size is `nlp.batch_size` (typically 1000) +docs = nlp.pipe(texts, n_process=2, batch_size=2000) +``` + + + +Multiprocessing is not generally recommended on GPU because RAM is too limited. +If you want to try it out, be aware that it is only possible using `spawn` due +to limitations in CUDA. + + + + + +In Linux, transformer models may hang or deadlock with multiprocessing due to an +[issue in PyTorch](https://github.com/pytorch/pytorch/issues/17199). One +suggested workaround is to use `spawn` instead of `fork` and another is to +limit the number of threads before loading any models using +`torch.set_num_threads(1)`. + + + ## Pipelines and built-in components {#pipelines} spaCy makes it very easy to create your own pipelines consisting of reusable From acc58719da2f0b7584eedc913fd691a8ab0c750f Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Thu, 18 Mar 2021 12:49:20 +0100 Subject: [PATCH 5/9] Update custom similarity hooks example --- website/docs/usage/processing-pipelines.md | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 9e8e87239..836bdac67 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -1547,24 +1547,33 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`. | Name | Customizes | | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `user_hooks` | [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents) | +| `user_hooks` | [`Doc.similarity`](/api/doc#similarity), [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents) | | `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) | | `user_span_hooks` | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root) | ```python ### Add custom similarity hooks +from spacy.language import Language + + class SimilarityModel: - def __init__(self, model): - self._model = model + def __init__(self, name: str, index: int): + self.name = name + self.index = index def __call__(self, doc): doc.user_hooks["similarity"] = self.similarity doc.user_span_hooks["similarity"] = self.similarity doc.user_token_hooks["similarity"] = self.similarity + return doc def similarity(self, obj1, obj2): - y = self._model([obj1.vector, obj2.vector]) - return float(y[0]) + return obj1.vector[self.index] + obj2.vector[self.index] + + +@Language.factory("similarity_component", default_config={"index": 0}) +def create_similarity_component(nlp, name, index: int): + return SimilarityModel(name, index) ``` ## Developing plugins and wrappers {#plugins} From 0fb1881f36f68b42b6b096915c153ef189b21ff2 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Thu, 18 Mar 2021 13:29:51 +0100 Subject: [PATCH 6/9] Reformat processing pipelines --- website/docs/usage/processing-pipelines.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 836bdac67..a669bda7d 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -54,8 +54,8 @@ texts = ["This is a text", "These are lots of texts", "..."] In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a (potentially very large) iterable of texts as a stream. Because we're only accessing the named entities in `doc.ents` (set by the `ner` component), we'll -disable all other components during processing. `nlp.pipe` yields `Doc` -objects, so we can iterate over them and access the named entity predictions: +disable all other components during processing. `nlp.pipe` yields `Doc` objects, +so we can iterate over them and access the named entity predictions: > #### ✏️ Things to try > @@ -104,12 +104,11 @@ docs = nlp.pipe(texts, n_process=4) docs = nlp.pipe(texts, n_process=-1) ``` -Depending on your platform, starting many processes with multiprocessing can -add a lot of overhead. In particular, the default start method `spawn` used in +Depending on your platform, starting many processes with multiprocessing can add +a lot of overhead. In particular, the default start method `spawn` used in macOS/OS X (as of Python 3.8) and in Windows can be slow for larger models because the model data is copied in memory for each new process. See the -[Python docs on -multiprocessing](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) +[Python docs on multiprocessing](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) for further details. For shorter tasks and in particular with `spawn`, it can be faster to use a @@ -134,8 +133,8 @@ to limitations in CUDA. In Linux, transformer models may hang or deadlock with multiprocessing due to an [issue in PyTorch](https://github.com/pytorch/pytorch/issues/17199). One -suggested workaround is to use `spawn` instead of `fork` and another is to -limit the number of threads before loading any models using +suggested workaround is to use `spawn` instead of `fork` and another is to limit +the number of threads before loading any models using `torch.set_num_threads(1)`. @@ -1547,7 +1546,7 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`. | Name | Customizes | | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `user_hooks` | [`Doc.similarity`](/api/doc#similarity), [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents) | +| `user_hooks` | [`Doc.similarity`](/api/doc#similarity), [`Doc.vector`](/api/doc#vector), [`Doc.has_vector`](/api/doc#has_vector), [`Doc.vector_norm`](/api/doc#vector_norm), [`Doc.sents`](/api/doc#sents) | | `user_token_hooks` | [`Token.similarity`](/api/token#similarity), [`Token.vector`](/api/token#vector), [`Token.has_vector`](/api/token#has_vector), [`Token.vector_norm`](/api/token#vector_norm), [`Token.conjuncts`](/api/token#conjuncts) | | `user_span_hooks` | [`Span.similarity`](/api/span#similarity), [`Span.vector`](/api/span#vector), [`Span.has_vector`](/api/span#has_vector), [`Span.vector_norm`](/api/span#vector_norm), [`Span.root`](/api/span#root) | From 40e5d3a980886548dd0c692654f00dd26bac519a Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Thu, 18 Mar 2021 16:56:10 +0100 Subject: [PATCH 7/9] Update saving/loading example --- website/docs/usage/saving-loading.md | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md index f15493fd7..9dad077e7 100644 --- a/website/docs/usage/saving-loading.md +++ b/website/docs/usage/saving-loading.md @@ -19,9 +19,8 @@ import Serialization101 from 'usage/101/\_serialization.md' When serializing the pipeline, keep in mind that this will only save out the **binary data for the individual components** to allow spaCy to restore them – not the entire objects. This is a good thing, because it makes serialization -safe. But it also means that you have to take care of storing the language name -and pipeline component names as well, and restoring them separately before you -can load in the data. +safe. But it also means that you have to take care of storing the config, which +contains the pipeline configuration and all the relevant settings. > #### Saving the meta and config > @@ -33,24 +32,21 @@ can load in the data. ```python ### Serialize +config = nlp.config bytes_data = nlp.to_bytes() -lang = nlp.config["nlp"]["lang"] # "en" -pipeline = nlp.config["nlp"]["pipeline"] # ["tagger", "parser", "ner"] ``` ```python ### Deserialize -nlp = spacy.blank(lang) -for pipe_name in pipeline: - nlp.add_pipe(pipe_name) +lang_cls = spacy.util.get_lang_class(config["nlp"]["lang"]) +nlp = lang_cls.from_config(config) nlp.from_bytes(bytes_data) ``` This is also how spaCy does it under the hood when loading a pipeline: it loads the `config.cfg` containing the language and pipeline information, initializes -the language class, creates and adds the pipeline components based on the -defined [factories](/usage/processing-pipeline#custom-components-factories) and -_then_ loads in the binary data. You can read more about this process +the language class, creates and adds the pipeline components based on the config +and _then_ loads in the binary data. You can read more about this process [here](/usage/processing-pipelines#pipelines). ## Serializing Doc objects efficiently {#docs new="2.2"} From 6a9a46776661c32ae9a95d9abddf81f7b905a118 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Fri, 19 Mar 2021 08:12:49 +0100 Subject: [PATCH 8/9] Update website/docs/usage/processing-pipelines.md Co-authored-by: Ines Montani --- website/docs/usage/processing-pipelines.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index a669bda7d..52568658d 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -91,7 +91,7 @@ have to call `list()` on it first: -### Multiprocessing +### Multiprocessing {#multiprocessing} spaCy includes built-in support for multiprocessing with [`nlp.pipe`](/api/language#pipe) using the `n_process` option: From 0d2b723e8d1ae02dcdf06500188f06172b098420 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Sat, 20 Mar 2021 11:38:55 +0100 Subject: [PATCH 9/9] Update entity setting section --- website/docs/usage/linguistic-features.md | 26 ++++++++++++++++------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index fd76c6e4d..40ea2bf9c 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -599,18 +599,27 @@ ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print('Before', ents) # The model didn't recognize "fb" as an entity :( -fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity +# Create a span for the new entity +fb_ent = Span(doc, 0, 1, label="ORG") + +# Option 1: Modify the provided entity spans, leaving the rest unmodified +doc.set_ents([fb_ent], default="unmodified") + +# Option 2: Assign a complete list of ents to doc.ents doc.ents = list(doc.ents) + [fb_ent] -ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] +ents = [(e.text, e.start, e.end, e.label_) for e in doc.ents] print('After', ents) -# [('fb', 0, 2, 'ORG')] 🎉 +# [('fb', 0, 1, 'ORG')] 🎉 ``` -Keep in mind that you need to create a `Span` with the start and end index of -the **token**, not the start and end index of the entity in the document. In -this case, "fb" is token `(0, 1)` – but at the document level, the entity will -have the start and end indices `(0, 2)`. +Keep in mind that `Span` is initialized with the start and end **token** +indices, not the character offsets. To create a span from character offsets, use +[`Doc.char_span`](/api/doc#char_span): + +```python +fb_ent = doc.char_span(0, 2, label="ORG") +``` #### Setting entity annotations from array {#setting-from-array} @@ -645,9 +654,10 @@ write efficient native code. ```python # cython: infer_types=True +from spacy.typedefs cimport attr_t from spacy.tokens.doc cimport Doc -cpdef set_entity(Doc doc, int start, int end, int ent_type): +cpdef set_entity(Doc doc, int start, int end, attr_t ent_type): for i in range(start, end): doc.c[i].ent_type = ent_type doc.c[start].ent_iob = 3