From 1fa6d6ba55e8d4c84db8d74a284fec1d60dc32c5 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 25 Jul 2019 14:24:56 +0200 Subject: [PATCH 1/4] Improve consistency of docs examples [ci skip] --- website/docs/api/doc.md | 10 +++++----- website/docs/api/top-level.md | 2 +- website/docs/usage/processing-pipelines.md | 4 ++-- website/docs/usage/saving-loading.md | 4 ++-- 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index bf9801564..b1306ef91 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -11,6 +11,11 @@ compressed binary strings. The `Doc` object holds an array of `TokenC]` structs. The Python-level `Token` and [`Span`](/api/span) objects are views of this array, i.e. they don't own the data themselves. +## Doc.\_\_init\_\_ {#init tag="method"} + +Construct a `Doc` object. The most common way to get a `Doc` object is via the +`nlp` object. + > #### Example > > ```python @@ -24,11 +29,6 @@ array, i.e. they don't own the data themselves. > doc = Doc(nlp.vocab, words=words, spaces=spaces) > ``` -## Doc.\_\_init\_\_ {#init tag="method"} - -Construct a `Doc` object. The most common way to get a `Doc` object is via the -`nlp` object. - | Name | Type | Description | | ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `vocab` | `Vocab` | A storage container for lexical types. | diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 9d5bdc527..2990a0969 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -29,7 +29,7 @@ class. The data will be loaded in via > nlp = spacy.load("/path/to/en") # unicode path > nlp = spacy.load(Path("/path/to/en")) # pathlib Path > -> nlp = spacy.load("en", disable=["parser", "tagger"]) +> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"]) > ``` | Name | Type | Description | diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 0fa243501..6934374ec 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -182,8 +182,8 @@ initializing a Language class via [`from_disk`](/api/language#from_disk). - nlp = spacy.load('en', tagger=False, entity=False) - doc = nlp(u"I don't want parsed", parse=False) -+ nlp = spacy.load('en', disable=['ner']) -+ nlp.remove_pipe('parser') ++ nlp = spacy.load("en", disable=["ner"]) ++ nlp.remove_pipe("parser") + doc = nlp(u"I don't want parsed") ``` diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md index 3c1e51603..81e90dcc7 100644 --- a/website/docs/usage/saving-loading.md +++ b/website/docs/usage/saving-loading.md @@ -623,8 +623,8 @@ solves this with a clear distinction between setting up the instance and loading the data. ```diff -- nlp = spacy.load("en", path="/path/to/data") -+ nlp = spacy.blank("en").from_disk("/path/to/data") +- nlp = spacy.load("en_core_web_sm", path="/path/to/data") ++ nlp = spacy.blank("en_core_web_sm").from_disk("/path/to/data") ``` From 02e444ec7ca5ba32979e42990d0f75084d0ae679 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 25 Jul 2019 14:25:03 +0200 Subject: [PATCH 2/4] Add section on special tokenizer component [ci skip] --- website/docs/usage/101/_pipelines.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/website/docs/usage/101/_pipelines.md b/website/docs/usage/101/_pipelines.md index 64c2f6c98..68308a381 100644 --- a/website/docs/usage/101/_pipelines.md +++ b/website/docs/usage/101/_pipelines.md @@ -52,4 +52,18 @@ entities into account when making predictions. + + +The tokenizer is a "special" component and isn't part of the regular pipeline. +It also doesn't show up in `nlp.pipe_names`. The reason is that there can only +really be one tokenizer, and while all other pipeline components take a `Doc` +and return it, the tokenizer takes a **string of text** and turns it into a +`Doc`. You can still customize the tokenizer, though. `nlp.tokenizer` is +writable, so you can either create your own +[`Tokenizer` class from scratch](/usage/linguistic-features#native-tokenizers), +or even replace it with an +[entirely custom function](/usage/linguistic-features#custom-tokenizer). + + + --- From a5e3d2f3180d45b6e35a0a5aadf42e2f2c6acced Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 25 Jul 2019 14:25:34 +0200 Subject: [PATCH 3/4] Improve section on disabling pipes [ci skip] --- website/docs/usage/processing-pipelines.md | 49 +++++++++++++++++++--- 1 file changed, 43 insertions(+), 6 deletions(-) diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 6934374ec..13da76560 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -146,19 +146,56 @@ require them in the pipeline settings in your model's `meta.json`. ### Disabling and modifying pipeline components {#disabling} If you don't need a particular component of the pipeline – for example, the -tagger or the parser, you can disable loading it. This can sometimes make a big -difference and improve loading speed. Disabled component names can be provided -to [`spacy.load`](/api/top-level#spacy.load), +tagger or the parser, you can **disable loading** it. This can sometimes make a +big difference and improve loading speed. Disabled component names can be +provided to [`spacy.load`](/api/top-level#spacy.load), [`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a list: ```python -nlp = spacy.load("en", disable=["parser", "tagger"]) +### Disable loading +nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"]) nlp = English().from_disk("/model", disable=["ner"]) ``` -You can also use the [`remove_pipe`](/api/language#remove_pipe) method to remove -pipeline components from an existing pipeline, the +In some cases, you do want to load all pipeline components and their weights, +because you need them at different points in your application. However, if you +only need a `Doc` object with named entities, there's no need to run all +pipeline components on it – that can potentially make processing much slower. +Instead, you can use the `disable` keyword argument on +[`nlp.pipe`](/api/language#pipe) to temporarily disable the components **during +processing**: + +```python +### Disable for processing +for doc in nlp.pipe(texts, disable=["tagger", "parser"]): + # Do something with the doc here +``` + +If you need to **execute more code** with components disabled – e.g. to reset +the weights or update only some components during training – you can use the +[`nlp.disable_pipes`](/api/language#disable_pipes) contextmanager. At the end of +the `with` block, the disabled pipeline components will be restored +automatically. Alternatively, `disable_pipes` returns an object that lets you +call its `restore()` method to restore the disabled components when needed. This +can be useful if you want to prevent unnecessary code indentation of large +blocks. + +```python +### Disable for block +# 1. Use as a contextmanager +with nlp.disable_pipes("tagger", "parser"): + doc = nlp(u"I won't be tagged and parsed") +doc = nlp(u"I will be tagged and parsed") + +# 2. Restore manually +disabled = nlp.disable_pipes("ner") +doc = nlp(u"I won't have named entities") +disabled.restore() +``` + +Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method +to remove pipeline components from an existing pipeline, the [`rename_pipe`](/api/language#rename_pipe) method to rename them, or the [`replace_pipe`](/api/language#replace_pipe) method to replace them with a custom component entirely (more details on this in the section on From bd39e5e6304410af812034230241dfc55f2a4927 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Thu, 25 Jul 2019 17:38:03 +0200 Subject: [PATCH 4/4] Add "Processing text" section [ci skip] --- website/docs/usage/processing-pipelines.md | 77 ++++++++++++++++++++++ 1 file changed, 77 insertions(+) diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 13da76560..f3c59da7b 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -2,6 +2,7 @@ title: Language Processing Pipelines next: vectors-similarity menu: + - ['Processing Text', 'processing'] - ['How Pipelines Work', 'pipelines'] - ['Custom Components', 'custom-components'] - ['Extension Attributes', 'custom-components-attributes'] @@ -12,6 +13,82 @@ import Pipelines101 from 'usage/101/\_pipelines.md' +## Processing text {#processing} + +When you call `nlp` on a text, spaCy will **tokenize** it and then **call each +component** on the `Doc`, in order. It then returns the processed `Doc` that you +can work with. + +```python +doc = nlp(u"This is a text") +``` + +When processing large volumes of text, the statistical models are usually more +efficient if you let them work on batches of texts. spaCy's +[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields +processed `Doc` objects. The batching is done internally. + +```diff +texts = [u"This is a text", u"These are lots of texts", u"..."] +- docs = [nlp(text) for text in texts] ++ docs = list(nlp.pipe(texts)) +``` + + + +- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and + buffer them in batches, instead of one-by-one. This is usually much more + efficient. +- Only apply the **pipeline components you need**. Getting predictions from the + model that you don't actually need adds up and becomes very inefficient at + scale. To prevent this, use the `disable` keyword argument to disable + components you don't need – either when loading a model, or during processing + with `nlp.pipe`. See the section on + [disabling pipeline components](#disabling) for more details and examples. + + + +In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a +(potentially very large) iterable of texts as a stream. Because we're only +accessing the named entities in `doc.ents` (set by the `ner` component), we'll +disable all other statistical components (the `tagger` and `parser`) during +processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and +access the named entity predictions: + +> #### ✏️ Things to try +> +> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now +> empty, because the entity recognizer didn't run. + +```python +### {executable="true"} +import spacy + +texts = [ + "Net income was $9.4 million compared to the prior year of $2.7 million.", + "Revenue exceeded twelve billion dollars, with a loss of $1b.", +] + +nlp = spacy.load("en_core_web_sm") +for doc in nlp.pipe(texts, disable=["tagger", "parser"]): + # Do something with the doc here + print([(ent.text, ent.label_) for ent in doc.ents]) +``` + + + +When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a +[generator](https://realpython.com/introduction-to-python-generators/) that +yields `Doc` objects – not a list. So if you want to use it like a list, you'll +have to call `list()` on it first: + +```diff +- docs = nlp.pipe(texts)[0] # will raise an error ++ docs = list(nlp.pipe(texts))[0] # works as expected +``` + + + ## How pipelines work {#pipelines} spaCy makes it very easy to create your own pipelines consisting of reusable