Merge branch 'master' into spacy.io

2025-11-07 19:37:38 +03:00 · 2019-07-25 17:38:19 +02:00 · 2019-07-25 17:38:19 +02:00 · d9eeae5c69
commit d9eeae5c69
parent 4361da2bba bd39e5e630
5 changed files with 144 additions and 16 deletions
--- a/website/docs/api/doc.md
+++ b/website/docs/api/doc.md
@ -11,6 +11,11 @@ compressed binary strings. The `Doc` object holds an array of `TokenC]` structs.
 The Python-level `Token` and [`Span`](/api/span) objects are views of this
 array, i.e. they don't own the data themselves.

+## Doc.\_\_init\_\_ {#init tag="method"}
+
+Construct a `Doc` object. The most common way to get a `Doc` object is via the
+`nlp` object.
+
 > #### Example
 >
 > ```python
@ -24,11 +29,6 @@ array, i.e. they don't own the data themselves.
 > doc = Doc(nlp.vocab, words=words, spaces=spaces)
 > ```

-## Doc.\_\_init\_\_ {#init tag="method"}
-
-Construct a `Doc` object. The most common way to get a `Doc` object is via the
-`nlp` object.
-
 | Name        | Type     | Description                                                                                                                                                         |
 | ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `vocab`     | `Vocab`  | A storage container for lexical types.                                                                                                                              |
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -29,7 +29,7 @@ class. The data will be loaded in via
 > nlp = spacy.load("/path/to/en") # unicode path
 > nlp = spacy.load(Path("/path/to/en")) # pathlib Path
 >
-> nlp = spacy.load("en", disable=["parser", "tagger"])
+> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
 > ```

 | Name        | Type             | Description                                                                       |
--- a/website/docs/usage/101/_pipelines.md
+++ b/website/docs/usage/101/_pipelines.md
@ -52,4 +52,18 @@ entities into account when making predictions.

 </Accordion>

+<Accordion title="Why is the tokenizer special?" id="pipeline-components-tokenizer">
+
+The tokenizer is a "special" component and isn't part of the regular pipeline.
+It also doesn't show up in `nlp.pipe_names`. The reason is that there can only
+really be one tokenizer, and while all other pipeline components take a `Doc`
+and return it, the tokenizer takes a **string of text** and turns it into a
+`Doc`. You can still customize the tokenizer, though. `nlp.tokenizer` is
+writable, so you can either create your own
+[`Tokenizer` class from scratch](/usage/linguistic-features#native-tokenizers),
+or even replace it with an
+[entirely custom function](/usage/linguistic-features#custom-tokenizer).
+
+</Accordion>
+
 ---
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -2,6 +2,7 @@
 title: Language Processing Pipelines
 next: vectors-similarity
 menu:
+  - ['Processing Text', 'processing']
  - ['How Pipelines Work', 'pipelines']
  - ['Custom Components', 'custom-components']
  - ['Extension Attributes', 'custom-components-attributes']
@ -12,6 +13,82 @@ import Pipelines101 from 'usage/101/\_pipelines.md'

 <Pipelines101 />

+## Processing text {#processing}
+
+When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
+component** on the `Doc`, in order. It then returns the processed `Doc` that you
+can work with.
+
+```python
+doc = nlp(u"This is a text")
+```
+
+When processing large volumes of text, the statistical models are usually more
+efficient if you let them work on batches of texts. spaCy's
+[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
+processed `Doc` objects. The batching is done internally.
+
+```diff
+texts = [u"This is a text", u"These are lots of texts", u"..."]
+- docs = [nlp(text) for text in texts]
+ docs = list(nlp.pipe(texts))
+```
+
+<Infobox title="Tips for efficient processing">
+
+- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
+  buffer them in batches, instead of one-by-one. This is usually much more
+  efficient.
+- Only apply the **pipeline components you need**. Getting predictions from the
+  model that you don't actually need adds up and becomes very inefficient at
+  scale. To prevent this, use the `disable` keyword argument to disable
+  components you don't need – either when loading a model, or during processing
+  with `nlp.pipe`. See the section on
+  [disabling pipeline components](#disabling) for more details and examples.
+
+</Infobox>
+
+In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
+(potentially very large) iterable of texts as a stream. Because we're only
+accessing the named entities in `doc.ents` (set by the `ner` component), we'll
+disable all other statistical components (the `tagger` and `parser`) during
+processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and
+access the named entity predictions:
+
+> #### ✏️ Things to try
+>
+> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
+>    empty, because the entity recognizer didn't run.
+
+```python
+### {executable="true"}
+import spacy
+
+texts = [
+    "Net income was $9.4 million compared to the prior year of $2.7 million.",
+    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
+]
+
+nlp = spacy.load("en_core_web_sm")
+for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
+    # Do something with the doc here
+    print([(ent.text, ent.label_) for ent in doc.ents])
+```
+
+<Infobox title="Important note" variant="warning">
+
+When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
+[generator](https://realpython.com/introduction-to-python-generators/) that
+yields `Doc` objects – not a list. So if you want to use it like a list, you'll
+have to call `list()` on it first:
+
+```diff
+- docs = nlp.pipe(texts)[0]         # will raise an error
+ docs = list(nlp.pipe(texts))[0]   # works as expected
+```
+
+</Infobox>
+
 ## How pipelines work {#pipelines}

 spaCy makes it very easy to create your own pipelines consisting of reusable
@ -146,19 +223,56 @@ require them in the pipeline settings in your model's `meta.json`.
 ### Disabling and modifying pipeline components {#disabling}

 If you don't need a particular component of the pipeline – for example, the
-tagger or the parser, you can disable loading it. This can sometimes make a big
-difference and improve loading speed. Disabled component names can be provided
-to [`spacy.load`](/api/top-level#spacy.load),
+tagger or the parser, you can **disable loading** it. This can sometimes make a
+big difference and improve loading speed. Disabled component names can be
+provided to [`spacy.load`](/api/top-level#spacy.load),
 [`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a
 list:

 ```python
-nlp = spacy.load("en", disable=["parser", "tagger"])
+### Disable loading
+nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
 nlp = English().from_disk("/model", disable=["ner"])
 ```

-You can also use the [`remove_pipe`](/api/language#remove_pipe) method to remove
-pipeline components from an existing pipeline, the
+In some cases, you do want to load all pipeline components and their weights,
+because you need them at different points in your application. However, if you
+only need a `Doc` object with named entities, there's no need to run all
+pipeline components on it – that can potentially make processing much slower.
+Instead, you can use the `disable` keyword argument on
+[`nlp.pipe`](/api/language#pipe) to temporarily disable the components **during
+processing**:
+
+```python
+### Disable for processing
+for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
+    # Do something with the doc here
+```
+
+If you need to **execute more code** with components disabled – e.g. to reset
+the weights or update only some components during training – you can use the
+[`nlp.disable_pipes`](/api/language#disable_pipes) contextmanager. At the end of
+the `with` block, the disabled pipeline components will be restored
+automatically. Alternatively, `disable_pipes` returns an object that lets you
+call its `restore()` method to restore the disabled components when needed. This
+can be useful if you want to prevent unnecessary code indentation of large
+blocks.
+
+```python
+### Disable for block
+# 1. Use as a contextmanager
+with nlp.disable_pipes("tagger", "parser"):
+    doc = nlp(u"I won't be tagged and parsed")
+doc = nlp(u"I will be tagged and parsed")
+
+# 2. Restore manually
+disabled = nlp.disable_pipes("ner")
+doc = nlp(u"I won't have named entities")
+disabled.restore()
+```
+
+Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
+to remove pipeline components from an existing pipeline, the
 [`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
 [`replace_pipe`](/api/language#replace_pipe) method to replace them with a
 custom component entirely (more details on this in the section on
@ -182,8 +296,8 @@ initializing a Language class via [`from_disk`](/api/language#from_disk).
 - nlp = spacy.load('en', tagger=False, entity=False)
 - doc = nlp(u"I don't want parsed", parse=False)

-+ nlp = spacy.load('en', disable=['ner'])
-+ nlp.remove_pipe('parser')
+ nlp = spacy.load("en", disable=["ner"])
+ nlp.remove_pipe("parser")
 + doc = nlp(u"I don't want parsed")
 ```

--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@ -623,8 +623,8 @@ solves this with a clear distinction between setting up the instance and loading
 the data.

 ```diff
- nlp = spacy.load("en", path="/path/to/data")
-+ nlp = spacy.blank("en").from_disk("/path/to/data")
+- nlp = spacy.load("en_core_web_sm", path="/path/to/data")
+ nlp = spacy.blank("en_core_web_sm").from_disk("/path/to/data")
 ```

 </Infobox>