Merge branch 'master' into spacy.io

This commit is contained in:
Ines Montani 2019-07-25 17:38:19 +02:00
commit d9eeae5c69
5 changed files with 144 additions and 16 deletions

View File

@ -11,6 +11,11 @@ compressed binary strings. The `Doc` object holds an array of `TokenC]` structs.
The Python-level `Token` and [`Span`](/api/span) objects are views of this
array, i.e. they don't own the data themselves.
## Doc.\_\_init\_\_ {#init tag="method"}
Construct a `Doc` object. The most common way to get a `Doc` object is via the
`nlp` object.
> #### Example
>
> ```python
@ -24,11 +29,6 @@ array, i.e. they don't own the data themselves.
> doc = Doc(nlp.vocab, words=words, spaces=spaces)
> ```
## Doc.\_\_init\_\_ {#init tag="method"}
Construct a `Doc` object. The most common way to get a `Doc` object is via the
`nlp` object.
| Name | Type | Description |
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | A storage container for lexical types. |

View File

@ -29,7 +29,7 @@ class. The data will be loaded in via
> nlp = spacy.load("/path/to/en") # unicode path
> nlp = spacy.load(Path("/path/to/en")) # pathlib Path
>
> nlp = spacy.load("en", disable=["parser", "tagger"])
> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
> ```
| Name | Type | Description |

View File

@ -52,4 +52,18 @@ entities into account when making predictions.
</Accordion>
<Accordion title="Why is the tokenizer special?" id="pipeline-components-tokenizer">
The tokenizer is a "special" component and isn't part of the regular pipeline.
It also doesn't show up in `nlp.pipe_names`. The reason is that there can only
really be one tokenizer, and while all other pipeline components take a `Doc`
and return it, the tokenizer takes a **string of text** and turns it into a
`Doc`. You can still customize the tokenizer, though. `nlp.tokenizer` is
writable, so you can either create your own
[`Tokenizer` class from scratch](/usage/linguistic-features#native-tokenizers),
or even replace it with an
[entirely custom function](/usage/linguistic-features#custom-tokenizer).
</Accordion>
---

View File

@ -2,6 +2,7 @@
title: Language Processing Pipelines
next: vectors-similarity
menu:
- ['Processing Text', 'processing']
- ['How Pipelines Work', 'pipelines']
- ['Custom Components', 'custom-components']
- ['Extension Attributes', 'custom-components-attributes']
@ -12,6 +13,82 @@ import Pipelines101 from 'usage/101/\_pipelines.md'
<Pipelines101 />
## Processing text {#processing}
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
component** on the `Doc`, in order. It then returns the processed `Doc` that you
can work with.
```python
doc = nlp(u"This is a text")
```
When processing large volumes of text, the statistical models are usually more
efficient if you let them work on batches of texts. spaCy's
[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
processed `Doc` objects. The batching is done internally.
```diff
texts = [u"This is a text", u"These are lots of texts", u"..."]
- docs = [nlp(text) for text in texts]
+ docs = list(nlp.pipe(texts))
```
<Infobox title="Tips for efficient processing">
- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
buffer them in batches, instead of one-by-one. This is usually much more
efficient.
- Only apply the **pipeline components you need**. Getting predictions from the
model that you don't actually need adds up and becomes very inefficient at
scale. To prevent this, use the `disable` keyword argument to disable
components you don't need either when loading a model, or during processing
with `nlp.pipe`. See the section on
[disabling pipeline components](#disabling) for more details and examples.
</Infobox>
In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
(potentially very large) iterable of texts as a stream. Because we're only
accessing the named entities in `doc.ents` (set by the `ner` component), we'll
disable all other statistical components (the `tagger` and `parser`) during
processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and
access the named entity predictions:
> #### ✏️ Things to try
>
> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
> empty, because the entity recognizer didn't run.
```python
### {executable="true"}
import spacy
texts = [
"Net income was $9.4 million compared to the prior year of $2.7 million.",
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
# Do something with the doc here
print([(ent.text, ent.label_) for ent in doc.ents])
```
<Infobox title="Important note" variant="warning">
When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
[generator](https://realpython.com/introduction-to-python-generators/) that
yields `Doc` objects not a list. So if you want to use it like a list, you'll
have to call `list()` on it first:
```diff
- docs = nlp.pipe(texts)[0] # will raise an error
+ docs = list(nlp.pipe(texts))[0] # works as expected
```
</Infobox>
## How pipelines work {#pipelines}
spaCy makes it very easy to create your own pipelines consisting of reusable
@ -146,19 +223,56 @@ require them in the pipeline settings in your model's `meta.json`.
### Disabling and modifying pipeline components {#disabling}
If you don't need a particular component of the pipeline for example, the
tagger or the parser, you can disable loading it. This can sometimes make a big
difference and improve loading speed. Disabled component names can be provided
to [`spacy.load`](/api/top-level#spacy.load),
tagger or the parser, you can **disable loading** it. This can sometimes make a
big difference and improve loading speed. Disabled component names can be
provided to [`spacy.load`](/api/top-level#spacy.load),
[`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a
list:
```python
nlp = spacy.load("en", disable=["parser", "tagger"])
### Disable loading
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
nlp = English().from_disk("/model", disable=["ner"])
```
You can also use the [`remove_pipe`](/api/language#remove_pipe) method to remove
pipeline components from an existing pipeline, the
In some cases, you do want to load all pipeline components and their weights,
because you need them at different points in your application. However, if you
only need a `Doc` object with named entities, there's no need to run all
pipeline components on it that can potentially make processing much slower.
Instead, you can use the `disable` keyword argument on
[`nlp.pipe`](/api/language#pipe) to temporarily disable the components **during
processing**:
```python
### Disable for processing
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
# Do something with the doc here
```
If you need to **execute more code** with components disabled e.g. to reset
the weights or update only some components during training you can use the
[`nlp.disable_pipes`](/api/language#disable_pipes) contextmanager. At the end of
the `with` block, the disabled pipeline components will be restored
automatically. Alternatively, `disable_pipes` returns an object that lets you
call its `restore()` method to restore the disabled components when needed. This
can be useful if you want to prevent unnecessary code indentation of large
blocks.
```python
### Disable for block
# 1. Use as a contextmanager
with nlp.disable_pipes("tagger", "parser"):
doc = nlp(u"I won't be tagged and parsed")
doc = nlp(u"I will be tagged and parsed")
# 2. Restore manually
disabled = nlp.disable_pipes("ner")
doc = nlp(u"I won't have named entities")
disabled.restore()
```
Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
to remove pipeline components from an existing pipeline, the
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
custom component entirely (more details on this in the section on
@ -182,8 +296,8 @@ initializing a Language class via [`from_disk`](/api/language#from_disk).
- nlp = spacy.load('en', tagger=False, entity=False)
- doc = nlp(u"I don't want parsed", parse=False)
+ nlp = spacy.load('en', disable=['ner'])
+ nlp.remove_pipe('parser')
+ nlp = spacy.load("en", disable=["ner"])
+ nlp.remove_pipe("parser")
+ doc = nlp(u"I don't want parsed")
```

View File

@ -623,8 +623,8 @@ solves this with a clear distinction between setting up the instance and loading
the data.
```diff
- nlp = spacy.load("en", path="/path/to/data")
+ nlp = spacy.blank("en").from_disk("/path/to/data")
- nlp = spacy.load("en_core_web_sm", path="/path/to/data")
+ nlp = spacy.blank("en_core_web_sm").from_disk("/path/to/data")
```
</Infobox>