Add "Processing text" section [ci skip]

This commit is contained in:
Ines Montani 2019-07-25 17:38:03 +02:00
parent a5e3d2f318
commit bd39e5e630

View File

@ -2,6 +2,7 @@
title: Language Processing Pipelines title: Language Processing Pipelines
next: vectors-similarity next: vectors-similarity
menu: menu:
- ['Processing Text', 'processing']
- ['How Pipelines Work', 'pipelines'] - ['How Pipelines Work', 'pipelines']
- ['Custom Components', 'custom-components'] - ['Custom Components', 'custom-components']
- ['Extension Attributes', 'custom-components-attributes'] - ['Extension Attributes', 'custom-components-attributes']
@ -12,6 +13,82 @@ import Pipelines101 from 'usage/101/\_pipelines.md'
<Pipelines101 /> <Pipelines101 />
## Processing text {#processing}
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
component** on the `Doc`, in order. It then returns the processed `Doc` that you
can work with.
```python
doc = nlp(u"This is a text")
```
When processing large volumes of text, the statistical models are usually more
efficient if you let them work on batches of texts. spaCy's
[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
processed `Doc` objects. The batching is done internally.
```diff
texts = [u"This is a text", u"These are lots of texts", u"..."]
- docs = [nlp(text) for text in texts]
+ docs = list(nlp.pipe(texts))
```
<Infobox title="Tips for efficient processing">
- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
buffer them in batches, instead of one-by-one. This is usually much more
efficient.
- Only apply the **pipeline components you need**. Getting predictions from the
model that you don't actually need adds up and becomes very inefficient at
scale. To prevent this, use the `disable` keyword argument to disable
components you don't need either when loading a model, or during processing
with `nlp.pipe`. See the section on
[disabling pipeline components](#disabling) for more details and examples.
</Infobox>
In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
(potentially very large) iterable of texts as a stream. Because we're only
accessing the named entities in `doc.ents` (set by the `ner` component), we'll
disable all other statistical components (the `tagger` and `parser`) during
processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and
access the named entity predictions:
> #### ✏️ Things to try
>
> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
> empty, because the entity recognizer didn't run.
```python
### {executable="true"}
import spacy
texts = [
"Net income was $9.4 million compared to the prior year of $2.7 million.",
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
# Do something with the doc here
print([(ent.text, ent.label_) for ent in doc.ents])
```
<Infobox title="Important note" variant="warning">
When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
[generator](https://realpython.com/introduction-to-python-generators/) that
yields `Doc` objects not a list. So if you want to use it like a list, you'll
have to call `list()` on it first:
```diff
- docs = nlp.pipe(texts)[0] # will raise an error
+ docs = list(nlp.pipe(texts))[0] # works as expected
```
</Infobox>
## How pipelines work {#pipelines} ## How pipelines work {#pipelines}
spaCy makes it very easy to create your own pipelines consisting of reusable spaCy makes it very easy to create your own pipelines consisting of reusable