mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 13:47:13 +03:00
Merge branch 'master' into spacy.io
This commit is contained in:
commit
d9eeae5c69
|
@ -11,6 +11,11 @@ compressed binary strings. The `Doc` object holds an array of `TokenC]` structs.
|
||||||
The Python-level `Token` and [`Span`](/api/span) objects are views of this
|
The Python-level `Token` and [`Span`](/api/span) objects are views of this
|
||||||
array, i.e. they don't own the data themselves.
|
array, i.e. they don't own the data themselves.
|
||||||
|
|
||||||
|
## Doc.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
|
Construct a `Doc` object. The most common way to get a `Doc` object is via the
|
||||||
|
`nlp` object.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -24,11 +29,6 @@ array, i.e. they don't own the data themselves.
|
||||||
> doc = Doc(nlp.vocab, words=words, spaces=spaces)
|
> doc = Doc(nlp.vocab, words=words, spaces=spaces)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
## Doc.\_\_init\_\_ {#init tag="method"}
|
|
||||||
|
|
||||||
Construct a `Doc` object. The most common way to get a `Doc` object is via the
|
|
||||||
`nlp` object.
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
| `vocab` | `Vocab` | A storage container for lexical types. |
|
||||||
|
|
|
@ -29,7 +29,7 @@ class. The data will be loaded in via
|
||||||
> nlp = spacy.load("/path/to/en") # unicode path
|
> nlp = spacy.load("/path/to/en") # unicode path
|
||||||
> nlp = spacy.load(Path("/path/to/en")) # pathlib Path
|
> nlp = spacy.load(Path("/path/to/en")) # pathlib Path
|
||||||
>
|
>
|
||||||
> nlp = spacy.load("en", disable=["parser", "tagger"])
|
> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
|
|
|
@ -52,4 +52,18 @@ entities into account when making predictions.
|
||||||
|
|
||||||
</Accordion>
|
</Accordion>
|
||||||
|
|
||||||
|
<Accordion title="Why is the tokenizer special?" id="pipeline-components-tokenizer">
|
||||||
|
|
||||||
|
The tokenizer is a "special" component and isn't part of the regular pipeline.
|
||||||
|
It also doesn't show up in `nlp.pipe_names`. The reason is that there can only
|
||||||
|
really be one tokenizer, and while all other pipeline components take a `Doc`
|
||||||
|
and return it, the tokenizer takes a **string of text** and turns it into a
|
||||||
|
`Doc`. You can still customize the tokenizer, though. `nlp.tokenizer` is
|
||||||
|
writable, so you can either create your own
|
||||||
|
[`Tokenizer` class from scratch](/usage/linguistic-features#native-tokenizers),
|
||||||
|
or even replace it with an
|
||||||
|
[entirely custom function](/usage/linguistic-features#custom-tokenizer).
|
||||||
|
|
||||||
|
</Accordion>
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
|
@ -2,6 +2,7 @@
|
||||||
title: Language Processing Pipelines
|
title: Language Processing Pipelines
|
||||||
next: vectors-similarity
|
next: vectors-similarity
|
||||||
menu:
|
menu:
|
||||||
|
- ['Processing Text', 'processing']
|
||||||
- ['How Pipelines Work', 'pipelines']
|
- ['How Pipelines Work', 'pipelines']
|
||||||
- ['Custom Components', 'custom-components']
|
- ['Custom Components', 'custom-components']
|
||||||
- ['Extension Attributes', 'custom-components-attributes']
|
- ['Extension Attributes', 'custom-components-attributes']
|
||||||
|
@ -12,6 +13,82 @@ import Pipelines101 from 'usage/101/\_pipelines.md'
|
||||||
|
|
||||||
<Pipelines101 />
|
<Pipelines101 />
|
||||||
|
|
||||||
|
## Processing text {#processing}
|
||||||
|
|
||||||
|
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
|
||||||
|
component** on the `Doc`, in order. It then returns the processed `Doc` that you
|
||||||
|
can work with.
|
||||||
|
|
||||||
|
```python
|
||||||
|
doc = nlp(u"This is a text")
|
||||||
|
```
|
||||||
|
|
||||||
|
When processing large volumes of text, the statistical models are usually more
|
||||||
|
efficient if you let them work on batches of texts. spaCy's
|
||||||
|
[`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields
|
||||||
|
processed `Doc` objects. The batching is done internally.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
texts = [u"This is a text", u"These are lots of texts", u"..."]
|
||||||
|
- docs = [nlp(text) for text in texts]
|
||||||
|
+ docs = list(nlp.pipe(texts))
|
||||||
|
```
|
||||||
|
|
||||||
|
<Infobox title="Tips for efficient processing">
|
||||||
|
|
||||||
|
- Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and
|
||||||
|
buffer them in batches, instead of one-by-one. This is usually much more
|
||||||
|
efficient.
|
||||||
|
- Only apply the **pipeline components you need**. Getting predictions from the
|
||||||
|
model that you don't actually need adds up and becomes very inefficient at
|
||||||
|
scale. To prevent this, use the `disable` keyword argument to disable
|
||||||
|
components you don't need – either when loading a model, or during processing
|
||||||
|
with `nlp.pipe`. See the section on
|
||||||
|
[disabling pipeline components](#disabling) for more details and examples.
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
|
In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a
|
||||||
|
(potentially very large) iterable of texts as a stream. Because we're only
|
||||||
|
accessing the named entities in `doc.ents` (set by the `ner` component), we'll
|
||||||
|
disable all other statistical components (the `tagger` and `parser`) during
|
||||||
|
processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and
|
||||||
|
access the named entity predictions:
|
||||||
|
|
||||||
|
> #### ✏️ Things to try
|
||||||
|
>
|
||||||
|
> 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now
|
||||||
|
> empty, because the entity recognizer didn't run.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### {executable="true"}
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
texts = [
|
||||||
|
"Net income was $9.4 million compared to the prior year of $2.7 million.",
|
||||||
|
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
|
||||||
|
]
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_web_sm")
|
||||||
|
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
|
||||||
|
# Do something with the doc here
|
||||||
|
print([(ent.text, ent.label_) for ent in doc.ents])
|
||||||
|
```
|
||||||
|
|
||||||
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
|
When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a
|
||||||
|
[generator](https://realpython.com/introduction-to-python-generators/) that
|
||||||
|
yields `Doc` objects – not a list. So if you want to use it like a list, you'll
|
||||||
|
have to call `list()` on it first:
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- docs = nlp.pipe(texts)[0] # will raise an error
|
||||||
|
+ docs = list(nlp.pipe(texts))[0] # works as expected
|
||||||
|
```
|
||||||
|
|
||||||
|
</Infobox>
|
||||||
|
|
||||||
## How pipelines work {#pipelines}
|
## How pipelines work {#pipelines}
|
||||||
|
|
||||||
spaCy makes it very easy to create your own pipelines consisting of reusable
|
spaCy makes it very easy to create your own pipelines consisting of reusable
|
||||||
|
@ -146,19 +223,56 @@ require them in the pipeline settings in your model's `meta.json`.
|
||||||
### Disabling and modifying pipeline components {#disabling}
|
### Disabling and modifying pipeline components {#disabling}
|
||||||
|
|
||||||
If you don't need a particular component of the pipeline – for example, the
|
If you don't need a particular component of the pipeline – for example, the
|
||||||
tagger or the parser, you can disable loading it. This can sometimes make a big
|
tagger or the parser, you can **disable loading** it. This can sometimes make a
|
||||||
difference and improve loading speed. Disabled component names can be provided
|
big difference and improve loading speed. Disabled component names can be
|
||||||
to [`spacy.load`](/api/top-level#spacy.load),
|
provided to [`spacy.load`](/api/top-level#spacy.load),
|
||||||
[`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a
|
[`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a
|
||||||
list:
|
list:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
nlp = spacy.load("en", disable=["parser", "tagger"])
|
### Disable loading
|
||||||
|
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
|
||||||
nlp = English().from_disk("/model", disable=["ner"])
|
nlp = English().from_disk("/model", disable=["ner"])
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also use the [`remove_pipe`](/api/language#remove_pipe) method to remove
|
In some cases, you do want to load all pipeline components and their weights,
|
||||||
pipeline components from an existing pipeline, the
|
because you need them at different points in your application. However, if you
|
||||||
|
only need a `Doc` object with named entities, there's no need to run all
|
||||||
|
pipeline components on it – that can potentially make processing much slower.
|
||||||
|
Instead, you can use the `disable` keyword argument on
|
||||||
|
[`nlp.pipe`](/api/language#pipe) to temporarily disable the components **during
|
||||||
|
processing**:
|
||||||
|
|
||||||
|
```python
|
||||||
|
### Disable for processing
|
||||||
|
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
|
||||||
|
# Do something with the doc here
|
||||||
|
```
|
||||||
|
|
||||||
|
If you need to **execute more code** with components disabled – e.g. to reset
|
||||||
|
the weights or update only some components during training – you can use the
|
||||||
|
[`nlp.disable_pipes`](/api/language#disable_pipes) contextmanager. At the end of
|
||||||
|
the `with` block, the disabled pipeline components will be restored
|
||||||
|
automatically. Alternatively, `disable_pipes` returns an object that lets you
|
||||||
|
call its `restore()` method to restore the disabled components when needed. This
|
||||||
|
can be useful if you want to prevent unnecessary code indentation of large
|
||||||
|
blocks.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### Disable for block
|
||||||
|
# 1. Use as a contextmanager
|
||||||
|
with nlp.disable_pipes("tagger", "parser"):
|
||||||
|
doc = nlp(u"I won't be tagged and parsed")
|
||||||
|
doc = nlp(u"I will be tagged and parsed")
|
||||||
|
|
||||||
|
# 2. Restore manually
|
||||||
|
disabled = nlp.disable_pipes("ner")
|
||||||
|
doc = nlp(u"I won't have named entities")
|
||||||
|
disabled.restore()
|
||||||
|
```
|
||||||
|
|
||||||
|
Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
|
||||||
|
to remove pipeline components from an existing pipeline, the
|
||||||
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
||||||
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
[`replace_pipe`](/api/language#replace_pipe) method to replace them with a
|
||||||
custom component entirely (more details on this in the section on
|
custom component entirely (more details on this in the section on
|
||||||
|
@ -182,8 +296,8 @@ initializing a Language class via [`from_disk`](/api/language#from_disk).
|
||||||
- nlp = spacy.load('en', tagger=False, entity=False)
|
- nlp = spacy.load('en', tagger=False, entity=False)
|
||||||
- doc = nlp(u"I don't want parsed", parse=False)
|
- doc = nlp(u"I don't want parsed", parse=False)
|
||||||
|
|
||||||
+ nlp = spacy.load('en', disable=['ner'])
|
+ nlp = spacy.load("en", disable=["ner"])
|
||||||
+ nlp.remove_pipe('parser')
|
+ nlp.remove_pipe("parser")
|
||||||
+ doc = nlp(u"I don't want parsed")
|
+ doc = nlp(u"I don't want parsed")
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
|
@ -623,8 +623,8 @@ solves this with a clear distinction between setting up the instance and loading
|
||||||
the data.
|
the data.
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
- nlp = spacy.load("en", path="/path/to/data")
|
- nlp = spacy.load("en_core_web_sm", path="/path/to/data")
|
||||||
+ nlp = spacy.blank("en").from_disk("/path/to/data")
|
+ nlp = spacy.blank("en_core_web_sm").from_disk("/path/to/data")
|
||||||
```
|
```
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
Loading…
Reference in New Issue
Block a user