mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-25 05:01:02 +03:00 
			
		
		
		
	Merge branch 'master' into spacy.io
This commit is contained in:
		
						commit
						d9eeae5c69
					
				|  | @ -11,6 +11,11 @@ compressed binary strings. The `Doc` object holds an array of `TokenC]` structs. | |||
| The Python-level `Token` and [`Span`](/api/span) objects are views of this | ||||
| array, i.e. they don't own the data themselves. | ||||
| 
 | ||||
| ## Doc.\_\_init\_\_ {#init tag="method"} | ||||
| 
 | ||||
| Construct a `Doc` object. The most common way to get a `Doc` object is via the | ||||
| `nlp` object. | ||||
| 
 | ||||
| > #### Example | ||||
| > | ||||
| > ```python | ||||
|  | @ -24,11 +29,6 @@ array, i.e. they don't own the data themselves. | |||
| > doc = Doc(nlp.vocab, words=words, spaces=spaces) | ||||
| > ``` | ||||
| 
 | ||||
| ## Doc.\_\_init\_\_ {#init tag="method"} | ||||
| 
 | ||||
| Construct a `Doc` object. The most common way to get a `Doc` object is via the | ||||
| `nlp` object. | ||||
| 
 | ||||
| | Name        | Type     | Description                                                                                                                                                         | | ||||
| | ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||||
| | `vocab`     | `Vocab`  | A storage container for lexical types.                                                                                                                              | | ||||
|  |  | |||
|  | @ -29,7 +29,7 @@ class. The data will be loaded in via | |||
| > nlp = spacy.load("/path/to/en") # unicode path | ||||
| > nlp = spacy.load(Path("/path/to/en")) # pathlib Path | ||||
| > | ||||
| > nlp = spacy.load("en", disable=["parser", "tagger"]) | ||||
| > nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"]) | ||||
| > ``` | ||||
| 
 | ||||
| | Name        | Type             | Description                                                                       | | ||||
|  |  | |||
|  | @ -52,4 +52,18 @@ entities into account when making predictions. | |||
| 
 | ||||
| </Accordion> | ||||
| 
 | ||||
| <Accordion title="Why is the tokenizer special?" id="pipeline-components-tokenizer"> | ||||
| 
 | ||||
| The tokenizer is a "special" component and isn't part of the regular pipeline. | ||||
| It also doesn't show up in `nlp.pipe_names`. The reason is that there can only | ||||
| really be one tokenizer, and while all other pipeline components take a `Doc` | ||||
| and return it, the tokenizer takes a **string of text** and turns it into a | ||||
| `Doc`. You can still customize the tokenizer, though. `nlp.tokenizer` is | ||||
| writable, so you can either create your own | ||||
| [`Tokenizer` class from scratch](/usage/linguistic-features#native-tokenizers), | ||||
| or even replace it with an | ||||
| [entirely custom function](/usage/linguistic-features#custom-tokenizer). | ||||
| 
 | ||||
| </Accordion> | ||||
| 
 | ||||
| --- | ||||
|  |  | |||
|  | @ -2,6 +2,7 @@ | |||
| title: Language Processing Pipelines | ||||
| next: vectors-similarity | ||||
| menu: | ||||
|   - ['Processing Text', 'processing'] | ||||
|   - ['How Pipelines Work', 'pipelines'] | ||||
|   - ['Custom Components', 'custom-components'] | ||||
|   - ['Extension Attributes', 'custom-components-attributes'] | ||||
|  | @ -12,6 +13,82 @@ import Pipelines101 from 'usage/101/\_pipelines.md' | |||
| 
 | ||||
| <Pipelines101 /> | ||||
| 
 | ||||
| ## Processing text {#processing} | ||||
| 
 | ||||
| When you call `nlp` on a text, spaCy will **tokenize** it and then **call each | ||||
| component** on the `Doc`, in order. It then returns the processed `Doc` that you | ||||
| can work with. | ||||
| 
 | ||||
| ```python | ||||
| doc = nlp(u"This is a text") | ||||
| ``` | ||||
| 
 | ||||
| When processing large volumes of text, the statistical models are usually more | ||||
| efficient if you let them work on batches of texts. spaCy's | ||||
| [`nlp.pipe`](/api/language#pipe) method takes an iterable of texts and yields | ||||
| processed `Doc` objects. The batching is done internally. | ||||
| 
 | ||||
| ```diff | ||||
| texts = [u"This is a text", u"These are lots of texts", u"..."] | ||||
| - docs = [nlp(text) for text in texts] | ||||
| + docs = list(nlp.pipe(texts)) | ||||
| ``` | ||||
| 
 | ||||
| <Infobox title="Tips for efficient processing"> | ||||
| 
 | ||||
| - Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and | ||||
|   buffer them in batches, instead of one-by-one. This is usually much more | ||||
|   efficient. | ||||
| - Only apply the **pipeline components you need**. Getting predictions from the | ||||
|   model that you don't actually need adds up and becomes very inefficient at | ||||
|   scale. To prevent this, use the `disable` keyword argument to disable | ||||
|   components you don't need – either when loading a model, or during processing | ||||
|   with `nlp.pipe`. See the section on | ||||
|   [disabling pipeline components](#disabling) for more details and examples. | ||||
| 
 | ||||
| </Infobox> | ||||
| 
 | ||||
| In this example, we're using [`nlp.pipe`](/api/language#pipe) to process a | ||||
| (potentially very large) iterable of texts as a stream. Because we're only | ||||
| accessing the named entities in `doc.ents` (set by the `ner` component), we'll | ||||
| disable all other statistical components (the `tagger` and `parser`) during | ||||
| processing. `nlp.pipe` yields `Doc` objects, so we can iterate over them and | ||||
| access the named entity predictions: | ||||
| 
 | ||||
| > #### ✏️ Things to try | ||||
| > | ||||
| > 1. Also disable the `"ner"` component. You'll see that the `doc.ents` are now | ||||
| >    empty, because the entity recognizer didn't run. | ||||
| 
 | ||||
| ```python | ||||
| ### {executable="true"} | ||||
| import spacy | ||||
| 
 | ||||
| texts = [ | ||||
|     "Net income was $9.4 million compared to the prior year of $2.7 million.", | ||||
|     "Revenue exceeded twelve billion dollars, with a loss of $1b.", | ||||
| ] | ||||
| 
 | ||||
| nlp = spacy.load("en_core_web_sm") | ||||
| for doc in nlp.pipe(texts, disable=["tagger", "parser"]): | ||||
|     # Do something with the doc here | ||||
|     print([(ent.text, ent.label_) for ent in doc.ents]) | ||||
| ``` | ||||
| 
 | ||||
| <Infobox title="Important note" variant="warning"> | ||||
| 
 | ||||
| When using [`nlp.pipe`](/api/language#pipe), keep in mind that it returns a | ||||
| [generator](https://realpython.com/introduction-to-python-generators/) that | ||||
| yields `Doc` objects – not a list. So if you want to use it like a list, you'll | ||||
| have to call `list()` on it first: | ||||
| 
 | ||||
| ```diff | ||||
| - docs = nlp.pipe(texts)[0]         # will raise an error | ||||
| + docs = list(nlp.pipe(texts))[0]   # works as expected | ||||
| ``` | ||||
| 
 | ||||
| </Infobox> | ||||
| 
 | ||||
| ## How pipelines work {#pipelines} | ||||
| 
 | ||||
| spaCy makes it very easy to create your own pipelines consisting of reusable | ||||
|  | @ -146,19 +223,56 @@ require them in the pipeline settings in your model's `meta.json`. | |||
| ### Disabling and modifying pipeline components {#disabling} | ||||
| 
 | ||||
| If you don't need a particular component of the pipeline – for example, the | ||||
| tagger or the parser, you can disable loading it. This can sometimes make a big | ||||
| difference and improve loading speed. Disabled component names can be provided | ||||
| to [`spacy.load`](/api/top-level#spacy.load), | ||||
| tagger or the parser, you can **disable loading** it. This can sometimes make a | ||||
| big difference and improve loading speed. Disabled component names can be | ||||
| provided to [`spacy.load`](/api/top-level#spacy.load), | ||||
| [`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a | ||||
| list: | ||||
| 
 | ||||
| ```python | ||||
| nlp = spacy.load("en", disable=["parser", "tagger"]) | ||||
| ### Disable loading | ||||
| nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"]) | ||||
| nlp = English().from_disk("/model", disable=["ner"]) | ||||
| ``` | ||||
| 
 | ||||
| You can also use the [`remove_pipe`](/api/language#remove_pipe) method to remove | ||||
| pipeline components from an existing pipeline, the | ||||
| In some cases, you do want to load all pipeline components and their weights, | ||||
| because you need them at different points in your application. However, if you | ||||
| only need a `Doc` object with named entities, there's no need to run all | ||||
| pipeline components on it – that can potentially make processing much slower. | ||||
| Instead, you can use the `disable` keyword argument on | ||||
| [`nlp.pipe`](/api/language#pipe) to temporarily disable the components **during | ||||
| processing**: | ||||
| 
 | ||||
| ```python | ||||
| ### Disable for processing | ||||
| for doc in nlp.pipe(texts, disable=["tagger", "parser"]): | ||||
|     # Do something with the doc here | ||||
| ``` | ||||
| 
 | ||||
| If you need to **execute more code** with components disabled – e.g. to reset | ||||
| the weights or update only some components during training – you can use the | ||||
| [`nlp.disable_pipes`](/api/language#disable_pipes) contextmanager. At the end of | ||||
| the `with` block, the disabled pipeline components will be restored | ||||
| automatically. Alternatively, `disable_pipes` returns an object that lets you | ||||
| call its `restore()` method to restore the disabled components when needed. This | ||||
| can be useful if you want to prevent unnecessary code indentation of large | ||||
| blocks. | ||||
| 
 | ||||
| ```python | ||||
| ### Disable for block | ||||
| # 1. Use as a contextmanager | ||||
| with nlp.disable_pipes("tagger", "parser"): | ||||
|     doc = nlp(u"I won't be tagged and parsed") | ||||
| doc = nlp(u"I will be tagged and parsed") | ||||
| 
 | ||||
| # 2. Restore manually | ||||
| disabled = nlp.disable_pipes("ner") | ||||
| doc = nlp(u"I won't have named entities") | ||||
| disabled.restore() | ||||
| ``` | ||||
| 
 | ||||
| Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method | ||||
| to remove pipeline components from an existing pipeline, the | ||||
| [`rename_pipe`](/api/language#rename_pipe) method to rename them, or the | ||||
| [`replace_pipe`](/api/language#replace_pipe) method to replace them with a | ||||
| custom component entirely (more details on this in the section on | ||||
|  | @ -182,8 +296,8 @@ initializing a Language class via [`from_disk`](/api/language#from_disk). | |||
| - nlp = spacy.load('en', tagger=False, entity=False) | ||||
| - doc = nlp(u"I don't want parsed", parse=False) | ||||
| 
 | ||||
| + nlp = spacy.load('en', disable=['ner']) | ||||
| + nlp.remove_pipe('parser') | ||||
| + nlp = spacy.load("en", disable=["ner"]) | ||||
| + nlp.remove_pipe("parser") | ||||
| + doc = nlp(u"I don't want parsed") | ||||
| ``` | ||||
| 
 | ||||
|  |  | |||
|  | @ -623,8 +623,8 @@ solves this with a clear distinction between setting up the instance and loading | |||
| the data. | ||||
| 
 | ||||
| ```diff | ||||
| - nlp = spacy.load("en", path="/path/to/data") | ||||
| + nlp = spacy.blank("en").from_disk("/path/to/data") | ||||
| - nlp = spacy.load("en_core_web_sm", path="/path/to/data") | ||||
| + nlp = spacy.blank("en_core_web_sm").from_disk("/path/to/data") | ||||
| ``` | ||||
| 
 | ||||
| </Infobox> | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue
	
	Block a user