Update processing-pipelines.md to mention method for doc metadata (#7480)

* Update processing-pipelines.md

Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True)

Link to a new example on the attributes page detailing the following:

> ```
> data = [
>   ("Some text to process", {"meta": "foo"}),
>   ("And more text...", {"meta": "bar"})
> ]
> 
> for doc, context in nlp.pipe(data, as_tuples=True):
>     # Let's assume you have a "meta" extension registered on the Doc
>     doc._.meta = context["meta"]
> ```

from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as

* Updating the attributes section

Update the attributes section with example of how extensions can be used to store metadata.

* Update processing-pipelines.md

* Update processing-pipelines.md

Made as_tuples example executable and relocated to the end of the "Processing Text" section.

* Update processing-pipelines.md

* Update processing-pipelines.md

Removed extra line

* Reformat and rephrase

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
This commit is contained in:
langdonholmes 2021-04-19 02:58:12 -07:00 committed by svlandeg
parent fd6eebbfdc
commit cef9f25ec0

View File

@ -91,6 +91,37 @@ have to call `list()` on it first:
</Infobox> </Infobox>
You can use the `as_tuples` option to pass additional context along with each
doc when using [`nlp.pipe`](/api/language#pipe). If `as_tuples` is `True`, then
the input should be a sequence of `(text, context)` tuples and the output will
be a sequence of `(doc, context)` tuples. For example, you can pass metadata in
the context and save it in a [custom attribute](#custom-components-attributes):
```python
### {executable="true"}
import spacy
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
text_tuples = [
("This is the first text.", {"text_id": "text1"}),
("This is the second text.", {"text_id": "text2"})
]
nlp = spacy.load("en_core_web_sm")
doc_tuples = nlp.pipe(text_tuples, as_tuples=True)
docs = []
for doc, context in doc_tuples:
doc._.text_id = context["text_id"]
docs.append(doc)
for doc in docs:
print(f"{doc._.text_id}: {doc.text}")
```
### Multiprocessing {#multiprocessing} ### Multiprocessing {#multiprocessing}
spaCy includes built-in support for multiprocessing with spaCy includes built-in support for multiprocessing with
@ -1373,6 +1404,8 @@ There are three main types of extensions, which can be defined using the
[`Span.set_extension`](/api/span#set_extension) and [`Span.set_extension`](/api/span#set_extension) and
[`Token.set_extension`](/api/token#set_extension) methods. [`Token.set_extension`](/api/token#set_extension) methods.
## Description
1. **Attribute extensions.** Set a default value for an attribute, which can be 1. **Attribute extensions.** Set a default value for an attribute, which can be
overwritten manually at any time. Attribute extensions work like "normal" overwritten manually at any time. Attribute extensions work like "normal"
variables and are the quickest way to store arbitrary information on a `Doc`, variables and are the quickest way to store arbitrary information on a `Doc`,