diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 52568658d..bde3ab84f 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -91,6 +91,37 @@ have to call `list()` on it first: +You can use the `as_tuples` option to pass additional context along with each +doc when using [`nlp.pipe`](/api/language#pipe). If `as_tuples` is `True`, then +the input should be a sequence of `(text, context)` tuples and the output will +be a sequence of `(doc, context)` tuples. For example, you can pass metadata in +the context and save it in a [custom attribute](#custom-components-attributes): + +```python +### {executable="true"} +import spacy +from spacy.tokens import Doc + +if not Doc.has_extension("text_id"): + Doc.set_extension("text_id", default=None) + +text_tuples = [ + ("This is the first text.", {"text_id": "text1"}), + ("This is the second text.", {"text_id": "text2"}) +] + +nlp = spacy.load("en_core_web_sm") +doc_tuples = nlp.pipe(text_tuples, as_tuples=True) + +docs = [] +for doc, context in doc_tuples: + doc._.text_id = context["text_id"] + docs.append(doc) + +for doc in docs: + print(f"{doc._.text_id}: {doc.text}") +``` + ### Multiprocessing {#multiprocessing} spaCy includes built-in support for multiprocessing with @@ -1373,6 +1404,8 @@ There are three main types of extensions, which can be defined using the [`Span.set_extension`](/api/span#set_extension) and [`Token.set_extension`](/api/token#set_extension) methods. +## Description + 1. **Attribute extensions.** Set a default value for an attribute, which can be overwritten manually at any time. Attribute extensions work like "normal" variables and are the quickest way to store arbitrary information on a `Doc`,