Update processing-pipelines.md to mention method for doc metadata (#7480)

* Update processing-pipelines.md Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True) Link to a new example on the attributes page detailing the following: > ``` > data = [ > ("Some text to process", {"meta": "foo"}), > ("And more text...", {"meta": "bar"}) > ] > > for doc, context in nlp.pipe(data, as_tuples=True): > # Let's assume you have a "meta" extension registered on the Doc > doc._.meta = context["meta"] > ``` from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as * Updating the attributes section Update the attributes section with example of how extensions can be used to store metadata. * Update processing-pipelines.md * Update processing-pipelines.md Made as_tuples example executable and relocated to the end of the "Processing Text" section. * Update processing-pipelines.md * Update processing-pipelines.md Removed extra line * Reformat and rephrase Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2026-01-09 10:11:24 +03:00 · 2021-04-19 02:58:12 -07:00 · 2021-04-19 02:58:12 -07:00 · cef9f25ec0
commit cef9f25ec0
parent fd6eebbfdc
1 changed files with 33 additions and 0 deletions
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -91,6 +91,37 @@ have to call `list()` on it first:

 </Infobox>

+You can use the `as_tuples` option to pass additional context along with each
+doc when using [`nlp.pipe`](/api/language#pipe). If `as_tuples` is `True`, then
+the input should be a sequence of `(text, context)` tuples and the output will
+be a sequence of `(doc, context)` tuples. For example, you can pass metadata in
+the context and save it in a [custom attribute](#custom-components-attributes):
+
+```python
+### {executable="true"}
+import spacy
+from spacy.tokens import Doc
+
+if not Doc.has_extension("text_id"):
+    Doc.set_extension("text_id", default=None)
+
+text_tuples = [
+    ("This is the first text.", {"text_id": "text1"}),
+    ("This is the second text.", {"text_id": "text2"})
+]
+
+nlp = spacy.load("en_core_web_sm")
+doc_tuples = nlp.pipe(text_tuples, as_tuples=True)
+
+docs = []
+for doc, context in doc_tuples:
+    doc._.text_id = context["text_id"]
+    docs.append(doc)
+
+for doc in docs:
+    print(f"{doc._.text_id}: {doc.text}")
+```
+
 ### Multiprocessing {#multiprocessing}

 spaCy includes built-in support for multiprocessing with
@ -1373,6 +1404,8 @@ There are three main types of extensions, which can be defined using the
 [`Span.set_extension`](/api/span#set_extension) and
 [`Token.set_extension`](/api/token#set_extension) methods.

+## Description
+
 1. **Attribute extensions.** Set a default value for an attribute, which can be
   overwritten manually at any time. Attribute extensions work like "normal"
   variables and are the quickest way to store arbitrary information on a `Doc`,