simplify Python example

2025-07-27 16:39:55 +03:00 · 2023-09-07 14:55:54 +02:00 · 2023-09-07 14:55:54 +02:00 · 5daa3a2123
commit 5daa3a2123
parent ecc017bbbc
2 changed files with 185 additions and 37 deletions
--- a/website/docs/api/large-language-models.mdx
+++ b/website/docs/api/large-language-models.mdx
@ -2,7 +2,7 @@
 title: Large Language Models
 teaser: Integrating LLMs into structured NLP pipelines
 menu:
-  - ['Config', 'config']
+  - ['Config and implementation', 'config']
  - ['Tasks', 'tasks']
  - ['Models', 'models']
  - ['Cache', 'cache']
@ -14,49 +14,197 @@ Language Models (LLMs) into spaCy, featuring a modular system for **fast
 prototyping** and **prompting**, and turning unstructured responses into
 **robust outputs** for various NLP tasks, **no training data** required.
-## Config {id="config"}
+## Config and implementation {id="config"}
-`spacy-llm` exposes a `llm` factory that accepts the following configuration
+An LLM component is implemented through the `LLMWrapper` class. It is accessable
-options:
+through a generic `llm`
 [component factory](https://spacy.io/usage/processing-pipelines#custom-components-factories)
 as well as through task-specific component factories:
-| Argument         | Description                                                                                             |
+- `llm_ner`
-| ---------------- | ------------------------------------------------------------------------------------------------------- |
+- `llm_spancat`
-| `task`           | An LLMTask can generate prompts and parse LLM responses. See [docs](#tasks). ~~Optional[LLMTask]~~      |
+- `llm_rel`
-| `model`          | Callable querying a specific LLM API. See [docs](#models). ~~Callable[[Iterable[Any]], Iterable[Any]]~~ |
+- `llm_textcat`
-| `cache`          | Cache to use for caching prompts and responses per doc (batch). See [docs](#cache). ~~Cache~~           |
+- `llm_sentiment`
-| `save_io`        | Whether to save prompts/responses within `Doc.user_data["llm_io"]`. ~~bool~~                            |
+- `llm_summarization`
 | `validate_types` | Whether to check if signatures of configured model and task are consistent. ~~bool~~                    |
-Beyond that, an `llm_TASKNAME` factory is available for each task - `llm_ner` for
+### LLMWrapper.\_\_init\_\_ {id="init",tag="method"}
 an LLM component with the NER task, `llm_rel` for relationship extraction etc.
 These factories are equivalent to using the `llm` factory and defining the task
 in the configuration. Note: tasks may require more configuration than just
 the task factory - compare with the tasks' description below.
-An `llm` component is defined by two main settings:
+> #### Example
 >
 > ```python
 > # Construction via add_pipe with default GPT3.5 model and NER task
 > config = {"task": {"@llm_tasks": "spacy.NER.v3", "labels": ["PERSON", "ORGANISATION", "LOCATION"]}}
 > llm = nlp.add_pipe("llm")
 >
 > # Construction via add_pipe with task-specific factory and default GPT3.5 model
 > parser = nlp.add_pipe("llm-ner", config=config)
 >
 > # Construction from class
 > from spacy_llm.pipeline import LLMWrapper
 > llm = LLMWrapper(vocab=nlp.vocab, task=task, model=model, cache=cache, save_io=True)
 > ```
- A [**task**](#tasks), defining the prompt to send to the LLM as well as the
+Create a new pipeline instance. In your application, you would normally use a
-  functionality to parse the resulting response back into structured fields on
+shortcut for this and instantiate the component using its string name and
-  the [Doc](/api/doc) objects.
+[`nlp.add_pipe`](/api/language#add_pipe).
 - A [**model**](#models) defining the model and how to connect to it. Note that
  `spacy-llm` supports both access to external APIs (such as OpenAI) as well as
  access to self-hosted open-source LLMs (such as using Dolly through Hugging
  Face).
-Moreover, `spacy-llm` exposes a customizable [**caching**](#cache) functionality
+| Name           | Description                                                                                        |
-to avoid running the same document through an LLM service (be it local or
+| -------------- | -------------------------------------------------------------------------------------------------- |
-through a REST API) more than once.
+| `name`         | String name of the component instance. `llm` by default. ~~str~~                                   |
 | _keyword-only_ |                                                                                                    |
 | `vocab`        | The shared vocabulary. ~~Vocab~~                                                                   |
 | `task`         | An [LLM Task](#tasks) can generate prompts and parse LLM responses. ~~LLMTask~~                    |
 | `model`        | The [LLM Model](#models) queries a specific LLM API.. ~~Callable[[Iterable[Any]], Iterable[Any]]~~ |
 | `cache`        | [Cache](#cache) to use for caching prompts and responses per doc. ~~Cache~~                        |
 | `save_io`      | Whether to save LLM I/O (prompts and responses) in the `Doc._.llm_io` custom attribute. ~~bool~~   |
-Finally, you can choose to save a stringified version of LLM prompts/responses
+### LLMWrapper.\_\_call\_\_ {id="call",tag="method"}
-within the `Doc.user_data["llm_io"]` attribute by setting `save_io` to `True`.
+
-`Doc.user_data["llm_io"]` is a dictionary containing one entry for every LLM
+Apply the pipe to one document. The document is modified in place and returned.
-component within the `nlp` pipeline. Each entry is itself a dictionary, with two
+This usually happens under the hood when the `nlp` object is called on a text
-keys: `prompt` and `response`.
+and all pipeline components are applied to the `Doc` in order.
 > #### Example
 >
 > ```python
 > doc = nlp("Ingrid visited Paris.")
 > llm_ner = nlp.add_pipe("llm_ner")
 > # This usually happens under the hood
 > processed = llm_ner(doc)
 > ```
 | Name        | Description                      |
 | ----------- | -------------------------------- |
 | `doc`       | The document to process. ~~Doc~~ |
 | **RETURNS** | The processed document. ~~Doc~~  |
 ### LLMWrapper.pipe {id="pipe",tag="method"}
 Apply the pipe to a stream of documents. This usually happens under the hood
 when the `nlp` object is called on a text and all pipeline components are
 applied to the `Doc` in order.
 > #### Example
 >
 > ```python
 > llm_ner = nlp.add_pipe("llm_ner")
 > for doc in llm_ner.pipe(docs, batch_size=50):
 >     pass
 > ```
 | Name           | Description                                                   |
 | -------------- | ------------------------------------------------------------- |
 | `docs`         | A stream of documents. ~~Iterable[Doc]~~                      |
 | _keyword-only_ |                                                               |
 | `batch_size`   | The number of documents to buffer. Defaults to `128`. ~~int~~ |
 | **YIELDS**     | The processed documents in order. ~~Doc~~                     |
 ### LLMWrapper.add_label {id="add_label",tag="method"}
 Add a new label to the pipe's task. Alternatively, provide the labels upon the
 [task](#task) definition, or through the `[initialize]` block of the
 [config](#config).
 > #### Example
 >
 > ```python
 > llm_ner = nlp.add_pipe("llm_ner")
 > llm_ner.add_label("MY_LABEL")
 > ```
 | Name        | Description                                                 |
 | ----------- | ----------------------------------------------------------- |
 | `label`     | The label to add. ~~str~~                                   |
 | **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
 ### LLMWrapper.to_disk {id="to_disk",tag="method"}
 Serialize the pipe to disk.
 > #### Example
 >
 > ```python
 > llm_ner = nlp.add_pipe("llm_ner")
 > llm_ner.to_disk("/path/to/llm_ner")
 > ```
 | Name           | Description                                                                                                                                |
 | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
 | `path`         | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
 | _keyword-only_ |                                                                                                                                            |
 | `exclude`      | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~                                                |
 ### LLMWrapper.from_disk {id="from_disk",tag="method"}
 Load the pipe from disk. Modifies the object in place and returns it.
 > #### Example
 >
 > ```python
 > llm_ner = nlp.add_pipe("llm_ner")
 > llm_ner.from_disk("/path/to/llm_ner")
 > ```
 | Name           | Description                                                                                     |
 | -------------- | ----------------------------------------------------------------------------------------------- |
 | `path`         | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
 | _keyword-only_ |                                                                                                 |
 | `exclude`      | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~     |
 | **RETURNS**    | The modified `LLMWrapper` object. ~~LLMWrapper~~                                                |
 ### LLMWrapper.to_bytes {id="to_bytes",tag="method"}
 > #### Example
 >
 > ```python
 > llm_ner = nlp.add_pipe("llm_ner")
 > ner_bytes = llm_ner.to_bytes()
 > ```
 Serialize the pipe to a bytestring.
 | Name           | Description                                                                                 |
 | -------------- | ------------------------------------------------------------------------------------------- |
 | _keyword-only_ |                                                                                             |
 | `exclude`      | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
 | **RETURNS**    | The serialized form of the `LLMWrapper` object. ~~bytes~~                                   |
 ### LLMWrapper.from_bytes {id="from_bytes",tag="method"}
 Load the pipe from a bytestring. Modifies the object in place and returns it.
 > #### Example
 >
 > ```python
 > ner_bytes = llm_ner.to_bytes()
 > llm_ner = nlp.add_pipe("llm_ner")
 > llm_ner.from_bytes(ner_bytes)
 > ```
 | Name           | Description                                                                                 |
 | -------------- | ------------------------------------------------------------------------------------------- |
 | `bytes_data`   | The data to load from. ~~bytes~~                                                            |
 | _keyword-only_ |                                                                                             |
 | `exclude`      | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
 | **RETURNS**    | The `LLMWrapper` object. ~~LLMWrapper~~                                                     |
 ### LLMWrapper.labels {id="labels",tag="property"}
 The labels currently added to the component. Empty tuple if the LLM's task does
 not require labels.
 > #### Example
 >
 > ```python
 > llm_ner.add_label("MY_LABEL")
 > assert "MY_LABEL" in llm_ner.labels
 > ```
 | Name        | Description                                            |
 | ----------- | ------------------------------------------------------ |
 | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
 A note on `validate_types`: by default, `spacy-llm` checks whether the
 signatures of the `model` and `task` callables are consistent with each other
 and emits a warning if they don't. `validate_types` can be set to `False` if you
 want to disable this behavior.
 ## Tasks {id="tasks"}
--- a/website/docs/usage/large-language-models.mdx
+++ b/website/docs/usage/large-language-models.mdx
@ -170,7 +170,7 @@ to be `"databricks/dolly-v2-12b"` for better performance.
 ### Example 3: Create the component directly in Python {id="example-3"}
 The `llm` component behaves as any other component does, and there are
-[task-specific components](/api/large-language-models#implementation) defined to
+[task-specific components](/api/large-language-models#config) defined to
 help you hit the ground running with a reasonable built-in task implementation.
 ```python