Apply suggestions from review

2025-08-05 04:40:20 +03:00 · 2023-07-18 10:42:27 +02:00 · 2023-07-18 10:42:27 +02:00 · 0b97fff92d
commit 0b97fff92d
parent acf63e55ac
3 changed files with 129 additions and 176 deletions
--- a/website/docs/api/large-language-models.mdx
+++ b/website/docs/api/large-language-models.mdx
@ -60,52 +60,6 @@ prompt. Further, the task defines how to parse the LLM's responses back into
 structured information. All tasks are registered in spaCy's `llm_tasks`
 registry.

-Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined
-in [`ty.py`](https://github.com/spacy-llm/spacy_llm/ty.py). It needs to define a
-`generate_prompts` function and a `parse_responses` function.
-
-Moreover, the task may define an optional
-[`scorer` method](https://spacy.io/api/scorer#score). It should accept an
-iterable of `Example`s as input and return a score dictionary. If the `scorer`
-method is defined, `spacy-llm` will call it to evaluate the component.
-
-#### Providing examples for few-shot prompts {id="few-shot-prompts"}
-
-All built-in tasks support few-shot prompts, i. e. including examples in a
-prompt. Examples can be supplied in two ways: (1) as a separate file containing
-only examples or (2) by initializing `llm` with a `get_examples()` callback
-(like any other spaCy pipeline component).
-
-##### (1) Few-shot example file
-
-A file containing examples for few-shot prompting can be configured like this:
-
-```ini
-[components.llm.task]
-@llm_tasks = "spacy.NER.v2"
-labels = PERSON,ORGANISATION,LOCATION
-[components.llm.task.examples]
-@misc = "spacy.FewShotReader.v1"
-path = "ner_examples.yml"
-```
-
-The supplied file has to conform to the format expected by the required task
-(see the task documentation further down).
-
-##### (2) Initializing the `llm` component with a `get_examples()` callback
-
-Alternatively, you can initialize your `nlp` pipeline by providing a
-`get_examples` callback for
-[`nlp.initialize`](https://spacy.io/api/language#initialize) and setting
-`n_prompt_examples` to a positive number to automatically fetch a few examples
-for few-shot learning. Set `n_prompt_examples` to `-1` to use all examples as
-part of the few-shot learning prompt.
-
-```ini
-[initialize.components.llm]
-n_prompt_examples = 3
-```
-
 #### task.generate_prompts {id="task-generate-prompts"}

 Takes a collection of documents, and returns a collection of "prompts", which
@ -161,9 +115,10 @@ note that this requirement will be included in the prompt, but the task doesn't
 perform a hard cut-off. It's hence possible that your summary exceeds
 `max_n_words`.

-To perform few-shot learning, you can write down a few examples in a separate
-file, and provide these to be injected into the prompt to the LLM. The default
-reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
+To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
+you can write down a few examples in a separate file, and provide these to be
+injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
+supports `.yml`, `.yaml`, `.json` and `.jsonl`.

 ```yaml
 - text: >
@ -239,9 +194,10 @@ the following parameters:
  expand the span to the next token boundaries, e.g. expanding `"New Y"` out to
  `"New York"`.

-To perform few-shot learning, you can write down a few examples in a separate
-file, and provide these to be injected into the prompt to the LLM. The default
-reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
+To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
+you can write down a few examples in a separate file, and provide these to be
+injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
+supports `.yml`, `.yaml`, `.json` and `.jsonl`.

 ```yaml
 - text: Jack and Jill went up the hill.
@ -328,9 +284,10 @@ the following parameters:
  expand the span to the next token boundaries, e.g. expanding `"New Y"` out to
  `"New York"`.

-To perform few-shot learning, you can write down a few examples in a separate
-file, and provide these to be injected into the prompt to the LLM. The default
-reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
+To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
+you can write down a few examples in a separate file, and provide these to be
+injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
+supports `.yml`, `.yaml`, `.json` and `.jsonl`.

 ```yaml
 - text: Jack and Jill went up the hill.
@ -442,9 +399,10 @@ definitions are included in the prompt.
 | `allow_none`        | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~                                                                       |
 | `verbose`           | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~                                                                                                                                  |

-To perform few-shot learning, you can write down a few examples in a separate
-file, and provide these to be injected into the prompt to the LLM. The default
-reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
+To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
+you can write down a few examples in a separate file, and provide these to be
+injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
+supports `.yml`, `.yaml`, `.json` and `.jsonl`.

 ```json
 [
@ -496,9 +454,10 @@ prompting and includes an improved prompt template.
 | `allow_none`        | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~                                                                       |
 | `verbose`           | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~                                                                                                                                  |

-To perform few-shot learning, you can write down a few examples in a separate
-file, and provide these to be injected into the prompt to the LLM. The default
-reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
+To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
+you can write down a few examples in a separate file, and provide these to be
+injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
+supports `.yml`, `.yaml`, `.json` and `.jsonl`.

 ```json
 [
@ -545,9 +504,10 @@ prompting.
 | `allow_none`        | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Deafults to `True`. ~~bool~~ |
 | `verbose`           | If set to `True`, warnings will be generated when the LLM returns invalid responses. Deafults to `False`. ~~bool~~                                                            |

-To perform few-shot learning, you can write down a few examples in a separate
-file, and provide these to be injected into the prompt to the LLM. The default
-reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
+To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
+you can write down a few examples in a separate file, and provide these to be
+injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
+supports `.yml`, `.yaml`, `.json` and `.jsonl`.

 ```json
 [
@ -593,9 +553,10 @@ on an upstream NER component for entities extraction.
 | `normalizer`        | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~                                                                 |
 | `verbose`           | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~                                                                                                                          |

-To perform few-shot learning, you can write down a few examples in a separate
-file, and provide these to be injected into the prompt to the LLM. The default
-reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
+To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
+you can write down a few examples in a separate file, and provide these to be
+injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
+supports `.yml`, `.yaml`, `.json` and `.jsonl`.

 ```json
 {"text": "Laura bought a house in Boston with her husband Mark.", "ents": [{"start_char": 0, "end_char": 5, "label": "PERSON"}, {"start_char": 24, "end_char": 30, "label": "GPE"}, {"start_char": 48, "end_char": 52, "label": "PERSON"}], "relations": [{"dep": 0, "dest": 1, "relation": "LivesIn"}, {"dep": 2, "dest": 1, "relation": "LivesIn"}]}
@ -654,9 +615,10 @@ doesn't match the number of tokens recognized by spaCy, no lemmas are stored in
 the corresponding doc's tokens. Otherwise the tokens `.lemma_` property is
 updated with the lemma suggested by the LLM.

-To perform few-shot learning, you can write down a few examples in a separate
-file, and provide these to be injected into the prompt to the LLM. The default
-reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
+To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
+you can write down a few examples in a separate file, and provide these to be
+injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
+supports `.yml`, `.yaml`, `.json` and `.jsonl`.

 ```yaml
 - text: I'm buying ice cream.
@ -706,9 +668,10 @@ issues (e. g. in case of unexpected LLM responses) the value might be `None`.
 | `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~                                                                                             |
 | `field`    | Name of extension attribute to store summary in (i. e. the summary will be available in `doc._.{field}`). Defaults to `sentiment`. ~~str~~                                                                                 |

-To perform few-shot learning, you can write down a few examples in a separate
-file, and provide these to be injected into the prompt to the LLM. The default
-reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
+To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
+you can write down a few examples in a separate file, and provide these to be
+injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
+supports `.yml`, `.yaml`, `.json` and `.jsonl`.

 ```yaml
 - text: 'This is horrifying.'
@ -751,43 +714,7 @@ it's a function of type `Callable[[Iterable[Any]], Iterable[Any]]`, but specific
 implementations can have other signatures, like
 `Callable[[Iterable[str]], Iterable[str]]`.

-All built-in models are registered in `llm_models`. If no model is specified,
-the repo currently connects to the `OpenAI` API by default using REST, and
-accesses the `"gpt-3.5-turbo"` model.
-
-Currently three different approaches to use LLMs are supported:
-
-1. `spacy-llm`s native REST interface. This is the default for all hosted models
-   (e. g. OpenAI, Cohere, Anthropic, ...).
-2. A HuggingFace integration that allows to run a limited set of HF models
-   locally.
-3. A LangChain integration that allows to run any model supported by LangChain
-   (hosted or locally).
-
-Approaches 1. and 2 are the default for hosted model and local models,
-respectively. Alternatively you can use LangChain to access hosted or local
-models by specifying one of the models registered with the `langchain.` prefix.
-
-<Infobox>
-_Why LangChain if there are also are a native REST and a HuggingFace interface? When should I use what?_
-
-Third-party libraries like `langchain` focus on prompt management, integration
-of many different LLM APIs, and other related features such as conversational
-memory or agents. `spacy-llm` on the other hand emphasizes features we consider
-useful in the context of NLP pipelines utilizing LLMs to process documents
-(mostly) independent from each other. It makes sense that the feature sets of
-such third-party libraries and `spacy-llm` aren't identical - and users might
-want to take advantage of features not available in `spacy-llm`.
-
-The advantage of implementing our own REST and HuggingFace integrations is that
-we can ensure a larger degree of stability and robustness, as we can guarantee
-backwards-compatibility and more smoothly integrated error handling.
-
-If however there are features or APIs not natively covered by `spacy-llm`, it's
-trivial to utilize LangChain to cover this - and easy to customize the prompting
-mechanism, if so required.
-
-</Infobox>
+#### API Keys {id="api-keys"}

 Note that when using hosted services, you have to ensure that the proper API
 keys are set as environment variables as described by the corresponding
@ -1377,7 +1304,7 @@ python -m pip install "accelerate>=0.16.0,<1.0"
 > ```

 | Argument      | Description                                                                                                                                                       |
-| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `name`        | The name of a OpenLLaMA model that is supported. ~~Literal["open_llama_3b", "open_llama_7b", "open_llama_7b_v2", "open_llama_13b"]~~                              |
 | `config_init` | Further configuration passed on to the construction of the model with `transformers.AutoModelForCausalLM.from_pretrained()`. Defaults to `{}`. ~~Dict[str, Any]~~ |
 | `config_run`  | Further configuration used during model inference. Defaults to `{}`. ~~Dict[str, Any]~~                                                                           |
--- a/website/docs/usage/large-language-models.mdx
+++ b/website/docs/usage/large-language-models.mdx
@ -9,8 +9,6 @@ menu:
  - ['API', 'api']
  - ['Tasks', 'tasks']
  - ['Models', 'models']
-  - ['Ongoing work', 'ongoing-work']
-  - ['Issues', 'issues']
 ---

 [The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large
@ -359,15 +357,18 @@ Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined
 in [`ty.py`](https://github.com/spacy-llm/spacy_llm/ty.py). It needs to define a
 `generate_prompts` function and a `parse_responses` function.

+| Task                                                                        | Description                                                                                                                                                  |
+| --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`.                                                          |
+| [`task.parse_responses`](/api/large-language-models#task-parse-responses)   | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. |
+
 Moreover, the task may define an optional
 [`scorer` method](https://spacy.io/api/scorer#score). It should accept an
 iterable of `Example`s as input and return a score dictionary. If the `scorer`
 method is defined, `spacy-llm` will call it to evaluate the component.

 | Component                                                               | Description                                                                                                                                                           |
-| --------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`.                                                                   |
-| [`task.parse_responses`](/api/large-language-models#task-parse-responses)   | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents.          |
+| ----------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | [`spacy.Summarization.v1`](/api/large-language-models#summarization-v1) | The summarization task prompts the model for a concise summary of the provided text.                                                                                  |
 | [`spacy.NER.v2`](/api/large-language-models#ner-v2)                     | The built-in NER task supports both zero-shot and few-shot prompting. This version also supports explicitly defining the provided labels with custom descriptions.    |
 | [`spacy.NER.v1`](/api/large-language-models#ner-v1)                     | The original version of the built-in NER task supports both zero-shot and few-shot prompting.                                                                         |
@ -381,6 +382,43 @@ method is defined, `spacy-llm` will call it to evaluate the component.
 | [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1)         | Performs sentiment analysis on provided texts.                                                                                                                        |
 | [`spacy.NoOp.v1`](/api/large-language-models#noop-v1)                   | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`.                                                     |

+#### Providing examples for few-shot prompts {id="few-shot-prompts"}
+
+All built-in tasks support few-shot prompts, i. e. including examples in a
+prompt. Examples can be supplied in two ways: (1) as a separate file containing
+only examples or (2) by initializing `llm` with a `get_examples()` callback
+(like any other spaCy pipeline component).
+
+##### (1) Few-shot example file
+
+A file containing examples for few-shot prompting can be configured like this:
+
+```ini
+[components.llm.task]
+@llm_tasks = "spacy.NER.v2"
+labels = PERSON,ORGANISATION,LOCATION
+[components.llm.task.examples]
+@misc = "spacy.FewShotReader.v1"
+path = "ner_examples.yml"
+```
+
+The supplied file has to conform to the format expected by the required task
+(see the task documentation further down).
+
+##### (2) Initializing the `llm` component with a `get_examples()` callback
+
+Alternatively, you can initialize your `nlp` pipeline by providing a
+`get_examples` callback for
+[`nlp.initialize`](https://spacy.io/api/language#initialize) and setting
+`n_prompt_examples` to a positive number to automatically fetch a few examples
+for few-shot learning. Set `n_prompt_examples` to `-1` to use all examples as
+part of the few-shot learning prompt.
+
+```ini
+[initialize.components.llm]
+n_prompt_examples = 3
+```
+
 ### Model {id="models"}

 A _model_ defines which LLM model to query, and how to query it. It can be a
@ -408,29 +446,33 @@ Approaches 1. and 2 are the default for hosted model and local models,
 respectively. Alternatively you can use LangChain to access hosted or local
 models by specifying one of the models registered with the `langchain.` prefix.

-Note that when using hosted services, you have to ensure that the proper API
-keys are set as environment variables as described by the corresponding
+<Infobox>
+_Why LangChain if there are also are a native REST and a HuggingFace interface? When should I use what?_
+
+Third-party libraries like `langchain` focus on prompt management, integration
+of many different LLM APIs, and other related features such as conversational
+memory or agents. `spacy-llm` on the other hand emphasizes features we consider
+useful in the context of NLP pipelines utilizing LLMs to process documents
+(mostly) independent from each other. It makes sense that the feature sets of
+such third-party libraries and `spacy-llm` aren't identical - and users might
+want to take advantage of features not available in `spacy-llm`.
+
+The advantage of implementing our own REST and HuggingFace integrations is that
+we can ensure a larger degree of stability and robustness, as we can guarantee
+backwards-compatibility and more smoothly integrated error handling.
+
+If however there are features or APIs not natively covered by `spacy-llm`, it's
+trivial to utilize LangChain to cover this - and easy to customize the prompting
+mechanism, if so required.
+
+</Infobox>
+
+<Infobox variant="warning">
+Note that when using hosted services, you have to ensure that the [proper API
+keys](/api/large-language-models#api-keys) are set as environment variables as described by the corresponding
 provider's documentation.

-E. g. when using OpenAI, you have to get an API key from openai.com, and ensure
-that the keys are set as environmental variables:
-
-```shell
-export OPENAI_API_KEY="sk-..."
-export OPENAI_API_ORG="org-..."
-```
-
-For Cohere it's
-
-```shell
-export CO_API_KEY="..."
-```
-
-and for Anthropic
-
-```shell
-export ANTHROPIC_API_KEY="..."
-```
+</Infobox>

 | Component                                                                      | Description                                                                          |
 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------ |
@ -473,23 +515,3 @@ documents at each run that keeps batches of documents stored on disk.
 | [`spacy.FewShotReader.v1`](/api/large-language-models#fewshotreader-v1) | This function is registered in spaCy's `misc` registry, and reads in examples from a `.yml`, `.yaml`, `.json` or `.jsonl` file. It uses [`srsly`](https://github.com/explosion/srsly) to read in these files and parses them depending on the file extension.                        |
 | [`spacy.FileReader.v1`](/api/large-language-models#filereader-v1)       | This function is registered in spaCy's `misc` registry, and reads a file provided to the `path` to return a `str` representation of its contents. This function is typically used to read [Jinja](https://jinja.palletsprojects.com/en/3.1.x/) files containing the prompt template. |
 | [Normalizer functions](/api/large-language-models#normalizer-functions) | These functions provide simple normalizations for string comparisons, e.g. between a list of specified labels and a label given in the raw text of the LLM response.                                                                                                                 |
-
-## Ongoing work {id="ongoing-work"}
-
-In the near future, we will
-
- Add more example tasks
- Support a broader range of models
- Provide more example use-cases and tutorials
- Make the built-in tasks easier to customize via Jinja templates to define the
-  instructions & examples
-
-PRs are always welcome!
-
-## Reporting issues {id="issues"}
-
-If you have questions regarding the usage of `spacy-llm`, or want to give us
-feedback after giving it a spin, please use the
-[discussion board](https://github.com/explosion/spaCy/discussions). Bug reports
-can be filed on the
-[spaCy issue tracker](https://github.com/explosion/spaCy/issues). Thank you!
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@ -37,7 +37,11 @@
                    { "text": "spaCy Projects", "url": "/usage/projects", "tag": "new" },
                    { "text": "Saving & Loading", "url": "/usage/saving-loading" },
                    { "text": "Visualizers", "url": "/usage/visualizers" },
-                    { "text": "Large Language Models", "url": "/usage/large-language-models", "tag": "new"  }
+                    {
+                        "text": "Large Language Models",
+                        "url": "/usage/large-language-models",
+                        "tag": "new"
+                    }
                ]
            },
            {
@ -102,6 +106,7 @@
                    { "text": "EntityLinker", "url": "/api/entitylinker" },
                    { "text": "EntityRecognizer", "url": "/api/entityrecognizer" },
                    { "text": "EntityRuler", "url": "/api/entityruler" },
+                    { "text": "Large Language Models", "url": "/api/large-language-models" },
                    { "text": "Lemmatizer", "url": "/api/lemmatizer" },
                    { "text": "Morphologizer", "url": "/api/morphologizer" },
                    { "text": "SentenceRecognizer", "url": "/api/sentencerecognizer" },
@ -134,7 +139,6 @@
                    { "text": "Corpus", "url": "/api/corpus" },
                    { "text": "InMemoryLookupKB", "url": "/api/inmemorylookupkb" },
                    { "text": "KnowledgeBase", "url": "/api/kb" },
-                    { "text": "Large Language Models", "url": "/api/large-language-models" },
                    { "text": "Lookups", "url": "/api/lookups" },
                    { "text": "MorphAnalysis", "url": "/api/morphology#morphanalysis" },
                    { "text": "Morphology", "url": "/api/morphology" },