mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 10:46:29 +03:00
013762be41
* fix construction example * shorten task-specific factory list * small edits to HF models * small edit to API models * typo * fix space Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
1196 lines
68 KiB
Plaintext
1196 lines
68 KiB
Plaintext
---
|
|
title: Large Language Models
|
|
teaser: Integrating LLMs into structured NLP pipelines
|
|
menu:
|
|
- ['Config and implementation', 'config']
|
|
- ['Tasks', 'tasks']
|
|
- ['Models', 'models']
|
|
- ['Cache', 'cache']
|
|
- ['Various Functions', 'various-functions']
|
|
---
|
|
|
|
[The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large
|
|
Language Models (LLMs) into spaCy, featuring a modular system for **fast
|
|
prototyping** and **prompting**, and turning unstructured responses into
|
|
**robust outputs** for various NLP tasks, **no training data** required.
|
|
|
|
## Config and implementation {id="config"}
|
|
|
|
An LLM component is implemented through the `LLMWrapper` class. It is accessible
|
|
through a generic `llm`
|
|
[component factory](https://spacy.io/usage/processing-pipelines#custom-components-factories)
|
|
as well as through task-specific component factories: `llm_ner`, `llm_spancat`, `llm_rel`,
|
|
`llm_textcat`, `llm_sentiment` and `llm_summarization`.
|
|
|
|
### LLMWrapper.\_\_init\_\_ {id="init",tag="method"}
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> # Construction via add_pipe with the default GPT 3.5 model and an explicitly defined task
|
|
> config = {"task": {"@llm_tasks": "spacy.NER.v3", "labels": ["PERSON", "ORGANISATION", "LOCATION"]}}
|
|
> llm = nlp.add_pipe("llm", config=config)
|
|
>
|
|
> # Construction via add_pipe with a task-specific factory and default GPT3.5 model
|
|
> llm = nlp.add_pipe("llm-ner")
|
|
>
|
|
> # Construction from class
|
|
> from spacy_llm.pipeline import LLMWrapper
|
|
> llm = LLMWrapper(vocab=nlp.vocab, task=task, model=model, cache=cache, save_io=True)
|
|
> ```
|
|
|
|
Create a new pipeline instance. In your application, you would normally use a
|
|
shortcut for this and instantiate the component using its string name and
|
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
|
|
|
| Name | Description |
|
|
| -------------- | -------------------------------------------------------------------------------------------------- |
|
|
| `name` | String name of the component instance. `llm` by default. ~~str~~ |
|
|
| _keyword-only_ | |
|
|
| `vocab` | The shared vocabulary. ~~Vocab~~ |
|
|
| `task` | An [LLM Task](#tasks) can generate prompts and parse LLM responses. ~~LLMTask~~ |
|
|
| `model` | The [LLM Model](#models) queries a specific LLM API.. ~~Callable[[Iterable[Any]], Iterable[Any]]~~ |
|
|
| `cache` | [Cache](#cache) to use for caching prompts and responses per doc. ~~Cache~~ |
|
|
| `save_io` | Whether to save LLM I/O (prompts and responses) in the `Doc._.llm_io` custom attribute. ~~bool~~ |
|
|
|
|
### LLMWrapper.\_\_call\_\_ {id="call",tag="method"}
|
|
|
|
Apply the pipe to one document. The document is modified in place and returned.
|
|
This usually happens under the hood when the `nlp` object is called on a text
|
|
and all pipeline components are applied to the `Doc` in order.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> doc = nlp("Ingrid visited Paris.")
|
|
> llm_ner = nlp.add_pipe("llm_ner")
|
|
> # This usually happens under the hood
|
|
> processed = llm_ner(doc)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | -------------------------------- |
|
|
| `doc` | The document to process. ~~Doc~~ |
|
|
| **RETURNS** | The processed document. ~~Doc~~ |
|
|
|
|
### LLMWrapper.pipe {id="pipe",tag="method"}
|
|
|
|
Apply the pipe to a stream of documents. This usually happens under the hood
|
|
when the `nlp` object is called on a text and all pipeline components are
|
|
applied to the `Doc` in order.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> llm_ner = nlp.add_pipe("llm_ner")
|
|
> for doc in llm_ner.pipe(docs, batch_size=50):
|
|
> pass
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------- |
|
|
| `docs` | A stream of documents. ~~Iterable[Doc]~~ |
|
|
| _keyword-only_ | |
|
|
| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
|
|
| **YIELDS** | The processed documents in order. ~~Doc~~ |
|
|
|
|
### LLMWrapper.add_label {id="add_label",tag="method"}
|
|
|
|
Add a new label to the pipe's task. Alternatively, provide the labels upon the
|
|
[task](#task) definition, or through the `[initialize]` block of the
|
|
[config](#config).
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> llm_ner = nlp.add_pipe("llm_ner")
|
|
> llm_ner.add_label("MY_LABEL")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ----------------------------------------------------------- |
|
|
| `label` | The label to add. ~~str~~ |
|
|
| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
|
|
|
|
### LLMWrapper.to_disk {id="to_disk",tag="method"}
|
|
|
|
Serialize the pipe to disk.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> llm_ner = nlp.add_pipe("llm_ner")
|
|
> llm_ner.to_disk("/path/to/llm_ner")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
|
| _keyword-only_ | |
|
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
|
|
|
### LLMWrapper.from_disk {id="from_disk",tag="method"}
|
|
|
|
Load the pipe from disk. Modifies the object in place and returns it.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> llm_ner = nlp.add_pipe("llm_ner")
|
|
> llm_ner.from_disk("/path/to/llm_ner")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ----------------------------------------------------------------------------------------------- |
|
|
| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
|
|
| _keyword-only_ | |
|
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
|
| **RETURNS** | The modified `LLMWrapper` object. ~~LLMWrapper~~ |
|
|
|
|
### LLMWrapper.to_bytes {id="to_bytes",tag="method"}
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> llm_ner = nlp.add_pipe("llm_ner")
|
|
> ner_bytes = llm_ner.to_bytes()
|
|
> ```
|
|
|
|
Serialize the pipe to a bytestring.
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
|
| _keyword-only_ | |
|
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
|
| **RETURNS** | The serialized form of the `LLMWrapper` object. ~~bytes~~ |
|
|
|
|
### LLMWrapper.from_bytes {id="from_bytes",tag="method"}
|
|
|
|
Load the pipe from a bytestring. Modifies the object in place and returns it.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> ner_bytes = llm_ner.to_bytes()
|
|
> llm_ner = nlp.add_pipe("llm_ner")
|
|
> llm_ner.from_bytes(ner_bytes)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ------------------------------------------------------------------------------------------- |
|
|
| `bytes_data` | The data to load from. ~~bytes~~ |
|
|
| _keyword-only_ | |
|
|
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
|
| **RETURNS** | The `LLMWrapper` object. ~~LLMWrapper~~ |
|
|
|
|
### LLMWrapper.labels {id="labels",tag="property"}
|
|
|
|
The labels currently added to the component. Empty tuple if the LLM's task does
|
|
not require labels.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> llm_ner.add_label("MY_LABEL")
|
|
> assert "MY_LABEL" in llm_ner.labels
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------ |
|
|
| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
|
|
|
|
## Tasks {id="tasks"}
|
|
|
|
### Task implementation {id="task-implementation"}
|
|
|
|
A _task_ defines an NLP problem or question, that will be sent to the LLM via a
|
|
prompt. Further, the task defines how to parse the LLM's responses back into
|
|
structured information. All tasks are registered in the `llm_tasks` registry.
|
|
|
|
#### task.generate_prompts {id="task-generate-prompts"}
|
|
|
|
Takes a collection of documents, and returns a collection of "prompts", which
|
|
can be of type `Any`. Often, prompts are of type `str` - but this is not
|
|
enforced to allow for maximum flexibility in the framework.
|
|
|
|
| Argument | Description |
|
|
| ----------- | ---------------------------------------- |
|
|
| `docs` | The input documents. ~~Iterable[Doc]~~ |
|
|
| **RETURNS** | The generated prompts. ~~Iterable[Any]~~ |
|
|
|
|
#### task.parse_responses {id="task-parse-responses"}
|
|
|
|
Takes a collection of LLM responses and the original documents, parses the
|
|
responses into structured information, and sets the annotations on the
|
|
documents. The `parse_responses` function is free to set the annotations in any
|
|
way, including `Doc` fields like `ents`, `spans` or `cats`, or using custom
|
|
defined fields.
|
|
|
|
The `responses` are of type `Iterable[Any]`, though they will often be `str`
|
|
objects. This depends on the return type of the [model](#models).
|
|
|
|
| Argument | Description |
|
|
| ----------- | ------------------------------------------ |
|
|
| `docs` | The input documents. ~~Iterable[Doc]~~ |
|
|
| `responses` | The generated prompts. ~~Iterable[Any]~~ |
|
|
| **RETURNS** | The annotated documents. ~~Iterable[Doc]~~ |
|
|
|
|
### Summarization {id="summarization"}
|
|
|
|
A summarization task takes a document as input and generates a summary that is
|
|
stored in an extension attribute.
|
|
|
|
#### spacy.Summarization.v1 {id="summarization-v1"}
|
|
|
|
The `spacy.Summarization.v1` task supports both zero-shot and few-shot
|
|
prompting.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.Summarization.v1"
|
|
> examples = null
|
|
> max_n_words = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `template` | Custom prompt template to send to LLM model. Defaults to [summarization.v1.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/summarization.v1.jinja). ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `max_n_words` | Maximum number of words to be used in summary. Note that this should not expected to work exactly. Defaults to `None`. ~~Optional[int]~~ |
|
|
| `field` | Name of extension attribute to store summary in (i. e. the summary will be available in `doc._.{field}`). Defaults to `summary`. ~~str~~ |
|
|
|
|
The summarization task prompts the model for a concise summary of the provided
|
|
text. It optionally allows to limit the response to a certain number of tokens -
|
|
note that this requirement will be included in the prompt, but the task doesn't
|
|
perform a hard cut-off. It's hence possible that your summary exceeds
|
|
`max_n_words`.
|
|
|
|
To perform [few-shot learning](/usage/large-language-models#few-shot-prompts),
|
|
you can write down a few examples in a separate file, and provide these to be
|
|
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
|
|
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
|
|
|
|
```yaml
|
|
- text: >
|
|
The United Nations, referred to informally as the UN, is an
|
|
intergovernmental organization whose stated purposes are to maintain
|
|
international peace and security, develop friendly relations among nations,
|
|
achieve international cooperation, and serve as a centre for harmonizing the
|
|
actions of nations. It is the world's largest international organization.
|
|
The UN is headquartered on international territory in New York City, and the
|
|
organization has other offices in Geneva, Nairobi, Vienna, and The Hague,
|
|
where the International Court of Justice is headquartered.\n\n The UN was
|
|
established after World War II with the aim of preventing future world wars,
|
|
and succeeded the League of Nations, which was characterized as
|
|
ineffective.
|
|
summary:
|
|
'The UN is an international organization that promotes global peace,
|
|
cooperation, and harmony. Established after WWII, its purpose is to prevent
|
|
future world wars.'
|
|
```
|
|
|
|
```ini
|
|
[components.llm.task]
|
|
@llm_tasks = "spacy.Summarization.v1"
|
|
max_n_words = 20
|
|
[components.llm.task.examples]
|
|
@misc = "spacy.FewShotReader.v1"
|
|
path = "summarization_examples.yml"
|
|
```
|
|
|
|
### NER {id="ner"}
|
|
|
|
The NER task identifies non-overlapping entities in text.
|
|
|
|
#### spacy.NER.v3 {id="ner-v3"}
|
|
|
|
Version 3 is fundamentally different to v1 and v2, as it implements
|
|
Chain-of-Thought prompting, based on the
|
|
[PromptNER paper](https://arxiv.org/pdf/2305.15444.pdf) by Ashok and Lipton
|
|
(2023). On an internal use-case, we have found this implementation to obtain
|
|
significant better accuracy - with an increase of F-score of up to 15 percentage
|
|
points.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.NER.v3"
|
|
> labels = ["PERSON", "ORGANISATION", "LOCATION"]
|
|
> ```
|
|
|
|
When no examples are [specified](/usage/large-language-models#few-shot-prompts),
|
|
the v3 implementation will use a dummy example in the prompt. Technically this
|
|
means that the task will always perform few-shot prompting under the hood.
|
|
|
|
| Argument | Description |
|
|
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ |
|
|
| `label_definitions` | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ |
|
|
| `template` | Custom prompt template to send to LLM model. Defaults to [ner.v3.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/ner.v3.jinja). ~~str~~ |
|
|
| `description` (NEW) | A description of what to recognize or not recognize as entities. ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ |
|
|
| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ |
|
|
|
|
Note that the `single_match` parameter, used in v1 and v2, is not supported
|
|
anymore, as the CoT parsing algorithm takes care of this automatically.
|
|
|
|
New to v3 is the fact that you can provide an explicit description of what
|
|
entities should look like. You can use this feature in addition to
|
|
`label_definitions`.
|
|
|
|
```ini
|
|
[components.llm.task]
|
|
@llm_tasks = "spacy.NER.v3"
|
|
labels = ["DISH", "INGREDIENT", "EQUIPMENT"]
|
|
description = Entities are the names food dishes,
|
|
ingredients, and any kind of cooking equipment.
|
|
Adjectives, verbs, adverbs are not entities.
|
|
Pronouns are not entities.
|
|
|
|
[components.llm.task.label_definitions]
|
|
DISH = "Known food dishes, e.g. Lobster Ravioli, garlic bread"
|
|
INGREDIENT = "Individual parts of a food dish, including herbs and spices."
|
|
EQUIPMENT = "Any kind of cooking equipment. e.g. oven, cooking pot, grill"
|
|
```
|
|
|
|
To perform [few-shot learning](/usage/large-language-models#few-shot-prompts),
|
|
you can write down a few examples in a separate file, and provide these to be
|
|
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
|
|
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
|
|
|
|
While not required, this task works best when both positive and negative
|
|
examples are provided. The format is different than the files required for v1
|
|
and v2, as additional fields such as `is_entity` and `reason` should now be
|
|
provided.
|
|
|
|
```json
|
|
[
|
|
{
|
|
"text": "You can't get a great chocolate flavor with carob.",
|
|
"spans": [
|
|
{
|
|
"text": "chocolate",
|
|
"is_entity": false,
|
|
"label": "==NONE==",
|
|
"reason": "is a flavor in this context, not an ingredient"
|
|
},
|
|
{
|
|
"text": "carob",
|
|
"is_entity": true,
|
|
"label": "INGREDIENT",
|
|
"reason": "is an ingredient to add chocolate flavor"
|
|
}
|
|
]
|
|
},
|
|
...
|
|
]
|
|
```
|
|
|
|
```ini
|
|
[components.llm.task.examples]
|
|
@misc = "spacy.FewShotReader.v1"
|
|
path = "${paths.examples}"
|
|
```
|
|
|
|
For a fully working example, see this
|
|
[usage example](https://github.com/explosion/spacy-llm/tree/main/usage_examples/ner_v3_openai).
|
|
|
|
#### spacy.NER.v2 {id="ner-v2"}
|
|
|
|
This version supports explicitly defining the provided labels with custom
|
|
descriptions, and further supports zero-shot and few-shot prompting just like
|
|
v1.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.NER.v2"
|
|
> labels = ["PERSON", "ORGANISATION", "LOCATION"]
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ |
|
|
| `label_definitions` (NEW) | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ |
|
|
| `template` (NEW) | Custom prompt template to send to LLM model. Defaults to [ner.v2.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/ner.v2.jinja). ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ |
|
|
| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ |
|
|
| `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ |
|
|
|
|
The parameters `alignment_mode`, `case_sensitive_matching` and `single_match`
|
|
are identical to the [v1](#ner-v1) implementation. The format of few-shot
|
|
examples are also the same.
|
|
|
|
> Label descriptions can also be used with explicit examples to give as much
|
|
> info to the LLM model as possible.
|
|
|
|
New to v2 is the fact that you can write definitions for each label and provide
|
|
them via the `label_definitions` argument. This lets you tell the LLM exactly
|
|
what you're looking for rather than relying on the LLM to interpret its task
|
|
given just the label name. Label descriptions are freeform so you can write
|
|
whatever you want here, but a brief description along with some examples and
|
|
counter examples seems to work quite well.
|
|
|
|
```ini
|
|
[components.llm.task]
|
|
@llm_tasks = "spacy.NER.v2"
|
|
labels = PERSON,SPORTS_TEAM
|
|
|
|
[components.llm.task.label_definitions]
|
|
PERSON = "Extract any named individual in the text."
|
|
SPORTS_TEAM = "Extract the names of any professional sports team. e.g. Golden State Warriors, LA Lakers, Man City, Real Madrid"
|
|
```
|
|
|
|
For a fully working example, see this
|
|
[usage example](https://github.com/explosion/spacy-llm/tree/main/usage_examples/ner_dolly).
|
|
|
|
#### spacy.NER.v1 {id="ner-v1"}
|
|
|
|
The original version of the built-in NER task supports both zero-shot and
|
|
few-shot prompting.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.NER.v1"
|
|
> labels = PERSON,ORGANISATION,LOCATION
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `labels` | Comma-separated list of labels. ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ |
|
|
| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ |
|
|
| `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ |
|
|
|
|
The NER task implementation doesn't currently ask the LLM for specific offsets,
|
|
but simply expects a list of strings that represent the enties in the document.
|
|
This means that a form of string matching is required. This can be configured by
|
|
the following parameters:
|
|
|
|
- The `single_match` parameter is typically set to `False` to allow for multiple
|
|
matches. For instance, the response from the LLM might only mention the entity
|
|
"Paris" once, but you'd still want to mark it every time it occurs in the
|
|
document.
|
|
- The case-sensitive matching is typically set to `False` to be robust against
|
|
case variances in the LLM's output.
|
|
- The `alignment_mode` argument is used to match entities as returned by the LLM
|
|
to the tokens from the original `Doc` - specifically it's used as argument in
|
|
the call to [`doc.char_span()`](/api/doc#char_span). The `"strict"` mode will
|
|
only keep spans that strictly adhere to the given token boundaries.
|
|
`"contract"` will only keep those tokens that are fully within the given
|
|
range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will expand the
|
|
span to the next token boundaries, e.g. expanding `"New Y"` out to
|
|
`"New York"`.
|
|
|
|
To perform [few-shot learning](/usage/large-language-models#few-shot-prompts),
|
|
you can write down a few examples in a separate file, and provide these to be
|
|
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
|
|
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
|
|
|
|
```yaml
|
|
- text: Jack and Jill went up the hill.
|
|
entities:
|
|
PERSON:
|
|
- Jack
|
|
- Jill
|
|
LOCATION:
|
|
- hill
|
|
- text: Jack fell down and broke his crown.
|
|
entities:
|
|
PERSON:
|
|
- Jack
|
|
```
|
|
|
|
```ini
|
|
[components.llm.task.examples]
|
|
@misc = "spacy.FewShotReader.v1"
|
|
path = "ner_examples.yml"
|
|
```
|
|
|
|
### SpanCat {id="spancat"}
|
|
|
|
The SpanCat task identifies potentially overlapping entities in text.
|
|
|
|
#### spacy.SpanCat.v3 {id="spancat-v3"}
|
|
|
|
The built-in SpanCat v3 task is a simple adaptation of the NER v3 task to
|
|
support overlapping entities and store its annotations in `doc.spans`.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.SpanCat.v3"
|
|
> labels = ["PERSON", "ORGANISATION", "LOCATION"]
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ |
|
|
| `label_definitions` | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ |
|
|
| `template` | Custom prompt template to send to LLM model. Defaults to [`spancat.v3.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/spancat.v3.jinja). ~~str~~ |
|
|
| `description` (NEW) | A description of what to recognize or not recognize as entities. ~~str~~ |
|
|
| `spans_key` | Key of the `Doc.spans` dict to save the spans under. Defaults to `"sc"`. ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ |
|
|
| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ |
|
|
|
|
Note that the `single_match` parameter, used in v1 and v2, is not supported
|
|
anymore, as the CoT parsing algorithm takes care of this automatically.
|
|
|
|
#### spacy.SpanCat.v2 {id="spancat-v2"}
|
|
|
|
The built-in SpanCat v2 task is a simple adaptation of the NER v2 task to
|
|
support overlapping entities and store its annotations in `doc.spans`.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.SpanCat.v2"
|
|
> labels = ["PERSON", "ORGANISATION", "LOCATION"]
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ |
|
|
| `label_definitions` (NEW) | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ |
|
|
| `template` (NEW) | Custom prompt template to send to LLM model. Defaults to [`spancat.v2.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/spancat.v2.jinja). ~~str~~ |
|
|
| `spans_key` | Key of the `Doc.spans` dict to save the spans under. Defaults to `"sc"`. ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ |
|
|
| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ |
|
|
| `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ |
|
|
|
|
Except for the `spans_key` parameter, the SpanCat v2 task reuses the
|
|
configuration from the NER v2 task. Refer to [its documentation](#ner-v2) for
|
|
more insight.
|
|
|
|
#### spacy.SpanCat.v1 {id="spancat-v1"}
|
|
|
|
The original version of the built-in SpanCat task is a simple adaptation of the
|
|
v1 NER task to support overlapping entities and store its annotations in
|
|
`doc.spans`.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.SpanCat.v1"
|
|
> labels = PERSON,ORGANISATION,LOCATION
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `labels` | Comma-separated list of labels. ~~str~~ |
|
|
| `spans_key` | Key of the `Doc.spans` dict to save the spans under. Defaults to `"sc"`. ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ |
|
|
| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ |
|
|
| `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ |
|
|
|
|
Except for the `spans_key` parameter, the SpanCat v1 task reuses the
|
|
configuration from the NER v1 task. Refer to [its documentation](#ner-v1) for
|
|
more insight.
|
|
|
|
### TextCat {id="textcat"}
|
|
|
|
The TextCat task labels documents with relevant categories.
|
|
|
|
#### spacy.TextCat.v3 {id="textcat-v3"}
|
|
|
|
On top of the functionality from v2, version 3 of the built-in TextCat tasks
|
|
allows setting definitions of labels. Those definitions are included in the
|
|
prompt.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.TextCat.v3"
|
|
> labels = ["COMPLIMENT", "INSULT"]
|
|
>
|
|
> [components.llm.task.label_definitions]
|
|
> "COMPLIMENT" = "a polite expression of praise or admiration.",
|
|
> "INSULT" = "a disrespectful or scornfully abusive remark or act."
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ |
|
|
| `label_definitions` (NEW) | Dictionary of label definitions. Included in the prompt, if set. Defaults to `None`. ~~Optional[Dict[str, str]]~~ |
|
|
| `template` | Custom prompt template to send to LLM model. Defaults to [`textcat.v3.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/textcat.v3.jinja). ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `exclusive_classes` | If set to `True`, only one label per document should be valid. If set to `False`, one document can have multiple labels. Defaults to `False`. ~~bool~~ |
|
|
| `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~ |
|
|
| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ |
|
|
|
|
The formatting of few-shot examples is the same as those for the
|
|
[v1](#textcat-v1) implementation.
|
|
|
|
#### spacy.TextCat.v2 {id="textcat-v2"}
|
|
|
|
V2 includes all v1 functionality, with an improved prompt template.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.TextCat.v2"
|
|
> labels = ["COMPLIMENT", "INSULT"]
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ |
|
|
| `template` (NEW) | Custom prompt template to send to LLM model. Defaults to [`textcat.v2.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/textcat.v2.jinja). ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `exclusive_classes` | If set to `True`, only one label per document should be valid. If set to `False`, one document can have multiple labels. Defaults to `False`. ~~bool~~ |
|
|
| `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~ |
|
|
| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ |
|
|
|
|
The formatting of few-shot examples is the same as those for the
|
|
[v1](#textcat-v1) implementation.
|
|
|
|
#### spacy.TextCat.v1 {id="textcat-v1"}
|
|
|
|
Version 1 of the built-in TextCat task supports both zero-shot and few-shot
|
|
prompting.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.TextCat.v1"
|
|
> labels = COMPLIMENT,INSULT
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `labels` | Comma-separated list of labels. ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Deafults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `exclusive_classes` | If set to `True`, only one label per document should be valid. If set to `False`, one document can have multiple labels. Deafults to `False`. ~~bool~~ |
|
|
| `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Deafults to `True`. ~~bool~~ |
|
|
| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Deafults to `False`. ~~bool~~ |
|
|
|
|
To perform [few-shot learning](/usage/large-language-models#few-shot-prompts),
|
|
you can write down a few examples in a separate file, and provide these to be
|
|
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
|
|
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
|
|
|
|
```json
|
|
[
|
|
{
|
|
"text": "You look great!",
|
|
"answer": "Compliment"
|
|
},
|
|
{
|
|
"text": "You are not very clever at all.",
|
|
"answer": "Insult"
|
|
}
|
|
]
|
|
```
|
|
|
|
```ini
|
|
[components.llm.task.examples]
|
|
@misc = "spacy.FewShotReader.v1"
|
|
path = "textcat_examples.json"
|
|
```
|
|
|
|
### REL {id="rel"}
|
|
|
|
The REL task extracts relations between named entities.
|
|
|
|
#### spacy.REL.v1 {id="rel-v1"}
|
|
|
|
The built-in REL task supports both zero-shot and few-shot prompting. It relies
|
|
on an upstream NER component for entities extraction.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.REL.v1"
|
|
> labels = ["LivesIn", "Visits"]
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ |
|
|
| `template` | Custom prompt template to send to LLM model. Defaults to [`rel.v3.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/rel.v1.jinja). ~~str~~ |
|
|
| `label_definitions` | Dictionary providing a description for each relation label. Defaults to `None`. ~~Optional[Dict[str, str]]~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ |
|
|
| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ |
|
|
|
|
To perform [few-shot learning](/usage/large-language-models#few-shot-prompts),
|
|
you can write down a few examples in a separate file, and provide these to be
|
|
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
|
|
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
|
|
|
|
```json
|
|
{"text": "Laura bought a house in Boston with her husband Mark.", "ents": [{"start_char": 0, "end_char": 5, "label": "PERSON"}, {"start_char": 24, "end_char": 30, "label": "GPE"}, {"start_char": 48, "end_char": 52, "label": "PERSON"}], "relations": [{"dep": 0, "dest": 1, "relation": "LivesIn"}, {"dep": 2, "dest": 1, "relation": "LivesIn"}]}
|
|
{"text": "Michael travelled through South America by bike.", "ents": [{"start_char": 0, "end_char": 7, "label": "PERSON"}, {"start_char": 26, "end_char": 39, "label": "LOC"}], "relations": [{"dep": 0, "dest": 1, "relation": "Visits"}]}
|
|
```
|
|
|
|
```ini
|
|
[components.llm.task]
|
|
@llm_tasks = "spacy.REL.v1"
|
|
labels = ["LivesIn", "Visits"]
|
|
|
|
[components.llm.task.examples]
|
|
@misc = "spacy.FewShotReader.v1"
|
|
path = "rel_examples.jsonl"
|
|
```
|
|
|
|
Note: the REL task relies on pre-extracted entities to make its prediction.
|
|
Hence, you'll need to add a component that populates `doc.ents` with recognized
|
|
spans to your spaCy pipeline and put it _before_ the REL component.
|
|
|
|
For a fully working example, see this
|
|
[usage example](https://github.com/explosion/spacy-llm/tree/main/usage_examples/rel_openai).
|
|
|
|
### Lemma {id="lemma"}
|
|
|
|
The Lemma task lemmatizes the provided text and updates the `lemma_` attribute
|
|
in the doc's tokens accordingly.
|
|
|
|
#### spacy.Lemma.v1 {id="lemma-v1"}
|
|
|
|
This task supports both zero-shot and few-shot prompting.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.Lemma.v1"
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `template` | Custom prompt template to send to LLM model. Defaults to [lemma.v1.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/lemma.v1.jinja). ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
|
|
The task prompts the LLM to lemmatize the passed text and return the lemmatized
|
|
version as a list of tokens and their corresponding lemma. E. g. the text
|
|
`I'm buying ice cream for my friends` should invoke the response
|
|
|
|
```
|
|
I: I
|
|
'm: be
|
|
buying: buy
|
|
ice: ice
|
|
cream: cream
|
|
for: for
|
|
my: my
|
|
friends: friend
|
|
.: .
|
|
```
|
|
|
|
If for any given text/doc instance the number of lemmas returned by the LLM
|
|
doesn't match the number of tokens from the pipeline's tokenizer, no lemmas are
|
|
stored in the corresponding doc's tokens. Otherwise the tokens `.lemma_`
|
|
property is updated with the lemma suggested by the LLM.
|
|
|
|
To perform [few-shot learning](/usage/large-language-models#few-shot-prompts),
|
|
you can write down a few examples in a separate file, and provide these to be
|
|
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
|
|
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
|
|
|
|
```yaml
|
|
- text: I'm buying ice cream.
|
|
lemmas:
|
|
- 'I': 'I'
|
|
- "'m": 'be'
|
|
- 'buying': 'buy'
|
|
- 'ice': 'ice'
|
|
- 'cream': 'cream'
|
|
- '.': '.'
|
|
|
|
- text: I've watered the plants.
|
|
lemmas:
|
|
- 'I': 'I'
|
|
- "'ve": 'have'
|
|
- 'watered': 'water'
|
|
- 'the': 'the'
|
|
- 'plants': 'plant'
|
|
- '.': '.'
|
|
```
|
|
|
|
```ini
|
|
[components.llm.task]
|
|
@llm_tasks = "spacy.Lemma.v1"
|
|
[components.llm.task.examples]
|
|
@misc = "spacy.FewShotReader.v1"
|
|
path = "lemma_examples.yml"
|
|
```
|
|
|
|
### Sentiment {id="sentiment"}
|
|
|
|
Performs sentiment analysis on provided texts. Scores between 0 and 1 are stored
|
|
in `Doc._.sentiment` - the higher, the more positive. Note in cases of parsing
|
|
issues (e. g. in case of unexpected LLM responses) the value might be `None`.
|
|
|
|
#### spacy.Sentiment.v1 {id="sentiment-v1"}
|
|
|
|
This task supports both zero-shot and few-shot prompting.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.Sentiment.v1"
|
|
> examples = null
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `template` | Custom prompt template to send to LLM model. Defaults to [sentiment.v1.jinja](./spacy_llm/tasks/templates/sentiment.v1.jinja). ~~str~~ |
|
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
|
| `field` | Name of extension attribute to store summary in (i. e. the summary will be available in `doc._.{field}`). Defaults to `sentiment`. ~~str~~ |
|
|
|
|
To perform [few-shot learning](/usage/large-language-models#few-shot-prompts),
|
|
you can write down a few examples in a separate file, and provide these to be
|
|
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
|
|
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
|
|
|
|
```yaml
|
|
- text: 'This is horrifying.'
|
|
score: 0
|
|
- text: 'This is underwhelming.'
|
|
score: 0.25
|
|
- text: 'This is ok.'
|
|
score: 0.5
|
|
- text: "I'm looking forward to this!"
|
|
score: 1.0
|
|
```
|
|
|
|
```ini
|
|
[components.llm.task]
|
|
@llm_tasks = "spacy.Sentiment.v1"
|
|
[components.llm.task.examples]
|
|
@misc = "spacy.FewShotReader.v1"
|
|
path = "sentiment_examples.yml"
|
|
```
|
|
|
|
### NoOp {id="noop"}
|
|
|
|
This task is only useful for testing - it tells the LLM to do nothing, and does
|
|
not set any fields on the `docs`.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task]
|
|
> @llm_tasks = "spacy.NoOp.v1"
|
|
> ```
|
|
|
|
#### spacy.NoOp.v1 {id="noop-v1"}
|
|
|
|
This task needs no further configuration.
|
|
|
|
## Models {id="models"}
|
|
|
|
A _model_ defines which LLM model to query, and how to query it. It can be a
|
|
simple function taking a collection of prompts (consistent with the output type
|
|
of `task.generate_prompts()`) and returning a collection of responses
|
|
(consistent with the expected input of `parse_responses`). Generally speaking,
|
|
it's a function of type `Callable[[Iterable[Any]], Iterable[Any]]`, but specific
|
|
implementations can have other signatures, like
|
|
`Callable[[Iterable[str]], Iterable[str]]`.
|
|
|
|
### Models via REST API {id="models-rest"}
|
|
|
|
These models all take the same parameters, but note that the `config` should
|
|
contain provider-specific keys and values, as it will be passed onwards to the
|
|
provider's API.
|
|
|
|
| Argument | Description |
|
|
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `name` | Model name, i. e. any supported variant for this particular model. Default depends on the specific model (cf. below) ~~str~~ |
|
|
| `config` | Further configuration passed on to the model. Default depends on the specific model (cf. below). ~~Dict[Any, Any]~~ |
|
|
| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ |
|
|
| `max_tries` | Max. number of tries for API request. Defaults to `5`. ~~int~~ |
|
|
| `max_request_time` | Max. time (in seconds) to wait for request to terminate before raising an exception. Defaults to `30.0`. ~~float~~ |
|
|
| `interval` | Time interval (in seconds) for API retries in seconds. Defaults to `1.0`. ~~float~~ |
|
|
|
|
> #### Example config:
|
|
>
|
|
> ```ini
|
|
> [components.llm.model]
|
|
> @llm_models = "spacy.GPT-4.v1"
|
|
> name = "gpt-4"
|
|
> config = {"temperature": 0.0}
|
|
> ```
|
|
|
|
Currently, these models are provided as part of the core library:
|
|
|
|
| Model | Provider | Supported names | Default name | Default config |
|
|
| ----------------------------- | --------- | ---------------------------------------------------------------------------------------- | ---------------------- | ------------------------------------ |
|
|
| `spacy.GPT-4.v1` | OpenAI | `["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]` | `"gpt-4"` | `{}` |
|
|
| `spacy.GPT-4.v2` | OpenAI | `["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]` | `"gpt-4"` | `{temperature=0.0}` |
|
|
| `spacy.GPT-3-5.v1` | OpenAI | `["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k"]` | `"gpt-3.5-turbo"` | `{}` |
|
|
| `spacy.GPT-3-5.v2` | OpenAI | `["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k"]` | `"gpt-3.5-turbo"` | `{temperature=0.0}` |
|
|
| `spacy.Davinci.v1` | OpenAI | `["davinci"]` | `"davinci"` | `{}` |
|
|
| `spacy.Davinci.v2` | OpenAI | `["davinci"]` | `"davinci"` | `{temperature=0.0, max_tokens=500}` |
|
|
| `spacy.Text-Davinci.v1` | OpenAI | `["text-davinci-003", "text-davinci-002"]` | `"text-davinci-003"` | `{}` |
|
|
| `spacy.Text-Davinci.v2` | OpenAI | `["text-davinci-003", "text-davinci-002"]` | `"text-davinci-003"` | `{temperature=0.0, max_tokens=1000}` |
|
|
| `spacy.Code-Davinci.v1` | OpenAI | `["code-davinci-002"]` | `"code-davinci-002"` | `{}` |
|
|
| `spacy.Code-Davinci.v2` | OpenAI | `["code-davinci-002"]` | `"code-davinci-002"` | `{temperature=0.0, max_tokens=500}` |
|
|
| `spacy.Curie.v1` | OpenAI | `["curie"]` | `"curie"` | `{}` |
|
|
| `spacy.Curie.v2` | OpenAI | `["curie"]` | `"curie"` | `{temperature=0.0, max_tokens=500}` |
|
|
| `spacy.Text-Curie.v1` | OpenAI | `["text-curie-001"]` | `"text-curie-001"` | `{}` |
|
|
| `spacy.Text-Curie.v2` | OpenAI | `["text-curie-001"]` | `"text-curie-001"` | `{temperature=0.0, max_tokens=500}` |
|
|
| `spacy.Babbage.v1` | OpenAI | `["babbage"]` | `"babbage"` | `{}` |
|
|
| `spacy.Babbage.v2` | OpenAI | `["babbage"]` | `"babbage"` | `{temperature=0.0, max_tokens=500}` |
|
|
| `spacy.Text-Babbage.v1` | OpenAI | `["text-babbage-001"]` | `"text-babbage-001"` | `{}` |
|
|
| `spacy.Text-Babbage.v2` | OpenAI | `["text-babbage-001"]` | `"text-babbage-001"` | `{temperature=0.0, max_tokens=500}` |
|
|
| `spacy.Ada.v1` | OpenAI | `["ada"]` | `"ada"` | `{}` |
|
|
| `spacy.Ada.v2` | OpenAI | `["ada"]` | `"ada"` | `{temperature=0.0, max_tokens=500}` |
|
|
| `spacy.Text-Ada.v1` | OpenAI | `["text-ada-001"]` | `"text-ada-001"` | `{}` |
|
|
| `spacy.Text-Ada.v2` | OpenAI | `["text-ada-001"]` | `"text-ada-001"` | `{temperature=0.0, max_tokens=500}` |
|
|
| `spacy.Command.v1` | Cohere | `["command", "command-light", "command-light-nightly", "command-nightly"]` | `"command"` | `{}` |
|
|
| `spacy.Claude-2.v1` | Anthropic | `["claude-2", "claude-2-100k"]` | `"claude-2"` | `{}` |
|
|
| `spacy.Claude-1.v1` | Anthropic | `["claude-1", "claude-1-100k"]` | `"claude-1"` | `{}` |
|
|
| `spacy.Claude-1-0.v1` | Anthropic | `["claude-1.0"]` | `"claude-1.0"` | `{}` |
|
|
| `spacy.Claude-1-2.v1` | Anthropic | `["claude-1.2"]` | `"claude-1.2"` | `{}` |
|
|
| `spacy.Claude-1-3.v1` | Anthropic | `["claude-1.3", "claude-1.3-100k"]` | `"claude-1.3"` | `{}` |
|
|
| `spacy.Claude-instant-1.v1` | Anthropic | `["claude-instant-1", "claude-instant-1-100k"]` | `"claude-instant-1"` | `{}` |
|
|
| `spacy.Claude-instant-1-1.v1` | Anthropic | `["claude-instant-1.1", "claude-instant-1.1-100k"]` | `"claude-instant-1.1"` | `{}` |
|
|
|
|
To use these models, make sure that you've [set the relevant API](#api-keys)
|
|
keys as environment variables.
|
|
|
|
#### API Keys {id="api-keys"}
|
|
|
|
Note that when using hosted services, you have to ensure that the proper API
|
|
keys are set as environment variables as described by the corresponding
|
|
provider's documentation.
|
|
|
|
E. g. when using OpenAI, you have to get an API key from openai.com, and ensure
|
|
that the keys are set as environmental variables:
|
|
|
|
```shell
|
|
export OPENAI_API_KEY="sk-..."
|
|
export OPENAI_API_ORG="org-..."
|
|
```
|
|
|
|
For Cohere:
|
|
|
|
```shell
|
|
export CO_API_KEY="..."
|
|
```
|
|
|
|
For Anthropic:
|
|
|
|
```shell
|
|
export ANTHROPIC_API_KEY="..."
|
|
```
|
|
|
|
### Models via HuggingFace {id="models-hf"}
|
|
|
|
These models all take the same parameters:
|
|
|
|
| Argument | Description |
|
|
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `name` | Model name, i. e. any supported variant for this particular model. ~~str~~ |
|
|
| `config_init` | Further configuration passed on to the construction of the model with `transformers.pipeline()`. Defaults to `{}`. ~~Dict[str, Any]~~ |
|
|
| `config_run` | Further configuration used during model inference. Defaults to `{}`. ~~Dict[str, Any]~~ |
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.model]
|
|
> @llm_models = "spacy.Llama2.v1"
|
|
> name = "llama2-7b-hf"
|
|
> ```
|
|
|
|
Currently, these models are provided as part of the core library:
|
|
|
|
| Model | Provider | Supported names | HF directory |
|
|
| -------------------- | --------------- | ------------------------------------------------------------------------------------------------------------ | -------------------------------------- |
|
|
| `spacy.Dolly.v1` | Databricks | `["dolly-v2-3b", "dolly-v2-7b", "dolly-v2-12b"]` | https://huggingface.co/databricks |
|
|
| `spacy.Llama2.v1` | Meta AI | `["Llama-2-7b-hf", "Llama-2-13b-hf", "Llama-2-70b-hf"]` | https://huggingface.co/meta-llama |
|
|
| `spacy.Falcon.v1` | TII | `["falcon-rw-1b", "falcon-7b", "falcon-7b-instruct", "falcon-40b-instruct"]` | https://huggingface.co/tiiuae |
|
|
| `spacy.StableLM.v1` | Stability AI | `["stablelm-base-alpha-3b", "stablelm-base-alpha-7b", "stablelm-tuned-alpha-3b", "stablelm-tuned-alpha-7b"]` | https://huggingface.co/stabilityai |
|
|
| `spacy.OpenLLaMA.v1` | OpenLM Research | `["open_llama_3b", "open_llama_7b", "open_llama_7b_v2", "open_llama_13b"]` | https://huggingface.co/openlm-research |
|
|
|
|
Note that Hugging Face will download the model the first time you use it - you
|
|
can
|
|
[define the cached directory](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache)
|
|
by setting the environmental variable `HF_HOME`.
|
|
|
|
#### Installation with HuggingFace {id="install-hf"}
|
|
|
|
To use models from HuggingFace, ideally you have a GPU enabled and have
|
|
installed `transformers`, `torch` and CUDA in your virtual environment. This
|
|
allows you to have the setting `device=cuda:0` in your config, which ensures
|
|
that the model is loaded entirely on the GPU (and fails otherwise).
|
|
|
|
You can do so with
|
|
|
|
```shell
|
|
python -m pip install "spacy-llm[transformers]" "transformers[sentencepiece]"
|
|
```
|
|
|
|
If you don't have access to a GPU, you can install `accelerate` and
|
|
set`device_map=auto` instead, but be aware that this may result in some layers
|
|
getting distributed to the CPU or even the hard drive, which may ultimately
|
|
result in extremely slow queries.
|
|
|
|
```shell
|
|
python -m pip install "accelerate>=0.16.0,<1.0"
|
|
```
|
|
|
|
### LangChain models {id="langchain-models"}
|
|
|
|
To use [LangChain](https://github.com/hwchase17/langchain) for the API retrieval
|
|
part, make sure you have installed it first:
|
|
|
|
```shell
|
|
python -m pip install "langchain==0.0.191"
|
|
# Or install with spacy-llm directly
|
|
python -m pip install "spacy-llm[extras]"
|
|
```
|
|
|
|
Note that LangChain currently only supports Python 3.9 and beyond.
|
|
|
|
LangChain models in `spacy-llm` work slightly differently. `langchain`'s models
|
|
are parsed automatically, each LLM class in `langchain` has one entry in
|
|
`spacy-llm`'s registry. As `langchain`'s design has one class per API and not
|
|
per model, this results in registry entries like `langchain.OpenAI.v1` - i. e.
|
|
there is one registry entry per API and not per model (family), as for the REST-
|
|
and HuggingFace-based entries.
|
|
|
|
The name of the model to be used has to be passed in via the `name` attribute.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.model]
|
|
> @llm_models = "langchain.OpenAI.v1"
|
|
> name = "gpt-3.5-turbo"
|
|
> query = {"@llm_queries": "spacy.CallLangChain.v1"}
|
|
> config = {"temperature": 0.0}
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `name` | The name of a mdodel supported by LangChain for this API. ~~str~~ |
|
|
| `config` | Configuration passed on to the LangChain model. Defaults to `{}`. ~~Dict[Any, Any]~~ |
|
|
| `query` | Function that executes the prompts. If `None`, defaults to `spacy.CallLangChain.v1`. ~~Optional[Callable[["langchain.llms.BaseLLM", Iterable[Any]], Iterable[Any]]]~~ |
|
|
|
|
The default `query` (`spacy.CallLangChain.v1`) executes the prompts by running
|
|
`model(text)` for each given textual prompt.
|
|
|
|
## Cache {id="cache"}
|
|
|
|
Interacting with LLMs, either through an external API or a local instance, is
|
|
costly. Since developing an NLP pipeline generally means a lot of exploration
|
|
and prototyping, `spacy-llm` implements a built-in cache to avoid reprocessing
|
|
the same documents at each run that keeps batches of documents stored on disk.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.cache]
|
|
> @llm_misc = "spacy.BatchCache.v1"
|
|
> path = "path/to/cache"
|
|
> batch_size = 64
|
|
> max_batches_in_mem = 4
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
| `path` | Cache directory. If `None`, no caching is performed, and this component will act as a NoOp. Defaults to `None`. ~~Optional[Union[str, Path]]~~ |
|
|
| `batch_size` | Number of docs in one batch (file). Once a batch is full, it will be peristed to disk. Defaults to 64. ~~int~~ |
|
|
| `max_batches_in_mem` | Max. number of batches to hold in memory. Allows you to limit the effect on your memory if you're handling a lot of docs. Defaults to 4. ~~int~~ |
|
|
|
|
When retrieving a document, the `BatchCache` will first figure out what batch
|
|
the document belongs to. If the batch isn't in memory it will try to load the
|
|
batch from disk and then move it into memory.
|
|
|
|
Note that since the cache is generated by a registered function, you can also
|
|
provide your own registered function returning your own cache implementation. If
|
|
you wish to do so, ensure that your cache object adheres to the `Protocol`
|
|
defined in `spacy_llm.ty.Cache`.
|
|
|
|
## Various functions {id="various-functions"}
|
|
|
|
### spacy.FewShotReader.v1 {id="fewshotreader-v1"}
|
|
|
|
This function is registered in spaCy's `misc` registry, and reads in examples
|
|
from a `.yml`, `.yaml`, `.json` or `.jsonl` file. It uses
|
|
[`srsly`](https://github.com/explosion/srsly) to read in these files and parses
|
|
them depending on the file extension.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task.examples]
|
|
> @misc = "spacy.FewShotReader.v1"
|
|
> path = "ner_examples.yml"
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| -------- | ----------------------------------------------------------------------------------------------- |
|
|
| `path` | Path to an examples file with suffix `.yml`, `.yaml`, `.json` or `.jsonl`. ~~Union[str, Path]~~ |
|
|
|
|
### spacy.FileReader.v1 {id="filereader-v1"}
|
|
|
|
This function is registered in spaCy's `misc` registry, and reads a file
|
|
provided to the `path` to return a `str` representation of its contents. This
|
|
function is typically used to read
|
|
[Jinja](https://jinja.palletsprojects.com/en/3.1.x/) files containing the prompt
|
|
template.
|
|
|
|
> #### Example config
|
|
>
|
|
> ```ini
|
|
> [components.llm.task.template]
|
|
> @misc = "spacy.FileReader.v1"
|
|
> path = "ner_template.jinja2"
|
|
> ```
|
|
|
|
| Argument | Description |
|
|
| -------- | ------------------------------------------------- |
|
|
| `path` | Path to the file to be read. ~~Union[str, Path]~~ |
|
|
|
|
### Normalizer functions {id="normalizer-functions"}
|
|
|
|
These functions provide simple normalizations for string comparisons, e.g.
|
|
between a list of specified labels and a label given in the raw text of the LLM
|
|
response. They are registered in spaCy's `misc` registry and have the signature
|
|
`Callable[[str], str]`.
|
|
|
|
- `spacy.StripNormalizer.v1`: only apply `text.strip()`
|
|
- `spacy.LowercaseNormalizer.v1`: applies `text.strip().lower()` to compare
|
|
strings in a case-insensitive way.
|