mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 01:46:28 +03:00
Merge pull request #13253 from explosion/chore/sync-master-with-llm_main
Sync `master` with `docs/llm_main`
This commit is contained in:
commit
3b3b5cdc63
|
@ -9,13 +9,21 @@ menu:
|
||||||
- ['Various Functions', 'various-functions']
|
- ['Various Functions', 'various-functions']
|
||||||
---
|
---
|
||||||
|
|
||||||
[The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large
|
[The `spacy-llm` package](https://github.com/explosion/spacy-llm) integrates
|
||||||
Language Models (LLMs) into spaCy, featuring a modular system for **fast
|
Large Language Models (LLMs) into spaCy, featuring a modular system for **fast
|
||||||
prototyping** and **prompting**, and turning unstructured responses into
|
prototyping** and **prompting**, and turning unstructured responses into
|
||||||
**robust outputs** for various NLP tasks, **no training data** required.
|
**robust outputs** for various NLP tasks, **no training data** required.
|
||||||
|
|
||||||
## Config and implementation {id="config"}
|
## Config and implementation {id="config"}
|
||||||
|
|
||||||
|
An LLM component is implemented through the `LLMWrapper` class. It is accessible
|
||||||
|
through a generic `llm`
|
||||||
|
[component factory](https://spacy.io/usage/processing-pipelines#custom-components-factories)
|
||||||
|
as well as through task-specific component factories: `llm_ner`, `llm_spancat`,
|
||||||
|
`llm_rel`, `llm_textcat`, `llm_sentiment`, `llm_summarization`,
|
||||||
|
`llm_entity_linker`, `llm_raw` and `llm_translation`. For these factories, the
|
||||||
|
GPT-3-5 model from OpenAI is used by default, but this can be customized.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
> ```python
|
> ```python
|
||||||
|
@ -34,14 +42,6 @@ prototyping** and **prompting**, and turning unstructured responses into
|
||||||
> llm = LLMWrapper(vocab=nlp.vocab, task=task, model=model, cache=cache, save_io=True)
|
> llm = LLMWrapper(vocab=nlp.vocab, task=task, model=model, cache=cache, save_io=True)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
An LLM component is implemented through the `LLMWrapper` class. It is accessible
|
|
||||||
through a generic `llm`
|
|
||||||
[component factory](https://spacy.io/usage/processing-pipelines#custom-components-factories)
|
|
||||||
as well as through task-specific component factories: `llm_ner`, `llm_spancat`,
|
|
||||||
`llm_rel`, `llm_textcat`, `llm_sentiment` and `llm_summarization`. For these
|
|
||||||
factories, the GPT-3-5 model from OpenAI is used by default, but this can be
|
|
||||||
customized.
|
|
||||||
|
|
||||||
### LLMWrapper.\_\_init\_\_ {id="init",tag="method"}
|
### LLMWrapper.\_\_init\_\_ {id="init",tag="method"}
|
||||||
|
|
||||||
Create a new pipeline instance. In your application, you would normally use a
|
Create a new pipeline instance. In your application, you would normally use a
|
||||||
|
@ -206,13 +206,82 @@ not require labels.
|
||||||
|
|
||||||
## Tasks {id="tasks"}
|
## Tasks {id="tasks"}
|
||||||
|
|
||||||
### Task implementation {id="task-implementation"}
|
In `spacy-llm`, a _task_ defines an NLP problem or question and its solution
|
||||||
|
using an LLM. It does so by implementing the following responsibilities:
|
||||||
|
|
||||||
A _task_ defines an NLP problem or question, that will be sent to the LLM via a
|
1. Loading a prompt template and injecting documents' data into the prompt.
|
||||||
prompt. Further, the task defines how to parse the LLM's responses back into
|
Optionally, include fewshot examples in the prompt.
|
||||||
structured information. All tasks are registered in the `llm_tasks` registry.
|
2. Splitting the prompt into several pieces following a map-reduce paradigm,
|
||||||
|
_if_ the prompt is too long to fit into the model's context and the task
|
||||||
|
supports sharding prompts.
|
||||||
|
3. Parsing the LLM's responses back into structured information and validating
|
||||||
|
the parsed output.
|
||||||
|
|
||||||
#### task.generate_prompts {id="task-generate-prompts"}
|
Two different task interfaces are supported: `ShardingLLMTask` and
|
||||||
|
`NonShardingLLMTask`. Only the former supports the sharding of documents, i. e.
|
||||||
|
splitting up prompts if they are too long.
|
||||||
|
|
||||||
|
All tasks are registered in the `llm_tasks` registry.
|
||||||
|
|
||||||
|
### On Sharding {id="task-sharding"}
|
||||||
|
|
||||||
|
"Sharding" describes, generally speaking, the process of distributing parts of a
|
||||||
|
dataset across multiple storage units for easier processing and lookups. In
|
||||||
|
`spacy-llm` we use this term (synonymously: "mapping") to describe the splitting
|
||||||
|
up of prompts if they are too long for a model to handle, and "fusing"
|
||||||
|
(synonymously: "reducing") to describe how the model responses for several
|
||||||
|
shards are merged back together into a single document.
|
||||||
|
|
||||||
|
Prompts are broken up in a manner that _always_ keeps the prompt in the template
|
||||||
|
intact, meaning that the instructions to the LLM will always stay complete. The
|
||||||
|
document content however will be split, if the length of the fully rendered
|
||||||
|
prompt exceeds a model context length.
|
||||||
|
|
||||||
|
A toy example: let's assume a model has a context window of 25 tokens and the
|
||||||
|
prompt template for our fictional, sharding-supporting task looks like this:
|
||||||
|
|
||||||
|
```
|
||||||
|
Estimate the sentiment of this text:
|
||||||
|
"{text}"
|
||||||
|
Estimated sentiment:
|
||||||
|
```
|
||||||
|
|
||||||
|
Depending on how tokens are counted exactly (this is a config setting), we might
|
||||||
|
come up with `n = 12` tokens for the number of tokens in the prompt
|
||||||
|
instructions. Furthermore let's assume that our `text` is "This has been
|
||||||
|
amazing - I can't remember the last time I left the cinema so impressed." -
|
||||||
|
which has roughly 19 tokens.
|
||||||
|
|
||||||
|
Considering we only have 13 tokens to add to our prompt before we hit the
|
||||||
|
context limit, we'll have to split our prompt into two parts. Thus `spacy-llm`,
|
||||||
|
assuming the task used supports sharding, will split the prompt into two (the
|
||||||
|
default splitting strategy splits by tokens, but alternative splitting
|
||||||
|
strategies splitting e. g. by sentences can be configured):
|
||||||
|
|
||||||
|
_(Prompt 1/2)_
|
||||||
|
|
||||||
|
```
|
||||||
|
Estimate the sentiment of this text:
|
||||||
|
"This has been amazing - I can't remember "
|
||||||
|
Estimated sentiment:
|
||||||
|
```
|
||||||
|
|
||||||
|
_(Prompt 2/2)_
|
||||||
|
|
||||||
|
```
|
||||||
|
Estimate the sentiment of this text:
|
||||||
|
"the last time I left the cinema so impressed."
|
||||||
|
Estimated sentiment:
|
||||||
|
```
|
||||||
|
|
||||||
|
The reduction step is task-specific - a sentiment estimation task might e. g. do
|
||||||
|
a weighted average of the sentiment scores. Note that prompt sharding introduces
|
||||||
|
potential inaccuracies, as the LLM won't have access to the entire document at
|
||||||
|
once. Depending on your use case this might or might not be problematic.
|
||||||
|
|
||||||
|
### `NonShardingLLMTask` {id="task-nonsharding"}
|
||||||
|
|
||||||
|
#### task.generate_prompts {id="task-nonsharding-generate-prompts"}
|
||||||
|
|
||||||
Takes a collection of documents, and returns a collection of "prompts", which
|
Takes a collection of documents, and returns a collection of "prompts", which
|
||||||
can be of type `Any`. Often, prompts are of type `str` - but this is not
|
can be of type `Any`. Often, prompts are of type `str` - but this is not
|
||||||
|
@ -223,7 +292,7 @@ enforced to allow for maximum flexibility in the framework.
|
||||||
| `docs` | The input documents. ~~Iterable[Doc]~~ |
|
| `docs` | The input documents. ~~Iterable[Doc]~~ |
|
||||||
| **RETURNS** | The generated prompts. ~~Iterable[Any]~~ |
|
| **RETURNS** | The generated prompts. ~~Iterable[Any]~~ |
|
||||||
|
|
||||||
#### task.parse_responses {id="task-parse-responses"}
|
#### task.parse_responses {id="task-non-sharding-parse-responses"}
|
||||||
|
|
||||||
Takes a collection of LLM responses and the original documents, parses the
|
Takes a collection of LLM responses and the original documents, parses the
|
||||||
responses into structured information, and sets the annotations on the
|
responses into structured information, and sets the annotations on the
|
||||||
|
@ -234,11 +303,157 @@ defined fields.
|
||||||
The `responses` are of type `Iterable[Any]`, though they will often be `str`
|
The `responses` are of type `Iterable[Any]`, though they will often be `str`
|
||||||
objects. This depends on the return type of the [model](#models).
|
objects. This depends on the return type of the [model](#models).
|
||||||
|
|
||||||
| Argument | Description |
|
| Argument | Description |
|
||||||
| ----------- | ------------------------------------------ |
|
| ----------- | ------------------------------------------------------ |
|
||||||
| `docs` | The input documents. ~~Iterable[Doc]~~ |
|
| `docs` | The input documents. ~~Iterable[Doc]~~ |
|
||||||
| `responses` | The generated prompts. ~~Iterable[Any]~~ |
|
| `responses` | The responses received from the LLM. ~~Iterable[Any]~~ |
|
||||||
| **RETURNS** | The annotated documents. ~~Iterable[Doc]~~ |
|
| **RETURNS** | The annotated documents. ~~Iterable[Doc]~~ |
|
||||||
|
|
||||||
|
### `ShardingLLMTask` {id="task-sharding"}
|
||||||
|
|
||||||
|
#### task.generate_prompts {id="task-sharding-generate-prompts"}
|
||||||
|
|
||||||
|
Takes a collection of documents, breaks them up into shards if necessary to fit
|
||||||
|
all content into the model's context, and returns a collection of collections of
|
||||||
|
"prompts" (i. e. each doc can have multiple shards, each of which have exactly
|
||||||
|
one prompt), which can be of type `Any`. Often, prompts are of type `str` - but
|
||||||
|
this is not enforced to allow for maximum flexibility in the framework.
|
||||||
|
|
||||||
|
| Argument | Description |
|
||||||
|
| ----------- | -------------------------------------------------- |
|
||||||
|
| `docs` | The input documents. ~~Iterable[Doc]~~ |
|
||||||
|
| **RETURNS** | The generated prompts. ~~Iterable[Iterable[Any]]~~ |
|
||||||
|
|
||||||
|
#### task.parse_responses {id="task-sharding-parse-responses"}
|
||||||
|
|
||||||
|
Receives a collection of collections of LLM responses (i. e. each doc can have
|
||||||
|
multiple shards, each of which have exactly one prompt / prompt response) and
|
||||||
|
the original shards, parses the responses into structured information, sets the
|
||||||
|
annotations on the shards, and merges back doc shards into single docs. The
|
||||||
|
`parse_responses` function is free to set the annotations in any way, including
|
||||||
|
`Doc` fields like `ents`, `spans` or `cats`, or using custom defined fields.
|
||||||
|
|
||||||
|
The `responses` are of type `Iterable[Iterable[Any]]`, though they will often be
|
||||||
|
`str` objects. This depends on the return type of the [model](#models).
|
||||||
|
|
||||||
|
| Argument | Description |
|
||||||
|
| ----------- | ---------------------------------------------------------------- |
|
||||||
|
| `shards` | The input document shards. ~~Iterable[Iterable[Doc]]~~ |
|
||||||
|
| `responses` | The responses received from the LLM. ~~Iterable[Iterable[Any]]~~ |
|
||||||
|
| **RETURNS** | The annotated documents. ~~Iterable[Doc]~~ |
|
||||||
|
|
||||||
|
### Translation {id="translation"}
|
||||||
|
|
||||||
|
The translation task translates texts from a defined or inferred source to a
|
||||||
|
defined target language.
|
||||||
|
|
||||||
|
#### spacy.Translation.v1 {id="translation-v1"}
|
||||||
|
|
||||||
|
`spacy.Translation.v1` supports both zero-shot and few-shot prompting.
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [components.llm.task]
|
||||||
|
> @llm_tasks = "spacy.Translation.v1"
|
||||||
|
> examples = null
|
||||||
|
> target_lang = "Spanish"
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Argument | Description |
|
||||||
|
| --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `template` | Custom prompt template to send to LLM model. Defaults to [translation.v1.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/translation.v1.jinja). ~~str~~ |
|
||||||
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
||||||
|
| `parse_responses` (NEW) | Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. ~~Optional[TaskResponseParser[TranslationTask]]~~ |
|
||||||
|
| `prompt_example_type` (NEW) | Type to use for fewshot examples. Defaults to `TranslationExample`. ~~Optional[Type[FewshotExample]]~~ |
|
||||||
|
| `source_lang` | Language to translate from. Doesn't have to be set. ~~Optional[str]~~ |
|
||||||
|
| `target_lang` | Language to translate to. No default value, has to be set. ~~str~~ |
|
||||||
|
| `field` | Name of extension attribute to store translation in (i. e. the translation will be available in `doc._.{field}`). Defaults to `translation`. ~~str~~ |
|
||||||
|
|
||||||
|
To perform [few-shot learning](/usage/large-language-models#few-shot-prompts),
|
||||||
|
you can write down a few examples in a separate file, and provide these to be
|
||||||
|
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
|
||||||
|
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- text: 'Top of the morning to you!'
|
||||||
|
translation: '¡Muy buenos días!'
|
||||||
|
- text: 'The weather is great today.'
|
||||||
|
translation: 'El clima está fantástico hoy.'
|
||||||
|
- text: 'Do you know what will happen tomorrow?'
|
||||||
|
translation: '¿Sabes qué pasará mañana?'
|
||||||
|
```
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[components.llm.task]
|
||||||
|
@llm_tasks = "spacy.Translation.v1"
|
||||||
|
target_lang = "Spanish"
|
||||||
|
[components.llm.task.examples]
|
||||||
|
@misc = "spacy.FewShotReader.v1"
|
||||||
|
path = "translation_examples.yml"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Raw prompting {id="raw"}
|
||||||
|
|
||||||
|
Different to all other tasks `spacy.Raw.vX` doesn't provide a specific prompt,
|
||||||
|
wrapping doc data, to the model. Instead it instructs the model to reply to the
|
||||||
|
doc content. This is handy for use cases like question answering (where each doc
|
||||||
|
contains one question) or if you want to include customized prompts for each
|
||||||
|
doc.
|
||||||
|
|
||||||
|
#### spacy.Raw.v1 {id="raw-v1"}
|
||||||
|
|
||||||
|
Note that since this task may request arbitrary information, it doesn't do any
|
||||||
|
parsing per se - the model response is stored in a custom `Doc` attribute (i. e.
|
||||||
|
can be accessed via `doc._.{field}`).
|
||||||
|
|
||||||
|
It supports both zero-shot and few-shot prompting.
|
||||||
|
|
||||||
|
> #### Example config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [components.llm.task]
|
||||||
|
> @llm_tasks = "spacy.Raw.v1"
|
||||||
|
> examples = null
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Argument | Description |
|
||||||
|
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `template` | Custom prompt template to send to LLM model. Defaults to [raw.v1.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/raw.v1.jinja). ~~str~~ |
|
||||||
|
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
|
||||||
|
| `parse_responses` | Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. ~~Optional[TaskResponseParser[RawTask]]~~ |
|
||||||
|
| `prompt_example_type` | Type to use for fewshot examples. Defaults to `RawExample`. ~~Optional[Type[FewshotExample]]~~ |
|
||||||
|
| `field` | Name of extension attribute to store model reply in (i. e. the reply will be available in `doc._.{field}`). Defaults to `reply`. ~~str~~ |
|
||||||
|
|
||||||
|
To perform [few-shot learning](/usage/large-language-models#few-shot-prompts),
|
||||||
|
you can write down a few examples in a separate file, and provide these to be
|
||||||
|
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
|
||||||
|
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Each example can follow an arbitrary pattern. It might help the prompt performance though if the examples resemble
|
||||||
|
# the actual docs' content.
|
||||||
|
- text: "3 + 5 = x. What's x?"
|
||||||
|
reply: '8'
|
||||||
|
|
||||||
|
- text: 'Write me a limerick.'
|
||||||
|
reply:
|
||||||
|
"There was an Old Man with a beard, Who said, 'It is just as I feared! Two
|
||||||
|
Owls and a Hen, Four Larks and a Wren, Have all built their nests in my
|
||||||
|
beard!"
|
||||||
|
|
||||||
|
- text: "Analyse the sentiment of the text 'This is great'."
|
||||||
|
reply: "'This is great' expresses a very positive sentiment."
|
||||||
|
```
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[components.llm.task]
|
||||||
|
@llm_tasks = "spacy.Raw.v1"
|
||||||
|
field = "llm_reply"
|
||||||
|
[components.llm.task.examples]
|
||||||
|
@misc = "spacy.FewShotReader.v1"
|
||||||
|
path = "raw_examples.yml"
|
||||||
|
```
|
||||||
|
|
||||||
### Summarization {id="summarization"}
|
### Summarization {id="summarization"}
|
||||||
|
|
||||||
|
@ -307,6 +522,171 @@ max_n_words = 20
|
||||||
path = "summarization_examples.yml"
|
path = "summarization_examples.yml"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### EL (Entity Linking) {id="nel"}
|
||||||
|
|
||||||
|
The EL links recognized entities (see [NER](#ner)) to those in a knowledge base
|
||||||
|
(KB). The EL task prompts the LLM to select the most likely candidate from the
|
||||||
|
KB, whose structure can be arbitrary.
|
||||||
|
|
||||||
|
Note that the documents processed by the entity linking task are expected to
|
||||||
|
have recognized entities in their `.ents` attribute. This can be achieved by
|
||||||
|
either running the [NER task](#ner), using a trained spaCy NER model or setting
|
||||||
|
the entities manually prior to running the EL task.
|
||||||
|
|
||||||
|
In order to be able to pull data from the KB, an object implementing the
|
||||||
|
`CandidateSelector` protocol has to be provided. This requires two functions:
|
||||||
|
(1) `__call__()` to fetch candidate entities for entity mentions in the text
|
||||||
|
(assumed to be available in `Doc.ents`) and (2) `get_entity_description()` to
|
||||||
|
fetch descriptions for any given entity ID. Descriptions can be empty, but
|
||||||
|
ideally provide more context for entities stored in the KB.
|
||||||
|
|
||||||
|
`spacy-llm` provides a `CandidateSelector` implementation
|
||||||
|
(`spacy.CandidateSelector.v1`) that leverages a spaCy knowledge base - as used
|
||||||
|
in an `entity_linking` component - to select candidates. This knowledge base can
|
||||||
|
be loaded from an existing spaCy pipeline (note that the pipeline's EL component
|
||||||
|
doesn't have to be trained) or from a separate .yaml file.
|
||||||
|
|
||||||
|
#### spacy.EntityLinker.v1 {id="el-v1"}
|
||||||
|
|
||||||
|
Supports zero- and few-shot prompting. Relies on a configurable component
|
||||||
|
suggesting viable entities before letting the LLM pick the most likely
|
||||||
|
candidate.
|
||||||
|
|
||||||
|
> #### Example config for spacy.EntityLinker.v1
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [paths]
|
||||||
|
> el_nlp = null
|
||||||
|
>
|
||||||
|
> ...
|
||||||
|
>
|
||||||
|
> [components.llm.task]
|
||||||
|
> @llm_tasks = "spacy.EntityLinker.v1"
|
||||||
|
>
|
||||||
|
> [initialize]
|
||||||
|
> [initialize.components]
|
||||||
|
> [initialize.components.llm]
|
||||||
|
> [initialize.components.llm.candidate_selector]
|
||||||
|
> @llm_misc = "spacy.CandidateSelector.v1"
|
||||||
|
>
|
||||||
|
> # Load a KB from a KB file. For loading KBs from spaCy pipelines see spacy.KBObjectLoader.v1.
|
||||||
|
> [initialize.components.llm.candidate_selector.kb_loader]
|
||||||
|
> @llm_misc = "spacy.KBFileLoader.v1"
|
||||||
|
> # Path to knowledge base .yaml file.
|
||||||
|
> path = ${paths.el_kb}
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Argument | Description |
|
||||||
|
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `template` | Custom prompt template to send to LLM model. Defaults to [entity_linker.v1.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/entity_linker.v1.jinja). ~~str~~ |
|
||||||
|
| `parse_responses` | Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. ~~Optional[TaskResponseParser[EntityLinkerTask]]~~ |
|
||||||
|
| `prompt_example_type` | Type to use for fewshot examples. Defaults to `ELExample`. ~~Optional[Type[FewshotExample]]~~ |
|
||||||
|
| `examples` | Optional callable that reads a file containing task examples for few-shot learning. If `None` is passed, zero-shot learning will be used. Defaults to `None`. ~~ExamplesConfigType~~ |
|
||||||
|
| `scorer` | Scorer function. Defaults to the metric used by spaCy to evaluate entity linking performance. ~~Optional[Scorer]~~ |
|
||||||
|
|
||||||
|
##### spacy.CandidateSelector.v1 {id="candidate-selector-v1"}
|
||||||
|
|
||||||
|
`spacy.CandidateSelector.v1` is an implementation of the `CandidateSelector`
|
||||||
|
protocol required by [`spacy.EntityLinker.v1`](#el-v1). The built-in candidate
|
||||||
|
selector method allows loading existing knowledge bases in several ways, e. g.
|
||||||
|
loading from a spaCy pipeline with a (not necessarily trained) entity linking
|
||||||
|
component, and loading from a file describing the knowlege base as a .yaml file.
|
||||||
|
Either way the loaded data will be converted to a spaCy `InMemoryLookupKB`
|
||||||
|
instance. The KB's selection capabilities are used to select the most likely
|
||||||
|
entity candidates for the specified mentions.
|
||||||
|
|
||||||
|
> #### Example config for spacy.CandidateSelector.v1
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize]
|
||||||
|
> [initialize.components]
|
||||||
|
> [initialize.components.llm]
|
||||||
|
> [initialize.components.llm.candidate_selector]
|
||||||
|
> @llm_misc = "spacy.CandidateSelector.v1"
|
||||||
|
>
|
||||||
|
> # Load a KB from a KB file. For loading KBs from spaCy pipelines see spacy.KBObjectLoader.v1.
|
||||||
|
> [initialize.components.llm.candidate_selector.kb_loader]
|
||||||
|
> @llm_misc = "spacy.KBFileLoader.v1"
|
||||||
|
> # Path to knowledge base .yaml file.
|
||||||
|
> path = ${paths.el_kb}
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Argument | Description |
|
||||||
|
| ----------- | ----------------------------------------------------------------- |
|
||||||
|
| `kb_loader` | KB loader object. ~~InMemoryLookupKBLoader~~ |
|
||||||
|
| `top_n` | Top-n candidates to include in the prompt. Defaults to 5. ~~int~~ |
|
||||||
|
|
||||||
|
##### spacy.KBObjectLoader.v1 {id="kb-object-loader-v1"}
|
||||||
|
|
||||||
|
Adheres to the `InMemoryLookupKBLoader` interface required by
|
||||||
|
[`spacy.CandidateSelector.v1`](#candidate-selector-v1). Loads a knowledge base
|
||||||
|
from an existing spaCy pipeline.
|
||||||
|
|
||||||
|
> #### Example config for spacy.KBObjectLoader.v1
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize.components.llm.candidate_selector.kb_loader]
|
||||||
|
> @llm_misc = "spacy.KBObjectLoader.v1"
|
||||||
|
> # Path to knowledge base directory in serialized spaCy pipeline.
|
||||||
|
> path = ${paths.el_kb}
|
||||||
|
> # Path to spaCy pipeline. If this is not specified, spacy-llm tries to determine this automatically (but may fail).
|
||||||
|
> nlp_path = ${paths.el_nlp}
|
||||||
|
> # Path to file with descriptions for entity.
|
||||||
|
> desc_path = ${paths.el_desc}
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Argument | Description |
|
||||||
|
| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `path` | Path to KB file. ~~Union[str, Path]~~ |
|
||||||
|
| `nlp_path` | Path to serialized NLP pipeline. If None, path will be guessed. ~~Optional[Union[Path, str]]~~ |
|
||||||
|
| `desc_path` | Path to file with descriptions for entities. ~~int~~ |
|
||||||
|
| `ent_desc_reader` | Entity description reader. Defaults to an internal method expecting a CSV file without header row, with ";" as delimiters, and with two columns - one for the entitys' IDs, one for their descriptions. ~~Optional[EntDescReader]~~ |
|
||||||
|
|
||||||
|
##### spacy.KBFileLoader.v1 {id="kb-file-loader-v1"}
|
||||||
|
|
||||||
|
Adheres to the `InMemoryLookupKBLoader` interface required by
|
||||||
|
[`spacy.CandidateSelector.v1`](#candidate-selector-v1). Loads a knowledge base
|
||||||
|
from a knowledge base file. The KB .yaml file has to stick to the following
|
||||||
|
format:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
entities:
|
||||||
|
# The key should be whatever ID identifies this entity uniquely in your knowledge base.
|
||||||
|
ID1:
|
||||||
|
name: "..."
|
||||||
|
desc: "..."
|
||||||
|
ID2:
|
||||||
|
...
|
||||||
|
# Data on aliases in your knowledge base - e. g. "Apple" for the entity "Apple Inc.".
|
||||||
|
aliases:
|
||||||
|
- alias: "..."
|
||||||
|
# List of all entities that this alias refers to.
|
||||||
|
entities: ["ID1", "ID2", ...]
|
||||||
|
# Optional: prior probabilities that this alias refers to the n-th entity in the "entities" attribute.
|
||||||
|
probabilities: [0.5, 0.2, ...]
|
||||||
|
- alias: "..."
|
||||||
|
entities: [...]
|
||||||
|
probabilities: [...]
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
See
|
||||||
|
[here](https://github.com/explosion/spacy-llm/blob/main/usage_examples/el_openai/el_kb_data.yml)
|
||||||
|
for a toy example of how such a KB file might look like.
|
||||||
|
|
||||||
|
> #### Example config for spacy.KBFileLoader.v1
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [initialize.components.llm.candidate_selector.kb_loader]
|
||||||
|
> @llm_misc = "spacy.KBFileLoader.v1"
|
||||||
|
> # Path to knowledge base file.
|
||||||
|
> path = ${paths.el_kb}
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Argument | Description |
|
||||||
|
| -------- | ------------------------------------- |
|
||||||
|
| `path` | Path to KB file. ~~Union[str, Path]~~ |
|
||||||
|
|
||||||
### NER {id="ner"}
|
### NER {id="ner"}
|
||||||
|
|
||||||
The NER task identifies non-overlapping entities in text.
|
The NER task identifies non-overlapping entities in text.
|
||||||
|
@ -984,9 +1364,15 @@ A _model_ defines which LLM model to query, and how to query it. It can be a
|
||||||
simple function taking a collection of prompts (consistent with the output type
|
simple function taking a collection of prompts (consistent with the output type
|
||||||
of `task.generate_prompts()`) and returning a collection of responses
|
of `task.generate_prompts()`) and returning a collection of responses
|
||||||
(consistent with the expected input of `parse_responses`). Generally speaking,
|
(consistent with the expected input of `parse_responses`). Generally speaking,
|
||||||
it's a function of type `Callable[[Iterable[Any]], Iterable[Any]]`, but specific
|
it's a function of type
|
||||||
|
`Callable[[Iterable[Iterable[Any]]], Iterable[Iterable[Any]]]`, but specific
|
||||||
implementations can have other signatures, like
|
implementations can have other signatures, like
|
||||||
`Callable[[Iterable[str]], Iterable[str]]`.
|
`Callable[[Iterable[Iterable[str]]], Iterable[Iterable[str]]]`.
|
||||||
|
|
||||||
|
Note: the model signature expects a nested iterable so it's able to deal with
|
||||||
|
sharded docs. Unsharded docs (i. e. those produced by (nonsharding
|
||||||
|
tasks)[/api/large-language-models#task-nonsharding]) are reshaped to fit the
|
||||||
|
expected data structure.
|
||||||
|
|
||||||
### Models via REST API {id="models-rest"}
|
### Models via REST API {id="models-rest"}
|
||||||
|
|
||||||
|
@ -994,14 +1380,15 @@ These models all take the same parameters, but note that the `config` should
|
||||||
contain provider-specific keys and values, as it will be passed onwards to the
|
contain provider-specific keys and values, as it will be passed onwards to the
|
||||||
provider's API.
|
provider's API.
|
||||||
|
|
||||||
| Argument | Description |
|
| Argument | Description |
|
||||||
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `name` | Model name, i. e. any supported variant for this particular model. Default depends on the specific model (cf. below) ~~str~~ |
|
| `name` | Model name, i. e. any supported variant for this particular model. Default depends on the specific model (cf. below) ~~str~~ |
|
||||||
| `config` | Further configuration passed on to the model. Default depends on the specific model (cf. below). ~~Dict[Any, Any]~~ |
|
| `config` | Further configuration passed on to the model. Default depends on the specific model (cf. below). ~~Dict[Any, Any]~~ |
|
||||||
| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ |
|
| `strict` | If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. ~~bool~~ |
|
||||||
| `max_tries` | Max. number of tries for API request. Defaults to `5`. ~~int~~ |
|
| `max_tries` | Max. number of tries for API request. Defaults to `5`. ~~int~~ |
|
||||||
| `max_request_time` | Max. time (in seconds) to wait for request to terminate before raising an exception. Defaults to `30.0`. ~~float~~ |
|
| `max_request_time` | Max. time (in seconds) to wait for request to terminate before raising an exception. Defaults to `30.0`. ~~float~~ |
|
||||||
| `interval` | Time interval (in seconds) for API retries in seconds. Defaults to `1.0`. ~~float~~ |
|
| `interval` | Time interval (in seconds) for API retries in seconds. Defaults to `1.0`. ~~float~~ |
|
||||||
|
| `endpoint` | Endpoint URL. Defaults to the provider's standard URL, if available (which is not the case for providers with exclusively custom deployments, such as Azure) ~~Optional[str]~~ |
|
||||||
|
|
||||||
> #### Example config:
|
> #### Example config:
|
||||||
>
|
>
|
||||||
|
@ -1018,8 +1405,10 @@ Currently, these models are provided as part of the core library:
|
||||||
| ----------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------ | ---------------------- | ------------------------------------ |
|
| ----------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------ | ---------------------- | ------------------------------------ |
|
||||||
| `spacy.GPT-4.v1` | OpenAI | `["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]` | `"gpt-4"` | `{}` |
|
| `spacy.GPT-4.v1` | OpenAI | `["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]` | `"gpt-4"` | `{}` |
|
||||||
| `spacy.GPT-4.v2` | OpenAI | `["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]` | `"gpt-4"` | `{temperature=0.0}` |
|
| `spacy.GPT-4.v2` | OpenAI | `["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]` | `"gpt-4"` | `{temperature=0.0}` |
|
||||||
|
| `spacy.GPT-4.v3` | OpenAI | All names of [GPT-4 models](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo) offered by OpenAI | `"gpt-4"` | `{temperature=0.0}` |
|
||||||
| `spacy.GPT-3-5.v1` | OpenAI | `["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k", "gpt-3.5-turbo-instruct"]` | `"gpt-3.5-turbo"` | `{}` |
|
| `spacy.GPT-3-5.v1` | OpenAI | `["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k", "gpt-3.5-turbo-instruct"]` | `"gpt-3.5-turbo"` | `{}` |
|
||||||
| `spacy.GPT-3-5.v2` | OpenAI | `["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k", "gpt-3.5-turbo-instruct"]` | `"gpt-3.5-turbo"` | `{temperature=0.0}` |
|
| `spacy.GPT-3-5.v2` | OpenAI | `["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k", "gpt-3.5-turbo-instruct"]` | `"gpt-3.5-turbo"` | `{temperature=0.0}` |
|
||||||
|
| `spacy.GPT-3-5.v3` | OpenAI | All names of [GPT-3.5 models](https://platform.openai.com/docs/models/gpt-3-5) offered by OpenAI | `"gpt-3.5-turbo"` | `{temperature=0.0}` |
|
||||||
| `spacy.Davinci.v1` | OpenAI | `["davinci"]` | `"davinci"` | `{}` |
|
| `spacy.Davinci.v1` | OpenAI | `["davinci"]` | `"davinci"` | `{}` |
|
||||||
| `spacy.Davinci.v2` | OpenAI | `["davinci"]` | `"davinci"` | `{temperature=0.0, max_tokens=500}` |
|
| `spacy.Davinci.v2` | OpenAI | `["davinci"]` | `"davinci"` | `{temperature=0.0, max_tokens=500}` |
|
||||||
| `spacy.Text-Davinci.v1` | OpenAI | `["text-davinci-003", "text-davinci-002"]` | `"text-davinci-003"` | `{}` |
|
| `spacy.Text-Davinci.v1` | OpenAI | `["text-davinci-003", "text-davinci-002"]` | `"text-davinci-003"` | `{}` |
|
||||||
|
@ -1040,6 +1429,7 @@ Currently, these models are provided as part of the core library:
|
||||||
| `spacy.Text-Ada.v2` | OpenAI | `["text-ada-001"]` | `"text-ada-001"` | `{temperature=0.0, max_tokens=500}` |
|
| `spacy.Text-Ada.v2` | OpenAI | `["text-ada-001"]` | `"text-ada-001"` | `{temperature=0.0, max_tokens=500}` |
|
||||||
| `spacy.Azure.v1` | Microsoft, OpenAI | Arbitrary values | No default | `{temperature=0.0}` |
|
| `spacy.Azure.v1` | Microsoft, OpenAI | Arbitrary values | No default | `{temperature=0.0}` |
|
||||||
| `spacy.Command.v1` | Cohere | `["command", "command-light", "command-light-nightly", "command-nightly"]` | `"command"` | `{}` |
|
| `spacy.Command.v1` | Cohere | `["command", "command-light", "command-light-nightly", "command-nightly"]` | `"command"` | `{}` |
|
||||||
|
| `spacy.Claude-2-1.v1` | Anthropic | `["claude-2-1"]` | `"claude-2-1"` | `{}` |
|
||||||
| `spacy.Claude-2.v1` | Anthropic | `["claude-2", "claude-2-100k"]` | `"claude-2"` | `{}` |
|
| `spacy.Claude-2.v1` | Anthropic | `["claude-2", "claude-2-100k"]` | `"claude-2"` | `{}` |
|
||||||
| `spacy.Claude-1.v1` | Anthropic | `["claude-1", "claude-1-100k"]` | `"claude-1"` | `{}` |
|
| `spacy.Claude-1.v1` | Anthropic | `["claude-1", "claude-1-100k"]` | `"claude-1"` | `{}` |
|
||||||
| `spacy.Claude-1-0.v1` | Anthropic | `["claude-1.0"]` | `"claude-1.0"` | `{}` |
|
| `spacy.Claude-1-0.v1` | Anthropic | `["claude-1.0"]` | `"claude-1.0"` | `{}` |
|
||||||
|
|
|
@ -340,15 +340,30 @@ A _task_ defines an NLP problem or question, that will be sent to the LLM via a
|
||||||
prompt. Further, the task defines how to parse the LLM's responses back into
|
prompt. Further, the task defines how to parse the LLM's responses back into
|
||||||
structured information. All tasks are registered in the `llm_tasks` registry.
|
structured information. All tasks are registered in the `llm_tasks` registry.
|
||||||
|
|
||||||
Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined
|
Practically speaking, a task should adhere to the `Protocol` named `LLMTask`
|
||||||
in [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py).
|
defined in
|
||||||
It needs to define a `generate_prompts` function and a `parse_responses`
|
[`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). It
|
||||||
function.
|
needs to define a `generate_prompts` function and a `parse_responses` function.
|
||||||
|
|
||||||
| Task | Description |
|
Tasks may support prompt sharding (for more info see the API docs on
|
||||||
| --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
[sharding](/api/large-language-models#task-sharding) and
|
||||||
| [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`. |
|
[non-sharding](/api/large-language-models#task-nonsharding) tasks). The function
|
||||||
| [`task.parse_responses`](/api/large-language-models#task-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. |
|
signatures for `generate_prompts` and `parse_responses` depend on whether they
|
||||||
|
do.
|
||||||
|
|
||||||
|
For tasks **not supporting** sharding:
|
||||||
|
|
||||||
|
| Task | Description | |
|
||||||
|
| --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |
|
||||||
|
| [`task.generate_prompts`](/api/large-language-models#task-nonsharding-generate-prompts) | Takes a collection of documents, and returns a collection of prompts, which can be of type `Any`. |
|
||||||
|
| [`task.parse_responses`](/api/large-language-models#task-nonsharding-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. |
|
||||||
|
|
||||||
|
For tasks **supporting** sharding:
|
||||||
|
|
||||||
|
| Task | Description | |
|
||||||
|
| ------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |
|
||||||
|
| [`task.generate_prompts`](/api/large-language-models#task-sharding-generate-prompts) | Takes a collection of documents, and returns a collection of collections of prompt shards, which can be of type `Any`. |
|
||||||
|
| [`task.parse_responses`](/api/large-language-models#task-sharding-parse-responses) | Takes a collection of collections of LLM responses (one per prompt shard) and the original documents, parses the responses into structured information, sets the annotations on the doc shards, and merges those doc shards back into a single doc instance. |
|
||||||
|
|
||||||
Moreover, the task may define an optional [`scorer` method](/api/scorer#score).
|
Moreover, the task may define an optional [`scorer` method](/api/scorer#score).
|
||||||
It should accept an iterable of `Example` objects as input and return a score
|
It should accept an iterable of `Example` objects as input and return a score
|
||||||
|
@ -357,6 +372,7 @@ evaluate the component.
|
||||||
|
|
||||||
| Component | Description |
|
| Component | Description |
|
||||||
| ----------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
|
| ----------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| [`spacy.EntityLinker.v1`](/api/large-language-models#el-v1) | The entity linking task prompts the model to link all entities in a given text to entries in a knowledge base. |
|
||||||
| [`spacy.Summarization.v1`](/api/large-language-models#summarization-v1) | The summarization task prompts the model for a concise summary of the provided text. |
|
| [`spacy.Summarization.v1`](/api/large-language-models#summarization-v1) | The summarization task prompts the model for a concise summary of the provided text. |
|
||||||
| [`spacy.NER.v3`](/api/large-language-models#ner-v3) | Implements Chain-of-Thought reasoning for NER extraction - obtains higher accuracy than v1 or v2. |
|
| [`spacy.NER.v3`](/api/large-language-models#ner-v3) | Implements Chain-of-Thought reasoning for NER extraction - obtains higher accuracy than v1 or v2. |
|
||||||
| [`spacy.NER.v2`](/api/large-language-models#ner-v2) | Builds on v1 and additionally supports defining the provided labels with explicit descriptions. |
|
| [`spacy.NER.v2`](/api/large-language-models#ner-v2) | Builds on v1 and additionally supports defining the provided labels with explicit descriptions. |
|
||||||
|
@ -369,7 +385,9 @@ evaluate the component.
|
||||||
| [`spacy.TextCat.v2`](/api/large-language-models#textcat-v2) | Version 2 builds on v1 and includes an improved prompt template. |
|
| [`spacy.TextCat.v2`](/api/large-language-models#textcat-v2) | Version 2 builds on v1 and includes an improved prompt template. |
|
||||||
| [`spacy.TextCat.v1`](/api/large-language-models#textcat-v1) | Version 1 of the built-in TextCat task supports both zero-shot and few-shot prompting. |
|
| [`spacy.TextCat.v1`](/api/large-language-models#textcat-v1) | Version 1 of the built-in TextCat task supports both zero-shot and few-shot prompting. |
|
||||||
| [`spacy.Lemma.v1`](/api/large-language-models#lemma-v1) | Lemmatizes the provided text and updates the `lemma_` attribute of the tokens accordingly. |
|
| [`spacy.Lemma.v1`](/api/large-language-models#lemma-v1) | Lemmatizes the provided text and updates the `lemma_` attribute of the tokens accordingly. |
|
||||||
|
| [`spacy.Raw.v1`](/api/large-language-models#raw-v1) | Executes raw doc content as prompt to LLM. |
|
||||||
| [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1) | Performs sentiment analysis on provided texts. |
|
| [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1) | Performs sentiment analysis on provided texts. |
|
||||||
|
| [`spacy.Translation.v1`](/api/large-language-models#translation-v1) | Translates doc content into the specified target language. |
|
||||||
| [`spacy.NoOp.v1`](/api/large-language-models#noop-v1) | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`. |
|
| [`spacy.NoOp.v1`](/api/large-language-models#noop-v1) | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`. |
|
||||||
|
|
||||||
#### Providing examples for few-shot prompts {id="few-shot-prompts"}
|
#### Providing examples for few-shot prompts {id="few-shot-prompts"}
|
||||||
|
|
Loading…
Reference in New Issue
Block a user