Apply suggestions from review

This commit is contained in:
Victoria Slocum 2023-07-18 10:42:27 +02:00
parent acf63e55ac
commit 0b97fff92d
3 changed files with 129 additions and 176 deletions

View File

@ -60,52 +60,6 @@ prompt. Further, the task defines how to parse the LLM's responses back into
structured information. All tasks are registered in spaCy's `llm_tasks`
registry.
Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined
in [`ty.py`](https://github.com/spacy-llm/spacy_llm/ty.py). It needs to define a
`generate_prompts` function and a `parse_responses` function.
Moreover, the task may define an optional
[`scorer` method](https://spacy.io/api/scorer#score). It should accept an
iterable of `Example`s as input and return a score dictionary. If the `scorer`
method is defined, `spacy-llm` will call it to evaluate the component.
#### Providing examples for few-shot prompts {id="few-shot-prompts"}
All built-in tasks support few-shot prompts, i. e. including examples in a
prompt. Examples can be supplied in two ways: (1) as a separate file containing
only examples or (2) by initializing `llm` with a `get_examples()` callback
(like any other spaCy pipeline component).
##### (1) Few-shot example file
A file containing examples for few-shot prompting can be configured like this:
```ini
[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = PERSON,ORGANISATION,LOCATION
[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "ner_examples.yml"
```
The supplied file has to conform to the format expected by the required task
(see the task documentation further down).
##### (2) Initializing the `llm` component with a `get_examples()` callback
Alternatively, you can initialize your `nlp` pipeline by providing a
`get_examples` callback for
[`nlp.initialize`](https://spacy.io/api/language#initialize) and setting
`n_prompt_examples` to a positive number to automatically fetch a few examples
for few-shot learning. Set `n_prompt_examples` to `-1` to use all examples as
part of the few-shot learning prompt.
```ini
[initialize.components.llm]
n_prompt_examples = 3
```
#### task.generate_prompts {id="task-generate-prompts"}
Takes a collection of documents, and returns a collection of "prompts", which
@ -161,9 +115,10 @@ note that this requirement will be included in the prompt, but the task doesn't
perform a hard cut-off. It's hence possible that your summary exceeds
`max_n_words`.
To perform few-shot learning, you can write down a few examples in a separate
file, and provide these to be injected into the prompt to the LLM. The default
reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```yaml
- text: >
@ -239,9 +194,10 @@ the following parameters:
expand the span to the next token boundaries, e.g. expanding `"New Y"` out to
`"New York"`.
To perform few-shot learning, you can write down a few examples in a separate
file, and provide these to be injected into the prompt to the LLM. The default
reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```yaml
- text: Jack and Jill went up the hill.
@ -328,9 +284,10 @@ the following parameters:
expand the span to the next token boundaries, e.g. expanding `"New Y"` out to
`"New York"`.
To perform few-shot learning, you can write down a few examples in a separate
file, and provide these to be injected into the prompt to the LLM. The default
reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```yaml
- text: Jack and Jill went up the hill.
@ -442,9 +399,10 @@ definitions are included in the prompt.
| `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~ |
| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ |
To perform few-shot learning, you can write down a few examples in a separate
file, and provide these to be injected into the prompt to the LLM. The default
reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```json
[
@ -496,9 +454,10 @@ prompting and includes an improved prompt template.
| `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~ |
| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ |
To perform few-shot learning, you can write down a few examples in a separate
file, and provide these to be injected into the prompt to the LLM. The default
reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```json
[
@ -545,9 +504,10 @@ prompting.
| `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Deafults to `True`. ~~bool~~ |
| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Deafults to `False`. ~~bool~~ |
To perform few-shot learning, you can write down a few examples in a separate
file, and provide these to be injected into the prompt to the LLM. The default
reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```json
[
@ -593,9 +553,10 @@ on an upstream NER component for entities extraction.
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ |
| `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ |
To perform few-shot learning, you can write down a few examples in a separate
file, and provide these to be injected into the prompt to the LLM. The default
reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```json
{"text": "Laura bought a house in Boston with her husband Mark.", "ents": [{"start_char": 0, "end_char": 5, "label": "PERSON"}, {"start_char": 24, "end_char": 30, "label": "GPE"}, {"start_char": 48, "end_char": 52, "label": "PERSON"}], "relations": [{"dep": 0, "dest": 1, "relation": "LivesIn"}, {"dep": 2, "dest": 1, "relation": "LivesIn"}]}
@ -654,9 +615,10 @@ doesn't match the number of tokens recognized by spaCy, no lemmas are stored in
the corresponding doc's tokens. Otherwise the tokens `.lemma_` property is
updated with the lemma suggested by the LLM.
To perform few-shot learning, you can write down a few examples in a separate
file, and provide these to be injected into the prompt to the LLM. The default
reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```yaml
- text: I'm buying ice cream.
@ -706,9 +668,10 @@ issues (e. g. in case of unexpected LLM responses) the value might be `None`.
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
| `field` | Name of extension attribute to store summary in (i. e. the summary will be available in `doc._.{field}`). Defaults to `sentiment`. ~~str~~ |
To perform few-shot learning, you can write down a few examples in a separate
file, and provide these to be injected into the prompt to the LLM. The default
reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```yaml
- text: 'This is horrifying.'
@ -751,43 +714,7 @@ it's a function of type `Callable[[Iterable[Any]], Iterable[Any]]`, but specific
implementations can have other signatures, like
`Callable[[Iterable[str]], Iterable[str]]`.
All built-in models are registered in `llm_models`. If no model is specified,
the repo currently connects to the `OpenAI` API by default using REST, and
accesses the `"gpt-3.5-turbo"` model.
Currently three different approaches to use LLMs are supported:
1. `spacy-llm`s native REST interface. This is the default for all hosted models
(e. g. OpenAI, Cohere, Anthropic, ...).
2. A HuggingFace integration that allows to run a limited set of HF models
locally.
3. A LangChain integration that allows to run any model supported by LangChain
(hosted or locally).
Approaches 1. and 2 are the default for hosted model and local models,
respectively. Alternatively you can use LangChain to access hosted or local
models by specifying one of the models registered with the `langchain.` prefix.
<Infobox>
_Why LangChain if there are also are a native REST and a HuggingFace interface? When should I use what?_
Third-party libraries like `langchain` focus on prompt management, integration
of many different LLM APIs, and other related features such as conversational
memory or agents. `spacy-llm` on the other hand emphasizes features we consider
useful in the context of NLP pipelines utilizing LLMs to process documents
(mostly) independent from each other. It makes sense that the feature sets of
such third-party libraries and `spacy-llm` aren't identical - and users might
want to take advantage of features not available in `spacy-llm`.
The advantage of implementing our own REST and HuggingFace integrations is that
we can ensure a larger degree of stability and robustness, as we can guarantee
backwards-compatibility and more smoothly integrated error handling.
If however there are features or APIs not natively covered by `spacy-llm`, it's
trivial to utilize LangChain to cover this - and easy to customize the prompting
mechanism, if so required.
</Infobox>
#### API Keys {id="api-keys"}
Note that when using hosted services, you have to ensure that the proper API
keys are set as environment variables as described by the corresponding
@ -1377,7 +1304,7 @@ python -m pip install "accelerate>=0.16.0,<1.0"
> ```
| Argument | Description |
| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `name` | The name of a OpenLLaMA model that is supported. ~~Literal["open_llama_3b", "open_llama_7b", "open_llama_7b_v2", "open_llama_13b"]~~ |
| `config_init` | Further configuration passed on to the construction of the model with `transformers.AutoModelForCausalLM.from_pretrained()`. Defaults to `{}`. ~~Dict[str, Any]~~ |
| `config_run` | Further configuration used during model inference. Defaults to `{}`. ~~Dict[str, Any]~~ |

View File

@ -9,8 +9,6 @@ menu:
- ['API', 'api']
- ['Tasks', 'tasks']
- ['Models', 'models']
- ['Ongoing work', 'ongoing-work']
- ['Issues', 'issues']
---
[The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large
@ -359,15 +357,18 @@ Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined
in [`ty.py`](https://github.com/spacy-llm/spacy_llm/ty.py). It needs to define a
`generate_prompts` function and a `parse_responses` function.
| Task | Description |
| --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`. |
| [`task.parse_responses`](/api/large-language-models#task-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. |
Moreover, the task may define an optional
[`scorer` method](https://spacy.io/api/scorer#score). It should accept an
iterable of `Example`s as input and return a score dictionary. If the `scorer`
method is defined, `spacy-llm` will call it to evaluate the component.
| Component | Description |
| --------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`. |
| [`task.parse_responses`](/api/large-language-models#task-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. |
| ----------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`spacy.Summarization.v1`](/api/large-language-models#summarization-v1) | The summarization task prompts the model for a concise summary of the provided text. |
| [`spacy.NER.v2`](/api/large-language-models#ner-v2) | The built-in NER task supports both zero-shot and few-shot prompting. This version also supports explicitly defining the provided labels with custom descriptions. |
| [`spacy.NER.v1`](/api/large-language-models#ner-v1) | The original version of the built-in NER task supports both zero-shot and few-shot prompting. |
@ -381,6 +382,43 @@ method is defined, `spacy-llm` will call it to evaluate the component.
| [`spacy.Sentiment.v1`](/api/large-language-models#sentiment-v1) | Performs sentiment analysis on provided texts. |
| [`spacy.NoOp.v1`](/api/large-language-models#noop-v1) | This task is only useful for testing - it tells the LLM to do nothing, and does not set any fields on the `docs`. |
#### Providing examples for few-shot prompts {id="few-shot-prompts"}
All built-in tasks support few-shot prompts, i. e. including examples in a
prompt. Examples can be supplied in two ways: (1) as a separate file containing
only examples or (2) by initializing `llm` with a `get_examples()` callback
(like any other spaCy pipeline component).
##### (1) Few-shot example file
A file containing examples for few-shot prompting can be configured like this:
```ini
[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = PERSON,ORGANISATION,LOCATION
[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "ner_examples.yml"
```
The supplied file has to conform to the format expected by the required task
(see the task documentation further down).
##### (2) Initializing the `llm` component with a `get_examples()` callback
Alternatively, you can initialize your `nlp` pipeline by providing a
`get_examples` callback for
[`nlp.initialize`](https://spacy.io/api/language#initialize) and setting
`n_prompt_examples` to a positive number to automatically fetch a few examples
for few-shot learning. Set `n_prompt_examples` to `-1` to use all examples as
part of the few-shot learning prompt.
```ini
[initialize.components.llm]
n_prompt_examples = 3
```
### Model {id="models"}
A _model_ defines which LLM model to query, and how to query it. It can be a
@ -408,29 +446,33 @@ Approaches 1. and 2 are the default for hosted model and local models,
respectively. Alternatively you can use LangChain to access hosted or local
models by specifying one of the models registered with the `langchain.` prefix.
Note that when using hosted services, you have to ensure that the proper API
keys are set as environment variables as described by the corresponding
<Infobox>
_Why LangChain if there are also are a native REST and a HuggingFace interface? When should I use what?_
Third-party libraries like `langchain` focus on prompt management, integration
of many different LLM APIs, and other related features such as conversational
memory or agents. `spacy-llm` on the other hand emphasizes features we consider
useful in the context of NLP pipelines utilizing LLMs to process documents
(mostly) independent from each other. It makes sense that the feature sets of
such third-party libraries and `spacy-llm` aren't identical - and users might
want to take advantage of features not available in `spacy-llm`.
The advantage of implementing our own REST and HuggingFace integrations is that
we can ensure a larger degree of stability and robustness, as we can guarantee
backwards-compatibility and more smoothly integrated error handling.
If however there are features or APIs not natively covered by `spacy-llm`, it's
trivial to utilize LangChain to cover this - and easy to customize the prompting
mechanism, if so required.
</Infobox>
<Infobox variant="warning">
Note that when using hosted services, you have to ensure that the [proper API
keys](/api/large-language-models#api-keys) are set as environment variables as described by the corresponding
provider's documentation.
E. g. when using OpenAI, you have to get an API key from openai.com, and ensure
that the keys are set as environmental variables:
```shell
export OPENAI_API_KEY="sk-..."
export OPENAI_API_ORG="org-..."
```
For Cohere it's
```shell
export CO_API_KEY="..."
```
and for Anthropic
```shell
export ANTHROPIC_API_KEY="..."
```
</Infobox>
| Component | Description |
| ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------ |
@ -473,23 +515,3 @@ documents at each run that keeps batches of documents stored on disk.
| [`spacy.FewShotReader.v1`](/api/large-language-models#fewshotreader-v1) | This function is registered in spaCy's `misc` registry, and reads in examples from a `.yml`, `.yaml`, `.json` or `.jsonl` file. It uses [`srsly`](https://github.com/explosion/srsly) to read in these files and parses them depending on the file extension. |
| [`spacy.FileReader.v1`](/api/large-language-models#filereader-v1) | This function is registered in spaCy's `misc` registry, and reads a file provided to the `path` to return a `str` representation of its contents. This function is typically used to read [Jinja](https://jinja.palletsprojects.com/en/3.1.x/) files containing the prompt template. |
| [Normalizer functions](/api/large-language-models#normalizer-functions) | These functions provide simple normalizations for string comparisons, e.g. between a list of specified labels and a label given in the raw text of the LLM response. |
## Ongoing work {id="ongoing-work"}
In the near future, we will
- Add more example tasks
- Support a broader range of models
- Provide more example use-cases and tutorials
- Make the built-in tasks easier to customize via Jinja templates to define the
instructions & examples
PRs are always welcome!
## Reporting issues {id="issues"}
If you have questions regarding the usage of `spacy-llm`, or want to give us
feedback after giving it a spin, please use the
[discussion board](https://github.com/explosion/spaCy/discussions). Bug reports
can be filed on the
[spaCy issue tracker](https://github.com/explosion/spaCy/issues). Thank you!

View File

@ -37,7 +37,11 @@
{ "text": "spaCy Projects", "url": "/usage/projects", "tag": "new" },
{ "text": "Saving & Loading", "url": "/usage/saving-loading" },
{ "text": "Visualizers", "url": "/usage/visualizers" },
{ "text": "Large Language Models", "url": "/usage/large-language-models", "tag": "new" }
{
"text": "Large Language Models",
"url": "/usage/large-language-models",
"tag": "new"
}
]
},
{
@ -102,6 +106,7 @@
{ "text": "EntityLinker", "url": "/api/entitylinker" },
{ "text": "EntityRecognizer", "url": "/api/entityrecognizer" },
{ "text": "EntityRuler", "url": "/api/entityruler" },
{ "text": "Large Language Models", "url": "/api/large-language-models" },
{ "text": "Lemmatizer", "url": "/api/lemmatizer" },
{ "text": "Morphologizer", "url": "/api/morphologizer" },
{ "text": "SentenceRecognizer", "url": "/api/sentencerecognizer" },
@ -134,7 +139,6 @@
{ "text": "Corpus", "url": "/api/corpus" },
{ "text": "InMemoryLookupKB", "url": "/api/inmemorylookupkb" },
{ "text": "KnowledgeBase", "url": "/api/kb" },
{ "text": "Large Language Models", "url": "/api/large-language-models" },
{ "text": "Lookups", "url": "/api/lookups" },
{ "text": "MorphAnalysis", "url": "/api/morphology#morphanalysis" },
{ "text": "Morphology", "url": "/api/morphology" },