Apply suggestions from review

This commit is contained in:
Victoria Slocum 2023-07-19 15:22:11 +02:00
parent ae61351cdb
commit b8a6a25953
2 changed files with 34 additions and 40 deletions

View File

@ -32,7 +32,7 @@ An `llm` component is defined by two main settings:
- A [**task**](#tasks), defining the prompt to send to the LLM as well as the - A [**task**](#tasks), defining the prompt to send to the LLM as well as the
functionality to parse the resulting response back into structured fields on functionality to parse the resulting response back into structured fields on
the [Doc](https://spacy.io/api/doc) objects. the [Doc](/api/doc) objects.
- A [**model**](#models) defining the model and how to connect to it. Note that - A [**model**](#models) defining the model and how to connect to it. Note that
`spacy-llm` supports both access to external APIs (such as OpenAI) as well as `spacy-llm` supports both access to external APIs (such as OpenAI) as well as
access to self-hosted open-source LLMs (such as using Dolly through Hugging access to self-hosted open-source LLMs (such as using Dolly through Hugging
@ -187,7 +187,7 @@ the following parameters:
case variances in the LLM's output. case variances in the LLM's output.
- The `alignment_mode` argument is used to match entities as returned by the LLM - The `alignment_mode` argument is used to match entities as returned by the LLM
to the tokens from the original `Doc` - specifically it's used as argument in to the tokens from the original `Doc` - specifically it's used as argument in
the call to [`doc.char_span()`](https://spacy.io/api/doc#char_span). The the call to [`doc.char_span()`](/api/doc#char_span). The
`"strict"` mode will only keep spans that strictly adhere to the given token `"strict"` mode will only keep spans that strictly adhere to the given token
boundaries. `"contract"` will only keep those tokens that are fully within the boundaries. `"contract"` will only keep those tokens that are fully within the
given range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will given range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will
@ -277,7 +277,7 @@ the following parameters:
case variances in the LLM's output. case variances in the LLM's output.
- The `alignment_mode` argument is used to match entities as returned by the LLM - The `alignment_mode` argument is used to match entities as returned by the LLM
to the tokens from the original `Doc` - specifically it's used as argument in to the tokens from the original `Doc` - specifically it's used as argument in
the call to [`doc.char_span()`](https://spacy.io/api/doc#char_span). The the call to [`doc.char_span()`](/api/doc#char_span). The
`"strict"` mode will only keep spans that strictly adhere to the given token `"strict"` mode will only keep spans that strictly adhere to the given token
boundaries. `"contract"` will only keep those tokens that are fully within the boundaries. `"contract"` will only keep those tokens that are fully within the
given range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will given range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will

View File

@ -12,10 +12,9 @@ menu:
--- ---
[The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large [The spacy-llm package](https://github.com/explosion/spacy-llm) integrates Large
Language Models (LLMs) into spaCy pipelines, featuring a modular Language Models (LLMs) into spaCy pipelines, featuring a modular system for
system for **fast prototyping** and **prompting**, and turning unstructured **fast prototyping** and **prompting**, and turning unstructured responses into
responses into **robust outputs** for various NLP tasks, **no training data** **robust outputs** for various NLP tasks, **no training data** required.
required.
- Serializable `llm` **component** to integrate prompts into your pipeline - Serializable `llm` **component** to integrate prompts into your pipeline
- **Modular functions** to define the [**task**](#tasks) (prompting and parsing) - **Modular functions** to define the [**task**](#tasks) (prompting and parsing)
@ -25,11 +24,13 @@ required.
- Access to - Access to
**[OpenAI API](https://platform.openai.com/docs/api-reference/introduction)**, **[OpenAI API](https://platform.openai.com/docs/api-reference/introduction)**,
including GPT-4 and various GPT-3 models including GPT-4 and various GPT-3 models
- Built-in support for various **open-source** models hosted on [Hugging Face](https://huggingface.co/) - Built-in support for various **open-source** models hosted on
- Usage examples for standard NLP tasks such as **Named Entity Recognition** and **Text Classification** [Hugging Face](https://huggingface.co/)
- Easy implementation of **your own functions** via - Usage examples for standard NLP tasks such as **Named Entity Recognition** and
the [registry](https://spacy.io/api/top-level#registry) for custom **Text Classification**
prompting, parsing and model integrations - Easy implementation of **your own functions** via the
[registry](/api/top-level#registry) for custom prompting, parsing and model
integrations
## Motivation {id="motivation"} ## Motivation {id="motivation"}
@ -38,10 +39,6 @@ capabilities. With only a few (and sometimes no) examples, an LLM can be
prompted to perform custom NLP tasks such as text categorization, named entity prompted to perform custom NLP tasks such as text categorization, named entity
recognition, coreference resolution, information extraction and more. recognition, coreference resolution, information extraction and more.
[spaCy](https://spacy.io) is a well-established library for building systems
that need to work with language in various ways. spaCy's built-in components are
generally powered by supervised learning or rule-based approaches.
Supervised learning is much worse than LLM prompting for prototyping, but for Supervised learning is much worse than LLM prompting for prototyping, but for
many tasks it's much better for production. A transformer model that runs many tasks it's much better for production. A transformer model that runs
comfortably on a single GPU is extremely powerful, and it's likely to be a comfortably on a single GPU is extremely powerful, and it's likely to be a
@ -70,7 +67,7 @@ well-thought-out library, which is exactly what spaCy provides.
`spacy-llm` will be installed automatically in future spaCy versions. For now, `spacy-llm` will be installed automatically in future spaCy versions. For now,
you can run the following in the same virtual environment where you already have you can run the following in the same virtual environment where you already have
`spacy` [installed](https://spacy.io/usage). `spacy` [installed](/usage).
> ⚠️ This package is still experimental and it is possible that changes made to > ⚠️ This package is still experimental and it is possible that changes made to
> the interface will be breaking in minor version updates. > the interface will be breaking in minor version updates.
@ -82,9 +79,8 @@ python -m pip install spacy-llm
## Usage {id="usage"} ## Usage {id="usage"}
The task and the model have to be supplied to the `llm` pipeline component using The task and the model have to be supplied to the `llm` pipeline component using
the [config system](https://spacy.io/api/data-formats#config). This package the [config system](/api/data-formats#config). This package provides various
provides various built-in functionality, as detailed in the [API](#-api) built-in functionality, as detailed in the [API](#-api) documentation.
documentation.
### Example 1: Add a text classifier using a GPT-3 model from OpenAI {id="example-1"} ### Example 1: Add a text classifier using a GPT-3 model from OpenAI {id="example-1"}
@ -173,8 +169,8 @@ to be `"databricks/dolly-v2-12b"` for better performance.
### Example 3: Create the component directly in Python {id="example-3"} ### Example 3: Create the component directly in Python {id="example-3"}
The `llm` component behaves as any other component does, so adding it to The `llm` component behaves as any other component does, so adding it to an
an existing pipeline follows the same pattern: existing pipeline follows the same pattern:
```python ```python
import spacy import spacy
@ -198,16 +194,16 @@ print([(ent.text, ent.label_) for ent in doc.ents])
``` ```
Note that for efficient usage of resources, typically you would use Note that for efficient usage of resources, typically you would use
[`nlp.pipe(docs)`](https://spacy.io/api/language#pipe) with a batch, instead of [`nlp.pipe(docs)`](/api/language#pipe) with a batch, instead of calling
calling `nlp(doc)` with a single document. `nlp(doc)` with a single document.
### Example 4: Implement your own custom task {id="example-4"} ### Example 4: Implement your own custom task {id="example-4"}
To write a [`task`](#tasks), you need to implement two functions: To write a [`task`](#tasks), you need to implement two functions:
`generate_prompts` that takes a list of [`Doc`](https://spacy.io/api/doc) `generate_prompts` that takes a list of [`Doc`](/api/doc) objects and transforms
objects and transforms them into a list of prompts, and `parse_responses` that them into a list of prompts, and `parse_responses` that transforms the LLM
transforms the LLM outputs into annotations on the outputs into annotations on the [`Doc`](/api/doc), e.g. entity spans, text
[`Doc`](https://spacy.io/api/doc), e.g. entity spans, text categories and more. categories and more.
To register your custom task, decorate a factory function using the To register your custom task, decorate a factory function using the
`spacy_llm.registry.llm_tasks` decorator with a custom name that you can refer `spacy_llm.registry.llm_tasks` decorator with a custom name that you can refer
@ -325,7 +321,7 @@ An `llm` component is defined by two main settings:
- A [**task**](#tasks), defining the prompt to send to the LLM as well as the - A [**task**](#tasks), defining the prompt to send to the LLM as well as the
functionality to parse the resulting response back into structured fields on functionality to parse the resulting response back into structured fields on
the [Doc](https://spacy.io/api/doc) objects. the [Doc](/api/doc) objects.
- A [**model**](#models) defining the model to use and how to connect to it. - A [**model**](#models) defining the model to use and how to connect to it.
Note that `spacy-llm` supports both access to external APIs (such as OpenAI) Note that `spacy-llm` supports both access to external APIs (such as OpenAI)
as well as access to self-hosted open-source LLMs (such as using Dolly through as well as access to self-hosted open-source LLMs (such as using Dolly through
@ -350,8 +346,7 @@ want to disable this behavior.
A _task_ defines an NLP problem or question, that will be sent to the LLM via a A _task_ defines an NLP problem or question, that will be sent to the LLM via a
prompt. Further, the task defines how to parse the LLM's responses back into prompt. Further, the task defines how to parse the LLM's responses back into
structured information. All tasks are registered in the `llm_tasks` structured information. All tasks are registered in the `llm_tasks` registry.
registry.
Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined Practically speaking, a task should adhere to the `Protocol` `LLMTask` defined
in [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py). in [`ty.py`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/ty.py).
@ -363,10 +358,10 @@ function.
| [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`. | | [`task.generate_prompts`](/api/large-language-models#task-generate-prompts) | Takes a collection of documents, and returns a collection of "prompts", which can be of type `Any`. |
| [`task.parse_responses`](/api/large-language-models#task-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. | | [`task.parse_responses`](/api/large-language-models#task-parse-responses) | Takes a collection of LLM responses and the original documents, parses the responses into structured information, and sets the annotations on the documents. |
Moreover, the task may define an optional Moreover, the task may define an optional [`scorer` method](/api/scorer#score).
[`scorer` method](https://spacy.io/api/scorer#score). It should accept an It should accept an iterable of `Example`s as input and return a score
iterable of `Example`s as input and return a score dictionary. If the `scorer` dictionary. If the `scorer` method is defined, `spacy-llm` will call it to
method is defined, `spacy-llm` will call it to evaluate the component. evaluate the component.
| Component | Description | | Component | Description |
| ----------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@ -409,11 +404,10 @@ The supplied file has to conform to the format expected by the required task
##### (2) Initializing the `llm` component with a `get_examples()` callback ##### (2) Initializing the `llm` component with a `get_examples()` callback
Alternatively, you can initialize your `nlp` pipeline by providing a Alternatively, you can initialize your `nlp` pipeline by providing a
`get_examples` callback for `get_examples` callback for [`nlp.initialize`](/api/language#initialize) and
[`nlp.initialize`](https://spacy.io/api/language#initialize) and setting setting `n_prompt_examples` to a positive number to automatically fetch a few
`n_prompt_examples` to a positive number to automatically fetch a few examples examples for few-shot learning. Set `n_prompt_examples` to `-1` to use all
for few-shot learning. Set `n_prompt_examples` to `-1` to use all examples as examples as part of the few-shot learning prompt.
part of the few-shot learning prompt.
```ini ```ini
[initialize.components.llm] [initialize.components.llm]