Shorten NER section

This commit is contained in:
svlandeg 2023-08-31 14:42:30 +02:00
parent 3e4264899c
commit 3266898466

View File

@ -52,7 +52,9 @@ signatures of the `model` and `task` callables are consistent with each other
and emits a warning if they don't. `validate_types` can be set to `False` if you and emits a warning if they don't. `validate_types` can be set to `False` if you
want to disable this behavior. want to disable this behavior.
### Tasks {id="tasks"} ## Tasks {id="tasks"}
### Task implementation {id="task-implementation"}
A _task_ defines an NLP problem or question, that will be sent to the LLM via a A _task_ defines an NLP problem or question, that will be sent to the LLM via a
prompt. Further, the task defines how to parse the LLM's responses back into prompt. Further, the task defines how to parse the LLM's responses back into
@ -146,6 +148,10 @@ max_n_words = 20
path = "summarization_examples.yml" path = "summarization_examples.yml"
``` ```
### NER {id="ner"}
The NER task identifies non-overlapping entities in text.
#### spacy.NER.v2 {id="ner-v2"} #### spacy.NER.v2 {id="ner-v2"}
The built-in NER task supports both zero-shot and few-shot prompting. This The built-in NER task supports both zero-shot and few-shot prompting. This
@ -164,52 +170,17 @@ descriptions.
| Argument | Description | | Argument | Description |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ | | `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ |
| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [ner.v2.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/ner.v2.jinja). ~~str~~ | | `template` (NEW) | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [ner.v2.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/ner.v2.jinja). ~~str~~ |
| `label_definitions` | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ | | `label_definitions` (NEW) | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ |
| `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | | `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ |
| `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ | | `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ |
| `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ | | `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ |
| `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ | | `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ |
| `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ | | `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ |
The NER task implementation doesn't currently ask the LLM for specific offsets, The parameters `alignment_mode`, `case_sensitive_matching` and `single_match`
but simply expects a list of strings that represent the enties in the document. are identical to the [v1](#ner-v1) implementation. The format of few-shot
This means that a form of string matching is required. This can be configured by examples are also the same.
the following parameters:
- The `single_match` parameter is typically set to `False` to allow for multiple
matches. For instance, the response from the LLM might only mention the entity
"Paris" once, but you'd still want to mark it every time it occurs in the
document.
- The case-sensitive matching is typically set to `False` to be robust against
case variances in the LLM's output.
- The `alignment_mode` argument is used to match entities as returned by the LLM
to the tokens from the original `Doc` - specifically it's used as argument in
the call to [`doc.char_span()`](/api/doc#char_span). The `"strict"` mode will
only keep spans that strictly adhere to the given token boundaries.
`"contract"` will only keep those tokens that are fully within the given
range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will expand the
span to the next token boundaries, e.g. expanding `"New Y"` out to
`"New York"`.
To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
you can write down a few examples in a separate file, and provide these to be
injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
supports `.yml`, `.yaml`, `.json` and `.jsonl`.
```yaml
- text: Jack and Jill went up the hill.
entities:
PERSON:
- Jack
- Jill
LOCATION:
- hill
- text: Jack fell down and broke his crown.
entities:
PERSON:
- Jack
```
```ini ```ini
[components.llm.task] [components.llm.task]
@ -223,12 +194,12 @@ path = "ner_examples.yml"
> Label descriptions can also be used with explicit examples to give as much > Label descriptions can also be used with explicit examples to give as much
> info to the LLM model as possible. > info to the LLM model as possible.
You can also write definitions for each label and provide them via the New to v2, is the fact that you can write definitions for each label and provide
`label_definitions` argument. This lets you tell the LLM exactly what you're them via the `label_definitions` argument. This lets you tell the LLM exactly
looking for rather than relying on the LLM to interpret its task given just the what you're looking for rather than relying on the LLM to interpret its task
label name. Label descriptions are freeform so you can write whatever you want given just the label name. Label descriptions are freeform so you can write
here, but through some experiments a brief description along with some examples whatever you want here, but a brief description along with some examples and
and counter examples seems to work quite well. counter examples seems to work quite well.
```ini ```ini
[components.llm.task] [components.llm.task]