diff --git a/website/docs/api/large-language-models.mdx b/website/docs/api/large-language-models.mdx index 94b426cc8..d6e73788d 100644 --- a/website/docs/api/large-language-models.mdx +++ b/website/docs/api/large-language-models.mdx @@ -52,7 +52,9 @@ signatures of the `model` and `task` callables are consistent with each other and emits a warning if they don't. `validate_types` can be set to `False` if you want to disable this behavior. -### Tasks {id="tasks"} +## Tasks {id="tasks"} + +### Task implementation {id="task-implementation"} A _task_ defines an NLP problem or question, that will be sent to the LLM via a prompt. Further, the task defines how to parse the LLM's responses back into @@ -146,6 +148,10 @@ max_n_words = 20 path = "summarization_examples.yml" ``` +### NER {id="ner"} + +The NER task identifies non-overlapping entities in text. + #### spacy.NER.v2 {id="ner-v2"} The built-in NER task supports both zero-shot and few-shot prompting. This @@ -164,52 +170,17 @@ descriptions. | Argument | Description | | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ | -| `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [ner.v2.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/ner.v2.jinja). ~~str~~ | -| `label_definitions` | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ | +| `template` (NEW) | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [ner.v2.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/ner.v2.jinja). ~~str~~ | +| `label_definitions` (NEW) | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~ | | `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | | `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ | | `alignment_mode` | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~ | | `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~ | | `single_match` | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~ | -The NER task implementation doesn't currently ask the LLM for specific offsets, -but simply expects a list of strings that represent the enties in the document. -This means that a form of string matching is required. This can be configured by -the following parameters: - -- The `single_match` parameter is typically set to `False` to allow for multiple - matches. For instance, the response from the LLM might only mention the entity - "Paris" once, but you'd still want to mark it every time it occurs in the - document. -- The case-sensitive matching is typically set to `False` to be robust against - case variances in the LLM's output. -- The `alignment_mode` argument is used to match entities as returned by the LLM - to the tokens from the original `Doc` - specifically it's used as argument in - the call to [`doc.char_span()`](/api/doc#char_span). The `"strict"` mode will - only keep spans that strictly adhere to the given token boundaries. - `"contract"` will only keep those tokens that are fully within the given - range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will expand the - span to the next token boundaries, e.g. expanding `"New Y"` out to - `"New York"`. - -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), -you can write down a few examples in a separate file, and provide these to be -injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` -supports `.yml`, `.yaml`, `.json` and `.jsonl`. - -```yaml -- text: Jack and Jill went up the hill. - entities: - PERSON: - - Jack - - Jill - LOCATION: - - hill -- text: Jack fell down and broke his crown. - entities: - PERSON: - - Jack -``` +The parameters `alignment_mode`, `case_sensitive_matching` and `single_match` +are identical to the [v1](#ner-v1) implementation. The format of few-shot +examples are also the same. ```ini [components.llm.task] @@ -223,12 +194,12 @@ path = "ner_examples.yml" > Label descriptions can also be used with explicit examples to give as much > info to the LLM model as possible. -You can also write definitions for each label and provide them via the -`label_definitions` argument. This lets you tell the LLM exactly what you're -looking for rather than relying on the LLM to interpret its task given just the -label name. Label descriptions are freeform so you can write whatever you want -here, but through some experiments a brief description along with some examples -and counter examples seems to work quite well. +New to v2, is the fact that you can write definitions for each label and provide +them via the `label_definitions` argument. This lets you tell the LLM exactly +what you're looking for rather than relying on the LLM to interpret its task +given just the label name. Label descriptions are freeform so you can write +whatever you want here, but a brief description along with some examples and +counter examples seems to work quite well. ```ini [components.llm.task]