Shorten NER section

2025-07-10 16:22:29 +03:00 · 2023-08-31 14:42:30 +02:00 · 2023-08-31 14:42:30 +02:00 · 3266898466
commit 3266898466
parent 3e4264899c
1 changed files with 18 additions and 47 deletions
--- a/website/docs/api/large-language-models.mdx
+++ b/website/docs/api/large-language-models.mdx
@ -52,7 +52,9 @@ signatures of the `model` and `task` callables are consistent with each other
 and emits a warning if they don't. `validate_types` can be set to `False` if you
 want to disable this behavior.
-### Tasks {id="tasks"}
+## Tasks {id="tasks"}
 ### Task implementation {id="task-implementation"}
 A _task_ defines an NLP problem or question, that will be sent to the LLM via a
 prompt. Further, the task defines how to parse the LLM's responses back into
@ -146,6 +148,10 @@ max_n_words = 20
 path = "summarization_examples.yml"
 ```
 ### NER {id="ner"}
 The NER task identifies non-overlapping entities in text.
 #### spacy.NER.v2 {id="ner-v2"}
 The built-in NER task supports both zero-shot and few-shot prompting. This
@ -164,52 +170,17 @@ descriptions.
 | Argument                  | Description                                                                                                                                                                                                                                                         |
 | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `labels`                  | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~                                                                                                                                                                                  |
-| `template`                | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [ner.v2.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/ner.v2.jinja). ~~str~~ |
+| `template` (NEW)          | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [ner.v2.jinja](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/ner.v2.jinja). ~~str~~ |
-| `label_definitions`       | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~                                                              |
+| `label_definitions` (NEW) | Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. ~~Optional[Dict[str, str]]~~                                                              |
 | `examples`                | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~                                                                                                                                      |
 | `normalizer`              | Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~                                                                                           |
 | `alignment_mode`          | Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. ~~str~~                                                                                      |
 | `case_sensitive_matching` | Whether to search without case sensitivity. Defaults to `False`. ~~bool~~                                                                                                                                                                                           |
 | `single_match`            | Whether to match an entity in the LLM's response only once (the first hit) or multiple times. Defaults to `False`. ~~bool~~                                                                                                                                         |
-The NER task implementation doesn't currently ask the LLM for specific offsets,
+The parameters `alignment_mode`, `case_sensitive_matching` and `single_match`
-but simply expects a list of strings that represent the enties in the document.
+are identical to the [v1](#ner-v1) implementation. The format of few-shot
-This means that a form of string matching is required. This can be configured by
+examples are also the same.
 the following parameters:
 - The `single_match` parameter is typically set to `False` to allow for multiple
  matches. For instance, the response from the LLM might only mention the entity
  "Paris" once, but you'd still want to mark it every time it occurs in the
  document.
 - The case-sensitive matching is typically set to `False` to be robust against
  case variances in the LLM's output.
 - The `alignment_mode` argument is used to match entities as returned by the LLM
  to the tokens from the original `Doc` - specifically it's used as argument in
  the call to [`doc.char_span()`](/api/doc#char_span). The `"strict"` mode will
  only keep spans that strictly adhere to the given token boundaries.
  `"contract"` will only keep those tokens that are fully within the given
  range, e.g. reducing `"New Y"` to `"New"`. Finally, `"expand"` will expand the
  span to the next token boundaries, e.g. expanding `"New Y"` out to
  `"New York"`.
 To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts),
 you can write down a few examples in a separate file, and provide these to be
 injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1`
 supports `.yml`, `.yaml`, `.json` and `.jsonl`.
 ```yaml
 - text: Jack and Jill went up the hill.
  entities:
    PERSON:
      - Jack
      - Jill
    LOCATION:
      - hill
 - text: Jack fell down and broke his crown.
  entities:
    PERSON:
      - Jack
 ```
 ```ini
 [components.llm.task]
@ -223,12 +194,12 @@ path = "ner_examples.yml"
 > Label descriptions can also be used with explicit examples to give as much
 > info to the LLM model as possible.
-You can also write definitions for each label and provide them via the
+New to v2, is the fact that you can write definitions for each label and provide
-`label_definitions` argument. This lets you tell the LLM exactly what you're
+them via the `label_definitions` argument. This lets you tell the LLM exactly
-looking for rather than relying on the LLM to interpret its task given just the
+what you're looking for rather than relying on the LLM to interpret its task
-label name. Label descriptions are freeform so you can write whatever you want
+given just the label name. Label descriptions are freeform so you can write
-here, but through some experiments a brief description along with some examples
+whatever you want here, but a brief description along with some examples and
-and counter examples seems to work quite well.
+counter examples seems to work quite well.
 ```ini
 [components.llm.task]