From 065ead4eed2608666c95dcad6037913e53fbf424 Mon Sep 17 00:00:00 2001 From: David Berenstein Date: Fri, 1 Sep 2023 11:05:36 +0200 Subject: [PATCH 1/4] updated `add_pipe` docs (#12947) --- website/meta/universe.json | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/website/meta/universe.json b/website/meta/universe.json index ec380f847..46de8121c 100644 --- a/website/meta/universe.json +++ b/website/meta/universe.json @@ -2806,7 +2806,7 @@ "", "# see github repo for examples on sentence-transformers and Huggingface", "nlp = spacy.load('en_core_web_md')", - "nlp.add_pipe(\"text_categorizer\", ", + "nlp.add_pipe(\"classy_classification\", ", " config={", " \"data\": data,", " \"model\": \"spacy\"", @@ -3010,8 +3010,8 @@ "# Load the spaCy language model:", "nlp = spacy.load(\"en_core_web_sm\")", "", - "# Add the \"text_categorizer\" pipeline component to the spaCy model, and configure it with SetFit parameters:", - "nlp.add_pipe(\"text_categorizer\", config={", + "# Add the \"spacy_setfit\" pipeline component to the spaCy model, and configure it with SetFit parameters:", + "nlp.add_pipe(\"spacy_setfit\", config={", " \"pretrained_model_name_or_path\": \"paraphrase-MiniLM-L3-v2\",", " \"setfit_trainer_args\": {", " \"train_dataset\": train_dataset", From 5c1f9264c219cfbd7b9c266c0bb8e00238f238c7 Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Fri, 1 Sep 2023 13:47:20 +0200 Subject: [PATCH 2/4] fix typo in link (#12948) * fix typo in link * fix REL.v1 parameter --- website/docs/api/large-language-models.mdx | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/website/docs/api/large-language-models.mdx b/website/docs/api/large-language-models.mdx index 94b426cc8..e65945357 100644 --- a/website/docs/api/large-language-models.mdx +++ b/website/docs/api/large-language-models.mdx @@ -113,7 +113,7 @@ note that this requirement will be included in the prompt, but the task doesn't perform a hard cut-off. It's hence possible that your summary exceeds `max_n_words`. -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +To perform [few-shot learning](/usage/large-language-models#few-shot-prompts), you can write down a few examples in a separate file, and provide these to be injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`. @@ -192,7 +192,7 @@ the following parameters: span to the next token boundaries, e.g. expanding `"New Y"` out to `"New York"`. -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +To perform [few-shot learning](/usage/large-language-models#few-shot-prompts), you can write down a few examples in a separate file, and provide these to be injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`. @@ -282,7 +282,7 @@ the following parameters: span to the next token boundaries, e.g. expanding `"New Y"` out to `"New York"`. -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +To perform [few-shot learning](/usage/large-language-models#few-shot-prompts), you can write down a few examples in a separate file, and provide these to be injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`. @@ -397,7 +397,7 @@ definitions are included in the prompt. | `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~ | | `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ | -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +To perform [few-shot learning](/usage/large-language-models#few-shot-prompts), you can write down a few examples in a separate file, and provide these to be injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`. @@ -452,7 +452,7 @@ prompting and includes an improved prompt template. | `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. ~~bool~~ | | `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ | -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +To perform [few-shot learning](/usage/large-language-models#few-shot-prompts), you can write down a few examples in a separate file, and provide these to be injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`. @@ -502,7 +502,7 @@ prompting. | `allow_none` | When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Deafults to `True`. ~~bool~~ | | `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Deafults to `False`. ~~bool~~ | -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +To perform [few-shot learning](/usage/large-language-models#few-shot-prompts), you can write down a few examples in a separate file, and provide these to be injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`. @@ -546,12 +546,12 @@ on an upstream NER component for entities extraction. | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `labels` | List of labels or str of comma-separated list of labels. ~~Union[List[str], str]~~ | | `template` | Custom prompt template to send to LLM model. Default templates for each task are located in the `spacy_llm/tasks/templates` directory. Defaults to [`rel.jinja`](https://github.com/explosion/spacy-llm/blob/main/spacy_llm/tasks/templates/rel.jinja). ~~str~~ | -| `label_description` | Dictionary providing a description for each relation label. Defaults to `None`. ~~Optional[Dict[str, str]]~~ | +| `label_definitions` | Dictionary providing a description for each relation label. Defaults to `None`. ~~Optional[Dict[str, str]]~~ | | `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | | `normalizer` | Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. ~~Optional[Callable[[str], str]]~~ | | `verbose` | If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. ~~bool~~ | -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +To perform [few-shot learning](/usage/large-language-models#few-shot-prompts), you can write down a few examples in a separate file, and provide these to be injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`. @@ -565,6 +565,7 @@ supports `.yml`, `.yaml`, `.json` and `.jsonl`. [components.llm.task] @llm_tasks = "spacy.REL.v1" labels = ["LivesIn", "Visits"] + [components.llm.task.examples] @misc = "spacy.FewShotReader.v1" path = "rel_examples.jsonl" @@ -613,7 +614,7 @@ doesn't match the number of tokens from the pipeline's tokenizer, no lemmas are stored in the corresponding doc's tokens. Otherwise the tokens `.lemma_` property is updated with the lemma suggested by the LLM. -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +To perform [few-shot learning](/usage/large-language-models#few-shot-prompts), you can write down a few examples in a separate file, and provide these to be injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`. @@ -666,7 +667,7 @@ issues (e. g. in case of unexpected LLM responses) the value might be `None`. | `examples` | Optional function that generates examples for few-shot learning. Defaults to `None`. ~~Optional[Callable[[], Iterable[Any]]]~~ | | `field` | Name of extension attribute to store summary in (i. e. the summary will be available in `doc._.{field}`). Defaults to `sentiment`. ~~str~~ | -To perform [few-shot learning](/usage/large-langauge-models#few-shot-prompts), +To perform [few-shot learning](/usage/large-language-models#few-shot-prompts), you can write down a few examples in a separate file, and provide these to be injected into the prompt to the LLM. The default reader `spacy.FewShotReader.v1` supports `.yml`, `.yaml`, `.json` and `.jsonl`. From 6d1f6d9a23b4232e2ca67b5c2a15b62add8b5411 Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Mon, 4 Sep 2023 09:05:50 +0200 Subject: [PATCH 3/4] Fix LLM usage example (#12950) * fix usage example * revert back to v2 to allow hot fix on main --- website/docs/usage/large-language-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/usage/large-language-models.mdx b/website/docs/usage/large-language-models.mdx index 3c2c52c68..4da9a8f16 100644 --- a/website/docs/usage/large-language-models.mdx +++ b/website/docs/usage/large-language-models.mdx @@ -184,7 +184,7 @@ nlp.add_pipe( "labels": ["PERSON", "ORGANISATION", "LOCATION"] }, "model": { - "@llm_models": "spacy.gpt-3.5.v1", + "@llm_models": "spacy.GPT-3-5.v1", }, }, ) From cc788476881ca456ddcb985675dc292fd75d5f40 Mon Sep 17 00:00:00 2001 From: Magdalena Aniol <96200718+magdaaniol@users.noreply.github.com> Date: Wed, 6 Sep 2023 16:38:13 +0200 Subject: [PATCH 4/4] fix training.batch_size example (#12963) --- website/docs/usage/training.mdx | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/website/docs/usage/training.mdx b/website/docs/usage/training.mdx index 98333db72..abb1b9cfd 100644 --- a/website/docs/usage/training.mdx +++ b/website/docs/usage/training.mdx @@ -180,7 +180,7 @@ Some of the main advantages and features of spaCy's training config are: Under the hood, the config is parsed into a dictionary. It's divided into sections and subsections, indicated by the square brackets and dot notation. For -example, `[training]` is a section and `[training.batch_size]` a subsection. +example, `[training]` is a section and `[training.batcher]` a subsection. Subsections can define values, just like a dictionary, or use the `@` syntax to refer to [registered functions](#config-functions). This allows the config to not just define static settings, but also construct objects like architectures, @@ -254,7 +254,7 @@ For cases like this, you can set additional command-line options starting with block. ```bash -$ python -m spacy train config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy --training.batch_size 128 +$ python -m spacy train config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy --training.max_epochs 3 ``` Only existing sections and values in the config can be overwritten. At the end @@ -279,7 +279,7 @@ process. Environment variables **take precedence** over CLI overrides and values defined in the config file. ```bash -$ SPACY_CONFIG_OVERRIDES="--system.gpu_allocator pytorch --training.batch_size 128" ./your_script.sh +$ SPACY_CONFIG_OVERRIDES="--system.gpu_allocator pytorch --training.max_epochs 3" ./your_script.sh ``` ### Reading from standard input {id="config-stdin"} @@ -578,16 +578,17 @@ now-updated model to the predicted docs. The training configuration defined in the config file doesn't have to only consist of static values. Some settings can also be **functions**. For instance, -the `batch_size` can be a number that doesn't change, or a schedule, like a +the batch size can be a number that doesn't change, or a schedule, like a sequence of compounding values, which has shown to be an effective trick (see [Smith et al., 2017](https://arxiv.org/abs/1711.00489)). ```ini {title="With static value"} -[training] -batch_size = 128 +[training.batcher] +@batchers = "spacy.batch_by_words.v1" +size = 3000 ``` -To refer to a function instead, you can make `[training.batch_size]` its own +To refer to a function instead, you can make `[training.batcher.size]` its own section and use the `@` syntax to specify the function and its arguments – in this case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding) defined in the [function registry](/api/top-level#registry). All other values @@ -606,7 +607,7 @@ from your configs. > optimizer. ```ini {title="With registered function"} -[training.batch_size] +[training.batcher.size] @schedules = "compounding.v1" start = 100 stop = 1000 @@ -1027,14 +1028,14 @@ def my_custom_schedule(start: int = 1, factor: float = 1.001): ``` In your config, you can now reference the schedule in the -`[training.batch_size]` block via `@schedules`. If a block contains a key +`[training.batcher.size]` block via `@schedules`. If a block contains a key starting with an `@`, it's interpreted as a reference to a function. All other settings in the block will be passed to the function as keyword arguments. Keep in mind that the config shouldn't have any hidden defaults and all arguments on the functions need to be represented in the config. ```ini {title="config.cfg (excerpt)"} -[training.batch_size] +[training.batcher.size] @schedules = "my_custom_schedule.v1" start = 2 factor = 1.005