Fill in usage examples

2025-09-13 23:52:38 +03:00 · 2023-01-17 16:06:28 +01:00 · 2023-01-17 16:06:28 +01:00 · 82ccbdc70b
commit 82ccbdc70b
parent 3a300b0962
1 changed files with 89 additions and 13 deletions
--- a/website/docs/usage/v3-5.mdx
+++ b/website/docs/usage/v3-5.mdx
@ -9,20 +9,96 @@ menu:
 ## New features {id="features",hidden="true"}

 spaCy v3.5 introduces three new CLI commands, `apply`, `benchmark` and
-`find-threshold`, provides improvements and extensions to our entity linking
-functionality, XXX
+`find-threshold`, provides improvements to our entity linking functionality, and
+includes a range of language updates and bug fixes.

 ### New CLI commands {id="cli"}

-TODO `apply`
+#### apply CLI

-TODO `benchmark`
+The [`apply` CLI](/api/cli#apply) can be used to apply a pipeline to one or more
+`.txt`, `.jsonl` or `.spacy` input files, saving the annotated docs in a single
+`.spacy` file.

-TODO `find-threshold`
+```shell
+spacy apply en_core_web_sm my_texts/ output.spacy
+```
+
+#### benchmark CLI
+
+The [`benchmark` CLI](/api/cli#benchmark) has been added to extend the existing
+`evaluate` functionality with a wider range of profiling subcommands.
+
+The `benchmark accuracy` CLI is introduced as an alias for `evaluate`.
+
+The new `benchmark speed` CLI performs warmup rounds before measuing the speed
+in words per second on batches of randomly shuffled documents from the provided
+data.
+
+```shell
+spacy benchmark speed my_pipeline data.spacy
+```
+
+The output is the mean performance using batches (`nlp.pipe`) with a 95%
+confidence interval, e.g., profiling `en_core_web_sm` on CPU:
+
+```none
+Outliers: 2.0%, extreme outliers: 0.0%
+Mean: 18904.1 words/s (95% CI: -256.9 +244.1)
+```
+
+#### find-threshold CLI
+
+The [`find-threshold` CLI](/api/cli#find-threshold) runs a series of trials
+across threshold values from `0.0` to `1.0` and identifies the best threshold
+for the provided score metric.
+
+The following command runs 20 trials for the `spancat` component in
+`my_pipeline`, recording the `spans_sc_f` score for each value of the threshold
+`[components.spancat.threshold]` from `0.0` to `1.0`:
+
+```shell
+spacy find-threshold my_pipeline data.spacy spancat threshold spans_sc_f --n_trials 20
+```
+
+The `find-threshold` CLI can be used with `textcat_multilabel`, `spancat` and
+custom components with thresholds that are applied while predicting or scoring.

 ### Fuzzy matching {id="fuzzy"}

-TODO
+New `FUZZY` operators support [fuzzy matching](/usage/rule-based-matching#fuzzy)
+with the `Matcher`. By default, the `FUZZY` operator allows a Levenshtein edit
+distance of 2 and up to 30% of the pattern string length. `FUZZY1`..`FUZZY9` can
+be used to specify the exact number of allowed edits.
+
+```python
+# Match lowercase with fuzzy matching (allows up to 2 edits)
+pattern = [{"LOWER": {"FUZZY": "definitely"}}]
+
+# Match custom attribute values with fuzzy matching (allows up to 2 edits)
+pattern = [{"_": {"country": {"FUZZY": "Kyrgyzstan"}}}]
+
+# Match with exact Levenshtein edit distance limits (allows up to 3 edits)
+pattern = [{"_": {"country": {"FUZZY3": "Kyrgyzstan"}}}]
+```
+
+Note that `FUZZY` is using Levenshtein edit distance rather than
+Damerau-Levenshtein edit distance, so a transposition like `teh` for `the`
+counts as two edits, one insertion and one deletion.
+
+If you'd prefer an alternate fuzzy matching algorithm, you can provide your onw
+custom method to the `Matcher` or as a config option for an entity ruler and
+span ruler.
+
+### FUZZY and REGEX with lists {id="fuzzy-regex-lists"}
+
+The `FUZZY` and `REGEX` operators are also now supported for lists with `IN` and
+`NOT_IN`:
+
+```python
+pattern = [{"TEXT": {"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}]
+pattern = [{"TEXT": {"REGEX": {"NOT_IN": ["^awe(some)?$", "^wonder(ful)?"]}}}]
+```

 ### Entity linking generalization {id="el"}

@ -43,7 +119,6 @@ new default implementation [`InMemoryLookupKB`](/api/kb_in_memory).
  [tracking](/api/top-level#ConsoleLogger).
 - Improved scoring behavior for `textcat` with `spacy.textcat_scorer.v2` and
  `spacy.textcat_multilabel_scorer.v2`.
-
 - Updates so that downstream components can train properly on a frozen `tok2vec`
  or `transformer` layer.
 - Allow interpolation of variables in directory names in projects.
@ -82,7 +157,7 @@ An error is now raised when unsupported values are given as input to train a
 `textcat` or `textcat_multilabel` model - ensure that values are `0.0` or `1.0`
 as explained in the [docs](/api/textcategorizer#assigned-attributes).

-### Updated default scores for tokenization and textcat {id="scores"}
+### Updated scorers for tokenization and textcat {id="scores"}

 We fixed a bug that inflated the `token_acc` scores in v3.0-v3.4. The reported
 `token_acc` will drop from v3.4 to v3.5, but if `token_p/r/f` stay the same,
@ -91,11 +166,12 @@ your tokenization performance has not changed from v3.4.
 For new `textcat` or `textcat_multilabel` configs, the new default `v2` scorers:

 - ignore `threshold` for `textcat`, so the reported `cats_p/r/f` may increase
-  slightly in v3.5 even though underlying performance is unchanged
+  slightly in v3.5 even though the underlying predictions are unchanged
 - report the performance of only the **final** `textcat` or `textcat_multilabel`
  component in the pipeline by default
- custom scorers can be used to score multiple `textcat` and
-  `textcat_multilabel` components with the built-in `Scorer.score_cats` scorer
+- allow custom scorers to be used to score multiple `textcat` and
+  `textcat_multilabel` components with `Scorer.score_cats` by restricting the
+  evaluation to the component's provided labels

 ### Pipeline package version compatibility {id="version-compat"}

@ -122,8 +198,8 @@ working as expected, you can update the spaCy version requirements in the
 [`meta.json`](/api/data-formats#meta):

 ```diff
- "spacy_version": ">=3.3.0,<3.5.0",
-+ "spacy_version": ">=3.3.0,<3.6.0",
+- "spacy_version": ">=3.4.0,<3.5.0",
+ "spacy_version": ">=3.4.0,<3.6.0",
 ```

 ### Updating v3.4 configs