Fill in non-CLI details from release notes draft

2025-09-13 23:52:38 +03:00 · 2023-01-16 13:12:17 +01:00 · 2023-01-16 13:12:17 +01:00 · b02ed00814
commit b02ed00814
parent 7aabc12d5c
1 changed files with 68 additions and 12 deletions
--- a/website/docs/usage/v3-5.md
+++ b/website/docs/usage/v3-5.md
@ -8,34 +8,90 @@ menu:

 ## New features {#features hidden="true"}

-spaCy v3.5 introduces two new CLI commands, `find-threshold`
-and `apply`, provides improvements and extensions to our entity linking
+spaCy v3.5 introduces three new CLI commands, `apply`, `benchmark` and
+`find-threshold`, provides improvements and extensions to our entity linking
 functionality, XXX

 ### New CLI commands {#cli}

-TODO `find-threshold`
-
 TODO `apply`

-### Entity Linking generalization {#el}
+TODO `benchmark`

-XXX
+TODO `find-threshold`

-### Trained pipelines {#models}
+### Entity linking generalization {#el}

-XXX
+The knowledge base used for entity linking is now easier to customize and has a
+new default implementation [`InMemoryLookupKB`](/api/kb_in_memory).

-### Pipeline updates {#pipelines}
+### Additional features and improvements {#additional-features-and-improvements}

-XXX
+- Language updates:
+  - Extended support for Slovenian.
+  - Fixed lookup fallback for French and Catalan lemmatizers.
+  - Switch Russian and Ukrainian lemmatizers to `pymorphy3`.
+  - Support for editorial punctuation in Ancient Greek.
+  - Update to Russian tokenizer exceptions.
+  - Small fix for Dutch stop words.
+- Allow up to `typer` v0.7.x, `mypy` 0.990 and `typing_extensions` v4.4.x.
+- New `spacy.ConsoleLogger.v3` with expanded progress
+  [tracking](/api/top-level#ConsoleLogger).
+- Improved scoring behavior for `textcat` with `spacy.textcat_scorer.v2` and
+  `spacy.textcat_multilabel_scorer.v2`.
+
+- Updates so that downstream components can train properly on a frozen `tok2vec`
+  or `transformer` layer.
+- Allow interpolation of variables in directory names in projects.
+- Support for local file system [remotes](/usage/projects#remote) for projects.
+- Improve UX around `displacy.serve` when the default port is in use.
+- Optional `before_update` callback that is invoked at the start of each
+  [training step](/api/data-formats#config-training).
+- Improve performance of `SpanGroup` and fix typing issues for `SpanGroup` and
+  `Span` objects.
+- Patch a
+  [security vulnerability](https://github.com/advisories/GHSA-gw9q-c7gh-j9vm) in
+  extracting tar files.
+- Add equality definition for `Vectors`.
+- Ensure `Vocab.to_disk` respects the exclude setting for `lookups` and
+  `vectors`.
+- Correctly handle missing annotations in the edit tree lemmatizer.
+
+### Trained pipeline updates {#pipelines}
+
+- The CNN pipelines add `IS_SPACE` as a `tok2vec` feature for `tagger` and
+  `morphologizer` components to improve tagging of non-whitespace vs. whitespace
+  tokens.
+- The transformer pipelines require `spacy-transformers` v1.2, which uses the
+  exact alignment from `tokenizers` for fast tokenizers instead of the heuristic
+  alignment from `spacy-alignments`. For all trained pipelines except
+  `ja_core_news_trf`, the alignments between spaCy tokens and transformer tokens
+  may be slightly different. More details about the `spacy-transformers` changes
+  in the
+  [v1.2.0 release notes](https://github.com/explosion/spacy-transformers/releases/tag/v1.2.0).

 ## Notes about upgrading from v3.4 {#upgrading}

-### XXX
+### Validation of textcat values {#textcat-validation}

-XXX
+An error is now raised when unsupported values are given as input to train a
+`textcat` or `textcat_multilabel` model - ensure that values are `0.0` or `1.0`
+as explained in the [docs](/api/textcategorizer#assigned-attributes).

+### Updated default scores for tokenization and textcat {#scores}
+
+We fixed a bug that inflated the `token_acc` scores in v3.0-v3.4. The reported
+`token_acc` will drop from v3.4 to v3.5, but if `token_p/r/f` stay the same,
+your tokenization performance has not changed from v3.4.
+
+For new `textcat` or `textcat_multilabel` configs, the new default `v2` scorers:
+
+- ignore `threshold` for `textcat`, so the reported `cats_p/r/f` may increase
+  slightly in v3.5 even though underlying performance is unchanged
+- report the performance of only the **final** `textcat` or `textcat_multilabel`
+  component in the pipeline by default
+- custom scorers can be used to score multiple `textcat` and
+  `textcat_multilabel` components with the built-in `Scorer.score_cats` scorer

 ### Pipeline package version compatibility {#version-compat}