diff --git a/website/docs/usage/v3-5.md b/website/docs/usage/v3-5.md index 28c07cd7e..94062cd9e 100644 --- a/website/docs/usage/v3-5.md +++ b/website/docs/usage/v3-5.md @@ -8,34 +8,90 @@ menu: ## New features {#features hidden="true"} -spaCy v3.5 introduces two new CLI commands, `find-threshold` -and `apply`, provides improvements and extensions to our entity linking +spaCy v3.5 introduces three new CLI commands, `apply`, `benchmark` and +`find-threshold`, provides improvements and extensions to our entity linking functionality, XXX ### New CLI commands {#cli} -TODO `find-threshold` - TODO `apply` -### Entity Linking generalization {#el} +TODO `benchmark` -XXX +TODO `find-threshold` -### Trained pipelines {#models} +### Entity linking generalization {#el} -XXX +The knowledge base used for entity linking is now easier to customize and has a +new default implementation [`InMemoryLookupKB`](/api/kb_in_memory). -### Pipeline updates {#pipelines} +### Additional features and improvements {#additional-features-and-improvements} -XXX +- Language updates: + - Extended support for Slovenian. + - Fixed lookup fallback for French and Catalan lemmatizers. + - Switch Russian and Ukrainian lemmatizers to `pymorphy3`. + - Support for editorial punctuation in Ancient Greek. + - Update to Russian tokenizer exceptions. + - Small fix for Dutch stop words. +- Allow up to `typer` v0.7.x, `mypy` 0.990 and `typing_extensions` v4.4.x. +- New `spacy.ConsoleLogger.v3` with expanded progress + [tracking](/api/top-level#ConsoleLogger). +- Improved scoring behavior for `textcat` with `spacy.textcat_scorer.v2` and + `spacy.textcat_multilabel_scorer.v2`. + +- Updates so that downstream components can train properly on a frozen `tok2vec` + or `transformer` layer. +- Allow interpolation of variables in directory names in projects. +- Support for local file system [remotes](/usage/projects#remote) for projects. +- Improve UX around `displacy.serve` when the default port is in use. +- Optional `before_update` callback that is invoked at the start of each + [training step](/api/data-formats#config-training). +- Improve performance of `SpanGroup` and fix typing issues for `SpanGroup` and + `Span` objects. +- Patch a + [security vulnerability](https://github.com/advisories/GHSA-gw9q-c7gh-j9vm) in + extracting tar files. +- Add equality definition for `Vectors`. +- Ensure `Vocab.to_disk` respects the exclude setting for `lookups` and + `vectors`. +- Correctly handle missing annotations in the edit tree lemmatizer. + +### Trained pipeline updates {#pipelines} + +- The CNN pipelines add `IS_SPACE` as a `tok2vec` feature for `tagger` and + `morphologizer` components to improve tagging of non-whitespace vs. whitespace + tokens. +- The transformer pipelines require `spacy-transformers` v1.2, which uses the + exact alignment from `tokenizers` for fast tokenizers instead of the heuristic + alignment from `spacy-alignments`. For all trained pipelines except + `ja_core_news_trf`, the alignments between spaCy tokens and transformer tokens + may be slightly different. More details about the `spacy-transformers` changes + in the + [v1.2.0 release notes](https://github.com/explosion/spacy-transformers/releases/tag/v1.2.0). ## Notes about upgrading from v3.4 {#upgrading} -### XXX +### Validation of textcat values {#textcat-validation} -XXX +An error is now raised when unsupported values are given as input to train a +`textcat` or `textcat_multilabel` model - ensure that values are `0.0` or `1.0` +as explained in the [docs](/api/textcategorizer#assigned-attributes). +### Updated default scores for tokenization and textcat {#scores} + +We fixed a bug that inflated the `token_acc` scores in v3.0-v3.4. The reported +`token_acc` will drop from v3.4 to v3.5, but if `token_p/r/f` stay the same, +your tokenization performance has not changed from v3.4. + +For new `textcat` or `textcat_multilabel` configs, the new default `v2` scorers: + +- ignore `threshold` for `textcat`, so the reported `cats_p/r/f` may increase + slightly in v3.5 even though underlying performance is unchanged +- report the performance of only the **final** `textcat` or `textcat_multilabel` + component in the pipeline by default +- custom scorers can be used to score multiple `textcat` and + `textcat_multilabel` components with the built-in `Scorer.score_cats` scorer ### Pipeline package version compatibility {#version-compat}