Fill in non-CLI details from release notes draft

This commit is contained in:
Adriane Boyd 2023-01-16 13:12:17 +01:00
parent 7aabc12d5c
commit b02ed00814

View File

@ -8,34 +8,90 @@ menu:
## New features {#features hidden="true"} ## New features {#features hidden="true"}
spaCy v3.5 introduces two new CLI commands, `find-threshold` spaCy v3.5 introduces three new CLI commands, `apply`, `benchmark` and
and `apply`, provides improvements and extensions to our entity linking `find-threshold`, provides improvements and extensions to our entity linking
functionality, XXX functionality, XXX
### New CLI commands {#cli} ### New CLI commands {#cli}
TODO `find-threshold`
TODO `apply` TODO `apply`
### Entity Linking generalization {#el} TODO `benchmark`
XXX TODO `find-threshold`
### Trained pipelines {#models} ### Entity linking generalization {#el}
XXX The knowledge base used for entity linking is now easier to customize and has a
new default implementation [`InMemoryLookupKB`](/api/kb_in_memory).
### Pipeline updates {#pipelines} ### Additional features and improvements {#additional-features-and-improvements}
XXX - Language updates:
- Extended support for Slovenian.
- Fixed lookup fallback for French and Catalan lemmatizers.
- Switch Russian and Ukrainian lemmatizers to `pymorphy3`.
- Support for editorial punctuation in Ancient Greek.
- Update to Russian tokenizer exceptions.
- Small fix for Dutch stop words.
- Allow up to `typer` v0.7.x, `mypy` 0.990 and `typing_extensions` v4.4.x.
- New `spacy.ConsoleLogger.v3` with expanded progress
[tracking](/api/top-level#ConsoleLogger).
- Improved scoring behavior for `textcat` with `spacy.textcat_scorer.v2` and
`spacy.textcat_multilabel_scorer.v2`.
- Updates so that downstream components can train properly on a frozen `tok2vec`
or `transformer` layer.
- Allow interpolation of variables in directory names in projects.
- Support for local file system [remotes](/usage/projects#remote) for projects.
- Improve UX around `displacy.serve` when the default port is in use.
- Optional `before_update` callback that is invoked at the start of each
[training step](/api/data-formats#config-training).
- Improve performance of `SpanGroup` and fix typing issues for `SpanGroup` and
`Span` objects.
- Patch a
[security vulnerability](https://github.com/advisories/GHSA-gw9q-c7gh-j9vm) in
extracting tar files.
- Add equality definition for `Vectors`.
- Ensure `Vocab.to_disk` respects the exclude setting for `lookups` and
`vectors`.
- Correctly handle missing annotations in the edit tree lemmatizer.
### Trained pipeline updates {#pipelines}
- The CNN pipelines add `IS_SPACE` as a `tok2vec` feature for `tagger` and
`morphologizer` components to improve tagging of non-whitespace vs. whitespace
tokens.
- The transformer pipelines require `spacy-transformers` v1.2, which uses the
exact alignment from `tokenizers` for fast tokenizers instead of the heuristic
alignment from `spacy-alignments`. For all trained pipelines except
`ja_core_news_trf`, the alignments between spaCy tokens and transformer tokens
may be slightly different. More details about the `spacy-transformers` changes
in the
[v1.2.0 release notes](https://github.com/explosion/spacy-transformers/releases/tag/v1.2.0).
## Notes about upgrading from v3.4 {#upgrading} ## Notes about upgrading from v3.4 {#upgrading}
### XXX ### Validation of textcat values {#textcat-validation}
XXX An error is now raised when unsupported values are given as input to train a
`textcat` or `textcat_multilabel` model - ensure that values are `0.0` or `1.0`
as explained in the [docs](/api/textcategorizer#assigned-attributes).
### Updated default scores for tokenization and textcat {#scores}
We fixed a bug that inflated the `token_acc` scores in v3.0-v3.4. The reported
`token_acc` will drop from v3.4 to v3.5, but if `token_p/r/f` stay the same,
your tokenization performance has not changed from v3.4.
For new `textcat` or `textcat_multilabel` configs, the new default `v2` scorers:
- ignore `threshold` for `textcat`, so the reported `cats_p/r/f` may increase
slightly in v3.5 even though underlying performance is unchanged
- report the performance of only the **final** `textcat` or `textcat_multilabel`
component in the pipeline by default
- custom scorers can be used to score multiple `textcat` and
`textcat_multilabel` components with the built-in `Scorer.score_cats` scorer
### Pipeline package version compatibility {#version-compat} ### Pipeline package version compatibility {#version-compat}