mirror of
https://github.com/explosion/spaCy.git
synced 2025-08-03 20:00:21 +03:00
Fill in non-CLI details from release notes draft
This commit is contained in:
parent
7aabc12d5c
commit
b02ed00814
|
@ -8,34 +8,90 @@ menu:
|
|||
|
||||
## New features {#features hidden="true"}
|
||||
|
||||
spaCy v3.5 introduces two new CLI commands, `find-threshold`
|
||||
and `apply`, provides improvements and extensions to our entity linking
|
||||
spaCy v3.5 introduces three new CLI commands, `apply`, `benchmark` and
|
||||
`find-threshold`, provides improvements and extensions to our entity linking
|
||||
functionality, XXX
|
||||
|
||||
### New CLI commands {#cli}
|
||||
|
||||
TODO `find-threshold`
|
||||
|
||||
TODO `apply`
|
||||
|
||||
### Entity Linking generalization {#el}
|
||||
TODO `benchmark`
|
||||
|
||||
XXX
|
||||
TODO `find-threshold`
|
||||
|
||||
### Trained pipelines {#models}
|
||||
### Entity linking generalization {#el}
|
||||
|
||||
XXX
|
||||
The knowledge base used for entity linking is now easier to customize and has a
|
||||
new default implementation [`InMemoryLookupKB`](/api/kb_in_memory).
|
||||
|
||||
### Pipeline updates {#pipelines}
|
||||
### Additional features and improvements {#additional-features-and-improvements}
|
||||
|
||||
XXX
|
||||
- Language updates:
|
||||
- Extended support for Slovenian.
|
||||
- Fixed lookup fallback for French and Catalan lemmatizers.
|
||||
- Switch Russian and Ukrainian lemmatizers to `pymorphy3`.
|
||||
- Support for editorial punctuation in Ancient Greek.
|
||||
- Update to Russian tokenizer exceptions.
|
||||
- Small fix for Dutch stop words.
|
||||
- Allow up to `typer` v0.7.x, `mypy` 0.990 and `typing_extensions` v4.4.x.
|
||||
- New `spacy.ConsoleLogger.v3` with expanded progress
|
||||
[tracking](/api/top-level#ConsoleLogger).
|
||||
- Improved scoring behavior for `textcat` with `spacy.textcat_scorer.v2` and
|
||||
`spacy.textcat_multilabel_scorer.v2`.
|
||||
|
||||
- Updates so that downstream components can train properly on a frozen `tok2vec`
|
||||
or `transformer` layer.
|
||||
- Allow interpolation of variables in directory names in projects.
|
||||
- Support for local file system [remotes](/usage/projects#remote) for projects.
|
||||
- Improve UX around `displacy.serve` when the default port is in use.
|
||||
- Optional `before_update` callback that is invoked at the start of each
|
||||
[training step](/api/data-formats#config-training).
|
||||
- Improve performance of `SpanGroup` and fix typing issues for `SpanGroup` and
|
||||
`Span` objects.
|
||||
- Patch a
|
||||
[security vulnerability](https://github.com/advisories/GHSA-gw9q-c7gh-j9vm) in
|
||||
extracting tar files.
|
||||
- Add equality definition for `Vectors`.
|
||||
- Ensure `Vocab.to_disk` respects the exclude setting for `lookups` and
|
||||
`vectors`.
|
||||
- Correctly handle missing annotations in the edit tree lemmatizer.
|
||||
|
||||
### Trained pipeline updates {#pipelines}
|
||||
|
||||
- The CNN pipelines add `IS_SPACE` as a `tok2vec` feature for `tagger` and
|
||||
`morphologizer` components to improve tagging of non-whitespace vs. whitespace
|
||||
tokens.
|
||||
- The transformer pipelines require `spacy-transformers` v1.2, which uses the
|
||||
exact alignment from `tokenizers` for fast tokenizers instead of the heuristic
|
||||
alignment from `spacy-alignments`. For all trained pipelines except
|
||||
`ja_core_news_trf`, the alignments between spaCy tokens and transformer tokens
|
||||
may be slightly different. More details about the `spacy-transformers` changes
|
||||
in the
|
||||
[v1.2.0 release notes](https://github.com/explosion/spacy-transformers/releases/tag/v1.2.0).
|
||||
|
||||
## Notes about upgrading from v3.4 {#upgrading}
|
||||
|
||||
### XXX
|
||||
### Validation of textcat values {#textcat-validation}
|
||||
|
||||
XXX
|
||||
An error is now raised when unsupported values are given as input to train a
|
||||
`textcat` or `textcat_multilabel` model - ensure that values are `0.0` or `1.0`
|
||||
as explained in the [docs](/api/textcategorizer#assigned-attributes).
|
||||
|
||||
### Updated default scores for tokenization and textcat {#scores}
|
||||
|
||||
We fixed a bug that inflated the `token_acc` scores in v3.0-v3.4. The reported
|
||||
`token_acc` will drop from v3.4 to v3.5, but if `token_p/r/f` stay the same,
|
||||
your tokenization performance has not changed from v3.4.
|
||||
|
||||
For new `textcat` or `textcat_multilabel` configs, the new default `v2` scorers:
|
||||
|
||||
- ignore `threshold` for `textcat`, so the reported `cats_p/r/f` may increase
|
||||
slightly in v3.5 even though underlying performance is unchanged
|
||||
- report the performance of only the **final** `textcat` or `textcat_multilabel`
|
||||
component in the pipeline by default
|
||||
- custom scorers can be used to score multiple `textcat` and
|
||||
`textcat_multilabel` components with the built-in `Scorer.score_cats` scorer
|
||||
|
||||
### Pipeline package version compatibility {#version-compat}
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user