mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 12:18:04 +03:00
231 lines
9.1 KiB
Plaintext
231 lines
9.1 KiB
Plaintext
---
|
|
title: What's New in v3.5
|
|
teaser: New features and how to upgrade
|
|
menu:
|
|
- ['New Features', 'features']
|
|
- ['Upgrading Notes', 'upgrading']
|
|
---
|
|
|
|
## New features {id="features",hidden="true"}
|
|
|
|
spaCy v3.5 introduces three new CLI commands, `apply`, `benchmark` and
|
|
`find-threshold`, adds fuzzy matching, provides improvements to our entity
|
|
linking functionality, and includes a range of language updates and bug fixes.
|
|
|
|
### New CLI commands {id="cli"}
|
|
|
|
#### apply CLI
|
|
|
|
The [`apply` CLI](/api/cli#apply) can be used to apply a pipeline to one or more
|
|
`.txt`, `.jsonl` or `.spacy` input files, saving the annotated docs in a single
|
|
`.spacy` file.
|
|
|
|
```bash
|
|
$ spacy apply en_core_web_sm my_texts/ output.spacy
|
|
```
|
|
|
|
#### benchmark CLI
|
|
|
|
The [`benchmark` CLI](/api/cli#benchmark) has been added to extend the existing
|
|
`evaluate` functionality with a wider range of profiling subcommands.
|
|
|
|
The `benchmark accuracy` CLI is introduced as an alias for `evaluate`. The new
|
|
`benchmark speed` CLI performs warmup rounds before measuring the speed in words
|
|
per second on batches of randomly shuffled documents from the provided data.
|
|
|
|
```bash
|
|
$ spacy benchmark speed my_pipeline data.spacy
|
|
```
|
|
|
|
The output is the mean performance using batches (`nlp.pipe`) with a 95%
|
|
confidence interval, e.g., profiling `en_core_web_sm` on CPU:
|
|
|
|
```none
|
|
Outliers: 2.0%, extreme outliers: 0.0%
|
|
Mean: 18904.1 words/s (95% CI: -256.9 +244.1)
|
|
```
|
|
|
|
#### find-threshold CLI
|
|
|
|
The [`find-threshold` CLI](/api/cli#find-threshold) runs a series of trials
|
|
across threshold values from `0.0` to `1.0` and identifies the best threshold
|
|
for the provided score metric.
|
|
|
|
The following command runs 20 trials for the `spancat` component in
|
|
`my_pipeline`, recording the `spans_sc_f` score for each value of the threshold
|
|
`[components.spancat.threshold]` from `0.0` to `1.0`:
|
|
|
|
```bash
|
|
$ spacy find-threshold my_pipeline data.spacy spancat threshold spans_sc_f --n_trials 20
|
|
```
|
|
|
|
The `find-threshold` CLI can be used with `textcat_multilabel`, `spancat` and
|
|
custom components with thresholds that are applied while predicting or scoring.
|
|
|
|
### Fuzzy matching {id="fuzzy"}
|
|
|
|
New `FUZZY` operators support [fuzzy matching](/usage/rule-based-matching#fuzzy)
|
|
with the `Matcher`. By default, the `FUZZY` operator allows a Levenshtein edit
|
|
distance of 2 and up to 30% of the pattern string length. `FUZZY1`..`FUZZY9` can
|
|
be used to specify the exact number of allowed edits.
|
|
|
|
```python
|
|
# Match lowercase with fuzzy matching (allows up to 3 edits)
|
|
pattern = [{"LOWER": {"FUZZY": "definitely"}}]
|
|
|
|
# Match custom attribute values with fuzzy matching (allows up to 3 edits)
|
|
pattern = [{"_": {"country": {"FUZZY": "Kyrgyzstan"}}}]
|
|
|
|
# Match with exact Levenshtein edit distance limits (allows up to 4 edits)
|
|
pattern = [{"_": {"country": {"FUZZY4": "Kyrgyzstan"}}}]
|
|
```
|
|
|
|
Note that `FUZZY` uses Levenshtein edit distance rather than Damerau-Levenshtein
|
|
edit distance, so a transposition like `teh` for `the` counts as two edits, one
|
|
insertion and one deletion.
|
|
|
|
If you'd prefer an alternate fuzzy matching algorithm, you can provide your own
|
|
custom method to the `Matcher` or as a config option for an entity ruler and
|
|
span ruler.
|
|
|
|
### FUZZY and REGEX with lists {id="fuzzy-regex-lists"}
|
|
|
|
The `FUZZY` and `REGEX` operators are also now supported for lists with `IN` and
|
|
`NOT_IN`:
|
|
|
|
```python
|
|
pattern = [{"TEXT": {"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}]
|
|
pattern = [{"TEXT": {"REGEX": {"NOT_IN": ["^awe(some)?$", "^wonder(ful)?"]}}}]
|
|
```
|
|
|
|
### Entity linking generalization {id="el"}
|
|
|
|
The knowledge base used for entity linking is now easier to customize and has a
|
|
new default implementation [`InMemoryLookupKB`](/api/inmemorylookupkb).
|
|
|
|
### Additional features and improvements {id="additional-features-and-improvements"}
|
|
|
|
- Language updates:
|
|
- Extended support for Slovenian
|
|
- Fixed lookup fallback for French and Catalan lemmatizers
|
|
- Switch Russian and Ukrainian lemmatizers to `pymorphy3`
|
|
- Support for editorial punctuation in Ancient Greek
|
|
- Update to Russian tokenizer exceptions
|
|
- Small fix for Dutch stop words
|
|
- Allow up to `typer` v0.7.x, `mypy` 0.990 and `typing_extensions` v4.4.x.
|
|
- New `spacy.ConsoleLogger.v3` with expanded progress
|
|
[tracking](/api/top-level#ConsoleLogger).
|
|
- Improved scoring behavior for `textcat` with `spacy.textcat_scorer.v2` and
|
|
`spacy.textcat_multilabel_scorer.v2`.
|
|
- Updates so that downstream components can train properly on a frozen `tok2vec`
|
|
or `transformer` layer.
|
|
- Allow interpolation of variables in directory names in projects.
|
|
- Support for local file system [remotes](/usage/projects#remote) for projects.
|
|
- Improve UX around `displacy.serve` when the default port is in use.
|
|
- Optional `before_update` callback that is invoked at the start of each
|
|
[training step](/api/data-formats#config-training).
|
|
- Improve performance of `SpanGroup` and fix typing issues for `SpanGroup` and
|
|
`Span` objects.
|
|
- Patch a
|
|
[security vulnerability](https://github.com/advisories/GHSA-gw9q-c7gh-j9vm) in
|
|
extracting tar files.
|
|
- Add equality definition for `Vectors`.
|
|
- Ensure `Vocab.to_disk` respects the exclude setting for `lookups` and
|
|
`vectors`.
|
|
- Correctly handle missing annotations in the edit tree lemmatizer.
|
|
|
|
### Trained pipeline updates {id="pipelines"}
|
|
|
|
- The CNN pipelines add `IS_SPACE` as a `tok2vec` feature for `tagger` and
|
|
`morphologizer` components to improve tagging of non-whitespace vs. whitespace
|
|
tokens.
|
|
- The transformer pipelines require `spacy-transformers` v1.2, which uses the
|
|
exact alignment from `tokenizers` for fast tokenizers instead of the heuristic
|
|
alignment from `spacy-alignments`. For all trained pipelines except
|
|
`ja_core_news_trf`, the alignments between spaCy tokens and transformer tokens
|
|
may be slightly different. More details about the `spacy-transformers` changes
|
|
in the
|
|
[v1.2.0 release notes](https://github.com/explosion/spacy-transformers/releases/tag/v1.2.0).
|
|
|
|
## Notes about upgrading from v3.4 {id="upgrading"}
|
|
|
|
### Validation of textcat values {id="textcat-validation"}
|
|
|
|
An error is now raised when unsupported values are given as input to train a
|
|
`textcat` or `textcat_multilabel` model - ensure that values are `0.0` or `1.0`
|
|
as explained in the [docs](/api/textcategorizer#assigned-attributes).
|
|
|
|
### Using the default knowledge base
|
|
|
|
As `KnowledgeBase` is now an abstract class, you should call the constructor of
|
|
the new `InMemoryLookupKB` instead when you want to use spaCy's default KB
|
|
implementation:
|
|
|
|
```diff
|
|
- kb = KnowledgeBase()
|
|
+ kb = InMemoryLookupKB()
|
|
```
|
|
|
|
If you've written a custom KB that inherits from `KnowledgeBase`, you'll need to
|
|
implement its abstract methods, or alternatively inherit from `InMemoryLookupKB`
|
|
instead.
|
|
|
|
### Updated scorers for tokenization and textcat {id="scores"}
|
|
|
|
We fixed a bug that inflated the `token_acc` scores in v3.0-v3.4. The reported
|
|
`token_acc` will drop from v3.4 to v3.5, but if `token_p/r/f` stay the same,
|
|
your tokenization performance has not changed from v3.4.
|
|
|
|
For new `textcat` or `textcat_multilabel` configs, the new default `v2` scorers:
|
|
|
|
- ignore `threshold` for `textcat`, so the reported `cats_p/r/f` may increase
|
|
slightly in v3.5 even though the underlying predictions are unchanged
|
|
- report the performance of only the **final** `textcat` or `textcat_multilabel`
|
|
component in the pipeline by default
|
|
- allow custom scorers to be used to score multiple `textcat` and
|
|
`textcat_multilabel` components with `Scorer.score_cats` by restricting the
|
|
evaluation to the component's provided labels
|
|
|
|
### Pipeline package version compatibility {id="version-compat"}
|
|
|
|
> #### Using legacy implementations
|
|
>
|
|
> In spaCy v3, you'll still be able to load and reference legacy implementations
|
|
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
|
|
> components or architectures change and newer versions are available in the
|
|
> core library.
|
|
|
|
When you're loading a pipeline package trained with an earlier version of spaCy
|
|
v3, you will see a warning telling you that the pipeline may be incompatible.
|
|
This doesn't necessarily have to be true, but we recommend running your
|
|
pipelines against your test suite or evaluation data to make sure there are no
|
|
unexpected results.
|
|
|
|
If you're using one of the [trained pipelines](/models) we provide, you should
|
|
run [`spacy download`](/api/cli#download) to update to the latest version. To
|
|
see an overview of all installed packages and their compatibility, you can run
|
|
[`spacy validate`](/api/cli#validate).
|
|
|
|
If you've trained your own custom pipeline and you've confirmed that it's still
|
|
working as expected, you can update the spaCy version requirements in the
|
|
[`meta.json`](/api/data-formats#meta):
|
|
|
|
```diff
|
|
- "spacy_version": ">=3.4.0,<3.5.0",
|
|
+ "spacy_version": ">=3.4.0,<3.6.0",
|
|
```
|
|
|
|
### Updating v3.4 configs
|
|
|
|
To update a config from spaCy v3.4 with the new v3.5 settings, run
|
|
[`init fill-config`](/api/cli#init-fill-config):
|
|
|
|
```cli
|
|
$ python -m spacy init fill-config config-v3.4.cfg config-v3.5.cfg
|
|
```
|
|
|
|
In many cases ([`spacy train`](/api/cli#train),
|
|
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
|
|
automatically, but you'll need to fill in the new settings to run
|
|
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
|