mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-24 16:24:16 +03:00
3.5 usage page (#12057)
* skeleton * Fill in non-CLI details from release notes draft * Add TODO for fuzzy matching * Website updates for v3-5 draft * Fill in usage examples * Add fuzzy matching to intro * Fix fuzzy examples * Shell example formatting * Fix typo * Format * Remove trailing periods in internal list * Update * Fix spacing for nested lists * Update InMemoryLookupKB link Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Ines Montani <ines@ines.io>
This commit is contained in:
parent
1e993d3b03
commit
0f5d8a27f2
215
website/docs/usage/v3-5.mdx
Normal file
215
website/docs/usage/v3-5.mdx
Normal file
|
@ -0,0 +1,215 @@
|
|||
---
|
||||
title: What's New in v3.5
|
||||
teaser: New features and how to upgrade
|
||||
menu:
|
||||
- ['New Features', 'features']
|
||||
- ['Upgrading Notes', 'upgrading']
|
||||
---
|
||||
|
||||
## New features {id="features",hidden="true"}
|
||||
|
||||
spaCy v3.5 introduces three new CLI commands, `apply`, `benchmark` and
|
||||
`find-threshold`, adds fuzzy matching, provides improvements to our entity
|
||||
linking functionality, and includes a range of language updates and bug fixes.
|
||||
|
||||
### New CLI commands {id="cli"}
|
||||
|
||||
#### apply CLI
|
||||
|
||||
The [`apply` CLI](/api/cli#apply) can be used to apply a pipeline to one or more
|
||||
`.txt`, `.jsonl` or `.spacy` input files, saving the annotated docs in a single
|
||||
`.spacy` file.
|
||||
|
||||
```bash
|
||||
$ spacy apply en_core_web_sm my_texts/ output.spacy
|
||||
```
|
||||
|
||||
#### benchmark CLI
|
||||
|
||||
The [`benchmark` CLI](/api/cli#benchmark) has been added to extend the existing
|
||||
`evaluate` functionality with a wider range of profiling subcommands.
|
||||
|
||||
The `benchmark accuracy` CLI is introduced as an alias for `evaluate`. The new
|
||||
`benchmark speed` CLI performs warmup rounds before measuring the speed in words
|
||||
per second on batches of randomly shuffled documents from the provided data.
|
||||
|
||||
```bash
|
||||
$ spacy benchmark speed my_pipeline data.spacy
|
||||
```
|
||||
|
||||
The output is the mean performance using batches (`nlp.pipe`) with a 95%
|
||||
confidence interval, e.g., profiling `en_core_web_sm` on CPU:
|
||||
|
||||
```none
|
||||
Outliers: 2.0%, extreme outliers: 0.0%
|
||||
Mean: 18904.1 words/s (95% CI: -256.9 +244.1)
|
||||
```
|
||||
|
||||
#### find-threshold CLI
|
||||
|
||||
The [`find-threshold` CLI](/api/cli#find-threshold) runs a series of trials
|
||||
across threshold values from `0.0` to `1.0` and identifies the best threshold
|
||||
for the provided score metric.
|
||||
|
||||
The following command runs 20 trials for the `spancat` component in
|
||||
`my_pipeline`, recording the `spans_sc_f` score for each value of the threshold
|
||||
`[components.spancat.threshold]` from `0.0` to `1.0`:
|
||||
|
||||
```bash
|
||||
$ spacy find-threshold my_pipeline data.spacy spancat threshold spans_sc_f --n_trials 20
|
||||
```
|
||||
|
||||
The `find-threshold` CLI can be used with `textcat_multilabel`, `spancat` and
|
||||
custom components with thresholds that are applied while predicting or scoring.
|
||||
|
||||
### Fuzzy matching {id="fuzzy"}
|
||||
|
||||
New `FUZZY` operators support [fuzzy matching](/usage/rule-based-matching#fuzzy)
|
||||
with the `Matcher`. By default, the `FUZZY` operator allows a Levenshtein edit
|
||||
distance of 2 and up to 30% of the pattern string length. `FUZZY1`..`FUZZY9` can
|
||||
be used to specify the exact number of allowed edits.
|
||||
|
||||
```python
|
||||
# Match lowercase with fuzzy matching (allows up to 3 edits)
|
||||
pattern = [{"LOWER": {"FUZZY": "definitely"}}]
|
||||
|
||||
# Match custom attribute values with fuzzy matching (allows up to 3 edits)
|
||||
pattern = [{"_": {"country": {"FUZZY": "Kyrgyzstan"}}}]
|
||||
|
||||
# Match with exact Levenshtein edit distance limits (allows up to 4 edits)
|
||||
pattern = [{"_": {"country": {"FUZZY4": "Kyrgyzstan"}}}]
|
||||
```
|
||||
|
||||
Note that `FUZZY` uses Levenshtein edit distance rather than Damerau-Levenshtein
|
||||
edit distance, so a transposition like `teh` for `the` counts as two edits, one
|
||||
insertion and one deletion.
|
||||
|
||||
If you'd prefer an alternate fuzzy matching algorithm, you can provide your own
|
||||
custom method to the `Matcher` or as a config option for an entity ruler and
|
||||
span ruler.
|
||||
|
||||
### FUZZY and REGEX with lists {id="fuzzy-regex-lists"}
|
||||
|
||||
The `FUZZY` and `REGEX` operators are also now supported for lists with `IN` and
|
||||
`NOT_IN`:
|
||||
|
||||
```python
|
||||
pattern = [{"TEXT": {"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}]
|
||||
pattern = [{"TEXT": {"REGEX": {"NOT_IN": ["^awe(some)?$", "^wonder(ful)?"]}}}]
|
||||
```
|
||||
|
||||
### Entity linking generalization {id="el"}
|
||||
|
||||
The knowledge base used for entity linking is now easier to customize and has a
|
||||
new default implementation [`InMemoryLookupKB`](/api/inmemorylookupkb).
|
||||
|
||||
### Additional features and improvements {id="additional-features-and-improvements"}
|
||||
|
||||
- Language updates:
|
||||
- Extended support for Slovenian
|
||||
- Fixed lookup fallback for French and Catalan lemmatizers
|
||||
- Switch Russian and Ukrainian lemmatizers to `pymorphy3`
|
||||
- Support for editorial punctuation in Ancient Greek
|
||||
- Update to Russian tokenizer exceptions
|
||||
- Small fix for Dutch stop words
|
||||
- Allow up to `typer` v0.7.x, `mypy` 0.990 and `typing_extensions` v4.4.x.
|
||||
- New `spacy.ConsoleLogger.v3` with expanded progress
|
||||
[tracking](/api/top-level#ConsoleLogger).
|
||||
- Improved scoring behavior for `textcat` with `spacy.textcat_scorer.v2` and
|
||||
`spacy.textcat_multilabel_scorer.v2`.
|
||||
- Updates so that downstream components can train properly on a frozen `tok2vec`
|
||||
or `transformer` layer.
|
||||
- Allow interpolation of variables in directory names in projects.
|
||||
- Support for local file system [remotes](/usage/projects#remote) for projects.
|
||||
- Improve UX around `displacy.serve` when the default port is in use.
|
||||
- Optional `before_update` callback that is invoked at the start of each
|
||||
[training step](/api/data-formats#config-training).
|
||||
- Improve performance of `SpanGroup` and fix typing issues for `SpanGroup` and
|
||||
`Span` objects.
|
||||
- Patch a
|
||||
[security vulnerability](https://github.com/advisories/GHSA-gw9q-c7gh-j9vm) in
|
||||
extracting tar files.
|
||||
- Add equality definition for `Vectors`.
|
||||
- Ensure `Vocab.to_disk` respects the exclude setting for `lookups` and
|
||||
`vectors`.
|
||||
- Correctly handle missing annotations in the edit tree lemmatizer.
|
||||
|
||||
### Trained pipeline updates {id="pipelines"}
|
||||
|
||||
- The CNN pipelines add `IS_SPACE` as a `tok2vec` feature for `tagger` and
|
||||
`morphologizer` components to improve tagging of non-whitespace vs. whitespace
|
||||
tokens.
|
||||
- The transformer pipelines require `spacy-transformers` v1.2, which uses the
|
||||
exact alignment from `tokenizers` for fast tokenizers instead of the heuristic
|
||||
alignment from `spacy-alignments`. For all trained pipelines except
|
||||
`ja_core_news_trf`, the alignments between spaCy tokens and transformer tokens
|
||||
may be slightly different. More details about the `spacy-transformers` changes
|
||||
in the
|
||||
[v1.2.0 release notes](https://github.com/explosion/spacy-transformers/releases/tag/v1.2.0).
|
||||
|
||||
## Notes about upgrading from v3.4 {id="upgrading"}
|
||||
|
||||
### Validation of textcat values {id="textcat-validation"}
|
||||
|
||||
An error is now raised when unsupported values are given as input to train a
|
||||
`textcat` or `textcat_multilabel` model - ensure that values are `0.0` or `1.0`
|
||||
as explained in the [docs](/api/textcategorizer#assigned-attributes).
|
||||
|
||||
### Updated scorers for tokenization and textcat {id="scores"}
|
||||
|
||||
We fixed a bug that inflated the `token_acc` scores in v3.0-v3.4. The reported
|
||||
`token_acc` will drop from v3.4 to v3.5, but if `token_p/r/f` stay the same,
|
||||
your tokenization performance has not changed from v3.4.
|
||||
|
||||
For new `textcat` or `textcat_multilabel` configs, the new default `v2` scorers:
|
||||
|
||||
- ignore `threshold` for `textcat`, so the reported `cats_p/r/f` may increase
|
||||
slightly in v3.5 even though the underlying predictions are unchanged
|
||||
- report the performance of only the **final** `textcat` or `textcat_multilabel`
|
||||
component in the pipeline by default
|
||||
- allow custom scorers to be used to score multiple `textcat` and
|
||||
`textcat_multilabel` components with `Scorer.score_cats` by restricting the
|
||||
evaluation to the component's provided labels
|
||||
|
||||
### Pipeline package version compatibility {id="version-compat"}
|
||||
|
||||
> #### Using legacy implementations
|
||||
>
|
||||
> In spaCy v3, you'll still be able to load and reference legacy implementations
|
||||
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
|
||||
> components or architectures change and newer versions are available in the
|
||||
> core library.
|
||||
|
||||
When you're loading a pipeline package trained with an earlier version of spaCy
|
||||
v3, you will see a warning telling you that the pipeline may be incompatible.
|
||||
This doesn't necessarily have to be true, but we recommend running your
|
||||
pipelines against your test suite or evaluation data to make sure there are no
|
||||
unexpected results.
|
||||
|
||||
If you're using one of the [trained pipelines](/models) we provide, you should
|
||||
run [`spacy download`](/api/cli#download) to update to the latest version. To
|
||||
see an overview of all installed packages and their compatibility, you can run
|
||||
[`spacy validate`](/api/cli#validate).
|
||||
|
||||
If you've trained your own custom pipeline and you've confirmed that it's still
|
||||
working as expected, you can update the spaCy version requirements in the
|
||||
[`meta.json`](/api/data-formats#meta):
|
||||
|
||||
```diff
|
||||
- "spacy_version": ">=3.4.0,<3.5.0",
|
||||
+ "spacy_version": ">=3.4.0,<3.6.0",
|
||||
```
|
||||
|
||||
### Updating v3.4 configs
|
||||
|
||||
To update a config from spaCy v3.4 with the new v3.5 settings, run
|
||||
[`init fill-config`](/api/cli#init-fill-config):
|
||||
|
||||
```cli
|
||||
$ python -m spacy init fill-config config-v3.4.cfg config-v3.5.cfg
|
||||
```
|
||||
|
||||
In many cases ([`spacy train`](/api/cli#train),
|
||||
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
|
||||
automatically, but you'll need to fill in the new settings to run
|
||||
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
|
|
@ -13,7 +13,8 @@
|
|||
{ "text": "New in v3.1", "url": "/usage/v3-1" },
|
||||
{ "text": "New in v3.2", "url": "/usage/v3-2" },
|
||||
{ "text": "New in v3.3", "url": "/usage/v3-3" },
|
||||
{ "text": "New in v3.4", "url": "/usage/v3-4" }
|
||||
{ "text": "New in v3.4", "url": "/usage/v3-4" },
|
||||
{ "text": "New in v3.5", "url": "/usage/v3-5" }
|
||||
]
|
||||
},
|
||||
{
|
||||
|
|
|
@ -20,6 +20,10 @@
|
|||
display: inline-block
|
||||
margin-bottom: var(--spacing-sm)
|
||||
|
||||
.ol, .ul
|
||||
margin-top: var(--spacing-xs)
|
||||
margin-bottom: var(--spacing-xs)
|
||||
|
||||
&:before
|
||||
content: '\25CF'
|
||||
position: relative
|
||||
|
|
|
@ -58,8 +58,8 @@ const AlertSpace = ({ nightly, legacy }) => {
|
|||
}
|
||||
|
||||
const navAlert = (
|
||||
<Link to="/usage/v3-4" hidden>
|
||||
<strong>💥 Out now:</strong> spaCy v3.4
|
||||
<Link to="/usage/v3-5" hidden>
|
||||
<strong>💥 Out now:</strong> spaCy v3.5
|
||||
</Link>
|
||||
)
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user