mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-27 10:26:35 +03:00
497a708c71
* Temporarily disable CI tests
* Start v3.3 website updates
* Add trainable lemmatizer to pipeline design
* Fix Vectors.most_similar
* Add floret vector info to pipeline design
* Add Lower and Upper Sorbian
* Add span to sidebar
* Work on release notes
* Copy from release notes
* Update pipeline design graphic
* Upgrading note about Doc.from_docs
* Add tables and details
* Update website/docs/models/index.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix da lemma acc
* Add minimal intro, various updates
* Round lemma acc
* Add section on floret / word lists
* Add new pipelines table, minor edits
* Fix displacy spans example title
* Clarify adding non-trainable lemmatizer
* Update adding-languages URLs
* Revert "Temporarily disable CI tests"
This reverts commit 1dee505920
.
* Spell out words/sec
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
248 lines
12 KiB
Markdown
248 lines
12 KiB
Markdown
---
|
|
title: What's New in v3.3
|
|
teaser: New features and how to upgrade
|
|
menu:
|
|
- ['New Features', 'features']
|
|
- ['Upgrading Notes', 'upgrading']
|
|
---
|
|
|
|
## New features {#features hidden="true"}
|
|
|
|
spaCy v3.3 improves the speed of core pipeline components, adds a new trainable
|
|
lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.
|
|
|
|
### Speed improvements {#speed}
|
|
|
|
v3.3 includes a slew of speed improvements:
|
|
|
|
- Speed up parser and NER by using constant-time head lookups.
|
|
- Support unnormalized softmax probabilities in `spacy.Tagger.v2` to speed up
|
|
inference for tagger, morphologizer, senter and trainable lemmatizer.
|
|
- Speed up parser projectivization functions.
|
|
- Replace `Ragged` with faster `AlignmentArray` in `Example` for training.
|
|
- Improve `Matcher` speed.
|
|
- Improve serialization speed for empty `Doc.spans`.
|
|
|
|
For longer texts, the trained pipeline speeds improve **15%** or more in
|
|
prediction. We benchmarked `en_core_web_md` (same components as in v3.2) and
|
|
`de_core_news_md` (with the new trainable lemmatizer) across a range of text
|
|
sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:
|
|
|
|
**Intel Xeon W-2265**
|
|
|
|
| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
|
|
| :----------------------------------------------- | -------------: | -------------: | -------------: | -----: |
|
|
| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 17292 | 17441 | 0.86% |
|
|
| (=same components) | 1000 | 15408 | 16024 | 4.00% |
|
|
| | 10000 | 12798 | 15346 | 19.91% |
|
|
| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 20221 | 19321 | -4.45% |
|
|
| (+v3.3 trainable lemmatizer) | 1000 | 17480 | 17345 | -0.77% |
|
|
| | 10000 | 14513 | 17036 | 17.38% |
|
|
|
|
**Apple M1**
|
|
|
|
| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
|
|
| ------------------------------------------------ | -------------: | -------------: | -------------: | -----: |
|
|
| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 18272 | 18408 | 0.74% |
|
|
| (=same components) | 1000 | 18794 | 19248 | 2.42% |
|
|
| | 10000 | 15144 | 17513 | 15.64% |
|
|
| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 19227 | 19591 | 1.89% |
|
|
| (+v3.3 trainable lemmatizer) | 1000 | 20047 | 20628 | 2.90% |
|
|
| | 10000 | 15921 | 18546 | 16.49% |
|
|
|
|
### Trainable lemmatizer {#trainable-lemmatizer}
|
|
|
|
The new [trainable lemmatizer](/api/edittreelemmatizer) component uses
|
|
[edit trees](https://explosion.ai/blog/edit-tree-lemmatizer) to transform tokens
|
|
into lemmas. Try out the trainable lemmatizer with the
|
|
[training quickstart](/usage/training#quickstart)!
|
|
|
|
### displaCy support for overlapping spans and arcs {#displacy}
|
|
|
|
displaCy now supports overlapping spans with a new
|
|
[`span`](/usage/visualizers#span) style and multiple arcs with different labels
|
|
between the same tokens for [`dep`](/usage/visualizers#dep) visualizations.
|
|
|
|
Overlapping spans can be visualized for any spans key in `doc.spans`:
|
|
|
|
```python
|
|
import spacy
|
|
from spacy import displacy
|
|
from spacy.tokens import Span
|
|
|
|
nlp = spacy.blank("en")
|
|
text = "Welcome to the Bank of China."
|
|
doc = nlp(text)
|
|
doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
|
|
displacy.serve(doc, style="span", options={"spans_key": "custom"})
|
|
```
|
|
|
|
import DisplacySpanHtml from 'images/displacy-span.html'
|
|
|
|
<Iframe title="displaCy visualizer for overlapping spans" html={DisplacySpanHtml} height={180} />
|
|
|
|
## Additional features and improvements
|
|
|
|
- Config comparisons with [`spacy debug diff-config`](/api/cli#debug-diff).
|
|
- Span suggester debugging with
|
|
[`SpanCategorizer.set_candidates`](/api/spancategorizer#set_candidates).
|
|
- Big endian support with
|
|
[`thinc-bigendian-ops`](https://github.com/andrewsi-z/thinc-bigendian-ops) and
|
|
updates to make `floret`, `murmurhash`, Thinc and spaCy endian neutral.
|
|
- Initial support for Lower Sorbian and Upper Sorbian.
|
|
- Language updates for English, French, Italian, Japanese, Korean, Norwegian,
|
|
Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
|
|
- New noun chunks for Finnish.
|
|
|
|
## Trained pipelines {#pipelines}
|
|
|
|
### New trained pipelines {#new-pipelines}
|
|
|
|
v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use
|
|
the new trainable lemmatizer and
|
|
[floret vectors](https://github.com/explosion/floret). Due to the use
|
|
[Bloom embeddings](https://explosion.ai/blog/bloom-embeddings) and subwords, the
|
|
pipelines have compact vectors with no out-of-vocabulary words.
|
|
|
|
| Package | Language | UPOS | Parser LAS | NER F |
|
|
| ----------------------------------------------- | -------- | ---: | ---------: | ----: |
|
|
| [`fi_core_news_sm`](/models/fi#fi_core_news_sm) | Finnish | 92.5 | 71.9 | 75.9 |
|
|
| [`fi_core_news_md`](/models/fi#fi_core_news_md) | Finnish | 95.9 | 78.6 | 80.6 |
|
|
| [`fi_core_news_lg`](/models/fi#fi_core_news_lg) | Finnish | 96.2 | 79.4 | 82.4 |
|
|
| [`ko_core_news_sm`](/models/ko#ko_core_news_sm) | Korean | 86.1 | 65.6 | 71.3 |
|
|
| [`ko_core_news_md`](/models/ko#ko_core_news_md) | Korean | 94.7 | 80.9 | 83.1 |
|
|
| [`ko_core_news_lg`](/models/ko#ko_core_news_lg) | Korean | 94.7 | 81.3 | 85.3 |
|
|
| [`sv_core_news_sm`](/models/sv#sv_core_news_sm) | Swedish | 95.0 | 75.9 | 74.7 |
|
|
| [`sv_core_news_md`](/models/sv#sv_core_news_md) | Swedish | 96.3 | 78.5 | 79.3 |
|
|
| [`sv_core_news_lg`](/models/sv#sv_core_news_lg) | Swedish | 96.3 | 79.1 | 81.1 |
|
|
|
|
### Pipeline updates {#pipeline-updates}
|
|
|
|
The following languages switch from lookup or rule-based lemmatizers to the new
|
|
trainable lemmatizer: Danish, Dutch, German, Greek, Italian, Lithuanian,
|
|
Norwegian, Polish, Portuguese and Romanian. The overall lemmatizer accuracy
|
|
improves for all of these pipelines, but be aware that the types of errors may
|
|
look quite different from the lookup-based lemmatizers. If you'd prefer to
|
|
continue using the previous lemmatizer, you can
|
|
[switch from the trainable lemmatizer to a non-trainable lemmatizer](/models#design-modify).
|
|
|
|
<figure>
|
|
|
|
| Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
|
|
| ----------------------------------------------- | -------------: | -------------: |
|
|
| [`da_core_news_md`](/models/da#da_core_news_md) | 84.9 | 94.8 |
|
|
| [`de_core_news_md`](/models/de#de_core_news_md) | 73.4 | 97.7 |
|
|
| [`el_core_news_md`](/models/el#el_core_news_md) | 56.5 | 88.9 |
|
|
| [`fi_core_news_md`](/models/fi#fi_core_news_md) | - | 86.2 |
|
|
| [`it_core_news_md`](/models/it#it_core_news_md) | 86.6 | 97.2 |
|
|
| [`ko_core_news_md`](/models/ko#ko_core_news_md) | - | 90.0 |
|
|
| [`lt_core_news_md`](/models/lt#lt_core_news_md) | 71.1 | 84.8 |
|
|
| [`nb_core_news_md`](/models/nb#nb_core_news_md) | 76.7 | 97.1 |
|
|
| [`nl_core_news_md`](/models/nl#nl_core_news_md) | 81.5 | 94.0 |
|
|
| [`pl_core_news_md`](/models/pl#pl_core_news_md) | 87.1 | 93.7 |
|
|
| [`pt_core_news_md`](/models/pt#pt_core_news_md) | 76.7 | 96.9 |
|
|
| [`ro_core_news_md`](/models/ro#ro_core_news_md) | 81.8 | 95.5 |
|
|
| [`sv_core_news_md`](/models/sv#sv_core_news_md) | - | 95.5 |
|
|
|
|
</figure>
|
|
|
|
In addition, the vectors in the English pipelines are deduplicated to improve
|
|
the pruned vectors in the `md` models and reduce the `lg` model size.
|
|
|
|
## Notes about upgrading from v3.2 {#upgrading}
|
|
|
|
### Span comparisons
|
|
|
|
Span comparisons involving ordering (`<`, `<=`, `>`, `>=`) now take all span
|
|
attributes into account (start, end, label, and KB ID) so spans may be sorted in
|
|
a slightly different order.
|
|
|
|
### Whitespace annotation
|
|
|
|
During training, annotation on whitespace tokens is handled in the same way as
|
|
annotation on non-whitespace tokens in order to allow custom whitespace
|
|
annotation.
|
|
|
|
### Doc.from_docs
|
|
|
|
[`Doc.from_docs`](/api/doc#from_docs) now includes `Doc.tensor` by default and
|
|
supports excludes with an `exclude` argument in the same format as
|
|
`Doc.to_bytes`. The supported exclude fields are `spans`, `tensor` and
|
|
`user_data`.
|
|
|
|
Docs including `Doc.tensor` may be quite a bit larger in RAM, so to exclude
|
|
`Doc.tensor` as in v3.2:
|
|
|
|
```diff
|
|
-merged_doc = Doc.from_docs(docs)
|
|
+merged_doc = Doc.from_docs(docs, exclude=["tensor"])
|
|
```
|
|
|
|
### Using trained pipelines with floret vectors
|
|
|
|
If you're running a new trained pipeline for Finnish, Korean or Swedish on new
|
|
texts and working with `Doc` objects, you shouldn't notice any difference with
|
|
floret vectors vs. default vectors.
|
|
|
|
If you use vectors for similarity comparisons, there are a few differences,
|
|
mainly because a floret pipeline doesn't include any kind of frequency-based
|
|
word list similar to the list of in-vocabulary vector keys with default vectors.
|
|
|
|
- If your workflow iterates over the vector keys, you should use an external
|
|
word list instead:
|
|
|
|
```diff
|
|
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
|
+ lexemes = [nlp.vocab[word] for word in external_word_list]
|
|
```
|
|
|
|
- `Vectors.most_similar` is not supported because there's no fixed list of
|
|
vectors to compare your vectors to.
|
|
|
|
### Pipeline package version compatibility {#version-compat}
|
|
|
|
> #### Using legacy implementations
|
|
>
|
|
> In spaCy v3, you'll still be able to load and reference legacy implementations
|
|
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
|
|
> components or architectures change and newer versions are available in the
|
|
> core library.
|
|
|
|
When you're loading a pipeline package trained with an earlier version of spaCy
|
|
v3, you will see a warning telling you that the pipeline may be incompatible.
|
|
This doesn't necessarily have to be true, but we recommend running your
|
|
pipelines against your test suite or evaluation data to make sure there are no
|
|
unexpected results.
|
|
|
|
If you're using one of the [trained pipelines](/models) we provide, you should
|
|
run [`spacy download`](/api/cli#download) to update to the latest version. To
|
|
see an overview of all installed packages and their compatibility, you can run
|
|
[`spacy validate`](/api/cli#validate).
|
|
|
|
If you've trained your own custom pipeline and you've confirmed that it's still
|
|
working as expected, you can update the spaCy version requirements in the
|
|
[`meta.json`](/api/data-formats#meta):
|
|
|
|
```diff
|
|
- "spacy_version": ">=3.2.0,<3.3.0",
|
|
+ "spacy_version": ">=3.2.0,<3.4.0",
|
|
```
|
|
|
|
### Updating v3.2 configs
|
|
|
|
To update a config from spaCy v3.2 with the new v3.3 settings, run
|
|
[`init fill-config`](/api/cli#init-fill-config):
|
|
|
|
```cli
|
|
$ python -m spacy init fill-config config-v3.2.cfg config-v3.3.cfg
|
|
```
|
|
|
|
In many cases ([`spacy train`](/api/cli#train),
|
|
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
|
|
automatically, but you'll need to fill in the new settings to run
|
|
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
|
|
|
|
To see the speed improvements for the
|
|
[`Tagger` architecture](/api/architectures#Tagger), edit your config to switch
|
|
from `spacy.Tagger.v1` to `spacy.Tagger.v2` and then run `init fill-config`.
|