mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 05:31:15 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			248 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			248 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| ---
 | |
| title: What's New in v3.3
 | |
| teaser: New features and how to upgrade
 | |
| menu:
 | |
|   - ['New Features', 'features']
 | |
|   - ['Upgrading Notes', 'upgrading']
 | |
| ---
 | |
| 
 | |
| ## New features {id="features",hidden="true"}
 | |
| 
 | |
| spaCy v3.3 improves the speed of core pipeline components, adds a new trainable
 | |
| lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.
 | |
| 
 | |
| ### Speed improvements {id="speed"}
 | |
| 
 | |
| v3.3 includes a slew of speed improvements:
 | |
| 
 | |
| - Speed up parser and NER by using constant-time head lookups.
 | |
| - Support unnormalized softmax probabilities in `spacy.Tagger.v2` to speed up
 | |
|   inference for tagger, morphologizer, senter and trainable lemmatizer.
 | |
| - Speed up parser projectivization functions.
 | |
| - Replace `Ragged` with faster `AlignmentArray` in `Example` for training.
 | |
| - Improve `Matcher` speed.
 | |
| - Improve serialization speed for empty `Doc.spans`.
 | |
| 
 | |
| For longer texts, the trained pipeline speeds improve **15%** or more in
 | |
| prediction. We benchmarked `en_core_web_md` (same components as in v3.2) and
 | |
| `de_core_news_md` (with the new trainable lemmatizer) across a range of text
 | |
| sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:
 | |
| 
 | |
| **Intel Xeon W-2265**
 | |
| 
 | |
| | Model                                            | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec |   Diff |
 | |
| | :----------------------------------------------- | -------------: | -------------: | -------------: | -----: |
 | |
| | [`en_core_web_md`](/models/en#en_core_web_md)    |            100 |          17292 |          17441 |  0.86% |
 | |
| | (=same components)                               |           1000 |          15408 |          16024 |  4.00% |
 | |
| |                                                  |          10000 |          12798 |          15346 | 19.91% |
 | |
| | [`de_core_news_md`](/models/de/#de_core_news_md) |            100 |          20221 |          19321 | -4.45% |
 | |
| | (+v3.3 trainable lemmatizer)                     |           1000 |          17480 |          17345 | -0.77% |
 | |
| |                                                  |          10000 |          14513 |          17036 | 17.38% |
 | |
| 
 | |
| **Apple M1**
 | |
| 
 | |
| | Model                                            | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec |   Diff |
 | |
| | ------------------------------------------------ | -------------: | -------------: | -------------: | -----: |
 | |
| | [`en_core_web_md`](/models/en#en_core_web_md)    |            100 |          18272 |          18408 |  0.74% |
 | |
| | (=same components)                               |           1000 |          18794 |          19248 |  2.42% |
 | |
| |                                                  |          10000 |          15144 |          17513 | 15.64% |
 | |
| | [`de_core_news_md`](/models/de/#de_core_news_md) |            100 |          19227 |          19591 |  1.89% |
 | |
| | (+v3.3 trainable lemmatizer)                     |           1000 |          20047 |          20628 |  2.90% |
 | |
| |                                                  |          10000 |          15921 |          18546 | 16.49% |
 | |
| 
 | |
| ### Trainable lemmatizer {id="trainable-lemmatizer"}
 | |
| 
 | |
| The new [trainable lemmatizer](/api/edittreelemmatizer) component uses
 | |
| [edit trees](https://explosion.ai/blog/edit-tree-lemmatizer) to transform tokens
 | |
| into lemmas. Try out the trainable lemmatizer with the
 | |
| [training quickstart](/usage/training#quickstart)!
 | |
| 
 | |
| ### displaCy support for overlapping spans and arcs {id="displacy"}
 | |
| 
 | |
| displaCy now supports overlapping spans with a new
 | |
| [`span`](/usage/visualizers#span) style and multiple arcs with different labels
 | |
| between the same tokens for [`dep`](/usage/visualizers#dep) visualizations.
 | |
| 
 | |
| Overlapping spans can be visualized for any spans key in `doc.spans`:
 | |
| 
 | |
| ```python
 | |
| import spacy
 | |
| from spacy import displacy
 | |
| from spacy.tokens import Span
 | |
| 
 | |
| nlp = spacy.blank("en")
 | |
| text = "Welcome to the Bank of China."
 | |
| doc = nlp(text)
 | |
| doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
 | |
| displacy.serve(doc, style="span", options={"spans_key": "custom"})
 | |
| ```
 | |
| 
 | |
| <Standalone height={100}>
 | |
| <div style={{ lineHeight: 2.5, direction: 'ltr', fontFamily: "-apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'", fontSize: 18 }}>Welcome to the <span style={{ fontWeight: 'bold', display: 'inline-block', position: 'relative'}}>Bank<span style={{ background: '#7aecec', top: 40, height: 4, left: -1, width: 'calc(100% + 2px)', position: 'absolute' }}></span><span style={{ background: '#7aecec', top: 40, height: 4, borderTopLeftRadius: 3, borderBottomLeftRadius: 3, left: -1, width: 'calc(100% + 2px)', position: 'absolute' }}><span style={{ background: '#7aecec', color: '#000', top: '-0.5em', padding: '2px 3px', position: 'absolute', fontSize: '0.6em', fontWeight: 'bold', lineHeight: 1, borderRadius: 3 }}>ORG</span></span></span> <span style={{ fontWeight: 'bold', display: 'inline-block', position: 'relative'}}>of <span style={{ background: '#7aecec', top: 40, height: 4, left: -1, width: 'calc(100% + 2px)', position: 'absolute' }}></span></span> <span style={{ fontWeight: 'bold', display: 'inline-block', position: 'relative'}}>China<span style={{ background: '#7aecec', top: 40, height: 4, left: -1, width: 'calc(100% + 2px)', position: 'absolute' }}></span><span style={{ background: '#feca74', top: 57, height: 4, left: -1, width: 'calc(100% + 2px)', position: 'absolute' }}></span><span style={{ background: '#feca74', top: 57, height: 4, borderTopLeftRadius: 3, borderBottomLeftRadius: 3, left: -1, width: 'calc(100% + 2px)', position: 'absolute' }}><span style={{ background: '#feca74', color: '#000', top: '-0.5em', padding: '2px 3px', position: 'absolute', fontSize: '0.6em', fontWeight: 'bold', lineHeight: 1, borderRadius: 3 }}>GPE</span></span></span>.</div>
 | |
| </Standalone>
 | |
| 
 | |
| ## Additional features and improvements
 | |
| 
 | |
| - Config comparisons with [`spacy debug diff-config`](/api/cli#debug-diff).
 | |
| - Span suggester debugging with
 | |
|   [`SpanCategorizer.set_candidates`](/api/spancategorizer#set_candidates).
 | |
| - Big endian support with
 | |
|   [`thinc-bigendian-ops`](https://github.com/andrewsi-z/thinc-bigendian-ops) and
 | |
|   updates to make `floret`, `murmurhash`, Thinc and spaCy endian neutral.
 | |
| - Initial support for Lower Sorbian and Upper Sorbian.
 | |
| - Language updates for English, French, Italian, Japanese, Korean, Norwegian,
 | |
|   Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
 | |
| - New noun chunks for Finnish.
 | |
| 
 | |
| ## Trained pipelines {id="pipelines"}
 | |
| 
 | |
| ### New trained pipelines {id="new-pipelines"}
 | |
| 
 | |
| v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use
 | |
| the new trainable lemmatizer and
 | |
| [floret vectors](https://github.com/explosion/floret). Due to the use
 | |
| [Bloom embeddings](https://explosion.ai/blog/bloom-embeddings) and subwords, the
 | |
| pipelines have compact vectors with no out-of-vocabulary words.
 | |
| 
 | |
| | Package                                         | Language | UPOS | Parser LAS | NER F |
 | |
| | ----------------------------------------------- | -------- | ---: | ---------: | ----: |
 | |
| | [`fi_core_news_sm`](/models/fi#fi_core_news_sm) | Finnish  | 92.5 |       71.9 |  75.9 |
 | |
| | [`fi_core_news_md`](/models/fi#fi_core_news_md) | Finnish  | 95.9 |       78.6 |  80.6 |
 | |
| | [`fi_core_news_lg`](/models/fi#fi_core_news_lg) | Finnish  | 96.2 |       79.4 |  82.4 |
 | |
| | [`ko_core_news_sm`](/models/ko#ko_core_news_sm) | Korean   | 86.1 |       65.6 |  71.3 |
 | |
| | [`ko_core_news_md`](/models/ko#ko_core_news_md) | Korean   | 94.7 |       80.9 |  83.1 |
 | |
| | [`ko_core_news_lg`](/models/ko#ko_core_news_lg) | Korean   | 94.7 |       81.3 |  85.3 |
 | |
| | [`sv_core_news_sm`](/models/sv#sv_core_news_sm) | Swedish  | 95.0 |       75.9 |  74.7 |
 | |
| | [`sv_core_news_md`](/models/sv#sv_core_news_md) | Swedish  | 96.3 |       78.5 |  79.3 |
 | |
| | [`sv_core_news_lg`](/models/sv#sv_core_news_lg) | Swedish  | 96.3 |       79.1 |  81.1 |
 | |
| 
 | |
| ### Pipeline updates {id="pipeline-updates"}
 | |
| 
 | |
| The following languages switch from lookup or rule-based lemmatizers to the new
 | |
| trainable lemmatizer: Danish, Dutch, German, Greek, Italian, Lithuanian,
 | |
| Norwegian, Polish, Portuguese and Romanian. The overall lemmatizer accuracy
 | |
| improves for all of these pipelines, but be aware that the types of errors may
 | |
| look quite different from the lookup-based lemmatizers. If you'd prefer to
 | |
| continue using the previous lemmatizer, you can
 | |
| [switch from the trainable lemmatizer to a non-trainable lemmatizer](/models#design-modify).
 | |
| 
 | |
| <figure>
 | |
| 
 | |
| | Model                                           | v3.2 Lemma Acc | v3.3 Lemma Acc |
 | |
| | ----------------------------------------------- | -------------: | -------------: |
 | |
| | [`da_core_news_md`](/models/da#da_core_news_md) |           84.9 |           94.8 |
 | |
| | [`de_core_news_md`](/models/de#de_core_news_md) |           73.4 |           97.7 |
 | |
| | [`el_core_news_md`](/models/el#el_core_news_md) |           56.5 |           88.9 |
 | |
| | [`fi_core_news_md`](/models/fi#fi_core_news_md) |              - |           86.2 |
 | |
| | [`it_core_news_md`](/models/it#it_core_news_md) |           86.6 |           97.2 |
 | |
| | [`ko_core_news_md`](/models/ko#ko_core_news_md) |              - |           90.0 |
 | |
| | [`lt_core_news_md`](/models/lt#lt_core_news_md) |           71.1 |           84.8 |
 | |
| | [`nb_core_news_md`](/models/nb#nb_core_news_md) |           76.7 |           97.1 |
 | |
| | [`nl_core_news_md`](/models/nl#nl_core_news_md) |           81.5 |           94.0 |
 | |
| | [`pl_core_news_md`](/models/pl#pl_core_news_md) |           87.1 |           93.7 |
 | |
| | [`pt_core_news_md`](/models/pt#pt_core_news_md) |           76.7 |           96.9 |
 | |
| | [`ro_core_news_md`](/models/ro#ro_core_news_md) |           81.8 |           95.5 |
 | |
| | [`sv_core_news_md`](/models/sv#sv_core_news_md) |              - |           95.5 |
 | |
| 
 | |
| </figure>
 | |
| 
 | |
| In addition, the vectors in the English pipelines are deduplicated to improve
 | |
| the pruned vectors in the `md` models and reduce the `lg` model size.
 | |
| 
 | |
| ## Notes about upgrading from v3.2 {id="upgrading"}
 | |
| 
 | |
| ### Span comparisons
 | |
| 
 | |
| Span comparisons involving ordering (`<`, `<=`, `>`, `>=`) now take all span
 | |
| attributes into account (start, end, label, and KB ID) so spans may be sorted in
 | |
| a slightly different order.
 | |
| 
 | |
| ### Whitespace annotation
 | |
| 
 | |
| During training, annotation on whitespace tokens is handled in the same way as
 | |
| annotation on non-whitespace tokens in order to allow custom whitespace
 | |
| annotation.
 | |
| 
 | |
| ### Doc.from_docs
 | |
| 
 | |
| [`Doc.from_docs`](/api/doc#from_docs) now includes `Doc.tensor` by default and
 | |
| supports excludes with an `exclude` argument in the same format as
 | |
| `Doc.to_bytes`. The supported exclude fields are `spans`, `tensor` and
 | |
| `user_data`.
 | |
| 
 | |
| Docs including `Doc.tensor` may be quite a bit larger in RAM, so to exclude
 | |
| `Doc.tensor` as in v3.2:
 | |
| 
 | |
| ```diff
 | |
| -merged_doc = Doc.from_docs(docs)
 | |
| +merged_doc = Doc.from_docs(docs, exclude=["tensor"])
 | |
| ```
 | |
| 
 | |
| ### Using trained pipelines with floret vectors
 | |
| 
 | |
| If you're running a new trained pipeline for Finnish, Korean or Swedish on new
 | |
| texts and working with `Doc` objects, you shouldn't notice any difference with
 | |
| floret vectors vs. default vectors.
 | |
| 
 | |
| If you use vectors for similarity comparisons, there are a few differences,
 | |
| mainly because a floret pipeline doesn't include any kind of frequency-based
 | |
| word list similar to the list of in-vocabulary vector keys with default vectors.
 | |
| 
 | |
| - If your workflow iterates over the vector keys, you should use an external
 | |
|   word list instead:
 | |
| 
 | |
|   ```diff
 | |
|   - lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
 | |
|   + lexemes = [nlp.vocab[word] for word in external_word_list]
 | |
|   ```
 | |
| 
 | |
| - `Vectors.most_similar` is not supported because there's no fixed list of
 | |
|   vectors to compare your vectors to.
 | |
| 
 | |
| ### Pipeline package version compatibility {id="version-compat"}
 | |
| 
 | |
| > #### Using legacy implementations
 | |
| >
 | |
| > In spaCy v3, you'll still be able to load and reference legacy implementations
 | |
| > via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
 | |
| > components or architectures change and newer versions are available in the
 | |
| > core library.
 | |
| 
 | |
| When you're loading a pipeline package trained with an earlier version of spaCy
 | |
| v3, you will see a warning telling you that the pipeline may be incompatible.
 | |
| This doesn't necessarily have to be true, but we recommend running your
 | |
| pipelines against your test suite or evaluation data to make sure there are no
 | |
| unexpected results.
 | |
| 
 | |
| If you're using one of the [trained pipelines](/models) we provide, you should
 | |
| run [`spacy download`](/api/cli#download) to update to the latest version. To
 | |
| see an overview of all installed packages and their compatibility, you can run
 | |
| [`spacy validate`](/api/cli#validate).
 | |
| 
 | |
| If you've trained your own custom pipeline and you've confirmed that it's still
 | |
| working as expected, you can update the spaCy version requirements in the
 | |
| [`meta.json`](/api/data-formats#meta):
 | |
| 
 | |
| ```diff
 | |
| - "spacy_version": ">=3.2.0,<3.3.0",
 | |
| + "spacy_version": ">=3.2.0,<3.4.0",
 | |
| ```
 | |
| 
 | |
| ### Updating v3.2 configs
 | |
| 
 | |
| To update a config from spaCy v3.2 with the new v3.3 settings, run
 | |
| [`init fill-config`](/api/cli#init-fill-config):
 | |
| 
 | |
| ```bash
 | |
| $ python -m spacy init fill-config config-v3.2.cfg config-v3.3.cfg
 | |
| ```
 | |
| 
 | |
| In many cases ([`spacy train`](/api/cli#train),
 | |
| [`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
 | |
| automatically, but you'll need to fill in the new settings to run
 | |
| [`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
 | |
| 
 | |
| To see the speed improvements for the
 | |
| [`Tagger` architecture](/api/architectures#Tagger), edit your config to switch
 | |
| from `spacy.Tagger.v1` to `spacy.Tagger.v2` and then run `init fill-config`.
 |