mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			245 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			245 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						|
title: What's New in v3.2
 | 
						|
teaser: New features and how to upgrade
 | 
						|
menu:
 | 
						|
  - ['New Features', 'features']
 | 
						|
  - ['Upgrading Notes', 'upgrading']
 | 
						|
---
 | 
						|
 | 
						|
## New Features {#features hidden="true"}
 | 
						|
 | 
						|
spaCy v3.2 adds support for [`floret`](https://github.com/explosion/floret)
 | 
						|
vectors, makes custom `Doc` creation and scoring easier, and includes many bug
 | 
						|
fixes and improvements. For the trained pipelines, there's a new transformer
 | 
						|
pipeline for Japanese and the Universal Dependencies training data has been
 | 
						|
updated across the board to the most recent release.
 | 
						|
 | 
						|
<Infobox title="Improve performance for spaCy on Apple M1 with AppleOps" variant="warning" emoji="📣">
 | 
						|
 | 
						|
spaCy is now up to **8 × faster on M1 Macs** by calling into Apple's
 | 
						|
native Accelerate library for matrix multiplication. For more details, see
 | 
						|
[`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops).
 | 
						|
 | 
						|
```bash
 | 
						|
$ pip install spacy[apple]
 | 
						|
```
 | 
						|
 | 
						|
</Infobox>
 | 
						|
 | 
						|
### Registered scoring functions {#registered-scoring-functions}
 | 
						|
 | 
						|
To customize the scoring, you can specify a scoring function for each component
 | 
						|
in your config from the new [`scorers` registry](/api/top-level#registry):
 | 
						|
 | 
						|
```ini
 | 
						|
### config.cfg (excerpt) {highlight="3"}
 | 
						|
[components.tagger]
 | 
						|
factory = "tagger"
 | 
						|
scorer = {"@scorers":"spacy.tagger_scorer.v1"}
 | 
						|
```
 | 
						|
 | 
						|
### Overwrite settings {#overwrite}
 | 
						|
 | 
						|
Most pipeline components now include an `overwrite` setting in the config that
 | 
						|
determines whether existing annotation in the `Doc` is preserved or overwritten:
 | 
						|
 | 
						|
```ini
 | 
						|
### config.cfg (excerpt) {highlight="3"}
 | 
						|
[components.tagger]
 | 
						|
factory = "tagger"
 | 
						|
overwrite = false
 | 
						|
```
 | 
						|
 | 
						|
### Doc input for pipelines {#doc-input}
 | 
						|
 | 
						|
[`nlp`](/api/language#call) and [`nlp.pipe`](/api/language#pipe) accept
 | 
						|
[`Doc`](/api/doc) input, skipping the tokenizer if a `Doc` is provided instead
 | 
						|
of a string. This makes it easier to create a `Doc` with custom tokenization or
 | 
						|
to set custom extensions before processing:
 | 
						|
 | 
						|
```python
 | 
						|
doc = nlp.make_doc("This is text 500.")
 | 
						|
doc._.text_id = 500
 | 
						|
doc = nlp(doc)
 | 
						|
```
 | 
						|
 | 
						|
### Support for floret vectors {#vectors}
 | 
						|
 | 
						|
We recently published [`floret`](https://github.com/explosion/floret), an
 | 
						|
extended version of [fastText](https://fasttext.cc) that combines fastText's
 | 
						|
subwords with Bloom embeddings for compact, full-coverage vectors. The use of
 | 
						|
subwords means that there are no OOV words and due to Bloom embeddings, the
 | 
						|
vector table can be kept very small at <100K entries. Bloom embeddings are
 | 
						|
already used by [HashEmbed](https://thinc.ai/docs/api-layers#hashembed) in
 | 
						|
[tok2vec](/api/architectures#tok2vec-arch) for compact spaCy models.
 | 
						|
 | 
						|
For easy integration, floret includes a
 | 
						|
[Python wrapper](https://github.com/explosion/floret/blob/main/python/README.md):
 | 
						|
 | 
						|
```bash
 | 
						|
$ pip install floret
 | 
						|
```
 | 
						|
 | 
						|
A demo project shows how to train and import floret vectors:
 | 
						|
 | 
						|
<Project id="pipelines/floret_vectors_demo">
 | 
						|
 | 
						|
Train toy English floret vectors and import them into a spaCy pipeline.
 | 
						|
 | 
						|
</Project>
 | 
						|
 | 
						|
Two additional demo projects compare standard fastText vectors with floret
 | 
						|
vectors for full spaCy pipelines. For agglutinative languages like Finnish or
 | 
						|
Korean, there are large improvements in performance due to the use of subwords
 | 
						|
(no OOV words!), with a vector table containing merely 50K entries.
 | 
						|
 | 
						|
<Project id="pipelines/floret_fi_core_demo">
 | 
						|
 | 
						|
Finnish UD+NER vector and pipeline training, comparing standard fasttext vs.
 | 
						|
floret vectors.
 | 
						|
 | 
						|
For the default project settings with 1M (2.6G) tokenized training texts and 50K
 | 
						|
300-dim vectors, ~300K keys for the standard vectors:
 | 
						|
 | 
						|
| Vectors                                      |      TAG |      POS |  DEP UAS |  DEP LAS |    NER F |
 | 
						|
| -------------------------------------------- | -------: | -------: | -------: | -------: | -------: |
 | 
						|
| none                                         |     93.3 |     92.3 |     79.7 |     72.8 |     61.0 |
 | 
						|
| standard (pruned: 50K vectors for 300K keys) |     95.9 |     94.7 |     83.3 |     77.9 |     68.5 |
 | 
						|
| standard (unpruned: 300K vectors/keys)       |     96.0 |     95.0 | **83.8** |     78.4 |     69.1 |
 | 
						|
| floret (minn 4, maxn 5; 50K vectors, no OOV) | **96.6** | **95.5** |     83.5 | **78.5** | **70.9** |
 | 
						|
 | 
						|
</Project>
 | 
						|
 | 
						|
<Project id="pipelines/floret_ko_ud_demo">
 | 
						|
 | 
						|
Korean UD vector and pipeline training, comparing standard fasttext vs. floret
 | 
						|
vectors.
 | 
						|
 | 
						|
For the default project settings with 1M (3.3G) tokenized training texts and 50K
 | 
						|
300-dim vectors, ~800K keys for the standard vectors:
 | 
						|
 | 
						|
| Vectors                                      |      TAG |      POS |  DEP UAS |  DEP LAS |
 | 
						|
| -------------------------------------------- | -------: | -------: | -------: | -------: |
 | 
						|
| none                                         |     72.5 |     85.0 |     73.2 |     64.3 |
 | 
						|
| standard (pruned: 50K vectors for 800K keys) |     77.9 |     89.4 |     78.8 |     72.8 |
 | 
						|
| standard (unpruned: 800K vectors/keys)       |     79.0 |     90.2 |     79.2 |     73.9 |
 | 
						|
| floret (minn 2, maxn 3; 50K vectors, no OOV) | **82.5** | **93.8** | **83.0** | **80.1** |
 | 
						|
 | 
						|
</Project>
 | 
						|
 | 
						|
### Updates for spacy-transformers v1.1 {#spacy-transformers}
 | 
						|
 | 
						|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.1 has
 | 
						|
been refactored to improve serialization and support of inline transformer
 | 
						|
components and replacing listeners. In addition, the transformer model output is
 | 
						|
provided as
 | 
						|
[`ModelOutput`](https://huggingface.co/transformers/main_classes/output.html?highlight=modeloutput#transformers.file_utils.ModelOutput)
 | 
						|
instead of tuples in
 | 
						|
`TransformerData.model_output and FullTransformerBatch.model_output.` For
 | 
						|
backwards compatibility, the tuple format remains available under
 | 
						|
`TransformerData.tensors` and `FullTransformerBatch.tensors`. See more details
 | 
						|
in the [transformer API docs](/api/architectures#TransformerModel).
 | 
						|
 | 
						|
`spacy-transfomers` v1.1 also adds support for `transformer_config` settings
 | 
						|
such as `output_attentions`. Additional output is stored under
 | 
						|
`TransformerData.model_output`. More details are in the
 | 
						|
[TransformerModel docs](/api/architectures#TransformerModel). The training speed
 | 
						|
has been improved by streamlining allocations for tokenizer output and there is
 | 
						|
new support for [mixed-precision training](/api/architectures#TransformerModel).
 | 
						|
 | 
						|
### New transformer package for Japanese {#pipeline-packages}
 | 
						|
 | 
						|
spaCy v3.2 adds a new transformer pipeline package for Japanese
 | 
						|
[`ja_core_news_trf`](/models/ja#ja_core_news_trf), which uses the `basic`
 | 
						|
pretokenizer instead of `mecab` to limit the number of dependencies required for
 | 
						|
the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for
 | 
						|
their contributions!
 | 
						|
 | 
						|
### Pipeline and language updates {#pipeline-updates}
 | 
						|
 | 
						|
- All Universal Dependencies training data has been updated to v2.8.
 | 
						|
- The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos
 | 
						|
  Rodriguez, Carme Armentano and the Barcelona Supercomputing Center!
 | 
						|
- The transformer pipelines are trained using spacy-transformers v1.1, with
 | 
						|
  improved IO and more options for
 | 
						|
  [model config and output](/api/architectures#TransformerModel).
 | 
						|
- Trailing whitespace has been added as a `tok2vec` feature, improving the
 | 
						|
  performance for many components, especially fine-grained tagging and sentence
 | 
						|
  segmentation.
 | 
						|
- The English attribute ruler patterns have been overhauled to improve
 | 
						|
  `Token.pos` and `Token.morph`.
 | 
						|
 | 
						|
spaCy v3.2 also features a new Irish lemmatizer, support for `noun_chunks` in
 | 
						|
Portuguese, improved `noun_chunks` for Spanish and additional updates for
 | 
						|
Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
 | 
						|
 | 
						|
## Notes about upgrading from v3.1 {#upgrading}
 | 
						|
 | 
						|
### Pipeline package version compatibility {#version-compat}
 | 
						|
 | 
						|
> #### Using legacy implementations
 | 
						|
>
 | 
						|
> In spaCy v3, you'll still be able to load and reference legacy implementations
 | 
						|
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
 | 
						|
> components or architectures change and newer versions are available in the
 | 
						|
> core library.
 | 
						|
 | 
						|
When you're loading a pipeline package trained with spaCy v3.0 or v3.1, you will
 | 
						|
see a warning telling you that the pipeline may be incompatible. This doesn't
 | 
						|
necessarily have to be true, but we recommend running your pipelines against
 | 
						|
your test suite or evaluation data to make sure there are no unexpected results.
 | 
						|
If you're using one of the [trained pipelines](/models) we provide, you should
 | 
						|
run [`spacy download`](/api/cli#download) to update to the latest version. To
 | 
						|
see an overview of all installed packages and their compatibility, you can run
 | 
						|
[`spacy validate`](/api/cli#validate).
 | 
						|
 | 
						|
If you've trained your own custom pipeline and you've confirmed that it's still
 | 
						|
working as expected, you can update the spaCy version requirements in the
 | 
						|
[`meta.json`](/api/data-formats#meta):
 | 
						|
 | 
						|
```diff
 | 
						|
- "spacy_version": ">=3.1.0,<3.2.0",
 | 
						|
+ "spacy_version": ">=3.2.0,<3.3.0",
 | 
						|
```
 | 
						|
 | 
						|
### Updating v3.1 configs
 | 
						|
 | 
						|
To update a config from spaCy v3.1 with the new v3.2 settings, run
 | 
						|
[`init fill-config`](/api/cli#init-fill-config):
 | 
						|
 | 
						|
```cli
 | 
						|
$ python -m spacy init fill-config config-v3.1.cfg config-v3.2.cfg
 | 
						|
```
 | 
						|
 | 
						|
In many cases ([`spacy train`](/api/cli#train),
 | 
						|
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
 | 
						|
automatically, but you'll need to fill in the new settings to run
 | 
						|
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
 | 
						|
 | 
						|
## Notes about upgrading from spacy-transformers v1.0 {#upgrading-transformers}
 | 
						|
 | 
						|
When you're loading a transformer pipeline package trained with
 | 
						|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.0
 | 
						|
after upgrading to `spacy-transformers` v1.1, you'll see a warning telling you
 | 
						|
that the pipeline may be incompatible. `spacy-transformers` v1.1 should be able
 | 
						|
to import v1.0 `transformer` components into the new internal format with no
 | 
						|
change in performance, but here we'd also recommend running your test suite to
 | 
						|
verify that the pipeline still performs as expected.
 | 
						|
 | 
						|
If you save your pipeline with [`nlp.to_disk`](/api/language#to_disk), it will
 | 
						|
be saved in the new v1.1 format and should be fully compatible with
 | 
						|
`spacy-transformers` v1.1. Once you've confirmed the performance, you can update
 | 
						|
the requirements in [`meta.json`](/api/data-formats#meta):
 | 
						|
 | 
						|
```diff
 | 
						|
  "requirements": [
 | 
						|
-    "spacy-transformers>=1.0.3,<1.1.0"
 | 
						|
+    "spacy-transformers>=1.1.2,<1.2.0"
 | 
						|
  ]
 | 
						|
```
 | 
						|
 | 
						|
If you're using one of the [trained pipelines](/models) we provide, you should
 | 
						|
run [`spacy download`](/api/cli#download) to update to the latest version. To
 | 
						|
see an overview of all installed packages and their compatibility, you can run
 | 
						|
[`spacy validate`](/api/cli#validate).
 |