--- title: What's New in v3.2 teaser: New features and how to upgrade menu: - ['New Features', 'features'] - ['Upgrading Notes', 'upgrading'] --- ## New Features {id="features",hidden="true"} spaCy v3.2 adds support for [`floret`](https://github.com/explosion/floret) vectors, makes custom `Doc` creation and scoring easier, and includes many bug fixes and improvements. For the trained pipelines, there's a new transformer pipeline for Japanese and the Universal Dependencies training data has been updated across the board to the most recent release. spaCy is now up to **8 × faster on M1 Macs** by calling into Apple's native Accelerate library for matrix multiplication. For more details, see [`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops). ```bash $ pip install spacy[apple] ``` ### Registered scoring functions {id="registered-scoring-functions"} To customize the scoring, you can specify a scoring function for each component in your config from the new [`scorers` registry](/api/top-level#registry): ```ini {title="config.cfg (excerpt)",highlight="3"} [components.tagger] factory = "tagger" scorer = {"@scorers":"spacy.tagger_scorer.v1"} ``` ### Overwrite settings {id="overwrite"} Most pipeline components now include an `overwrite` setting in the config that determines whether existing annotation in the `Doc` is preserved or overwritten: ```ini {title="config.cfg (excerpt)",highlight="3"} [components.tagger] factory = "tagger" overwrite = false ``` ### Doc input for pipelines {id="doc-input"} [`nlp`](/api/language#call) and [`nlp.pipe`](/api/language#pipe) accept [`Doc`](/api/doc) input, skipping the tokenizer if a `Doc` is provided instead of a string. This makes it easier to create a `Doc` with custom tokenization or to set custom extensions before processing: ```python doc = nlp.make_doc("This is text 500.") doc._.text_id = 500 doc = nlp(doc) ``` ### Support for floret vectors {id="vectors"} We recently published [`floret`](https://github.com/explosion/floret), an extended version of [fastText](https://fasttext.cc) that combines fastText's subwords with Bloom embeddings for compact, full-coverage vectors. The use of subwords means that there are no OOV words and due to Bloom embeddings, the vector table can be kept very small at \<100K entries. Bloom embeddings are already used by [HashEmbed](https://thinc.ai/docs/api-layers#hashembed) in [tok2vec](/api/architectures#tok2vec-arch) for compact spaCy models. For easy integration, floret includes a [Python wrapper](https://github.com/explosion/floret/blob/main/python/README.md): ```bash $ pip install floret ``` A demo project shows how to train and import floret vectors: Train toy English floret vectors and import them into a spaCy pipeline. Two additional demo projects compare standard fastText vectors with floret vectors for full spaCy pipelines. For agglutinative languages like Finnish or Korean, there are large improvements in performance due to the use of subwords (no OOV words!), with a vector table containing merely 50K entries. Finnish UD+NER vector and pipeline training, comparing standard fasttext vs. floret vectors. For the default project settings with 1M (2.6G) tokenized training texts and 50K 300-dim vectors, ~300K keys for the standard vectors: | Vectors | TAG | POS | DEP UAS | DEP LAS | NER F | | -------------------------------------------- | -------: | -------: | -------: | -------: | -------: | | none | 93.3 | 92.3 | 79.7 | 72.8 | 61.0 | | standard (pruned: 50K vectors for 300K keys) | 95.9 | 94.7 | 83.3 | 77.9 | 68.5 | | standard (unpruned: 300K vectors/keys) | 96.0 | 95.0 | **83.8** | 78.4 | 69.1 | | floret (minn 4, maxn 5; 50K vectors, no OOV) | **96.6** | **95.5** | 83.5 | **78.5** | **70.9** | Korean UD vector and pipeline training, comparing standard fasttext vs. floret vectors. For the default project settings with 1M (3.3G) tokenized training texts and 50K 300-dim vectors, ~800K keys for the standard vectors: | Vectors | TAG | POS | DEP UAS | DEP LAS | | -------------------------------------------- | -------: | -------: | -------: | -------: | | none | 72.5 | 85.0 | 73.2 | 64.3 | | standard (pruned: 50K vectors for 800K keys) | 77.9 | 89.4 | 78.8 | 72.8 | | standard (unpruned: 800K vectors/keys) | 79.0 | 90.2 | 79.2 | 73.9 | | floret (minn 2, maxn 3; 50K vectors, no OOV) | **82.5** | **93.8** | **83.0** | **80.1** | ### Updates for spacy-transformers v1.1 {id="spacy-transformers"} [`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.1 has been refactored to improve serialization and support of inline transformer components and replacing listeners. In addition, the transformer model output is provided as [`ModelOutput`](https://huggingface.co/transformers/main_classes/output.html?highlight=modeloutput#transformers.file_utils.ModelOutput) instead of tuples in `TransformerData.model_output and FullTransformerBatch.model_output.` For backwards compatibility, the tuple format remains available under `TransformerData.tensors` and `FullTransformerBatch.tensors`. See more details in the [transformer API docs](/api/architectures#TransformerModel). `spacy-transformers` v1.1 also adds support for `transformer_config` settings such as `output_attentions`. Additional output is stored under `TransformerData.model_output`. More details are in the [TransformerModel docs](/api/architectures#TransformerModel). The training speed has been improved by streamlining allocations for tokenizer output and there is new support for [mixed-precision training](/api/architectures#TransformerModel). ### New transformer package for Japanese {id="pipeline-packages"} spaCy v3.2 adds a new transformer pipeline package for Japanese [`ja_core_news_trf`](/models/ja#ja_core_news_trf), which uses the `basic` pretokenizer instead of `mecab` to limit the number of dependencies required for the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for their contributions! ### Pipeline and language updates {id="pipeline-updates"} - All Universal Dependencies training data has been updated to v2.8. - The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos Rodriguez, Carme Armentano and the Barcelona Supercomputing Center! - The transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for [model config and output](/api/architectures#TransformerModel). - Trailing whitespace has been added as a `tok2vec` feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation. - The English attribute ruler patterns have been overhauled to improve `Token.pos` and `Token.morph`. spaCy v3.2 also features a new Irish lemmatizer, support for `noun_chunks` in Portuguese, improved `noun_chunks` for Spanish and additional updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese. ## Notes about upgrading from v3.1 {id="upgrading"} ### Pipeline package version compatibility {id="version-compat"} > #### Using legacy implementations > > In spaCy v3, you'll still be able to load and reference legacy implementations > via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the > components or architectures change and newer versions are available in the > core library. When you're loading a pipeline package trained with spaCy v3.0 or v3.1, you will see a warning telling you that the pipeline may be incompatible. This doesn't necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results. If you're using one of the [trained pipelines](/models) we provide, you should run [`spacy download`](/api/cli#download) to update to the latest version. To see an overview of all installed packages and their compatibility, you can run [`spacy validate`](/api/cli#validate). If you've trained your own custom pipeline and you've confirmed that it's still working as expected, you can update the spaCy version requirements in the [`meta.json`](/api/data-formats#meta): ```diff - "spacy_version": ">=3.1.0,<3.2.0", + "spacy_version": ">=3.2.0,<3.3.0", ``` ### Updating v3.1 configs To update a config from spaCy v3.1 with the new v3.2 settings, run [`init fill-config`](/api/cli#init-fill-config): ```bash $ python -m spacy init fill-config config-v3.1.cfg config-v3.2.cfg ``` In many cases ([`spacy train`](/api/cli#train), [`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in automatically, but you'll need to fill in the new settings to run [`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data). ## Notes about upgrading from spacy-transformers v1.0 {id="upgrading-transformers"} When you're loading a transformer pipeline package trained with [`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.0 after upgrading to `spacy-transformers` v1.1, you'll see a warning telling you that the pipeline may be incompatible. `spacy-transformers` v1.1 should be able to import v1.0 `transformer` components into the new internal format with no change in performance, but here we'd also recommend running your test suite to verify that the pipeline still performs as expected. If you save your pipeline with [`nlp.to_disk`](/api/language#to_disk), it will be saved in the new v1.1 format and should be fully compatible with `spacy-transformers` v1.1. Once you've confirmed the performance, you can update the requirements in [`meta.json`](/api/data-formats#meta): ```diff "requirements": [ - "spacy-transformers>=1.0.3,<1.1.0" + "spacy-transformers>=1.1.2,<1.2.0" ] ``` If you're using one of the [trained pipelines](/models) we provide, you should run [`spacy download`](/api/cli#download) to update to the latest version. To see an overview of all installed packages and their compatibility, you can run [`spacy validate`](/api/cli#validate).