mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-25 21:21:10 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			245 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			245 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| title: What's New in v3.2
 | |
| teaser: New features and how to upgrade
 | |
| menu:
 | |
|   - ['New Features', 'features']
 | |
|   - ['Upgrading Notes', 'upgrading']
 | |
| ---
 | |
| 
 | |
| ## New Features {#features hidden="true"}
 | |
| 
 | |
| spaCy v3.2 adds support for [`floret`](https://github.com/explosion/floret)
 | |
| vectors, makes custom `Doc` creation and scoring easier, and includes many bug
 | |
| fixes and improvements. For the trained pipelines, there's a new transformer
 | |
| pipeline for Japanese and the Universal Dependencies training data has been
 | |
| updated across the board to the most recent release.
 | |
| 
 | |
| <Infobox title="Improve performance for spaCy on Apple M1 with AppleOps" variant="warning" emoji="📣">
 | |
| 
 | |
| spaCy is now up to **8 × faster on M1 Macs** by calling into Apple's
 | |
| native Accelerate library for matrix multiplication. For more details, see
 | |
| [`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops).
 | |
| 
 | |
| ```bash
 | |
| $ pip install spacy[apple]
 | |
| ```
 | |
| 
 | |
| </Infobox>
 | |
| 
 | |
| ### Registered scoring functions {#registered-scoring-functions}
 | |
| 
 | |
| To customize the scoring, you can specify a scoring function for each component
 | |
| in your config from the new [`scorers` registry](/api/top-level#registry):
 | |
| 
 | |
| ```ini
 | |
| ### config.cfg (excerpt) {highlight="3"}
 | |
| [components.tagger]
 | |
| factory = "tagger"
 | |
| scorer = {"@scorers":"spacy.tagger_scorer.v1"}
 | |
| ```
 | |
| 
 | |
| ### Overwrite settings {#overwrite}
 | |
| 
 | |
| Most pipeline components now include an `overwrite` setting in the config that
 | |
| determines whether existing annotation in the `Doc` is preserved or overwritten:
 | |
| 
 | |
| ```ini
 | |
| ### config.cfg (excerpt) {highlight="3"}
 | |
| [components.tagger]
 | |
| factory = "tagger"
 | |
| overwrite = false
 | |
| ```
 | |
| 
 | |
| ### Doc input for pipelines {#doc-input}
 | |
| 
 | |
| [`nlp`](/api/language#call) and [`nlp.pipe`](/api/language#pipe) accept
 | |
| [`Doc`](/api/doc) input, skipping the tokenizer if a `Doc` is provided instead
 | |
| of a string. This makes it easier to create a `Doc` with custom tokenization or
 | |
| to set custom extensions before processing:
 | |
| 
 | |
| ```python
 | |
| doc = nlp.make_doc("This is text 500.")
 | |
| doc._.text_id = 500
 | |
| doc = nlp(doc)
 | |
| ```
 | |
| 
 | |
| ### Support for floret vectors {#vectors}
 | |
| 
 | |
| We recently published [`floret`](https://github.com/explosion/floret), an
 | |
| extended version of [fastText](https://fasttext.cc) that combines fastText's
 | |
| subwords with Bloom embeddings for compact, full-coverage vectors. The use of
 | |
| subwords means that there are no OOV words and due to Bloom embeddings, the
 | |
| vector table can be kept very small at <100K entries. Bloom embeddings are
 | |
| already used by [HashEmbed](https://thinc.ai/docs/api-layers#hashembed) in
 | |
| [tok2vec](/api/architectures#tok2vec-arch) for compact spaCy models.
 | |
| 
 | |
| For easy integration, floret includes a
 | |
| [Python wrapper](https://github.com/explosion/floret/blob/main/python/README.md):
 | |
| 
 | |
| ```bash
 | |
| $ pip install floret
 | |
| ```
 | |
| 
 | |
| A demo project shows how to train and import floret vectors:
 | |
| 
 | |
| <Project id="pipelines/floret_vectors_demo">
 | |
| 
 | |
| Train toy English floret vectors and import them into a spaCy pipeline.
 | |
| 
 | |
| </Project>
 | |
| 
 | |
| Two additional demo projects compare standard fastText vectors with floret
 | |
| vectors for full spaCy pipelines. For agglutinative languages like Finnish or
 | |
| Korean, there are large improvements in performance due to the use of subwords
 | |
| (no OOV words!), with a vector table containing merely 50K entries.
 | |
| 
 | |
| <Project id="pipelines/floret_fi_core_demo">
 | |
| 
 | |
| Finnish UD+NER vector and pipeline training, comparing standard fasttext vs.
 | |
| floret vectors.
 | |
| 
 | |
| For the default project settings with 1M (2.6G) tokenized training texts and 50K
 | |
| 300-dim vectors, ~300K keys for the standard vectors:
 | |
| 
 | |
| | Vectors                                      |      TAG |      POS |  DEP UAS |  DEP LAS |    NER F |
 | |
| | -------------------------------------------- | -------: | -------: | -------: | -------: | -------: |
 | |
| | none                                         |     93.3 |     92.3 |     79.7 |     72.8 |     61.0 |
 | |
| | standard (pruned: 50K vectors for 300K keys) |     95.9 |     94.7 |     83.3 |     77.9 |     68.5 |
 | |
| | standard (unpruned: 300K vectors/keys)       |     96.0 |     95.0 | **83.8** |     78.4 |     69.1 |
 | |
| | floret (minn 4, maxn 5; 50K vectors, no OOV) | **96.6** | **95.5** |     83.5 | **78.5** | **70.9** |
 | |
| 
 | |
| </Project>
 | |
| 
 | |
| <Project id="pipelines/floret_ko_ud_demo">
 | |
| 
 | |
| Korean UD vector and pipeline training, comparing standard fasttext vs. floret
 | |
| vectors.
 | |
| 
 | |
| For the default project settings with 1M (3.3G) tokenized training texts and 50K
 | |
| 300-dim vectors, ~800K keys for the standard vectors:
 | |
| 
 | |
| | Vectors                                      |      TAG |      POS |  DEP UAS |  DEP LAS |
 | |
| | -------------------------------------------- | -------: | -------: | -------: | -------: |
 | |
| | none                                         |     72.5 |     85.0 |     73.2 |     64.3 |
 | |
| | standard (pruned: 50K vectors for 800K keys) |     77.9 |     89.4 |     78.8 |     72.8 |
 | |
| | standard (unpruned: 800K vectors/keys)       |     79.0 |     90.2 |     79.2 |     73.9 |
 | |
| | floret (minn 2, maxn 3; 50K vectors, no OOV) | **82.5** | **93.8** | **83.0** | **80.1** |
 | |
| 
 | |
| </Project>
 | |
| 
 | |
| ### Updates for spacy-transformers v1.1 {#spacy-transformers}
 | |
| 
 | |
| [`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.1 has
 | |
| been refactored to improve serialization and support of inline transformer
 | |
| components and replacing listeners. In addition, the transformer model output is
 | |
| provided as
 | |
| [`ModelOutput`](https://huggingface.co/transformers/main_classes/output.html?highlight=modeloutput#transformers.file_utils.ModelOutput)
 | |
| instead of tuples in
 | |
| `TransformerData.model_output and FullTransformerBatch.model_output.` For
 | |
| backwards compatibility, the tuple format remains available under
 | |
| `TransformerData.tensors` and `FullTransformerBatch.tensors`. See more details
 | |
| in the [transformer API docs](/api/architectures#TransformerModel).
 | |
| 
 | |
| `spacy-transfomers` v1.1 also adds support for `transformer_config` settings
 | |
| such as `output_attentions`. Additional output is stored under
 | |
| `TransformerData.model_output`. More details are in the
 | |
| [TransformerModel docs](/api/architectures#TransformerModel). The training speed
 | |
| has been improved by streamlining allocations for tokenizer output and there is
 | |
| new support for [mixed-precision training](/api/architectures#TransformerModel).
 | |
| 
 | |
| ### New transformer package for Japanese {#pipeline-packages}
 | |
| 
 | |
| spaCy v3.2 adds a new transformer pipeline package for Japanese
 | |
| [`ja_core_news_trf`](/models/ja#ja_core_news_trf), which uses the `basic`
 | |
| pretokenizer instead of `mecab` to limit the number of dependencies required for
 | |
| the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for
 | |
| their contributions!
 | |
| 
 | |
| ### Pipeline and language updates {#pipeline-updates}
 | |
| 
 | |
| - All Universal Dependencies training data has been updated to v2.8.
 | |
| - The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos
 | |
|   Rodriguez, Carme Armentano and the Barcelona Supercomputing Center!
 | |
| - The transformer pipelines are trained using spacy-transformers v1.1, with
 | |
|   improved IO and more options for
 | |
|   [model config and output](/api/architectures#TransformerModel).
 | |
| - Trailing whitespace has been added as a `tok2vec` feature, improving the
 | |
|   performance for many components, especially fine-grained tagging and sentence
 | |
|   segmentation.
 | |
| - The English attribute ruler patterns have been overhauled to improve
 | |
|   `Token.pos` and `Token.morph`.
 | |
| 
 | |
| spaCy v3.2 also features a new Irish lemmatizer, support for `noun_chunks` in
 | |
| Portuguese, improved `noun_chunks` for Spanish and additional updates for
 | |
| Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
 | |
| 
 | |
| ## Notes about upgrading from v3.1 {#upgrading}
 | |
| 
 | |
| ### Pipeline package version compatibility {#version-compat}
 | |
| 
 | |
| > #### Using legacy implementations
 | |
| >
 | |
| > In spaCy v3, you'll still be able to load and reference legacy implementations
 | |
| > via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
 | |
| > components or architectures change and newer versions are available in the
 | |
| > core library.
 | |
| 
 | |
| When you're loading a pipeline package trained with spaCy v3.0 or v3.1, you will
 | |
| see a warning telling you that the pipeline may be incompatible. This doesn't
 | |
| necessarily have to be true, but we recommend running your pipelines against
 | |
| your test suite or evaluation data to make sure there are no unexpected results.
 | |
| If you're using one of the [trained pipelines](/models) we provide, you should
 | |
| run [`spacy download`](/api/cli#download) to update to the latest version. To
 | |
| see an overview of all installed packages and their compatibility, you can run
 | |
| [`spacy validate`](/api/cli#validate).
 | |
| 
 | |
| If you've trained your own custom pipeline and you've confirmed that it's still
 | |
| working as expected, you can update the spaCy version requirements in the
 | |
| [`meta.json`](/api/data-formats#meta):
 | |
| 
 | |
| ```diff
 | |
| - "spacy_version": ">=3.1.0,<3.2.0",
 | |
| + "spacy_version": ">=3.2.0,<3.3.0",
 | |
| ```
 | |
| 
 | |
| ### Updating v3.1 configs
 | |
| 
 | |
| To update a config from spaCy v3.1 with the new v3.2 settings, run
 | |
| [`init fill-config`](/api/cli#init-fill-config):
 | |
| 
 | |
| ```cli
 | |
| $ python -m spacy init fill-config config-v3.1.cfg config-v3.2.cfg
 | |
| ```
 | |
| 
 | |
| In many cases ([`spacy train`](/api/cli#train),
 | |
| [`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
 | |
| automatically, but you'll need to fill in the new settings to run
 | |
| [`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
 | |
| 
 | |
| ## Notes about upgrading from spacy-transformers v1.0 {#upgrading-transformers}
 | |
| 
 | |
| When you're loading a transformer pipeline package trained with
 | |
| [`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.0
 | |
| after upgrading to `spacy-transformers` v1.1, you'll see a warning telling you
 | |
| that the pipeline may be incompatible. `spacy-transformers` v1.1 should be able
 | |
| to import v1.0 `transformer` components into the new internal format with no
 | |
| change in performance, but here we'd also recommend running your test suite to
 | |
| verify that the pipeline still performs as expected.
 | |
| 
 | |
| If you save your pipeline with [`nlp.to_disk`](/api/language#to_disk), it will
 | |
| be saved in the new v1.1 format and should be fully compatible with
 | |
| `spacy-transformers` v1.1. Once you've confirmed the performance, you can update
 | |
| the requirements in [`meta.json`](/api/data-formats#meta):
 | |
| 
 | |
| ```diff
 | |
|   "requirements": [
 | |
| -    "spacy-transformers>=1.0.3,<1.1.0"
 | |
| +    "spacy-transformers>=1.1.2,<1.2.0"
 | |
|   ]
 | |
| ```
 | |
| 
 | |
| If you're using one of the [trained pipelines](/models) we provide, you should
 | |
| run [`spacy download`](/api/cli#download) to update to the latest version. To
 | |
| see an overview of all installed packages and their compatibility, you can run
 | |
| [`spacy validate`](/api/cli#validate).
 |