spaCy/website/docs/usage/v3-2.mdx
2022-11-29 02:33:27 +01:00

245 lines
10 KiB
Plaintext

---
title: What's New in v3.2
teaser: New features and how to upgrade
menu:
- ['New Features', 'features']
- ['Upgrading Notes', 'upgrading']
---
## New Features {#features hidden="true"}
spaCy v3.2 adds support for [`floret`](https://github.com/explosion/floret)
vectors, makes custom `Doc` creation and scoring easier, and includes many bug
fixes and improvements. For the trained pipelines, there's a new transformer
pipeline for Japanese and the Universal Dependencies training data has been
updated across the board to the most recent release.
<Infobox title="Improve performance for spaCy on Apple M1 with AppleOps" variant="warning" emoji="📣">
spaCy is now up to **8 &times; faster on M1 Macs** by calling into Apple's
native Accelerate library for matrix multiplication. For more details, see
[`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops).
```bash
$ pip install spacy[apple]
```
</Infobox>
### Registered scoring functions {#registered-scoring-functions}
To customize the scoring, you can specify a scoring function for each component
in your config from the new [`scorers` registry](/api/top-level#registry):
```ini
### config.cfg (excerpt) {highlight="3"}
[components.tagger]
factory = "tagger"
scorer = {"@scorers":"spacy.tagger_scorer.v1"}
```
### Overwrite settings {#overwrite}
Most pipeline components now include an `overwrite` setting in the config that
determines whether existing annotation in the `Doc` is preserved or overwritten:
```ini
### config.cfg (excerpt) {highlight="3"}
[components.tagger]
factory = "tagger"
overwrite = false
```
### Doc input for pipelines {#doc-input}
[`nlp`](/api/language#call) and [`nlp.pipe`](/api/language#pipe) accept
[`Doc`](/api/doc) input, skipping the tokenizer if a `Doc` is provided instead
of a string. This makes it easier to create a `Doc` with custom tokenization or
to set custom extensions before processing:
```python
doc = nlp.make_doc("This is text 500.")
doc._.text_id = 500
doc = nlp(doc)
```
### Support for floret vectors {#vectors}
We recently published [`floret`](https://github.com/explosion/floret), an
extended version of [fastText](https://fasttext.cc) that combines fastText's
subwords with Bloom embeddings for compact, full-coverage vectors. The use of
subwords means that there are no OOV words and due to Bloom embeddings, the
vector table can be kept very small at <100K entries. Bloom embeddings are
already used by [HashEmbed](https://thinc.ai/docs/api-layers#hashembed) in
[tok2vec](/api/architectures#tok2vec-arch) for compact spaCy models.
For easy integration, floret includes a
[Python wrapper](https://github.com/explosion/floret/blob/main/python/README.md):
```bash
$ pip install floret
```
A demo project shows how to train and import floret vectors:
<Project id="pipelines/floret_vectors_demo">
Train toy English floret vectors and import them into a spaCy pipeline.
</Project>
Two additional demo projects compare standard fastText vectors with floret
vectors for full spaCy pipelines. For agglutinative languages like Finnish or
Korean, there are large improvements in performance due to the use of subwords
(no OOV words!), with a vector table containing merely 50K entries.
<Project id="pipelines/floret_fi_core_demo">
Finnish UD+NER vector and pipeline training, comparing standard fasttext vs.
floret vectors.
For the default project settings with 1M (2.6G) tokenized training texts and 50K
300-dim vectors, ~300K keys for the standard vectors:
| Vectors | TAG | POS | DEP UAS | DEP LAS | NER F |
| -------------------------------------------- | -------: | -------: | -------: | -------: | -------: |
| none | 93.3 | 92.3 | 79.7 | 72.8 | 61.0 |
| standard (pruned: 50K vectors for 300K keys) | 95.9 | 94.7 | 83.3 | 77.9 | 68.5 |
| standard (unpruned: 300K vectors/keys) | 96.0 | 95.0 | **83.8** | 78.4 | 69.1 |
| floret (minn 4, maxn 5; 50K vectors, no OOV) | **96.6** | **95.5** | 83.5 | **78.5** | **70.9** |
</Project>
<Project id="pipelines/floret_ko_ud_demo">
Korean UD vector and pipeline training, comparing standard fasttext vs. floret
vectors.
For the default project settings with 1M (3.3G) tokenized training texts and 50K
300-dim vectors, ~800K keys for the standard vectors:
| Vectors | TAG | POS | DEP UAS | DEP LAS |
| -------------------------------------------- | -------: | -------: | -------: | -------: |
| none | 72.5 | 85.0 | 73.2 | 64.3 |
| standard (pruned: 50K vectors for 800K keys) | 77.9 | 89.4 | 78.8 | 72.8 |
| standard (unpruned: 800K vectors/keys) | 79.0 | 90.2 | 79.2 | 73.9 |
| floret (minn 2, maxn 3; 50K vectors, no OOV) | **82.5** | **93.8** | **83.0** | **80.1** |
</Project>
### Updates for spacy-transformers v1.1 {#spacy-transformers}
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.1 has
been refactored to improve serialization and support of inline transformer
components and replacing listeners. In addition, the transformer model output is
provided as
[`ModelOutput`](https://huggingface.co/transformers/main_classes/output.html?highlight=modeloutput#transformers.file_utils.ModelOutput)
instead of tuples in
`TransformerData.model_output and FullTransformerBatch.model_output.` For
backwards compatibility, the tuple format remains available under
`TransformerData.tensors` and `FullTransformerBatch.tensors`. See more details
in the [transformer API docs](/api/architectures#TransformerModel).
`spacy-transfomers` v1.1 also adds support for `transformer_config` settings
such as `output_attentions`. Additional output is stored under
`TransformerData.model_output`. More details are in the
[TransformerModel docs](/api/architectures#TransformerModel). The training speed
has been improved by streamlining allocations for tokenizer output and there is
new support for [mixed-precision training](/api/architectures#TransformerModel).
### New transformer package for Japanese {#pipeline-packages}
spaCy v3.2 adds a new transformer pipeline package for Japanese
[`ja_core_news_trf`](/models/ja#ja_core_news_trf), which uses the `basic`
pretokenizer instead of `mecab` to limit the number of dependencies required for
the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for
their contributions!
### Pipeline and language updates {#pipeline-updates}
- All Universal Dependencies training data has been updated to v2.8.
- The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos
Rodriguez, Carme Armentano and the Barcelona Supercomputing Center!
- The transformer pipelines are trained using spacy-transformers v1.1, with
improved IO and more options for
[model config and output](/api/architectures#TransformerModel).
- Trailing whitespace has been added as a `tok2vec` feature, improving the
performance for many components, especially fine-grained tagging and sentence
segmentation.
- The English attribute ruler patterns have been overhauled to improve
`Token.pos` and `Token.morph`.
spaCy v3.2 also features a new Irish lemmatizer, support for `noun_chunks` in
Portuguese, improved `noun_chunks` for Spanish and additional updates for
Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
## Notes about upgrading from v3.1 {#upgrading}
### Pipeline package version compatibility {#version-compat}
> #### Using legacy implementations
>
> In spaCy v3, you'll still be able to load and reference legacy implementations
> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
> components or architectures change and newer versions are available in the
> core library.
When you're loading a pipeline package trained with spaCy v3.0 or v3.1, you will
see a warning telling you that the pipeline may be incompatible. This doesn't
necessarily have to be true, but we recommend running your pipelines against
your test suite or evaluation data to make sure there are no unexpected results.
If you're using one of the [trained pipelines](/models) we provide, you should
run [`spacy download`](/api/cli#download) to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
[`spacy validate`](/api/cli#validate).
If you've trained your own custom pipeline and you've confirmed that it's still
working as expected, you can update the spaCy version requirements in the
[`meta.json`](/api/data-formats#meta):
```diff
- "spacy_version": ">=3.1.0,<3.2.0",
+ "spacy_version": ">=3.2.0,<3.3.0",
```
### Updating v3.1 configs
To update a config from spaCy v3.1 with the new v3.2 settings, run
[`init fill-config`](/api/cli#init-fill-config):
```cli
$ python -m spacy init fill-config config-v3.1.cfg config-v3.2.cfg
```
In many cases ([`spacy train`](/api/cli#train),
[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
automatically, but you'll need to fill in the new settings to run
[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
## Notes about upgrading from spacy-transformers v1.0 {#upgrading-transformers}
When you're loading a transformer pipeline package trained with
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.0
after upgrading to `spacy-transformers` v1.1, you'll see a warning telling you
that the pipeline may be incompatible. `spacy-transformers` v1.1 should be able
to import v1.0 `transformer` components into the new internal format with no
change in performance, but here we'd also recommend running your test suite to
verify that the pipeline still performs as expected.
If you save your pipeline with [`nlp.to_disk`](/api/language#to_disk), it will
be saved in the new v1.1 format and should be fully compatible with
`spacy-transformers` v1.1. Once you've confirmed the performance, you can update
the requirements in [`meta.json`](/api/data-formats#meta):
```diff
"requirements": [
- "spacy-transformers>=1.0.3,<1.1.0"
+ "spacy-transformers>=1.1.2,<1.2.0"
]
```
If you're using one of the [trained pipelines](/models) we provide, you should
run [`spacy download`](/api/cli#download) to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
[`spacy validate`](/api/cli#validate).