spaCy/website/docs/usage/v3-2.md
2021-11-29 14:14:21 +01:00

10 KiB
Raw Blame History

title teaser menu
What's New in v3.2 New features and how to upgrade
New Features
features
Upgrading Notes
upgrading

New Features

spaCy v3.2 adds support for floret vectors, makes custom Doc creation and scoring easier, and includes many bug fixes and improvements. For the trained pipelines, there's a new transformer pipeline for Japanese and the Universal Dependencies training data has been updated across the board to the most recent release.

spaCy is now up to 8 × faster on M1 Macs by calling into Apple's native Accelerate library for matrix multiplication. For more details, see thinc-apple-ops.

$ pip install spacy[apple]

Registered scoring functions

To customize the scoring, you can specify a scoring function for each component in your config from the new scorers registry:

### config.cfg (excerpt) {highlight="3"}
[components.tagger]
factory = "tagger"
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

Overwrite settings

Most pipeline components now include an overwrite setting in the config that determines whether existing annotation in the Doc is preserved or overwritten:

### config.cfg (excerpt) {highlight="3"}
[components.tagger]
factory = "tagger"
overwrite = false

Doc input for pipelines

nlp and nlp.pipe accept Doc input, skipping the tokenizer if a Doc is provided instead of a string. This makes it easier to create a Doc with custom tokenization or to set custom extensions before processing:

doc = nlp.make_doc("This is text 500.")
doc._.text_id = 500
doc = nlp(doc)

Support for floret vectors

We recently published floret, an extended version of fastText that combines fastText's subwords with Bloom embeddings for compact, full-coverage vectors. The use of subwords means that there are no OOV words and due to Bloom embeddings, the vector table can be kept very small at <100K entries. Bloom embeddings are already used by HashEmbed in tok2vec for compact spaCy models.

For easy integration, floret includes a Python wrapper:

$ pip install floret

A demo project shows how to train and import floret vectors:

Train toy English floret vectors and import them into a spaCy pipeline.

Two additional demo projects compare standard fastText vectors with floret vectors for full spaCy pipelines. For agglutinative languages like Finnish or Korean, there are large improvements in performance due to the use of subwords (no OOV words!), with a vector table containing merely 50K entries.

Finnish UD+NER vector and pipeline training, comparing standard fasttext vs. floret vectors.

For the default project settings with 1M (2.6G) tokenized training texts and 50K 300-dim vectors, ~300K keys for the standard vectors:

Vectors TAG POS DEP UAS DEP LAS NER F
none 93.3 92.3 79.7 72.8 61.0
standard (pruned: 50K vectors for 300K keys) 95.9 94.7 83.3 77.9 68.5
standard (unpruned: 300K vectors/keys) 96.0 95.0 83.8 78.4 69.1
floret (minn 4, maxn 5; 50K vectors, no OOV) 96.6 95.5 83.5 78.5 70.9

Korean UD vector and pipeline training, comparing standard fasttext vs. floret vectors.

For the default project settings with 1M (3.3G) tokenized training texts and 50K 300-dim vectors, ~800K keys for the standard vectors:

Vectors TAG POS DEP UAS DEP LAS
none 72.5 85.0 73.2 64.3
standard (pruned: 50K vectors for 800K keys) 77.9 89.4 78.8 72.8
standard (unpruned: 800K vectors/keys) 79.0 90.2 79.2 73.9
floret (minn 2, maxn 3; 50K vectors, no OOV) 82.5 93.8 83.0 80.1

Updates for spacy-transformers v1.1

spacy-transformers v1.1 has been refactored to improve serialization and support of inline transformer components and replacing listeners. In addition, the transformer model output is provided as ModelOutput instead of tuples in TransformerData.model_output and FullTransformerBatch.model_output. For backwards compatibility, the tuple format remains available under TransformerData.tensors and FullTransformerBatch.tensors. See more details in the transformer API docs.

spacy-transfomers v1.1 also adds support for transformer_config settings such as output_attentions. Additional output is stored under TransformerData.model_output. More details are in the TransformerModel docs. The training speed has been improved by streamlining allocations for tokenizer output and there is new support for mixed-precision training.

New transformer package for Japanese

spaCy v3.2 adds a new transformer pipeline package for Japanese ja_core_news_trf, which uses the basic pretokenizer instead of mecab to limit the number of dependencies required for the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for their contributions!

Pipeline and language updates

  • All Universal Dependencies training data has been updated to v2.8.
  • The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos Rodriguez, Carme Armentano and the Barcelona Supercomputing Center!
  • The transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
  • Trailing whitespace has been added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
  • The English attribute ruler patterns have been overhauled to improve Token.pos and Token.morph.

spaCy v3.2 also features a new Irish lemmatizer, support for noun_chunks in Portuguese, improved noun_chunks for Spanish and additional updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.

Notes about upgrading from v3.1

Pipeline package version compatibility

Using legacy implementations

In spaCy v3, you'll still be able to load and reference legacy implementations via spacy-legacy, even if the components or architectures change and newer versions are available in the core library.

When you're loading a pipeline package trained with spaCy v3.0 or v3.1, you will see a warning telling you that the pipeline may be incompatible. This doesn't necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results. If you're using one of the trained pipelines we provide, you should run spacy download to update to the latest version. To see an overview of all installed packages and their compatibility, you can run spacy validate.

If you've trained your own custom pipeline and you've confirmed that it's still working as expected, you can update the spaCy version requirements in the meta.json:

- "spacy_version": ">=3.1.0,<3.2.0",
+ "spacy_version": ">=3.2.0,<3.3.0",

Updating v3.1 configs

To update a config from spaCy v3.1 with the new v3.2 settings, run init fill-config:

$ python -m spacy init fill-config config-v3.1.cfg config-v3.2.cfg

In many cases (spacy train, spacy.load), the new defaults will be filled in automatically, but you'll need to fill in the new settings to run debug config and debug data.

Notes about upgrading from spacy-transformers v1.0

When you're loading a transformer pipeline package trained with spacy-transformers v1.0 after upgrading to spacy-transformers v1.1, you'll see a warning telling you that the pipeline may be incompatible. spacy-transformers v1.1 should be able to import v1.0 transformer components into the new internal format with no change in performance, but here we'd also recommend running your test suite to verify that the pipeline still performs as expected.

If you save your pipeline with nlp.to_disk, it will be saved in the new v1.1 format and should be fully compatible with spacy-transformers v1.1. Once you've confirmed the performance, you can update the requirements in meta.json:

  "requirements": [
-    "spacy-transformers>=1.0.3,<1.1.0"
+    "spacy-transformers>=1.1.2,<1.2.0"
  ]

If you're using one of the trained pipelines we provide, you should run spacy download to update to the latest version. To see an overview of all installed packages and their compatibility, you can run spacy validate.