From 216ed231a988640841a7e6c7d936a9b00fd9ed1a Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Fri, 5 Nov 2021 16:31:14 +0100 Subject: [PATCH] What's new in v3.2 (#9633) * What's new in v3.2 * Fix formatting * Fix typo * Redo thanks * Formatting * Fix typo * Fix project links * Fix typo * Minimal intro, floret python module * Rephrase * Rephrase, extend * Rephrase * Update links and formatting [ci skip] * Minor correction * Fix typo Co-authored-by: Ines Montani --- website/docs/usage/v3-2.md | 244 +++++++++++++++++++++++++++++++++ website/meta/sidebars.json | 3 +- website/src/templates/index.js | 4 +- 3 files changed, 248 insertions(+), 3 deletions(-) create mode 100644 website/docs/usage/v3-2.md diff --git a/website/docs/usage/v3-2.md b/website/docs/usage/v3-2.md new file mode 100644 index 000000000..766d1c0a9 --- /dev/null +++ b/website/docs/usage/v3-2.md @@ -0,0 +1,244 @@ +--- +title: What's New in v3.2 +teaser: New features and how to upgrade +menu: + - ['New Features', 'features'] + - ['Upgrading Notes', 'upgrading'] +--- + +## New Features {#features hidden="true"} + +spaCy v3.2 adds support for [`floret`](https://github.com/explosion/floret) +vectors, makes custom `Doc` creation and scoring easier, and includes many bug +fixes and improvements. For the trained pipelines, there's a new transformer +pipeline for Japanese and the Universal Dependencies training data has been +updated across the board to the most recent release. + + + +spaCy is now up to **8 × faster on M1 Macs** by calling into Apple's +native Accelerate library for matrix multiplication. For more details, see +[`thinc-apple-ops`](https://github.com/explosion/thinc-apple-ops). + +```bash +$ pip install spacy[apple] +``` + + + +### Registered scoring functions {#registered-scoring-functions} + +To customize the scoring, you can specify a scoring function for each component +in your config from the new [`scorers` registry](/api/top-level#registry): + +```ini +### config.cfg (excerpt) {highlight="3"} +[components.tagger] +factory = "tagger" +scorer = {"@scorers":"spacy.tagger_scorer.v1"} +``` + +### Overwrite settings {#overwrite} + +Most pipeline components now include an `overwrite` setting in the config that +determines whether existing annotation in the `Doc` is preserved or overwritten: + +```ini +### config.cfg (excerpt) {highlight="3"} +[components.tagger] +factory = "tagger" +overwrite = false +``` + +### Doc input for pipelines {#doc-input} + +[`nlp`](/api/language#call) and [`nlp.pipe`](/api/language#pipe) accept +[`Doc`](/api/doc) input, skipping the tokenizer if a `Doc` is provided instead +of a string. This makes it easier to create a `Doc` with custom tokenization or +to set custom extensions before processing: + +```python +doc = nlp.make_doc("This is text 500.") +doc._.text_id = 500 +doc = nlp(doc) +``` + +### Support for floret vectors {#vectors} + +We recently published [`floret`](https://github.com/explosion/floret), an +extended version of [fastText](https://fasttext.cc) that combines fastText's +subwords with Bloom embeddings for compact, full-coverage vectors. The use of +subwords means that there are no OOV words and due to Bloom embeddings, the +vector table can be kept very small at <100K entries. Bloom embeddings are +already used by [HashEmbed](https://thinc.ai/docs/api-layers#hashembed) in +[tok2vec](/api/architectures#tok2vec-arch) for compact spaCy models. + +For easy integration, floret includes a +[Python wrapper](https://github.com/explosion/floret/blob/main/python/README.md): + +```bash +$ pip install floret +``` + +A demo project shows how to train and import floret vectors: + + + +Train toy English floret vectors and import them into a spaCy pipeline. + + + +Two additional demo projects compare standard fastText vectors with floret +vectors for full spaCy pipelines. For agglutinative languages like Finnish or +Korean, there are large improvements in performance due to the use of subwords +(no OOV words!), with a vector table containing merely 50K entries. + + + +Finnish UD+NER vector and pipeline training, comparing standard fasttext vs. +floret vectors. + +For the default project settings with 1M (2.6G) tokenized training texts and 50K +300-dim vectors, ~300K keys for the standard vectors: + +| Vectors | TAG | POS | DEP UAS | DEP LAS | NER F | +| -------------------------------------------- | -------: | -------: | -------: | -------: | -------: | +| none | 93.3 | 92.3 | 79.7 | 72.8 | 61.0 | +| standard (pruned: 50K vectors for 300K keys) | 95.9 | 94.7 | 83.3 | 77.9 | 68.5 | +| standard (unpruned: 300K vectors/keys) | 96.0 | 95.0 | **83.8** | 78.4 | 69.1 | +| floret (minn 4, maxn 5; 50K vectors, no OOV) | **96.6** | **95.5** | 83.5 | **78.5** | **70.9** | + + + + + +Korean UD vector and pipeline training, comparing standard fasttext vs. floret +vectors. + +For the default project settings with 1M (3.3G) tokenized training texts and 50K +300-dim vectors, ~800K keys for the standard vectors: + +| Vectors | TAG | POS | DEP UAS | DEP LAS | +| -------------------------------------------- | -------: | -------: | -------: | -------: | +| none | 72.5 | 85.0 | 73.2 | 64.3 | +| standard (pruned: 50K vectors for 800K keys) | 77.9 | 89.4 | 78.8 | 72.8 | +| standard (unpruned: 800K vectors/keys) | 79.0 | 90.2 | 79.2 | 73.9 | +| floret (minn 2, maxn 3; 50K vectors, no OOV) | **82.5** | **93.8** | **83.0** | **80.1** | + + + +### Updates for spacy-transformers v1.1 {#spacy-transformers} + +[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.1 has +been refactored to improve serialization and support of inline transformer +components and replacing listeners. In addition, the transformer model output is +provided as +[`ModelOutput`](https://huggingface.co/transformers/main_classes/output.html?highlight=modeloutput#transformers.file_utils.ModelOutput) +instead of tuples in +`TransformerData.model_output and FullTransformerBatch.model_output.` For +backwards compatibility, the tuple format remains available under +`TransformerData.tensors` and `FullTransformerBatch.tensors`. See more details +in the [transformer API docs](/api/architectures#TransformerModel). + +`spacy-transfomers` v1.1 also adds support for `transformer_config` settings +such as `output_attentions`. Additional output is stored under +`TransformerData.model_output`. More details are in the +[TransformerModel docs](/api/architectures#TransformerModel). The training speed +has been improved by streamlining allocations for tokenizer output and there is +new support for [mixed-precision training](/api/architectures#TransformerModel). + +### New transformer package for Japanese {#pipeline-packages} + +spaCy v3.2 adds a new transformer pipeline package for Japanese +[`ja_core_news_trf`](/models/ja#ja_core_news_trf), which uses the `basic` +pretokenizer instead of `mecab` to limit the number of dependencies required for +the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for +their contributions! + +### Pipeline and language updates {#pipeline-updates} + +- All Universal Dependencies training data has been updated to v2.8. +- The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos + Rodriguez and the Barcelona Supercomputing Center! +- The transformer pipelines are trained using spacy-transformers v1.1, with + improved IO and more options for + [model config and output](/api/architectures#TransformerModel). +- Trailing whitespace has been added as a `tok2vec` feature, improving the + performance for many components, especially fine-grained tagging and sentence + segmentation. +- The English attribute ruler patterns have been overhauled to improve + `Token.pos` and `Token.morph`. + +spaCy v3.2 also features a new Irish lemmatizer, support for `noun_chunks` in +Portuguese, improved `noun_chunks` for Spanish and additional updates for +Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese. + +## Notes about upgrading from v3.1 {#upgrading} + +### Pipeline package version compatibility {#version-compat} + +> #### Using legacy implementations +> +> In spaCy v3, you'll still be able to load and reference legacy implementations +> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the +> components or architectures change and newer versions are available in the +> core library. + +When you're loading a pipeline package trained with spaCy v3.0 or v3.1, you will +see a warning telling you that the pipeline may be incompatible. This doesn't +necessarily have to be true, but we recommend running your pipelines against +your test suite or evaluation data to make sure there are no unexpected results. +If you're using one of the [trained pipelines](/models) we provide, you should +run [`spacy download`](/api/cli#download) to update to the latest version. To +see an overview of all installed packages and their compatibility, you can run +[`spacy validate`](/api/cli#validate). + +If you've trained your own custom pipeline and you've confirmed that it's still +working as expected, you can update the spaCy version requirements in the +[`meta.json`](/api/data-formats#meta): + +```diff +- "spacy_version": ">=3.1.0,<3.2.0", ++ "spacy_version": ">=3.2.0,<3.3.0", +``` + +### Updating v3.1 configs + +To update a config from spaCy v3.1 with the new v3.2 settings, run +[`init fill-config`](/api/cli#init-fill-config): + +```cli +$ python -m spacy init fill-config config-v3.1.cfg config-v3.2.cfg +``` + +In many cases ([`spacy train`](/api/cli#train), +[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in +automatically, but you'll need to fill in the new settings to run +[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data). + +## Notes about upgrading from spacy-transformers v1.0 {#upgrading-transformers} + +When you're loading a transformer pipeline package trained with +[`spacy-transformers`](https://github.com/explosion/spacy-transformers) v1.0 +after upgrading to `spacy-transformers` v1.1, you'll see a warning telling you +that the pipeline may be incompatible. `spacy-transformers` v1.1 should be able +to import v1.0 `transformer` components into the new internal format with no +change in performance, but here we'd also recommend running your test suite to +verify that the pipeline still performs as expected. + +If you save your pipeline with [`nlp.to_disk`](/api/language#to_disk), it will +be saved in the new v1.1 format and should be fully compatible with +`spacy-transformers` v1.1. Once you've confirmed the performance, you can update +the requirements in [`meta.json`](/api/data-formats#meta): + +```diff + "requirements": [ +- "spacy-transformers>=1.0.3,<1.1.0" ++ "spacy-transformers>=1.1.2,<1.2.0" + ] +``` + +If you're using one of the [trained pipelines](/models) we provide, you should +run [`spacy download`](/api/cli#download) to update to the latest version. To +see an overview of all installed packages and their compatibility, you can run +[`spacy validate`](/api/cli#validate). diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json index 6fe09f052..1054f7626 100644 --- a/website/meta/sidebars.json +++ b/website/meta/sidebars.json @@ -10,7 +10,8 @@ { "text": "Facts & Figures", "url": "/usage/facts-figures" }, { "text": "spaCy 101", "url": "/usage/spacy-101" }, { "text": "New in v3.0", "url": "/usage/v3" }, - { "text": "New in v3.1", "url": "/usage/v3-1" } + { "text": "New in v3.1", "url": "/usage/v3-1" }, + { "text": "New in v3.2", "url": "/usage/v3-2" } ] }, { diff --git a/website/src/templates/index.js b/website/src/templates/index.js index 2c68ff056..56ac0dbed 100644 --- a/website/src/templates/index.js +++ b/website/src/templates/index.js @@ -119,8 +119,8 @@ const AlertSpace = ({ nightly, legacy }) => { } const navAlert = ( - - 💥 Out now: spaCy v3.1 + + 💥 Out now: spaCy v3.2 )