spaCy/website/docs/usage/v3-3.md
Adriane Boyd 497a708c71
Docs for v3.3 (#10628)
* Temporarily disable CI tests

* Start v3.3 website updates

* Add trainable lemmatizer to pipeline design

* Fix Vectors.most_similar

* Add floret vector info to pipeline design

* Add Lower and Upper Sorbian

* Add span to sidebar

* Work on release notes

* Copy from release notes

* Update pipeline design graphic

* Upgrading note about Doc.from_docs

* Add tables and details

* Update website/docs/models/index.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix da lemma acc

* Add minimal intro, various updates

* Round lemma acc

* Add section on floret / word lists

* Add new pipelines table, minor edits

* Fix displacy spans example title

* Clarify adding non-trainable lemmatizer

* Update adding-languages URLs

* Revert "Temporarily disable CI tests"

This reverts commit 1dee505920.

* Spell out words/sec

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-28 14:09:35 +02:00

12 KiB

title teaser menu
What's New in v3.3 New features and how to upgrade
New Features
features
Upgrading Notes
upgrading

New features

spaCy v3.3 improves the speed of core pipeline components, adds a new trainable lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.

Speed improvements

v3.3 includes a slew of speed improvements:

  • Speed up parser and NER by using constant-time head lookups.
  • Support unnormalized softmax probabilities in spacy.Tagger.v2 to speed up inference for tagger, morphologizer, senter and trainable lemmatizer.
  • Speed up parser projectivization functions.
  • Replace Ragged with faster AlignmentArray in Example for training.
  • Improve Matcher speed.
  • Improve serialization speed for empty Doc.spans.

For longer texts, the trained pipeline speeds improve 15% or more in prediction. We benchmarked en_core_web_md (same components as in v3.2) and de_core_news_md (with the new trainable lemmatizer) across a range of text sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:

Intel Xeon W-2265

Model Avg. Words/Doc v3.2 Words/Sec v3.3 Words/Sec Diff
en_core_web_md 100 17292 17441 0.86%
(=same components) 1000 15408 16024 4.00%
10000 12798 15346 19.91%
de_core_news_md 100 20221 19321 -4.45%
(+v3.3 trainable lemmatizer) 1000 17480 17345 -0.77%
10000 14513 17036 17.38%

Apple M1

Model Avg. Words/Doc v3.2 Words/Sec v3.3 Words/Sec Diff
en_core_web_md 100 18272 18408 0.74%
(=same components) 1000 18794 19248 2.42%
10000 15144 17513 15.64%
de_core_news_md 100 19227 19591 1.89%
(+v3.3 trainable lemmatizer) 1000 20047 20628 2.90%
10000 15921 18546 16.49%

Trainable lemmatizer

The new trainable lemmatizer component uses edit trees to transform tokens into lemmas. Try out the trainable lemmatizer with the training quickstart!

displaCy support for overlapping spans and arcs

displaCy now supports overlapping spans with a new span style and multiple arcs with different labels between the same tokens for dep visualizations.

Overlapping spans can be visualized for any spans key in doc.spans:

import spacy
from spacy import displacy
from spacy.tokens import Span

nlp = spacy.blank("en")
text = "Welcome to the Bank of China."
doc = nlp(text)
doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
displacy.serve(doc, style="span", options={"spans_key": "custom"})

import DisplacySpanHtml from 'images/displacy-span.html'