* Temporarily disable CI tests
* Start v3.3 website updates
* Add trainable lemmatizer to pipeline design
* Fix Vectors.most_similar
* Add floret vector info to pipeline design
* Add Lower and Upper Sorbian
* Add span to sidebar
* Work on release notes
* Copy from release notes
* Update pipeline design graphic
* Upgrading note about Doc.from_docs
* Add tables and details
* Update website/docs/models/index.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix da lemma acc
* Add minimal intro, various updates
* Round lemma acc
* Add section on floret / word lists
* Add new pipelines table, minor edits
* Fix displacy spans example title
* Clarify adding non-trainable lemmatizer
* Update adding-languages URLs
* Revert "Temporarily disable CI tests"
This reverts commit 1dee505920
.
* Spell out words/sec
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
10 KiB
title | teaser | menu | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Trained Models & Pipelines | Downloadable trained pipelines and weights for spaCy |
|
Quickstart
📖 Installation and usage
For more details on how to use trained pipelines with spaCy, see the usage guide.
import QuickstartModels from 'widgets/quickstart-models.js'
Package naming conventions
In general, spaCy expects all pipeline packages to follow the naming convention
of [lang]\_[name]
. For spaCy's pipelines, we also chose to divide the name
into three components:
-
Type: Capabilities (e.g.
core
for general-purpose pipeline with tagging, parsing, lemmatization and named entity recognition, ordep
for only tagging, parsing and lemmatization). -
Genre: Type of text the pipeline is trained on, e.g.
web
ornews
. -
Size: Package size indicator,
sm
,md
,lg
ortrf
.sm
andtrf
pipelines have no static word vectors.For pipelines with default vectors,
md
has a reduced word vector table with 20k unique vectors for ~500k words andlg
has a large word vector table with ~500k entries.For pipelines with floret vectors,
md
vector tables have 50k entries andlg
vector tables have 200k entries.
For example, en_core_web_sm
is a small English
pipeline trained on written web text (blogs, news, comments), that includes
vocabulary, syntax and entities.
Package versioning
Additionally, the pipeline package versioning reflects both the compatibility
with spaCy, as well as the model version. A package version a.b.c
translates
to:
a
: spaCy major version. For example,2
for spaCy v2.x.b
: spaCy minor version. For example,3
for spaCy v2.3.x.c
: Model version. Different model config: e.g. from being trained on different data, with different parameters, for different numbers of iterations, with different vectors, etc.
For a detailed compatibility overview, see the
compatibility.json
.
This is also the source of spaCy's internal compatibility check, performed when
you run the download
command.
Trained pipeline design
The spaCy v3 trained pipelines are designed to be efficient and configurable. For example, multiple components can share a common "token-to-vector" model and it's easy to swap out or disable the lemmatizer. The pipelines are designed to be efficient in terms of speed and size and work well when the pipeline is run in full.
When modifying a trained pipeline, it's important to understand how the
components depend on each other. Unlike spaCy v2, where the tagger
,
parser
and ner
components were all independent, some v3 components depend on
earlier components in the pipeline. As a result, disabling or reordering
components can affect the annotation quality or lead to warnings and errors.
Main changes from spaCy v2 models:
- The
Tok2Vec
component may be a separate, shared component. A component like a tagger or parser can listen to an earliertok2vec
ortransformer
rather than having its own separate tok2vec layer. - Rule-based exceptions move from individual components to the
attribute_ruler
. Lemma and POS exceptions move from the tokenizer exceptions to the attribute ruler and the tag map and morph rules move from the tagger to the attribute ruler. - The lemmatizer tables and processing move from the vocab and tagger to a
separate
lemmatizer
component.
CNN/CPU pipeline design
In the sm
/md
/lg
models:
- The
tagger
,morphologizer
andparser
components listen to thetok2vec
component. If the lemmatizer is trainable (v3.3+),lemmatizer
also listens totok2vec
. - The
attribute_ruler
mapstoken.tag
totoken.pos
if there is nomorphologizer
. Theattribute_ruler
additionally makes sure whitespace is tagged consistently and copiestoken.pos
totoken.tag
if there is no tagger. For English, the attribute ruler can improve its mapping fromtoken.tag
totoken.pos
if dependency parses from aparser
are present, but the parser is not required. - The
lemmatizer
component for many languages requirestoken.pos
annotation from eithertagger
+attribute_ruler
ormorphologizer
. - The
ner
component is independent with its own internal tok2vec layer.
CNN/CPU pipelines with floret vectors
The Finnish, Korean and Swedish md
and lg
pipelines use
floret vectors instead of default vectors. If you're
running a trained pipeline on texts and working with Doc
objects,
you shouldn't notice any difference with floret vectors. With floret vectors no
tokens are out-of-vocabulary, so Token.is_oov
will
return True
for all tokens.
If you access vectors directly for similarity comparisons, there are a few differences because floret vectors don't include a fixed word list like the vector keys for default vectors.
-
If your workflow iterates over the vector keys, you need to use an external word list instead:
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors] + lexemes = [nlp.vocab[word] for word in external_word_list]
-
Vectors.most_similar
is not supported because there's no fixed list of vectors to compare your vectors to.
Transformer pipeline design
In the transformer (trf
) models, the tagger
, parser
and ner
(if present)
all listen to the transformer
component. The attribute_ruler
and
lemmatizer
have the same configuration as in the CNN models.
Modifying the default pipeline
For faster processing, you may only want to run a subset of the components in a
trained pipeline. The disable
and exclude
arguments to
spacy.load
let you control which components are
loaded and run. Disabled components are loaded in the background so it's
possible to reenable them in the same pipeline in the future with
nlp.enable_pipe
. To skip loading a component
completely, use exclude
instead of disable
.
Disable part-of-speech tagging and lemmatization
To disable part-of-speech tagging and lemmatization, disable the tagger
,
morphologizer
, attribute_ruler
and lemmatizer
components.
# Note: English doesn't include a morphologizer
nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmatizer"])
nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemmatizer"])
The lemmatizer depends on tagger
+attribute_ruler
or morphologizer
for a
number of languages. If you disable any of these components, you'll see
lemmatizer warnings unless the lemmatizer is also disabled.
v3.3: Catalan, English, French, Russian and Spanish
v3.0-v3.2: Catalan, Dutch, English, French, Greek, Italian, Macedonian, Norwegian, Polish, Russian and Spanish
Use senter rather than parser for fast sentence segmentation
If you need fast sentence segmentation without dependency parses, disable the
parser
use the senter
component instead:
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipe("parser")
nlp.enable_pipe("senter")
The senter
component is ~10× faster than the parser and more accurate
than the rule-based sentencizer
.
Switch from trainable lemmatizer to default lemmatizer
Since v3.3, a number of pipelines use a trainable lemmatizer. You can check whether the lemmatizer is trainable:
nlp = spacy.load("de_core_web_sm")
assert nlp.get_pipe("lemmatizer").is_trainable
If you'd like to switch to a non-trainable lemmatizer that's similar to v3.2 or earlier, you can replace the trainable lemmatizer with the default non-trainable lemmatizer:
# Requirements: pip install spacy-lookups-data
nlp = spacy.load("de_core_web_sm")
# Remove existing lemmatizer
nlp.remove_pipe("lemmatizer")
# Add non-trainable lemmatizer from language defaults
# and load lemmatizer tables from spacy-lookups-data
nlp.add_pipe("lemmatizer").initialize()
Switch from rule-based to lookup lemmatization
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish pipelines, you can swap out a trainable or rule-based lemmatizer for a lookup lemmatizer:
# Requirements: pip install spacy-lookups-data
nlp = spacy.load("en_core_web_sm")
nlp.remove_pipe("lemmatizer")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"}).initialize()
Disable everything except NER
For the non-transformer models, the ner
component is independent, so you can
disable everything else:
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"])
In the transformer models, ner
listens to the transformer
component, so you
can disable all components related tagging, parsing, and lemmatization.
nlp = spacy.load("en_core_web_trf", disable=["tagger", "parser", "attribute_ruler", "lemmatizer"])
Move NER to the end of the pipeline
As of v3.1, the NER component is at the end of the pipeline by default.
For access to POS
and LEMMA
features in an entity_ruler
, move ner
to the
end of the pipeline after attribute_ruler
and lemmatizer
:
# load without NER
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
# source NER from the same pipeline package as the last component
nlp.add_pipe("ner", source=spacy.load("en_core_web_sm"))
# insert the entity ruler
nlp.add_pipe("entity_ruler", before="ner")