mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			457 lines
		
	
	
		
			21 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			457 lines
		
	
	
		
			21 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						||
title: What's New in v2.2
 | 
						||
teaser: New features, backwards incompatibilities and migration guide
 | 
						||
menu:
 | 
						||
  - ['New Features', 'features']
 | 
						||
  - ['Backwards Incompatibilities', 'incompat']
 | 
						||
  - ['Migrating from v2.1', 'migrating']
 | 
						||
---
 | 
						||
 | 
						||
## New Features {#features hidden="true"}
 | 
						||
 | 
						||
spaCy v2.2 features improved statistical models, new pretrained models for
 | 
						||
Norwegian and Lithuanian, better Dutch NER, as well as a new mechanism for
 | 
						||
storing language data that makes the installation about **5-10× smaller**
 | 
						||
on disk. We've also added a new class to efficiently **serialize annotations**,
 | 
						||
an improved and **10× faster** phrase matching engine, built-in scoring
 | 
						||
and **CLI training for text classification**, a new command to analyze and
 | 
						||
**debug training data**, data augmentation during training and more. For the
 | 
						||
full changelog, see the
 | 
						||
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.2.0).
 | 
						||
 | 
						||
For more details and a behind-the-scenes look at the new release,
 | 
						||
[see our blog post](https://explosion.ai/blog/spacy-v2-2).
 | 
						||
 | 
						||
### Better pretrained models and more languages {#models}
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```bash
 | 
						||
> python -m spacy download nl_core_news_sm
 | 
						||
> python -m spacy download nb_core_news_sm
 | 
						||
> python -m spacy download lt_core_news_sm
 | 
						||
> ```
 | 
						||
 | 
						||
The new version also features new and re-trained models for all languages and
 | 
						||
resolves a number of data bugs. The [Dutch model](/models/nl) has been retrained
 | 
						||
with a new and custom-labelled NER corpus using the same extended label scheme
 | 
						||
as the English models. It should now produce significantly better NER results
 | 
						||
overall. We've also added new core models for [Norwegian](/models/nb) (MIT) and
 | 
						||
[Lithuanian](/models/lt) (CC BY-SA).
 | 
						||
 | 
						||
<Infobox>
 | 
						||
 | 
						||
**Usage:** [Models directory](/models) **Benchmarks: **
 | 
						||
[Release notes](https://github.com/explosion/spaCy/releases/tag/v2.2.0)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Text classification scores and CLI training {#train-textcat-cli}
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```bash
 | 
						||
> $ python -m spacy train en /output /train /dev \\
 | 
						||
> --pipeline textcat --textcat-arch simple_cnn \\
 | 
						||
> --textcat-multilabel
 | 
						||
> ```
 | 
						||
 | 
						||
When training your models using the `spacy train` command, you can now also
 | 
						||
include text categories in the JSON-formatted training data. The `Scorer` and
 | 
						||
`nlp.evaluate` now report the text classification scores, calculated as the
 | 
						||
F-score on positive label for binary exclusive tasks, the macro-averaged F-score
 | 
						||
for 3+ exclusive labels or the macro-averaged AUC ROC score for multilabel
 | 
						||
classification.
 | 
						||
 | 
						||
<Infobox>
 | 
						||
 | 
						||
**API:** [`spacy train`](/api/cli#train), [`Scorer`](/api/scorer),
 | 
						||
[`Language.evaluate`](/api/language#evaluate)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### New DocBin class to efficiently serialize Doc collections
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```python
 | 
						||
> from spacy.tokens import DocBin
 | 
						||
> doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
 | 
						||
> for doc in nlp.pipe(texts):
 | 
						||
>     doc_bin.add(doc)
 | 
						||
> bytes_data = doc_bin.to_bytes()
 | 
						||
> # Deserialize later, e.g. in a new process
 | 
						||
> nlp = spacy.blank("en")
 | 
						||
> doc_bin = DocBin().from_bytes(bytes_data)
 | 
						||
> docs = list(doc_bin.get_docs(nlp.vocab))
 | 
						||
> ```
 | 
						||
 | 
						||
If you're working with lots of data, you'll probably need to pass analyses
 | 
						||
between machines, either to use something like [Dask](https://dask.org) or
 | 
						||
[Spark](https://spark.apache.org), or even just to save out work to disk. Often
 | 
						||
it's sufficient to use the `Doc.to_array` functionality for this, and just
 | 
						||
serialize the numpy arrays – but other times you want a more general way to save
 | 
						||
and restore `Doc` objects.
 | 
						||
 | 
						||
The new `DocBin` class makes it easy to serialize and deserialize a collection
 | 
						||
of `Doc` objects together, and is much more efficient than calling
 | 
						||
`Doc.to_bytes` on each individual `Doc` object. You can also control what data
 | 
						||
gets saved, and you can merge pallets together for easy map/reduce-style
 | 
						||
processing.
 | 
						||
 | 
						||
<Infobox>
 | 
						||
 | 
						||
**API:** [`DocBin`](/api/docbin) **Usage: **
 | 
						||
[Serializing Doc objects](/usage/saving-loading#docs)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Serializable lookup tables and smaller installation {#lookups}
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```python
 | 
						||
> data = {"foo": "bar"}
 | 
						||
> nlp.vocab.lookups.add_table("my_dict", data)
 | 
						||
>
 | 
						||
> def custom_component(doc):
 | 
						||
>    table = doc.vocab.lookups.get_table("my_dict")
 | 
						||
>    print(table.get("foo"))  # look something up
 | 
						||
>    return doc
 | 
						||
> ```
 | 
						||
 | 
						||
The new `Lookups` API lets you add large dictionaries and lookup tables to the
 | 
						||
`Vocab` and access them from the tokenizer or custom components and extension
 | 
						||
attributes. Internally, the tables use Bloom filters for efficient lookup
 | 
						||
checks. They're also fully serializable out-of-the-box. All large data resources
 | 
						||
like lemmatization tables have been moved to a separate package,
 | 
						||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can
 | 
						||
be installed alongside the core library. This allowed us to make the spaCy
 | 
						||
installation **5-10× smaller on disk** (depending on your platform).
 | 
						||
[Pretrained models](/models) now include their data files, so you only need to
 | 
						||
install the lookups if you want to build blank models or use lemmatization with
 | 
						||
languages that don't yet ship with pretrained models.
 | 
						||
 | 
						||
<Infobox>
 | 
						||
 | 
						||
**API:** [`Lookups`](/api/lookups),
 | 
						||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) **Usage:
 | 
						||
** [Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### CLI command to debug and validate training data {#debug-data}
 | 
						||
 | 
						||
> #### Example
 | 
						||
>
 | 
						||
> ```bash
 | 
						||
> $ python -m spacy debug-data en train.json dev.json
 | 
						||
> ```
 | 
						||
 | 
						||
The new `debug-data` command lets you analyze and validate your training and
 | 
						||
development data, get useful stats, and find problems like invalid entity
 | 
						||
annotations, cyclic dependencies, low data labels and more. If you're training a
 | 
						||
model with `spacy train` and the results seem surprising or confusing,
 | 
						||
`debug-data` may help you track down the problems and improve your training
 | 
						||
data.
 | 
						||
 | 
						||
<Accordion title="Example output">
 | 
						||
 | 
						||
```
 | 
						||
=========================== Data format validation ===========================
 | 
						||
✔ Corpus is loadable
 | 
						||
 | 
						||
=============================== Training stats ===============================
 | 
						||
Training pipeline: tagger, parser, ner
 | 
						||
Starting with blank model 'en'
 | 
						||
18127 training docs
 | 
						||
2939 evaluation docs
 | 
						||
⚠ 34 training examples also in evaluation data
 | 
						||
 | 
						||
============================== Vocab & Vectors ==============================
 | 
						||
ℹ 2083156 total words in the data (56962 unique)
 | 
						||
⚠ 13020 misaligned tokens in the training data
 | 
						||
⚠ 2423 misaligned tokens in the dev data
 | 
						||
10 most common words: 'the' (98429), ',' (91756), '.' (87073), 'to' (50058),
 | 
						||
'of' (49559), 'and' (44416), 'a' (34010), 'in' (31424), 'that' (22792), 'is'
 | 
						||
(18952)
 | 
						||
ℹ No word vectors present in the model
 | 
						||
 | 
						||
========================== Named Entity Recognition ==========================
 | 
						||
ℹ 18 new labels, 0 existing labels
 | 
						||
528978 missing values (tokens with '-' label)
 | 
						||
New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL'
 | 
						||
(10490), 'NORP' (9033), 'MONEY' (5164), 'PERCENT' (3761), 'ORDINAL' (2122),
 | 
						||
'LOC' (2113), 'TIME' (1616), 'WORK_OF_ART' (1229), 'QUANTITY' (1150), 'FAC'
 | 
						||
(1134), 'EVENT' (974), 'PRODUCT' (935), 'LAW' (444), 'LANGUAGE' (338)
 | 
						||
✔ Good amount of examples for all labels
 | 
						||
✔ Examples without occurences available for all labels
 | 
						||
✔ No entities consisting of or starting/ending with whitespace
 | 
						||
 | 
						||
=========================== Part-of-speech Tagging ===========================
 | 
						||
ℹ 49 labels in data (57 labels in tag map)
 | 
						||
'NN' (266331), 'IN' (227365), 'DT' (185600), 'NNP' (164404), 'JJ' (119830),
 | 
						||
'NNS' (110957), '.' (101482), ',' (92476), 'RB' (90090), 'PRP' (90081), 'VB'
 | 
						||
(74538), 'VBD' (68199), 'CC' (62862), 'VBZ' (50712), 'VBP' (43420), 'VBN'
 | 
						||
(42193), 'CD' (40326), 'VBG' (34764), 'TO' (31085), 'MD' (25863), 'PRP$'
 | 
						||
(23335), 'HYPH' (13833), 'POS' (13427), 'UH' (13322), 'WP' (10423), 'WDT'
 | 
						||
(9850), 'RP' (8230), 'WRB' (8201), ':' (8168), '''' (7392), '``' (6984), 'NNPS'
 | 
						||
(5817), 'JJR' (5689), '$' (3710), 'EX' (3465), 'JJS' (3118), 'RBR' (2872),
 | 
						||
'-RRB-' (2825), '-LRB-' (2788), 'PDT' (2078), 'XX' (1316), 'RBS' (1142), 'FW'
 | 
						||
(794), 'NFP' (557), 'SYM' (440), 'WP$' (294), 'LS' (293), 'ADD' (191), 'AFX'
 | 
						||
(24)
 | 
						||
✔ All labels present in tag map for language 'en'
 | 
						||
 | 
						||
============================= Dependency Parsing =============================
 | 
						||
ℹ Found 111703 sentences with an average length of 18.6 words.
 | 
						||
ℹ Found 2251 nonprojective train sentences
 | 
						||
ℹ Found 303 nonprojective dev sentences
 | 
						||
ℹ 47 labels in train data
 | 
						||
ℹ 211 labels in projectivized train data
 | 
						||
'punct' (236796), 'prep' (188853), 'pobj' (182533), 'det' (172674), 'nsubj'
 | 
						||
(169481), 'compound' (116142), 'ROOT' (111697), 'amod' (107945), 'dobj' (93540),
 | 
						||
'aux' (86802), 'advmod' (86197), 'cc' (62679), 'conj' (59575), 'poss' (36449),
 | 
						||
'ccomp' (36343), 'advcl' (29017), 'mark' (27990), 'nummod' (24582), 'relcl'
 | 
						||
(21359), 'xcomp' (21081), 'attr' (18347), 'npadvmod' (17740), 'acomp' (17204),
 | 
						||
'auxpass' (15639), 'appos' (15368), 'neg' (15266), 'nsubjpass' (13922), 'case'
 | 
						||
(13408), 'acl' (12574), 'pcomp' (10340), 'nmod' (9736), 'intj' (9285), 'prt'
 | 
						||
(8196), 'quantmod' (7403), 'dep' (4300), 'dative' (4091), 'agent' (3908), 'expl'
 | 
						||
(3456), 'parataxis' (3099), 'oprd' (2326), 'predet' (1946), 'csubj' (1494),
 | 
						||
'subtok' (1147), 'preconj' (692), 'meta' (469), 'csubjpass' (64), 'iobj' (1)
 | 
						||
⚠ Low number of examples for label 'iobj' (1)
 | 
						||
⚠ Low number of examples for 130 labels in the projectivized dependency
 | 
						||
trees used for training. You may want to projectivize labels such as punct
 | 
						||
before training in order to improve parser performance.
 | 
						||
⚠ Projectivized labels with low numbers of examples: appos||attr: 12
 | 
						||
advmod||dobj: 13 prep||ccomp: 12 nsubjpass||ccomp: 15 pcomp||prep: 14
 | 
						||
amod||dobj: 9 attr||xcomp: 14 nmod||nsubj: 17 prep||advcl: 2 prep||prep: 5
 | 
						||
nsubj||conj: 12 advcl||advmod: 18 ccomp||advmod: 11 ccomp||pcomp: 5 acl||pobj:
 | 
						||
10 npadvmod||acomp: 7 dobj||pcomp: 14 nsubjpass||pcomp: 1 nmod||pobj: 8
 | 
						||
amod||attr: 6 nmod||dobj: 12 aux||conj: 1 neg||conj: 1 dative||xcomp: 11
 | 
						||
pobj||dative: 3 xcomp||acomp: 19 advcl||pobj: 2 nsubj||advcl: 2 csubj||ccomp: 1
 | 
						||
advcl||acl: 1 relcl||nmod: 2 dobj||advcl: 10 advmod||advcl: 3 nmod||nsubjpass: 6
 | 
						||
amod||pobj: 5 cc||neg: 1 attr||ccomp: 16 advcl||xcomp: 3 nmod||attr: 4
 | 
						||
advcl||nsubjpass: 5 advcl||ccomp: 4 ccomp||conj: 1 punct||acl: 1 meta||acl: 1
 | 
						||
parataxis||acl: 1 prep||acl: 1 amod||nsubj: 7 ccomp||ccomp: 3 acomp||xcomp: 5
 | 
						||
dobj||acl: 5 prep||oprd: 6 advmod||acl: 2 dative||advcl: 1 pobj||agent: 5
 | 
						||
xcomp||amod: 1 dep||advcl: 1 prep||amod: 8 relcl||compound: 1 advcl||csubj: 3
 | 
						||
npadvmod||conj: 2 npadvmod||xcomp: 4 advmod||nsubj: 3 ccomp||amod: 7
 | 
						||
advcl||conj: 1 nmod||conj: 2 advmod||nsubjpass: 2 dep||xcomp: 2 appos||ccomp: 1
 | 
						||
advmod||dep: 1 advmod||advmod: 5 aux||xcomp: 8 dep||advmod: 1 dative||ccomp: 2
 | 
						||
prep||dep: 1 conj||conj: 1 dep||ccomp: 4 cc||ROOT: 1 prep||ROOT: 1 nsubj||pcomp:
 | 
						||
3 advmod||prep: 2 relcl||dative: 1 acl||conj: 1 advcl||attr: 4 prep||npadvmod: 1
 | 
						||
nsubjpass||xcomp: 1 neg||advmod: 1 xcomp||oprd: 1 advcl||advcl: 1 dobj||dep: 3
 | 
						||
nsubjpass||parataxis: 1 attr||pcomp: 1 ccomp||parataxis: 1 advmod||attr: 1
 | 
						||
nmod||oprd: 1 appos||nmod: 2 advmod||relcl: 1 appos||npadvmod: 1 appos||conj: 1
 | 
						||
prep||expl: 1 nsubjpass||conj: 1 punct||pobj: 1 cc||pobj: 1 conj||pobj: 1
 | 
						||
punct||conj: 1 ccomp||dep: 1 oprd||xcomp: 3 ccomp||xcomp: 1 ccomp||nsubj: 1
 | 
						||
nmod||dep: 1 xcomp||ccomp: 1 acomp||advcl: 1 intj||advmod: 1 advmod||acomp: 2
 | 
						||
relcl||oprd: 1 advmod||prt: 1 advmod||pobj: 1 appos||nummod: 1 relcl||npadvmod:
 | 
						||
3 mark||advcl: 1 aux||ccomp: 1 amod||nsubjpass: 1 npadvmod||advmod: 1 conj||dep:
 | 
						||
1 nummod||pobj: 1 amod||npadvmod: 1 intj||pobj: 1 nummod||npadvmod: 1
 | 
						||
xcomp||xcomp: 1 aux||dep: 1 advcl||relcl: 1
 | 
						||
⚠ The following labels were found only in the train data: xcomp||amod,
 | 
						||
advcl||relcl, prep||nsubjpass, acl||nsubj, nsubjpass||conj, xcomp||oprd,
 | 
						||
advmod||conj, advmod||advmod, iobj, advmod||nsubjpass, dobj||conj, ccomp||amod,
 | 
						||
meta||acl, xcomp||xcomp, prep||attr, prep||ccomp, advcl||acomp, acl||dobj,
 | 
						||
advcl||advcl, pobj||agent, prep||advcl, nsubjpass||xcomp, prep||dep,
 | 
						||
acomp||xcomp, aux||ccomp, ccomp||dep, conj||dep, relcl||compound,
 | 
						||
nsubjpass||ccomp, nmod||dobj, advmod||advcl, advmod||acl, dobj||advcl,
 | 
						||
dative||xcomp, prep||nsubj, ccomp||ccomp, nsubj||ccomp, xcomp||acomp,
 | 
						||
prep||acomp, dep||advmod, acl||pobj, appos||dobj, npadvmod||acomp, cc||ROOT,
 | 
						||
relcl||nsubj, nmod||pobj, acl||nsubjpass, ccomp||advmod, pcomp||prep,
 | 
						||
amod||dobj, advmod||attr, advcl||csubj, appos||attr, dobj||pcomp, prep||ROOT,
 | 
						||
relcl||pobj, advmod||pobj, amod||nsubj, ccomp||xcomp, prep||oprd,
 | 
						||
npadvmod||advmod, appos||nummod, advcl||pobj, neg||advmod, acl||attr,
 | 
						||
appos||nsubjpass, csubj||ccomp, amod||nsubjpass, intj||pobj, dep||advcl,
 | 
						||
cc||neg, xcomp||ccomp, dative||ccomp, nmod||oprd, pobj||dative, prep||dobj,
 | 
						||
dep||ccomp, relcl||attr, ccomp||nsubj, advcl||xcomp, nmod||dep, advcl||advmod,
 | 
						||
ccomp||conj, pobj||prep, advmod||acomp, advmod||relcl, attr||pcomp,
 | 
						||
ccomp||parataxis, oprd||xcomp, intj||advmod, nmod||nsubjpass, prep||npadvmod,
 | 
						||
parataxis||acl, prep||pobj, advcl||dobj, amod||pobj, prep||acl, conj||pobj,
 | 
						||
advmod||dep, punct||pobj, ccomp||acomp, acomp||advcl, nummod||npadvmod,
 | 
						||
dobj||dep, npadvmod||xcomp, advcl||conj, relcl||npadvmod, punct||acl,
 | 
						||
relcl||dobj, dobj||xcomp, nsubjpass||parataxis, dative||advcl, relcl||nmod,
 | 
						||
advcl||ccomp, appos||npadvmod, ccomp||pcomp, prep||amod, mark||advcl,
 | 
						||
prep||advmod, prep||xcomp, appos||nsubj, attr||ccomp, advmod||prt, dobj||ccomp,
 | 
						||
aux||conj, advcl||nsubj, conj||conj, advmod||ccomp, advcl||nsubjpass,
 | 
						||
attr||xcomp, nmod||conj, npadvmod||conj, relcl||dative, prep||expl,
 | 
						||
nsubjpass||pcomp, advmod||xcomp, advmod||dobj, appos||pobj, nsubj||conj,
 | 
						||
relcl||nsubjpass, advcl||attr, appos||ccomp, advmod||prep, prep||conj,
 | 
						||
nmod||attr, punct||conj, neg||conj, dep||xcomp, aux||xcomp, dobj||acl,
 | 
						||
nummod||pobj, amod||npadvmod, nsubj||pcomp, advcl||acl, appos||nmod,
 | 
						||
relcl||oprd, prep||prep, cc||pobj, nmod||nsubj, amod||attr, aux||dep,
 | 
						||
appos||conj, advmod||nsubj, nsubj||advcl, acl||conj
 | 
						||
To train a parser, your data should include at least 20 instances of each label.
 | 
						||
⚠ Multiple root labels (ROOT, nsubj, aux, npadvmod, prep) found in
 | 
						||
training data. spaCy's parser uses a single root label ROOT so this distinction
 | 
						||
will not be available.
 | 
						||
 | 
						||
================================== Summary ==================================
 | 
						||
✔ 5 checks passed
 | 
						||
⚠ 8 warnings
 | 
						||
```
 | 
						||
 | 
						||
</Accordion>
 | 
						||
 | 
						||
<Infobox>
 | 
						||
 | 
						||
**API:** [`spacy debug-data`](/api/cli#debug-data)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Backwards incompatibilities {#incompat}
 | 
						||
 | 
						||
<Infobox title="Important note on models" variant="warning">
 | 
						||
 | 
						||
If you've been training **your own models**, you'll need to **retrain** them
 | 
						||
with the new version. Also don't forget to upgrade all models to the latest
 | 
						||
versions. Models for v2.0 or v2.1 aren't compatible with models for v2.2. To
 | 
						||
check if all of your models are up to date, you can run the
 | 
						||
[`spacy validate`](/api/cli#validate) command.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
> #### Install with lookups data
 | 
						||
>
 | 
						||
> ```bash
 | 
						||
> $ pip install spacy[lookups]
 | 
						||
> ```
 | 
						||
>
 | 
						||
> You can also install
 | 
						||
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 | 
						||
> directly.
 | 
						||
 | 
						||
- The lemmatization tables have been moved to their own package,
 | 
						||
  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
 | 
						||
  is not installed by default. If you're using pretrained models, **nothing
 | 
						||
  changes**, because the tables are now included in the model packages. If you
 | 
						||
  want to use the lemmatizer for other languages that don't yet have pretrained
 | 
						||
  models (e.g. Turkish or Croatian) or start off with a blank model that
 | 
						||
  contains lookup data (e.g. `spacy.blank("en")`), you'll need to **explicitly
 | 
						||
  install spaCy plus data** via `pip install spacy[lookups]`.
 | 
						||
- Lemmatization tables (rules, exceptions, index and lookups) are now part of
 | 
						||
  the `Vocab` and serialized with it. This means that serialized objects (`nlp`,
 | 
						||
  pipeline components, vocab) will now include additional data, and models
 | 
						||
  written to disk will include additional files.
 | 
						||
- The [`Lemmatizer`](/api/lemmatizer) class is now initialized with an instance
 | 
						||
  of [`Lookups`](/api/lookups) containing the rules and tables, instead of dicts
 | 
						||
  as separate arguments. This makes it easier to share data tables and modify
 | 
						||
  them at runtime. This is mostly internals, but if you've been implementing a
 | 
						||
  custom `Lemmatizer`, you'll need to update your code.
 | 
						||
- The [Dutch model](/models/nl) has been trained on a new NER corpus (custom
 | 
						||
  labelled UD instead of WikiNER), so their predictions may be very different
 | 
						||
  compared to the previous version. The results should be significantly better
 | 
						||
  and more generalizable, though.
 | 
						||
- The [`spacy download`](/api/cli#download) command does **not** set the
 | 
						||
  `--no-deps` pip argument anymore by default, meaning that model package
 | 
						||
  dependencies (if available) will now be also downloaded and installed. If
 | 
						||
  spaCy (which is also a model dependency) is not installed in the current
 | 
						||
  environment, e.g. if a user has built from source, `--no-deps` is added back
 | 
						||
  automatically to prevent spaCy from being downloaded and installed again from
 | 
						||
  pip.
 | 
						||
- The built-in
 | 
						||
  [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) converter
 | 
						||
  is now stricter and will raise an error if entities are overlapping (instead
 | 
						||
  of silently skipping them). If your data contains invalid entity annotations,
 | 
						||
  make sure to clean it and resolve conflicts. You can now also use the new
 | 
						||
  `debug-data` command to find problems in your data.
 | 
						||
- Pipeline components can now overwrite IOB tags of tokens that are not yet part
 | 
						||
  of an entity. Once a token has an `ent_iob` value set, it won't be reset to an
 | 
						||
  "unset" state and will always have at least `O` assigned. `list(doc.ents)` now
 | 
						||
  actually keeps the annotations on the token level consistent, instead of
 | 
						||
  resetting `O` to an empty string.
 | 
						||
- The default punctuation in the [`Sentencizer`](/api/sentencizer) has been
 | 
						||
  extended and now includes more characters common in various languages. This
 | 
						||
  also means that the results it produces may change, depending on your text. If
 | 
						||
  you want the previous behavior with limited characters, set
 | 
						||
  `punct_chars=[".", "!", "?"]` on initialization.
 | 
						||
- The [`PhraseMatcher`](/api/phrasematcher) algorithm was rewritten from scratch
 | 
						||
  and it's now 10× faster. The rewrite also resolved a few subtle bugs
 | 
						||
  with very large terminology lists. So if you were matching large lists, you
 | 
						||
  may see slightly different results – however, the results should now be fully
 | 
						||
  correct. See [this PR](https://github.com/explosion/spaCy/pull/4309) for more
 | 
						||
  details.
 | 
						||
- The `Serbian` language class (introduced in v2.1.8) incorrectly used the
 | 
						||
  language code `rs` instead of `sr`. This has now been fixed, so `Serbian` is
 | 
						||
  now available via `spacy.lang.sr`.
 | 
						||
- The `"sources"` in the `meta.json` have changed from a list of strings to a
 | 
						||
  list of dicts. This is mostly internals, but if your code used
 | 
						||
  `nlp.meta["sources"]`, you might have to update it.
 | 
						||
 | 
						||
### Migrating from spaCy 2.1 {#migrating}
 | 
						||
 | 
						||
#### Lemmatization data and lookup tables
 | 
						||
 | 
						||
If you application needs lemmatization for [languages](/usage/models#languages)
 | 
						||
with only tokenizers, you now need to install that data explicitly via
 | 
						||
`pip install spacy[lookups]` or `pip install spacy-lookups-data`. No additional
 | 
						||
setup is required – the package just needs to be installed in the same
 | 
						||
environment as spaCy.
 | 
						||
 | 
						||
```python
 | 
						||
### {highlight="3-4"}
 | 
						||
nlp = Turkish()
 | 
						||
doc = nlp("Bu bir cümledir.")
 | 
						||
# 🚨 This now requires the lookups data to be installed explicitly
 | 
						||
print([token.lemma_ for token in doc])
 | 
						||
```
 | 
						||
 | 
						||
The same applies to blank models that you want to update and train – for
 | 
						||
instance, you might use [`spacy.blank`](/api/top-level#spacy.blank) to create a
 | 
						||
blank English model and then train your own part-of-speech tagger on top. If you
 | 
						||
don't explicitly install the lookups data, that `nlp` object won't have any
 | 
						||
lemmatization rules available. spaCy will now show you a warning when you train
 | 
						||
a new part-of-speech tagger and the vocab has no lookups available.
 | 
						||
 | 
						||
#### Lemmatizer initialization
 | 
						||
 | 
						||
This is mainly internals and should hopefully not affect your code. But if
 | 
						||
you've been creating custom [`Lemmatizers`](/api/lemmatizer), you'll need to
 | 
						||
update how they're initialized and pass in an instance of
 | 
						||
[`Lookups`](/api/lookups) with the (optional) tables `lemma_index`, `lemma_exc`,
 | 
						||
`lemma_rules` and `lemma_lookup`.
 | 
						||
 | 
						||
```diff
 | 
						||
from spacy.lemmatizer import Lemmatizer
 | 
						||
+ from spacy.lookups import Lookups
 | 
						||
 | 
						||
lemma_index = {"verb": ("cope", "cop")}
 | 
						||
lemma_exc = {"verb": {"coping": ("cope",)}}
 | 
						||
lemma_rules = {"verb": [["ing", ""]]}
 | 
						||
- lemmatizer = Lemmatizer(lemma_index, lemma_exc, lemma_rules)
 | 
						||
+ lookups = Lookups()
 | 
						||
+ lookups.add_table("lemma_index", lemma_index)
 | 
						||
+ lookups.add_table("lemma_exc", lemma_exc)
 | 
						||
+ lookups.add_table("lemma_rules", lemma_rules)
 | 
						||
+ lemmatizer = Lemmatizer(lookups)
 | 
						||
```
 | 
						||
 | 
						||
#### Converting entity offsets to BILUO tags
 | 
						||
 | 
						||
If you've been using the
 | 
						||
[`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) helper to
 | 
						||
convert character offsets into token-based BILUO tags, you may now see an error
 | 
						||
if the offsets contain overlapping tokens and make it impossible to create a
 | 
						||
valid BILUO sequence. This is helpful, because it lets you spot potential
 | 
						||
problems in your data that can lead to inconsistent results later on. But it
 | 
						||
also means that you need to adjust and clean up the offsets before converting
 | 
						||
them:
 | 
						||
 | 
						||
```diff
 | 
						||
doc = nlp("I live in Berlin Kreuzberg")
 | 
						||
- entities = [(10, 26, "LOC"), (10, 16, "GPE"), (17, 26, "LOC")]
 | 
						||
+ entities = [(10, 16, "GPE"), (17, 26, "LOC")]
 | 
						||
tags = get_biluo_tags_from_offsets(doc, entities)
 | 
						||
```
 | 
						||
 | 
						||
#### Serbian language data
 | 
						||
 | 
						||
If you've been working with `Serbian` (introduced in v2.1.8), you'll need to
 | 
						||
change the language code from `rs` to the correct `sr`:
 | 
						||
 | 
						||
```diff
 | 
						||
- from spacy.lang.rs import Serbian
 | 
						||
+ from spacy.lang.sr import Serbian
 | 
						||
```
 |