mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			346 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			346 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						|
title: What's New in v2.3
 | 
						|
teaser: New features, backwards incompatibilities and migration guide
 | 
						|
menu:
 | 
						|
  - ['New Features', 'features']
 | 
						|
  - ['Backwards Incompatibilities', 'incompat']
 | 
						|
  - ['Migrating from v2.2', 'migrating']
 | 
						|
---
 | 
						|
 | 
						|
## New Features {#features hidden="true"}
 | 
						|
 | 
						|
spaCy v2.3 features new pretrained models for five languages, word vectors for
 | 
						|
all language models, and decreased model size and loading times for models with
 | 
						|
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
 | 
						|
and Romanian** and updated the training data and vectors for most languages.
 | 
						|
Model packages with vectors are about **2×** smaller on disk and load
 | 
						|
**2-4×** faster. For the full changelog, see the
 | 
						|
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
 | 
						|
For more details and a behind-the-scenes look at the new release,
 | 
						|
[see our blog post](https://explosion.ai/blog/spacy-v2-3).
 | 
						|
 | 
						|
### Expanded model families with vectors {#models}
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```bash
 | 
						|
> python -m spacy download da_core_news_sm
 | 
						|
> python -m spacy download ja_core_news_sm
 | 
						|
> python -m spacy download pl_core_news_sm
 | 
						|
> python -m spacy download ro_core_news_sm
 | 
						|
> python -m spacy download zh_core_web_sm
 | 
						|
> ```
 | 
						|
 | 
						|
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
 | 
						|
`md` and `lg` models with word vectors for all languages, this release provides
 | 
						|
a total of 46 model packages. For models trained using
 | 
						|
[Universal Dependencies](https://universaldependencies.org) corpora, the
 | 
						|
training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
 | 
						|
and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
 | 
						|
 | 
						|
<Infobox>
 | 
						|
 | 
						|
**Models:** [Models directory](/models) **Benchmarks: **
 | 
						|
[Release notes](https://github.com/explosion/spaCy/releases/tag/v2.3.0)
 | 
						|
 | 
						|
</Infobox>
 | 
						|
 | 
						|
### Chinese {#chinese}
 | 
						|
 | 
						|
> #### Example
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.lang.zh import Chinese
 | 
						|
>
 | 
						|
> # Load with "default" model provided by pkuseg
 | 
						|
> cfg = {"pkuseg_model": "default", "require_pkuseg": True}
 | 
						|
> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
 | 
						|
>
 | 
						|
> # Append words to user dict
 | 
						|
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
 | 
						|
> ```
 | 
						|
 | 
						|
This release adds support for
 | 
						|
[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
 | 
						|
the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
 | 
						|
Chinese tokenizer can be initialized with both `pkuseg` and custom models and
 | 
						|
the `pkuseg` user dictionary is easy to customize. Note that
 | 
						|
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
 | 
						|
pre-compiled wheels for Python 3.8. See the
 | 
						|
[usage documentation](/usage/models#chinese) for details on how to install it on
 | 
						|
Python 3.8.
 | 
						|
 | 
						|
<Infobox>
 | 
						|
 | 
						|
**Models:** [Chinese models](/models/zh) **Usage: **
 | 
						|
[Chinese tokenizer usage](/usage/models#chinese)
 | 
						|
 | 
						|
</Infobox>
 | 
						|
 | 
						|
### Japanese {#japanese}
 | 
						|
 | 
						|
The updated Japanese language class switches to
 | 
						|
[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
 | 
						|
segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
 | 
						|
installing spaCy for Japanese, which is now possible with a single command:
 | 
						|
`pip install spacy[ja]`.
 | 
						|
 | 
						|
<Infobox>
 | 
						|
 | 
						|
**Models:** [Japanese models](/models/ja) **Usage:**
 | 
						|
[Japanese tokenizer usage](/usage/models#japanese)
 | 
						|
 | 
						|
</Infobox>
 | 
						|
 | 
						|
### Small CLI updates
 | 
						|
 | 
						|
- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
 | 
						|
  in a base model with `spacy debug-data lang train dev -b base_model`
 | 
						|
- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
 | 
						|
  `spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
 | 
						|
  without loading a model
 | 
						|
- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
 | 
						|
  the first iteration
 | 
						|
 | 
						|
## Backwards incompatibilities {#incompat}
 | 
						|
 | 
						|
<Infobox title="Important note on models" variant="warning">
 | 
						|
 | 
						|
If you've been training **your own models**, you'll need to **retrain** them
 | 
						|
with the new version. Also don't forget to upgrade all models to the latest
 | 
						|
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
 | 
						|
with models for v2.3. To check if all of your models are up to date, you can run
 | 
						|
the [`spacy validate`](/api/cli#validate) command.
 | 
						|
 | 
						|
</Infobox>
 | 
						|
 | 
						|
> #### Install with lookups data
 | 
						|
>
 | 
						|
> ```bash
 | 
						|
> $ pip install spacy[lookups]
 | 
						|
> ```
 | 
						|
>
 | 
						|
> You can also install
 | 
						|
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 | 
						|
> directly.
 | 
						|
 | 
						|
- If you're training new models, you'll want to install the package
 | 
						|
  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
 | 
						|
  now includes both the lemmatization tables (as in v2.2) and the normalization
 | 
						|
  tables (new in v2.3). If you're using pretrained models, **nothing changes**,
 | 
						|
  because the relevant tables are included in the model packages.
 | 
						|
- Due to the updated Universal Dependencies training data, the fine-grained
 | 
						|
  part-of-speech tags will change for many provided language models. The
 | 
						|
  coarse-grained part-of-speech tagset remains the same, but the mapping from
 | 
						|
  particular fine-grained to coarse-grained tags may show minor differences.
 | 
						|
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
 | 
						|
  tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
 | 
						|
  for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
 | 
						|
  increases the accuracy of the models by improving the alignment between
 | 
						|
  spaCy's tokenization and Universal Dependencies multi-word tokens used for
 | 
						|
  contractions.
 | 
						|
 | 
						|
### Migrating from spaCy 2.2 {#migrating}
 | 
						|
 | 
						|
#### Tokenizer settings
 | 
						|
 | 
						|
In spaCy v2.2.2-v2.2.4, there was a change to the precedence of `token_match`
 | 
						|
that gave prefixes and suffixes priority over `token_match`, which caused
 | 
						|
problems for many custom tokenizer configurations. This has been reverted in
 | 
						|
v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
 | 
						|
and earlier versions.
 | 
						|
 | 
						|
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
 | 
						|
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
 | 
						|
comma at the end of a URL) before applying the match. See the full
 | 
						|
[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
 | 
						|
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
 | 
						|
debugging your tokenizer configuration.
 | 
						|
 | 
						|
#### Warnings configuration
 | 
						|
 | 
						|
spaCy's custom warnings have been replaced with native Python
 | 
						|
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
 | 
						|
setting `SPACY_WARNING_IGNORE`, use the [`warnings`
 | 
						|
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
 | 
						|
to manage warnings.
 | 
						|
 | 
						|
```diff
 | 
						|
import spacy
 | 
						|
+ import warnings
 | 
						|
 | 
						|
- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
 | 
						|
+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
 | 
						|
```
 | 
						|
 | 
						|
#### Normalization tables
 | 
						|
 | 
						|
The normalization tables have moved from the language data in
 | 
						|
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
 | 
						|
package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
 | 
						|
If you're adding data for a new language, the normalization table should be
 | 
						|
added to `spacy-lookups-data`. See
 | 
						|
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
 | 
						|
 | 
						|
#### No preloaded vocab for models with vectors
 | 
						|
 | 
						|
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
 | 
						|
loaded on initialization for models with vectors. As you process texts, the
 | 
						|
lexemes will be added to the vocab automatically, just as in small models
 | 
						|
without vectors.
 | 
						|
 | 
						|
To see the number of unique vectors and number of words with vectors, see
 | 
						|
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
 | 
						|
unique vectors and `684830` words with vectors:
 | 
						|
 | 
						|
```python
 | 
						|
{
 | 
						|
    'width': 300,
 | 
						|
    'vectors': 20000,
 | 
						|
    'keys': 684830,
 | 
						|
    'name': 'en_core_web_md.vectors'
 | 
						|
}
 | 
						|
```
 | 
						|
 | 
						|
If required, for instance if you are working directly with word vectors rather
 | 
						|
than processing texts, you can load all lexemes for words with vectors at once:
 | 
						|
 | 
						|
```python
 | 
						|
for orth in nlp.vocab.vectors:
 | 
						|
    _ = nlp.vocab[orth]
 | 
						|
```
 | 
						|
 | 
						|
If your workflow previously iterated over `nlp.vocab`, a similar alternative
 | 
						|
is to iterate over words with vectors instead:
 | 
						|
 | 
						|
```diff
 | 
						|
- lexemes = [w for w in nlp.vocab]
 | 
						|
+ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
 | 
						|
```
 | 
						|
 | 
						|
Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
 | 
						|
the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
 | 
						|
provided lexemes but only 685K words with vectors. The vectors have been
 | 
						|
updated for most languages in v2.2, but the English models contain the same
 | 
						|
vectors for both v2.2 and v2.3.
 | 
						|
 | 
						|
#### Lexeme.is_oov and Token.is_oov
 | 
						|
 | 
						|
<Infobox title="Important note" variant="warning">
 | 
						|
 | 
						|
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
 | 
						|
fixed in the next patch release v2.3.1.
 | 
						|
 | 
						|
</Infobox>
 | 
						|
 | 
						|
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
 | 
						|
have a word vector. This is equivalent to `token.orth not in
 | 
						|
nlp.vocab.vectors`.
 | 
						|
 | 
						|
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
 | 
						|
probability and cluster features. The probability and cluster features are no
 | 
						|
longer included in the provided medium and large models (see the next section).
 | 
						|
 | 
						|
#### Probability and cluster features
 | 
						|
 | 
						|
> #### Load and save extra prob lookups table
 | 
						|
>
 | 
						|
> ```python
 | 
						|
> from spacy.lang.en import English
 | 
						|
> nlp = English()
 | 
						|
> doc = nlp("the")
 | 
						|
> print(doc[0].prob) # lazily loads extra prob table
 | 
						|
> nlp.to_disk("/path/to/model") # includes prob table
 | 
						|
> ```
 | 
						|
 | 
						|
The `Token.prob` and `Token.cluster` features, which are no longer used by the
 | 
						|
core pipeline components as of spaCy v2, are no longer provided in the
 | 
						|
pretrained models to reduce the model size. To keep these features available for
 | 
						|
users relying on them, the `prob` and `cluster` features for the most frequent
 | 
						|
1M tokens have been moved to
 | 
						|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
 | 
						|
`extra` features for the relevant languages (English, German, Greek and
 | 
						|
Spanish).
 | 
						|
 | 
						|
The extra tables are loaded lazily, so if you have `spacy-lookups-data`
 | 
						|
installed and your code accesses `Token.prob`, the full table is loaded into the
 | 
						|
model vocab, which will take a few seconds on initial loading. When you save
 | 
						|
this model after loading the `prob` table, the full `prob` table will be saved
 | 
						|
as part of the model vocab.
 | 
						|
 | 
						|
To load the probability table into a provided model, first make sure you have
 | 
						|
`spacy-lookups-data` installed. To load the table, remove the empty provided
 | 
						|
`lexeme_prob` table and then access `Lexeme.prob` for any word to load the
 | 
						|
table from `spacy-lookups-data`:
 | 
						|
 | 
						|
```diff
 | 
						|
+ # prerequisite: pip install spacy-lookups-data
 | 
						|
import spacy
 | 
						|
 | 
						|
nlp = spacy.load("en_core_web_md")
 | 
						|
 | 
						|
# remove the empty placeholder prob table
 | 
						|
+ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
 | 
						|
+     nlp.vocab.lookups_extra.remove_table("lexeme_prob")
 | 
						|
 | 
						|
# access any `.prob` to load the full table into the model
 | 
						|
assert nlp.vocab["a"].prob == -3.9297883511
 | 
						|
 | 
						|
# if desired, save this model with the probability table included
 | 
						|
nlp.to_disk("/path/to/model")
 | 
						|
```
 | 
						|
 | 
						|
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
 | 
						|
of a new model, add the data to
 | 
						|
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
 | 
						|
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
 | 
						|
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
 | 
						|
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
 | 
						|
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
 | 
						|
currently only used to provide a custom `oov_prob`. See examples in the
 | 
						|
[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
 | 
						|
in `spacy-lookups-data`.
 | 
						|
 | 
						|
#### Initializing new models without extra lookups tables
 | 
						|
 | 
						|
When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
 | 
						|
the `prob` table from `spacy-lookups-data` may be loaded as part of the
 | 
						|
initialization. If you'd like to omit this extra data as in spaCy's provided
 | 
						|
v2.3 models, use the new flag `--omit-extra-lookups`.
 | 
						|
 | 
						|
#### Tag maps in provided models vs. blank models
 | 
						|
 | 
						|
The tag maps in the provided models may differ from the tag maps in the spaCy
 | 
						|
library. You can access the tag map in a loaded model under
 | 
						|
`nlp.vocab.morphology.tag_map`.
 | 
						|
 | 
						|
The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
 | 
						|
initialized. If you want to provide an alternate tag map, update
 | 
						|
`nlp.vocab.morphology.tag_map` after initializing the model or if you're using
 | 
						|
the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
 | 
						|
provide in the tag map as a JSON dict.
 | 
						|
 | 
						|
If you want to export a tag map from a provided model for use with the train
 | 
						|
CLI, you can save it as a JSON dict. To only use string keys as required by
 | 
						|
JSON and to make it easier to read and edit, any internal integer IDs need to
 | 
						|
be converted back to strings:
 | 
						|
 | 
						|
```python
 | 
						|
import spacy
 | 
						|
import srsly
 | 
						|
 | 
						|
nlp = spacy.load("en_core_web_sm")
 | 
						|
tag_map = {}
 | 
						|
 | 
						|
# convert any integer IDs to strings for JSON
 | 
						|
for tag, morph in nlp.vocab.morphology.tag_map.items():
 | 
						|
    tag_map[tag] = {}
 | 
						|
    for feat, val in morph.items():
 | 
						|
        feat = nlp.vocab.strings.as_string(feat)
 | 
						|
        if not isinstance(val, bool):
 | 
						|
            val = nlp.vocab.strings.as_string(val)
 | 
						|
        tag_map[tag][feat] = val
 | 
						|
 | 
						|
srsly.write_json("tag_map.json", tag_map)
 | 
						|
```
 |