spaCy/website/docs/usage/v2-3.md
Adriane Boyd c4d0209472
Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section

* Add section on tag maps
2020-06-26 14:12:29 +02:00

346 lines
13 KiB
Markdown

---
title: What's New in v2.3
teaser: New features, backwards incompatibilities and migration guide
menu:
- ['New Features', 'features']
- ['Backwards Incompatibilities', 'incompat']
- ['Migrating from v2.2', 'migrating']
---
## New Features {#features hidden="true"}
spaCy v2.3 features new pretrained models for five languages, word vectors for
all language models, and decreased model size and loading times for models with
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
and Romanian** and updated the training data and vectors for most languages.
Model packages with vectors are about **2&times** smaller on disk and load
**2-4×** faster. For the full changelog, see the
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
For more details and a behind-the-scenes look at the new release,
[see our blog post](https://explosion.ai/blog/spacy-v2-3).
### Expanded model families with vectors {#models}
> #### Example
>
> ```bash
> python -m spacy download da_core_news_sm
> python -m spacy download ja_core_news_sm
> python -m spacy download pl_core_news_sm
> python -m spacy download ro_core_news_sm
> python -m spacy download zh_core_web_sm
> ```
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
`md` and `lg` models with word vectors for all languages, this release provides
a total of 46 model packages. For models trained using
[Universal Dependencies](https://universaldependencies.org) corpora, the
training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
<Infobox>
**Models:** [Models directory](/models) **Benchmarks: **
[Release notes](https://github.com/explosion/spaCy/releases/tag/v2.3.0)
</Infobox>
### Chinese {#chinese}
> #### Example
>
> ```python
> from spacy.lang.zh import Chinese
>
> # Load with "default" model provided by pkuseg
> cfg = {"pkuseg_model": "default", "require_pkuseg": True}
> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
>
> # Append words to user dict
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
> ```
This release adds support for
[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
Chinese tokenizer can be initialized with both `pkuseg` and custom models and
the `pkuseg` user dictionary is easy to customize. Note that
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
pre-compiled wheels for Python 3.8. See the
[usage documentation](/usage/models#chinese) for details on how to install it on
Python 3.8.
<Infobox>
**Models:** [Chinese models](/models/zh) **Usage: **
[Chinese tokenizer usage](/usage/models#chinese)
</Infobox>
### Japanese {#japanese}
The updated Japanese language class switches to
[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
installing spaCy for Japanese, which is now possible with a single command:
`pip install spacy[ja]`.
<Infobox>
**Models:** [Japanese models](/models/ja) **Usage:**
[Japanese tokenizer usage](/usage/models#japanese)
</Infobox>
### Small CLI updates
- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
in a base model with `spacy debug-data lang train dev -b base_model`
- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
`spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
without loading a model
- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
the first iteration
## Backwards incompatibilities {#incompat}
<Infobox title="Important note on models" variant="warning">
If you've been training **your own models**, you'll need to **retrain** them
with the new version. Also don't forget to upgrade all models to the latest
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
with models for v2.3. To check if all of your models are up to date, you can run
the [`spacy validate`](/api/cli#validate) command.
</Infobox>
> #### Install with lookups data
>
> ```bash
> $ pip install spacy[lookups]
> ```
>
> You can also install
> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
> directly.
- If you're training new models, you'll want to install the package
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
now includes both the lemmatization tables (as in v2.2) and the normalization
tables (new in v2.3). If you're using pretrained models, **nothing changes**,
because the relevant tables are included in the model packages.
- Due to the updated Universal Dependencies training data, the fine-grained
part-of-speech tags will change for many provided language models. The
coarse-grained part-of-speech tagset remains the same, but the mapping from
particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
increases the accuracy of the models by improving the alignment between
spaCy's tokenization and Universal Dependencies multi-word tokens used for
contractions.
### Migrating from spaCy 2.2 {#migrating}
#### Tokenizer settings
In spaCy v2.2.2-v2.2.4, there was a change to the precedence of `token_match`
that gave prefixes and suffixes priority over `token_match`, which caused
problems for many custom tokenizer configurations. This has been reverted in
v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
and earlier versions.
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
comma at the end of a URL) before applying the match. See the full
[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
debugging your tokenizer configuration.
#### Warnings configuration
spaCy's custom warnings have been replaced with native Python
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
setting `SPACY_WARNING_IGNORE`, use the [`warnings`
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
to manage warnings.
```diff
import spacy
+ import warnings
- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
```
#### Normalization tables
The normalization tables have moved from the language data in
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
If you're adding data for a new language, the normalization table should be
added to `spacy-lookups-data`. See
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
#### No preloaded vocab for models with vectors
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
loaded on initialization for models with vectors. As you process texts, the
lexemes will be added to the vocab automatically, just as in small models
without vectors.
To see the number of unique vectors and number of words with vectors, see
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
unique vectors and `684830` words with vectors:
```python
{
'width': 300,
'vectors': 20000,
'keys': 684830,
'name': 'en_core_web_md.vectors'
}
```
If required, for instance if you are working directly with word vectors rather
than processing texts, you can load all lexemes for words with vectors at once:
```python
for orth in nlp.vocab.vectors:
_ = nlp.vocab[orth]
```
If your workflow previously iterated over `nlp.vocab`, a similar alternative
is to iterate over words with vectors instead:
```diff
- lexemes = [w for w in nlp.vocab]
+ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
```
Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
provided lexemes but only 685K words with vectors. The vectors have been
updated for most languages in v2.2, but the English models contain the same
vectors for both v2.2 and v2.3.
#### Lexeme.is_oov and Token.is_oov
<Infobox title="Important note" variant="warning">
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
fixed in the next patch release v2.3.1.
</Infobox>
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
have a word vector. This is equivalent to `token.orth not in
nlp.vocab.vectors`.
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
probability and cluster features. The probability and cluster features are no
longer included in the provided medium and large models (see the next section).
#### Probability and cluster features
> #### Load and save extra prob lookups table
>
> ```python
> from spacy.lang.en import English
> nlp = English()
> doc = nlp("the")
> print(doc[0].prob) # lazily loads extra prob table
> nlp.to_disk("/path/to/model") # includes prob table
> ```
The `Token.prob` and `Token.cluster` features, which are no longer used by the
core pipeline components as of spaCy v2, are no longer provided in the
pretrained models to reduce the model size. To keep these features available for
users relying on them, the `prob` and `cluster` features for the most frequent
1M tokens have been moved to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
`extra` features for the relevant languages (English, German, Greek and
Spanish).
The extra tables are loaded lazily, so if you have `spacy-lookups-data`
installed and your code accesses `Token.prob`, the full table is loaded into the
model vocab, which will take a few seconds on initial loading. When you save
this model after loading the `prob` table, the full `prob` table will be saved
as part of the model vocab.
To load the probability table into a provided model, first make sure you have
`spacy-lookups-data` installed. To load the table, remove the empty provided
`lexeme_prob` table and then access `Lexeme.prob` for any word to load the
table from `spacy-lookups-data`:
```diff
+ # prerequisite: pip install spacy-lookups-data
import spacy
nlp = spacy.load("en_core_web_md")
# remove the empty placeholder prob table
+ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
+ nlp.vocab.lookups_extra.remove_table("lexeme_prob")
# access any `.prob` to load the full table into the model
assert nlp.vocab["a"].prob == -3.9297883511
# if desired, save this model with the probability table included
nlp.to_disk("/path/to/model")
```
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
of a new model, add the data to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
currently only used to provide a custom `oov_prob`. See examples in the
[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
in `spacy-lookups-data`.
#### Initializing new models without extra lookups tables
When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
the `prob` table from `spacy-lookups-data` may be loaded as part of the
initialization. If you'd like to omit this extra data as in spaCy's provided
v2.3 models, use the new flag `--omit-extra-lookups`.
#### Tag maps in provided models vs. blank models
The tag maps in the provided models may differ from the tag maps in the spaCy
library. You can access the tag map in a loaded model under
`nlp.vocab.morphology.tag_map`.
The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
initialized. If you want to provide an alternate tag map, update
`nlp.vocab.morphology.tag_map` after initializing the model or if you're using
the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
provide in the tag map as a JSON dict.
If you want to export a tag map from a provided model for use with the train
CLI, you can save it as a JSON dict. To only use string keys as required by
JSON and to make it easier to read and edit, any internal integer IDs need to
be converted back to strings:
```python
import spacy
import srsly
nlp = spacy.load("en_core_web_sm")
tag_map = {}
# convert any integer IDs to strings for JSON
for tag, morph in nlp.vocab.morphology.tag_map.items():
tag_map[tag] = {}
for feat, val in morph.items():
feat = nlp.vocab.strings.as_string(feat)
if not isinstance(val, bool):
val = nlp.vocab.strings.as_string(val)
tag_map[tag][feat] = val
srsly.write_json("tag_map.json", tag_map)
```