mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
Extend v2.3 migration guide (#5653)
* Extend preloaded vocab section * Add section on tag maps
This commit is contained in:
parent
90c7eb0e2f
commit
c4d0209472
|
@ -182,12 +182,12 @@ If you're adding data for a new language, the normalization table should be
|
|||
added to `spacy-lookups-data`. See
|
||||
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
|
||||
|
||||
#### No preloaded lexemes/vocab for models with vectors
|
||||
#### No preloaded vocab for models with vectors
|
||||
|
||||
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
|
||||
loaded on initialization for models with vectors. As you process texts, the
|
||||
lexemes will be added to the vocab automatically, just as in models without
|
||||
vectors.
|
||||
lexemes will be added to the vocab automatically, just as in small models
|
||||
without vectors.
|
||||
|
||||
To see the number of unique vectors and number of words with vectors, see
|
||||
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
|
||||
|
@ -210,6 +210,20 @@ for orth in nlp.vocab.vectors:
|
|||
_ = nlp.vocab[orth]
|
||||
```
|
||||
|
||||
If your workflow previously iterated over `nlp.vocab`, a similar alternative
|
||||
is to iterate over words with vectors instead:
|
||||
|
||||
```diff
|
||||
- lexemes = [w for w in nlp.vocab]
|
||||
+ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
|
||||
```
|
||||
|
||||
Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
|
||||
the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
|
||||
provided lexemes but only 685K words with vectors. The vectors have been
|
||||
updated for most languages in v2.2, but the English models contain the same
|
||||
vectors for both v2.2 and v2.3.
|
||||
|
||||
#### Lexeme.is_oov and Token.is_oov
|
||||
|
||||
<Infobox title="Important note" variant="warning">
|
||||
|
@ -254,6 +268,28 @@ model vocab, which will take a few seconds on initial loading. When you save
|
|||
this model after loading the `prob` table, the full `prob` table will be saved
|
||||
as part of the model vocab.
|
||||
|
||||
To load the probability table into a provided model, first make sure you have
|
||||
`spacy-lookups-data` installed. To load the table, remove the empty provided
|
||||
`lexeme_prob` table and then access `Lexeme.prob` for any word to load the
|
||||
table from `spacy-lookups-data`:
|
||||
|
||||
```diff
|
||||
+ # prerequisite: pip install spacy-lookups-data
|
||||
import spacy
|
||||
|
||||
nlp = spacy.load("en_core_web_md")
|
||||
|
||||
# remove the empty placeholder prob table
|
||||
+ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
|
||||
+ nlp.vocab.lookups_extra.remove_table("lexeme_prob")
|
||||
|
||||
# access any `.prob` to load the full table into the model
|
||||
assert nlp.vocab["a"].prob == -3.9297883511
|
||||
|
||||
# if desired, save this model with the probability table included
|
||||
nlp.to_disk("/path/to/model")
|
||||
```
|
||||
|
||||
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
|
||||
of a new model, add the data to
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
|
||||
|
@ -271,3 +307,39 @@ When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
|
|||
the `prob` table from `spacy-lookups-data` may be loaded as part of the
|
||||
initialization. If you'd like to omit this extra data as in spaCy's provided
|
||||
v2.3 models, use the new flag `--omit-extra-lookups`.
|
||||
|
||||
#### Tag maps in provided models vs. blank models
|
||||
|
||||
The tag maps in the provided models may differ from the tag maps in the spaCy
|
||||
library. You can access the tag map in a loaded model under
|
||||
`nlp.vocab.morphology.tag_map`.
|
||||
|
||||
The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
|
||||
initialized. If you want to provide an alternate tag map, update
|
||||
`nlp.vocab.morphology.tag_map` after initializing the model or if you're using
|
||||
the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
|
||||
provide in the tag map as a JSON dict.
|
||||
|
||||
If you want to export a tag map from a provided model for use with the train
|
||||
CLI, you can save it as a JSON dict. To only use string keys as required by
|
||||
JSON and to make it easier to read and edit, any internal integer IDs need to
|
||||
be converted back to strings:
|
||||
|
||||
```python
|
||||
import spacy
|
||||
import srsly
|
||||
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
tag_map = {}
|
||||
|
||||
# convert any integer IDs to strings for JSON
|
||||
for tag, morph in nlp.vocab.morphology.tag_map.items():
|
||||
tag_map[tag] = {}
|
||||
for feat, val in morph.items():
|
||||
feat = nlp.vocab.strings.as_string(feat)
|
||||
if not isinstance(val, bool):
|
||||
val = nlp.vocab.strings.as_string(val)
|
||||
tag_map[tag][feat] = val
|
||||
|
||||
srsly.write_json("tag_map.json", tag_map)
|
||||
```
|
||||
|
|
Loading…
Reference in New Issue
Block a user