Extend what's new in v2.3 with vocab / is_oov (#5635)

This commit is contained in:
Adriane Boyd 2020-06-23 16:48:59 +02:00 committed by GitHub
parent d94e961f14
commit 7ce451c211
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -182,6 +182,51 @@ If you're adding data for a new language, the normalization table should be
added to `spacy-lookups-data`. See
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
#### No preloaded lexemes/vocab for models with vectors
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
loaded on initialization for models with vectors. As you process texts, the
lexemes will be added to the vocab automatically, just as in models without
vectors.
To see the number of unique vectors and number of words with vectors, see
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
unique vectors and `684830` words with vectors:
```python
{
'width': 300,
'vectors': 20000,
'keys': 684830,
'name': 'en_core_web_md.vectors'
}
```
If required, for instance if you are working directly with word vectors rather
than processing texts, you can load all lexemes for words with vectors at once:
```python
for orth in nlp.vocab.vectors:
_ = nlp.vocab[orth]
```
#### Lexeme.is_oov and Token.is_oov
<Infobox title="Important note" variant="warning">
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
fixed in the next patch release v2.3.1.
</Infobox>
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
have a word vector. This is equivalent to `token.orth not in
nlp.vocab.vectors`.
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
probability and cluster features. The probability and cluster features are no
longer included in the provided medium and large models (see the next section).
#### Probability and cluster features
> #### Load and save extra prob lookups table