Extend what's new in v2.3 with vocab / is_oov (#5635)

2026-02-18 05:00:41 +03:00 · 2020-06-23 16:48:59 +02:00 · 2020-06-23 16:48:59 +02:00 · 7ce451c211
commit 7ce451c211
parent d94e961f14
1 changed files with 45 additions and 0 deletions
--- a/website/docs/usage/v2-3.md
+++ b/website/docs/usage/v2-3.md
@ -182,6 +182,51 @@ If you're adding data for a new language, the normalization table should be
 added to `spacy-lookups-data`. See
 [adding norm exceptions](/usage/adding-languages#norm-exceptions).

+#### No preloaded lexemes/vocab for models with vectors
+
+To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
+loaded on initialization for models with vectors. As you process texts, the
+lexemes will be added to the vocab automatically, just as in models without
+vectors.
+
+To see the number of unique vectors and number of words with vectors, see
+`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
+unique vectors and `684830` words with vectors:
+
+```python
+{
+    'width': 300,
+    'vectors': 20000,
+    'keys': 684830,
+    'name': 'en_core_web_md.vectors'
+}
+```
+
+If required, for instance if you are working directly with word vectors rather
+than processing texts, you can load all lexemes for words with vectors at once:
+
+```python
+for orth in nlp.vocab.vectors:
+    _ = nlp.vocab[orth]
+```
+
+#### Lexeme.is_oov and Token.is_oov
+
+<Infobox title="Important note" variant="warning">
+
+Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
+fixed in the next patch release v2.3.1.
+
+</Infobox>
+
+In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
+have a word vector. This is equivalent to `token.orth not in
+nlp.vocab.vectors`.
+
+Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
+probability and cluster features. The probability and cluster features are no
+longer included in the provided medium and large models (see the next section).
+
 #### Probability and cluster features

 > #### Load and save extra prob lookups table