spaCy/spacy
Adriane Boyd eed4b785f5 Load vocab lookups tables at beginning of training
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.

The option moves from `nlp.load_vocab_data` to `training.lookups`.

Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.

The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.

To load `lexeme_norm` from `spacy-lookups-data`:

```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
..
cli Load vocab lookups tables at beginning of training 2020-09-18 15:59:16 +02:00
displacy Refactor Docs.is_ flags (#6044) 2020-09-17 00:14:01 +02:00
lang Refactor Docs.is_ flags (#6044) 2020-09-17 00:14:01 +02:00
matcher Refactor Docs.is_ flags (#6044) 2020-09-17 00:14:01 +02:00
ml Add vectors option to CharacterEmbed (#6069) 2020-09-16 17:45:04 +02:00
pipeline Merge remote-tracking branch 'upstream/develop' into fix/corpus 2020-09-17 09:24:36 +02:00
tests Load vocab lookups tables at beginning of training 2020-09-18 15:59:16 +02:00
tokens Remove W106: HEAD and SENT_START in doc.from_array (#6086) 2020-09-18 03:01:29 +02:00
training Refactor Docs.is_ flags (#6044) 2020-09-17 00:14:01 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Support vocab arg in spacy.blank 2020-09-15 11:39:36 +02:00
__main__.py Tidy up 2020-06-22 00:45:40 +02:00
about.py Set version to v3.0.0a19 2020-09-17 00:18:49 +02:00
attrs.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
attrs.pyx Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
compat.py Tidy up, autoformat, add types 2020-07-25 15:01:15 +02:00
default_config_pretraining.cfg generalize corpora, dot notation for dev and train corpus 2020-09-17 11:38:59 +02:00
default_config.cfg Load vocab lookups tables at beginning of training 2020-09-18 15:59:16 +02:00
errors.py Remove W106: HEAD and SENT_START in doc.from_array (#6086) 2020-09-18 03:01:29 +02:00
glossary.py unicode -> str consistency 2020-05-24 17:20:58 +02:00
kb.pxd Define candidate generator in EL config (#5876) 2020-08-18 16:10:36 +02:00
kb.pyx Update docs links in codebase 2020-09-04 12:58:50 +02:00
language.py Load vocab lookups tables at beginning of training 2020-09-18 15:59:16 +02:00
lexeme.pxd Fix Lexeme.from_ptr 2020-08-10 16:43:37 +02:00
lexeme.pyx Update docs links in codebase 2020-09-04 12:58:50 +02:00
lookups.py Update docs links in codebase 2020-09-04 12:58:50 +02:00
morphology.pxd Add Lemmatizer and simplify related components (#5848) 2020-08-07 15:27:13 +02:00
morphology.pyx Add Lemmatizer and simplify related components (#5848) 2020-08-07 15:27:13 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Drop Python 2.7 and 3.5 (#4828) 2019-12-22 01:53:56 +01:00
pipe_analysis.py Simplify pipe analysis 2020-08-01 13:40:06 +02:00
schemas.py Load vocab lookups tables at beginning of training 2020-09-18 15:59:16 +02:00
scorer.py Temporary work-around for scoring a subset of components (#6090) 2020-09-18 14:26:42 +02:00
strings.pxd Remove 'cleanup' of strings (#6007) 2020-09-01 16:12:15 +02:00
strings.pyx Update docs links in codebase 2020-09-04 12:58:50 +02:00
structs.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
symbols.pxd Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
symbols.pyx Remove PRON_LEMMA symbol (#5968) 2020-08-25 14:21:29 +02:00
tokenizer.pxd Simplify specials and cache checks (#6012) 2020-09-03 09:42:49 +02:00
tokenizer.pyx Fix token.idx for special cases with affixes (#6035) 2020-09-13 14:05:36 +02:00
typedefs.pxd Update spaCy for thinc 8.0.0 (#4920) 2020-01-29 17:06:46 +01:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py Load vocab lookups tables at beginning of training 2020-09-18 15:59:16 +02:00
vectors.pyx Update docs links in codebase 2020-09-04 12:58:50 +02:00
vocab.pxd Tidy up and move noun_chunks, token_match, url_match 2020-07-22 22:18:46 +02:00
vocab.pyx Load vocab lookups tables at beginning of training 2020-09-18 15:59:16 +02:00