spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-17 12:40:46 +03:00

History

Adriane Boyd eed4b785f5 Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```		2020-09-18 15:59:16 +02:00
..
cli	Load vocab lookups tables at beginning of training	2020-09-18 15:59:16 +02:00
displacy	Refactor Docs.is_ flags (#6044 )	2020-09-17 00:14:01 +02:00
lang	Refactor Docs.is_ flags (#6044 )	2020-09-17 00:14:01 +02:00
matcher	Refactor Docs.is_ flags (#6044 )	2020-09-17 00:14:01 +02:00
ml	Add vectors option to CharacterEmbed (#6069 )	2020-09-16 17:45:04 +02:00
pipeline	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-17 09:24:36 +02:00
tests	Load vocab lookups tables at beginning of training	2020-09-18 15:59:16 +02:00
tokens	Remove W106: HEAD and SENT_START in doc.from_array (#6086 )	2020-09-18 03:01:29 +02:00
training	Refactor Docs.is_ flags (#6044 )	2020-09-17 00:14:01 +02:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Support vocab arg in spacy.blank	2020-09-15 11:39:36 +02:00
__main__.py	Tidy up	2020-06-22 00:45:40 +02:00
about.py	Set version to v3.0.0a19	2020-09-17 00:18:49 +02:00
attrs.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
attrs.pyx	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
compat.py	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
default_config_pretraining.cfg	generalize corpora, dot notation for dev and train corpus	2020-09-17 11:38:59 +02:00
default_config.cfg	Load vocab lookups tables at beginning of training	2020-09-18 15:59:16 +02:00
errors.py	Remove W106: HEAD and SENT_START in doc.from_array (#6086 )	2020-09-18 03:01:29 +02:00
glossary.py	unicode -> str consistency	2020-05-24 17:20:58 +02:00
kb.pxd	Define candidate generator in EL config (#5876 )	2020-08-18 16:10:36 +02:00
kb.pyx	Update docs links in codebase	2020-09-04 12:58:50 +02:00
language.py	Load vocab lookups tables at beginning of training	2020-09-18 15:59:16 +02:00
lexeme.pxd	Fix Lexeme.from_ptr	2020-08-10 16:43:37 +02:00
lexeme.pyx	Update docs links in codebase	2020-09-04 12:58:50 +02:00
lookups.py	Update docs links in codebase	2020-09-04 12:58:50 +02:00
morphology.pxd	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
morphology.pyx	Add Lemmatizer and simplify related components (#5848 )	2020-08-07 15:27:13 +02:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Drop Python 2.7 and 3.5 (#4828 )	2019-12-22 01:53:56 +01:00
pipe_analysis.py	Simplify pipe analysis	2020-08-01 13:40:06 +02:00
schemas.py	Load vocab lookups tables at beginning of training	2020-09-18 15:59:16 +02:00
scorer.py	Temporary work-around for scoring a subset of components (#6090 )	2020-09-18 14:26:42 +02:00
strings.pxd	Remove 'cleanup' of strings (#6007 )	2020-09-01 16:12:15 +02:00
strings.pyx	Update docs links in codebase	2020-09-04 12:58:50 +02:00
structs.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
symbols.pxd	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
symbols.pyx	Remove PRON_LEMMA symbol (#5968 )	2020-08-25 14:21:29 +02:00
tokenizer.pxd	Simplify specials and cache checks (#6012 )	2020-09-03 09:42:49 +02:00
tokenizer.pyx	Fix token.idx for special cases with affixes (#6035 )	2020-09-13 14:05:36 +02:00
typedefs.pxd	Update spaCy for thinc 8.0.0 (#4920 )	2020-01-29 17:06:46 +01:00
typedefs.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
util.py	Load vocab lookups tables at beginning of training	2020-09-18 15:59:16 +02:00
vectors.pyx	Update docs links in codebase	2020-09-04 12:58:50 +02:00
vocab.pxd	Tidy up and move noun_chunks, token_match, url_match	2020-07-22 22:18:46 +02:00
vocab.pyx	Load vocab lookups tables at beginning of training	2020-09-18 15:59:16 +02:00