spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 02:16:32 +03:00

Author	SHA1	Message	Date
Ines Montani	b7111da1d7	Update config and commands	2020-07-11 13:03:53 +02:00
Ines Montani	f99ce7fbfb	Make validation errors more elegant	2020-07-10 23:34:17 +02:00
Ines Montani	0389c34b81	Merge branch 'develop' into feature/refactor-config-args	2020-07-10 20:51:52 +02:00
Ines Montani	9fe1fa88ad	Fix typo	2020-07-10 20:32:37 +02:00
Ines Montani	defe1e7213	Pretty-print config validation errors	2020-07-10 20:01:20 +02:00
Sofie Van Landeghem	de6a32315c	debug-model script (#5749 ) * adding debug-model to print the internals for debugging purposes * expend debug-model script with 4 stages: before, init, train, predict * avoid enforcing to have a seed in the train script * small fixes	2020-07-10 19:47:53 +02:00
Ines Montani	a3667394b4	Integrate with latest Thinc and config overrides	2020-07-10 19:47:05 +02:00
Ines Montani	73332ddb67	Update CLI commans to use one shared util file	2020-07-10 17:57:40 +02:00
Ines Montani	240e0a62ca	Update with WIP	2020-07-10 13:31:27 +02:00
Ines Montani	05e182e421	Update CLI args and docstrings	2020-07-09 19:44:28 +02:00
Matthw Honnibal	77af0a6bb4	Offer option of padding-sensitive batching	2020-07-09 14:50:20 +02:00
Matthw Honnibal	1b20ffac38	batch_by_words by default	2020-07-08 21:37:06 +02:00
Matthw Honnibal	42e1109def	Support option to not batch by number of words	2020-07-08 11:26:54 +02:00
Ines Montani	fa261d09e8	Add alternative CLI option	2020-07-06 15:57:38 +02:00
Ines Montani	412dbb1f38	Remove dead and/or deprecated code (#5710 ) * Remove dead and/or deprecated code * Remove n_threads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-06 13:06:25 +02:00
Sofie Van Landeghem	fcbf899b08	Feature/example only (#5707 ) * remove _convert_examples * fix test_gold, raise TypeError if tuples are used instead of Example's * throwing proper errors when the wrong type of objects are passed * fix deprectated format in tests * fix deprectated format in parser tests * fix tests for NEL, morph, senter, tagger, textcat * update regression tests with new Example format * use make_doc * more fixes to nlp.update calls * few more small fixes for rehearse and evaluate * only import ml_datasets if really necessary	2020-07-06 13:02:36 +02:00
Ines Montani	37c3bb35e2	Auto-format	2020-07-04 16:25:34 +02:00
Matthw Honnibal	6a0a27e5c2	Fix max_steps	2020-07-01 18:08:14 +02:00
Matthw Honnibal	2fa56484b2	Fix eval batch size	2020-07-01 15:16:25 +02:00
Matthw Honnibal	c5d12d1a22	Allow batch size to be set for evaluation in spacy train	2020-07-01 15:04:36 +02:00
Matthw Honnibal	8c5a88e777	Fix per-epoch shuffling	2020-07-01 01:02:35 +02:00
Matthew Honnibal	8c29268749	Improve spacy.gold (no GoldParse, no json format!) (#5555 ) * Update errors * Remove beam for now (maybe) Remove beam_utils Update setup.py Remove beam * Remove GoldParse WIP on removing goldparse Get ArcEager compiling after GoldParse excise Update setup.py Get spacy.syntax compiling after removing GoldParse Rename NewExample -> Example and clean up Clean html files Start updating tests Update Morphologizer * fix error numbers * fix merge conflict * informative error when calling to_array with wrong field * fix error catching * fixing language and scoring tests * start testing get_aligned * additional tests for new get_aligned function * Draft create_gold_state for arc_eager oracle * Fix import * Fix import * Remove TokenAnnotation code from nonproj * fixing NER one-to-many alignment * Fix many-to-one IOB codes * fix test for misaligned * attempt to fix cases with weird spaces * fix spaces * test_gold_biluo_different_tokenization works * allow None as BILUO annotation * fixed some tests + WIP roundtrip unit test * add spaces to json output format * minibatch utiltiy can deal with strings, docs or examples * fix augment (needs further testing) * various fixes in scripts - needs to be further tested * fix test_cli * cleanup * correct silly typo * add support for MORPH in to/from_array, fix morphologizer overfitting test * fix tagger * fix entity linker * ensure test keeps working with non-linked entities * pipe() takes docs, not examples * small bug fix * textcat bugfix * throw informative error when running the components with the wrong type of objects * fix parser tests to work with example (most still failing) * fix BiluoPushDown parsing entities * small fixes * bugfix tok2vec * fix renames and simple_ner labels * various small fixes * prevent writing dummy values like deps because that could interfer with sent_start values * fix the fix * implement split_sent with aligned SENT_START attribute * test for split sentences with various alignment issues, works * Return ArcEagerGoldParse from ArcEager * Update parser and NER gold stuff * Draft new GoldCorpus class * add links to to_dict * clean up * fix test checking for variants * Fix oracles * Start updating converters * Move converters under spacy.gold * Move things around * Fix naming * Fix name * Update converter to produce DocBin * Update converters * Allow DocBin to take list of Doc objects. * Make spacy convert output docbin * Fix import * Fix docbin * Fix compile in ArcEager * Fix import * Serialize all attrs by default * Update converter * Remove jsonl converter * Add json2docs converter * Draft Corpus class for DocBin * Work on train script * Update Corpus * Update DocBin * Allocate Doc before starting to add words * Make doc.from_array several times faster * Update train.py * Fix Corpus * Fix parser model * Start debugging arc_eager oracle * Update header * Fix parser declaration * Xfail some tests * Skip tests that cause crashes * Skip test causing segfault * Remove GoldCorpus * Update imports * Update after removing GoldCorpus * Fix module name of corpus * Fix mimport * Work on parser oracle * Update arc_eager oracle * Restore ArcEager.get_cost function * Update transition system * Update test_arc_eager_oracle * Remove beam test * Update test * Unskip * Unskip tests * add links to to_dict * clean up * fix test checking for variants * Allow DocBin to take list of Doc objects. * Fix compile in ArcEager * Serialize all attrs by default Move converters under spacy.gold Move things around Fix naming Fix name Update converter to produce DocBin Update converters Make spacy convert output docbin Fix import Fix docbin Fix import Update converter Remove jsonl converter Add json2docs converter * Allocate Doc before starting to add words * Make doc.from_array several times faster * Start updating converters * Work on train script * Draft Corpus class for DocBin Update Corpus Fix Corpus * Update DocBin Add missing strings when serializing * Update train.py * Fix parser model * Start debugging arc_eager oracle * Update header * Fix parser declaration * Xfail some tests Skip tests that cause crashes Skip test causing segfault * Remove GoldCorpus Update imports Update after removing GoldCorpus Fix module name of corpus Fix mimport * Work on parser oracle Update arc_eager oracle Restore ArcEager.get_cost function Update transition system * Update tests Remove beam test Update test Unskip Unskip tests * Add get_aligned_parse method in Example Fix Example.get_aligned_parse * Add kwargs to Corpus.dev_dataset to match train_dataset * Update nonproj * Use get_aligned_parse in ArcEager * Add another arc-eager oracle test * Remove Example.doc property Remove Example.doc Remove Example.doc Remove Example.doc Remove Example.doc * Update ArcEager oracle Fix Break oracle * Debugging * Fix Corpus * Fix eg.doc * Format * small fixes * limit arg for Corpus * fix test_roundtrip_docs_to_docbin * fix test_make_orth_variants * fix add_label test * Update tests * avoid writing temp dir in json2docs, fixing 4402 test * Update test * Add missing costs to NER oracle * Update test * Work on Example.get_aligned_ner method * Clean up debugging * Xfail tests * Remove prints * Remove print * Xfail some tests * Replace unseen labels for parser * Update test * Update test * Xfail test * Fix Corpus * fix imports * fix docs_to_json * various small fixes * cleanup * Support gold_preproc in Corpus * Support gold_preproc * Pass gold_preproc setting into corpus * Remove debugging * Fix gold_preproc * Fix json2docs converter * Fix convert command * Fix flake8 * Fix import * fix output_dir (converted to Path by typer) * fix var * bugfix: update states after creating golds to avoid out of bounds indexing * Improve efficiency of ArEager oracle * pull merge_sent into iob2docs to avoid Doc creation for each line * fix asserts * bugfix excl Span.end in iob2docs * Support max_length in Corpus * Fix arc_eager oracle * Filter out uannotated sentences in NER * Remove debugging in parser * Simplify NER alignment * Fix conversion of NER data * Fix NER init_gold_batch * Tweak efficiency of precomputable affine * Update onto-json default * Update gold test for NER * Fix parser test * Update test * Add NER data test * Fix convert for single file * Fix test * Hack scorer to avoid evaluating non-nered data * Fix handling of NER data in Example * Output unlabelled spans from O biluo tags in iob_utils * Fix unset variable * Return kept examples from init_gold_batch * Return examples from init_gold_batch * Dont return Example from init_gold_batch * Set spaces on gold doc after conversion * Add test * Fix spaces reading * Improve NER alignment * Improve handling of missing values in NER * Restore the 'cutting' in parser training * Add assertion * Print epochs * Restore random cuts in parser/ner training * Implement Doc.copy * Implement Example.copy * Copy examples at the start of Language.update * Don't unset example docs * Tweak parser model slightly * attempt to fix _guess_spaces * _add_entities_to_doc first, so that links don't get overwritten * fixing get_aligned_ner for one-to-many * fix indexing into x_text * small fix biluo_tags_from_offsets * Add onto-ner config * Simplify NER alignment * Fix NER scoring for partially annotated documents * fix indexing into x_text * fix test_cli failing tests by ignoring spans in doc.ents with empty label * Fix limit * Improve NER alignment * Fix count_train * Remove print statement * fix tests, we're not having nothing but None * fix clumsy fingers * Fix tests * Fix doc.ents * Remove empty docs in Corpus and improve limit * Update config Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-06-26 19:34:12 +02:00
Sofie Van Landeghem	c0f4a1e43b	train is from-config by default (#5575 ) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before	2020-06-12 02:02:07 +02:00
Ines Montani	810fce3bb1	Merge branch 'develop' into master-tmp	2020-06-03 14:36:59 +02:00
Adriane Boyd	10d938f221	Update default cfg dir in train CLI	2020-06-03 14:15:50 +02:00
Adriane Boyd	f1f9c8b417	Port train CLI updates Updates from #5362 and fix from #5387: * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-06-03 14:03:43 +02:00
Ines Montani	a7e370bcbf	Don't override spaCy version	2020-05-30 15:03:18 +02:00
Ines Montani	6e6db6afb6	Better model compatibility and validation	2020-05-22 15:42:46 +02:00
Ines Montani	24f72c669c	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
Adriane Boyd	daaa7bf451	Add option to omit extra lexeme tables in CLI	2020-05-20 15:51:44 +02:00
adrianeboyd	a5cd203284	Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:14 +02:00
Sofie Van Landeghem	0d94737857	Feature toggle_pipes (#5378 ) * make disable_pipes deprecated in favour of the new toggle_pipes * rewrite disable_pipes statements * update documentation * remove bin/wiki_entity_linking folder * one more fix * remove deprecated link to documentation * few more doc fixes * add note about name change to the docs * restore original disable_pipes * small fixes * fix typo * fix error number to W096 * rename to select_pipes * also make changes to the documentation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-18 22:27:10 +02:00
adrianeboyd	c045a9c7f6	Fix logic in train CLI timing eval on GPU (#5387 ) Run CPU timing in first iteration only	2020-05-01 12:05:33 +02:00
adrianeboyd	bdff76dede	Various updates/additions to CLI scripts (#5362 ) * `debug-data`: determine coverage of provided vectors * `evaluate`: support `blank:lg` model to make it possible to just evaluate tokenization * `init-model`: add option to truncate vectors to N most frequent vectors from word2vec file * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-04-29 12:56:46 +02:00
Umar Butler	8952effcc4	Fixed Typo in Warning (#5284 ) * Fixed typo in cli warning Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat. * Create and signed contributor form	2020-04-09 15:46:15 +02:00
adrianeboyd	b71a11ff6d	Update morphologizer (#5108 ) * Add pos and morph scoring to Scorer Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph accuracy in `spacy evaluate`. * Update morphologizer for v3 * switch to tagger-based morphologizer * use `spacy.HashCharEmbedCNN` for morphologizer defaults * add `Doc.is_morphed` flag * Add morphologizer to train CLI * Add basic morphologizer pipeline tests * Add simple morphologizer training example * Remove subword_features from CharEmbed models Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and `spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features` is always `False`. * Rename setting in morphologizer example Use `with_pos_tags` instead of `without_pos_tags`. * Fix kwargs for spacy.HashCharEmbedBiLSTM.v1 * Remove defaults for spacy.HashCharEmbedBiLSTM.v1 Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`. * Set random seed for textcat overfitting test	2020-04-02 14:46:32 +02:00
Ines Montani	46568f40a7	Merge branch 'master' into tmp/sync	2020-03-26 13:38:14 +01:00
adrianeboyd	8d3563f1c4	Minor bugfixes for train CLI (#5186 ) * Omit per_type scores from model-best calculations The addition of per_type scores to the included metrics (#4911) causes errors when they're compared while determining the best model, so omit them for this `max()` comparison. * Add default speed data for interrupted train CLI Add better speed meta defaults so that an interrupted iteration still produces a best model. Co-authored-by: Ines Montani <ines@ines.io>	2020-03-26 10:46:50 +01:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
Sofie Van Landeghem	218e1706ac	Bugfix linking vectors (#5196 ) * restore call to _load_vectors * bump to thinc 8.0.0a3 * bump to 3.0.0.dev4	2020-03-25 10:20:11 +01:00
adrianeboyd	c95ce96c44	Update sentence recognizer (#5109 ) * Update sentence recognizer * rename `sentrec` to `senter` * use `spacy.HashEmbedCNN.v1` by default * update to follow `Tagger` modifications * remove component methods that can be inherited from `Tagger` * add simple initialization and overfitting pipeline tests * Update serialization test for senter	2020-03-06 14:45:02 +01:00
adrianeboyd	8c20dae6f7	Fix model-final/model-best meta from train CLI (#5093 ) * Fix model-final/model-best meta * include speed and accuracy from final iteration * combine with speeds from base model if necessary * Include token_acc metric for all components	2020-03-03 21:43:25 +01:00
Ines Montani	5da3ad682a	Tidy up and auto-format	2020-02-28 11:57:41 +01:00
Sofie Van Landeghem	06f0a8daa0	Default settings to configurations (#4995 ) * fix grad_clip naming * cleaning up pretrained_vectors out of cfg * further refactoring Model init's * move Model building out of pipes * further refactor to require a model config when creating a pipe * small fixes * making cfg in nn_parser more consistent * fixing nr_class for parser * fixing nn_parser's nO * fix printing of loss * architectures in own file per type, consistent naming * convenience methods default_tagger_config and default_tok2vec_config * let create_pipe access default config if available for that component * default_parser_config * move defaults to separate folder * allow reading nlp from package or dir with argument 'name' * architecture spacy.VocabVectors.v1 to read static vectors from file * cleanup * default configs for nel, textcat, morphologizer, tensorizer * fix imports * fixing unit tests * fixes and clean up * fixing defaults, nO, fix unit tests * restore parser IO * fix IO * 'fix' serialization test * add .cfg to manifest fix example configs with additional arguments * replace Morpohologizer with Tagger * add IO bit when testing overfitting of tagger (currently failing) * fix IO - don't initialize when reading from disk * expand overfitting tests to also check IO goes OK * remove dropout from HashEmbed to fix Tagger performance * add defaults for sentrec * update thinc * always pass a Model instance to a Pipe * fix piped_added statement * remove obsolete W029 * remove obsolete errors * restore byte checking tests (work again) * clean up test * further test cleanup * convert from config to Model in create_pipe * bring back error when component is not initialized * cleanup * remove calls for nlp2.begin_training * use thinc.api in imports * allow setting charembed's nM and nC * fix for hardcoded nM/nC + unit test * formatting fixes * trigger build	2020-02-27 18:42:27 +01:00
adrianeboyd	ff184b7a9c	Add tag_map argument to CLI debug-data and train (#4750 ) (#5038 ) Add an argument for a path to a JSON-formatted tag map, which is used to update and extend the default language tag map.	2020-02-26 12:10:38 +01:00
svlandeg	fc6e34c3a1	fix bugs from porting master to develop	2020-02-26 08:44:22 +01:00
Ines Montani	e3f40a6a0f	Tidy up and auto-format	2020-02-18 15:38:18 +01:00
Ines Montani	de11ea753a	Merge branch 'master' into develop	2020-02-18 14:47:23 +01:00
Sofie Van Landeghem	2572460175	add tok2vec parameters to train script to facilitate init_tok2vec (#5021 )	2020-02-16 17:16:41 +01:00
Sofie Van Landeghem	a27c77ce62	add message when cli train script throws exception (#5009 ) * add message when cli train script throws exception * fix formatting	2020-02-15 15:50:17 +01:00

1 2 3 4 5

220 Commits