spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-15 22:51:58 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	7b3f0c6f1b	Questionable fix for parser training bug with misaligned sentences (#6694 ) * Questionable fix for parser training bug with misaligned sentences * Fix Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-15 14:18:24 +01:00
Ines Montani	e8a97a2bd6	Merge pull request #6720 from adrianeboyd/feature/improved-init-training-config-validation	2021-01-15 11:45:24 +11:00
Adriane Boyd	681a6195f7	Validate seed and gpu_allocator manually	2021-01-14 16:57:57 +01:00
svlandeg	fec9b81aa2	Merge remote-tracking branch 'upstream/develop' into feature/missing-dep	2021-01-13 17:46:12 +01:00
svlandeg	ed53bb979d	cleanup	2021-01-13 14:20:05 +01:00
svlandeg	86a4e316b8	fix sent_starts	2021-01-13 13:47:25 +01:00
svlandeg	a581d82f33	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
Adriane Boyd	5fb8b7037a	Expand initialize/training config validation Validate both `[initialize]` and `[training]` in `debug data` and `nlp.initialize()` with separate config validation error blocks that indicate which block of the config is being validated.	2021-01-12 17:17:00 +01:00
Matthew Honnibal	c04bab6bae	Fix train loop to avoid swallowing tracebacks (#6693 ) * Avoid swallowing tracebacks in train loop * Format * Handle first	2021-01-09 08:25:47 +08:00
svlandeg	dd12c6c8fd	allow missing information in deps and heads annotations	2021-01-07 19:10:32 +01:00
Bruno	1a77607036	spaCy v3 is not saving the best version in training loop (#6629 ) * Save best only if is the best and also respect the average config * Create bratao.md * Update loop.py * Remove average check * Keep before_to_disk	2021-01-06 12:51:30 +11:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
Adriane Boyd	b57be94c78	Fix memory issues in Language.evaluate (#6386 ) * Fix memory issues in Language.evaluate Reset annotation in predicted docs before evaluating and store all data in `examples`. * Minor refactor to docs generator init * Fix generator expression * Fix final generator check * Refactor pipeline loop * Handle examples generator in Language.evaluate * Add test with generator * Use make_doc	2020-12-31 10:45:50 +11:00
Adriane Boyd	1ddf2f39c7	Switch converters to generator functions (#6547 ) * Switch converters to generator functions To reduce the memory usage when converting large corpora, refactor the convert methods to be generator functions. * Update tests	2020-12-15 16:47:16 +08:00
Matthew Honnibal	8656a08777	Add beam_parser and beam_ner components for v3 (#6369 ) * Get basic beam tests working * Get basic beam tests working * Compile _beam_utils * Remove prints * Test beam density * Beam parser seems to train * Draft beam NER * Upd beam * Add hypothesis as dev dependency * Implement missing is-gold-parse method * Implement early update * Fix state hashing * Fix test * Fix test * Default to non-beam in parser constructor * Improve oracle for beam * Start refactoring beam * Update test * Refactor beam * Update nn * Refactor beam and weight by cost * Update ner beam settings * Update test * Add __init__.pxd * Upd test * Fix test * Upd test * Fix test * Remove ring buffer history from StateC * WIP change arc-eager transitions * Add state tests * Support ternary sent start values * Fix arc eager * Fix NER * Pass oracle cut size for beam * Fix ner test * Fix beam * Improve StateC.clone * Improve StateClass.borrow * Work directly with StateC, not StateClass * Remove print statements * Fix state copy * Improve state class * Refactor parser oracles * Fix arc eager oracle * Fix arc eager oracle * Use a vector to implement the stack * Refactor state data structure * Fix alignment of sent start * Add get_aligned_sent_starts method * Add test for ae oracle when bad sentence starts * Fix sentence segment handling * Avoid Reduce that inserts illegal sentence * Update preset SBD test * Fix test * Remove prints * Fix sent starts in Example * Improve python API of StateClass * Tweak comments and debug output of arc eager * Upd test * Fix state test * Fix state test	2020-12-13 09:08:32 +08:00
Sofie Van Landeghem	de108ed3e8	Add specific error when StaticVectors can't read the vectors data (#6450 )	2020-12-09 06:16:07 +08:00
Ines Montani	6cfa66ed1c	Make training.loop return nlp object and path (#6520 )	2020-12-08 14:55:55 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Adriane Boyd	4448680750	Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)	2020-12-08 14:25:16 +08:00
Sofie Van Landeghem	079f6ea474	avoid resolving the full config (#6465 )	2020-11-30 09:34:29 +08:00
Adriane Boyd	1c4df8fd09	Replace pytokenizations with internal alignment (#6293 ) * Replace pytokenizations with internal alignment Replace pytokenizations with internal alignment algorithm that is restricted to only allow differences in whitespace and capitalization. * Rename `spacy.training.align` to `spacy.training.alignment` to contain the `Alignment` dataclass * Implement `get_alignments` in `spacy.training.align` * Refactor trailing whitespace handling * Remove unnecessary exception for empty docs Allow a non-empty whitespace-only doc to be aligned with an empty doc * Remove empty docs exceptions completely	2020-11-03 16:24:38 +01:00
Sofie Van Landeghem	2918923541	fix resolving of dot notation (#6326 )	2020-10-31 12:17:06 +01:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Ines Montani	ff4267d181	Fix success message [ci skip]	2020-10-15 14:42:08 +02:00
Adriane Boyd	a93d42861d	Use null raw for has_unknown_spaces in docs_to_json	2020-10-15 09:57:54 +02:00
Ines Montani	ab890a35f9	Make console logger table more compact	2020-10-11 12:55:46 +02:00
Ines Montani	8ac5f22253	Adjust error message	2020-10-09 18:00:16 +02:00
svlandeg	18dfb27985	Add custom error when evaluation throws a KeyError	2020-10-09 12:05:33 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
Ines Montani	568e12215d	Merge pull request #6206 from svlandeg/fix/patterns-init	2020-10-06 10:27:23 +02:00
Ines Montani	126268ce50	Auto-format [ci skip]	2020-10-05 21:58:18 +02:00
Ines Montani	be99f1e4de	Remove output dirs before training (#6204 ) * Remove output dirs before training * Re-raise error if cleaning fails	2020-10-05 20:11:16 +02:00
svlandeg	9eb813a35d	Merge remote-tracking branch 'upstream/develop' into fix/patterns-init	2020-10-05 17:49:44 +02:00
svlandeg	4e3ace4b8c	is_trainable method	2020-10-05 17:43:42 +02:00
Matthew Honnibal	3ee3649b52	Fix augment	2020-10-05 16:59:49 +02:00
Matthew Honnibal	8deed614e9	Fix augment	2020-10-05 16:41:45 +02:00
Matthew Honnibal	4ed3e037df	Fix augment	2020-10-05 16:40:55 +02:00
svlandeg	dc06912c76	prevent loss keyerror for non-trainable components	2020-10-05 16:33:28 +02:00
Ines Montani	8171e28b20	Remove logging [ci skip] This would be fired on each example, which is wrong	2020-10-05 15:09:52 +02:00
svlandeg	251b3eb4e5	add initialize method for entity_ruler	2020-10-05 14:59:13 +02:00
Matthew Honnibal	6a9d14e35a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-10-05 14:17:41 +02:00
Matthew Honnibal	d2b9aafb8c	Fix augmenter	2020-10-05 14:14:49 +02:00
svlandeg	fd2d48556c	fix E902 and E903 numbering	2020-10-05 13:43:32 +02:00
Ines Montani	3c36a57e84	Update data augmenters (#6196 ) * Draft lower-case augmenter * Make warning a debug log * Update lowercase augmenter, docs and tests Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-10-04 17:46:29 +02:00
Matthew Honnibal	84ae197dd6	Fix logger	2020-10-04 14:16:53 +02:00
Ines Montani	bcd52e5486	Tidy up errors and warnings	2020-10-04 11:16:31 +02:00
Ines Montani	ff914f4e6f	Lazy-load xx	2020-10-04 11:10:26 +02:00
Matthew Honnibal	85ede32680	Format	2020-10-03 19:26:23 +02:00
Matthew Honnibal	b305f2ff5a	Fix loggers	2020-10-03 19:26:10 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00

1 2 3

113 Commits