spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-25 17:36:30 +03:00

Author	SHA1	Message	Date
Adriane Boyd	b57be94c78	Fix memory issues in Language.evaluate (#6386 ) * Fix memory issues in Language.evaluate Reset annotation in predicted docs before evaluating and store all data in `examples`. * Minor refactor to docs generator init * Fix generator expression * Fix final generator check * Refactor pipeline loop * Handle examples generator in Language.evaluate * Add test with generator * Use make_doc	2020-12-31 10:45:50 +11:00
svlandeg	a6a68da673	unskipping tests with python >= 3.6	2020-12-30 18:46:43 +01:00
svlandeg	d5ff0fecf8	add docs	2020-12-30 14:01:13 +01:00
svlandeg	c74ab6a313	fix imports	2020-12-30 12:40:12 +01:00
svlandeg	712a78b74a	add simple unit test	2020-12-30 12:35:26 +01:00
Adriane Boyd	5ca57d8221	Add logger warning when serializing user hooks (#6595 ) Add a warning that user hooks are lost on serialization. Add a `user_hooks` exclude to skip the warning with pickle.	2020-12-29 11:54:32 +01:00
Yosi	cf52510631	Add Amharic አማርኛ Language support (#6583 ) * Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>	2020-12-22 16:50:34 +01:00
Adriane Boyd	cabd4ae5b1	Use logger.warning instead of logger.warn (#6596 ) Use `logger.warning` instead of deprecated `logger.warn`.	2020-12-21 08:25:10 +08:00
Sofie Van Landeghem	282a3b49ea	Fix parser resizing when there is no upper layer (#6460 ) * allow resizing of the parser model even when upper=False * update from spacy.TransitionBasedParser.v1 to v2 * bugfix	2020-12-18 18:56:57 +08:00
Sofie Van Landeghem	0a923a7915	Tagger robustness (#6580 ) * require labels in taggers * ensure tagger works with incomplete data	2020-12-18 18:51:47 +08:00
Adriane Boyd	1ddf2f39c7	Switch converters to generator functions (#6547 ) * Switch converters to generator functions To reduce the memory usage when converting large corpora, refactor the convert methods to be generator functions. * Update tests	2020-12-15 16:47:16 +08:00
Matthew Honnibal	8656a08777	Add beam_parser and beam_ner components for v3 (#6369 ) * Get basic beam tests working * Get basic beam tests working * Compile _beam_utils * Remove prints * Test beam density * Beam parser seems to train * Draft beam NER * Upd beam * Add hypothesis as dev dependency * Implement missing is-gold-parse method * Implement early update * Fix state hashing * Fix test * Fix test * Default to non-beam in parser constructor * Improve oracle for beam * Start refactoring beam * Update test * Refactor beam * Update nn * Refactor beam and weight by cost * Update ner beam settings * Update test * Add __init__.pxd * Upd test * Fix test * Upd test * Fix test * Remove ring buffer history from StateC * WIP change arc-eager transitions * Add state tests * Support ternary sent start values * Fix arc eager * Fix NER * Pass oracle cut size for beam * Fix ner test * Fix beam * Improve StateC.clone * Improve StateClass.borrow * Work directly with StateC, not StateClass * Remove print statements * Fix state copy * Improve state class * Refactor parser oracles * Fix arc eager oracle * Fix arc eager oracle * Use a vector to implement the stack * Refactor state data structure * Fix alignment of sent start * Add get_aligned_sent_starts method * Add test for ae oracle when bad sentence starts * Fix sentence segment handling * Avoid Reduce that inserts illegal sentence * Update preset SBD test * Fix test * Remove prints * Fix sent starts in Example * Improve python API of StateClass * Tweak comments and debug output of arc eager * Upd test * Fix state test * Fix state test	2020-12-13 09:08:32 +08:00
Ines Montani	9d32e839d3	Merge branch 'develop' into feature/init-config-cpu-gpu	2020-12-10 08:50:53 +11:00
Ines Montani	f2571b5ec4	Merge pull request #6444 from adrianeboyd/chore/update-develop-from-master	2020-12-09 13:09:58 +11:00
Ines Montani	90171f2031	Merge pull request #6528 from svlandeg/feature/pipe_fill_config	2020-12-09 12:01:22 +11:00
Ines Montani	dfaef27f90	Merge pull request #6503 from adrianeboyd/feature/lemmatizer-rule-warning-pos Warn on empty POS for the rule-based lemmatizer	2020-12-09 11:34:16 +11:00
Ines Montani	b85bd63eca	Fix test	2020-12-09 11:24:01 +11:00
Ines Montani	febf71af28	Fix test	2020-12-09 11:23:07 +11:00
Ines Montani	1980203229	Merge branch 'master' into pr/6444	2020-12-09 11:09:40 +11:00
Ines Montani	05a2812ae0	Merge branch 'develop' into pr/6444	2020-12-09 11:04:03 +11:00
Sofie Van Landeghem	cfc72c2995	Bugfix multi-label textcat reproducibility (#6481 ) * add test for multi-label textcat reproducibility * remove positive_label * fix lengths dtype * fix comments * remove comment that we should not have forgotten :-)	2020-12-09 06:29:15 +08:00
svlandeg	8f8a7f1733	returning config in init_config	2020-12-08 17:37:20 +01:00
Sofie Van Landeghem	2c27093c5f	require_cpu functionality (#6336 ) * add require_cpu from Thinc 8.0.0rc2 * add docs * fix test if cupy is not installed	2020-12-08 14:42:40 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Adriane Boyd	29b058ebdc	Fix spacy when retokenizing cases with affixes (#6475 ) Preserve `token.spacy` corresponding to the span end token in the original doc rather than adjusting for the current offset. * If not modifying in place, this checks in the original document (`doc.c` rather than `tokens`). * If modifying in place, the document has not been modified past the current span start position so the value at the current span end position is valid.	2020-12-08 14:25:56 +08:00
Adriane Boyd	4448680750	Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)	2020-12-08 14:25:16 +08:00
Adriane Boyd	d70950605c	Warn on empty POS for the rule-based lemmatizer Add a warning to the rule-based lemmatizer for any tokens without POS annotation.	2020-12-04 11:46:15 +01:00
Sofie Van Landeghem	d6c616a125	Fixes in test suite (#6457 ) * fix slow test for textcat readers * cleanup test_issue5551 * add explicit score weight * cleanup	2020-12-02 12:57:08 +01:00
Adriane Boyd	53c0fb7431	Only set NORM on Token in retokenizer (#6464 ) * Only set NORM on Token in retokenizer Instead of setting `NORM` on both the token and lexeme, set `NORM` only on the token. The retokenizer tries to set all possible attributes with `Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate which attributes are available for each. `NORM` is the only attribute that's stored on both and for most cases it doesn't make sense to set the global norms based on a individual retokenization. For lexeme-only attributes like `IS_STOP` there's no way to avoid the global side effects, but I think that `NORM` would be better only on the token. * Fix test	2020-11-30 09:35:42 +08:00
Adriane Boyd	03ae77e603	Add SPACY as a Matcher attribute (#6463 )	2020-11-30 09:34:50 +08:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Sofie Van Landeghem	2af31a8c8d	Bugfix textcat reproducibility on GPU (#6411 ) * add seed argument to ParametricAttention layer * bump thinc to 7.4.3 * set thinc version range Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-23 12:29:35 +01:00
Adriane Boyd	320a8b1481	Add ent_id_ to strings serialized with Doc (#6353 )	2020-11-10 20:16:07 +08:00
Adriane Boyd	a7e7d6c6c9	Ignore misaligned in Morphologizer.get_loss (#6363 ) Fix bug where `Morphologizer.get_loss` treated misaligned annotation as `EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH` mappings.	2020-11-10 20:15:09 +08:00
Ines Montani	363ac73c72	Update docs [ci skip]	2020-11-09 12:43:26 +08:00
Adriane Boyd	31de700b0f	Fix on_match callback and remove empty patterns (#6312 ) For the `DependencyMatcher`: * Fix on_match callback so that it is called once per matched pattern * Fix results so that patterns with empty match lists are not returned	2020-11-05 09:16:26 +01:00
Adriane Boyd	1c4df8fd09	Replace pytokenizations with internal alignment (#6293 ) * Replace pytokenizations with internal alignment Replace pytokenizations with internal alignment algorithm that is restricted to only allow differences in whitespace and capitalization. * Rename `spacy.training.align` to `spacy.training.alignment` to contain the `Alignment` dataclass * Implement `get_alignments` in `spacy.training.align` * Refactor trailing whitespace handling * Remove unnecessary exception for empty docs Allow a non-empty whitespace-only doc to be aligned with an empty doc * Remove empty docs exceptions completely	2020-11-03 16:24:38 +01:00
Adriane Boyd	a4b32b9552	Handle missing reference values in scorer (#6286 ) * Handle missing reference values in scorer Handle missing values in reference doc during scoring where it is possible to detect an unset state for the attribute. If no reference docs contain annotation, `None` is returned instead of a score. `spacy evaluate` displays `-` for missing scores and the missing scores are saved as `None`/`null` in the metrics. Attributes without unset states: * `token.head`: relies on `token.dep` to recognize unset values * `doc.cats`: unable to handle missing annotation Additional changes: * add optional `has_annotation` check to `score_scans` to replace `doc.sents` hack * update `score_token_attr_per_feat` to handle missing and empty morph representations * fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START` vs. `SENT_START` * Fix import * Update return types	2020-11-03 15:47:18 +01:00
Adriane Boyd	5d2cb86c34	Fix on_match callback for DependencyMatcher (#6313 ) Fix `DependencyMatcher` so that the callback is called only once per match.	2020-10-31 12:20:27 +01:00
Adriane Boyd	45c9a68828	Identify final Matcher pattern node by quantifier (#6317 ) Modify the internal pattern representation in `Matcher` patterns to identify the final ID state using a unique quantifier rather than a combination of other attributes. It was insufficient to identify the final ID node based on an uninitialized `quantifier` (coincidentally being the same as the `ZERO`) with `nr_attr` as 0. (In addition, it was potentially bug-prone that `nr_attr` was set to 0 even though attrs were allocated.) In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr` is 0 and the quantifier is ZERO, so the previous methods for incrementing to the ID node at the end of the pattern weren't able to distinguish the final ID node from the `{"OP": "!"}` pattern.	2020-10-31 12:18:48 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Jan Margeta	1ad2213349	Fix TokenPatternSchema pattern field validation Empty pattern field should be considered invalid This is fixed by replacing minItems with min_items as described in Pydantic docs: https://pydantic-docs.helpmanual.io/usage/schema/	2020-10-16 00:41:21 +02:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	b1d568a4df	Tidy up tests	2020-10-15 10:20:21 +02:00
Ines Montani	d165af26be	Auto-format [ci skip]	2020-10-15 10:08:53 +02:00
Ines Montani	5d62499266	Fix tests	2020-10-15 09:29:15 +02:00
Ines Montani	178760855f	Merge branch 'develop' into master-tmp	2020-10-15 09:06:03 +02:00
svlandeg	0796401c19	call NumpyOps instead of get_current_ops()	2020-10-14 16:55:00 +02:00
svlandeg	e94a21638e	adding tests for trained models to ensure predict reproducibility	2020-10-13 21:07:13 +02:00
svlandeg	ede979d42f	formattting	2020-10-13 18:53:17 +02:00
svlandeg	ff83bfae3f	naming	2020-10-13 18:52:37 +02:00
svlandeg	6ccacff54e	add tests for individual spacy layers	2020-10-13 18:50:07 +02:00
svlandeg	c23041ae60	component tests single or multiple prediction	2020-10-13 16:26:53 +02:00
svlandeg	3a505e7e14	small edit to ensure the new word was indeed new	2020-10-10 21:05:28 +02:00
svlandeg	68d79796c6	add test for vocab after serializing KB	2020-10-10 20:59:48 +02:00
Ines Montani	539b0c10da	Tidy up and auto-format	2020-10-10 19:14:48 +02:00
Ines Montani	bfa3931c9d	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
Ines Montani	525f798841	Fix typo in test	2020-10-09 18:00:21 +02:00
Ines Montani	b7cb9d95e4	Merge pull request #6229 from svlandeg/bugfix/disabled	2020-10-09 16:05:11 +02:00
Ines Montani	9fb3244672	Merge pull request #6231 from adrianeboyd/feature/include-static-vectors	2020-10-09 15:54:52 +02:00
Adriane Boyd	727370c633	Remove Span._recalculate_indices Remove `Span._recalculate_indices`, which is a remnant from the deprecated `Span.merge`.	2020-10-09 14:42:51 +02:00
svlandeg	06b9d213fd	formatting	2020-10-09 12:19:47 +02:00
Ines Montani	4771a10503	Make test more explicit [ci skip]	2020-10-09 12:15:26 +02:00
Ines Montani	cc3646b06c	Add xfailing test for peculiar spans failure [ci skip]	2020-10-09 12:10:25 +02:00
svlandeg	8316bc7d4a	bugfix DisabledPipes	2020-10-09 12:06:20 +02:00
Adriane Boyd	39aabf50ab	Also rename to include_static_vectors in CharEmbed	2020-10-09 11:54:48 +02:00
Florijan Stamenković	18f5c309dc	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:14:40 +02:00
Duygu Altinok	80fb1bffc9	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-09 10:13:15 +02:00
Duygu Altinok	2fad279a44	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:10:22 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
Ines Montani	8ff73f04db	Fix morph in Doc.to_json	2020-10-08 14:44:35 +02:00
Ines Montani	064575d79d	Merge pull request #6216 from svlandeg/feature/nel-initialize	2020-10-08 11:14:12 +02:00
svlandeg	eaf5c265cb	set_kb method for entity_linker	2020-10-08 10:34:01 +02:00
Ines Montani	010956d493	Clear rule-based components on initialize	2020-10-08 09:51:31 +02:00
Sofie Van Landeghem	2998131416	Reproducibility for TextCat and Tok2Vec (#6218 ) * ensure fixed seed in HashEmbed layers * forgot about the joys of python 2	2020-10-08 00:43:46 +02:00
svlandeg	efedccea8d	fix tests	2020-10-07 15:29:52 +02:00
svlandeg	6b8bdb2d39	add init_config to nlp.create_pipe	2020-10-07 14:58:16 +02:00
Duygu Altinok	7e821c2776	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-07 11:07:52 +02:00
Duygu Altinok	b95a11dd95	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-07 10:25:37 +02:00
Rahul Gupta	1a00bff06d	Hindi: Adds tests for lexical attributes (norm and like_num) (#5829 ) * Hindi: Adds tests for lexical attributes (norm and like_num) * Signs and sdds the contributor agreement * Add ordinal numbers to be tagged as like_num * Adds alternate pronunciation for 31 and 39	2020-10-07 10:23:32 +02:00
Sofie Van Landeghem	fff3f8ccfa	Fix packaging pin (#6212 ) * pin packaging to >=20.0 * ignore spacy-pkuseg in requirements unit test	2020-10-06 14:16:05 +02:00
Matthew Honnibal	cfb9770a94	Fix empty input into StaticVectors layer (#6211 ) * Add test for empty doc(s) * Fix empty check in staticvectors * Remove xfail * Update spacy/ml/staticvectors.py	2020-10-06 14:15:41 +02:00
Florijan Stamenković	9db670b996	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-06 11:17:37 +02:00
Ines Montani	568e12215d	Merge pull request #6206 from svlandeg/fix/patterns-init	2020-10-06 10:27:23 +02:00
svlandeg	ff9ac39c88	read entity_ruler patterns with srsly.read_jsonl.v1	2020-10-05 22:50:14 +02:00
Ines Montani	126268ce50	Auto-format [ci skip]	2020-10-05 21:58:18 +02:00
Ines Montani	181039bd17	Merge pull request #6205 from explosion/feature/embed-features	2020-10-05 21:49:10 +02:00
Ines Montani	568617af58	Merge pull request #6202 from explosion/feature/project-spacy-version	2020-10-05 21:40:52 +02:00
Ines Montani	6abfc2911d	Merge pull request #6203 from adrianeboyd/feature/zh-spacy-pkuseg	2020-10-05 21:35:57 +02:00
Matthew Honnibal	91d0fbb588	Fix test	2020-10-05 21:13:53 +02:00
Matthew Honnibal	b392d48e76	Fix test	2020-10-05 20:17:07 +02:00
Matthew Honnibal	db84d175c3	Fix test	2020-10-05 19:59:30 +02:00
Matthew Honnibal	6dcc4a0ba6	Simplify MultiHashEmbed signature	2020-10-05 19:57:45 +02:00
Matthew Honnibal	7d93575f35	spacy/tests/	2020-10-05 15:28:12 +02:00
Matthew Honnibal	f4ca9a39cb	spacy/tests/	2020-10-05 15:27:06 +02:00
Matthew Honnibal	f2f1deca66	spacy/tests/	2020-10-05 15:24:33 +02:00
Matthew Honnibal	8ec79ad3fa	Allow configuration of MultiHashEmbed features Update arguments to MultiHashEmbed layer so that the attributes can be controlled. A kind of tricky scheme is used to allow optional specification of the rows. I think it's an okay balance between flexibility and convenience.	2020-10-05 15:22:00 +02:00
Adriane Boyd	5d19dfc9d3	Update Chinese tokenizer for spacy-pkuseg fork	2020-10-05 14:21:53 +02:00
Ines Montani	6958510bda	Include spaCy version check in project CLI	2020-10-05 13:53:07 +02:00
Ines Montani	20f2a17a09	Merge test_misc and test_util	2020-10-05 13:45:57 +02:00
Ines Montani	1c641e41c3	Remove unused import [ci skip]	2020-10-05 11:50:11 +02:00
Adriane Boyd	b0b93854cb	Update ru/uk lemmatizers for new nlp.initialize	2020-10-05 09:27:16 +02:00
Ines Montani	549758f67d	Adjust test for now	2020-10-04 23:16:09 +02:00
Ines Montani	3c36a57e84	Update data augmenters (#6196 ) * Draft lower-case augmenter * Make warning a debug log * Update lowercase augmenter, docs and tests Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-10-04 17:46:29 +02:00
Ines Montani	496228771d	Merge pull request #6194 from explosion/master-tmp	2020-10-04 15:25:41 +02:00
Ines Montani	0307a228c8	Merge pull request #6193 from explosion/fix/adjust-pipe-init Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 15:20:54 +02:00
Ines Montani	59deeb7da6	Merge branch 'develop' into master-tmp	2020-10-04 14:52:20 +02:00
Ines Montani	8f018e47f8	Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 14:43:45 +02:00
Ines Montani	11347f34da	Tidy up, tests and docs	2020-10-04 13:54:05 +02:00
Ines Montani	d3b3663942	Adjust error message and add test	2020-10-04 10:11:27 +02:00
Ines Montani	2110e8f86d	Auto-format	2020-10-04 10:06:49 +02:00
Matthew Honnibal	835070cedc	Upd test	2020-10-03 19:35:10 +02:00
Ines Montani	c2401fca41	Add tests for Pipe.label_data	2020-10-03 19:12:46 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00
Ines Montani	7c4ab7e82c	Fix Lemmatizer.get_lookups_config	2020-10-03 17:16:10 +02:00
Ines Montani	dd542ec6a4	Fix label initialization of textcat component (#6190 )	2020-10-03 17:07:38 +02:00
Sofie Van Landeghem	09dcb75076	small UX fix for DocBin (#6167 ) * add informative warning when messing up store_user_data DocBin flags * add informative warning when messing up store_user_data DocBin flags * cleanup test * rename to patterns_path	2020-10-02 15:43:32 +02:00
Ines Montani	f0b30aedad	Make lemmatizers use initialize logic (#6182 ) * Make lemmatizer use initialize logic and tidy up * Fix typo * Raise for uninitialized tables	2020-10-02 15:42:36 +02:00
Ines Montani	d2aa662ab2	Merge pull request #6179 from adrianeboyd/feature/token-morph-refactor-2 [ci skip]	2020-10-02 12:10:27 +02:00
Ines Montani	c41a4332e4	Add test for custom data augmentation	2020-10-02 11:37:56 +02:00
Adriane Boyd	f83dfe62da	Fix test	2020-10-02 10:17:26 +02:00
Ines Montani	01c1538c72	Integrate file readers	2020-10-02 01:36:06 +02:00
Adriane Boyd	86c3ec9c2b	Refactor Token morph setting (#6175 ) * Refactor Token morph setting * Remove `Token.morph_` * Add `Token.set_morph()` * `0` resets `token.c.morph` to unset * Any other values are passed to `Morphology.add` * Add token.morph setter to set from MorphAnalysis	2020-10-01 22:21:46 +02:00
Ines Montani	d48ddd6c9a	Remove default initialize lookups	2020-10-01 21:54:33 +02:00
Adriane Boyd	73538782a0	Switch Doc.__init__(ents=) to IOB tags (#6173 ) * Switch Doc.__init__(ents=) to IOB tags * Fix check for "-" * Allow "" or None as missing IOB tag	2020-10-01 16:22:18 +02:00
Yohei Tamura	3243ddac8f	Fix/span.sent (#6083 ) * add fail test * fix test * fix span.sent * Remove incorrect implicit check Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-01 14:01:52 +02:00
Ines Montani	381258b75b	Merge pull request #6165 from explosion/feature/update-tokenizers-initialize	2020-10-01 09:49:47 +02:00
Ines Montani	a103ab5f1a	Update augmenter lookups and docs	2020-09-30 23:03:47 +02:00
Ines Montani	23c63eefaf	Tidy up env vars [ci skip]	2020-09-30 15:15:11 +02:00
Adriane Boyd	6b7bb32834	Refactor Chinese initialization	2020-09-30 11:46:45 +02:00
Ines Montani	34f9c26c62	Add lexeme norm defaults	2020-09-30 10:20:14 +02:00
Ines Montani	1aeef3bfbb	Make corpus paths default to None and improve errors	2020-09-29 22:33:46 +02:00
Ines Montani	fa47f87924	Tidy up and auto-format	2020-09-29 21:39:28 +02:00
Ines Montani	6467a560e3	WIP: Test updating Chinese tokenizer	2020-09-29 21:10:22 +02:00
Ines Montani	78021089f9	Merge pull request #6160 from explosion/feature/prepare	2020-09-29 20:55:13 +02:00
Ines Montani	c3f8c09d7d	Merge pull request #6154 from adrianeboyd/bugfix/chinese-tokenizer-pickle	2020-09-29 20:54:59 +02:00
Ines Montani	d3c63b7965	Merge branch 'develop' into feature/prepare	2020-09-29 20:53:05 +02:00
Ines Montani	2be80379ec	Fix small issues, resolve_dot_names and debug model	2020-09-29 20:38:35 +02:00
Ines Montani	7851020653	Update tests	2020-09-29 18:14:15 +02:00
Ines Montani	f2352eb701	Test with default value	2020-09-29 17:00:40 +02:00
Ines Montani	63d1598137	Simplify config use in Language.initialize	2020-09-29 16:05:48 +02:00
Ines Montani	56f8bc73ef	Add more tests	2020-09-29 15:23:34 +02:00
Ines Montani	591038b1a4	Add test	2020-09-29 12:54:52 +02:00
Matthew Honnibal	e1fdf2b7c5	Upd tests	2020-09-29 12:05:38 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Ines Montani	2e9c9e74af	Fix config resolution and interpolation TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)	2020-09-28 15:34:00 +02:00
Ines Montani	822ea4ef61	Refactor CLI	2020-09-28 15:09:59 +02:00
Matthew Honnibal	a976da168c	Support data augmentation in Corpus (#6155 ) * Support data augmentation in Corpus * Note initial docs for data augmentation * Add augmenter to quickstart * Fix flake8 * Format * Fix test * Update spacy/tests/training/test_training.py * Improve data augmentation arguments * Update templates * Move randomization out into caller * Refactor * Update spacy/training/augment.py * Update spacy/tests/training/test_training.py * Fix augment * Fix test	2020-09-28 03:03:27 +02:00
Ines Montani	9016d23cc5	Fix exclude and add test	2020-09-27 23:34:03 +02:00
Ines Montani	7e938ed63e	Update config resolution to use new Thinc	2020-09-27 22:21:31 +02:00
Adriane Boyd	8393dbedad	Minor fixes * Put `cfg` back in serialization * Add `pickle5` to pytest conf	2020-09-27 15:15:53 +02:00
Adriane Boyd	11e195d3ed	Update ChineseTokenizer * Allow `pkuseg_model` to be set to `None` on initialization * Don't save config within tokenizer * Force convert pkuseg_model to use pickle protocol 4 by reencoding with `pickle5` on serialization * Update pkuseg serialization test	2020-09-27 14:00:18 +02:00
Ines Montani	ca3c997062	Improve CLI config validation with latest Thinc	2020-09-26 13:13:57 +02:00
Adriane Boyd	3c062b3911	Add MORPH handling to Matcher (#6107 ) * Add MORPH handling to Matcher * Add `MORPH` to `Matcher` schema * Rename `_SetMemberPredicate` to `_SetPredicate` * Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate` * Add special handling for normalization and conversion of morph values into sets * For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only matches for 0 or 1 values * Update test * Rename to IS_SUBSET and IS_SUPERSET	2020-09-24 16:55:09 +02:00
Adriane Boyd	59340606b7	Add option to disable Matcher errors (#6125 ) * Add option to disable Matcher errors * Add option to disable Matcher errors when a doc doesn't contain a particular type of annotation Minor additional change: * Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH` values * Rename suppress_errors to allow_missing Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Refactor annotation checks in Matcher and PhraseMatcher Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-24 16:54:39 +02:00
Sofie Van Landeghem	c7eedd3534	updates to NEL functionality (#6132 ) * NEL: read sentences and ents from reference * fiddling with sent_start annotations * add KB serialization test * KB write additional file with strings.json * score_links function to calculate NEL P/R/F * formatting * documentation	2020-09-24 16:53:59 +02:00
Ines Montani	d0ef4a4cf5	Prevent division by zero in score weights	2020-09-24 16:42:13 +02:00
Ines Montani	58dde293ce	Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2	2020-09-24 14:44:42 +02:00
Adriane Boyd	8eaacaae97	Refactor Doc.ents setter to use Doc.set_ents Additional changes: * Entity spans with missing labels are ignored * Fix ent_kb_id setting in `Doc.set_ents`	2020-09-24 12:36:51 +02:00
Ines Montani	c6c67b606e	Merge pull request #6133 from explosion/fix/score_weights	2020-09-24 12:00:57 +02:00
Ines Montani	4bbe41f017	Fix combined scores and update test	2020-09-24 10:42:47 +02:00
Sofie Van Landeghem	c645c4e7ce	fix micro PRF for textcat (#6130 ) * fix micro PRF for textcat * small fix	2020-09-24 10:31:17 +02:00
Ines Montani	ae51f580c1	Fix handling of score_weights	2020-09-24 10:27:33 +02:00
svlandeg	b816ace4bb	format	2020-09-23 17:33:13 +02:00
svlandeg	5a9fdbc8ad	state_type as Literal	2020-09-23 17:32:14 +02:00
svlandeg	dd2292793f	'parser' instead of 'deps' for state_type	2020-09-23 16:53:49 +02:00
svlandeg	6c85fab316	state_type and extra_state_tokens instead of nr_feature_tokens	2020-09-23 13:35:09 +02:00
Ines Montani	60a317520a	Merge pull request #6109 from svlandeg/feature/2rename	2020-09-23 09:47:12 +02:00
Sofie Van Landeghem	86a08f819d	tok2vec.update instead of predict (#6113 )	2020-09-22 21:54:52 +02:00
Adriane Boyd	e4acb28658	Fix norm in retokenizer split (#6111 ) Parallel to behavior in merge, reset norm on original token in retokenizer split.	2020-09-22 21:53:33 +02:00
Sofie Van Landeghem	e0e793be4d	fix KB IO (#6118 )	2020-09-22 21:53:06 +02:00
Sofie Van Landeghem	d53c84b6d6	avoid None callback (#6100 )	2020-09-22 13:54:44 +02:00
Adriane Boyd	535842e483	Merge branch 'develop' into feature/doc-ents-v3-2	2020-09-22 13:45:50 +02:00
Ines Montani	5e3b796b12	Validate section refs in debug config	2020-09-22 12:24:39 +02:00
svlandeg	e1b8090b9b	few more fixes	2020-09-22 12:01:06 +02:00
svlandeg	b556a10808	rename converts in_to_out	2020-09-22 11:50:19 +02:00
Ines Montani	beb766d0a0	Add test	2020-09-22 09:15:57 +02:00
Ines Montani	69f7e52c26	Update README.md	2020-09-22 09:10:06 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Ines Montani	a5f6ab4943	Merge pull request #6098 from adrianeboyd/feature/doc-init	2020-09-21 18:35:20 +02:00
Adriane Boyd	f212303729	Add sent_starts to Doc.__init__ Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start` values but also convert to and accept `sent_start` internally.	2020-09-21 17:59:09 +02:00
Adriane Boyd	177df15d89	Implement Doc.set_ents	2020-09-21 15:54:05 +02:00
Adriane Boyd	13fbf6556a	Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2	2020-09-21 14:42:04 +02:00
Adriane Boyd	ce455f30ca	Fix formatting	2020-09-21 13:53:29 +02:00
Adriane Boyd	bc02e86494	Extend Doc.__init__ with additional annotation Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to `Doc.__init__` to initialize the most common doc/token values.	2020-09-21 13:36:24 +02:00
Ines Montani	758ead8a47	Sync overrides with CLI overrides	2020-09-21 12:50:13 +02:00
Ines Montani	5497acf49a	Support config overrides via environment variables	2020-09-21 11:25:10 +02:00
Ines Montani	1114219ae3	Tidy up and auto-format	2020-09-21 10:59:07 +02:00
Adriane Boyd	eed4b785f5	Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```	2020-09-18 15:59:16 +02:00
Ines Montani	a127fa475e	Merge pull request #6078 from svlandeg/fix/corpus	2020-09-18 14:44:21 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00
Adriane Boyd	8b650f3a78	Modify setting missing and blocked entity tokens In order to make it easier to construct `Doc` objects as training data, modify how missing and blocked entity tokens are set to prioritize setting `O` and missing entity tokens for training purposes over setting blocked entity tokens. * `Doc.ents` setter sets tokens outside entity spans to `O` regardless of the current state of each token * For `Doc.ents`, setting a span with a missing label sets the `ent_iob` to missing instead of blocked * `Doc.block_ents(spans)` marks spans as hard `O` for use with the `EntityRecognizer`	2020-09-17 21:27:42 +02:00
svlandeg	427dbecdd6	cleanup and formatting	2020-09-17 11:48:04 +02:00
svlandeg	0c35885751	generalize corpora, dot notation for dev and train corpus	2020-09-17 11:38:59 +02:00
svlandeg	781fae678b	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-17 09:24:36 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Adriane Boyd	a119667a36	Clean up spacy.tokens (#6046 ) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment.	2020-09-16 20:32:38 +02:00
Adriane Boyd	f3db3f6fe0	Add vectors option to CharacterEmbed (#6069 ) * Add vectors option to CharacterEmbed * Update spacy/pipeline/morphologizer.pyx * Adjust default morphologizer config Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-16 17:45:04 +02:00
Adriane Boyd	87c329c711	Set rule-based lemmatizers as default (#6076 ) For languages without provided models and with lemmatizer rules in `spacy-lookups-data`, make the rule-based lemmatizer the default: Bengali, Persian, Norwegian, Swedish	2020-09-16 17:37:29 +02:00

... 2 3 4 5 6 ...

2241 Commits