spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-28 02:46:35 +03:00

Author	SHA1	Message	Date
Ines Montani	05a2812ae0	Merge branch 'develop' into pr/6444	2020-12-09 11:04:03 +11:00
Sofie Van Landeghem	cfc72c2995	Bugfix multi-label textcat reproducibility (#6481 ) * add test for multi-label textcat reproducibility * remove positive_label * fix lengths dtype * fix comments * remove comment that we should not have forgotten :-)	2020-12-09 06:29:15 +08:00
svlandeg	8f8a7f1733	returning config in init_config	2020-12-08 17:37:20 +01:00
Sofie Van Landeghem	2c27093c5f	require_cpu functionality (#6336 ) * add require_cpu from Thinc 8.0.0rc2 * add docs * fix test if cupy is not installed	2020-12-08 14:42:40 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Adriane Boyd	29b058ebdc	Fix spacy when retokenizing cases with affixes (#6475 ) Preserve `token.spacy` corresponding to the span end token in the original doc rather than adjusting for the current offset. * If not modifying in place, this checks in the original document (`doc.c` rather than `tokens`). * If modifying in place, the document has not been modified past the current span start position so the value at the current span end position is valid.	2020-12-08 14:25:56 +08:00
Adriane Boyd	4448680750	Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)	2020-12-08 14:25:16 +08:00
Adriane Boyd	d70950605c	Warn on empty POS for the rule-based lemmatizer Add a warning to the rule-based lemmatizer for any tokens without POS annotation.	2020-12-04 11:46:15 +01:00
Sofie Van Landeghem	d6c616a125	Fixes in test suite (#6457 ) * fix slow test for textcat readers * cleanup test_issue5551 * add explicit score weight * cleanup	2020-12-02 12:57:08 +01:00
Adriane Boyd	53c0fb7431	Only set NORM on Token in retokenizer (#6464 ) * Only set NORM on Token in retokenizer Instead of setting `NORM` on both the token and lexeme, set `NORM` only on the token. The retokenizer tries to set all possible attributes with `Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate which attributes are available for each. `NORM` is the only attribute that's stored on both and for most cases it doesn't make sense to set the global norms based on a individual retokenization. For lexeme-only attributes like `IS_STOP` there's no way to avoid the global side effects, but I think that `NORM` would be better only on the token. * Fix test	2020-11-30 09:35:42 +08:00
Adriane Boyd	03ae77e603	Add SPACY as a Matcher attribute (#6463 )	2020-11-30 09:34:50 +08:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Sofie Van Landeghem	2af31a8c8d	Bugfix textcat reproducibility on GPU (#6411 ) * add seed argument to ParametricAttention layer * bump thinc to 7.4.3 * set thinc version range Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-23 12:29:35 +01:00
Adriane Boyd	320a8b1481	Add ent_id_ to strings serialized with Doc (#6353 )	2020-11-10 20:16:07 +08:00
Adriane Boyd	a7e7d6c6c9	Ignore misaligned in Morphologizer.get_loss (#6363 ) Fix bug where `Morphologizer.get_loss` treated misaligned annotation as `EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH` mappings.	2020-11-10 20:15:09 +08:00
Ines Montani	363ac73c72	Update docs [ci skip]	2020-11-09 12:43:26 +08:00
Adriane Boyd	31de700b0f	Fix on_match callback and remove empty patterns (#6312 ) For the `DependencyMatcher`: * Fix on_match callback so that it is called once per matched pattern * Fix results so that patterns with empty match lists are not returned	2020-11-05 09:16:26 +01:00
Adriane Boyd	1c4df8fd09	Replace pytokenizations with internal alignment (#6293 ) * Replace pytokenizations with internal alignment Replace pytokenizations with internal alignment algorithm that is restricted to only allow differences in whitespace and capitalization. * Rename `spacy.training.align` to `spacy.training.alignment` to contain the `Alignment` dataclass * Implement `get_alignments` in `spacy.training.align` * Refactor trailing whitespace handling * Remove unnecessary exception for empty docs Allow a non-empty whitespace-only doc to be aligned with an empty doc * Remove empty docs exceptions completely	2020-11-03 16:24:38 +01:00
Adriane Boyd	a4b32b9552	Handle missing reference values in scorer (#6286 ) * Handle missing reference values in scorer Handle missing values in reference doc during scoring where it is possible to detect an unset state for the attribute. If no reference docs contain annotation, `None` is returned instead of a score. `spacy evaluate` displays `-` for missing scores and the missing scores are saved as `None`/`null` in the metrics. Attributes without unset states: * `token.head`: relies on `token.dep` to recognize unset values * `doc.cats`: unable to handle missing annotation Additional changes: * add optional `has_annotation` check to `score_scans` to replace `doc.sents` hack * update `score_token_attr_per_feat` to handle missing and empty morph representations * fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START` vs. `SENT_START` * Fix import * Update return types	2020-11-03 15:47:18 +01:00
Adriane Boyd	5d2cb86c34	Fix on_match callback for DependencyMatcher (#6313 ) Fix `DependencyMatcher` so that the callback is called only once per match.	2020-10-31 12:20:27 +01:00
Adriane Boyd	45c9a68828	Identify final Matcher pattern node by quantifier (#6317 ) Modify the internal pattern representation in `Matcher` patterns to identify the final ID state using a unique quantifier rather than a combination of other attributes. It was insufficient to identify the final ID node based on an uninitialized `quantifier` (coincidentally being the same as the `ZERO`) with `nr_attr` as 0. (In addition, it was potentially bug-prone that `nr_attr` was set to 0 even though attrs were allocated.) In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr` is 0 and the quantifier is ZERO, so the previous methods for incrementing to the ID node at the end of the pattern weren't able to distinguish the final ID node from the `{"OP": "!"}` pattern.	2020-10-31 12:18:48 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Jan Margeta	1ad2213349	Fix TokenPatternSchema pattern field validation Empty pattern field should be considered invalid This is fixed by replacing minItems with min_items as described in Pydantic docs: https://pydantic-docs.helpmanual.io/usage/schema/	2020-10-16 00:41:21 +02:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	b1d568a4df	Tidy up tests	2020-10-15 10:20:21 +02:00
Ines Montani	d165af26be	Auto-format [ci skip]	2020-10-15 10:08:53 +02:00
Ines Montani	5d62499266	Fix tests	2020-10-15 09:29:15 +02:00
Ines Montani	178760855f	Merge branch 'develop' into master-tmp	2020-10-15 09:06:03 +02:00
svlandeg	0796401c19	call NumpyOps instead of get_current_ops()	2020-10-14 16:55:00 +02:00
svlandeg	e94a21638e	adding tests for trained models to ensure predict reproducibility	2020-10-13 21:07:13 +02:00
svlandeg	ede979d42f	formattting	2020-10-13 18:53:17 +02:00
svlandeg	ff83bfae3f	naming	2020-10-13 18:52:37 +02:00
svlandeg	6ccacff54e	add tests for individual spacy layers	2020-10-13 18:50:07 +02:00
svlandeg	c23041ae60	component tests single or multiple prediction	2020-10-13 16:26:53 +02:00
svlandeg	3a505e7e14	small edit to ensure the new word was indeed new	2020-10-10 21:05:28 +02:00
svlandeg	68d79796c6	add test for vocab after serializing KB	2020-10-10 20:59:48 +02:00
Ines Montani	539b0c10da	Tidy up and auto-format	2020-10-10 19:14:48 +02:00
Ines Montani	bfa3931c9d	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
Ines Montani	525f798841	Fix typo in test	2020-10-09 18:00:21 +02:00
Ines Montani	b7cb9d95e4	Merge pull request #6229 from svlandeg/bugfix/disabled	2020-10-09 16:05:11 +02:00
Ines Montani	9fb3244672	Merge pull request #6231 from adrianeboyd/feature/include-static-vectors	2020-10-09 15:54:52 +02:00
Adriane Boyd	727370c633	Remove Span._recalculate_indices Remove `Span._recalculate_indices`, which is a remnant from the deprecated `Span.merge`.	2020-10-09 14:42:51 +02:00
svlandeg	06b9d213fd	formatting	2020-10-09 12:19:47 +02:00
Ines Montani	4771a10503	Make test more explicit [ci skip]	2020-10-09 12:15:26 +02:00
Ines Montani	cc3646b06c	Add xfailing test for peculiar spans failure [ci skip]	2020-10-09 12:10:25 +02:00
svlandeg	8316bc7d4a	bugfix DisabledPipes	2020-10-09 12:06:20 +02:00
Adriane Boyd	39aabf50ab	Also rename to include_static_vectors in CharEmbed	2020-10-09 11:54:48 +02:00
Florijan Stamenković	18f5c309dc	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:14:40 +02:00
Duygu Altinok	80fb1bffc9	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-09 10:13:15 +02:00
Duygu Altinok	2fad279a44	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:10:22 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
Ines Montani	8ff73f04db	Fix morph in Doc.to_json	2020-10-08 14:44:35 +02:00
Ines Montani	064575d79d	Merge pull request #6216 from svlandeg/feature/nel-initialize	2020-10-08 11:14:12 +02:00
svlandeg	eaf5c265cb	set_kb method for entity_linker	2020-10-08 10:34:01 +02:00
Ines Montani	010956d493	Clear rule-based components on initialize	2020-10-08 09:51:31 +02:00
Sofie Van Landeghem	2998131416	Reproducibility for TextCat and Tok2Vec (#6218 ) * ensure fixed seed in HashEmbed layers * forgot about the joys of python 2	2020-10-08 00:43:46 +02:00
svlandeg	efedccea8d	fix tests	2020-10-07 15:29:52 +02:00
svlandeg	6b8bdb2d39	add init_config to nlp.create_pipe	2020-10-07 14:58:16 +02:00
Duygu Altinok	7e821c2776	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-07 11:07:52 +02:00
Duygu Altinok	b95a11dd95	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-07 10:25:37 +02:00
Rahul Gupta	1a00bff06d	Hindi: Adds tests for lexical attributes (norm and like_num) (#5829 ) * Hindi: Adds tests for lexical attributes (norm and like_num) * Signs and sdds the contributor agreement * Add ordinal numbers to be tagged as like_num * Adds alternate pronunciation for 31 and 39	2020-10-07 10:23:32 +02:00
Sofie Van Landeghem	fff3f8ccfa	Fix packaging pin (#6212 ) * pin packaging to >=20.0 * ignore spacy-pkuseg in requirements unit test	2020-10-06 14:16:05 +02:00
Matthew Honnibal	cfb9770a94	Fix empty input into StaticVectors layer (#6211 ) * Add test for empty doc(s) * Fix empty check in staticvectors * Remove xfail * Update spacy/ml/staticvectors.py	2020-10-06 14:15:41 +02:00
Florijan Stamenković	9db670b996	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-06 11:17:37 +02:00
Ines Montani	568e12215d	Merge pull request #6206 from svlandeg/fix/patterns-init	2020-10-06 10:27:23 +02:00
svlandeg	ff9ac39c88	read entity_ruler patterns with srsly.read_jsonl.v1	2020-10-05 22:50:14 +02:00
Ines Montani	126268ce50	Auto-format [ci skip]	2020-10-05 21:58:18 +02:00
Ines Montani	181039bd17	Merge pull request #6205 from explosion/feature/embed-features	2020-10-05 21:49:10 +02:00
Ines Montani	568617af58	Merge pull request #6202 from explosion/feature/project-spacy-version	2020-10-05 21:40:52 +02:00
Ines Montani	6abfc2911d	Merge pull request #6203 from adrianeboyd/feature/zh-spacy-pkuseg	2020-10-05 21:35:57 +02:00
Matthew Honnibal	91d0fbb588	Fix test	2020-10-05 21:13:53 +02:00
Matthew Honnibal	b392d48e76	Fix test	2020-10-05 20:17:07 +02:00
Matthew Honnibal	db84d175c3	Fix test	2020-10-05 19:59:30 +02:00
Matthew Honnibal	6dcc4a0ba6	Simplify MultiHashEmbed signature	2020-10-05 19:57:45 +02:00
Matthew Honnibal	7d93575f35	spacy/tests/	2020-10-05 15:28:12 +02:00
Matthew Honnibal	f4ca9a39cb	spacy/tests/	2020-10-05 15:27:06 +02:00
Matthew Honnibal	f2f1deca66	spacy/tests/	2020-10-05 15:24:33 +02:00
Matthew Honnibal	8ec79ad3fa	Allow configuration of MultiHashEmbed features Update arguments to MultiHashEmbed layer so that the attributes can be controlled. A kind of tricky scheme is used to allow optional specification of the rows. I think it's an okay balance between flexibility and convenience.	2020-10-05 15:22:00 +02:00
Adriane Boyd	5d19dfc9d3	Update Chinese tokenizer for spacy-pkuseg fork	2020-10-05 14:21:53 +02:00
Ines Montani	6958510bda	Include spaCy version check in project CLI	2020-10-05 13:53:07 +02:00
Ines Montani	20f2a17a09	Merge test_misc and test_util	2020-10-05 13:45:57 +02:00
Ines Montani	1c641e41c3	Remove unused import [ci skip]	2020-10-05 11:50:11 +02:00
Adriane Boyd	b0b93854cb	Update ru/uk lemmatizers for new nlp.initialize	2020-10-05 09:27:16 +02:00
Ines Montani	549758f67d	Adjust test for now	2020-10-04 23:16:09 +02:00
Ines Montani	3c36a57e84	Update data augmenters (#6196 ) * Draft lower-case augmenter * Make warning a debug log * Update lowercase augmenter, docs and tests Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-10-04 17:46:29 +02:00
Ines Montani	496228771d	Merge pull request #6194 from explosion/master-tmp	2020-10-04 15:25:41 +02:00
Ines Montani	0307a228c8	Merge pull request #6193 from explosion/fix/adjust-pipe-init Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 15:20:54 +02:00
Ines Montani	59deeb7da6	Merge branch 'develop' into master-tmp	2020-10-04 14:52:20 +02:00
Ines Montani	8f018e47f8	Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 14:43:45 +02:00
Ines Montani	11347f34da	Tidy up, tests and docs	2020-10-04 13:54:05 +02:00
Ines Montani	d3b3663942	Adjust error message and add test	2020-10-04 10:11:27 +02:00
Ines Montani	2110e8f86d	Auto-format	2020-10-04 10:06:49 +02:00
Matthew Honnibal	835070cedc	Upd test	2020-10-03 19:35:10 +02:00
Ines Montani	c2401fca41	Add tests for Pipe.label_data	2020-10-03 19:12:46 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00
Ines Montani	7c4ab7e82c	Fix Lemmatizer.get_lookups_config	2020-10-03 17:16:10 +02:00
Ines Montani	dd542ec6a4	Fix label initialization of textcat component (#6190 )	2020-10-03 17:07:38 +02:00
Sofie Van Landeghem	09dcb75076	small UX fix for DocBin (#6167 ) * add informative warning when messing up store_user_data DocBin flags * add informative warning when messing up store_user_data DocBin flags * cleanup test * rename to patterns_path	2020-10-02 15:43:32 +02:00
Ines Montani	f0b30aedad	Make lemmatizers use initialize logic (#6182 ) * Make lemmatizer use initialize logic and tidy up * Fix typo * Raise for uninitialized tables	2020-10-02 15:42:36 +02:00

1 2 3 4 5 ...

2122 Commits