spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 10:26:35 +03:00

Author	SHA1	Message	Date
Ines Montani	6cfa66ed1c	Make training.loop return nlp object and path (#6520 )	2020-12-08 14:55:55 +08:00
Sofie Van Landeghem	2c27093c5f	require_cpu functionality (#6336 ) * add require_cpu from Thinc 8.0.0rc2 * add docs * fix test if cupy is not installed	2020-12-08 14:42:40 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Adriane Boyd	29b058ebdc	Fix spacy when retokenizing cases with affixes (#6475 ) Preserve `token.spacy` corresponding to the span end token in the original doc rather than adjusting for the current offset. * If not modifying in place, this checks in the original document (`doc.c` rather than `tokens`). * If modifying in place, the document has not been modified past the current span start position so the value at the current span end position is valid.	2020-12-08 14:25:56 +08:00
Adriane Boyd	4448680750	Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)	2020-12-08 14:25:16 +08:00
Adriane Boyd	e931d3f72b	Move max_length to nlp.make_doc() (#6512 ) Move max_length check to `nlp.make_doc()` so that's it's also checked for `nlp.pipe()`.	2020-12-08 14:24:02 +08:00
Ines Montani	ee2ec52f48	Merge pull request #6409 from svlandeg/feature/trf-docs	2020-12-08 06:32:10 +01:00
Ines Montani	82e88f0e3b	Merge pull request #6379 from svlandeg/fix/labels-constructor	2020-12-08 06:29:56 +01:00
Adriane Boyd	d70950605c	Warn on empty POS for the rule-based lemmatizer Add a warning to the rule-based lemmatizer for any tokens without POS annotation.	2020-12-04 11:46:15 +01:00
Adriane Boyd	78085fab1f	Check for spacy-nightly package in download (#6502 ) Also check for spacy-nightly in download so that `--no-deps` isn't set for normal nightly installs.	2020-12-04 09:40:03 +01:00
Ines Montani	63f83e7034	Merge pull request #6470 from adrianeboyd/feature/license-in-package	2020-12-04 03:55:54 +01:00
Sofie Van Landeghem	d6c616a125	Fixes in test suite (#6457 ) * fix slow test for textcat readers * cleanup test_issue5551 * add explicit score weight * cleanup	2020-12-02 12:57:08 +01:00
Adriane Boyd	31ec9a906e	Clean up 3rd party license info (#6478 ) Move scikit-learn license from `Scorer` to `licenses/3rd_party_licenses.txt`.	2020-12-02 10:15:23 +01:00
Adriane Boyd	591cd48aa8	Remove config.cfg from MANIFEST	2020-12-01 12:58:02 +01:00
Adriane Boyd	b0dd13e0ba	Support LICENSE in spacy package If present, include the file `input_dir/LICENSE` at the top level of the packaged model.	2020-11-30 13:43:58 +01:00
Adriane Boyd	53c0fb7431	Only set NORM on Token in retokenizer (#6464 ) * Only set NORM on Token in retokenizer Instead of setting `NORM` on both the token and lexeme, set `NORM` only on the token. The retokenizer tries to set all possible attributes with `Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate which attributes are available for each. `NORM` is the only attribute that's stored on both and for most cases it doesn't make sense to set the global norms based on a individual retokenization. For lexeme-only attributes like `IS_STOP` there's no way to avoid the global side effects, but I think that `NORM` would be better only on the token. * Fix test	2020-11-30 09:35:42 +08:00
Adriane Boyd	03ae77e603	Add SPACY as a Matcher attribute (#6463 )	2020-11-30 09:34:50 +08:00
Sofie Van Landeghem	079f6ea474	avoid resolving the full config (#6465 )	2020-11-30 09:34:29 +08:00
Ines Montani	9beba7164f	Make jinja2 top-level import No problem anymore since it's now an official dependency	2020-11-27 15:17:14 +08:00
Adriane Boyd	26296ab223	Add error message if DocBin zlib decompress fails (#6394 ) Add a better error message if DocBin zlib decompress fails, indicating that the data is not in `DocBin` format.	2020-11-27 14:39:49 +08:00
Adriane Boyd	3a5cc5f8b4	Set version to v2.3.4	2020-11-26 08:48:52 +01:00
Adriane Boyd	e0f5646a4a	Restore cleanup_beam method (#6446 )	2020-11-25 13:21:48 +01:00
Adriane Boyd	cf693f0eae	Fix token_match in tokenizer	2020-11-25 11:49:34 +01:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Adriane Boyd	573f5c863f	Fix tag map clobbering in spacy train (#6437 ) Fix bug from #5768 where the tag map is clobbered if a custom tag map isn't provided.	2020-11-24 13:13:16 +01:00
Adriane Boyd	ce18fc6588	Set version to v2.3.3	2020-11-24 10:03:45 +01:00
Adriane Boyd	cd61d264ef	Set version to v2.3.3.dev0	2020-11-23 13:51:59 +01:00
Sofie Van Landeghem	2af31a8c8d	Bugfix textcat reproducibility on GPU (#6411 ) * add seed argument to ParametricAttention layer * bump thinc to 7.4.3 * set thinc version range Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-23 12:29:35 +01:00
Adriane Boyd	3f61f5eb54	Use int8_t instead of char in Matcher (#6413 ) * Use signed char instead of char in Matcher Remove unused char* utf8_t typedef * Use int8_t instead of signed char	2020-11-23 10:26:47 +01:00
Adriane Boyd	4284605683	Remove Beam cleanup (#6414 ) Beam cleanup is handled through the Beam finalization method.	2020-11-23 10:01:46 +01:00
Adriane Boyd	a8c2dad466	Add all vectors to vocab before pruning (#6408 ) Add all vectors to the vocab before pruning to correct the selection of vectors to prioritize.	2020-11-23 10:00:59 +01:00
svlandeg	636be3c791	Merge remote-tracking branch 'upstream/develop' into feature/trf-docs	2020-11-19 14:15:35 +01:00
svlandeg	73fc1ed963	remove labels from morphologizer constructor	2020-11-11 21:48:50 +01:00
svlandeg	d5a920325f	remove labels from constructor	2020-11-11 21:34:12 +01:00
Adriane Boyd	320a8b1481	Add ent_id_ to strings serialized with Doc (#6353 )	2020-11-10 20:16:07 +08:00
Adriane Boyd	a7e7d6c6c9	Ignore misaligned in Morphologizer.get_loss (#6363 ) Fix bug where `Morphologizer.get_loss` treated misaligned annotation as `EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH` mappings.	2020-11-10 20:15:09 +08:00
Sofie Van Landeghem	a0c899a0ff	Fix textcat + transformer architecture (#6371 ) * add pooling to textcat TransformerListener * maybe_get_dim in case it's null	2020-11-10 20:14:47 +08:00
Ines Montani	de6453940e	Merge pull request #6305 from svlandeg/feature/score-docs [ci skip]	2020-11-10 02:52:11 +01:00
Ines Montani	d7950c5ada	Merge pull request #6297 from adrianeboyd/docs/nightly-conda-install [ci skip]	2020-11-10 02:45:52 +01:00
svlandeg	789fb3d124	add docs for upstream argument of TransformerListener	2020-11-09 21:42:58 +01:00
Ines Montani	363ac73c72	Update docs [ci skip]	2020-11-09 12:43:26 +08:00
Daniel Vasic	20d72de986	Added Multext-East V5 tagset for Croatian language (#6248 ) * Added Multext-East V5 tagset for Croatian language * Create danielvasic.md * Update danielvasic.md * Update danielvasic.md * Add tag map to CroatianDefaults Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-05 12:19:22 +01:00
Robert Šípek	6069efe57d	Add tag map to cs language (#6284 )	2020-11-05 10:13:11 +01:00
Vu Ha	6d465ec52c	add oprd to the list of accepted deps for noun chunking (#6302 ) * add oprd to the list of accepted deps for noun chunking * add SCA	2020-11-05 09:17:35 +01:00
Adriane Boyd	31de700b0f	Fix on_match callback and remove empty patterns (#6312 ) For the `DependencyMatcher`: * Fix on_match callback so that it is called once per matched pattern * Fix results so that patterns with empty match lists are not returned	2020-11-05 09:16:26 +01:00
Sofie Van Landeghem	8ef056cf98	fix embed_size in Entity Linker architecture (#6343 )	2020-11-04 22:20:13 +01:00
Adriane Boyd	084fc575aa	Set version to v3.0.0rc3	2020-11-03 17:29:57 +01:00
Adriane Boyd	1c4df8fd09	Replace pytokenizations with internal alignment (#6293 ) * Replace pytokenizations with internal alignment Replace pytokenizations with internal alignment algorithm that is restricted to only allow differences in whitespace and capitalization. * Rename `spacy.training.align` to `spacy.training.alignment` to contain the `Alignment` dataclass * Implement `get_alignments` in `spacy.training.align` * Refactor trailing whitespace handling * Remove unnecessary exception for empty docs Allow a non-empty whitespace-only doc to be aligned with an empty doc * Remove empty docs exceptions completely	2020-11-03 16:24:38 +01:00
Adriane Boyd	a4b32b9552	Handle missing reference values in scorer (#6286 ) * Handle missing reference values in scorer Handle missing values in reference doc during scoring where it is possible to detect an unset state for the attribute. If no reference docs contain annotation, `None` is returned instead of a score. `spacy evaluate` displays `-` for missing scores and the missing scores are saved as `None`/`null` in the metrics. Attributes without unset states: * `token.head`: relies on `token.dep` to recognize unset values * `doc.cats`: unable to handle missing annotation Additional changes: * add optional `has_annotation` check to `score_scans` to replace `doc.sents` hack * update `score_token_attr_per_feat` to handle missing and empty morph representations * fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START` vs. `SENT_START` * Fix import * Update return types	2020-11-03 15:47:18 +01:00
Adriane Boyd	5d2cb86c34	Fix on_match callback for DependencyMatcher (#6313 ) Fix `DependencyMatcher` so that the callback is called only once per match.	2020-10-31 12:20:27 +01:00
Adriane Boyd	45c9a68828	Identify final Matcher pattern node by quantifier (#6317 ) Modify the internal pattern representation in `Matcher` patterns to identify the final ID state using a unique quantifier rather than a combination of other attributes. It was insufficient to identify the final ID node based on an uninitialized `quantifier` (coincidentally being the same as the `ZERO`) with `nr_attr` as 0. (In addition, it was potentially bug-prone that `nr_attr` was set to 0 even though attrs were allocated.) In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr` is 0 and the quantifier is ZERO, so the previous methods for incrementing to the ID node at the end of the pattern weren't able to distinguish the final ID node from the `{"OP": "!"}` pattern.	2020-10-31 12:18:48 +01:00
Sofie Van Landeghem	2918923541	fix resolving of dot notation (#6326 )	2020-10-31 12:17:06 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
svlandeg	080066ae74	remove TODO note	2020-10-26 10:37:25 +01:00
Ines Montani	2c9804038d	Fix success message [ci skip]	2020-10-23 16:11:54 +02:00
Adriane Boyd	4299a7f654	Setup / install / quickstart updates * Add `cuda110` to setup.cfg and quickstart dropdown * Switch to `pip` for pip-only packages in conda quickstart instructions * Update zh pkuseg install message with version range and conda * Remove `zh` from `extras_require` because the default doesn't require additional packages	2020-10-23 11:27:54 +02:00
Adriane Boyd	563a21834e	Save raw scores in evaluate output	2020-10-19 15:49:09 +02:00
Adriane Boyd	dd207ca6d0	Add dep_las_per_type and more generic PRF printer	2020-10-19 15:49:02 +02:00
Adriane Boyd	4300858ecb	Include per-type/feat scores in evaluate output	2020-10-19 15:48:55 +02:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Ines Montani	5a6ed01ce0	Merge pull request #6262 from adrianeboyd/bugfix/template-en-vectors	2020-10-16 15:38:08 +02:00
Adriane Boyd	c8d04b79e2	Sort and add vectors for langs without transformers	2020-10-16 08:25:16 +02:00
Adriane Boyd	2fbd43c603	Use core lg models as vectors models in quickstart	2020-10-16 08:17:53 +02:00
Jan Margeta	1ad2213349	Fix TokenPatternSchema pattern field validation Empty pattern field should be considered invalid This is fixed by replacing minItems with min_items as described in Pydantic docs: https://pydantic-docs.helpmanual.io/usage/schema/	2020-10-16 00:41:21 +02:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	ff4267d181	Fix success message [ci skip]	2020-10-15 14:42:08 +02:00
Ines Montani	10611bf56a	Increment version [ci skip]	2020-10-15 13:30:11 +02:00
Ines Montani	4e17ddf75e	Merge pull request #6256 from adrianeboyd/bugfix/docs-to-json-raw	2020-10-15 10:35:01 +02:00
Ines Montani	b1d568a4df	Tidy up tests	2020-10-15 10:20:21 +02:00
Ines Montani	d165af26be	Auto-format [ci skip]	2020-10-15 10:08:53 +02:00
Adriane Boyd	a93d42861d	Use null raw for has_unknown_spaces in docs_to_json	2020-10-15 09:57:54 +02:00
Ines Montani	5665a21517	Tidy up	2020-10-15 09:30:32 +02:00
Ines Montani	5d62499266	Fix tests	2020-10-15 09:29:15 +02:00
Ines Montani	178760855f	Merge branch 'develop' into master-tmp	2020-10-15 09:06:03 +02:00
Ines Montani	bc85b12e6d	Merge pull request #6249 from svlandeg/feature/batch-tests	2020-10-15 08:57:56 +02:00
svlandeg	0796401c19	call NumpyOps instead of get_current_ops()	2020-10-14 16:55:00 +02:00
svlandeg	44e14ccae8	one more losses fix	2020-10-14 15:11:34 +02:00
svlandeg	0aa8851878	always return losses	2020-10-14 15:00:49 +02:00
svlandeg	e94a21638e	adding tests for trained models to ensure predict reproducibility	2020-10-13 21:07:13 +02:00
svlandeg	ede979d42f	formattting	2020-10-13 18:53:17 +02:00
svlandeg	ff83bfae3f	naming	2020-10-13 18:52:37 +02:00
svlandeg	6ccacff54e	add tests for individual spacy layers	2020-10-13 18:50:07 +02:00
svlandeg	c23041ae60	component tests single or multiple prediction	2020-10-13 16:26:53 +02:00
Ines Montani	1f49300862	Update transformer recommendations [ci skip]	2020-10-13 15:41:17 +02:00
Sofie Van Landeghem	f8a1c1afd6	avoid dropout at runtime (#6247 )	2020-10-13 14:39:59 +02:00
Ines Montani	86d648740f	Fix morph representation in Doc.to_json	2020-10-13 11:39:03 +02:00
Ines Montani	7f92a5ee6a	Update spacy/lang/ta/examples.py	2020-10-13 11:03:35 +02:00
Ines Montani	a0e12c136b	Increment version [ci skip]	2020-10-13 10:00:53 +02:00
Ines Montani	f090f39f17	Merge pull request #6245 from svlandeg/bugfix/else bugfix in _pipe	2020-10-13 09:59:06 +02:00
svlandeg	1f465bea18	if-else	2020-10-13 09:27:19 +02:00
svlandeg	40276fd3be	update NEL docs after latest refactor	2020-10-12 11:41:27 +02:00
Ines Montani	4fa967ea84	Increment version [ci skip]	2020-10-11 13:10:58 +02:00
Ines Montani	ab890a35f9	Make console logger table more compact	2020-10-11 12:55:46 +02:00
Ines Montani	99606e46fe	Relax meta.json schema [ci skip]	2020-10-11 12:30:57 +02:00
svlandeg	3a505e7e14	small edit to ensure the new word was indeed new	2020-10-10 21:05:28 +02:00
svlandeg	68d79796c6	add test for vocab after serializing KB	2020-10-10 20:59:48 +02:00
Ines Montani	539b0c10da	Tidy up and auto-format	2020-10-10 19:14:48 +02:00
Ines Montani	bfa3931c9d	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
Ines Montani	796f8b9424	Increment version	2020-10-09 18:00:27 +02:00
Ines Montani	525f798841	Fix typo in test	2020-10-09 18:00:21 +02:00
Ines Montani	8ac5f22253	Adjust error message	2020-10-09 18:00:16 +02:00
svlandeg	08cb085f6c	Merge remote-tracking branch 'upstream/develop' into fix/various	2020-10-09 17:01:27 +02:00
Ines Montani	b7cb9d95e4	Merge pull request #6229 from svlandeg/bugfix/disabled	2020-10-09 16:05:11 +02:00
svlandeg	e972ecba72	add utf8 encoding for opening file	2020-10-09 16:03:14 +02:00
Ines Montani	9fb3244672	Merge pull request #6231 from adrianeboyd/feature/include-static-vectors	2020-10-09 15:54:52 +02:00
svlandeg	040c7c0541	fix get_dim calls in build_simple_cnn_text_classifier	2020-10-09 15:40:58 +02:00
Adriane Boyd	727370c633	Remove Span._recalculate_indices Remove `Span._recalculate_indices`, which is a remnant from the deprecated `Span.merge`.	2020-10-09 14:42:51 +02:00
svlandeg	853edace37	fix MultiHashEmbed example in documentation	2020-10-09 14:11:06 +02:00
svlandeg	06b9d213fd	formatting	2020-10-09 12:19:47 +02:00
svlandeg	2cafba5f50	shorten error message for clarity	2020-10-09 12:17:35 +02:00
Ines Montani	4771a10503	Make test more explicit [ci skip]	2020-10-09 12:15:26 +02:00
Ines Montani	cc3646b06c	Add xfailing test for peculiar spans failure [ci skip]	2020-10-09 12:10:25 +02:00
svlandeg	8316bc7d4a	bugfix DisabledPipes	2020-10-09 12:06:20 +02:00
svlandeg	18dfb27985	Add custom error when evaluation throws a KeyError	2020-10-09 12:05:33 +02:00
Adriane Boyd	39aabf50ab	Also rename to include_static_vectors in CharEmbed	2020-10-09 11:54:48 +02:00
Florijan Stamenković	18f5c309dc	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:14:40 +02:00
Duygu Altinok	80fb1bffc9	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-09 10:13:15 +02:00
Duygu Altinok	2fad279a44	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:10:22 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
Ines Montani	8ff73f04db	Fix morph in Doc.to_json	2020-10-08 14:44:35 +02:00
Ines Montani	064575d79d	Merge pull request #6216 from svlandeg/feature/nel-initialize	2020-10-08 11:14:12 +02:00
svlandeg	3e2e1fd323	cleanup	2020-10-08 10:37:32 +02:00
svlandeg	eaf5c265cb	set_kb method for entity_linker	2020-10-08 10:34:01 +02:00
Ines Montani	010956d493	Clear rule-based components on initialize	2020-10-08 09:51:31 +02:00
Baranitharan	d6037c1860	added sentence	2020-10-08 08:22:58 +05:30
Baranitharan	81afe9b19d	Update examples.py	2020-10-08 08:17:25 +05:30
Sofie Van Landeghem	241cd112f5	add reenabled pipe names back to the meta before serializing (#6219 )	2020-10-08 00:44:16 +02:00
Sofie Van Landeghem	2998131416	Reproducibility for TextCat and Tok2Vec (#6218 ) * ensure fixed seed in HashEmbed layers * forgot about the joys of python 2	2020-10-08 00:43:46 +02:00
svlandeg	efedccea8d	fix tests	2020-10-07 15:29:52 +02:00
svlandeg	6b8bdb2d39	add init_config to nlp.create_pipe	2020-10-07 14:58:16 +02:00
svlandeg	33c2d4af16	move kb_loader to initialize for NEL instead of constructor	2020-10-07 14:56:00 +02:00
Wannaphong Phatthiyaphaibun	9fc8392b38	Add Thai tag map (LST20 Corpus) (#6163 ) * Add Thai tag map (LST20 Corpus) By @korakot * Update tag_map.py * Update tag_map.py * Update tag_map.py	2020-10-07 11:12:01 +02:00
Duygu Altinok	7e821c2776	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-07 11:07:52 +02:00
Duygu Altinok	2ce6fc2611	Turkish tag map and morph rules addition (#6141 ) * feat: added turkish tag map * feat: morph rules cconj and sconj * feat: more conjuncts * feat: added popular postpositions * feat: added adverbs * feat: added personal pronouns * feat: added reflexive pronouns * minor: corrected case capital * minor: fixed comma typo * feat: added indef pronouns * feat: added dict iter * fixed comma typo * updated language class with tag map and morph * use default tag map instead * removed tag map	2020-10-07 10:27:36 +02:00
Duygu Altinok	b95a11dd95	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-07 10:25:37 +02:00
Rahul Gupta	1a00bff06d	Hindi: Adds tests for lexical attributes (norm and like_num) (#5829 ) * Hindi: Adds tests for lexical attributes (norm and like_num) * Signs and sdds the contributor agreement * Add ordinal numbers to be tagged as like_num * Adds alternate pronunciation for 31 and 39	2020-10-07 10:23:32 +02:00
Nuccy90	c809b2c8e7	Update morph_rules.py (#6102 ) * Update morph_rules.py Added "dig" and "dej" ("you" in accusative form) * Create Nuccy90.md * Update Nuccy90.md	2020-10-06 15:14:47 +02:00
Matthew Honnibal	1a500f9717	Set version to v3.0.0a35	2020-10-06 14:19:07 +02:00
Sofie Van Landeghem	fff3f8ccfa	Fix packaging pin (#6212 ) * pin packaging to >=20.0 * ignore spacy-pkuseg in requirements unit test	2020-10-06 14:16:05 +02:00
Matthew Honnibal	cfb9770a94	Fix empty input into StaticVectors layer (#6211 ) * Add test for empty doc(s) * Fix empty check in staticvectors * Remove xfail * Update spacy/ml/staticvectors.py	2020-10-06 14:15:41 +02:00
Florijan Stamenković	9db670b996	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-06 11:17:37 +02:00
Ines Montani	568e12215d	Merge pull request #6206 from svlandeg/fix/patterns-init	2020-10-06 10:27:23 +02:00
svlandeg	9b4cf7b0b6	update output of debug config command	2020-10-06 09:47:23 +02:00
svlandeg	ff9ac39c88	read entity_ruler patterns with srsly.read_jsonl.v1	2020-10-05 22:50:14 +02:00
Ines Montani	126268ce50	Auto-format [ci skip]	2020-10-05 21:58:18 +02:00
Ines Montani	1a554bdcb1	Update docs and docstring [ci skip]	2020-10-05 21:55:27 +02:00
Ines Montani	9614e53b02	Tidy up and auto-format	2020-10-05 21:55:18 +02:00
Ines Montani	181039bd17	Merge pull request #6205 from explosion/feature/embed-features	2020-10-05 21:49:10 +02:00
Ines Montani	5ba418b08c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-10-05 21:44:01 +02:00
Ines Montani	568617af58	Merge pull request #6202 from explosion/feature/project-spacy-version	2020-10-05 21:40:52 +02:00
Ines Montani	2d0c0134bc	Adjust message [ci skip]	2020-10-05 21:38:23 +02:00
Ines Montani	6abfc2911d	Merge pull request #6203 from adrianeboyd/feature/zh-spacy-pkuseg	2020-10-05 21:35:57 +02:00
Matthew Honnibal	b7e01d2024	Fix quickstart	2020-10-05 21:21:30 +02:00
Matthew Honnibal	ff8b980775	Upd quickstart template	2020-10-05 21:19:41 +02:00
Matthew Honnibal	91d0fbb588	Fix test	2020-10-05 21:13:53 +02:00
Ines Montani	9ca283a899	Merge branch 'develop' into feature/project-spacy-version	2020-10-05 21:06:07 +02:00
Ines Montani	0135f6ed95	Enable commit check via env var	2020-10-05 20:51:15 +02:00
Matthew Honnibal	b392d48e76	Fix test	2020-10-05 20:17:07 +02:00
Ines Montani	be99f1e4de	Remove output dirs before training (#6204 ) * Remove output dirs before training * Re-raise error if cleaning fails	2020-10-05 20:11:16 +02:00
Matthew Honnibal	e50047f1c5	Check lengths match	2020-10-05 20:02:45 +02:00
Ines Montani	582701519e	Remove __release__ flag	2020-10-05 20:00:49 +02:00
Ines Montani	d58fb42707	Add spacy_version option and validation for project.yml	2020-10-05 20:00:42 +02:00
Matthew Honnibal	db84d175c3	Fix test	2020-10-05 19:59:30 +02:00
Matthew Honnibal	cdd2b79b6d	Remove deprecated MultiHashEmbed	2020-10-05 19:58:18 +02:00
Matthew Honnibal	6dcc4a0ba6	Simplify MultiHashEmbed signature	2020-10-05 19:57:45 +02:00
svlandeg	193e0d5a98	add docs for entity_ruler.initialize	2020-10-05 18:04:08 +02:00
svlandeg	3ac3447eee	cleanup	2020-10-05 17:50:37 +02:00
svlandeg	9eb813a35d	Merge remote-tracking branch 'upstream/develop' into fix/patterns-init	2020-10-05 17:49:44 +02:00
Adriane Boyd	f102ef6b54	Read features.msgpack instead of features.pkl	2020-10-05 17:47:39 +02:00
svlandeg	4e3ace4b8c	is_trainable method	2020-10-05 17:43:42 +02:00
Ines Montani	84fedcebab	Make args keyword-only [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-10-05 17:07:35 +02:00
Matthew Honnibal	71e73ed0a6	Merge branch 'develop' into feature/embed-features	2020-10-05 17:00:05 +02:00
Matthew Honnibal	3ee3649b52	Fix augment	2020-10-05 16:59:49 +02:00
Matthew Honnibal	22937d25a9	Merge branch 'develop' into feature/embed-features	2020-10-05 16:42:17 +02:00
Matthew Honnibal	8deed614e9	Fix augment	2020-10-05 16:41:45 +02:00
Matthew Honnibal	4ed3e037df	Fix augment	2020-10-05 16:40:55 +02:00
Matthew Honnibal	9f1bc3f24c	Fix augment	2020-10-05 16:40:23 +02:00
svlandeg	dc06912c76	prevent loss keyerror for non-trainable components	2020-10-05 16:33:28 +02:00
Adriane Boyd	187234648c	Revert back to "default" as default for pkuseg_user_dict	2020-10-05 16:24:28 +02:00
svlandeg	65abd77779	add finish_update to Pipe	2020-10-05 16:23:33 +02:00
Matthew Honnibal	90040aacec	Fix merge	2020-10-05 16:12:01 +02:00
Matthew Honnibal	93a98e8c3e	Merge branch 'develop' into feature/embed-features	2020-10-05 15:51:31 +02:00
Matthew Honnibal	eb9ba61517	Format	2020-10-05 15:29:49 +02:00
Matthew Honnibal	7d93575f35	spacy/tests/	2020-10-05 15:28:12 +02:00
Matthew Honnibal	f4ca9a39cb	spacy/tests/	2020-10-05 15:27:06 +02:00
Matthew Honnibal	f2f1deca66	spacy/tests/	2020-10-05 15:24:33 +02:00
Matthew Honnibal	8ec79ad3fa	Allow configuration of MultiHashEmbed features Update arguments to MultiHashEmbed layer so that the attributes can be controlled. A kind of tricky scheme is used to allow optional specification of the rows. I think it's an okay balance between flexibility and convenience.	2020-10-05 15:22:00 +02:00
Ines Montani	7946fd84bb	Merge pull request #6200 from adrianeboyd/bugfix/vocab-disk-lookups-vectors Always serialize lookups and vectors to disk	2020-10-05 15:15:25 +02:00
Ines Montani	8171e28b20	Remove logging [ci skip] This would be fired on each example, which is wrong	2020-10-05 15:09:52 +02:00
svlandeg	251b3eb4e5	add initialize method for entity_ruler	2020-10-05 14:59:13 +02:00
Sofie Van Landeghem	f4f49f5877	update blis (#6198 ) * allow higher blis version * fix typo * bump to 3.0.0a34 * fix pins in other files	2020-10-05 14:58:56 +02:00
Adriane Boyd	5d19dfc9d3	Update Chinese tokenizer for spacy-pkuseg fork	2020-10-05 14:21:53 +02:00
Matthew Honnibal	6a9d14e35a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-10-05 14:17:41 +02:00
Matthew Honnibal	d2b9aafb8c	Fix augmenter	2020-10-05 14:14:49 +02:00
Ines Montani	6260fa3c10	Merge pull request #6201 from svlandeg/fix/error_nr	2020-10-05 14:00:57 +02:00
Ines Montani	6958510bda	Include spaCy version check in project CLI	2020-10-05 13:53:07 +02:00
Ines Montani	20f2a17a09	Merge test_misc and test_util	2020-10-05 13:45:57 +02:00
svlandeg	fd2d48556c	fix E902 and E903 numbering	2020-10-05 13:43:32 +02:00
Ines Montani	1c641e41c3	Remove unused import [ci skip]	2020-10-05 11:50:11 +02:00
Adriane Boyd	03cfb2d2f4	Always serialize lookups and vectors to disk	2020-10-05 09:40:20 +02:00
Adriane Boyd	b0b93854cb	Update ru/uk lemmatizers for new nlp.initialize	2020-10-05 09:27:16 +02:00
Ines Montani	549758f67d	Adjust test for now	2020-10-04 23:16:09 +02:00
Ines Montani	4b15ff7504	Increment version [ci skip]	2020-10-04 22:47:04 +02:00
Ines Montani	f1d1f78636	Make warning debug log [ci skip]	2020-10-04 22:44:21 +02:00
Ines Montani	3c36a57e84	Update data augmenters (#6196 ) * Draft lower-case augmenter * Make warning a debug log * Update lowercase augmenter, docs and tests Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-10-04 17:46:29 +02:00
Ines Montani	d38dc466c5	Adjust error [ci skip]	2020-10-04 15:26:01 +02:00
Ines Montani	496228771d	Merge pull request #6194 from explosion/master-tmp	2020-10-04 15:25:41 +02:00
Ines Montani	0307a228c8	Merge pull request #6193 from explosion/fix/adjust-pipe-init Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 15:20:54 +02:00
Ines Montani	59deeb7da6	Merge branch 'develop' into master-tmp	2020-10-04 14:52:20 +02:00
Ines Montani	43d7652635	Merge pull request #6192 from explosion/feature/init-attr-ruler	2020-10-04 14:46:37 +02:00
Ines Montani	8f018e47f8	Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 14:43:45 +02:00
Matthew Honnibal	84ae197dd6	Fix logger	2020-10-04 14:16:53 +02:00
Ines Montani	11347f34da	Tidy up, tests and docs	2020-10-04 13:54:05 +02:00
Matthew Honnibal	96b636c2d3	Update attribute ruler	2020-10-04 13:08:21 +02:00
Ines Montani	bcd52e5486	Tidy up errors and warnings	2020-10-04 11:16:31 +02:00
Ines Montani	ff914f4e6f	Lazy-load xx	2020-10-04 11:10:26 +02:00
Ines Montani	d3b3663942	Adjust error message and add test	2020-10-04 10:11:27 +02:00
Ines Montani	2110e8f86d	Auto-format	2020-10-04 10:06:49 +02:00
Ines Montani	cc08c88a89	Merge pull request #6187 from svlandeg/fix/begin_training_pipe	2020-10-04 10:01:02 +02:00
svlandeg	3f657ed3a1	implement warning in __init_subclass__ instead	2020-10-03 22:34:10 +02:00
Matthew Honnibal	3b2a78720c	Upd morphologizer	2020-10-03 19:35:19 +02:00
Matthew Honnibal	835070cedc	Upd test	2020-10-03 19:35:10 +02:00
Matthew Honnibal	70b9de8e58	Set version to v3.0.0a32	2020-10-03 19:26:52 +02:00
Matthew Honnibal	85ede32680	Format	2020-10-03 19:26:23 +02:00
Matthew Honnibal	b305f2ff5a	Fix loggers	2020-10-03 19:26:10 +02:00
Matthew Honnibal	4fccd2ceaf	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-10-03 19:13:55 +02:00
Matthew Honnibal	8ea8b7d940	Support loading labels in morphologizer	2020-10-03 19:13:42 +02:00
Ines Montani	c2401fca41	Add tests for Pipe.label_data	2020-10-03 19:12:46 +02:00
Ines Montani	80603f0fa5	Make SentenceRecognizer.label_data return None Overwrite the method from the base class (Tagger) but don't export anything in "init labels"	2020-10-03 18:54:09 +02:00
Ines Montani	d6c967401f	Increment version	2020-10-03 17:20:47 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00
Ines Montani	7c4ab7e82c	Fix Lemmatizer.get_lookups_config	2020-10-03 17:16:10 +02:00
Ines Montani	dd542ec6a4	Fix label initialization of textcat component (#6190 )	2020-10-03 17:07:38 +02:00
Ines Montani	989a96308f	Tidy up, auto-format, types	2020-10-03 16:31:58 +02:00
Matthew Honnibal	7b127f307e	Set version to v3.0.0a30	2020-10-03 16:06:42 +02:00
Matthew Honnibal	db419f6b2f	Improve control of training progress and logging (#6184 ) * Make logging and progress easier to control * Update docs * Cleanup errors * Fix ConfigValidationError * Pass stdout/stderr, not wasabi.Printer * Fix type * Upd logging example * Fix logger example * Fix type	2020-10-03 14:57:46 +02:00
Ines Montani	ae15c9de79	Raise error from caught KeyError to preserve traceback	2020-10-03 11:43:56 +02:00
Ines Montani	f758804401	Save one line of code	2020-10-03 11:41:28 +02:00
Stanislav Schmidt	3589a64d44	Change type of texts argument in pipe to iterable (#6186 ) * Change type of texts argument in pipe to iterable * Add contributor agreement	2020-10-02 21:00:11 +02:00
svlandeg	02247cccaf	Merge remote-tracking branch 'upstream/develop' into feature/small-fixes	2020-10-02 20:48:11 +02:00
svlandeg	fb48de349c	bwd compat for pipe.begin_training	2020-10-02 20:31:14 +02:00
Matthew Honnibal	6965cdf16d	Fix comment	2020-10-02 17:26:21 +02:00
Ines Montani	3cf10a0729	Merge pull request #6183 from adrianeboyd/feature/quickstart-morphologizer Add morphologizer to quickstart template	2020-10-02 17:08:01 +02:00
Adriane Boyd	62ccd5c4df	Relax model meta performance schema (#6185 ) Allow more embedded per_x in `ModelMetaSchema`	2020-10-02 16:37:21 +02:00
Sofie Van Landeghem	09dcb75076	small UX fix for DocBin (#6167 ) * add informative warning when messing up store_user_data DocBin flags * add informative warning when messing up store_user_data DocBin flags * cleanup test * rename to patterns_path	2020-10-02 15:43:32 +02:00
Ines Montani	f0b30aedad	Make lemmatizers use initialize logic (#6182 ) * Make lemmatizer use initialize logic and tidy up * Fix typo * Raise for uninitialized tables	2020-10-02 15:42:36 +02:00
Adriane Boyd	22158dc24a	Add morphologizer to quickstart template	2020-10-02 15:06:16 +02:00
Ines Montani	d2aa662ab2	Merge pull request #6179 from adrianeboyd/feature/token-morph-refactor-2 [ci skip]	2020-10-02 12:10:27 +02:00
Ines Montani	c41a4332e4	Add test for custom data augmentation	2020-10-02 11:37:56 +02:00
svlandeg	acc391c2a8	remove redundant str() call	2020-10-02 11:05:59 +02:00
Ines Montani	3856048437	Merge pull request #6178 from explosion/feature/file-readers Integrate file readers via srsly, update orth_variants loading	2020-10-02 10:26:09 +02:00
Adriane Boyd	f83dfe62da	Fix test	2020-10-02 10:17:26 +02:00
Adriane Boyd	65dfaa4f4b	Also accept MorphAnalysis in set_morph	2020-10-02 08:33:43 +02:00
Adriane Boyd	77e08c398f	Switch reset value for set_morph to None	2020-10-02 08:25:15 +02:00
Ines Montani	568768643e	Increment version [ci skip]	2020-10-02 01:50:13 +02:00
Ines Montani	01c1538c72	Integrate file readers	2020-10-02 01:36:06 +02:00
Ines Montani	af282ae732	Fix import	2020-10-02 01:12:34 +02:00
Ines Montani	e59ecb12c0	Auto-format	2020-10-02 01:12:30 +02:00
Matthew Honnibal	75a1569908	Merge	2020-10-01 23:07:53 +02:00
Matthew Honnibal	300e5a9928	Avoid relying on NORM in default v3 models (#6176 ) * Allow CharacterEmbed to specify feature * Default to LOWER in character embed * Update tok2vec * Use LOWER, not NORM	2020-10-01 23:05:55 +02:00
Ines Montani	5762876dcc	Update default config [ci skip]	2020-10-01 22:27:37 +02:00
Adriane Boyd	86c3ec9c2b	Refactor Token morph setting (#6175 ) * Refactor Token morph setting * Remove `Token.morph_` * Add `Token.set_morph()` * `0` resets `token.c.morph` to unset * Any other values are passed to `Morphology.add` * Add token.morph setter to set from MorphAnalysis	2020-10-01 22:21:46 +02:00
Matthew Honnibal	b854bca15c	Default to LOWER in character embed	2020-10-01 22:17:58 +02:00
Matthew Honnibal	684a77870b	Allow CharacterEmbed to specify feature	2020-10-01 22:17:26 +02:00
Ines Montani	da30701cd1	Increment version [ci skip]	2020-10-01 21:58:11 +02:00
Ines Montani	d48ddd6c9a	Remove default initialize lookups	2020-10-01 21:54:33 +02:00
Ines Montani	1700c8541e	Increment version [ci skip]	2020-10-01 17:57:16 +02:00
Ines Montani	f2627157c8	Update docs [ci skip]	2020-10-01 17:38:17 +02:00
Ines Montani	7f68f4bd92	Hide jsonl_loc on init vectors and tidy up [ci skip]	2020-10-01 16:44:17 +02:00
Adriane Boyd	27cbffff1b	Minor edit to CoNLL-U converter (#6172 ) This doesn't make a difference given how the `merged_morph` values override the `morph` values for all the final docs, but could have led to unexpected bugs in the future if the converter is modified.	2020-10-01 16:23:42 +02:00
Sofie Van Landeghem	a22215f427	Add FeatureExtractor from Thinc (#6170 ) * move featureextractor from Thinc * Update website/docs/api/architectures.md Co-authored-by: Ines Montani <ines@ines.io> * Update website/docs/api/architectures.md Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Ines Montani <ines@ines.io>	2020-10-01 16:22:48 +02:00
Adriane Boyd	73538782a0	Switch Doc.__init__(ents=) to IOB tags (#6173 ) * Switch Doc.__init__(ents=) to IOB tags * Fix check for "-" * Allow "" or None as missing IOB tag	2020-10-01 16:22:18 +02:00
Adriane Boyd	df98d3ef9f	Update import from collections.abc (#6174 )	2020-10-01 16:21:49 +02:00
Yohei Tamura	3243ddac8f	Fix/span.sent (#6083 ) * add fail test * fix test * fix span.sent * Remove incorrect implicit check Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-01 14:01:52 +02:00
Ines Montani	0a8a124a6e	Update docs [ci skip]	2020-10-01 12:15:53 +02:00
Ines Montani	44160cd52f	Tidy up [ci skip]	2020-10-01 10:41:19 +02:00
Ines Montani	381258b75b	Merge pull request #6165 from explosion/feature/update-tokenizers-initialize	2020-10-01 09:49:47 +02:00
svlandeg	6787e56315	print debugging warning before raising error if model not properly initialized	2020-10-01 09:21:00 +02:00
svlandeg	5121972930	add types of Tok2Vec embedding layers	2020-10-01 09:20:09 +02:00
Ines Montani	4b6afd3611	Remove English [initialize] default block for now to get tests to pass	2020-09-30 23:49:29 +02:00
Ines Montani	6f29f68f69	Update errors and make Tokenizer.initialize args less strict	2020-09-30 23:48:47 +02:00
Ines Montani	a103ab5f1a	Update augmenter lookups and docs	2020-09-30 23:03:47 +02:00
Matthew Honnibal	5128298964	Add missing augmenter	2020-09-30 20:18:45 +02:00
Matthew Honnibal	59294e91aa	Restore the 'jsonl' arg for init vectors The lexemes.jsonl file is still used in our English vectors, and it may be required by users as well. I think it's worth supporting the option.	2020-09-30 19:06:50 +02:00
Matthew Honnibal	c379a4274a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-30 16:52:42 +02:00
Matthew Honnibal	e58dca3028	Add read_labels	2020-09-30 16:52:27 +02:00
Ines Montani	23c63eefaf	Tidy up env vars [ci skip]	2020-09-30 15:15:11 +02:00
Elijah Rippeth	4cbb954281	reorder so tagmap is replaced only if a custom file is provided. (#6164 ) * reorder so tagmap is replaced only if a custom file is provided. * Remove unneeded variable initialization Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-09-30 13:26:06 +02:00
Adriane Boyd	6b7bb32834	Refactor Chinese initialization	2020-09-30 11:46:45 +02:00
Ines Montani	34f9c26c62	Add lexeme norm defaults	2020-09-30 10:20:14 +02:00
Ines Montani	a5debb356d	Tidy up and adjust logging [ci skip]	2020-09-30 01:22:08 +02:00
Ines Montani	56a2f778c4	Add logging [ci skip]	2020-09-30 01:08:55 +02:00
Ines Montani	fe3f111c37	Merge pull request #6168 from explosion/fix/default-corpus-values	2020-09-30 00:24:02 +02:00
Ines Montani	b799af16de	Don't raise in Pipe.initialize if not implemented	2020-09-30 00:05:27 +02:00
Matthew Honnibal	bc61691f6f	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-29 23:41:04 +02:00
Matthew Honnibal	f52249fe2e	Fix data augmentation	2020-09-29 23:40:54 +02:00
Matthew Honnibal	14c4da547f	Try to fix augmentation	2020-09-29 23:08:56 +02:00
Ines Montani	ae51843468	Remove augmenter from jinja template [ci skip]	2020-09-29 23:08:50 +02:00
Ines Montani	9bb958fd0a	Fix debug data [ci skip]	2020-09-29 23:07:11 +02:00
Matthew Honnibal	a2aa1f6882	Disable the OVL augmentation by default	2020-09-29 23:02:40 +02:00

... 4 5 6 7 8 ...

8529 Commits