spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-28 02:46:35 +03:00

Author	SHA1	Message	Date
Adriane Boyd	bf0cdae8d4	Add token_splitter component (#6726 ) * Add long_token_splitter component Add a `long_token_splitter` component for use with transformer pipelines. This component splits up long tokens like URLs into smaller tokens. This is particularly relevant for pretrained pipelines with `strided_spans`, since the user can't change the length of the span `window` and may not wish to preprocess the input texts. The `long_token_splitter` splits tokens that are at least `long_token_length` tokens long into smaller tokens of `split_length` size. Notes: * Since this is intended for use as the first component in a pipeline, the token splitter does not try to preserve any token annotation. * API docs to come when the API is stable. * Adjust API, add test * Fix name in factory	2021-01-17 19:54:41 +08:00
Adriane Boyd	9328dd5625	Handle unset token.morph in Morphologizer (#6704 ) * Handle unset token.morph in Morphologizer Handle unset `token.morph` in `Morphologizer.initialize` and `Morphologizer.get_loss`. If both `token.morph` and `token.pos` are unset, treat the annotation as missing rather than empty. * Add token.has_morph()	2021-01-15 17:20:10 +01:00
Ines Montani	f9e4ac1283	Fix test	2021-01-15 12:51:02 +11:00
Ines Montani	b0b743597c	Tidy up and auto-format	2021-01-15 11:57:36 +11:00
Ines Montani	57369909c0	Merge pull request #6727 from adrianeboyd/chore/update-develop-from-master-rc3	2021-01-15 11:44:28 +11:00
Adriane Boyd	0c936004d1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-rc3	2021-01-14 11:49:58 +01:00
Matthew Honnibal	92310a5e26	Merge branch 'develop' into feature/missing-dep	2021-01-14 17:39:01 +11:00
Adriane Boyd	9957ed7897	Override language defaults for null token and URL match (#6705 ) * Override language defaults for null token and URL match When the serialized `token_match` or `url_match` is `None`, override the language defaults to preserve `None` on deserialization. * Fix fixtures in tests	2021-01-14 17:31:29 +11:00
Matthew Honnibal	f277bfdf0f	Add SpanGroup and Graph container types to represent arbitrary annotations (#6696 ) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io>	2021-01-14 17:30:41 +11:00
svlandeg	fec9b81aa2	Merge remote-tracking branch 'upstream/develop' into feature/missing-dep	2021-01-13 17:46:12 +01:00
svlandeg	ed53bb979d	cleanup	2021-01-13 14:20:05 +01:00
svlandeg	86a4e316b8	fix sent_starts	2021-01-13 13:47:25 +01:00
Ines Montani	31a92b28ae	Merge pull request #6715 from adrianeboyd/feature/before-after-init-callbacks Add initialize.before_init and after_init callbacks	2021-01-13 12:17:00 +11:00
Ines Montani	97d5a7ba99	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2021-01-13 12:03:02 +11:00
Ines Montani	8d6448ccf7	Add config resolver test	2021-01-13 12:02:59 +11:00
svlandeg	232e953b14	pytest.approx with absolute eps	2021-01-12 20:32:57 +01:00
svlandeg	5b598bd1d5	formatting	2021-01-12 17:28:41 +01:00
svlandeg	a581d82f33	introduce token.has_head and refer to MISSING_DEP_ (WIP)	2021-01-12 17:17:06 +01:00
Adriane Boyd	a45d89f09a	Add initialize.before_init and after_init callbacks Add `initialize.before_init` and `initialize.after_init` callbacks to the config. The `initialize.before_init` callback is a place to implement one-time tokenizer customizations that are then saved with the model.	2021-01-12 13:07:44 +01:00
Adriane Boyd	ad43cbb042	Sync missing and misaligned values in Tagger loss (#6689 ) Use `None` for both missing and misaligned annotation in `Tagger.get_loss`, reverting to the default missing value in the loss function.	2021-01-10 11:30:37 +11:00
svlandeg	dd12c6c8fd	allow missing information in deps and heads annotations	2021-01-07 19:10:32 +01:00
Sofie Van Landeghem	75d9019343	Fix types of Tok2Vec encoding architectures (#6442 ) * fix TorchBiLSTMEncoder documentation * ensure the types of the encoding Tok2vec layers are correct * update references from v1 to v2 for the new architectures	2021-01-07 16:39:27 +11:00
ophelielacroix	e3222fdec9	Add (noun chunks) syntax iterators for Danish (#6246 ) * add syntax iterators for danish * add test noun chunks for danish syntax iterators * add contributor agreement * update da syntax iterators to remove nested chunks * add tests for da noun chunks * Fix test * add missing import * fix example * Prevent overlapping noun chunks Prevent overlapping noun chunks by tracking the end index of the previous noun chunk span. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-01-07 16:33:00 +11:00
Sofie Van Landeghem	8c1a23209f	Getting scores out of beam_parser (#6684 ) * clean up of ner tests * beam_parser tests * implement get_beam_parses and scored_parses for the dep parser * we don't have to add the parse if there are no arcs	2021-01-07 16:28:27 +11:00
Sofie Van Landeghem	402dbc5bae	Getting scores out of beam_ner (#6575 ) * small fixes and formatting * bring test_issue4313 up-to-date, currently fails * formatting * add get_beam_parses method back * add scored_ents function * delete tag map	2021-01-06 12:02:32 +01:00
Sofie Van Landeghem	6f7e7d88b9	remove cause without apostrophe from norm exceptions (#6636 )	2021-01-06 12:30:30 +08:00
Adriane Boyd	bf9096437e	Set default lemmas in retokenizer (#6667 ) Instead of unsetting lemmas on retokenized tokens, set the default lemmas to: * merge: concatenate any existing lemmas with `SPACY` preserved * split: use the new `ORTH` values if lemmas were previously set, otherwise leave unset	2021-01-06 12:29:44 +08:00
Adriane Boyd	0041dfbc7f	Use special matcher for exceptions with spaces (#6668 ) Use the special cases phrase matcher for exceptions that include space characters so that exceptions including spaces are supported.	2021-01-06 12:05:10 +08:00
Sofie Van Landeghem	afc5714d32	multi-label textcat component (#6474 ) * multi-label textcat component * formatting * fix comment * cleanup * fix from #6481 * random edit to push the tests * add explicit error when textcat is called with multi-label gold data * fix error nr * small fix	2021-01-06 13:07:14 +11:00
Ines Montani	81f018fb67	Merge pull request #6671 from explosion/chore/tidy-autoformat Tidy up and auto-format	2021-01-05 14:45:31 +11:00
Ines Montani	224a3590e9	Merge pull request #6654 from svlandeg/chore/tests-cleanup Unskipping tests	2021-01-05 13:53:40 +11:00
Ines Montani	c4993f16d0	Merge pull request #6651 from svlandeg/bugfix/cli_info	2021-01-05 13:44:26 +11:00
Ines Montani	991669c934	Tidy up and auto-format	2021-01-05 13:41:53 +11:00
Adriane Boyd	b57be94c78	Fix memory issues in Language.evaluate (#6386 ) * Fix memory issues in Language.evaluate Reset annotation in predicted docs before evaluating and store all data in `examples`. * Minor refactor to docs generator init * Fix generator expression * Fix final generator check * Refactor pipeline loop * Handle examples generator in Language.evaluate * Add test with generator * Use make_doc	2020-12-31 10:45:50 +11:00
svlandeg	a6a68da673	unskipping tests with python >= 3.6	2020-12-30 18:46:43 +01:00
svlandeg	d5ff0fecf8	add docs	2020-12-30 14:01:13 +01:00
svlandeg	c74ab6a313	fix imports	2020-12-30 12:40:12 +01:00
svlandeg	712a78b74a	add simple unit test	2020-12-30 12:35:26 +01:00
Adriane Boyd	5ca57d8221	Add logger warning when serializing user hooks (#6595 ) Add a warning that user hooks are lost on serialization. Add a `user_hooks` exclude to skip the warning with pickle.	2020-12-29 11:54:32 +01:00
Yosi	cf52510631	Add Amharic አማርኛ Language support (#6583 ) * Add Amharic to space * clean up * Add some PRON_LEMMA * add Tigrinya support * remove text_noun_chunks * Tigrinya Support * added some more details for ti * fix unit test * add amharic char range * changes from review * amharic and tigrinya share same unicode block * get rid of _amharic/_tigrinya in char_classes Co-authored-by: Josiah Solomon <jsolomon@meteorcomm.com>	2020-12-22 16:50:34 +01:00
Adriane Boyd	cabd4ae5b1	Use logger.warning instead of logger.warn (#6596 ) Use `logger.warning` instead of deprecated `logger.warn`.	2020-12-21 08:25:10 +08:00
Sofie Van Landeghem	282a3b49ea	Fix parser resizing when there is no upper layer (#6460 ) * allow resizing of the parser model even when upper=False * update from spacy.TransitionBasedParser.v1 to v2 * bugfix	2020-12-18 18:56:57 +08:00
Sofie Van Landeghem	0a923a7915	Tagger robustness (#6580 ) * require labels in taggers * ensure tagger works with incomplete data	2020-12-18 18:51:47 +08:00
Adriane Boyd	1ddf2f39c7	Switch converters to generator functions (#6547 ) * Switch converters to generator functions To reduce the memory usage when converting large corpora, refactor the convert methods to be generator functions. * Update tests	2020-12-15 16:47:16 +08:00
Matthew Honnibal	8656a08777	Add beam_parser and beam_ner components for v3 (#6369 ) * Get basic beam tests working * Get basic beam tests working * Compile _beam_utils * Remove prints * Test beam density * Beam parser seems to train * Draft beam NER * Upd beam * Add hypothesis as dev dependency * Implement missing is-gold-parse method * Implement early update * Fix state hashing * Fix test * Fix test * Default to non-beam in parser constructor * Improve oracle for beam * Start refactoring beam * Update test * Refactor beam * Update nn * Refactor beam and weight by cost * Update ner beam settings * Update test * Add __init__.pxd * Upd test * Fix test * Upd test * Fix test * Remove ring buffer history from StateC * WIP change arc-eager transitions * Add state tests * Support ternary sent start values * Fix arc eager * Fix NER * Pass oracle cut size for beam * Fix ner test * Fix beam * Improve StateC.clone * Improve StateClass.borrow * Work directly with StateC, not StateClass * Remove print statements * Fix state copy * Improve state class * Refactor parser oracles * Fix arc eager oracle * Fix arc eager oracle * Use a vector to implement the stack * Refactor state data structure * Fix alignment of sent start * Add get_aligned_sent_starts method * Add test for ae oracle when bad sentence starts * Fix sentence segment handling * Avoid Reduce that inserts illegal sentence * Update preset SBD test * Fix test * Remove prints * Fix sent starts in Example * Improve python API of StateClass * Tweak comments and debug output of arc eager * Upd test * Fix state test * Fix state test	2020-12-13 09:08:32 +08:00
Ines Montani	9d32e839d3	Merge branch 'develop' into feature/init-config-cpu-gpu	2020-12-10 08:50:53 +11:00
Ines Montani	f2571b5ec4	Merge pull request #6444 from adrianeboyd/chore/update-develop-from-master	2020-12-09 13:09:58 +11:00
Ines Montani	90171f2031	Merge pull request #6528 from svlandeg/feature/pipe_fill_config	2020-12-09 12:01:22 +11:00
Ines Montani	dfaef27f90	Merge pull request #6503 from adrianeboyd/feature/lemmatizer-rule-warning-pos Warn on empty POS for the rule-based lemmatizer	2020-12-09 11:34:16 +11:00
Ines Montani	b85bd63eca	Fix test	2020-12-09 11:24:01 +11:00
Ines Montani	febf71af28	Fix test	2020-12-09 11:23:07 +11:00
Ines Montani	1980203229	Merge branch 'master' into pr/6444	2020-12-09 11:09:40 +11:00
Ines Montani	05a2812ae0	Merge branch 'develop' into pr/6444	2020-12-09 11:04:03 +11:00
Sofie Van Landeghem	cfc72c2995	Bugfix multi-label textcat reproducibility (#6481 ) * add test for multi-label textcat reproducibility * remove positive_label * fix lengths dtype * fix comments * remove comment that we should not have forgotten :-)	2020-12-09 06:29:15 +08:00
svlandeg	8f8a7f1733	returning config in init_config	2020-12-08 17:37:20 +01:00
Sofie Van Landeghem	2c27093c5f	require_cpu functionality (#6336 ) * add require_cpu from Thinc 8.0.0rc2 * add docs * fix test if cupy is not installed	2020-12-08 14:42:40 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Adriane Boyd	29b058ebdc	Fix spacy when retokenizing cases with affixes (#6475 ) Preserve `token.spacy` corresponding to the span end token in the original doc rather than adjusting for the current offset. * If not modifying in place, this checks in the original document (`doc.c` rather than `tokens`). * If modifying in place, the document has not been modified past the current span start position so the value at the current span end position is valid.	2020-12-08 14:25:56 +08:00
Adriane Boyd	4448680750	Fix alignment for 1-to-1 tokens and lowercasing (#6476 ) * When checking for token alignments, check not only that the tokens are identical but that the character positions are both at the start of a token. It's possible for the tokens to be identical even though the two tokens aren't aligned one-to-one in a case like `["a'", "''"]` vs. `["a", "''", "'"]`, where the middle tokens are identical but should not be aligned on the token level at character position 2 since it's the start of one token but the middle of another. * Use the lowercased version of the token texts to create the character-to-token alignment because lowercasing can change the string length (e.g., for `İ`, see the not-a-bug bug report: https://bugs.python.org/issue34723)	2020-12-08 14:25:16 +08:00
Adriane Boyd	d70950605c	Warn on empty POS for the rule-based lemmatizer Add a warning to the rule-based lemmatizer for any tokens without POS annotation.	2020-12-04 11:46:15 +01:00
Sofie Van Landeghem	d6c616a125	Fixes in test suite (#6457 ) * fix slow test for textcat readers * cleanup test_issue5551 * add explicit score weight * cleanup	2020-12-02 12:57:08 +01:00
Adriane Boyd	53c0fb7431	Only set NORM on Token in retokenizer (#6464 ) * Only set NORM on Token in retokenizer Instead of setting `NORM` on both the token and lexeme, set `NORM` only on the token. The retokenizer tries to set all possible attributes with `Token/Lexeme.set_struct_attr` so that it doesn't have to enumerate which attributes are available for each. `NORM` is the only attribute that's stored on both and for most cases it doesn't make sense to set the global norms based on a individual retokenization. For lexeme-only attributes like `IS_STOP` there's no way to avoid the global side effects, but I think that `NORM` would be better only on the token. * Fix test	2020-11-30 09:35:42 +08:00
Adriane Boyd	03ae77e603	Add SPACY as a Matcher attribute (#6463 )	2020-11-30 09:34:50 +08:00
Adriane Boyd	724831b066	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master * Update Macedonian for v3 * Update Turkish for v3	2020-11-25 11:49:34 +01:00
Sofie Van Landeghem	2af31a8c8d	Bugfix textcat reproducibility on GPU (#6411 ) * add seed argument to ParametricAttention layer * bump thinc to 7.4.3 * set thinc version range Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-11-23 12:29:35 +01:00
Adriane Boyd	320a8b1481	Add ent_id_ to strings serialized with Doc (#6353 )	2020-11-10 20:16:07 +08:00
Adriane Boyd	a7e7d6c6c9	Ignore misaligned in Morphologizer.get_loss (#6363 ) Fix bug where `Morphologizer.get_loss` treated misaligned annotation as `EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH` mappings.	2020-11-10 20:15:09 +08:00
Ines Montani	363ac73c72	Update docs [ci skip]	2020-11-09 12:43:26 +08:00
Adriane Boyd	31de700b0f	Fix on_match callback and remove empty patterns (#6312 ) For the `DependencyMatcher`: * Fix on_match callback so that it is called once per matched pattern * Fix results so that patterns with empty match lists are not returned	2020-11-05 09:16:26 +01:00
Adriane Boyd	1c4df8fd09	Replace pytokenizations with internal alignment (#6293 ) * Replace pytokenizations with internal alignment Replace pytokenizations with internal alignment algorithm that is restricted to only allow differences in whitespace and capitalization. * Rename `spacy.training.align` to `spacy.training.alignment` to contain the `Alignment` dataclass * Implement `get_alignments` in `spacy.training.align` * Refactor trailing whitespace handling * Remove unnecessary exception for empty docs Allow a non-empty whitespace-only doc to be aligned with an empty doc * Remove empty docs exceptions completely	2020-11-03 16:24:38 +01:00
Adriane Boyd	a4b32b9552	Handle missing reference values in scorer (#6286 ) * Handle missing reference values in scorer Handle missing values in reference doc during scoring where it is possible to detect an unset state for the attribute. If no reference docs contain annotation, `None` is returned instead of a score. `spacy evaluate` displays `-` for missing scores and the missing scores are saved as `None`/`null` in the metrics. Attributes without unset states: * `token.head`: relies on `token.dep` to recognize unset values * `doc.cats`: unable to handle missing annotation Additional changes: * add optional `has_annotation` check to `score_scans` to replace `doc.sents` hack * update `score_token_attr_per_feat` to handle missing and empty morph representations * fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START` vs. `SENT_START` * Fix import * Update return types	2020-11-03 15:47:18 +01:00
Adriane Boyd	5d2cb86c34	Fix on_match callback for DependencyMatcher (#6313 ) Fix `DependencyMatcher` so that the callback is called only once per match.	2020-10-31 12:20:27 +01:00
Adriane Boyd	45c9a68828	Identify final Matcher pattern node by quantifier (#6317 ) Modify the internal pattern representation in `Matcher` patterns to identify the final ID state using a unique quantifier rather than a combination of other attributes. It was insufficient to identify the final ID node based on an uninitialized `quantifier` (coincidentally being the same as the `ZERO`) with `nr_attr` as 0. (In addition, it was potentially bug-prone that `nr_attr` was set to 0 even though attrs were allocated.) In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr` is 0 and the quantifier is ZERO, so the previous methods for incrementing to the ID node at the end of the pattern weren't able to distinguish the final ID node from the `{"OP": "!"}` pattern.	2020-10-31 12:18:48 +01:00
Duygu Altinok	0e55f806dd	Turkish tokenization improvements (#6268 ) * added single and paired orth variants * added token match * added long text tokenization test * inverted init * normalized lemmas to lowercase * more abbrevs * tests for ordinals and abbrevs * separated period abbvrevs to another list * fiex typo * added ordinal and abbrev tests * added number tests for dates * minor refinement * added inflected abbrevs regex * added percentage and inflection * cosmetics * added token match * added url inflection tests * excluded url tokens from custom pattern * removed url match import	2020-10-29 09:43:17 +01:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Jan Margeta	1ad2213349	Fix TokenPatternSchema pattern field validation Empty pattern field should be considered invalid This is fixed by replacing minItems with min_items as described in Pydantic docs: https://pydantic-docs.helpmanual.io/usage/schema/	2020-10-16 00:41:21 +02:00
Borijan Georgievski	2311192ba1	Include Macedonian language (#6230 ) * Include Macedonian language * Fix indentation at char_classes.py * Fix indentation at char_classes.py * Add Macedonian tests, update lex_attrs and char_classes * Import unicode literals for python 2	2020-10-15 15:55:01 +02:00
Ines Montani	b1d568a4df	Tidy up tests	2020-10-15 10:20:21 +02:00
Ines Montani	d165af26be	Auto-format [ci skip]	2020-10-15 10:08:53 +02:00
Ines Montani	5d62499266	Fix tests	2020-10-15 09:29:15 +02:00
Ines Montani	178760855f	Merge branch 'develop' into master-tmp	2020-10-15 09:06:03 +02:00
svlandeg	0796401c19	call NumpyOps instead of get_current_ops()	2020-10-14 16:55:00 +02:00
svlandeg	e94a21638e	adding tests for trained models to ensure predict reproducibility	2020-10-13 21:07:13 +02:00
svlandeg	ede979d42f	formattting	2020-10-13 18:53:17 +02:00
svlandeg	ff83bfae3f	naming	2020-10-13 18:52:37 +02:00
svlandeg	6ccacff54e	add tests for individual spacy layers	2020-10-13 18:50:07 +02:00
svlandeg	c23041ae60	component tests single or multiple prediction	2020-10-13 16:26:53 +02:00
svlandeg	3a505e7e14	small edit to ensure the new word was indeed new	2020-10-10 21:05:28 +02:00
svlandeg	68d79796c6	add test for vocab after serializing KB	2020-10-10 20:59:48 +02:00
Ines Montani	539b0c10da	Tidy up and auto-format	2020-10-10 19:14:48 +02:00
Ines Montani	bfa3931c9d	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
Ines Montani	525f798841	Fix typo in test	2020-10-09 18:00:21 +02:00
Ines Montani	b7cb9d95e4	Merge pull request #6229 from svlandeg/bugfix/disabled	2020-10-09 16:05:11 +02:00
Ines Montani	9fb3244672	Merge pull request #6231 from adrianeboyd/feature/include-static-vectors	2020-10-09 15:54:52 +02:00
Adriane Boyd	727370c633	Remove Span._recalculate_indices Remove `Span._recalculate_indices`, which is a remnant from the deprecated `Span.merge`.	2020-10-09 14:42:51 +02:00
svlandeg	06b9d213fd	formatting	2020-10-09 12:19:47 +02:00
Ines Montani	4771a10503	Make test more explicit [ci skip]	2020-10-09 12:15:26 +02:00
Ines Montani	cc3646b06c	Add xfailing test for peculiar spans failure [ci skip]	2020-10-09 12:10:25 +02:00
svlandeg	8316bc7d4a	bugfix DisabledPipes	2020-10-09 12:06:20 +02:00
Adriane Boyd	39aabf50ab	Also rename to include_static_vectors in CharEmbed	2020-10-09 11:54:48 +02:00
Florijan Stamenković	18f5c309dc	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:14:40 +02:00
Duygu Altinok	80fb1bffc9	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-09 10:13:15 +02:00
Duygu Altinok	2fad279a44	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-09 10:10:22 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
Ines Montani	8ff73f04db	Fix morph in Doc.to_json	2020-10-08 14:44:35 +02:00
Ines Montani	064575d79d	Merge pull request #6216 from svlandeg/feature/nel-initialize	2020-10-08 11:14:12 +02:00
svlandeg	eaf5c265cb	set_kb method for entity_linker	2020-10-08 10:34:01 +02:00
Ines Montani	010956d493	Clear rule-based components on initialize	2020-10-08 09:51:31 +02:00
Sofie Van Landeghem	2998131416	Reproducibility for TextCat and Tok2Vec (#6218 ) * ensure fixed seed in HashEmbed layers * forgot about the joys of python 2	2020-10-08 00:43:46 +02:00
svlandeg	efedccea8d	fix tests	2020-10-07 15:29:52 +02:00
svlandeg	6b8bdb2d39	add init_config to nlp.create_pipe	2020-10-07 14:58:16 +02:00
Duygu Altinok	7e821c2776	Turkish language syntax iterators (#6191 ) * added tr_vocab to config * basic test * added syntax iterator to Turkish lang class * first version for Turkish syntax iter, without flat * added simple tests with nmod, amod, det * more tests to amod and nmod * separated noun chunks and parser test * rearrangement after nchunk parser separation * added recursive NPs * tests with complicated recursive NPs * tests with conjed NPs * additional tests for conj NP * small modification for shaving off conj from NP * added tests with flat * more tests with flat * added examples with flats conjed * added inner func for flat trick * corrected parse Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-07 11:07:52 +02:00
Duygu Altinok	b95a11dd95	Ordinal numbers for Turkish (#6142 ) * minor ordinal number addition * fixed typo * added corresponding lexical test	2020-10-07 10:25:37 +02:00
Rahul Gupta	1a00bff06d	Hindi: Adds tests for lexical attributes (norm and like_num) (#5829 ) * Hindi: Adds tests for lexical attributes (norm and like_num) * Signs and sdds the contributor agreement * Add ordinal numbers to be tagged as like_num * Adds alternate pronunciation for 31 and 39	2020-10-07 10:23:32 +02:00
Sofie Van Landeghem	fff3f8ccfa	Fix packaging pin (#6212 ) * pin packaging to >=20.0 * ignore spacy-pkuseg in requirements unit test	2020-10-06 14:16:05 +02:00
Matthew Honnibal	cfb9770a94	Fix empty input into StaticVectors layer (#6211 ) * Add test for empty doc(s) * Fix empty check in staticvectors * Remove xfail * Update spacy/ml/staticvectors.py	2020-10-06 14:15:41 +02:00
Florijan Stamenković	9db670b996	Fix Issue 6207 (#6208 ) * Regression test for issue 6207 * Fix issue 6207 * Sign contributor agreement * Minor adjustments to test Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-06 11:17:37 +02:00
Ines Montani	568e12215d	Merge pull request #6206 from svlandeg/fix/patterns-init	2020-10-06 10:27:23 +02:00
svlandeg	ff9ac39c88	read entity_ruler patterns with srsly.read_jsonl.v1	2020-10-05 22:50:14 +02:00
Ines Montani	126268ce50	Auto-format [ci skip]	2020-10-05 21:58:18 +02:00
Ines Montani	181039bd17	Merge pull request #6205 from explosion/feature/embed-features	2020-10-05 21:49:10 +02:00
Ines Montani	568617af58	Merge pull request #6202 from explosion/feature/project-spacy-version	2020-10-05 21:40:52 +02:00
Ines Montani	6abfc2911d	Merge pull request #6203 from adrianeboyd/feature/zh-spacy-pkuseg	2020-10-05 21:35:57 +02:00
Matthew Honnibal	91d0fbb588	Fix test	2020-10-05 21:13:53 +02:00
Matthew Honnibal	b392d48e76	Fix test	2020-10-05 20:17:07 +02:00
Matthew Honnibal	db84d175c3	Fix test	2020-10-05 19:59:30 +02:00
Matthew Honnibal	6dcc4a0ba6	Simplify MultiHashEmbed signature	2020-10-05 19:57:45 +02:00
Matthew Honnibal	7d93575f35	spacy/tests/	2020-10-05 15:28:12 +02:00
Matthew Honnibal	f4ca9a39cb	spacy/tests/	2020-10-05 15:27:06 +02:00
Matthew Honnibal	f2f1deca66	spacy/tests/	2020-10-05 15:24:33 +02:00
Matthew Honnibal	8ec79ad3fa	Allow configuration of MultiHashEmbed features Update arguments to MultiHashEmbed layer so that the attributes can be controlled. A kind of tricky scheme is used to allow optional specification of the rows. I think it's an okay balance between flexibility and convenience.	2020-10-05 15:22:00 +02:00
Adriane Boyd	5d19dfc9d3	Update Chinese tokenizer for spacy-pkuseg fork	2020-10-05 14:21:53 +02:00
Ines Montani	6958510bda	Include spaCy version check in project CLI	2020-10-05 13:53:07 +02:00
Ines Montani	20f2a17a09	Merge test_misc and test_util	2020-10-05 13:45:57 +02:00
Ines Montani	1c641e41c3	Remove unused import [ci skip]	2020-10-05 11:50:11 +02:00
Adriane Boyd	b0b93854cb	Update ru/uk lemmatizers for new nlp.initialize	2020-10-05 09:27:16 +02:00
Ines Montani	549758f67d	Adjust test for now	2020-10-04 23:16:09 +02:00
Ines Montani	3c36a57e84	Update data augmenters (#6196 ) * Draft lower-case augmenter * Make warning a debug log * Update lowercase augmenter, docs and tests Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-10-04 17:46:29 +02:00
Ines Montani	496228771d	Merge pull request #6194 from explosion/master-tmp	2020-10-04 15:25:41 +02:00
Ines Montani	0307a228c8	Merge pull request #6193 from explosion/fix/adjust-pipe-init Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 15:20:54 +02:00
Ines Montani	59deeb7da6	Merge branch 'develop' into master-tmp	2020-10-04 14:52:20 +02:00
Ines Montani	8f018e47f8	Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 14:43:45 +02:00
Ines Montani	11347f34da	Tidy up, tests and docs	2020-10-04 13:54:05 +02:00
Ines Montani	d3b3663942	Adjust error message and add test	2020-10-04 10:11:27 +02:00
Ines Montani	2110e8f86d	Auto-format	2020-10-04 10:06:49 +02:00
Matthew Honnibal	835070cedc	Upd test	2020-10-03 19:35:10 +02:00
Ines Montani	c2401fca41	Add tests for Pipe.label_data	2020-10-03 19:12:46 +02:00
Ines Montani	3bc3c05fcc	Tidy up and auto-format	2020-10-03 17:20:18 +02:00
Ines Montani	7c4ab7e82c	Fix Lemmatizer.get_lookups_config	2020-10-03 17:16:10 +02:00
Ines Montani	dd542ec6a4	Fix label initialization of textcat component (#6190 )	2020-10-03 17:07:38 +02:00
Sofie Van Landeghem	09dcb75076	small UX fix for DocBin (#6167 ) * add informative warning when messing up store_user_data DocBin flags * add informative warning when messing up store_user_data DocBin flags * cleanup test * rename to patterns_path	2020-10-02 15:43:32 +02:00
Ines Montani	f0b30aedad	Make lemmatizers use initialize logic (#6182 ) * Make lemmatizer use initialize logic and tidy up * Fix typo * Raise for uninitialized tables	2020-10-02 15:42:36 +02:00
Ines Montani	d2aa662ab2	Merge pull request #6179 from adrianeboyd/feature/token-morph-refactor-2 [ci skip]	2020-10-02 12:10:27 +02:00
Ines Montani	c41a4332e4	Add test for custom data augmentation	2020-10-02 11:37:56 +02:00
Adriane Boyd	f83dfe62da	Fix test	2020-10-02 10:17:26 +02:00
Ines Montani	01c1538c72	Integrate file readers	2020-10-02 01:36:06 +02:00
Adriane Boyd	86c3ec9c2b	Refactor Token morph setting (#6175 ) * Refactor Token morph setting * Remove `Token.morph_` * Add `Token.set_morph()` * `0` resets `token.c.morph` to unset * Any other values are passed to `Morphology.add` * Add token.morph setter to set from MorphAnalysis	2020-10-01 22:21:46 +02:00
Ines Montani	d48ddd6c9a	Remove default initialize lookups	2020-10-01 21:54:33 +02:00
Adriane Boyd	73538782a0	Switch Doc.__init__(ents=) to IOB tags (#6173 ) * Switch Doc.__init__(ents=) to IOB tags * Fix check for "-" * Allow "" or None as missing IOB tag	2020-10-01 16:22:18 +02:00
Yohei Tamura	3243ddac8f	Fix/span.sent (#6083 ) * add fail test * fix test * fix span.sent * Remove incorrect implicit check Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-10-01 14:01:52 +02:00
Ines Montani	381258b75b	Merge pull request #6165 from explosion/feature/update-tokenizers-initialize	2020-10-01 09:49:47 +02:00
Ines Montani	a103ab5f1a	Update augmenter lookups and docs	2020-09-30 23:03:47 +02:00
Ines Montani	23c63eefaf	Tidy up env vars [ci skip]	2020-09-30 15:15:11 +02:00
Adriane Boyd	6b7bb32834	Refactor Chinese initialization	2020-09-30 11:46:45 +02:00
Ines Montani	34f9c26c62	Add lexeme norm defaults	2020-09-30 10:20:14 +02:00
Ines Montani	1aeef3bfbb	Make corpus paths default to None and improve errors	2020-09-29 22:33:46 +02:00
Ines Montani	fa47f87924	Tidy up and auto-format	2020-09-29 21:39:28 +02:00
Ines Montani	6467a560e3	WIP: Test updating Chinese tokenizer	2020-09-29 21:10:22 +02:00
Ines Montani	78021089f9	Merge pull request #6160 from explosion/feature/prepare	2020-09-29 20:55:13 +02:00
Ines Montani	c3f8c09d7d	Merge pull request #6154 from adrianeboyd/bugfix/chinese-tokenizer-pickle	2020-09-29 20:54:59 +02:00
Ines Montani	d3c63b7965	Merge branch 'develop' into feature/prepare	2020-09-29 20:53:05 +02:00
Ines Montani	2be80379ec	Fix small issues, resolve_dot_names and debug model	2020-09-29 20:38:35 +02:00
Ines Montani	7851020653	Update tests	2020-09-29 18:14:15 +02:00
Ines Montani	f2352eb701	Test with default value	2020-09-29 17:00:40 +02:00
Ines Montani	63d1598137	Simplify config use in Language.initialize	2020-09-29 16:05:48 +02:00
Ines Montani	56f8bc73ef	Add more tests	2020-09-29 15:23:34 +02:00
Ines Montani	591038b1a4	Add test	2020-09-29 12:54:52 +02:00
Matthew Honnibal	e1fdf2b7c5	Upd tests	2020-09-29 12:05:38 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Ines Montani	2e9c9e74af	Fix config resolution and interpolation TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)	2020-09-28 15:34:00 +02:00
Ines Montani	822ea4ef61	Refactor CLI	2020-09-28 15:09:59 +02:00
Matthew Honnibal	a976da168c	Support data augmentation in Corpus (#6155 ) * Support data augmentation in Corpus * Note initial docs for data augmentation * Add augmenter to quickstart * Fix flake8 * Format * Fix test * Update spacy/tests/training/test_training.py * Improve data augmentation arguments * Update templates * Move randomization out into caller * Refactor * Update spacy/training/augment.py * Update spacy/tests/training/test_training.py * Fix augment * Fix test	2020-09-28 03:03:27 +02:00
Ines Montani	9016d23cc5	Fix exclude and add test	2020-09-27 23:34:03 +02:00
Ines Montani	7e938ed63e	Update config resolution to use new Thinc	2020-09-27 22:21:31 +02:00
Adriane Boyd	8393dbedad	Minor fixes * Put `cfg` back in serialization * Add `pickle5` to pytest conf	2020-09-27 15:15:53 +02:00
Adriane Boyd	11e195d3ed	Update ChineseTokenizer * Allow `pkuseg_model` to be set to `None` on initialization * Don't save config within tokenizer * Force convert pkuseg_model to use pickle protocol 4 by reencoding with `pickle5` on serialization * Update pkuseg serialization test	2020-09-27 14:00:18 +02:00
Ines Montani	ca3c997062	Improve CLI config validation with latest Thinc	2020-09-26 13:13:57 +02:00
Adriane Boyd	3c062b3911	Add MORPH handling to Matcher (#6107 ) * Add MORPH handling to Matcher * Add `MORPH` to `Matcher` schema * Rename `_SetMemberPredicate` to `_SetPredicate` * Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate` * Add special handling for normalization and conversion of morph values into sets * For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only matches for 0 or 1 values * Update test * Rename to IS_SUBSET and IS_SUPERSET	2020-09-24 16:55:09 +02:00
Adriane Boyd	59340606b7	Add option to disable Matcher errors (#6125 ) * Add option to disable Matcher errors * Add option to disable Matcher errors when a doc doesn't contain a particular type of annotation Minor additional change: * Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH` values * Rename suppress_errors to allow_missing Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Refactor annotation checks in Matcher and PhraseMatcher Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-24 16:54:39 +02:00
Sofie Van Landeghem	c7eedd3534	updates to NEL functionality (#6132 ) * NEL: read sentences and ents from reference * fiddling with sent_start annotations * add KB serialization test * KB write additional file with strings.json * score_links function to calculate NEL P/R/F * formatting * documentation	2020-09-24 16:53:59 +02:00
Ines Montani	d0ef4a4cf5	Prevent division by zero in score weights	2020-09-24 16:42:13 +02:00
Ines Montani	58dde293ce	Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2	2020-09-24 14:44:42 +02:00
Adriane Boyd	8eaacaae97	Refactor Doc.ents setter to use Doc.set_ents Additional changes: * Entity spans with missing labels are ignored * Fix ent_kb_id setting in `Doc.set_ents`	2020-09-24 12:36:51 +02:00
Ines Montani	c6c67b606e	Merge pull request #6133 from explosion/fix/score_weights	2020-09-24 12:00:57 +02:00
Ines Montani	4bbe41f017	Fix combined scores and update test	2020-09-24 10:42:47 +02:00
Sofie Van Landeghem	c645c4e7ce	fix micro PRF for textcat (#6130 ) * fix micro PRF for textcat * small fix	2020-09-24 10:31:17 +02:00
Ines Montani	ae51f580c1	Fix handling of score_weights	2020-09-24 10:27:33 +02:00
svlandeg	b816ace4bb	format	2020-09-23 17:33:13 +02:00
svlandeg	5a9fdbc8ad	state_type as Literal	2020-09-23 17:32:14 +02:00
svlandeg	dd2292793f	'parser' instead of 'deps' for state_type	2020-09-23 16:53:49 +02:00
svlandeg	6c85fab316	state_type and extra_state_tokens instead of nr_feature_tokens	2020-09-23 13:35:09 +02:00
Ines Montani	60a317520a	Merge pull request #6109 from svlandeg/feature/2rename	2020-09-23 09:47:12 +02:00
Sofie Van Landeghem	86a08f819d	tok2vec.update instead of predict (#6113 )	2020-09-22 21:54:52 +02:00
Adriane Boyd	e4acb28658	Fix norm in retokenizer split (#6111 ) Parallel to behavior in merge, reset norm on original token in retokenizer split.	2020-09-22 21:53:33 +02:00
Sofie Van Landeghem	e0e793be4d	fix KB IO (#6118 )	2020-09-22 21:53:06 +02:00
Sofie Van Landeghem	d53c84b6d6	avoid None callback (#6100 )	2020-09-22 13:54:44 +02:00
Adriane Boyd	535842e483	Merge branch 'develop' into feature/doc-ents-v3-2	2020-09-22 13:45:50 +02:00
Ines Montani	5e3b796b12	Validate section refs in debug config	2020-09-22 12:24:39 +02:00
svlandeg	e1b8090b9b	few more fixes	2020-09-22 12:01:06 +02:00
svlandeg	b556a10808	rename converts in_to_out	2020-09-22 11:50:19 +02:00
Ines Montani	beb766d0a0	Add test	2020-09-22 09:15:57 +02:00
Ines Montani	69f7e52c26	Update README.md	2020-09-22 09:10:06 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Ines Montani	a5f6ab4943	Merge pull request #6098 from adrianeboyd/feature/doc-init	2020-09-21 18:35:20 +02:00
Adriane Boyd	f212303729	Add sent_starts to Doc.__init__ Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start` values but also convert to and accept `sent_start` internally.	2020-09-21 17:59:09 +02:00
Adriane Boyd	177df15d89	Implement Doc.set_ents	2020-09-21 15:54:05 +02:00
Adriane Boyd	13fbf6556a	Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2	2020-09-21 14:42:04 +02:00
Adriane Boyd	ce455f30ca	Fix formatting	2020-09-21 13:53:29 +02:00
Adriane Boyd	bc02e86494	Extend Doc.__init__ with additional annotation Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to `Doc.__init__` to initialize the most common doc/token values.	2020-09-21 13:36:24 +02:00
Ines Montani	758ead8a47	Sync overrides with CLI overrides	2020-09-21 12:50:13 +02:00
Ines Montani	5497acf49a	Support config overrides via environment variables	2020-09-21 11:25:10 +02:00
Ines Montani	1114219ae3	Tidy up and auto-format	2020-09-21 10:59:07 +02:00
Adriane Boyd	eed4b785f5	Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```	2020-09-18 15:59:16 +02:00
Ines Montani	a127fa475e	Merge pull request #6078 from svlandeg/fix/corpus	2020-09-18 14:44:21 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00
Adriane Boyd	8b650f3a78	Modify setting missing and blocked entity tokens In order to make it easier to construct `Doc` objects as training data, modify how missing and blocked entity tokens are set to prioritize setting `O` and missing entity tokens for training purposes over setting blocked entity tokens. * `Doc.ents` setter sets tokens outside entity spans to `O` regardless of the current state of each token * For `Doc.ents`, setting a span with a missing label sets the `ent_iob` to missing instead of blocked * `Doc.block_ents(spans)` marks spans as hard `O` for use with the `EntityRecognizer`	2020-09-17 21:27:42 +02:00
svlandeg	427dbecdd6	cleanup and formatting	2020-09-17 11:48:04 +02:00
svlandeg	0c35885751	generalize corpora, dot notation for dev and train corpus	2020-09-17 11:38:59 +02:00
svlandeg	781fae678b	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-17 09:24:36 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Adriane Boyd	a119667a36	Clean up spacy.tokens (#6046 ) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment.	2020-09-16 20:32:38 +02:00
Adriane Boyd	f3db3f6fe0	Add vectors option to CharacterEmbed (#6069 ) * Add vectors option to CharacterEmbed * Update spacy/pipeline/morphologizer.pyx * Adjust default morphologizer config Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-16 17:45:04 +02:00
Adriane Boyd	87c329c711	Set rule-based lemmatizers as default (#6076 ) For languages without provided models and with lemmatizer rules in `spacy-lookups-data`, make the rule-based lemmatizer the default: Bengali, Persian, Norwegian, Swedish	2020-09-16 17:37:29 +02:00
svlandeg	1040e250d8	actual commit with test for custom readers with ml_datasets >= 0.2	2020-09-16 16:41:28 +02:00
svlandeg	0d1392340f	Merge remote-tracking branch 'upstream/develop' into fix/corpus	2020-09-15 23:17:08 +02:00
svlandeg	f420aa1138	use e.value to get to the ExceptionInfo value	2020-09-15 22:30:09 +02:00
svlandeg	51fa929f47	rewrite train_corpus to corpus.train in config	2020-09-15 21:58:04 +02:00
svlandeg	bd87e8686e	move tests to correct subdir	2020-09-15 21:40:38 +02:00
Ines Montani	aaf01689a1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-15 14:24:42 +02:00
Ines Montani	91a6637f74	Remove extra pipe config values before merging	2020-09-15 14:24:17 +02:00
Ines Montani	d3d7f92f05	Fix lang check and error handling in Language.from_config	2020-09-15 14:24:06 +02:00
Ines Montani	2ed6e2a218	Auto-format	2020-09-15 14:20:04 +02:00
Ines Montani	253ba5ef14	Raise for bad Vocab values	2020-09-15 13:25:34 +02:00
Ines Montani	7dfc4bc062	Allow overriding meta from spacy.blank	2020-09-15 11:12:12 +02:00
Matthew Honnibal	e8378b57bc	Fix test	2020-09-14 21:21:13 +02:00
Sofie Van Landeghem	3216a33149	positive_label config for textcat (#6062 ) * hook up positive_label in textcat * unit tests * documentation * formatting * tests * fix typo * move verify_config to after begin_training * revert accidential commit	2020-09-14 17:08:00 +02:00
Sofie Van Landeghem	744df9814a	define threshold for scoring textcat in TextCat config (#6055 ) * define threshold for scoring textcat in TextCat config * fix unit test and documentation	2020-09-13 14:15:52 +02:00
Adriane Boyd	ab270364f1	Modify Token.morph to enable unsetting (#6043 ) Modify `Token.morph` property so that `Token.c.morph` can be reset back to an internal value of `0`. Allow setting `Token.morph` from a hash as long as the morph string is already in the `StringStore`, setting it indirectly through `Token.morph_` so that the value is added to the morphology. If the hash is not in the `StringStore`, raise an error.	2020-09-13 14:06:07 +02:00
Adriane Boyd	c7bd631b5f	Fix token.idx for special cases with affixes (#6035 )	2020-09-13 14:05:36 +02:00
Matthew Honnibal	54c40223a1	Improve v3 pretrain command (#6040 ) * Starts to run * Update pretrain script * Update corpus * Update pretrain schema * Remove outdated test * Make JsonlTexts produce Example objects.	2020-09-13 14:05:05 +02:00
Ines Montani	febb99916d	Tidy up and auto-format [ci skip]	2020-09-13 10:55:36 +02:00
svlandeg	115147804a	string_to_list to parse comma-separated string into a list	2020-09-12 14:43:22 +02:00
Sofie Van Landeghem	cb66ea7400	Remove simple_ner code (#6041 ) * remove simple_ner code * remove unused _biluo and _iob files	2020-09-09 16:11:27 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
Ines Montani	f174c7b1f3	Merge branch 'develop' into pr/6018	2020-09-04 15:54:49 +02:00
Ines Montani	d7cc2ee72d	Fix tests	2020-09-04 14:05:55 +02:00
Ines Montani	90043a6f9b	Tidy up and auto-format	2020-09-04 13:42:33 +02:00
Ines Montani	df0b68f60e	Remove unicode declarations and update language data	2020-09-04 13:19:16 +02:00
Ines Montani	864a697e63	Merge branch 'develop' into master-tmp	2020-09-04 13:15:36 +02:00
Adriane Boyd	b927893309	Merge branch 'develop' into feature/dependency-matcher-v3	2020-09-04 13:03:30 +02:00
Ines Montani	5afe6447cd	registry.assets -> registry.misc	2020-09-03 17:31:14 +02:00
Ines Montani	c063e55eb7	Add prefix to batchers	2020-09-03 17:30:41 +02:00
Yohei Tamura	5af432e0f2	fix for empty string (#5936 )	2020-09-03 10:09:03 +02:00
Adriane Boyd	8b5594df86	Remove near-duplicate test	2020-09-02 20:32:01 +02:00
Adriane Boyd	960d9cfadc	Officially support DependencyMatcher Add official support for the `DependencyMatcher`. Redesign the pattern specification. Fix and extend operator implementations. Update API docs and add usage docs. Patterns -------- Refactor pattern structure to: ``` { "LEFT_ID": str, "REL_OP": str, "RIGHT_ID": str, "RIGHT_ATTRS": dict, } ``` The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all subsequent nodes contain all four keys. New operators ------------- Because of the way patterns are constructed from left to right, it's helpful to have `follows` operators along with `precedes` operators. Add operators for simple precedes / follows alongside immediate precedes / follows. * `.`: precedes `;`: immediately follows * `;`: follows Operator fixes -------------- `<` and `<<` do not include the node itself * Fix reversed order for all operators involving linear precedence (`.`, all sibling operators) * Linear precedence operators do not match nodes outside the same parse Additional fixes ---------------- * Use v3 Matcher API * Support `get` and `remove` * Support pickling	2020-09-02 17:45:29 +02:00
Sofie Van Landeghem	eb56377799	Fix overfitting test (#6011 ) * remove unused MORPH_RULES * fix textcat architecture in overfitting test	2020-09-02 13:07:41 +02:00
Sofie Van Landeghem	f7a25d69f7	Bugfix in merge_entities (#6005 ) * failing test * bugfix	2020-09-01 21:57:52 +02:00
Adriane Boyd	9130094199	Prevent Tagger model init with 0 labels (#5984 ) * Prevent Tagger model init with 0 labels Raise an error before trying to initialize a tagger model with 0 labels. * Add dummy tagger label for test * Remove tagless tagger model initializiation * Fix error number after merge * Add dummy tagger label to test * Fix formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-31 21:24:33 +02:00
Ines Montani	add9de5487	Deprecate (Phrase)Matcher.pipe	2020-08-31 17:01:24 +02:00
Ines Montani	6340d1c63d	Add as_spans to Matcher/PhraseMatcher	2020-08-31 14:53:22 +02:00
Sofie Van Landeghem	ec14744ee4	Rename Transformer listener (#6001 ) * rename to spacy-transformers.TransformerListener * add some more tok2vec tests * use select_pipes * fix docs - annotation setter was not changed in the end	2020-08-31 12:41:39 +02:00
Adriane Boyd	216efaf5f5	Restrict tokenizer exceptions to ORTH and NORM	2020-08-31 09:55:01 +02:00
Ines Montani	45f46a5c85	Merge pull request #5993 from explosion/feature/disabled-components	2020-08-29 15:58:41 +02:00
Ines Montani	34146750d4	Use frozen list with custom errors We don't want to break backwards compatibility too much but we also want to provide the best possible UX	2020-08-29 15:20:11 +02:00
Ines Montani	744f432420	Merge pull request #5994 from explosion/feature/idempotent-component-decorator	2020-08-29 13:17:13 +02:00
Ines Montani	2bc31e15c9	Tidy up and auto-format [ci skip]	2020-08-29 13:01:10 +02:00
Ines Montani	15d73f4dc3	Make user-facing Language.disabled return list More consistent with all the other properties	2020-08-29 12:08:33 +02:00
Ines Montani	0687d7148e	Rename user-facing API	2020-08-28 21:04:02 +02:00
Adriane Boyd	0104bd1600	Sort the AttributeRuler matches by rule order Sort the returned matches by rule order (the `match_id`) so that the rules are applied in the order they were added. This is necessary, for instance, if the `AttributeRuler` is used for the tag map and later rules require POS tags.	2020-08-28 21:01:06 +02:00
Adriane Boyd	8674b17651	Serialize AttributeRuler.patterns Serialize `AttributeRuler.patterns` instead of the individual lists to simplify the serialized and so that patterns are reloaded exactly as they were originally provided (preserving `_attrs_unnormed`).	2020-08-28 20:44:45 +02:00
Ines Montani	10da74382f	Raise if disabled components are removed before DisabledPipes.restore	2020-08-28 20:35:26 +02:00
Ines Montani	cad988da7f	Allow component decorators to re-run with same function	2020-08-28 16:27:22 +02:00
Ines Montani	3ce5be4b76	Allow loaded but disabled components	2020-08-28 15:20:14 +02:00
Ines Montani	9c4049b57f	Merge pull request #5986 from explosion/fix/language-config-interpolate-disk-bytes	2020-08-28 15:03:52 +02:00
Ines Montani	adc050cdc5	Fix code style in test [ci skip]	2020-08-28 15:03:21 +02:00
Ines Montani	a51b4f3a19	Merge branch 'develop' into fix/language-config-interpolate-disk-bytes	2020-08-28 13:21:17 +02:00
svlandeg	9a8255ffd5	two tests because of different exit type	2020-08-28 10:50:26 +02:00
svlandeg	73baaf330a	update error type	2020-08-28 10:46:21 +02:00
Ines Montani	daac8ebacd	Don't interpolate config on Language deserialization	2020-08-27 16:44:36 +02:00
Adriane Boyd	90d88729e0	Add AttributeRuler.score (#5963 ) * Add AttributeRuler.score Add scoring for TAG / POS / MORPH / LEMMA if these are present in the assigned token attributes. Add default score weights (that don't really make a lot of sense) so that the scores are in the default config in some form. * Update docs	2020-08-26 15:39:30 +02:00
Hiroshi Matsuda	332803eda9	fix ja leading spaces (#5969 ) * change condition for space after * add NAUGHTY_STRINGS test example	2020-08-25 14:16:24 +02:00
Ines Montani	dd84577a98	Update CLI utils, project.yml schema and add test	2020-08-25 11:54:53 +02:00
Shashank	450720aca2	Added support for Sanskrit language (#5956 ) * Added support for Sanskrit language * Added tests for lexical attribute like_num	2020-08-25 10:56:29 +02:00
Ines Montani	e12b03358b	Support removing extra values in fill-config (#5966 ) * Support removing extra values in fill-config * Fix test	2020-08-24 22:53:47 +02:00
Ines Montani	0e7f99da58	Fix handling of optional [pretraining] block (#5954 ) * Fix handling of optional [pretraining] block * Remote pretraining from default config * Fix test * Add schema option for empty pretrain block	2020-08-24 15:56:03 +02:00
idoshr	b10c7bc56e	Hebrew like num (#5952 ) * Update stop_words.py Hebrew STOP WORDS * Update stop_words.py * contributor * contributor * add some common domain extentions support human number 1K/1M.... * support human number 1K/1M.... * hebrew number tokenize 1K/1M implement in EN * test human tokenize fix * test * heb like num revert human number change * heb like num	2020-08-24 14:30:05 +02:00
Matthew Honnibal	160a855246	Format	2020-08-23 21:15:12 +02:00
Matthew Honnibal	e559867605	Allow spacy project to push and pull to/from remote storage (#5949 ) * Add utils for working with remote storage * WIP add remote_cache for project * WIP add push and pull commands * Use pathy in remote_cache * Updarte util * Update remote_cache * Update util * Update project assets * Update pull script * Update push script * Fix type annotation in util * Work on remote storage * Remove site and env hash * Fix imports * Fix type annotation * Require pathy * Require pathy * Fix import * Add a util to handle project variable substitution * Import push and pull commands * Fix pull command * Fix push command * Fix tarfile in remote_storage * Improve printing * Fiddle with status messages * Set version to v3.0.0a9 * Draft docs for spacy project remote storages * Update docs [ci skip] * Use Thinc config to simplify and unify template variables * Auto-format * Don't import Pathy globally for now Causes slow and annoying Google Cloud warning * Tidy up test * Tidy up and update tests * Update to latest Thinc * Update docs * variables -> vars * Update docs [ci skip] * Update docs [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2020-08-23 18:32:09 +02:00
Sofie Van Landeghem	56eabcb2f2	Adding num_like test for Czech (#5946 ) * Create lex_attrs.py Hello, I am missing a CZECH language in SpaCy. So I would like to help to push it a little. This file is base on others lex_attrs.py files just with translation to Czech. * Update __init__.py Updated for use with new Czech Lex_attrs file * Update stop_words.py * Create test_text.py * add like_num testing for czech Co-authored-by: holubvl3 <47881982+holubvl3@users.noreply.github.com> Co-authored-by: holubvl3 <vilemrousi@gmail.com> Co-authored-by: Vladimír Holubec <vholubec@arcdata.cz>	2020-08-21 17:06:33 +02:00
svlandeg	af36d77d01	fix typo in docstring	2020-08-21 15:56:03 +02:00
Ines Montani	e60442d83a	Adjust label casing in displaCy NER visualizer (resolves #4866 ) - Accept any case for label names in ents and colors option, even if actual predicted label uses different casing - Don't text-transform: uppercase visually, if it's important to users that the label is represented as-is in the UI	2020-08-21 11:51:31 +02:00
Ines Montani	6ad59d59fe	Merge branch 'develop' of https://github.com/explosion/spaCy into develop [ci skip]	2020-08-20 11:20:58 +02:00
Sofie Van Landeghem	071c09ff35	add coding (#5942 )	2020-08-20 11:08:38 +02:00
Ines Montani	e2f2ef3a5a	Update init config and recommendations - As much as I dislike YAML, it seemed like a better format here because it allows us to add comments if we want to explain the different recommendations - Don't include the generated JS in the repo by default and build it on the fly when running or deploying the site. This ensures it's always up to date. - Simplify jinja_to_js script and use fewer dependencies	2020-08-19 13:33:15 +02:00
Sofie Van Landeghem	358cbb21e3	Define candidate generator in EL config (#5876 ) * candidate generator as separate part of EL config * update comment * ent instead of str as input for candidate generation * Span instead of str: correct type indication * fix types * unit test to create new candidate generator * fix replace_pipe argument passing * move error message, general cleanup * add vocab back to KB constructor * provide KB as callable from Vocab arg * rename to kb_loader, fix KB serialization as part of the EL pipe * fix typo * reformatting * cleanup * fix comment * fix wrongly duplicated code from merge conflict * rename dump to to_disk * from_disk instead of load_bulk * update test after recent removal of set_morphology in tagger * remove old doc	2020-08-18 16:10:36 +02:00
Ines Montani	3ae5e02f4f	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
Ines Montani	45f13cbf64	Merge pull request #5916 from explosion/feature/new-thinc-config	2020-08-16 15:24:12 +02:00
Ines Montani	a570c304df	Update quickstart, template and docs	2020-08-15 14:50:29 +02:00
Ines Montani	3272a63430	Merge pull request #5920 from explosion/fix/logging-warning-various	2020-08-15 14:41:15 +02:00
Matthew Honnibal	9ebf39fb5f	Relax test	2020-08-14 16:31:09 +02:00
Ines Montani	8128e5eb35	Replace lexeme_norm warning with logging	2020-08-14 15:00:52 +02:00
Ines Montani	88b0a96801	Update for new Thinc and adjust config	2020-08-13 17:38:30 +02:00
Ines Montani	950832f087	Tidy up pipes (#5906 ) * Tidy up pipes * Fix init, defaults and raise custom errors * Update docs * Update docs [ci skip] * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Tidy up error handling and validation, fix consistency * Simplify get_examples check * Remove unused import [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-11 23:29:31 +02:00
Ines Montani	c099f6eece	Add Token.lex	2020-08-10 16:43:52 +02:00
Ines Montani	3eaeb73342	Tidy up and auto-format	2020-08-09 22:36:23 +02:00
Ines Montani	8d2baa153d	Update tokenizer docs and add test	2020-08-09 15:24:01 +02:00
Matthew Honnibal	8a13f510d6	Update tests	2020-08-09 15:01:16 +02:00
Ines Montani	fe29ceec9e	Merge branch 'develop' into docs/model-docstrings	2020-08-07 18:42:01 +02:00
Ines Montani	3a193eb8f1	Fix imports, types and default configs	2020-08-07 18:40:54 +02:00
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00
Ines Montani	ef2c67cca5	Add DocBin to/from_disk methods and update docs (#5892 ) * Add DocBin to/from_disk methods and update docs * Use DocBin.from_disk in Corpus	2020-08-07 14:30:59 +02:00
Matthew Honnibal	d4525816ef	Be less choosy about reporting textcat scores (#5879 ) * Set textcat scores more consistently * Refactor textcat scores * Fixes to scorer * Add comments * Add threshold * Rename just 'f' to micro_f in textcat scorer * Fix textcat score for two-class * Fix syntax * Fix textcat score * Fix docstring	2020-08-06 16:24:13 +02:00
Adriane Boyd	5e683a6e46	Fix return values for per feat score (#5885 ) * Fix return values for per feat score Convert `PRFScore` to dict as other per type scores. * Update tests accordingly	2020-08-06 15:14:47 +02:00
Ines Montani	5cc0d89fad	Simplify config overrides in CLI and deserialization (#5880 )	2020-08-05 23:35:09 +02:00
Ines Montani	823e533dc1	Add config callbacks for modifying nlp object before and after init (#5866 ) * WIP: Concept for modifying nlp object before and after init * Make callbacks return nlp object Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Raise if callbacks don't return correct type * Rename, update types, add after_pipeline_creation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-05 19:47:54 +02:00
Ines Montani	e68459296d	Tidy up and auto-format	2020-08-05 16:00:59 +02:00
Adriane Boyd	4193402c47	Add warning when Matcher subpattern is discarded (#5873 ) * Add a warning when a subpattern is not processed and discarded * Normalize subpattern attribute/operator keys to upper case like top-level attributes	2020-08-05 14:56:14 +02:00
Adriane Boyd	af125875cf	Update SimpleNER (#5878 ) * Fix `get_loss` to use NER annotation * Add labels as part of cfg * Add simple overfitting test	2020-08-05 14:43:29 +02:00
Sofie Van Landeghem	b88c5c701a	Bugfix in nlp.replace_pipe (#5875 ) * bugfix and unit test * merge two conditions	2020-08-05 09:30:58 +02:00
Ines Montani	b795f02fbd	Allow adding pipeline components from source model (#5857 ) * Allow adding pipeline components from source model * Config: name -> component * Improve error messages * Fix error and test * Add frozen components and exclude logic * Remove exclude from Language.evaluate * Init sourced components with current vocab * Fix error codes	2020-08-04 23:39:19 +02:00
Sofie Van Landeghem	34873c4911	Example Dict format consistency (#5858 ) * consistently use upper-case IDS in token_annotation format and for get_aligned * remove ID from to_dict (not used in from_dict either) * fix test Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 22:22:26 +02:00
Adriane Boyd	fa79a0db9f	Add AttributeRuler for token attribute exceptions (#5842 ) * Add AttributeRuler for token attribute exceptions Add the `AttributeRuler` to handle exceptions for token-level attributes. The `AttributeRuler` uses `Matcher` patterns to identify target spans and applies the specified attributes to the token at the provided index in the matched span. A negative index can be used to index from the end of the matched span. The retokenizer is used to "merge" the individual tokens and assign them the provided attributes. Helper functions can import existing tag maps and morph rules to the corresponding `Matcher` patterns. There is an additional minor bug fix for `MORPH` attributes in the retokenizer to correctly normalize the values and to handle `MORPH` alongside `_` in an attrs dict. * Fix default name * Update name in error message * Extend AttributeRuler functionality * Add option to initialize with a dict of AttributeRuler patterns * Instead of silently discarding overlapping matches (the default behavior for the retokenizer if only the attrs differ), split the matches into disjoint sets and retokenize each set separately. This allows, for instance, one pattern to set the POS and another pattern to set the lemma. (If two matches modify the same attribute, it looks like the attrs are applied in the order they were added, but it may not be deterministic?) * Improve types * Sort spans before processing * Fix index boundaries in Span * Refactor retokenizer to separate attrs methods Add top-level `normalize_token_attrs` and `set_token_attrs` methods. * Update AttributeRuler to use refactored methods Update `AttributeRuler` to replace use of full retokenizer with only the relevant methods for normalizing and setting attributes for a single token. * Update spacy/pipeline/attributeruler.py Co-authored-by: Ines Montani <ines@ines.io> * Make API more similar to EntityRuler * Add `AttributeRuler.add_patterns` to add patterns from a list of dicts * Return list of dicts as property `AttributeRuler.patterns` * Make attrs_unnormed private * Add test loading patterns from assets * Revert "Fix index boundaries in Span" This reverts commit `8f8a5c3386`. * Add Span index boundary checks (#5861) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 17:02:39 +02:00
Sofie Van Landeghem	492d1ec5de	Prevent alignment when texts don't match (#5867 ) * remove empty gold.pyx * add alignment unit test (to be used in docs) * ensure that Alignment is only used on equal texts * additional test using example.alignment * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-04 16:29:18 +02:00
Matthew Honnibal	ecb3c4e8f4	Create corpus iterator and batcher from registry during training (#5865 ) * Move batchers into their own module (and registry) * Update CLI * Update Corpus and batcher * Update tests * Update one config * Merge 'evaluation' block back under [training] * Import batchers in gold __init__ * Fix batchers * Update config * Update schema * Update util * Don't assume train and dev are actually paths * Update onto-joint config * Fix missing import * Format * Format * Update spacy/gold/corpus.py Co-authored-by: Ines Montani <ines@ines.io> * Fix name * Update default config * Fix get_length option in batchers * Update test * Add comment * Pass path into Corpus * Update docstring * Update schema and configs * Update config * Fix test * Fix paths * Fix print * Fix create_train_batches * [training.read_train] -> [training.train_corpus] * Update onto-joint config Co-authored-by: Ines Montani <ines@ines.io>	2020-08-04 15:09:37 +02:00
Sofie Van Landeghem	82347110f5	Default empty KB in EL component (#5872 ) * EL field documentation * documentation consistent with docs * default empty KB, initialize vocab separately * formatting * add test for changing the default entity vector length * update comment	2020-08-04 14:34:09 +02:00
Adriane Boyd	b7e3018d97	Recalculate alignment if tokenization differs (#5868 ) * Recalculate alignment if tokenization differs * Refactor cached alignment data	2020-08-04 14:31:32 +02:00
Adriane Boyd	c62fd878a3	Allow Doc.char_span to snap to token boundaries (#5849 ) * Allow Doc.char_span to snap to token boundaries Add a `mode` option to allow `Doc.char_span` to snap to token boundaries. The `mode` options: * `strict`: character offsets must match token boundaries (default, same as before) * `inside`: all tokens completely within the character span * `outside`: all tokens at least partially covered by the character span Add a new helper function `token_by_char` that returns the token corresponding to a character position in the text. Update `token_by_start` and `token_by_end` to use `token_by_char` for more efficient searching. * Remove unused import * Rename mode to alignment_mode Rename `mode` to `alignment_mode` with the options `strict`/`contract`/`expand`. Any unrecognized modes are silently converted to `strict`.	2020-08-04 13:36:32 +02:00
Adriane Boyd	b841248589	Add Span index boundary checks (#5861 ) * Add Span index boundary checks * Return Span-specific IndexError in all cases * Simplify and fix if/else	2020-08-04 13:35:25 +02:00
Ines Montani	b40f44419b	Simplify pipe analysis - remove unused code - don't print by default - integrate attrs info into analysis output	2020-08-01 13:40:06 +02:00
Ines Montani	b68c53858c	Remove global	2020-07-31 18:37:58 +02:00
Ines Montani	30a76fcf6f	Integrate and simplify pipe analysis	2020-07-31 18:34:35 +02:00
Adriane Boyd	ac14ce7c30	Prefer earlier spans in EntityRuler (#5843 ) Similar to #4414, update the sorting in EntityRuler to prefer the first span in overlapping spans.	2020-07-31 16:09:32 +02:00
Adriane Boyd	9b509aa87f	Move Language.evaluate scorer config to new arg Move `Language.evaluate` scorer config from `component_cfg` to separate argument `scorer_cfg`.	2020-07-31 11:05:16 +02:00
Sofie Van Landeghem	ca491722ad	The Parser is now a Pipe (2) (#5844 ) * moving syntax folder to _parser_internals * moving nn_parser and transition_system * move nn_parser and transition_system out of internals folder * moving nn_parser code into transition_system file * rename transition_system to transition_parser * moving parser_model and _state to ml * move _state back to internals * The Parser now inherits from Pipe! * small code fixes * removing unnecessary imports * remove link_vectors_to_models * transition_system to internals folder * little bit more cleanup * newlines	2020-07-30 23:30:54 +02:00
Rahul Gupta	f76fae0e8d	English: adds ordinal numbers (#5830 )	2020-07-29 20:22:47 +02:00
Gustavo Zadrozny Leyendecker	90b958fd01	Fix on EntityRendered to support break lines (after last entity) (closes #5838 )	2020-07-29 18:48:39 +02:00
Matthew Honnibal	f0cf4a2dca	Update tests	2020-07-29 14:01:14 +02:00
Matthew Honnibal	c7d1ece3eb	Update tests	2020-07-29 14:01:13 +02:00
Matthew Honnibal	20e9098e3f	Update tests	2020-07-29 14:01:12 +02:00
Matthew Honnibal	1784c95827	Clean up link_vectors_to_models unused stuff	2020-07-29 14:01:11 +02:00
Sofie Van Landeghem	40c995b1be	Option for returning only greedy matches (#5771 ) * add "greedy" option for match pattern * distinction between greedy FIRST or LONGEST * check for proper values, throw custom warning otherwise * unxfail one more test * add comment in docstring * add test that LONGEST also prefers first match if equal length * use c arrays for more efficient processing * rename 'greediness' to 'greedy'	2020-07-29 11:04:43 +02:00
Ines Montani	b83ead5bf5	Merge pull request #5824 from svlandeg/fix/textcat-v3	2020-07-28 15:04:25 +02:00
Ines Montani	06a97a8766	Support --opt=value format in CLI config overrides	2020-07-28 13:43:15 +02:00
Ines Montani	0094cb0d04	Remove scores list from config and document	2020-07-28 11:22:24 +02:00
graue70	b97dbab998	Fix typo in unit tests (#5823 )	2020-07-27 20:18:48 +02:00
Ines Montani	894e20c466	Merge branch 'develop' into feature/component-scores	2020-07-27 18:14:39 +02:00
svlandeg	85b2dcfd67	cleanup	2020-07-27 17:54:44 +02:00
svlandeg	61068e0fb1	util function dot_to_object and corresponding unit test	2020-07-27 17:50:12 +02:00
Adriane Boyd	8bb0507777	Add and update score methods and score weights Add and update `score` methods, provided `scores`, and default weights `default_score_weights` for pipeline components. * `scores` provides all top-level keys returned by `score` (merely informative, similar to `assigns`). * `default_score_weights` provides the default weights for a default config. * The keys from `default_score_weights` determine which values will be shown in the `spacy train` output, so keys with weight `0.0` will be displayed but not counted toward the overall score.	2020-07-27 14:44:53 +02:00
Adriane Boyd	baf19fd652	Update cats scoring to provide overall score * Provide top-level score as `attr_score` * Provide a description of the score as `attr_score_desc` * Provide all potential scores keys, setting unused keys to `None` * Update CLI evaluate accordingly	2020-07-27 12:26:10 +02:00
Ines Montani	ed61fb10fc	Rename default textcat arch to TextCatEnsemble	2020-07-26 15:11:43 +02:00
Ines Montani	4060c2d5a6	Fix test	2020-07-26 13:40:19 +02:00
Ines Montani	2470486543	Allow pipeline components to set default scores and weights	2020-07-26 13:18:43 +02:00
Ines Montani	e92df281ce	Tidy up, autoformat, add types	2020-07-25 15:01:15 +02:00
Ines Montani	cdbd6ba912	Merge pull request #5798 from explosion/feature/language-data-config	2020-07-25 13:34:49 +02:00
Adriane Boyd	2bcceb80c4	Refactor the Scorer to improve flexibility (#5731 ) * Refactor the Scorer to improve flexibility Refactor the `Scorer` to improve flexibility for arbitrary pipeline components. * Individual pipeline components provide their own `evaluate` methods that score a list of `Example`s and return a dictionary of scores * `Scorer` is initialized either: * with a provided pipeline containing components to be scored * with a default pipeline containing the built-in statistical components (senter, tagger, morphologizer, parser, ner) * `Scorer.score` evaluates a list of `Example`s and returns a dictionary of scores referring to the scores provided by the components in the pipeline Significant differences: * `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc` and the new `morph_acc`, `pos_acc`, and `lemma_acc` * Scoring is no longer cumulative: `Scorer.score` scores a list of examples rather than a single example and does not retain any state about previously scored examples * PRF values in the returned scores are no longer multiplied by 100 * Add kwargs to Morphologizer.evaluate * Create generalized scoring methods in Scorer * Generalized static scoring methods are added to `Scorer` * Methods require an attribute (either on Token or Doc) that is used to key the returned scores Naming differences: * `uas`, `las`, and `las_per_type` in the scores dict are renamed to `dep_uas`, `dep_las`, and `dep_las_per_type` Scoring differences: * `Doc.sents` is now scored as spans rather than on sentence-initial token positions so that `Doc.sents` and `Doc.ents` can be scored with the same method (this lowers scores since a single incorrect sentence start results in two incorrect spans) * Simplify / extend hasattr check for eval method * Add hasattr check to tokenizer scoring * Simplify to hasattr check for component scoring * Reset Example alignment if docs are set Reset the Example alignment if either doc is set in case the tokenization has changed. * Add PRF tokenization scoring for tokens as spans Add PRF scores for tokens as character spans. The scores are: * token_acc: # correct tokens / # gold tokens * token_p/r/f: PRF for (token.idx, token.idx + len(token)) * Add docstring to Scorer.score_tokenization * Rename component.evaluate() to component.score() * Update Scorer API docs * Update scoring for positive_label in textcat * Fix TextCategorizer.score kwargs * Update Language.evaluate docs * Update score names in default config	2020-07-25 12:53:02 +02:00
Ines Montani	b9aaa4e457	Improve vocab data integration and warning	2020-07-25 11:51:30 +02:00
Ines Montani	38f6ea7a78	Simplify language data and revert detailed configs	2020-07-24 14:50:26 +02:00
Adriane Boyd	656574a01a	Update Japanese tests (#5807 ) * Update POS tests to reflect current behavior (it is not entirely clear whether the AUX/VERB mapping is indeed the desired behavior?) * Switch to `from_config` initialization in subtoken test	2020-07-24 12:45:14 +02:00
Adriane Boyd	fdb8815ef5	Minor refactor for Morphology and MorphAnalysis (#5804 ) * `MorphAnalysis.get` returns only the field values * Move `_normalize_props` inside `Morphology` as `Morphology.normalize_attrs` and simplify * Simplify POS field detection/conversion * Convert all non-POS features to strings * `Morphology` returns an empty string for a missing morph to align with the FEATS string returned for an existing morph * Remove unused `list_to_feats`	2020-07-24 09:28:06 +02:00
Ines Montani	a624ae0675	Remove POS, TAG and LEMMA from tokenizer exceptions	2020-07-22 23:09:01 +02:00
Ines Montani	14d7d46f89	Merge branch 'develop' into feature/language-data-config	2020-07-22 22:18:53 +02:00
Ines Montani	b507f61629	Tidy up and move noun_chunks, token_match, url_match	2020-07-22 22:18:46 +02:00
Ines Montani	d0c6d1efc5	@factories -> factory (#5801 )	2020-07-22 17:29:31 +02:00
Ines Montani	945f795a3e	WIP: move more language data to config	2020-07-22 15:59:37 +02:00
Adriane Boyd	b84fd70cc3	Fix exceptions for Morphology.__reduce__ (#5792 ) Pickle exceptions in the MORPH_RULES format instead of the internal format after the recent `Morphology.__init__` changes.	2020-07-22 15:00:25 +02:00
Ines Montani	43b960c01b	Refactor pipeline components, config and language data (#5759 ) * Update with WIP * Update with WIP * Update with pipeline serialization * Update types and pipe factories * Add deep merge, tidy up and add tests * Fix pipe creation from config * Don't validate default configs on load * Update spacy/language.py Co-authored-by: Ines Montani <ines@ines.io> * Adjust factory/component meta error * Clean up factory args and remove defaults * Add test for failing empty dict defaults * Update pipeline handling and methods * provide KB as registry function instead of as object * small change in test to make functionality more clear * update example script for EL configuration * Fix typo * Simplify test * Simplify test * splitting pipes.pyx into separate files * moving default configs to each component file * fix batch_size type * removing default values from component constructors where possible (TODO: test 4725) * skip instead of xfail * Add test for config -> nlp with multiple instances * pipeline.pipes -> pipeline.pipe * Tidy up, document, remove kwargs * small cleanup/generalization for Tok2VecListener * use DEFAULT_UPSTREAM field * revert to avoid circular imports * Fix tests * Replace deprecated arg * Make model dirs require config * fix pickling of keyword-only arguments in constructor * WIP: clean up and integrate full config * Add helper to handle function args more reliably Now also includes keyword-only args * Fix config composition and serialization * Improve config debugging and add visual diff * Remove unused defaults and fix type * Remove pipeline and factories from meta * Update spacy/default_config.cfg Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/default_config.cfg * small UX edits * avoid printing stack trace for debug CLI commands * Add support for language-specific factories * specify the section of the config which holds the model to debug * WIP: add Language.from_config * Update with language data refactor WIP * Auto-format * Add backwards-compat handling for Language.factories * Update morphologizer.pyx * Fix morphologizer * Update and simplify lemmatizers * Fix Japanese tests * Port over tagger changes * Fix Chinese and tests * Update to latest Thinc * WIP: xfail first Russian lemmatizer test * Fix component-specific overrides * fix nO for output layers in debug_model * Fix default value * Fix tests and don't pass objects in config * Fix deep merging * Fix lemma lookup data registry Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed) * Add types * Add Vocab.from_config * Fix typo * Fix tests * Make config copying more elegant * Fix pipe analysis * Fix lemmatizers and is_base_form * WIP: move language defaults to config * Fix morphology type * Fix vocab * Remove comment * Update to latest Thinc * Add morph rules to config * Tidy up * Remove set_morphology option from tagger factory * Hack use_gpu * Move [pipeline] to top-level block and make [nlp.pipeline] list Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them * Fix use_gpu and resume in CLI * Auto-format * Remove resume from config * Fix formatting and error * [pipeline] -> [components] * Fix types * Fix tagger test: requires set_morphology? Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-22 13:42:59 +02:00
Ines Montani	311d0bde29	Merge pull request #5788 from explosion/master-tmp	2020-07-20 15:39:24 +02:00
Ines Montani	d51db72e46	Remove Python 2 marker	2020-07-20 15:01:36 +02:00
Ines Montani	644074b954	Merge branch 'develop' into master-tmp	2020-07-20 14:58:04 +02:00
Sofie Van Landeghem	c9da9605f7	Test suite clean up (#5781 ) * step_through tests: skip instead of xfail * test_empty_doc should be fixed with new Thinc version * remove outdated test (there are other misaligned tests now) * xfail reason * fix test according to french exceptions * clarified some skipped tests * skip ukranian test instead of xfail * skip instead of xfail * skip + reason instead of xfail * removed obsolete tests referring to removed "set_frozen" functionality * fix test 999 * remove unused AlignmentError * remove xfail where possible, skip otherwise * increment thinc release for empty_doc test	2020-07-20 14:49:54 +02:00
Sofie Van Landeghem	1b2ec94382	Hyphen infix (#5770 ) * infix split on hyphen when preceded by number * clean up * skip ukranian test instead of xfail	2020-07-20 14:48:51 +02:00
Ines Montani	cb65b36839	Merge pull request #5767 from adrianeboyd/feature/remove-tag-maps	2020-07-19 15:15:34 +02:00
Ines Montani	796f6c52d1	Merge branch 'develop' into pr/5767	2020-07-19 13:37:46 +02:00
Adriane Boyd	39ebcd9ec9	Refactor Chinese tokenizer configuration (#5736 ) * Refactor Chinese tokenizer configuration Refactor `ChineseTokenizer` configuration so that it uses a single `segmenter` setting to choose between character segmentation, jieba, and pkuseg. * replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting `segmenter` with the supported values: `char`, `jieba`, `pkuseg` * make the default segmenter plain character segmentation `char` (no additional libraries required) * Fix Chinese serialization test to use char default * Warn if attempting to customize other segmenter Add a warning if `Chinese.pkuseg_update_user_dict` is called when another segmenter is selected.	2020-07-19 13:34:37 +02:00
Adriane Boyd	9ee1c54f40	Improve tag map initialization and updating (#5764 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that the tag map can be loaded correctly prior to loading a `Corpus` with `spacy debug-data` and `spacy train`. * normalize provided tag map as necessary * use the same method for initializing and updating the tag map * Replace rather than update tag map Replace rather than update tag map when loading a custom tag map. Updating the tag map is problematic due to the sorted list of tag names and the fact that the tag map will contain lingering/unwanted tags from the default tag map. * Update CLI scripts * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 13:13:57 +02:00
Adriane Boyd	b81a89f0a9	Update morphologizer (#5766 ) * update `Morphologizer.begin_training` for use with `Example` * make init and begin_training more consistent * add `Morphology.normalize_features` to normalize outside of `Morphology.add` * make sure `get_loss` doesn't create unknown labels when the POS and morph alignments differ	2020-07-19 11:10:51 +02:00
Adriane Boyd	50db3f0cdb	Serialize morph rules with tagger Serialize `morph_rules` with the tagger alongside the `tag_map`. Use `Morphology.load_tag_map` and `Morphology.load_morph_exceptions` to load these settings rather than reinitializing the morphology each time they are changed.	2020-07-17 08:22:21 +02:00
Adriane Boyd	2f981d5af1	Remove corpus-specific tag maps Remove corpus-specific tag maps from the language data for languages without custom tokenizers. For languages with custom word segmenters that also provide tags (Japanese and Korean), the tag maps for the custom tokenizers are kept as the default. The default tag maps for languages without custom tokenizers are now the default tag map from `lang/tag_map/py`, UPOS -> UPOS.	2020-07-15 15:58:29 +02:00
Adriane Boyd	a7a7e0d2a6	Add morph to morphology in Doc.from_array (#5762 ) * Add morph to morphology in Doc.from_array Add morphological analyses to morphology table in `Doc.from_array`. * Use separate vocab in DocBin roundtrip test	2020-07-14 14:07:35 +02:00
Ines Montani	872938ec76	Merge pull request #5747 from explosion/feature/refactor-config-args	2020-07-14 00:00:22 +02:00
Ines Montani	ed55143c0d	Merge branch 'develop' into compat/remove-object-subclass	2020-07-12 14:28:52 +02:00
Ines Montani	7906ddd56c	Fix test	2020-07-12 14:28:34 +02:00
Ines Montani	5f6f4ff594	Remove object subclassing	2020-07-12 14:03:23 +02:00
Ines Montani	b7111da1d7	Update config and commands	2020-07-11 13:03:53 +02:00
Ines Montani	0389c34b81	Merge branch 'develop' into feature/refactor-config-args	2020-07-10 20:51:52 +02:00
Sofie Van Landeghem	de6a32315c	debug-model script (#5749 ) * adding debug-model to print the internals for debugging purposes * expend debug-model script with 4 stages: before, init, train, predict * avoid enforcing to have a seed in the train script * small fixes	2020-07-10 19:47:53 +02:00
Ines Montani	5cfc3edcaa	Update CLI tests	2020-07-10 18:21:01 +02:00
Ines Montani	73332ddb67	Update CLI commans to use one shared util file	2020-07-10 17:57:40 +02:00

... 6 7 8 9 10 ...

2474 Commits