spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-02 10:53:05 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	6f5e308d17	Support negative examples in partial NER annotations (#8106 ) * Support a cfg field in transition system * Make NER 'has gold' check use right alignment for span * Pass 'negative_samples_key' property into NER transition system * Add field for negative samples to NER transition system * Check neg_key in NER has_gold * Support negative examples in NER oracle * Test for negative examples in NER * Fix name of config variable in NER * Remove vestiges of old-style partial annotation * Remove obsolete tests * Add comment noting lack of support for negative samples in parser * Additions to "neg examples" PR (#8201) * add custom error and test for deprecated format * add test for unlearning an entity * add break also for Begin's cost * add negative_samples_key property on Parser * rename * extend docs & fix some older docs issues * add subclass constructors, clean up tests, fix docs * add flaky test with ValueError if gold parse was not found * remove ValueError if n_gold == 0 * fix docstring * Hack in environment variables to try out training * Remove hack * Remove NER hack, and support 'negative O' samples * Fix O oracle * Fix transition parser * Remove 'not O' from oracle * Fix NER oracle * check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation * use set instead of list in consistency check Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-06-17 17:33:00 +10:00
Adriane Boyd	5646fcbe46	Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1	2021-06-15 15:05:17 +02:00
Paul O'Leary McCann	2c105cdbce	Raise error if deps not provided with heads (#8335 ) * Fill in deps if not provided with heads Before this change, if heads were passed without deps they would be silently ignored, which could be confusing. See #8334. * Use "dep" instead of a blank string This is the customary placeholder dep. It might be better to show an error here instead though. * Throw error on heads without deps * Add a test * Fix tests * Formatting * Fix all tests * Fix a test I missed * Revise error message * Clean up whitespace Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-06-15 13:23:32 +02:00
Adriane Boyd	9dfd3c9484	Use warnings.warn instead of logger.warning	2021-06-04 17:44:08 +02:00
Sofie Van Landeghem	f0277bdeab	Show warning if entity_ruler runs without patterns (#7807 ) * Show warning if entity_ruler runs without patterns * Show warning if matcher runs without patterns * fix wording * unit test for warning once (WIP) * warn W036 only once * cleanup * create filter_warning helper	2021-06-04 17:37:38 +02:00
Sofie Van Landeghem	ff91e6dac7	Show warning if entity_ruler runs without patterns (#7807 ) * Show warning if entity_ruler runs without patterns * Show warning if matcher runs without patterns * fix wording * unit test for warning once (WIP) * warn W036 only once * cleanup * create filter_warning helper	2021-05-31 18:20:27 +10:00
Sofie Van Landeghem	0dffc5d9e2	Custom warning if the doc_bin is too large (#8069 ) * custom warning if the doc_bin is too large * cleanup * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * fix numbering * fixing numbering once more * fixing this seems to be pretty hard Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-05-17 15:48:40 +02:00
Adriane Boyd	b120fb3511	Handle errors while multiprocessing (#8004 ) * Handle errors while multiprocessing Handle errors while multiprocessing without hanging. * Return the traceback for errors raised while processing a batch, which can be handled by the top-level error handler * Allow for shortened batches due to custom error handlers that ignore errors and skip documents * Define custom components at a higher level * Also move up custom error handler * Use simpler component for test * Switch error type * Adjust test * Only call top-level error handler for exceptions * Register custom test components within tests Use global functions (so they can be pickled) but register the components only within the individual tests.	2021-05-17 13:28:39 +02:00
Adriane Boyd	82fa81d095	Make all Span attrs writable (#8062 ) Also allow `Span` string properties `label_` and `kb_id_` to be writable following #6696.	2021-05-17 18:05:45 +10:00
Adriane Boyd	bdb485cc80	Add callback to copy vocab/tokenizer from model (#7750 ) * Add callback to copy vocab/tokenizer from model Add callback `spacy.copy_from_base_model.v1` to copy the tokenizer settings and/or vocab (including vectors) from a base model. * Move spacy.copy_from_base_model.v1 to spacy.training.callbacks * Add documentation * Modify to specify model as tokenizer and vocab params	2021-04-22 12:36:50 +02:00
Adriane Boyd	1ad646cbcf	Improve checks for sourced components (#7490 ) * Improve checks for sourced components * Remove language class checks * Convert python warning to logger warning * Remove unused warning * Fix formatting	2021-04-19 18:36:32 +10:00
Bram Vanroy	ed561cf428	Terminology: deprecated vs obsolete (#7621 ) * Terminology: deprecated vs obsolete Typically, deprecated is used for functionality that is bound to become unavailable but that can still be used. Obsolete is used for features that have been removed. In E941, I think what is meant is "obsolete" since loading a model by a shortcut simply does not work anymore (and throws an error). This is different from downloading a model with a shortcut, which is deprecated but still works. In light of this, perhaps all other error codes should be checked as well. * clarify that the link command is removed and not just deprecated Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-04-12 14:37:00 +02:00
Paul O'Leary McCann	7944761ba7	Add warning if initial vectors are empty (#7641 ) See #7637, where this came up.	2021-04-04 20:20:24 +02:00
Adriane Boyd	139f655f34	Merge doc.spans in Doc.from_docs() (#7497 ) Merge data from `doc.spans` in `Doc.from_docs()`. * Fix internal character offset set when merging empty docs (only affects tokens and spans in `user_data` if an empty doc is in the list of docs)	2021-03-29 22:34:01 +11:00
Adriane Boyd	39153ef90f	Update lexeme_norm checks * Add util method for check * Add new languages to list with lexeme norm tables * Add check to all relevant components * Add config details to warning message Note that we're not actually inspecting the model config to see if `NORM` is used as an attribute, so it may warn in cases where it's not relevant.	2021-03-19 10:59:27 +01:00
Adriane Boyd	d746ea6278	Add warning about GPU selection in Jupyter notebooks (#7075 ) * Initial warning * Update check * Redo edit * Move jupyter warning to helper method * Add link with details to warnings	2021-03-09 15:35:21 +01:00
Sofie Van Landeghem	39de3602e0	return custom error in nlp.initialize (#7104 ) * return custom error in nlp.initialize * Rename error Co-authored-by: Ines Montani <ines@ines.io>	2021-03-09 23:01:31 +11:00
Sofie Van Landeghem	cd70c3cb79	Fixing pretrain (#7342 ) * initialize NLP with train corpus * add more pretraining tests * more tests * function to fetch tok2vec layer for pretraining * clarify parameter name * test different objectives * formatting * fix check for static vectors when using vectors objective * clarify docs * logger statement * fix init_tok2vec and proc.initialize order * test training after pretraining * add init_config tests for pretraining * pop pretraining block to avoid config validation errors * custom errors	2021-03-09 14:01:13 +11:00
Sofie Van Landeghem	212f0e779e	Support doc.spans in Example.from_dict (#7197 ) * add support for spans in Example.from_dict * add unit tests * update error to E879	2021-03-03 01:12:54 +11:00
svlandeg	2010219a7f	import wandb failure - UX	2021-02-26 18:00:39 +01:00
Sofie Van Landeghem	ba5a50f62b	NEL docs & UX (#7129 ) * EL set_kb docs fix * custom warning for set_kb mistake	2021-02-22 11:04:22 +11:00
Adriane Boyd	6108dabdc8	Rephrase error related to sample data initialization Now that the initialize step is fully implemented, the source of E923 is typically missing or improperly converted/formatted data rather than a bug in spaCy, so rephrase the error and message and remove the prompt to open an issue.	2021-02-08 09:21:36 +01:00
Ines Montani	d0c3775712	Replace links to nightly docs [ci skip]	2021-01-30 20:09:38 +11:00
Ines Montani	526b416118	Tidy up comments	2021-01-30 12:34:09 +11:00
Ines Montani	30765674d0	Merge branch 'master' into develop	2021-01-30 12:20:28 +11:00
Ines Montani	7694f76dd1	Update warning and mention replace_listeners	2021-01-29 23:46:01 +11:00
Ines Montani	94232aea08	Improve E889	2021-01-29 23:39:23 +11:00
Ines Montani	bbb94b37c6	Update error handling and docstring	2021-01-29 16:27:49 +11:00
Adriane Boyd	fcce3600ed	Forbid OP matching 2+ tokens in DependencyMatcher (#6824 ) Instead of silently using only the first token in each matched span: * Forbid `OP: ?//+` through `DependencyMatcher` validation As a fail-safe, add warning if a token match that's not exactly one token long is found by a token pattern.	2021-01-29 08:52:01 +08:00
Sofie Van Landeghem	24a697abb8	avoid empty aliases and improve UX and docs (#6840 )	2021-01-29 08:51:40 +08:00
Adriane Boyd	4096a79de7	Add alignment mode error and fix Doc.char_span docs (#6820 ) * Raise an error on an unrecognized alignment mode rather than defaulting to `strict` * Fix the `Doc.char_span` API doc alignment mode details	2021-01-27 23:40:42 +11:00
Ines Montani	c0926c9088	WIP: Various small training changes (#6818 ) * Allow output_path to be None during training * Fix cat scoring (?) * Improve error message for weighted None score * Improve messages So we can call this in other places etc. * FIx output path check * Use latest wasabi * Revert "Improve error message for weighted None score" This reverts commit `7059926763`. * Exclude None scores from final score by default It's otherwise very difficult to keep track of the score weights if we modify a config programmatically, source components etc. * Update warnings and use logger.warning	2021-01-26 14:51:52 +11:00
Ines Montani	1090d3d675	Merge branch 'develop' into feature/spacy-legacy	2021-01-18 11:43:39 +11:00
Sofie Van Landeghem	fed8f48965	raise NotImplementedError when noun_chunks iterator is not implemented (#6711 ) * raise NotImplementedError when noun_chunks iterator is not implemented * bring back, fix and document span.noun_chunks * formatting Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2021-01-17 19:56:05 +08:00
Ines Montani	a552db2819	Include available registry names in error	2021-01-16 14:35:03 +11:00
Ines Montani	a203e3dbb8	Support spacy-legacy via the registry	2021-01-15 21:42:40 +11:00
Adriane Boyd	681a6195f7	Validate seed and gpu_allocator manually	2021-01-14 16:57:57 +01:00
Sofie Van Landeghem	afc5714d32	multi-label textcat component (#6474 ) * multi-label textcat component * formatting * fix comment * cleanup * fix from #6481 * random edit to push the tests * add explicit error when textcat is called with multi-label gold data * fix error nr * small fix	2021-01-06 13:07:14 +11:00
Adriane Boyd	5ca57d8221	Add logger warning when serializing user hooks (#6595 ) Add a warning that user hooks are lost on serialization. Add a `user_hooks` exclude to skip the warning with pickle.	2020-12-29 11:54:32 +01:00
Ines Montani	dfaef27f90	Merge pull request #6503 from adrianeboyd/feature/lemmatizer-rule-warning-pos Warn on empty POS for the rule-based lemmatizer	2020-12-09 11:34:16 +11:00
Sofie Van Landeghem	de108ed3e8	Add specific error when StaticVectors can't read the vectors data (#6450 )	2020-12-09 06:16:07 +08:00
Sofie Van Landeghem	f98a04434a	pretrain architectures (#6451 ) * define new architectures for the pretraining objective * add loss function as attr of the omdel * cleanup * cleanup * shorten name * fix typo * remove unused error	2020-12-08 14:41:03 +08:00
Ines Montani	ee2ec52f48	Merge pull request #6409 from svlandeg/feature/trf-docs	2020-12-08 06:32:10 +01:00
Adriane Boyd	d70950605c	Warn on empty POS for the rule-based lemmatizer Add a warning to the rule-based lemmatizer for any tokens without POS annotation.	2020-12-04 11:46:15 +01:00
Adriane Boyd	26296ab223	Add error message if DocBin zlib decompress fails (#6394 ) Add a better error message if DocBin zlib decompress fails, indicating that the data is not in `DocBin` format.	2020-11-27 14:39:49 +08:00
svlandeg	789fb3d124	add docs for upstream argument of TransformerListener	2020-11-09 21:42:58 +01:00
Adriane Boyd	1c4df8fd09	Replace pytokenizations with internal alignment (#6293 ) * Replace pytokenizations with internal alignment Replace pytokenizations with internal alignment algorithm that is restricted to only allow differences in whitespace and capitalization. * Rename `spacy.training.align` to `spacy.training.alignment` to contain the `Alignment` dataclass * Implement `get_alignments` in `spacy.training.align` * Refactor trailing whitespace handling * Remove unnecessary exception for empty docs Allow a non-empty whitespace-only doc to be aligned with an empty doc * Remove empty docs exceptions completely	2020-11-03 16:24:38 +01:00
Sofie Van Landeghem	75a202ce65	TextCat updates and fixes (#6263 ) * small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos	2020-10-18 14:50:41 +02:00
Ines Montani	bfa3931c9d	Revert added_strings change (#6236 )	2020-10-10 18:55:07 +02:00
Ines Montani	8ac5f22253	Adjust error message	2020-10-09 18:00:16 +02:00
svlandeg	06b9d213fd	formatting	2020-10-09 12:19:47 +02:00
svlandeg	2cafba5f50	shorten error message for clarity	2020-10-09 12:17:35 +02:00
svlandeg	18dfb27985	Add custom error when evaluation throws a KeyError	2020-10-09 12:05:33 +02:00
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
Ines Montani	be99f1e4de	Remove output dirs before training (#6204 ) * Remove output dirs before training * Re-raise error if cleaning fails	2020-10-05 20:11:16 +02:00
svlandeg	fd2d48556c	fix E902 and E903 numbering	2020-10-05 13:43:32 +02:00
Ines Montani	d38dc466c5	Adjust error [ci skip]	2020-10-04 15:26:01 +02:00
Ines Montani	bcd52e5486	Tidy up errors and warnings	2020-10-04 11:16:31 +02:00
Ines Montani	d3b3663942	Adjust error message and add test	2020-10-04 10:11:27 +02:00
Ines Montani	cc08c88a89	Merge pull request #6187 from svlandeg/fix/begin_training_pipe	2020-10-04 10:01:02 +02:00
svlandeg	3f657ed3a1	implement warning in __init_subclass__ instead	2020-10-03 22:34:10 +02:00
Ines Montani	dd542ec6a4	Fix label initialization of textcat component (#6190 )	2020-10-03 17:07:38 +02:00
svlandeg	fb48de349c	bwd compat for pipe.begin_training	2020-10-02 20:31:14 +02:00
Sofie Van Landeghem	09dcb75076	small UX fix for DocBin (#6167 ) * add informative warning when messing up store_user_data DocBin flags * add informative warning when messing up store_user_data DocBin flags * cleanup test * rename to patterns_path	2020-10-02 15:43:32 +02:00
Ines Montani	f0b30aedad	Make lemmatizers use initialize logic (#6182 ) * Make lemmatizer use initialize logic and tidy up * Fix typo * Raise for uninitialized tables	2020-10-02 15:42:36 +02:00
Ines Montani	01c1538c72	Integrate file readers	2020-10-02 01:36:06 +02:00
Adriane Boyd	86c3ec9c2b	Refactor Token morph setting (#6175 ) * Refactor Token morph setting * Remove `Token.morph_` * Add `Token.set_morph()` * `0` resets `token.c.morph` to unset * Any other values are passed to `Morphology.add` * Add token.morph setter to set from MorphAnalysis	2020-10-01 22:21:46 +02:00
Ines Montani	381258b75b	Merge pull request #6165 from explosion/feature/update-tokenizers-initialize	2020-10-01 09:49:47 +02:00
Ines Montani	6f29f68f69	Update errors and make Tokenizer.initialize args less strict	2020-09-30 23:48:47 +02:00
Ines Montani	a103ab5f1a	Update augmenter lookups and docs	2020-09-30 23:03:47 +02:00
Adriane Boyd	6b7bb32834	Refactor Chinese initialization	2020-09-30 11:46:45 +02:00
Ines Montani	1aeef3bfbb	Make corpus paths default to None and improve errors	2020-09-29 22:33:46 +02:00
Ines Montani	78021089f9	Merge pull request #6160 from explosion/feature/prepare	2020-09-29 20:55:13 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Adriane Boyd	11e195d3ed	Update ChineseTokenizer * Allow `pkuseg_model` to be set to `None` on initialization * Don't save config within tokenizer * Force convert pkuseg_model to use pickle protocol 4 by reencoding with `pickle5` on serialization * Update pkuseg serialization test	2020-09-27 14:00:18 +02:00
Sofie Van Landeghem	009ba14aaf	Fix pretraining in train script (#6143 ) * update pretraining API in train CLI * bump thinc to 8.0.0a35 * bump to 3.0.0a26 * doc fixes * small doc fix	2020-09-25 15:47:10 +02:00
Adriane Boyd	59340606b7	Add option to disable Matcher errors (#6125 ) * Add option to disable Matcher errors * Add option to disable Matcher errors when a doc doesn't contain a particular type of annotation Minor additional change: * Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH` values * Rename suppress_errors to allow_missing Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Refactor annotation checks in Matcher and PhraseMatcher Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-24 16:54:39 +02:00
Sofie Van Landeghem	c7eedd3534	updates to NEL functionality (#6132 ) * NEL: read sentences and ents from reference * fiddling with sent_start annotations * add KB serialization test * KB write additional file with strings.json * score_links function to calculate NEL P/R/F * formatting * documentation	2020-09-24 16:53:59 +02:00
Ines Montani	58dde293ce	Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2	2020-09-24 14:44:42 +02:00
Ines Montani	92f8b6959a	Fix typo	2020-09-24 13:48:41 +02:00
Adriane Boyd	5c13e0cf1b	Remove unused error	2020-09-24 13:41:55 +02:00
Ines Montani	be56c0994b	Add [training.before_to_disk] callback	2020-09-24 12:40:25 +02:00
Ines Montani	f69fea8b25	Improve error handling around non-number scores	2020-09-24 11:29:07 +02:00
Ines Montani	4eb39b5c43	Fix logging	2020-09-24 11:04:35 +02:00
svlandeg	25b34bba94	throw custom error when state_type is invalid	2020-09-23 16:57:14 +02:00
Adriane Boyd	b1a7d6c528	Refactor seen token detection	2020-09-22 14:42:51 +02:00
Adriane Boyd	535842e483	Merge branch 'develop' into feature/doc-ents-v3-2	2020-09-22 13:45:50 +02:00
svlandeg	b556a10808	rename converts in_to_out	2020-09-22 11:50:19 +02:00
Ines Montani	49e80dbcac	Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc	2020-09-22 09:45:04 +02:00
Ines Montani	81606b29bd	Merge pull request #6104 from svlandeg/fix/debug_model [ci skip]	2020-09-22 09:31:23 +02:00
Ines Montani	67fbcb3da5	Tidy up tests and docs	2020-09-21 20:43:54 +02:00
Adriane Boyd	177df15d89	Implement Doc.set_ents	2020-09-21 15:54:05 +02:00
svlandeg	eb9b447960	Merge remote-tracking branch 'upstream/develop' into fix/debug_model # Conflicts: # spacy/cli/debug_model.py	2020-09-21 14:05:16 +02:00
Adriane Boyd	bc02e86494	Extend Doc.__init__ with additional annotation Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to `Doc.__init__` to initialize the most common doc/token values.	2020-09-21 13:36:24 +02:00
svlandeg	73ff52b9ec	hack for tok2vec listener	2020-09-18 16:43:15 +02:00
Adriane Boyd	a88106e852	Remove W106: HEAD and SENT_START in doc.from_array (#6086 ) * Remove W106: HEAD and SENT_START in doc.from_array This warning was hacky and being triggered too often. * Fix test	2020-09-18 03:01:29 +02:00
Adriane Boyd	7e4cd7575c	Refactor Docs.is_ flags (#6044 ) * Refactor Docs.is_ flags * Add derived `Doc.has_annotation` method * `Doc.has_annotation(attr)` returns `True` for partial annotation * `Doc.has_annotation(attr, require_complete=True)` returns `True` for complete annotation * Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced` and `is_nered` * Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The list is the `DocBin` attributes list plus `SPACY` and `LENGTH`. Notes on `Doc.has_annotation`: * `HEAD` is converted to `DEP` because heads don't have an unset state * Accept `IS_SENT_START` as a synonym of `SENT_START` Additional changes: * Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for `DocBin` * In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override `SENT_START` * In `Doc.from_array()` using `attrs` other than `Doc._get_array_attrs()` (i.e., a user's custom list rather than our default internal list) with both `HEAD` and `SENT_START` shows a warning that `HEAD` will override `SENT_START` * `set_children_from_heads` does not require dependency labels to set sentence boundaries and sets `sent_start` for all non-sentence starts to `-1` * Fix call to set_children_form_heads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-17 00:14:01 +02:00
Ines Montani	aaf01689a1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-15 14:24:42 +02:00
Ines Montani	d3d7f92f05	Fix lang check and error handling in Language.from_config	2020-09-15 14:24:06 +02:00
Ines Montani	253ba5ef14	Raise for bad Vocab values	2020-09-15 13:25:34 +02:00

1 2 3 4 5 ...

369 Commits