spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-09-21 19:39:13 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	d093d6343b	TrainablePipe (#6213 ) * rename Pipe to TrainablePipe * split functionality between Pipe and TrainablePipe * remove unnecessary methods from certain components * cleanup * hasattr(component, "pipe") should be sufficient again * remove serialization and vocab/cfg from Pipe * unify _ensure_examples and validate_examples * small fixes * hasattr checks for self.cfg and self.vocab * make is_resizable and is_trainable properties * serialize strings.json instead of vocab * fix KB IO + tests * fix typos * more typos * _added_strings as a set * few more tests specifically for _added_strings field * bump to 3.0.0a36	2020-10-08 21:33:49 +02:00
svlandeg	3e2e1fd323	cleanup	2020-10-08 10:37:32 +02:00
svlandeg	eaf5c265cb	set_kb method for entity_linker	2020-10-08 10:34:01 +02:00
svlandeg	6b8bdb2d39	add init_config to nlp.create_pipe	2020-10-07 14:58:16 +02:00
svlandeg	ff9ac39c88	read entity_ruler patterns with srsly.read_jsonl.v1	2020-10-05 22:50:14 +02:00
svlandeg	4e3ace4b8c	is_trainable method	2020-10-05 17:43:42 +02:00
svlandeg	65abd77779	add finish_update to Pipe	2020-10-05 16:23:33 +02:00
Ines Montani	8f018e47f8	Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe	2020-10-04 14:43:45 +02:00
Ines Montani	ae15c9de79	Raise error from caught KeyError to preserve traceback	2020-10-03 11:43:56 +02:00
Stanislav Schmidt	3589a64d44	Change type of texts argument in pipe to iterable (#6186 ) * Change type of texts argument in pipe to iterable * Add contributor agreement	2020-10-02 21:00:11 +02:00
svlandeg	02247cccaf	Merge remote-tracking branch 'upstream/develop' into feature/small-fixes	2020-10-02 20:48:11 +02:00
svlandeg	6787e56315	print debugging warning before raising error if model not properly initialized	2020-10-01 09:21:00 +02:00
Ines Montani	fa47f87924	Tidy up and auto-format	2020-09-29 21:39:28 +02:00
Ines Montani	798040bc1d	Fix language detection	2020-09-29 21:08:13 +02:00
Matthew Honnibal	8ce9f44433	Merge branch 'feature/prepare' of https://github.com/explosion/spaCy into feature/prepare	2020-09-29 16:57:38 +02:00
Matthew Honnibal	ca72608059	Fix language	2020-09-29 16:48:33 +02:00
Ines Montani	fd594cfb9b	Tighten up format	2020-09-29 16:47:55 +02:00
Ines Montani	63d1598137	Simplify config use in Language.initialize	2020-09-29 16:05:48 +02:00
Ines Montani	adca08a12f	Pass nlp forward	2020-09-29 12:21:52 +02:00
Ines Montani	42f0e4c946	Clean up	2020-09-29 12:14:08 +02:00
Matthew Honnibal	9c8b2524fe	Upd initialize args	2020-09-29 12:08:37 +02:00
Matthew Honnibal	f2d1b7feb5	Clean up sgd	2020-09-29 12:00:08 +02:00
Ines Montani	78396d137f	Integrate initialize settings	2020-09-29 11:57:08 +02:00
Ines Montani	dec984a9c1	Update Language.initialize and support components/tokenizer settings	2020-09-29 11:52:45 +02:00
Matthew Honnibal	5276db6f3f	Remove 'device' argument from Language, clean up 'sgd' arg	2020-09-29 11:42:19 +02:00
Ines Montani	ff9a63bfbd	begin_training -> initialize	2020-09-28 21:35:09 +02:00
Ines Montani	658fad428a	Fix base schema integration	2020-09-27 22:50:36 +02:00
Ines Montani	7e938ed63e	Update config resolution to use new Thinc	2020-09-27 22:21:31 +02:00
Ines Montani	4bbe41f017	Fix combined scores and update test	2020-09-24 10:42:47 +02:00
Ines Montani	ae51f580c1	Fix handling of score_weights	2020-09-24 10:27:33 +02:00
Ines Montani	1114219ae3	Tidy up and auto-format	2020-09-21 10:59:07 +02:00
Adriane Boyd	47080fba98	Minor renaming / refactoring * Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message * Make `Vocab.lookups` a property	2020-09-18 19:43:19 +02:00
Adriane Boyd	eed4b785f5	Load vocab lookups tables at beginning of training Similar to how vectors are handled, move the vocab lookups to be loaded at the start of training rather than when the vocab is initialized, since the vocab doesn't have access to the full config when it's created. The option moves from `nlp.load_vocab_data` to `training.lookups`. Typically these tables will come from `spacy-lookups-data`, but any `Lookups` object can be provided. The loading from `spacy-lookups-data` is now strict, so configs for each language should specify the exact tables required. This also makes it easier to control whether the larger clusters and probs tables are included. To load `lexeme_norm` from `spacy-lookups-data`: ``` [training.lookups] @misc = "spacy.LoadLookupsData.v1" lang = ${nlp.lang} tables = ["lexeme_norm"] ```	2020-09-18 15:59:16 +02:00
Matthew Honnibal	c776594ab1	Fix	2020-09-16 18:15:14 +02:00
Matthew Honnibal	4a573d18b3	Add comment	2020-09-16 17:51:29 +02:00
Matthew Honnibal	d31afc8334	Fix Language.link_components when model is None	2020-09-16 17:49:48 +02:00
Ines Montani	aaf01689a1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-09-15 14:24:42 +02:00
Ines Montani	91a6637f74	Remove extra pipe config values before merging	2020-09-15 14:24:17 +02:00
Ines Montani	d3d7f92f05	Fix lang check and error handling in Language.from_config	2020-09-15 14:24:06 +02:00
Ines Montani	253ba5ef14	Raise for bad Vocab values	2020-09-15 13:25:34 +02:00
Ines Montani	7dfc4bc062	Allow overriding meta from spacy.blank	2020-09-15 11:12:12 +02:00
Matthew Honnibal	b693d2d224	Fix speed report in table	2020-09-13 17:39:31 +02:00
Ines Montani	febb99916d	Tidy up and auto-format [ci skip]	2020-09-13 10:55:36 +02:00
Sofie Van Landeghem	e92e850c72	Raise if empty examples (#6052 ) * raise error if no valid Example objects were found during initialization * fix max_length parameter * remove commit from other branch Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-09-12 21:01:53 +02:00
svlandeg	711166a75a	prevent overwriting score_weights	2020-09-11 15:12:05 +02:00
Sofie Van Landeghem	8e7557656f	Renaming gold & annotation_setter (#6042 ) * version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting	2020-09-09 10:31:03 +02:00
Sofie Van Landeghem	60f22e1800	Pipe API (#6034 ) * ensure Language passes on valid examples for initialization * fix tagger model initialization * check for valid get_examples across components * assume labels were added before begin_training * fix senter initialization * fix morphologizer initialization * use methods to check arguments * test textcat init, requires thinc>=8.0.0a31 * fix tok2vec init * fix entity linker init * use islice * fix simple NER * cleanup debug model * fix assert statements * fix tests * throw error when adding a label if the output layer can't be resized anymore * fix test * add failing test for simple_ner * UX improvements * morphologizer UX * assume begin_training gets a representative set and processes the labels * remove assumptions for output of untrained NER model * restore test for original purpose	2020-09-08 22:44:25 +02:00
Ines Montani	f06eed800e	Merge pull request #6029 from explosion/master-tmp	2020-09-04 15:11:55 +02:00
Ines Montani	f9550b4493	Fix components in meta.json and website [ci skip]	2020-09-04 14:42:12 +02:00
Ines Montani	90043a6f9b	Tidy up and auto-format	2020-09-04 13:42:33 +02:00
Ines Montani	ba600f91c5	Tidy up imports	2020-09-04 13:15:44 +02:00
Ines Montani	ab1bb421ed	Update docs links in codebase	2020-09-04 12:58:50 +02:00
Ines Montani	896caf45e3	Merge pull request #6023 from explosion/ux/model-terminology-consistency [ci skip]	2020-09-03 17:13:44 +02:00
Ines Montani	b5a0657fd6	"model" terminology consistency in docs	2020-09-03 13:13:03 +02:00
Matthew Honnibal	ef0d0630a4	Let Langugae.use_params work with falsey inputs The Language.use_params method was failing if you passed in None, which meant we had to use awkward conditionals for the parameter averaging. This solves the problem.	2020-09-03 12:51:04 +02:00
Matthew Honnibal	046c38bd26	Remove 'cleanup' of strings (#6007 ) A long time ago we went to some trouble to try to clean up "unused" strings, to avoid the `StringStore` growing in long-running processes. This never really worked reliably, and I think it was a really wrong approach. It's much better to let the user reload the `nlp` object as necessary, now that the string encoding is stable (in v1, the string IDs were sequential integers, making reloading the NLP object really annoying.) The extra book-keeping does make some performance difference, and the feature is unsed, so it's past time we killed it.	2020-09-01 16:12:15 +02:00
Ines Montani	45f46a5c85	Merge pull request #5993 from explosion/feature/disabled-components	2020-08-29 15:58:41 +02:00
Ines Montani	34146750d4	Use frozen list with custom errors We don't want to break backwards compatibility too much but we also want to provide the best possible UX	2020-08-29 15:20:11 +02:00
Ines Montani	6520d1a1df	Work around set order in Language.disabled	2020-08-29 12:58:22 +02:00
Ines Montani	e0b4984aa4	Make deprecated disable_pipes call into select_pipes	2020-08-29 12:08:46 +02:00
Ines Montani	15d73f4dc3	Make user-facing Language.disabled return list More consistent with all the other properties	2020-08-29 12:08:33 +02:00
Ines Montani	0687d7148e	Rename user-facing API	2020-08-28 21:04:02 +02:00
Ines Montani	6a999c9303	Remove outdated component attr check	2020-08-28 20:59:19 +02:00
Ines Montani	10da74382f	Raise if disabled components are removed before DisabledPipes.restore	2020-08-28 20:35:26 +02:00
Ines Montani	1e0363290e	Remove todos and update docstrings	2020-08-28 20:34:46 +02:00
Ines Montani	cad988da7f	Allow component decorators to re-run with same function	2020-08-28 16:27:22 +02:00
Ines Montani	3ce5be4b76	Allow loaded but disabled components	2020-08-28 15:20:14 +02:00
Ines Montani	daac8ebacd	Don't interpolate config on Language deserialization	2020-08-27 16:44:36 +02:00
Matthew Honnibal	4f82a02b70	Remove 'fix_pretrained_vectors_name' hack	2020-08-25 14:37:45 +02:00
Ines Montani	0e7f99da58	Fix handling of optional [pretraining] block (#5954 ) * Fix handling of optional [pretraining] block * Remote pretraining from default config * Fix test * Add schema option for empty pretrain block	2020-08-24 15:56:03 +02:00
Matthew Honnibal	160a855246	Format	2020-08-23 21:15:12 +02:00
Sofie Van Landeghem	358cbb21e3	Define candidate generator in EL config (#5876 ) * candidate generator as separate part of EL config * update comment * ent instead of str as input for candidate generation * Span instead of str: correct type indication * fix types * unit test to create new candidate generator * fix replace_pipe argument passing * move error message, general cleanup * add vocab back to KB constructor * provide KB as callable from Vocab arg * rename to kb_loader, fix KB serialization as part of the EL pipe * fix typo * reformatting * cleanup * fix comment * fix wrongly duplicated code from merge conflict * rename dump to to_disk * from_disk instead of load_bulk * update test after recent removal of set_morphology in tagger * remove old doc	2020-08-18 16:10:36 +02:00
Ines Montani	3ae5e02f4f	Update docs, types and API consistency	2020-08-17 16:45:24 +02:00
Ines Montani	67cc39af7f	Update Thinc and include section order	2020-08-14 14:06:22 +02:00
Ines Montani	88b0a96801	Update for new Thinc and adjust config	2020-08-13 17:38:30 +02:00
Ines Montani	950832f087	Tidy up pipes (#5906 ) * Tidy up pipes * Fix init, defaults and raise custom errors * Update docs * Update docs [ci skip] * Apply suggestions from code review Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Tidy up error handling and validation, fix consistency * Simplify get_examples check * Remove unused import [ci skip] Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-11 23:29:31 +02:00
Ines Montani	3eaeb73342	Tidy up and auto-format	2020-08-09 22:36:23 +02:00
Adriane Boyd	e962784531	Add Lemmatizer and simplify related components (#5848 ) * Add Lemmatizer and simplify related components * Add `Lemmatizer` pipe with `lookup` and `rule` modes using the `Lookups` tables. * Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma) * Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer, or morph rules) * Remove lemmatizer from `Vocab` * Adjust many many tests Differences: * No default lookup lemmas * No special treatment of TAG in `from_array` and similar required * Easier to modify labels in a `Tagger` * No extra strings added from morphology / tag map * Fix test * Initial fix for Lemmatizer config/serialization * Adjust init test to be more generic * Adjust init test to force empty Lookups * Add simple cache to rule-based lemmatizer * Convert language-specific lemmatizers Convert language-specific lemmatizers to component lemmatizers. Remove previous lemmatizer class. * Fix French and Polish lemmatizers * Remove outdated UPOS conversions * Update Russian lemmatizer init in tests * Add minimal init/run tests for custom lemmatizers * Add option to overwrite existing lemmas * Update mode setting, lookup loading, and caching * Make `mode` an immutable property * Only enforce strict `load_lookups` for known supported modes * Move caching into individual `_lemmatize` methods * Implement strict when lang is not found in lookups * Fix tables/lookups in make_lemmatizer * Reallow provided lookups and allow for stricter checks * Add lookups asset to all Lemmatizer pipe tests * Rename lookups in lemmatizer init test * Clean up merge * Refactor lookup table loading * Add helper from `load_lemmatizer_lookups` that loads required and optional lookups tables based on settings provided by a config. Additional slight refactor of lookups: * Add `Lookups.set_table` to set a table from a provided `Table` * Reorder class definitions to be able to specify type as `Table` * Move registry assets into test methods * Refactor lookups tables config Use class methods within `Lemmatizer` to provide the config for particular modes and to load the lookups from a config. * Add pipe and score to lemmatizer * Simplify Tagger.score * Add missing import * Clean up imports and auto-format * Remove unused kwarg * Tidy up and auto-format * Update docstrings for Lemmatizer Update docstrings for Lemmatizer. Additionally modify `is_base_form` API to take `Token` instead of individual features. * Update docstrings * Remove tag map values from Tagger.add_label * Update API docs * Fix relative link in Lemmatizer API docs	2020-08-07 15:27:13 +02:00
Ines Montani	56c17973aa	Use "raise ... from" in custom errors for better tracebacks	2020-08-05 23:53:21 +02:00
Ines Montani	5cc0d89fad	Simplify config overrides in CLI and deserialization (#5880 )	2020-08-05 23:35:09 +02:00
Ines Montani	823e533dc1	Add config callbacks for modifying nlp object before and after init (#5866 ) * WIP: Concept for modifying nlp object before and after init * Make callbacks return nlp object Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> * Raise if callbacks don't return correct type * Rename, update types, add after_pipeline_creation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-08-05 19:47:54 +02:00
Sofie Van Landeghem	b88c5c701a	Bugfix in nlp.replace_pipe (#5875 ) * bugfix and unit test * merge two conditions	2020-08-05 09:30:58 +02:00
Ines Montani	b795f02fbd	Allow adding pipeline components from source model (#5857 ) * Allow adding pipeline components from source model * Config: name -> component * Improve error messages * Fix error and test * Add frozen components and exclude logic * Remove exclude from Language.evaluate * Init sourced components with current vocab * Fix error codes	2020-08-04 23:39:19 +02:00
Ines Montani	b40f44419b	Simplify pipe analysis - remove unused code - don't print by default - integrate attrs info into analysis output	2020-08-01 13:40:06 +02:00
Ines Montani	30a76fcf6f	Integrate and simplify pipe analysis	2020-07-31 18:34:35 +02:00
Ines Montani	2d955fbf98	Fix linting [ci skip]	2020-07-31 17:05:28 +02:00
Adriane Boyd	9b509aa87f	Move Language.evaluate scorer config to new arg Move `Language.evaluate` scorer config from `component_cfg` to separate argument `scorer_cfg`.	2020-07-31 11:05:16 +02:00
Ines Montani	7a21775cd0	Merge pull request #5834 from explosion/feature/vectors	2020-07-29 18:49:26 +02:00
Ines Montani	b0f57a0cac	Update docs and consistency	2020-07-29 15:14:07 +02:00
Matthew Honnibal	97d3651574	Fix stray link_vectors_to_models call	2020-07-29 14:01:13 +02:00
Matthew Honnibal	1784c95827	Clean up link_vectors_to_models unused stuff	2020-07-29 14:01:11 +02:00
Matthew Honnibal	7299419fe4	Dont load vectors in Language.from_config	2020-07-29 14:00:12 +02:00
Matthew Honnibal	30dd96c540	Load vectors in Language.from_config	2020-07-29 14:00:12 +02:00
Ines Montani	e257e66ab9	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-29 11:36:45 +02:00
Adriane Boyd	0cddb0dbe9	Move timing into Language.evaluate (#5836 ) Move timing into `Language.evaluate` so that only the processing is timing, not processing + scoring. `Language.evaluate` returns `scores["speed"]` as words per second, which should be identical to how the speed was added to the scores previously. Also add the speed to the evaluate CLI output.	2020-07-29 11:02:31 +02:00
Ines Montani	e5d9eaf79c	Tidy up docstrings and arguments	2020-07-28 23:12:42 +02:00
Ines Montani	2748249217	Re-add meta["pipeline"] for now	2020-07-28 16:14:23 +02:00
Ines Montani	ae4d8a6ffd	Update docstrings, docs and pipe consistency	2020-07-28 13:37:31 +02:00
Ines Montani	0094cb0d04	Remove scores list from config and document	2020-07-28 11:22:24 +02:00
Ines Montani	894e20c466	Merge branch 'develop' into feature/component-scores	2020-07-27 18:14:39 +02:00

1 2 3 4 5 ...

594 Commits