spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-10 10:41:14 +03:00

Author	SHA1	Message	Date
Ines Montani	b507f61629	Tidy up and move noun_chunks, token_match, url_match	2020-07-22 22:18:46 +02:00
Ines Montani	7fc4dadd22	Fix typo	2020-07-22 20:27:22 +02:00
Ines Montani	0fcd352179	Remove omit_extra_lookups	2020-07-22 16:01:17 +02:00
Ines Montani	945f795a3e	WIP: move more language data to config	2020-07-22 15:59:37 +02:00
Adriane Boyd	b84fd70cc3	Fix exceptions for Morphology.__reduce__ (#5792 ) Pickle exceptions in the MORPH_RULES format instead of the internal format after the recent `Morphology.__init__` changes.	2020-07-22 15:00:25 +02:00
Ines Montani	43b960c01b	Refactor pipeline components, config and language data (#5759 ) * Update with WIP * Update with WIP * Update with pipeline serialization * Update types and pipe factories * Add deep merge, tidy up and add tests * Fix pipe creation from config * Don't validate default configs on load * Update spacy/language.py Co-authored-by: Ines Montani <ines@ines.io> * Adjust factory/component meta error * Clean up factory args and remove defaults * Add test for failing empty dict defaults * Update pipeline handling and methods * provide KB as registry function instead of as object * small change in test to make functionality more clear * update example script for EL configuration * Fix typo * Simplify test * Simplify test * splitting pipes.pyx into separate files * moving default configs to each component file * fix batch_size type * removing default values from component constructors where possible (TODO: test 4725) * skip instead of xfail * Add test for config -> nlp with multiple instances * pipeline.pipes -> pipeline.pipe * Tidy up, document, remove kwargs * small cleanup/generalization for Tok2VecListener * use DEFAULT_UPSTREAM field * revert to avoid circular imports * Fix tests * Replace deprecated arg * Make model dirs require config * fix pickling of keyword-only arguments in constructor * WIP: clean up and integrate full config * Add helper to handle function args more reliably Now also includes keyword-only args * Fix config composition and serialization * Improve config debugging and add visual diff * Remove unused defaults and fix type * Remove pipeline and factories from meta * Update spacy/default_config.cfg Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/default_config.cfg * small UX edits * avoid printing stack trace for debug CLI commands * Add support for language-specific factories * specify the section of the config which holds the model to debug * WIP: add Language.from_config * Update with language data refactor WIP * Auto-format * Add backwards-compat handling for Language.factories * Update morphologizer.pyx * Fix morphologizer * Update and simplify lemmatizers * Fix Japanese tests * Port over tagger changes * Fix Chinese and tests * Update to latest Thinc * WIP: xfail first Russian lemmatizer test * Fix component-specific overrides * fix nO for output layers in debug_model * Fix default value * Fix tests and don't pass objects in config * Fix deep merging * Fix lemma lookup data registry Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed) * Add types * Add Vocab.from_config * Fix typo * Fix tests * Make config copying more elegant * Fix pipe analysis * Fix lemmatizers and is_base_form * WIP: move language defaults to config * Fix morphology type * Fix vocab * Remove comment * Update to latest Thinc * Add morph rules to config * Tidy up * Remove set_morphology option from tagger factory * Hack use_gpu * Move [pipeline] to top-level block and make [nlp.pipeline] list Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them * Fix use_gpu and resume in CLI * Auto-format * Remove resume from config * Fix formatting and error * [pipeline] -> [components] * Fix types * Fix tagger test: requires set_morphology? Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-22 13:42:59 +02:00
Ines Montani	311d0bde29	Merge pull request #5788 from explosion/master-tmp	2020-07-20 15:39:24 +02:00
Ines Montani	d51db72e46	Remove Python 2 marker	2020-07-20 15:01:36 +02:00
Ines Montani	e6967ca98a	Revert cupy-cuda version update	2020-07-20 14:59:41 +02:00
Ines Montani	644074b954	Merge branch 'develop' into master-tmp	2020-07-20 14:58:04 +02:00
Sofie Van Landeghem	c9da9605f7	Test suite clean up (#5781 ) * step_through tests: skip instead of xfail * test_empty_doc should be fixed with new Thinc version * remove outdated test (there are other misaligned tests now) * xfail reason * fix test according to french exceptions * clarified some skipped tests * skip ukranian test instead of xfail * skip instead of xfail * skip + reason instead of xfail * removed obsolete tests referring to removed "set_frozen" functionality * fix test 999 * remove unused AlignmentError * remove xfail where possible, skip otherwise * increment thinc release for empty_doc test	2020-07-20 14:49:54 +02:00
Sofie Van Landeghem	1b2ec94382	Hyphen infix (#5770 ) * infix split on hyphen when preceded by number * clean up * skip ukranian test instead of xfail	2020-07-20 14:48:51 +02:00
Adriane Boyd	ec819fc311	Provide default output for evaluate in CLI (#5784 )	2020-07-20 14:42:46 +02:00
Ines Montani	cb65b36839	Merge pull request #5767 from adrianeboyd/feature/remove-tag-maps	2020-07-19 15:15:34 +02:00
Ines Montani	fa3c98f8b3	Update train.py	2020-07-19 13:40:47 +02:00
Ines Montani	796f6c52d1	Merge branch 'develop' into pr/5767	2020-07-19 13:37:46 +02:00
Alec Chapman	a8978ca285	Add VA COVID-19 NLP project to spaCy Universe (#5777 ) * Update universe.json Add cov-bsv to "resources" * Update universe.json * add contributor agreement	2020-07-19 13:35:31 +02:00
Adriane Boyd	39ebcd9ec9	Refactor Chinese tokenizer configuration (#5736 ) * Refactor Chinese tokenizer configuration Refactor `ChineseTokenizer` configuration so that it uses a single `segmenter` setting to choose between character segmentation, jieba, and pkuseg. * replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting `segmenter` with the supported values: `char`, `jieba`, `pkuseg` * make the default segmenter plain character segmentation `char` (no additional libraries required) * Fix Chinese serialization test to use char default * Warn if attempting to customize other segmenter Add a warning if `Chinese.pkuseg_update_user_dict` is called when another segmenter is selected.	2020-07-19 13:34:37 +02:00
Adriane Boyd	9ee1c54f40	Improve tag map initialization and updating (#5764 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that the tag map can be loaded correctly prior to loading a `Corpus` with `spacy debug-data` and `spacy train`. * normalize provided tag map as necessary * use the same method for initializing and updating the tag map * Replace rather than update tag map Replace rather than update tag map when loading a custom tag map. Updating the tag map is problematic due to the sorted list of tag names and the fact that the tag map will contain lingering/unwanted tags from the default tag map. * Update CLI scripts * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 13:13:57 +02:00
Adriane Boyd	597bcc629e	Improve tag map initialization and updating (#5768 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that a provided tag map can be loaded correctly in the CLI. * normalize provided tag map as necessary * use the same method for initializing and overwriting the tag map * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 11:13:39 +02:00
Adriane Boyd	b81a89f0a9	Update morphologizer (#5766 ) * update `Morphologizer.begin_training` for use with `Example` * make init and begin_training more consistent * add `Morphology.normalize_features` to normalize outside of `Morphology.add` * make sure `get_loss` doesn't create unknown labels when the POS and morph alignments differ	2020-07-19 11:10:51 +02:00
Sofie Van Landeghem	38b59d728d	Upgrade of UD eval script (#5776 ) * new morph feature format * add new languages with tokenization * update with all new pretrained models	2020-07-19 11:10:31 +02:00
Adriane Boyd	7e14272096	Lower upper pin for cupy to 8.0.0 (#5773 )	2020-07-19 11:10:11 +02:00
Adriane Boyd	cd5af72c9a	Update pkuseg version (#5774 ) * Update pkuseg version in Chinese tokenizer warnings * Update pkuseg version in `Makefile` * Remove warning about python3.8 wheels in docs	2020-07-19 11:09:49 +02:00
Ines Montani	68fade8f76	Add Plausible [ci skip]	2020-07-19 00:02:29 +02:00
Ines Montani	6f4e4aceb3	Add Plausible [ci skip]	2020-07-18 23:50:29 +02:00
Adriane Boyd	50db3f0cdb	Serialize morph rules with tagger Serialize `morph_rules` with the tagger alongside the `tag_map`. Use `Morphology.load_tag_map` and `Morphology.load_morph_exceptions` to load these settings rather than reinitializing the morphology each time they are changed.	2020-07-17 08:22:21 +02:00
Adriane Boyd	d106cf66dd	Update Morphology to load exceptions as MORPH_RULES Update `Morphology` to load exceptions in `Morphology.__init__` and `Morphology.load_morph_exceptions` from the format used in `MORPH_RULES` rather than the internal format with tuple keys. * Rename to `Morphology.exc` to `Morphology._exc` for internal use with tuple keys * Add `Morphology.exc` as a property that converts the internal `_exc` back to `MORPH_RULES` format, primarily for serialization	2020-07-16 21:16:49 +02:00
Adriane Boyd	d83e3c44c5	Remove corpus-specific morph rules * Remove corpus-specific morph rules * Add options similar to tag maps to provide them in the `train` and `debug-data` CLIs	2020-07-15 19:44:18 +02:00
Adriane Boyd	2f981d5af1	Remove corpus-specific tag maps Remove corpus-specific tag maps from the language data for languages without custom tokenizers. For languages with custom word segmenters that also provide tags (Japanese and Korean), the tag maps for the custom tokenizers are kept as the default. The default tag maps for languages without custom tokenizers are now the default tag map from `lang/tag_map/py`, UPOS -> UPOS.	2020-07-15 15:58:29 +02:00
Adriane Boyd	5228920e2f	Clarify warning W030 for misaligned BILUO tags (#5761 )	2020-07-14 14:09:48 +02:00
Adriane Boyd	a7a7e0d2a6	Add morph to morphology in Doc.from_array (#5762 ) * Add morph to morphology in Doc.from_array Add morphological analyses to morphology table in `Doc.from_array`. * Use separate vocab in DocBin roundtrip test	2020-07-14 14:07:35 +02:00
Ines Montani	872938ec76	Merge pull request #5747 from explosion/feature/refactor-config-args	2020-07-14 00:00:22 +02:00
Sofie Van Landeghem	6f3bb6f77c	fix doc.to_utf8 on GPU (#5757 )	2020-07-13 23:05:33 +02:00
Adriane Boyd	7ea2cc7650	Set version to 2.3.2 (#5756 )	2020-07-13 14:55:56 +02:00
Mark Neumann	27a1cd3c63	fix meta serialization in train (#5751 ) Co-authored-by: Mark Neumann <markng@allenai.org>	2020-07-12 22:06:46 +02:00
Ines Montani	dcfa910e4e	Merge pull request #5752 from explosion/compat/remove-object-subclass	2020-07-12 16:37:04 +02:00
Ines Montani	ed55143c0d	Merge branch 'develop' into compat/remove-object-subclass	2020-07-12 14:28:52 +02:00
Ines Montani	7906ddd56c	Fix test	2020-07-12 14:28:34 +02:00
Ines Montani	5f6f4ff594	Remove object subclassing	2020-07-12 14:03:23 +02:00
Ines Montani	c96535e338	Update command docstrings and docs	2020-07-12 13:53:49 +02:00
Ines Montani	0ab483037c	Make debug commands subcommands of spacy debug Also handle backwards-compatibility so the old commands don't break	2020-07-12 13:53:41 +02:00
Ines Montani	3f948b9c74	Update docs	2020-07-12 12:32:28 +02:00
Ines Montani	8a67ddd6f1	Remove unused import	2020-07-12 12:32:24 +02:00
Ines Montani	d1d7fd5f5d	Don't use file paths in schemas It should be possible to validate top-level config with file paths that don't exist	2020-07-12 12:32:08 +02:00
Ines Montani	79346853aa	Add debug-config command	2020-07-12 12:31:17 +02:00
Ines Montani	3a8632c3fb	Hide command from public --help for now Not sure we want this to be officially documented yet?	2020-07-11 19:21:22 +02:00
Ines Montani	5e683d03fe	Allow extra args on pretrain and debug_data	2020-07-11 19:17:59 +02:00
Ines Montani	70abcca60e	Update Thinc pin	2020-07-11 17:02:54 +02:00
Ines Montani	b7111da1d7	Update config and commands	2020-07-11 13:03:53 +02:00

1 2 3 4 5 ...

12286 Commits