spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 04:08:09 +03:00

Author	SHA1	Message	Date
Ines Montani	4a0a692875	Add missing lex_attr_getters (resolves #5806 )	2020-07-25 12:55:18 +02:00
Adriane Boyd	2bcceb80c4	Refactor the Scorer to improve flexibility (#5731 ) * Refactor the Scorer to improve flexibility Refactor the `Scorer` to improve flexibility for arbitrary pipeline components. * Individual pipeline components provide their own `evaluate` methods that score a list of `Example`s and return a dictionary of scores * `Scorer` is initialized either: * with a provided pipeline containing components to be scored * with a default pipeline containing the built-in statistical components (senter, tagger, morphologizer, parser, ner) * `Scorer.score` evaluates a list of `Example`s and returns a dictionary of scores referring to the scores provided by the components in the pipeline Significant differences: * `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc` and the new `morph_acc`, `pos_acc`, and `lemma_acc` * Scoring is no longer cumulative: `Scorer.score` scores a list of examples rather than a single example and does not retain any state about previously scored examples * PRF values in the returned scores are no longer multiplied by 100 * Add kwargs to Morphologizer.evaluate * Create generalized scoring methods in Scorer * Generalized static scoring methods are added to `Scorer` * Methods require an attribute (either on Token or Doc) that is used to key the returned scores Naming differences: * `uas`, `las`, and `las_per_type` in the scores dict are renamed to `dep_uas`, `dep_las`, and `dep_las_per_type` Scoring differences: * `Doc.sents` is now scored as spans rather than on sentence-initial token positions so that `Doc.sents` and `Doc.ents` can be scored with the same method (this lowers scores since a single incorrect sentence start results in two incorrect spans) * Simplify / extend hasattr check for eval method * Add hasattr check to tokenizer scoring * Simplify to hasattr check for component scoring * Reset Example alignment if docs are set Reset the Example alignment if either doc is set in case the tokenization has changed. * Add PRF tokenization scoring for tokens as spans Add PRF scores for tokens as character spans. The scores are: * token_acc: # correct tokens / # gold tokens * token_p/r/f: PRF for (token.idx, token.idx + len(token)) * Add docstring to Scorer.score_tokenization * Rename component.evaluate() to component.score() * Update Scorer API docs * Update scoring for positive_label in textcat * Fix TextCategorizer.score kwargs * Update Language.evaluate docs * Update score names in default config	2020-07-25 12:53:02 +02:00
Ines Montani	c003d26b94	Tidy up	2020-07-25 12:21:37 +02:00
Ines Montani	a063a82c40	Tidy up __init__.py	2020-07-25 12:14:37 +02:00
Ines Montani	8d9d28eb8b	Re-add setting for vocab data and tidy up	2020-07-25 12:14:28 +02:00
Ines Montani	b9aaa4e457	Improve vocab data integration and warning	2020-07-25 11:51:30 +02:00
Ines Montani	38f6ea7a78	Simplify language data and revert detailed configs	2020-07-24 14:50:26 +02:00
Adriane Boyd	656574a01a	Update Japanese tests (#5807 ) * Update POS tests to reflect current behavior (it is not entirely clear whether the AUX/VERB mapping is indeed the desired behavior?) * Switch to `from_config` initialization in subtoken test	2020-07-24 12:45:14 +02:00
Adriane Boyd	fdb8815ef5	Minor refactor for Morphology and MorphAnalysis (#5804 ) * `MorphAnalysis.get` returns only the field values * Move `_normalize_props` inside `Morphology` as `Morphology.normalize_attrs` and simplify * Simplify POS field detection/conversion * Convert all non-POS features to strings * `Morphology` returns an empty string for a missing morph to align with the FEATS string returned for an existing morph * Remove unused `list_to_feats`	2020-07-24 09:28:06 +02:00
Ines Montani	87737a5a60	Tidy up	2020-07-23 00:16:23 +02:00
Ines Montani	a624ae0675	Remove POS, TAG and LEMMA from tokenizer exceptions	2020-07-22 23:09:01 +02:00
Ines Montani	14d7d46f89	Merge branch 'develop' into feature/language-data-config	2020-07-22 22:18:53 +02:00
Ines Montani	b507f61629	Tidy up and move noun_chunks, token_match, url_match	2020-07-22 22:18:46 +02:00
Ines Montani	7fc4dadd22	Fix typo	2020-07-22 20:27:22 +02:00
Ines Montani	d0c6d1efc5	@factories -> factory (#5801 )	2020-07-22 17:29:31 +02:00
Ines Montani	2c5bb59909	Use consistent --gpu-id option name	2020-07-22 16:53:41 +02:00
Ines Montani	0fcd352179	Remove omit_extra_lookups	2020-07-22 16:01:17 +02:00
Ines Montani	945f795a3e	WIP: move more language data to config	2020-07-22 15:59:37 +02:00
Adriane Boyd	b84fd70cc3	Fix exceptions for Morphology.__reduce__ (#5792 ) Pickle exceptions in the MORPH_RULES format instead of the internal format after the recent `Morphology.__init__` changes.	2020-07-22 15:00:25 +02:00
Ines Montani	43b960c01b	Refactor pipeline components, config and language data (#5759 ) * Update with WIP * Update with WIP * Update with pipeline serialization * Update types and pipe factories * Add deep merge, tidy up and add tests * Fix pipe creation from config * Don't validate default configs on load * Update spacy/language.py Co-authored-by: Ines Montani <ines@ines.io> * Adjust factory/component meta error * Clean up factory args and remove defaults * Add test for failing empty dict defaults * Update pipeline handling and methods * provide KB as registry function instead of as object * small change in test to make functionality more clear * update example script for EL configuration * Fix typo * Simplify test * Simplify test * splitting pipes.pyx into separate files * moving default configs to each component file * fix batch_size type * removing default values from component constructors where possible (TODO: test 4725) * skip instead of xfail * Add test for config -> nlp with multiple instances * pipeline.pipes -> pipeline.pipe * Tidy up, document, remove kwargs * small cleanup/generalization for Tok2VecListener * use DEFAULT_UPSTREAM field * revert to avoid circular imports * Fix tests * Replace deprecated arg * Make model dirs require config * fix pickling of keyword-only arguments in constructor * WIP: clean up and integrate full config * Add helper to handle function args more reliably Now also includes keyword-only args * Fix config composition and serialization * Improve config debugging and add visual diff * Remove unused defaults and fix type * Remove pipeline and factories from meta * Update spacy/default_config.cfg Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/default_config.cfg * small UX edits * avoid printing stack trace for debug CLI commands * Add support for language-specific factories * specify the section of the config which holds the model to debug * WIP: add Language.from_config * Update with language data refactor WIP * Auto-format * Add backwards-compat handling for Language.factories * Update morphologizer.pyx * Fix morphologizer * Update and simplify lemmatizers * Fix Japanese tests * Port over tagger changes * Fix Chinese and tests * Update to latest Thinc * WIP: xfail first Russian lemmatizer test * Fix component-specific overrides * fix nO for output layers in debug_model * Fix default value * Fix tests and don't pass objects in config * Fix deep merging * Fix lemma lookup data registry Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed) * Add types * Add Vocab.from_config * Fix typo * Fix tests * Make config copying more elegant * Fix pipe analysis * Fix lemmatizers and is_base_form * WIP: move language defaults to config * Fix morphology type * Fix vocab * Remove comment * Update to latest Thinc * Add morph rules to config * Tidy up * Remove set_morphology option from tagger factory * Hack use_gpu * Move [pipeline] to top-level block and make [nlp.pipeline] list Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them * Fix use_gpu and resume in CLI * Auto-format * Remove resume from config * Fix formatting and error * [pipeline] -> [components] * Fix types * Fix tagger test: requires set_morphology? Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-22 13:42:59 +02:00
Ines Montani	311d0bde29	Merge pull request #5788 from explosion/master-tmp	2020-07-20 15:39:24 +02:00
Ines Montani	d51db72e46	Remove Python 2 marker	2020-07-20 15:01:36 +02:00
Ines Montani	644074b954	Merge branch 'develop' into master-tmp	2020-07-20 14:58:04 +02:00
Sofie Van Landeghem	c9da9605f7	Test suite clean up (#5781 ) * step_through tests: skip instead of xfail * test_empty_doc should be fixed with new Thinc version * remove outdated test (there are other misaligned tests now) * xfail reason * fix test according to french exceptions * clarified some skipped tests * skip ukranian test instead of xfail * skip instead of xfail * skip + reason instead of xfail * removed obsolete tests referring to removed "set_frozen" functionality * fix test 999 * remove unused AlignmentError * remove xfail where possible, skip otherwise * increment thinc release for empty_doc test	2020-07-20 14:49:54 +02:00
Sofie Van Landeghem	1b2ec94382	Hyphen infix (#5770 ) * infix split on hyphen when preceded by number * clean up * skip ukranian test instead of xfail	2020-07-20 14:48:51 +02:00
Adriane Boyd	ec819fc311	Provide default output for evaluate in CLI (#5784 )	2020-07-20 14:42:46 +02:00
Ines Montani	cb65b36839	Merge pull request #5767 from adrianeboyd/feature/remove-tag-maps	2020-07-19 15:15:34 +02:00
Ines Montani	fa3c98f8b3	Update train.py	2020-07-19 13:40:47 +02:00
Ines Montani	796f6c52d1	Merge branch 'develop' into pr/5767	2020-07-19 13:37:46 +02:00
Adriane Boyd	39ebcd9ec9	Refactor Chinese tokenizer configuration (#5736 ) * Refactor Chinese tokenizer configuration Refactor `ChineseTokenizer` configuration so that it uses a single `segmenter` setting to choose between character segmentation, jieba, and pkuseg. * replace `use_jieba`, `use_pkuseg`, `require_pkuseg` with the setting `segmenter` with the supported values: `char`, `jieba`, `pkuseg` * make the default segmenter plain character segmentation `char` (no additional libraries required) * Fix Chinese serialization test to use char default * Warn if attempting to customize other segmenter Add a warning if `Chinese.pkuseg_update_user_dict` is called when another segmenter is selected.	2020-07-19 13:34:37 +02:00
Adriane Boyd	9ee1c54f40	Improve tag map initialization and updating (#5764 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that the tag map can be loaded correctly prior to loading a `Corpus` with `spacy debug-data` and `spacy train`. * normalize provided tag map as necessary * use the same method for initializing and updating the tag map * Replace rather than update tag map Replace rather than update tag map when loading a custom tag map. Updating the tag map is problematic due to the sorted list of tag names and the fact that the tag map will contain lingering/unwanted tags from the default tag map. * Update CLI scripts * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 13:13:57 +02:00
Adriane Boyd	597bcc629e	Improve tag map initialization and updating (#5768 ) * Improve tag map initialization and updating Generalize tag map initialization and updating so that a provided tag map can be loaded correctly in the CLI. * normalize provided tag map as necessary * use the same method for initializing and overwriting the tag map * Reinitialize cache after loading new tag map Reinitialize the cache with the right size after loading a new tag map.	2020-07-19 11:13:39 +02:00
Adriane Boyd	b81a89f0a9	Update morphologizer (#5766 ) * update `Morphologizer.begin_training` for use with `Example` * make init and begin_training more consistent * add `Morphology.normalize_features` to normalize outside of `Morphology.add` * make sure `get_loss` doesn't create unknown labels when the POS and morph alignments differ	2020-07-19 11:10:51 +02:00
Adriane Boyd	cd5af72c9a	Update pkuseg version (#5774 ) * Update pkuseg version in Chinese tokenizer warnings * Update pkuseg version in `Makefile` * Remove warning about python3.8 wheels in docs	2020-07-19 11:09:49 +02:00
Adriane Boyd	50db3f0cdb	Serialize morph rules with tagger Serialize `morph_rules` with the tagger alongside the `tag_map`. Use `Morphology.load_tag_map` and `Morphology.load_morph_exceptions` to load these settings rather than reinitializing the morphology each time they are changed.	2020-07-17 08:22:21 +02:00
Adriane Boyd	d106cf66dd	Update Morphology to load exceptions as MORPH_RULES Update `Morphology` to load exceptions in `Morphology.__init__` and `Morphology.load_morph_exceptions` from the format used in `MORPH_RULES` rather than the internal format with tuple keys. * Rename to `Morphology.exc` to `Morphology._exc` for internal use with tuple keys * Add `Morphology.exc` as a property that converts the internal `_exc` back to `MORPH_RULES` format, primarily for serialization	2020-07-16 21:16:49 +02:00
Adriane Boyd	d83e3c44c5	Remove corpus-specific morph rules * Remove corpus-specific morph rules * Add options similar to tag maps to provide them in the `train` and `debug-data` CLIs	2020-07-15 19:44:18 +02:00
Adriane Boyd	2f981d5af1	Remove corpus-specific tag maps Remove corpus-specific tag maps from the language data for languages without custom tokenizers. For languages with custom word segmenters that also provide tags (Japanese and Korean), the tag maps for the custom tokenizers are kept as the default. The default tag maps for languages without custom tokenizers are now the default tag map from `lang/tag_map/py`, UPOS -> UPOS.	2020-07-15 15:58:29 +02:00
Adriane Boyd	5228920e2f	Clarify warning W030 for misaligned BILUO tags (#5761 )	2020-07-14 14:09:48 +02:00
Adriane Boyd	a7a7e0d2a6	Add morph to morphology in Doc.from_array (#5762 ) * Add morph to morphology in Doc.from_array Add morphological analyses to morphology table in `Doc.from_array`. * Use separate vocab in DocBin roundtrip test	2020-07-14 14:07:35 +02:00
Ines Montani	872938ec76	Merge pull request #5747 from explosion/feature/refactor-config-args	2020-07-14 00:00:22 +02:00
Sofie Van Landeghem	6f3bb6f77c	fix doc.to_utf8 on GPU (#5757 )	2020-07-13 23:05:33 +02:00
Adriane Boyd	7ea2cc7650	Set version to 2.3.2 (#5756 )	2020-07-13 14:55:56 +02:00
Mark Neumann	27a1cd3c63	fix meta serialization in train (#5751 ) Co-authored-by: Mark Neumann <markng@allenai.org>	2020-07-12 22:06:46 +02:00
Ines Montani	ed55143c0d	Merge branch 'develop' into compat/remove-object-subclass	2020-07-12 14:28:52 +02:00
Ines Montani	7906ddd56c	Fix test	2020-07-12 14:28:34 +02:00
Ines Montani	5f6f4ff594	Remove object subclassing	2020-07-12 14:03:23 +02:00
Ines Montani	c96535e338	Update command docstrings and docs	2020-07-12 13:53:49 +02:00
Ines Montani	0ab483037c	Make debug commands subcommands of spacy debug Also handle backwards-compatibility so the old commands don't break	2020-07-12 13:53:41 +02:00
Ines Montani	8a67ddd6f1	Remove unused import	2020-07-12 12:32:24 +02:00
Ines Montani	d1d7fd5f5d	Don't use file paths in schemas It should be possible to validate top-level config with file paths that don't exist	2020-07-12 12:32:08 +02:00
Ines Montani	79346853aa	Add debug-config command	2020-07-12 12:31:17 +02:00
Ines Montani	3a8632c3fb	Hide command from public --help for now Not sure we want this to be officially documented yet?	2020-07-11 19:21:22 +02:00
Ines Montani	5e683d03fe	Allow extra args on pretrain and debug_data	2020-07-11 19:17:59 +02:00
Ines Montani	b7111da1d7	Update config and commands	2020-07-11 13:03:53 +02:00
Ines Montani	f99ce7fbfb	Make validation errors more elegant	2020-07-10 23:34:17 +02:00
Ines Montani	7b5717cac3	Merge branch 'develop' into feature/refactor-config-args	2020-07-10 22:50:07 +02:00
Matthew Honnibal	743f7fb73a	Set version to v3.0.0a4	2020-07-10 22:40:12 +02:00
Matthew Honnibal	b68216e263	Explicitly delete objects after parser.update to free GPU memory (#5748 ) * Try explicitly deleting objects * Refactor parser model backprop slightly * Free parser data explicitly after rehearse and update	2020-07-10 22:35:20 +02:00
Ines Montani	fb6f6f584e	Replace - with _ in command names We might as well be nice if user accidentally types --training.use-gpu	2020-07-10 22:34:22 +02:00
Ines Montani	bfa8e11ffa	Update and auto-format	2020-07-10 20:52:00 +02:00
Ines Montani	0389c34b81	Merge branch 'develop' into feature/refactor-config-args	2020-07-10 20:51:52 +02:00
Ines Montani	931250e1f5	Fix pipeline component schema	2020-07-10 20:32:53 +02:00
Ines Montani	9fe1fa88ad	Fix typo	2020-07-10 20:32:37 +02:00
Ines Montani	defe1e7213	Pretty-print config validation errors	2020-07-10 20:01:20 +02:00
Sofie Van Landeghem	de6a32315c	debug-model script (#5749 ) * adding debug-model to print the internals for debugging purposes * expend debug-model script with 4 stages: before, init, train, predict * avoid enforcing to have a seed in the train script * small fixes	2020-07-10 19:47:53 +02:00
Ines Montani	a3667394b4	Integrate with latest Thinc and config overrides	2020-07-10 19:47:05 +02:00
Ines Montani	5cfc3edcaa	Update CLI tests	2020-07-10 18:21:01 +02:00
Ines Montani	3583ea84d8	Update arg parsing	2020-07-10 18:20:52 +02:00
Ines Montani	73332ddb67	Update CLI commans to use one shared util file	2020-07-10 17:57:40 +02:00
Ines Montani	240e0a62ca	Update with WIP	2020-07-10 13:31:27 +02:00
Ines Montani	a60562f208	Update project CLI hashes, directories, skipping (#5741 ) * Update project CLI hashes, directories, skipping * Improve clone success message * Remove unused context args * Move project-specific utils to project utils The hashing/checksum functions may not end up being general-purpose functions and are more designed for the projects, so they shouldn't live in spacy.util * Improve run help and add workflows * Add note re: directory checksum speed * Fix cloning from subdirectories and output messages * Remove hard-coded dirs	2020-07-09 23:51:18 +02:00
Adriane Boyd	0a62098c5f	Fix lemmatizer is_base_form for python2.7 (#5734 ) * Fix lemmatizer init args for python2.7 * Move English is_base_form to a class method * Skip test pickling PhraseMatcher for python2	2020-07-09 22:11:24 +02:00
Adriane Boyd	923affd091	Remove is_base_form from French lemmatizer (#5733 ) Remove English-specific is_base_form from French lemmatizer.	2020-07-09 22:11:13 +02:00
Matthew Honnibal	552d1ad226	Hack at tests	2020-07-09 20:25:51 +02:00
Matthew Honnibal	eb064c59cd	Try to fix textcat test	2020-07-09 20:24:53 +02:00
Ines Montani	018319a640	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-09 19:44:41 +02:00
Ines Montani	05e182e421	Update CLI args and docstrings	2020-07-09 19:44:28 +02:00
Sofie Van Landeghem	dd207a28be	cleanup components API (#5726 ) * add keyword separator for update functions and drop unused "state" * few more Example tests and various small fixes * consistently return losses after update call * eliminate unused tensors field across pipe components * fix name * fix arg name	2020-07-09 19:43:39 +02:00
Adriane Boyd	ac4297ee39	Minor refactor to conversion of output docs (#5718 ) Minor refactor of conversion of docs to output format to avoid duplicate conversion steps.	2020-07-09 19:42:32 +02:00
Sofie Van Landeghem	c1ea55307b	Fixing reproducible training (#5735 ) * Add initial reproducibility tests * failing test for default_text_classifier (WIP) * track trouble to underlying tok2vec layer * add regression test for Issue 5551 * tests go green with https://github.com/explosion/thinc/pull/359 * update test * adding fixed seeds to HashEmbed layers, seems to fix the reproducility issue Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-09 19:39:31 +02:00
Matthew Honnibal	1827f22f56	Set version to v3.0.0a3	2020-07-09 19:38:04 +02:00
Matthw Honnibal	7010f1a2be	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-09 19:34:11 +02:00
Matthw Honnibal	77af0a6bb4	Offer option of padding-sensitive batching	2020-07-09 14:50:20 +02:00
Matthw Honnibal	3a7f275c02	Add extra batch util	2020-07-09 14:38:41 +02:00
Matthw Honnibal	eb0798c421	Add __len__ method for Example	2020-07-09 14:38:26 +02:00
Ines Montani	8f9552d9e7	Refactor project CLI (#5732 ) * Make project command a submodule * Update with WIP * Add helper for joining commands * Update docstrins, formatting and types * Update assets and add support for copying local files * Fix type * Update success messages	2020-07-09 01:42:51 +02:00
Adriane Boyd	ad15499b3b	Fix get_loss for values outside of labels in senter (#5730 ) * Fix get_loss for None alignments in senter When converting the `sent_start` values back to `SentenceRecognizer` labels, handle `None` alignments. * Handle SENT_START as -1 Handle SENT_START as -1 (or -1 converted to uint64) by treating any values other than 1 the same as 0 in `SentenceRecognizer.get_loss`.	2020-07-09 01:41:58 +02:00
Matthw Honnibal	1b20ffac38	batch_by_words by default	2020-07-08 21:37:06 +02:00
Matthw Honnibal	93e50da46a	Remove auto 'set_annotation' in training to address GPU memory	2020-07-08 21:36:51 +02:00
Matthw Honnibal	fb8a5967c1	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-08 15:27:50 +02:00
Ines Montani	0a3d41bb1d	Deprecat model shortcuts and simplify download (#5722 )	2020-07-08 14:00:07 +02:00
Adriane Boyd	c9f0f75778	Update get_loss for senter and morphologizer (#5724 ) * Update get_loss for senter Update `SentenceRecognizer.get_loss` to keep it similar to `Tagger`. * Update get_loss for morphologizer Update `Morphologizer.get_loss` to keep it similar to `Tagger`.	2020-07-08 13:59:28 +02:00
Matthw Honnibal	ca989f4cc4	Improve cutting logic in parser	2020-07-08 11:27:54 +02:00
Matthw Honnibal	42e1109def	Support option to not batch by number of words	2020-07-08 11:26:54 +02:00
Ines Montani	8cb7f9ccff	Improve assets and DVC handling (#5719 ) * Improve assets and DVC handling * Remove outdated comment [ci skip]	2020-07-07 20:51:50 +02:00
Sofie Van Landeghem	a39a110c4e	Few more Example unit tests (#5720 ) * small fixes in Example, UX * add gold tests for aligned_spans and get_aligned_parse * sentencizer unnecessary	2020-07-07 18:46:00 +02:00
Matthw Honnibal	433dc3c9c9	Simplify PrecomputableAffine slightly	2020-07-07 17:22:47 +02:00
Matthw Honnibal	a4164f67ca	Don't normalize gradients	2020-07-07 17:21:58 +02:00
Matthw Honnibal	8177f25b6c	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-07 17:21:10 +02:00
Ines Montani	fa00a85828	Merge pull request #5715 from explosion/chore/tidy-regression-tests	2020-07-07 11:22:07 +02:00
Matthw Honnibal	d1fd3438c3	Add dropout to parser hidden layer	2020-07-07 01:38:15 +02:00
Matthw Honnibal	f25761e513	Dont randomize cuts in parser	2020-07-06 17:51:25 +02:00
Matthw Honnibal	709fc5e4ad	Clarify dropout and seed in Tok2Vec	2020-07-06 17:50:21 +02:00
Matthew Honnibal	19d42f42de	Set version to v3.0.0a2	2020-07-06 17:43:12 +02:00
Matthew Honnibal	cc477be952	Improve gold-standard alignment (#5711 ) * Remove previous alignment * Implement better alignment, using ragged data structure * Use pytokenizations for alignment * Fixes * Fixes * Fix overlapping entities in alignment * Fix align split_sents * Update test * Commit align.py * Try to appease setuptools * Fix flake8 * use realistic entities for testing * Update tests for better alignment * Improve alignment heuristic Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2020-07-06 17:39:31 +02:00
Mike Izbicki	7a2ca00794	fix bug in Korean language, resulting in 100x speedup by reducing overhead of mecab (#5701 ) * speed up Korean nlp 100x by stopping mecab from reloading on each doc * add contributor agreement * rename variables to improve code readability	2020-07-06 17:03:33 +02:00
Ines Montani	b6deef80f8	Fix class to pickling works as expected	2020-07-06 16:43:45 +02:00
Ines Montani	fa261d09e8	Add alternative CLI option	2020-07-06 15:57:38 +02:00
Adriane Boyd	c67fc6aa5b	Make `docs_to_json` backwards-compatible with v2 (#5714 ) * In `spacy convert -t json` output the JSON docs wrapped in a list * Add back token-level `ner` alongside the doc-level `entities`	2020-07-06 14:15:00 +02:00
Ines Montani	5b7b2a498d	Tidy up and merge regression tests	2020-07-06 14:05:59 +02:00
Ines Montani	412dbb1f38	Remove dead and/or deprecated code (#5710 ) * Remove dead and/or deprecated code * Remove n_threads Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-07-06 13:06:25 +02:00
Sofie Van Landeghem	fcbf899b08	Feature/example only (#5707 ) * remove _convert_examples * fix test_gold, raise TypeError if tuples are used instead of Example's * throwing proper errors when the wrong type of objects are passed * fix deprectated format in tests * fix deprectated format in parser tests * fix tests for NEL, morph, senter, tagger, textcat * update regression tests with new Example format * use make_doc * more fixes to nlp.update calls * few more small fixes for rehearse and evaluate * only import ml_datasets if really necessary	2020-07-06 13:02:36 +02:00
graue70	9860b8399e	Fix typo in test function docstring (#5696 )	2020-07-05 15:49:06 +02:00
Matthew Honnibal	3e78e82a83	Experimental character-based pretraining (#5700 ) * Use cosine loss in Cloze multitask * Fix char_embed for gpu * Call resume_training for base model in train CLI * Fix bilstm_depth default in pretrain command * Implement character-based pretraining objective * Use chars loss in ClozeMultitask * Add method to decode predicted characters * Fix number characters * Rescale gradients for mlm * Fix char embed+vectors in ml * Fix pipes * Fix pretrain args * Move get_characters_loss * Fix import * Fix import * Mention characters loss option in pretrain * Remove broken 'self attention' option in pretrain * Revert "Remove broken 'self attention' option in pretrain" This reverts commit `56b820f6af`. * Document 'characters' objective of pretrain	2020-07-05 15:48:39 +02:00
Matthw Honnibal	3f6f087113	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-04 23:52:12 +02:00
Matthw Honnibal	5642507823	Fix has_unknown_spaces in Doc.copy	2020-07-04 23:52:02 +02:00
Matthw Honnibal	8870a6ded7	Specify seeds in HashEmbed	2020-07-04 23:51:49 +02:00
Ines Montani	37c3bb35e2	Auto-format	2020-07-04 16:25:34 +02:00
Ines Montani	abd173937f	Auto-format and update URL	2020-07-04 14:23:44 +02:00
Ines Montani	99aff16d60	Make argument shortcut consistent	2020-07-04 14:23:32 +02:00
Matthew Honnibal	2bd1bf81f1	Refactor pretrain and support character-based objective for v3 (#5706 ) * Start adding character-based stuff * Start adding character-based objective * Start adding character-based stuff * Start adding character-based objective * Remove outdated comment * Update pretraining models * Add/fix character-based multi-task models * Refactor pretrain and support character-based objective * Update pretrain config * Remove unused * Fix flake8 errors * Clean up imports * Format * Format * Update Thinc version * Raise error if vectors objective but no vectors	2020-07-03 17:57:28 +02:00
Ines Montani	84fb3a3fb3	Auto-format and fix tuple	2020-07-03 15:20:10 +02:00
Adriane Boyd	86d13a9fb8	Set version to 2.3.1 (#5705 )	2020-07-03 13:38:41 +02:00
Matthew Honnibal	e1b3e8ee11	Set version to v3.0.0a1	2020-07-03 13:21:08 +02:00
Matthew Honnibal	a902b5f217	Record whether Doc objects are built from known spacing (#5697 ) * Tell convert CLI to store user data for Doc * Remove assert * Add has_unknwon_spaces flag on Doc * Do not tokenize docs with unknown spaces in Corpus * Handle conversion of unknown spaces in Example * Fixes * Fixes * Draft has_known_spaces support in DocBin * Add test for serialize has_unknown_spaces * Fix DocBin serialization when has_unknown_spaces * Use serialization in test	2020-07-03 12:58:16 +02:00
Adriane Boyd	abad56db7d	Add conllu2docs converter (#5704 ) Add conllu2docs converter adapted from conllu2json converter	2020-07-03 12:54:32 +02:00
Jan Jessewitsch	e4dcac4a4b	Merging multiple docs into one (#5032 ) * Add static method to Doc to allow merging of multiple docs. * Add error description for the error that occurs if docs with different vocabs (from different languages) are merged in Doc.from_docs(). * Add test for Doc.from_docs() implementation. * Fix using numpy's concatenate in Doc.from_docs. * Replace typing's type annotations in from_docs. * Simply remove type annotations in from_docs. * Add documentation for Doc.from_docs to api. * Simplify from_docs, its test and the api doc for codebase consistency. * Fix merging of Doc objects that end with whitespaces (Achieved by simply not setting the SPACY attribute on whitespace tokens). Remove two unnecessary imports of attributes. * Add merging of user data from Doc objects in from_docs. Add user data test case to corresponding test. Add applicable warning messages. * Fix incorrect setting of tokens idx by using concatenated spaces (again). Add test case to corresponding test. * Add MORPH to attrs * Update warnings calls * Remove out-dated error from merge * Rename space_delimiter to ensure_whitespace Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-07-03 11:32:42 +02:00
Sofie Van Landeghem	41b65fd0f8	fix to pretrain script (#5699 ) * fix to pretrain script * remove unnecessary import	2020-07-02 21:48:01 +02:00
Adriane Boyd	a723fa02a1	DocBin: add version number, missing attributes and strings (#5685 ) * Add version number to DocBin Add a version number to DocBin for future use. * Add POS to all attributes in DocBin * Add morph string to strings in DocBin * Update DocBin API * Add string for ENT_KB_ID in DocBin	2020-07-02 17:41:50 +02:00
Adriane Boyd	a77c4c3465	Add strings and ENT_KB_ID to Doc serialization (#5691 ) * Add strings for all writeable Token attributes to `Doc.to/from_bytes()`. * Add ENT_KB_ID to default attributes.	2020-07-02 17:11:57 +02:00
Adriane Boyd	971826a96d	Include git commit in package and model meta (#5694 ) * Include git commit in package and model meta * Rewrite to read file in setup * Fix file handle	2020-07-02 17:10:27 +02:00
Ines Montani	d36632553a	Merge pull request #5688 from explosion/remove-deprecated Remove deprecated methods: Doc.print_tree, Doc.merge, Span.merge	2020-07-02 15:10:30 +02:00
Ines Montani	8a5b9a6d5f	Merge pull request #5693 from svlandeg/bugfix/nel-v3	2020-07-02 14:45:46 +02:00
Ines Montani	ee8a830248	Merge pull request #5687 from svlandeg/bugfix/init-model Fixing init_model	2020-07-02 14:10:28 +02:00
svlandeg	04ed4d60a8	raise error when links are not aligned to tokens	2020-07-02 13:57:35 +02:00
svlandeg	f503817623	fix parsing entity links in new gold format	2020-07-02 13:48:11 +02:00
Ines Montani	60c2695131	Remove deprecated methods	2020-07-01 22:33:39 +02:00
Ines Montani	fe4cfd0632	Start updating website for v3 [ci skip]	2020-07-01 21:26:39 +02:00
svlandeg	a30bc77415	bugfixing prune_vectors and vectors_loc	2020-07-01 21:00:47 +02:00
Matthw Honnibal	94a0cf46fd	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-01 18:45:45 +02:00
Matthw Honnibal	6a0a27e5c2	Fix max_steps	2020-07-01 18:08:14 +02:00
Ines Montani	8d90e44d74	Fix title	2020-07-01 15:38:01 +02:00
Ines Montani	8fb574900a	Update parent package and version	2020-07-01 15:35:23 +02:00
Matthew Honnibal	0ada186dda	Set version to v3.0.0.dev14	2020-07-01 15:31:04 +02:00
Matthw Honnibal	cb51bb637b	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-01 15:17:27 +02:00
Matthw Honnibal	7734cbc34d	Set batch size in begin_training	2020-07-01 15:16:59 +02:00
Matthw Honnibal	1f7709e9a6	Improve max length check in corpus	2020-07-01 15:16:43 +02:00
Matthw Honnibal	2fa56484b2	Fix eval batch size	2020-07-01 15:16:25 +02:00
Matthw Honnibal	c5d12d1a22	Allow batch size to be set for evaluation in spacy train	2020-07-01 15:04:36 +02:00
Matthw Honnibal	f5532757a3	Filter out 0-length examples in Corpus	2020-07-01 15:02:37 +02:00
Ines Montani	bc87ba97e0	Merge pull request #5681 from svlandeg/bugfix/exec-cwd	2020-07-01 14:13:19 +02:00
Matthw Honnibal	52338a07bb	Set version to v3.0.0.dev13	2020-07-01 02:49:17 +02:00
Matthw Honnibal	fa6d473390	Fix parser maxout_pieces=1	2020-07-01 02:48:58 +02:00
Matthw Honnibal	35af5819e0	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-07-01 01:03:39 +02:00
Matthw Honnibal	0d6edf5397	Clean up debug code in transition_system	2020-07-01 01:03:20 +02:00
Matthw Honnibal	a1b6add4c8	Fix parser gold cutting and gradient normalization	2020-07-01 01:02:58 +02:00
Matthw Honnibal	8c5a88e777	Fix per-epoch shuffling	2020-07-01 01:02:35 +02:00
svlandeg	a7d547c65e	small fix	2020-06-30 21:56:17 +02:00
svlandeg	8eca7e995e	add try-except to git commands to get an informative warning	2020-06-30 21:53:40 +02:00
Ines Montani	b032943c34	Fix funny printing again	2020-06-30 21:33:41 +02:00
Matthw Honnibal	d525552979	Fix efficiency of parser backprop_nonlinearity	2020-06-30 21:22:54 +02:00
Ines Montani	d64644d9d1	Adjust auto-formatting	2020-06-30 20:36:30 +02:00
Ines Montani	6da3500728	Fix command substitution	2020-06-30 20:35:51 +02:00
svlandeg	e7aff9c5fc	bugfix exec usage in dvc.yaml	2020-06-30 18:51:20 +02:00
svlandeg	60f97bc519	add custom warning when run_command fails	2020-06-30 17:28:43 +02:00
svlandeg	39953c7c60	fix print_run_help with new arg order	2020-06-30 17:28:09 +02:00
svlandeg	cd632d8ec2	move folder for exec argument one up	2020-06-30 17:19:36 +02:00
svlandeg	1ae6fa2554	move subcommand one place up as project_dir has default	2020-06-30 16:04:53 +02:00
svlandeg	a46b76f188	use current working dir as default throughout	2020-06-30 15:39:24 +02:00
svlandeg	b228111925	fix funny printing	2020-06-30 14:54:45 +02:00
Ines Montani	8e20505970	Resolve within working_dir context manager	2020-06-30 13:29:45 +02:00
Ines Montani	72175b5c60	Update project command	2020-06-30 13:17:26 +02:00
Ines Montani	c5e31acb06	Make working_dir yield absolute cwd path	2020-06-30 13:17:14 +02:00
Ines Montani	3aca404735	Make run_command take string and list	2020-06-30 13:17:00 +02:00
Ines Montani	7584fdafec	Fix typo	2020-06-30 12:59:13 +02:00
svlandeg	140c4896a0	split_command util function	2020-06-30 12:54:15 +02:00
Matthw Honnibal	57e09747dc	Improve efficiency of get_oracle_sequences	2020-06-30 11:50:48 +02:00
Matthw Honnibal	233945bfe0	Fix init for padding	2020-06-30 11:50:24 +02:00
svlandeg	d23be563eb	remove redundant setting of no_args_is_help	2020-06-30 11:23:35 +02:00
svlandeg	b311ce982f	Merge remote-tracking branch 'upstream/develop' into fix/small-edits # Conflicts: # spacy/cli/project.py	2020-06-30 11:17:31 +02:00
svlandeg	7e4cbda89a	fix project_init for relative path	2020-06-30 11:09:53 +02:00
Matthw Honnibal	85ed5730a2	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-06-30 01:14:16 +02:00
Ines Montani	e8033df81e	Also handle python3 and pip3	2020-06-29 20:30:42 +02:00
Ines Montani	c874dde66c	Show help on "spacy project"	2020-06-29 20:11:34 +02:00
Ines Montani	1d2c646e57	Fix init and remove .dvc/plots	2020-06-29 20:07:21 +02:00
Matthw Honnibal	5bed6fc431	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-06-29 19:55:24 +02:00
svlandeg	1176783310	fix one more shlex.split	2020-06-29 18:37:42 +02:00
svlandeg	ff233d5743	print details on error msg (e.g. PermissionError on specific file)	2020-06-29 18:22:33 +02:00
svlandeg	894b8e7ff6	throw warning (instead of crashing) when temp dir can't be cleaned	2020-06-29 18:16:39 +02:00
svlandeg	efe7eb71f2	create subfolder in working dir	2020-06-29 17:46:08 +02:00
svlandeg	3487214ba1	fix shlex.split for non-posix	2020-06-29 17:45:47 +02:00
Ines Montani	126050f259	Improve asset fetching Get all paths first and run dvc add once so it only shows one progress bar and one combined git command (if repo is git repo)	2020-06-29 16:55:24 +02:00
Ines Montani	7c08713baa	Improve error messages	2020-06-29 16:54:47 +02:00
Ines Montani	24664efa23	Import project_run_all function	2020-06-29 16:54:19 +02:00
svlandeg	f8dddeda27	print help msg when just calling 'project' without args	2020-06-29 16:38:15 +02:00
svlandeg	bf43ebbf61	fix typo's	2020-06-29 16:32:25 +02:00
Matthew Honnibal	67928036f2	Set version to v3.0.0.dev12	2020-06-29 14:45:43 +02:00
Matthew Honnibal	2d715451a2	Revert "Convert custom user_data to token extension format for Japanese tokenizer (#5652 )" (#5665 ) This reverts commit `1dd38191ec`.	2020-06-29 14:34:15 +02:00
Sofie Van Landeghem	8d3c0306e1	refactor fixes (#5664 ) * fixes in ud_train, UX for morphs * update pyproject with new version of thinc * fixes in debug_data script * cleanup of old unused error messages * remove obsolete TempErrors * move error messages to errors.py * add ENT_KB_ID to default DocBin serialization * few fixes to simple_ner * fix tags	2020-06-29 14:33:00 +02:00

... 2 3 4 5 6 ...

7451 Commits