spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 10:26:35 +03:00

Author	SHA1	Message	Date
Ines Montani	01c394eb23	Update to latest Typer and remove hacks	2020-06-25 12:27:19 +02:00
Ines Montani	82a03ee18e	Replace python with sys.executable	2020-06-25 12:26:53 +02:00
Adriane Boyd	6fe6e761de	Skip vocab in component config overrides (#5624 )	2020-06-23 23:21:11 +02:00
Adriane Boyd	d94e961f14	Fix polarity of Token.is_oov and Lexeme.is_oov (#5634 ) Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the lexeme does not have a vector.	2020-06-23 13:29:51 +02:00
Ines Montani	8131a65dee	Update __init__.py	2020-06-22 16:09:09 +02:00
Ines Montani	2ad7a02400	Merge branch 'develop' into feature/project-cli	2020-06-22 15:33:11 +02:00
Ines Montani	83b4aa05c9	Merge pull request #5626 from explosion/feature/typer	2020-06-22 06:29:03 -07:00
Ines Montani	0ee6d7a4d1	Remove project stuff from this branch	2020-06-22 14:54:38 +02:00
Ines Montani	a6b76440b7	Update project CLI	2020-06-22 14:53:31 +02:00
Hiroshi Matsuda	150a39ccca	Japanese model: add user_dict entries and small refactor (#5573 ) * user_dict fields: adding inflections, reading_forms, sub_tokens deleting: unidic_tags improve code readability around the token alignment procedure * add test cases, replace fugashi with sudachipy in conftest * move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer * tag is space -> both surface and tag are spaces * consider len(text)==0	2020-06-22 14:32:25 +02:00
Ines Montani	3f2f5f9cb3	Remove ml_datasets from install dependencies	2020-06-22 12:14:51 +02:00
Rameshh	c34420794a	Add Nepali Language (#5622 ) * added support for nepali lang * added examples and test files * added spacy contributor agreement	2020-06-22 10:25:46 +02:00
Karen Hambardzumyan	66a4834e56	Some changes for Armenian (#5616 ) * Fixing numericals * We need a Armenian question sign to make the sentence a question	2020-06-22 08:50:34 +02:00
Ines Montani	dc5d535659	Tidy up info	2020-06-22 01:17:11 +02:00
Ines Montani	189ed56777	Fix and simplify info	2020-06-22 01:07:48 +02:00
Ines Montani	fca3907d4e	Add correct uppercase variants for boolean flags	2020-06-22 00:57:28 +02:00
Ines Montani	79dd824906	Tidy up	2020-06-22 00:45:40 +02:00
Ines Montani	1e5b4d8524	Fix DVC check	2020-06-22 00:30:05 +02:00
Ines Montani	5ba1df5e78	Update project CLI	2020-06-22 00:15:06 +02:00
Ines Montani	ef5f548fb0	Tidy up and auto-format	2020-06-21 22:38:04 +02:00
Ines Montani	f77e0bc028	Merge branch 'develop' into master-tmp	2020-06-21 22:34:15 +02:00
Ines Montani	40bb918a4c	Remove unicode declarations and tidy up	2020-06-21 22:34:10 +02:00
Ines Montani	275bab62df	Refactor CLI	2020-06-21 21:35:01 +02:00
Ines Montani	c12713a8be	Port CLI to Typer and add project stubs	2020-06-21 13:44:00 +02:00
svlandeg	689600e17d	add additional test back in (it works now)	2020-06-20 23:23:57 +02:00
svlandeg	2f6062a8a4	add line that got removed from EntityLinker	2020-06-20 23:14:45 +02:00
svlandeg	12dc8ab208	remove redundant code from master in EntityLinker	2020-06-20 23:07:42 +02:00
svlandeg	6179774278	fix test_build_dependencies by ignoring new libs	2020-06-20 22:49:37 +02:00
svlandeg	256d4c27c8	fix tagger begin_training being called without examples	2020-06-20 22:38:00 +02:00
svlandeg	5cb812e0ab	fix NER warn empty lookups (cf PR #5588 )	2020-06-20 22:04:18 +02:00
svlandeg	c9242e9bf4	fix entity linker (cf PR #5548 )	2020-06-20 21:47:23 +02:00
svlandeg	dc069e90b3	fix token.morph_ for v.3 (cf PR #5517 )	2020-06-20 21:13:11 +02:00
Ines Montani	988d2a4eda	Add --code-path option to train CLI (#5618 )	2020-06-20 18:43:12 +02:00
Ines Montani	5424b70e51	Remove v2 test	2020-06-20 16:18:53 +02:00
Ines Montani	63c22969f4	Update test_issue5230.py	2020-06-20 16:17:48 +02:00
Ines Montani	296b5d633b	Remove references to Python 2 / is_python2	2020-06-20 16:11:13 +02:00
Ines Montani	0cdb631e6c	Fix merge errors	2020-06-20 16:02:42 +02:00
Ines Montani	52728d8fa3	Merge branch 'develop' into master-tmp	2020-06-20 15:52:00 +02:00
Ines Montani	f91e9e8c84	Remove F841 [ci skip]	2020-06-20 14:47:17 +02:00
Ines Montani	8283df80e9	Tidy up and auto-format	2020-06-20 14:15:04 +02:00
Marat M. Yavrumyan	8120b641cc	Update lex_attrs.py (#5608 )	2020-06-19 20:00:34 +02:00
Ines Montani	e9d3e177f0	Merge branch 'master' into v2.3.x	2020-06-16 16:31:38 +02:00
Matthew Honnibal	7ff447c5a0	Set version to v2.3.0	2020-06-15 18:22:25 +02:00
Adriane Boyd	0d8405aafa	Updates to docstrings (#5589 )	2020-06-15 14:58:36 +02:00
Adriane Boyd	e867e9fa8f	Fix and add warnings related to spacy-lookups-data (#5588 ) * Fix warning message for lemmatization tables * Add a warning when the `lexeme_norm` table is empty. (Given the relatively lang-specific loading for `Lookups`, it seemed like too much overhead to dynamically extract the list of languages, so for now it's hard-coded.)	2020-06-15 14:58:29 +02:00
Arvind Srinivasan	f698007907	Added Tamil Example Sentences (#5583 ) * Added Examples for Tamil Sentences #### Description This PR add example sentences for the Tamil language which were missing as per issue #1107 #### Type of Change This is an enhancement. * Accepting spaCy Contributor Agreement * Signed on my behalf as an individual	2020-06-15 14:58:21 +02:00
Adriane Boyd	c94f7d0e75	Updates to docstrings (#5589 )	2020-06-15 14:56:51 +02:00
Adriane Boyd	c482f20778	Fix and add warnings related to spacy-lookups-data (#5588 ) * Fix warning message for lemmatization tables * Add a warning when the `lexeme_norm` table is empty. (Given the relatively lang-specific loading for `Lookups`, it seemed like too much overhead to dynamically extract the list of languages, so for now it's hard-coded.)	2020-06-15 14:56:04 +02:00
Arvind Srinivasan	aa5b40fa64	Added Tamil Example Sentences (#5583 ) * Added Examples for Tamil Sentences #### Description This PR add example sentences for the Tamil language which were missing as per issue #1107 #### Type of Change This is an enhancement. * Accepting spaCy Contributor Agreement * Signed on my behalf as an individual	2020-06-13 15:56:26 +02:00
theudas	3f5e2f9d99	Added Parameter to NEL to take n sentences into account (#5548 ) * added setting for neighbour sentence in NEL * added spaCy contributor agreement * added multi sentence also for training * made the try-except block smaller	2020-06-12 15:15:03 +02:00
adrianeboyd	4724fa4cf4	Expand Japanese requirements warning (#5572 ) Include explicit install instructions in Japanese requirements warning.	2020-06-12 15:14:55 +02:00
adrianeboyd	44967a3f9c	Update pytest conf for sudachipy with Japanese (#5574 )	2020-06-12 15:14:47 +02:00
Matthew Honnibal	a1c5b694be	Small fixes to train defaults	2020-06-12 02:22:13 +02:00
theudas	fa46e0bef2	Added Parameter to NEL to take n sentences into account (#5548 ) * added setting for neighbour sentence in NEL * added spaCy contributor agreement * added multi sentence also for training * made the try-except block smaller	2020-06-12 02:03:23 +02:00
Sofie Van Landeghem	c0f4a1e43b	train is from-config by default (#5575 ) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before	2020-06-12 02:02:07 +02:00
adrianeboyd	556895177e	Expand Japanese requirements warning (#5572 ) Include explicit install instructions in Japanese requirements warning.	2020-06-11 13:47:37 +02:00
adrianeboyd	fe167fcf7d	Update pytest conf for sudachipy with Japanese (#5574 )	2020-06-11 10:23:50 +02:00
Jones Martins	bab30e4ad2	Add "c'mon" token exception (#5570 ) * Add "c'mon" exception * Fix typo in "C'mon" exception	2020-06-10 21:54:06 +02:00
Jones Martins	28db7dd5d9	Add missing pronoums/determiners (#5569 ) * Add missing pronoums/determiners * Add test for missing pronoums * Add contributor file	2020-06-10 18:47:04 +02:00
adrianeboyd	0a70bd6281	Bump version to 2.3.0.dev1 (#5567 )	2020-06-09 15:47:31 +02:00
adrianeboyd	b7e6e1b9a7	Disable sentence segmentation in ja tokenizer (#5566 )	2020-06-09 12:00:59 +02:00
adrianeboyd	f162815f45	Handle empty and whitespace-only docs for Japanese (#5564 ) Handle empty and whitespace-only docs in the custom alignment method used by the Japanese tokenizer.	2020-06-08 21:09:23 +02:00
adrianeboyd	3bf111585d	Update Japanese tokenizer config and add serialization (#5562 ) * Use `config` dict for tokenizer settings * Add serialization of split mode setting * Add tests for tokenizer split modes and serialization of split mode setting Based on #5561	2020-06-08 16:29:05 +02:00
Hiroshi Matsuda	456bf47f51	fix a bug causing mis-alignments (#5560 )	2020-06-08 15:49:34 +02:00
Ines Montani	d93cbeb14f	Add warning for loose version constraints (#5536 ) * Add warning for loose version constraints * Update wording [ci skip] * Tweak error message Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-06-05 12:42:15 +02:00
adrianeboyd	1ac43d78f9	Avoid libc.stdint for UINT64_MAX (#5545 )	2020-06-04 20:02:05 +02:00
Paul O'Leary McCann	410fb7ee43	Add Japanese Model (#5544 ) * Add more rules to deal with Japanese UD mappings Japanese UD rules sometimes give different UD tags to tokens with the same underlying POS tag. The UD spec indicates these cases should be disambiguated using the output of a tool called "comainu", but rules are enough to get the right result. These rules are taken from Ginza at time of writing, see #3756. * Add new tags from GSD This is a few rare tags that aren't in Unidic but are in the GSD data. * Add basic Japanese sentencization This code is taken from Ginza again. * Add sentenceizer quote handling Could probably add more paired characters but this will do for now. Also includes some tests. * Replace fugashi with SudachiPy * Modify tag format to match GSD annotations Some of the tests still need to be updated, but I want to get this up for testing training. * Deal with case with closing punct without opening * refactor resolve_pos() * change tag field separator from "," to "-" * add TAG_ORTH_MAP * add TAG_BIGRAM_MAP * revise rules for 連体詞 * revise rules for 連体詞 * improve POS about 2% * add syntax_iterator.py (not mature yet) * improve syntax_iterators.py * improve syntax_iterators.py * add phrases including nouns and drop NPs consist of STOP_WORDS * First take at noun chunks This works in many situations but still has issues in others. If the start of a subtree has no noun, then nested phrases can be generated. また行きたい、そんな気持ちにさせてくれるお店です。 [そんな気持ち, また行きたい、そんな気持ちにさせてくれるお店] For some reason て gets included sometimes. Not sure why. ゲンに連れ添って円盤生物を調査するパートナーとなる。 [て円盤生物, ...] Some phrases that look like they should be split are grouped together; not entirely sure that's wrong. This whole thing becomes one chunk: 道の駅遠山郷北側からかぐら大橋南詰現道交点までの1.060kmのみ開通済み * Use new generic get_words_and_spaces The new get_words_and_spaces function is simpler than what was used in Japanese, so it's good to be able to switch to it. However, there was an issue. The new function works just on text, so POS info could get out of sync. Fixing this required a small change to the way dtokens (tokens with POS and lemma info) were generated. Specifically, multiple extraneous spaces now become a single token, so when generating dtokens multiple space tokens should be created in a row. * Fix noun_chunks, should be working now * Fix some tests, add naughty strings tests Some of the existing tests changed because the tokenization mode of Sudachi changed to the more fine-grained A mode. Sudachi also has issues with some strings, so this adds a test against the naughty strings. * Remove empty Sudachi tokens Not doing this creates zero-length tokens and causes errors in the internal spaCy processing. * Add yield_bunsetu back in as a separate piece of code Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: hiroshi <hiroshi_matsuda@megagon.ai>	2020-06-04 19:15:43 +02:00
Matthew Honnibal	8411d4f4e6	Merge pull request #5543 from svlandeg/feature/pretrain-config pretrain from config	2020-06-04 19:07:12 +02:00
svlandeg	3ade455fd3	formatting	2020-06-04 16:09:55 +02:00
svlandeg	776d4f1190	cleanup	2020-06-04 16:07:30 +02:00
svlandeg	6b027d7689	remove duplicate model definition of tok2vec layer	2020-06-04 15:49:23 +02:00
svlandeg	1775f54a26	small little fixes	2020-06-03 22:17:02 +02:00
svlandeg	07886a3de3	rename init_tok2vec to resume	2020-06-03 22:00:25 +02:00
svlandeg	4ed6278663	small fixes to pretrain config, init_tok2vec TODO	2020-06-03 19:32:40 +02:00
svlandeg	ffe0451d09	pretrain from config	2020-06-03 14:45:00 +02:00
Ines Montani	a8875d4a4b	Fix typo	2020-06-03 14:42:39 +02:00
Ines Montani	4e0610d0d4	Update warning codes	2020-06-03 14:37:09 +02:00
Ines Montani	810fce3bb1	Merge branch 'develop' into master-tmp	2020-06-03 14:36:59 +02:00
Adriane Boyd	b0ee76264b	Remove debugging	2020-06-03 14:20:42 +02:00
Adriane Boyd	1d8168d1fd	Fix problems with lower and whitespace in variants Port relevant changes from #5361: * Initialize lower flag explicitly * Handle whitespace words from GoldParse correctly when creating raw text with orth variants	2020-06-03 14:15:58 +02:00
Adriane Boyd	10d938f221	Update default cfg dir in train CLI	2020-06-03 14:15:50 +02:00
Adriane Boyd	f1f9c8b417	Port train CLI updates Updates from #5362 and fix from #5387: * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-06-03 14:03:43 +02:00
Adriane Boyd	8c758ed1eb	Fix meta path	2020-06-03 12:11:57 +02:00
Adriane Boyd	a57bdeecac	Test util.get_model_meta instead of util.load_model	2020-06-03 12:10:12 +02:00
svlandeg	eac12cbb77	make dropout in embed layers configurable	2020-06-03 11:50:16 +02:00
svlandeg	e91485dfc4	add discard_oversize parameter, move optimizer to training subsection	2020-06-03 10:04:16 +02:00
svlandeg	03c58b488c	prevent infinite loop, custom warning	2020-06-03 10:00:21 +02:00
svlandeg	6504b7f161	Merge remote-tracking branch 'upstream/develop' into feature/pretrain-config	2020-06-03 08:30:16 +02:00
svlandeg	c5ac382f0a	fix name clash	2020-06-02 22:24:57 +02:00
svlandeg	2bf5111ecf	additional test with discard_oversize=False	2020-06-02 22:09:37 +02:00
svlandeg	aa6271b16c	extending algorithm to deal better with edge cases	2020-06-02 22:05:08 +02:00
svlandeg	f2e162fc60	it's only oversized if the tolerance level is also exceeded	2020-06-02 19:59:04 +02:00
svlandeg	ef834b4cd7	fix comments	2020-06-02 19:50:44 +02:00
svlandeg	6208d322d3	slightly more challenging unit test	2020-06-02 19:47:30 +02:00
svlandeg	6651fafd5c	using overflow buffer for examples within the tolerance margin	2020-06-02 19:43:39 +02:00
svlandeg	85b0597ed5	add test for minibatch util	2020-06-02 18:26:21 +02:00
svlandeg	5b350a6c99	bugfix of the bugfix	2020-06-02 17:49:33 +02:00
Adriane Boyd	75f08ad62d	Remove unnecessary check	2020-06-02 17:41:25 +02:00
Adriane Boyd	bbc1836581	Add rudimentary version checks on model load	2020-06-02 17:33:48 +02:00
svlandeg	fdfd822936	rewrite minibatch_by_words function	2020-06-02 15:22:54 +02:00
svlandeg	ec52e7f886	add oversize examples before StopIteration returns	2020-06-02 13:21:55 +02:00
svlandeg	e0f9f448f1	remove Tensorizer	2020-06-01 23:38:48 +02:00
Leo	925e938570	Spanish tokenizer exception and examples improvement (#5531 ) * Spanish tokenizer exception additions. Added Spanish question examples * erased slang tokenization examples	2020-06-01 18:18:34 +02:00
Matthew Honnibal	67af3a32b0	Merge pull request #5527 from adrianeboyd/bugfix/tagger-sp-tag-map Preserve _SP when filtering tag map in Tagger	2020-06-01 12:00:21 +02:00
Leo	c21c308ecb	corrected issue #5524 changed <U+009C> 'STRING TERMINATOR' for <U+0153> LATIN SMALL LIGATURE OE' (#5526 )	2020-05-31 22:08:12 +02:00
Adriane Boyd	a005ccd6d7	Preserve _SP when filtering tag map in Tagger To allow "SP" as a tag (for Chinese OntoNotes), preserve "_SP" if present as the reference `SPACE` POS in the tag map in `Tagger.begin_training()`.	2020-05-31 19:57:54 +02:00
Ines Montani	b5ae2edcba	Merge pull request #5516 from explosion/feature/improve-model-version-deps	2020-05-31 12:54:01 +02:00
Ines Montani	dc186afdc5	Add warning	2020-05-30 15:34:54 +02:00
Ines Montani	b7aff6020c	Make functions more general purpose and update docstrings and tests	2020-05-30 15:18:53 +02:00
Ines Montani	a7e370bcbf	Don't override spaCy version	2020-05-30 15:03:18 +02:00
Ines Montani	e47e5a4b10	Use more sophisticated version parsing logic	2020-05-30 15:01:58 +02:00
svlandeg	15134ef611	fix deserialization order	2020-05-30 12:53:32 +02:00
Matthew Honnibal	64adda3202	Revert "Remove peeking from Parser.begin_training (#5456 )" This reverts commit `9393253b66`. The model shouldn't need to see all examples, and actually in v3 there's no equivalent step. All examples are provided to the component, for the component to do stuff like figuring out the labels. The model just needs to do stuff like shape inference.	2020-05-29 23:21:55 +02:00
Matthew Honnibal	85f1acfaa0	Merge pull request #5517 from adrianeboyd/bugfix/morph-repr Remove MorphAnalysis __str__ and __repr__	2020-05-29 19:20:56 +02:00
svlandeg	291483157d	prevent loading a pretrained Tok2Vec layer AND pretrained components	2020-05-29 17:38:33 +02:00
Adriane Boyd	e1b7cbd197	Remove MorphAnalysis __str__ and __repr__	2020-05-29 14:33:47 +02:00
Ines Montani	4fd087572a	WIP: improve model version deps	2020-05-28 12:51:37 +02:00
Matthw Honnibal	58750b06f8	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-05-27 22:18:36 +02:00
Matthew Honnibal	aecd1437cc	Merge pull request #5508 from adrianeboyd/bugfix/tag-map-sp-tag Prefer _SP over SP for default tag map space attrs	2020-05-27 20:39:40 +02:00
Adriane Boyd	25de2a2191	Improve vector name loading from model meta	2020-05-27 14:48:54 +02:00
adrianeboyd	aad0610a85	Map NR to PROPN (#5512 )	2020-05-26 22:30:53 +02:00
Adriane Boyd	b6b5908f5e	Prefer _SP over SP for default tag map space attrs If `_SP` is already in the tag map, use the mapping from `_SP` instead of `SP` so that `SP` can be a valid non-space tag. (Chinese has a non-space tag `SP` which was overriding the mapping of `_SP` to `SPACE`.)	2020-05-26 14:57:13 +02:00
Adriane Boyd	1eed101be9	Fix Polish lemmatizer for deserialized models Restructure Polish lemmatizer not to depend on lookups data in `__init__` since the lemmatizer is initialized before the lookups data is loaded from a saved model. The lookups tables are accessed first in `__call__` instead once the data is available.	2020-05-26 09:56:12 +02:00
Ines Montani	24ef6680fa	Merge pull request #5499 from adrianeboyd/chore/bump-version-deps-v2.3.0	2020-05-25 13:25:45 +02:00
Adriane Boyd	3f727bc539	Switch to v2.3.0.dev0	2020-05-25 12:57:20 +02:00
Adriane Boyd	736f3cb5af	Bump version and deps for v2.3.0 * spacy to v2.3.0 * thinc to v7.4.1 * spacy-lookups-data to v0.3.2	2020-05-25 12:03:49 +02:00
Adriane Boyd	e06ca7ea24	Switch to new add API in PhraseMatcher unpickle	2020-05-25 11:22:47 +02:00
Ines Montani	1a15896ba9	unicode -> str consistency [ci skip]	2020-05-24 18:51:10 +02:00
Ines Montani	5d3806e059	unicode -> str consistency	2020-05-24 17:20:58 +02:00
Ines Montani	387c7aba15	Update test	2020-05-24 14:55:16 +02:00
Ines Montani	f9786d765e	Simplify is_package check	2020-05-24 14:48:56 +02:00
Matthw Honnibal	2d9de8684d	Support use_pytorch_for_gpu_memory config	2020-05-22 23:10:40 +02:00
Ines Montani	4465cad6c5	Rename spacy.analysis to spacy.pipe_analysis	2020-05-22 17:42:06 +02:00
Ines Montani	25d6ed3fb8	Merge pull request #5489 from explosion/feature/connected-components	2020-05-22 17:40:11 +02:00
Ines Montani	841c05b47b	Merge pull request #5490 from explosion/fix/remove-jsonschema	2020-05-22 17:39:54 +02:00
Ines Montani	569a65b60e	Auto-format	2020-05-22 16:55:42 +02:00
Ines Montani	d844528c5f	Add test for is_compatible_model	2020-05-22 16:55:15 +02:00
Ines Montani	12b7be1d98	Remove jsonschema from dependencies	2020-05-22 16:49:26 +02:00
Matthew Honnibal	f7f6df7275	Move to spacy.analysis	2020-05-22 16:43:18 +02:00
Matthew Honnibal	78d79d94ce	Guess set_annotations=True in nlp.update During `nlp.update`, components can be passed a boolean set_annotations to indicate whether they should assign annotations to the `Doc`. This needs to be called if downstream components expect to use the annotations during training, e.g. if we wanted to use tagger features in the parser. Components can specify their assignments and requirements, so we can figure out which components have these inter-dependencies. After figuring this out, we can guess whether to pass set_annotations=True. We could also call set_annotations=True always, or even just have this as the only behaviour. The downside of this is that it would require the `Doc` objects to be created afresh to avoid problematic modifications. One approach would be to make a fresh copy of the `Doc` objects within `nlp.update()`, so that we can write to the objects without any problems. If we do that, we can drop this logic and also drop the `set_annotations` mechanism. I would be fine with that approach, although it runs the risk of introducing some performance overhead, and we'll have to take care to copy all extension attributes etc.	2020-05-22 15:55:45 +02:00
Ines Montani	6728747f71	Merge pull request #5486 from explosion/fix/compat-py2	2020-05-22 15:47:21 +02:00
Ines Montani	6e6db6afb6	Better model compatibility and validation	2020-05-22 15:42:46 +02:00
Matthew Honnibal	f6078d866a	Merge pull request #5121 from adrianeboyd/bugfix/revert-token-match Revert token_match priority changes from #4374 and extend token match options	2020-05-22 14:42:51 +02:00
Ines Montani	c685ee734a	Fix compat for v2.x branch	2020-05-22 14:22:36 +02:00
Adriane Boyd	e4a1b5dab1	Rename to url_match Rename to `url_match` and update docs.	2020-05-22 12:41:03 +02:00
Adriane Boyd	730fa493a4	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-22 12:18:00 +02:00
Adriane Boyd	71fe61fdcd	Disallow merging 0-length spans	2020-05-22 10:14:34 +02:00
Matthew Honnibal	93c4d13588	Merge pull request #5264 from lfiedler/issue-5230 Fix ResourceWarnings during unittest	2020-05-22 00:31:07 +02:00
Matthew Honnibal	e1cb7e838b	Merge pull request #5481 from explosion/feature/blank-shortcut-v2 Add blank:{lang} shortcut support to util.load_model	2020-05-22 00:08:23 +02:00
Ines Montani	2250380816	Merge pull request #5482 from explosion/fix/backwards-compat-super	2020-05-21 21:51:46 +02:00
Ines Montani	891fa59009	Use backwards-compatible super()	2020-05-21 20:52:48 +02:00
Matthew Honnibal	5ce02c1b17	Merge pull request #5470 from svlandeg/bugfix/noun-chunks Bugfix in noun chunks	2020-05-21 20:51:31 +02:00
Matthw Honnibal	25b51f4fc8	Set version to v3.0.0.dev9	2020-05-21 20:47:52 +02:00
Matthw Honnibal	bc94fdabd0	Fix begin_training	2020-05-21 20:46:21 +02:00
Matthw Honnibal	d507ac28d8	Fix shape inference	2020-05-21 20:46:10 +02:00
Ines Montani	cb02bff0eb	Add blank:{lang} shortcut to util.load_mode	2020-05-21 20:24:07 +02:00
Matthw Honnibal	df87c32a40	Pass smaller doc sample into model initialize	2020-05-21 20:17:24 +02:00
Ines Montani	581bda9f98	Update senter test and auto-format	2020-05-21 20:17:14 +02:00
Ines Montani	0f1beb5ff2	Tidy up and avoid absolute spacy imports in core	2020-05-21 20:05:03 +02:00
svlandeg	51715b9f72	span / noun chunk has +1 because end is exclusive	2020-05-21 19:56:56 +02:00
Adriane Boyd	132b2a6898	Merge remote-tracking branch 'upstream/master-tmp' into HEAD	2020-05-21 19:50:30 +02:00
Adriane Boyd	17ee9ab53a	Fix _SP/POS=SPACE in strings serialization tests	2020-05-21 19:49:08 +02:00
Ines Montani	245f91df78	Fix merge issues	2020-05-21 19:42:13 +02:00
Matthw Honnibal	3b5cfec1fc	Tweak memory management in train_from_config	2020-05-21 19:32:04 +02:00
Matthw Honnibal	f075655deb	Fix shape inference in begin_training	2020-05-21 19:26:29 +02:00
svlandeg	84d5b7ad0a	Merge remote-tracking branch 'upstream/master' into bugfix/noun-chunks # Conflicts: # spacy/lang/el/syntax_iterators.py # spacy/lang/en/syntax_iterators.py # spacy/lang/fa/syntax_iterators.py # spacy/lang/fr/syntax_iterators.py # spacy/lang/id/syntax_iterators.py # spacy/lang/nb/syntax_iterators.py # spacy/lang/sv/syntax_iterators.py	2020-05-21 19:19:50 +02:00
svlandeg	f7d10da555	avoid unnecessary loop to check overlapping noun chunks	2020-05-21 19:15:57 +02:00
Ines Montani	631e20d0c6	Fix test and schemas	2020-05-21 19:01:02 +02:00
Ines Montani	d34fc0915e	Remove serialization getter	2020-05-21 18:48:21 +02:00
Ines Montani	f44897e4c6	Update warning IDs	2020-05-21 18:39:11 +02:00
Ines Montani	24f72c669c	Merge branch 'develop' into master-tmp	2020-05-21 18:39:06 +02:00
Ines Montani	c6ec19c844	Add missing declaration	2020-05-21 17:30:05 +02:00
Matthew Honnibal	884d9b060d	Merge pull request #5466 from adrianeboyd/feature/omit-extra-lexeme-info Add option to omit extra lexeme tables in CLI	2020-05-21 16:40:02 +02:00
Matthew Honnibal	e6c4c1a507	Merge pull request #5468 from adrianeboyd/feature/cli-conllu-misc-ner Improve handling of NER in CoNLL-U MISC	2020-05-21 16:39:46 +02:00
Matthew Honnibal	26cd6a0229	Merge pull request #5462 from adrianeboyd/feature/lemmatizer-all-upos Extend lemmatizer rules for all UPOS tags	2020-05-21 16:05:31 +02:00
Matthew Honnibal	cad9b290a2	Merge branch 'master' into feature/omit-extra-lexeme-info	2020-05-21 16:04:24 +02:00
Matthew Honnibal	1f572ce89b	Merge pull request #5473 from explosion/fix/travis-tests Fix Python 2.7 compat	2020-05-21 15:56:16 +02:00
Ines Montani	a9cb2882cb	Rename argument: doc_or_span/obj -> doclike (#5463 ) * doc_or_span -> obj * Revert "doc_or_span -> obj" This reverts commit `78bb9ff5e0`. * obj -> doclike * Refer to correct object	2020-05-21 15:17:39 +02:00
Ines Montani	bea863acd2	Fix naming conflict and formatting	2020-05-21 14:24:38 +02:00
Ines Montani	bd6353715a	Merge branch 'master' into fix/travis-tests	2020-05-21 14:23:04 +02:00
Ines Montani	d8f3190c0a	Tidy up and auto-format	2020-05-21 14:14:01 +02:00
Ines Montani	56de520afd	Try to fix tests on Travis (2.7)	2020-05-21 14:04:57 +02:00
adrianeboyd	d45602bc11	Merge branch 'master' into feature/omit-extra-lexeme-info	2020-05-21 10:26:01 +02:00
svlandeg	b221bcf1ba	fixing all languages	2020-05-21 00:17:28 +02:00
svlandeg	b509a3e7fc	fix: use actual range in 'seen' instead of subtree	2020-05-20 23:06:39 +02:00
svlandeg	36a94c409a	failing test to reproduce overlapping spans problem	2020-05-20 23:06:03 +02:00
adrianeboyd	49ef06d793	Add option for base model in init-model CLI (#5467 ) Intended for languages like Chinese with a custom tokenizer.	2020-05-20 18:49:11 +02:00
Adriane Boyd	4b229bfc22	Improve handling of NER in CoNLL-U MISC	2020-05-20 18:48:51 +02:00
Matthew Honnibal	609c0ba557	Fix accidentally quadratic runtime in Example.split_sents (#5464 ) * Tidy up train-from-config a bit * Fix accidentally quadratic perf in TokenAnnotation.brackets When we're reading in the gold data, we had a nested loop where we looped over the brackets for each token, looking for brackets that start on that word. This is accidentally quadratic, because we have one bracket per word (for the POS tags). So we had an O(N*2) behaviour here that ended up being pretty slow. To solve this I'm indexing the brackets by their starting word on the TokenAnnotations object, and having a property to provide the previous view. Fixes	2020-05-20 18:48:18 +02:00
Adriane Boyd	daaa7bf451	Add option to omit extra lexeme tables in CLI	2020-05-20 15:51:44 +02:00
Adriane Boyd	8cba0e41d8	Return lowercase form as default except for PROPN	2020-05-20 15:35:08 +02:00
adrianeboyd	9393253b66	Remove peeking from Parser.begin_training (#5456 ) Inspect all instances in `Parser.begin_training` rather than only the first 1000.	2020-05-20 15:18:06 +02:00
Matthw Honnibal	fda7355508	Fix train-from-config	2020-05-20 12:30:21 +02:00
Matthw Honnibal	24efd54a42	Merge from develop	2020-05-20 12:27:31 +02:00
Sofie Van Landeghem	7f5715a081	Various fixes to NEL functionality, Example class etc (#5460 ) * setting KB in the EL constructor, similar to how the model is passed on * removing wikipedia example files - moved to projects * throw an error when nlp.update is called with 2 positional arguments * rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config * update config files with new parameters * avoid training pipeline components that don't have a model (like sentencizer) * various small fixes + UX improvements * small fixes * set thinc to 8.0.0a9 everywhere * remove outdated comment	2020-05-20 11:41:12 +02:00
Adriane Boyd	4fa9670537	Extend lemmatizer rules for all UPOS tags	2020-05-20 10:15:43 +02:00
Matthew Honnibal	664a3603b0	Set version to v3.0.0.dev8	2020-05-19 17:15:39 +02:00
adrianeboyd	40e65d6f63	Fix most_similar for vectors with unused rows (#5348 ) * Fix most_similar for vectors with unused rows Address issues related to the unused rows in the vector table and `most_similar`: * Update `most_similar()` to search only through rows that are in use according to `key2row`. * Raise an error when `most_similar(n=n)` is larger than the number of vectors in the table. * Set and restore `_unset` correctly when vectors are added or deserialized so that new vectors are added in the correct row. * Set data and keys to the same length in `Vocab.prune_vectors()` to avoid spurious entries in `key2row`. * Fix regression test using `most_similar` Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 16:41:26 +02:00
Sofie Van Landeghem	f00de445dd	default models defined in component decorator (#5452 ) * move defaults to pipeline and use in component decorator * black formatting * relative import	2020-05-19 16:20:03 +02:00
adrianeboyd	70da1fd2d6	Add warning for misaligned character offset spans (#5007 ) * Add warning for misaligned character offset spans * Resolve conflict * Filter warnings in example scripts Filter warnings in example scripts to show warnings once, in particular warnings about misaligned entities. Co-authored-by: Ines Montani <ines@ines.io>	2020-05-19 16:01:18 +02:00
adrianeboyd	0061992d95	Update Polish tokenizer for UD_Polish-PDB (#5432 ) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:55 +02:00
adrianeboyd	a5cd203284	Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:14 +02:00
Sofie Van Landeghem	0d94737857	Feature toggle_pipes (#5378 ) * make disable_pipes deprecated in favour of the new toggle_pipes * rewrite disable_pipes statements * update documentation * remove bin/wiki_entity_linking folder * one more fix * remove deprecated link to documentation * few more doc fixes * add note about name change to the docs * restore original disable_pipes * small fixes * fix typo * fix error number to W096 * rename to select_pipes * also make changes to the documentation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-18 22:27:10 +02:00
Matthew Honnibal	333b1a308b	Adapt parser and NER for transformers (#5449 ) * Draft layer for BILUO actions * Fixes to biluo layer * WIP on BILUO layer * Add tests for BILUO layer * Format * Fix transitions * Update test * Link in the simple_ner * Update BILUO tagger * Update __init__ * Import simple_ner * Update test * Import * Add files * Add config * Fix label passing for BILUO and tagger * Fix label handling for simple_ner component * Update simple NER test * Update config * Hack train script * Update BILUO layer * Fix SimpleNER component * Update train_from_config * Add biluo_to_iob helper * Add IOB layer * Add IOBTagger model * Update biluo layer * Update SimpleNER tagger * Update BILUO * Read random seed in train-from-config * Update use of normal_init * Fix normalization of gradient in SimpleNER * Update IOBTagger * Remove print * Tweak masking in BILUO * Add dropout in SimpleNER * Update thinc * Tidy up simple_ner * Fix biluo model * Unhack train-from-config * Update setup.cfg and requirements * Add tb_framework.py for parser model * Try to avoid memory leak in BILUO * Move ParserModel into spacy.ml, avoid need for subclass. * Use updated parser model * Remove incorrect call to model.initializre in PrecomputableAffine * Update parser model * Avoid divide by zero in tagger * Add extra dropout layer in tagger * Refine minibatch_by_words function to avoid oom * Fix parser model after refactor * Try to avoid div-by-zero in SimpleNER * Fix infinite loop in minibatch_by_words * Use SequenceCategoricalCrossentropy in Tagger * Fix parser model when hidden layer * Remove extra dropout from tagger * Add extra nan check in tagger * Fix thinc version * Update tests and imports * Fix test * Update test * Update tests * Fix tests * Fix test Co-authored-by: Ines Montani <ines@ines.io>	2020-05-18 22:23:33 +02:00
Ines Montani	a41e28ceba	Merge pull request #5436 from ilivans/fix_errors_with_codes	2020-05-18 10:45:56 +02:00
Ilkyu Ju	72a25c9cef	Very minor issues in Korean example sentences (#5446 ) * Add contributor agreement * Improve ko translation of example sentences I fixed unnatural translations and word spacing errors. * Update osori.md	2020-05-17 13:43:34 +02:00
svlandeg	6fb6a8518c	bump to 3.0.0.dev7 and thinc to 8.0.0a8	2020-05-15 13:25:54 +02:00
svlandeg	047f3d7d94	remove ops argument for Adam	2020-05-15 13:25:00 +02:00
svlandeg	e0fda2bd81	throw warning when model_cfg is None	2020-05-15 11:02:10 +02:00
adrianeboyd	908dea3939	Skip duplicate lexeme rank setting (#5401 ) Skip duplicate lexeme rank setting within `_fix_pretrained_vectors_name()`.	2020-05-14 18:26:12 +02:00
adrianeboyd	f49e2810e6	Add Polish lemmatizer (#5413 ) * Add Polish lemmatizer Contributed by @ryszardtuora * Add missing import	2020-05-14 18:23:19 +02:00
adrianeboyd	e63880e081	Use Token.sent_start for Span.sent (#5439 ) Use `Token.sent_start` for sentence boundaries in `Span.sent` so that `Doc.sents` and `Span.sent` return the same sentence boundaries.	2020-05-14 18:22:51 +02:00
adrianeboyd	780b869345	Fix syntax iterators for Persian (#5437 )	2020-05-14 16:51:03 +02:00
Ilia Ivanov	712d9d4820	fixup! Fix ErrorsWithCodes().__class__ return value	2020-05-14 15:45:58 +02:00
Ilia Ivanov	a987e9e45d	Fix ErrorsWithCodes().__class__ return value	2020-05-14 14:14:15 +02:00
Vishnu Priya VR	9ce059dd06	Limiting noun_chunks for specific languages (#5396 ) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>	2020-05-14 12:58:06 +02:00
Sofie Van Landeghem	b04738903e	prevent None in gold fields (#5425 ) * set gold fields to empty list instead of keeping them as None * add unit test	2020-05-13 22:08:50 +02:00
adrianeboyd	113e7981d0	Check that row is within bounds when adding vector (#5430 ) Check that row is within bounds for the vector data array when adding a vector. Don't add vectors with rank OOV_RANK in `init-model` (change is due to shift from OOV as 0 to OOV as OOV_RANK).	2020-05-13 22:08:28 +02:00
adrianeboyd	07639dd6ac	Remove TAG from da/sv tokenizer exceptions (#5428 ) Remove `TAG` value from Danish and Swedish tokenizer exceptions because it may not be included in a tag map (and these settings are problematic as tokenizer exceptions anyway).	2020-05-13 10:25:54 +02:00
adrianeboyd	24e7108f80	Modify array type to accommodate OOV_RANK (#5429 ) Modify indices array type in `Vocab.prune_vectors` to accommodate OOV_RANK index as max(uint64).	2020-05-13 10:25:05 +02:00
svlandeg	102c8c7e2f	fix fan_in renaming	2020-05-12 13:56:10 +02:00
adrianeboyd	440b81bddc	Improve exceptions for 'd (would/had) in English (#5379 ) Instead of treating `'d` in contractions like `I'd` as `would` in all cases in the tokenizer exceptions, leave the tagging and lemmatization up to later components.	2020-05-08 15:10:57 +02:00
adrianeboyd	c963e269ba	Add method to update / reset pkuseg user dict (#5404 )	2020-05-08 11:21:46 +02:00
Samuel Rodríguez Medina	5e55bfa821	Fixed tests for Swedish that were written in Danish. (#5395 )	2020-05-05 14:06:27 +02:00
Adriane Boyd	565e0eef73	Add tokenizer option for token match with affixes To fix the slow tokenizer URL (#4374) and allow `token_match` to take priority over prefixes and suffixes by default, introduce a new tokenizer option for a token match pattern that's applied after prefixes and suffixes but before infixes.	2020-05-05 10:35:33 +02:00
Adriane Boyd	792c8af8cf	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-05 09:25:57 +02:00
Matthew Honnibal	eb117e2fce	Add load_config_from_str helper	2020-05-02 14:09:21 +02:00
adrianeboyd	c045a9c7f6	Fix logic in train CLI timing eval on GPU (#5387 ) Run CPU timing in first iteration only	2020-05-01 12:05:33 +02:00
Samuel Rodríguez Medina	148b036e0c	Spanish like num improvement (#5381 ) * Add tests for Spanish like_num. * Add missing numbers in Spanish lexical attributes for like_num. * Modify Spanish test function name. * Add contributor agreement.	2020-04-30 11:13:23 +02:00
Samuel Rodríguez Medina	8602daba85	Swedish like_num (#5371 ) * Sign contributor agreement. * Add like_num functionality to Swedish. * Update spacy/tests/lang/sv/test_lex_attrs.py Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update contributor agreement Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-29 21:25:22 +02:00
adrianeboyd	74da669326	Fix problems with lower and whitespace in variants (#5361 ) * Initialize lower flag explicitly * Handle whitespace words from GoldParse correctly when creating raw text with orth variants * Return the text with original casing if anything goes wrong	2020-04-29 13:01:25 +02:00
adrianeboyd	3f43c73d37	Normalize TokenC.sent_start values for Matcher (#5346 ) Normalize TokenC.sent_start values to booleans for the `Matcher`.	2020-04-29 12:57:30 +02:00
adrianeboyd	bdff76dede	Various updates/additions to CLI scripts (#5362 ) * `debug-data`: determine coverage of provided vectors * `evaluate`: support `blank:lg` model to make it possible to just evaluate tokenization * `init-model`: add option to truncate vectors to N most frequent vectors from word2vec file * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-04-29 12:56:46 +02:00
Sofie Van Landeghem	cfdaf99b80	Fix passing of component configuration (#5374 ) * add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument * add fix and test for Issue 5137	2020-04-29 12:56:17 +02:00
Ines Montani	efec28ce70	Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2	2020-04-29 12:55:37 +02:00
Sofie Van Landeghem	f67343295d	Update NEL examples and documentation (#5370 ) * simplify creation of KB by skipping dim reduction * small fixes to train EL example script * add KB creation and NEL training example scripts to example section * update descriptions of example scripts in the documentation * moving wiki_entity_linking folder from bin to projects * remove test for wiki NEL functionality that is being moved	2020-04-29 12:53:53 +02:00
adrianeboyd	a6e521cd79	Add is_sent_end token property (#5375 ) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`.	2020-04-29 12:53:16 +02:00
Ines Montani	eac47971f1	Merge pull request #5258 from mirfan899/master	2020-04-29 12:51:55 +02:00
adrianeboyd	d5f18f8307	Add missing import	2020-04-28 14:01:29 +02:00
adrianeboyd	ac40a8f7a5	Add missing import	2020-04-28 14:00:11 +02:00
Adriane Boyd	3a045572ed	Add missing import	2020-04-28 13:48:37 +02:00
Adriane Boyd	bc39f97e11	Simplify warnings	2020-04-28 13:37:37 +02:00
adrianeboyd	f8ac5b9f56	bugfix in span similarity (#5155 ) (#5358 ) * bugfix in span similarity * also rewrite doc.pyx for clarity * formatting Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-27 16:51:27 +02:00
Sofie Van Landeghem	9203d821ae	Add 2 ini files in tests/lang (#5359 )	2020-04-27 13:01:54 +02:00
Punitvara	b2b7e1f37a	This PR adds Gujarati Language class along with (#5355 ) * This PR adds Gujarati Language class along with - stop words * Add test for gu tokenizer	2020-04-27 11:07:37 +02:00
sabiqueqb	fc91660aa2	Gh 5339 language class for malayalam (#5342 ) * Initialize Malayalam Language class * Add lex_attrs and examples for Malayalam * Add spaCy Contributor Agreement * Add test for ml tokenizer	2020-04-27 09:45:08 +02:00
adrianeboyd	84e06f9fb7	Improve GoldParse NER alignment (#5335 ) Improve GoldParse NER alignment by including all cases where the start and end of the NER span can be aligned, regardless of internal tokenization differences. To do this, convert BILUO tags to character offsets, check start/end alignment with `doc.char_span()`, and assign the BILUO tags for the aligned spans. Alignment for `O/-` tags is handled through the one-to-one and multi alignments.	2020-04-23 16:58:23 +02:00
adrianeboyd	521f361052	Switch to new gold.align method (#5334 ) * Switch from original `_align` to new simpler alignment algorithm from #4526 * Remove alignment normalizations beyond whitespace and lowercasing	2020-04-21 19:31:03 +02:00
Matthew Honnibal	b2ef6100af	Only run backprop once when shared tok2vec weights (#5331 ) Previously, pipelines with shared tok2vec weights would call the tok2vec backprop callback multiple times, once for each pipeline component. This caused errors for PyTorch, and was inefficient. Instead, accumulate the gradient for all but one component, and just call the callback once.	2020-04-21 19:30:41 +02:00
adrianeboyd	bf5c13d170	Modify jieba install message (#5328 ) Modify jieba install message to instruct the user to use `ChineseDefaults.use_jieba = False` so that it's possible to load pkuseg-only models without jieba installed.	2020-04-20 22:06:53 +02:00
Matthew Honnibal	6918d99b6c	Improve GPU usage for train-with-config (#5330 ) * Adjust for no ops in Optimizer * Fix gpu in train-from-config * Update train-from-config script * Fix parser * Fix GPU efficiency of padding backprop	2020-04-20 22:06:28 +02:00
adrianeboyd	f7471abd82	Add pkuseg and serialization support for Chinese (#5308 ) * Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message	2020-04-18 17:01:53 +02:00
Jakob Jul Elben	663333c3b2	Fixes #5413 (#5315 ) * Fix 5314 * Add contributor * Resolve requested changes Co-authored-by: Jakob Jul Elben <jakob@datamaga.com>	2020-04-16 13:29:02 +02:00
Leander Fiedler	a3401b1194	issue5230 changed reference to function to anonymous function	2020-04-15 21:52:52 +02:00
Leander Fiedler	cef0c909b9	issue5230 changed reference to function to anonymous function	2020-04-15 19:28:33 +02:00
Paolo Arduin	1ca32d8f9c	Matcher support for Span as well as Doc (#5113 ) * Matcher support for Span, as well as Doc #5056 * Removes an import unused * Signed contributors agreement * Code optimization and better test * Add error message for bad Matcher call argument * Fix merging	2020-04-15 13:51:33 +02:00
adrianeboyd	98c59027ed	Use max(uint64) for OOV lexeme rank (#5303 ) * Use max(uint64) for OOV lexeme rank * Add test for default OOV rank * Revert back to thinc==7.4.0 Requiring the updated version of thinc was unnecessary. * Define OOV_RANK in one place Define OOV_RANK in one place in `util`. * Fix formatting [ci skip] * Switch to external definitions of max(uint64) Switch to external defintions of max(uint64) and confirm that they are equal.	2020-04-15 13:49:47 +02:00
adrianeboyd	3d2c308906	Add Doc init from list of words and text (#5251 ) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting	2020-04-14 19:15:52 +02:00
Paolo Arduin	8ce408d2e1	Comparison predicate handling for `!=` (#5282 ) * Fix #5281 * Optim test	2020-04-14 19:14:15 +02:00
Leander Fiedler	6700006830	issue5230 attempted fix of pytest segfault for python3.5	2020-04-12 09:34:54 +02:00
Leander Fiedler	d60e2d3ebf	issue5230 added unit test for dumping and loading knowledgebase	2020-04-12 09:08:41 +02:00
Leander Fiedler	d2bb649227	issue5230 filter warnings in addition to filterwarnings to prevent deprecation warnings in python35(win) setup to pop up	2020-04-10 23:21:13 +02:00
Leander Fiedler	ca2a7a44db	issue5230 store string values of warnings to remotely debug failing python35(win) setup	2020-04-10 22:26:55 +02:00
Leander Fiedler	88ca40a15d	issue5230 raise warnings as errors to remotely debug failing python35(win) setup	2020-04-10 21:45:53 +02:00
Leander Fiedler	a7bdfe42e1	issue5230 added print statement to warnings filter to remotely debug failing python35(win) setup	2020-04-10 21:14:33 +02:00
Leander Fiedler	8c1d0d628f	issue5230 writer now checks instance of loc parameter before trying to operate on it	2020-04-10 20:35:52 +02:00
Umar Butler	8952effcc4	Fixed Typo in Warning (#5284 ) * Fixed typo in cli warning Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat. * Create and signed contributor form	2020-04-09 15:46:15 +02:00
Sofie Van Landeghem	42364dcd9f	Remove "pala" tokenizer exception for Spanish (#5265 )	2020-04-09 10:21:20 +02:00
adrianeboyd	cf579a398d	Add __init__.py to eu and hy tests (#5278 )	2020-04-08 20:03:06 +02:00
adrianeboyd	ae4af52ce7	Add ideographic stops to sentencizer (#5263 ) Add ideographic half- and fullwidth full stops to default sentencizer punctuation.	2020-04-08 12:58:39 +02:00
adrianeboyd	fa760010a5	Set rank for new vector in Vocab.set_vector (#5266 ) Set `Lexeme.rank` for vectors added with `Vocab.set_vector` so that the lexeme `ID` accessed by a model points the right row for the new vector.	2020-04-07 12:04:51 +02:00
lfiedler	e1e25c7e30	issue5230: added unittest test case for completion	2020-04-06 21:36:02 +02:00
Leander Fiedler	cde96f6c64	issue5230: optimized unit test a bit	2020-04-06 20:51:12 +02:00
Leander Fiedler	71cc903d65	issue5230: replaced open statements on path objects so that serialization still works an files are closed	2020-04-06 20:30:41 +02:00
Leander Fiedler	273ed452bb	issue5230: added unicode declaration at top of the file	2020-04-06 19:22:32 +02:00
Leander Fiedler	1cd975d4a5	issue5230: fixed resource warnings in language	2020-04-06 18:54:32 +02:00
Leander Fiedler	493c77462a	issue5230: test cases covering known sources of resource warnings	2020-04-06 18:46:51 +02:00
adrianeboyd	c981aa6684	Use inline flags in token_match patterns (#5257 ) * Use inline flags in token_match patterns Use inline flags in `token_match` patterns so that serializing does not lose the flag information. * Modify inline flag * Modify inline flag	2020-04-06 13:19:04 +02:00
adrianeboyd	e8be15e9b7	Improve tokenization for UD Spanish AnCora (#5253 )	2020-04-06 13:18:23 +02:00
adrianeboyd	f4ef64a526	Improve tokenization for UD Dutch corpora (#5259 ) * Improve tokenization for UD Dutch corpora Improve tokenization for UD Dutch Alpino and LassySmall. * Format Dutch tokenizer exceptions	2020-04-06 13:18:07 +02:00
Muhammad Irfan	406d5748b3	add missing Urdu tags	2020-04-05 20:55:38 +05:00
Sofie Van Landeghem	b2e93be867	Optimizer defaults (#5244 ) * set optimizer defaults to mimic thinc 7 + bump to dev6 * larger error range for senter overfitting test	2020-04-03 13:02:46 +02:00
YohannesDatasci	beef184e53	Armenian language support (#5246 ) * add Armenian language and test cases * agreement submission	2020-04-03 13:02:18 +02:00
Michael Leichtfried	2b14997b68	Remove duplicated branch in if/else-if statement (#5234 ) * Remove duplicated branch in if-elif-statement * Add contributor agreement for leicmi	2020-04-02 14:47:42 +02:00
adrianeboyd	b71a11ff6d	Update morphologizer (#5108 ) * Add pos and morph scoring to Scorer Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph accuracy in `spacy evaluate`. * Update morphologizer for v3 * switch to tagger-based morphologizer * use `spacy.HashCharEmbedCNN` for morphologizer defaults * add `Doc.is_morphed` flag * Add morphologizer to train CLI * Add basic morphologizer pipeline tests * Add simple morphologizer training example * Remove subword_features from CharEmbed models Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and `spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features` is always `False`. * Rename setting in morphologizer example Use `with_pos_tags` instead of `without_pos_tags`. * Fix kwargs for spacy.HashCharEmbedBiLSTM.v1 * Remove defaults for spacy.HashCharEmbedBiLSTM.v1 Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`. * Set random seed for textcat overfitting test	2020-04-02 14:46:32 +02:00
adrianeboyd	d107afcffb	Raise error for inplace resize with new vector dim (#5228 ) Raise an error if there is an attempt to resize the vectors in place with a different vector dimension.	2020-04-02 10:43:13 +02:00
Jacob Lauritzen	0b76212831	Extend and fix Danish examples (#5227 ) * Extend and fix Danish examples This PR fixes two examples, adds additional examples translated from the english version, and adds punctuation. The two changed examples are: * "fortov" changed to "fortovet", which is more [used](https://www.google.com/search?client=firefox-b-d&sxsrf=ALeKk0143gEuPe4IbIUpzBBt-oU10OMVqA%3A1585549036477&ei=7I6BXuvJHMGOrwSqi46oCQ&q=l%C3%B8behjul+p%C3%A5+fortov&oq=l%C3%B8behjul+p%C3%A5+fortov&gs_lcp=CgZwc3ktYWIQAzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQRzIECAAQR1DT8xZY0_MWYK_0FmgAcAZ4AIABAIgBAJIBAJgBAKABAaoBB2d3cy13aXo&sclient=psy-ab&ved=0ahUKEwjr7964xsHoAhVBx4sKHaqFA5UQ4dUDCAo&uact=5) and more natural. The Swedish and Norwegian examples also use this version of the word. * "stor by" changed to "storby". In Danish we have a specific noun to describe a large, metropolitan city which is different from just describing a city as "large". In this sentence it would be much more natural to describe London as a "storby". Google even correct as search for "London stor by" to "London storby". * Sign contrib agreement	2020-04-02 10:42:35 +02:00
Sofie Van Landeghem	ab59f3124e	fix NEL overfitting test for GPU (#5236 )	2020-04-02 10:32:52 +02:00
Sofie Van Landeghem	311133e579	Train textcat with config (#5143 ) * bring back default build_text_classifier method * remove _set_dims_ hack in favor of proper dim inference * add tok2vec initialize to unit test * small fixes * add unit test for various textcat config settings * logistic output layer does not have nO * fix window_size setting * proper fix * fix W initialization * Update textcat training example * Use ml_datasets * Convert training data to `Example` format * Use `n_texts` to set proportionate dev size * fix _init renaming on latest thinc * avoid setting a non-existing dim * update to thinc==8.0.0a2 * add BOW and CNN defaults for easy testing * various experiments with train_textcat script, fix softmax activation in textcat bow * allow textcat train script to work on other datasets as well * have dataset as a parameter * train textcat from config, with example config * add config for training textcat * formatting * fix exclusive_classes * fixing BOW for GPU * bump thinc to 8.0.0a3 (not published yet so CI will fail) * add in link_vectors_to_models which got deleted Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2020-03-29 19:40:36 +02:00
adrianeboyd	ce0e538068	Check whether doc is instantiated in Example.get_gold_parses() (#5167 ) * Check whether doc is instantiated When creating docs to pair with gold parses, modify test to check whether a doc is unset rather than whether it contains tokens. * Restore test of evaluate on an empty doc * Set a minimal gold.orig for the scorer Without a minimal gold.orig the scorer can't evaluate empty docs. This is the v3 equivalent of #4925.	2020-03-29 13:57:00 +02:00
Sofie Van Landeghem	d6d95674c1	bugfix in span similarity (#5155 ) * bugfix in span similarity * also rewrite doc.pyx for clarity * formatting	2020-03-29 13:56:07 +02:00
Nikhil Saldanha	4f27a24f5b	Add kannada examples (#5162 ) * Add example sentences for Kannada * sign contributor agreement	2020-03-29 13:54:42 +02:00
adrianeboyd	d47b810ba4	Fix exclusive_classes in textcat ensemble (#5166 ) Pass the exclusive_classes setting to the bow model within the ensemble textcat model.	2020-03-29 13:52:34 +02:00
adrianeboyd	963bd890c1	Modify Vector.resize to work with cupy and improve resizing (#5216 ) * Modify Vector.resize to work with cupy Modify `Vectors.resize` to work with cupy. Modify behavior when resizing to a different vector dimension so that individual vectors are truncated or extended with zeros instead of having the original values filled into the new shape without regard for the original axes. * Update spacy/tests/vocab_vectors/test_vectors.py Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-03-29 13:51:20 +02:00
Sofie Van Landeghem	1f9852abc3	Fix parser @ GPU (#5210 ) * ensure self.bias is numpy array in parser model * 2 more little bug fixes for parser on GPU * removing testing GPU statement * remove commented code	2020-03-28 23:09:35 +01:00
Sofie Van Landeghem	9b412516e7	Fixing pickling of the parser (#5218 ) * fix __reduce__ for pickling parser * setting the move object as 'state' during pickling * unskip test_issue4725 - works again	2020-03-27 19:35:26 +01:00
Ines Montani	92b9b631ef	xfail -> skip	2020-03-27 10:51:32 +01:00
Ines Montani	ee4bb0e3b6	Fix import	2020-03-26 21:44:18 +01:00
Ines Montani	4fe2299586	xfail hanging test	2020-03-26 20:58:13 +01:00
Ines Montani	f12a46472c	Remove unicode declarations	2020-03-26 15:18:32 +01:00

... 4 5 6 7 8 ...

7312 Commits