spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-15 22:51:58 +03:00

Author	SHA1	Message	Date
Sofie Van Landeghem	c0f4a1e43b	train is from-config by default (#5575 ) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before	2020-06-12 02:02:07 +02:00
adrianeboyd	556895177e	Expand Japanese requirements warning (#5572 ) Include explicit install instructions in Japanese requirements warning.	2020-06-11 13:47:37 +02:00
adrianeboyd	fe167fcf7d	Update pytest conf for sudachipy with Japanese (#5574 )	2020-06-11 10:23:50 +02:00
Jones Martins	bab30e4ad2	Add "c'mon" token exception (#5570 ) * Add "c'mon" exception * Fix typo in "C'mon" exception	2020-06-10 21:54:06 +02:00
Jones Martins	28db7dd5d9	Add missing pronoums/determiners (#5569 ) * Add missing pronoums/determiners * Add test for missing pronoums * Add contributor file	2020-06-10 18:47:04 +02:00
adrianeboyd	0a70bd6281	Bump version to 2.3.0.dev1 (#5567 )	2020-06-09 15:47:31 +02:00
adrianeboyd	b7e6e1b9a7	Disable sentence segmentation in ja tokenizer (#5566 )	2020-06-09 12:00:59 +02:00
adrianeboyd	f162815f45	Handle empty and whitespace-only docs for Japanese (#5564 ) Handle empty and whitespace-only docs in the custom alignment method used by the Japanese tokenizer.	2020-06-08 21:09:23 +02:00
adrianeboyd	3bf111585d	Update Japanese tokenizer config and add serialization (#5562 ) * Use `config` dict for tokenizer settings * Add serialization of split mode setting * Add tests for tokenizer split modes and serialization of split mode setting Based on #5561	2020-06-08 16:29:05 +02:00
Hiroshi Matsuda	456bf47f51	fix a bug causing mis-alignments (#5560 )	2020-06-08 15:49:34 +02:00
Ines Montani	d93cbeb14f	Add warning for loose version constraints (#5536 ) * Add warning for loose version constraints * Update wording [ci skip] * Tweak error message Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-06-05 12:42:15 +02:00
adrianeboyd	1ac43d78f9	Avoid libc.stdint for UINT64_MAX (#5545 )	2020-06-04 20:02:05 +02:00
Paul O'Leary McCann	410fb7ee43	Add Japanese Model (#5544 ) * Add more rules to deal with Japanese UD mappings Japanese UD rules sometimes give different UD tags to tokens with the same underlying POS tag. The UD spec indicates these cases should be disambiguated using the output of a tool called "comainu", but rules are enough to get the right result. These rules are taken from Ginza at time of writing, see #3756. * Add new tags from GSD This is a few rare tags that aren't in Unidic but are in the GSD data. * Add basic Japanese sentencization This code is taken from Ginza again. * Add sentenceizer quote handling Could probably add more paired characters but this will do for now. Also includes some tests. * Replace fugashi with SudachiPy * Modify tag format to match GSD annotations Some of the tests still need to be updated, but I want to get this up for testing training. * Deal with case with closing punct without opening * refactor resolve_pos() * change tag field separator from "," to "-" * add TAG_ORTH_MAP * add TAG_BIGRAM_MAP * revise rules for 連体詞 * revise rules for 連体詞 * improve POS about 2% * add syntax_iterator.py (not mature yet) * improve syntax_iterators.py * improve syntax_iterators.py * add phrases including nouns and drop NPs consist of STOP_WORDS * First take at noun chunks This works in many situations but still has issues in others. If the start of a subtree has no noun, then nested phrases can be generated. また行きたい、そんな気持ちにさせてくれるお店です。 [そんな気持ち, また行きたい、そんな気持ちにさせてくれるお店] For some reason て gets included sometimes. Not sure why. ゲンに連れ添って円盤生物を調査するパートナーとなる。 [て円盤生物, ...] Some phrases that look like they should be split are grouped together; not entirely sure that's wrong. This whole thing becomes one chunk: 道の駅遠山郷北側からかぐら大橋南詰現道交点までの1.060kmのみ開通済み * Use new generic get_words_and_spaces The new get_words_and_spaces function is simpler than what was used in Japanese, so it's good to be able to switch to it. However, there was an issue. The new function works just on text, so POS info could get out of sync. Fixing this required a small change to the way dtokens (tokens with POS and lemma info) were generated. Specifically, multiple extraneous spaces now become a single token, so when generating dtokens multiple space tokens should be created in a row. * Fix noun_chunks, should be working now * Fix some tests, add naughty strings tests Some of the existing tests changed because the tokenization mode of Sudachi changed to the more fine-grained A mode. Sudachi also has issues with some strings, so this adds a test against the naughty strings. * Remove empty Sudachi tokens Not doing this creates zero-length tokens and causes errors in the internal spaCy processing. * Add yield_bunsetu back in as a separate piece of code Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: hiroshi <hiroshi_matsuda@megagon.ai>	2020-06-04 19:15:43 +02:00
Matthew Honnibal	8411d4f4e6	Merge pull request #5543 from svlandeg/feature/pretrain-config pretrain from config	2020-06-04 19:07:12 +02:00
svlandeg	3ade455fd3	formatting	2020-06-04 16:09:55 +02:00
svlandeg	776d4f1190	cleanup	2020-06-04 16:07:30 +02:00
svlandeg	6b027d7689	remove duplicate model definition of tok2vec layer	2020-06-04 15:49:23 +02:00
svlandeg	1775f54a26	small little fixes	2020-06-03 22:17:02 +02:00
svlandeg	07886a3de3	rename init_tok2vec to resume	2020-06-03 22:00:25 +02:00
svlandeg	4ed6278663	small fixes to pretrain config, init_tok2vec TODO	2020-06-03 19:32:40 +02:00
svlandeg	ffe0451d09	pretrain from config	2020-06-03 14:45:00 +02:00
Ines Montani	a8875d4a4b	Fix typo	2020-06-03 14:42:39 +02:00
Ines Montani	4e0610d0d4	Update warning codes	2020-06-03 14:37:09 +02:00
Ines Montani	810fce3bb1	Merge branch 'develop' into master-tmp	2020-06-03 14:36:59 +02:00
Adriane Boyd	b0ee76264b	Remove debugging	2020-06-03 14:20:42 +02:00
Adriane Boyd	1d8168d1fd	Fix problems with lower and whitespace in variants Port relevant changes from #5361: * Initialize lower flag explicitly * Handle whitespace words from GoldParse correctly when creating raw text with orth variants	2020-06-03 14:15:58 +02:00
Adriane Boyd	10d938f221	Update default cfg dir in train CLI	2020-06-03 14:15:50 +02:00
Adriane Boyd	f1f9c8b417	Port train CLI updates Updates from #5362 and fix from #5387: * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-06-03 14:03:43 +02:00
Adriane Boyd	8c758ed1eb	Fix meta path	2020-06-03 12:11:57 +02:00
Adriane Boyd	a57bdeecac	Test util.get_model_meta instead of util.load_model	2020-06-03 12:10:12 +02:00
svlandeg	eac12cbb77	make dropout in embed layers configurable	2020-06-03 11:50:16 +02:00
svlandeg	e91485dfc4	add discard_oversize parameter, move optimizer to training subsection	2020-06-03 10:04:16 +02:00
svlandeg	03c58b488c	prevent infinite loop, custom warning	2020-06-03 10:00:21 +02:00
svlandeg	6504b7f161	Merge remote-tracking branch 'upstream/develop' into feature/pretrain-config	2020-06-03 08:30:16 +02:00
svlandeg	c5ac382f0a	fix name clash	2020-06-02 22:24:57 +02:00
svlandeg	2bf5111ecf	additional test with discard_oversize=False	2020-06-02 22:09:37 +02:00
svlandeg	aa6271b16c	extending algorithm to deal better with edge cases	2020-06-02 22:05:08 +02:00
svlandeg	f2e162fc60	it's only oversized if the tolerance level is also exceeded	2020-06-02 19:59:04 +02:00
svlandeg	ef834b4cd7	fix comments	2020-06-02 19:50:44 +02:00
svlandeg	6208d322d3	slightly more challenging unit test	2020-06-02 19:47:30 +02:00
svlandeg	6651fafd5c	using overflow buffer for examples within the tolerance margin	2020-06-02 19:43:39 +02:00
svlandeg	85b0597ed5	add test for minibatch util	2020-06-02 18:26:21 +02:00
svlandeg	5b350a6c99	bugfix of the bugfix	2020-06-02 17:49:33 +02:00
Adriane Boyd	75f08ad62d	Remove unnecessary check	2020-06-02 17:41:25 +02:00
Adriane Boyd	bbc1836581	Add rudimentary version checks on model load	2020-06-02 17:33:48 +02:00
svlandeg	fdfd822936	rewrite minibatch_by_words function	2020-06-02 15:22:54 +02:00
svlandeg	ec52e7f886	add oversize examples before StopIteration returns	2020-06-02 13:21:55 +02:00
svlandeg	e0f9f448f1	remove Tensorizer	2020-06-01 23:38:48 +02:00
Leo	925e938570	Spanish tokenizer exception and examples improvement (#5531 ) * Spanish tokenizer exception additions. Added Spanish question examples * erased slang tokenization examples	2020-06-01 18:18:34 +02:00
Matthew Honnibal	67af3a32b0	Merge pull request #5527 from adrianeboyd/bugfix/tagger-sp-tag-map Preserve _SP when filtering tag map in Tagger	2020-06-01 12:00:21 +02:00
Leo	c21c308ecb	corrected issue #5524 changed <U+009C> 'STRING TERMINATOR' for <U+0153> LATIN SMALL LIGATURE OE' (#5526 )	2020-05-31 22:08:12 +02:00
Adriane Boyd	a005ccd6d7	Preserve _SP when filtering tag map in Tagger To allow "SP" as a tag (for Chinese OntoNotes), preserve "_SP" if present as the reference `SPACE` POS in the tag map in `Tagger.begin_training()`.	2020-05-31 19:57:54 +02:00
Ines Montani	b5ae2edcba	Merge pull request #5516 from explosion/feature/improve-model-version-deps	2020-05-31 12:54:01 +02:00
Ines Montani	dc186afdc5	Add warning	2020-05-30 15:34:54 +02:00
Ines Montani	b7aff6020c	Make functions more general purpose and update docstrings and tests	2020-05-30 15:18:53 +02:00
Ines Montani	a7e370bcbf	Don't override spaCy version	2020-05-30 15:03:18 +02:00
Ines Montani	e47e5a4b10	Use more sophisticated version parsing logic	2020-05-30 15:01:58 +02:00
svlandeg	15134ef611	fix deserialization order	2020-05-30 12:53:32 +02:00
Matthew Honnibal	64adda3202	Revert "Remove peeking from Parser.begin_training (#5456 )" This reverts commit `9393253b66`. The model shouldn't need to see all examples, and actually in v3 there's no equivalent step. All examples are provided to the component, for the component to do stuff like figuring out the labels. The model just needs to do stuff like shape inference.	2020-05-29 23:21:55 +02:00
Matthew Honnibal	85f1acfaa0	Merge pull request #5517 from adrianeboyd/bugfix/morph-repr Remove MorphAnalysis __str__ and __repr__	2020-05-29 19:20:56 +02:00
svlandeg	291483157d	prevent loading a pretrained Tok2Vec layer AND pretrained components	2020-05-29 17:38:33 +02:00
Adriane Boyd	e1b7cbd197	Remove MorphAnalysis __str__ and __repr__	2020-05-29 14:33:47 +02:00
Ines Montani	4fd087572a	WIP: improve model version deps	2020-05-28 12:51:37 +02:00
Matthw Honnibal	58750b06f8	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2020-05-27 22:18:36 +02:00
Matthew Honnibal	aecd1437cc	Merge pull request #5508 from adrianeboyd/bugfix/tag-map-sp-tag Prefer _SP over SP for default tag map space attrs	2020-05-27 20:39:40 +02:00
Adriane Boyd	25de2a2191	Improve vector name loading from model meta	2020-05-27 14:48:54 +02:00
adrianeboyd	aad0610a85	Map NR to PROPN (#5512 )	2020-05-26 22:30:53 +02:00
Adriane Boyd	b6b5908f5e	Prefer _SP over SP for default tag map space attrs If `_SP` is already in the tag map, use the mapping from `_SP` instead of `SP` so that `SP` can be a valid non-space tag. (Chinese has a non-space tag `SP` which was overriding the mapping of `_SP` to `SPACE`.)	2020-05-26 14:57:13 +02:00
Adriane Boyd	1eed101be9	Fix Polish lemmatizer for deserialized models Restructure Polish lemmatizer not to depend on lookups data in `__init__` since the lemmatizer is initialized before the lookups data is loaded from a saved model. The lookups tables are accessed first in `__call__` instead once the data is available.	2020-05-26 09:56:12 +02:00
Ines Montani	24ef6680fa	Merge pull request #5499 from adrianeboyd/chore/bump-version-deps-v2.3.0	2020-05-25 13:25:45 +02:00
Adriane Boyd	3f727bc539	Switch to v2.3.0.dev0	2020-05-25 12:57:20 +02:00
Adriane Boyd	736f3cb5af	Bump version and deps for v2.3.0 * spacy to v2.3.0 * thinc to v7.4.1 * spacy-lookups-data to v0.3.2	2020-05-25 12:03:49 +02:00
Adriane Boyd	e06ca7ea24	Switch to new add API in PhraseMatcher unpickle	2020-05-25 11:22:47 +02:00
Ines Montani	1a15896ba9	unicode -> str consistency [ci skip]	2020-05-24 18:51:10 +02:00
Ines Montani	5d3806e059	unicode -> str consistency	2020-05-24 17:20:58 +02:00
Ines Montani	387c7aba15	Update test	2020-05-24 14:55:16 +02:00
Ines Montani	f9786d765e	Simplify is_package check	2020-05-24 14:48:56 +02:00
Matthw Honnibal	2d9de8684d	Support use_pytorch_for_gpu_memory config	2020-05-22 23:10:40 +02:00
Ines Montani	4465cad6c5	Rename spacy.analysis to spacy.pipe_analysis	2020-05-22 17:42:06 +02:00
Ines Montani	25d6ed3fb8	Merge pull request #5489 from explosion/feature/connected-components	2020-05-22 17:40:11 +02:00
Ines Montani	841c05b47b	Merge pull request #5490 from explosion/fix/remove-jsonschema	2020-05-22 17:39:54 +02:00
Ines Montani	569a65b60e	Auto-format	2020-05-22 16:55:42 +02:00
Ines Montani	d844528c5f	Add test for is_compatible_model	2020-05-22 16:55:15 +02:00
Ines Montani	12b7be1d98	Remove jsonschema from dependencies	2020-05-22 16:49:26 +02:00
Matthew Honnibal	f7f6df7275	Move to spacy.analysis	2020-05-22 16:43:18 +02:00
Matthew Honnibal	78d79d94ce	Guess set_annotations=True in nlp.update During `nlp.update`, components can be passed a boolean set_annotations to indicate whether they should assign annotations to the `Doc`. This needs to be called if downstream components expect to use the annotations during training, e.g. if we wanted to use tagger features in the parser. Components can specify their assignments and requirements, so we can figure out which components have these inter-dependencies. After figuring this out, we can guess whether to pass set_annotations=True. We could also call set_annotations=True always, or even just have this as the only behaviour. The downside of this is that it would require the `Doc` objects to be created afresh to avoid problematic modifications. One approach would be to make a fresh copy of the `Doc` objects within `nlp.update()`, so that we can write to the objects without any problems. If we do that, we can drop this logic and also drop the `set_annotations` mechanism. I would be fine with that approach, although it runs the risk of introducing some performance overhead, and we'll have to take care to copy all extension attributes etc.	2020-05-22 15:55:45 +02:00
Ines Montani	6728747f71	Merge pull request #5486 from explosion/fix/compat-py2	2020-05-22 15:47:21 +02:00
Ines Montani	6e6db6afb6	Better model compatibility and validation	2020-05-22 15:42:46 +02:00
Matthew Honnibal	f6078d866a	Merge pull request #5121 from adrianeboyd/bugfix/revert-token-match Revert token_match priority changes from #4374 and extend token match options	2020-05-22 14:42:51 +02:00
Ines Montani	c685ee734a	Fix compat for v2.x branch	2020-05-22 14:22:36 +02:00
Adriane Boyd	e4a1b5dab1	Rename to url_match Rename to `url_match` and update docs.	2020-05-22 12:41:03 +02:00
Adriane Boyd	730fa493a4	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-22 12:18:00 +02:00
Adriane Boyd	71fe61fdcd	Disallow merging 0-length spans	2020-05-22 10:14:34 +02:00
Matthew Honnibal	93c4d13588	Merge pull request #5264 from lfiedler/issue-5230 Fix ResourceWarnings during unittest	2020-05-22 00:31:07 +02:00
Matthew Honnibal	e1cb7e838b	Merge pull request #5481 from explosion/feature/blank-shortcut-v2 Add blank:{lang} shortcut support to util.load_model	2020-05-22 00:08:23 +02:00
Ines Montani	2250380816	Merge pull request #5482 from explosion/fix/backwards-compat-super	2020-05-21 21:51:46 +02:00
Ines Montani	891fa59009	Use backwards-compatible super()	2020-05-21 20:52:48 +02:00
Matthew Honnibal	5ce02c1b17	Merge pull request #5470 from svlandeg/bugfix/noun-chunks Bugfix in noun chunks	2020-05-21 20:51:31 +02:00
Matthw Honnibal	25b51f4fc8	Set version to v3.0.0.dev9	2020-05-21 20:47:52 +02:00
Matthw Honnibal	bc94fdabd0	Fix begin_training	2020-05-21 20:46:21 +02:00

1 2 3 4 5 ...

7058 Commits