spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-12 23:22:38 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	f1189dc205	Draft tests for new Example class	2020-06-09 15:43:08 +02:00
Matthew Honnibal	c833ebe1ad	Start tests for new example class	2020-06-09 15:29:05 +02:00
Matthew Honnibal	453cfa14d0	Start drafting new example class	2020-06-09 15:28:42 +02:00
Matthew Honnibal	449000c234	Fix gold_io	2020-06-09 12:43:53 +02:00
Matthew Honnibal	cb08ce3936	Move alignment into Cython	2020-06-09 12:40:41 +02:00
Matthew Honnibal	20a1bdb298	Fix train	2020-06-09 12:33:29 +02:00
Matthew Honnibal	549164c31c	Fix corpus when no raw text supplied	2020-06-09 12:33:14 +02:00
adrianeboyd	b7e6e1b9a7	Disable sentence segmentation in ja tokenizer (#5566 )	2020-06-09 12:00:59 +02:00
Sofie Van Landeghem	86112d2168	update issue manager's version	2020-06-09 08:57:38 +02:00
Matthew Honnibal	d9289712ba	* Make GoldCorpus return dict, not Example * Make Example require a Doc object (previously optional) Clarify methods in GoldCorpus WIP refactor Example Refactor Example.split_sents Fix test Fix augment Update test Update test Fix import Update test_scorer Update Example	2020-06-09 01:01:59 +02:00
Matthew Honnibal	084271c9e9	Remove GoldParse from public API * Move get_parses_from_example to spacy.syntax * Get GoldParse out of Example * Avoid expecting GoldParse input in parser * Add Alignment to spacy.gold.align * Update Example object * Add comment * Update pipeline * Fix imports * Simplify gold_io * WIP on GoldCorpus * Update test * Xfail some gold tests * Remove ignore_misaligned option from GoldCorpus * Fix Example constructor * Update test * Fix usage of Example * Add deprecated_get_gold method on Example * Patch scorer * Fix test * Fix test * Update tests * Xfail a test * Fix passing of make_projective * Pass make_projective by default * Hack data format in Example.from_dict * Update tests * Fix example.from_dict * Update morphologizer * Fix entity linker * Add get_field to TokenAnnotation * Fix Example.get_aligned * Update test * Fix alignment * Fix corpus * Fix GoldCorpus * Handle misaligned * Format * Fix missing import	2020-06-08 22:09:57 +02:00
adrianeboyd	f162815f45	Handle empty and whitespace-only docs for Japanese (#5564 ) Handle empty and whitespace-only docs in the custom alignment method used by the Japanese tokenizer.	2020-06-08 21:09:23 +02:00
Martino Mensio	de00f967ce	adding spacy-universal-sentence-encoder (#5534 ) * adding spacy-universal-sentence-encoder * update affiliation * updated code example	2020-06-08 20:26:30 +02:00
Sofie Van Landeghem	d1799da200	bot for answered issues (#5563 ) * add tiangolo's issue manager * fix formatting * spaces, tabs, who knows * formatting * I'll get this right at some point * maybe one more space ?	2020-06-08 19:47:32 +02:00
adrianeboyd	3bf111585d	Update Japanese tokenizer config and add serialization (#5562 ) * Use `config` dict for tokenizer settings * Add serialization of split mode setting * Add tests for tokenizer split modes and serialization of split mode setting Based on #5561	2020-06-08 16:29:05 +02:00
Hiroshi Matsuda	456bf47f51	fix a bug causing mis-alignments (#5560 )	2020-06-08 15:49:34 +02:00
Matthew Honnibal	b69fa77ccc	Add missing inits	2020-06-06 15:38:46 +02:00
Matthew Honnibal	6e87ca1f45	Fix imports	2020-06-06 15:36:58 +02:00
Matthew Honnibal	53b00991fd	Fix imports	2020-06-06 15:36:46 +02:00
Matthew Honnibal	74204116a3	Rename _gold -> gold	2020-06-06 15:29:32 +02:00
Matthew Honnibal	7f135736f4	Fix imports	2020-06-06 15:28:52 +02:00
Matthew Honnibal	17533a9286	Format	2020-06-06 15:13:07 +02:00
Matthew Honnibal	0f9b4bbfea	Fix imports	2020-06-06 15:12:52 +02:00
Matthew Honnibal	866179350b	Fix import	2020-06-06 15:11:13 +02:00
Matthew Honnibal	3baa1ada03	Refactr spacy.gold	2020-06-06 15:10:33 +02:00
Matthew Honnibal	1d2e39d974	Support to_dict in Doc	2020-06-06 15:10:10 +02:00
Matthew Honnibal	7b873ce2b1	Move GoldParse under spacy.syntax	2020-06-06 15:09:43 +02:00
Matthew Honnibal	32c8fb1372	Add gold_io.pyx	2020-06-06 14:41:49 +02:00
Matthew Honnibal	156466ca69	Add iob_utils	2020-06-06 14:39:14 +02:00
Matthew Honnibal	53e6473e24	Add to/from dict helpers	2020-06-06 14:29:06 +02:00
Matthew Honnibal	a663d44b1b	Add GoldCorpus	2020-06-06 14:28:37 +02:00
Matthew Honnibal	1fb8fc6ea9	Add Example class	2020-06-06 14:24:35 +02:00
Matthew Honnibal	cce6a51a9c	Add annotation classes	2020-06-06 14:22:27 +02:00
Matthew Honnibal	6005b94e74	Add data augmentation	2020-06-06 14:19:06 +02:00
Matthew Honnibal	fcb4f7a6db	Start breaking down gold.pyx	2020-06-06 14:15:12 +02:00
adrianeboyd	009119fa66	Requirements/setup for Japanese (#5553 ) * Add sudachipy and sudachidict_core to Makefile * Switch ja requirements from fugashi to sudachipy	2020-06-06 00:22:18 +02:00
Ines Montani	d93cbeb14f	Add warning for loose version constraints (#5536 ) * Add warning for loose version constraints * Update wording [ci skip] * Tweak error message Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-06-05 12:42:15 +02:00
adrianeboyd	1ac43d78f9	Avoid libc.stdint for UINT64_MAX (#5545 )	2020-06-04 20:02:05 +02:00
Sofie Van Landeghem	4d1ba6feb4	add tag variant for 2.3 (#5542 )	2020-06-04 19:16:33 +02:00
Paul O'Leary McCann	410fb7ee43	Add Japanese Model (#5544 ) * Add more rules to deal with Japanese UD mappings Japanese UD rules sometimes give different UD tags to tokens with the same underlying POS tag. The UD spec indicates these cases should be disambiguated using the output of a tool called "comainu", but rules are enough to get the right result. These rules are taken from Ginza at time of writing, see #3756. * Add new tags from GSD This is a few rare tags that aren't in Unidic but are in the GSD data. * Add basic Japanese sentencization This code is taken from Ginza again. * Add sentenceizer quote handling Could probably add more paired characters but this will do for now. Also includes some tests. * Replace fugashi with SudachiPy * Modify tag format to match GSD annotations Some of the tests still need to be updated, but I want to get this up for testing training. * Deal with case with closing punct without opening * refactor resolve_pos() * change tag field separator from "," to "-" * add TAG_ORTH_MAP * add TAG_BIGRAM_MAP * revise rules for 連体詞 * revise rules for 連体詞 * improve POS about 2% * add syntax_iterator.py (not mature yet) * improve syntax_iterators.py * improve syntax_iterators.py * add phrases including nouns and drop NPs consist of STOP_WORDS * First take at noun chunks This works in many situations but still has issues in others. If the start of a subtree has no noun, then nested phrases can be generated. また行きたい、そんな気持ちにさせてくれるお店です。 [そんな気持ち, また行きたい、そんな気持ちにさせてくれるお店] For some reason て gets included sometimes. Not sure why. ゲンに連れ添って円盤生物を調査するパートナーとなる。 [て円盤生物, ...] Some phrases that look like they should be split are grouped together; not entirely sure that's wrong. This whole thing becomes one chunk: 道の駅遠山郷北側からかぐら大橋南詰現道交点までの1.060kmのみ開通済み * Use new generic get_words_and_spaces The new get_words_and_spaces function is simpler than what was used in Japanese, so it's good to be able to switch to it. However, there was an issue. The new function works just on text, so POS info could get out of sync. Fixing this required a small change to the way dtokens (tokens with POS and lemma info) were generated. Specifically, multiple extraneous spaces now become a single token, so when generating dtokens multiple space tokens should be created in a row. * Fix noun_chunks, should be working now * Fix some tests, add naughty strings tests Some of the existing tests changed because the tokenization mode of Sudachi changed to the more fine-grained A mode. Sudachi also has issues with some strings, so this adds a test against the naughty strings. * Remove empty Sudachi tokens Not doing this creates zero-length tokens and causes errors in the internal spaCy processing. * Add yield_bunsetu back in as a separate piece of code Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: hiroshi <hiroshi_matsuda@megagon.ai>	2020-06-04 19:15:43 +02:00
Matthew Honnibal	8411d4f4e6	Merge pull request #5543 from svlandeg/feature/pretrain-config pretrain from config	2020-06-04 19:07:12 +02:00
svlandeg	3ade455fd3	formatting	2020-06-04 16:09:55 +02:00
svlandeg	776d4f1190	cleanup	2020-06-04 16:07:30 +02:00
svlandeg	6b027d7689	remove duplicate model definition of tok2vec layer	2020-06-04 15:49:23 +02:00
svlandeg	1775f54a26	small little fixes	2020-06-03 22:17:02 +02:00
svlandeg	07886a3de3	rename init_tok2vec to resume	2020-06-03 22:00:25 +02:00
svlandeg	4ed6278663	small fixes to pretrain config, init_tok2vec TODO	2020-06-03 19:32:40 +02:00
Ines Montani	d79964bcb1	Merge pull request #5535 from adrianeboyd/feature/model-spacy-version-check	2020-06-03 15:35:20 +02:00
Ines Montani	56a9d1b78c	Merge pull request #5479 from explosion/master-tmp	2020-06-03 15:31:27 +02:00
svlandeg	ddf8244df9	add normalize option to distance metric	2020-06-03 14:52:54 +02:00

... 4 5 6 7 8 ...

11991 Commits