spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-02-16 03:20:34 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	0c10831b14	Start debugging arc_eager oracle	2020-06-20 21:49:46 +02:00
Matthew Honnibal	2bcb5881d7	Fix parser model	2020-06-20 21:49:31 +02:00
Matthew Honnibal	396dd60b3a	Fix Corpus	2020-06-20 21:49:15 +02:00
Matthew Honnibal	450c6fe39c	Update train.py	2020-06-20 21:49:06 +02:00
svlandeg	c9242e9bf4	fix entity linker (cf PR #5548 )	2020-06-20 21:47:23 +02:00
svlandeg	dc069e90b3	fix token.morph_ for v.3 (cf PR #5517 )	2020-06-20 21:13:11 +02:00
Matthew Honnibal	6d821b2e55	Make doc.from_array several times faster	2020-06-20 20:17:13 +02:00
Matthew Honnibal	fa86aa581d	Allocate Doc before starting to add words	2020-06-20 20:15:21 +02:00
Matthew Honnibal	652f31d3ee	Update DocBin	2020-06-20 20:12:54 +02:00
Matthew Honnibal	0a8b6631a2	Update Corpus	2020-06-20 20:12:31 +02:00
Matthew Honnibal	11fa0658f7	Work on train script	2020-06-20 20:12:19 +02:00
Ines Montani	988d2a4eda	Add --code-path option to train CLI (#5618 )	2020-06-20 18:43:12 +02:00
Matthew Honnibal	0de361cd00	Draft Corpus class for DocBin	2020-06-20 18:31:07 +02:00
Ines Montani	5424b70e51	Remove v2 test	2020-06-20 16:18:53 +02:00
Ines Montani	63c22969f4	Update test_issue5230.py	2020-06-20 16:17:48 +02:00
Ines Montani	296b5d633b	Remove references to Python 2 / is_python2	2020-06-20 16:11:13 +02:00
Matthew Honnibal	7360d3db72	Add json2docs converter	2020-06-20 16:02:53 +02:00
Ines Montani	0cdb631e6c	Fix merge errors	2020-06-20 16:02:42 +02:00
Matthew Honnibal	f1756a6a22	Remove jsonl converter	2020-06-20 16:02:40 +02:00
Matthew Honnibal	5d89b1840e	Update converter	2020-06-20 16:00:14 +02:00
Matthew Honnibal	f5780cb160	Serialize all attrs by default	2020-06-20 15:59:39 +02:00
Matthew Honnibal	3241acbe0b	Fix import	2020-06-20 15:56:28 +02:00
Matthew Honnibal	b7a366b435	Fix compile in ArcEager	2020-06-20 15:56:16 +02:00
Matthew Honnibal	91fa2f1126	Fix docbin	2020-06-20 15:56:05 +02:00
Matthew Honnibal	476bcd4c53	Fix import	2020-06-20 15:55:57 +02:00
Matthew Honnibal	7a846921a3	Make spacy convert output docbin	2020-06-20 15:55:35 +02:00
Ines Montani	52728d8fa3	Merge branch 'develop' into master-tmp	2020-06-20 15:52:00 +02:00
Ines Montani	f91e9e8c84	Remove F841 [ci skip]	2020-06-20 14:47:17 +02:00
Ines Montani	8283df80e9	Tidy up and auto-format	2020-06-20 14:15:04 +02:00
Matthew Honnibal	0d22c6e006	Allow DocBin to take list of Doc objects.	2020-06-20 03:50:36 +02:00
Matthew Honnibal	95df028758	Update converters	2020-06-20 03:50:23 +02:00
Matthew Honnibal	3a73d95dcc	Update converter to produce DocBin	2020-06-20 03:50:13 +02:00
Matthew Honnibal	d9a8fdf4b7	Fix name	2020-06-20 03:26:36 +02:00
Matthew Honnibal	e20a780867	Fix naming	2020-06-20 03:24:49 +02:00
Matthew Honnibal	f61d5e3ac3	Move things around	2020-06-20 03:23:58 +02:00
Matthew Honnibal	c630cfdb5e	Move converters under spacy.gold	2020-06-20 03:20:34 +02:00
Matthew Honnibal	161d8439fa	Start updating converters	2020-06-20 03:19:40 +02:00
Matthew Honnibal	a79f0598a6	Merge branch 'whatif/arrow' of https://github.com/explosion/spaCy into whatif/arrow	2020-06-20 02:36:40 +02:00
Matthew Honnibal	be81577719	Fix oracles	2020-06-20 02:36:12 +02:00
Marat M. Yavrumyan	8120b641cc	Update lex_attrs.py (#5608 )	2020-06-19 20:00:34 +02:00
svlandeg	e30ec9b2a8	fix test checking for variants	2020-06-19 14:05:35 +02:00
svlandeg	25b0674320	clean up	2020-06-19 11:31:01 +02:00
svlandeg	c705a28438	add links to to_dict	2020-06-19 11:22:24 +02:00
Matthew Honnibal	03db143cd0	Draft new GoldCorpus class	2020-06-19 04:15:02 +02:00
Matthew Honnibal	a389866df6	Merge branch 'whatif/arrow' of https://github.com/explosion/spaCy into whatif/arrow	2020-06-19 02:30:27 +02:00
Matthew Honnibal	bd29b7b14f	Update parser and NER gold stuff	2020-06-19 02:29:16 +02:00
Matthew Honnibal	5ae9e3480d	Return ArcEagerGoldParse from ArcEager	2020-06-19 00:11:59 +02:00
svlandeg	6ca6d7d6b4	test for split sentences with various alignment issues, works	2020-06-18 20:01:02 +02:00
svlandeg	1951921230	implement split_sent with aligned SENT_START attribute	2020-06-18 19:41:53 +02:00
svlandeg	d1d6f16776	fix the fix	2020-06-18 19:15:32 +02:00
svlandeg	e822367cf7	prevent writing dummy values like deps because that could interfer with sent_start values	2020-06-18 17:47:59 +02:00
svlandeg	0b6d45eae1	various small fixes	2020-06-18 15:55:00 +02:00
svlandeg	1c71f2310c	fix renames and simple_ner labels	2020-06-18 15:33:28 +02:00
svlandeg	64fc840a5d	bugfix tok2vec	2020-06-18 15:24:40 +02:00
svlandeg	01f9ae774c	small fixes	2020-06-18 14:01:19 +02:00
svlandeg	0c6f1f3891	fix BiluoPushDown parsing entities	2020-06-18 13:00:03 +02:00
svlandeg	cd790aaa2a	fix parser tests to work with example (most still failing)	2020-06-18 11:19:22 +02:00
svlandeg	9f43ba839a	throw informative error when running the components with the wrong type of objects	2020-06-18 10:36:05 +02:00
svlandeg	6712d0b5db	textcat bugfix	2020-06-18 10:09:56 +02:00
svlandeg	40b2b21eef	small bug fix	2020-06-17 23:33:51 +02:00
svlandeg	d6c4dd6eea	pipe() takes docs, not examples	2020-06-17 21:29:36 +02:00
svlandeg	0f123af35e	ensure test keeps working with non-linked entities	2020-06-17 21:13:38 +02:00
svlandeg	6d73e139b0	fix entity linker	2020-06-17 21:12:25 +02:00
svlandeg	be5934b827	fix tagger	2020-06-17 19:42:11 +02:00
svlandeg	10d396977e	add support for MORPH in to/from_array, fix morphologizer overfitting test	2020-06-17 17:48:07 +02:00
svlandeg	1a151b10d6	correct silly typo	2020-06-17 14:48:14 +02:00
svlandeg	f6c451b650	cleanup	2020-06-17 14:45:54 +02:00
svlandeg	2d9f406188	fix test_cli	2020-06-17 14:42:48 +02:00
svlandeg	f7ad8e8c83	various fixes in scripts - needs to be further tested	2020-06-17 12:05:58 +02:00
svlandeg	3c4f9e4cc4	fix augment (needs further testing)	2020-06-17 10:46:29 +02:00
svlandeg	4ed399c848	minibatch utiltiy can deal with strings, docs or examples	2020-06-16 21:35:55 +02:00
svlandeg	8b66c11ff2	add spaces to json output format	2020-06-16 19:30:03 +02:00
svlandeg	ba80ad7efd	fixed some tests + WIP roundtrip unit test	2020-06-16 18:26:50 +02:00
Ines Montani	e9d3e177f0	Merge branch 'master' into v2.3.x	2020-06-16 16:31:38 +02:00
svlandeg	43d41d6bb6	allow None as BILUO annotation	2020-06-16 15:30:05 +02:00
svlandeg	44a0f9c2c8	test_gold_biluo_different_tokenization works	2020-06-16 15:21:20 +02:00
svlandeg	1c35b8efcd	fix spaces	2020-06-16 12:08:25 +02:00
svlandeg	6fea5fa4bd	attempt to fix cases with weird spaces	2020-06-16 11:52:29 +02:00
svlandeg	0702a1d3fb	fix test for misaligned	2020-06-15 23:10:47 +02:00
svlandeg	a28f8f369e	Fix many-to-one IOB codes	2020-06-15 23:06:22 +02:00
svlandeg	12886b787b	fixing NER one-to-many alignment	2020-06-15 22:44:17 +02:00
Matthew Honnibal	7ff447c5a0	Set version to v2.3.0	2020-06-15 18:22:25 +02:00
Matthew Honnibal	a0bf73a5dd	Merge branch 'whatif/arrow' of https://github.com/explosion/spaCy into whatif/arrow	2020-06-15 18:16:01 +02:00
Matthew Honnibal	c66f93299e	Remove TokenAnnotation code from nonproj	2020-06-15 18:14:47 +02:00
Matthew Honnibal	c95494739c	Fix import	2020-06-15 18:11:10 +02:00
Matthew Honnibal	8f978f2031	Fix import	2020-06-15 18:10:47 +02:00
Matthew Honnibal	95de7efaad	Draft create_gold_state for arc_eager oracle	2020-06-15 18:10:19 +02:00
svlandeg	68986a252e	additional tests for new get_aligned function	2020-06-15 17:42:40 +02:00
svlandeg	41d29983a7	start testing get_aligned	2020-06-15 17:16:01 +02:00
svlandeg	fd5f199feb	fixing language and scoring tests	2020-06-15 15:02:05 +02:00
Adriane Boyd	0d8405aafa	Updates to docstrings (#5589 )	2020-06-15 14:58:36 +02:00
Adriane Boyd	e867e9fa8f	Fix and add warnings related to spacy-lookups-data (#5588 ) * Fix warning message for lemmatization tables * Add a warning when the `lexeme_norm` table is empty. (Given the relatively lang-specific loading for `Lookups`, it seemed like too much overhead to dynamically extract the list of languages, so for now it's hard-coded.)	2020-06-15 14:58:29 +02:00
Arvind Srinivasan	f698007907	Added Tamil Example Sentences (#5583 ) * Added Examples for Tamil Sentences #### Description This PR add example sentences for the Tamil language which were missing as per issue #1107 #### Type of Change This is an enhancement. * Accepting spaCy Contributor Agreement * Signed on my behalf as an individual	2020-06-15 14:58:21 +02:00
Adriane Boyd	c94f7d0e75	Updates to docstrings (#5589 )	2020-06-15 14:56:51 +02:00
Adriane Boyd	c482f20778	Fix and add warnings related to spacy-lookups-data (#5588 ) * Fix warning message for lemmatization tables * Add a warning when the `lexeme_norm` table is empty. (Given the relatively lang-specific loading for `Lookups`, it seemed like too much overhead to dynamically extract the list of languages, so for now it's hard-coded.)	2020-06-15 14:56:04 +02:00
svlandeg	b4d914ec77	fix error catching	2020-06-15 12:56:32 +02:00
svlandeg	b9c9cbb2cd	informative error when calling to_array with wrong field	2020-06-15 11:53:31 +02:00
svlandeg	ff231e1cdd	fix merge conflict	2020-06-15 09:04:19 +02:00
svlandeg	a48553c1ed	fix error numbers	2020-06-15 08:51:31 +02:00
Matthew Honnibal	3c0fc10dc4	Remove beam for now (maybe) Remove beam_utils Update setup.py Remove beam	2020-06-14 19:53:29 +02:00
Matthew Honnibal	98ca14f577	Remove GoldParse WIP on removing goldparse Get ArcEager compiling after GoldParse excise Update setup.py Get spacy.syntax compiling after removing GoldParse Rename NewExample -> Example and clean up Clean html files Start updating tests Update Morphologizer	2020-06-14 19:53:30 +02:00
Matthew Honnibal	d53723aa4f	Merge from whatif/arrow	2020-06-14 17:43:59 +02:00
Matthew Honnibal	380cce9d8b	Update errors	2020-06-14 17:40:05 +02:00
Matthew Honnibal	706e652820	Merge from develop	2020-06-14 17:35:01 +02:00
Matthew Honnibal	9296d71a54	More GoldParse excise	2020-06-14 17:26:54 +02:00
Matthew Honnibal	60d4e5a9e0	WIP on updating transition-system	2020-06-14 17:22:14 +02:00
Matthew Honnibal	7d65615625	WIP start excising GoldParse	2020-06-14 17:11:41 +02:00
Matthew Honnibal	4362ec7084	Hack Language.evaluate	2020-06-13 23:37:42 +02:00
Matthew Honnibal	7de997c0a5	Update test	2020-06-13 23:11:45 +02:00
Matthew Honnibal	8f941ef527	Update GoldParse	2020-06-13 23:11:29 +02:00
Matthew Honnibal	3a0bbcfb4c	Add biluo_tags_from_doc function	2020-06-13 23:10:54 +02:00
Matthew Honnibal	caa7508725	Draft missing NewExample stuff	2020-06-13 23:10:21 +02:00
Matthew Honnibal	3eb8f3867e	Update test	2020-06-13 23:05:16 +02:00
Arvind Srinivasan	aa5b40fa64	Added Tamil Example Sentences (#5583 ) * Added Examples for Tamil Sentences #### Description This PR add example sentences for the Tamil language which were missing as per issue #1107 #### Type of Change This is an enhancement. * Accepting spaCy Contributor Agreement * Signed on my behalf as an individual	2020-06-13 15:56:26 +02:00
Matthew Honnibal	5564314d32	Suggest approach for GoldParse	2020-06-13 15:43:35 +02:00
Matthew Honnibal	b078b05ecd	Handle various data better in NewExample	2020-06-13 15:30:12 +02:00
svlandeg	face0de74f	fix MORPH conversion + enable unit test	2020-06-12 16:29:09 +02:00
svlandeg	a5ee082da1	cats bugfix	2020-06-12 15:49:38 +02:00
svlandeg	880dccf93e	entities on doc_annotation, parse links and check their offsets against the entities. unit test works	2020-06-12 15:47:20 +02:00
theudas	3f5e2f9d99	Added Parameter to NEL to take n sentences into account (#5548 ) * added setting for neighbour sentence in NEL * added spaCy contributor agreement * added multi sentence also for training * made the try-except block smaller	2020-06-12 15:15:03 +02:00
adrianeboyd	4724fa4cf4	Expand Japanese requirements warning (#5572 ) Include explicit install instructions in Japanese requirements warning.	2020-06-12 15:14:55 +02:00
adrianeboyd	44967a3f9c	Update pytest conf for sudachipy with Japanese (#5574 )	2020-06-12 15:14:47 +02:00
svlandeg	3aed177a35	fix ENT_IOB conversion and enable unit test	2020-06-12 11:30:24 +02:00
Matthew Honnibal	a1c5b694be	Small fixes to train defaults	2020-06-12 02:22:13 +02:00
theudas	fa46e0bef2	Added Parameter to NEL to take n sentences into account (#5548 ) * added setting for neighbour sentence in NEL * added spaCy contributor agreement * added multi sentence also for training * made the try-except block smaller	2020-06-12 02:03:23 +02:00
Sofie Van Landeghem	c0f4a1e43b	train is from-config by default (#5575 ) * verbose and tag_map options * adding init_tok2vec option and only changing the tok2vec that is specified * adding omit_extra_lookups and verifying textcat config * wip * pretrain bugfix * add replace and resume options * train_textcat fix * raw text functionality * improve UX when KeyError or when input data can't be parsed * avoid unnecessary access to goldparse in TextCat pipe * save performance information in nlp.meta * add noise_level to config * move nn_parser's defaults to config file * multitask in config - doesn't work yet * scorer offering both F and AUC options, need to be specified in config * add textcat verification code from old train script * small fixes to config files * clean up * set default config for ner/parser to allow create_pipe to work as before * two more test fixes * small fixes * cleanup * fix NER pickling + additional unit test * create_pipe as before	2020-06-12 02:02:07 +02:00
svlandeg	6a67a11682	adding tests for new example class (some still failing - WIP)	2020-06-11 17:43:40 +02:00
adrianeboyd	556895177e	Expand Japanese requirements warning (#5572 ) Include explicit install instructions in Japanese requirements warning.	2020-06-11 13:47:37 +02:00
adrianeboyd	fe167fcf7d	Update pytest conf for sudachipy with Japanese (#5574 )	2020-06-11 10:23:50 +02:00
Jones Martins	bab30e4ad2	Add "c'mon" token exception (#5570 ) * Add "c'mon" exception * Fix typo in "C'mon" exception	2020-06-10 21:54:06 +02:00
Jones Martins	28db7dd5d9	Add missing pronoums/determiners (#5569 ) * Add missing pronoums/determiners * Add test for missing pronoums * Add contributor file	2020-06-10 18:47:04 +02:00
Matthew Honnibal	488727aee0	Start updating test	2020-06-09 23:58:28 +02:00
Matthew Honnibal	337d2b5ad6	Fix sent start in NewExample	2020-06-09 23:58:16 +02:00
Matthew Honnibal	ad547a4b8f	Refactor towards new Example class	2020-06-09 23:39:46 +02:00
Matthew Honnibal	82810b9846	Update morphologizer	2020-06-09 23:32:07 +02:00
Matthew Honnibal	af1b5f129b	Use new example class in GoldCorpus	2020-06-09 23:31:19 +02:00
Matthew Honnibal	0714f1fa5c	Remove the 'pass example into __call__' thing	2020-06-09 23:30:06 +02:00
Matthew Honnibal	b3868cd1f8	Update NewExample	2020-06-09 23:06:48 +02:00
Matthew Honnibal	ccd332a9fc	Update test stubs	2020-06-09 15:49:04 +02:00
adrianeboyd	0a70bd6281	Bump version to 2.3.0.dev1 (#5567 )	2020-06-09 15:47:31 +02:00
Matthew Honnibal	04569c0b3e	Fix import	2020-06-09 15:44:08 +02:00
Matthew Honnibal	f4caaa8ad9	Update alignment	2020-06-09 15:43:57 +02:00
Matthew Honnibal	b5ef397639	Add header for align.pxd	2020-06-09 15:43:48 +02:00
Matthew Honnibal	793092d2d8	Fix renaming in GoldCorpus	2020-06-09 15:43:38 +02:00
Matthew Honnibal	36d49a0f13	Fix NewExample class	2020-06-09 15:43:19 +02:00
Matthew Honnibal	f1189dc205	Draft tests for new Example class	2020-06-09 15:43:08 +02:00
Matthew Honnibal	c833ebe1ad	Start tests for new example class	2020-06-09 15:29:05 +02:00
Matthew Honnibal	453cfa14d0	Start drafting new example class	2020-06-09 15:28:42 +02:00
Matthew Honnibal	449000c234	Fix gold_io	2020-06-09 12:43:53 +02:00
Matthew Honnibal	cb08ce3936	Move alignment into Cython	2020-06-09 12:40:41 +02:00
Matthew Honnibal	20a1bdb298	Fix train	2020-06-09 12:33:29 +02:00
Matthew Honnibal	549164c31c	Fix corpus when no raw text supplied	2020-06-09 12:33:14 +02:00
adrianeboyd	b7e6e1b9a7	Disable sentence segmentation in ja tokenizer (#5566 )	2020-06-09 12:00:59 +02:00
Matthew Honnibal	d9289712ba	* Make GoldCorpus return dict, not Example * Make Example require a Doc object (previously optional) Clarify methods in GoldCorpus WIP refactor Example Refactor Example.split_sents Fix test Fix augment Update test Update test Fix import Update test_scorer Update Example	2020-06-09 01:01:59 +02:00
Matthew Honnibal	084271c9e9	Remove GoldParse from public API * Move get_parses_from_example to spacy.syntax * Get GoldParse out of Example * Avoid expecting GoldParse input in parser * Add Alignment to spacy.gold.align * Update Example object * Add comment * Update pipeline * Fix imports * Simplify gold_io * WIP on GoldCorpus * Update test * Xfail some gold tests * Remove ignore_misaligned option from GoldCorpus * Fix Example constructor * Update test * Fix usage of Example * Add deprecated_get_gold method on Example * Patch scorer * Fix test * Fix test * Update tests * Xfail a test * Fix passing of make_projective * Pass make_projective by default * Hack data format in Example.from_dict * Update tests * Fix example.from_dict * Update morphologizer * Fix entity linker * Add get_field to TokenAnnotation * Fix Example.get_aligned * Update test * Fix alignment * Fix corpus * Fix GoldCorpus * Handle misaligned * Format * Fix missing import	2020-06-08 22:09:57 +02:00
adrianeboyd	f162815f45	Handle empty and whitespace-only docs for Japanese (#5564 ) Handle empty and whitespace-only docs in the custom alignment method used by the Japanese tokenizer.	2020-06-08 21:09:23 +02:00
adrianeboyd	3bf111585d	Update Japanese tokenizer config and add serialization (#5562 ) * Use `config` dict for tokenizer settings * Add serialization of split mode setting * Add tests for tokenizer split modes and serialization of split mode setting Based on #5561	2020-06-08 16:29:05 +02:00
Hiroshi Matsuda	456bf47f51	fix a bug causing mis-alignments (#5560 )	2020-06-08 15:49:34 +02:00
Matthew Honnibal	b69fa77ccc	Add missing inits	2020-06-06 15:38:46 +02:00
Matthew Honnibal	6e87ca1f45	Fix imports	2020-06-06 15:36:58 +02:00
Matthew Honnibal	53b00991fd	Fix imports	2020-06-06 15:36:46 +02:00
Matthew Honnibal	74204116a3	Rename _gold -> gold	2020-06-06 15:29:32 +02:00
Matthew Honnibal	7f135736f4	Fix imports	2020-06-06 15:28:52 +02:00
Matthew Honnibal	17533a9286	Format	2020-06-06 15:13:07 +02:00
Matthew Honnibal	0f9b4bbfea	Fix imports	2020-06-06 15:12:52 +02:00
Matthew Honnibal	866179350b	Fix import	2020-06-06 15:11:13 +02:00
Matthew Honnibal	3baa1ada03	Refactr spacy.gold	2020-06-06 15:10:33 +02:00
Matthew Honnibal	1d2e39d974	Support to_dict in Doc	2020-06-06 15:10:10 +02:00
Matthew Honnibal	7b873ce2b1	Move GoldParse under spacy.syntax	2020-06-06 15:09:43 +02:00
Matthew Honnibal	32c8fb1372	Add gold_io.pyx	2020-06-06 14:41:49 +02:00
Matthew Honnibal	156466ca69	Add iob_utils	2020-06-06 14:39:14 +02:00
Matthew Honnibal	53e6473e24	Add to/from dict helpers	2020-06-06 14:29:06 +02:00
Matthew Honnibal	a663d44b1b	Add GoldCorpus	2020-06-06 14:28:37 +02:00
Matthew Honnibal	1fb8fc6ea9	Add Example class	2020-06-06 14:24:35 +02:00
Matthew Honnibal	cce6a51a9c	Add annotation classes	2020-06-06 14:22:27 +02:00
Matthew Honnibal	6005b94e74	Add data augmentation	2020-06-06 14:19:06 +02:00
Matthew Honnibal	fcb4f7a6db	Start breaking down gold.pyx	2020-06-06 14:15:12 +02:00
Ines Montani	d93cbeb14f	Add warning for loose version constraints (#5536 ) * Add warning for loose version constraints * Update wording [ci skip] * Tweak error message Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-06-05 12:42:15 +02:00
adrianeboyd	1ac43d78f9	Avoid libc.stdint for UINT64_MAX (#5545 )	2020-06-04 20:02:05 +02:00
Paul O'Leary McCann	410fb7ee43	Add Japanese Model (#5544 ) * Add more rules to deal with Japanese UD mappings Japanese UD rules sometimes give different UD tags to tokens with the same underlying POS tag. The UD spec indicates these cases should be disambiguated using the output of a tool called "comainu", but rules are enough to get the right result. These rules are taken from Ginza at time of writing, see #3756. * Add new tags from GSD This is a few rare tags that aren't in Unidic but are in the GSD data. * Add basic Japanese sentencization This code is taken from Ginza again. * Add sentenceizer quote handling Could probably add more paired characters but this will do for now. Also includes some tests. * Replace fugashi with SudachiPy * Modify tag format to match GSD annotations Some of the tests still need to be updated, but I want to get this up for testing training. * Deal with case with closing punct without opening * refactor resolve_pos() * change tag field separator from "," to "-" * add TAG_ORTH_MAP * add TAG_BIGRAM_MAP * revise rules for 連体詞 * revise rules for 連体詞 * improve POS about 2% * add syntax_iterator.py (not mature yet) * improve syntax_iterators.py * improve syntax_iterators.py * add phrases including nouns and drop NPs consist of STOP_WORDS * First take at noun chunks This works in many situations but still has issues in others. If the start of a subtree has no noun, then nested phrases can be generated. また行きたい、そんな気持ちにさせてくれるお店です。 [そんな気持ち, また行きたい、そんな気持ちにさせてくれるお店] For some reason て gets included sometimes. Not sure why. ゲンに連れ添って円盤生物を調査するパートナーとなる。 [て円盤生物, ...] Some phrases that look like they should be split are grouped together; not entirely sure that's wrong. This whole thing becomes one chunk: 道の駅遠山郷北側からかぐら大橋南詰現道交点までの1.060kmのみ開通済み * Use new generic get_words_and_spaces The new get_words_and_spaces function is simpler than what was used in Japanese, so it's good to be able to switch to it. However, there was an issue. The new function works just on text, so POS info could get out of sync. Fixing this required a small change to the way dtokens (tokens with POS and lemma info) were generated. Specifically, multiple extraneous spaces now become a single token, so when generating dtokens multiple space tokens should be created in a row. * Fix noun_chunks, should be working now * Fix some tests, add naughty strings tests Some of the existing tests changed because the tokenization mode of Sudachi changed to the more fine-grained A mode. Sudachi also has issues with some strings, so this adds a test against the naughty strings. * Remove empty Sudachi tokens Not doing this creates zero-length tokens and causes errors in the internal spaCy processing. * Add yield_bunsetu back in as a separate piece of code Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: hiroshi <hiroshi_matsuda@megagon.ai>	2020-06-04 19:15:43 +02:00
Matthew Honnibal	8411d4f4e6	Merge pull request #5543 from svlandeg/feature/pretrain-config pretrain from config	2020-06-04 19:07:12 +02:00
svlandeg	3ade455fd3	formatting	2020-06-04 16:09:55 +02:00
svlandeg	776d4f1190	cleanup	2020-06-04 16:07:30 +02:00
svlandeg	6b027d7689	remove duplicate model definition of tok2vec layer	2020-06-04 15:49:23 +02:00
svlandeg	1775f54a26	small little fixes	2020-06-03 22:17:02 +02:00
svlandeg	07886a3de3	rename init_tok2vec to resume	2020-06-03 22:00:25 +02:00
svlandeg	4ed6278663	small fixes to pretrain config, init_tok2vec TODO	2020-06-03 19:32:40 +02:00
svlandeg	ffe0451d09	pretrain from config	2020-06-03 14:45:00 +02:00
Ines Montani	a8875d4a4b	Fix typo	2020-06-03 14:42:39 +02:00
Ines Montani	4e0610d0d4	Update warning codes	2020-06-03 14:37:09 +02:00
Ines Montani	810fce3bb1	Merge branch 'develop' into master-tmp	2020-06-03 14:36:59 +02:00
Adriane Boyd	b0ee76264b	Remove debugging	2020-06-03 14:20:42 +02:00
Adriane Boyd	1d8168d1fd	Fix problems with lower and whitespace in variants Port relevant changes from #5361: * Initialize lower flag explicitly * Handle whitespace words from GoldParse correctly when creating raw text with orth variants	2020-06-03 14:15:58 +02:00
Adriane Boyd	10d938f221	Update default cfg dir in train CLI	2020-06-03 14:15:50 +02:00
Adriane Boyd	f1f9c8b417	Port train CLI updates Updates from #5362 and fix from #5387: * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-06-03 14:03:43 +02:00
Adriane Boyd	8c758ed1eb	Fix meta path	2020-06-03 12:11:57 +02:00
Adriane Boyd	a57bdeecac	Test util.get_model_meta instead of util.load_model	2020-06-03 12:10:12 +02:00
svlandeg	eac12cbb77	make dropout in embed layers configurable	2020-06-03 11:50:16 +02:00
svlandeg	e91485dfc4	add discard_oversize parameter, move optimizer to training subsection	2020-06-03 10:04:16 +02:00
svlandeg	03c58b488c	prevent infinite loop, custom warning	2020-06-03 10:00:21 +02:00

... 2 3 4 5 6 ...

7325 Commits