Commit Graph

7325 Commits

Author SHA1 Message Date
Matthew Honnibal
0c10831b14 Start debugging arc_eager oracle 2020-06-20 21:49:46 +02:00
Matthew Honnibal
2bcb5881d7 Fix parser model 2020-06-20 21:49:31 +02:00
Matthew Honnibal
396dd60b3a Fix Corpus 2020-06-20 21:49:15 +02:00
Matthew Honnibal
450c6fe39c Update train.py 2020-06-20 21:49:06 +02:00
svlandeg
c9242e9bf4 fix entity linker (cf PR #5548) 2020-06-20 21:47:23 +02:00
svlandeg
dc069e90b3 fix token.morph_ for v.3 (cf PR #5517) 2020-06-20 21:13:11 +02:00
Matthew Honnibal
6d821b2e55 Make doc.from_array several times faster 2020-06-20 20:17:13 +02:00
Matthew Honnibal
fa86aa581d Allocate Doc before starting to add words 2020-06-20 20:15:21 +02:00
Matthew Honnibal
652f31d3ee Update DocBin 2020-06-20 20:12:54 +02:00
Matthew Honnibal
0a8b6631a2 Update Corpus 2020-06-20 20:12:31 +02:00
Matthew Honnibal
11fa0658f7 Work on train script 2020-06-20 20:12:19 +02:00
Ines Montani
988d2a4eda
Add --code-path option to train CLI (#5618) 2020-06-20 18:43:12 +02:00
Matthew Honnibal
0de361cd00 Draft Corpus class for DocBin 2020-06-20 18:31:07 +02:00
Ines Montani
5424b70e51 Remove v2 test 2020-06-20 16:18:53 +02:00
Ines Montani
63c22969f4 Update test_issue5230.py 2020-06-20 16:17:48 +02:00
Ines Montani
296b5d633b Remove references to Python 2 / is_python2 2020-06-20 16:11:13 +02:00
Matthew Honnibal
7360d3db72 Add json2docs converter 2020-06-20 16:02:53 +02:00
Ines Montani
0cdb631e6c Fix merge errors 2020-06-20 16:02:42 +02:00
Matthew Honnibal
f1756a6a22 Remove jsonl converter 2020-06-20 16:02:40 +02:00
Matthew Honnibal
5d89b1840e Update converter 2020-06-20 16:00:14 +02:00
Matthew Honnibal
f5780cb160 Serialize all attrs by default 2020-06-20 15:59:39 +02:00
Matthew Honnibal
3241acbe0b Fix import 2020-06-20 15:56:28 +02:00
Matthew Honnibal
b7a366b435 Fix compile in ArcEager 2020-06-20 15:56:16 +02:00
Matthew Honnibal
91fa2f1126 Fix docbin 2020-06-20 15:56:05 +02:00
Matthew Honnibal
476bcd4c53 Fix import 2020-06-20 15:55:57 +02:00
Matthew Honnibal
7a846921a3 Make spacy convert output docbin 2020-06-20 15:55:35 +02:00
Ines Montani
52728d8fa3 Merge branch 'develop' into master-tmp 2020-06-20 15:52:00 +02:00
Ines Montani
f91e9e8c84 Remove F841 [ci skip] 2020-06-20 14:47:17 +02:00
Ines Montani
8283df80e9 Tidy up and auto-format 2020-06-20 14:15:04 +02:00
Matthew Honnibal
0d22c6e006 Allow DocBin to take list of Doc objects. 2020-06-20 03:50:36 +02:00
Matthew Honnibal
95df028758 Update converters 2020-06-20 03:50:23 +02:00
Matthew Honnibal
3a73d95dcc Update converter to produce DocBin 2020-06-20 03:50:13 +02:00
Matthew Honnibal
d9a8fdf4b7 Fix name 2020-06-20 03:26:36 +02:00
Matthew Honnibal
e20a780867 Fix naming 2020-06-20 03:24:49 +02:00
Matthew Honnibal
f61d5e3ac3 Move things around 2020-06-20 03:23:58 +02:00
Matthew Honnibal
c630cfdb5e Move converters under spacy.gold 2020-06-20 03:20:34 +02:00
Matthew Honnibal
161d8439fa Start updating converters 2020-06-20 03:19:40 +02:00
Matthew Honnibal
a79f0598a6 Merge branch 'whatif/arrow' of https://github.com/explosion/spaCy into whatif/arrow 2020-06-20 02:36:40 +02:00
Matthew Honnibal
be81577719 Fix oracles 2020-06-20 02:36:12 +02:00
Marat M. Yavrumyan
8120b641cc
Update lex_attrs.py (#5608) 2020-06-19 20:00:34 +02:00
svlandeg
e30ec9b2a8 fix test checking for variants 2020-06-19 14:05:35 +02:00
svlandeg
25b0674320 clean up 2020-06-19 11:31:01 +02:00
svlandeg
c705a28438 add links to to_dict 2020-06-19 11:22:24 +02:00
Matthew Honnibal
03db143cd0 Draft new GoldCorpus class 2020-06-19 04:15:02 +02:00
Matthew Honnibal
a389866df6 Merge branch 'whatif/arrow' of https://github.com/explosion/spaCy into whatif/arrow 2020-06-19 02:30:27 +02:00
Matthew Honnibal
bd29b7b14f Update parser and NER gold stuff 2020-06-19 02:29:16 +02:00
Matthew Honnibal
5ae9e3480d Return ArcEagerGoldParse from ArcEager 2020-06-19 00:11:59 +02:00
svlandeg
6ca6d7d6b4 test for split sentences with various alignment issues, works 2020-06-18 20:01:02 +02:00
svlandeg
1951921230 implement split_sent with aligned SENT_START attribute 2020-06-18 19:41:53 +02:00
svlandeg
d1d6f16776 fix the fix 2020-06-18 19:15:32 +02:00
svlandeg
e822367cf7 prevent writing dummy values like deps because that could interfer with sent_start values 2020-06-18 17:47:59 +02:00
svlandeg
0b6d45eae1 various small fixes 2020-06-18 15:55:00 +02:00
svlandeg
1c71f2310c fix renames and simple_ner labels 2020-06-18 15:33:28 +02:00
svlandeg
64fc840a5d bugfix tok2vec 2020-06-18 15:24:40 +02:00
svlandeg
01f9ae774c small fixes 2020-06-18 14:01:19 +02:00
svlandeg
0c6f1f3891 fix BiluoPushDown parsing entities 2020-06-18 13:00:03 +02:00
svlandeg
cd790aaa2a fix parser tests to work with example (most still failing) 2020-06-18 11:19:22 +02:00
svlandeg
9f43ba839a throw informative error when running the components with the wrong type of objects 2020-06-18 10:36:05 +02:00
svlandeg
6712d0b5db textcat bugfix 2020-06-18 10:09:56 +02:00
svlandeg
40b2b21eef small bug fix 2020-06-17 23:33:51 +02:00
svlandeg
d6c4dd6eea pipe() takes docs, not examples 2020-06-17 21:29:36 +02:00
svlandeg
0f123af35e ensure test keeps working with non-linked entities 2020-06-17 21:13:38 +02:00
svlandeg
6d73e139b0 fix entity linker 2020-06-17 21:12:25 +02:00
svlandeg
be5934b827 fix tagger 2020-06-17 19:42:11 +02:00
svlandeg
10d396977e add support for MORPH in to/from_array, fix morphologizer overfitting test 2020-06-17 17:48:07 +02:00
svlandeg
1a151b10d6 correct silly typo 2020-06-17 14:48:14 +02:00
svlandeg
f6c451b650 cleanup 2020-06-17 14:45:54 +02:00
svlandeg
2d9f406188 fix test_cli 2020-06-17 14:42:48 +02:00
svlandeg
f7ad8e8c83 various fixes in scripts - needs to be further tested 2020-06-17 12:05:58 +02:00
svlandeg
3c4f9e4cc4 fix augment (needs further testing) 2020-06-17 10:46:29 +02:00
svlandeg
4ed399c848 minibatch utiltiy can deal with strings, docs or examples 2020-06-16 21:35:55 +02:00
svlandeg
8b66c11ff2 add spaces to json output format 2020-06-16 19:30:03 +02:00
svlandeg
ba80ad7efd fixed some tests + WIP roundtrip unit test 2020-06-16 18:26:50 +02:00
Ines Montani
e9d3e177f0 Merge branch 'master' into v2.3.x 2020-06-16 16:31:38 +02:00
svlandeg
43d41d6bb6 allow None as BILUO annotation 2020-06-16 15:30:05 +02:00
svlandeg
44a0f9c2c8 test_gold_biluo_different_tokenization works 2020-06-16 15:21:20 +02:00
svlandeg
1c35b8efcd fix spaces 2020-06-16 12:08:25 +02:00
svlandeg
6fea5fa4bd attempt to fix cases with weird spaces 2020-06-16 11:52:29 +02:00
svlandeg
0702a1d3fb fix test for misaligned 2020-06-15 23:10:47 +02:00
svlandeg
a28f8f369e Fix many-to-one IOB codes 2020-06-15 23:06:22 +02:00
svlandeg
12886b787b fixing NER one-to-many alignment 2020-06-15 22:44:17 +02:00
Matthew Honnibal
7ff447c5a0 Set version to v2.3.0 2020-06-15 18:22:25 +02:00
Matthew Honnibal
a0bf73a5dd Merge branch 'whatif/arrow' of https://github.com/explosion/spaCy into whatif/arrow 2020-06-15 18:16:01 +02:00
Matthew Honnibal
c66f93299e Remove TokenAnnotation code from nonproj 2020-06-15 18:14:47 +02:00
Matthew Honnibal
c95494739c Fix import 2020-06-15 18:11:10 +02:00
Matthew Honnibal
8f978f2031 Fix import 2020-06-15 18:10:47 +02:00
Matthew Honnibal
95de7efaad Draft create_gold_state for arc_eager oracle 2020-06-15 18:10:19 +02:00
svlandeg
68986a252e additional tests for new get_aligned function 2020-06-15 17:42:40 +02:00
svlandeg
41d29983a7 start testing get_aligned 2020-06-15 17:16:01 +02:00
svlandeg
fd5f199feb fixing language and scoring tests 2020-06-15 15:02:05 +02:00
Adriane Boyd
0d8405aafa Updates to docstrings (#5589) 2020-06-15 14:58:36 +02:00
Adriane Boyd
e867e9fa8f Fix and add warnings related to spacy-lookups-data (#5588)
* Fix warning message for lemmatization tables

* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
2020-06-15 14:58:29 +02:00
Arvind Srinivasan
f698007907 Added Tamil Example Sentences (#5583)
* Added Examples for Tamil Sentences

#### Description
This PR add example sentences for the Tamil language which were missing as per issue #1107 

#### Type of Change
This is an enhancement.

* Accepting spaCy Contributor Agreement

* Signed on my behalf as an individual
2020-06-15 14:58:21 +02:00
Adriane Boyd
c94f7d0e75
Updates to docstrings (#5589) 2020-06-15 14:56:51 +02:00
Adriane Boyd
c482f20778
Fix and add warnings related to spacy-lookups-data (#5588)
* Fix warning message for lemmatization tables

* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
2020-06-15 14:56:04 +02:00
svlandeg
b4d914ec77 fix error catching 2020-06-15 12:56:32 +02:00
svlandeg
b9c9cbb2cd informative error when calling to_array with wrong field 2020-06-15 11:53:31 +02:00
svlandeg
ff231e1cdd fix merge conflict 2020-06-15 09:04:19 +02:00
svlandeg
a48553c1ed fix error numbers 2020-06-15 08:51:31 +02:00
Matthew Honnibal
3c0fc10dc4 Remove beam for now (maybe)
Remove beam_utils

Update setup.py

Remove beam
2020-06-14 19:53:29 +02:00
Matthew Honnibal
98ca14f577 Remove GoldParse
WIP on removing goldparse

Get ArcEager compiling after GoldParse excise

Update setup.py

Get spacy.syntax compiling after removing GoldParse

Rename NewExample -> Example and clean up

Clean html files

Start updating tests

Update Morphologizer
2020-06-14 19:53:30 +02:00
Matthew Honnibal
d53723aa4f Merge from whatif/arrow 2020-06-14 17:43:59 +02:00
Matthew Honnibal
380cce9d8b Update errors 2020-06-14 17:40:05 +02:00
Matthew Honnibal
706e652820 Merge from develop 2020-06-14 17:35:01 +02:00
Matthew Honnibal
9296d71a54 More GoldParse excise 2020-06-14 17:26:54 +02:00
Matthew Honnibal
60d4e5a9e0 WIP on updating transition-system 2020-06-14 17:22:14 +02:00
Matthew Honnibal
7d65615625 WIP start excising GoldParse 2020-06-14 17:11:41 +02:00
Matthew Honnibal
4362ec7084 Hack Language.evaluate 2020-06-13 23:37:42 +02:00
Matthew Honnibal
7de997c0a5 Update test 2020-06-13 23:11:45 +02:00
Matthew Honnibal
8f941ef527 Update GoldParse 2020-06-13 23:11:29 +02:00
Matthew Honnibal
3a0bbcfb4c Add biluo_tags_from_doc function 2020-06-13 23:10:54 +02:00
Matthew Honnibal
caa7508725 Draft missing NewExample stuff 2020-06-13 23:10:21 +02:00
Matthew Honnibal
3eb8f3867e Update test 2020-06-13 23:05:16 +02:00
Arvind Srinivasan
aa5b40fa64
Added Tamil Example Sentences (#5583)
* Added Examples for Tamil Sentences

#### Description
This PR add example sentences for the Tamil language which were missing as per issue #1107 

#### Type of Change
This is an enhancement.

* Accepting spaCy Contributor Agreement

* Signed on my behalf as an individual
2020-06-13 15:56:26 +02:00
Matthew Honnibal
5564314d32 Suggest approach for GoldParse 2020-06-13 15:43:35 +02:00
Matthew Honnibal
b078b05ecd Handle various data better in NewExample 2020-06-13 15:30:12 +02:00
svlandeg
face0de74f fix MORPH conversion + enable unit test 2020-06-12 16:29:09 +02:00
svlandeg
a5ee082da1 cats bugfix 2020-06-12 15:49:38 +02:00
svlandeg
880dccf93e entities on doc_annotation, parse links and check their offsets against the entities. unit test works 2020-06-12 15:47:20 +02:00
theudas
3f5e2f9d99 Added Parameter to NEL to take n sentences into account (#5548)
* added setting for neighbour sentence in NEL

* added spaCy contributor agreement

* added multi sentence also for training

* made the try-except block smaller
2020-06-12 15:15:03 +02:00
adrianeboyd
4724fa4cf4 Expand Japanese requirements warning (#5572)
Include explicit install instructions in Japanese requirements warning.
2020-06-12 15:14:55 +02:00
adrianeboyd
44967a3f9c Update pytest conf for sudachipy with Japanese (#5574) 2020-06-12 15:14:47 +02:00
svlandeg
3aed177a35 fix ENT_IOB conversion and enable unit test 2020-06-12 11:30:24 +02:00
Matthew Honnibal
a1c5b694be Small fixes to train defaults 2020-06-12 02:22:13 +02:00
theudas
fa46e0bef2
Added Parameter to NEL to take n sentences into account (#5548)
* added setting for neighbour sentence in NEL

* added spaCy contributor agreement

* added multi sentence also for training

* made the try-except block smaller
2020-06-12 02:03:23 +02:00
Sofie Van Landeghem
c0f4a1e43b
train is from-config by default (#5575)
* verbose and tag_map options

* adding init_tok2vec option and only changing the tok2vec that is specified

* adding omit_extra_lookups and verifying textcat config

* wip

* pretrain bugfix

* add replace and resume options

* train_textcat fix

* raw text functionality

* improve UX when KeyError or when input data can't be parsed

* avoid unnecessary access to goldparse in TextCat pipe

* save performance information in nlp.meta

* add noise_level to config

* move nn_parser's defaults to config file

* multitask in config - doesn't work yet

* scorer offering both F and AUC options, need to be specified in config

* add textcat verification code from old train script

* small fixes to config files

* clean up

* set default config for ner/parser to allow create_pipe to work as before

* two more test fixes

* small fixes

* cleanup

* fix NER pickling + additional unit test

* create_pipe as before
2020-06-12 02:02:07 +02:00
svlandeg
6a67a11682 adding tests for new example class (some still failing - WIP) 2020-06-11 17:43:40 +02:00
adrianeboyd
556895177e
Expand Japanese requirements warning (#5572)
Include explicit install instructions in Japanese requirements warning.
2020-06-11 13:47:37 +02:00
adrianeboyd
fe167fcf7d
Update pytest conf for sudachipy with Japanese (#5574) 2020-06-11 10:23:50 +02:00
Jones Martins
bab30e4ad2
Add "c'mon" token exception (#5570)
* Add "c'mon" exception

* Fix typo in "C'mon" exception
2020-06-10 21:54:06 +02:00
Jones Martins
28db7dd5d9
Add missing pronoums/determiners (#5569)
* Add missing pronoums/determiners

* Add test for missing pronoums

* Add contributor file
2020-06-10 18:47:04 +02:00
Matthew Honnibal
488727aee0 Start updating test 2020-06-09 23:58:28 +02:00
Matthew Honnibal
337d2b5ad6 Fix sent start in NewExample 2020-06-09 23:58:16 +02:00
Matthew Honnibal
ad547a4b8f Refactor towards new Example class 2020-06-09 23:39:46 +02:00
Matthew Honnibal
82810b9846 Update morphologizer 2020-06-09 23:32:07 +02:00
Matthew Honnibal
af1b5f129b Use new example class in GoldCorpus 2020-06-09 23:31:19 +02:00
Matthew Honnibal
0714f1fa5c Remove the 'pass example into __call__' thing 2020-06-09 23:30:06 +02:00
Matthew Honnibal
b3868cd1f8 Update NewExample 2020-06-09 23:06:48 +02:00
Matthew Honnibal
ccd332a9fc Update test stubs 2020-06-09 15:49:04 +02:00
adrianeboyd
0a70bd6281
Bump version to 2.3.0.dev1 (#5567) 2020-06-09 15:47:31 +02:00
Matthew Honnibal
04569c0b3e Fix import 2020-06-09 15:44:08 +02:00
Matthew Honnibal
f4caaa8ad9 Update alignment 2020-06-09 15:43:57 +02:00
Matthew Honnibal
b5ef397639 Add header for align.pxd 2020-06-09 15:43:48 +02:00
Matthew Honnibal
793092d2d8 Fix renaming in GoldCorpus 2020-06-09 15:43:38 +02:00
Matthew Honnibal
36d49a0f13 Fix NewExample class 2020-06-09 15:43:19 +02:00
Matthew Honnibal
f1189dc205 Draft tests for new Example class 2020-06-09 15:43:08 +02:00
Matthew Honnibal
c833ebe1ad Start tests for new example class 2020-06-09 15:29:05 +02:00
Matthew Honnibal
453cfa14d0 Start drafting new example class 2020-06-09 15:28:42 +02:00
Matthew Honnibal
449000c234 Fix gold_io 2020-06-09 12:43:53 +02:00
Matthew Honnibal
cb08ce3936 Move alignment into Cython 2020-06-09 12:40:41 +02:00
Matthew Honnibal
20a1bdb298 Fix train 2020-06-09 12:33:29 +02:00
Matthew Honnibal
549164c31c Fix corpus when no raw text supplied 2020-06-09 12:33:14 +02:00
adrianeboyd
b7e6e1b9a7
Disable sentence segmentation in ja tokenizer (#5566) 2020-06-09 12:00:59 +02:00
Matthew Honnibal
d9289712ba * Make GoldCorpus return dict, not Example
* Make Example require a Doc object (previously optional)

Clarify methods in GoldCorpus

WIP refactor Example

Refactor Example.split_sents

Fix test

Fix augment

Update test

Update test

Fix import

Update test_scorer

Update Example
2020-06-09 01:01:59 +02:00
Matthew Honnibal
084271c9e9
Remove GoldParse from public API
* Move get_parses_from_example to spacy.syntax

* Get GoldParse out of Example

* Avoid expecting GoldParse input in parser

* Add Alignment to spacy.gold.align

* Update Example object

* Add comment

* Update pipeline

* Fix imports

* Simplify gold_io

* WIP on GoldCorpus

* Update test

* Xfail some gold tests

* Remove ignore_misaligned option from GoldCorpus

* Fix Example constructor

* Update test

* Fix usage of Example

* Add deprecated_get_gold method on Example

* Patch scorer

* Fix test

* Fix test

* Update tests

* Xfail a test

* Fix passing of make_projective

* Pass make_projective by default

* Hack data format in Example.from_dict

* Update tests

* Fix example.from_dict

* Update morphologizer

* Fix entity linker

* Add get_field to TokenAnnotation

* Fix Example.get_aligned

* Update test

* Fix alignment

* Fix corpus

* Fix GoldCorpus

* Handle misaligned

* Format

* Fix missing import
2020-06-08 22:09:57 +02:00
adrianeboyd
f162815f45
Handle empty and whitespace-only docs for Japanese (#5564)
Handle empty and whitespace-only docs in the custom alignment method
used by the Japanese tokenizer.
2020-06-08 21:09:23 +02:00
adrianeboyd
3bf111585d
Update Japanese tokenizer config and add serialization (#5562)
* Use `config` dict for tokenizer settings
* Add serialization of split mode setting
* Add tests for tokenizer split modes and serialization of split mode
setting

Based on #5561
2020-06-08 16:29:05 +02:00
Hiroshi Matsuda
456bf47f51
fix a bug causing mis-alignments (#5560) 2020-06-08 15:49:34 +02:00
Matthew Honnibal
b69fa77ccc Add missing inits 2020-06-06 15:38:46 +02:00
Matthew Honnibal
6e87ca1f45 Fix imports 2020-06-06 15:36:58 +02:00
Matthew Honnibal
53b00991fd Fix imports 2020-06-06 15:36:46 +02:00
Matthew Honnibal
74204116a3 Rename _gold -> gold 2020-06-06 15:29:32 +02:00
Matthew Honnibal
7f135736f4 Fix imports 2020-06-06 15:28:52 +02:00
Matthew Honnibal
17533a9286 Format 2020-06-06 15:13:07 +02:00
Matthew Honnibal
0f9b4bbfea Fix imports 2020-06-06 15:12:52 +02:00
Matthew Honnibal
866179350b Fix import 2020-06-06 15:11:13 +02:00
Matthew Honnibal
3baa1ada03 Refactr spacy.gold 2020-06-06 15:10:33 +02:00
Matthew Honnibal
1d2e39d974 Support to_dict in Doc 2020-06-06 15:10:10 +02:00
Matthew Honnibal
7b873ce2b1 Move GoldParse under spacy.syntax 2020-06-06 15:09:43 +02:00
Matthew Honnibal
32c8fb1372 Add gold_io.pyx 2020-06-06 14:41:49 +02:00
Matthew Honnibal
156466ca69 Add iob_utils 2020-06-06 14:39:14 +02:00
Matthew Honnibal
53e6473e24 Add to/from dict helpers 2020-06-06 14:29:06 +02:00
Matthew Honnibal
a663d44b1b Add GoldCorpus 2020-06-06 14:28:37 +02:00
Matthew Honnibal
1fb8fc6ea9 Add Example class 2020-06-06 14:24:35 +02:00
Matthew Honnibal
cce6a51a9c Add annotation classes 2020-06-06 14:22:27 +02:00
Matthew Honnibal
6005b94e74 Add data augmentation 2020-06-06 14:19:06 +02:00
Matthew Honnibal
fcb4f7a6db Start breaking down gold.pyx 2020-06-06 14:15:12 +02:00
Ines Montani
d93cbeb14f
Add warning for loose version constraints (#5536)
* Add warning for loose version constraints

* Update wording [ci skip]

* Tweak error message

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-06-05 12:42:15 +02:00
adrianeboyd
1ac43d78f9
Avoid libc.stdint for UINT64_MAX (#5545) 2020-06-04 20:02:05 +02:00
Paul O'Leary McCann
410fb7ee43
Add Japanese Model (#5544)
* Add more rules to deal with Japanese UD mappings

Japanese UD rules sometimes give different UD tags to tokens with the
same underlying POS tag. The UD spec indicates these cases should be
disambiguated using the output of a tool called "comainu", but rules are
enough to get the right result.

These rules are taken from Ginza at time of writing, see #3756.

* Add new tags from GSD

This is a few rare tags that aren't in Unidic but are in the GSD data.

* Add basic Japanese sentencization

This code is taken from Ginza again.

* Add sentenceizer quote handling

Could probably add more paired characters but this will do for now. Also
includes some tests.

* Replace fugashi with SudachiPy

* Modify tag format to match GSD annotations

Some of the tests still need to be updated, but I want to get this up
for testing training.

* Deal with case with closing punct without opening

* refactor resolve_pos()

* change tag field separator from "," to "-"

* add TAG_ORTH_MAP

* add TAG_BIGRAM_MAP

* revise rules for 連体詞

* revise rules for 連体詞

* improve POS about 2%

* add syntax_iterator.py (not mature yet)

* improve syntax_iterators.py

* improve syntax_iterators.py

* add phrases including nouns and drop NPs consist of STOP_WORDS

* First take at noun chunks

This works in many situations but still has issues in others.

If the start of a subtree has no noun, then nested phrases can be
generated.

    また行きたい、そんな気持ちにさせてくれるお店です。
    [そんな気持ち, また行きたい、そんな気持ちにさせてくれるお店]

For some reason て gets included sometimes. Not sure why.

    ゲンに連れ添って円盤生物を調査するパートナーとなる。
    [て円盤生物, ...]

Some phrases that look like they should be split are grouped together;
not entirely sure that's wrong. This whole thing becomes one chunk:

    道の駅遠山郷北側からかぐら大橋南詰現道交点までの1.060kmのみ開通済み

* Use new generic get_words_and_spaces

The new get_words_and_spaces function is simpler than what was used in
Japanese, so it's good to be able to switch to it. However, there was an
issue. The new function works just on text, so POS info could get out of
sync. Fixing this required a small change to the way dtokens (tokens
with POS and lemma info) were generated.

Specifically, multiple extraneous spaces now become a single token, so
when generating dtokens multiple space tokens should be created in a
row.

* Fix noun_chunks, should be working now

* Fix some tests, add naughty strings tests

Some of the existing tests changed because the tokenization mode of
Sudachi changed to the more fine-grained A mode.

Sudachi also has issues with some strings, so this adds a test against
the naughty strings.

* Remove empty Sudachi tokens

Not doing this creates zero-length tokens and causes errors in the
internal spaCy processing.

* Add yield_bunsetu back in as a separate piece of code

Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com>
Co-authored-by: hiroshi <hiroshi_matsuda@megagon.ai>
2020-06-04 19:15:43 +02:00
Matthew Honnibal
8411d4f4e6
Merge pull request #5543 from svlandeg/feature/pretrain-config
pretrain from config
2020-06-04 19:07:12 +02:00
svlandeg
3ade455fd3 formatting 2020-06-04 16:09:55 +02:00
svlandeg
776d4f1190 cleanup 2020-06-04 16:07:30 +02:00
svlandeg
6b027d7689 remove duplicate model definition of tok2vec layer 2020-06-04 15:49:23 +02:00
svlandeg
1775f54a26 small little fixes 2020-06-03 22:17:02 +02:00
svlandeg
07886a3de3 rename init_tok2vec to resume 2020-06-03 22:00:25 +02:00
svlandeg
4ed6278663 small fixes to pretrain config, init_tok2vec TODO 2020-06-03 19:32:40 +02:00
svlandeg
ffe0451d09 pretrain from config 2020-06-03 14:45:00 +02:00
Ines Montani
a8875d4a4b Fix typo 2020-06-03 14:42:39 +02:00
Ines Montani
4e0610d0d4 Update warning codes 2020-06-03 14:37:09 +02:00
Ines Montani
810fce3bb1 Merge branch 'develop' into master-tmp 2020-06-03 14:36:59 +02:00
Adriane Boyd
b0ee76264b Remove debugging 2020-06-03 14:20:42 +02:00
Adriane Boyd
1d8168d1fd Fix problems with lower and whitespace in variants
Port relevant changes from #5361:

* Initialize lower flag explicitly

* Handle whitespace words from GoldParse correctly when creating raw
text with orth variants
2020-06-03 14:15:58 +02:00
Adriane Boyd
10d938f221 Update default cfg dir in train CLI 2020-06-03 14:15:50 +02:00
Adriane Boyd
f1f9c8b417 Port train CLI updates
Updates from #5362 and fix from #5387:

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-06-03 14:03:43 +02:00
Adriane Boyd
8c758ed1eb Fix meta path 2020-06-03 12:11:57 +02:00
Adriane Boyd
a57bdeecac Test util.get_model_meta instead of util.load_model 2020-06-03 12:10:12 +02:00
svlandeg
eac12cbb77 make dropout in embed layers configurable 2020-06-03 11:50:16 +02:00
svlandeg
e91485dfc4 add discard_oversize parameter, move optimizer to training subsection 2020-06-03 10:04:16 +02:00
svlandeg
03c58b488c prevent infinite loop, custom warning 2020-06-03 10:00:21 +02:00