Commit Graph

1964 Commits

Author SHA1 Message Date
Adriane Boyd
73538782a0
Switch Doc.__init__(ents=) to IOB tags (#6173)
* Switch Doc.__init__(ents=) to IOB tags

* Fix check for "-"

* Allow "" or None as missing IOB tag
2020-10-01 16:22:18 +02:00
Ines Montani
381258b75b
Merge pull request #6165 from explosion/feature/update-tokenizers-initialize 2020-10-01 09:49:47 +02:00
Ines Montani
a103ab5f1a Update augmenter lookups and docs 2020-09-30 23:03:47 +02:00
Ines Montani
23c63eefaf Tidy up env vars [ci skip] 2020-09-30 15:15:11 +02:00
Adriane Boyd
6b7bb32834 Refactor Chinese initialization 2020-09-30 11:46:45 +02:00
Ines Montani
34f9c26c62 Add lexeme norm defaults 2020-09-30 10:20:14 +02:00
Ines Montani
1aeef3bfbb Make corpus paths default to None and improve errors 2020-09-29 22:33:46 +02:00
Ines Montani
fa47f87924 Tidy up and auto-format 2020-09-29 21:39:28 +02:00
Ines Montani
6467a560e3 WIP: Test updating Chinese tokenizer 2020-09-29 21:10:22 +02:00
Ines Montani
78021089f9
Merge pull request #6160 from explosion/feature/prepare 2020-09-29 20:55:13 +02:00
Ines Montani
c3f8c09d7d
Merge pull request #6154 from adrianeboyd/bugfix/chinese-tokenizer-pickle 2020-09-29 20:54:59 +02:00
Ines Montani
d3c63b7965 Merge branch 'develop' into feature/prepare 2020-09-29 20:53:05 +02:00
Ines Montani
2be80379ec Fix small issues, resolve_dot_names and debug model 2020-09-29 20:38:35 +02:00
Ines Montani
7851020653 Update tests 2020-09-29 18:14:15 +02:00
Ines Montani
f2352eb701 Test with default value 2020-09-29 17:00:40 +02:00
Ines Montani
63d1598137 Simplify config use in Language.initialize 2020-09-29 16:05:48 +02:00
Ines Montani
56f8bc73ef Add more tests 2020-09-29 15:23:34 +02:00
Ines Montani
591038b1a4 Add test 2020-09-29 12:54:52 +02:00
Matthew Honnibal
e1fdf2b7c5 Upd tests 2020-09-29 12:05:38 +02:00
Ines Montani
ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Ines Montani
2e9c9e74af Fix config resolution and interpolation
TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)
2020-09-28 15:34:00 +02:00
Ines Montani
822ea4ef61 Refactor CLI 2020-09-28 15:09:59 +02:00
Matthew Honnibal
a976da168c
Support data augmentation in Corpus (#6155)
* Support data augmentation in Corpus

* Note initial docs for data augmentation

* Add augmenter to quickstart

* Fix flake8

* Format

* Fix test

* Update spacy/tests/training/test_training.py

* Improve data augmentation arguments

* Update templates

* Move randomization out into caller

* Refactor

* Update spacy/training/augment.py

* Update spacy/tests/training/test_training.py

* Fix augment

* Fix test
2020-09-28 03:03:27 +02:00
Ines Montani
9016d23cc5 Fix exclude and add test 2020-09-27 23:34:03 +02:00
Ines Montani
7e938ed63e Update config resolution to use new Thinc 2020-09-27 22:21:31 +02:00
Adriane Boyd
8393dbedad Minor fixes
* Put `cfg` back in serialization
* Add `pickle5` to pytest conf
2020-09-27 15:15:53 +02:00
Adriane Boyd
11e195d3ed Update ChineseTokenizer
* Allow `pkuseg_model` to be set to `None` on initialization
* Don't save config within tokenizer
* Force convert pkuseg_model to use pickle protocol 4 by reencoding with
`pickle5` on serialization
* Update pkuseg serialization test
2020-09-27 14:00:18 +02:00
Ines Montani
ca3c997062 Improve CLI config validation with latest Thinc 2020-09-26 13:13:57 +02:00
Adriane Boyd
3c062b3911
Add MORPH handling to Matcher (#6107)
* Add MORPH handling to Matcher

* Add `MORPH` to `Matcher` schema
* Rename `_SetMemberPredicate` to `_SetPredicate`
* Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate`
  * Add special handling for normalization and conversion of morph
    values into sets
  * For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only
    matches for 0 or 1 values

* Update test

* Rename to IS_SUBSET and IS_SUPERSET
2020-09-24 16:55:09 +02:00
Adriane Boyd
59340606b7
Add option to disable Matcher errors (#6125)
* Add option to disable Matcher errors

* Add option to disable Matcher errors when a doc doesn't contain a
particular type of annotation

Minor additional change:

* Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH`
values

* Rename suppress_errors to allow_missing

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* Refactor annotation checks in Matcher and PhraseMatcher

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-24 16:54:39 +02:00
Sofie Van Landeghem
c7eedd3534
updates to NEL functionality (#6132)
* NEL: read sentences and ents from reference

* fiddling with sent_start annotations

* add KB serialization test

* KB write additional file with strings.json

* score_links function to calculate NEL P/R/F

* formatting

* documentation
2020-09-24 16:53:59 +02:00
Ines Montani
d0ef4a4cf5 Prevent division by zero in score weights 2020-09-24 16:42:13 +02:00
Ines Montani
58dde293ce
Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2 2020-09-24 14:44:42 +02:00
Adriane Boyd
8eaacaae97 Refactor Doc.ents setter to use Doc.set_ents
Additional changes:

* Entity spans with missing labels are ignored
* Fix ent_kb_id setting in `Doc.set_ents`
2020-09-24 12:36:51 +02:00
Ines Montani
c6c67b606e
Merge pull request #6133 from explosion/fix/score_weights 2020-09-24 12:00:57 +02:00
Ines Montani
4bbe41f017 Fix combined scores and update test 2020-09-24 10:42:47 +02:00
Sofie Van Landeghem
c645c4e7ce
fix micro PRF for textcat (#6130)
* fix micro PRF for textcat

* small fix
2020-09-24 10:31:17 +02:00
Ines Montani
ae51f580c1 Fix handling of score_weights 2020-09-24 10:27:33 +02:00
svlandeg
b816ace4bb format 2020-09-23 17:33:13 +02:00
svlandeg
5a9fdbc8ad state_type as Literal 2020-09-23 17:32:14 +02:00
svlandeg
dd2292793f 'parser' instead of 'deps' for state_type 2020-09-23 16:53:49 +02:00
svlandeg
6c85fab316 state_type and extra_state_tokens instead of nr_feature_tokens 2020-09-23 13:35:09 +02:00
Ines Montani
60a317520a
Merge pull request #6109 from svlandeg/feature/2rename 2020-09-23 09:47:12 +02:00
Sofie Van Landeghem
86a08f819d
tok2vec.update instead of predict (#6113) 2020-09-22 21:54:52 +02:00
Sofie Van Landeghem
e0e793be4d
fix KB IO (#6118) 2020-09-22 21:53:06 +02:00
Sofie Van Landeghem
d53c84b6d6
avoid None callback (#6100) 2020-09-22 13:54:44 +02:00
Adriane Boyd
535842e483
Merge branch 'develop' into feature/doc-ents-v3-2 2020-09-22 13:45:50 +02:00
Ines Montani
5e3b796b12 Validate section refs in debug config 2020-09-22 12:24:39 +02:00
svlandeg
e1b8090b9b few more fixes 2020-09-22 12:01:06 +02:00
svlandeg
b556a10808 rename converts in_to_out 2020-09-22 11:50:19 +02:00