Matthew Honnibal
db84d175c3
Fix test
2020-10-05 19:59:30 +02:00
Matthew Honnibal
6dcc4a0ba6
Simplify MultiHashEmbed signature
2020-10-05 19:57:45 +02:00
Matthew Honnibal
7d93575f35
spacy/tests/
2020-10-05 15:28:12 +02:00
Matthew Honnibal
f4ca9a39cb
spacy/tests/
2020-10-05 15:27:06 +02:00
Matthew Honnibal
f2f1deca66
spacy/tests/
2020-10-05 15:24:33 +02:00
Matthew Honnibal
8ec79ad3fa
Allow configuration of MultiHashEmbed features
...
Update arguments to MultiHashEmbed layer so that the attributes can be
controlled. A kind of tricky scheme is used to allow optional
specification of the rows. I think it's an okay balance between
flexibility and convenience.
2020-10-05 15:22:00 +02:00
Adriane Boyd
5d19dfc9d3
Update Chinese tokenizer for spacy-pkuseg fork
2020-10-05 14:21:53 +02:00
Ines Montani
6958510bda
Include spaCy version check in project CLI
2020-10-05 13:53:07 +02:00
Ines Montani
20f2a17a09
Merge test_misc and test_util
2020-10-05 13:45:57 +02:00
Ines Montani
1c641e41c3
Remove unused import [ci skip]
2020-10-05 11:50:11 +02:00
Adriane Boyd
b0b93854cb
Update ru/uk lemmatizers for new nlp.initialize
2020-10-05 09:27:16 +02:00
Ines Montani
549758f67d
Adjust test for now
2020-10-04 23:16:09 +02:00
Ines Montani
3c36a57e84
Update data augmenters ( #6196 )
...
* Draft lower-case augmenter
* Make warning a debug log
* Update lowercase augmenter, docs and tests
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-10-04 17:46:29 +02:00
Ines Montani
496228771d
Merge pull request #6194 from explosion/master-tmp
2020-10-04 15:25:41 +02:00
Ines Montani
0307a228c8
Merge pull request #6193 from explosion/fix/adjust-pipe-init
...
Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe
2020-10-04 15:20:54 +02:00
Ines Montani
59deeb7da6
Merge branch 'develop' into master-tmp
2020-10-04 14:52:20 +02:00
Ines Montani
8f018e47f8
Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe
2020-10-04 14:43:45 +02:00
Ines Montani
11347f34da
Tidy up, tests and docs
2020-10-04 13:54:05 +02:00
Ines Montani
d3b3663942
Adjust error message and add test
2020-10-04 10:11:27 +02:00
Ines Montani
2110e8f86d
Auto-format
2020-10-04 10:06:49 +02:00
Matthew Honnibal
835070cedc
Upd test
2020-10-03 19:35:10 +02:00
Ines Montani
c2401fca41
Add tests for Pipe.label_data
2020-10-03 19:12:46 +02:00
Ines Montani
3bc3c05fcc
Tidy up and auto-format
2020-10-03 17:20:18 +02:00
Ines Montani
7c4ab7e82c
Fix Lemmatizer.get_lookups_config
2020-10-03 17:16:10 +02:00
Ines Montani
dd542ec6a4
Fix label initialization of textcat component ( #6190 )
2020-10-03 17:07:38 +02:00
Sofie Van Landeghem
09dcb75076
small UX fix for DocBin ( #6167 )
...
* add informative warning when messing up store_user_data DocBin flags
* add informative warning when messing up store_user_data DocBin flags
* cleanup test
* rename to patterns_path
2020-10-02 15:43:32 +02:00
Ines Montani
f0b30aedad
Make lemmatizers use initialize logic ( #6182 )
...
* Make lemmatizer use initialize logic and tidy up
* Fix typo
* Raise for uninitialized tables
2020-10-02 15:42:36 +02:00
Ines Montani
d2aa662ab2
Merge pull request #6179 from adrianeboyd/feature/token-morph-refactor-2 [ci skip]
2020-10-02 12:10:27 +02:00
Ines Montani
c41a4332e4
Add test for custom data augmentation
2020-10-02 11:37:56 +02:00
Adriane Boyd
f83dfe62da
Fix test
2020-10-02 10:17:26 +02:00
Ines Montani
01c1538c72
Integrate file readers
2020-10-02 01:36:06 +02:00
Adriane Boyd
86c3ec9c2b
Refactor Token morph setting ( #6175 )
...
* Refactor Token morph setting
* Remove `Token.morph_`
* Add `Token.set_morph()`
* `0` resets `token.c.morph` to unset
* Any other values are passed to `Morphology.add`
* Add token.morph setter to set from MorphAnalysis
2020-10-01 22:21:46 +02:00
Ines Montani
d48ddd6c9a
Remove default initialize lookups
2020-10-01 21:54:33 +02:00
Adriane Boyd
73538782a0
Switch Doc.__init__(ents=) to IOB tags ( #6173 )
...
* Switch Doc.__init__(ents=) to IOB tags
* Fix check for "-"
* Allow "" or None as missing IOB tag
2020-10-01 16:22:18 +02:00
Yohei Tamura
3243ddac8f
Fix/span.sent ( #6083 )
...
* add fail test
* fix test
* fix span.sent
* Remove incorrect implicit check
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-01 14:01:52 +02:00
Ines Montani
381258b75b
Merge pull request #6165 from explosion/feature/update-tokenizers-initialize
2020-10-01 09:49:47 +02:00
Ines Montani
a103ab5f1a
Update augmenter lookups and docs
2020-09-30 23:03:47 +02:00
Ines Montani
23c63eefaf
Tidy up env vars [ci skip]
2020-09-30 15:15:11 +02:00
Adriane Boyd
6b7bb32834
Refactor Chinese initialization
2020-09-30 11:46:45 +02:00
Ines Montani
34f9c26c62
Add lexeme norm defaults
2020-09-30 10:20:14 +02:00
Ines Montani
1aeef3bfbb
Make corpus paths default to None and improve errors
2020-09-29 22:33:46 +02:00
Ines Montani
fa47f87924
Tidy up and auto-format
2020-09-29 21:39:28 +02:00
Ines Montani
6467a560e3
WIP: Test updating Chinese tokenizer
2020-09-29 21:10:22 +02:00
Ines Montani
78021089f9
Merge pull request #6160 from explosion/feature/prepare
2020-09-29 20:55:13 +02:00
Ines Montani
c3f8c09d7d
Merge pull request #6154 from adrianeboyd/bugfix/chinese-tokenizer-pickle
2020-09-29 20:54:59 +02:00
Ines Montani
d3c63b7965
Merge branch 'develop' into feature/prepare
2020-09-29 20:53:05 +02:00
Ines Montani
2be80379ec
Fix small issues, resolve_dot_names and debug model
2020-09-29 20:38:35 +02:00
Ines Montani
7851020653
Update tests
2020-09-29 18:14:15 +02:00
Ines Montani
f2352eb701
Test with default value
2020-09-29 17:00:40 +02:00
Ines Montani
63d1598137
Simplify config use in Language.initialize
2020-09-29 16:05:48 +02:00
Ines Montani
56f8bc73ef
Add more tests
2020-09-29 15:23:34 +02:00
Ines Montani
591038b1a4
Add test
2020-09-29 12:54:52 +02:00
Matthew Honnibal
e1fdf2b7c5
Upd tests
2020-09-29 12:05:38 +02:00
Ines Montani
ff9a63bfbd
begin_training -> initialize
2020-09-28 21:35:09 +02:00
Ines Montani
2e9c9e74af
Fix config resolution and interpolation
...
TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)
2020-09-28 15:34:00 +02:00
Ines Montani
822ea4ef61
Refactor CLI
2020-09-28 15:09:59 +02:00
Matthew Honnibal
a976da168c
Support data augmentation in Corpus ( #6155 )
...
* Support data augmentation in Corpus
* Note initial docs for data augmentation
* Add augmenter to quickstart
* Fix flake8
* Format
* Fix test
* Update spacy/tests/training/test_training.py
* Improve data augmentation arguments
* Update templates
* Move randomization out into caller
* Refactor
* Update spacy/training/augment.py
* Update spacy/tests/training/test_training.py
* Fix augment
* Fix test
2020-09-28 03:03:27 +02:00
Ines Montani
9016d23cc5
Fix exclude and add test
2020-09-27 23:34:03 +02:00
Ines Montani
7e938ed63e
Update config resolution to use new Thinc
2020-09-27 22:21:31 +02:00
Adriane Boyd
8393dbedad
Minor fixes
...
* Put `cfg` back in serialization
* Add `pickle5` to pytest conf
2020-09-27 15:15:53 +02:00
Adriane Boyd
11e195d3ed
Update ChineseTokenizer
...
* Allow `pkuseg_model` to be set to `None` on initialization
* Don't save config within tokenizer
* Force convert pkuseg_model to use pickle protocol 4 by reencoding with
`pickle5` on serialization
* Update pkuseg serialization test
2020-09-27 14:00:18 +02:00
Ines Montani
ca3c997062
Improve CLI config validation with latest Thinc
2020-09-26 13:13:57 +02:00
Adriane Boyd
3c062b3911
Add MORPH handling to Matcher ( #6107 )
...
* Add MORPH handling to Matcher
* Add `MORPH` to `Matcher` schema
* Rename `_SetMemberPredicate` to `_SetPredicate`
* Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate`
* Add special handling for normalization and conversion of morph
values into sets
* For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only
matches for 0 or 1 values
* Update test
* Rename to IS_SUBSET and IS_SUPERSET
2020-09-24 16:55:09 +02:00
Adriane Boyd
59340606b7
Add option to disable Matcher errors ( #6125 )
...
* Add option to disable Matcher errors
* Add option to disable Matcher errors when a doc doesn't contain a
particular type of annotation
Minor additional change:
* Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH`
values
* Rename suppress_errors to allow_missing
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Refactor annotation checks in Matcher and PhraseMatcher
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-24 16:54:39 +02:00
Sofie Van Landeghem
c7eedd3534
updates to NEL functionality ( #6132 )
...
* NEL: read sentences and ents from reference
* fiddling with sent_start annotations
* add KB serialization test
* KB write additional file with strings.json
* score_links function to calculate NEL P/R/F
* formatting
* documentation
2020-09-24 16:53:59 +02:00
Ines Montani
d0ef4a4cf5
Prevent division by zero in score weights
2020-09-24 16:42:13 +02:00
Ines Montani
58dde293ce
Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2
2020-09-24 14:44:42 +02:00
Adriane Boyd
8eaacaae97
Refactor Doc.ents setter to use Doc.set_ents
...
Additional changes:
* Entity spans with missing labels are ignored
* Fix ent_kb_id setting in `Doc.set_ents`
2020-09-24 12:36:51 +02:00
Ines Montani
c6c67b606e
Merge pull request #6133 from explosion/fix/score_weights
2020-09-24 12:00:57 +02:00
Ines Montani
4bbe41f017
Fix combined scores and update test
2020-09-24 10:42:47 +02:00
Sofie Van Landeghem
c645c4e7ce
fix micro PRF for textcat ( #6130 )
...
* fix micro PRF for textcat
* small fix
2020-09-24 10:31:17 +02:00
Ines Montani
ae51f580c1
Fix handling of score_weights
2020-09-24 10:27:33 +02:00
svlandeg
b816ace4bb
format
2020-09-23 17:33:13 +02:00
svlandeg
5a9fdbc8ad
state_type as Literal
2020-09-23 17:32:14 +02:00
svlandeg
dd2292793f
'parser' instead of 'deps' for state_type
2020-09-23 16:53:49 +02:00
svlandeg
6c85fab316
state_type and extra_state_tokens instead of nr_feature_tokens
2020-09-23 13:35:09 +02:00
Ines Montani
60a317520a
Merge pull request #6109 from svlandeg/feature/2rename
2020-09-23 09:47:12 +02:00
Sofie Van Landeghem
86a08f819d
tok2vec.update instead of predict ( #6113 )
2020-09-22 21:54:52 +02:00
Adriane Boyd
e4acb28658
Fix norm in retokenizer split ( #6111 )
...
Parallel to behavior in merge, reset norm on original token in
retokenizer split.
2020-09-22 21:53:33 +02:00
Sofie Van Landeghem
e0e793be4d
fix KB IO ( #6118 )
2020-09-22 21:53:06 +02:00
Sofie Van Landeghem
d53c84b6d6
avoid None callback ( #6100 )
2020-09-22 13:54:44 +02:00
Adriane Boyd
535842e483
Merge branch 'develop' into feature/doc-ents-v3-2
2020-09-22 13:45:50 +02:00
Ines Montani
5e3b796b12
Validate section refs in debug config
2020-09-22 12:24:39 +02:00
svlandeg
e1b8090b9b
few more fixes
2020-09-22 12:01:06 +02:00
svlandeg
b556a10808
rename converts in_to_out
2020-09-22 11:50:19 +02:00
Ines Montani
beb766d0a0
Add test
2020-09-22 09:15:57 +02:00
Ines Montani
69f7e52c26
Update README.md
2020-09-22 09:10:06 +02:00
Ines Montani
67fbcb3da5
Tidy up tests and docs
2020-09-21 20:43:54 +02:00
Ines Montani
a5f6ab4943
Merge pull request #6098 from adrianeboyd/feature/doc-init
2020-09-21 18:35:20 +02:00
Adriane Boyd
f212303729
Add sent_starts to Doc.__init__
...
Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start`
values but also convert to and accept `sent_start` internally.
2020-09-21 17:59:09 +02:00
Adriane Boyd
177df15d89
Implement Doc.set_ents
2020-09-21 15:54:05 +02:00
Adriane Boyd
13fbf6556a
Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2
2020-09-21 14:42:04 +02:00
Adriane Boyd
ce455f30ca
Fix formatting
2020-09-21 13:53:29 +02:00
Adriane Boyd
bc02e86494
Extend Doc.__init__ with additional annotation
...
Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to
`Doc.__init__` to initialize the most common doc/token values.
2020-09-21 13:36:24 +02:00
Ines Montani
758ead8a47
Sync overrides with CLI overrides
2020-09-21 12:50:13 +02:00
Ines Montani
5497acf49a
Support config overrides via environment variables
2020-09-21 11:25:10 +02:00
Ines Montani
1114219ae3
Tidy up and auto-format
2020-09-21 10:59:07 +02:00
Adriane Boyd
eed4b785f5
Load vocab lookups tables at beginning of training
...
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.
The option moves from `nlp.load_vocab_data` to `training.lookups`.
Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.
The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.
To load `lexeme_norm` from `spacy-lookups-data`:
```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
Ines Montani
a127fa475e
Merge pull request #6078 from svlandeg/fix/corpus
2020-09-18 14:44:21 +02:00
Adriane Boyd
a88106e852
Remove W106: HEAD and SENT_START in doc.from_array ( #6086 )
...
* Remove W106: HEAD and SENT_START in doc.from_array
This warning was hacky and being triggered too often.
* Fix test
2020-09-18 03:01:29 +02:00