Adriane Boyd
e0f5646a4a
Restore cleanup_beam method ( #6446 )
2020-11-25 13:21:48 +01:00
Adriane Boyd
cf693f0eae
Fix token_match in tokenizer
2020-11-25 11:49:34 +01:00
Adriane Boyd
724831b066
Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master
...
* Update Macedonian for v3
* Update Turkish for v3
2020-11-25 11:49:34 +01:00
Adriane Boyd
573f5c863f
Fix tag map clobbering in spacy train ( #6437 )
...
Fix bug from #5768 where the tag map is clobbered if a custom tag map
isn't provided.
2020-11-24 13:13:16 +01:00
Adriane Boyd
ce18fc6588
Set version to v2.3.3
2020-11-24 10:03:45 +01:00
Adriane Boyd
cd61d264ef
Set version to v2.3.3.dev0
2020-11-23 13:51:59 +01:00
Sofie Van Landeghem
2af31a8c8d
Bugfix textcat reproducibility on GPU ( #6411 )
...
* add seed argument to ParametricAttention layer
* bump thinc to 7.4.3
* set thinc version range
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-11-23 12:29:35 +01:00
Adriane Boyd
3f61f5eb54
Use int8_t instead of char in Matcher ( #6413 )
...
* Use signed char instead of char in Matcher
Remove unused char* utf8_t typedef
* Use int8_t instead of signed char
2020-11-23 10:26:47 +01:00
Adriane Boyd
4284605683
Remove Beam cleanup ( #6414 )
...
Beam cleanup is handled through the Beam finalization method.
2020-11-23 10:01:46 +01:00
Adriane Boyd
a8c2dad466
Add all vectors to vocab before pruning ( #6408 )
...
Add all vectors to the vocab before pruning to correct the selection of
vectors to prioritize.
2020-11-23 10:00:59 +01:00
svlandeg
636be3c791
Merge remote-tracking branch 'upstream/develop' into feature/trf-docs
2020-11-19 14:15:35 +01:00
svlandeg
73fc1ed963
remove labels from morphologizer constructor
2020-11-11 21:48:50 +01:00
svlandeg
d5a920325f
remove labels from constructor
2020-11-11 21:34:12 +01:00
Adriane Boyd
320a8b1481
Add ent_id_ to strings serialized with Doc ( #6353 )
2020-11-10 20:16:07 +08:00
Adriane Boyd
a7e7d6c6c9
Ignore misaligned in Morphologizer.get_loss ( #6363 )
...
Fix bug where `Morphologizer.get_loss` treated misaligned annotation as
`EMPTY_MORPH` rather than ignoring it. Remove unneeded default `EMPTY_MORPH`
mappings.
2020-11-10 20:15:09 +08:00
Sofie Van Landeghem
a0c899a0ff
Fix textcat + transformer architecture ( #6371 )
...
* add pooling to textcat TransformerListener
* maybe_get_dim in case it's null
2020-11-10 20:14:47 +08:00
Ines Montani
de6453940e
Merge pull request #6305 from svlandeg/feature/score-docs [ci skip]
2020-11-10 02:52:11 +01:00
Ines Montani
d7950c5ada
Merge pull request #6297 from adrianeboyd/docs/nightly-conda-install [ci skip]
2020-11-10 02:45:52 +01:00
svlandeg
789fb3d124
add docs for upstream argument of TransformerListener
2020-11-09 21:42:58 +01:00
Ines Montani
363ac73c72
Update docs [ci skip]
2020-11-09 12:43:26 +08:00
Daniel Vasic
20d72de986
Added Multext-East V5 tagset for Croatian language ( #6248 )
...
* Added Multext-East V5 tagset for Croatian language
* Create danielvasic.md
* Update danielvasic.md
* Update danielvasic.md
* Add tag map to CroatianDefaults
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-11-05 12:19:22 +01:00
Robert Šípek
6069efe57d
Add tag map to cs language ( #6284 )
2020-11-05 10:13:11 +01:00
Vu Ha
6d465ec52c
add oprd to the list of accepted deps for noun chunking ( #6302 )
...
* add oprd to the list of accepted deps for noun chunking
* add SCA
2020-11-05 09:17:35 +01:00
Adriane Boyd
31de700b0f
Fix on_match callback and remove empty patterns ( #6312 )
...
For the `DependencyMatcher`:
* Fix on_match callback so that it is called once per matched pattern
* Fix results so that patterns with empty match lists are not returned
2020-11-05 09:16:26 +01:00
Sofie Van Landeghem
8ef056cf98
fix embed_size in Entity Linker architecture ( #6343 )
2020-11-04 22:20:13 +01:00
Adriane Boyd
084fc575aa
Set version to v3.0.0rc3
2020-11-03 17:29:57 +01:00
Adriane Boyd
1c4df8fd09
Replace pytokenizations with internal alignment ( #6293 )
...
* Replace pytokenizations with internal alignment
Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.
* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`
* Refactor trailing whitespace handling
* Remove unnecessary exception for empty docs
Allow a non-empty whitespace-only doc to be aligned with an empty doc
* Remove empty docs exceptions completely
2020-11-03 16:24:38 +01:00
Adriane Boyd
a4b32b9552
Handle missing reference values in scorer ( #6286 )
...
* Handle missing reference values in scorer
Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.
Attributes without unset states:
* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation
Additional changes:
* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`
* Fix import
* Update return types
2020-11-03 15:47:18 +01:00
Adriane Boyd
5d2cb86c34
Fix on_match callback for DependencyMatcher ( #6313 )
...
Fix `DependencyMatcher` so that the callback is called only once per
match.
2020-10-31 12:20:27 +01:00
Adriane Boyd
45c9a68828
Identify final Matcher pattern node by quantifier ( #6317 )
...
Modify the internal pattern representation in `Matcher` patterns to
identify the final ID state using a unique quantifier rather than a
combination of other attributes.
It was insufficient to identify the final ID node based on an
uninitialized `quantifier` (coincidentally being the same as the `ZERO`)
with `nr_attr` as 0. (In addition, it was potentially bug-prone that
`nr_attr` was set to 0 even though attrs were allocated.)
In the case of `{"OP": "!"}` (a valid, if pointless, pattern), `nr_attr`
is 0 and the quantifier is ZERO, so the previous methods for
incrementing to the ID node at the end of the pattern weren't able to
distinguish the final ID node from the `{"OP": "!"}` pattern.
2020-10-31 12:18:48 +01:00
Sofie Van Landeghem
2918923541
fix resolving of dot notation ( #6326 )
2020-10-31 12:17:06 +01:00
Duygu Altinok
0e55f806dd
Turkish tokenization improvements ( #6268 )
...
* added single and paired orth variants
* added token match
* added long text tokenization test
* inverted init
* normalized lemmas to lowercase
* more abbrevs
* tests for ordinals and abbrevs
* separated period abbvrevs to another list
* fiex typo
* added ordinal and abbrev tests
* added number tests for dates
* minor refinement
* added inflected abbrevs regex
* added percentage and inflection
* cosmetics
* added token match
* added url inflection tests
* excluded url tokens from custom pattern
* removed url match import
2020-10-29 09:43:17 +01:00
svlandeg
080066ae74
remove TODO note
2020-10-26 10:37:25 +01:00
Ines Montani
2c9804038d
Fix success message [ci skip]
2020-10-23 16:11:54 +02:00
Adriane Boyd
4299a7f654
Setup / install / quickstart updates
...
* Add `cuda110` to setup.cfg and quickstart dropdown
* Switch to `pip` for pip-only packages in conda quickstart instructions
* Update zh pkuseg install message with version range and conda
* Remove `zh` from `extras_require` because the default doesn't require
additional packages
2020-10-23 11:27:54 +02:00
Adriane Boyd
563a21834e
Save raw scores in evaluate output
2020-10-19 15:49:09 +02:00
Adriane Boyd
dd207ca6d0
Add dep_las_per_type and more generic PRF printer
2020-10-19 15:49:02 +02:00
Adriane Boyd
4300858ecb
Include per-type/feat scores in evaluate output
2020-10-19 15:48:55 +02:00
Sofie Van Landeghem
75a202ce65
TextCat updates and fixes ( #6263 )
...
* small fix in example imports
* throw error when train_corpus or dev_corpus is not a string
* small fix in custom logger example
* limit macro_auc to labels with 2 annotations
* fix typo
* also create parents of output_dir if need be
* update documentation of textcat scores
* refactor TextCatEnsemble
* fix tests for new AUC definition
* bump to 3.0.0a42
* update docs
* rename to spacy.TextCatEnsemble.v2
* spacy.TextCatEnsemble.v1 in legacy
* cleanup
* small fix
* update to 3.0.0rc2
* fix import that got lost in merge
* cursed IDE
* fix two typos
2020-10-18 14:50:41 +02:00
Ines Montani
5a6ed01ce0
Merge pull request #6262 from adrianeboyd/bugfix/template-en-vectors
2020-10-16 15:38:08 +02:00
Adriane Boyd
c8d04b79e2
Sort and add vectors for langs without transformers
2020-10-16 08:25:16 +02:00
Adriane Boyd
2fbd43c603
Use core lg models as vectors models in quickstart
2020-10-16 08:17:53 +02:00
Jan Margeta
1ad2213349
Fix TokenPatternSchema pattern field validation
...
Empty pattern field should be considered invalid
This is fixed by replacing minItems with min_items
as described in Pydantic docs:
https://pydantic-docs.helpmanual.io/usage/schema/
2020-10-16 00:41:21 +02:00
Borijan Georgievski
2311192ba1
Include Macedonian language ( #6230 )
...
* Include Macedonian language
* Fix indentation at char_classes.py
* Fix indentation at char_classes.py
* Add Macedonian tests, update lex_attrs and char_classes
* Import unicode literals for python 2
2020-10-15 15:55:01 +02:00
Ines Montani
ff4267d181
Fix success message [ci skip]
2020-10-15 14:42:08 +02:00
Ines Montani
10611bf56a
Increment version [ci skip]
2020-10-15 13:30:11 +02:00
Ines Montani
4e17ddf75e
Merge pull request #6256 from adrianeboyd/bugfix/docs-to-json-raw
2020-10-15 10:35:01 +02:00
Ines Montani
b1d568a4df
Tidy up tests
2020-10-15 10:20:21 +02:00
Ines Montani
d165af26be
Auto-format [ci skip]
2020-10-15 10:08:53 +02:00
Adriane Boyd
a93d42861d
Use null raw for has_unknown_spaces in docs_to_json
2020-10-15 09:57:54 +02:00
Ines Montani
5665a21517
Tidy up
2020-10-15 09:30:32 +02:00
Ines Montani
5d62499266
Fix tests
2020-10-15 09:29:15 +02:00
Ines Montani
178760855f
Merge branch 'develop' into master-tmp
2020-10-15 09:06:03 +02:00
Ines Montani
bc85b12e6d
Merge pull request #6249 from svlandeg/feature/batch-tests
2020-10-15 08:57:56 +02:00
svlandeg
0796401c19
call NumpyOps instead of get_current_ops()
2020-10-14 16:55:00 +02:00
svlandeg
44e14ccae8
one more losses fix
2020-10-14 15:11:34 +02:00
svlandeg
0aa8851878
always return losses
2020-10-14 15:00:49 +02:00
svlandeg
e94a21638e
adding tests for trained models to ensure predict reproducibility
2020-10-13 21:07:13 +02:00
svlandeg
ede979d42f
formattting
2020-10-13 18:53:17 +02:00
svlandeg
ff83bfae3f
naming
2020-10-13 18:52:37 +02:00
svlandeg
6ccacff54e
add tests for individual spacy layers
2020-10-13 18:50:07 +02:00
svlandeg
c23041ae60
component tests single or multiple prediction
2020-10-13 16:26:53 +02:00
Ines Montani
1f49300862
Update transformer recommendations [ci skip]
2020-10-13 15:41:17 +02:00
Sofie Van Landeghem
f8a1c1afd6
avoid dropout at runtime ( #6247 )
2020-10-13 14:39:59 +02:00
Ines Montani
86d648740f
Fix morph representation in Doc.to_json
2020-10-13 11:39:03 +02:00
Ines Montani
7f92a5ee6a
Update spacy/lang/ta/examples.py
2020-10-13 11:03:35 +02:00
Ines Montani
a0e12c136b
Increment version [ci skip]
2020-10-13 10:00:53 +02:00
Ines Montani
f090f39f17
Merge pull request #6245 from svlandeg/bugfix/else
...
bugfix in _pipe
2020-10-13 09:59:06 +02:00
svlandeg
1f465bea18
if-else
2020-10-13 09:27:19 +02:00
svlandeg
40276fd3be
update NEL docs after latest refactor
2020-10-12 11:41:27 +02:00
Ines Montani
4fa967ea84
Increment version [ci skip]
2020-10-11 13:10:58 +02:00
Ines Montani
ab890a35f9
Make console logger table more compact
2020-10-11 12:55:46 +02:00
Ines Montani
99606e46fe
Relax meta.json schema [ci skip]
2020-10-11 12:30:57 +02:00
svlandeg
3a505e7e14
small edit to ensure the new word was indeed new
2020-10-10 21:05:28 +02:00
svlandeg
68d79796c6
add test for vocab after serializing KB
2020-10-10 20:59:48 +02:00
Ines Montani
539b0c10da
Tidy up and auto-format
2020-10-10 19:14:48 +02:00
Ines Montani
bfa3931c9d
Revert added_strings change ( #6236 )
2020-10-10 18:55:07 +02:00
Ines Montani
796f8b9424
Increment version
2020-10-09 18:00:27 +02:00
Ines Montani
525f798841
Fix typo in test
2020-10-09 18:00:21 +02:00
Ines Montani
8ac5f22253
Adjust error message
2020-10-09 18:00:16 +02:00
svlandeg
08cb085f6c
Merge remote-tracking branch 'upstream/develop' into fix/various
2020-10-09 17:01:27 +02:00
Ines Montani
b7cb9d95e4
Merge pull request #6229 from svlandeg/bugfix/disabled
2020-10-09 16:05:11 +02:00
svlandeg
e972ecba72
add utf8 encoding for opening file
2020-10-09 16:03:14 +02:00
Ines Montani
9fb3244672
Merge pull request #6231 from adrianeboyd/feature/include-static-vectors
2020-10-09 15:54:52 +02:00
svlandeg
040c7c0541
fix get_dim calls in build_simple_cnn_text_classifier
2020-10-09 15:40:58 +02:00
Adriane Boyd
727370c633
Remove Span._recalculate_indices
...
Remove `Span._recalculate_indices`, which is a remnant from the
deprecated `Span.merge`.
2020-10-09 14:42:51 +02:00
svlandeg
853edace37
fix MultiHashEmbed example in documentation
2020-10-09 14:11:06 +02:00
svlandeg
06b9d213fd
formatting
2020-10-09 12:19:47 +02:00
svlandeg
2cafba5f50
shorten error message for clarity
2020-10-09 12:17:35 +02:00
Ines Montani
4771a10503
Make test more explicit [ci skip]
2020-10-09 12:15:26 +02:00
Ines Montani
cc3646b06c
Add xfailing test for peculiar spans failure [ci skip]
2020-10-09 12:10:25 +02:00
svlandeg
8316bc7d4a
bugfix DisabledPipes
2020-10-09 12:06:20 +02:00
svlandeg
18dfb27985
Add custom error when evaluation throws a KeyError
2020-10-09 12:05:33 +02:00
Adriane Boyd
39aabf50ab
Also rename to include_static_vectors in CharEmbed
2020-10-09 11:54:48 +02:00
Florijan Stamenković
18f5c309dc
Fix Issue 6207 ( #6208 )
...
* Regression test for issue 6207
* Fix issue 6207
* Sign contributor agreement
* Minor adjustments to test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:14:40 +02:00
Duygu Altinok
80fb1bffc9
Ordinal numbers for Turkish ( #6142 )
...
* minor ordinal number addition
* fixed typo
* added corresponding lexical test
2020-10-09 10:13:15 +02:00
Duygu Altinok
2fad279a44
Turkish language syntax iterators ( #6191 )
...
* added tr_vocab to config
* basic test
* added syntax iterator to Turkish lang class
* first version for Turkish syntax iter, without flat
* added simple tests with nmod, amod, det
* more tests to amod and nmod
* separated noun chunks and parser test
* rearrangement after nchunk parser separation
* added recursive NPs
* tests with complicated recursive NPs
* tests with conjed NPs
* additional tests for conj NP
* small modification for shaving off conj from NP
* added tests with flat
* more tests with flat
* added examples with flats conjed
* added inner func for flat trick
* corrected parse
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-09 10:10:22 +02:00
Sofie Van Landeghem
d093d6343b
TrainablePipe ( #6213 )
...
* rename Pipe to TrainablePipe
* split functionality between Pipe and TrainablePipe
* remove unnecessary methods from certain components
* cleanup
* hasattr(component, "pipe") should be sufficient again
* remove serialization and vocab/cfg from Pipe
* unify _ensure_examples and validate_examples
* small fixes
* hasattr checks for self.cfg and self.vocab
* make is_resizable and is_trainable properties
* serialize strings.json instead of vocab
* fix KB IO + tests
* fix typos
* more typos
* _added_strings as a set
* few more tests specifically for _added_strings field
* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Ines Montani
8ff73f04db
Fix morph in Doc.to_json
2020-10-08 14:44:35 +02:00
Ines Montani
064575d79d
Merge pull request #6216 from svlandeg/feature/nel-initialize
2020-10-08 11:14:12 +02:00
svlandeg
3e2e1fd323
cleanup
2020-10-08 10:37:32 +02:00
svlandeg
eaf5c265cb
set_kb method for entity_linker
2020-10-08 10:34:01 +02:00
Ines Montani
010956d493
Clear rule-based components on initialize
2020-10-08 09:51:31 +02:00
Baranitharan
d6037c1860
added sentence
2020-10-08 08:22:58 +05:30
Baranitharan
81afe9b19d
Update examples.py
2020-10-08 08:17:25 +05:30
Sofie Van Landeghem
241cd112f5
add reenabled pipe names back to the meta before serializing ( #6219 )
2020-10-08 00:44:16 +02:00
Sofie Van Landeghem
2998131416
Reproducibility for TextCat and Tok2Vec ( #6218 )
...
* ensure fixed seed in HashEmbed layers
* forgot about the joys of python 2
2020-10-08 00:43:46 +02:00
svlandeg
efedccea8d
fix tests
2020-10-07 15:29:52 +02:00
svlandeg
6b8bdb2d39
add init_config to nlp.create_pipe
2020-10-07 14:58:16 +02:00
svlandeg
33c2d4af16
move kb_loader to initialize for NEL instead of constructor
2020-10-07 14:56:00 +02:00
Wannaphong Phatthiyaphaibun
9fc8392b38
Add Thai tag map (LST20 Corpus) ( #6163 )
...
* Add Thai tag map (LST20 Corpus)
By @korakot
* Update tag_map.py
* Update tag_map.py
* Update tag_map.py
2020-10-07 11:12:01 +02:00
Duygu Altinok
7e821c2776
Turkish language syntax iterators ( #6191 )
...
* added tr_vocab to config
* basic test
* added syntax iterator to Turkish lang class
* first version for Turkish syntax iter, without flat
* added simple tests with nmod, amod, det
* more tests to amod and nmod
* separated noun chunks and parser test
* rearrangement after nchunk parser separation
* added recursive NPs
* tests with complicated recursive NPs
* tests with conjed NPs
* additional tests for conj NP
* small modification for shaving off conj from NP
* added tests with flat
* more tests with flat
* added examples with flats conjed
* added inner func for flat trick
* corrected parse
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-07 11:07:52 +02:00
Duygu Altinok
2ce6fc2611
Turkish tag map and morph rules addition ( #6141 )
...
* feat: added turkish tag map
* feat: morph rules cconj and sconj
* feat: more conjuncts
* feat: added popular postpositions
* feat: added adverbs
* feat: added personal pronouns
* feat: added reflexive pronouns
* minor: corrected case capital
* minor: fixed comma typo
* feat: added indef pronouns
* feat: added dict iter
* fixed comma typo
* updated language class with tag map and morph
* use default tag map instead
* removed tag map
2020-10-07 10:27:36 +02:00
Duygu Altinok
b95a11dd95
Ordinal numbers for Turkish ( #6142 )
...
* minor ordinal number addition
* fixed typo
* added corresponding lexical test
2020-10-07 10:25:37 +02:00
Rahul Gupta
1a00bff06d
Hindi: Adds tests for lexical attributes (norm and like_num) ( #5829 )
...
* Hindi: Adds tests for lexical attributes (norm and like_num)
* Signs and sdds the contributor agreement
* Add ordinal numbers to be tagged as like_num
* Adds alternate pronunciation for 31 and 39
2020-10-07 10:23:32 +02:00
Nuccy90
c809b2c8e7
Update morph_rules.py ( #6102 )
...
* Update morph_rules.py
Added "dig" and "dej" ("you" in accusative form)
* Create Nuccy90.md
* Update Nuccy90.md
2020-10-06 15:14:47 +02:00
Matthew Honnibal
1a500f9717
Set version to v3.0.0a35
2020-10-06 14:19:07 +02:00
Sofie Van Landeghem
fff3f8ccfa
Fix packaging pin ( #6212 )
...
* pin packaging to >=20.0
* ignore spacy-pkuseg in requirements unit test
2020-10-06 14:16:05 +02:00
Matthew Honnibal
cfb9770a94
Fix empty input into StaticVectors layer ( #6211 )
...
* Add test for empty doc(s)
* Fix empty check in staticvectors
* Remove xfail
* Update spacy/ml/staticvectors.py
2020-10-06 14:15:41 +02:00
Florijan Stamenković
9db670b996
Fix Issue 6207 ( #6208 )
...
* Regression test for issue 6207
* Fix issue 6207
* Sign contributor agreement
* Minor adjustments to test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-06 11:17:37 +02:00
Ines Montani
568e12215d
Merge pull request #6206 from svlandeg/fix/patterns-init
2020-10-06 10:27:23 +02:00
svlandeg
9b4cf7b0b6
update output of debug config command
2020-10-06 09:47:23 +02:00
svlandeg
ff9ac39c88
read entity_ruler patterns with srsly.read_jsonl.v1
2020-10-05 22:50:14 +02:00
Ines Montani
126268ce50
Auto-format [ci skip]
2020-10-05 21:58:18 +02:00
Ines Montani
1a554bdcb1
Update docs and docstring [ci skip]
2020-10-05 21:55:27 +02:00
Ines Montani
9614e53b02
Tidy up and auto-format
2020-10-05 21:55:18 +02:00
Ines Montani
181039bd17
Merge pull request #6205 from explosion/feature/embed-features
2020-10-05 21:49:10 +02:00
Ines Montani
5ba418b08c
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-10-05 21:44:01 +02:00
Ines Montani
568617af58
Merge pull request #6202 from explosion/feature/project-spacy-version
2020-10-05 21:40:52 +02:00
Ines Montani
2d0c0134bc
Adjust message [ci skip]
2020-10-05 21:38:23 +02:00
Ines Montani
6abfc2911d
Merge pull request #6203 from adrianeboyd/feature/zh-spacy-pkuseg
2020-10-05 21:35:57 +02:00
Matthew Honnibal
b7e01d2024
Fix quickstart
2020-10-05 21:21:30 +02:00
Matthew Honnibal
ff8b980775
Upd quickstart template
2020-10-05 21:19:41 +02:00
Matthew Honnibal
91d0fbb588
Fix test
2020-10-05 21:13:53 +02:00
Ines Montani
9ca283a899
Merge branch 'develop' into feature/project-spacy-version
2020-10-05 21:06:07 +02:00
Ines Montani
0135f6ed95
Enable commit check via env var
2020-10-05 20:51:15 +02:00
Matthew Honnibal
b392d48e76
Fix test
2020-10-05 20:17:07 +02:00
Ines Montani
be99f1e4de
Remove output dirs before training ( #6204 )
...
* Remove output dirs before training
* Re-raise error if cleaning fails
2020-10-05 20:11:16 +02:00
Matthew Honnibal
e50047f1c5
Check lengths match
2020-10-05 20:02:45 +02:00
Ines Montani
582701519e
Remove __release__ flag
2020-10-05 20:00:49 +02:00
Ines Montani
d58fb42707
Add spacy_version option and validation for project.yml
2020-10-05 20:00:42 +02:00
Matthew Honnibal
db84d175c3
Fix test
2020-10-05 19:59:30 +02:00
Matthew Honnibal
cdd2b79b6d
Remove deprecated MultiHashEmbed
2020-10-05 19:58:18 +02:00
Matthew Honnibal
6dcc4a0ba6
Simplify MultiHashEmbed signature
2020-10-05 19:57:45 +02:00
svlandeg
193e0d5a98
add docs for entity_ruler.initialize
2020-10-05 18:04:08 +02:00
svlandeg
3ac3447eee
cleanup
2020-10-05 17:50:37 +02:00
svlandeg
9eb813a35d
Merge remote-tracking branch 'upstream/develop' into fix/patterns-init
2020-10-05 17:49:44 +02:00
Adriane Boyd
f102ef6b54
Read features.msgpack instead of features.pkl
2020-10-05 17:47:39 +02:00
svlandeg
4e3ace4b8c
is_trainable method
2020-10-05 17:43:42 +02:00
Ines Montani
84fedcebab
Make args keyword-only [ci skip]
...
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-10-05 17:07:35 +02:00
Matthew Honnibal
71e73ed0a6
Merge branch 'develop' into feature/embed-features
2020-10-05 17:00:05 +02:00
Matthew Honnibal
3ee3649b52
Fix augment
2020-10-05 16:59:49 +02:00
Matthew Honnibal
22937d25a9
Merge branch 'develop' into feature/embed-features
2020-10-05 16:42:17 +02:00
Matthew Honnibal
8deed614e9
Fix augment
2020-10-05 16:41:45 +02:00
Matthew Honnibal
4ed3e037df
Fix augment
2020-10-05 16:40:55 +02:00
Matthew Honnibal
9f1bc3f24c
Fix augment
2020-10-05 16:40:23 +02:00
svlandeg
dc06912c76
prevent loss keyerror for non-trainable components
2020-10-05 16:33:28 +02:00
Adriane Boyd
187234648c
Revert back to "default" as default for pkuseg_user_dict
2020-10-05 16:24:28 +02:00
svlandeg
65abd77779
add finish_update to Pipe
2020-10-05 16:23:33 +02:00
Matthew Honnibal
90040aacec
Fix merge
2020-10-05 16:12:01 +02:00
Matthew Honnibal
93a98e8c3e
Merge branch 'develop' into feature/embed-features
2020-10-05 15:51:31 +02:00
Matthew Honnibal
eb9ba61517
Format
2020-10-05 15:29:49 +02:00
Matthew Honnibal
7d93575f35
spacy/tests/
2020-10-05 15:28:12 +02:00
Matthew Honnibal
f4ca9a39cb
spacy/tests/
2020-10-05 15:27:06 +02:00
Matthew Honnibal
f2f1deca66
spacy/tests/
2020-10-05 15:24:33 +02:00
Matthew Honnibal
8ec79ad3fa
Allow configuration of MultiHashEmbed features
...
Update arguments to MultiHashEmbed layer so that the attributes can be
controlled. A kind of tricky scheme is used to allow optional
specification of the rows. I think it's an okay balance between
flexibility and convenience.
2020-10-05 15:22:00 +02:00
Ines Montani
7946fd84bb
Merge pull request #6200 from adrianeboyd/bugfix/vocab-disk-lookups-vectors
...
Always serialize lookups and vectors to disk
2020-10-05 15:15:25 +02:00
Ines Montani
8171e28b20
Remove logging [ci skip]
...
This would be fired on each example, which is wrong
2020-10-05 15:09:52 +02:00
svlandeg
251b3eb4e5
add initialize method for entity_ruler
2020-10-05 14:59:13 +02:00
Sofie Van Landeghem
f4f49f5877
update blis ( #6198 )
...
* allow higher blis version
* fix typo
* bump to 3.0.0a34
* fix pins in other files
2020-10-05 14:58:56 +02:00
Adriane Boyd
5d19dfc9d3
Update Chinese tokenizer for spacy-pkuseg fork
2020-10-05 14:21:53 +02:00
Matthew Honnibal
6a9d14e35a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-10-05 14:17:41 +02:00
Matthew Honnibal
d2b9aafb8c
Fix augmenter
2020-10-05 14:14:49 +02:00
Ines Montani
6260fa3c10
Merge pull request #6201 from svlandeg/fix/error_nr
2020-10-05 14:00:57 +02:00
Ines Montani
6958510bda
Include spaCy version check in project CLI
2020-10-05 13:53:07 +02:00
Ines Montani
20f2a17a09
Merge test_misc and test_util
2020-10-05 13:45:57 +02:00
svlandeg
fd2d48556c
fix E902 and E903 numbering
2020-10-05 13:43:32 +02:00
Ines Montani
1c641e41c3
Remove unused import [ci skip]
2020-10-05 11:50:11 +02:00
Adriane Boyd
03cfb2d2f4
Always serialize lookups and vectors to disk
2020-10-05 09:40:20 +02:00
Adriane Boyd
b0b93854cb
Update ru/uk lemmatizers for new nlp.initialize
2020-10-05 09:27:16 +02:00
Ines Montani
549758f67d
Adjust test for now
2020-10-04 23:16:09 +02:00
Ines Montani
4b15ff7504
Increment version [ci skip]
2020-10-04 22:47:04 +02:00
Ines Montani
f1d1f78636
Make warning debug log [ci skip]
2020-10-04 22:44:21 +02:00
Ines Montani
3c36a57e84
Update data augmenters ( #6196 )
...
* Draft lower-case augmenter
* Make warning a debug log
* Update lowercase augmenter, docs and tests
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-10-04 17:46:29 +02:00
Ines Montani
d38dc466c5
Adjust error [ci skip]
2020-10-04 15:26:01 +02:00
Ines Montani
496228771d
Merge pull request #6194 from explosion/master-tmp
2020-10-04 15:25:41 +02:00
Ines Montani
0307a228c8
Merge pull request #6193 from explosion/fix/adjust-pipe-init
...
Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe
2020-10-04 15:20:54 +02:00
Ines Montani
59deeb7da6
Merge branch 'develop' into master-tmp
2020-10-04 14:52:20 +02:00
Ines Montani
43d7652635
Merge pull request #6192 from explosion/feature/init-attr-ruler
2020-10-04 14:46:37 +02:00
Ines Montani
8f018e47f8
Adjust [initialize.components] on Language.remove_pipe and Language.rename_pipe
2020-10-04 14:43:45 +02:00
Matthew Honnibal
84ae197dd6
Fix logger
2020-10-04 14:16:53 +02:00
Ines Montani
11347f34da
Tidy up, tests and docs
2020-10-04 13:54:05 +02:00
Matthew Honnibal
96b636c2d3
Update attribute ruler
2020-10-04 13:08:21 +02:00
Ines Montani
bcd52e5486
Tidy up errors and warnings
2020-10-04 11:16:31 +02:00
Ines Montani
ff914f4e6f
Lazy-load xx
2020-10-04 11:10:26 +02:00
Ines Montani
d3b3663942
Adjust error message and add test
2020-10-04 10:11:27 +02:00
Ines Montani
2110e8f86d
Auto-format
2020-10-04 10:06:49 +02:00
Ines Montani
cc08c88a89
Merge pull request #6187 from svlandeg/fix/begin_training_pipe
2020-10-04 10:01:02 +02:00
svlandeg
3f657ed3a1
implement warning in __init_subclass__ instead
2020-10-03 22:34:10 +02:00
Matthew Honnibal
3b2a78720c
Upd morphologizer
2020-10-03 19:35:19 +02:00
Matthew Honnibal
835070cedc
Upd test
2020-10-03 19:35:10 +02:00
Matthew Honnibal
70b9de8e58
Set version to v3.0.0a32
2020-10-03 19:26:52 +02:00
Matthew Honnibal
85ede32680
Format
2020-10-03 19:26:23 +02:00
Matthew Honnibal
b305f2ff5a
Fix loggers
2020-10-03 19:26:10 +02:00
Matthew Honnibal
4fccd2ceaf
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-10-03 19:13:55 +02:00
Matthew Honnibal
8ea8b7d940
Support loading labels in morphologizer
2020-10-03 19:13:42 +02:00
Ines Montani
c2401fca41
Add tests for Pipe.label_data
2020-10-03 19:12:46 +02:00
Ines Montani
80603f0fa5
Make SentenceRecognizer.label_data return None
...
Overwrite the method from the base class (Tagger) but don't export anything in "init labels"
2020-10-03 18:54:09 +02:00
Ines Montani
d6c967401f
Increment version
2020-10-03 17:20:47 +02:00
Ines Montani
3bc3c05fcc
Tidy up and auto-format
2020-10-03 17:20:18 +02:00
Ines Montani
7c4ab7e82c
Fix Lemmatizer.get_lookups_config
2020-10-03 17:16:10 +02:00
Ines Montani
dd542ec6a4
Fix label initialization of textcat component ( #6190 )
2020-10-03 17:07:38 +02:00
Ines Montani
989a96308f
Tidy up, auto-format, types
2020-10-03 16:31:58 +02:00
Matthew Honnibal
7b127f307e
Set version to v3.0.0a30
2020-10-03 16:06:42 +02:00
Matthew Honnibal
db419f6b2f
Improve control of training progress and logging ( #6184 )
...
* Make logging and progress easier to control
* Update docs
* Cleanup errors
* Fix ConfigValidationError
* Pass stdout/stderr, not wasabi.Printer
* Fix type
* Upd logging example
* Fix logger example
* Fix type
2020-10-03 14:57:46 +02:00
Ines Montani
ae15c9de79
Raise error from caught KeyError to preserve traceback
2020-10-03 11:43:56 +02:00
Ines Montani
f758804401
Save one line of code
2020-10-03 11:41:28 +02:00
Stanislav Schmidt
3589a64d44
Change type of texts argument in pipe to iterable ( #6186 )
...
* Change type of texts argument in pipe to iterable
* Add contributor agreement
2020-10-02 21:00:11 +02:00
svlandeg
02247cccaf
Merge remote-tracking branch 'upstream/develop' into feature/small-fixes
2020-10-02 20:48:11 +02:00
svlandeg
fb48de349c
bwd compat for pipe.begin_training
2020-10-02 20:31:14 +02:00
Matthew Honnibal
6965cdf16d
Fix comment
2020-10-02 17:26:21 +02:00
Ines Montani
3cf10a0729
Merge pull request #6183 from adrianeboyd/feature/quickstart-morphologizer
...
Add morphologizer to quickstart template
2020-10-02 17:08:01 +02:00
Adriane Boyd
62ccd5c4df
Relax model meta performance schema ( #6185 )
...
Allow more embedded per_x in `ModelMetaSchema`
2020-10-02 16:37:21 +02:00
Sofie Van Landeghem
09dcb75076
small UX fix for DocBin ( #6167 )
...
* add informative warning when messing up store_user_data DocBin flags
* add informative warning when messing up store_user_data DocBin flags
* cleanup test
* rename to patterns_path
2020-10-02 15:43:32 +02:00
Ines Montani
f0b30aedad
Make lemmatizers use initialize logic ( #6182 )
...
* Make lemmatizer use initialize logic and tidy up
* Fix typo
* Raise for uninitialized tables
2020-10-02 15:42:36 +02:00
Adriane Boyd
22158dc24a
Add morphologizer to quickstart template
2020-10-02 15:06:16 +02:00
Ines Montani
d2aa662ab2
Merge pull request #6179 from adrianeboyd/feature/token-morph-refactor-2 [ci skip]
2020-10-02 12:10:27 +02:00
Ines Montani
c41a4332e4
Add test for custom data augmentation
2020-10-02 11:37:56 +02:00
svlandeg
acc391c2a8
remove redundant str() call
2020-10-02 11:05:59 +02:00
Ines Montani
3856048437
Merge pull request #6178 from explosion/feature/file-readers
...
Integrate file readers via srsly, update orth_variants loading
2020-10-02 10:26:09 +02:00
Adriane Boyd
f83dfe62da
Fix test
2020-10-02 10:17:26 +02:00
Adriane Boyd
65dfaa4f4b
Also accept MorphAnalysis in set_morph
2020-10-02 08:33:43 +02:00
Adriane Boyd
77e08c398f
Switch reset value for set_morph to None
2020-10-02 08:25:15 +02:00
Ines Montani
568768643e
Increment version [ci skip]
2020-10-02 01:50:13 +02:00
Ines Montani
01c1538c72
Integrate file readers
2020-10-02 01:36:06 +02:00
Ines Montani
af282ae732
Fix import
2020-10-02 01:12:34 +02:00
Ines Montani
e59ecb12c0
Auto-format
2020-10-02 01:12:30 +02:00
Matthew Honnibal
75a1569908
Merge
2020-10-01 23:07:53 +02:00
Matthew Honnibal
300e5a9928
Avoid relying on NORM in default v3 models ( #6176 )
...
* Allow CharacterEmbed to specify feature
* Default to LOWER in character embed
* Update tok2vec
* Use LOWER, not NORM
2020-10-01 23:05:55 +02:00
Ines Montani
5762876dcc
Update default config [ci skip]
2020-10-01 22:27:37 +02:00
Adriane Boyd
86c3ec9c2b
Refactor Token morph setting ( #6175 )
...
* Refactor Token morph setting
* Remove `Token.morph_`
* Add `Token.set_morph()`
* `0` resets `token.c.morph` to unset
* Any other values are passed to `Morphology.add`
* Add token.morph setter to set from MorphAnalysis
2020-10-01 22:21:46 +02:00
Matthew Honnibal
b854bca15c
Default to LOWER in character embed
2020-10-01 22:17:58 +02:00
Matthew Honnibal
684a77870b
Allow CharacterEmbed to specify feature
2020-10-01 22:17:26 +02:00
Ines Montani
da30701cd1
Increment version [ci skip]
2020-10-01 21:58:11 +02:00
Ines Montani
d48ddd6c9a
Remove default initialize lookups
2020-10-01 21:54:33 +02:00
Ines Montani
1700c8541e
Increment version [ci skip]
2020-10-01 17:57:16 +02:00
Ines Montani
f2627157c8
Update docs [ci skip]
2020-10-01 17:38:17 +02:00
Ines Montani
7f68f4bd92
Hide jsonl_loc on init vectors and tidy up [ci skip]
2020-10-01 16:44:17 +02:00
Adriane Boyd
27cbffff1b
Minor edit to CoNLL-U converter ( #6172 )
...
This doesn't make a difference given how the `merged_morph` values
override the `morph` values for all the final docs, but could have led
to unexpected bugs in the future if the converter is modified.
2020-10-01 16:23:42 +02:00
Sofie Van Landeghem
a22215f427
Add FeatureExtractor from Thinc ( #6170 )
...
* move featureextractor from Thinc
* Update website/docs/api/architectures.md
Co-authored-by: Ines Montani <ines@ines.io>
* Update website/docs/api/architectures.md
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Ines Montani <ines@ines.io>
2020-10-01 16:22:48 +02:00
Adriane Boyd
73538782a0
Switch Doc.__init__(ents=) to IOB tags ( #6173 )
...
* Switch Doc.__init__(ents=) to IOB tags
* Fix check for "-"
* Allow "" or None as missing IOB tag
2020-10-01 16:22:18 +02:00
Adriane Boyd
df98d3ef9f
Update import from collections.abc ( #6174 )
2020-10-01 16:21:49 +02:00
Yohei Tamura
3243ddac8f
Fix/span.sent ( #6083 )
...
* add fail test
* fix test
* fix span.sent
* Remove incorrect implicit check
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-01 14:01:52 +02:00
Ines Montani
0a8a124a6e
Update docs [ci skip]
2020-10-01 12:15:53 +02:00
Ines Montani
44160cd52f
Tidy up [ci skip]
2020-10-01 10:41:19 +02:00
Ines Montani
381258b75b
Merge pull request #6165 from explosion/feature/update-tokenizers-initialize
2020-10-01 09:49:47 +02:00
svlandeg
6787e56315
print debugging warning before raising error if model not properly initialized
2020-10-01 09:21:00 +02:00
svlandeg
5121972930
add types of Tok2Vec embedding layers
2020-10-01 09:20:09 +02:00
Ines Montani
4b6afd3611
Remove English [initialize] default block for now to get tests to pass
2020-09-30 23:49:29 +02:00
Ines Montani
6f29f68f69
Update errors and make Tokenizer.initialize args less strict
2020-09-30 23:48:47 +02:00
Ines Montani
a103ab5f1a
Update augmenter lookups and docs
2020-09-30 23:03:47 +02:00
Matthew Honnibal
5128298964
Add missing augmenter
2020-09-30 20:18:45 +02:00
Matthew Honnibal
59294e91aa
Restore the 'jsonl' arg for init vectors
...
The lexemes.jsonl file is still used in our English vectors, and it may
be required by users as well. I think it's worth supporting the option.
2020-09-30 19:06:50 +02:00
Matthew Honnibal
c379a4274a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-09-30 16:52:42 +02:00
Matthew Honnibal
e58dca3028
Add read_labels
2020-09-30 16:52:27 +02:00
Ines Montani
23c63eefaf
Tidy up env vars [ci skip]
2020-09-30 15:15:11 +02:00
Elijah Rippeth
4cbb954281
reorder so tagmap is replaced only if a custom file is provided. ( #6164 )
...
* reorder so tagmap is replaced only if a custom file is provided.
* Remove unneeded variable initialization
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-09-30 13:26:06 +02:00
Adriane Boyd
6b7bb32834
Refactor Chinese initialization
2020-09-30 11:46:45 +02:00
Ines Montani
34f9c26c62
Add lexeme norm defaults
2020-09-30 10:20:14 +02:00
Ines Montani
a5debb356d
Tidy up and adjust logging [ci skip]
2020-09-30 01:22:08 +02:00
Ines Montani
56a2f778c4
Add logging [ci skip]
2020-09-30 01:08:55 +02:00
Ines Montani
fe3f111c37
Merge pull request #6168 from explosion/fix/default-corpus-values
2020-09-30 00:24:02 +02:00
Ines Montani
b799af16de
Don't raise in Pipe.initialize if not implemented
2020-09-30 00:05:27 +02:00
Matthew Honnibal
bc61691f6f
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-09-29 23:41:04 +02:00
Matthew Honnibal
f52249fe2e
Fix data augmentation
2020-09-29 23:40:54 +02:00
Matthew Honnibal
14c4da547f
Try to fix augmentation
2020-09-29 23:08:56 +02:00
Ines Montani
ae51843468
Remove augmenter from jinja template [ci skip]
2020-09-29 23:08:50 +02:00
Ines Montani
9bb958fd0a
Fix debug data [ci skip]
2020-09-29 23:07:11 +02:00
Matthew Honnibal
a2aa1f6882
Disable the OVL augmentation by default
2020-09-29 23:02:40 +02:00
Ines Montani
df8dd91b6f
Merge branch 'develop' into fix/default-corpus-values
2020-09-29 22:55:39 +02:00
Ines Montani
0a1ee109db
Remove init form path
2020-09-29 22:53:18 +02:00
Ines Montani
ad6d40d028
Add logging
2020-09-29 22:53:14 +02:00
Ines Montani
c334a7d45f
Remove
2020-09-29 22:38:39 +02:00
Ines Montani
1aeef3bfbb
Make corpus paths default to None and improve errors
2020-09-29 22:33:46 +02:00
Ines Montani
0250bcf6a3
Show validation error during init
2020-09-29 22:29:09 +02:00
Ines Montani
da30bae8a6
Use __pyx_vtable__ instead of __reduce_cython__
2020-09-29 22:04:17 +02:00
Ines Montani
43c92ec8c9
Resolve dir for better output [ci skip]
2020-09-29 22:01:04 +02:00
Ines Montani
fa47f87924
Tidy up and auto-format
2020-09-29 21:39:28 +02:00
Ines Montani
604be54a5c
Support --code in evaluate CLI [ci skip]
2020-09-29 21:20:56 +02:00
Ines Montani
6467a560e3
WIP: Test updating Chinese tokenizer
2020-09-29 21:10:22 +02:00
Ines Montani
4f3102d09c
Auto-format
2020-09-29 21:09:10 +02:00
Ines Montani
798040bc1d
Fix language detection
2020-09-29 21:08:13 +02:00
Ines Montani
78021089f9
Merge pull request #6160 from explosion/feature/prepare
2020-09-29 20:55:13 +02:00
Ines Montani
c3f8c09d7d
Merge pull request #6154 from adrianeboyd/bugfix/chinese-tokenizer-pickle
2020-09-29 20:54:59 +02:00
Ines Montani
d3c63b7965
Merge branch 'develop' into feature/prepare
2020-09-29 20:53:05 +02:00
Ines Montani
2be80379ec
Fix small issues, resolve_dot_names and debug model
2020-09-29 20:38:35 +02:00
Matthew Honnibal
a4da3120b4
Fix multitasks
2020-09-29 18:33:16 +02:00
Matthew Honnibal
0b5c72fce2
Fix incorrect docstrings
2020-09-29 18:30:38 +02:00
Ines Montani
7851020653
Update tests
2020-09-29 18:14:15 +02:00
Ines Montani
71a0ee274a
Move init labels to init pipeline module
2020-09-29 18:09:33 +02:00
Ines Montani
dba26186ef
Handle None default args in Cython methods
2020-09-29 18:08:02 +02:00
Ines Montani
9353a82076
Auto-format
2020-09-29 18:07:48 +02:00
Ines Montani
534e1ef498
Fix template
2020-09-29 17:02:55 +02:00
Ines Montani
f2352eb701
Test with default value
2020-09-29 17:00:40 +02:00
Matthew Honnibal
8ce9f44433
Merge branch 'feature/prepare' of https://github.com/explosion/spaCy into feature/prepare
2020-09-29 16:57:38 +02:00
Matthew Honnibal
e4f535a964
Fix Pipe.labels
2020-09-29 16:55:07 +02:00
Matthew Honnibal
4ad26f4a2f
Move reader
2020-09-29 16:54:53 +02:00
Ines Montani
30c76dbd67
Merge branch 'feature/prepare' of https://github.com/explosion/spaCy into feature/prepare
2020-09-29 16:53:48 +02:00
Matthew Honnibal
43fc7a316d
Add registry function for reading jsonl
2020-09-29 16:49:09 +02:00
Matthew Honnibal
1fd002180e
Allow more components to use labels
2020-09-29 16:48:56 +02:00
Matthew Honnibal
99bff78617
Use labels in tagger
2020-09-29 16:48:44 +02:00
Matthew Honnibal
ca72608059
Fix language
2020-09-29 16:48:33 +02:00
Matthew Honnibal
10847c7f4e
Fix arg
2020-09-29 16:48:07 +02:00
Ines Montani
fd594cfb9b
Tighten up format
2020-09-29 16:47:55 +02:00
Matthew Honnibal
e70a00fa76
Remove unnecessary warning from train
2020-09-29 16:47:54 +02:00
Matthew Honnibal
3f0d61232d
Remove outdated arg from train
2020-09-29 16:47:44 +02:00
Matthew Honnibal
e957d66b92
Merge branch 'feature/prepare' of https://github.com/explosion/spaCy into feature/prepare
2020-09-29 16:22:53 +02:00
Ines Montani
978ab54a84
Fix logging
2020-09-29 16:22:41 +02:00
Matthew Honnibal
45daf5c9fe
Add init labels command
2020-09-29 16:22:37 +02:00
Matthew Honnibal
58c8d4b414
Add label_data property to pipeline
2020-09-29 16:22:13 +02:00
Ines Montani
aa2a6882d0
Fix logging
2020-09-29 16:08:39 +02:00
Ines Montani
63d1598137
Simplify config use in Language.initialize
2020-09-29 16:05:48 +02:00
Ines Montani
56f8bc73ef
Add more tests
2020-09-29 15:23:34 +02:00
Sofie Van Landeghem
6a04e5adea
encoding UTF8 ( #6161 )
2020-09-29 14:49:55 +02:00
Ines Montani
591038b1a4
Add test
2020-09-29 12:54:52 +02:00
Ines Montani
adca08a12f
Pass nlp forward
2020-09-29 12:21:52 +02:00
Ines Montani
f171903139
Clean up sgd and pipeline -> nlp
2020-09-29 12:20:26 +02:00
Ines Montani
612bbf85ab
Update initialize.py
2020-09-29 12:14:47 +02:00
Ines Montani
42f0e4c946
Clean up
2020-09-29 12:14:08 +02:00
Matthew Honnibal
9c8b2524fe
Upd initialize args
2020-09-29 12:08:37 +02:00
Matthew Honnibal
e1fdf2b7c5
Upd tests
2020-09-29 12:05:38 +02:00
Ines Montani
50410c17ac
Update schemas.py
2020-09-29 12:05:38 +02:00
Matthew Honnibal
f2d1b7feb5
Clean up sgd
2020-09-29 12:00:08 +02:00
Ines Montani
78396d137f
Integrate initialize settings
2020-09-29 11:57:08 +02:00
Ines Montani
dec984a9c1
Update Language.initialize and support components/tokenizer settings
2020-09-29 11:52:45 +02:00
Matthew Honnibal
b3b6868639
Remove 'sgd' arg from component initialize
2020-09-29 11:42:35 +02:00
Matthew Honnibal
5276db6f3f
Remove 'device' argument from Language, clean up 'sgd' arg
2020-09-29 11:42:19 +02:00
Ines Montani
4925ad760a
Add init vectors
2020-09-29 10:58:50 +02:00
svlandeg
64d90039a1
encoding UTF8
2020-09-29 10:54:42 +02:00
Ines Montani
ff9a63bfbd
begin_training -> initialize
2020-09-28 21:35:09 +02:00
Ines Montani
046f655d86
Fix error
2020-09-28 21:17:45 +02:00
Ines Montani
a139fe672b
Fix typos and refactor CLI logging
2020-09-28 21:17:10 +02:00
Ines Montani
2e9c9e74af
Fix config resolution and interpolation
...
TODO: auto-interpolate in Thinc if config is dict (i.e. likely subsection)
2020-09-28 15:34:00 +02:00
Ines Montani
02838a1d47
Fix resolve_dot_names
2020-09-28 15:27:10 +02:00
Ines Montani
822ea4ef61
Refactor CLI
2020-09-28 15:09:59 +02:00
Ines Montani
a89e0ff7cb
Fix typo
2020-09-28 12:55:21 +02:00
Ines Montani
a62337b3f3
Tidy up vocab init
2020-09-28 12:53:06 +02:00
Ines Montani
c22ecc66bb
Don't support init path for now
2020-09-28 12:46:28 +02:00
Ines Montani
f49288ab81
Update default_config_pretraining.cfg
2020-09-28 12:31:54 +02:00
Ines Montani
a5f2cc0509
Tidy up and remove raw text (rehearsal) for now
2020-09-28 12:30:13 +02:00
Ines Montani
1590de11b1
Update config
2020-09-28 12:05:23 +02:00
Matthew Honnibal
9f6ad06452
Upd default config
2020-09-28 12:00:23 +02:00
Ines Montani
e44a7519cd
Update CLI and add [initialize] block
2020-09-28 11:56:14 +02:00
Ines Montani
d5155376fd
Update vocab init
2020-09-28 11:30:18 +02:00
Ines Montani
8b74fd19df
init pipeline -> init nlp
2020-09-28 11:13:38 +02:00
Ines Montani
2fdb7285a0
Update CLI
2020-09-28 11:06:07 +02:00
Ines Montani
553bfea641
Fix commands
2020-09-28 10:53:17 +02:00
Matthew Honnibal
44bad1474c
Add init_pipeline file
2020-09-28 09:47:34 +02:00
Matthew Honnibal
65448b2e34
Remove schema=None until Optional
2020-09-28 03:42:58 +02:00
Matthew Honnibal
b886f53c31
init-pipeline runs (maybe doesnt work)
2020-09-28 03:42:47 +02:00
Matthew Honnibal
ed2aff2db3
Remove unused train code
2020-09-28 03:12:31 +02:00
Matthew Honnibal
3a0a3b8db6
Dont hard-code for 'corpora' name
2020-09-28 03:06:33 +02:00
Matthew Honnibal
a023cf3ecc
Add (untested) resolve_dot_names util
2020-09-28 03:06:12 +02:00
Matthew Honnibal
a976da168c
Support data augmentation in Corpus ( #6155 )
...
* Support data augmentation in Corpus
* Note initial docs for data augmentation
* Add augmenter to quickstart
* Fix flake8
* Format
* Fix test
* Update spacy/tests/training/test_training.py
* Improve data augmentation arguments
* Update templates
* Move randomization out into caller
* Refactor
* Update spacy/training/augment.py
* Update spacy/tests/training/test_training.py
* Fix augment
* Fix test
2020-09-28 03:03:27 +02:00
Matthew Honnibal
13b1605ee6
Add init script
2020-09-28 01:08:49 +02:00
Matthew Honnibal
a3e1791c9c
Upd train
2020-09-28 01:08:30 +02:00
Matthew Honnibal
b5556093e2
Start updating train script
2020-09-27 23:59:44 +02:00
Ines Montani
9016d23cc5
Fix exclude and add test
2020-09-27 23:34:03 +02:00
Ines Montani
658fad428a
Fix base schema integration
2020-09-27 22:50:36 +02:00
Ines Montani
e04bd16f7f
Merge branch 'develop' into feature/new-thinc-config-resolution
2020-09-27 22:34:46 +02:00
Ines Montani
d7ad65a9bb
Fix handling of error description [ci skip]
2020-09-27 22:31:57 +02:00
Ines Montani
7e938ed63e
Update config resolution to use new Thinc
2020-09-27 22:21:31 +02:00
Adriane Boyd
013b66de05
Add tokenizer scoring to ja / ko / zh ( #6152 )
2020-09-27 22:20:45 +02:00
Adriane Boyd
a6548ead17
Add _ as a symbol ( #6153 )
...
* Add _ to StringStore in Morphology
* Add _ as a symbol
Add `_` as a symbol instead of adding to the `StringStore`.
2020-09-27 22:20:14 +02:00
Matthew Honnibal
39b178999c
Tmp notes
2020-09-27 20:13:38 +02:00
Adriane Boyd
8393dbedad
Minor fixes
...
* Put `cfg` back in serialization
* Add `pickle5` to pytest conf
2020-09-27 15:15:53 +02:00
Adriane Boyd
54fe871935
Fix formatting, refactor pickle5 exceptions
2020-09-27 14:37:28 +02:00
Adriane Boyd
11e195d3ed
Update ChineseTokenizer
...
* Allow `pkuseg_model` to be set to `None` on initialization
* Don't save config within tokenizer
* Force convert pkuseg_model to use pickle protocol 4 by reencoding with
`pickle5` on serialization
* Update pkuseg serialization test
2020-09-27 14:00:18 +02:00
Ines Montani
b4486d747d
Merge branch 'develop' into fix/train-config-interpolation
2020-09-26 15:32:14 +02:00
Ines Montani
8fea06d55e
Merge pull request #6149 from adrianeboyd/feature/attributeruler-match-ids
...
Simplify string match IDs for AttributeRuler
2020-09-26 15:31:30 +02:00
Ines Montani
b2d07de786
Construct nlp from uninterpolated config before training
2020-09-26 15:16:59 +02:00
Ines Montani
ca3c997062
Improve CLI config validation with latest Thinc
2020-09-26 13:13:57 +02:00
Adriane Boyd
6c25e60089
Simplify string match IDs for AttributeRuler
2020-09-26 11:12:39 +02:00
Matthew Honnibal
702edf52a0
Fix attributeruler
2020-09-26 00:30:48 +02:00
Matthew Honnibal
821f37254c
Fix attributeruler
2020-09-26 00:19:53 +02:00
Matthew Honnibal
98327f66a9
Fix attributeruler key
2020-09-25 23:20:50 +02:00
Matthew Honnibal
092ce4648e
Make DocBin output stable data (set iteration)
2020-09-25 22:20:44 +02:00
Matthew Honnibal
26afd3bd90
Fix iteration order
2020-09-25 21:47:22 +02:00
Matthew Honnibal
3d8388969e
Sort paths for cache consistency
2020-09-25 19:07:26 +02:00
Adriane Boyd
c3b5a3cfff
Clean up MorphAnalysisC struct ( #6146 )
2020-09-25 15:56:48 +02:00
Sofie Van Landeghem
009ba14aaf
Fix pretraining in train script ( #6143 )
...
* update pretraining API in train CLI
* bump thinc to 8.0.0a35
* bump to 3.0.0a26
* doc fixes
* small doc fix
2020-09-25 15:47:10 +02:00
Adriane Boyd
50f20cf722
Revert changes to Scorer.score_spans
2020-09-25 08:21:47 +02:00
Matthew Honnibal
93d7ff309f
Remove print
2020-09-24 21:05:27 +02:00
Matthew Honnibal
16475528f7
Fix skipped documents in entity scorer ( #6137 )
...
* Fix skipped documents in entity scorer
* Add back the skipping of unannotated entities
* Update spacy/scorer.py
* Use more specific NER scorer
* Fix import
* Fix get_ner_prf
* Add scorer
* Fix scorer
Co-authored-by: Ines Montani <ines@ines.io>
2020-09-24 20:38:57 +02:00
Matthew Honnibal
2abb4ba9db
Make a pre-check to speed up alignment cache ( #6139 )
...
* Dirty trick to fast-track alignment cache
* Improve alignment cache check
* Fix header
* Fix align cache
* Fix align logic
2020-09-24 18:13:39 +02:00
Ines Montani
26e28ed413
Fix combined scores if multiple components report it
2020-09-24 17:11:13 +02:00
Ines Montani
0b52b6904c
Update entity_linker.py
2020-09-24 17:10:35 +02:00
Ines Montani
20b89a9717
Increment version [ci skip]
2020-09-24 16:57:02 +02:00
Adriane Boyd
3c062b3911
Add MORPH handling to Matcher ( #6107 )
...
* Add MORPH handling to Matcher
* Add `MORPH` to `Matcher` schema
* Rename `_SetMemberPredicate` to `_SetPredicate`
* Add `ISSUBSET` and `ISSUPERSET` operators to `_SetPredicate`
* Add special handling for normalization and conversion of morph
values into sets
* For other attrs, `ISSUBSET` acts like `IN` and `ISSUPERSET` only
matches for 0 or 1 values
* Update test
* Rename to IS_SUBSET and IS_SUPERSET
2020-09-24 16:55:09 +02:00
Adriane Boyd
59340606b7
Add option to disable Matcher errors ( #6125 )
...
* Add option to disable Matcher errors
* Add option to disable Matcher errors when a doc doesn't contain a
particular type of annotation
Minor additional change:
* Update `AttributeRuler.load_from_morph_rules` to allow direct `MORPH`
values
* Rename suppress_errors to allow_missing
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Refactor annotation checks in Matcher and PhraseMatcher
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-24 16:54:39 +02:00
Sofie Van Landeghem
c7eedd3534
updates to NEL functionality ( #6132 )
...
* NEL: read sentences and ents from reference
* fiddling with sent_start annotations
* add KB serialization test
* KB write additional file with strings.json
* score_links function to calculate NEL P/R/F
* formatting
* documentation
2020-09-24 16:53:59 +02:00
Ines Montani
d0ef4a4cf5
Prevent division by zero in score weights
2020-09-24 16:42:13 +02:00
Matthew Honnibal
74ee456374
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-09-24 16:11:47 +02:00
Matthew Honnibal
0bc214c102
Fix pull
2020-09-24 16:11:33 +02:00
Ines Montani
3f751e68f5
Increment version [ci skip]
2020-09-24 14:45:41 +02:00
Ines Montani
58dde293ce
Merge pull request #6089 from adrianeboyd/feature/doc-ents-v3-2
2020-09-24 14:44:42 +02:00
Ines Montani
74e1f192b4
Merge pull request #6134 from explosion/feature/training_before_to_disk
2020-09-24 14:44:11 +02:00
Ines Montani
24e7ac3f2b
Fix download CLI [ci skip]
2020-09-24 14:43:56 +02:00
Ines Montani
88e54caa12
accuracy -> performance
2020-09-24 14:32:35 +02:00
Ines Montani
92f8b6959a
Fix typo
2020-09-24 13:48:41 +02:00
Adriane Boyd
5c13e0cf1b
Remove unused error
2020-09-24 13:41:55 +02:00
Ines Montani
be56c0994b
Add [training.before_to_disk] callback
2020-09-24 12:40:25 +02:00
Adriane Boyd
8eaacaae97
Refactor Doc.ents setter to use Doc.set_ents
...
Additional changes:
* Entity spans with missing labels are ignored
* Fix ent_kb_id setting in `Doc.set_ents`
2020-09-24 12:36:51 +02:00
Ines Montani
c6c67b606e
Merge pull request #6133 from explosion/fix/score_weights
2020-09-24 12:00:57 +02:00
Ines Montani
f69fea8b25
Improve error handling around non-number scores
2020-09-24 11:29:07 +02:00
Ines Montani
4eb39b5c43
Fix logging
2020-09-24 11:04:35 +02:00
Ines Montani
4bbe41f017
Fix combined scores and update test
2020-09-24 10:42:47 +02:00
Sofie Van Landeghem
c645c4e7ce
fix micro PRF for textcat ( #6130 )
...
* fix micro PRF for textcat
* small fix
2020-09-24 10:31:17 +02:00
Matthew Honnibal
17a6b0a173
Make project pull order insensitive ( #6131 )
2020-09-24 10:30:42 +02:00
Ines Montani
ae51f580c1
Fix handling of score_weights
2020-09-24 10:27:33 +02:00
Ines Montani
f25f05c503
Adjust sort order [ci skip]
2020-09-23 20:03:04 +02:00
Ines Montani
3f77eb749c
Increment version [ci skip]
2020-09-23 19:50:15 +02:00
svlandeg
b816ace4bb
format
2020-09-23 17:33:13 +02:00
svlandeg
5a9fdbc8ad
state_type as Literal
2020-09-23 17:32:14 +02:00
svlandeg
35dbc63578
Merge remote-tracking branch 'upstream/develop' into fix/nr_features
...
# Conflicts:
# spacy/ml/models/parser.py
# spacy/tests/serialize/test_serialize_config.py
# website/docs/api/architectures.md
2020-09-23 17:01:13 +02:00
svlandeg
25b34bba94
throw custom error when state_type is invalid
2020-09-23 16:57:14 +02:00
Ines Montani
916050bf2f
Merge pull request #6127 from explosion/feature/literal-nr_feature_tokens
2020-09-23 16:56:08 +02:00
Ines Montani
3c3863654e
Increment version [ci skip]
2020-09-23 16:54:43 +02:00
svlandeg
dd2292793f
'parser' instead of 'deps' for state_type
2020-09-23 16:53:49 +02:00
Ines Montani
50a4425cda
Adjust docs
2020-09-23 16:03:32 +02:00
Ines Montani
76bbed3466
Use Literal type for nr_feature_tokens
2020-09-23 16:00:03 +02:00
Muhammad Fahmi Rasyid
7489d02dea
Update Indonesian Example Phrases ( #6124 )
...
* create contributor agreement
* Update Indonesian example. (see #1107 )
Update Indonesian examples with more proper phrases. the current phrases contains sensitive and violent words.
2020-09-23 14:02:26 +02:00
svlandeg
6c85fab316
state_type and extra_state_tokens instead of nr_feature_tokens
2020-09-23 13:35:09 +02:00
Ines Montani
7745d77a38
Fix whitespace in template [ci skip]
2020-09-23 13:21:42 +02:00
svlandeg
6435458d51
simplify expression
2020-09-23 12:12:38 +02:00
svlandeg
20b0ec5dcf
avoid logging performance of frozen components
2020-09-23 10:37:12 +02:00
Ines Montani
ae5dacf75f
Tidy up and add types
2020-09-23 10:14:34 +02:00
Ines Montani
6ca06cb62c
Update docs and formatting [ci skip]
2020-09-23 10:14:27 +02:00
Ines Montani
888f936a73
Merge pull request #6106 from svlandeg/feature/textcat-quickstart
2020-09-23 10:11:45 +02:00
Ines Montani
60a317520a
Merge pull request #6109 from svlandeg/feature/2rename
2020-09-23 09:47:12 +02:00
Ines Montani
f976bab710
Remove empty file [ci skip]
2020-09-23 09:30:09 +02:00
svlandeg
556f3e4652
add pooling to NEL's TransformerListener
2020-09-23 09:24:28 +02:00
svlandeg
4a56ea72b5
fallbacks for old names
2020-09-23 09:15:07 +02:00
Sofie Van Landeghem
86a08f819d
tok2vec.update instead of predict ( #6113 )
2020-09-22 21:54:52 +02:00
Adriane Boyd
e4acb28658
Fix norm in retokenizer split ( #6111 )
...
Parallel to behavior in merge, reset norm on original token in
retokenizer split.
2020-09-22 21:53:33 +02:00
Sofie Van Landeghem
e0e793be4d
fix KB IO ( #6118 )
2020-09-22 21:53:06 +02:00
Adriane Boyd
9b4979407d
Fix overlapping German noun chunks ( #6112 )
...
Add a similar fix as in #5470 to prevent the German noun chunks iterator
from producing overlapping spans.
2020-09-22 21:52:42 +02:00
Adriane Boyd
b1a7d6c528
Refactor seen token detection
2020-09-22 14:42:51 +02:00
Sofie Van Landeghem
d53c84b6d6
avoid None callback ( #6100 )
2020-09-22 13:54:44 +02:00
Adriane Boyd
535842e483
Merge branch 'develop' into feature/doc-ents-v3-2
2020-09-22 13:45:50 +02:00
Ines Montani
5e3b796b12
Validate section refs in debug config
2020-09-22 12:24:39 +02:00
svlandeg
085a1c8e2b
add no_output_layer to TextCatBOW config
2020-09-22 12:06:40 +02:00
svlandeg
e1b8090b9b
few more fixes
2020-09-22 12:01:06 +02:00
svlandeg
b556a10808
rename converts in_to_out
2020-09-22 11:50:19 +02:00
svlandeg
e931f4d757
add textcat score
2020-09-22 10:56:43 +02:00
svlandeg
396b33257f
add entity_linker to jinja template
2020-09-22 10:40:05 +02:00
Ines Montani
db7126ead9
Increment version
2020-09-22 10:31:26 +02:00
svlandeg
135de82a2d
add textcat to quickstart
2020-09-22 10:22:06 +02:00
Ines Montani
6316d5f398
Improve messages in project CLI [ci skip]
2020-09-22 09:45:34 +02:00
Ines Montani
49e80dbcac
Merge pull request #6103 from explosion/chore/tidy-up-tests-docs-get-doc
2020-09-22 09:45:04 +02:00
Ines Montani
81606b29bd
Merge pull request #6104 from svlandeg/fix/debug_model [ci skip]
2020-09-22 09:31:23 +02:00
Ines Montani
beb766d0a0
Add test
2020-09-22 09:15:57 +02:00
Ines Montani
285fa934d8
Merge branch 'chore/tidy-up-tests-docs-get-doc' of https://github.com/explosion/spaCy into chore/tidy-up-tests-docs-get-doc
2020-09-22 09:10:14 +02:00
Ines Montani
69f7e52c26
Update README.md
2020-09-22 09:10:06 +02:00
svlandeg
45b29c4a5b
cleanup
2020-09-21 23:17:23 +02:00
svlandeg
fa5c416db6
initialize through nlp object and with train_corpus
2020-09-21 23:09:22 +02:00
Matthew Honnibal
3abc4a5adb
Slightly tidy doc.ents.__set__
2020-09-21 22:58:03 +02:00
Ines Montani
67fbcb3da5
Tidy up tests and docs
2020-09-21 20:43:54 +02:00
Ines Montani
a5f6ab4943
Merge pull request #6098 from adrianeboyd/feature/doc-init
2020-09-21 18:35:20 +02:00
Adriane Boyd
f212303729
Add sent_starts to Doc.__init__
...
Add sent_starts to `Doc.__init__`. Officially specify `is_sent_start`
values but also convert to and accept `sent_start` internally.
2020-09-21 17:59:09 +02:00
svlandeg
447b3e5787
Merge remote-tracking branch 'upstream/develop' into fix/debug_model
...
# Conflicts:
# spacy/cli/debug_model.py
2020-09-21 16:58:40 +02:00
Ines Montani
b3327c1e45
Increment version [ci skip]
2020-09-21 16:04:30 +02:00
Ines Montani
e8bcaa44f1
Don't auto-decompress archives with smart_open [ci skip]
2020-09-21 16:01:46 +02:00
Adriane Boyd
6aa91c7ca0
Make user_data keyword-only
2020-09-21 16:00:06 +02:00
Adriane Boyd
177df15d89
Implement Doc.set_ents
2020-09-21 15:54:05 +02:00
Adriane Boyd
13fbf6556a
Merge remote-tracking branch 'upstream/develop' into feature/doc-ents-v3-2
2020-09-21 14:42:04 +02:00
svlandeg
eb9b447960
Merge remote-tracking branch 'upstream/develop' into fix/debug_model
...
# Conflicts:
# spacy/cli/debug_model.py
2020-09-21 14:05:16 +02:00
Adriane Boyd
ce455f30ca
Fix formatting
2020-09-21 13:53:29 +02:00
Adriane Boyd
bc02e86494
Extend Doc.__init__ with additional annotation
...
Mostly copying from `spacy.tests.util.get_doc`, add additional kwargs to
`Doc.__init__` to initialize the most common doc/token values.
2020-09-21 13:36:24 +02:00
Ines Montani
758ead8a47
Sync overrides with CLI overrides
2020-09-21 12:50:13 +02:00
Ines Montani
5497acf49a
Support config overrides via environment variables
2020-09-21 11:25:10 +02:00
Ines Montani
1114219ae3
Tidy up and auto-format
2020-09-21 10:59:07 +02:00
Ines Montani
b2302c0a1c
Improve error for missing dependency
2020-09-20 17:44:51 +02:00
Matthew Honnibal
8fb59d958c
Format
2020-09-20 16:31:48 +02:00
Matthew Honnibal
dc22771f87
Fix sparse checkout
2020-09-20 16:30:05 +02:00
Matthew Honnibal
a0fb5e50db
Use simple git clone call if not sparse
2020-09-20 16:22:04 +02:00
Matthew Honnibal
2c24d633d0
Use updated run_command
2020-09-20 16:21:43 +02:00
Matthew Honnibal
889128e5c5
Improve error handling in run_command
2020-09-20 16:20:57 +02:00
Ines Montani
554c9a2497
Update docs [ci skip]
2020-09-20 12:30:53 +02:00
svlandeg
6db1d5dc0d
trying some stuff
2020-09-19 19:11:30 +02:00
Ines Montani
e863b3dc14
Merge pull request #6092 from adrianeboyd/bugfix/load-vocab-lookups-2
2020-09-19 12:33:38 +02:00
Sofie Van Landeghem
39872de1f6
Introducing the gpu_allocator ( #6091 )
...
* rename 'use_pytorch_for_gpu_memory' to 'gpu_allocator'
* --code instead of --code-path
* update documentation
* avoid querying the "system" section directly
* add explanation of gpu_allocator to TF/PyTorch section in docs
* fix typo
* fix typo 2
* use set_gpu_allocator from thinc 8.0.0a34
* default null instead of empty string
2020-09-19 01:17:02 +02:00
Adriane Boyd
47080fba98
Minor renaming / refactoring
...
* Rename loader to `spacy.LookupsDataLoader.v1`, add debugging message
* Make `Vocab.lookups` a property
2020-09-18 19:43:19 +02:00
svlandeg
73ff52b9ec
hack for tok2vec listener
2020-09-18 16:43:15 +02:00
Adriane Boyd
eed4b785f5
Load vocab lookups tables at beginning of training
...
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.
The option moves from `nlp.load_vocab_data` to `training.lookups`.
Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.
The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.
To load `lexeme_norm` from `spacy-lookups-data`:
```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
2020-09-18 15:59:16 +02:00
Ines Montani
a127fa475e
Merge pull request #6078 from svlandeg/fix/corpus
2020-09-18 14:44:21 +02:00
Matthew Honnibal
bbdb5f62b7
Temporary work-around for scoring a subset of components ( #6090 )
...
* Try hacking the scorer to work around sentence boundaries
* Upd scorer
* Set dev version
* Upd scorer hack
* Fix version
* Improve comment on hack
2020-09-18 14:26:42 +02:00
Adriane Boyd
a88106e852
Remove W106: HEAD and SENT_START in doc.from_array ( #6086 )
...
* Remove W106: HEAD and SENT_START in doc.from_array
This warning was hacky and being triggered too often.
* Fix test
2020-09-18 03:01:29 +02:00
svlandeg
e4fc7e0222
fixing output sample to proper 2D array
2020-09-17 22:34:36 +02:00
Adriane Boyd
8b650f3a78
Modify setting missing and blocked entity tokens
...
In order to make it easier to construct `Doc` objects as training data,
modify how missing and blocked entity tokens are set to prioritize
setting `O` and missing entity tokens for training purposes over setting
blocked entity tokens.
* `Doc.ents` setter sets tokens outside entity spans to `O` regardless
of the current state of each token
* For `Doc.ents`, setting a span with a missing label sets the `ent_iob`
to missing instead of blocked
* `Doc.block_ents(spans)` marks spans as hard `O` for use with the
`EntityRecognizer`
2020-09-17 21:27:42 +02:00
Ines Montani
3865214343
Use consistent shortcut
2020-09-17 16:57:02 +02:00
svlandeg
35a3931064
fix typo
2020-09-17 16:36:27 +02:00
svlandeg
ddfc1fc146
add pretraining option to init config
2020-09-17 16:05:40 +02:00
svlandeg
427dbecdd6
cleanup and formatting
2020-09-17 11:48:04 +02:00
svlandeg
0c35885751
generalize corpora, dot notation for dev and train corpus
2020-09-17 11:38:59 +02:00
svlandeg
781fae678b
Merge remote-tracking branch 'upstream/develop' into fix/corpus
2020-09-17 09:24:36 +02:00
Matthew Honnibal
8303d101a5
Set version to v3.0.0a19
2020-09-17 00:18:49 +02:00
Adriane Boyd
7e4cd7575c
Refactor Docs.is_ flags ( #6044 )
...
* Refactor Docs.is_ flags
* Add derived `Doc.has_annotation` method
* `Doc.has_annotation(attr)` returns `True` for partial annotation
* `Doc.has_annotation(attr, require_complete=True)` returns `True` for
complete annotation
* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`
* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.
Notes on `Doc.has_annotation`:
* `HEAD` is converted to `DEP` because heads don't have an unset state
* Accept `IS_SENT_START` as a synonym of `SENT_START`
Additional changes:
* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`
* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`
* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`
* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`
* Fix call to set_children_form_heads
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 00:14:01 +02:00
Adriane Boyd
a119667a36
Clean up spacy.tokens ( #6046 )
...
* Clean up spacy.tokens
* Update `set_children_from_heads`:
* Don't check `dep` when setting lr_* or sentence starts
* Set all non-sentence starts to `False`
* Use `set_children_from_heads` in `Token.head` setter
* Reduce similar/duplicate code (admittedly adds a bit of overhead)
* Update sentence starts consistently
* Remove unused `Doc.set_parse`
* Minor changes:
* Declare cython variables (to avoid cython warnings)
* Clean up imports
* Modify set_children_from_heads to set token range
Modify `set_children_from_heads` so that it adjust tokens within a
specified range rather then the whole document.
Modify the `Token.head` setter to adjust only the tokens affected by the
new head assignment.
2020-09-16 20:32:38 +02:00
Matthew Honnibal
c776594ab1
Fix
2020-09-16 18:15:14 +02:00
Matthew Honnibal
4a573d18b3
Add comment
2020-09-16 17:51:29 +02:00
Matthew Honnibal
d31afc8334
Fix Language.link_components when model is None
2020-09-16 17:49:48 +02:00
Adriane Boyd
f3db3f6fe0
Add vectors option to CharacterEmbed ( #6069 )
...
* Add vectors option to CharacterEmbed
* Update spacy/pipeline/morphologizer.pyx
* Adjust default morphologizer config
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-16 17:45:04 +02:00
Adriane Boyd
d722a439aa
Remove unneeded methods in senter and morphologizer ( #6074 )
...
Now that the tagger doesn't manage the tag map, the child classes senter
and morphologizer don't need to override the serialization methods.
2020-09-16 17:39:41 +02:00
Adriane Boyd
87c329c711
Set rule-based lemmatizers as default ( #6076 )
...
For languages without provided models and with lemmatizer rules in
`spacy-lookups-data`, make the rule-based lemmatizer the default:
Bengali, Persian, Norwegian, Swedish
2020-09-16 17:37:29 +02:00
svlandeg
1040e250d8
actual commit with test for custom readers with ml_datasets >= 0.2
2020-09-16 16:41:28 +02:00
svlandeg
714a5a05c6
test for custom readers with ml_datasets >= 0.2
2020-09-16 16:39:55 +02:00
svlandeg
0d1392340f
Merge remote-tracking branch 'upstream/develop' into fix/corpus
2020-09-15 23:17:08 +02:00
svlandeg
f420aa1138
use e.value to get to the ExceptionInfo value
2020-09-15 22:30:09 +02:00
svlandeg
7336657662
corpus is a Dict
2020-09-15 22:07:16 +02:00
svlandeg
51fa929f47
rewrite train_corpus to corpus.train in config
2020-09-15 21:58:04 +02:00
svlandeg
bd87e8686e
move tests to correct subdir
2020-09-15 21:40:38 +02:00
Ines Montani
aaf01689a1
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-09-15 14:24:42 +02:00
Ines Montani
91a6637f74
Remove extra pipe config values before merging
2020-09-15 14:24:17 +02:00
Ines Montani
d3d7f92f05
Fix lang check and error handling in Language.from_config
2020-09-15 14:24:06 +02:00
Ines Montani
2ed6e2a218
Auto-format
2020-09-15 14:20:04 +02:00
Ines Montani
2214d1bb7b
Merge pull request #6067 from explosion/feature/spacy-blank-from-config
2020-09-15 14:18:33 +02:00
Ines Montani
253ba5ef14
Raise for bad Vocab values
2020-09-15 13:25:34 +02:00
svlandeg
7677e5c0e2
fix wandb logger when calling multiple times from same script
2020-09-15 12:56:33 +02:00
Ines Montani
eff9406718
Support vocab arg in spacy.blank
2020-09-15 11:39:36 +02:00
Ines Montani
99549a5ace
Fix consistency and update docs
2020-09-15 11:37:37 +02:00
Ines Montani
7dfc4bc062
Allow overriding meta from spacy.blank
2020-09-15 11:12:12 +02:00
Ines Montani
0f943157af
Delegate to Language.from_config in spacy.blank
2020-09-15 11:07:55 +02:00
Ines Montani
e977086a9a
Update default pretraining config [ci skip]
2020-09-15 01:12:02 +02:00
Ines Montani
154752f9c2
Update docs and consistency [ci skip]
2020-09-15 00:32:49 +02:00
Ines Montani
9cc304c194
Merge pull request #6064 from explosion/fix/sparse-checkout-ux
...
Fix sparse checkout and error handling
2020-09-15 00:32:20 +02:00
Matthew Honnibal
475323cd36
Set version to v3.0.0a18
2020-09-14 22:05:43 +02:00
Matthew Honnibal
e8378b57bc
Fix test
2020-09-14 21:21:13 +02:00
Matthew Honnibal
adf0bab23a
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2020-09-14 21:04:49 +02:00
Matthew Honnibal
ae15fa9688
Fix iob converter
2020-09-14 21:02:18 +02:00
Sofie Van Landeghem
3216a33149
positive_label config for textcat ( #6062 )
...
* hook up positive_label in textcat
* unit tests
* documentation
* formatting
* tests
* fix typo
* move verify_config to after begin_training
* revert accidential commit
2020-09-14 17:08:00 +02:00
Ines Montani
c052017025
Fix sparse checkout and error handling
2020-09-14 14:12:58 +02:00
Matthew Honnibal
fdd2340f6c
Set version to v3.0.0a17
2020-09-13 23:52:03 +02:00
Ines Montani
416deb412f
Prevent duplicate traceback on CalledProcessError [ci skip]
2020-09-13 19:28:54 +02:00
Ines Montani
61a4ef0b46
Fix syntax error
2020-09-13 19:23:09 +02:00
Matthew Honnibal
b693d2d224
Fix speed report in table
2020-09-13 17:39:31 +02:00
Sofie Van Landeghem
744df9814a
define threshold for scoring textcat in TextCat config ( #6055 )
...
* define threshold for scoring textcat in TextCat config
* fix unit test and documentation
2020-09-13 14:15:52 +02:00
Adriane Boyd
ab270364f1
Modify Token.morph to enable unsetting ( #6043 )
...
Modify `Token.morph` property so that `Token.c.morph` can be reset back
to an internal value of `0`. Allow setting `Token.morph` from a hash as
long as the morph string is already in the `StringStore`, setting it
indirectly through `Token.morph_` so that the value is added to the
morphology. If the hash is not in the `StringStore`, raise an error.
2020-09-13 14:06:07 +02:00
Adriane Boyd
c7bd631b5f
Fix token.idx for special cases with affixes ( #6035 )
2020-09-13 14:05:36 +02:00
Matthew Honnibal
54c40223a1
Improve v3 pretrain command ( #6040 )
...
* Starts to run
* Update pretrain script
* Update corpus
* Update pretrain schema
* Remove outdated test
* Make JsonlTexts produce Example objects.
2020-09-13 14:05:05 +02:00