Commit Graph

14760 Commits

Author SHA1 Message Date
Edward
944ad6b1d4
Add new parameter for saving every n epoch in pretraining (#8912)
* Add parameter for saving every n epoch

* Add new parameter in schemas

* Add new parameter in default_config

* Adjust schemas

* format code
2021-08-12 11:14:48 +02:00
Adriane Boyd
f99d6d5e39
Refactor scoring methods to use registered functions (#8766)
* Add scorer option to components

Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.

* Add registered scorers for all components

* Add `scorers` registry
* Move all scoring methods outside of components as independent
  functions and register
* Use the registered scoring methods as defaults in configs and inits

Additional:

* The scoring methods no longer have access to the full component, so
  use settings from `cfg` as default scorer options to handle settings
  such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
  patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
  score overlapping spans correctly

* Update Russian lemmatizer to use direct score method

* Check type of cfg in Pipe.score

* Fix check

* Update spacy/pipeline/sentencizer.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove validate_examples from scoring functions

* Use Pipe.labels instead of Pipe.cfg["labels"]

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-10 15:13:39 +02:00
fgaim
ee011ca963
Update Tigrinya ትግርኛ language support (#8900)
* Add missing punctuation for Tigrinya and Amharic

* Fix numeral and ordinal numbers for Tigrinya

 - Amharic was used in many cases
 - Also fixed some typos

* Update Tigrinya stop-words

* Contributor agreement for fgaim

* Fix typo in "ti" lang test

* Remove multi-word entries from numbers and ordinals
2021-08-10 13:55:08 +02:00
Dimitar Ganev
733ffe439d
Improve the stop words and the tokenizer exceptions in Bulgarian language. (#8862)
* Add more stop words and Improve the readability

* Add and categorize the tokenizer exceptions for `bg` lang

* Create syrull.md

* Add references for the additional stop words and tokenizer exc abbrs
2021-08-10 13:44:23 +02:00
Adriane Boyd
415dee587c
Merge pull request #8911 from adrianeboyd/chore/update-develop-from-master-v3.1-1
Update develop from master
2021-08-09 15:41:36 +02:00
Adriane Boyd
a79888ed67 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-1 2021-08-09 13:13:13 +02:00
Paul O'Leary McCann
cac298471f
Fix #8902 (bad link in docs)
typo fix
2021-08-08 22:04:00 +09:00
Eduard Zorita
439f30faad
Add stub files for main cython classes (#8427)
* Add stub files for main API classes

* Add contributor agreement for ezorita

* Update types for ndarray and hash()

* Fix __getitem__ and __iter__

* Add attributes of Doc and Token classes

* Overload type hints for Span.__getitem__

* Fix type hint overload for Span.__getitem__

Co-authored-by: Luca Dorigo <dorigoluca@gmail.com>
2021-08-07 12:30:03 +02:00
github-actions[bot]
56d4d87aeb
Auto-format code with black (#8895)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-08-06 13:38:06 +02:00
Kabir Khan
1dfffe5fb4
No output info message in train (#8885)
* Add info message that no output directory was provided in train

* Update train.py

* Fix logging
2021-08-05 09:21:22 +02:00
Adriane Boyd
fa2e7a4bbf
Fix spancat tests on GPU (#8872)
* Fix spancat tests on GPU

* Fix more spancat tests
2021-08-04 14:29:43 +02:00
Paul O'Leary McCann
77d698dcae
Fix check for RIGHT_ATTRS in dep matcher (#8807)
* Fix check for RIGHT_ATTRs in dep matcher

If a non-anchor node does not have RIGHT_ATTRS, the dep matcher throws
an E100, which says that non-anchor nodes must have LEFT_ID, REL_OP, and
RIGHT_ID. It specifically does not say RIGHT_ATTRS is required.

A blank RIGHT_ATTRS is also valid, and patterns with one will be
excepted. While not normal, sometimes a REL_OP is enough to specify a
non-anchor node - maybe you just want the head of another node
unconditionally, for example.

This change just sets RIGHT_ATTRS to {} if not present. Alternatively
changing E100 to state RIGHT_ATTRS is required could also be reasonable.

* Fix test

This test was written on the assumption that if `RIGHT_ATTRS` isn't
present an error will be raised. Since the proposed changes make it so
an error won't be raised this is no longer necessary.

* Revert test, update error message

Error message now lists missing keys, and RIGHT_ATTRS is required.

* Use list of required keys in error message

Also removes unused key param arg.
2021-08-04 09:20:41 +02:00
Adriane Boyd
941a591f3c
Pass excludes when serializing vocab (#8824)
* Pass excludes when serializing vocab

Additional minor bug fix:

* Deserialize vocab in `EntityLinker.from_disk`

* Add test for excluding strings on load

* Fix formatting
2021-08-03 14:42:44 +02:00
Adriane Boyd
175847f92c
Support list values and INTERSECTS in Matcher (#8784)
* Support list values and IS_INTERSECT in Matcher

* Support list values as token attributes for set operators, not just as
pattern values.

* Add `IS_INTERSECT` operator.

* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.

* Rename IS_INTERSECT to INTERSECTS
2021-08-02 19:39:26 +02:00
Adriane Boyd
fbbbda1954
Fix start/end chars for empty and out-of-bounds spans (#8816) 2021-08-02 19:07:19 +02:00
Adriane Boyd
9ad3b8cf8d
Only add sourced vectors hashes to meta if necessary (#8830) 2021-08-02 18:22:35 +02:00
Nick Sorros
0485cdefcc
Add logger debug for project push and pull (#8860)
* Add logger debug for project push and pull

* Sign contributor agreement
2021-08-02 18:13:53 +02:00
themrmax
de076194c4
Make ConsoleLogger flush after each logging line (#8810)
This is necessary to avoid "logging blackouts" when running training on Kubernetes pods
2021-08-02 14:33:38 +02:00
Ines Montani
30f20496d5
Merge pull request #8840 from polm/docs/evaluate-speed [ci skip] 2021-07-30 09:10:15 +10:00
Ines Montani
65d163fab5
Adjust formatting [ci skip] 2021-07-30 09:10:04 +10:00
Ines Montani
3a701d3645
Merge pull request #8841 from adrianeboyd/docs/ent-id-sep [ci skip]
Fix formatting of ent_id_sep in EntityRuler API docs
2021-07-30 09:09:25 +10:00
Ines Montani
f08be084fb
Merge pull request #8844 from thomashacker/bugfix/fix-doc-transformer-typo [ci skip]
Fix typo in Tok2VecTransformer example config
2021-07-30 09:08:59 +10:00
thomashacker
02258916c8 Fix example config typo for transformer architecture 2021-07-29 11:19:40 +02:00
Adriane Boyd
15b12f3e35 Fix formatting of ent_id_sep in EntityRuler API docs 2021-07-29 10:10:12 +02:00
Paul O'Leary McCann
a60cb13910 Update speed entry in metrics table 2021-07-29 16:35:19 +09:00
Paul O'Leary McCann
e125313a50 Revert "Add note about SPEED in output"
This reverts commit c92d268176.
2021-07-29 16:34:08 +09:00
Ines Montani
0a1e299d30
Merge pull request #8814 from polm/docs/migrate-lexeme-tables [ci skip] 2021-07-29 17:18:02 +10:00
Paul O'Leary McCann
c92d268176 Add note about SPEED in output
In #8823 it was pointed out that the `SPEED` value wasn't documented
anywhere.
2021-07-29 15:03:07 +09:00
Paul O'Leary McCann
8867e60fbb
Update website/docs/usage/v3.md
Co-authored-by: Ines Montani <ines@ines.io>
2021-07-29 14:56:56 +09:00
Adriane Boyd
8547514aa4
Remove labels from textcat component config example (#8815) 2021-07-27 13:14:38 +02:00
Paul O'Leary McCann
76ac95923a Add note to migration guide about lexeme tables (fix #7290)
This just adds the resolution from #6388 to the docs.
2021-07-27 19:19:25 +09:00
Paul O'Leary McCann
67ecdcc3ac
Update subset/superset docs (#8795)
* Update subset/superset docs

* Update website/docs/usage/rule-based-matching.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-07-27 12:08:46 +02:00
Adriane Boyd
81d3a1edb1
Use tokenizer URL_MATCH pattern in LIKE_URL (#8765) 2021-07-27 12:07:01 +02:00
Adriane Boyd
4f28190afe
Merge pull request #8813 from adrianeboyd/chore/develop-v3.2
Update develop for v3.2
2021-07-27 11:26:18 +02:00
Ines Montani
7f21c7dfa2
Merge pull request #8794 from explosion/autoblack
Auto-format code with black
2021-07-27 12:17:15 +10:00
Ines Montani
34c401f04f
Merge pull request #8801 from polm/fix/respect-no-skip (fixes #8796)
Respect the no_skip value
2021-07-27 12:16:47 +10:00
Ines Montani
134cb06af3
Merge pull request #8808 from kevinlu1248/master [ci skip]
Changed a CLI command in data-formats.md due to erroneous information
2021-07-27 12:15:16 +10:00
Ines Montani
9bf0d6f2fd
Merge pull request #8806 from Ledenel/master [ci skip]
fix typo
2021-07-27 12:14:22 +10:00
Kevin Lu
4a8e9e4e4e
Update data-formats.md 2021-07-25 22:58:53 -07:00
Ledenel
413f745c68 fix broken example in spaCy universe Chatterbot 2021-07-25 15:53:32 +00:00
Paul O'Leary McCann
284b530c63 Respect the no_skip value
Seems like the logic for this was just left out. See #8796.
2021-07-24 15:31:17 +09:00
explosion-bot
a58ab6ea22 Auto-format code with black 2021-07-23 08:04:09 +00:00
Adriane Boyd
6bbc2b1956
Reload train corpus in debug data after initialize (#8776) 2021-07-21 22:38:40 +02:00
Adriane Boyd
d48c01a6f7
Remove extraneous grc test file (#8768) 2021-07-20 15:51:15 +02:00
Sofie Van Landeghem
ffaead8fe0
bump to 3.1.1 2021-07-19 14:48:27 +02:00
Sofie Van Landeghem
83e27d262e
negative tag annotation (#8731)
* unit test to unlearn tag via negative annotation

* bump thinc to 8.0.8
2021-07-19 14:39:11 +02:00
Adriane Boyd
0e4b96c97e
Update lexeme ranks for loaded vectors (#8640)
Update the ranks for any lexemes that have been added to the vocab
before the vectors are added to the model.
2021-07-19 18:25:54 +10:00
Adriane Boyd
e532c69475
Update Language.replace_pipe for disabled components (#8729)
* Fix the index where the replacement in inserted to account for
disabled components
* Allow `Language.replace_pipe` to replace disabled components
2021-07-19 18:06:12 +10:00
Paul O'Leary McCann
d717593eb7
Merge pull request #8754 from KennethEnevoldsen/patch-1
[minor] removed outdated spacy version for spacymoji
2021-07-18 19:17:33 +09:00
Paul O'Leary McCann
ac67639eaf
Merge pull request #8755 from KennethEnevoldsen/patch-2
fixed GitHub link and thumbnail
2021-07-18 19:14:57 +09:00