Commit Graph

15164 Commits

Author SHA1 Message Date
Ines Montani
4f769ff913 Update Prodigy project template for v1.11 [ci skip] 2021-08-12 13:46:20 +10:00
Paul O'Leary McCann
e227d24d43
Allow passing in array vars for speedup (#8882)
* Allow passing in array vars for speedup

This fixes #8845. Not sure about the docstring changes here...

* Update docs

Types maybe need more detail? Maybe not?

* Run prettier on docs

* Update spacy/tokens/span.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-10 15:13:53 +02:00
Adriane Boyd
f99d6d5e39
Refactor scoring methods to use registered functions (#8766)
* Add scorer option to components

Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.

* Add registered scorers for all components

* Add `scorers` registry
* Move all scoring methods outside of components as independent
  functions and register
* Use the registered scoring methods as defaults in configs and inits

Additional:

* The scoring methods no longer have access to the full component, so
  use settings from `cfg` as default scorer options to handle settings
  such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
  patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
  score overlapping spans correctly

* Update Russian lemmatizer to use direct score method

* Check type of cfg in Pipe.score

* Fix check

* Update spacy/pipeline/sentencizer.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove validate_examples from scoring functions

* Use Pipe.labels instead of Pipe.cfg["labels"]

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-08-10 15:13:39 +02:00
fgaim
ee011ca963
Update Tigrinya ትግርኛ language support (#8900)
* Add missing punctuation for Tigrinya and Amharic

* Fix numeral and ordinal numbers for Tigrinya

 - Amharic was used in many cases
 - Also fixed some typos

* Update Tigrinya stop-words

* Contributor agreement for fgaim

* Fix typo in "ti" lang test

* Remove multi-word entries from numbers and ordinals
2021-08-10 13:55:08 +02:00
Paul O'Leary McCann
6029cfc391
Add scores to output in spancat (#8855)
* Add scores to output in spancat

This exposes the scores as an attribute on the SpanGroup. Includes a
basic test.

* Add basic doc note

* Vectorize score calcs

* Add "annotation format" section

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Clean up doc section

* Ran prettier on docs

* Get arrays off the gpu before iterating over them

* Remove int() calls

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-08-10 13:47:49 +02:00
Dimitar Ganev
733ffe439d
Improve the stop words and the tokenizer exceptions in Bulgarian language. (#8862)
* Add more stop words and Improve the readability

* Add and categorize the tokenizer exceptions for `bg` lang

* Create syrull.md

* Add references for the additional stop words and tokenizer exc abbrs
2021-08-10 13:44:23 +02:00
Adriane Boyd
415dee587c
Merge pull request #8911 from adrianeboyd/chore/update-develop-from-master-v3.1-1
Update develop from master
2021-08-09 15:41:36 +02:00
Ines Montani
c581848cbb Merge pull request #8910 from DuyguA/patch-1 [ci skip]
updated unv json for new book
2021-08-09 23:13:17 +10:00
Ines Montani
a1e9f19460
Merge pull request #8910 from DuyguA/patch-1 [ci skip]
updated unv json for new book
2021-08-09 23:12:50 +10:00
Paul O'Leary McCann
35255786a1 Fix #8902 (bad link in docs)
typo fix
2021-08-09 13:59:59 +02:00
Adriane Boyd
a79888ed67 Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-1 2021-08-09 13:13:13 +02:00
Duygu Altinok
380b2817cf
updated unv json for new book 2021-08-09 12:39:22 +02:00
Paul O'Leary McCann
cac298471f
Fix #8902 (bad link in docs)
typo fix
2021-08-08 22:04:00 +09:00
Eduard Zorita
439f30faad
Add stub files for main cython classes (#8427)
* Add stub files for main API classes

* Add contributor agreement for ezorita

* Update types for ndarray and hash()

* Fix __getitem__ and __iter__

* Add attributes of Doc and Token classes

* Overload type hints for Span.__getitem__

* Fix type hint overload for Span.__getitem__

Co-authored-by: Luca Dorigo <dorigoluca@gmail.com>
2021-08-07 12:30:03 +02:00
github-actions[bot]
56d4d87aeb
Auto-format code with black (#8895)
Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>
2021-08-06 13:38:06 +02:00
Kabir Khan
1dfffe5fb4
No output info message in train (#8885)
* Add info message that no output directory was provided in train

* Update train.py

* Fix logging
2021-08-05 09:21:22 +02:00
Adriane Boyd
fa2e7a4bbf
Fix spancat tests on GPU (#8872)
* Fix spancat tests on GPU

* Fix more spancat tests
2021-08-04 14:29:43 +02:00
Paul O'Leary McCann
77d698dcae
Fix check for RIGHT_ATTRS in dep matcher (#8807)
* Fix check for RIGHT_ATTRs in dep matcher

If a non-anchor node does not have RIGHT_ATTRS, the dep matcher throws
an E100, which says that non-anchor nodes must have LEFT_ID, REL_OP, and
RIGHT_ID. It specifically does not say RIGHT_ATTRS is required.

A blank RIGHT_ATTRS is also valid, and patterns with one will be
excepted. While not normal, sometimes a REL_OP is enough to specify a
non-anchor node - maybe you just want the head of another node
unconditionally, for example.

This change just sets RIGHT_ATTRS to {} if not present. Alternatively
changing E100 to state RIGHT_ATTRS is required could also be reasonable.

* Fix test

This test was written on the assumption that if `RIGHT_ATTRS` isn't
present an error will be raised. Since the proposed changes make it so
an error won't be raised this is no longer necessary.

* Revert test, update error message

Error message now lists missing keys, and RIGHT_ATTRS is required.

* Use list of required keys in error message

Also removes unused key param arg.
2021-08-04 09:20:41 +02:00
Adriane Boyd
941a591f3c
Pass excludes when serializing vocab (#8824)
* Pass excludes when serializing vocab

Additional minor bug fix:

* Deserialize vocab in `EntityLinker.from_disk`

* Add test for excluding strings on load

* Fix formatting
2021-08-03 14:42:44 +02:00
Adriane Boyd
c1caa47aa7 Support list values and INTERSECTS in Matcher (#8784)
* Support list values and IS_INTERSECT in Matcher

* Support list values as token attributes for set operators, not just as
pattern values.

* Add `IS_INTERSECT` operator.

* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.

* Rename IS_INTERSECT to INTERSECTS
2021-08-02 19:41:10 +02:00
Adriane Boyd
175847f92c
Support list values and INTERSECTS in Matcher (#8784)
* Support list values and IS_INTERSECT in Matcher

* Support list values as token attributes for set operators, not just as
pattern values.

* Add `IS_INTERSECT` operator.

* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.

* Rename IS_INTERSECT to INTERSECTS
2021-08-02 19:39:26 +02:00
Adriane Boyd
fbbbda1954
Fix start/end chars for empty and out-of-bounds spans (#8816) 2021-08-02 19:07:19 +02:00
Adriane Boyd
9ad3b8cf8d
Only add sourced vectors hashes to meta if necessary (#8830) 2021-08-02 18:22:35 +02:00
Nick Sorros
0485cdefcc
Add logger debug for project push and pull (#8860)
* Add logger debug for project push and pull

* Sign contributor agreement
2021-08-02 18:13:53 +02:00
themrmax
de076194c4
Make ConsoleLogger flush after each logging line (#8810)
This is necessary to avoid "logging blackouts" when running training on Kubernetes pods
2021-08-02 14:33:38 +02:00
Ines Montani
d79dbd0624 Merge pull request #8844 from thomashacker/bugfix/fix-doc-transformer-typo [ci skip]
Fix typo in Tok2VecTransformer example config
2021-07-30 09:11:24 +10:00
Ines Montani
4ddee5e84c Merge pull request #8841 from adrianeboyd/docs/ent-id-sep [ci skip]
Fix formatting of ent_id_sep in EntityRuler API docs
2021-07-30 09:11:15 +10:00
Ines Montani
cf9b671566 Merge pull request #8840 from polm/docs/evaluate-speed [ci skip] 2021-07-30 09:11:05 +10:00
Ines Montani
30f20496d5
Merge pull request #8840 from polm/docs/evaluate-speed [ci skip] 2021-07-30 09:10:15 +10:00
Ines Montani
65d163fab5
Adjust formatting [ci skip] 2021-07-30 09:10:04 +10:00
Ines Montani
3a701d3645
Merge pull request #8841 from adrianeboyd/docs/ent-id-sep [ci skip]
Fix formatting of ent_id_sep in EntityRuler API docs
2021-07-30 09:09:25 +10:00
Ines Montani
f08be084fb
Merge pull request #8844 from thomashacker/bugfix/fix-doc-transformer-typo [ci skip]
Fix typo in Tok2VecTransformer example config
2021-07-30 09:08:59 +10:00
thomashacker
02258916c8 Fix example config typo for transformer architecture 2021-07-29 11:19:40 +02:00
Adriane Boyd
15b12f3e35 Fix formatting of ent_id_sep in EntityRuler API docs 2021-07-29 10:10:12 +02:00
Paul O'Leary McCann
a60cb13910 Update speed entry in metrics table 2021-07-29 16:35:19 +09:00
Paul O'Leary McCann
e125313a50 Revert "Add note about SPEED in output"
This reverts commit c92d268176.
2021-07-29 16:34:08 +09:00
Ines Montani
03a742f332 Merge pull request #8814 from polm/docs/migrate-lexeme-tables [ci skip] 2021-07-29 17:19:44 +10:00
Ines Montani
0a1e299d30
Merge pull request #8814 from polm/docs/migrate-lexeme-tables [ci skip] 2021-07-29 17:18:02 +10:00
Paul O'Leary McCann
c92d268176 Add note about SPEED in output
In #8823 it was pointed out that the `SPEED` value wasn't documented
anywhere.
2021-07-29 15:03:07 +09:00
Paul O'Leary McCann
8867e60fbb
Update website/docs/usage/v3.md
Co-authored-by: Ines Montani <ines@ines.io>
2021-07-29 14:56:56 +09:00
Adriane Boyd
9e9611233f Remove labels from textcat component config example (#8815) 2021-07-27 13:15:33 +02:00
Paul O'Leary McCann
de5bc8a0e1 Update subset/superset docs (#8795)
* Update subset/superset docs

* Update website/docs/usage/rule-based-matching.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-07-27 13:15:27 +02:00
Adriane Boyd
8547514aa4
Remove labels from textcat component config example (#8815) 2021-07-27 13:14:38 +02:00
Paul O'Leary McCann
76ac95923a Add note to migration guide about lexeme tables (fix #7290)
This just adds the resolution from #6388 to the docs.
2021-07-27 19:19:25 +09:00
Paul O'Leary McCann
67ecdcc3ac
Update subset/superset docs (#8795)
* Update subset/superset docs

* Update website/docs/usage/rule-based-matching.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-07-27 12:08:46 +02:00
Adriane Boyd
81d3a1edb1
Use tokenizer URL_MATCH pattern in LIKE_URL (#8765) 2021-07-27 12:07:01 +02:00
Adriane Boyd
4f28190afe
Merge pull request #8813 from adrianeboyd/chore/develop-v3.2
Update develop for v3.2
2021-07-27 11:26:18 +02:00
Ines Montani
7f21c7dfa2
Merge pull request #8794 from explosion/autoblack
Auto-format code with black
2021-07-27 12:17:15 +10:00
Ines Montani
34c401f04f
Merge pull request #8801 from polm/fix/respect-no-skip (fixes #8796)
Respect the no_skip value
2021-07-27 12:16:47 +10:00
Ines Montani
cf3855ae05 Merge pull request #8806 from Ledenel/master [ci skip]
fix typo
2021-07-27 12:15:44 +10:00