Commit Graph

14763 Commits

Author SHA1 Message Date
Paul O'Leary McCann
5aba213349 Fix skweak Github URL
Github entry should not contain url, just user/repo
2021-05-31 18:00:43 +09:00
Kristian Boda
dc8d8d15d2
Add hmrb to spaCy Universe (#8129)
* docs: add hmrb to spacy universe

* docs: add sentence on spacy versions

* docs: update description and images

* misc: add spaCy Contributor Agreement
2021-05-31 18:40:48 +10:00
Dhruv Naik
283f64a98d
Fix bug from Entityruler: ent_ids returns None for phrases (#8169)
* bugfix for explosion/spaCy#8168

* add test for explosion/spaCy#8168
2021-05-31 18:38:53 +10:00
Michael K
b0467d2972
Add project urls to package metadata (#7728)
This adds the links to PyPI. To see that in action check out
https://pypi.org/project/Django/ (source code:
b8c9e9fae1/setup.cfg (L27-L32))
2021-05-31 18:38:29 +10:00
Narayan Acharya
6b79714080
Address missing config overrides post load of models (#8208) 2021-05-31 18:36:52 +10:00
Sofie Van Landeghem
fff662e41f
Ensemble textcat with listener (#8012)
* add unit test for two listeners, with a textcat ensemble in the middle

* return zero gradients instead of None in accumulate_gradient
2021-05-31 18:21:06 +10:00
Sofie Van Landeghem
ff91e6dac7
Show warning if entity_ruler runs without patterns (#7807)
* Show warning if entity_ruler runs without patterns

* Show warning if matcher runs without patterns

* fix wording

* unit test for warning once (WIP)

* warn W036 only once

* cleanup

* create filter_warning helper
2021-05-31 18:20:27 +10:00
Paul O'Leary McCann
d1a221a374
Add all symbols in Unicode Currency Symbols block (#8212)
* Add all symbols in Unicode Currency Symbols block

In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.

* Fix test

* Fix training test
2021-05-31 18:03:40 +10:00
Paul O'Leary McCann
04239e94c7
Use a context manager when reading model (fix #7036) (#8244) 2021-05-31 17:36:17 +10:00
Sofie Van Landeghem
fc37715cfb
ensure 'spacy ray' works (#7799)
* ensure 'spacy ray' works

* better fix by changing entry point
2021-05-28 18:15:31 +02:00
svlandeg
0aa1083ce8 avoid repetitive entities in the output 2021-05-28 16:52:51 +02:00
svlandeg
0d81bce9cc add failing test for too short a sentence 2021-05-28 15:10:35 +02:00
svlandeg
0f5c586e2f add basic tests for debugging 2021-05-28 14:19:55 +02:00
Ines Montani
5957ab74f7
Merge pull request #8112 from svlandeg/bugfix/replace-trf 2021-05-28 11:35:17 +10:00
svlandeg
391b512afd fix types of fwd functions 2021-05-27 16:36:46 +02:00
svlandeg
04b55bf054 removing unused imports 2021-05-27 16:31:38 +02:00
svlandeg
910026582d set versions to v1 instead of v0 2021-05-27 16:17:20 +02:00
svlandeg
2e3c0e2256 delete outdated tests 2021-05-27 13:54:31 +02:00
svlandeg
ba2e491cc4 Merge remote-tracking branch 'upstream/master' into feature/coref 2021-05-27 13:50:32 +02:00
Sofie Van Landeghem
3c58c0323f
fix docs (#8200) 2021-05-27 10:48:59 +02:00
Sofie Van Landeghem
290bd6ed39
ensure tolerance is properly passed on (#8158) 2021-05-27 18:10:28 +10:00
Paul O'Leary McCann
0c553ecd4e Fix docs (fix #8189) 2021-05-24 19:47:30 +09:00
Paul O'Leary McCann
a484245f35 Remove references to coref_er 2021-05-24 19:08:45 +09:00
Paul O'Leary McCann
d6389b133d Don't use a generator for no reason 2021-05-24 19:06:15 +09:00
Paul O'Leary McCann
d6fd5fe1c0 Minor cleanup 2021-05-24 14:56:43 +09:00
Paul O'Leary McCann
0942a0b51b Remove coref_er.py
The intent of this was that it would be a component pipeline that used
entities as input, but that's now covered by the get_mentions function
as a pipeline arg.
2021-05-21 18:20:25 +09:00
Paul O'Leary McCann
f6652c9252 Add new coref scoring
This is closer to the traditional evaluation method. That uses an
average of three scores, this is just using the bcubed metric for now
(nothing special about bcubed, just picked one).

The scoring implementation comes from the coval project. It relies on
scipy, which is one issue, and is rather involved, which is another.

Besides being comparable with traditional evaluations, this scoring is
relatively fast.
2021-05-21 15:56:40 +09:00
Paul O'Leary McCann
e1b4a85bb9 Fix loss
The loss was being returned as a single element array, which caused
training to die when it attempted to turn it into JSON.
2021-05-21 15:46:50 +09:00
Adriane Boyd
cd6bd91c3a
Switch default train corpus max_length to 0 in quickstart (#8142)
The behavior of `spacy.Corpus.v1` is unexpected enough for `max_length
!= 0` that `0` is a better default for users creating a new config with
the quickstart.

If not, documents are skipped, sometimes the entire corpus is skipped,
and sometimes documents are (quite unexpectedly for your average user)
split into sentences.
2021-05-20 14:48:09 +02:00
Paul O'Leary McCann
ff3fed06cf Catch a stray reference 2021-05-20 21:30:46 +09:00
Sofie Van Landeghem
202943bc8c
KB & NEL to/from bytes (#8113)
* unit test for pickling KB

* add pickling test for NEL

* KB to_bytes and from_bytes

* NEL to_bytes and from_bytes

* xfail pickle tests for now

* fix docs

* cleanup
2021-05-20 18:11:30 +10:00
Paul O'Leary McCann
8c5df622d8 Help out python gc in coref backprop 2021-05-20 16:40:55 +09:00
Paul O'Leary McCann
fa92daf052 Break pairwise operations into pseudolayers
This makes their scope tighter and more contained, and has the nice side
effect that fewer things need to be passed around for backprop.
2021-05-20 15:59:51 +09:00
Adriane Boyd
4e69fcaa50 Disable GPU CI tests (#8143) 2021-05-19 12:00:31 +02:00
Adriane Boyd
f6128c06b0
Disable GPU CI tests (#8143) 2021-05-19 12:00:07 +02:00
Paul O'Leary McCann
d22acee4f7 Fix backprop
Training seems to actually run now!
2021-05-18 20:09:27 +09:00
Paul O'Leary McCann
2486b8ad4d Fix pipeline intialize 2021-05-18 19:56:27 +09:00
Paul O'Leary McCann
0620820857 Deal with generators in tuplify 2021-05-18 19:55:52 +09:00
Paul O'Leary McCann
a7d9c8156d Make get_sentence_map work with init
When sentences are not available, just treat the whole doc as one
sentence. A reasonable general fallback, but important due to the init
call, where upstream components aren't run.
2021-05-18 19:54:54 +09:00
Paul O'Leary McCann
883c137b26 Add basic tuplify init 2021-05-18 19:53:59 +09:00
Paul O'Leary McCann
051715506e Fiddle with get_mentions definition
Ended up not making a difference, but oh well.
2021-05-18 19:53:33 +09:00
Adriane Boyd
06324e5a5e
Update pydantic requirements (#8127)
Update pydantic requirements following
https://github.com/explosion/thinc/pull/499
2021-05-18 11:35:50 +02:00
Paul O'Leary McCann
a33d29441a Merge remote-tracking branch 'upstream/develop' into feature/coref 2021-05-18 17:00:17 +09:00
Adriane Boyd
6baab565eb
Minor updates to quickstart settings/instructions (#7965)
* Minor updates to quickstart settings/instructions

* set default value of textcat exclusive to `false` until the default
checkbox behavior is updated
* add the `morphologizer` to the list of components
* add a note that v3.0.6+ is required

* Switch to warning above quickstart

* Undo changes to textcat default in quickstart

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-05-17 16:55:22 +02:00
Adriane Boyd
2c545c4c5b
Fix offsets in Span.get_lca_matrix (#8116)
* Fix range in Span.get_lca_matrix

Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.

* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`

* Update test for v3.x
2021-05-17 16:54:23 +02:00
Sofie Van Landeghem
0dffc5d9e2
Custom warning if the doc_bin is too large (#8069)
* custom warning if the doc_bin is too large

* cleanup

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* fix numbering

* fixing numbering once more

* fixing this seems to be pretty hard

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-05-17 15:48:40 +02:00
Adriane Boyd
b120fb3511
Handle errors while multiprocessing (#8004)
* Handle errors while multiprocessing

Handle errors while multiprocessing without hanging.

* Return the traceback for errors raised while processing a batch, which
  can be handled by the top-level error handler
* Allow for shortened batches due to custom error handlers that ignore
  errors and skip documents

* Define custom components at a higher level

* Also move up custom error handler

* Use simpler component for test

* Switch error type

* Adjust test

* Only call top-level error handler for exceptions

* Register custom test components within tests

Use global functions (so they can be pickled) but register the
components only within the individual tests.
2021-05-17 13:28:39 +02:00
Adriane Boyd
8a2602051c
Update debug data for textcat (#8066)
* Check for unsupported cats values
* Only show labels if train/dev mismatched
* Don't show label counts (only counting positive labels seems odd)
* Use warnings for mismatched train/dev labels
2021-05-17 13:27:04 +02:00
Adriane Boyd
1d59fdbd39
Update Vietnamese tokenizer (#8099)
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese

Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
2021-05-17 18:16:20 +10:00
Adriane Boyd
fe3a4aa846
Add ENT_ID and NORM to DocBin strings (#8054)
Save strings for token attributes `ENT_ID` and `NORM` in `DocBin`
strings.
2021-05-17 18:06:11 +10:00