Commit Graph

14571 Commits

Author SHA1 Message Date
Paul O'Leary McCann
0c553ecd4e Fix docs (fix #8189) 2021-05-24 19:47:30 +09:00
Adriane Boyd
cd6bd91c3a
Switch default train corpus max_length to 0 in quickstart (#8142)
The behavior of `spacy.Corpus.v1` is unexpected enough for `max_length
!= 0` that `0` is a better default for users creating a new config with
the quickstart.

If not, documents are skipped, sometimes the entire corpus is skipped,
and sometimes documents are (quite unexpectedly for your average user)
split into sentences.
2021-05-20 14:48:09 +02:00
Sofie Van Landeghem
202943bc8c
KB & NEL to/from bytes (#8113)
* unit test for pickling KB

* add pickling test for NEL

* KB to_bytes and from_bytes

* NEL to_bytes and from_bytes

* xfail pickle tests for now

* fix docs

* cleanup
2021-05-20 18:11:30 +10:00
Adriane Boyd
4e69fcaa50 Disable GPU CI tests (#8143) 2021-05-19 12:00:31 +02:00
Adriane Boyd
f6128c06b0
Disable GPU CI tests (#8143) 2021-05-19 12:00:07 +02:00
Adriane Boyd
06324e5a5e
Update pydantic requirements (#8127)
Update pydantic requirements following
https://github.com/explosion/thinc/pull/499
2021-05-18 11:35:50 +02:00
Adriane Boyd
6baab565eb
Minor updates to quickstart settings/instructions (#7965)
* Minor updates to quickstart settings/instructions

* set default value of textcat exclusive to `false` until the default
checkbox behavior is updated
* add the `morphologizer` to the list of components
* add a note that v3.0.6+ is required

* Switch to warning above quickstart

* Undo changes to textcat default in quickstart

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-05-17 16:55:22 +02:00
Adriane Boyd
2c545c4c5b
Fix offsets in Span.get_lca_matrix (#8116)
* Fix range in Span.get_lca_matrix

Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.

* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`

* Update test for v3.x
2021-05-17 16:54:23 +02:00
Sofie Van Landeghem
0dffc5d9e2
Custom warning if the doc_bin is too large (#8069)
* custom warning if the doc_bin is too large

* cleanup

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* fix numbering

* fixing numbering once more

* fixing this seems to be pretty hard

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-05-17 15:48:40 +02:00
Adriane Boyd
b120fb3511
Handle errors while multiprocessing (#8004)
* Handle errors while multiprocessing

Handle errors while multiprocessing without hanging.

* Return the traceback for errors raised while processing a batch, which
  can be handled by the top-level error handler
* Allow for shortened batches due to custom error handlers that ignore
  errors and skip documents

* Define custom components at a higher level

* Also move up custom error handler

* Use simpler component for test

* Switch error type

* Adjust test

* Only call top-level error handler for exceptions

* Register custom test components within tests

Use global functions (so they can be pickled) but register the
components only within the individual tests.
2021-05-17 13:28:39 +02:00
Adriane Boyd
8a2602051c
Update debug data for textcat (#8066)
* Check for unsupported cats values
* Only show labels if train/dev mismatched
* Don't show label counts (only counting positive labels seems odd)
* Use warnings for mismatched train/dev labels
2021-05-17 13:27:04 +02:00
Adriane Boyd
1d59fdbd39
Update Vietnamese tokenizer (#8099)
* Adapt tokenization methods from `pyvi` to preserve text encoding and
whitespace
* Add serialization support similar to Chinese and Japanese

Note: as for Chinese and Japanese, some settings are duplicated in
`config.cfg` and `tokenizer/cfg`.
2021-05-17 18:16:20 +10:00
Adriane Boyd
fe3a4aa846
Add ENT_ID and NORM to DocBin strings (#8054)
Save strings for token attributes `ENT_ID` and `NORM` in `DocBin`
strings.
2021-05-17 18:06:11 +10:00
Adriane Boyd
82fa81d095
Make all Span attrs writable (#8062)
Also allow `Span` string properties `label_` and `kb_id_` to be writable
following #6696.
2021-05-17 18:05:45 +10:00
svlandeg
b403f924ee Merge remote-tracking branch 'upstream/master' into bugfix/replace-trf 2021-05-17 09:47:47 +02:00
Ines Montani
595ef03e23
Merge pull request #8096 from juliensalinas/master [ci skip] 2021-05-17 13:58:37 +10:00
Julien Salinas
c496f78245 Add NLP Cloud to Universe. 2021-05-14 11:13:44 +02:00
Julien Salinas
a176d2209a Sign contributors agreement. 2021-05-14 11:00:27 +02:00
Paul O'Leary McCann
2dc6db53fd
Merge pull request #8072 from medianeuroscience/master
Added eMFDscore to universe.json
2021-05-14 11:58:30 +09:00
Frederic R. Hopp
c5962b9fba
Update universe.json
fixed typo
2021-05-13 07:40:05 -07:00
Frederic R. Hopp
a9ca221e03
Update universe.json
Added more detailed description to eMFDscore project
2021-05-12 09:20:17 -07:00
svlandeg
235e9f5488 call replace_listener_cfg attr if it's available 2021-05-12 17:19:38 +02:00
svlandeg
44a3a58599 call replace_listener attr if it's available 2021-05-12 16:01:02 +02:00
svlandeg
ece8be4fec extend test to training with replaced tok2vec layer 2021-05-12 11:32:22 +02:00
Frederic R. Hopp
7bba9cdc14
Update universe.json 2021-05-11 19:18:19 -07:00
Adriane Boyd
d5bbd1f94f
Handle partial entities in Span.as_doc (#8055)
* Handle partial entities in Span.as_doc

In `Span.as_doc` replace partial entities at the beginning or end of the
span with missing entity annotation.

Fixes a bug where invalid entity annotation (no initial `B`) was
returned for an initial partial entity.

* Check for empty span in ents conversion

Note: `Span.as_doc()` will still fail on an empty span due to failures
in `Span.vector`.
2021-05-11 17:10:16 +02:00
Ines Montani
3883d49446 Fix default transformer in quickstart generator (resolves #8018) [ci skip] 2021-05-11 11:27:08 +10:00
Paul O'Leary McCann
bdeaf3a18b
Fix/fix en ordinals (#8028)
* Fix #8019

"th" is not the only ordinal ending.

* Add some more ordinal tests
2021-05-07 10:26:42 +02:00
Adriane Boyd
71c2a3ab47
Fix new version for match_alignments (#8021) 2021-05-07 09:55:20 +02:00
Jeno Pizarro
5cf76ab608
Update negspacy example code for spaCy 3.0 (#8022) 2021-05-07 09:33:21 +02:00
Adriane Boyd
6788d90f61
Preserve existing ENT_KB_ID annotation in NER (#7988)
* Preserve existing ENT_KB_ID annotation in NER

Preserve `ent_kb_id` annotation on existing entity spans, which is not
preserved by the transition system.

* Simplify kb_id assignment

* Simplify further
2021-05-06 18:49:55 +10:00
Sofie Van Landeghem
02a6a5fea0
Fix 'debug model' for transformers + generalize (#7973)
* add overrides to docs

* fix debug model with transformer

* assume training data is set in config
2021-05-06 18:43:32 +10:00
Adriane Boyd
cc5aeaed29
Add Chinese PTB tags to glossary (#7993) 2021-05-06 18:43:03 +10:00
Adriane Boyd
0a22fed634
Fix span offsets for Matcher(as_spans) on spans (#7992)
Fix returned span offsets for `Matcher(as_spans=True)(span)`.
2021-05-06 18:42:44 +10:00
Adriane Boyd
7d5db41ac3
Skip vector ngram backoff if minn is not set (#7925) 2021-05-06 18:34:35 +10:00
Sofie Van Landeghem
e9037d8fc0
make EntityLinker robust for nO=None (#7930) 2021-05-06 18:14:47 +10:00
Paul O'Leary McCann
66bfabd839
Fix pretraining objectives fragment (#8005)
* Fix pretraining objectives fragment

The fragment here is reused from a heading higher up, so you couldn't
link to this section.

* Fix section link to new fragment
2021-05-06 08:27:36 +02:00
Adriane Boyd
a71194362f
Fix Docs.from_docs for all empty docs (#8009) 2021-05-05 18:44:14 +02:00
meghanabhange
debaab7021
Update details in universe denomme | Multilingual Name Detection (#7982)
* Add denomme

* spaCy contributor agreement

* Update install and thumb

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-05-05 17:12:13 +02:00
Adriane Boyd
31528f62ed
Add / to nb infixes (#7991) 2021-05-04 11:00:10 +02:00
Santiago Castro
e99ff6f255
Fix typo in Language docstrings (#7958) 2021-05-03 14:44:09 +02:00
Ines Montani
12d3d0fedd Fix quickstart default checked of conditional fields [ci skip] 2021-05-03 11:48:12 +10:00
Adriane Boyd
2320791f6d
Fix Transformer.initialize example (#7963) 2021-04-30 12:21:31 +02:00
Adriane Boyd
cf032ec31e
Update to catalogue>=2.0.4 (#7951) 2021-04-29 19:11:28 +02:00
Adriane Boyd
7cf5bd072f
Refactor util.to_ternary_int (#7944)
* Refactor to avoid literal comparison with `is`
* Extend tests
2021-04-29 16:58:54 +02:00
Sevdimali
49aed683cc
Azerbaijani language added (#7911) 2021-04-28 14:42:02 +02:00
Adriane Boyd
f4080983ea
Extend to cupy 9.0.0 (#7914) 2021-04-28 10:18:24 +02:00
Paul O'Leary McCann
8007d5c814
Check if the resume path points to a directory (#7919)
This came up in #7878, but if --resume-path is a directory then loading
the weights will fail. On Linux this will give a straightforward error
message, but on Windows it gives "Permission Denied", which is
confusing.
2021-04-28 09:17:15 +02:00
Paul O'Leary McCann
de6b5ed14d
Fix percent unk display in debug data (#7886)
* Fix percent unk display

This was showing (ratio %), so 10% would show as 0.10%. Fix by
multiplying ration by 100.

Might want to add a warning if this is over a threshold.

* Only show whole-integer percents
2021-04-27 09:16:35 +02:00
Janis Klaise
1690595e4d
Update load_lookups return type and docstring (#7907)
* Update load_lookups return type and docstring

* Add contributor agreement
2021-04-27 09:13:39 +02:00