Commit Graph

14535 Commits

Author SHA1 Message Date
Adriane Boyd
507422149f
Various docs updates for v3.0 (#8353)
* Update cats score names in Scorer API docs

* Refer to performance in meta

* Update package naming/versions, lemmatizer details

* Minor formatting fixes

* Provide more explanation for cats_score_desc

* Provide language-specific lemmatizer defaults in API docs

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-06-14 12:19:36 +02:00
Sofie Van Landeghem
8729307e67
register extract_ngrams layer (#8358)
* register extract_ngrams layer

* fix import

* bump spacy-legacy to 3.0.6

* revert bump (wrong PR)
2021-06-14 10:30:30 +02:00
Ines Montani
3259faad42 Update YouTube embed [ci skip] 2021-06-14 10:21:01 +10:00
Ines Montani
7f0f674a1b Fix universe.json and auto-format [ci skip] 2021-06-14 10:18:06 +10:00
Adriane Boyd
f4008bdb13
Restrict pymorphy2 requirement to pymorphy2 mode (#8299)
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
2021-06-11 10:19:22 +02:00
Francisco Aranda
0a1a4c665d
update spacy-wordnet code example (#8327)
* update spacy-wordnet code example

- include spaCy 2.x and 3.x init alternatives
- upgrade recognai logo

* fix escape chars
2021-06-10 21:53:11 +02:00
Adriane Boyd
6d2789452e
Restrict cython to <3.0 (#8337) 2021-06-10 11:03:30 +02:00
Adriane Boyd
d52ab13b5f
Update CI: update ubuntu image, add download test (#8298)
* Update CI: update ubuntu image, add download test

* Switch instances to `ubuntu-18.04`
* Add model download test, currently only for one job with python 3.8

* Fix variable name

* Set variables explicitly
2021-06-07 14:46:07 +02:00
graue70
f34dd0b98f
Fix typos in comments (#8279) 2021-06-07 10:43:54 +02:00
Jean-Hugues Roy
ff5cf3606c
Improvements to French stopwords list (#7941)
* "y" etc.

Many changes described in pull request

* Update spacy/lang/fr/stop_words.py

* Update spacy/lang/fr/stop_words.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-02 11:50:49 +02:00
Vito De Tullio
3672464e25
applying suggestion to avoid mypy errors (#8265)
* applying suggestion to avoid mypy errors

* sign contributor agreement
2021-06-02 19:25:30 +10:00
Adriane Boyd
4aa1a7d5a3
Remove unsupported attrs from attrs.IDS (#8132)
The attributes `PROB`, `CLUSTER` and `SENT_END` are not supported by
`Lexeme.get_struct_attr` so should not be included through `attrs.IDS`
as supported attributes in `Doc.to_array` and other methods.
2021-06-02 19:16:57 +10:00
Paul O'Leary McCann
5aba213349 Fix skweak Github URL
Github entry should not contain url, just user/repo
2021-05-31 18:00:43 +09:00
Kristian Boda
dc8d8d15d2
Add hmrb to spaCy Universe (#8129)
* docs: add hmrb to spacy universe

* docs: add sentence on spacy versions

* docs: update description and images

* misc: add spaCy Contributor Agreement
2021-05-31 18:40:48 +10:00
Dhruv Naik
283f64a98d
Fix bug from Entityruler: ent_ids returns None for phrases (#8169)
* bugfix for explosion/spaCy#8168

* add test for explosion/spaCy#8168
2021-05-31 18:38:53 +10:00
Michael K
b0467d2972
Add project urls to package metadata (#7728)
This adds the links to PyPI. To see that in action check out
https://pypi.org/project/Django/ (source code:
b8c9e9fae1/setup.cfg (L27-L32))
2021-05-31 18:38:29 +10:00
Narayan Acharya
6b79714080
Address missing config overrides post load of models (#8208) 2021-05-31 18:36:52 +10:00
Sofie Van Landeghem
fff662e41f
Ensemble textcat with listener (#8012)
* add unit test for two listeners, with a textcat ensemble in the middle

* return zero gradients instead of None in accumulate_gradient
2021-05-31 18:21:06 +10:00
Sofie Van Landeghem
ff91e6dac7
Show warning if entity_ruler runs without patterns (#7807)
* Show warning if entity_ruler runs without patterns

* Show warning if matcher runs without patterns

* fix wording

* unit test for warning once (WIP)

* warn W036 only once

* cleanup

* create filter_warning helper
2021-05-31 18:20:27 +10:00
Paul O'Leary McCann
d1a221a374
Add all symbols in Unicode Currency Symbols block (#8212)
* Add all symbols in Unicode Currency Symbols block

In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.

* Fix test

* Fix training test
2021-05-31 18:03:40 +10:00
Sofie Van Landeghem
fc37715cfb
ensure 'spacy ray' works (#7799)
* ensure 'spacy ray' works

* better fix by changing entry point
2021-05-28 18:15:31 +02:00
Ines Montani
5957ab74f7
Merge pull request #8112 from svlandeg/bugfix/replace-trf 2021-05-28 11:35:17 +10:00
Sofie Van Landeghem
3c58c0323f
fix docs (#8200) 2021-05-27 10:48:59 +02:00
Sofie Van Landeghem
290bd6ed39
ensure tolerance is properly passed on (#8158) 2021-05-27 18:10:28 +10:00
Paul O'Leary McCann
0c553ecd4e Fix docs (fix #8189) 2021-05-24 19:47:30 +09:00
Sofie Van Landeghem
202943bc8c
KB & NEL to/from bytes (#8113)
* unit test for pickling KB

* add pickling test for NEL

* KB to_bytes and from_bytes

* NEL to_bytes and from_bytes

* xfail pickle tests for now

* fix docs

* cleanup
2021-05-20 18:11:30 +10:00
Adriane Boyd
f6128c06b0
Disable GPU CI tests (#8143) 2021-05-19 12:00:07 +02:00
Adriane Boyd
06324e5a5e
Update pydantic requirements (#8127)
Update pydantic requirements following
https://github.com/explosion/thinc/pull/499
2021-05-18 11:35:50 +02:00
Adriane Boyd
6baab565eb
Minor updates to quickstart settings/instructions (#7965)
* Minor updates to quickstart settings/instructions

* set default value of textcat exclusive to `false` until the default
checkbox behavior is updated
* add the `morphologizer` to the list of components
* add a note that v3.0.6+ is required

* Switch to warning above quickstart

* Undo changes to textcat default in quickstart

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-05-17 16:55:22 +02:00
Adriane Boyd
2c545c4c5b
Fix offsets in Span.get_lca_matrix (#8116)
* Fix range in Span.get_lca_matrix

Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.

* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`

* Update test for v3.x
2021-05-17 16:54:23 +02:00
Sofie Van Landeghem
0dffc5d9e2
Custom warning if the doc_bin is too large (#8069)
* custom warning if the doc_bin is too large

* cleanup

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* fix numbering

* fixing numbering once more

* fixing this seems to be pretty hard

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-05-17 15:48:40 +02:00
Adriane Boyd
b120fb3511
Handle errors while multiprocessing (#8004)
* Handle errors while multiprocessing

Handle errors while multiprocessing without hanging.

* Return the traceback for errors raised while processing a batch, which
  can be handled by the top-level error handler
* Allow for shortened batches due to custom error handlers that ignore
  errors and skip documents

* Define custom components at a higher level

* Also move up custom error handler

* Use simpler component for test

* Switch error type

* Adjust test

* Only call top-level error handler for exceptions

* Register custom test components within tests

Use global functions (so they can be pickled) but register the
components only within the individual tests.
2021-05-17 13:28:39 +02:00
Adriane Boyd
8a2602051c
Update debug data for textcat (#8066)
* Check for unsupported cats values
* Only show labels if train/dev mismatched
* Don't show label counts (only counting positive labels seems odd)
* Use warnings for mismatched train/dev labels
2021-05-17 13:27:04 +02:00
Adriane Boyd
fe3a4aa846
Add ENT_ID and NORM to DocBin strings (#8054)
Save strings for token attributes `ENT_ID` and `NORM` in `DocBin`
strings.
2021-05-17 18:06:11 +10:00
Adriane Boyd
82fa81d095
Make all Span attrs writable (#8062)
Also allow `Span` string properties `label_` and `kb_id_` to be writable
following #6696.
2021-05-17 18:05:45 +10:00
svlandeg
b403f924ee Merge remote-tracking branch 'upstream/master' into bugfix/replace-trf 2021-05-17 09:47:47 +02:00
Ines Montani
595ef03e23
Merge pull request #8096 from juliensalinas/master [ci skip] 2021-05-17 13:58:37 +10:00
Julien Salinas
c496f78245 Add NLP Cloud to Universe. 2021-05-14 11:13:44 +02:00
Julien Salinas
a176d2209a Sign contributors agreement. 2021-05-14 11:00:27 +02:00
Paul O'Leary McCann
2dc6db53fd
Merge pull request #8072 from medianeuroscience/master
Added eMFDscore to universe.json
2021-05-14 11:58:30 +09:00
Frederic R. Hopp
c5962b9fba
Update universe.json
fixed typo
2021-05-13 07:40:05 -07:00
Frederic R. Hopp
a9ca221e03
Update universe.json
Added more detailed description to eMFDscore project
2021-05-12 09:20:17 -07:00
svlandeg
235e9f5488 call replace_listener_cfg attr if it's available 2021-05-12 17:19:38 +02:00
svlandeg
44a3a58599 call replace_listener attr if it's available 2021-05-12 16:01:02 +02:00
svlandeg
ece8be4fec extend test to training with replaced tok2vec layer 2021-05-12 11:32:22 +02:00
Frederic R. Hopp
7bba9cdc14
Update universe.json 2021-05-11 19:18:19 -07:00
Adriane Boyd
d5bbd1f94f
Handle partial entities in Span.as_doc (#8055)
* Handle partial entities in Span.as_doc

In `Span.as_doc` replace partial entities at the beginning or end of the
span with missing entity annotation.

Fixes a bug where invalid entity annotation (no initial `B`) was
returned for an initial partial entity.

* Check for empty span in ents conversion

Note: `Span.as_doc()` will still fail on an empty span due to failures
in `Span.vector`.
2021-05-11 17:10:16 +02:00
Ines Montani
3883d49446 Fix default transformer in quickstart generator (resolves #8018) [ci skip] 2021-05-11 11:27:08 +10:00
Paul O'Leary McCann
bdeaf3a18b
Fix/fix en ordinals (#8028)
* Fix #8019

"th" is not the only ordinal ending.

* Add some more ordinal tests
2021-05-07 10:26:42 +02:00
Adriane Boyd
71c2a3ab47
Fix new version for match_alignments (#8021) 2021-05-07 09:55:20 +02:00