Commit Graph

14569 Commits

Author SHA1 Message Date
Adriane Boyd
f55b876326
Merge pull request #10387 from adrianeboyd/chore/v3.0.8
Set version to v3.0.8
2022-02-28 12:53:53 +01:00
Adriane Boyd
ebcc7d830f Update slow readers test to use textcat_multilabel (#9300) 2022-02-28 11:22:06 +01:00
Adriane Boyd
694c318f4f Address random results in slow readers tests (#9544)
* Set random seed for dataset shuffling
* Use more dev examples for non-zero scores
2022-02-28 11:19:43 +01:00
Ines Montani
308b1706a7 Allow conftest.py to run twice for build envs 2022-02-28 09:22:34 +01:00
Adriane Boyd
3420506954 Set version to v3.0.8 2022-02-28 09:02:03 +01:00
Adriane Boyd
f71de10405
Merge pull request #10346 from adrianeboyd/chore/v3.0-backport-10324
Fix Tok2Vec for empty batches (#10324)
2022-02-21 16:41:13 +01:00
Adriane Boyd
5caccbd19e Switch to latest CI images (#9773) 2022-02-21 15:02:52 +01:00
Daniël de Kok
6a4a00c447 Pin mypy to 0.910 until there is a compatible pydantic version 2022-02-21 15:01:36 +01:00
Adriane Boyd
749631ad28 Fix Tok2Vec for empty batches (#10324)
* Add test for tok2vec with vectors and empty docs

* Add shortcut for empty batch in Tok2Vec.predict

* Avoid types
2022-02-21 14:33:16 +01:00
Adriane Boyd
034ac0acf4
Merge pull request #8787 from adrianeboyd/chore/backport-v3.0.7
Backport bug fixes to v3.0.x
2021-07-21 16:53:50 +02:00
Adriane Boyd
02e18926c3
Revert "Backport bugfixes from v3.1.0 to v3.0 (#8739)" (#8786)
This reverts commit f94168a41e.
2021-07-21 15:32:37 +02:00
Adriane Boyd
f94168a41e
Backport bugfixes from v3.1.0 to v3.0 (#8739)
* Fix scoring normalization (#7629)

* fix scoring normalization

* score weights by total sum instead of per component

* cleanup

* more cleanup

* Use a context manager when reading model (fix #7036) (#8244)

* Fix other open calls without context managers (#8245)

* Don't add duplicate patterns all the time in EntityRuler (fix #8216) (#8246)

* Don't add duplicate patterns (fix #8216)

* Refactor EntityRuler init

This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.

* Make EntityRuler.clear reset matchers

Includes a new test for this.

* Tidy PhraseMatcher instantiation

Since the attr can be None safely now, the guard if is no longer
required here.

Also renamed the `_validate` attr. Maybe it's not needed?

* Fix NER test

* Add test to make sure patterns aren't increasing

* Move test to regression tests

* Exclude generated .cpp files from package (#8271)

* Fix non-deterministic deduplication in Greek lemmatizer (#8421)

* Fix setting empty entities in Example.from_dict (#8426)

* Filter W036 for entity ruler, etc. (#8424)

* Preserve paths.vectors/initialize.vectors setting in quickstart template

* Various fixes for spans in Docs.from_docs (#8487)

* Fix spans offsets if a doc ends in a single space and no space is
  inserted
* Also include spans key in merged doc for empty spans lists

* Fix duplicate spacy package CLI opts (#8551)

Use `-c` for `--code` and not additionally for `--create-meta`, in line
with the docs.

* Raise an error for textcat with <2 labels (#8584)

* Raise an error for textcat with <2 labels

Raise an error if initializing a `textcat` component without at least
two labels.

* Add similar note to docs

* Update positive_label description in API docs

* Add Macedonian models to website (#8637)

* Fix Azerbaijani init, extend lang init tests (#8656)

* Extend langs in initialize tests

* Fix az init

* Fix ru/uk lemmatizer mp with spawn (#8657)

Use an instance variable instead a class variable for the morphological
analzyer so that multiprocessing with spawn is possible.

* Use 0-vector for OOV lexemes (#8639)

* Set version to v3.0.7

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-07-19 09:20:40 +02:00
Adriane Boyd
0080454140 Set version to v3.0.7 2021-07-16 16:38:15 +02:00
Adriane Boyd
6db938959d Use 0-vector for OOV lexemes (#8639) 2021-07-16 15:48:47 +02:00
Adriane Boyd
99a3f26d7f Fix ru/uk lemmatizer mp with spawn (#8657)
Use an instance variable instead a class variable for the morphological
analzyer so that multiprocessing with spawn is possible.
2021-07-16 15:48:47 +02:00
Adriane Boyd
c62566ffce Fix Azerbaijani init, extend lang init tests (#8656)
* Extend langs in initialize tests

* Fix az init
2021-07-16 15:48:47 +02:00
Adriane Boyd
066718b1dc Add Macedonian models to website (#8637) 2021-07-16 15:48:47 +02:00
Adriane Boyd
81e71a61f8 Raise an error for textcat with <2 labels (#8584)
* Raise an error for textcat with <2 labels

Raise an error if initializing a `textcat` component without at least
two labels.

* Add similar note to docs

* Update positive_label description in API docs
2021-07-16 15:48:42 +02:00
Adriane Boyd
6aa3fede76 Fix duplicate spacy package CLI opts (#8551)
Use `-c` for `--code` and not additionally for `--create-meta`, in line
with the docs.
2021-07-16 15:48:19 +02:00
Adriane Boyd
71396273a5 Various fixes for spans in Docs.from_docs (#8487)
* Fix spans offsets if a doc ends in a single space and no space is
  inserted
* Also include spans key in merged doc for empty spans lists
2021-07-16 15:48:19 +02:00
Adriane Boyd
e51fff5432 Preserve paths.vectors/initialize.vectors setting in quickstart template 2021-07-16 15:48:19 +02:00
Adriane Boyd
c78eb28dfa Filter W036 for entity ruler, etc. (#8424) 2021-07-16 15:48:19 +02:00
Adriane Boyd
e3f1d4a7d0 Fix setting empty entities in Example.from_dict (#8426) 2021-07-16 15:48:19 +02:00
Adriane Boyd
81515b4690 Fix non-deterministic deduplication in Greek lemmatizer (#8421) 2021-07-16 15:48:19 +02:00
Adriane Boyd
8b9355d758 Exclude generated .cpp files from package (#8271) 2021-07-16 15:47:55 +02:00
Paul O'Leary McCann
ad026dc5fd Don't add duplicate patterns all the time in EntityRuler (fix #8216) (#8246)
* Don't add duplicate patterns (fix #8216)

* Refactor EntityRuler init

This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.

* Make EntityRuler.clear reset matchers

Includes a new test for this.

* Tidy PhraseMatcher instantiation

Since the attr can be None safely now, the guard if is no longer
required here.

Also renamed the `_validate` attr. Maybe it's not needed?

* Fix NER test

* Add test to make sure patterns aren't increasing

* Move test to regression tests
2021-07-16 15:47:55 +02:00
Paul O'Leary McCann
1db18732e0 Fix other open calls without context managers (#8245) 2021-07-16 15:47:55 +02:00
Paul O'Leary McCann
a834b03216 Use a context manager when reading model (fix #7036) (#8244) 2021-07-16 15:47:55 +02:00
Sofie Van Landeghem
55e5f8ede3 Fix scoring normalization (#7629)
* fix scoring normalization

* score weights by total sum instead of per component

* cleanup

* more cleanup
2021-07-16 15:47:55 +02:00
Adriane Boyd
bb97e7bf8a
Update validate CLI to fix compat and ignore warnings (#8423) 2021-07-14 23:28:08 +02:00
Adriane Boyd
480a3bf3be
Make JsonlReader path optional (#8396)
To avoid config errors during training when `[corpora.pretrain.path]` is
`None` with the default `spacy.JsonlCorpus.v1` reader, make the reader
path optional, similar to `spacy.Corpus.v1`.
2021-06-15 14:55:15 +02:00
Paul O'Leary McCann
94e1346f44
Change span lemmas to use original whitespace (fix #8368) (#8391)
* Change span lemmas to use original whitespace (fix #8368)

This is a redo of #8371 based off master.

The test for this required some changes to existing tests. I don't think
the changes were significant but I'd like someone to check them.

* Remove mystery docstring

This sentence was uncompleted for years, and now we will never know how
it ends.
2021-06-15 13:24:54 +02:00
Paul O'Leary McCann
2c105cdbce
Raise error if deps not provided with heads (#8335)
* Fill in deps if not provided with heads

Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.

* Use "dep" instead of a blank string

This is the customary placeholder dep. It might be better to show an
error here instead though.

* Throw error on heads without deps

* Add a test

* Fix tests

* Formatting

* Fix all tests

* Fix a test I missed

* Revise error message

* Clean up whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-15 13:23:32 +02:00
Sofie Van Landeghem
0fd0d949c4
fix 's typo's across code base (#8384) 2021-06-15 10:57:08 +02:00
Adriane Boyd
507422149f
Various docs updates for v3.0 (#8353)
* Update cats score names in Scorer API docs

* Refer to performance in meta

* Update package naming/versions, lemmatizer details

* Minor formatting fixes

* Provide more explanation for cats_score_desc

* Provide language-specific lemmatizer defaults in API docs

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-06-14 12:19:36 +02:00
Sofie Van Landeghem
8729307e67
register extract_ngrams layer (#8358)
* register extract_ngrams layer

* fix import

* bump spacy-legacy to 3.0.6

* revert bump (wrong PR)
2021-06-14 10:30:30 +02:00
Ines Montani
3259faad42 Update YouTube embed [ci skip] 2021-06-14 10:21:01 +10:00
Ines Montani
7f0f674a1b Fix universe.json and auto-format [ci skip] 2021-06-14 10:18:06 +10:00
Adriane Boyd
f4008bdb13
Restrict pymorphy2 requirement to pymorphy2 mode (#8299)
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
2021-06-11 10:19:22 +02:00
Francisco Aranda
0a1a4c665d
update spacy-wordnet code example (#8327)
* update spacy-wordnet code example

- include spaCy 2.x and 3.x init alternatives
- upgrade recognai logo

* fix escape chars
2021-06-10 21:53:11 +02:00
Adriane Boyd
6d2789452e
Restrict cython to <3.0 (#8337) 2021-06-10 11:03:30 +02:00
Adriane Boyd
d52ab13b5f
Update CI: update ubuntu image, add download test (#8298)
* Update CI: update ubuntu image, add download test

* Switch instances to `ubuntu-18.04`
* Add model download test, currently only for one job with python 3.8

* Fix variable name

* Set variables explicitly
2021-06-07 14:46:07 +02:00
graue70
f34dd0b98f
Fix typos in comments (#8279) 2021-06-07 10:43:54 +02:00
Jean-Hugues Roy
ff5cf3606c
Improvements to French stopwords list (#7941)
* "y" etc.

Many changes described in pull request

* Update spacy/lang/fr/stop_words.py

* Update spacy/lang/fr/stop_words.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-02 11:50:49 +02:00
Vito De Tullio
3672464e25
applying suggestion to avoid mypy errors (#8265)
* applying suggestion to avoid mypy errors

* sign contributor agreement
2021-06-02 19:25:30 +10:00
Adriane Boyd
4aa1a7d5a3
Remove unsupported attrs from attrs.IDS (#8132)
The attributes `PROB`, `CLUSTER` and `SENT_END` are not supported by
`Lexeme.get_struct_attr` so should not be included through `attrs.IDS`
as supported attributes in `Doc.to_array` and other methods.
2021-06-02 19:16:57 +10:00
Paul O'Leary McCann
5aba213349 Fix skweak Github URL
Github entry should not contain url, just user/repo
2021-05-31 18:00:43 +09:00
Kristian Boda
dc8d8d15d2
Add hmrb to spaCy Universe (#8129)
* docs: add hmrb to spacy universe

* docs: add sentence on spacy versions

* docs: update description and images

* misc: add spaCy Contributor Agreement
2021-05-31 18:40:48 +10:00
Dhruv Naik
283f64a98d
Fix bug from Entityruler: ent_ids returns None for phrases (#8169)
* bugfix for explosion/spaCy#8168

* add test for explosion/spaCy#8168
2021-05-31 18:38:53 +10:00
Michael K
b0467d2972
Add project urls to package metadata (#7728)
This adds the links to PyPI. To see that in action check out
https://pypi.org/project/Django/ (source code:
b8c9e9fae1/setup.cfg (L27-L32))
2021-05-31 18:38:29 +10:00