Commit Graph

8691 Commits

Author SHA1 Message Date
Adriane Boyd
4af02ac9e4 Set version to v3.0.9 2022-12-13 13:56:56 +01:00
Adriane Boyd
67c6ef2b2a Increase tolerance for almost equal checks in textcat regression test 2022-12-13 13:56:56 +01:00
Adriane Boyd
c4af89f956 Clean up warnings in the test suite (#11331) 2022-12-12 17:27:00 +01:00
Adriane Boyd
d4acae856a Update flake8 version in reqs and CI
* Update some unneeded forward refs related to flake8 checks
2022-12-12 14:30:06 +01:00
Adriane Boyd
0f87720411 Rename test helper method with non-test_ name (#11701) 2022-12-12 14:02:50 +01:00
Adriane Boyd
c8009c2734 Cast to uint64 for all array-based doc representations (#11933)
* Convert all individual values explicitly to uint64 for array-based doc representations

* Temporarily test with latest numpy v1.24.0rc

* Remove unnecessary conversion from attr_t

* Reduce number of individual casts

* Convert specifically from int32 to uint64

* Revert "Temporarily test with latest numpy v1.24.0rc"

This reverts commit eb0e3c5006.

* Also use int32 in tests
2022-12-12 14:02:50 +01:00
Paul O'Leary McCann
d4d4d69cb4 Config generation fails for GPU without transformers (#11899)
If you don't have spacy-transformers installed, but try to use `init
config` with the GPU flag, you'll get an error. The issue is that the
`use_transformers` flag in the config is conflated with the GPU flag,
and then there's an attempt to access transformers config info that may
not exist.

There may be a better way to do this, but this stops the error.
2022-12-12 14:02:50 +01:00
Paul O'Leary McCann
337ebda793 Add in errors used in the beam code that were removed at some point (#11935)
I don't think there's any way to use the beam code at the moment, but as
long as it's around the errors it refers to should also be present.
2022-12-12 14:02:50 +01:00
Adriane Boyd
5c975565dc Add smart_open requirement, update deprecated options (#11864)
* Switch from deprecated `ignore_ext` to `compression`
* Add upload/download test for local files
2022-12-12 14:02:50 +01:00
Adriane Boyd
ebcc7d830f Update slow readers test to use textcat_multilabel (#9300) 2022-02-28 11:22:06 +01:00
Adriane Boyd
694c318f4f Address random results in slow readers tests (#9544)
* Set random seed for dataset shuffling
* Use more dev examples for non-zero scores
2022-02-28 11:19:43 +01:00
Ines Montani
308b1706a7 Allow conftest.py to run twice for build envs 2022-02-28 09:22:34 +01:00
Adriane Boyd
3420506954 Set version to v3.0.8 2022-02-28 09:02:03 +01:00
Adriane Boyd
749631ad28 Fix Tok2Vec for empty batches (#10324)
* Add test for tok2vec with vectors and empty docs

* Add shortcut for empty batch in Tok2Vec.predict

* Avoid types
2022-02-21 14:33:16 +01:00
Adriane Boyd
0080454140 Set version to v3.0.7 2021-07-16 16:38:15 +02:00
Adriane Boyd
6db938959d Use 0-vector for OOV lexemes (#8639) 2021-07-16 15:48:47 +02:00
Adriane Boyd
99a3f26d7f Fix ru/uk lemmatizer mp with spawn (#8657)
Use an instance variable instead a class variable for the morphological
analzyer so that multiprocessing with spawn is possible.
2021-07-16 15:48:47 +02:00
Adriane Boyd
c62566ffce Fix Azerbaijani init, extend lang init tests (#8656)
* Extend langs in initialize tests

* Fix az init
2021-07-16 15:48:47 +02:00
Adriane Boyd
81e71a61f8 Raise an error for textcat with <2 labels (#8584)
* Raise an error for textcat with <2 labels

Raise an error if initializing a `textcat` component without at least
two labels.

* Add similar note to docs

* Update positive_label description in API docs
2021-07-16 15:48:42 +02:00
Adriane Boyd
6aa3fede76 Fix duplicate spacy package CLI opts (#8551)
Use `-c` for `--code` and not additionally for `--create-meta`, in line
with the docs.
2021-07-16 15:48:19 +02:00
Adriane Boyd
71396273a5 Various fixes for spans in Docs.from_docs (#8487)
* Fix spans offsets if a doc ends in a single space and no space is
  inserted
* Also include spans key in merged doc for empty spans lists
2021-07-16 15:48:19 +02:00
Adriane Boyd
e51fff5432 Preserve paths.vectors/initialize.vectors setting in quickstart template 2021-07-16 15:48:19 +02:00
Adriane Boyd
c78eb28dfa Filter W036 for entity ruler, etc. (#8424) 2021-07-16 15:48:19 +02:00
Adriane Boyd
e3f1d4a7d0 Fix setting empty entities in Example.from_dict (#8426) 2021-07-16 15:48:19 +02:00
Adriane Boyd
81515b4690 Fix non-deterministic deduplication in Greek lemmatizer (#8421) 2021-07-16 15:48:19 +02:00
Paul O'Leary McCann
ad026dc5fd Don't add duplicate patterns all the time in EntityRuler (fix #8216) (#8246)
* Don't add duplicate patterns (fix #8216)

* Refactor EntityRuler init

This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.

* Make EntityRuler.clear reset matchers

Includes a new test for this.

* Tidy PhraseMatcher instantiation

Since the attr can be None safely now, the guard if is no longer
required here.

Also renamed the `_validate` attr. Maybe it's not needed?

* Fix NER test

* Add test to make sure patterns aren't increasing

* Move test to regression tests
2021-07-16 15:47:55 +02:00
Paul O'Leary McCann
1db18732e0 Fix other open calls without context managers (#8245) 2021-07-16 15:47:55 +02:00
Paul O'Leary McCann
a834b03216 Use a context manager when reading model (fix #7036) (#8244) 2021-07-16 15:47:55 +02:00
Sofie Van Landeghem
55e5f8ede3 Fix scoring normalization (#7629)
* fix scoring normalization

* score weights by total sum instead of per component

* cleanup

* more cleanup
2021-07-16 15:47:55 +02:00
Adriane Boyd
bb97e7bf8a
Update validate CLI to fix compat and ignore warnings (#8423) 2021-07-14 23:28:08 +02:00
Adriane Boyd
480a3bf3be
Make JsonlReader path optional (#8396)
To avoid config errors during training when `[corpora.pretrain.path]` is
`None` with the default `spacy.JsonlCorpus.v1` reader, make the reader
path optional, similar to `spacy.Corpus.v1`.
2021-06-15 14:55:15 +02:00
Paul O'Leary McCann
94e1346f44
Change span lemmas to use original whitespace (fix #8368) (#8391)
* Change span lemmas to use original whitespace (fix #8368)

This is a redo of #8371 based off master.

The test for this required some changes to existing tests. I don't think
the changes were significant but I'd like someone to check them.

* Remove mystery docstring

This sentence was uncompleted for years, and now we will never know how
it ends.
2021-06-15 13:24:54 +02:00
Paul O'Leary McCann
2c105cdbce
Raise error if deps not provided with heads (#8335)
* Fill in deps if not provided with heads

Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.

* Use "dep" instead of a blank string

This is the customary placeholder dep. It might be better to show an
error here instead though.

* Throw error on heads without deps

* Add a test

* Fix tests

* Formatting

* Fix all tests

* Fix a test I missed

* Revise error message

* Clean up whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-15 13:23:32 +02:00
Sofie Van Landeghem
0fd0d949c4
fix 's typo's across code base (#8384) 2021-06-15 10:57:08 +02:00
Sofie Van Landeghem
8729307e67
register extract_ngrams layer (#8358)
* register extract_ngrams layer

* fix import

* bump spacy-legacy to 3.0.6

* revert bump (wrong PR)
2021-06-14 10:30:30 +02:00
Adriane Boyd
f4008bdb13
Restrict pymorphy2 requirement to pymorphy2 mode (#8299)
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
2021-06-11 10:19:22 +02:00
graue70
f34dd0b98f
Fix typos in comments (#8279) 2021-06-07 10:43:54 +02:00
Jean-Hugues Roy
ff5cf3606c
Improvements to French stopwords list (#7941)
* "y" etc.

Many changes described in pull request

* Update spacy/lang/fr/stop_words.py

* Update spacy/lang/fr/stop_words.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-02 11:50:49 +02:00
Vito De Tullio
3672464e25
applying suggestion to avoid mypy errors (#8265)
* applying suggestion to avoid mypy errors

* sign contributor agreement
2021-06-02 19:25:30 +10:00
Adriane Boyd
4aa1a7d5a3
Remove unsupported attrs from attrs.IDS (#8132)
The attributes `PROB`, `CLUSTER` and `SENT_END` are not supported by
`Lexeme.get_struct_attr` so should not be included through `attrs.IDS`
as supported attributes in `Doc.to_array` and other methods.
2021-06-02 19:16:57 +10:00
Dhruv Naik
283f64a98d
Fix bug from Entityruler: ent_ids returns None for phrases (#8169)
* bugfix for explosion/spaCy#8168

* add test for explosion/spaCy#8168
2021-05-31 18:38:53 +10:00
Narayan Acharya
6b79714080
Address missing config overrides post load of models (#8208) 2021-05-31 18:36:52 +10:00
Sofie Van Landeghem
fff662e41f
Ensemble textcat with listener (#8012)
* add unit test for two listeners, with a textcat ensemble in the middle

* return zero gradients instead of None in accumulate_gradient
2021-05-31 18:21:06 +10:00
Sofie Van Landeghem
ff91e6dac7
Show warning if entity_ruler runs without patterns (#7807)
* Show warning if entity_ruler runs without patterns

* Show warning if matcher runs without patterns

* fix wording

* unit test for warning once (WIP)

* warn W036 only once

* cleanup

* create filter_warning helper
2021-05-31 18:20:27 +10:00
Paul O'Leary McCann
d1a221a374
Add all symbols in Unicode Currency Symbols block (#8212)
* Add all symbols in Unicode Currency Symbols block

In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.

* Fix test

* Fix training test
2021-05-31 18:03:40 +10:00
Ines Montani
5957ab74f7
Merge pull request #8112 from svlandeg/bugfix/replace-trf 2021-05-28 11:35:17 +10:00
Sofie Van Landeghem
3c58c0323f
fix docs (#8200) 2021-05-27 10:48:59 +02:00
Sofie Van Landeghem
290bd6ed39
ensure tolerance is properly passed on (#8158) 2021-05-27 18:10:28 +10:00
Sofie Van Landeghem
202943bc8c
KB & NEL to/from bytes (#8113)
* unit test for pickling KB

* add pickling test for NEL

* KB to_bytes and from_bytes

* NEL to_bytes and from_bytes

* xfail pickle tests for now

* fix docs

* cleanup
2021-05-20 18:11:30 +10:00
Adriane Boyd
2c545c4c5b
Fix offsets in Span.get_lca_matrix (#8116)
* Fix range in Span.get_lca_matrix

Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.

* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`

* Update test for v3.x
2021-05-17 16:54:23 +02:00