Commit Graph

14766 Commits

Author SHA1 Message Date
Giovanni Toffoli
19521d525b
Added Italian POS-aware lemmatizer. (#8079)
* Added Italian POS-aware lemmatizer.

Also added the code used to build the lookup tables by POS.

* Create gtoffoli.md

* Add imports and format

* Remove helper script

* Use lemma_lookup instead of lemma_lookup_legacy

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-16 11:14:45 +02:00
svlandeg
29d83dec0c adjust whitespace tokenizer to avoid sep in split() 2021-06-16 10:58:45 +02:00
Antti Ajanki
5a6125c227
[Finnish tokenizer] Handle conjunction contractions (#8105) 2021-06-16 10:56:47 +02:00
Adriane Boyd
b09be3e1cb
Merge pull request #8397 from adrianeboyd/chore/develop-into-master-v3.1
Merge develop into master for v3.1
2021-06-16 10:54:47 +02:00
Adriane Boyd
33240ed2c5 Temporarily skip model download test 2021-06-16 10:14:42 +02:00
Adriane Boyd
5646fcbe46 Merge remote-tracking branch 'upstream/develop' into chore/develop-into-master-v3.1 2021-06-15 15:05:17 +02:00
Adriane Boyd
480a3bf3be
Make JsonlReader path optional (#8396)
To avoid config errors during training when `[corpora.pretrain.path]` is
`None` with the default `spacy.JsonlCorpus.v1` reader, make the reader
path optional, similar to `spacy.Corpus.v1`.
2021-06-15 14:55:15 +02:00
Paul O'Leary McCann
94e1346f44
Change span lemmas to use original whitespace (fix #8368) (#8391)
* Change span lemmas to use original whitespace (fix #8368)

This is a redo of #8371 based off master.

The test for this required some changes to existing tests. I don't think
the changes were significant but I'd like someone to check them.

* Remove mystery docstring

This sentence was uncompleted for years, and now we will never know how
it ends.
2021-06-15 13:24:54 +02:00
Paul O'Leary McCann
2c105cdbce
Raise error if deps not provided with heads (#8335)
* Fill in deps if not provided with heads

Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.

* Use "dep" instead of a blank string

This is the customary placeholder dep. It might be better to show an
error here instead though.

* Throw error on heads without deps

* Add a test

* Fix tests

* Formatting

* Fix all tests

* Fix a test I missed

* Revise error message

* Clean up whitespace

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-06-15 13:23:32 +02:00
Sofie Van Landeghem
0fd0d949c4
fix 's typo's across code base (#8384) 2021-06-15 10:57:08 +02:00
Adriane Boyd
507422149f
Various docs updates for v3.0 (#8353)
* Update cats score names in Scorer API docs

* Refer to performance in meta

* Update package naming/versions, lemmatizer details

* Minor formatting fixes

* Provide more explanation for cats_score_desc

* Provide language-specific lemmatizer defaults in API docs

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
2021-06-14 12:19:36 +02:00
Adriane Boyd
6b69b8934b
Set version to v3.1.0.dev0 (#8379) 2021-06-14 11:17:35 +02:00
Sofie Van Landeghem
8729307e67
register extract_ngrams layer (#8358)
* register extract_ngrams layer

* fix import

* bump spacy-legacy to 3.0.6

* revert bump (wrong PR)
2021-06-14 10:30:30 +02:00
Adriane Boyd
63d748f80e
Add Catalan and Danish trf to website models (#8378) 2021-06-14 09:50:13 +02:00
Ines Montani
3259faad42 Update YouTube embed [ci skip] 2021-06-14 10:21:01 +10:00
Ines Montani
7f0f674a1b Fix universe.json and auto-format [ci skip] 2021-06-14 10:18:06 +10:00
Adriane Boyd
b98d216205
Update Catalan language data (#8308)
* Update Catalan language data

Update Catalan language data based on contributions from the Text Mining
Unit at the Barcelona Supercomputing Center:

https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data

* Update tokenizer settings for UD Catalan AnCora

Update for UD Catalan AnCora v2.7 with merged multi-word tokens.

* Update test

* Move prefix patternt to more generic infix pattern

* Clean up
2021-06-11 10:21:22 +02:00
Adriane Boyd
d9be9e6cf9
Move README.md and LICENSES_SOURCES in package (#8297)
In addition to `LICENSE`, move the files `README.md` and
`LICENSES_SOURCES` to the top directory in `spacy package` if present in
the model directory.
2021-06-11 10:20:24 +02:00
Adriane Boyd
dbbeab2506
Merge pull request #8285 from adrianeboyd/feature/refactor-logger-warnings
Refactor warnings
2021-06-11 10:20:02 +02:00
Adriane Boyd
f4008bdb13
Restrict pymorphy2 requirement to pymorphy2 mode (#8299)
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
2021-06-11 10:19:22 +02:00
Francisco Aranda
0a1a4c665d
update spacy-wordnet code example (#8327)
* update spacy-wordnet code example

- include spaCy 2.x and 3.x init alternatives
- upgrade recognai logo

* fix escape chars
2021-06-10 21:53:11 +02:00
Adriane Boyd
6d2789452e
Restrict cython to <3.0 (#8337) 2021-06-10 11:03:30 +02:00
Adriane Boyd
d52ab13b5f
Update CI: update ubuntu image, add download test (#8298)
* Update CI: update ubuntu image, add download test

* Switch instances to `ubuntu-18.04`
* Add model download test, currently only for one job with python 3.8

* Fix variable name

* Set variables explicitly
2021-06-07 14:46:07 +02:00
graue70
f34dd0b98f
Fix typos in comments (#8279) 2021-06-07 10:43:54 +02:00
Adriane Boyd
9dfd3c9484 Use warnings.warn instead of logger.warning 2021-06-04 17:44:08 +02:00
Sofie Van Landeghem
f0277bdeab Show warning if entity_ruler runs without patterns (#7807)
* Show warning if entity_ruler runs without patterns

* Show warning if matcher runs without patterns

* fix wording

* unit test for warning once (WIP)

* warn W036 only once

* cleanup

* create filter_warning helper
2021-06-04 17:37:38 +02:00
Adriane Boyd
07082c9692
Exclude generated .cpp files from package (#8271) 2021-06-04 14:56:07 +02:00
Paul O'Leary McCann
d959603d51
Don't add duplicate patterns all the time in EntityRuler (fix #8216) (#8246)
* Don't add duplicate patterns (fix #8216)

* Refactor EntityRuler init

This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.

* Make EntityRuler.clear reset matchers

Includes a new test for this.

* Tidy PhraseMatcher instantiation

Since the attr can be None safely now, the guard if is no longer
required here.

Also renamed the `_validate` attr. Maybe it's not needed?

* Fix NER test

* Add test to make sure patterns aren't increasing

* Move test to regression tests
2021-06-03 09:05:26 +02:00
Jean-Hugues Roy
ff5cf3606c
Improvements to French stopwords list (#7941)
* "y" etc.

Many changes described in pull request

* Update spacy/lang/fr/stop_words.py

* Update spacy/lang/fr/stop_words.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-06-02 11:50:49 +02:00
Vito De Tullio
3672464e25
applying suggestion to avoid mypy errors (#8265)
* applying suggestion to avoid mypy errors

* sign contributor agreement
2021-06-02 19:25:30 +10:00
Adriane Boyd
4aa1a7d5a3
Remove unsupported attrs from attrs.IDS (#8132)
The attributes `PROB`, `CLUSTER` and `SENT_END` are not supported by
`Lexeme.get_struct_attr` so should not be included through `attrs.IDS`
as supported attributes in `Doc.to_array` and other methods.
2021-06-02 19:16:57 +10:00
Paul O'Leary McCann
d54631f68b
Fix other open calls without context managers (#8245) 2021-05-31 19:04:29 +10:00
Paul O'Leary McCann
5aba213349 Fix skweak Github URL
Github entry should not contain url, just user/repo
2021-05-31 18:00:43 +09:00
Kristian Boda
dc8d8d15d2
Add hmrb to spaCy Universe (#8129)
* docs: add hmrb to spacy universe

* docs: add sentence on spacy versions

* docs: update description and images

* misc: add spaCy Contributor Agreement
2021-05-31 18:40:48 +10:00
Dhruv Naik
283f64a98d
Fix bug from Entityruler: ent_ids returns None for phrases (#8169)
* bugfix for explosion/spaCy#8168

* add test for explosion/spaCy#8168
2021-05-31 18:38:53 +10:00
Michael K
b0467d2972
Add project urls to package metadata (#7728)
This adds the links to PyPI. To see that in action check out
https://pypi.org/project/Django/ (source code:
b8c9e9fae1/setup.cfg (L27-L32))
2021-05-31 18:38:29 +10:00
Narayan Acharya
6b79714080
Address missing config overrides post load of models (#8208) 2021-05-31 18:36:52 +10:00
Sofie Van Landeghem
fff662e41f
Ensemble textcat with listener (#8012)
* add unit test for two listeners, with a textcat ensemble in the middle

* return zero gradients instead of None in accumulate_gradient
2021-05-31 18:21:06 +10:00
Sofie Van Landeghem
ff91e6dac7
Show warning if entity_ruler runs without patterns (#7807)
* Show warning if entity_ruler runs without patterns

* Show warning if matcher runs without patterns

* fix wording

* unit test for warning once (WIP)

* warn W036 only once

* cleanup

* create filter_warning helper
2021-05-31 18:20:27 +10:00
Paul O'Leary McCann
d1a221a374
Add all symbols in Unicode Currency Symbols block (#8212)
* Add all symbols in Unicode Currency Symbols block

In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.

* Fix test

* Fix training test
2021-05-31 18:03:40 +10:00
Paul O'Leary McCann
04239e94c7
Use a context manager when reading model (fix #7036) (#8244) 2021-05-31 17:36:17 +10:00
Sofie Van Landeghem
fc37715cfb
ensure 'spacy ray' works (#7799)
* ensure 'spacy ray' works

* better fix by changing entry point
2021-05-28 18:15:31 +02:00
Ines Montani
5957ab74f7
Merge pull request #8112 from svlandeg/bugfix/replace-trf 2021-05-28 11:35:17 +10:00
Sofie Van Landeghem
3c58c0323f
fix docs (#8200) 2021-05-27 10:48:59 +02:00
Sofie Van Landeghem
290bd6ed39
ensure tolerance is properly passed on (#8158) 2021-05-27 18:10:28 +10:00
Paul O'Leary McCann
0c553ecd4e Fix docs (fix #8189) 2021-05-24 19:47:30 +09:00
Adriane Boyd
cd6bd91c3a
Switch default train corpus max_length to 0 in quickstart (#8142)
The behavior of `spacy.Corpus.v1` is unexpected enough for `max_length
!= 0` that `0` is a better default for users creating a new config with
the quickstart.

If not, documents are skipped, sometimes the entire corpus is skipped,
and sometimes documents are (quite unexpectedly for your average user)
split into sentences.
2021-05-20 14:48:09 +02:00
Sofie Van Landeghem
202943bc8c
KB & NEL to/from bytes (#8113)
* unit test for pickling KB

* add pickling test for NEL

* KB to_bytes and from_bytes

* NEL to_bytes and from_bytes

* xfail pickle tests for now

* fix docs

* cleanup
2021-05-20 18:11:30 +10:00
Adriane Boyd
4e69fcaa50 Disable GPU CI tests (#8143) 2021-05-19 12:00:31 +02:00
Adriane Boyd
f6128c06b0
Disable GPU CI tests (#8143) 2021-05-19 12:00:07 +02:00