Commit Graph

15815 Commits

Author SHA1 Message Date
jsnfly
176a90edee
Fix texcat loss scaling (#9904) (#10002)
* add failing test for issue 9904

* remove division by batch size and summation before applying the mean

Co-authored-by: jonas <jsnfly@gmx.de>
2022-01-13 09:03:23 +01:00
Sofie Van Landeghem
d8a3012539
Merge pull request #10037 from explosion/master
Update develop with master
2022-01-12 12:29:23 +01:00
Ryn Daniels
057b8c64c0
Check for assets with size of 0 bytes (#10026)
* Check for assets with size of 0 bytes

* Update spacy/cli/project/assets.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-12 10:34:23 +01:00
Sofie Van Landeghem
5ba4171b19
Update LICENSE to include 2022 [ci skip] 2022-01-07 09:24:07 +01:00
Ines Montani
005e23a525
Merge pull request #9989 from explosion/docs/update-algolia-search-api [ci skip] 2022-01-05 14:14:42 +01:00
Ines Montani
a437ca6737 Update website to use new Algolia search API 2022-01-05 13:21:06 +01:00
Sofie Van Landeghem
067a44a417
Merge pull request #9987 from explosion/master
Update develop with commits from master
2022-01-05 11:49:50 +01:00
Lj Miranda
00e7bf5ffd
Add a few docs to the default_config.cfg (#9981)
* Clarify patience hyperparameter

The current value for patience doesn't seem to indicate that it's
pointing to the number of steps. It may be useful to specify that
explicitly.

Ref: https://github.com/explosion/spaCy/discussions/7450
Ref: https://github.com/explosion/spaCy/discussions/7465

* Update docs for max_steps
2022-01-05 09:16:40 +01:00
Duygu Altinok
55cf492218
Feat/debug data warn spread ents (#9960)
* added check for crossing boundaries

* formatted blacked

* Rephrasing slightly

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-04 18:22:10 +01:00
Sofie Van Landeghem
56dcb39fb7
Fix references to config file in the docs & UX (#9961)
* doc fixes around config file

* fix typo

* clarify default
2022-01-04 14:31:26 +01:00
Sofie Van Landeghem
029a48e340
fix type of lexeme.rank (#9979) 2022-01-04 13:15:25 +01:00
Sam Edwardes
6f65e2b544
Added spacypdfreader to universe.json (#9963) 2022-01-03 16:34:36 +09:00
Richard Hudson
cc21eac88a Use \n rather than linesep for consistency with wasabi 2021-12-29 13:33:56 +01:00
Richard Hudson
85da92f041 Ignore Windows carriage return characters 2021-12-29 12:16:45 +01:00
Paul O'Leary McCann
f40e237c5a
Remove denomme from universe (#9952)
Package seems to have been deleted.
2021-12-29 11:41:29 +01:00
Richard Hudson
f7f9cc72e7 Fixed supports_ansi problem for Windows tests 2021-12-29 11:22:48 +01:00
Florian Cäsar
86e71e7b19
Fix Scorer.score_cats for missing labels (#9443)
* Fix Scorer.score_cats for missing labels

* Add test case for Scorer.score_cats missing labels

* semantic nitpick

* black formatting

* adjust test to give different results depending on multi_label setting

* fix loss function according to whether or not missing values are supported

* add note to docs

* small fixes

* make mypy happy

* Update spacy/pipeline/textcat.py

Co-authored-by: Florian Cäsar <florian.caesar@pm.me>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2021-12-29 11:04:39 +01:00
Richard Hudson
264ead3274 Removed incorrect automatically added import statement 2021-12-29 10:11:48 +01:00
Sofie Van Landeghem
b8106e0f95
Merge pull request #9951 from explosion/master
Update develop branch with master
2021-12-29 10:11:43 +01:00
Richard Hudson
8e55efcbd9 Check SUPPORTS_ANSI when rendering 2021-12-29 09:30:35 +01:00
Richard Hudson
08370604d3
Change order of imports
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-29 09:22:06 +01:00
Richard Hudson
678bc61086
Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-29 09:21:23 +01:00
Richard Hudson
e3e8495b41 Updated requirements.txt 2021-12-29 08:47:56 +01:00
Yoav Vollansky
9d63dfacfc
Update UNIVERSE.md (#9941)
typo
2021-12-27 13:46:04 +01:00
Peter Baumgartner
72abf9e102
MultiHashEmbed vector docs correction (#9918) 2021-12-27 11:18:08 +01:00
Richard Hudson
92943f8a23 Removed unused import 2021-12-23 17:47:56 +01:00
Richard Hudson
2cae470180 More type corrections 2021-12-23 17:35:47 +01:00
Richard Hudson
106fb53509 More type corrections 2021-12-23 17:24:28 +01:00
Richard Hudson
5c850b2ac3 Corrected types 2021-12-23 17:01:43 +01:00
Richard Hudson
e713aa0938 Add surrounding tokens functionality 2021-12-23 16:13:40 +01:00
Duygu Altinok
7ec1452f5f
added ellided forms (#9878)
* added ellided forms

* rearranged a bit

* rearranged a bit

* added stopword tests

* blacked tests file
2021-12-23 13:41:01 +01:00
Andrew Janco
3cfeb518ee
Handle "_" value for token pos in conllu data (#9903)
* change '_' to '' to allow Token.pos, when no value for token pos in conllu data

* Minor code style

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-12-21 15:46:33 +01:00
Adriane Boyd
837d241b68
Make floret murmurhash endian-neutral (#9735) 2021-12-20 17:11:31 +01:00
Adriane Boyd
1163073756
Remove outdated patterns MANIFEST.in (#9912) 2021-12-20 16:40:20 +01:00
Adriane Boyd
18e5638af0
Extend cupy to v10.x (#9911)
* Add extra for `cupy-cuda115`
2021-12-20 15:48:35 +01:00
Sofie Van Landeghem
7847839003
Merge pull request #9891 from explosion/master
Update develop with master
2021-12-17 14:01:27 +01:00
Daniël de Kok
93e9bf681f
Merge pull request #9873 from danieldk/temporarily-pin-mypy
Pin mypy to 0.910 until there is a compatible pydantic version
2021-12-16 10:28:31 +01:00
Daniël de Kok
b08f1ac17d Pin mypy to 0.910 until there is a compatible pydantic version 2021-12-16 09:31:45 +01:00
Adriane Boyd
94fbd88521
Use dict.copy().items() instead of list(.items()) (#9868) 2021-12-16 09:17:33 +01:00
Edward
018827e9fd Add healthsea to universe (#9838)
* Add healthsea to universe

* Update website/meta/universe.json

* Add thumbnail

* Update website/meta/universe.json

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-15 17:57:19 +01:00
antonpibm
ac45ae3779
Update Tokenizer documentation to reflect token_match and url_match signatures (#9859) 2021-12-15 09:34:33 +01:00
Ines Montani
ba0fa7a64e
Support Google Sheets embeds in docs (#9861) 2021-12-15 09:27:08 +01:00
Richard Hudson
ed788c5def Add render_instances function 2021-12-08 19:24:32 +01:00
Richard Hudson
bd00611259 Add render_text 2021-12-08 17:47:29 +01:00
Richard Hudson
49f3fd39b9 Refactoring 2021-12-08 16:42:39 +01:00
Richard Hudson
183d535ef4 Add permitted values 2021-12-08 14:58:02 +01:00
Richard Hudson
9f7f234b0f Added tabular view 2021-12-08 14:30:38 +01:00
Richard Hudson
e04950ef3c Fixed problems with non-projective trees 2021-12-07 12:04:41 +01:00
Adriane Boyd
800737b416
Set version to v3.2.1 (#9823) 2021-12-07 10:51:45 +01:00
Haakon Meland Eriksen
251119455d
Remove NER words from stop words in Norwegian (#9820)
Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations.

Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data.

See explanation in https://github.com/explosion/spaCy/issues/3052#issuecomment-986756711 and comment https://github.com/explosion/spaCy/issues/3052#issuecomment-986951831
2021-12-07 09:45:10 +01:00