Daniël de Kok
28299644fc
Speed up the StateC::L feature function ( #10019 )
...
* Speed up the StateC::L feature function
This function gets the n-th most-recent left-arc with a particular head.
Before this change, StateC::L would construct a vector of all left-arcs
with the given head and then pick the n-th most recent from that vector.
Since the number of left-arcs strongly correlates with the doc length
and the feature is constructed for every transition, this can make
transition-parsing quadratic.
With this change StateC::L:
- Searches left-arcs backwards.
- Stops early when the n-th matching transition is found.
- Does not construct a vector (reducing memory pressure).
This change doesn't avoid the linear search when the transition that is
queried does not occur in the left-arcs. Regardless, performance is
improved quite a bit with very long docs:
Before:
N Time
400 3.3
800 5.4
1600 11.6
3200 30.7
After:
N Time
400 3.2
800 5.0
1600 9.5
3200 23.2
We can probably do better with more tailored data structures, but I
first wanted to make a low-impact PR.
Found while investigating #9858 .
* StateC::L: simplify loop
2022-01-13 09:03:55 +01:00
jsnfly
176a90edee
Fix texcat loss scaling ( #9904 ) ( #10002 )
...
* add failing test for issue 9904
* remove division by batch size and summation before applying the mean
Co-authored-by: jonas <jsnfly@gmx.de>
2022-01-13 09:03:23 +01:00
Sofie Van Landeghem
d8a3012539
Merge pull request #10037 from explosion/master
...
Update develop with master
2022-01-12 12:29:23 +01:00
Ryn Daniels
057b8c64c0
Check for assets with size of 0 bytes ( #10026 )
...
* Check for assets with size of 0 bytes
* Update spacy/cli/project/assets.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-12 10:34:23 +01:00
Sofie Van Landeghem
5ba4171b19
Update LICENSE to include 2022 [ci skip]
2022-01-07 09:24:07 +01:00
Ines Montani
005e23a525
Merge pull request #9989 from explosion/docs/update-algolia-search-api [ci skip]
2022-01-05 14:14:42 +01:00
Ines Montani
a437ca6737
Update website to use new Algolia search API
2022-01-05 13:21:06 +01:00
Sofie Van Landeghem
067a44a417
Merge pull request #9987 from explosion/master
...
Update develop with commits from master
2022-01-05 11:49:50 +01:00
Lj Miranda
00e7bf5ffd
Add a few docs to the default_config.cfg ( #9981 )
...
* Clarify patience hyperparameter
The current value for patience doesn't seem to indicate that it's
pointing to the number of steps. It may be useful to specify that
explicitly.
Ref: https://github.com/explosion/spaCy/discussions/7450
Ref: https://github.com/explosion/spaCy/discussions/7465
* Update docs for max_steps
2022-01-05 09:16:40 +01:00
Duygu Altinok
55cf492218
Feat/debug data warn spread ents ( #9960 )
...
* added check for crossing boundaries
* formatted blacked
* Rephrasing slightly
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-01-04 18:22:10 +01:00
Sofie Van Landeghem
56dcb39fb7
Fix references to config file in the docs & UX ( #9961 )
...
* doc fixes around config file
* fix typo
* clarify default
2022-01-04 14:31:26 +01:00
Sofie Van Landeghem
029a48e340
fix type of lexeme.rank ( #9979 )
2022-01-04 13:15:25 +01:00
Sam Edwardes
6f65e2b544
Added spacypdfreader to universe.json ( #9963 )
2022-01-03 16:34:36 +09:00
Richard Hudson
cc21eac88a
Use \n rather than linesep for consistency with wasabi
2021-12-29 13:33:56 +01:00
Richard Hudson
85da92f041
Ignore Windows carriage return characters
2021-12-29 12:16:45 +01:00
Paul O'Leary McCann
f40e237c5a
Remove denomme from universe ( #9952 )
...
Package seems to have been deleted.
2021-12-29 11:41:29 +01:00
Richard Hudson
f7f9cc72e7
Fixed supports_ansi problem for Windows tests
2021-12-29 11:22:48 +01:00
Florian Cäsar
86e71e7b19
Fix Scorer.score_cats for missing labels ( #9443 )
...
* Fix Scorer.score_cats for missing labels
* Add test case for Scorer.score_cats missing labels
* semantic nitpick
* black formatting
* adjust test to give different results depending on multi_label setting
* fix loss function according to whether or not missing values are supported
* add note to docs
* small fixes
* make mypy happy
* Update spacy/pipeline/textcat.py
Co-authored-by: Florian Cäsar <florian.caesar@pm.me>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2021-12-29 11:04:39 +01:00
Richard Hudson
264ead3274
Removed incorrect automatically added import statement
2021-12-29 10:11:48 +01:00
Sofie Van Landeghem
b8106e0f95
Merge pull request #9951 from explosion/master
...
Update develop branch with master
2021-12-29 10:11:43 +01:00
Richard Hudson
8e55efcbd9
Check SUPPORTS_ANSI when rendering
2021-12-29 09:30:35 +01:00
Richard Hudson
08370604d3
Change order of imports
...
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-29 09:22:06 +01:00
Richard Hudson
678bc61086
Apply suggestions from code review
...
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-29 09:21:23 +01:00
Richard Hudson
e3e8495b41
Updated requirements.txt
2021-12-29 08:47:56 +01:00
Yoav Vollansky
9d63dfacfc
Update UNIVERSE.md ( #9941 )
...
typo
2021-12-27 13:46:04 +01:00
Peter Baumgartner
72abf9e102
MultiHashEmbed vector docs correction ( #9918 )
2021-12-27 11:18:08 +01:00
Richard Hudson
92943f8a23
Removed unused import
2021-12-23 17:47:56 +01:00
Richard Hudson
2cae470180
More type corrections
2021-12-23 17:35:47 +01:00
Richard Hudson
106fb53509
More type corrections
2021-12-23 17:24:28 +01:00
Richard Hudson
5c850b2ac3
Corrected types
2021-12-23 17:01:43 +01:00
Richard Hudson
e713aa0938
Add surrounding tokens functionality
2021-12-23 16:13:40 +01:00
Duygu Altinok
7ec1452f5f
added ellided forms ( #9878 )
...
* added ellided forms
* rearranged a bit
* rearranged a bit
* added stopword tests
* blacked tests file
2021-12-23 13:41:01 +01:00
Andrew Janco
3cfeb518ee
Handle "_" value for token pos in conllu data ( #9903 )
...
* change '_' to '' to allow Token.pos, when no value for token pos in conllu data
* Minor code style
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2021-12-21 15:46:33 +01:00
Adriane Boyd
837d241b68
Make floret murmurhash endian-neutral ( #9735 )
2021-12-20 17:11:31 +01:00
Adriane Boyd
1163073756
Remove outdated patterns MANIFEST.in ( #9912 )
2021-12-20 16:40:20 +01:00
Adriane Boyd
18e5638af0
Extend cupy to v10.x ( #9911 )
...
* Add extra for `cupy-cuda115`
2021-12-20 15:48:35 +01:00
Sofie Van Landeghem
7847839003
Merge pull request #9891 from explosion/master
...
Update develop with master
2021-12-17 14:01:27 +01:00
Daniël de Kok
93e9bf681f
Merge pull request #9873 from danieldk/temporarily-pin-mypy
...
Pin mypy to 0.910 until there is a compatible pydantic version
2021-12-16 10:28:31 +01:00
Daniël de Kok
b08f1ac17d
Pin mypy to 0.910 until there is a compatible pydantic version
2021-12-16 09:31:45 +01:00
Adriane Boyd
94fbd88521
Use dict.copy().items() instead of list(.items()) ( #9868 )
2021-12-16 09:17:33 +01:00
Edward
018827e9fd
Add healthsea to universe ( #9838 )
...
* Add healthsea to universe
* Update website/meta/universe.json
* Add thumbnail
* Update website/meta/universe.json
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2021-12-15 17:57:19 +01:00
antonpibm
ac45ae3779
Update Tokenizer documentation to reflect token_match and url_match signatures ( #9859 )
2021-12-15 09:34:33 +01:00
Ines Montani
ba0fa7a64e
Support Google Sheets embeds in docs ( #9861 )
2021-12-15 09:27:08 +01:00
Richard Hudson
ed788c5def
Add render_instances function
2021-12-08 19:24:32 +01:00
Richard Hudson
bd00611259
Add render_text
2021-12-08 17:47:29 +01:00
Richard Hudson
49f3fd39b9
Refactoring
2021-12-08 16:42:39 +01:00
Richard Hudson
183d535ef4
Add permitted values
2021-12-08 14:58:02 +01:00
Richard Hudson
9f7f234b0f
Added tabular view
2021-12-08 14:30:38 +01:00
Richard Hudson
e04950ef3c
Fixed problems with non-projective trees
2021-12-07 12:04:41 +01:00
Adriane Boyd
800737b416
Set version to v3.2.1 ( #9823 )
2021-12-07 10:51:45 +01:00