Commit Graph

482 Commits

Author SHA1 Message Date
Sofie Van Landeghem
5ace559201
ensure span.text works for an empty span (#6772) 2021-01-21 23:18:46 +08:00
Sofie Van Landeghem
2af31a8c8d
Bugfix textcat reproducibility on GPU (#6411)
* add seed argument to ParametricAttention layer

* bump thinc to 7.4.3

* set thinc version range

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-11-23 12:29:35 +01:00
Sofie Van Landeghem
2998131416
Reproducibility for TextCat and Tok2Vec (#6218)
* ensure fixed seed in HashEmbed layers

* forgot about the joys of python 2
2020-10-08 00:43:46 +02:00
Florijan Stamenković
9db670b996
Fix Issue 6207 (#6208)
* Regression test for issue 6207

* Fix issue 6207

* Sign contributor agreement

* Minor adjustments to test

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2020-10-06 11:17:37 +02:00
Sofie Van Landeghem
f7a25d69f7
Bugfix in merge_entities (#6005)
* failing test

* bugfix
2020-09-01 21:57:52 +02:00
Sofie Van Landeghem
071c09ff35
add coding (#5942) 2020-08-20 11:08:38 +02:00
Gustavo Zadrozny Leyendecker
90b958fd01
Fix on EntityRendered to support break lines (after last entity) (closes #5838) 2020-07-29 18:48:39 +02:00
Adriane Boyd
0a62098c5f
Fix lemmatizer is_base_form for python2.7 (#5734)
* Fix lemmatizer init args for python2.7

* Move English is_base_form to a class method

* Skip test pickling PhraseMatcher for python2
2020-07-09 22:11:24 +02:00
graue70
9860b8399e
Fix typo in test function docstring (#5696) 2020-07-05 15:49:06 +02:00
Adriane Boyd
167df42cb6
Move lemmatizer is_base_form to language settings (#5663)
Move `Lemmatizer.is_base_form` to the language settings so that each
language can provide a language-specific method as
`LanguageDefaults.is_base_form`.

The existing English-specific `Lemmatizer.is_base_form` is moved to
`EnglishDefaults`.
2020-06-29 14:16:57 +02:00
Ines Montani
c685ee734a Fix compat for v2.x branch 2020-05-22 14:22:36 +02:00
Matthew Honnibal
93c4d13588
Merge pull request #5264 from lfiedler/issue-5230
Fix ResourceWarnings during unittest
2020-05-22 00:31:07 +02:00
svlandeg
36a94c409a failing test to reproduce overlapping spans problem 2020-05-20 23:06:03 +02:00
adrianeboyd
40e65d6f63
Fix most_similar for vectors with unused rows (#5348)
* Fix most_similar for vectors with unused rows

Address issues related to the unused rows in the vector table and
`most_similar`:

* Update `most_similar()` to search only through rows that are in use
according to `key2row`.

* Raise an error when `most_similar(n=n)` is larger than the number of
vectors in the table.

* Set and restore `_unset` correctly when vectors are added or
deserialized so that new vectors are added in the correct row.

* Set data and keys to the same length in `Vocab.prune_vectors()` to
avoid spurious entries in `key2row`.

* Fix regression test using `most_similar`

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-05-19 16:41:26 +02:00
Sofie Van Landeghem
cfdaf99b80
Fix passing of component configuration (#5374)
* add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument

* add fix and test for Issue 5137
2020-04-29 12:56:17 +02:00
Sofie Van Landeghem
f67343295d
Update NEL examples and documentation (#5370)
* simplify creation of KB by skipping dim reduction

* small fixes to train EL example script

* add KB creation and NEL training example scripts to example section

* update descriptions of example scripts in the documentation

* moving wiki_entity_linking folder from bin to projects

* remove test for wiki NEL functionality that is being moved
2020-04-29 12:53:53 +02:00
adrianeboyd
f8ac5b9f56
bugfix in span similarity (#5155) (#5358)
* bugfix in span similarity

* also rewrite doc.pyx for clarity

* formatting

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2020-04-27 16:51:27 +02:00
Jakob Jul Elben
663333c3b2
Fixes #5413 (#5315)
* Fix 5314

* Add contributor

* Resolve requested changes

Co-authored-by: Jakob Jul Elben <jakob@datamaga.com>
2020-04-16 13:29:02 +02:00
Leander Fiedler
d60e2d3ebf issue5230 added unit test for dumping and loading knowledgebase 2020-04-12 09:08:41 +02:00
Leander Fiedler
d2bb649227 issue5230 filter warnings in addition to filterwarnings to prevent deprecation warnings in python35(win) setup to pop up 2020-04-10 23:21:13 +02:00
Leander Fiedler
ca2a7a44db issue5230 store string values of warnings to remotely debug failing python35(win) setup 2020-04-10 22:26:55 +02:00
Leander Fiedler
88ca40a15d issue5230 raise warnings as errors to remotely debug failing python35(win) setup 2020-04-10 21:45:53 +02:00
Leander Fiedler
a7bdfe42e1 issue5230 added print statement to warnings filter to remotely debug failing python35(win) setup 2020-04-10 21:14:33 +02:00
Leander Fiedler
8c1d0d628f issue5230 writer now checks instance of loc parameter before trying to operate on it 2020-04-10 20:35:52 +02:00
lfiedler
e1e25c7e30 issue5230: added unittest test case for completion 2020-04-06 21:36:02 +02:00
Leander Fiedler
cde96f6c64 issue5230: optimized unit test a bit 2020-04-06 20:51:12 +02:00
Leander Fiedler
71cc903d65 issue5230: replaced open statements on path objects so that serialization still works an files are closed 2020-04-06 20:30:41 +02:00
Leander Fiedler
273ed452bb issue5230: added unicode declaration at top of the file 2020-04-06 19:22:32 +02:00
Leander Fiedler
1cd975d4a5 issue5230: fixed resource warnings in language 2020-04-06 18:54:32 +02:00
Leander Fiedler
493c77462a issue5230: test cases
covering known sources of resource warnings
2020-04-06 18:46:51 +02:00
Ines Montani
828acffc12 Tidy up and auto-format 2020-03-25 12:28:12 +01:00
Sofie Van Landeghem
1a2b8fc264
set vector of merged entity (#5085)
* merge_entities sets the vector in the vocab for the merged token

* add unit test

* import unicode_literals

* move code to _merge function

* only set vector if vocab has non-zero vectors
2020-03-06 14:45:28 +01:00
Sofie Van Landeghem
d307e9ca58
take care of global vectors in multiprocessing (#5081)
* restore load_nlp.VECTORS in the child process

* add unit test

* fix test

* remove unnecessary import

* add utf8 encoding

* import unicode_literals
2020-03-03 13:58:22 +01:00
Sofie Van Landeghem
c6b12ab02a
Bugfix/get doc (#5049)
* new (broken) unit test

* fixing get_doc method
2020-03-02 11:49:28 +01:00
svlandeg
b49a3afd0c use clean_underscore fixture 2020-02-23 15:49:20 +01:00
svlandeg
6e717c62ed avoid the tests interacting with eachother through the global Underscore variable 2020-02-12 13:21:31 +01:00
svlandeg
7939c63886 use English instead of model 2020-02-12 12:26:27 +01:00
svlandeg
46628d8890 add some asserts 2020-02-12 12:12:52 +01:00
svlandeg
51d37033c8 remove old comment 2020-02-12 12:10:05 +01:00
svlandeg
05dedaa2cf add unit test 2020-02-12 12:00:13 +01:00
Tyler Couto
9fa9d7f2cb
Fix for Issue 4665 - conllu2json (#4953)
* Fix for Issue 4665 - conllu2json

- Allowing HEAD to be an underscore

* Added contributor agreement
2020-02-03 13:01:48 +01:00
Yohei Tamura
708a4d27eb fix nlp.evaluate (#4924) (#4925)
* new file:   test_issue4924.py

* modified:   spacy/gold.pyx

* modified:   test_issue4924.py for python2
2020-01-20 12:17:46 +01:00
Sofie Van Landeghem
a1b22e90cd serialize ENT_ID (#4852)
* expand serialization test for custom token attribute

* add failing test for issue 4849

* define ENT_ID as attr and use in doc serialization

* fix few typos
2020-01-06 14:57:34 +01:00
Ines Montani
3431ac42de Fix typo 2019-12-21 21:17:45 +01:00
Ines Montani
7c69d30de5 Tidy up and expect warning 2019-12-21 21:14:52 +01:00
Ines Montani
cb4145adc7 Tidy up and auto-format 2019-12-21 19:04:17 +01:00
Sofie Van Landeghem
f9b541f9ef More robust set entities method in KB (#4794)
* add unit test for setting entities with duplicate identifiers

* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00
Ines Montani
5b36dec7eb Auto-exclude disabled when calling from_disk during load (#4708) 2019-11-25 16:01:22 +01:00
Ines Montani
5d4eede1e4 Fix test util imports 2019-11-21 16:28:29 +01:00
GuiGel
8f7ab70870 Bugfix/fix entity ruler from disk (#4670)
* fix EntityRuler from_disk bug

* add contributor file

* Test EntityRuler PhraseMatcher deserialization (#4651)

* newline at end of file

* fix copy paste error

* serializing the EntityRuler by itself

* Add unicode declarations for Python 2 and auto-format
2019-11-21 16:26:37 +01:00