Commit Graph

425 Commits

Author SHA1 Message Date
Sofie Van Landeghem
8e7414dace Match pop with append for training format (#4516)
* trying to fix script - not succesful yet

* match pop() with extend() to avoid changing the data

* few more pop-extend fixes

* reinsert deleted print statement

* fix print statement

* add last tested version

* append instead of extend

* add in few comments

* quick fix for 4402 + unit test

* fixing number of docs (not counting cats)

* more fixes

* fix len

* print tmp file instead of using data from examples dir

* print tmp file instead of using data from examples dir (2)
2019-10-27 16:01:32 +01:00
tamuhey
fcd25db033 [#4529] fix: gold pyx (#4530)
* fix: gold pyx

* remove print

* skip test in python2

* Add unicode declarations and don't skip test on Python 2
2019-10-27 13:50:07 +01:00
Ines Montani
cfffdba7b1 Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522)
* Implement new API for {Phrase}Matcher.add (backwards-compatible)

* Update docs

* Also update DependencyMatcher.add

* Update internals

* Rewrite tests to use new API

* Add basic check for common mistake

Raise error with suggestion if user likely passed in a pattern instead of a list of patterns

* Fix typo [ci skip]
2019-10-25 22:21:08 +02:00
Ines Montani
d2da117114 Also support passing list to Language.disable_pipes (#4521)
* Also support passing list to Language.disable_pipes

* Adjust internals
2019-10-25 16:19:08 +02:00
Ines Montani
cc05d9dad6 Auto-format [ci skip] 2019-10-24 16:21:08 +02:00
Sofie Van Landeghem
d5d55312b2 prevent division by zero in most_similar method (#4488) 2019-10-21 12:04:46 +02:00
Ines Montani
181c01f629 Tidy up and auto-format 2019-10-18 11:27:38 +02:00
Sofie Van Landeghem
2d249a9502 KB extensions and better parsing of WikiData (#4375)
* fix overflow error on windows

* more documentation & logging fixes

* md fix

* 3 different limit parameters to play with execution time

* bug fixes directory locations

* small fixes

* exclude dev test articles from prior probabilities stats

* small fixes

* filtering wikidata entities, removing numeric and meta items

* adding aliases from wikidata also to the KB

* fix adding WD aliases

* adding also new aliases to previously added entities

* fixing comma's

* small doc fixes

* adding subclassof filtering

* append alias functionality in KB

* prevent appending the same entity-alias pair

* fix for appending WD aliases

* remove date filter

* remove unnecessary import

* small corrections and reformatting

* remove WD aliases for now (too slow)

* removing numeric entities from training and evaluation

* small fixes

* shortcut during prediction if there is only one candidate

* add counts and fscore logging, remove FP NER from evaluation

* fix entity_linker.predict to take docs instead of single sentences

* remove enumeration sentences from the WP dataset

* entity_linker.update to process full doc instead of single sentence

* spelling corrections and dump locations in readme

* NLP IO fix

* reading KB is unnecessary at the end of the pipeline

* small logging fix

* remove empty files
2019-10-14 12:28:53 +02:00
Ines Montani
fec9433044 Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373) 2019-10-04 12:18:41 +02:00
Sofie Van Landeghem
4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Sofie Van Landeghem
9d3ce7cba2 Ensure training doesn't crash with empty batches (#4360)
* unit test for previously resolved unflatten issue

* prevent batch of empty docs to cause problems
2019-10-02 12:50:47 +02:00
Ines Montani
cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
e0cf4796a5 Move lookup tables out of the core library (#4346)
* Add default to util.get_entry_point

* Tidy up entry points

* Read lookups from entry points

* Remove lookup tables and related tests

* Add lookups install option

* Remove lemmatizer tests

* Remove logic to process language data files

* Update setup.cfg
2019-10-01 00:01:27 +02:00
Ines Montani
0226b3bf0e Fix test imports 2019-09-29 17:34:56 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Sofie Van Landeghem
22b9e12159 Ensure the NER remains consistent after resizing (#4330)
* test and fix for second bug of issue 4042

* fix for first bug in 4042

* crashing test for Issue 4313

* forgot one instance of resize

* remove prints

* undo uncomment

* delete test for 4313 (uses third party lib)

* add fix for Issue 4313

* unit test for 4313
2019-09-27 20:57:13 +02:00
Matthew Honnibal
46c02d25b1 Merge changes to test_ner 2019-09-18 21:41:24 +02:00
Sofie Van Landeghem
de5a9ecdf3 Distinction between outside, missing and blocked NER annotations (#4307)
* remove duplicate unit test

* unit test (currently failing) for issue 4267

* bugfix: ensure doc.ents preserves kb_id annotations

* fix in setting doc.ents with empty label

* rename

* test for presetting an entity to a certain type

* allow overwriting Outside + blocking presets

* fix actions when previous label needs to be kept

* fix default ent_iob in set entities

* cleaner solution with U- action

* remove debugging print statements

* unit tests with explicit transitions and is_valid testing

* remove U- from move_names explicitly

* remove unit tests with pre-trained models that don't work

* remove (working) unit tests with pre-trained models

* clean up unit tests

* move unit tests

* small fixes

* remove two TODO's from doc.ents comments
2019-09-18 21:37:17 +02:00
tamuhey
875f3e5d8c remove redundant __call__ method in pipes.TextCategorizer (#4305)
* remove redundant __call__ method in pipes.TextCategorizer

Because the parent __call__ method behaves in the same way.

* fix: Pipe.__call__ arg

* fix: invalid arg in Pipe.__call__

* modified:   spacy/tests/regression/test_issue4278.py (#4278)

* deleted:    Pipfile
2019-09-18 21:31:27 +02:00
Ines Montani
139428c20f Set unique vector names in tests 2019-09-16 15:16:54 +02:00
Ines Montani
655b434553 Merge branch 'master' into develop 2019-09-12 11:39:18 +02:00
tamuhey
71909cdf22 Fix iss4278 (#4279)
* fix: len(tuple) == 2

* (#4278) add fail test

* add contributor's aggreement
2019-09-12 10:44:49 +02:00
Ines Montani
e82a8d0d7a Merge branch 'master' into develop 2019-09-11 11:52:38 +02:00
Ines Montani
8f9f48b04c Add GreekLemmatizer.lookup (resolves #4272) 2019-09-11 11:44:40 +02:00
Ines Montani
6279d74c65 Tidy up and auto-format 2019-09-11 11:38:22 +02:00
Matthew Honnibal
7b858ba606 Update from master 2019-09-10 20:14:08 +02:00
adrianeboyd
3780e2ff50 Flush tokenizer cache when necessary (#4258)
Flush tokenizer cache when affixes, token_match, or special cases are
modified.

Fixes #4238, same issue as in #1250.
2019-09-08 20:52:46 +02:00
Matthew Honnibal
1a65c5b7af Update develop from master 2019-09-08 18:21:41 +02:00
Adriane Boyd
0f28418446 Add regression test for #1061 back to test suite 2019-09-04 20:42:24 +02:00
Ines Montani
419ae59c79 Make flaky test test_issue_1971_4 more explicit 2019-08-31 14:08:05 +02:00
svlandeg
7bec0ebbcb failing unit test for Issue 4190 2019-08-28 14:16:34 +02:00
Matthew Honnibal
22250cf6b7 Make regression test less sensitive to tag-map stuff 2019-08-25 21:54:26 +02:00
Matthew Honnibal
bb911e5f4e Fix #3830: 'subtok' label being added even if learn_tokens=False (#4188)
* Prevent subtok label if not learning tokens

The parser introduces the subtok label to mark tokens that should be
merged during post-processing. Previously this happened even if we did
not have the --learn-tokens flag set. This patch passes the config
through to the parser, to prevent the problem.

* Make merge_subtokens a parser post-process if learn_subtokens

* Fix train script

* Add test for 3830: subtok problem

* Fix handlign of non-subtok in parser training
2019-08-23 17:54:00 +02:00
Sofie Van Landeghem
c417c380e3 Matcher ID fixes (#4179)
* allow phrasematcher to link one match to multiple original patterns

* small fix for defining ent_id in the matcher (anti-ghost prevention)

* cleanup

* formatting
2019-08-22 17:17:07 +02:00
Sofie Van Landeghem
de272f8b82 adding double match for optional operator at the end (#4166) 2019-08-21 22:46:56 +02:00
Sofie Van Landeghem
01c5980187 Serialize POS attribute when doc.is_tagged (#4092)
* fix and unit test for issue 3959

* additional unit test for manifestation of the same (resolved) bug
2019-08-21 21:59:30 +02:00
Sofie Van Landeghem
7539a4f3a8 use states[q] in while retry loop (#4162) 2019-08-21 21:58:04 +02:00
Ines Montani
f580302673 Tidy up and auto-format 2019-08-20 17:36:34 +02:00
Ines Montani
364aaf5bc2 Simplify test 2019-08-20 16:41:58 +02:00
Sofie Van Landeghem
68ee0384fd Unit test for Issue 3879 (#4153)
* failing unit test for Issue #3879

* mark test as failing
2019-08-20 16:40:25 +02:00
Ines Montani
86cd7f0efd Add regression test for #4120 2019-08-20 16:33:09 +02:00
Ines Montani
009280fbc5 Tidy up and auto-format 2019-08-18 15:09:16 +02:00
AJ Rader
2f3648700c Correction of default lemmatizer lookup in English (Issue # 4104) (#4110)
* pytest file for issue4104 established

* edited default lookup english lemmatizer for spun; fixes issue 4102

* eliminated parameterization and sorted dictionary dependnency in issue 4104 test

* added contributor agreement
2019-08-15 11:39:10 +02:00
Sofie Van Landeghem
963ea5e8d0 Update lemma and vector information after splitting a token (#4097)
* fixing vector and lemma attributes after retokenizer.split

* fixing unit test with mockup tensor

* xp instead of numpy
2019-08-08 15:09:44 +02:00
Sofie Van Landeghem
ad09b0d6f3 fetch norm from lex if necessary for matching (#4080) 2019-08-05 23:51:04 +02:00
adrianeboyd
925a852bb6 Improve NER per type scoring (#4052)
* Improve NER per type scoring

* include all gold labels in per type scoring, not only when recall > 0
* improve efficiency of per type scoring

* Create Scorer tests, initially with NER tests

* move regression test #3968 (per type NER scoring) to Scorer tests

* add new test for per type NER scoring with imperfect P/R/F and per
type P/R/F including a case where R == 0.0
2019-08-01 17:15:36 +02:00
Sofie Van Landeghem
f7d950de6d ensure the lang of vocab and nlp stay consistent (#4057)
* ensure the language of vocab and nlp stay consistent across serialization

* equality with =
2019-08-01 17:13:01 +02:00
Sofie Van Landeghem
7de3b129ab Resolve edge case when calling textcat.predict with empty doc (#4035)
* resolve edge case where no doc has tokens when calling textcat.predict

* more explicit value test
2019-07-30 14:58:01 +02:00
Sofie Van Landeghem
ba02957c80 Fix dependency copy for as_doc (#3969)
* failing unit test for issue 3962

* attempt to fix Issue #3962

* create artificial unit test example

* using length instead of self.length

* sp

* reformat with black

* find better ancestor within span and use generic 'dep'

* attach to span.root if there is no appropriate ancestor

* comment span text

* clean up ancestor code

* reconstruct dep tree to keep same number of sentences
2019-07-23 18:28:54 +02:00