Commit Graph

6145 Commits

Author SHA1 Message Date
svlandeg
b1911f7105 Errors.E146 for IO error when FP is null 2019-07-22 14:56:13 +02:00
svlandeg
5d544f89ba Errors.E145 for IO errors when reading KB 2019-07-22 14:36:07 +02:00
Ines Montani
a32b033b8c Add regression test for #4002
Test that the PhraseMatcher can match on overwritten NORM attributes.
2019-07-22 14:18:24 +02:00
svlandeg
ad65171837 Merge remote-tracking branch 'upstream/master' into feature/nel-fixes 2019-07-22 13:41:28 +02:00
svlandeg
76184374e2 test corner cases 2019-07-22 13:39:32 +02:00
svlandeg
9f8c1e71a2 fix for Issue #4000 2019-07-22 13:34:12 +02:00
svlandeg
dae8a21282 rename entity frequency 2019-07-19 17:40:28 +02:00
svlandeg
41fb5204ba output tensors as part of predict 2019-07-19 14:47:36 +02:00
svlandeg
21176517a7 have gold.links correspond exactly to doc.ents 2019-07-19 12:36:15 +02:00
BreakBB
3e370cf2ba Add 'Prof.' to Englisch tokenizer_exceptions 2019-07-19 10:00:45 +02:00
svlandeg
e1213eaf6a use original gold object in get_loss function 2019-07-18 13:35:10 +02:00
svlandeg
ec55d2fccd filter training data beforehand (+black formatting) 2019-07-18 10:22:24 +02:00
Falak Asad
ff1e73e35c Bugfix/issue 3968 (#3982)
* Fix for issue-3968

* Added contributor agreement

* Made suggested changes
2019-07-18 00:20:32 +02:00
svlandeg
d833d4c358 fixes in kb and gold 2019-07-17 17:18:26 +02:00
Ines Montani
73565c6d9d Rename function arguments 2019-07-17 14:29:52 +02:00
Matthew Honnibal
394e4d8058 Add docstring for spacy.gold.align 2019-07-17 13:59:17 +02:00
Ines Montani
073013f129 Auto-format [ci skip] 2019-07-17 12:34:13 +02:00
svlandeg
4086c6ff60 get vector functionality + unit test 2019-07-17 12:17:02 +02:00
Ines Montani
62ff128888 Add regression test for #3951 2019-07-16 14:00:00 +02:00
Ines Montani
7f551050b1 Add regression test for #3972 2019-07-16 13:07:35 +02:00
svlandeg
a63d15a142 code cleanup 2019-07-15 17:36:43 +02:00
svlandeg
cdc589d344 small fix 2019-07-15 12:04:45 +02:00
svlandeg
60f299374f set default context width 2019-07-15 12:03:09 +02:00
svlandeg
6e809e9b8b proper error for missing cfg arguments 2019-07-15 11:42:50 +02:00
svlandeg
6026958957 tokenizer doc fix 2019-07-15 11:19:34 +02:00
Ines Montani
c0e29f7029
Merge pull request #3957 from sorenlind/danish-tokenizer-slash
Make Danish tokenizer split on forward slash
2019-07-12 18:19:22 +02:00
Matthew Honnibal
ef666656b3 Fix attrs alignment 2019-07-12 17:59:47 +02:00
Matthew Honnibal
c345c042b0 Fix symbol alignment 2019-07-12 17:48:38 +02:00
Ines Montani
7281026879 Increment version [ci skip] 2019-07-12 17:40:00 +02:00
Søren Lind Kristiansen
26aee70d95 Make Danish tokenizer split on forward slash 2019-07-12 15:20:42 +02:00
Matthew Honnibal
3bc4d618f9 Set version to v2.1.5 2019-07-12 13:26:12 +02:00
Sofie Van Landeghem
ed774cb953 Fixing ngram bug (#3953)
* minimal failing example for Issue #3661

* referenced Issue #3661 instead of Issue #3611

* cleanup
2019-07-12 10:01:35 +02:00
Matthew Honnibal
09dc01a426 Fix #3853, and add warning 2019-07-11 14:46:47 +02:00
Matthew Honnibal
7369949d2e Add warning for #3853 2019-07-11 14:46:47 +02:00
Ines Montani
673c864a06
Fix doc.count_by functionality (#3950)
Fix doc.count_by functionality
2019-07-11 13:44:00 +02:00
Ines Montani
2426f4d44c
Fix default punctuation rules for splitting Hindi text (#3948)
Fix default punctuation rules for splitting Hindi text

Co-authored-by: yash <patadiayash@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-07-11 13:36:28 +02:00
svlandeg
349107daa3 cleanup 2019-07-11 13:09:22 +02:00
svlandeg
0f0f07318a counter instead of preshcounter 2019-07-11 13:05:53 +02:00
Matthew Honnibal
b40b4c2c31
💫 Fix issue #3839: Incorrect entity IDs from Matcher with operators (#3949)
* Add regression test for issue #3541

* Add comment on bugfix

* Remove incorrect test

* Un-xfail test
2019-07-11 12:55:11 +02:00
Matthew Honnibal
e19f4ee719 Add warning message re Issue #3853 2019-07-11 12:50:38 +02:00
Ines Montani
197cfd7ebc Merge branch 'master' into pr/3948 2019-07-11 12:18:31 +02:00
Ines Montani
d166756607 Fix test 2019-07-11 12:16:43 +02:00
Ines Montani
0b8406a05c Tidy up and auto-format 2019-07-11 12:02:25 +02:00
yash
6751af3e78 Merge branch 'master' of https://github.com/yash1994/spaCy 2019-07-11 15:26:57 +05:30
yash
ae2d52e323 Add default encoding utf-8 for test file 2019-07-11 15:26:27 +05:30
Ines Montani
33ca0a036a Merge branch 'master' into pr/3948 2019-07-11 11:55:54 +02:00
Matthew Honnibal
0491a8e7c8 Reformat 2019-07-11 11:49:36 +02:00
Matthew Honnibal
bd3c3f342b Fix _serialize 2019-07-11 11:48:55 +02:00
yash
815f8d13dd Fix default punctuation rules for hindi text (#3625 explosion) 2019-07-11 15:00:51 +05:30
yash
d5311b3c42 Add test file for issue (#3625) and spacy contributor agreement 2019-07-11 14:53:14 +05:30
svlandeg
e080412385 tracked the bug down to PreshCounter.inc - still unclear what goes wrong 2019-07-11 01:53:06 +02:00
svlandeg
a89fecce97 failing unit test for issue #3869 2019-07-11 00:43:55 +02:00
Matthew Honnibal
a388888074 Merge branch 'master' of https://github.com/explosion/spaCy 2019-07-10 22:54:17 +02:00
Matthew Honnibal
c6cb782758 Set version to 2.1.5.dev0 2019-07-10 22:54:09 +02:00
Sofie Van Landeghem
c4c21cb428 more friendly textcat errors (#3946)
* more friendly textcat errors with require_model and require_labels

* update thinc version with recent bugfix
2019-07-10 19:39:38 +02:00
Matthew Honnibal
b94c5443d9 Rename Binder->DocBox, and improve it. 2019-07-10 19:37:20 +02:00
Matthew Honnibal
3d18600c05 Return True from doc.is_... when no ambiguity
* Make doc.is_sentenced return True if len(doc) < 2.

* Make doc.is_nered return True if len(doc) == 0, for consistency.

Closes #3934
2019-07-10 19:21:42 +02:00
Matthew Honnibal
465456edb9 Un-xfail test #3880 2019-07-10 14:01:17 +02:00
Matthew Honnibal
87f7ec34d5 Add test for #3880 2019-07-10 13:53:55 +02:00
Ines Montani
4e04080b76 Only compare sorted patterns in test
Try to work around flaky tests on Python 3.5
2019-07-10 13:00:52 +02:00
Ines Montani
82045aac8a Merge regression tests 2019-07-10 12:49:18 +02:00
Ines Montani
40cd03fc35 Improve EntityRuler serialization 2019-07-10 12:25:45 +02:00
Ines Montani
570ab1f481 Fix handling of old entity ruler files
Expected an `entity_ruler.jsonl` file in the top-level model directory, so the path passed to from_disk by default (model path plus componentn name), but with the suffix ".jsonl".
2019-07-10 12:14:12 +02:00
Ines Montani
874d914a44 Tidy up test 2019-07-10 12:13:23 +02:00
Ines Montani
ea2050079b Auto-format 2019-07-10 12:03:05 +02:00
Ines Montani
6ba5ddbd5f
Merge pull request #3864 from svlandeg/feature/nel-wiki
Entity linking using Wikipedia & Wikidata
2019-07-10 11:25:41 +02:00
Ines Montani
8721849423 Update Scorer.ents_per_type 2019-07-10 11:19:28 +02:00
Björn Böing
205c73a589 Update tokenizer and doc init example (#3939)
* Fix Doc.to_json hyperlink

* Update tokenizer and doc init examples

* Change "matchin rules" to "punctuation rules"

* Auto-format
2019-07-10 10:16:48 +02:00
cedar101
58f06e6180 Korean support (#3901)
* start lang/ko

* add test codes

* using natto-py

* add test_ko_tokenizer_full_tags()

* spaCy contributor agreement

* external dependency for ko

* collections.namedtuple for python version < 3.5

* case fix

* tuple unpacking

* add jongseong(final consonant)

* apply mecab option

* Remove Pipfile for now


Co-authored-by: Ines Montani <ines@ines.io>
2019-07-09 22:23:16 +02:00
Ines Montani
f2ea3e3ea2
Merge branch 'master' into feature/nel-wiki 2019-07-09 21:57:47 +02:00
Ines Montani
547464609d Remove merge_subtokens from parser postprocessing for now 2019-07-09 21:50:30 +02:00
Björn Böing
04982ccc40 Update pretrain to prevent unintended overwriting of weight fil… (#3902)
* Update pretrain to prevent unintended overwriting of weight files for #3859

* Add '--epoch-start' to pretrain docs

* Add mising pretrain arguments to bash example

* Update doc tag for v2.1.5
2019-07-09 21:48:30 +02:00
Alejandro Alcalde
6d577f0b92 Evaluation of NER model per entity type, closes #3490 (#3911)
* Evaluation of NER model per entity type, closes ##3490

Now each ent score is tracked individually in order to have its own Precision, Recall and F1 Score

* Keep track of each entity individually using dicts

* Improving how to compute the scores for each entity

* Fixed bug computing scores for ents

* Formatting with black

* Added key ents_per_type to the scores function

The key `ents_per_type` contains the metrics Precision, Recall and F1-Score for each entity individually
2019-07-09 20:54:59 +02:00
Joshua Smith
2eb925bd05 Added an argument to EntityRuler constructor to pass attrs to… (#3919)
* Perserve flags in EntityRuler

The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized.  This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.

* add signed contributor agreement

* flake8 cleanup

mostly blank line issues.

* mark test from the issue as needing a model

The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.

* Adds `phrase_matcher_attr` to allow args to PhraseMatcher

This is an added arg to pass to the `PhraseMatcher`. For example,
this allows creation of a case insensitive phrase matcher when the
`EntityRuler` is created.  References explosion/spaCy#3822

* remove unneeded model loading

The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)

* updated docstring for new argument

* updated docs to reflect new argument to the EntityRuler constructor

* change tempdir handling to be compatible with python 2.7

* return conflicted code to entityruler

Some stuff got cut out because of merge conflicts, this
returns that code for the phrase_matcher_attr.

* fixed typo in the code added back after conflicts

* flake8 compliance

When I deconflicted the branch there were some flake8 issues
introduced. This resolves the spacing problems.

* test changes:  attempts to fix flaky test in python3.5

These tests seem to be alittle flaky in 3.5 so I changed the check to avoid
the comparisons that seem to be fail sometimes.
2019-07-09 20:09:17 +02:00
Joshua Smith
e8420ab2b7 Added support for serializing overwrite and ent_id_sep (#3918)
* Perserve flags in EntityRuler

The EntityRuler (explosion/spaCy#3526) does not preserve
overwrite flags (or `ent_id_sep`) when serialized.  This
commit adds support for serialization/deserialization preserving
overwrite and ent_id_sep flags.

* add signed contributor agreement

* flake8 cleanup

mostly blank line issues.

* mark test from the issue as needing a model

The test from the issue needs some language model for serialization
but the test wasn't originally marked correctly.

* remove unneeded model loading

The model didn't need to be loaded, and I replaced it with
a change that doesn't require it (using existings fixtures)

* change tempdir handling to be compatible with python 2.7

* Adds code to handle item saved before this change.

This code chanes how the save files are handled and how the bytes
are stored as well.  This code adds check to dispatch correctly
if it encounters bytes or files saved in the old format (and tests
for those cases).

* use util function for tempdir management

Updated after PR comments: this code now uses the make_tempdir function from util
instead of doing it by hand.
2019-07-08 17:28:28 +02:00
Knut O. Hellan
a54f0cfc2b Norwegian tweaks (#3894)
* Norwegian fix

Add support for alternative past tense verb form (vaska).

* Norwegian months

Add all Norwegian months to tokenizer excpetions.

* More Norwegian abbreviations

Add more Norwegian abbreviations to tokenizer_exceptions.

* Contributor agreement khellan

Add signed contributor agreement for khellan (Knut O. Hellan).
2019-07-08 10:28:47 +02:00
Rokas Ramanauskas
61ce126d4c Lithuanian language support (#3895)
* initial LT lang support

* Added more stopwords. Started setting up some basic test environment (not complete)

* Initial morph rules for LT lang

* Closes #1 Adds tokenizer exceptions for Lithuanian

* Closes #5 Punctuation rules. Closes #6 Lexical Attributes

* test: add native examples to basic tests

* feat: add tag map for lt lang

* fix: remove undefined tag attribute 'Definite'

* feat: add lemmatizer for lt lang

* refactor: add new instances to lt lang morph rules; use tags from tag map

* refactor: add morph rules to lt lang defaults

* refactor: only keep nouns, verbs, adverbs and adjectives in lt lang lemmatizer lookup

* refactor: add capitalized words to lt lang lemmatizer

* refactor: add more num words to lt lang lex attrs

* refactor: update lt lang stop word set

* refactor: add new instances to lt lang tokenizer exceptions

* refactor: remove comments form lt lang init file

* refactor: use function instead of lambda in lt lex lang getter

* refactor: remove conversion to dict in lt init when dict is already provided

* chore: rename lt 'test_basic' to 'test_text'

* feat: add more lt text tests

* feat: add lemmatizer tests

* refactor: remove unused imports, add newline to end of file

* chore: add contributor agreement

* chore: change 'en' to 'lt' in lt example description

* fix: add missing encoding info

* style: add newline to end of file

* refactor: use python2 compatible syntax

* style: reformat code using black
2019-07-08 10:25:22 +02:00
svlandeg
0ea52c86b8 remove redundancy 2019-07-03 15:02:10 +02:00
svlandeg
668b17ea4a deuglify kb deserializer 2019-07-03 15:00:42 +02:00
svlandeg
8840d4b1b3 fix for context encoder optimizer 2019-07-03 13:35:36 +02:00
svlandeg
2d2dea9924 experiment with adding NER types to the feature vector 2019-06-29 14:52:36 +02:00
svlandeg
c664f58246 adding prior probability as feature in the model 2019-06-28 16:22:58 +02:00
svlandeg
1c80b85241 fix tests 2019-06-28 08:59:23 +02:00
svlandeg
68a0662019 context encoder with Tok2Vec + linking model instead of cosine 2019-06-28 08:29:31 +02:00
Ines Montani
4f1dae1c6b Update languages and examples (see #1107) 2019-06-26 16:19:17 +02:00
svlandeg
dbc53b9870 rename to KBEntryC 2019-06-26 15:55:26 +02:00
Ines Montani
37f744ca00 Auto-format [ci skip] 2019-06-26 14:48:09 +02:00
Ines Montani
6ccdf37574 Exclude user_data when copying doc in displaCy (closes #3882) 2019-06-26 14:37:05 +02:00
svlandeg
1de61f68d6 improve speed of prediction loop 2019-06-26 13:53:10 +02:00
svlandeg
bee23cd8af try Tok2Vec instead of SpacyVectors 2019-06-25 16:09:22 +02:00
svlandeg
8608685543 ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
svlandeg
58a5b40ef6 clean up duplicate code 2019-06-24 15:19:58 +02:00
svlandeg
ddc73b11a9 fix unicode literals 2019-06-24 12:58:18 +02:00
svlandeg
f4af47ce4a Merge branch 'feature/nel-wiki' of https://github.com/svlandeg/spaCy into feature/nel-wiki 2019-06-24 10:57:07 +02:00
svlandeg
b58bace84b small fixes 2019-06-24 10:55:04 +02:00
Ines Montani
c833d9b314 Add "v.s." to English tokenizer exceptions (see #3868) 2019-06-20 17:48:45 +02:00
Ines Montani
ae2c208735 Auto-format [ci skip] 2019-06-20 10:36:38 +02:00
Ines Montani
872121955c Update error code 2019-06-20 10:35:51 +02:00
Ines Montani
e1be80e3ec Merge branch 'master' into pr/3864 2019-06-20 10:35:37 +02:00
Björn Böing
ebf5a04d6c Update pretrain docs and add unsupported loss_func error (#3860)
* Add error to `get_vectors_loss` for unsupported loss function of `pretrain`

* Add missing "--loss-func" argument to pretrain docs. Update pretrain plac annotations to match docs.

* Add missing quotation marks
2019-06-20 10:30:44 +02:00