Commit Graph

6287 Commits

Author SHA1 Message Date
adrianeboyd
a58cb023d7 WIP: Extending debug-data (#4114)
* Extending debug-data with dependency checks, etc.

* Modify debug-data to load with GoldCorpus to iterate over .json/.jsonl
files within directories

* Add GoldCorpus iterator train_docs_without_preprocessing to load
original train docs without shuffling and projectivizing

* Report number of misaligned tokens

* Add more dependency checks and messages

* Update spacy/cli/debug_data.py

Co-Authored-By: Ines Montani <ines@ines.io>

* Fixed conflict

* Move counts to _compile_gold()

* Move all dependency nonproj/sent/head/cycle counting to
_compile_gold()

* Unclobber previous merges

* Update variable names

* Update more variable names, fix misspelling

* Don't clobber loading error messages

* Only warn about misaligned tokens if present
2019-08-16 10:52:46 +02:00
Ziming He
eea7d4f4a8 biluo_tags_from_offsets throw exception for overlapping entities (#4021)
* Check whether two entities overlap

- biluo_gold_biluo_overlap now throw exception when entities passed in have overlaps
- added unit test

* SCA agreement
2019-08-15 18:13:32 +02:00
adrianeboyd
2f9b28c218 Provide more info in cycle error message E069 (#4123)
Provide the tokens in the cycle and the first 50 tokens from document in
the error message so it's easier to track down the location of the cycle
in the data.

Addresses feature request in #3698.
2019-08-15 18:08:28 +02:00
AJ Rader
2f3648700c Correction of default lemmatizer lookup in English (Issue # 4104) (#4110)
* pytest file for issue4104 established

* edited default lookup english lemmatizer for spun; fixes issue 4102

* eliminated parameterization and sorted dictionary dependnency in issue 4104 test

* added contributor agreement
2019-08-15 11:39:10 +02:00
Ines Montani
1711b5eb62
💫 Support displaCy user colors via entry point (#4113) 2019-08-13 15:59:55 +02:00
Sofie Van Landeghem
0ba1b5eebc CLI scripts for entity linking (wikipedia & generic) (#4091)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* turn kb_creator into CLI script (wip)

* proper parameters for training entity vectors

* wikidata pipeline split up into two executable scripts

* remove context_width

* move wikidata scripts in bin directory, remove old dummy script

* refine KB script with logs and preprocessing options

* small edits

* small improvements to logging of EL CLI script
2019-08-13 15:38:59 +02:00
黎谢鹏
250a54414b update lang/zh (#4103)
* update lang/zh

* update lang/zh
2019-08-12 10:37:48 +02:00
Sofie Van Landeghem
963ea5e8d0 Update lemma and vector information after splitting a token (#4097)
* fixing vector and lemma attributes after retokenizer.split

* fixing unit test with mockup tensor

* xp instead of numpy
2019-08-08 15:09:44 +02:00
Matthew Honnibal
04113a844d Set version to v2.1.8 2019-08-07 13:53:58 +02:00
Ines Montani
6bec24cdd0 Require downloaded model in pkg_resources (#4090) 2019-08-07 13:18:11 +02:00
adrianeboyd
69aca7d839 Add validate option to EntityRuler (#4089)
* Add validate option to EntityRuler

* Add validate to EntityRuler, passed to Matcher and PhraseMatcher

* Add validate to usage and API docs

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-07 00:40:53 +02:00
Jeno
15be09ceb0 Raise error if annotation dict in simple training style has unexpected keys #4074 (#4079)
* adding enhancement #4074.

* modified behavior to strictly require top level dictionary keys - issue #4074

* pass expected keys to error message and add links as expected top level key
2019-08-06 11:01:25 +02:00
Sofie Van Landeghem
ad09b0d6f3 fetch norm from lex if necessary for matching (#4080) 2019-08-05 23:51:04 +02:00
Pavle Vidanović
e1a935d71c Stopwords for Serbian language. (#4078)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated
2019-08-05 10:22:27 +02:00
veer-bains
874bd8c8dd Fixed syntax error in lang/ko when using python 2 (#4082) (closes #4068)
* fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py

* fixed syntax error in declaring variables with python 2.7 in spacy/lang/ko/__init__.py

* Update __init__.py

* Create veer-bains.md

* Update __init__.py

fixed syntax errors in variable datatype assignment when calling spacy.blank("ko") with python 2.7
2019-08-05 10:19:32 +02:00
Ines Montani
87ddbdc33e Fix handling of kwargs in Language.evaluate
Makes it consistent with other methods
2019-08-04 13:44:21 +02:00
Muhammad Irfan
d1d30b0442 added missing punctuation following conventions. (#4066) 2019-08-04 13:41:18 +02:00
Anastassia
33b14724a5 Update gold corpus code to properly ingest a directory of jsonl… (#4067)
* Update gold corpus code to properly ingest a directory of jsonlines files

In response to: https://github.com/explosion/spaCy/issues/3975

* Update spacy/gold.pyx

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-02 09:58:51 +02:00
Matthew Honnibal
944a66c326 Add span.tensor and token.tensor attributes 2019-08-01 18:30:50 +02:00
Matthew Honnibal
d3071ecdbc Set version to v2.1.7 2019-08-01 18:09:19 +02:00
Matthew Honnibal
97c51ef93b Set version to v2.1.7.dev1 2019-08-01 17:29:25 +02:00
Matthew Honnibal
4632c597e7 Fix Pipe base class 2019-08-01 17:29:01 +02:00
Ines Montani
8718ca8b1f
Fix init_model if there's no vocab (closes #4048) (#4049) 2019-08-01 17:26:09 +02:00
adrianeboyd
925a852bb6 Improve NER per type scoring (#4052)
* Improve NER per type scoring

* include all gold labels in per type scoring, not only when recall > 0
* improve efficiency of per type scoring

* Create Scorer tests, initially with NER tests

* move regression test #3968 (per type NER scoring) to Scorer tests

* add new test for per type NER scoring with imperfect P/R/F and per
type P/R/F including a case where R == 0.0
2019-08-01 17:15:36 +02:00
Sofie Van Landeghem
f7d950de6d ensure the lang of vocab and nlp stay consistent (#4057)
* ensure the language of vocab and nlp stay consistent across serialization

* equality with =
2019-08-01 17:13:01 +02:00
Sofie Van Landeghem
7de3b129ab Resolve edge case when calling textcat.predict with empty doc (#4035)
* resolve edge case where no doc has tokens when calling textcat.predict

* more explicit value test
2019-07-30 14:58:01 +02:00
Matthew Honnibal
89c92c65fb Update version 2019-07-28 17:56:38 +02:00
Matthew Honnibal
06eb428ed1 Make pipe base class a bit less presumptuous 2019-07-28 17:56:11 +02:00
Matthew Honnibal
16b5144095 Don't raise NotImplemented in Pipe.update 2019-07-28 17:54:11 +02:00
Ines Montani
fc69da0acb
💫 Support simple training format in nlp.evaluate and add tests (#4033)
* Support simple training format in nlp.evaluate and add tests

* Update docs [ci skip]
2019-07-27 17:30:18 +02:00
Ines Montani
a3723f439c Fix formatting [ci skip] 2019-07-27 16:35:42 +02:00
Ines Montani
d5bce35fb1 Fix bug in Span.similarity when called via hook 2019-07-27 15:33:27 +02:00
Ines Montani
109b5e1798 Fix bug in Token.similarity when called via hook 2019-07-27 15:26:01 +02:00
Ines Montani
e000b5ed82 Also support "requirements" in model.json 2019-07-27 13:34:57 +02:00
Ines Montani
307ffe472d
Support custom language factory setting in meta.json (#4031) 2019-07-27 13:17:43 +02:00
Bae Yong-Ju
05fbf5d976 Fix error when Korean text contains regexp special characters. (#4022) 2019-07-25 17:53:33 +02:00
Matthew Honnibal
73e095923f 💫 Improve error message when model.from_bytes() dies (#4014)
* Improve error message when model.from_bytes() dies

When Thinc's model.from_bytes() is called with a mismatched model, often
we get a particularly ungraceful error,

e.g. "AttributeError: FunctionLayer has no attribute G"

This is because we're trying to load the parameters for something like
a LayerNorm layer, and the model architecture has some other layer there
instead. This is obviously terrible, especially since the error *type*
is wrong.

I've changed it to raise a ValueError. The error message is still
probably a bit terse, but it's hard to be sure exactly what's gone
wrong.

* Update spacy/pipeline/pipes.pyx

* Update spacy/pipeline/pipes.pyx

* Update spacy/pipeline/pipes.pyx

* Update spacy/syntax/nn_parser.pyx

* Update spacy/syntax/nn_parser.pyx

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

* Update spacy/pipeline/pipes.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>


Co-authored-by: Ines Montani <ines@ines.io>
2019-07-24 11:27:34 +02:00
Ines Montani
87fcf3141c
Merge pull request #4003 from svlandeg/feature/nel-fixes
API changes for Entity linking functionality
2019-07-23 23:17:07 +02:00
Paul O'Leary McCann
c8949ce88a Remove old comment (#4012)
Norwegian used to borrow from French but that doesn't appear to have
been true for a while now, so the comment that was here is no longer
relevant.
2019-07-23 23:10:06 +02:00
Sofie Van Landeghem
ba02957c80 Fix dependency copy for as_doc (#3969)
* failing unit test for issue 3962

* attempt to fix Issue #3962

* create artificial unit test example

* using length instead of self.length

* sp

* reformat with black

* find better ancestor within span and use generic 'dep'

* attach to span.root if there is no appropriate ancestor

* comment span text

* clean up ancestor code

* reconstruct dep tree to keep same number of sentences
2019-07-23 18:28:54 +02:00
svlandeg
4e7ec1ed31 return fix 2019-07-23 14:23:58 +02:00
svlandeg
400ff342cf replace assert's with custom error messages 2019-07-23 11:52:48 +02:00
svlandeg
20389e4553 format and bugfix 2019-07-22 15:08:17 +02:00
svlandeg
b1911f7105 Errors.E146 for IO error when FP is null 2019-07-22 14:56:13 +02:00
svlandeg
5d544f89ba Errors.E145 for IO errors when reading KB 2019-07-22 14:36:07 +02:00
Ines Montani
a32b033b8c Add regression test for #4002
Test that the PhraseMatcher can match on overwritten NORM attributes.
2019-07-22 14:18:24 +02:00
svlandeg
ad65171837 Merge remote-tracking branch 'upstream/master' into feature/nel-fixes 2019-07-22 13:41:28 +02:00
svlandeg
76184374e2 test corner cases 2019-07-22 13:39:32 +02:00
svlandeg
9f8c1e71a2 fix for Issue #4000 2019-07-22 13:34:12 +02:00
svlandeg
dae8a21282 rename entity frequency 2019-07-19 17:40:28 +02:00
svlandeg
41fb5204ba output tensors as part of predict 2019-07-19 14:47:36 +02:00
svlandeg
21176517a7 have gold.links correspond exactly to doc.ents 2019-07-19 12:36:15 +02:00
BreakBB
3e370cf2ba Add 'Prof.' to Englisch tokenizer_exceptions 2019-07-19 10:00:45 +02:00
svlandeg
e1213eaf6a use original gold object in get_loss function 2019-07-18 13:35:10 +02:00
svlandeg
ec55d2fccd filter training data beforehand (+black formatting) 2019-07-18 10:22:24 +02:00
Falak Asad
ff1e73e35c Bugfix/issue 3968 (#3982)
* Fix for issue-3968

* Added contributor agreement

* Made suggested changes
2019-07-18 00:20:32 +02:00
svlandeg
d833d4c358 fixes in kb and gold 2019-07-17 17:18:26 +02:00
Ines Montani
73565c6d9d Rename function arguments 2019-07-17 14:29:52 +02:00
Matthew Honnibal
394e4d8058 Add docstring for spacy.gold.align 2019-07-17 13:59:17 +02:00
Ines Montani
073013f129 Auto-format [ci skip] 2019-07-17 12:34:13 +02:00
svlandeg
4086c6ff60 get vector functionality + unit test 2019-07-17 12:17:02 +02:00
Ines Montani
62ff128888 Add regression test for #3951 2019-07-16 14:00:00 +02:00
Ines Montani
7f551050b1 Add regression test for #3972 2019-07-16 13:07:35 +02:00
svlandeg
a63d15a142 code cleanup 2019-07-15 17:36:43 +02:00
svlandeg
cdc589d344 small fix 2019-07-15 12:04:45 +02:00
svlandeg
60f299374f set default context width 2019-07-15 12:03:09 +02:00
svlandeg
6e809e9b8b proper error for missing cfg arguments 2019-07-15 11:42:50 +02:00
svlandeg
6026958957 tokenizer doc fix 2019-07-15 11:19:34 +02:00
Ines Montani
c0e29f7029
Merge pull request #3957 from sorenlind/danish-tokenizer-slash
Make Danish tokenizer split on forward slash
2019-07-12 18:19:22 +02:00
Matthew Honnibal
ef666656b3 Fix attrs alignment 2019-07-12 17:59:47 +02:00
Matthew Honnibal
c345c042b0 Fix symbol alignment 2019-07-12 17:48:38 +02:00
Ines Montani
7281026879 Increment version [ci skip] 2019-07-12 17:40:00 +02:00
Søren Lind Kristiansen
26aee70d95 Make Danish tokenizer split on forward slash 2019-07-12 15:20:42 +02:00
Matthew Honnibal
3bc4d618f9 Set version to v2.1.5 2019-07-12 13:26:12 +02:00
Sofie Van Landeghem
ed774cb953 Fixing ngram bug (#3953)
* minimal failing example for Issue #3661

* referenced Issue #3661 instead of Issue #3611

* cleanup
2019-07-12 10:01:35 +02:00
Matthew Honnibal
09dc01a426 Fix #3853, and add warning 2019-07-11 14:46:47 +02:00
Matthew Honnibal
7369949d2e Add warning for #3853 2019-07-11 14:46:47 +02:00
Ines Montani
673c864a06
Fix doc.count_by functionality (#3950)
Fix doc.count_by functionality
2019-07-11 13:44:00 +02:00
Ines Montani
2426f4d44c
Fix default punctuation rules for splitting Hindi text (#3948)
Fix default punctuation rules for splitting Hindi text

Co-authored-by: yash <patadiayash@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-07-11 13:36:28 +02:00
svlandeg
349107daa3 cleanup 2019-07-11 13:09:22 +02:00
svlandeg
0f0f07318a counter instead of preshcounter 2019-07-11 13:05:53 +02:00
Matthew Honnibal
b40b4c2c31
💫 Fix issue #3839: Incorrect entity IDs from Matcher with operators (#3949)
* Add regression test for issue #3541

* Add comment on bugfix

* Remove incorrect test

* Un-xfail test
2019-07-11 12:55:11 +02:00
Matthew Honnibal
e19f4ee719 Add warning message re Issue #3853 2019-07-11 12:50:38 +02:00
Ines Montani
197cfd7ebc Merge branch 'master' into pr/3948 2019-07-11 12:18:31 +02:00
Ines Montani
d166756607 Fix test 2019-07-11 12:16:43 +02:00
Ines Montani
0b8406a05c Tidy up and auto-format 2019-07-11 12:02:25 +02:00
yash
6751af3e78 Merge branch 'master' of https://github.com/yash1994/spaCy 2019-07-11 15:26:57 +05:30
yash
ae2d52e323 Add default encoding utf-8 for test file 2019-07-11 15:26:27 +05:30
Ines Montani
33ca0a036a Merge branch 'master' into pr/3948 2019-07-11 11:55:54 +02:00
Matthew Honnibal
0491a8e7c8 Reformat 2019-07-11 11:49:36 +02:00
Matthew Honnibal
bd3c3f342b Fix _serialize 2019-07-11 11:48:55 +02:00
yash
815f8d13dd Fix default punctuation rules for hindi text (#3625 explosion) 2019-07-11 15:00:51 +05:30
yash
d5311b3c42 Add test file for issue (#3625) and spacy contributor agreement 2019-07-11 14:53:14 +05:30
svlandeg
e080412385 tracked the bug down to PreshCounter.inc - still unclear what goes wrong 2019-07-11 01:53:06 +02:00
svlandeg
a89fecce97 failing unit test for issue #3869 2019-07-11 00:43:55 +02:00
Matthew Honnibal
a388888074 Merge branch 'master' of https://github.com/explosion/spaCy 2019-07-10 22:54:17 +02:00
Matthew Honnibal
c6cb782758 Set version to 2.1.5.dev0 2019-07-10 22:54:09 +02:00
Sofie Van Landeghem
c4c21cb428 more friendly textcat errors (#3946)
* more friendly textcat errors with require_model and require_labels

* update thinc version with recent bugfix
2019-07-10 19:39:38 +02:00
Matthew Honnibal
b94c5443d9 Rename Binder->DocBox, and improve it. 2019-07-10 19:37:20 +02:00
Matthew Honnibal
3d18600c05 Return True from doc.is_... when no ambiguity
* Make doc.is_sentenced return True if len(doc) < 2.

* Make doc.is_nered return True if len(doc) == 0, for consistency.

Closes #3934
2019-07-10 19:21:42 +02:00