Ines Montani
5073ce63fd
Merge branch 'spacy.io' [ci skip]
2019-03-22 15:17:11 +01:00
svlandeg
9751312aff
specify unicode strings for python 2.7
2019-03-22 14:15:18 +01:00
svlandeg
5318ce88fa
'entity_linker' instead of 'el'
2019-03-22 13:55:10 +01:00
svlandeg
ec3e860b44
Merge remote-tracking branch 'upstream/master' into feature/el-framework
2019-03-22 13:47:08 +01:00
Ines Montani
c9bd0e5a96
Set version to 2.1.2
2019-03-22 13:44:47 +01:00
svlandeg
12d4caf341
Merge remote-tracking branch 'upstream/master' into feature/el-framework
2019-03-22 13:44:36 +01:00
Matthew Honnibal
e65b5bb9a0
Fix tokenizer on Python2.7 ( #3460 )
...
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.
Problems occurred when we had a range between two of these unknown
codepoints, like this:
```
'[\\uAA77-\\uAA79]'
```
On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.
This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.
Closes #3356 .
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-22 13:42:47 +01:00
Ines Montani
c81923ee30
Update wasabi pin
2019-03-22 13:31:58 +01:00
Ines Montani
188ccd5750
Fix xfail marker
2019-03-22 12:54:14 +01:00
Ines Montani
7dd5e2f564
Update v2-1.md
2019-03-22 12:43:23 +01:00
svlandeg
7cf0bc9a8c
delete sandbox folder
2019-03-22 12:25:11 +01:00
svlandeg
5b1cd49222
error msg and unit tests for setting kb_id on span
2019-03-22 12:05:35 +01:00
svlandeg
3c9ac59ea0
Merge branch 'backup_el' of https://github.com/svlandeg/spaCy into backup_el
2019-03-22 11:43:52 +01:00
svlandeg
a48241e9a2
use nlp's vocab for stringstore
2019-03-22 11:36:45 +01:00
svlandeg
1ee0e78fd7
select candidate with highest prior probabiity
2019-03-22 11:36:45 +01:00
svlandeg
7b708ab8a4
name per entity
2019-03-22 11:36:45 +01:00
svlandeg
c593607ce2
minimal EL pipe
2019-03-22 11:36:45 +01:00
svlandeg
c71123dd0c
ensure no candidates are returned for unknown aliases
2019-03-22 11:36:45 +01:00
svlandeg
b6c3255a9f
Entity class
2019-03-22 11:36:45 +01:00
svlandeg
1289cd6e8f
property getters and keep track of KB internally
2019-03-22 11:36:45 +01:00
svlandeg
98ae77a682
unit test on number of candidates generated
2019-03-22 11:36:45 +01:00
svlandeg
9a46c431c3
store entity hash instead of pointer
2019-03-22 11:36:45 +01:00
svlandeg
9819dca80e
create candidate object from entry pointer (not fully functional yet)
2019-03-22 11:36:45 +01:00
svlandeg
a9074e0886
check the length of entities and probabilities vector + unit test
2019-03-22 11:36:45 +01:00
svlandeg
d133ffaff9
correct size, not counting dummy elements in the vector
2019-03-22 11:36:45 +01:00
svlandeg
33f8a0fe2e
check and unit test in case prior probs exceed 1
2019-03-22 11:36:45 +01:00
svlandeg
b55baaa1dc
avoid value 0 in preshmap and helpful user warnings
2019-03-22 11:36:45 +01:00
svlandeg
20a7b7b1c0
raising error when adding alias for unknown entity + unit test
2019-03-22 11:36:45 +01:00
svlandeg
8843f9279c
use StringStore
2019-03-22 11:36:45 +01:00
svlandeg
51560bf0ed
bugfix adding aliases
2019-03-22 11:36:45 +01:00
svlandeg
c4ba942765
get candidates by alias
2019-03-22 11:36:45 +01:00
svlandeg
151b855cc8
adding and retrieving aliases
2019-03-22 11:36:45 +01:00
svlandeg
cf34113250
very minimal KB functionality working
2019-03-22 11:36:44 +01:00
svlandeg
af281c5466
adding aliases per entity in the KB
2019-03-22 11:36:44 +01:00
svlandeg
f77b99c103
fix compile errors
2019-03-22 11:36:44 +01:00
svlandeg
27483f9080
add pyx and separate method to add aliases
2019-03-22 11:36:44 +01:00
svlandeg
feb71e15fd
hash the entity name
2019-03-22 11:36:44 +01:00
svlandeg
839dafa104
documented some comments and todos
2019-03-22 11:36:44 +01:00
svlandeg
7f37737878
kb snippet, draft by Matt (wip)
2019-03-22 11:36:44 +01:00
svlandeg
735fc2a735
annotate kb_id through ents in doc
2019-03-22 11:36:44 +01:00
svlandeg
d849eb2455
adding kb_id as field to token, el as nlp pipeline component
2019-03-22 11:34:46 +01:00
Matthew Honnibal
d811c97da1
Fix test that caused pytest to choke on Python3
2019-03-22 10:28:51 +01:00
Matthew Honnibal
a2ad9832e5
Add failing test for #3356
2019-03-22 02:42:37 +01:00
svlandeg
4820b43313
use nlp's vocab for stringstore
2019-03-21 23:17:25 +01:00
Matthew Honnibal
7ec64a36fd
Merge pull request #3455 from explosion/bugfix/fix-en-tag-map
...
💫 Bring English tag_map in line with UD Treebank
2019-03-21 21:19:30 +01:00
svlandeg
6e2433b95e
select candidate with highest prior probabiity
2019-03-21 18:55:01 +01:00
svlandeg
24a0c4a8d4
name per entity
2019-03-21 18:20:57 +01:00
svlandeg
d0c763ba44
minimal EL pipe
2019-03-21 17:33:25 +01:00
svlandeg
26afa4800f
ensure no candidates are returned for unknown aliases
2019-03-21 15:24:40 +01:00
Matthew Honnibal
c66bd61e88
Fix lemmas
2019-03-21 14:22:12 +01:00