svlandeg
9751312aff
specify unicode strings for python 2.7
2019-03-22 14:15:18 +01:00
svlandeg
5318ce88fa
'entity_linker' instead of 'el'
2019-03-22 13:55:10 +01:00
svlandeg
ec3e860b44
Merge remote-tracking branch 'upstream/master' into feature/el-framework
2019-03-22 13:47:08 +01:00
Ines Montani
c9bd0e5a96
Set version to 2.1.2
2019-03-22 13:44:47 +01:00
svlandeg
12d4caf341
Merge remote-tracking branch 'upstream/master' into feature/el-framework
2019-03-22 13:44:36 +01:00
Matthew Honnibal
e65b5bb9a0
Fix tokenizer on Python2.7 ( #3460 )
...
spaCy v2.1 switched to the built-in re module, where v2.0 had been using
the third-party regex library. When the tokenizer was deserialized on
Python2.7, the `re.compile()` function was called with expressions that
featured escaped unicode codepoints that were not in Python2.7's unicode
database.
Problems occurred when we had a range between two of these unknown
codepoints, like this:
```
'[\\uAA77-\\uAA79]'
```
On Python2.7, the unknown codepoints are not unescaped correctly,
resulting in arbitrary out-of-range characters being matched by the
expression.
This problem does not occur if we instead have a range between two
unicode literals, rather than the escape sequences. To fix the bug, we
therefore add a new compat function that unescapes unicode sequences
using the `ast.literal_eval()` function. Care is taken to ensure we
do not also escape non-unicode sequences.
Closes #3356 .
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-22 13:42:47 +01:00
Ines Montani
188ccd5750
Fix xfail marker
2019-03-22 12:54:14 +01:00
svlandeg
7cf0bc9a8c
delete sandbox folder
2019-03-22 12:25:11 +01:00
svlandeg
5b1cd49222
error msg and unit tests for setting kb_id on span
2019-03-22 12:05:35 +01:00
svlandeg
a48241e9a2
use nlp's vocab for stringstore
2019-03-22 11:36:45 +01:00
svlandeg
1ee0e78fd7
select candidate with highest prior probabiity
2019-03-22 11:36:45 +01:00
svlandeg
7b708ab8a4
name per entity
2019-03-22 11:36:45 +01:00
svlandeg
c593607ce2
minimal EL pipe
2019-03-22 11:36:45 +01:00
svlandeg
c71123dd0c
ensure no candidates are returned for unknown aliases
2019-03-22 11:36:45 +01:00
svlandeg
b6c3255a9f
Entity class
2019-03-22 11:36:45 +01:00
svlandeg
1289cd6e8f
property getters and keep track of KB internally
2019-03-22 11:36:45 +01:00
svlandeg
98ae77a682
unit test on number of candidates generated
2019-03-22 11:36:45 +01:00
svlandeg
9a46c431c3
store entity hash instead of pointer
2019-03-22 11:36:45 +01:00
svlandeg
9819dca80e
create candidate object from entry pointer (not fully functional yet)
2019-03-22 11:36:45 +01:00
svlandeg
a9074e0886
check the length of entities and probabilities vector + unit test
2019-03-22 11:36:45 +01:00
svlandeg
d133ffaff9
correct size, not counting dummy elements in the vector
2019-03-22 11:36:45 +01:00
svlandeg
33f8a0fe2e
check and unit test in case prior probs exceed 1
2019-03-22 11:36:45 +01:00
svlandeg
b55baaa1dc
avoid value 0 in preshmap and helpful user warnings
2019-03-22 11:36:45 +01:00
svlandeg
20a7b7b1c0
raising error when adding alias for unknown entity + unit test
2019-03-22 11:36:45 +01:00
svlandeg
8843f9279c
use StringStore
2019-03-22 11:36:45 +01:00
svlandeg
51560bf0ed
bugfix adding aliases
2019-03-22 11:36:45 +01:00
svlandeg
c4ba942765
get candidates by alias
2019-03-22 11:36:45 +01:00
svlandeg
151b855cc8
adding and retrieving aliases
2019-03-22 11:36:45 +01:00
svlandeg
cf34113250
very minimal KB functionality working
2019-03-22 11:36:44 +01:00
svlandeg
af281c5466
adding aliases per entity in the KB
2019-03-22 11:36:44 +01:00
svlandeg
f77b99c103
fix compile errors
2019-03-22 11:36:44 +01:00
svlandeg
27483f9080
add pyx and separate method to add aliases
2019-03-22 11:36:44 +01:00
svlandeg
feb71e15fd
hash the entity name
2019-03-22 11:36:44 +01:00
svlandeg
839dafa104
documented some comments and todos
2019-03-22 11:36:44 +01:00
svlandeg
7f37737878
kb snippet, draft by Matt (wip)
2019-03-22 11:36:44 +01:00
svlandeg
735fc2a735
annotate kb_id through ents in doc
2019-03-22 11:36:44 +01:00
svlandeg
d849eb2455
adding kb_id as field to token, el as nlp pipeline component
2019-03-22 11:34:46 +01:00
Matthew Honnibal
d811c97da1
Fix test that caused pytest to choke on Python3
2019-03-22 10:28:51 +01:00
Matthew Honnibal
a2ad9832e5
Add failing test for #3356
2019-03-22 02:42:37 +01:00
Matthew Honnibal
c66bd61e88
Fix lemmas
2019-03-21 14:22:12 +01:00
Matthew Honnibal
04395ffa49
Bring English tag_map in line with UD Treebank
...
I wrote a small script to read the UD English training data and check
that our tag map and morph rules were resulting in the best POS map.
This hadn't been done for some time, and there have been various changes
to the UD schema since it has been done. After these changes we should
see much better agreement between our POS assignments and the UD POS
tags.
2019-03-21 13:53:44 +01:00
Matthew Honnibal
c7f26abe5f
Merge pull request #3434 from Bharat123rox/narrow-unicode
...
Raise Error for a narrow unicode build of Python
2019-03-20 12:19:52 +01:00
Matthew Honnibal
1c8ff59185
Merge pull request #3441 from explosion/fix/cli-ud-scripts
...
💫 Move UD scripts to bin
2019-03-20 12:19:15 +01:00
Matthew Honnibal
72889a16d5
Fix similarity calculation if vectors are on GPU ( #3440 )
2019-03-20 12:09:59 +01:00
Matthew Honnibal
1612990e88
Implement cosine loss for spacy pretrain. Make default
2019-03-20 11:06:58 +00:00
Ines Montani
ae5b4d0e84
Fix formatting (hopefully also restarts build properly)
2019-03-20 09:55:45 +01:00
Ines Montani
6abc1ddb26
Update __main__.py
2019-03-20 09:43:26 +01:00
Bharat123Rox
f2547f02d6
Made changes suggested by @ines
2019-03-20 07:43:19 +05:30
Ines Montani
7400c7f8a7
Move UD scripts to bin
2019-03-20 01:19:34 +01:00
Ines Montani
685fff40cf
Revert "Add --always-link flag to cli.download (see #3435 )"
...
This reverts commit 583a566843
.
2019-03-20 01:03:40 +01:00