spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-14 16:12:39 +03:00

Author	SHA1	Message	Date
svlandeg	ec3e860b44	Merge remote-tracking branch 'upstream/master' into feature/el-framework	2019-03-22 13:47:08 +01:00
Ines Montani	c9bd0e5a96	Set version to 2.1.2	2019-03-22 13:44:47 +01:00
svlandeg	12d4caf341	Merge remote-tracking branch 'upstream/master' into feature/el-framework	2019-03-22 13:44:36 +01:00
Matthew Honnibal	e65b5bb9a0	Fix tokenizer on Python2.7 (#3460 ) spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()` function was called with expressions that featured escaped unicode codepoints that were not in Python2.7's unicode database. Problems occurred when we had a range between two of these unknown codepoints, like this: ``` '[\\uAA77-\\uAA79]' ``` On Python2.7, the unknown codepoints are not unescaped correctly, resulting in arbitrary out-of-range characters being matched by the expression. This problem does not occur if we instead have a range between two unicode literals, rather than the escape sequences. To fix the bug, we therefore add a new compat function that unescapes unicode sequences using the `ast.literal_eval()` function. Care is taken to ensure we do not also escape non-unicode sequences. Closes #3356. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-22 13:42:47 +01:00
Ines Montani	c81923ee30	Update wasabi pin	2019-03-22 13:31:58 +01:00
Ines Montani	188ccd5750	Fix xfail marker	2019-03-22 12:54:14 +01:00
Ines Montani	7dd5e2f564	Update v2-1.md	2019-03-22 12:43:23 +01:00
svlandeg	7cf0bc9a8c	delete sandbox folder	2019-03-22 12:25:11 +01:00
svlandeg	5b1cd49222	error msg and unit tests for setting kb_id on span	2019-03-22 12:05:35 +01:00
svlandeg	3c9ac59ea0	Merge branch 'backup_el' of https://github.com/svlandeg/spaCy into backup_el	2019-03-22 11:43:52 +01:00
svlandeg	a48241e9a2	use nlp's vocab for stringstore	2019-03-22 11:36:45 +01:00
svlandeg	1ee0e78fd7	select candidate with highest prior probabiity	2019-03-22 11:36:45 +01:00
svlandeg	7b708ab8a4	name per entity	2019-03-22 11:36:45 +01:00
svlandeg	c593607ce2	minimal EL pipe	2019-03-22 11:36:45 +01:00
svlandeg	c71123dd0c	ensure no candidates are returned for unknown aliases	2019-03-22 11:36:45 +01:00
svlandeg	b6c3255a9f	Entity class	2019-03-22 11:36:45 +01:00
svlandeg	1289cd6e8f	property getters and keep track of KB internally	2019-03-22 11:36:45 +01:00
svlandeg	98ae77a682	unit test on number of candidates generated	2019-03-22 11:36:45 +01:00
svlandeg	9a46c431c3	store entity hash instead of pointer	2019-03-22 11:36:45 +01:00
svlandeg	9819dca80e	create candidate object from entry pointer (not fully functional yet)	2019-03-22 11:36:45 +01:00
svlandeg	a9074e0886	check the length of entities and probabilities vector + unit test	2019-03-22 11:36:45 +01:00
svlandeg	d133ffaff9	correct size, not counting dummy elements in the vector	2019-03-22 11:36:45 +01:00
svlandeg	33f8a0fe2e	check and unit test in case prior probs exceed 1	2019-03-22 11:36:45 +01:00
svlandeg	b55baaa1dc	avoid value 0 in preshmap and helpful user warnings	2019-03-22 11:36:45 +01:00
svlandeg	20a7b7b1c0	raising error when adding alias for unknown entity + unit test	2019-03-22 11:36:45 +01:00
svlandeg	8843f9279c	use StringStore	2019-03-22 11:36:45 +01:00
svlandeg	51560bf0ed	bugfix adding aliases	2019-03-22 11:36:45 +01:00
svlandeg	c4ba942765	get candidates by alias	2019-03-22 11:36:45 +01:00
svlandeg	151b855cc8	adding and retrieving aliases	2019-03-22 11:36:45 +01:00
svlandeg	cf34113250	very minimal KB functionality working	2019-03-22 11:36:44 +01:00
svlandeg	af281c5466	adding aliases per entity in the KB	2019-03-22 11:36:44 +01:00
svlandeg	f77b99c103	fix compile errors	2019-03-22 11:36:44 +01:00
svlandeg	27483f9080	add pyx and separate method to add aliases	2019-03-22 11:36:44 +01:00
svlandeg	feb71e15fd	hash the entity name	2019-03-22 11:36:44 +01:00
svlandeg	839dafa104	documented some comments and todos	2019-03-22 11:36:44 +01:00
svlandeg	7f37737878	kb snippet, draft by Matt (wip)	2019-03-22 11:36:44 +01:00
svlandeg	735fc2a735	annotate kb_id through ents in doc	2019-03-22 11:36:44 +01:00
svlandeg	d849eb2455	adding kb_id as field to token, el as nlp pipeline component	2019-03-22 11:34:46 +01:00
Matthew Honnibal	d811c97da1	Fix test that caused pytest to choke on Python3	2019-03-22 10:28:51 +01:00
Matthew Honnibal	a2ad9832e5	Add failing test for #3356	2019-03-22 02:42:37 +01:00
svlandeg	4820b43313	use nlp's vocab for stringstore	2019-03-21 23:17:25 +01:00
Matthew Honnibal	7ec64a36fd	Merge pull request #3455 from explosion/bugfix/fix-en-tag-map 💫 Bring English tag_map in line with UD Treebank	2019-03-21 21:19:30 +01:00
svlandeg	6e2433b95e	select candidate with highest prior probabiity	2019-03-21 18:55:01 +01:00
svlandeg	24a0c4a8d4	name per entity	2019-03-21 18:20:57 +01:00
svlandeg	d0c763ba44	minimal EL pipe	2019-03-21 17:33:25 +01:00
svlandeg	26afa4800f	ensure no candidates are returned for unknown aliases	2019-03-21 15:24:40 +01:00
Matthew Honnibal	c66bd61e88	Fix lemmas	2019-03-21 14:22:12 +01:00
Matthew Honnibal	04395ffa49	Bring English tag_map in line with UD Treebank I wrote a small script to read the UD English training data and check that our tag map and morph rules were resulting in the best POS map. This hadn't been done for some time, and there have been various changes to the UD schema since it has been done. After these changes we should see much better agreement between our POS assignments and the UD POS tags.	2019-03-21 13:53:44 +01:00
svlandeg	a5d5a05930	Entity class	2019-03-21 13:32:21 +01:00
svlandeg	6ba4079f7c	property getters and keep track of KB internally	2019-03-21 13:26:12 +01:00

1 2 3 4 5 ...

9969 Commits