Commit Graph

10469 Commits

Author SHA1 Message Date
Matthew Honnibal
22bd0095f5 * Map empty string to NULL_ATTR in attrs 2015-10-10 22:10:19 +11:00
Matthew Honnibal
7488821677 * Map NIL to empty string in tag map 2015-10-10 22:09:50 +11:00
Matthew Honnibal
20e909d2bb * Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore 2015-10-10 18:27:03 +11:00
Matthew Honnibal
e18fbcb604 * Allow SPACY_DATA environment variable in website tests 2015-10-10 17:59:47 +11:00
Matthew Honnibal
1cac36bf1c * Add symbols to the vocab before reading the strings, so that they line up correctly 2015-10-10 17:58:29 +11:00
Matthew Honnibal
94bafc1417 * Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS 2015-10-10 17:57:29 +11:00
Matthew Honnibal
3cea417852 * Enumerate all symbols in one file 2015-10-10 16:03:48 +11:00
Matthew Honnibal
4bbd1388bd * Whitespace 2015-10-10 16:03:48 +11:00
Matthew Honnibal
064bd69ad0 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-10 16:03:48 +11:00
Matthew Honnibal
08e29519a6 * Add test for how spaces are attached by the parser. 2015-10-10 16:03:13 +11:00
Matthew Honnibal
dfbcff2ff1 * Revert codecs/io change to strings.pyx, as it seemed to cause an error? Will investigate. 2015-10-10 15:54:55 +11:00
Matthew Honnibal
bdcb8d695c * Add non-breaking space to specials.json 2015-10-10 15:54:06 +11:00
Matthew Honnibal
9dd2f25c74 * Fix Issue #131: Force whitespace characters to attach syntactically to previous token, and ensure they cannot serve as stand-alone 'sentence' units. 2015-10-10 15:53:30 +11:00
Matthew Honnibal
8b39feefbe * Add dependency post-process rule to ensure spaces are attached to neighbouring tokens, so that they can't be sentence boundaries 2015-10-10 15:32:13 +11:00
Matthew Honnibal
1521cf25c9 * Fix merge problem in test_parse_navigate 2015-10-10 15:04:01 +11:00
Matthew Honnibal
c12d36d5f4 * Fix quote marks in lemma_rules 2015-10-10 15:03:36 +11:00
Matthew Honnibal
2153067958 * Fix use of io in strings.pyx 2015-10-10 15:03:12 +11:00
Matthew Honnibal
ec874247b5 Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-10-10 14:23:51 +11:00
Matthew Honnibal
30de4135c9 * Fix merge problem 2015-10-10 14:22:32 +11:00
Matthew Honnibal
dc393a5f1d Merge pull request #126 from tomtung/master
Improve slicing support for both Doc and Span
2015-10-10 14:14:57 +11:00
Matthew Honnibal
6ea8f99a10 Merge branch 'alvations-master' 2015-10-10 14:13:24 +11:00
Matthew Honnibal
83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
Matthew Honnibal
55cd7008bb Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-10-10 14:07:55 +11:00
Matthew Honnibal
57b3cd4661 * Add smart-quotes to lemma rules 2015-10-10 14:06:46 +11:00
Matthew Honnibal
7e7f28e1fd * Add smart-quote possessive marker in generate_specials 2015-10-10 14:06:09 +11:00
Matthew Honnibal
41c50e509c Merge pull request #137 from henningpeters/master
push version and add spacy channel
2015-10-10 01:40:29 +11:00
Matthew Honnibal
8b8d048385 Merge pull request #135 from henningpeters/patch-1
remove compile warning noise
2015-10-10 01:40:15 +11:00
Matthew Honnibal
d31c911f83 Merge pull request #136 from henningpeters/patch-2
cleanup
2015-10-10 01:40:00 +11:00
Henning Peters
7a47c0c872 push version 2015-10-09 16:37:57 +02:00
Henning Peters
88b2f7ea5d push version and add spacy channel 2015-10-09 16:30:23 +02:00
Henning Peters
876fc99c44 cleanup
looks like this file was accidentally added
2015-10-09 16:11:56 +02:00
Matthew Honnibal
a3dfe2b901 * Increment data version 2015-10-09 13:26:17 +02:00
Matthew Honnibal
af8d0a2a09 * Increment version 2015-10-09 12:42:41 +02:00
Matthew Honnibal
3bf50ab830 * Ensure the fabfile prebuild command installs pytest 2015-10-09 20:57:47 +11:00
Matthew Honnibal
599f739ddb * Fix smart quote lemma test 2015-10-09 20:51:28 +11:00
Matthew Honnibal
5682439d1e * Remove em dash test from test_lemmatizer, as em dashes are now handled in specials.json 2015-10-09 20:24:21 +11:00
Matthew Honnibal
f35632e2e5 * Remove SBD print statement in train, after SBD evaluation was removed from Scorer 2015-10-09 11:08:58 +02:00
Matthew Honnibal
1f90502ce8 * Fix website/test_home for Python 3 2015-10-09 11:08:31 +02:00
Matthew Honnibal
caff4638c9 * Fix website/test_api.py for Python 3 2015-10-09 11:08:12 +02:00
Matthew Honnibal
a510858f5a * Pretty-print specials.json, and add the em dash 2015-10-09 11:07:45 +02:00
Matthew Honnibal
49600a44a8 * Fix trailing comma in lemma_rules.json 2015-10-09 11:06:57 +02:00
Matthew Honnibal
0e92e8574a * Fix pos tag in em-dash in specials 2015-10-09 11:06:37 +02:00
Matthew Honnibal
d341443282 * Remove em-dash from lemma rules. Handle instead in specials. 2015-10-09 10:27:13 +02:00
Matthew Honnibal
b6047afe4c * Fix punctuation lemma rules, to resolve Issue #130 2015-10-09 10:25:37 +02:00
Matthew Honnibal
393a13d1af * Add unicode em dash to specials.json, so that we can control what POS tag it gets. This way we can prevent sentence boundary detection errors, to address Issue #130. 2015-10-09 19:24:33 +11:00
Matthew Honnibal
1490feda29 * Make generate_specials pretty-print the specials.json file 2015-10-09 19:23:47 +11:00
Matthew Honnibal
1842a53e73 * Lemmatize smart quotes as plain quotes 2015-10-09 19:09:36 +11:00
Matthew Honnibal
2d9e5bf566 * Allow punctuation to be lemmatized 2015-10-09 19:02:42 +11:00
Matthew Honnibal
5332c0b697 * Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130 2015-10-09 18:54:40 +11:00
Matthew Honnibal
b71ba2eed5 * Add tests for unicode puncuation character lemmatization 2015-10-09 18:43:14 +11:00