Commit Graph

1147 Commits

Author SHA1 Message Date
Matthew Honnibal
0cee928467 * Allow StringStore to be pickled, to start addressing Issue #125 2015-10-13 13:44:41 +11:00
Matthew Honnibal
41012907a8 * Fix variable name 2015-10-13 13:44:40 +11:00
Matthew Honnibal
e70368d157 * Use lower case strings for dependency label names in symbols enum 2015-10-13 13:44:40 +11:00
Matthew Honnibal
7b4af3d1e7 * Fix parts_of_speech now that symbols list has been reformed 2015-10-13 13:44:40 +11:00
Matthew Honnibal
37b909b6b6 * Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd 2015-10-13 13:44:40 +11:00
Matthew Honnibal
ce65ec698c * Remove qualified naming in symbols 2015-10-13 13:44:40 +11:00
Matthew Honnibal
9f4be0adcd * Map NO_TAG to NIL in parts_of_speech.pxd 2015-10-13 13:44:40 +11:00
Matthew Honnibal
278e12f7e8 * Addmorphology symbols to morphology. May need to remove these as an enum. 2015-10-13 13:44:40 +11:00
Matthew Honnibal
d80067eda1 * Map empty string to NULL_ATTR in attrs 2015-10-13 13:44:40 +11:00
Matthew Honnibal
d70e8cac2c * Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore 2015-10-13 13:44:40 +11:00
Matthew Honnibal
a29c8ee23d * Add symbols to the vocab before reading the strings, so that they line up correctly 2015-10-13 13:44:39 +11:00
Matthew Honnibal
74c0853471 * Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS 2015-10-13 13:44:39 +11:00
Matthew Honnibal
10a4a843ea * Enumerate all symbols in one file 2015-10-13 13:44:39 +11:00
Matthew Honnibal
85ce36ab11 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-13 13:44:39 +11:00
Matthew Honnibal
dfbcff2ff1 * Revert codecs/io change to strings.pyx, as it seemed to cause an error? Will investigate. 2015-10-10 15:54:55 +11:00
Matthew Honnibal
9dd2f25c74 * Fix Issue #131: Force whitespace characters to attach syntactically to previous token, and ensure they cannot serve as stand-alone 'sentence' units. 2015-10-10 15:53:30 +11:00
Matthew Honnibal
8b39feefbe * Add dependency post-process rule to ensure spaces are attached to neighbouring tokens, so that they can't be sentence boundaries 2015-10-10 15:32:13 +11:00
Matthew Honnibal
2153067958 * Fix use of io in strings.pyx 2015-10-10 15:03:12 +11:00
Matthew Honnibal
ec874247b5 Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-10-10 14:23:51 +11:00
Matthew Honnibal
30de4135c9 * Fix merge problem 2015-10-10 14:22:32 +11:00
Matthew Honnibal
dc393a5f1d Merge pull request #126 from tomtung/master
Improve slicing support for both Doc and Span
2015-10-10 14:14:57 +11:00
Matthew Honnibal
83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
Matthew Honnibal
a3dfe2b901 * Increment data version 2015-10-09 13:26:17 +02:00
Matthew Honnibal
2d9e5bf566 * Allow punctuation to be lemmatized 2015-10-09 19:02:42 +11:00
Matthew Honnibal
5332c0b697 * Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130 2015-10-09 18:54:40 +11:00
Yubing (Tom) Dong
9a6811acc4 Merge remote-tracking branch 'upstream/master' 2015-10-08 22:53:02 -07:00
Matthew Honnibal
b125289f30 * Fix type declaration in asciied function 2015-10-09 13:46:57 +11:00
Matthew Honnibal
801d55a6d9 * Fix phrase matcher 2015-10-09 02:00:45 +11:00
Matthew Honnibal
b3a70e6375 * Clean up unnecessary try/except block 2015-10-08 14:34:11 +11:00
Yubing (Tom) Dong
0f601b8b75 Update docstring of Doc.__getitem__ 2015-10-07 01:27:28 -07:00
Yubing (Tom) Dong
3fd3bc79aa Refactor to remove duplicate slicing logic 2015-10-07 01:25:35 -07:00
Yubing (Tom) Dong
97685aecb7 Add slicing support to Span 2015-10-06 02:45:49 -07:00
Yubing (Tom) Dong
ef2af20cd3 Make Doc's slicing behavior conform to Python conventions 2015-10-06 02:41:28 -07:00
Yubing (Tom) Dong
2fc33e8024 Allow step=1 when slicing a Doc 2015-10-06 00:57:05 -07:00
Matthew Honnibal
b228a8f4a6 * Remove spacy/en/attrs 2015-10-06 16:20:46 +11:00
Matthew Honnibal
693677fd8d * Prepare to remove en/attrx file, now that moving to symbols.pyx 2015-10-06 16:20:13 +11:00
Matthew Honnibal
3d9f41c2c9 * Add LookupError for better error reporting in Vocab 2015-10-06 10:34:59 +11:00
Matthew Honnibal
ecc5281b36 * Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx 2015-10-06 10:12:08 +11:00
alvations
8caedba42a caught more codecs.open -> io.open 2015-09-30 20:20:09 +02:00
alvations
8199012d26 changing deprecated codecs.open to io.open =) 2015-09-30 20:10:15 +02:00
Matthew Honnibal
87e6186828 * Rename _seq to doc attribute in Span 2015-09-29 23:03:55 +10:00
Matthew Honnibal
ab694b0364 * Fix open-bounded slice indices. 2015-09-29 23:03:09 +10:00
Matthew Honnibal
a6ced80c0c * Fix Issue #116: Misleading handling of True value in Language.__init__. 2015-09-29 20:54:12 +10:00
Matthew Honnibal
f9d2a5b651 * Fix issue #112: Replace unidecode with text-unidecode, to avoid license problems. 2015-09-28 23:40:18 +10:00
Matthew Honnibal
2c33a96ac3 Merge pull request #99 from rw/patch-1
Force SSL for downloading English language data.
2015-09-28 17:46:26 +10:00
Matthew Honnibal
abf0d930af * Fix API for loading word vectors from a file. 2015-09-23 23:51:08 +10:00
Matthew Honnibal
f5c256745b Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-09-22 12:26:24 +10:00
Matthew Honnibal
528e26a506 * Add rule to ensure ordinals are preserved as single tokens 2015-09-22 12:26:05 +10:00
Robert
8711b64860 Force SSL for downloading English language data.
It would also be nice to have a checksum for this.
2015-09-21 17:26:01 -07:00
Matthew Honnibal
f7283a5067 * Fix vectors bugs for OOV words 2015-09-22 02:10:25 +02:00