Commit Graph

1292 Commits

Author SHA1 Message Date
Matthew Honnibal
4e16f9e435 * Move tests underneath spacy/ 2015-10-26 00:07:31 +11:00
Matthew Honnibal
3a6e48e814 Merge pull request #149 from chrisdubois/pickle-patch
Add __reduce__ to Tokenizer so that English pickles.
2015-10-25 15:30:31 +11:00
Chris DuBois
dac8fe7bdb Add __reduce__ to Tokenizer so that English pickles.
- Add tests to test_pickle and test_tokenizer that save to tempfiles.
2015-10-23 22:24:03 -07:00
Matthew Honnibal
ff4fe524ee * Fix exception for python 2 2015-10-23 01:56:13 +02:00
Matthew Honnibal
341a3e85cd * Upd downloaded data version 2015-10-23 00:56:57 +02:00
Matthew Honnibal
f18fd8c659 * Fix language.py for change in StringStore load API 2015-10-23 03:48:12 +11:00
Matthew Honnibal
23855db3ca Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop 2015-10-23 03:46:09 +11:00
Matthew Honnibal
4f13849065 Merge pull request #145 from henningpeters/master
better error reporting, cleanup
2015-10-23 03:45:47 +11:00
Matthew Honnibal
3be94be0c0 Merge pull request #148 from maxirmx/master
Utf8 encoding for lemma_rules.json
2015-10-22 21:46:28 +11:00
Matthew Honnibal
c86bda8d1a * Fix import of uget 2015-10-22 21:13:56 +11:00
Matthew Honnibal
2348a08481 * Load/dump strings with a json file, instead of the hacky strings file we were using. 2015-10-22 21:13:03 +11:00
Matthew Honnibal
9baf0abd59 * Save vocab after training. 2015-10-22 21:09:14 +11:00
maxirmx
f07e4accd7 Fixing encoding issue #4 2015-10-21 20:45:56 +03:00
maxirmx
fcbfff043f Fixing encoding issue #3 2015-10-21 15:52:34 +03:00
maxirmx
fe9d2e2c4e Fixing encode issue #2 2015-10-21 15:36:21 +03:00
maxirmx
e4a1726f77 Fixing encoding issue
UTF-8
2015-10-21 14:16:37 +03:00
Andreas Grivas
93ada458e2 added __repr__ that prints text in ipython for doc, token, and span objects 2015-10-21 14:11:46 +03:00
Henning Peters
ccffd2ef53 fixed extract directory 2015-10-21 07:59:34 +02:00
Henning Peters
da4c9cee06 assert filename match 2015-10-20 19:33:59 +02:00
Henning Peters
4f703f0cb4 better error reporting, cleanup 2015-10-20 19:11:29 +02:00
Matthew Honnibal
9cdea6e450 * Import uget correctly 2015-10-19 08:32:41 +02:00
Matthew Honnibal
6727a46bb5 * Fix Issue #118: Matcher behaves unpredictably when matches overlap. 2015-10-19 16:45:32 +11:00
Matthew Honnibal
135062d23c * Fix error with merged text when merged region did not have trailing whitespace 2015-10-19 15:47:04 +11:00
Henning Peters
bfde91fa49 add custom download tool (uget), replace wget with uget 2015-10-18 12:35:04 +02:00
Matthew Honnibal
9839cd2c0b * Fix whitespace_ calculation in Token 2015-10-18 17:21:11 +11:00
Matthew Honnibal
c99285b8b9 * Clean up C++ usage in spacy/matcher.pyx 2015-10-18 17:20:50 +11:00
Matthew Honnibal
a7e6c5ac8f * Fix Issue #122: Incorrect calculation of children after Doc.merge() 2015-10-18 17:17:27 +11:00
Matthew Honnibal
3ba66f2dc7 * Add string length cap in Tokenizer.__call__ 2015-10-16 04:54:16 +11:00
Matthew Honnibal
6e0f985afc * Fix token.conjuncts 2015-10-15 03:49:45 +11:00
Matthew Honnibal
2e0104ac81 * Fix token.conjuncts 2015-10-15 03:47:45 +11:00
Matthew Honnibal
b8f3345a82 * Fix token.conjuncts method 2015-10-15 03:36:01 +11:00
Matthew Honnibal
23818f89b8 * Fix token.conjuncts method 2015-10-15 03:34:57 +11:00
Matthew Honnibal
7a15d1b60c * Add Python 2/3 compatibility fix for copy_reg 2015-10-13 20:04:40 +11:00
Matthew Honnibal
329ae57520 * Fix whitespace attachment thing 2015-10-13 09:46:38 +02:00
Matthew Honnibal
37919eac82 * Fix whitespace attachment in simpler way. Leaves problem with setting left/right children. 2015-10-13 18:23:24 +11:00
Matthew Honnibal
c70eb776ae * Fix whitespace attachment, so that left/right children are consistent with head. 2015-10-13 15:58:22 +11:00
Matthew Honnibal
531182f937 * Fix Model.__reduce__ 2015-10-13 15:14:38 +11:00
Matthew Honnibal
6c227a6c1f * Fix Model.__reduce__ 2015-10-13 15:10:04 +11:00
Matthew Honnibal
358c82595c * Fix NAMES list in spacy/parts_of_speech.pyx 2015-10-13 14:18:45 +11:00
Matthew Honnibal
c1fdc487bc Merge branch 'attrs' 2015-10-13 14:03:41 +11:00
Matthew Honnibal
e886e6a406 * Inc version 2015-10-13 13:46:17 +11:00
Matthew Honnibal
20fd36a0f7 * Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125: allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve. 2015-10-13 13:44:41 +11:00
Matthew Honnibal
f8de403483 * Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125 2015-10-13 13:44:41 +11:00
Matthew Honnibal
85e7944572 * Start trying to pickle Vocab 2015-10-13 13:44:41 +11:00
Matthew Honnibal
5ca57bd859 * Ensure Morphology can be pickled, to address Issue #125. 2015-10-13 13:44:41 +11:00
Matthew Honnibal
0cee928467 * Allow StringStore to be pickled, to start addressing Issue #125 2015-10-13 13:44:41 +11:00
Matthew Honnibal
41012907a8 * Fix variable name 2015-10-13 13:44:40 +11:00
Matthew Honnibal
e70368d157 * Use lower case strings for dependency label names in symbols enum 2015-10-13 13:44:40 +11:00
Matthew Honnibal
7b4af3d1e7 * Fix parts_of_speech now that symbols list has been reformed 2015-10-13 13:44:40 +11:00
Matthew Honnibal
37b909b6b6 * Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd 2015-10-13 13:44:40 +11:00
Matthew Honnibal
ce65ec698c * Remove qualified naming in symbols 2015-10-13 13:44:40 +11:00
Matthew Honnibal
9f4be0adcd * Map NO_TAG to NIL in parts_of_speech.pxd 2015-10-13 13:44:40 +11:00
Matthew Honnibal
278e12f7e8 * Addmorphology symbols to morphology. May need to remove these as an enum. 2015-10-13 13:44:40 +11:00
Matthew Honnibal
d80067eda1 * Map empty string to NULL_ATTR in attrs 2015-10-13 13:44:40 +11:00
Matthew Honnibal
d70e8cac2c * Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore 2015-10-13 13:44:40 +11:00
Matthew Honnibal
a29c8ee23d * Add symbols to the vocab before reading the strings, so that they line up correctly 2015-10-13 13:44:39 +11:00
Matthew Honnibal
74c0853471 * Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS 2015-10-13 13:44:39 +11:00
Matthew Honnibal
10a4a843ea * Enumerate all symbols in one file 2015-10-13 13:44:39 +11:00
Matthew Honnibal
85ce36ab11 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-13 13:44:39 +11:00
Matthew Honnibal
dfbcff2ff1 * Revert codecs/io change to strings.pyx, as it seemed to cause an error? Will investigate. 2015-10-10 15:54:55 +11:00
Matthew Honnibal
9dd2f25c74 * Fix Issue #131: Force whitespace characters to attach syntactically to previous token, and ensure they cannot serve as stand-alone 'sentence' units. 2015-10-10 15:53:30 +11:00
Matthew Honnibal
8b39feefbe * Add dependency post-process rule to ensure spaces are attached to neighbouring tokens, so that they can't be sentence boundaries 2015-10-10 15:32:13 +11:00
Matthew Honnibal
2153067958 * Fix use of io in strings.pyx 2015-10-10 15:03:12 +11:00
Matthew Honnibal
ec874247b5 Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-10-10 14:23:51 +11:00
Matthew Honnibal
30de4135c9 * Fix merge problem 2015-10-10 14:22:32 +11:00
Matthew Honnibal
dc393a5f1d Merge pull request #126 from tomtung/master
Improve slicing support for both Doc and Span
2015-10-10 14:14:57 +11:00
Matthew Honnibal
83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
Matthew Honnibal
a3dfe2b901 * Increment data version 2015-10-09 13:26:17 +02:00
Matthew Honnibal
2d9e5bf566 * Allow punctuation to be lemmatized 2015-10-09 19:02:42 +11:00
Matthew Honnibal
5332c0b697 * Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130 2015-10-09 18:54:40 +11:00
Yubing (Tom) Dong
9a6811acc4 Merge remote-tracking branch 'upstream/master' 2015-10-08 22:53:02 -07:00
Matthew Honnibal
b125289f30 * Fix type declaration in asciied function 2015-10-09 13:46:57 +11:00
Matthew Honnibal
801d55a6d9 * Fix phrase matcher 2015-10-09 02:00:45 +11:00
Matthew Honnibal
b3a70e6375 * Clean up unnecessary try/except block 2015-10-08 14:34:11 +11:00
Yubing (Tom) Dong
0f601b8b75 Update docstring of Doc.__getitem__ 2015-10-07 01:27:28 -07:00
Yubing (Tom) Dong
3fd3bc79aa Refactor to remove duplicate slicing logic 2015-10-07 01:25:35 -07:00
Yubing (Tom) Dong
97685aecb7 Add slicing support to Span 2015-10-06 02:45:49 -07:00
Yubing (Tom) Dong
ef2af20cd3 Make Doc's slicing behavior conform to Python conventions 2015-10-06 02:41:28 -07:00
Yubing (Tom) Dong
2fc33e8024 Allow step=1 when slicing a Doc 2015-10-06 00:57:05 -07:00
Matthew Honnibal
b228a8f4a6 * Remove spacy/en/attrs 2015-10-06 16:20:46 +11:00
Matthew Honnibal
693677fd8d * Prepare to remove en/attrx file, now that moving to symbols.pyx 2015-10-06 16:20:13 +11:00
Matthew Honnibal
3d9f41c2c9 * Add LookupError for better error reporting in Vocab 2015-10-06 10:34:59 +11:00
Matthew Honnibal
ecc5281b36 * Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx 2015-10-06 10:12:08 +11:00
alvations
8caedba42a caught more codecs.open -> io.open 2015-09-30 20:20:09 +02:00
alvations
8199012d26 changing deprecated codecs.open to io.open =) 2015-09-30 20:10:15 +02:00
Matthew Honnibal
87e6186828 * Rename _seq to doc attribute in Span 2015-09-29 23:03:55 +10:00
Matthew Honnibal
ab694b0364 * Fix open-bounded slice indices. 2015-09-29 23:03:09 +10:00
Matthew Honnibal
a6ced80c0c * Fix Issue #116: Misleading handling of True value in Language.__init__. 2015-09-29 20:54:12 +10:00
Matthew Honnibal
f9d2a5b651 * Fix issue #112: Replace unidecode with text-unidecode, to avoid license problems. 2015-09-28 23:40:18 +10:00
Matthew Honnibal
2c33a96ac3 Merge pull request #99 from rw/patch-1
Force SSL for downloading English language data.
2015-09-28 17:46:26 +10:00
Matthew Honnibal
abf0d930af * Fix API for loading word vectors from a file. 2015-09-23 23:51:08 +10:00
Matthew Honnibal
f5c256745b Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-09-22 12:26:24 +10:00
Matthew Honnibal
528e26a506 * Add rule to ensure ordinals are preserved as single tokens 2015-09-22 12:26:05 +10:00
Robert
8711b64860 Force SSL for downloading English language data.
It would also be nice to have a checksum for this.
2015-09-21 17:26:01 -07:00
Matthew Honnibal
f7283a5067 * Fix vectors bugs for OOV words 2015-09-22 02:10:25 +02:00
Matthew Honnibal
44aecba701 * Fix Token.has_vector and Lexeme.has_vector 2015-09-22 01:43:16 +02:00
Matthew Honnibal
596fde8daa * Add has_vector attribute to Token and Lexeme 2015-09-21 19:52:43 +10:00
Matthew Honnibal
f32927efbf * Raise exceptions if attempt to access parse, but data is not installed. This partly but not fully addresses Issue #97. Still need exceptions on the various Token attributes that access the parse tree, e.g. token.head, token.lefts, token.rights, etc. Exceptions should be centralized, too. 2015-09-21 18:35:40 +10:00
Matthew Honnibal
388062ae01 * Fix repvec_length problem 2015-09-21 18:10:51 +10:00
Matthew Honnibal
ac459278d1 * Fix vector length error reporting, and ensure vec_len is returned 2015-09-21 18:08:32 +10:00
Matthew Honnibal
ba4e563701 * Ensure vectors are same length, and return vector length in load_vectors_bz2 2015-09-21 18:03:08 +10:00
Matthew Honnibal
d00fe2bbc6 * Don't allow Span objects to be written to, as it introduces subtle bugs because they're created afresh from Doc.sents, Doc.ents etc. 2015-09-21 17:59:39 +10:00
Matthew Honnibal
d6945bf880 * Add way to load vectors from bz2 file to vocab 2015-09-17 12:58:23 +10:00
Matthew Honnibal
77856c4fcd * Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea. 2015-09-17 11:50:11 +10:00
Matthew Honnibal
191d593e03 * Fix vectors bug in lexeme 2015-09-15 19:05:11 +10:00
Matthew Honnibal
3d87519f64 * Remove vectors argument from Vocab object 2015-09-15 14:47:14 +10:00
Matthew Honnibal
362526b592 * Rename vectors_length attribute 2015-09-15 14:43:31 +10:00
Matthew Honnibal
60c26b2dfa * Fix slicing when start or stop is None 2015-09-15 14:43:10 +10:00
Matthew Honnibal
7ac6cacc26 * Remove const qualifier on LexemeC.repvec 2015-09-15 14:42:51 +10:00
Matthew Honnibal
dd4d64b235 * Support setting of word vectors on Lexeme object. 2015-09-15 14:42:27 +10:00
Matthew Honnibal
27f988b167 * Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects. 2015-09-15 14:41:48 +10:00
Matthew Honnibal
193f127f81 * Fix ugly py_check_flag and py_set_flag functions in Lexeme 2015-09-15 13:06:18 +10:00
Matthew Honnibal
9561d88529 * Add is_stop to Python API 2015-09-14 18:25:40 +10:00
Matthew Honnibal
65dc0d1dfb * Extend word vectors support, with .similarity() function, vector_norm property, and rename repvec to vector. Keep repvec name as well for now for backwards compatibility. 2015-09-14 17:49:58 +10:00
Matthew Honnibal
e13e47e9e5 * Add English stop words 2015-09-14 17:48:51 +10:00
Matthew Honnibal
24ed3fc25c * Check file existance before opening in lemmatizer 2015-09-13 10:45:21 +10:00
Matthew Honnibal
dbb48ce49e * Delete extra wordnets 2015-09-13 10:31:37 +10:00
Matthew Honnibal
e9c59693ea * Remove assertion from vocab.pyx 2015-09-13 10:30:08 +10:00
Matthew Honnibal
c08f10083c * Add test and test_with_ws attributes. 2015-09-13 10:27:42 +10:00
Matthew Honnibal
0b7d2a6c62 * Inc version 2015-09-13 01:26:29 +02:00
Matthew Honnibal
e1dfaeed8a * Check serializer freqs exist before loading 2015-09-12 23:49:38 +02:00
Matthew Honnibal
a412c66c8c * Check serializer freqs exist before loading 2015-09-12 23:40:01 +02:00
Matthew Honnibal
631c843ed1 * Don't look for index.adv in le,matizer 2015-09-12 06:03:44 +02:00
Matthew Honnibal
dfdd4f2d60 Merge branch 'develop' of https://github.com/honnibal/spaCy into develop 2015-09-10 15:23:06 +02:00
Matthew Honnibal
e285ca7d6c * Load serializer freqs in vocab 2015-09-10 15:22:48 +02:00
Matthew Honnibal
f7fdcce1f9 Merge branch 'develop' of https://github.com/honnibal/spaCy into develop 2015-09-10 14:52:47 +02:00
Matthew Honnibal
85c3fec1d1 * Fix morphology loading 2015-09-10 14:52:23 +02:00
Matthew Honnibal
7c660c5efc * Use dict.get in lemmatizer 2015-09-10 14:51:39 +02:00
Matthew Honnibal
094440f9f5 Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop 2015-09-10 14:51:17 +02:00
Matthew Honnibal
c3f773cd63 * Fix Lexeme.check_flag 2015-09-10 14:51:05 +02:00
Matthew Honnibal
90da3a695d * Load lemmatizer from disk in Vocab.from_dir 2015-09-10 14:49:10 +02:00
Matthew Honnibal
e7e529edf4 * Fix Lexeme.check_flag 2015-09-10 14:45:43 +02:00
Matthew Honnibal
9e7bfe8449 * Fix space at end of merged token 2015-09-10 14:45:17 +02:00
Matthew Honnibal
f634191e27 * Fix vocab read/write 2015-09-10 14:44:38 +02:00
Matthew Honnibal
31ccf494e6 Merge branch 'develop' of https://github.com/honnibal/spaCy into develop 2015-09-09 14:33:38 +02:00
Matthew Honnibal
a7f4b26c8c * Tmp 2015-09-09 14:33:26 +02:00
Matthew Honnibal
07686470a9 * Don't consider a coordinated NP a base chunk 2015-09-09 14:32:28 +02:00
Matthew Honnibal
d9f1fc2112 * Add deprecation warning for unused load_vectors argument. 2015-09-09 14:31:09 +02:00
Matthew Honnibal
0b527fbdc8 * Set POS tag in morphology 2015-09-09 14:30:24 +02:00
Matthew Honnibal
07c09a0e1b * Fix attribute getters and setters in Lexeme 2015-09-09 14:29:22 +02:00
Matthew Honnibal
d6561988cf * Fix lexemes.bin 2015-09-09 11:49:51 +02:00
Matthew Honnibal
c301bebd33 Merge branch 'master' of https://github.com/honnibal/spaCy into develop 2015-09-09 10:55:39 +02:00
Matthew Honnibal
0e24d099a1 * Fix L/R edge bug, by ensuring l_edge and r_edge are preset, and fixing the way the edge update in del_arc. Bugs keep arising here because the edges are absolute positions, where everything else is relative. I'm also not 100% convinced that del_arc is handled correctly. Do we need to update the parents? 2015-09-09 03:40:44 +02:00
Matthew Honnibal
2be3620333 * Save morphological analyses in a cache 2015-09-08 15:39:24 +02:00
Matthew Honnibal
1def5a6cbe * Fix print statements in matcher 2015-09-08 15:38:19 +02:00
Matthew Honnibal
64d71f8893 * Fix lemmatizer 2015-09-08 15:38:03 +02:00
Matthew Honnibal
623329b19a Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop 2015-09-08 14:27:01 +02:00
Matthew Honnibal
62a01dd41d * Fix issue #92: lexemes.bin read error on 32-bit platforms. 2015-09-08 14:23:58 +02:00
Matthew Honnibal
ef58607a99 * Add spacy.it 2015-09-06 22:10:37 +02:00
Matthew Honnibal
2154a54f6b * Add spacy.de 2015-09-06 21:56:47 +02:00