Henning Peters
b8f63071eb
add lang registration facility
2016-03-25 18:54:45 +01:00
Matthew Honnibal
963fe5258e
* Add missing __contains__ method to vocab
2016-03-08 15:49:10 +00:00
Matthew Honnibal
478aa21cb0
* Remove broken __reduce__ method on vocab
2016-03-08 15:48:21 +00:00
Henning Peters
931c07a609
initial proposal for separate vector package
2016-03-04 11:09:06 +01:00
Matthew Honnibal
a95974ad3f
* Fix oov probability
2016-02-06 15:13:55 +01:00
Matthew Honnibal
dcb401f3e1
* Remove broken Vocab pickling
2016-02-06 14:08:47 +01:00
Matthew Honnibal
63e3d4e27f
* Add comment on Vocab.__reduce__
2016-01-19 20:11:25 +01:00
Henning Peters
235f094534
untangle data_path/via
2016-01-16 12:23:45 +01:00
Henning Peters
846fa49b2a
distinct load() and from_package() methods
2016-01-16 10:00:57 +01:00
Henning Peters
788f734513
refactored data_dir->via, add zip_safe, add spacy.load()
2016-01-15 18:01:02 +01:00
Henning Peters
bc229790ac
integrate with sputnik
2016-01-13 19:46:17 +01:00
Matthew Honnibal
eaf2ad59f1
* Fix use of mock Package object
2015-12-31 04:13:15 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
0e2498da00
* Replace from_package with load() classmethod in Vocab
2015-12-29 16:56:51 +01:00
Henning Peters
8359bd4d93
strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible
2015-12-18 09:52:55 +01:00
Henning Peters
9027cef3bc
access model via sputnik
2015-12-07 06:01:28 +01:00
Matthew Honnibal
6ed3aedf79
* Merge vocab changes
2015-11-06 00:48:08 +11:00
Matthew Honnibal
1e99fcd413
* Rename .repvec to .vector in C API
2015-11-03 23:47:59 +11:00
Matthew Honnibal
5887506f5d
* Don't expect lexemes.bin in Vocab
2015-11-03 13:23:39 +11:00
Matthew Honnibal
f11030aadc
* Remove out-dated TODO comment
2015-10-26 12:33:38 +11:00
Matthew Honnibal
a371a1071d
* Save and load word vectors during pickling, re Issue #125
2015-10-26 12:33:04 +11:00
Matthew Honnibal
314090cc78
* Set vectors length when unpickling vocab, re Issue #125
2015-10-26 12:05:08 +11:00
Matthew Honnibal
2348a08481
* Load/dump strings with a json file, instead of the hacky strings file we were using.
2015-10-22 21:13:03 +11:00
Matthew Honnibal
7a15d1b60c
* Add Python 2/3 compatibility fix for copy_reg
2015-10-13 20:04:40 +11:00
Matthew Honnibal
20fd36a0f7
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
2015-10-13 13:44:41 +11:00
Matthew Honnibal
f8de403483
* Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125
2015-10-13 13:44:41 +11:00
Matthew Honnibal
85e7944572
* Start trying to pickle Vocab
2015-10-13 13:44:41 +11:00
Matthew Honnibal
41012907a8
* Fix variable name
2015-10-13 13:44:40 +11:00
Matthew Honnibal
37b909b6b6
* Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd
2015-10-13 13:44:40 +11:00
Matthew Honnibal
d70e8cac2c
* Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore
2015-10-13 13:44:40 +11:00
Matthew Honnibal
a29c8ee23d
* Add symbols to the vocab before reading the strings, so that they line up correctly
2015-10-13 13:44:39 +11:00
Matthew Honnibal
85ce36ab11
* Refactor symbols, so that frequency rank can be derived from the orth id of a word.
2015-10-13 13:44:39 +11:00
Matthew Honnibal
83dccf0fd7
* Use io module insteads of deprecated codecs module
2015-10-10 14:13:01 +11:00
Matthew Honnibal
3d9f41c2c9
* Add LookupError for better error reporting in Vocab
2015-10-06 10:34:59 +11:00
alvations
8caedba42a
caught more codecs.open -> io.open
2015-09-30 20:20:09 +02:00
Matthew Honnibal
abf0d930af
* Fix API for loading word vectors from a file.
2015-09-23 23:51:08 +10:00
Matthew Honnibal
f7283a5067
* Fix vectors bugs for OOV words
2015-09-22 02:10:25 +02:00
Matthew Honnibal
ac459278d1
* Fix vector length error reporting, and ensure vec_len is returned
2015-09-21 18:08:32 +10:00
Matthew Honnibal
ba4e563701
* Ensure vectors are same length, and return vector length in load_vectors_bz2
2015-09-21 18:03:08 +10:00
Matthew Honnibal
d6945bf880
* Add way to load vectors from bz2 file to vocab
2015-09-17 12:58:23 +10:00
Matthew Honnibal
3d87519f64
* Remove vectors argument from Vocab object
2015-09-15 14:47:14 +10:00
Matthew Honnibal
27f988b167
* Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects.
2015-09-15 14:41:48 +10:00
Matthew Honnibal
e9c59693ea
* Remove assertion from vocab.pyx
2015-09-13 10:30:08 +10:00
Matthew Honnibal
e1dfaeed8a
* Check serializer freqs exist before loading
2015-09-12 23:49:38 +02:00
Matthew Honnibal
a412c66c8c
* Check serializer freqs exist before loading
2015-09-12 23:40:01 +02:00
Matthew Honnibal
e285ca7d6c
* Load serializer freqs in vocab
2015-09-10 15:22:48 +02:00
Matthew Honnibal
094440f9f5
Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop
2015-09-10 14:51:17 +02:00
Matthew Honnibal
90da3a695d
* Load lemmatizer from disk in Vocab.from_dir
2015-09-10 14:49:10 +02:00
Matthew Honnibal
f634191e27
* Fix vocab read/write
2015-09-10 14:44:38 +02:00
Matthew Honnibal
a7f4b26c8c
* Tmp
2015-09-09 14:33:26 +02:00