Commit Graph

110 Commits

Author SHA1 Message Date
Matthew Honnibal
872695759d Merge pull request #306 from wbwseeker/german_noun_chunks
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Henning Peters
b8f63071eb add lang registration facility 2016-03-25 18:54:45 +01:00
Wolfgang Seeker
5e2e8e951a add baseclass DocIterator for iterators over documents
add classes for English and German noun chunks

the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker
03fb498dbe introduce lang field for LexemeC to hold language id
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Matthew Honnibal
963fe5258e * Add missing __contains__ method to vocab 2016-03-08 15:49:10 +00:00
Matthew Honnibal
478aa21cb0 * Remove broken __reduce__ method on vocab 2016-03-08 15:48:21 +00:00
Henning Peters
931c07a609 initial proposal for separate vector package 2016-03-04 11:09:06 +01:00
Matthew Honnibal
a95974ad3f * Fix oov probability 2016-02-06 15:13:55 +01:00
Matthew Honnibal
dcb401f3e1 * Remove broken Vocab pickling 2016-02-06 14:08:47 +01:00
Matthew Honnibal
63e3d4e27f * Add comment on Vocab.__reduce__ 2016-01-19 20:11:25 +01:00
Henning Peters
235f094534 untangle data_path/via 2016-01-16 12:23:45 +01:00
Henning Peters
846fa49b2a distinct load() and from_package() methods 2016-01-16 10:00:57 +01:00
Henning Peters
788f734513 refactored data_dir->via, add zip_safe, add spacy.load() 2016-01-15 18:01:02 +01:00
Henning Peters
bc229790ac integrate with sputnik 2016-01-13 19:46:17 +01:00
Matthew Honnibal
eaf2ad59f1 * Fix use of mock Package object 2015-12-31 04:13:15 +01:00
Matthew Honnibal
aec130af56 Use util.Package class for io
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().

Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.

Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
0e2498da00 * Replace from_package with load() classmethod in Vocab 2015-12-29 16:56:51 +01:00
Henning Peters
8359bd4d93 strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible 2015-12-18 09:52:55 +01:00
Henning Peters
9027cef3bc access model via sputnik 2015-12-07 06:01:28 +01:00
Matthew Honnibal
6ed3aedf79 * Merge vocab changes 2015-11-06 00:48:08 +11:00
Matthew Honnibal
1e99fcd413 * Rename .repvec to .vector in C API 2015-11-03 23:47:59 +11:00
Matthew Honnibal
5887506f5d * Don't expect lexemes.bin in Vocab 2015-11-03 13:23:39 +11:00
Matthew Honnibal
f11030aadc * Remove out-dated TODO comment 2015-10-26 12:33:38 +11:00
Matthew Honnibal
a371a1071d * Save and load word vectors during pickling, re Issue #125 2015-10-26 12:33:04 +11:00
Matthew Honnibal
314090cc78 * Set vectors length when unpickling vocab, re Issue #125 2015-10-26 12:05:08 +11:00
Matthew Honnibal
2348a08481 * Load/dump strings with a json file, instead of the hacky strings file we were using. 2015-10-22 21:13:03 +11:00
Matthew Honnibal
7a15d1b60c * Add Python 2/3 compatibility fix for copy_reg 2015-10-13 20:04:40 +11:00
Matthew Honnibal
20fd36a0f7 * Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125: allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve. 2015-10-13 13:44:41 +11:00
Matthew Honnibal
f8de403483 * Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125 2015-10-13 13:44:41 +11:00
Matthew Honnibal
85e7944572 * Start trying to pickle Vocab 2015-10-13 13:44:41 +11:00
Matthew Honnibal
41012907a8 * Fix variable name 2015-10-13 13:44:40 +11:00
Matthew Honnibal
37b909b6b6 * Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd 2015-10-13 13:44:40 +11:00
Matthew Honnibal
d70e8cac2c * Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore 2015-10-13 13:44:40 +11:00
Matthew Honnibal
a29c8ee23d * Add symbols to the vocab before reading the strings, so that they line up correctly 2015-10-13 13:44:39 +11:00
Matthew Honnibal
85ce36ab11 * Refactor symbols, so that frequency rank can be derived from the orth id of a word. 2015-10-13 13:44:39 +11:00
Matthew Honnibal
83dccf0fd7 * Use io module insteads of deprecated codecs module 2015-10-10 14:13:01 +11:00
Matthew Honnibal
3d9f41c2c9 * Add LookupError for better error reporting in Vocab 2015-10-06 10:34:59 +11:00
alvations
8caedba42a caught more codecs.open -> io.open 2015-09-30 20:20:09 +02:00
Matthew Honnibal
abf0d930af * Fix API for loading word vectors from a file. 2015-09-23 23:51:08 +10:00
Matthew Honnibal
f7283a5067 * Fix vectors bugs for OOV words 2015-09-22 02:10:25 +02:00
Matthew Honnibal
ac459278d1 * Fix vector length error reporting, and ensure vec_len is returned 2015-09-21 18:08:32 +10:00
Matthew Honnibal
ba4e563701 * Ensure vectors are same length, and return vector length in load_vectors_bz2 2015-09-21 18:03:08 +10:00
Matthew Honnibal
d6945bf880 * Add way to load vectors from bz2 file to vocab 2015-09-17 12:58:23 +10:00
Matthew Honnibal
3d87519f64 * Remove vectors argument from Vocab object 2015-09-15 14:47:14 +10:00
Matthew Honnibal
27f988b167 * Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects. 2015-09-15 14:41:48 +10:00
Matthew Honnibal
e9c59693ea * Remove assertion from vocab.pyx 2015-09-13 10:30:08 +10:00
Matthew Honnibal
e1dfaeed8a * Check serializer freqs exist before loading 2015-09-12 23:49:38 +02:00
Matthew Honnibal
a412c66c8c * Check serializer freqs exist before loading 2015-09-12 23:40:01 +02:00
Matthew Honnibal
e285ca7d6c * Load serializer freqs in vocab 2015-09-10 15:22:48 +02:00
Matthew Honnibal
094440f9f5 Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop 2015-09-10 14:51:17 +02:00