Matthew Honnibal
|
314090cc78
|
* Set vectors length when unpickling vocab, re Issue #125
|
2015-10-26 12:05:08 +11:00 |
|
Matthew Honnibal
|
2348a08481
|
* Load/dump strings with a json file, instead of the hacky strings file we were using.
|
2015-10-22 21:13:03 +11:00 |
|
Matthew Honnibal
|
7a15d1b60c
|
* Add Python 2/3 compatibility fix for copy_reg
|
2015-10-13 20:04:40 +11:00 |
|
Matthew Honnibal
|
20fd36a0f7
|
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125: allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
|
2015-10-13 13:44:41 +11:00 |
|
Matthew Honnibal
|
f8de403483
|
* Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125
|
2015-10-13 13:44:41 +11:00 |
|
Matthew Honnibal
|
85e7944572
|
* Start trying to pickle Vocab
|
2015-10-13 13:44:41 +11:00 |
|
Matthew Honnibal
|
41012907a8
|
* Fix variable name
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
37b909b6b6
|
* Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
d70e8cac2c
|
* Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
a29c8ee23d
|
* Add symbols to the vocab before reading the strings, so that they line up correctly
|
2015-10-13 13:44:39 +11:00 |
|
Matthew Honnibal
|
85ce36ab11
|
* Refactor symbols, so that frequency rank can be derived from the orth id of a word.
|
2015-10-13 13:44:39 +11:00 |
|
Matthew Honnibal
|
83dccf0fd7
|
* Use io module insteads of deprecated codecs module
|
2015-10-10 14:13:01 +11:00 |
|
Matthew Honnibal
|
3d9f41c2c9
|
* Add LookupError for better error reporting in Vocab
|
2015-10-06 10:34:59 +11:00 |
|
alvations
|
8caedba42a
|
caught more codecs.open -> io.open
|
2015-09-30 20:20:09 +02:00 |
|
Matthew Honnibal
|
abf0d930af
|
* Fix API for loading word vectors from a file.
|
2015-09-23 23:51:08 +10:00 |
|
Matthew Honnibal
|
f7283a5067
|
* Fix vectors bugs for OOV words
|
2015-09-22 02:10:25 +02:00 |
|
Matthew Honnibal
|
ac459278d1
|
* Fix vector length error reporting, and ensure vec_len is returned
|
2015-09-21 18:08:32 +10:00 |
|
Matthew Honnibal
|
ba4e563701
|
* Ensure vectors are same length, and return vector length in load_vectors_bz2
|
2015-09-21 18:03:08 +10:00 |
|
Matthew Honnibal
|
d6945bf880
|
* Add way to load vectors from bz2 file to vocab
|
2015-09-17 12:58:23 +10:00 |
|
Matthew Honnibal
|
3d87519f64
|
* Remove vectors argument from Vocab object
|
2015-09-15 14:47:14 +10:00 |
|
Matthew Honnibal
|
27f988b167
|
* Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects.
|
2015-09-15 14:41:48 +10:00 |
|
Matthew Honnibal
|
e9c59693ea
|
* Remove assertion from vocab.pyx
|
2015-09-13 10:30:08 +10:00 |
|
Matthew Honnibal
|
e1dfaeed8a
|
* Check serializer freqs exist before loading
|
2015-09-12 23:49:38 +02:00 |
|
Matthew Honnibal
|
a412c66c8c
|
* Check serializer freqs exist before loading
|
2015-09-12 23:40:01 +02:00 |
|
Matthew Honnibal
|
e285ca7d6c
|
* Load serializer freqs in vocab
|
2015-09-10 15:22:48 +02:00 |
|
Matthew Honnibal
|
094440f9f5
|
Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop
|
2015-09-10 14:51:17 +02:00 |
|
Matthew Honnibal
|
90da3a695d
|
* Load lemmatizer from disk in Vocab.from_dir
|
2015-09-10 14:49:10 +02:00 |
|
Matthew Honnibal
|
f634191e27
|
* Fix vocab read/write
|
2015-09-10 14:44:38 +02:00 |
|
Matthew Honnibal
|
a7f4b26c8c
|
* Tmp
|
2015-09-09 14:33:26 +02:00 |
|
Matthew Honnibal
|
d6561988cf
|
* Fix lexemes.bin
|
2015-09-09 11:49:51 +02:00 |
|
Matthew Honnibal
|
c301bebd33
|
Merge branch 'master' of https://github.com/honnibal/spaCy into develop
|
2015-09-09 10:55:39 +02:00 |
|
Matthew Honnibal
|
623329b19a
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop
|
2015-09-08 14:27:01 +02:00 |
|
Matthew Honnibal
|
62a01dd41d
|
* Fix issue #92: lexemes.bin read error on 32-bit platforms.
|
2015-09-08 14:23:58 +02:00 |
|
Matthew Honnibal
|
f6ec5bf1b0
|
* Use empty tag map in vocab if none supplied
|
2015-09-06 20:19:27 +02:00 |
|
Matthew Honnibal
|
534e3dda3c
|
* More work on language independent parsing
|
2015-08-28 03:44:54 +02:00 |
|
Matthew Honnibal
|
c2307fa9ee
|
* More work on language-generic parsing
|
2015-08-28 02:02:33 +02:00 |
|
Matthew Honnibal
|
1302d35dff
|
* Rework interfaces in vocab
|
2015-08-26 19:21:46 +02:00 |
|
Matthew Honnibal
|
6f1743692a
|
* Work on language-independent refactoring
|
2015-08-23 20:49:18 +02:00 |
|
Matthew Honnibal
|
cad0cca4e3
|
* Tmp
|
2015-08-22 22:04:34 +02:00 |
|
Matthew Honnibal
|
3d43f49f69
|
* Revert prev change
|
2015-07-27 10:58:15 +02:00 |
|
Matthew Honnibal
|
6b586cdad4
|
* Change lexemes.bin format. Add a header specifying size of LexemeC and number of lexemes, and don't have the redundant orth information.
|
2015-07-27 08:31:51 +02:00 |
|
Matthew Honnibal
|
8e4c69ee8c
|
* Add is_oov property, and fix up handling of attributes
|
2015-07-27 01:50:06 +02:00 |
|
Matthew Honnibal
|
fc268f03eb
|
* Assert against null pointer exceptions in vocab
|
2015-07-27 01:00:10 +02:00 |
|
Matthew Honnibal
|
0f093fdb30
|
* Fix get_by_orth for py3
|
2015-07-26 19:26:41 +02:00 |
|
Matthew Honnibal
|
ceeda5a739
|
* Fix get_by_orth for py3
|
2015-07-26 18:39:27 +02:00 |
|
Matthew Honnibal
|
6bb96c122d
|
* Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects
|
2015-07-26 16:37:16 +02:00 |
|
Matthew Honnibal
|
7eb2446082
|
* Return empty lexeme on empty string
|
2015-07-26 00:18:30 +02:00 |
|
Matthew Honnibal
|
fd525f0675
|
* Pass OOV probability around
|
2015-07-25 23:29:51 +02:00 |
|
Matthew Honnibal
|
22028602a9
|
* Add unicode_literals declaration in vocab.pyx
|
2015-07-23 13:24:20 +02:00 |
|
Matthew Honnibal
|
a7c4d72e83
|
* Add serializer property to Vocab, and lazy-load it. Add get_by_orth method.
|
2015-07-23 01:18:19 +02:00 |
|
Matthew Honnibal
|
109106a949
|
* Replace UniStr, using unicode objects instead
|
2015-07-22 04:52:05 +02:00 |
|
Matthew Honnibal
|
1f7170e0e1
|
* Reinstate the fixed vocabulary --- words are only added to the lexicon in init_model, after that we create LexemeC structs with the Pool given to us.
|
2015-07-20 01:37:34 +02:00 |
|
Matthew Honnibal
|
317cbbc015
|
* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.
|
2015-07-19 15:18:17 +02:00 |
|
Matthew Honnibal
|
82d84b0f2b
|
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this.
|
2015-07-18 22:42:15 +02:00 |
|
Matthew Honnibal
|
c2c83120d4
|
* Remove codec property from Vocab
|
2015-07-17 16:40:11 +02:00 |
|
Matthew Honnibal
|
db9dfd2e23
|
* Major refactor of serialization. Nearly complete now.
|
2015-07-17 01:27:54 +02:00 |
|
Matthew Honnibal
|
2a5d050134
|
* Give codec loading back to Vocab.
|
2015-07-16 17:45:42 +02:00 |
|
Matthew Honnibal
|
b59d271510
|
* Move serialization functionality into Serializer class
|
2015-07-16 11:23:48 +02:00 |
|
Matthew Honnibal
|
af5cc926a4
|
* Add codec property to Vocab, to use the Huffman encoding
|
2015-07-13 13:55:14 +02:00 |
|
Matthew Honnibal
|
abc43b852d
|
* Add pos_tags attr to Vocab.
|
2015-07-08 12:36:38 +02:00 |
|
Matthew Honnibal
|
c04e6ebca6
|
* Allow user to load different sized vectors.
|
2015-06-05 16:26:39 +02:00 |
|
Matthew Honnibal
|
adeb57cb1e
|
* Fix long line
|
2015-06-01 23:07:00 +02:00 |
|
Matthew Honnibal
|
eba7b34f66
|
* Add flag to disable loading of word vectors
|
2015-05-25 01:02:42 +02:00 |
|
Matthew Honnibal
|
e73eaf2d05
|
* Replace some assertions with proper errors
|
2015-05-08 16:52:17 +02:00 |
|
Jordan Suchow
|
3a8d9b37a6
|
Remove trailing whitespace
|
2015-04-19 13:01:38 -07:00 |
|
Matthew Honnibal
|
f0e0588833
|
* Fill L2 norm attribute on LexemeC struct
|
2015-02-07 08:44:42 -05:00 |
|
Matthew Honnibal
|
76d9394cb4
|
* Fix vocab.pyx for Python3
|
2015-02-01 13:14:04 +11:00 |
|
Matthew Honnibal
|
ce3ae8b5d9
|
* Fix platform-specific lexicon bug.
|
2015-01-31 16:38:58 +11:00 |
|
Matthew Honnibal
|
d4a493855e
|
* Fix error msg
|
2015-01-25 23:01:30 +11:00 |
|
Matthew Honnibal
|
c1c3dba4cb
|
* Check whether vector files are present before trying to load them.
|
2015-01-25 18:16:48 +11:00 |
|
Matthew Honnibal
|
fda94271af
|
* Rename NORM1 and NORM2 attrs to lower and norm
|
2015-01-24 06:17:03 +11:00 |
|
Matthew Honnibal
|
d460c28838
|
* Rename vec to repvec
|
2015-01-22 02:06:22 +11:00 |
|
Matthew Honnibal
|
6c7e44140b
|
* Work on word vectors, and other stuff
|
2015-01-17 16:21:17 +11:00 |
|
Matthew Honnibal
|
7d3c40de7d
|
* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme
|
2015-01-15 00:33:16 +11:00 |
|
Matthew Honnibal
|
0930892fc1
|
* Tmp. Working on refactor. Compiles, must hook up lexical feats.
|
2015-01-14 00:03:48 +11:00 |
|
Matthew Honnibal
|
46da3d74d2
|
* Tmp. Refactoring, introducing a Lexeme PyObject.
|
2015-01-12 11:23:44 +11:00 |
|
Matthew Honnibal
|
ce2edd6312
|
* Tmp commit. Refactoring to create a Python Lexeme class.
|
2015-01-12 10:26:22 +11:00 |
|
Matthew Honnibal
|
a58920cc5e
|
* Import orth.word_shape as a C module
|
2015-01-06 03:18:22 +11:00 |
|
Matthew Honnibal
|
f5d41028b5
|
* Move around data files for test release
|
2015-01-03 01:59:22 +11:00 |
|
Matthew Honnibal
|
bb80937544
|
* Upd docstrings
|
2014-12-27 18:45:16 +11:00 |
|
Matthew Honnibal
|
b8b65903fc
|
* Tmp
|
2014-12-24 17:42:00 +11:00 |
|
Matthew Honnibal
|
73f200436f
|
* Tests passing except for morphology/lemmatization stuff
|
2014-12-23 11:40:32 +11:00 |
|
Matthew Honnibal
|
2a89d70429
|
* Add vocab.pyx to setup, and ensure we can import spacy.en.lang
|
2014-12-21 06:03:53 +11:00 |
|
Matthew Honnibal
|
e1c1a4b868
|
* Tmp
|
2014-12-21 05:36:29 +11:00 |
|
Matthew Honnibal
|
d11c1edf8c
|
* Import slice_unicode from strings.pyx
|
2014-12-20 07:56:26 +11:00 |
|
Matthew Honnibal
|
116f7f3bc1
|
* Rename Lexicon to Vocab, and move it to its own file
|
2014-12-20 06:54:03 +11:00 |
|