| 
							
							
								 Matthew Honnibal | 2348a08481 | * Load/dump strings with a json file, instead of the hacky strings file we were using. | 2015-10-22 21:13:03 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 7a15d1b60c | * Add Python 2/3 compatibility fix for copy_reg | 2015-10-13 20:04:40 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 20fd36a0f7 | * Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125: allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve. | 2015-10-13 13:44:41 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | f8de403483 | * Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125 | 2015-10-13 13:44:41 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 85e7944572 | * Start trying to pickle Vocab | 2015-10-13 13:44:41 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 41012907a8 | * Fix variable name | 2015-10-13 13:44:40 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 37b909b6b6 | * Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd | 2015-10-13 13:44:40 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | d70e8cac2c | * Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore | 2015-10-13 13:44:40 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | a29c8ee23d | * Add symbols to the vocab before reading the strings, so that they line up correctly | 2015-10-13 13:44:39 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 85ce36ab11 | * Refactor symbols, so that frequency rank can be derived from the orth id of a word. | 2015-10-13 13:44:39 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 83dccf0fd7 | * Use io module insteads of deprecated codecs module | 2015-10-10 14:13:01 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 3d9f41c2c9 | * Add LookupError for better error reporting in Vocab | 2015-10-06 10:34:59 +11:00 |  | 
			
				
					| 
							
							
								 alvations | 8caedba42a | caught more codecs.open -> io.open | 2015-09-30 20:20:09 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | abf0d930af | * Fix API for loading word vectors from a file. | 2015-09-23 23:51:08 +10:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | f7283a5067 | * Fix vectors bugs for OOV words | 2015-09-22 02:10:25 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | ac459278d1 | * Fix vector length error reporting, and ensure vec_len is returned | 2015-09-21 18:08:32 +10:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | ba4e563701 | * Ensure vectors are same length, and return vector length in load_vectors_bz2 | 2015-09-21 18:03:08 +10:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | d6945bf880 | * Add way to load vectors from bz2 file to vocab | 2015-09-17 12:58:23 +10:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 3d87519f64 | * Remove vectors argument from Vocab object | 2015-09-15 14:47:14 +10:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 27f988b167 | * Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects. | 2015-09-15 14:41:48 +10:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | e9c59693ea | * Remove assertion from vocab.pyx | 2015-09-13 10:30:08 +10:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | e1dfaeed8a | * Check serializer freqs exist before loading | 2015-09-12 23:49:38 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | a412c66c8c | * Check serializer freqs exist before loading | 2015-09-12 23:40:01 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | e285ca7d6c | * Load serializer freqs in vocab | 2015-09-10 15:22:48 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 094440f9f5 | Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop | 2015-09-10 14:51:17 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 90da3a695d | * Load lemmatizer from disk in Vocab.from_dir | 2015-09-10 14:49:10 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | f634191e27 | * Fix vocab read/write | 2015-09-10 14:44:38 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | a7f4b26c8c | * Tmp | 2015-09-09 14:33:26 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | d6561988cf | * Fix lexemes.bin | 2015-09-09 11:49:51 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | c301bebd33 | Merge branch 'master' of https://github.com/honnibal/spaCy into develop | 2015-09-09 10:55:39 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 623329b19a | Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop | 2015-09-08 14:27:01 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 62a01dd41d | * Fix issue #92: lexemes.bin read error on 32-bit platforms. | 2015-09-08 14:23:58 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | f6ec5bf1b0 | * Use empty tag map in vocab if none supplied | 2015-09-06 20:19:27 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 534e3dda3c | * More work on language independent parsing | 2015-08-28 03:44:54 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | c2307fa9ee | * More work on language-generic parsing | 2015-08-28 02:02:33 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 1302d35dff | * Rework interfaces in vocab | 2015-08-26 19:21:46 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 6f1743692a | * Work on language-independent refactoring | 2015-08-23 20:49:18 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | cad0cca4e3 | * Tmp | 2015-08-22 22:04:34 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 3d43f49f69 | * Revert prev change | 2015-07-27 10:58:15 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 6b586cdad4 | * Change lexemes.bin format. Add a header specifying size of LexemeC and number of lexemes, and don't have the redundant orth information. | 2015-07-27 08:31:51 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 8e4c69ee8c | * Add is_oov property, and fix up handling of attributes | 2015-07-27 01:50:06 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | fc268f03eb | * Assert against null pointer exceptions in vocab | 2015-07-27 01:00:10 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 0f093fdb30 | * Fix get_by_orth for py3 | 2015-07-26 19:26:41 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | ceeda5a739 | * Fix get_by_orth for py3 | 2015-07-26 18:39:27 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 6bb96c122d | * Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects | 2015-07-26 16:37:16 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 7eb2446082 | * Return empty lexeme on empty string | 2015-07-26 00:18:30 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | fd525f0675 | * Pass OOV probability around | 2015-07-25 23:29:51 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 22028602a9 | * Add unicode_literals declaration in vocab.pyx | 2015-07-23 13:24:20 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | a7c4d72e83 | * Add serializer property to Vocab, and lazy-load it. Add get_by_orth method. | 2015-07-23 01:18:19 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 109106a949 | * Replace UniStr, using unicode objects instead | 2015-07-22 04:52:05 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 1f7170e0e1 | * Reinstate the fixed vocabulary --- words are only added to the lexicon in init_model, after that we create LexemeC structs with the Pool given to us. | 2015-07-20 01:37:34 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 317cbbc015 | * Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time. | 2015-07-19 15:18:17 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 82d84b0f2b | * Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this. | 2015-07-18 22:42:15 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | c2c83120d4 | * Remove codec property from Vocab | 2015-07-17 16:40:11 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | db9dfd2e23 | * Major refactor of serialization. Nearly complete now. | 2015-07-17 01:27:54 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 2a5d050134 | * Give codec loading back to Vocab. | 2015-07-16 17:45:42 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | b59d271510 | * Move serialization functionality into Serializer class | 2015-07-16 11:23:48 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | af5cc926a4 | * Add codec property to Vocab, to use the Huffman encoding | 2015-07-13 13:55:14 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | abc43b852d | * Add pos_tags attr to Vocab. | 2015-07-08 12:36:38 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | c04e6ebca6 | * Allow user to load different sized vectors. | 2015-06-05 16:26:39 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | adeb57cb1e | * Fix long line | 2015-06-01 23:07:00 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | eba7b34f66 | * Add flag to disable loading of word vectors | 2015-05-25 01:02:42 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | e73eaf2d05 | * Replace some assertions with proper errors | 2015-05-08 16:52:17 +02:00 |  | 
			
				
					| 
							
							
								 Jordan Suchow | 3a8d9b37a6 | Remove trailing whitespace | 2015-04-19 13:01:38 -07:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | f0e0588833 | * Fill L2 norm attribute on LexemeC struct | 2015-02-07 08:44:42 -05:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 76d9394cb4 | * Fix vocab.pyx for Python3 | 2015-02-01 13:14:04 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | ce3ae8b5d9 | * Fix platform-specific lexicon bug. | 2015-01-31 16:38:58 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | d4a493855e | * Fix error msg | 2015-01-25 23:01:30 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | c1c3dba4cb | * Check whether vector files are present before trying to load them. | 2015-01-25 18:16:48 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | fda94271af | * Rename NORM1 and NORM2 attrs to lower and norm | 2015-01-24 06:17:03 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | d460c28838 | * Rename vec to repvec | 2015-01-22 02:06:22 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 6c7e44140b | * Work on word vectors, and other stuff | 2015-01-17 16:21:17 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 7d3c40de7d | * Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme | 2015-01-15 00:33:16 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 0930892fc1 | * Tmp. Working on refactor. Compiles, must hook up lexical feats. | 2015-01-14 00:03:48 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 46da3d74d2 | * Tmp. Refactoring, introducing a Lexeme PyObject. | 2015-01-12 11:23:44 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | ce2edd6312 | * Tmp commit. Refactoring to create a Python Lexeme class. | 2015-01-12 10:26:22 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | a58920cc5e | * Import orth.word_shape as a C module | 2015-01-06 03:18:22 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | f5d41028b5 | * Move around data files for test release | 2015-01-03 01:59:22 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | bb80937544 | * Upd docstrings | 2014-12-27 18:45:16 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | b8b65903fc | * Tmp | 2014-12-24 17:42:00 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 73f200436f | * Tests passing except for morphology/lemmatization stuff | 2014-12-23 11:40:32 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 2a89d70429 | * Add vocab.pyx to setup, and ensure we can import spacy.en.lang | 2014-12-21 06:03:53 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | e1c1a4b868 | * Tmp | 2014-12-21 05:36:29 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | d11c1edf8c | * Import slice_unicode from strings.pyx | 2014-12-20 07:56:26 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 116f7f3bc1 | * Rename Lexicon to Vocab, and move it to its own file | 2014-12-20 06:54:03 +11:00 |  |