| 
							
							
								 Matthew Honnibal | a7c4d72e83 | * Add serializer property to Vocab, and lazy-load it. Add get_by_orth method. | 2015-07-23 01:18:19 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 6ab1696b15 | * Remove read_encoding_freqs from util.py | 2015-07-23 01:17:32 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | d5255aad77 | * Update freqs for missing tags in ner, for serializer | 2015-07-23 01:17:11 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 12699a1152 | * Set initial freqs, to avoid missing values in serializer | 2015-07-23 01:16:27 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 680bb47b55 | * Write serializer freqs to single file, vocab/serializer.json | 2015-07-23 01:15:25 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | a0e36e8efc | * Add working to/from bytes API to Doc | 2015-07-23 01:14:45 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 1f31d96bf9 | * Fix Packer API, so that it reads and writes bytes strings, instead of BitArray. Docs are always byte aligned anyway. | 2015-07-23 01:13:02 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 38ef986b29 | * Update spacy/en/attrs.pxd | 2015-07-23 01:10:58 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 06eac32610 | * Add cfile.pyx | 2015-07-23 01:10:36 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 2b7bd46508 | * Update get_freqs script | 2015-07-22 15:43:06 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 0c507bd80a | * Fix tokenizer | 2015-07-22 14:10:30 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | c86dbe4944 | * Update English.save_models for new Packer save/load stuff | 2015-07-22 13:40:23 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | bf77bcd6b9 | * Add comment explaining hash_string | 2015-07-22 13:39:42 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 815bda201d | * Remove UniStr struct | 2015-07-22 13:39:17 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 2fc66e3723 | * Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff | 2015-07-22 13:38:45 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 4d61239eac | * Reorganize the serialization functions on Doc | 2015-07-22 04:53:01 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 109106a949 | * Replace UniStr, using unicode objects instead | 2015-07-22 04:52:05 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 386246db5b | * Update init_model, making language resources optional | 2015-07-22 00:25:14 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 424854028f | * Fix decode_int32 | 2015-07-21 20:09:59 +00:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 304d0e2633 | * Use decode_int32 in _orth_decode | 2015-07-21 20:40:55 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 9cfa59ec33 | * Optimistically try orth encoding, with char as a back-off | 2015-07-21 20:22:45 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | c8b89e37a5 | * Bug fix to faster huffman decoding | 2015-07-21 20:05:53 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | b166d1d2a2 | * Use encode32 and decode32 | 2015-07-21 19:59:06 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | c6cd0ddce8 | * Add faster encode_int32 and decode_int32 methods | 2015-07-21 19:58:45 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | dd60594f41 | * Fix double encoding error in strings.pyx | 2015-07-20 13:52:56 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 9cae1b4cad | * Restore accidentally clobbered updates to specials.json | 2015-07-20 12:19:46 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 14e9e6ec6c | * Fix ... tokenization, and correct orth inconsistencies in specials.json | 2015-07-20 12:10:56 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 06639dc497 | * Add length cap to word shape feature | 2015-07-20 12:06:59 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 128b6d9714 | * Move Utf8Str struct to strings module, as that's the only place it's relevant | 2015-07-20 12:06:41 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 01a97b90f3 | * Fix header for string store | 2015-07-20 12:06:10 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 1c9ea7b835 | * Add tests for short string optimization | 2015-07-20 12:05:45 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 52d538ea42 | * Fix short string optimization in strings.pyx. StringStore tests now all pass. | 2015-07-20 12:05:23 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 09a3055630 | * Work on short string optimization in Utf8Str | 2015-07-20 11:26:46 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | bb0ba1f0cd | * Improve serialization speed | 2015-07-20 03:27:59 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | f13d5dae91 | * Update test_packer | 2015-07-20 01:38:29 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | fb7202a173 | * Update test_codecs | 2015-07-20 01:38:15 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 8743a8c084 | * Update Doc serialization for new Packer interface | 2015-07-20 01:38:04 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 1f7170e0e1 | * Reinstate the fixed vocabulary --- words are only added to the lexicon in init_model, after that we create LexemeC structs with the Pool given to us. | 2015-07-20 01:37:34 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 5a7d060d9c | * Switch between the orth and char codecs depending on which is shorter for that message. Mostly orth is shorter, except if there are OOV words. | 2015-07-20 01:36:22 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 5a042ee0d3 | * Add function to predict number of bits needed to encode message | 2015-07-20 01:35:11 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | b89b489bb4 | * Implement both character and orth encoding in Packer, so that we can decide which to use per-text | 2015-07-19 22:39:45 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | ae78c9e3ce | * Implement character-based codec, so that we can do word/char backoff | 2015-07-19 22:03:39 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | cd1d047cb8 | * Delete out-dated HuffmanCodec comment | 2015-07-19 18:28:14 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 879ef9fa3e | * Update tests for huffman codec | 2015-07-19 17:59:51 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | b8086067d5 | * Build Huffman codec from unsorted inputs | 2015-07-19 17:58:44 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 317cbbc015 | * Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time. | 2015-07-19 15:18:17 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 0973e2f107 | * Update serializer tests | 2015-07-18 22:46:40 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 6b13e7227c | * Remove duplicate get_lex_attr method from doc.pyx | 2015-07-18 22:46:07 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | e49c7f1478 | * Update oov check in tokenizer | 2015-07-18 22:45:28 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | cfd842769e | * Allow infix tokens to be variable length | 2015-07-18 22:45:00 +02:00 |  |