Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5a20dfc03e
							
						
					 | 
					
						
						
							
							* Add memory management code
						
						
						
						
						
					 | 
					
						2014-09-17 20:02:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0152831c89
							
						
					 | 
					
						
						
							
							* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token.
						
						
						
						
						
					 | 
					
						2014-09-16 18:01:46 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							143e51ec73
							
						
					 | 
					
						
						
							
							* Refactor tokenization, splitting it into a clearer life-cycle.
						
						
						
						
						
					 | 
					
						2014-09-16 13:16:02 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c396581a0b
							
						
					 | 
					
						
						
							
							* Fiddle with the way strings are interned in lexeme
						
						
						
						
						
					 | 
					
						2014-09-15 06:34:45 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0bb547ab98
							
						
					 | 
					
						
						
							
							* Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance
						
						
						
						
						
					 | 
					
						2014-09-15 06:34:10 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7959141d36
							
						
					 | 
					
						
						
							
							* Add a few abbreviations, to get tests to pass
						
						
						
						
						
					 | 
					
						2014-09-15 06:32:18 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d235299260
							
						
					 | 
					
						
						
							
							* Few nips and tucks to hash table
						
						
						
						
						
					 | 
					
						2014-09-15 05:03:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e68a431e5e
							
						
					 | 
					
						
						
							
							* Pass only the tokens vector to _tokenize, instead of the whole python object.
						
						
						
						
						
					 | 
					
						2014-09-15 04:01:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							08cef75ffd
							
						
					 | 
					
						
						
							
							* Switch to using a heap-allocated vector in tokens
						
						
						
						
						
					 | 
					
						2014-09-15 03:46:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f77b7098c0
							
						
					 | 
					
						
						
							
							* Upd Tokens to use vector, with bounds checking.
						
						
						
						
						
					 | 
					
						2014-09-15 03:22:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0f6bf2a2ee
							
						
					 | 
					
						
						
							
							* Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector.
						
						
						
						
						
					 | 
					
						2014-09-15 02:08:39 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							df24e3708c
							
						
					 | 
					
						
						
							
							* Move EnglishTokens stuff to Tokens
						
						
						
						
						
					 | 
					
						2014-09-15 01:31:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bd08cb09a2
							
						
					 | 
					
						
						
							
							* Remove short-circuiting of initial_size argument for PointerHash
						
						
						
						
						
					 | 
					
						2014-09-15 01:30:49 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f3393cf57c
							
						
					 | 
					
						
						
							
							* Improve interface for PointerHash
						
						
						
						
						
					 | 
					
						2014-09-13 17:29:58 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							45865be37e
							
						
					 | 
					
						
						
							
							* Switch hash interface, using void* instead of size_t, to avoid casts.
						
						
						
						
						
					 | 
					
						2014-09-13 17:02:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0447279c57
							
						
					 | 
					
						
						
							
							* PointerHash working, efficiency is good. 6-7 mins
						
						
						
						
						
					 | 
					
						2014-09-13 16:43:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							85d68e8e95
							
						
					 | 
					
						
						
							
							* Replaced cache with own hash table. Similar timing
						
						
						
						
						
					 | 
					
						2014-09-13 03:14:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c8db76e3e1
							
						
					 | 
					
						
						
							
							* Add initial work on simple hash table
						
						
						
						
						
					 | 
					
						2014-09-13 02:02:41 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							afdc9b7ac2
							
						
					 | 
					
						
						
							
							* More performance fiddling, particularly moving the specials into the cache, so that we can just lookup the cache in _tokenize
						
						
						
						
						
					 | 
					
						2014-09-13 00:59:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7d239df4c8
							
						
					 | 
					
						
						
							
							* Fiddle with declarations, for small efficiency boost
						
						
						
						
						
					 | 
					
						2014-09-13 00:31:53 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a8e7cce30f
							
						
					 | 
					
						
						
							
							* Efficiency tweaks
						
						
						
						
						
					 | 
					
						2014-09-13 00:14:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							126a8453a5
							
						
					 | 
					
						
						
							
							* Fix performance issues by implementing a better cache. Add own String struct to help
						
						
						
						
						
					 | 
					
						2014-09-12 23:50:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9298e36b36
							
						
					 | 
					
						
						
							
							* Move special tokenization into its own lookup table, away from the cache.
						
						
						
						
						
					 | 
					
						2014-09-12 19:43:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							985bc68327
							
						
					 | 
					
						
						
							
							* Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation.
						
						
						
						
						
					 | 
					
						2014-09-12 18:26:26 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7eab281194
							
						
					 | 
					
						
						
							
							* Fiddle with token features
						
						
						
						
						
					 | 
					
						2014-09-12 15:49:55 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5aa591106b
							
						
					 | 
					
						
						
							
							* Fiddle with token features
						
						
						
						
						
					 | 
					
						2014-09-12 15:49:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1533041885
							
						
					 | 
					
						
						
							
							* Update the split_one method, so that it doesn't need to cast back to a Python object
						
						
						
						
						
					 | 
					
						2014-09-12 05:10:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4817277d66
							
						
					 | 
					
						
						
							
							* Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery.
						
						
						
						
						
					 | 
					
						2014-09-12 04:29:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8b20e9ad97
							
						
					 | 
					
						
						
							
							* Delete ununused _split method
						
						
						
						
						
					 | 
					
						2014-09-12 04:03:52 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a4863686ec
							
						
					 | 
					
						
						
							
							* Changed cache to use a linked-list data structure, to take out Python list code. Taking 6-7 mins for gigaword.
						
						
						
						
						
					 | 
					
						2014-09-12 03:30:50 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							51e2006a65
							
						
					 | 
					
						
						
							
							* Increase cache size. Processing now 6-7 mins
						
						
						
						
						
					 | 
					
						2014-09-12 02:52:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e096f30161
							
						
					 | 
					
						
						
							
							* Tweak signatures and refactor slightly. Processing gigaword taking 8-9 mins. Tests passing, but some sort of memory bug on exit.
						
						
						
						
						
					 | 
					
						2014-09-12 02:43:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							073ee0de63
							
						
					 | 
					
						
						
							
							* Restore dense_hash_map for cache dictionary. Seems to double efficiency
						
						
						
						
						
					 | 
					
						2014-09-12 02:23:51 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3c928fb5e0
							
						
					 | 
					
						
						
							
							* Switch to 64 bit hashes, for better reliability
						
						
						
						
						
					 | 
					
						2014-09-12 02:04:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2389bd1b10
							
						
					 | 
					
						
						
							
							* Improve cache mechanism by including a random element depending on the size of the cache.
						
						
						
						
						
					 | 
					
						2014-09-12 00:19:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c8f7c8bfde
							
						
					 | 
					
						
						
							
							* Moving to storing LexemeC structs internally
						
						
						
						
						
					 | 
					
						2014-09-11 21:54:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bf9c60c31c
							
						
					 | 
					
						
						
							
							* Moving to storing LexemeC structs internally
						
						
						
						
						
					 | 
					
						2014-09-11 21:44:58 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							563047e90f
							
						
					 | 
					
						
						
							
							* Switch to returning a Tokens object
						
						
						
						
						
					 | 
					
						2014-09-11 21:37:32 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1a3222af4b
							
						
					 | 
					
						
						
							
							* Moving tokens to use an array internally, instead of a list of Lexeme objects.
						
						
						
						
						
					 | 
					
						2014-09-11 16:57:08 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5b1c651661
							
						
					 | 
					
						
						
							
							* Only store LexemeC structs in the vocabulary, transforming them to Lexeme objects for output. Moving away from Lexeme objects for Tokens soon.
						
						
						
						
						
					 | 
					
						2014-09-11 12:28:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e567713429
							
						
					 | 
					
						
						
							
							* Moving back to lexeme structs
						
						
						
						
						
					 | 
					
						2014-09-10 20:41:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b488224c09
							
						
					 | 
					
						
						
							
							* Restoring Lexeme-as-struct
						
						
						
						
						
					 | 
					
						2014-09-10 20:41:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7c09c73a14
							
						
					 | 
					
						
						
							
							* Refactor to use tokens class.
						
						
						
						
						
					 | 
					
						2014-09-10 18:27:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cf412adba8
							
						
					 | 
					
						
						
							
							* Refactoring to use Tokens object
						
						
						
						
						
					 | 
					
						2014-09-10 18:11:13 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8fbe9b6f97
							
						
					 | 
					
						
						
							
							* Bug fixes to flag features
						
						
						
						
						
					 | 
					
						2014-09-01 23:41:31 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							151aa14bba
							
						
					 | 
					
						
						
							
							* Add asciify string transform, and other bits.
						
						
						
						
						
					 | 
					
						2014-09-01 23:25:28 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c4ba216642
							
						
					 | 
					
						
						
							
							* Switch canon_case to get value, to avoid keyerror
						
						
						
						
						
					 | 
					
						2014-09-01 17:27:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a779275a59
							
						
					 | 
					
						
						
							
							* Add canon_case function
						
						
						
						
						
					 | 
					
						2014-08-30 20:57:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8bbfadfced
							
						
					 | 
					
						
						
							
							* Pass tests. Need to implement more feature functions.
						
						
						
						
						
					 | 
					
						2014-08-30 20:36:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							dcab14ede2
							
						
					 | 
					
						
						
							
							* Begin testing more functionality
						
						
						
						
						
					 | 
					
						2014-08-30 19:01:15 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3e3ff99ca0
							
						
					 | 
					
						
						
							
							* Add orth features
						
						
						
						
						
					 | 
					
						2014-08-30 19:01:00 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4e5b2d47e2
							
						
					 | 
					
						
						
							
							* More docs
						
						
						
						
						
					 | 
					
						2014-08-29 03:01:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5233f110c4
							
						
					 | 
					
						
						
							
							* Adding PTB3 tokenizer back in, so can understand how much boilerplate is in the docs for multiple tokenizers
						
						
						
						
						
					 | 
					
						2014-08-29 02:30:27 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							45a22d6b2c
							
						
					 | 
					
						
						
							
							* Docs coming together
						
						
						
						
						
					 | 
					
						2014-08-29 01:59:23 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c282e6d5fb
							
						
					 | 
					
						
						
							
							* Redesign proceeding
						
						
						
						
						
					 | 
					
						2014-08-28 19:45:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fd4e61e58b
							
						
					 | 
					
						
						
							
							* Fixed contraction tests. Need to correct problem with the way case stats and tag stats are supposed to work.
						
						
						
						
						
					 | 
					
						2014-08-27 20:22:33 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fdaf24604a
							
						
					 | 
					
						
						
							
							* Basic punct tests updated and passing
						
						
						
						
						
					 | 
					
						2014-08-27 19:38:57 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8d20617dfd
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-08-27 17:16:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e9a62b6eba
							
						
					 | 
					
						
						
							
							* Refactoring with Lexeme as a class now compiles. Basic design seems to work
						
						
						
						
						
					 | 
					
						2014-08-27 17:15:39 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							68bae2fec6
							
						
					 | 
					
						
						
							
							* More refactoring
						
						
						
						
						
					 | 
					
						2014-08-25 16:42:22 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							88095666dc
							
						
					 | 
					
						
						
							
							* Remove Lexeme struct, preparing to rename Word to Lexeme.
						
						
						
						
						
					 | 
					
						2014-08-24 19:24:42 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ce59526011
							
						
					 | 
					
						
						
							
							* Add Word classes
						
						
						
						
						
					 | 
					
						2014-08-24 18:14:08 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3b793cf4f7
							
						
					 | 
					
						
						
							
							* Tests passing for new Word object version
						
						
						
						
						
					 | 
					
						2014-08-24 18:13:53 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9815c7649e
							
						
					 | 
					
						
						
							
							* Refactor around Word objects, adapting tests. Tests passing, except for string views.
						
						
						
						
						
					 | 
					
						2014-08-23 19:55:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4f01df9152
							
						
					 | 
					
						
						
							
							* Moving to Word objects in place of the Lexeme struct.
						
						
						
						
						
					 | 
					
						2014-08-22 17:32:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							782806df08
							
						
					 | 
					
						
						
							
							* Moving to Word objects in place of the Lexeme struct.
						
						
						
						
						
					 | 
					
						2014-08-22 17:28:23 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							47fbd0475a
							
						
					 | 
					
						
						
							
							* Replace the use of dense_hash_map with Python dict
						
						
						
						
						
					 | 
					
						2014-08-22 17:13:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e289896603
							
						
					 | 
					
						
						
							
							* Fix ptb3 module
						
						
						
						
						
					 | 
					
						2014-08-22 16:36:17 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							89d6faa9c9
							
						
					 | 
					
						
						
							
							* Move en_ptb to ptb3
						
						
						
						
						
					 | 
					
						2014-08-22 04:24:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							07ecf5d2f4
							
						
					 | 
					
						
						
							
							* Fixed group_by, removed idea of general attr_of function.
						
						
						
						
						
					 | 
					
						2014-08-22 00:02:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							811b7a6b91
							
						
					 | 
					
						
						
							
							* Struggling with arbitrary attr access...
						
						
						
						
						
					 | 
					
						2014-08-21 23:49:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							314658b31c
							
						
					 | 
					
						
						
							
							* Improve module docstring
						
						
						
						
						
					 | 
					
						2014-08-21 18:42:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d10993f41a
							
						
					 | 
					
						
						
							
							* More docs work
						
						
						
						
						
					 | 
					
						2014-08-21 16:37:13 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							248cbb6d07
							
						
					 | 
					
						
						
							
							* Update doc strings
						
						
						
						
						
					 | 
					
						2014-08-21 03:29:15 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							76afbd7d69
							
						
					 | 
					
						
						
							
							* Remove compiled orthography file
						
						
						
						
						
					 | 
					
						2014-08-20 17:04:07 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f39dcb1d89
							
						
					 | 
					
						
						
							
							* Add orthography
						
						
						
						
						
					 | 
					
						2014-08-20 17:03:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a78ad4152d
							
						
					 | 
					
						
						
							
							* Broken version being refactored for docs
						
						
						
						
						
					 | 
					
						2014-08-20 13:39:39 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5fddb8d165
							
						
					 | 
					
						
						
							
							* Working refactor, with updated data model for Lexemes
						
						
						
						
						
					 | 
					
						2014-08-19 04:21:20 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3379d7a571
							
						
					 | 
					
						
						
							
							* Reforming data model for lexemes
						
						
						
						
						
					 | 
					
						2014-08-19 02:40:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ab9b0daabf
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-08-18 23:21:49 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1b71cbfe28
							
						
					 | 
					
						
						
							
							* Roll back to using unicode, and never Py_UNICODE. No dependence on murmurhash either.
						
						
						
						
						
					 | 
					
						2014-08-18 20:48:48 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bbf9a2c944
							
						
					 | 
					
						
						
							
							* Working version that uses arrays for chunks, which should be more memory efficient
						
						
						
						
						
					 | 
					
						2014-08-18 20:23:54 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8d3f6082be
							
						
					 | 
					
						
						
							
							* Working version, adding improvements
						
						
						
						
						
					 | 
					
						2014-08-18 19:59:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							01469b0888
							
						
					 | 
					
						
						
							
							* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word.
						
						
						
						
						
					 | 
					
						2014-08-18 19:14:00 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b94c9b72c9
							
						
					 | 
					
						
						
							
							* WordTree in use. Need to reform the way chunks are handled. Should be properly one Lexeme per word, with split points being the things that are cached.
						
						
						
						
						
					 | 
					
						2014-08-16 20:10:22 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							34b68a18ab
							
						
					 | 
					
						
						
							
							* Progress to getting WordTree working. Tests pass, but so far it's slower.
						
						
						
						
						
					 | 
					
						2014-08-16 19:59:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							865cacfaf7
							
						
					 | 
					
						
						
							
							* Remove dependence on murmurhash
						
						
						
						
						
					 | 
					
						2014-08-16 17:37:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							515d41d325
							
						
					 | 
					
						
						
							
							* Restore string saving to spacy
						
						
						
						
						
					 | 
					
						2014-08-16 16:09:24 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							36073b89fe
							
						
					 | 
					
						
						
							
							* Restore unicode, work on improving string storage.
						
						
						
						
						
					 | 
					
						2014-08-16 14:35:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a225ca5b0d
							
						
					 | 
					
						
						
							
							* Refactoring tokenizer
						
						
						
						
						
					 | 
					
						2014-08-16 03:22:03 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							213a440ffc
							
						
					 | 
					
						
						
							
							* Add string decode and encode helpers to string_tools
						
						
						
						
						
					 | 
					
						2014-08-15 23:57:27 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f11c8e22eb
							
						
					 | 
					
						
						
							
							* Remove happax stuff
						
						
						
						
						
					 | 
					
						2014-08-02 22:11:28 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d6e07aa922
							
						
					 | 
					
						
						
							
							* Switch to 32bit hash for strings
						
						
						
						
						
					 | 
					
						2014-08-02 21:51:52 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							365a2af756
							
						
					 | 
					
						
						
							
							* Restore happax. commit uncommited work
						
						
						
						
						
					 | 
					
						2014-08-02 21:27:03 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6319ff0f22
							
						
					 | 
					
						
						
							
							* Add length property
						
						
						
						
						
					 | 
					
						2014-08-02 21:26:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							18fb76b2c4
							
						
					 | 
					
						
						
							
							* Removed happax. Not sure if good idea.
						
						
						
						
						
					 | 
					
						2014-08-02 20:53:35 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							edd38a84b1
							
						
					 | 
					
						
						
							
							* Removing happax stuff. Added length
						
						
						
						
						
					 | 
					
						2014-08-02 20:45:12 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fc7c10d7f8
							
						
					 | 
					
						
						
							
							* Ugly but seemingly working fix to the token memory leak
						
						
						
						
						
					 | 
					
						2014-08-01 09:43:19 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c7bb6b329c
							
						
					 | 
					
						
						
							
							* Don't free clobbered lexemes, as they might be part of a tail
						
						
						
						
						
					 | 
					
						2014-08-01 08:22:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c48214460e
							
						
					 | 
					
						
						
							
							* Free lexemes clobbered as happaxes
						
						
						
						
						
					 | 
					
						2014-08-01 07:40:20 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5b6457e80e
							
						
					 | 
					
						
						
							
							* Free lexemes clobbered as happaxes
						
						
						
						
						
					 | 
					
						2014-08-01 07:37:50 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d8cb2288ce
							
						
					 | 
					
						
						
							
							* Roll back to using murmurhash2 for now
						
						
						
						
						
					 | 
					
						2014-08-01 07:28:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f39211b2b1
							
						
					 | 
					
						
						
							
							* Add FixedTable for hashing
						
						
						
						
						
					 | 
					
						2014-08-01 07:27:21 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a44e15f623
							
						
					 | 
					
						
						
							
							* Hack around lack of distribution features for now.
						
						
						
						
						
					 | 
					
						2014-07-31 18:24:51 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4cb88c940b
							
						
					 | 
					
						
						
							
							* Fix memory leak in tokenizer, caused by having a fixed vocab.
						
						
						
						
						
					 | 
					
						2014-07-31 18:19:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5b81ee716f
							
						
					 | 
					
						
						
							
							* Use a sparse_hash_map to store happax vocab items, with a max size.
						
						
						
						
						
					 | 
					
						2014-07-31 17:40:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b9016c4633
							
						
					 | 
					
						
						
							
							* Switch to using sparsehash and murmurhash libraries out of pip
						
						
						
						
						
					 | 
					
						2014-07-25 15:47:27 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a895fe5ddb
							
						
					 | 
					
						
						
							
							* Upd from spacy
						
						
						
						
						
					 | 
					
						2014-07-23 17:35:18 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							87bf205b82
							
						
					 | 
					
						
						
							
							* Fix open apostrophe bug
						
						
						
						
						
					 | 
					
						2014-07-07 23:26:01 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							571808a274
							
						
					 | 
					
						
						
							
							Group-by seems to be working
						
						
						
						
						
					 | 
					
						2014-07-07 20:27:02 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							80b36f9f27
							
						
					 | 
					
						
						
							
							* 710k words per second for counts
						
						
						
						
						
					 | 
					
						2014-07-07 19:12:19 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							057c21969b
							
						
					 | 
					
						
						
							
							* Refactor for string view features. Working on setting up flags and enums.
						
						
						
						
						
					 | 
					
						2014-07-07 16:58:48 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f1bcbd4c4e
							
						
					 | 
					
						
						
							
							* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well.
						
						
						
						
						
					 | 
					
						2014-07-07 12:47:21 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6668e44961
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-07-07 08:15:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0074ae2fc0
							
						
					 | 
					
						
						
							
							* Switch to dynamically allocating array, based on the document length
						
						
						
						
						
					 | 
					
						2014-07-07 08:05:29 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ff1869ff07
							
						
					 | 
					
						
						
							
							* Fixed major efficiency problem, from not quite grokking pass by reference in cython c++
						
						
						
						
						
					 | 
					
						2014-07-07 07:36:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0c76143b72
							
						
					 | 
					
						
						
							
							* Give value for assert
						
						
						
						
						
					 | 
					
						2014-07-07 05:10:46 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e244739dfe
							
						
					 | 
					
						
						
							
							* Fix ptb tokenization
						
						
						
						
						
					 | 
					
						2014-07-07 05:10:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							dc20500920
							
						
					 | 
					
						
						
							
							* Remove cpp files
						
						
						
						
						
					 | 
					
						2014-07-07 05:09:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							25849fc926
							
						
					 | 
					
						
						
							
							* Generalize tokenization rules to capitals
						
						
						
						
						
					 | 
					
						2014-07-07 05:07:21 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							df0458001d
							
						
					 | 
					
						
						
							
							* Begin work on full PTB-compatible English tokenization
						
						
						
						
						
					 | 
					
						2014-07-07 04:29:24 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d5bef02c72
							
						
					 | 
					
						
						
							
							* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals
						
						
						
						
						
					 | 
					
						2014-07-07 04:21:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a62c38e1ef
							
						
					 | 
					
						
						
							
							* Working tokenization. en doesn't match PTB perfectly. Need to reorganize before adding more schemes.
						
						
						
						
						
					 | 
					
						2014-07-07 01:15:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4e79446dc2
							
						
					 | 
					
						
						
							
							* Reading in tokenization rules correctly. Passing tests.
						
						
						
						
						
					 | 
					
						2014-07-07 00:02:55 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							72159e7011
							
						
					 | 
					
						
						
							
							* Fixes to tokenization. Now segment sequences of the same punctuation.
						
						
						
						
						
					 | 
					
						2014-07-06 19:28:42 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e98e97d483
							
						
					 | 
					
						
						
							
							* Possessive test passing
						
						
						
						
						
					 | 
					
						2014-07-06 18:35:55 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							556f6a18ca
							
						
					 | 
					
						
						
							
							* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc.
						
						
						
						
						
					 | 
					
						2014-07-05 20:51:42 +02:00 | 
					
					
						
						
							
							
							
						
					 |