Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							95ccea03b2
							
						
					 | 
					
						
						
							
							* Work on greedy parser
						
						
						
						
						
					 | 
					
						2014-12-16 22:46:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a432862fde
							
						
					 | 
					
						
						
							
							* Add exception type to _arg_max_among in tagger
						
						
						
						
						
					 | 
					
						2014-12-16 09:44:19 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9e00798820
							
						
					 | 
					
						
						
							
							* Work on integrating a greedy dependency parser
						
						
						
						
						
					 | 
					
						2014-12-16 08:06:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							792802b2b9
							
						
					 | 
					
						
						
							
							* POS tag memoisation working, with good speed-up
						
						
						
						
						
					 | 
					
						2014-12-12 14:33:51 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ca54d58638
							
						
					 | 
					
						
						
							
							* Merge setup.py
						
						
						
						
						
					 | 
					
						2014-12-10 15:21:27 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9959a64f7b
							
						
					 | 
					
						
						
							
							* Working morphology and lemmatisation. POS tagging quite fast.
						
						
						
						
						
					 | 
					
						2014-12-10 08:09:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							df3be14987
							
						
					 | 
					
						
						
							
							* Add pos_type features to POS tagger
						
						
						
						
						
					 | 
					
						2014-12-10 08:08:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							42973c4b37
							
						
					 | 
					
						
						
							
							* Improve efficiency of tagger, and improve morphological processing
						
						
						
						
						
					 | 
					
						2014-12-10 01:02:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6b34a2f34b
							
						
					 | 
					
						
						
							
							* Move morphological analysis into its own module, morphology.pyx
						
						
						
						
						
					 | 
					
						2014-12-09 21:16:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b962fe73d7
							
						
					 | 
					
						
						
							
							* Make suffixes file use full-power regex, so that we can handle periods properly
						
						
						
						
						
					 | 
					
						2014-12-09 19:04:27 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							accdbe989b
							
						
					 | 
					
						
						
							
							* Remove Tokens.extend method
						
						
						
						
						
					 | 
					
						2014-12-09 17:09:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							495e1c7366
							
						
					 | 
					
						
						
							
							* Use fused type in Tokens.push_back, simplifying the use of the cache
						
						
						
						
						
					 | 
					
						2014-12-09 16:50:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							302e09018b
							
						
					 | 
					
						
						
							
							* Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas
						
						
						
						
						
					 | 
					
						2014-12-09 14:48:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							99bbbb6feb
							
						
					 | 
					
						
						
							
							* Work on morphological processing
						
						
						
						
						
					 | 
					
						2014-12-08 21:12:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7b68f911cf
							
						
					 | 
					
						
						
							
							* Add WordNet lemmatizer
						
						
						
						
						
					 | 
					
						2014-12-08 01:39:13 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c20dd79748
							
						
					 | 
					
						
						
							
							* Fiddle with const correctness and comments
						
						
						
						
						
					 | 
					
						2014-12-08 00:03:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b031c7c430
							
						
					 | 
					
						
						
							
							* Remove language-general context module
						
						
						
						
						
					 | 
					
						2014-12-07 23:53:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ef4398b204
							
						
					 | 
					
						
						
							
							* Rearrange POS stuff, so that language-specific stuff can live in language-specific modules
						
						
						
						
						
					 | 
					
						2014-12-07 23:52:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							327383e38a
							
						
					 | 
					
						
						
							
							* Remove unused code in tagger.pyx
						
						
						
						
						
					 | 
					
						2014-12-07 22:16:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9f17467c2e
							
						
					 | 
					
						
						
							
							* Fix EMPTY_TOKEN
						
						
						
						
						
					 | 
					
						2014-12-07 22:07:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3819a88e1b
							
						
					 | 
					
						
						
							
							* Add support for tag dictionary, and fix error-code for predict method
						
						
						
						
						
					 | 
					
						2014-12-07 22:07:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f00afe12c4
							
						
					 | 
					
						
						
							
							* Load POS tagger in load() function if path exists
						
						
						
						
						
					 | 
					
						2014-12-07 22:05:57 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5fe5e6e66b
							
						
					 | 
					
						
						
							
							* Move context functions to header, inlining them.
						
						
						
						
						
					 | 
					
						2014-12-07 21:59:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5caabec789
							
						
					 | 
					
						
						
							
							* Link in tagger, to work on integrating POS tagging
						
						
						
						
						
					 | 
					
						2014-12-07 15:29:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0c7aeb9de7
							
						
					 | 
					
						
						
							
							* Begin revising tagger, focussing on POS tagging
						
						
						
						
						
					 | 
					
						2014-12-07 15:29:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f5c4f2eb52
							
						
					 | 
					
						
						
							
							* Revise context, focussing on POS tagging for now
						
						
						
						
						
					 | 
					
						2014-12-07 15:28:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e27b912ef9
							
						
					 | 
					
						
						
							
							* Remove need for confusing _data pointer to be stored on Tokens
						
						
						
						
						
					 | 
					
						2014-12-05 16:31:30 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1c9253701d
							
						
					 | 
					
						
						
							
							* Introduce a TokenC struct, to handle token indices, pos tags and sense tags
						
						
						
						
						
					 | 
					
						2014-12-05 15:56:14 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							187372c7f3
							
						
					 | 
					
						
						
							
							* Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached
						
						
						
						
						
					 | 
					
						2014-12-05 03:29:50 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							75b8dfb348
							
						
					 | 
					
						
						
							
							* Remove upper_pc from lexeme.pyx
						
						
						
						
						
					 | 
					
						2014-12-04 22:14:34 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							49f3780ff5
							
						
					 | 
					
						
						
							
							* Fiddle with lexeme attrs
						
						
						
						
						
					 | 
					
						2014-12-04 21:22:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							564082e48e
							
						
					 | 
					
						
						
							
							* Hack Token class to take lex.dense inplace of the old lex.norm. This needs to be fixed...
						
						
						
						
						
					 | 
					
						2014-12-04 20:51:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							69bb022204
							
						
					 | 
					
						
						
							
							* Add as_array and count_by method
						
						
						
						
						
					 | 
					
						2014-12-04 20:46:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e1b1f45cc9
							
						
					 | 
					
						
						
							
							* Add STEM attribute to lexeme
						
						
						
						
						
					 | 
					
						2014-12-04 20:46:20 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d7952634ca
							
						
					 | 
					
						
						
							
							* Make the string-store serve const pointers to Utf8Str
						
						
						
						
						
					 | 
					
						2014-12-03 16:01:47 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7e04c22f8f
							
						
					 | 
					
						
						
							
							* const added to Lexicon interface. Seems to work.
						
						
						
						
						
					 | 
					
						2014-12-03 15:58:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d70d31aa45
							
						
					 | 
					
						
						
							
							* Introduce first attempt at const-ness
						
						
						
						
						
					 | 
					
						2014-12-03 15:44:25 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4560ada85b
							
						
					 | 
					
						
						
							
							* Add typedef for attr_t. Change flag_t to flags_t
						
						
						
						
						
					 | 
					
						2014-12-03 11:06:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e600f7b327
							
						
					 | 
					
						
						
							
							* Move String struct stuff into the utf8string module, from spacy.lang
						
						
						
						
						
					 | 
					
						2014-12-03 11:06:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e170faf5b0
							
						
					 | 
					
						
						
							
							* Hack Tokens to work without tagger.pyx
						
						
						
						
						
					 | 
					
						2014-12-03 11:05:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b463a7eb86
							
						
					 | 
					
						
						
							
							* Make flag-setting a language-specific thing
						
						
						
						
						
					 | 
					
						2014-12-03 11:04:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							71b009e323
							
						
					 | 
					
						
						
							
							* Fix bug in refactored StringStore.__getitem__
						
						
						
						
						
					 | 
					
						2014-12-03 11:02:24 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							14097311ae
							
						
					 | 
					
						
						
							
							* Make StringStore.__getitem__ accept unicode-typed keys.
						
						
						
						
						
					 | 
					
						2014-12-03 01:33:20 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							522bb0346e
							
						
					 | 
					
						
						
							
							* Work on get_array method of Tokens
						
						
						
						
						
					 | 
					
						2014-12-02 23:48:05 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8c2938fe01
							
						
					 | 
					
						
						
							
							* Rename Lexicon._dict to Lexicon._map
						
						
						
						
						
					 | 
					
						2014-12-02 23:46:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							33dfb4933c
							
						
					 | 
					
						
						
							
							* Remove taggers from Language class. Work on doc strings
						
						
						
						
						
					 | 
					
						2014-11-26 19:53:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							80baa2e3db
							
						
					 | 
					
						
						
							
							* Work on beam parser
						
						
						
						
						
					 | 
					
						2014-11-20 19:49:33 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5c3016bac8
							
						
					 | 
					
						
						
							
							* Tmp commit of ner code
						
						
						
						
						
					 | 
					
						2014-11-14 18:27:47 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							33c421bcf8
							
						
					 | 
					
						
						
							
							* More feature tweaks
						
						
						
						
						
					 | 
					
						2014-11-12 23:59:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							41dedfb14e
							
						
					 | 
					
						
						
							
							* Add label features for NER parsing
						
						
						
						
						
					 | 
					
						2014-11-12 23:55:10 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cf55b48ba6
							
						
					 | 
					
						
						
							
							* Switch to predict label on shift. Big increase in accuracy.
						
						
						
						
						
					 | 
					
						2014-11-12 23:50:12 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8f84e8a78b
							
						
					 | 
					
						
						
							
							* Neaten oracle
						
						
						
						
						
					 | 
					
						2014-11-12 23:38:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7e0a9077dd
							
						
					 | 
					
						
						
							
							* Add context files
						
						
						
						
						
					 | 
					
						2014-11-12 23:22:36 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3b0b902384
							
						
					 | 
					
						
						
							
							* IOB-style parsing working. Accuracy down from BILOU, form 87-88 to 85-86
						
						
						
						
						
					 | 
					
						2014-11-12 23:21:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e6bb8aa3a9
							
						
					 | 
					
						
						
							
							* Move moves to bilou_moves. Refactor context, returning to the simpler giant-enum style
						
						
						
						
						
					 | 
					
						2014-11-12 00:54:50 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c788633429
							
						
					 | 
					
						
						
							
							* Add tokens_from_list method to Language
						
						
						
						
						
					 | 
					
						2014-11-11 23:43:14 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							95282d4993
							
						
					 | 
					
						
						
							
							* Use the dynamic oracle 'follow' strategy
						
						
						
						
						
					 | 
					
						2014-11-11 21:11:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5aaf7a024d
							
						
					 | 
					
						
						
							
							* Move ner features to ner subdir
						
						
						
						
						
					 | 
					
						2014-11-11 21:09:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ff8989b63c
							
						
					 | 
					
						
						
							
							* Use greedy NER parser
						
						
						
						
						
					 | 
					
						2014-11-11 21:08:35 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0d943ab358
							
						
					 | 
					
						
						
							
							* Fixed greedy NER parsing. With static oracle, replicates accuracy from tagger.
						
						
						
						
						
					 | 
					
						2014-11-11 17:17:54 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							399239760b
							
						
					 | 
					
						
						
							
							* Fix moves for new State struct
						
						
						
						
						
					 | 
					
						2014-11-10 22:16:05 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							82247169f2
							
						
					 | 
					
						
						
							
							* Implement validation and oracle on pystate, for testing
						
						
						
						
						
					 | 
					
						2014-11-10 22:15:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3709ed9d6d
							
						
					 | 
					
						
						
							
							* Add curr field to State, to handle entity being built
						
						
						
						
						
					 | 
					
						2014-11-10 22:14:36 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							af9ed18cf1
							
						
					 | 
					
						
						
							
							* Bug fixes to NER
						
						
						
						
						
					 | 
					
						2014-11-10 17:39:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9f2587f5ec
							
						
					 | 
					
						
						
							
							* Work on shift-reduce NER
						
						
						
						
						
					 | 
					
						2014-11-10 16:28:56 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f307eb2e36
							
						
					 | 
					
						
						
							
							* Refactor context extraction, and start breaking out gold standards into their own functions
						
						
						
						
						
					 | 
					
						2014-11-09 15:43:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							602f993af9
							
						
					 | 
					
						
						
							
							* Moving tagger to accept multiple correct answers
						
						
						
						
						
					 | 
					
						2014-11-09 15:18:33 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f37d896a42
							
						
					 | 
					
						
						
							
							* Upd NER feats. With adadelta learner, getting 76.9 on NER
						
						
						
						
						
					 | 
					
						2014-11-07 04:43:54 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							68d1cdad62
							
						
					 | 
					
						
						
							
							* When encoding POS/NER tags, accept '-' as a missing value
						
						
						
						
						
					 | 
					
						2014-11-07 04:42:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							949a6245f9
							
						
					 | 
					
						
						
							
							* Increase default number of iterations from 5 to 10
						
						
						
						
						
					 | 
					
						2014-11-07 04:42:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3cab1d9a29
							
						
					 | 
					
						
						
							
							* Refine word_shape feature, by trimming the max sequence length
						
						
						
						
						
					 | 
					
						2014-11-07 04:41:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b4454cf036
							
						
					 | 
					
						
						
							
							* Add extra context tokens
						
						
						
						
						
					 | 
					
						2014-11-07 04:40:36 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							50309e6e49
							
						
					 | 
					
						
						
							
							* Fix context vector, importing all features
						
						
						
						
						
					 | 
					
						2014-11-05 22:11:39 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							07a23768de
							
						
					 | 
					
						
						
							
							* Play with NER feats a bit. Up to 82.00 training on MUC7.
						
						
						
						
						
					 | 
					
						2014-11-05 21:47:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4ecbe8c893
							
						
					 | 
					
						
						
							
							* Complete refactor of Tagger features, to use a generic list of context names.
						
						
						
						
						
					 | 
					
						2014-11-05 20:45:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0a8c84625d
							
						
					 | 
					
						
						
							
							* Moving feature context stuff to a generalized place
						
						
						
						
						
					 | 
					
						2014-11-05 19:55:10 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3733444101
							
						
					 | 
					
						
						
							
							* Generalize tagger code, in preparation for NER and supersense tagging.
						
						
						
						
						
					 | 
					
						2014-11-05 03:42:14 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							abbe3e44b0
							
						
					 | 
					
						
						
							
							* Move spacy.pos tagger to spacy.tagger, and generalize it so that it can take on other tagging tasks, given a different set of feature templates.
						
						
						
						
						
					 | 
					
						2014-11-05 00:37:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							954c970415
							
						
					 | 
					
						
						
							
							* Add __iter__ method to tokens
						
						
						
						
						
					 | 
					
						2014-11-04 01:07:08 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f07457a91f
							
						
					 | 
					
						
						
							
							* Remove POS alignment stuff. Now use training data based on raw text, instead of clumsy detokenization stuff
						
						
						
						
						
					 | 
					
						2014-11-04 01:06:43 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ae52f9f38c
							
						
					 | 
					
						
						
							
							* Remove vocab10k from tokens
						
						
						
						
						
					 | 
					
						2014-11-03 00:23:20 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							32fb50dc35
							
						
					 | 
					
						
						
							
							* Remove non_sparse method --- features wanting this can do it easily enough.
						
						
						
						
						
					 | 
					
						2014-11-03 00:15:47 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b5ae1471db
							
						
					 | 
					
						
						
							
							* Fiddle with POS tag features
						
						
						
						
						
					 | 
					
						2014-11-03 00:15:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							70ea862703
							
						
					 | 
					
						
						
							
							* Remove vocab10k field, and add flags for gazetteers
						
						
						
						
						
					 | 
					
						2014-11-03 00:13:51 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							711ed0f636
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-11-02 14:22:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fcd9490d56
							
						
					 | 
					
						
						
							
							* Add pos_tag method to Language
						
						
						
						
						
					 | 
					
						2014-11-02 14:21:43 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							829bb2bdbe
							
						
					 | 
					
						
						
							
							* Add mappings to Twitter POS tag corpus
						
						
						
						
						
					 | 
					
						2014-11-02 13:21:19 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							437cd2217d
							
						
					 | 
					
						
						
							
							* Fix strings i/o, removing use of ujson library in favour of plain text file. Allows better control of codecs.
						
						
						
						
						
					 | 
					
						2014-11-02 13:20:37 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3352e89e21
							
						
					 | 
					
						
						
							
							* Use LIKE_URL and LIKE_NUMBER flag features. Seems to improve accuracy on onto web
						
						
						
						
						
					 | 
					
						2014-11-02 13:19:54 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8335706321
							
						
					 | 
					
						
						
							
							* Add LIKE_URL and LIKE_NUMBER flag features
						
						
						
						
						
					 | 
					
						2014-11-02 13:19:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5484fbea69
							
						
					 | 
					
						
						
							
							* Implement is_number
						
						
						
						
						
					 | 
					
						2014-11-01 19:13:24 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f685218e21
							
						
					 | 
					
						
						
							
							* Add is_urlish function
						
						
						
						
						
					 | 
					
						2014-11-01 17:39:34 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							09a3e54176
							
						
					 | 
					
						
						
							
							* Delete print statements from stringstore
						
						
						
						
						
					 | 
					
						2014-10-31 17:45:26 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b186a66bae
							
						
					 | 
					
						
						
							
							* Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype
						
						
						
						
						
					 | 
					
						2014-10-31 17:44:39 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a8ca078b24
							
						
					 | 
					
						
						
							
							* Restore lexemes field to lexicon
						
						
						
						
						
					 | 
					
						2014-10-31 17:43:25 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6c807aa45f
							
						
					 | 
					
						
						
							
							* Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries
						
						
						
						
						
					 | 
					
						2014-10-31 17:43:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							aaf6953fe0
							
						
					 | 
					
						
						
							
							* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.
						
						
						
						
						
					 | 
					
						2014-10-31 17:42:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f67cb9a5a3
							
						
					 | 
					
						
						
							
							* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.
						
						
						
						
						
					 | 
					
						2014-10-31 17:42:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ea8f1e7053
							
						
					 | 
					
						
						
							
							* Tighten interfaces
						
						
						
						
						
					 | 
					
						2014-10-30 18:14:42 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ea85bf3a0a
							
						
					 | 
					
						
						
							
							* Tighten the interface to Language
						
						
						
						
						
					 | 
					
						2014-10-30 18:01:27 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c6fcd03692
							
						
					 | 
					
						
						
							
							* Small efficiency tweak to lexeme init
						
						
						
						
						
					 | 
					
						2014-10-30 17:56:11 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							87c2418a89
							
						
					 | 
					
						
						
							
							* Fiddle with data types on Lexeme, to compress them to a much smaller size.
						
						
						
						
						
					 | 
					
						2014-10-30 15:42:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ac88893232
							
						
					 | 
					
						
						
							
							* Fix Token after lexeme changes
						
						
						
						
						
					 | 
					
						2014-10-30 15:30:52 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e6b87766fe
							
						
					 | 
					
						
						
							
							* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme
						
						
						
						
						
					 | 
					
						2014-10-30 15:21:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							889b7b48b4
							
						
					 | 
					
						
						
							
							* Fix POS tagger, so that it loads correctly. Lexemes are being read in.
						
						
						
						
						
					 | 
					
						2014-10-30 13:38:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							67c8c8019f
							
						
					 | 
					
						
						
							
							* Update lexeme serialization, using a binary file format
						
						
						
						
						
					 | 
					
						2014-10-30 01:01:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							13909a2e24
							
						
					 | 
					
						
						
							
							* Rewriting Lexeme serialization.
						
						
						
						
						
					 | 
					
						2014-10-29 23:19:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							234d49bf4d
							
						
					 | 
					
						
						
							
							* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.
						
						
						
						
						
					 | 
					
						2014-10-24 02:23:42 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							08ce602243
							
						
					 | 
					
						
						
							
							* Large refactor, particularly to Python API
						
						
						
						
						
					 | 
					
						2014-10-24 00:59:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7baef5b7ff
							
						
					 | 
					
						
						
							
							* Fix padding on tokens
						
						
						
						
						
					 | 
					
						2014-10-23 04:01:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							96b835a3d4
							
						
					 | 
					
						
						
							
							* Upd for refactored Tokens class. Now gets 95.74, 185ms training on swbd_wsj_ewtb, eval on onto_web, Google POS tags.
						
						
						
						
						
					 | 
					
						2014-10-23 03:20:02 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e5e951ae67
							
						
					 | 
					
						
						
							
							* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.
						
						
						
						
						
					 | 
					
						2014-10-23 01:57:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ea1d4a81eb
							
						
					 | 
					
						
						
							
							* Refactoring get_atoms, improving tokens API
						
						
						
						
						
					 | 
					
						2014-10-22 13:10:56 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ad49e2482e
							
						
					 | 
					
						
						
							
							* Tagger now gets 97pc on wsj, parsing 19-21 in 500ms. Gets 92.7 on web text.
						
						
						
						
						
					 | 
					
						2014-10-22 12:57:06 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0a0e41f6c8
							
						
					 | 
					
						
						
							
							* Add prefix and suffix features
						
						
						
						
						
					 | 
					
						2014-10-22 12:56:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7018b53d3a
							
						
					 | 
					
						
						
							
							* Improve array features in tokens
						
						
						
						
						
					 | 
					
						2014-10-22 12:55:42 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							43d5964e13
							
						
					 | 
					
						
						
							
							* Add function to read detokenization rules
						
						
						
						
						
					 | 
					
						2014-10-22 12:54:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							224bdae996
							
						
					 | 
					
						
						
							
							* Add POS utilities
						
						
						
						
						
					 | 
					
						2014-10-22 10:17:57 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5ebe14f353
							
						
					 | 
					
						
						
							
							* Add greedy pos tagger
						
						
						
						
						
					 | 
					
						2014-10-22 10:17:26 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							12742f4f83
							
						
					 | 
					
						
						
							
							* Add detokenize method and test
						
						
						
						
						
					 | 
					
						2014-10-18 18:07:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							99f5e59286
							
						
					 | 
					
						
						
							
							* Have tokenizer emit tokens for whitespace other than single spaces
						
						
						
						
						
					 | 
					
						2014-10-14 20:25:57 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							43743a5d63
							
						
					 | 
					
						
						
							
							* Work on efficiency
						
						
						
						
						
					 | 
					
						2014-10-14 18:22:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6fb42c4919
							
						
					 | 
					
						
						
							
							* Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang
						
						
						
						
						
					 | 
					
						2014-10-14 16:17:45 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2805068ca8
							
						
					 | 
					
						
						
							
							* Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer
						
						
						
						
						
					 | 
					
						2014-10-14 15:21:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							65d3ead4fd
							
						
					 | 
					
						
						
							
							* Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id
						
						
						
						
						
					 | 
					
						2014-10-14 15:19:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							868e558037
							
						
					 | 
					
						
						
							
							* Preparations in place to handle hyphenation etc
						
						
						
						
						
					 | 
					
						2014-10-10 20:23:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ff79dbac2e
							
						
					 | 
					
						
						
							
							* More slight cleaning for lang.pyx
						
						
						
						
						
					 | 
					
						2014-10-10 20:11:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3d82ed1e5e
							
						
					 | 
					
						
						
							
							* More slight cleaning for lang.pyx
						
						
						
						
						
					 | 
					
						2014-10-10 19:50:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							02e948e7d5
							
						
					 | 
					
						
						
							
							* Remove counts stuff from Language class
						
						
						
						
						
					 | 
					
						2014-10-10 19:25:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							71ee921055
							
						
					 | 
					
						
						
							
							* Slight cleaning of tokenizer code
						
						
						
						
						
					 | 
					
						2014-10-10 19:17:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							59b41a9fd3
							
						
					 | 
					
						
						
							
							* Switch to new data model, tests passing
						
						
						
						
						
					 | 
					
						2014-10-10 08:11:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1b0e01d3d8
							
						
					 | 
					
						
						
							
							* Revising data model of lexeme. Compiles.
						
						
						
						
						
					 | 
					
						2014-10-09 19:53:30 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e40caae51f
							
						
					 | 
					
						
						
							
							* Update Lexicon class to expect a list of lexeme dict descriptions
						
						
						
						
						
					 | 
					
						2014-10-09 14:51:35 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							51d75b244b
							
						
					 | 
					
						
						
							
							* Add serialize/deserialize functions for lexeme, transport to/from python dict.
						
						
						
						
						
					 | 
					
						2014-10-09 14:10:46 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d73d89a2de
							
						
					 | 
					
						
						
							
							* Add i attribute to lexeme, giving lexemes sequential IDs.
						
						
						
						
						
					 | 
					
						2014-10-09 13:50:05 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							096ef2b199
							
						
					 | 
					
						
						
							
							* Rename external hashing lib, from trustyc to preshed
						
						
						
						
						
					 | 
					
						2014-09-26 18:40:03 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							11a346fd5e
							
						
					 | 
					
						
						
							
							* Remove hashing modules, which are now taken over by external lib
						
						
						
						
						
					 | 
					
						2014-09-26 18:39:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							93505276ed
							
						
					 | 
					
						
						
							
							* Add German tokenizer files
						
						
						
						
						
					 | 
					
						2014-09-25 18:29:13 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2e44fa7179
							
						
					 | 
					
						
						
							
							* Add util.py
						
						
						
						
						
					 | 
					
						2014-09-25 18:26:22 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b15619e170
							
						
					 | 
					
						
						
							
							* Use PointerHash instead of locally provided _hashing module
						
						
						
						
						
					 | 
					
						2014-09-25 18:23:35 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ed446c67ad
							
						
					 | 
					
						
						
							
							* Add typedefs file
						
						
						
						
						
					 | 
					
						2014-09-17 23:10:32 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							316a57c4be
							
						
					 | 
					
						
						
							
							* Remove own memory classes, which have now been broken out into their own package
						
						
						
						
						
					 | 
					
						2014-09-17 23:10:07 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ac522e2553
							
						
					 | 
					
						
						
							
							* Switch from own memory class to cymem, in pip
						
						
						
						
						
					 | 
					
						2014-09-17 23:09:24 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6266cac593
							
						
					 | 
					
						
						
							
							* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks
						
						
						
						
						
					 | 
					
						2014-09-17 20:02:26 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5a20dfc03e
							
						
					 | 
					
						
						
							
							* Add memory management code
						
						
						
						
						
					 | 
					
						2014-09-17 20:02:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0152831c89
							
						
					 | 
					
						
						
							
							* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token.
						
						
						
						
						
					 | 
					
						2014-09-16 18:01:46 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							143e51ec73
							
						
					 | 
					
						
						
							
							* Refactor tokenization, splitting it into a clearer life-cycle.
						
						
						
						
						
					 | 
					
						2014-09-16 13:16:02 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c396581a0b
							
						
					 | 
					
						
						
							
							* Fiddle with the way strings are interned in lexeme
						
						
						
						
						
					 | 
					
						2014-09-15 06:34:45 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0bb547ab98
							
						
					 | 
					
						
						
							
							* Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance
						
						
						
						
						
					 | 
					
						2014-09-15 06:34:10 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7959141d36
							
						
					 | 
					
						
						
							
							* Add a few abbreviations, to get tests to pass
						
						
						
						
						
					 | 
					
						2014-09-15 06:32:18 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d235299260
							
						
					 | 
					
						
						
							
							* Few nips and tucks to hash table
						
						
						
						
						
					 | 
					
						2014-09-15 05:03:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e68a431e5e
							
						
					 | 
					
						
						
							
							* Pass only the tokens vector to _tokenize, instead of the whole python object.
						
						
						
						
						
					 | 
					
						2014-09-15 04:01:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							08cef75ffd
							
						
					 | 
					
						
						
							
							* Switch to using a heap-allocated vector in tokens
						
						
						
						
						
					 | 
					
						2014-09-15 03:46:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f77b7098c0
							
						
					 | 
					
						
						
							
							* Upd Tokens to use vector, with bounds checking.
						
						
						
						
						
					 | 
					
						2014-09-15 03:22:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0f6bf2a2ee
							
						
					 | 
					
						
						
							
							* Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector.
						
						
						
						
						
					 | 
					
						2014-09-15 02:08:39 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							df24e3708c
							
						
					 | 
					
						
						
							
							* Move EnglishTokens stuff to Tokens
						
						
						
						
						
					 | 
					
						2014-09-15 01:31:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bd08cb09a2
							
						
					 | 
					
						
						
							
							* Remove short-circuiting of initial_size argument for PointerHash
						
						
						
						
						
					 | 
					
						2014-09-15 01:30:49 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f3393cf57c
							
						
					 | 
					
						
						
							
							* Improve interface for PointerHash
						
						
						
						
						
					 | 
					
						2014-09-13 17:29:58 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							45865be37e
							
						
					 | 
					
						
						
							
							* Switch hash interface, using void* instead of size_t, to avoid casts.
						
						
						
						
						
					 | 
					
						2014-09-13 17:02:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0447279c57
							
						
					 | 
					
						
						
							
							* PointerHash working, efficiency is good. 6-7 mins
						
						
						
						
						
					 | 
					
						2014-09-13 16:43:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							85d68e8e95
							
						
					 | 
					
						
						
							
							* Replaced cache with own hash table. Similar timing
						
						
						
						
						
					 | 
					
						2014-09-13 03:14:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c8db76e3e1
							
						
					 | 
					
						
						
							
							* Add initial work on simple hash table
						
						
						
						
						
					 | 
					
						2014-09-13 02:02:41 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							afdc9b7ac2
							
						
					 | 
					
						
						
							
							* More performance fiddling, particularly moving the specials into the cache, so that we can just lookup the cache in _tokenize
						
						
						
						
						
					 | 
					
						2014-09-13 00:59:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7d239df4c8
							
						
					 | 
					
						
						
							
							* Fiddle with declarations, for small efficiency boost
						
						
						
						
						
					 | 
					
						2014-09-13 00:31:53 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a8e7cce30f
							
						
					 | 
					
						
						
							
							* Efficiency tweaks
						
						
						
						
						
					 | 
					
						2014-09-13 00:14:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							126a8453a5
							
						
					 | 
					
						
						
							
							* Fix performance issues by implementing a better cache. Add own String struct to help
						
						
						
						
						
					 | 
					
						2014-09-12 23:50:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9298e36b36
							
						
					 | 
					
						
						
							
							* Move special tokenization into its own lookup table, away from the cache.
						
						
						
						
						
					 | 
					
						2014-09-12 19:43:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							985bc68327
							
						
					 | 
					
						
						
							
							* Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation.
						
						
						
						
						
					 | 
					
						2014-09-12 18:26:26 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7eab281194
							
						
					 | 
					
						
						
							
							* Fiddle with token features
						
						
						
						
						
					 | 
					
						2014-09-12 15:49:55 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5aa591106b
							
						
					 | 
					
						
						
							
							* Fiddle with token features
						
						
						
						
						
					 | 
					
						2014-09-12 15:49:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1533041885
							
						
					 | 
					
						
						
							
							* Update the split_one method, so that it doesn't need to cast back to a Python object
						
						
						
						
						
					 | 
					
						2014-09-12 05:10:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4817277d66
							
						
					 | 
					
						
						
							
							* Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery.
						
						
						
						
						
					 | 
					
						2014-09-12 04:29:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8b20e9ad97
							
						
					 | 
					
						
						
							
							* Delete ununused _split method
						
						
						
						
						
					 | 
					
						2014-09-12 04:03:52 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a4863686ec
							
						
					 | 
					
						
						
							
							* Changed cache to use a linked-list data structure, to take out Python list code. Taking 6-7 mins for gigaword.
						
						
						
						
						
					 | 
					
						2014-09-12 03:30:50 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							51e2006a65
							
						
					 | 
					
						
						
							
							* Increase cache size. Processing now 6-7 mins
						
						
						
						
						
					 | 
					
						2014-09-12 02:52:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e096f30161
							
						
					 | 
					
						
						
							
							* Tweak signatures and refactor slightly. Processing gigaword taking 8-9 mins. Tests passing, but some sort of memory bug on exit.
						
						
						
						
						
					 | 
					
						2014-09-12 02:43:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							073ee0de63
							
						
					 | 
					
						
						
							
							* Restore dense_hash_map for cache dictionary. Seems to double efficiency
						
						
						
						
						
					 | 
					
						2014-09-12 02:23:51 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3c928fb5e0
							
						
					 | 
					
						
						
							
							* Switch to 64 bit hashes, for better reliability
						
						
						
						
						
					 | 
					
						2014-09-12 02:04:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2389bd1b10
							
						
					 | 
					
						
						
							
							* Improve cache mechanism by including a random element depending on the size of the cache.
						
						
						
						
						
					 | 
					
						2014-09-12 00:19:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c8f7c8bfde
							
						
					 | 
					
						
						
							
							* Moving to storing LexemeC structs internally
						
						
						
						
						
					 | 
					
						2014-09-11 21:54:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bf9c60c31c
							
						
					 | 
					
						
						
							
							* Moving to storing LexemeC structs internally
						
						
						
						
						
					 | 
					
						2014-09-11 21:44:58 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							563047e90f
							
						
					 | 
					
						
						
							
							* Switch to returning a Tokens object
						
						
						
						
						
					 | 
					
						2014-09-11 21:37:32 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1a3222af4b
							
						
					 | 
					
						
						
							
							* Moving tokens to use an array internally, instead of a list of Lexeme objects.
						
						
						
						
						
					 | 
					
						2014-09-11 16:57:08 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5b1c651661
							
						
					 | 
					
						
						
							
							* Only store LexemeC structs in the vocabulary, transforming them to Lexeme objects for output. Moving away from Lexeme objects for Tokens soon.
						
						
						
						
						
					 | 
					
						2014-09-11 12:28:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e567713429
							
						
					 | 
					
						
						
							
							* Moving back to lexeme structs
						
						
						
						
						
					 | 
					
						2014-09-10 20:41:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b488224c09
							
						
					 | 
					
						
						
							
							* Restoring Lexeme-as-struct
						
						
						
						
						
					 | 
					
						2014-09-10 20:41:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7c09c73a14
							
						
					 | 
					
						
						
							
							* Refactor to use tokens class.
						
						
						
						
						
					 | 
					
						2014-09-10 18:27:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cf412adba8
							
						
					 | 
					
						
						
							
							* Refactoring to use Tokens object
						
						
						
						
						
					 | 
					
						2014-09-10 18:11:13 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8fbe9b6f97
							
						
					 | 
					
						
						
							
							* Bug fixes to flag features
						
						
						
						
						
					 | 
					
						2014-09-01 23:41:31 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							151aa14bba
							
						
					 | 
					
						
						
							
							* Add asciify string transform, and other bits.
						
						
						
						
						
					 | 
					
						2014-09-01 23:25:28 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c4ba216642
							
						
					 | 
					
						
						
							
							* Switch canon_case to get value, to avoid keyerror
						
						
						
						
						
					 | 
					
						2014-09-01 17:27:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a779275a59
							
						
					 | 
					
						
						
							
							* Add canon_case function
						
						
						
						
						
					 | 
					
						2014-08-30 20:57:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8bbfadfced
							
						
					 | 
					
						
						
							
							* Pass tests. Need to implement more feature functions.
						
						
						
						
						
					 | 
					
						2014-08-30 20:36:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							dcab14ede2
							
						
					 | 
					
						
						
							
							* Begin testing more functionality
						
						
						
						
						
					 | 
					
						2014-08-30 19:01:15 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3e3ff99ca0
							
						
					 | 
					
						
						
							
							* Add orth features
						
						
						
						
						
					 | 
					
						2014-08-30 19:01:00 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4e5b2d47e2
							
						
					 | 
					
						
						
							
							* More docs
						
						
						
						
						
					 | 
					
						2014-08-29 03:01:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5233f110c4
							
						
					 | 
					
						
						
							
							* Adding PTB3 tokenizer back in, so can understand how much boilerplate is in the docs for multiple tokenizers
						
						
						
						
						
					 | 
					
						2014-08-29 02:30:27 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							45a22d6b2c
							
						
					 | 
					
						
						
							
							* Docs coming together
						
						
						
						
						
					 | 
					
						2014-08-29 01:59:23 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c282e6d5fb
							
						
					 | 
					
						
						
							
							* Redesign proceeding
						
						
						
						
						
					 | 
					
						2014-08-28 19:45:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fd4e61e58b
							
						
					 | 
					
						
						
							
							* Fixed contraction tests. Need to correct problem with the way case stats and tag stats are supposed to work.
						
						
						
						
						
					 | 
					
						2014-08-27 20:22:33 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fdaf24604a
							
						
					 | 
					
						
						
							
							* Basic punct tests updated and passing
						
						
						
						
						
					 | 
					
						2014-08-27 19:38:57 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8d20617dfd
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-08-27 17:16:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e9a62b6eba
							
						
					 | 
					
						
						
							
							* Refactoring with Lexeme as a class now compiles. Basic design seems to work
						
						
						
						
						
					 | 
					
						2014-08-27 17:15:39 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							68bae2fec6
							
						
					 | 
					
						
						
							
							* More refactoring
						
						
						
						
						
					 | 
					
						2014-08-25 16:42:22 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							88095666dc
							
						
					 | 
					
						
						
							
							* Remove Lexeme struct, preparing to rename Word to Lexeme.
						
						
						
						
						
					 | 
					
						2014-08-24 19:24:42 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ce59526011
							
						
					 | 
					
						
						
							
							* Add Word classes
						
						
						
						
						
					 | 
					
						2014-08-24 18:14:08 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3b793cf4f7
							
						
					 | 
					
						
						
							
							* Tests passing for new Word object version
						
						
						
						
						
					 | 
					
						2014-08-24 18:13:53 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9815c7649e
							
						
					 | 
					
						
						
							
							* Refactor around Word objects, adapting tests. Tests passing, except for string views.
						
						
						
						
						
					 | 
					
						2014-08-23 19:55:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4f01df9152
							
						
					 | 
					
						
						
							
							* Moving to Word objects in place of the Lexeme struct.
						
						
						
						
						
					 | 
					
						2014-08-22 17:32:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							782806df08
							
						
					 | 
					
						
						
							
							* Moving to Word objects in place of the Lexeme struct.
						
						
						
						
						
					 | 
					
						2014-08-22 17:28:23 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							47fbd0475a
							
						
					 | 
					
						
						
							
							* Replace the use of dense_hash_map with Python dict
						
						
						
						
						
					 | 
					
						2014-08-22 17:13:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e289896603
							
						
					 | 
					
						
						
							
							* Fix ptb3 module
						
						
						
						
						
					 | 
					
						2014-08-22 16:36:17 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							89d6faa9c9
							
						
					 | 
					
						
						
							
							* Move en_ptb to ptb3
						
						
						
						
						
					 | 
					
						2014-08-22 04:24:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							07ecf5d2f4
							
						
					 | 
					
						
						
							
							* Fixed group_by, removed idea of general attr_of function.
						
						
						
						
						
					 | 
					
						2014-08-22 00:02:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							811b7a6b91
							
						
					 | 
					
						
						
							
							* Struggling with arbitrary attr access...
						
						
						
						
						
					 | 
					
						2014-08-21 23:49:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							314658b31c
							
						
					 | 
					
						
						
							
							* Improve module docstring
						
						
						
						
						
					 | 
					
						2014-08-21 18:42:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d10993f41a
							
						
					 | 
					
						
						
							
							* More docs work
						
						
						
						
						
					 | 
					
						2014-08-21 16:37:13 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							248cbb6d07
							
						
					 | 
					
						
						
							
							* Update doc strings
						
						
						
						
						
					 | 
					
						2014-08-21 03:29:15 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							76afbd7d69
							
						
					 | 
					
						
						
							
							* Remove compiled orthography file
						
						
						
						
						
					 | 
					
						2014-08-20 17:04:07 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f39dcb1d89
							
						
					 | 
					
						
						
							
							* Add orthography
						
						
						
						
						
					 | 
					
						2014-08-20 17:03:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a78ad4152d
							
						
					 | 
					
						
						
							
							* Broken version being refactored for docs
						
						
						
						
						
					 | 
					
						2014-08-20 13:39:39 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5fddb8d165
							
						
					 | 
					
						
						
							
							* Working refactor, with updated data model for Lexemes
						
						
						
						
						
					 | 
					
						2014-08-19 04:21:20 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3379d7a571
							
						
					 | 
					
						
						
							
							* Reforming data model for lexemes
						
						
						
						
						
					 | 
					
						2014-08-19 02:40:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ab9b0daabf
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-08-18 23:21:49 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1b71cbfe28
							
						
					 | 
					
						
						
							
							* Roll back to using unicode, and never Py_UNICODE. No dependence on murmurhash either.
						
						
						
						
						
					 | 
					
						2014-08-18 20:48:48 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bbf9a2c944
							
						
					 | 
					
						
						
							
							* Working version that uses arrays for chunks, which should be more memory efficient
						
						
						
						
						
					 | 
					
						2014-08-18 20:23:54 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8d3f6082be
							
						
					 | 
					
						
						
							
							* Working version, adding improvements
						
						
						
						
						
					 | 
					
						2014-08-18 19:59:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							01469b0888
							
						
					 | 
					
						
						
							
							* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word.
						
						
						
						
						
					 | 
					
						2014-08-18 19:14:00 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b94c9b72c9
							
						
					 | 
					
						
						
							
							* WordTree in use. Need to reform the way chunks are handled. Should be properly one Lexeme per word, with split points being the things that are cached.
						
						
						
						
						
					 | 
					
						2014-08-16 20:10:22 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							34b68a18ab
							
						
					 | 
					
						
						
							
							* Progress to getting WordTree working. Tests pass, but so far it's slower.
						
						
						
						
						
					 | 
					
						2014-08-16 19:59:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							865cacfaf7
							
						
					 | 
					
						
						
							
							* Remove dependence on murmurhash
						
						
						
						
						
					 | 
					
						2014-08-16 17:37:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							515d41d325
							
						
					 | 
					
						
						
							
							* Restore string saving to spacy
						
						
						
						
						
					 | 
					
						2014-08-16 16:09:24 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							36073b89fe
							
						
					 | 
					
						
						
							
							* Restore unicode, work on improving string storage.
						
						
						
						
						
					 | 
					
						2014-08-16 14:35:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a225ca5b0d
							
						
					 | 
					
						
						
							
							* Refactoring tokenizer
						
						
						
						
						
					 | 
					
						2014-08-16 03:22:03 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							213a440ffc
							
						
					 | 
					
						
						
							
							* Add string decode and encode helpers to string_tools
						
						
						
						
						
					 | 
					
						2014-08-15 23:57:27 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f11c8e22eb
							
						
					 | 
					
						
						
							
							* Remove happax stuff
						
						
						
						
						
					 | 
					
						2014-08-02 22:11:28 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d6e07aa922
							
						
					 | 
					
						
						
							
							* Switch to 32bit hash for strings
						
						
						
						
						
					 | 
					
						2014-08-02 21:51:52 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							365a2af756
							
						
					 | 
					
						
						
							
							* Restore happax. commit uncommited work
						
						
						
						
						
					 | 
					
						2014-08-02 21:27:03 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6319ff0f22
							
						
					 | 
					
						
						
							
							* Add length property
						
						
						
						
						
					 | 
					
						2014-08-02 21:26:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							18fb76b2c4
							
						
					 | 
					
						
						
							
							* Removed happax. Not sure if good idea.
						
						
						
						
						
					 | 
					
						2014-08-02 20:53:35 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							edd38a84b1
							
						
					 | 
					
						
						
							
							* Removing happax stuff. Added length
						
						
						
						
						
					 | 
					
						2014-08-02 20:45:12 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fc7c10d7f8
							
						
					 | 
					
						
						
							
							* Ugly but seemingly working fix to the token memory leak
						
						
						
						
						
					 | 
					
						2014-08-01 09:43:19 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c7bb6b329c
							
						
					 | 
					
						
						
							
							* Don't free clobbered lexemes, as they might be part of a tail
						
						
						
						
						
					 | 
					
						2014-08-01 08:22:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c48214460e
							
						
					 | 
					
						
						
							
							* Free lexemes clobbered as happaxes
						
						
						
						
						
					 | 
					
						2014-08-01 07:40:20 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5b6457e80e
							
						
					 | 
					
						
						
							
							* Free lexemes clobbered as happaxes
						
						
						
						
						
					 | 
					
						2014-08-01 07:37:50 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d8cb2288ce
							
						
					 | 
					
						
						
							
							* Roll back to using murmurhash2 for now
						
						
						
						
						
					 | 
					
						2014-08-01 07:28:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f39211b2b1
							
						
					 | 
					
						
						
							
							* Add FixedTable for hashing
						
						
						
						
						
					 | 
					
						2014-08-01 07:27:21 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a44e15f623
							
						
					 | 
					
						
						
							
							* Hack around lack of distribution features for now.
						
						
						
						
						
					 | 
					
						2014-07-31 18:24:51 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4cb88c940b
							
						
					 | 
					
						
						
							
							* Fix memory leak in tokenizer, caused by having a fixed vocab.
						
						
						
						
						
					 | 
					
						2014-07-31 18:19:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5b81ee716f
							
						
					 | 
					
						
						
							
							* Use a sparse_hash_map to store happax vocab items, with a max size.
						
						
						
						
						
					 | 
					
						2014-07-31 17:40:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b9016c4633
							
						
					 | 
					
						
						
							
							* Switch to using sparsehash and murmurhash libraries out of pip
						
						
						
						
						
					 | 
					
						2014-07-25 15:47:27 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a895fe5ddb
							
						
					 | 
					
						
						
							
							* Upd from spacy
						
						
						
						
						
					 | 
					
						2014-07-23 17:35:18 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							87bf205b82
							
						
					 | 
					
						
						
							
							* Fix open apostrophe bug
						
						
						
						
						
					 | 
					
						2014-07-07 23:26:01 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							571808a274
							
						
					 | 
					
						
						
							
							Group-by seems to be working
						
						
						
						
						
					 | 
					
						2014-07-07 20:27:02 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							80b36f9f27
							
						
					 | 
					
						
						
							
							* 710k words per second for counts
						
						
						
						
						
					 | 
					
						2014-07-07 19:12:19 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							057c21969b
							
						
					 | 
					
						
						
							
							* Refactor for string view features. Working on setting up flags and enums.
						
						
						
						
						
					 | 
					
						2014-07-07 16:58:48 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f1bcbd4c4e
							
						
					 | 
					
						
						
							
							* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well.
						
						
						
						
						
					 | 
					
						2014-07-07 12:47:21 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6668e44961
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-07-07 08:15:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0074ae2fc0
							
						
					 | 
					
						
						
							
							* Switch to dynamically allocating array, based on the document length
						
						
						
						
						
					 | 
					
						2014-07-07 08:05:29 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ff1869ff07
							
						
					 | 
					
						
						
							
							* Fixed major efficiency problem, from not quite grokking pass by reference in cython c++
						
						
						
						
						
					 | 
					
						2014-07-07 07:36:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0c76143b72
							
						
					 | 
					
						
						
							
							* Give value for assert
						
						
						
						
						
					 | 
					
						2014-07-07 05:10:46 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e244739dfe
							
						
					 | 
					
						
						
							
							* Fix ptb tokenization
						
						
						
						
						
					 | 
					
						2014-07-07 05:10:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							dc20500920
							
						
					 | 
					
						
						
							
							* Remove cpp files
						
						
						
						
						
					 | 
					
						2014-07-07 05:09:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							25849fc926
							
						
					 | 
					
						
						
							
							* Generalize tokenization rules to capitals
						
						
						
						
						
					 | 
					
						2014-07-07 05:07:21 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							df0458001d
							
						
					 | 
					
						
						
							
							* Begin work on full PTB-compatible English tokenization
						
						
						
						
						
					 | 
					
						2014-07-07 04:29:24 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d5bef02c72
							
						
					 | 
					
						
						
							
							* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals
						
						
						
						
						
					 | 
					
						2014-07-07 04:21:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a62c38e1ef
							
						
					 | 
					
						
						
							
							* Working tokenization. en doesn't match PTB perfectly. Need to reorganize before adding more schemes.
						
						
						
						
						
					 | 
					
						2014-07-07 01:15:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4e79446dc2
							
						
					 | 
					
						
						
							
							* Reading in tokenization rules correctly. Passing tests.
						
						
						
						
						
					 | 
					
						2014-07-07 00:02:55 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							72159e7011
							
						
					 | 
					
						
						
							
							* Fixes to tokenization. Now segment sequences of the same punctuation.
						
						
						
						
						
					 | 
					
						2014-07-06 19:28:42 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e98e97d483
							
						
					 | 
					
						
						
							
							* Possessive test passing
						
						
						
						
						
					 | 
					
						2014-07-06 18:35:55 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							556f6a18ca
							
						
					 | 
					
						
						
							
							* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc.
						
						
						
						
						
					 | 
					
						2014-07-05 20:51:42 +02:00 | 
					
					
						
						
							
							
							
						
					 |