Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fb8d50b3d5
							
						
					 | 
					
						
						
							
							Merge branch 'master' of ssh://github.com/honnibal/spaCy
						
						
						
						
						
					 | 
					
						2015-04-30 12:45:15 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ed8e8c3bd0
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2015-04-29 14:22:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							378c2a6435
							
						
					 | 
					
						
						
							
							* Fix POS model: make it use tag instead of pos in history features
						
						
						
						
						
					 | 
					
						2015-04-29 00:02:53 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							763ef01575
							
						
					 | 
					
						
						
							
							* Fix two bugs in feature calculation
						
						
						
						
						
					 | 
					
						2015-04-28 23:25:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b3fd48c97b
							
						
					 | 
					
						
						
							
							* Fix missing root labels bug identified in Issue #57
						
						
						
						
						
					 | 
					
						2015-04-28 20:45:51 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Jordan Suchow
							
						 
					 | 
					
						
						
						
						
							
						
						
							3a8d9b37a6
							
						
					 | 
					
						
						
							
							Remove trailing whitespace
						
						
						
						
						
					 | 
					
						2015-04-19 13:01:38 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Jordan Suchow
							
						 
					 | 
					
						
						
						
						
							
						
						
							5f0f940a1f
							
						
					 | 
					
						
						
							
							Remove unused imports
						
						
						
						
						
					 | 
					
						2015-04-19 01:05:22 -07:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cc4e395927
							
						
					 | 
					
						
						
							
							* Add some ad hoc regexes, for multi-word location prepositions
						
						
						
						
						
					 | 
					
						2015-04-17 04:44:24 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f7ffd94e6a
							
						
					 | 
					
						
						
							
							* Add Token.conjuncts property
						
						
						
						
						
					 | 
					
						2015-04-17 01:40:53 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							684d0e5e85
							
						
					 | 
					
						
						
							
							* Download updated data
						
						
						
						
						
					 | 
					
						2015-04-16 04:29:15 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2ef170a991
							
						
					 | 
					
						
						
							
							* Fix Issue #54: Error merging multi-word token when there's a mid-token match.
						
						
						
						
						
					 | 
					
						2015-04-16 04:28:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							42617548af
							
						
					 | 
					
						
						
							
							* Disable merge_mwes by default
						
						
						
						
						
					 | 
					
						2015-04-16 04:20:31 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							99dbf8a38c
							
						
					 | 
					
						
						
							
							* Fix error type in lookup_transition
						
						
						
						
						
					 | 
					
						2015-04-16 01:36:22 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							77d0700caf
							
						
					 | 
					
						
						
							
							* Add on X way regexes
						
						
						
						
						
					 | 
					
						2015-04-16 01:35:46 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9f16848b60
							
						
					 | 
					
						
						
							
							* Add (N0w, N1w) unigram pair to NER features, prompted by failure to detect 'this weekend'
						
						
						
						
						
					 | 
					
						2015-04-15 06:01:18 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c6707778dd
							
						
					 | 
					
						
						
							
							* Fix Issue #51: Handle non-ascii lemmas correctly
						
						
						
						
						
					 | 
					
						2015-04-13 22:28:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bf0aff5124
							
						
					 | 
					
						
						
							
							* Fix bug in Tokens.ents where entity wasn't being emitted if another started immediately after
						
						
						
						
						
					 | 
					
						2015-04-13 21:34:33 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2b84a90bbb
							
						
					 | 
					
						
						
							
							* Fix Issue #50: Python 3 compatibility of v0.80
						
						
						
						
						
					 | 
					
						2015-04-13 05:59:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fbd48c571d
							
						
					 | 
					
						
						
							
							* Rearrange code in tokens.pyx
						
						
						
						
						
					 | 
					
						2015-04-13 05:41:25 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							507048dc45
							
						
					 | 
					
						
						
							
							* Rename StandardError to Exception, for Python 3 compatibility
						
						
						
						
						
					 | 
					
						2015-04-12 07:28:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							761a19113a
							
						
					 | 
					
						
						
							
							* Fix /tmp moving thing in download.py
						
						
						
						
						
					 | 
					
						2015-04-12 07:04:10 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							248a2b4b0f
							
						
					 | 
					
						
						
							
							* Remove Spans class
						
						
						
						
						
					 | 
					
						2015-04-12 04:07:29 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1d05e6da00
							
						
					 | 
					
						
						
							
							* Add ne_iob and ne_type features to NER
						
						
						
						
						
					 | 
					
						2015-04-10 19:07:08 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4df8a3d90f
							
						
					 | 
					
						
						
							
							* Add ne_iob and ne_type attributes to context vector
						
						
						
						
						
					 | 
					
						2015-04-10 05:02:15 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8c354c432b
							
						
					 | 
					
						
						
							
							* Add ValueError condition to ner_tag reading
						
						
						
						
						
					 | 
					
						2015-04-10 04:59:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							435cccf098
							
						
					 | 
					
						
						
							
							* Add read_conll03_file function to conll.pyx
						
						
						
						
						
					 | 
					
						2015-04-10 04:59:11 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							99c9ecfc18
							
						
					 | 
					
						
						
							
							* Fix bug in prefix, suffix and word shape features in parser and NER
						
						
						
						
						
					 | 
					
						2015-04-10 03:53:33 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cff2b13fef
							
						
					 | 
					
						
						
							
							* Fix Issue #44: Broken Token.string attribute when single word sentence
						
						
						
						
						
					 | 
					
						2015-04-07 06:08:25 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6640386b25
							
						
					 | 
					
						
						
							
							* Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.
						
						
						
						
						
					 | 
					
						2015-04-07 06:00:57 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b64b2bd910
							
						
					 | 
					
						
						
							
							* Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.
						
						
						
						
						
					 | 
					
						2015-04-07 06:00:30 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f9e510a893
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2015-04-07 04:53:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							66c7ccf6cc
							
						
					 | 
					
						
						
							
							* Fix Spans.orth_
						
						
						
						
						
					 | 
					
						2015-04-07 04:53:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b8d34531c4
							
						
					 | 
					
						
						
							
							* Add support for units to English.__init__, by loading and applying regular expressions
						
						
						
						
						
					 | 
					
						2015-04-07 04:02:32 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0ea5af88b6
							
						
					 | 
					
						
						
							
							* Add multi-word expression RegexMatcher
						
						
						
						
						
					 | 
					
						2015-04-07 03:45:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2fee67cfa3
							
						
					 | 
					
						
						
							
							* Add regular expressions for English multi-word expressions
						
						
						
						
						
					 | 
					
						2015-04-07 03:45:18 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5a075ea3fc
							
						
					 | 
					
						
						
							
							* Ensure NER moves are available for single-word tokens
						
						
						
						
						
					 | 
					
						2015-04-05 22:30:58 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a60a366b2c
							
						
					 | 
					
						
						
							
							* Support 'punct' dep label in conll.pyx
						
						
						
						
						
					 | 
					
						2015-04-05 22:30:19 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							021c972137
							
						
					 | 
					
						
						
							
							* Print parse if verbose in scorer
						
						
						
						
						
					 | 
					
						2015-04-05 22:29:30 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fbf19049cf
							
						
					 | 
					
						
						
							
							* Add ent_type_ property
						
						
						
						
						
					 | 
					
						2015-03-31 02:01:29 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e70b87efeb
							
						
					 | 
					
						
						
							
							* Add merge() method to Tokens, with fairly brittle/hacky implementation, but quite easy to test. Passing minimal tests. Still need to fix left/right deps in C data
						
						
						
						
						
					 | 
					
						2015-03-30 01:37:41 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							557856e84c
							
						
					 | 
					
						
						
							
							* Allow regular expressions to specify labels for merged spans
						
						
						
						
						
					 | 
					
						2015-03-27 17:40:52 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a3af6b7c3d
							
						
					 | 
					
						
						
							
							* Left-Arc from Root, to allow non-monotonic reduce to compete with left-arc when the stack is not empty.
						
						
						
						
						
					 | 
					
						2015-03-27 17:39:16 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							db5a43318c
							
						
					 | 
					
						
						
							
							* Improve print_state debug printer
						
						
						
						
						
					 | 
					
						2015-03-27 17:29:58 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1705eccbbe
							
						
					 | 
					
						
						
							
							* Remove whitespace
						
						
						
						
						
					 | 
					
						2015-03-27 15:22:39 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3feb52374c
							
						
					 | 
					
						
						
							
							* Break apart a condition, for ease of debug printing
						
						
						
						
						
					 | 
					
						2015-03-27 15:21:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b32f581acb
							
						
					 | 
					
						
						
							
							* Fix bug in ArcEager.get_labels
						
						
						
						
						
					 | 
					
						2015-03-27 15:21:06 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5f2a4ff36d
							
						
					 | 
					
						
						
							
							* Fix spans.lemma_
						
						
						
						
						
					 | 
					
						2015-03-26 16:45:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f4cc222ec3
							
						
					 | 
					
						
						
							
							* Fix NER scoring
						
						
						
						
						
					 | 
					
						2015-03-26 16:45:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1320bd19db
							
						
					 | 
					
						
						
							
							* Move Span class to own file
						
						
						
						
						
					 | 
					
						2015-03-26 16:45:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6f47a667cf
							
						
					 | 
					
						
						
							
							* Move Span class to own file
						
						
						
						
						
					 | 
					
						2015-03-26 16:45:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f02c39dfaf
							
						
					 | 
					
						
						
							
							* Compare to is not None, for more robustness
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:48 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8f68b864c4
							
						
					 | 
					
						
						
							
							* Move Span/Spans to separate files. Currently duplicates lots of Tokens functionality. Should probably be integrated into Tokens
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:48 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e854ba0a13
							
						
					 | 
					
						
						
							
							* Remove support for force_gold flag from GreedyParser, since it's not so useful, and it's clutter
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6a6085f8b9
							
						
					 | 
					
						
						
							
							* Clean up GreedyParser.train function a bit
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b3157927e6
							
						
					 | 
					
						
						
							
							* Clean up unused feature templates
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							411bf377d4
							
						
					 | 
					
						
						
							
							* Remove dependency on ner_util module
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							01c892f583
							
						
					 | 
					
						
						
							
							* Add comment to fill_context
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2741179aff
							
						
					 | 
					
						
						
							
							* Important bug fix: Fill token N2w, which was being unfilled, after a bad edit while writing the NER features.
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2b2dec95d3
							
						
					 | 
					
						
						
							
							* Add comment to set_parse
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e770fade1e
							
						
					 | 
					
						
						
							
							* Don't set dependency labels in set_parse, as this may be used by the Entity recogniser instead. Need to clean this method up...
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							71648205d9
							
						
					 | 
					
						
						
							
							* Add support for debug feature set. Just use unigrams for this.
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3b70b304b2
							
						
					 | 
					
						
						
							
							* Add words to gold_tuples from gold conll file
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2e12dec76e
							
						
					 | 
					
						
						
							
							* Adjust scorer to account for tokenization mistakes
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							05d6065e2e
							
						
					 | 
					
						
						
							
							* Add assertion
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:46 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							377e9b29b1
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:46 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							670959f40c
							
						
					 | 
					
						
						
							
							* Fix iteration order on Tokens.rights
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:46 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							231ce2dae5
							
						
					 | 
					
						
						
							
							* Assign ROOT label by default. May be papering over another bug.
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:46 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9f4ad8fdfb
							
						
					 | 
					
						
						
							
							* Assign root words the ROOT label via the Break transition. Something is still wrong here...
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:46 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f729164c01
							
						
					 | 
					
						
						
							
							* Fix bug in label assignment: ensure null-label transitions receive the label 0
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:46 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7237c805c7
							
						
					 | 
					
						
						
							
							* Load tag for specials.json token
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:46 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							567388e38d
							
						
					 | 
					
						
						
							
							* Use values encoded by StringStore in POS tagging, rather than indices into a list of tags
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:45 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3105c7f8ba
							
						
					 | 
					
						
						
							
							* Don't pass label_ids dict to Tokens, since we now use the StringStore to manage string-to-int mapping for labels
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:45 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							801bf14f4f
							
						
					 | 
					
						
						
							
							* Clean up handling of dep_strings and ent_strings, using StringStore to encode the label names.
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:45 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							31fad99518
							
						
					 | 
					
						
						
							
							* Use StringStore to encode label names, instead of label_ids
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:45 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							64db61bff1
							
						
					 | 
					
						
						
							
							* Add Span class to Python API
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:45 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b9b695fb1b
							
						
					 | 
					
						
						
							
							* Remove debug word list
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:45 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f21ab2d7fb
							
						
					 | 
					
						
						
							
							* Fix bug in ugly ent_strings hack on English class
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:45 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1c843934be
							
						
					 | 
					
						
						
							
							* Fix oracle bug in NER. Now getting 77% F on ontonotes
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							903f196b3f
							
						
					 | 
					
						
						
							
							* Fix verbose printing for scorer
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e181c051d5
							
						
					 | 
					
						
						
							
							* Improve features for NER
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7ecb52c0ed
							
						
					 | 
					
						
						
							
							* Add scorer script
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8057a95f20
							
						
					 | 
					
						
						
							
							* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring.
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ae235e07b9
							
						
					 | 
					
						
						
							
							* Refactoring working for parser, but now need to rig up features for NER, and then debug oracle etc.
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b3eda03c9c
							
						
					 | 
					
						
						
							
							* Tmp
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							220ce8bfed
							
						
					 | 
					
						
						
							
							* Prepare English class for NER
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f5830dc1c1
							
						
					 | 
					
						
						
							
							* Remove _transitions.pyx
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6865c2fb4d
							
						
					 | 
					
						
						
							
							* Fix assignment of dep strings in tokens.pyx
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6b6bce9e7a
							
						
					 | 
					
						
						
							
							* Fix label loading for transition system
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5278c7504b
							
						
					 | 
					
						
						
							
							* Hacks to conll.pyx. Should clean these up.
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f321b2b2eb
							
						
					 | 
					
						
						
							
							* Remove TODO comment
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fdabd93bfb
							
						
					 | 
					
						
						
							
							* Ensure high loss for invalid moves, and fix label reading for arc-eager
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							10ed738df2
							
						
					 | 
					
						
						
							
							* Tmp commit
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4f83c9b3d5
							
						
					 | 
					
						
						
							
							* Make costs label-sensitive
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							179b7eb0a7
							
						
					 | 
					
						
						
							
							* Specify parser transition system in language
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8c883cef58
							
						
					 | 
					
						
						
							
							* Refactored transition system code now compiling. Still need to hook up label oracle, and test
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f0159ab4b6
							
						
					 | 
					
						
						
							
							* Add file to hold GoldParse class
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:42 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8eadb984cb
							
						
					 | 
					
						
						
							
							* Refactor arc_eager to use new TransitionSystem base class. Need to fix oracle
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:42 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b063001596
							
						
					 | 
					
						
						
							
							* Add base TransitionSystem class. Still need to rethink how non-monotonic labelling will work for best_valid
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:42 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							01bc4d6815
							
						
					 | 
					
						
						
							
							* Add set_parse method, to assign parse to tokens in a less hacky way.
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:42 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							dc986dbc0b
							
						
					 | 
					
						
						
							
							* Work on refactored parser, where TransitionSystem can be easily subclassed
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:42 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1cc6329b18
							
						
					 | 
					
						
						
							
							* Add base class to do transitions
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:42 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							135756ac3d
							
						
					 | 
					
						
						
							
							* Tmp commit of NER refactoring
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:42 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							23c1f6fc04
							
						
					 | 
					
						
						
							
							* Merge changes from stash
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:41 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0ff078876a
							
						
					 | 
					
						
						
							
							* Commit some work on ner.yx done on the plane
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:41 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d81b7be6a2
							
						
					 | 
					
						
						
							
							* Merge train.py
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:41 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2e3dc3dfe2
							
						
					 | 
					
						
						
							
							* Merge changes in tokens.pyx
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:41 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8cc3524dc9
							
						
					 | 
					
						
						
							
							* Ws
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:41 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3d0570685c
							
						
					 | 
					
						
						
							
							* Add NER transition system
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:41 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							043b758cf4
							
						
					 | 
					
						
						
							
							* Resurrect old NER code. This version won't be the one that runs; we want to re-use the parser code. But for now this is a useful reference.
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:41 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b139aa92ba
							
						
					 | 
					
						
						
							
							* Start setting out how NER will be implemented in the data model
						
						
						
						
						
					 | 
					
						2015-03-26 16:44:41 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0962ffc095
							
						
					 | 
					
						
						
							
							* Fix issue #37: missing check_flag attribute from Token class
						
						
						
						
						
					 | 
					
						2015-03-26 15:06:26 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2e8d0e5d45
							
						
					 | 
					
						
						
							
							* Upd download script
						
						
						
						
						
					 | 
					
						2015-03-03 05:47:16 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							dbe26f5793
							
						
					 | 
					
						
						
							
							* Add children and subtree methods to Token, which are generators to assist parse-tree navigation.
						
						
						
						
						
					 | 
					
						2015-03-03 04:18:41 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ea90d136e8
							
						
					 | 
					
						
						
							
							* Fix bug in labelled parsing, that caused an 8% drop in labelled accuracy.
						
						
						
						
						
					 | 
					
						2015-02-27 03:56:10 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							caf046b220
							
						
					 | 
					
						
						
							
							* Hastily add method to apply tags from a list of strings, instead of predicting the tags.
						
						
						
						
						
					 | 
					
						2015-02-23 15:40:17 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cae077b583
							
						
					 | 
					
						
						
							
							* Work on fixing orphaned Token objects bug
						
						
						
						
						
					 | 
					
						2015-02-16 15:20:31 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7572e31f5e
							
						
					 | 
					
						
						
							
							* Pass ownership of C data to Token instances if Tokens object is being garbage-collected, but Token instances are staying alive.
						
						
						
						
						
					 | 
					
						2015-02-11 18:05:06 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							64645a1c2f
							
						
					 | 
					
						
						
							
							* Improve docstring on English
						
						
						
						
						
					 | 
					
						2015-02-11 15:13:20 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							594e50bd45
							
						
					 | 
					
						
						
							
							* Add option to download speech-parsing data set.
						
						
						
						
						
					 | 
					
						2015-02-11 14:20:29 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0b7e769211
							
						
					 | 
					
						
						
							
							* Add POS tags to support SWBD tag set
						
						
						
						
						
					 | 
					
						2015-02-11 14:08:28 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							312b3a45f3
							
						
					 | 
					
						
						
							
							* Fix issue #19: Allow parsing/pos tagging of empty strings
						
						
						
						
						
					 | 
					
						2015-02-10 10:15:58 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2a0615104b
							
						
					 | 
					
						
						
							
							* Upd download script
						
						
						
						
						
					 | 
					
						2015-02-09 10:22:59 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5c3513583d
							
						
					 | 
					
						
						
							
							* Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens.
						
						
						
						
						
					 | 
					
						2015-02-09 03:57:10 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							be5536d239
							
						
					 | 
					
						
						
							
							* Fix Issue #22: PRP and PRP$ were mapped to NOUN. Should be PRON.
						
						
						
						
						
					 | 
					
						2015-02-08 18:36:18 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0492cee8b4
							
						
					 | 
					
						
						
							
							* Fix Issue #24: Lemmas are empty when the L field is missing for special-cased tokens
						
						
						
						
						
					 | 
					
						2015-02-08 18:30:30 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d229fbd228
							
						
					 | 
					
						
						
							
							* Give better error on out-of-bounds array access
						
						
						
						
						
					 | 
					
						2015-02-07 12:59:12 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ab8bb047d0
							
						
					 | 
					
						
						
							
							* Fix negative index for __getitem__
						
						
						
						
						
					 | 
					
						2015-02-07 12:58:46 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							44c7eafe44
							
						
					 | 
					
						
						
							
							* Fix download.py
						
						
						
						
						
					 | 
					
						2015-02-07 12:00:36 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6ca7f2eedc
							
						
					 | 
					
						
						
							
							* Upd download script
						
						
						
						
						
					 | 
					
						2015-02-07 11:32:33 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f0e0588833
							
						
					 | 
					
						
						
							
							* Fill L2 norm attribute on LexemeC struct
						
						
						
						
						
					 | 
					
						2015-02-07 08:44:42 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							75f9b7d6bf
							
						
					 | 
					
						
						
							
							* Add L2 norm field to LexemeC struct
						
						
						
						
						
					 | 
					
						2015-02-07 08:43:17 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							51b618d646
							
						
					 | 
					
						
						
							
							* Add a has_repvec property to Lexeme, and a check function to check flags
						
						
						
						
						
					 | 
					
						2015-02-07 08:42:44 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							321b402739
							
						
					 | 
					
						
						
							
							* Store the l2 norm of the word's vector
						
						
						
						
						
					 | 
					
						2015-02-07 08:42:16 -05:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c7d8644149
							
						
					 | 
					
						
						
							
							* Fix regression on 'prob' attr of Token.
						
						
						
						
						
					 | 
					
						2015-02-03 03:32:18 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c55a33d045
							
						
					 | 
					
						
						
							
							* Catch oracle errors
						
						
						
						
						
					 | 
					
						2015-02-02 23:02:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							de772088e6
							
						
					 | 
					
						
						
							
							* Use parse tree for sbd in Tokens.sents
						
						
						
						
						
					 | 
					
						2015-02-02 12:17:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							56c2ef2982
							
						
					 | 
					
						
						
							
							* Tweak POS features for web text
						
						
						
						
						
					 | 
					
						2015-02-02 11:59:36 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d68678a93e
							
						
					 | 
					
						
						
							
							* Add Exception class, OracleError
						
						
						
						
						
					 | 
					
						2015-02-02 11:57:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a20fdbd8ee
							
						
					 | 
					
						
						
							
							* Upd download script
						
						
						
						
						
					 | 
					
						2015-02-01 13:22:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							76d9394cb4
							
						
					 | 
					
						
						
							
							* Fix vocab.pyx for Python3
						
						
						
						
						
					 | 
					
						2015-02-01 13:14:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							63abdf154c
							
						
					 | 
					
						
						
							
							* Hastily hack download file
						
						
						
						
						
					 | 
					
						2015-01-31 22:48:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7de00c5a79
							
						
					 | 
					
						
						
							
							* Try not holding a reference to Pool, since that seems to confuse the GC
						
						
						
						
						
					 | 
					
						2015-01-31 22:10:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ce3ae8b5d9
							
						
					 | 
					
						
						
							
							* Fix platform-specific lexicon bug.
						
						
						
						
						
					 | 
					
						2015-01-31 16:38:58 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a1ed574b7b
							
						
					 | 
					
						
						
							
							* Fix default model path for English
						
						
						
						
						
					 | 
					
						2015-01-31 16:38:27 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							018e0bfa24
							
						
					 | 
					
						
						
							
							* Bug fixes to parse navigation
						
						
						
						
						
					 | 
					
						2015-01-31 16:37:13 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e013555b25
							
						
					 | 
					
						
						
							
							* Add option to download script
						
						
						
						
						
					 | 
					
						2015-01-31 13:51:56 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							08ca5c8970
							
						
					 | 
					
						
						
							
							* Add sent_end flag to TokenC struct
						
						
						
						
						
					 | 
					
						2015-01-31 13:44:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							024cfd485c
							
						
					 | 
					
						
						
							
							* Pass tag_strings as a tuple, to support new Tokens API
						
						
						
						
						
					 | 
					
						2015-01-31 13:43:37 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							77d62d0179
							
						
					 | 
					
						
						
							
							* Large refactor of Token objects, making them much thinner. This is to support fast parse-tree navigation.
						
						
						
						
						
					 | 
					
						2015-01-31 13:42:58 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							88170e6295
							
						
					 | 
					
						
						
							
							* Supply dep_strings as a tuple, for the changed API on Tokens
						
						
						
						
						
					 | 
					
						2015-01-31 13:42:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0981d68022
							
						
					 | 
					
						
						
							
							* Set a sent_end flag during parsing, for later use
						
						
						
						
						
					 | 
					
						2015-01-31 13:41:46 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							251dbf24d7
							
						
					 | 
					
						
						
							
							* Fix unintialised variable error
						
						
						
						
						
					 | 
					
						2015-01-30 20:46:34 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							83a4df5a1a
							
						
					 | 
					
						
						
							
							* Fix download script
						
						
						
						
						
					 | 
					
						2015-01-30 20:40:42 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6f9ebc2f34
							
						
					 | 
					
						
						
							
							* Fix download script
						
						
						
						
						
					 | 
					
						2015-01-30 20:33:19 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8b85d0bb8a
							
						
					 | 
					
						
						
							
							* Only download small data if no data dir exists
						
						
						
						
						
					 | 
					
						2015-01-30 20:27:14 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1a7a1c2771
							
						
					 | 
					
						
						
							
							* Fix Issue #16: tokens recurse when printing
						
						
						
						
						
					 | 
					
						2015-01-30 19:47:50 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cb95ef6934
							
						
					 | 
					
						
						
							
							* Fix download script
						
						
						
						
						
					 | 
					
						2015-01-30 19:28:43 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e578bd37bd
							
						
					 | 
					
						
						
							
							* Fix download script
						
						
						
						
						
					 | 
					
						2015-01-30 18:59:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							df52014d12
							
						
					 | 
					
						
						
							
							* Fix download script
						
						
						
						
						
					 | 
					
						2015-01-30 18:36:24 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0f95712189
							
						
					 | 
					
						
						
							
							* Improve accuracy reporting during training
						
						
						
						
						
					 | 
					
						2015-01-30 18:05:06 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b68f563c2f
							
						
					 | 
					
						
						
							
							* Fix Issue #14: Improve parsing API
						
						
						
						
						
					 | 
					
						2015-01-30 18:04:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							998b607f65
							
						
					 | 
					
						
						
							
							* Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source
						
						
						
						
						
					 | 
					
						2015-01-30 18:04:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							67d6e53a69
							
						
					 | 
					
						
						
							
							* Ensure parser and tagger function correctly when training from missing values, indicated by -1
						
						
						
						
						
					 | 
					
						2015-01-30 14:08:56 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4ff180db74
							
						
					 | 
					
						
						
							
							* Fix off-by-one error in commit 0a7fceb
						
						
						
						
						
					 | 
					
						2015-01-30 12:49:33 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0a7fcebdf7
							
						
					 | 
					
						
						
							
							* Fix Issue #12: Incorrect token.idx calculations for some punctuation, in the presence of token cache
						
						
						
						
						
					 | 
					
						2015-01-30 12:33:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ebf7d2fab1
							
						
					 | 
					
						
						
							
							* Use non-joint sbd, for more simplicity and fewer classes
						
						
						
						
						
					 | 
					
						2015-01-29 06:22:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d05c5bf141
							
						
					 | 
					
						
						
							
							* Remove comment
						
						
						
						
						
					 | 
					
						2015-01-29 05:19:27 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							320b045daa
							
						
					 | 
					
						
						
							
							* Oracle now consistent over gold standard derivation
						
						
						
						
						
					 | 
					
						2015-01-29 03:41:58 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f590382134
							
						
					 | 
					
						
						
							
							* Work on sbd
						
						
						
						
						
					 | 
					
						2015-01-29 03:18:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1884a7a0be
							
						
					 | 
					
						
						
							
							* Attach comment with paper
						
						
						
						
						
					 | 
					
						2015-01-28 03:18:43 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a2d6b195db
							
						
					 | 
					
						
						
							
							* Add messy Break transitions, carefully following the scheme of Dd Zhang et al (2013)
						
						
						
						
						
					 | 
					
						2015-01-28 03:09:45 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f9ee5d9934
							
						
					 | 
					
						
						
							
							* Build a python list of word strings, for debugging
						
						
						
						
						
					 | 
					
						2015-01-28 01:06:13 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d819101571
							
						
					 | 
					
						
						
							
							* Improve error message on oracle failure
						
						
						
						
						
					 | 
					
						2015-01-28 00:58:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e6c3d3471f
							
						
					 | 
					
						
						
							
							* Tweak documentation for Tokens, and hide constructor as __cinit__
						
						
						
						
						
					 | 
					
						2015-01-27 18:57:52 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c38c62d4a3
							
						
					 | 
					
						
						
							
							* Add docstring to English class
						
						
						
						
						
					 | 
					
						2015-01-27 02:45:21 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d4c99f7dec
							
						
					 | 
					
						
						
							
							* Add attrs.pxd
						
						
						
						
						
					 | 
					
						2015-01-26 22:22:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d4a493855e
							
						
					 | 
					
						
						
							
							* Fix error msg
						
						
						
						
						
					 | 
					
						2015-01-25 23:01:30 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7f87716cf7
							
						
					 | 
					
						
						
							
							* Fix download script
						
						
						
						
						
					 | 
					
						2015-01-25 23:01:10 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							92fb9257dd
							
						
					 | 
					
						
						
							
							* Add parts-of-speech file
						
						
						
						
						
					 | 
					
						2015-01-25 22:00:39 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c1c3dba4cb
							
						
					 | 
					
						
						
							
							* Check whether vector files are present before trying to load them.
						
						
						
						
						
					 | 
					
						2015-01-25 18:16:48 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5049d4c2e6
							
						
					 | 
					
						
						
							
							* Add parts_of_speech.pyx
						
						
						
						
						
					 | 
					
						2015-01-25 16:32:26 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							12b034e3ef
							
						
					 | 
					
						
						
							
							* Move POS tag definitions to parts_of_speech.pxd
						
						
						
						
						
					 | 
					
						2015-01-25 16:31:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7431c133d8
							
						
					 | 
					
						
						
							
							* Add error if try to access head and not is_parsed
						
						
						
						
						
					 | 
					
						2015-01-25 15:33:54 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							951d06c824
							
						
					 | 
					
						
						
							
							* Silently don't parse if data is not present
						
						
						
						
						
					 | 
					
						2015-01-25 14:47:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4e857ab7a6
							
						
					 | 
					
						
						
							
							* Fix bug in POS tagger feature
						
						
						
						
						
					 | 
					
						2015-01-25 02:20:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							dd56e298e2
							
						
					 | 
					
						
						
							
							* Ensure tagging is applied if parse=True
						
						
						
						
						
					 | 
					
						2015-01-25 02:19:44 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							94750819cd
							
						
					 | 
					
						
						
							
							* Set parse=True by default --- i.e. parse unless told not to.
						
						
						
						
						
					 | 
					
						2015-01-25 01:28:28 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							71b95202eb
							
						
					 | 
					
						
						
							
							* Add docstring to StringStore
						
						
						
						
						
					 | 
					
						2015-01-24 20:49:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6d1c08dafd
							
						
					 | 
					
						
						
							
							* Add docstring to Lexeme
						
						
						
						
						
					 | 
					
						2015-01-24 20:48:34 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a97bed9359
							
						
					 | 
					
						
						
							
							* Fix POS and dependency label tag names.  Add parse and string navigation functions.
						
						
						
						
						
					 | 
					
						2015-01-24 17:29:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							76cd024095
							
						
					 | 
					
						
						
							
							* Add whitespace property to Token
						
						
						
						
						
					 | 
					
						2015-01-24 07:41:21 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5fd72bc220
							
						
					 | 
					
						
						
							
							* Have 'string' refer to the whitespace-padded string
						
						
						
						
						
					 | 
					
						2015-01-24 07:32:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fda94271af
							
						
					 | 
					
						
						
							
							* Rename NORM1 and NORM2 attrs to lower and norm
						
						
						
						
						
					 | 
					
						2015-01-24 06:17:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5ed8b2b98f
							
						
					 | 
					
						
						
							
							* Rename sic to orth
						
						
						
						
						
					 | 
					
						2015-01-23 02:08:25 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a27b23cc8f
							
						
					 | 
					
						
						
							
							* Have SBD return start/end indices
						
						
						
						
						
					 | 
					
						2015-01-22 22:24:44 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d460c28838
							
						
					 | 
					
						
						
							
							* Rename vec to repvec
						
						
						
						
						
					 | 
					
						2015-01-22 02:06:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8b9d913d97
							
						
					 | 
					
						
						
							
							* Rename vec to repvec
						
						
						
						
						
					 | 
					
						2015-01-22 02:05:58 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9cd0b6b3e9
							
						
					 | 
					
						
						
							
							* Various tweaks to Tokens class
						
						
						
						
						
					 | 
					
						2015-01-22 02:05:37 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5928d158ce
							
						
					 | 
					
						
						
							
							* Pass the string to Tokens
						
						
						
						
						
					 | 
					
						2015-01-22 02:04:58 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							45264e356b
							
						
					 | 
					
						
						
							
							* Rename vec to repvec
						
						
						
						
						
					 | 
					
						2015-01-22 02:04:24 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5e63c606ad
							
						
					 | 
					
						
						
							
							* Rename vec to repvec
						
						
						
						
						
					 | 
					
						2015-01-22 02:03:54 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							56e6cf0672
							
						
					 | 
					
						
						
							
							* Add _string attr to Tokens object
						
						
						
						
						
					 | 
					
						2015-01-21 18:57:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d6ac60e91c
							
						
					 | 
					
						
						
							
							* Bug fixes to sentences method, and improved vector transport for tokens
						
						
						
						
						
					 | 
					
						2015-01-21 18:56:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f2a229136c
							
						
					 | 
					
						
						
							
							* Fix data_dir=None argument to English class
						
						
						
						
						
					 | 
					
						2015-01-21 18:27:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ef49b8c179
							
						
					 | 
					
						
						
							
							* Add stop-word flag
						
						
						
						
						
					 | 
					
						2015-01-21 18:22:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6646bfc5df
							
						
					 | 
					
						
						
							
							* Add LOWER attr
						
						
						
						
						
					 | 
					
						2015-01-21 18:19:08 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f149259bf5
							
						
					 | 
					
						
						
							
							* Fix negative indices in tokens
						
						
						
						
						
					 | 
					
						2015-01-20 01:16:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b65b0c07bf
							
						
					 | 
					
						
						
							
							* Messily hook up vector in tokens
						
						
						
						
						
					 | 
					
						2015-01-19 19:59:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8ff5b8bd84
							
						
					 | 
					
						
						
							
							* Add attribute for POS scheme
						
						
						
						
						
					 | 
					
						2015-01-17 17:33:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6c7e44140b
							
						
					 | 
					
						
						
							
							* Work on word vectors, and other stuff
						
						
						
						
						
					 | 
					
						2015-01-17 16:21:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							802867e96a
							
						
					 | 
					
						
						
							
							* Revise interface to Token. Strings now have attribute names like norm1_
						
						
						
						
						
					 | 
					
						2015-01-15 03:51:47 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7d3c40de7d
							
						
					 | 
					
						
						
							
							* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme
						
						
						
						
						
					 | 
					
						2015-01-15 00:33:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0930892fc1
							
						
					 | 
					
						
						
							
							* Tmp. Working on refactor. Compiles, must hook up lexical feats.
						
						
						
						
						
					 | 
					
						2015-01-14 00:03:48 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							46da3d74d2
							
						
					 | 
					
						
						
							
							* Tmp. Refactoring, introducing a Lexeme PyObject.
						
						
						
						
						
					 | 
					
						2015-01-12 11:23:44 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ce2edd6312
							
						
					 | 
					
						
						
							
							* Tmp commit. Refactoring to create a Python Lexeme class.
						
						
						
						
						
					 | 
					
						2015-01-12 10:26:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							aacaf1a0f0
							
						
					 | 
					
						
						
							
							* Fix parser
						
						
						
						
						
					 | 
					
						2015-01-08 01:19:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9a21127bf7
							
						
					 | 
					
						
						
							
							* Fix parser, which was importing the wrong model
						
						
						
						
						
					 | 
					
						2015-01-08 00:10:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6a3e39cdd1
							
						
					 | 
					
						
						
							
							* Add typedefs.pyx
						
						
						
						
						
					 | 
					
						2015-01-06 04:51:40 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a58920cc5e
							
						
					 | 
					
						
						
							
							* Import orth.word_shape as a C module
						
						
						
						
						
					 | 
					
						2015-01-06 03:18:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6b68f7ef75
							
						
					 | 
					
						
						
							
							* Finally get string types right for orth function
						
						
						
						
						
					 | 
					
						2015-01-06 03:17:39 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							90c143bd85
							
						
					 | 
					
						
						
							
							* Fix orth import
						
						
						
						
						
					 | 
					
						2015-01-05 18:49:19 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7689dccd0f
							
						
					 | 
					
						
						
							
							* Remove unused import
						
						
						
						
						
					 | 
					
						2015-01-05 18:48:48 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3f1944d688
							
						
					 | 
					
						
						
							
							* Make PyPy work
						
						
						
						
						
					 | 
					
						2015-01-05 17:54:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a510d9f677
							
						
					 | 
					
						
						
							
							* Another assertion removed
						
						
						
						
						
					 | 
					
						2015-01-05 13:01:40 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2856946a66
							
						
					 | 
					
						
						
							
							* Remove assertion that doesn't work on Python 3
						
						
						
						
						
					 | 
					
						2015-01-05 12:51:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							94034f1112
							
						
					 | 
					
						
						
							
							* Fix encoding in lemmatization
						
						
						
						
						
					 | 
					
						2015-01-05 11:54:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b132b3caa6
							
						
					 | 
					
						
						
							
							* Fix unicode error in lemmatizer
						
						
						
						
						
					 | 
					
						2015-01-05 11:53:54 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							477e7fbffe
							
						
					 | 
					
						
						
							
							* Fix data reading for lemmatizer
						
						
						
						
						
					 | 
					
						2015-01-05 06:01:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							58f75abaca
							
						
					 | 
					
						
						
							
							* Fix unicode error in orth
						
						
						
						
						
					 | 
					
						2015-01-05 05:53:08 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4e085d5166
							
						
					 | 
					
						
						
							
							* Fix lemmatizer for Python3
						
						
						
						
						
					 | 
					
						2015-01-05 05:51:26 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ae7c811fd1
							
						
					 | 
					
						
						
							
							* Use Exception instead of StandardError
						
						
						
						
						
					 | 
					
						2015-01-04 01:22:12 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0e4c2ba036
							
						
					 | 
					
						
						
							
							* Fix loading of special morph words
						
						
						
						
						
					 | 
					
						2015-01-03 23:13:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f5d41028b5
							
						
					 | 
					
						
						
							
							* Move around data files for test release
						
						
						
						
						
					 | 
					
						2015-01-03 01:59:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a24321b63a
							
						
					 | 
					
						
						
							
							* Add downloader
						
						
						
						
						
					 | 
					
						2015-01-02 21:44:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5d9a096e2f
							
						
					 | 
					
						
						
							
							* Some minor clean-up after HastyModel
						
						
						
						
						
					 | 
					
						2014-12-31 19:46:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							aafaf58cbe
							
						
					 | 
					
						
						
							
							* Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile.
						
						
						
						
						
					 | 
					
						2014-12-31 19:40:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bcd038e7b6
							
						
					 | 
					
						
						
							
							* Implement HastyModel
						
						
						
						
						
					 | 
					
						2014-12-31 01:16:47 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1a075f77ff
							
						
					 | 
					
						
						
							
							* Don't over-ride pre-loaded POS tags, if set by special-cases
						
						
						
						
						
					 | 
					
						2014-12-30 23:26:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							785c7ba76a
							
						
					 | 
					
						
						
							
							* Embed signature on attrs
						
						
						
						
						
					 | 
					
						2014-12-30 23:25:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							30e5805656
							
						
					 | 
					
						
						
							
							* Lazy-load tagger and parser
						
						
						
						
						
					 | 
					
						2014-12-30 23:25:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9976aa976e
							
						
					 | 
					
						
						
							
							* Messily fix morphology and POS tags on special tokens.
						
						
						
						
						
					 | 
					
						2014-12-30 23:24:37 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c1ef3febee
							
						
					 | 
					
						
						
							
							* Embedsignature in tokens.pyx
						
						
						
						
						
					 | 
					
						2014-12-30 21:22:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							aac5028b6e
							
						
					 | 
					
						
						
							
							* Move tagger to _ml
						
						
						
						
						
					 | 
					
						2014-12-30 21:21:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1ffb0229ed
							
						
					 | 
					
						
						
							
							* Import tokens in parser.pxd
						
						
						
						
						
					 | 
					
						2014-12-30 21:21:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bb0b00f819
							
						
					 | 
					
						
						
							
							* Repurporse the Tagger class as a generic Model, wrapping thinc's interface
						
						
						
						
						
					 | 
					
						2014-12-30 21:20:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fe2a5e0370
							
						
					 | 
					
						
						
							
							* Work on docstrings
						
						
						
						
						
					 | 
					
						2014-12-27 21:46:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bb80937544
							
						
					 | 
					
						
						
							
							* Upd docstrings
						
						
						
						
						
					 | 
					
						2014-12-27 18:45:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b8b65903fc
							
						
					 | 
					
						
						
							
							* Tmp
						
						
						
						
						
					 | 
					
						2014-12-24 17:42:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ab61673edd
							
						
					 | 
					
						
						
							
							* Fix api of array method
						
						
						
						
						
					 | 
					
						2014-12-23 15:18:48 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7708d0e24a
							
						
					 | 
					
						
						
							
							* Move lemmatizer to en dir
						
						
						
						
						
					 | 
					
						2014-12-23 15:16:57 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							98eb4c0426
							
						
					 | 
					
						
						
							
							* Fix path to parser model
						
						
						
						
						
					 | 
					
						2014-12-23 15:09:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b00bc01d8c
							
						
					 | 
					
						
						
							
							* All tests now passing for reorg
						
						
						
						
						
					 | 
					
						2014-12-23 13:18:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							73f200436f
							
						
					 | 
					
						
						
							
							* Tests passing except for morphology/lemmatization stuff
						
						
						
						
						
					 | 
					
						2014-12-23 11:40:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cf8d26c3d2
							
						
					 | 
					
						
						
							
							* POS tagger training working after reorg
						
						
						
						
						
					 | 
					
						2014-12-22 08:54:47 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4c4aa2c5c9
							
						
					 | 
					
						
						
							
							* Work on train
						
						
						
						
						
					 | 
					
						2014-12-22 07:25:43 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							61df50b598
							
						
					 | 
					
						
						
							
							* Add English-subclass POS tagger
						
						
						
						
						
					 | 
					
						2014-12-21 20:59:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9f3f07cab6
							
						
					 | 
					
						
						
							
							* Add attrs file for English
						
						
						
						
						
					 | 
					
						2014-12-21 11:29:11 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2a89d70429
							
						
					 | 
					
						
						
							
							* Add vocab.pyx to setup, and ensure we can import spacy.en.lang
						
						
						
						
						
					 | 
					
						2014-12-21 06:03:53 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b34a1325d3
							
						
					 | 
					
						
						
							
							* Everything compiling after reorg. About to start testing.
						
						
						
						
						
					 | 
					
						2014-12-21 05:42:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e1c1a4b868
							
						
					 | 
					
						
						
							
							* Tmp
						
						
						
						
						
					 | 
					
						2014-12-21 05:36:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d11c1edf8c
							
						
					 | 
					
						
						
							
							* Import slice_unicode from strings.pyx
						
						
						
						
						
					 | 
					
						2014-12-20 07:56:26 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							be1bdcbd85
							
						
					 | 
					
						
						
							
							* Move lang.pyx to tokenizer.pyx
						
						
						
						
						
					 | 
					
						2014-12-20 07:55:40 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							89a1cc1a48
							
						
					 | 
					
						
						
							
							* Move murmurhash to .pxd in strings file
						
						
						
						
						
					 | 
					
						2014-12-20 07:41:08 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d5a942c4a4
							
						
					 | 
					
						
						
							
							* Rename lang.pyx to tokenizer.pyx
						
						
						
						
						
					 | 
					
						2014-12-20 07:30:39 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a60ae261ae
							
						
					 | 
					
						
						
							
							* Move tokenizer to its own file, and refactor
						
						
						
						
						
					 | 
					
						2014-12-20 07:29:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							867a4a000c
							
						
					 | 
					
						
						
							
							* Export set_morph_from_dict function
						
						
						
						
						
					 | 
					
						2014-12-20 07:28:27 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4e30195c6d
							
						
					 | 
					
						
						
							
							* Refactor morphology.pyx
						
						
						
						
						
					 | 
					
						2014-12-20 07:27:28 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4c6ce7ee84
							
						
					 | 
					
						
						
							
							* Update tokens.pyx as part of reorg
						
						
						
						
						
					 | 
					
						2014-12-20 07:03:26 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							116f7f3bc1
							
						
					 | 
					
						
						
							
							* Rename Lexicon to Vocab, and move it to its own file
						
						
						
						
						
					 | 
					
						2014-12-20 06:54:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							780cbd68b1
							
						
					 | 
					
						
						
							
							* Move all struct definitions to structs.pxd, to avoid circular dependencies
						
						
						
						
						
					 | 
					
						2014-12-20 06:51:33 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f6556d8e5d
							
						
					 | 
					
						
						
							
							* Refactor, move Lexeme struct to structs.pxd
						
						
						
						
						
					 | 
					
						2014-12-20 06:51:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7d48bba6c4
							
						
					 | 
					
						
						
							
							* Move StringStore class to its own file
						
						
						
						
						
					 | 
					
						2014-12-20 06:42:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b066102d2d
							
						
					 | 
					
						
						
							
							* Remove POS cache for now
						
						
						
						
						
					 | 
					
						2014-12-20 03:49:58 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ff252dd535
							
						
					 | 
					
						
						
							
							* Clean up 'guess_cache' idea, which didnt work well enough
						
						
						
						
						
					 | 
					
						2014-12-20 03:49:11 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9d3ca13909
							
						
					 | 
					
						
						
							
							* Start work on parse-tree iteration classes
						
						
						
						
						
					 | 
					
						2014-12-20 03:48:10 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bed680c632
							
						
					 | 
					
						
						
							
							* Remove commented-out features
						
						
						
						
						
					 | 
					
						2014-12-20 03:47:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3d178c03ae
							
						
					 | 
					
						
						
							
							* Prune the features a bit
						
						
						
						
						
					 | 
					
						2014-12-20 02:46:14 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a0408e1758
							
						
					 | 
					
						
						
							
							* Working DecisionMemory class
						
						
						
						
						
					 | 
					
						2014-12-20 01:43:26 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7920ea72b4
							
						
					 | 
					
						
						
							
							* Working parser with the decision memory idea. Disabling that for now, for simplicity
						
						
						
						
						
					 | 
					
						2014-12-20 01:43:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a2f2a48da9
							
						
					 | 
					
						
						
							
							* Add some extra features
						
						
						
						
						
					 | 
					
						2014-12-20 01:42:24 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8fd9762d91
							
						
					 | 
					
						
						
							
							* Start laying out parse tree iteration methods
						
						
						
						
						
					 | 
					
						2014-12-20 01:42:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							53b8bc1f3c
							
						
					 | 
					
						
						
							
							* Work on implementing a trainable cache for the parser. So far, doesn't improve efficiency
						
						
						
						
						
					 | 
					
						2014-12-19 09:30:50 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							033d6c9ac2
							
						
					 | 
					
						
						
							
							* Adapt POS tagger decision-memory for use in parser
						
						
						
						
						
					 | 
					
						2014-12-19 07:23:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							809ddf7887
							
						
					 | 
					
						
						
							
							* Add index.pxd
						
						
						
						
						
					 | 
					
						2014-12-19 07:23:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1879abd16a
							
						
					 | 
					
						
						
							
							* Set const-correctness for tagger
						
						
						
						
						
					 | 
					
						2014-12-18 20:41:52 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f72243b156
							
						
					 | 
					
						
						
							
							* Set const-correctness for Feature* array
						
						
						
						
						
					 | 
					
						2014-12-18 20:41:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6ab7e40590
							
						
					 | 
					
						
						
							
							* Add non-monotonic parsing with cost-sensitive update. 92.26 on Y&M set
						
						
						
						
						
					 | 
					
						2014-12-18 11:33:25 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7e0c692daf
							
						
					 | 
					
						
						
							
							* Automatically push when the stack is empty
						
						
						
						
						
					 | 
					
						2014-12-18 09:16:10 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							61142a8eff
							
						
					 | 
					
						
						
							
							* Tweak features
						
						
						
						
						
					 | 
					
						2014-12-18 09:15:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8446ebfbbb
							
						
					 | 
					
						
						
							
							* Work on parser. Up to 92 UAS on YM labels
						
						
						
						
						
					 | 
					
						2014-12-18 09:05:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							55de747bfc
							
						
					 | 
					
						
						
							
							* Remove .cpp files
						
						
						
						
						
					 | 
					
						2014-12-18 02:43:13 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4448a840f7
							
						
					 | 
					
						
						
							
							* Work on greedy parsing. Scoring about 91.2
						
						
						
						
						
					 | 
					
						2014-12-18 02:42:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							87e9487d76
							
						
					 | 
					
						
						
							
							* Work on parser
						
						
						
						
						
					 | 
					
						2014-12-17 21:10:12 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9d7d97978d
							
						
					 | 
					
						
						
							
							* Work on greedy parser
						
						
						
						
						
					 | 
					
						2014-12-17 21:09:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d524dd306a
							
						
					 | 
					
						
						
							
							* Work on greedy parser
						
						
						
						
						
					 | 
					
						2014-12-17 03:19:43 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							95ccea03b2
							
						
					 | 
					
						
						
							
							* Work on greedy parser
						
						
						
						
						
					 | 
					
						2014-12-16 22:46:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a432862fde
							
						
					 | 
					
						
						
							
							* Add exception type to _arg_max_among in tagger
						
						
						
						
						
					 | 
					
						2014-12-16 09:44:19 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9e00798820
							
						
					 | 
					
						
						
							
							* Work on integrating a greedy dependency parser
						
						
						
						
						
					 | 
					
						2014-12-16 08:06:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							792802b2b9
							
						
					 | 
					
						
						
							
							* POS tag memoisation working, with good speed-up
						
						
						
						
						
					 | 
					
						2014-12-12 14:33:51 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ca54d58638
							
						
					 | 
					
						
						
							
							* Merge setup.py
						
						
						
						
						
					 | 
					
						2014-12-10 15:21:27 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9959a64f7b
							
						
					 | 
					
						
						
							
							* Working morphology and lemmatisation. POS tagging quite fast.
						
						
						
						
						
					 | 
					
						2014-12-10 08:09:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							df3be14987
							
						
					 | 
					
						
						
							
							* Add pos_type features to POS tagger
						
						
						
						
						
					 | 
					
						2014-12-10 08:08:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							42973c4b37
							
						
					 | 
					
						
						
							
							* Improve efficiency of tagger, and improve morphological processing
						
						
						
						
						
					 | 
					
						2014-12-10 01:02:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6b34a2f34b
							
						
					 | 
					
						
						
							
							* Move morphological analysis into its own module, morphology.pyx
						
						
						
						
						
					 | 
					
						2014-12-09 21:16:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b962fe73d7
							
						
					 | 
					
						
						
							
							* Make suffixes file use full-power regex, so that we can handle periods properly
						
						
						
						
						
					 | 
					
						2014-12-09 19:04:27 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							accdbe989b
							
						
					 | 
					
						
						
							
							* Remove Tokens.extend method
						
						
						
						
						
					 | 
					
						2014-12-09 17:09:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							495e1c7366
							
						
					 | 
					
						
						
							
							* Use fused type in Tokens.push_back, simplifying the use of the cache
						
						
						
						
						
					 | 
					
						2014-12-09 16:50:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							302e09018b
							
						
					 | 
					
						
						
							
							* Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas
						
						
						
						
						
					 | 
					
						2014-12-09 14:48:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							99bbbb6feb
							
						
					 | 
					
						
						
							
							* Work on morphological processing
						
						
						
						
						
					 | 
					
						2014-12-08 21:12:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7b68f911cf
							
						
					 | 
					
						
						
							
							* Add WordNet lemmatizer
						
						
						
						
						
					 | 
					
						2014-12-08 01:39:13 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c20dd79748
							
						
					 | 
					
						
						
							
							* Fiddle with const correctness and comments
						
						
						
						
						
					 | 
					
						2014-12-08 00:03:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b031c7c430
							
						
					 | 
					
						
						
							
							* Remove language-general context module
						
						
						
						
						
					 | 
					
						2014-12-07 23:53:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ef4398b204
							
						
					 | 
					
						
						
							
							* Rearrange POS stuff, so that language-specific stuff can live in language-specific modules
						
						
						
						
						
					 | 
					
						2014-12-07 23:52:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							327383e38a
							
						
					 | 
					
						
						
							
							* Remove unused code in tagger.pyx
						
						
						
						
						
					 | 
					
						2014-12-07 22:16:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9f17467c2e
							
						
					 | 
					
						
						
							
							* Fix EMPTY_TOKEN
						
						
						
						
						
					 | 
					
						2014-12-07 22:07:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3819a88e1b
							
						
					 | 
					
						
						
							
							* Add support for tag dictionary, and fix error-code for predict method
						
						
						
						
						
					 | 
					
						2014-12-07 22:07:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f00afe12c4
							
						
					 | 
					
						
						
							
							* Load POS tagger in load() function if path exists
						
						
						
						
						
					 | 
					
						2014-12-07 22:05:57 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5fe5e6e66b
							
						
					 | 
					
						
						
							
							* Move context functions to header, inlining them.
						
						
						
						
						
					 | 
					
						2014-12-07 21:59:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5caabec789
							
						
					 | 
					
						
						
							
							* Link in tagger, to work on integrating POS tagging
						
						
						
						
						
					 | 
					
						2014-12-07 15:29:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0c7aeb9de7
							
						
					 | 
					
						
						
							
							* Begin revising tagger, focussing on POS tagging
						
						
						
						
						
					 | 
					
						2014-12-07 15:29:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f5c4f2eb52
							
						
					 | 
					
						
						
							
							* Revise context, focussing on POS tagging for now
						
						
						
						
						
					 | 
					
						2014-12-07 15:28:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e27b912ef9
							
						
					 | 
					
						
						
							
							* Remove need for confusing _data pointer to be stored on Tokens
						
						
						
						
						
					 | 
					
						2014-12-05 16:31:30 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1c9253701d
							
						
					 | 
					
						
						
							
							* Introduce a TokenC struct, to handle token indices, pos tags and sense tags
						
						
						
						
						
					 | 
					
						2014-12-05 15:56:14 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							187372c7f3
							
						
					 | 
					
						
						
							
							* Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached
						
						
						
						
						
					 | 
					
						2014-12-05 03:29:50 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							75b8dfb348
							
						
					 | 
					
						
						
							
							* Remove upper_pc from lexeme.pyx
						
						
						
						
						
					 | 
					
						2014-12-04 22:14:34 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							49f3780ff5
							
						
					 | 
					
						
						
							
							* Fiddle with lexeme attrs
						
						
						
						
						
					 | 
					
						2014-12-04 21:22:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							564082e48e
							
						
					 | 
					
						
						
							
							* Hack Token class to take lex.dense inplace of the old lex.norm. This needs to be fixed...
						
						
						
						
						
					 | 
					
						2014-12-04 20:51:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							69bb022204
							
						
					 | 
					
						
						
							
							* Add as_array and count_by method
						
						
						
						
						
					 | 
					
						2014-12-04 20:46:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e1b1f45cc9
							
						
					 | 
					
						
						
							
							* Add STEM attribute to lexeme
						
						
						
						
						
					 | 
					
						2014-12-04 20:46:20 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d7952634ca
							
						
					 | 
					
						
						
							
							* Make the string-store serve const pointers to Utf8Str
						
						
						
						
						
					 | 
					
						2014-12-03 16:01:47 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7e04c22f8f
							
						
					 | 
					
						
						
							
							* const added to Lexicon interface. Seems to work.
						
						
						
						
						
					 | 
					
						2014-12-03 15:58:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d70d31aa45
							
						
					 | 
					
						
						
							
							* Introduce first attempt at const-ness
						
						
						
						
						
					 | 
					
						2014-12-03 15:44:25 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4560ada85b
							
						
					 | 
					
						
						
							
							* Add typedef for attr_t. Change flag_t to flags_t
						
						
						
						
						
					 | 
					
						2014-12-03 11:06:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e600f7b327
							
						
					 | 
					
						
						
							
							* Move String struct stuff into the utf8string module, from spacy.lang
						
						
						
						
						
					 | 
					
						2014-12-03 11:06:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e170faf5b0
							
						
					 | 
					
						
						
							
							* Hack Tokens to work without tagger.pyx
						
						
						
						
						
					 | 
					
						2014-12-03 11:05:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b463a7eb86
							
						
					 | 
					
						
						
							
							* Make flag-setting a language-specific thing
						
						
						
						
						
					 | 
					
						2014-12-03 11:04:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							71b009e323
							
						
					 | 
					
						
						
							
							* Fix bug in refactored StringStore.__getitem__
						
						
						
						
						
					 | 
					
						2014-12-03 11:02:24 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							14097311ae
							
						
					 | 
					
						
						
							
							* Make StringStore.__getitem__ accept unicode-typed keys.
						
						
						
						
						
					 | 
					
						2014-12-03 01:33:20 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							522bb0346e
							
						
					 | 
					
						
						
							
							* Work on get_array method of Tokens
						
						
						
						
						
					 | 
					
						2014-12-02 23:48:05 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8c2938fe01
							
						
					 | 
					
						
						
							
							* Rename Lexicon._dict to Lexicon._map
						
						
						
						
						
					 | 
					
						2014-12-02 23:46:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							33dfb4933c
							
						
					 | 
					
						
						
							
							* Remove taggers from Language class. Work on doc strings
						
						
						
						
						
					 | 
					
						2014-11-26 19:53:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							80baa2e3db
							
						
					 | 
					
						
						
							
							* Work on beam parser
						
						
						
						
						
					 | 
					
						2014-11-20 19:49:33 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5c3016bac8
							
						
					 | 
					
						
						
							
							* Tmp commit of ner code
						
						
						
						
						
					 | 
					
						2014-11-14 18:27:47 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							33c421bcf8
							
						
					 | 
					
						
						
							
							* More feature tweaks
						
						
						
						
						
					 | 
					
						2014-11-12 23:59:16 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							41dedfb14e
							
						
					 | 
					
						
						
							
							* Add label features for NER parsing
						
						
						
						
						
					 | 
					
						2014-11-12 23:55:10 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cf55b48ba6
							
						
					 | 
					
						
						
							
							* Switch to predict label on shift. Big increase in accuracy.
						
						
						
						
						
					 | 
					
						2014-11-12 23:50:12 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8f84e8a78b
							
						
					 | 
					
						
						
							
							* Neaten oracle
						
						
						
						
						
					 | 
					
						2014-11-12 23:38:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7e0a9077dd
							
						
					 | 
					
						
						
							
							* Add context files
						
						
						
						
						
					 | 
					
						2014-11-12 23:22:36 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3b0b902384
							
						
					 | 
					
						
						
							
							* IOB-style parsing working. Accuracy down from BILOU, form 87-88 to 85-86
						
						
						
						
						
					 | 
					
						2014-11-12 23:21:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e6bb8aa3a9
							
						
					 | 
					
						
						
							
							* Move moves to bilou_moves. Refactor context, returning to the simpler giant-enum style
						
						
						
						
						
					 | 
					
						2014-11-12 00:54:50 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c788633429
							
						
					 | 
					
						
						
							
							* Add tokens_from_list method to Language
						
						
						
						
						
					 | 
					
						2014-11-11 23:43:14 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							95282d4993
							
						
					 | 
					
						
						
							
							* Use the dynamic oracle 'follow' strategy
						
						
						
						
						
					 | 
					
						2014-11-11 21:11:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5aaf7a024d
							
						
					 | 
					
						
						
							
							* Move ner features to ner subdir
						
						
						
						
						
					 | 
					
						2014-11-11 21:09:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ff8989b63c
							
						
					 | 
					
						
						
							
							* Use greedy NER parser
						
						
						
						
						
					 | 
					
						2014-11-11 21:08:35 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0d943ab358
							
						
					 | 
					
						
						
							
							* Fixed greedy NER parsing. With static oracle, replicates accuracy from tagger.
						
						
						
						
						
					 | 
					
						2014-11-11 17:17:54 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							399239760b
							
						
					 | 
					
						
						
							
							* Fix moves for new State struct
						
						
						
						
						
					 | 
					
						2014-11-10 22:16:05 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							82247169f2
							
						
					 | 
					
						
						
							
							* Implement validation and oracle on pystate, for testing
						
						
						
						
						
					 | 
					
						2014-11-10 22:15:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3709ed9d6d
							
						
					 | 
					
						
						
							
							* Add curr field to State, to handle entity being built
						
						
						
						
						
					 | 
					
						2014-11-10 22:14:36 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							af9ed18cf1
							
						
					 | 
					
						
						
							
							* Bug fixes to NER
						
						
						
						
						
					 | 
					
						2014-11-10 17:39:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9f2587f5ec
							
						
					 | 
					
						
						
							
							* Work on shift-reduce NER
						
						
						
						
						
					 | 
					
						2014-11-10 16:28:56 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f307eb2e36
							
						
					 | 
					
						
						
							
							* Refactor context extraction, and start breaking out gold standards into their own functions
						
						
						
						
						
					 | 
					
						2014-11-09 15:43:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							602f993af9
							
						
					 | 
					
						
						
							
							* Moving tagger to accept multiple correct answers
						
						
						
						
						
					 | 
					
						2014-11-09 15:18:33 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f37d896a42
							
						
					 | 
					
						
						
							
							* Upd NER feats. With adadelta learner, getting 76.9 on NER
						
						
						
						
						
					 | 
					
						2014-11-07 04:43:54 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							68d1cdad62
							
						
					 | 
					
						
						
							
							* When encoding POS/NER tags, accept '-' as a missing value
						
						
						
						
						
					 | 
					
						2014-11-07 04:42:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							949a6245f9
							
						
					 | 
					
						
						
							
							* Increase default number of iterations from 5 to 10
						
						
						
						
						
					 | 
					
						2014-11-07 04:42:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3cab1d9a29
							
						
					 | 
					
						
						
							
							* Refine word_shape feature, by trimming the max sequence length
						
						
						
						
						
					 | 
					
						2014-11-07 04:41:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b4454cf036
							
						
					 | 
					
						
						
							
							* Add extra context tokens
						
						
						
						
						
					 | 
					
						2014-11-07 04:40:36 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							50309e6e49
							
						
					 | 
					
						
						
							
							* Fix context vector, importing all features
						
						
						
						
						
					 | 
					
						2014-11-05 22:11:39 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							07a23768de
							
						
					 | 
					
						
						
							
							* Play with NER feats a bit. Up to 82.00 training on MUC7.
						
						
						
						
						
					 | 
					
						2014-11-05 21:47:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4ecbe8c893
							
						
					 | 
					
						
						
							
							* Complete refactor of Tagger features, to use a generic list of context names.
						
						
						
						
						
					 | 
					
						2014-11-05 20:45:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0a8c84625d
							
						
					 | 
					
						
						
							
							* Moving feature context stuff to a generalized place
						
						
						
						
						
					 | 
					
						2014-11-05 19:55:10 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3733444101
							
						
					 | 
					
						
						
							
							* Generalize tagger code, in preparation for NER and supersense tagging.
						
						
						
						
						
					 | 
					
						2014-11-05 03:42:14 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							abbe3e44b0
							
						
					 | 
					
						
						
							
							* Move spacy.pos tagger to spacy.tagger, and generalize it so that it can take on other tagging tasks, given a different set of feature templates.
						
						
						
						
						
					 | 
					
						2014-11-05 00:37:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							954c970415
							
						
					 | 
					
						
						
							
							* Add __iter__ method to tokens
						
						
						
						
						
					 | 
					
						2014-11-04 01:07:08 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f07457a91f
							
						
					 | 
					
						
						
							
							* Remove POS alignment stuff. Now use training data based on raw text, instead of clumsy detokenization stuff
						
						
						
						
						
					 | 
					
						2014-11-04 01:06:43 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ae52f9f38c
							
						
					 | 
					
						
						
							
							* Remove vocab10k from tokens
						
						
						
						
						
					 | 
					
						2014-11-03 00:23:20 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							32fb50dc35
							
						
					 | 
					
						
						
							
							* Remove non_sparse method --- features wanting this can do it easily enough.
						
						
						
						
						
					 | 
					
						2014-11-03 00:15:47 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b5ae1471db
							
						
					 | 
					
						
						
							
							* Fiddle with POS tag features
						
						
						
						
						
					 | 
					
						2014-11-03 00:15:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							70ea862703
							
						
					 | 
					
						
						
							
							* Remove vocab10k field, and add flags for gazetteers
						
						
						
						
						
					 | 
					
						2014-11-03 00:13:51 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							711ed0f636
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-11-02 14:22:32 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fcd9490d56
							
						
					 | 
					
						
						
							
							* Add pos_tag method to Language
						
						
						
						
						
					 | 
					
						2014-11-02 14:21:43 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							829bb2bdbe
							
						
					 | 
					
						
						
							
							* Add mappings to Twitter POS tag corpus
						
						
						
						
						
					 | 
					
						2014-11-02 13:21:19 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							437cd2217d
							
						
					 | 
					
						
						
							
							* Fix strings i/o, removing use of ujson library in favour of plain text file. Allows better control of codecs.
						
						
						
						
						
					 | 
					
						2014-11-02 13:20:37 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3352e89e21
							
						
					 | 
					
						
						
							
							* Use LIKE_URL and LIKE_NUMBER flag features. Seems to improve accuracy on onto web
						
						
						
						
						
					 | 
					
						2014-11-02 13:19:54 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8335706321
							
						
					 | 
					
						
						
							
							* Add LIKE_URL and LIKE_NUMBER flag features
						
						
						
						
						
					 | 
					
						2014-11-02 13:19:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5484fbea69
							
						
					 | 
					
						
						
							
							* Implement is_number
						
						
						
						
						
					 | 
					
						2014-11-01 19:13:24 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f685218e21
							
						
					 | 
					
						
						
							
							* Add is_urlish function
						
						
						
						
						
					 | 
					
						2014-11-01 17:39:34 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							09a3e54176
							
						
					 | 
					
						
						
							
							* Delete print statements from stringstore
						
						
						
						
						
					 | 
					
						2014-10-31 17:45:26 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b186a66bae
							
						
					 | 
					
						
						
							
							* Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype
						
						
						
						
						
					 | 
					
						2014-10-31 17:44:39 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a8ca078b24
							
						
					 | 
					
						
						
							
							* Restore lexemes field to lexicon
						
						
						
						
						
					 | 
					
						2014-10-31 17:43:25 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6c807aa45f
							
						
					 | 
					
						
						
							
							* Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries
						
						
						
						
						
					 | 
					
						2014-10-31 17:43:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							aaf6953fe0
							
						
					 | 
					
						
						
							
							* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.
						
						
						
						
						
					 | 
					
						2014-10-31 17:42:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f67cb9a5a3
							
						
					 | 
					
						
						
							
							* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.
						
						
						
						
						
					 | 
					
						2014-10-31 17:42:04 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ea8f1e7053
							
						
					 | 
					
						
						
							
							* Tighten interfaces
						
						
						
						
						
					 | 
					
						2014-10-30 18:14:42 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ea85bf3a0a
							
						
					 | 
					
						
						
							
							* Tighten the interface to Language
						
						
						
						
						
					 | 
					
						2014-10-30 18:01:27 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c6fcd03692
							
						
					 | 
					
						
						
							
							* Small efficiency tweak to lexeme init
						
						
						
						
						
					 | 
					
						2014-10-30 17:56:11 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							87c2418a89
							
						
					 | 
					
						
						
							
							* Fiddle with data types on Lexeme, to compress them to a much smaller size.
						
						
						
						
						
					 | 
					
						2014-10-30 15:42:15 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ac88893232
							
						
					 | 
					
						
						
							
							* Fix Token after lexeme changes
						
						
						
						
						
					 | 
					
						2014-10-30 15:30:52 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e6b87766fe
							
						
					 | 
					
						
						
							
							* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme
						
						
						
						
						
					 | 
					
						2014-10-30 15:21:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							889b7b48b4
							
						
					 | 
					
						
						
							
							* Fix POS tagger, so that it loads correctly. Lexemes are being read in.
						
						
						
						
						
					 | 
					
						2014-10-30 13:38:55 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							67c8c8019f
							
						
					 | 
					
						
						
							
							* Update lexeme serialization, using a binary file format
						
						
						
						
						
					 | 
					
						2014-10-30 01:01:00 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							13909a2e24
							
						
					 | 
					
						
						
							
							* Rewriting Lexeme serialization.
						
						
						
						
						
					 | 
					
						2014-10-29 23:19:38 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							234d49bf4d
							
						
					 | 
					
						
						
							
							* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.
						
						
						
						
						
					 | 
					
						2014-10-24 02:23:42 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							08ce602243
							
						
					 | 
					
						
						
							
							* Large refactor, particularly to Python API
						
						
						
						
						
					 | 
					
						2014-10-24 00:59:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7baef5b7ff
							
						
					 | 
					
						
						
							
							* Fix padding on tokens
						
						
						
						
						
					 | 
					
						2014-10-23 04:01:17 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							96b835a3d4
							
						
					 | 
					
						
						
							
							* Upd for refactored Tokens class. Now gets 95.74, 185ms training on swbd_wsj_ewtb, eval on onto_web, Google POS tags.
						
						
						
						
						
					 | 
					
						2014-10-23 03:20:02 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e5e951ae67
							
						
					 | 
					
						
						
							
							* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.
						
						
						
						
						
					 | 
					
						2014-10-23 01:57:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ea1d4a81eb
							
						
					 | 
					
						
						
							
							* Refactoring get_atoms, improving tokens API
						
						
						
						
						
					 | 
					
						2014-10-22 13:10:56 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ad49e2482e
							
						
					 | 
					
						
						
							
							* Tagger now gets 97pc on wsj, parsing 19-21 in 500ms. Gets 92.7 on web text.
						
						
						
						
						
					 | 
					
						2014-10-22 12:57:06 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0a0e41f6c8
							
						
					 | 
					
						
						
							
							* Add prefix and suffix features
						
						
						
						
						
					 | 
					
						2014-10-22 12:56:09 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7018b53d3a
							
						
					 | 
					
						
						
							
							* Improve array features in tokens
						
						
						
						
						
					 | 
					
						2014-10-22 12:55:42 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							43d5964e13
							
						
					 | 
					
						
						
							
							* Add function to read detokenization rules
						
						
						
						
						
					 | 
					
						2014-10-22 12:54:59 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							224bdae996
							
						
					 | 
					
						
						
							
							* Add POS utilities
						
						
						
						
						
					 | 
					
						2014-10-22 10:17:57 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5ebe14f353
							
						
					 | 
					
						
						
							
							* Add greedy pos tagger
						
						
						
						
						
					 | 
					
						2014-10-22 10:17:26 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							12742f4f83
							
						
					 | 
					
						
						
							
							* Add detokenize method and test
						
						
						
						
						
					 | 
					
						2014-10-18 18:07:29 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							99f5e59286
							
						
					 | 
					
						
						
							
							* Have tokenizer emit tokens for whitespace other than single spaces
						
						
						
						
						
					 | 
					
						2014-10-14 20:25:57 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							43743a5d63
							
						
					 | 
					
						
						
							
							* Work on efficiency
						
						
						
						
						
					 | 
					
						2014-10-14 18:22:41 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6fb42c4919
							
						
					 | 
					
						
						
							
							* Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang
						
						
						
						
						
					 | 
					
						2014-10-14 16:17:45 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2805068ca8
							
						
					 | 
					
						
						
							
							* Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer
						
						
						
						
						
					 | 
					
						2014-10-14 15:21:03 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							65d3ead4fd
							
						
					 | 
					
						
						
							
							* Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id
						
						
						
						
						
					 | 
					
						2014-10-14 15:19:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							868e558037
							
						
					 | 
					
						
						
							
							* Preparations in place to handle hyphenation etc
						
						
						
						
						
					 | 
					
						2014-10-10 20:23:23 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ff79dbac2e
							
						
					 | 
					
						
						
							
							* More slight cleaning for lang.pyx
						
						
						
						
						
					 | 
					
						2014-10-10 20:11:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3d82ed1e5e
							
						
					 | 
					
						
						
							
							* More slight cleaning for lang.pyx
						
						
						
						
						
					 | 
					
						2014-10-10 19:50:07 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							02e948e7d5
							
						
					 | 
					
						
						
							
							* Remove counts stuff from Language class
						
						
						
						
						
					 | 
					
						2014-10-10 19:25:01 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							71ee921055
							
						
					 | 
					
						
						
							
							* Slight cleaning of tokenizer code
						
						
						
						
						
					 | 
					
						2014-10-10 19:17:22 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							59b41a9fd3
							
						
					 | 
					
						
						
							
							* Switch to new data model, tests passing
						
						
						
						
						
					 | 
					
						2014-10-10 08:11:31 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1b0e01d3d8
							
						
					 | 
					
						
						
							
							* Revising data model of lexeme. Compiles.
						
						
						
						
						
					 | 
					
						2014-10-09 19:53:30 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e40caae51f
							
						
					 | 
					
						
						
							
							* Update Lexicon class to expect a list of lexeme dict descriptions
						
						
						
						
						
					 | 
					
						2014-10-09 14:51:35 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							51d75b244b
							
						
					 | 
					
						
						
							
							* Add serialize/deserialize functions for lexeme, transport to/from python dict.
						
						
						
						
						
					 | 
					
						2014-10-09 14:10:46 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d73d89a2de
							
						
					 | 
					
						
						
							
							* Add i attribute to lexeme, giving lexemes sequential IDs.
						
						
						
						
						
					 | 
					
						2014-10-09 13:50:05 +11:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							096ef2b199
							
						
					 | 
					
						
						
							
							* Rename external hashing lib, from trustyc to preshed
						
						
						
						
						
					 | 
					
						2014-09-26 18:40:03 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							11a346fd5e
							
						
					 | 
					
						
						
							
							* Remove hashing modules, which are now taken over by external lib
						
						
						
						
						
					 | 
					
						2014-09-26 18:39:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							93505276ed
							
						
					 | 
					
						
						
							
							* Add German tokenizer files
						
						
						
						
						
					 | 
					
						2014-09-25 18:29:13 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2e44fa7179
							
						
					 | 
					
						
						
							
							* Add util.py
						
						
						
						
						
					 | 
					
						2014-09-25 18:26:22 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b15619e170
							
						
					 | 
					
						
						
							
							* Use PointerHash instead of locally provided _hashing module
						
						
						
						
						
					 | 
					
						2014-09-25 18:23:35 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ed446c67ad
							
						
					 | 
					
						
						
							
							* Add typedefs file
						
						
						
						
						
					 | 
					
						2014-09-17 23:10:32 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							316a57c4be
							
						
					 | 
					
						
						
							
							* Remove own memory classes, which have now been broken out into their own package
						
						
						
						
						
					 | 
					
						2014-09-17 23:10:07 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ac522e2553
							
						
					 | 
					
						
						
							
							* Switch from own memory class to cymem, in pip
						
						
						
						
						
					 | 
					
						2014-09-17 23:09:24 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6266cac593
							
						
					 | 
					
						
						
							
							* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks
						
						
						
						
						
					 | 
					
						2014-09-17 20:02:26 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5a20dfc03e
							
						
					 | 
					
						
						
							
							* Add memory management code
						
						
						
						
						
					 | 
					
						2014-09-17 20:02:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0152831c89
							
						
					 | 
					
						
						
							
							* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token.
						
						
						
						
						
					 | 
					
						2014-09-16 18:01:46 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							143e51ec73
							
						
					 | 
					
						
						
							
							* Refactor tokenization, splitting it into a clearer life-cycle.
						
						
						
						
						
					 | 
					
						2014-09-16 13:16:02 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c396581a0b
							
						
					 | 
					
						
						
							
							* Fiddle with the way strings are interned in lexeme
						
						
						
						
						
					 | 
					
						2014-09-15 06:34:45 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0bb547ab98
							
						
					 | 
					
						
						
							
							* Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance
						
						
						
						
						
					 | 
					
						2014-09-15 06:34:10 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7959141d36
							
						
					 | 
					
						
						
							
							* Add a few abbreviations, to get tests to pass
						
						
						
						
						
					 | 
					
						2014-09-15 06:32:18 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d235299260
							
						
					 | 
					
						
						
							
							* Few nips and tucks to hash table
						
						
						
						
						
					 | 
					
						2014-09-15 05:03:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e68a431e5e
							
						
					 | 
					
						
						
							
							* Pass only the tokens vector to _tokenize, instead of the whole python object.
						
						
						
						
						
					 | 
					
						2014-09-15 04:01:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							08cef75ffd
							
						
					 | 
					
						
						
							
							* Switch to using a heap-allocated vector in tokens
						
						
						
						
						
					 | 
					
						2014-09-15 03:46:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f77b7098c0
							
						
					 | 
					
						
						
							
							* Upd Tokens to use vector, with bounds checking.
						
						
						
						
						
					 | 
					
						2014-09-15 03:22:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0f6bf2a2ee
							
						
					 | 
					
						
						
							
							* Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector.
						
						
						
						
						
					 | 
					
						2014-09-15 02:08:39 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							df24e3708c
							
						
					 | 
					
						
						
							
							* Move EnglishTokens stuff to Tokens
						
						
						
						
						
					 | 
					
						2014-09-15 01:31:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bd08cb09a2
							
						
					 | 
					
						
						
							
							* Remove short-circuiting of initial_size argument for PointerHash
						
						
						
						
						
					 | 
					
						2014-09-15 01:30:49 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f3393cf57c
							
						
					 | 
					
						
						
							
							* Improve interface for PointerHash
						
						
						
						
						
					 | 
					
						2014-09-13 17:29:58 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							45865be37e
							
						
					 | 
					
						
						
							
							* Switch hash interface, using void* instead of size_t, to avoid casts.
						
						
						
						
						
					 | 
					
						2014-09-13 17:02:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0447279c57
							
						
					 | 
					
						
						
							
							* PointerHash working, efficiency is good. 6-7 mins
						
						
						
						
						
					 | 
					
						2014-09-13 16:43:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							85d68e8e95
							
						
					 | 
					
						
						
							
							* Replaced cache with own hash table. Similar timing
						
						
						
						
						
					 | 
					
						2014-09-13 03:14:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c8db76e3e1
							
						
					 | 
					
						
						
							
							* Add initial work on simple hash table
						
						
						
						
						
					 | 
					
						2014-09-13 02:02:41 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							afdc9b7ac2
							
						
					 | 
					
						
						
							
							* More performance fiddling, particularly moving the specials into the cache, so that we can just lookup the cache in _tokenize
						
						
						
						
						
					 | 
					
						2014-09-13 00:59:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7d239df4c8
							
						
					 | 
					
						
						
							
							* Fiddle with declarations, for small efficiency boost
						
						
						
						
						
					 | 
					
						2014-09-13 00:31:53 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a8e7cce30f
							
						
					 | 
					
						
						
							
							* Efficiency tweaks
						
						
						
						
						
					 | 
					
						2014-09-13 00:14:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							126a8453a5
							
						
					 | 
					
						
						
							
							* Fix performance issues by implementing a better cache. Add own String struct to help
						
						
						
						
						
					 | 
					
						2014-09-12 23:50:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9298e36b36
							
						
					 | 
					
						
						
							
							* Move special tokenization into its own lookup table, away from the cache.
						
						
						
						
						
					 | 
					
						2014-09-12 19:43:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							985bc68327
							
						
					 | 
					
						
						
							
							* Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation.
						
						
						
						
						
					 | 
					
						2014-09-12 18:26:26 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7eab281194
							
						
					 | 
					
						
						
							
							* Fiddle with token features
						
						
						
						
						
					 | 
					
						2014-09-12 15:49:55 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5aa591106b
							
						
					 | 
					
						
						
							
							* Fiddle with token features
						
						
						
						
						
					 | 
					
						2014-09-12 15:49:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1533041885
							
						
					 | 
					
						
						
							
							* Update the split_one method, so that it doesn't need to cast back to a Python object
						
						
						
						
						
					 | 
					
						2014-09-12 05:10:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4817277d66
							
						
					 | 
					
						
						
							
							* Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery.
						
						
						
						
						
					 | 
					
						2014-09-12 04:29:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8b20e9ad97
							
						
					 | 
					
						
						
							
							* Delete ununused _split method
						
						
						
						
						
					 | 
					
						2014-09-12 04:03:52 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a4863686ec
							
						
					 | 
					
						
						
							
							* Changed cache to use a linked-list data structure, to take out Python list code. Taking 6-7 mins for gigaword.
						
						
						
						
						
					 | 
					
						2014-09-12 03:30:50 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							51e2006a65
							
						
					 | 
					
						
						
							
							* Increase cache size. Processing now 6-7 mins
						
						
						
						
						
					 | 
					
						2014-09-12 02:52:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e096f30161
							
						
					 | 
					
						
						
							
							* Tweak signatures and refactor slightly. Processing gigaword taking 8-9 mins. Tests passing, but some sort of memory bug on exit.
						
						
						
						
						
					 | 
					
						2014-09-12 02:43:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							073ee0de63
							
						
					 | 
					
						
						
							
							* Restore dense_hash_map for cache dictionary. Seems to double efficiency
						
						
						
						
						
					 | 
					
						2014-09-12 02:23:51 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3c928fb5e0
							
						
					 | 
					
						
						
							
							* Switch to 64 bit hashes, for better reliability
						
						
						
						
						
					 | 
					
						2014-09-12 02:04:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							2389bd1b10
							
						
					 | 
					
						
						
							
							* Improve cache mechanism by including a random element depending on the size of the cache.
						
						
						
						
						
					 | 
					
						2014-09-12 00:19:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c8f7c8bfde
							
						
					 | 
					
						
						
							
							* Moving to storing LexemeC structs internally
						
						
						
						
						
					 | 
					
						2014-09-11 21:54:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bf9c60c31c
							
						
					 | 
					
						
						
							
							* Moving to storing LexemeC structs internally
						
						
						
						
						
					 | 
					
						2014-09-11 21:44:58 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							563047e90f
							
						
					 | 
					
						
						
							
							* Switch to returning a Tokens object
						
						
						
						
						
					 | 
					
						2014-09-11 21:37:32 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1a3222af4b
							
						
					 | 
					
						
						
							
							* Moving tokens to use an array internally, instead of a list of Lexeme objects.
						
						
						
						
						
					 | 
					
						2014-09-11 16:57:08 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5b1c651661
							
						
					 | 
					
						
						
							
							* Only store LexemeC structs in the vocabulary, transforming them to Lexeme objects for output. Moving away from Lexeme objects for Tokens soon.
						
						
						
						
						
					 | 
					
						2014-09-11 12:28:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e567713429
							
						
					 | 
					
						
						
							
							* Moving back to lexeme structs
						
						
						
						
						
					 | 
					
						2014-09-10 20:41:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b488224c09
							
						
					 | 
					
						
						
							
							* Restoring Lexeme-as-struct
						
						
						
						
						
					 | 
					
						2014-09-10 20:41:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							7c09c73a14
							
						
					 | 
					
						
						
							
							* Refactor to use tokens class.
						
						
						
						
						
					 | 
					
						2014-09-10 18:27:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cf412adba8
							
						
					 | 
					
						
						
							
							* Refactoring to use Tokens object
						
						
						
						
						
					 | 
					
						2014-09-10 18:11:13 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8fbe9b6f97
							
						
					 | 
					
						
						
							
							* Bug fixes to flag features
						
						
						
						
						
					 | 
					
						2014-09-01 23:41:31 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							151aa14bba
							
						
					 | 
					
						
						
							
							* Add asciify string transform, and other bits.
						
						
						
						
						
					 | 
					
						2014-09-01 23:25:28 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c4ba216642
							
						
					 | 
					
						
						
							
							* Switch canon_case to get value, to avoid keyerror
						
						
						
						
						
					 | 
					
						2014-09-01 17:27:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a779275a59
							
						
					 | 
					
						
						
							
							* Add canon_case function
						
						
						
						
						
					 | 
					
						2014-08-30 20:57:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8bbfadfced
							
						
					 | 
					
						
						
							
							* Pass tests. Need to implement more feature functions.
						
						
						
						
						
					 | 
					
						2014-08-30 20:36:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							dcab14ede2
							
						
					 | 
					
						
						
							
							* Begin testing more functionality
						
						
						
						
						
					 | 
					
						2014-08-30 19:01:15 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3e3ff99ca0
							
						
					 | 
					
						
						
							
							* Add orth features
						
						
						
						
						
					 | 
					
						2014-08-30 19:01:00 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4e5b2d47e2
							
						
					 | 
					
						
						
							
							* More docs
						
						
						
						
						
					 | 
					
						2014-08-29 03:01:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5233f110c4
							
						
					 | 
					
						
						
							
							* Adding PTB3 tokenizer back in, so can understand how much boilerplate is in the docs for multiple tokenizers
						
						
						
						
						
					 | 
					
						2014-08-29 02:30:27 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							45a22d6b2c
							
						
					 | 
					
						
						
							
							* Docs coming together
						
						
						
						
						
					 | 
					
						2014-08-29 01:59:23 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c282e6d5fb
							
						
					 | 
					
						
						
							
							* Redesign proceeding
						
						
						
						
						
					 | 
					
						2014-08-28 19:45:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fd4e61e58b
							
						
					 | 
					
						
						
							
							* Fixed contraction tests. Need to correct problem with the way case stats and tag stats are supposed to work.
						
						
						
						
						
					 | 
					
						2014-08-27 20:22:33 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fdaf24604a
							
						
					 | 
					
						
						
							
							* Basic punct tests updated and passing
						
						
						
						
						
					 | 
					
						2014-08-27 19:38:57 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8d20617dfd
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-08-27 17:16:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e9a62b6eba
							
						
					 | 
					
						
						
							
							* Refactoring with Lexeme as a class now compiles. Basic design seems to work
						
						
						
						
						
					 | 
					
						2014-08-27 17:15:39 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							68bae2fec6
							
						
					 | 
					
						
						
							
							* More refactoring
						
						
						
						
						
					 | 
					
						2014-08-25 16:42:22 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							88095666dc
							
						
					 | 
					
						
						
							
							* Remove Lexeme struct, preparing to rename Word to Lexeme.
						
						
						
						
						
					 | 
					
						2014-08-24 19:24:42 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ce59526011
							
						
					 | 
					
						
						
							
							* Add Word classes
						
						
						
						
						
					 | 
					
						2014-08-24 18:14:08 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3b793cf4f7
							
						
					 | 
					
						
						
							
							* Tests passing for new Word object version
						
						
						
						
						
					 | 
					
						2014-08-24 18:13:53 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							9815c7649e
							
						
					 | 
					
						
						
							
							* Refactor around Word objects, adapting tests. Tests passing, except for string views.
						
						
						
						
						
					 | 
					
						2014-08-23 19:55:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4f01df9152
							
						
					 | 
					
						
						
							
							* Moving to Word objects in place of the Lexeme struct.
						
						
						
						
						
					 | 
					
						2014-08-22 17:32:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							782806df08
							
						
					 | 
					
						
						
							
							* Moving to Word objects in place of the Lexeme struct.
						
						
						
						
						
					 | 
					
						2014-08-22 17:28:23 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							47fbd0475a
							
						
					 | 
					
						
						
							
							* Replace the use of dense_hash_map with Python dict
						
						
						
						
						
					 | 
					
						2014-08-22 17:13:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e289896603
							
						
					 | 
					
						
						
							
							* Fix ptb3 module
						
						
						
						
						
					 | 
					
						2014-08-22 16:36:17 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							89d6faa9c9
							
						
					 | 
					
						
						
							
							* Move en_ptb to ptb3
						
						
						
						
						
					 | 
					
						2014-08-22 04:24:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							07ecf5d2f4
							
						
					 | 
					
						
						
							
							* Fixed group_by, removed idea of general attr_of function.
						
						
						
						
						
					 | 
					
						2014-08-22 00:02:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							811b7a6b91
							
						
					 | 
					
						
						
							
							* Struggling with arbitrary attr access...
						
						
						
						
						
					 | 
					
						2014-08-21 23:49:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							314658b31c
							
						
					 | 
					
						
						
							
							* Improve module docstring
						
						
						
						
						
					 | 
					
						2014-08-21 18:42:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d10993f41a
							
						
					 | 
					
						
						
							
							* More docs work
						
						
						
						
						
					 | 
					
						2014-08-21 16:37:13 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							248cbb6d07
							
						
					 | 
					
						
						
							
							* Update doc strings
						
						
						
						
						
					 | 
					
						2014-08-21 03:29:15 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							76afbd7d69
							
						
					 | 
					
						
						
							
							* Remove compiled orthography file
						
						
						
						
						
					 | 
					
						2014-08-20 17:04:07 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f39dcb1d89
							
						
					 | 
					
						
						
							
							* Add orthography
						
						
						
						
						
					 | 
					
						2014-08-20 17:03:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a78ad4152d
							
						
					 | 
					
						
						
							
							* Broken version being refactored for docs
						
						
						
						
						
					 | 
					
						2014-08-20 13:39:39 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5fddb8d165
							
						
					 | 
					
						
						
							
							* Working refactor, with updated data model for Lexemes
						
						
						
						
						
					 | 
					
						2014-08-19 04:21:20 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3379d7a571
							
						
					 | 
					
						
						
							
							* Reforming data model for lexemes
						
						
						
						
						
					 | 
					
						2014-08-19 02:40:37 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ab9b0daabf
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-08-18 23:21:49 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							1b71cbfe28
							
						
					 | 
					
						
						
							
							* Roll back to using unicode, and never Py_UNICODE. No dependence on murmurhash either.
						
						
						
						
						
					 | 
					
						2014-08-18 20:48:48 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							bbf9a2c944
							
						
					 | 
					
						
						
							
							* Working version that uses arrays for chunks, which should be more memory efficient
						
						
						
						
						
					 | 
					
						2014-08-18 20:23:54 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8d3f6082be
							
						
					 | 
					
						
						
							
							* Working version, adding improvements
						
						
						
						
						
					 | 
					
						2014-08-18 19:59:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							01469b0888
							
						
					 | 
					
						
						
							
							* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word.
						
						
						
						
						
					 | 
					
						2014-08-18 19:14:00 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b94c9b72c9
							
						
					 | 
					
						
						
							
							* WordTree in use. Need to reform the way chunks are handled. Should be properly one Lexeme per word, with split points being the things that are cached.
						
						
						
						
						
					 | 
					
						2014-08-16 20:10:22 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							34b68a18ab
							
						
					 | 
					
						
						
							
							* Progress to getting WordTree working. Tests pass, but so far it's slower.
						
						
						
						
						
					 | 
					
						2014-08-16 19:59:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							865cacfaf7
							
						
					 | 
					
						
						
							
							* Remove dependence on murmurhash
						
						
						
						
						
					 | 
					
						2014-08-16 17:37:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							515d41d325
							
						
					 | 
					
						
						
							
							* Restore string saving to spacy
						
						
						
						
						
					 | 
					
						2014-08-16 16:09:24 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							36073b89fe
							
						
					 | 
					
						
						
							
							* Restore unicode, work on improving string storage.
						
						
						
						
						
					 | 
					
						2014-08-16 14:35:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a225ca5b0d
							
						
					 | 
					
						
						
							
							* Refactoring tokenizer
						
						
						
						
						
					 | 
					
						2014-08-16 03:22:03 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							213a440ffc
							
						
					 | 
					
						
						
							
							* Add string decode and encode helpers to string_tools
						
						
						
						
						
					 | 
					
						2014-08-15 23:57:27 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f11c8e22eb
							
						
					 | 
					
						
						
							
							* Remove happax stuff
						
						
						
						
						
					 | 
					
						2014-08-02 22:11:28 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d6e07aa922
							
						
					 | 
					
						
						
							
							* Switch to 32bit hash for strings
						
						
						
						
						
					 | 
					
						2014-08-02 21:51:52 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							365a2af756
							
						
					 | 
					
						
						
							
							* Restore happax. commit uncommited work
						
						
						
						
						
					 | 
					
						2014-08-02 21:27:03 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6319ff0f22
							
						
					 | 
					
						
						
							
							* Add length property
						
						
						
						
						
					 | 
					
						2014-08-02 21:26:44 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							18fb76b2c4
							
						
					 | 
					
						
						
							
							* Removed happax. Not sure if good idea.
						
						
						
						
						
					 | 
					
						2014-08-02 20:53:35 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							edd38a84b1
							
						
					 | 
					
						
						
							
							* Removing happax stuff. Added length
						
						
						
						
						
					 | 
					
						2014-08-02 20:45:12 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fc7c10d7f8
							
						
					 | 
					
						
						
							
							* Ugly but seemingly working fix to the token memory leak
						
						
						
						
						
					 | 
					
						2014-08-01 09:43:19 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c7bb6b329c
							
						
					 | 
					
						
						
							
							* Don't free clobbered lexemes, as they might be part of a tail
						
						
						
						
						
					 | 
					
						2014-08-01 08:22:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							c48214460e
							
						
					 | 
					
						
						
							
							* Free lexemes clobbered as happaxes
						
						
						
						
						
					 | 
					
						2014-08-01 07:40:20 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5b6457e80e
							
						
					 | 
					
						
						
							
							* Free lexemes clobbered as happaxes
						
						
						
						
						
					 | 
					
						2014-08-01 07:37:50 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d8cb2288ce
							
						
					 | 
					
						
						
							
							* Roll back to using murmurhash2 for now
						
						
						
						
						
					 | 
					
						2014-08-01 07:28:47 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f39211b2b1
							
						
					 | 
					
						
						
							
							* Add FixedTable for hashing
						
						
						
						
						
					 | 
					
						2014-08-01 07:27:21 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a44e15f623
							
						
					 | 
					
						
						
							
							* Hack around lack of distribution features for now.
						
						
						
						
						
					 | 
					
						2014-07-31 18:24:51 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4cb88c940b
							
						
					 | 
					
						
						
							
							* Fix memory leak in tokenizer, caused by having a fixed vocab.
						
						
						
						
						
					 | 
					
						2014-07-31 18:19:38 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							5b81ee716f
							
						
					 | 
					
						
						
							
							* Use a sparse_hash_map to store happax vocab items, with a max size.
						
						
						
						
						
					 | 
					
						2014-07-31 17:40:43 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b9016c4633
							
						
					 | 
					
						
						
							
							* Switch to using sparsehash and murmurhash libraries out of pip
						
						
						
						
						
					 | 
					
						2014-07-25 15:47:27 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a895fe5ddb
							
						
					 | 
					
						
						
							
							* Upd from spacy
						
						
						
						
						
					 | 
					
						2014-07-23 17:35:18 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							87bf205b82
							
						
					 | 
					
						
						
							
							* Fix open apostrophe bug
						
						
						
						
						
					 | 
					
						2014-07-07 23:26:01 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							571808a274
							
						
					 | 
					
						
						
							
							Group-by seems to be working
						
						
						
						
						
					 | 
					
						2014-07-07 20:27:02 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							80b36f9f27
							
						
					 | 
					
						
						
							
							* 710k words per second for counts
						
						
						
						
						
					 | 
					
						2014-07-07 19:12:19 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							057c21969b
							
						
					 | 
					
						
						
							
							* Refactor for string view features. Working on setting up flags and enums.
						
						
						
						
						
					 | 
					
						2014-07-07 16:58:48 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f1bcbd4c4e
							
						
					 | 
					
						
						
							
							* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well.
						
						
						
						
						
					 | 
					
						2014-07-07 12:47:21 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							6668e44961
							
						
					 | 
					
						
						
							
							* Whitespace
						
						
						
						
						
					 | 
					
						2014-07-07 08:15:44 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0074ae2fc0
							
						
					 | 
					
						
						
							
							* Switch to dynamically allocating array, based on the document length
						
						
						
						
						
					 | 
					
						2014-07-07 08:05:29 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							ff1869ff07
							
						
					 | 
					
						
						
							
							* Fixed major efficiency problem, from not quite grokking pass by reference in cython c++
						
						
						
						
						
					 | 
					
						2014-07-07 07:36:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0c76143b72
							
						
					 | 
					
						
						
							
							* Give value for assert
						
						
						
						
						
					 | 
					
						2014-07-07 05:10:46 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e244739dfe
							
						
					 | 
					
						
						
							
							* Fix ptb tokenization
						
						
						
						
						
					 | 
					
						2014-07-07 05:10:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							dc20500920
							
						
					 | 
					
						
						
							
							* Remove cpp files
						
						
						
						
						
					 | 
					
						2014-07-07 05:09:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							25849fc926
							
						
					 | 
					
						
						
							
							* Generalize tokenization rules to capitals
						
						
						
						
						
					 | 
					
						2014-07-07 05:07:21 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							df0458001d
							
						
					 | 
					
						
						
							
							* Begin work on full PTB-compatible English tokenization
						
						
						
						
						
					 | 
					
						2014-07-07 04:29:24 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							d5bef02c72
							
						
					 | 
					
						
						
							
							* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals
						
						
						
						
						
					 | 
					
						2014-07-07 04:21:06 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a62c38e1ef
							
						
					 | 
					
						
						
							
							* Working tokenization. en doesn't match PTB perfectly. Need to reorganize before adding more schemes.
						
						
						
						
						
					 | 
					
						2014-07-07 01:15:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							4e79446dc2
							
						
					 | 
					
						
						
							
							* Reading in tokenization rules correctly. Passing tests.
						
						
						
						
						
					 | 
					
						2014-07-07 00:02:55 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							72159e7011
							
						
					 | 
					
						
						
							
							* Fixes to tokenization. Now segment sequences of the same punctuation.
						
						
						
						
						
					 | 
					
						2014-07-06 19:28:42 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e98e97d483
							
						
					 | 
					
						
						
							
							* Possessive test passing
						
						
						
						
						
					 | 
					
						2014-07-06 18:35:55 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							556f6a18ca
							
						
					 | 
					
						
						
							
							* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc.
						
						
						
						
						
					 | 
					
						2014-07-05 20:51:42 +02:00 | 
					
					
						
						
							
							
							
						
					 |