| 
							
							
								 Matthew Honnibal | b962fe73d7 | * Make suffixes file use full-power regex, so that we can handle periods properly | 2014-12-09 19:04:27 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 1ccabc806e | * Work on lemmatization | 2014-12-09 16:06:18 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 677e111ee7 | * Revise tokenization rules to match PTB. Rules are pretty messy around periods, need better support for these. | 2014-12-07 22:04:47 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | da70b6bd60 | * Upd tokenization special-cases | 2014-11-11 22:10:15 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | bea762ec04 | * Update tokenization rules | 2014-11-04 01:06:00 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 75329e9ef8 | * Add Co. abbreviation to tokenization rules | 2014-11-03 00:16:20 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | fa91506073 | * Add '' double quote to suffixes file | 2014-11-03 00:12:59 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 11e42fd070 | * Add emoticons to tokenization | 2014-11-01 15:14:55 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 39743323ea | * Add i'ma to tokenization rules | 2014-10-31 17:45:44 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 849de654e7 | * Add file for infix patterns | 2014-10-14 20:26:43 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 5abb194553 | * Add semi-colon to suffix punct | 2014-10-14 10:43:45 +11:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | c4cd3bc57a | * Add prefix and suffix data files | 2014-09-25 18:24:52 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 143e51ec73 | * Refactor tokenization, splitting it into a clearer life-cycle. | 2014-09-16 13:16:02 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 6fc06bfe2f | * Hack a hard-cased unit in to get a test to pass | 2014-09-15 06:31:35 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 3b793cf4f7 | * Tests passing for new Word object version | 2014-08-24 18:13:53 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | a22101404a | * Move en_ptb data | 2014-08-22 04:28:51 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | a2047fa5aa | * Add 's suffix to tokenization table | 2014-08-18 23:21:37 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | cc3971ce5c | * Fix error in tokenization rules | 2014-07-07 05:09:34 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | 997551241f | * Upd ptb tokenization rules | 2014-07-07 05:09:22 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | df0458001d | * Begin work on full PTB-compatible English tokenization | 2014-07-07 04:29:24 +02:00 |  | 
			
				
					| 
							
							
								 Matthew Honnibal | d5bef02c72 | * Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals | 2014-07-07 04:21:06 +02:00 |  |