Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							f9e765cae7 
							
						 
					 
					
						
						
							
							* Add pipe() method to tokenizer  
						
						
						
					 
					
						2016-02-03 02:32:37 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							3e9961d2c4 
							
						 
					 
					
						
						
							
							* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue  #154  
						
						
						
					 
					
						2016-01-16 17:08:59 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							235f094534 
							
						 
					 
					
						
						
							
							untangle data_path/via  
						
						
						
					 
					
						2016-01-16 12:23:45 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							846fa49b2a 
							
						 
					 
					
						
						
							
							distinct load() and from_package() methods  
						
						
						
					 
					
						2016-01-16 10:00:57 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							788f734513 
							
						 
					 
					
						
						
							
							refactored data_dir->via, add zip_safe, add spacy.load()  
						
						
						
					 
					
						2016-01-15 18:01:02 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							bc229790ac 
							
						 
					 
					
						
						
							
							integrate with sputnik  
						
						
						
					 
					
						2016-01-13 19:46:17 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							a6ba43ecaf 
							
						 
					 
					
						
						
							
							* Fix errors in packaging revision  
						
						
						
					 
					
						2015-12-29 18:37:26 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							aec130af56 
							
						 
					 
					
						
						
							
							Use util.Package class for io  
						
						... 
						
						
						
						Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download 
						
					 
					
						2015-12-29 18:00:48 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							9027cef3bc 
							
						 
					 
					
						
						
							
							access model via sputnik  
						
						
						
					 
					
						2015-12-07 06:01:28 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							68f479e821 
							
						 
					 
					
						
						
							
							* Rename Doc.data to Doc.c  
						
						
						
					 
					
						2015-11-04 00:15:14 +11:00 
						 
				 
			
				
					
						
							
							
								Chris DuBois 
							
						 
					 
					
						
						
						
						
							
						
						
							dac8fe7bdb 
							
						 
					 
					
						
						
							
							Add __reduce__ to Tokenizer so that English pickles.  
						
						... 
						
						
						
						- Add tests to test_pickle and test_tokenizer that save to tempfiles. 
						
					 
					
						2015-10-23 22:24:03 -07:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							3ba66f2dc7 
							
						 
					 
					
						
						
							
							* Add string length cap in Tokenizer.__call__  
						
						
						
					 
					
						2015-10-16 04:54:16 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							c2307fa9ee 
							
						 
					 
					
						
						
							
							* More work on language-generic parsing  
						
						
						
					 
					
						2015-08-28 02:02:33 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							119c0f8c3f 
							
						 
					 
					
						
						
							
							* Hack out morphology stuff from tokenizer, while morphology being reimplemented.  
						
						
						
					 
					
						2015-08-26 19:20:11 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							9c4d0aae62 
							
						 
					 
					
						
						
							
							* Switch to better Python2/3 compatible unicode handling  
						
						
						
					 
					
						2015-07-28 14:45:37 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							0c507bd80a 
							
						 
					 
					
						
						
							
							* Fix tokenizer  
						
						
						
					 
					
						2015-07-22 14:10:30 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							2fc66e3723 
							
						 
					 
					
						
						
							
							* Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff  
						
						
						
					 
					
						2015-07-22 13:38:45 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							109106a949 
							
						 
					 
					
						
						
							
							* Replace UniStr, using unicode objects instead  
						
						
						
					 
					
						2015-07-22 04:52:05 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							e49c7f1478 
							
						 
					 
					
						
						
							
							* Update oov check in tokenizer  
						
						
						
					 
					
						2015-07-18 22:45:28 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							cfd842769e 
							
						 
					 
					
						
						
							
							* Allow infix tokens to be variable length  
						
						
						
					 
					
						2015-07-18 22:45:00 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							3b5baa660f 
							
						 
					 
					
						
						
							
							* Fix tokenizer  
						
						
						
					 
					
						2015-07-14 00:10:51 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							24d6ce99ec 
							
						 
					 
					
						
						
							
							* Add comment to tokenizer, explaining the spacy attr  
						
						
						
					 
					
						2015-07-13 22:29:13 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							67641f3b58 
							
						 
					 
					
						
						
							
							* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string  
						
						
						
					 
					
						2015-07-13 21:46:02 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							6eef0bf9ab 
							
						 
					 
					
						
						
							
							* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx  
						
						
						
					 
					
						2015-07-13 20:20:58 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							bb522496dd 
							
						 
					 
					
						
						
							
							* Rename Tokens to Doc  
						
						
						
					 
					
						2015-07-08 18:53:00 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							935bcdf3e5 
							
						 
					 
					
						
						
							
							* Remove redundant tag_names argument to Tokenizer  
						
						
						
					 
					
						2015-07-08 12:36:04 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							2d0e99a096 
							
						 
					 
					
						
						
							
							* Pass pos_tags into Tokenizer.from_dir  
						
						
						
					 
					
						2015-07-07 14:23:08 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							6788c86b2f 
							
						 
					 
					
						
						
							
							* Begin refactor  
						
						
						
					 
					
						2015-07-07 14:00:07 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							98cfd84123 
							
						 
					 
					
						
						
							
							* Remove hyphenation from main tokenizer loop: do it in infix.txt instead. This lets emoticons work  
						
						
						
					 
					
						2015-06-06 05:57:03 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							20f1d868a3 
							
						 
					 
					
						
						
							
							* Tmp commit. Working on whole document parsing  
						
						
						
					 
					
						2015-05-24 02:49:56 +02:00 
						 
				 
			
				
					
						
							
							
								Jordan Suchow 
							
						 
					 
					
						
						
						
						
							
						
						
							3a8d9b37a6 
							
						 
					 
					
						
						
							
							Remove trailing whitespace  
						
						
						
					 
					
						2015-04-19 13:01:38 -07:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							f02c39dfaf 
							
						 
					 
					
						
						
							
							* Compare to is not None, for more robustness  
						
						
						
					 
					
						2015-03-26 16:44:48 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							7237c805c7 
							
						 
					 
					
						
						
							
							* Load tag for specials.json token  
						
						
						
					 
					
						2015-03-26 16:44:46 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							0492cee8b4 
							
						 
					 
					
						
						
							
							* Fix Issue  #24 : Lemmas are empty when the L field is missing for special-cased tokens  
						
						
						
					 
					
						2015-02-08 18:30:30 -05:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							4ff180db74 
							
						 
					 
					
						
						
							
							* Fix off-by-one error in commit  0a7fceb 
						
						
						
					 
					
						2015-01-30 12:49:33 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							0a7fcebdf7 
							
						 
					 
					
						
						
							
							* Fix Issue  #12 : Incorrect token.idx calculations for some punctuation, in the presence of token cache  
						
						
						
					 
					
						2015-01-30 12:33:38 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							5928d158ce 
							
						 
					 
					
						
						
							
							* Pass the string to Tokens  
						
						
						
					 
					
						2015-01-22 02:04:58 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							6c7e44140b 
							
						 
					 
					
						
						
							
							* Work on word vectors, and other stuff  
						
						
						
					 
					
						2015-01-17 16:21:17 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							ce2edd6312 
							
						 
					 
					
						
						
							
							* Tmp commit. Refactoring to create a Python Lexeme class.  
						
						
						
					 
					
						2015-01-12 10:26:22 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							3f1944d688 
							
						 
					 
					
						
						
							
							* Make PyPy work  
						
						
						
					 
					
						2015-01-05 17:54:38 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							9976aa976e 
							
						 
					 
					
						
						
							
							* Messily fix morphology and POS tags on special tokens.  
						
						
						
					 
					
						2014-12-30 23:24:37 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							4c4aa2c5c9 
							
						 
					 
					
						
						
							
							* Work on train  
						
						
						
					 
					
						2014-12-22 07:25:43 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							e1c1a4b868 
							
						 
					 
					
						
						
							
							* Tmp  
						
						
						
					 
					
						2014-12-21 05:36:29 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							be1bdcbd85 
							
						 
					 
					
						
						
							
							* Move lang.pyx to tokenizer.pyx  
						
						
						
					 
					
						2014-12-20 07:55:40 +11:00