ines 
							
						 
					 
					
						
						
						
						
							
						
						
							7c919aeb09 
							
						 
					 
					
						
						
							
							Make sure serializers and deserializers are ordered  
						
						
						
					 
					
						2017-06-03 17:05:09 +02:00 
						 
				 
			
				
					
						
							
							
								ines 
							
						 
					 
					
						
						
						
						
							
						
						
							0153b66a86 
							
						 
					 
					
						
						
							
							Return self in Tokenizer.from_bytes  
						
						
						
					 
					
						2017-06-03 13:26:13 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							0561df2a9d 
							
						 
					 
					
						
						
							
							Fix tokenizer serialization  
						
						
						
					 
					
						2017-05-31 14:12:38 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							e9419072e7 
							
						 
					 
					
						
						
							
							Fix tokenizer serialisation  
						
						
						
					 
					
						2017-05-31 13:43:31 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							66af019d5d 
							
						 
					 
					
						
						
							
							Fix serialization of tokenizer  
						
						
						
					 
					
						2017-05-31 11:43:40 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							a318f0cae1 
							
						 
					 
					
						
						
							
							Add to/from disk/bytes methods for tokenizer  
						
						
						
					 
					
						2017-05-29 12:24:41 +02:00 
						 
				 
			
				
					
						
							
							
								ines 
							
						 
					 
					
						
						
						
						
							
						
						
							c5a653fa48 
							
						 
					 
					
						
						
							
							Update docstrings and API docs for Tokenizer  
						
						
						
					 
					
						2017-05-21 13:18:14 +02:00 
						 
				 
			
				
					
						
							
							
								ines 
							
						 
					 
					
						
						
						
						
							
						
						
							f216422ac5 
							
						 
					 
					
						
						
							
							Remove deprecated load classmethod  
						
						
						
					 
					
						2017-05-21 13:18:01 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							793430aa7a 
							
						 
					 
					
						
						
							
							Get spaCy train command working with neural network  
						
						... 
						
						
						
						* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab 
						
					 
					
						2017-05-17 12:04:50 +02:00 
						 
				 
			
				
					
						
							
							
								ines 
							
						 
					 
					
						
						
						
						
							
						
						
							e1efd589c3 
							
						 
					 
					
						
						
							
							Fix json imports and use ujson  
						
						
						
					 
					
						2017-04-15 12:13:34 +02:00 
						 
				 
			
				
					
						
							
							
								ines 
							
						 
					 
					
						
						
						
						
							
						
						
							c05ec4b89a 
							
						 
					 
					
						
						
							
							Add compat functions and remove old workarounds  
						
						... 
						
						
						
						Add ensure_path util function to handle checking instance of path 
						
					 
					
						2017-04-15 12:11:16 +02:00 
						 
				 
			
				
					
						
							
							
								ines 
							
						 
					 
					
						
						
						
						
							
						
						
							d24589aa72 
							
						 
					 
					
						
						
							
							Clean up imports, unused code, whitespace, docstrings  
						
						
						
					 
					
						2017-04-15 12:05:47 +02:00 
						 
				 
			
				
					
						
							
							
								ines 
							
						 
					 
					
						
						
						
						
							
						
						
							561f2a3eb4 
							
						 
					 
					
						
						
							
							Use consistent formatting for docstrings  
						
						
						
					 
					
						2017-04-15 11:59:21 +02:00 
						 
				 
			
				
					
						
							
							
								Raphaël Bournhonesque 
							
						 
					 
					
						
						
						
						
							
						
						
							f332bf05be 
							
						 
					 
					
						
						
							
							Remove unused import statements  
						
						
						
					 
					
						2017-03-21 21:08:54 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							0ac3d27689 
							
						 
					 
					
						
						
							
							Fix handling of trailing whitespace  
						
						... 
						
						
						
						Fix off-by-one error that meant trailing spaces were being dropped.
Closes  #792  
						
					 
					
						2017-03-08 15:01:40 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							0a6d7ca200 
							
						 
					 
					
						
						
							
							Fix spacing after token_match  
						
						... 
						
						
						
						The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes  #859 . 
						
					 
					
						2017-03-08 14:33:32 +01:00 
						 
				 
			
				
					
						
							
							
								Raphaël Bournhonesque 
							
						 
					 
					
						
						
						
						
							
						
						
							dce8f5515e 
							
						 
					 
					
						
						
							
							Allow zero-width 'infix' token  
						
						
						
					 
					
						2017-01-23 18:28:01 +01:00 
						 
				 
			
				
					
						
							
							
								Ines Montani 
							
						 
					 
					
						
						
						
						
							
						
						
							aa876884f0 
							
						 
					 
					
						
						
							
							Revert "Revert "Merge remote-tracking branch 'origin/master'""  
						
						... 
						
						
						
						This reverts commit fb9d3bb022 
						
					 
					
						2017-01-09 13:28:13 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							a36353df47 
							
						 
					 
					
						
						
							
							Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.  
						
						
						
					 
					
						2016-11-04 19:18:07 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							e0c9695615 
							
						 
					 
					
						
						
							
							Fix doc strings for tokenizer  
						
						
						
					 
					
						2016-11-02 23:15:39 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							e9e6fce576 
							
						 
					 
					
						
						
							
							Handle null prefix/suffix/infix search in tokenizer  
						
						
						
					 
					
						2016-11-02 20:35:48 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							8ce8803824 
							
						 
					 
					
						
						
							
							Fix JSON in tokenizer  
						
						
						
					 
					
						2016-10-21 01:44:20 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							95aaea0d3f 
							
						 
					 
					
						
						
							
							Refactor so that the tokenizer data is read from Python data, rather than from disk  
						
						
						
					 
					
						2016-09-25 14:49:53 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							fd65cf6cbb 
							
						 
					 
					
						
						
							
							Finish refactoring data loading  
						
						
						
					 
					
						2016-09-24 20:26:17 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							83e364188c 
							
						 
					 
					
						
						
							
							Mostly finished loading refactoring. Design is in place, but doesn't work yet.  
						
						
						
					 
					
						2016-09-24 15:42:01 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							cc8bf62208 
							
						 
					 
					
						
						
							
							* Fix Issue  #360 : Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.  
						
						
						
					 
					
						2016-05-09 13:23:47 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							519366f677 
							
						 
					 
					
						
						
							
							* Fix Issue  #351 : Indices off when leading whitespace  
						
						
						
					 
					
						2016-05-04 15:53:36 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							04d0209be9 
							
						 
					 
					
						
						
							
							* Recognise multiple infixes in a token.  
						
						
						
					 
					
						2016-04-13 18:38:26 +10:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							b8f63071eb 
							
						 
					 
					
						
						
							
							add lang registration facility  
						
						
						
					 
					
						2016-03-25 18:54:45 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							141639ea3a 
							
						 
					 
					
						
						
							
							* Fix bug in tokenizer that caused new tokens to be added for affixes  
						
						
						
					 
					
						2016-02-21 23:17:47 +00:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							f9e765cae7 
							
						 
					 
					
						
						
							
							* Add pipe() method to tokenizer  
						
						
						
					 
					
						2016-02-03 02:32:37 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							3e9961d2c4 
							
						 
					 
					
						
						
							
							* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue  #154  
						
						
						
					 
					
						2016-01-16 17:08:59 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							235f094534 
							
						 
					 
					
						
						
							
							untangle data_path/via  
						
						
						
					 
					
						2016-01-16 12:23:45 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							846fa49b2a 
							
						 
					 
					
						
						
							
							distinct load() and from_package() methods  
						
						
						
					 
					
						2016-01-16 10:00:57 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							788f734513 
							
						 
					 
					
						
						
							
							refactored data_dir->via, add zip_safe, add spacy.load()  
						
						
						
					 
					
						2016-01-15 18:01:02 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							bc229790ac 
							
						 
					 
					
						
						
							
							integrate with sputnik  
						
						
						
					 
					
						2016-01-13 19:46:17 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							a6ba43ecaf 
							
						 
					 
					
						
						
							
							* Fix errors in packaging revision  
						
						
						
					 
					
						2015-12-29 18:37:26 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							aec130af56 
							
						 
					 
					
						
						
							
							Use util.Package class for io  
						
						... 
						
						
						
						Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download 
						
					 
					
						2015-12-29 18:00:48 +01:00 
						 
				 
			
				
					
						
							
							
								Henning Peters 
							
						 
					 
					
						
						
						
						
							
						
						
							9027cef3bc 
							
						 
					 
					
						
						
							
							access model via sputnik  
						
						
						
					 
					
						2015-12-07 06:01:28 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							68f479e821 
							
						 
					 
					
						
						
							
							* Rename Doc.data to Doc.c  
						
						
						
					 
					
						2015-11-04 00:15:14 +11:00 
						 
				 
			
				
					
						
							
							
								Chris DuBois 
							
						 
					 
					
						
						
						
						
							
						
						
							dac8fe7bdb 
							
						 
					 
					
						
						
							
							Add __reduce__ to Tokenizer so that English pickles.  
						
						... 
						
						
						
						- Add tests to test_pickle and test_tokenizer that save to tempfiles. 
						
					 
					
						2015-10-23 22:24:03 -07:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							3ba66f2dc7 
							
						 
					 
					
						
						
							
							* Add string length cap in Tokenizer.__call__  
						
						
						
					 
					
						2015-10-16 04:54:16 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							c2307fa9ee 
							
						 
					 
					
						
						
							
							* More work on language-generic parsing  
						
						
						
					 
					
						2015-08-28 02:02:33 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							119c0f8c3f 
							
						 
					 
					
						
						
							
							* Hack out morphology stuff from tokenizer, while morphology being reimplemented.  
						
						
						
					 
					
						2015-08-26 19:20:11 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							9c4d0aae62 
							
						 
					 
					
						
						
							
							* Switch to better Python2/3 compatible unicode handling  
						
						
						
					 
					
						2015-07-28 14:45:37 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							0c507bd80a 
							
						 
					 
					
						
						
							
							* Fix tokenizer  
						
						
						
					 
					
						2015-07-22 14:10:30 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							2fc66e3723 
							
						 
					 
					
						
						
							
							* Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff  
						
						
						
					 
					
						2015-07-22 13:38:45 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							109106a949 
							
						 
					 
					
						
						
							
							* Replace UniStr, using unicode objects instead  
						
						
						
					 
					
						2015-07-22 04:52:05 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							e49c7f1478 
							
						 
					 
					
						
						
							
							* Update oov check in tokenizer  
						
						
						
					 
					
						2015-07-18 22:45:28 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							cfd842769e 
							
						 
					 
					
						
						
							
							* Allow infix tokens to be variable length  
						
						
						
					 
					
						2015-07-18 22:45:00 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							3b5baa660f 
							
						 
					 
					
						
						
							
							* Fix tokenizer  
						
						
						
					 
					
						2015-07-14 00:10:51 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							24d6ce99ec 
							
						 
					 
					
						
						
							
							* Add comment to tokenizer, explaining the spacy attr  
						
						
						
					 
					
						2015-07-13 22:29:13 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							67641f3b58 
							
						 
					 
					
						
						
							
							* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string  
						
						
						
					 
					
						2015-07-13 21:46:02 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							6eef0bf9ab 
							
						 
					 
					
						
						
							
							* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx  
						
						
						
					 
					
						2015-07-13 20:20:58 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							bb522496dd 
							
						 
					 
					
						
						
							
							* Rename Tokens to Doc  
						
						
						
					 
					
						2015-07-08 18:53:00 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							935bcdf3e5 
							
						 
					 
					
						
						
							
							* Remove redundant tag_names argument to Tokenizer  
						
						
						
					 
					
						2015-07-08 12:36:04 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							2d0e99a096 
							
						 
					 
					
						
						
							
							* Pass pos_tags into Tokenizer.from_dir  
						
						
						
					 
					
						2015-07-07 14:23:08 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							6788c86b2f 
							
						 
					 
					
						
						
							
							* Begin refactor  
						
						
						
					 
					
						2015-07-07 14:00:07 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							98cfd84123 
							
						 
					 
					
						
						
							
							* Remove hyphenation from main tokenizer loop: do it in infix.txt instead. This lets emoticons work  
						
						
						
					 
					
						2015-06-06 05:57:03 +02:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							20f1d868a3 
							
						 
					 
					
						
						
							
							* Tmp commit. Working on whole document parsing  
						
						
						
					 
					
						2015-05-24 02:49:56 +02:00 
						 
				 
			
				
					
						
							
							
								Jordan Suchow 
							
						 
					 
					
						
						
						
						
							
						
						
							3a8d9b37a6 
							
						 
					 
					
						
						
							
							Remove trailing whitespace  
						
						
						
					 
					
						2015-04-19 13:01:38 -07:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							f02c39dfaf 
							
						 
					 
					
						
						
							
							* Compare to is not None, for more robustness  
						
						
						
					 
					
						2015-03-26 16:44:48 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							7237c805c7 
							
						 
					 
					
						
						
							
							* Load tag for specials.json token  
						
						
						
					 
					
						2015-03-26 16:44:46 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							0492cee8b4 
							
						 
					 
					
						
						
							
							* Fix Issue  #24 : Lemmas are empty when the L field is missing for special-cased tokens  
						
						
						
					 
					
						2015-02-08 18:30:30 -05:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							4ff180db74 
							
						 
					 
					
						
						
							
							* Fix off-by-one error in commit  0a7fceb 
						
						
						
					 
					
						2015-01-30 12:49:33 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							0a7fcebdf7 
							
						 
					 
					
						
						
							
							* Fix Issue  #12 : Incorrect token.idx calculations for some punctuation, in the presence of token cache  
						
						
						
					 
					
						2015-01-30 12:33:38 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							5928d158ce 
							
						 
					 
					
						
						
							
							* Pass the string to Tokens  
						
						
						
					 
					
						2015-01-22 02:04:58 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							6c7e44140b 
							
						 
					 
					
						
						
							
							* Work on word vectors, and other stuff  
						
						
						
					 
					
						2015-01-17 16:21:17 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							ce2edd6312 
							
						 
					 
					
						
						
							
							* Tmp commit. Refactoring to create a Python Lexeme class.  
						
						
						
					 
					
						2015-01-12 10:26:22 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							3f1944d688 
							
						 
					 
					
						
						
							
							* Make PyPy work  
						
						
						
					 
					
						2015-01-05 17:54:38 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							9976aa976e 
							
						 
					 
					
						
						
							
							* Messily fix morphology and POS tags on special tokens.  
						
						
						
					 
					
						2014-12-30 23:24:37 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							4c4aa2c5c9 
							
						 
					 
					
						
						
							
							* Work on train  
						
						
						
					 
					
						2014-12-22 07:25:43 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							e1c1a4b868 
							
						 
					 
					
						
						
							
							* Tmp  
						
						
						
					 
					
						2014-12-21 05:36:29 +11:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							be1bdcbd85 
							
						 
					 
					
						
						
							
							* Move lang.pyx to tokenizer.pyx  
						
						
						
					 
					
						2014-12-20 07:55:40 +11:00