Matthew Honnibal
							
						 
					 | 
					
						
						
							
							
						
						
						
							
						
						
							6bc0f4d29f
							
						
					 | 
					
						
						
							
							Merge pull request #1611 from fsonntag/master
						
						
						
						
						
						
						
						Solving #1494 
						
					 | 
					
						2017-11-29 23:11:23 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Felix Sonntag
							
						 
					 | 
					
						
						
						
						
							
						
						
							724ae7dc55
							
						
					 | 
					
						
						
							
							Fixed issue of infix capturing prefixes
						
						
						
						
						
					 | 
					
						2017-11-28 17:17:12 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							542e6fd4ea
							
						
					 | 
					
						
						
							
							Don't remove entries from specials
						
						
						
						
						
					 | 
					
						2017-11-23 12:17:42 +00:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Felix Sonntag
							
						 
					 | 
					
						
						
						
						
							
						
						
							33b0f86de3
							
						
					 | 
					
						
						
							
							Changed tokenizer to add infix when infix_start is offset
						
						
						
						
						
					 | 
					
						2017-11-19 16:32:10 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Roman Domrachev
							
						 
					 | 
					
						
						
						
						
							
						
						
							61d28d03e4
							
						
					 | 
					
						
						
							
							Try again to do selective remove cache
						
						
						
						
						
					 | 
					
						2017-11-15 19:11:12 +03:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Roman Domrachev
							
						 
					 | 
					
						
						
						
						
							
						
						
							b3311100c7
							
						
					 | 
					
						
						
							
							Merge branch 'master' of github.com:explosion/spaCy
						
						
						
						
						
					 | 
					
						2017-11-15 18:30:04 +03:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Roman Domrachev
							
						 
					 | 
					
						
						
						
						
							
						
						
							505c6a2f2f
							
						
					 | 
					
						
						
							
							Completely cleanup tokenizer cache
						
						
						
						
						
						
						
						Tokenizer cache can have be different keys than string
That modification can slow down tokenizer and need to be measured 
						
					 | 
					
						2017-11-15 17:55:48 +03:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fe3c42a06b
							
						
					 | 
					
						
						
							
							Fix caching in tokenizer
						
						
						
						
						
					 | 
					
						2017-11-15 13:55:46 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Roman Domrachev
							
						 
					 | 
					
						
						
						
						
							
						
						
							91e2fa6561
							
						
					 | 
					
						
						
							
							Clean all caches
						
						
						
						
						
					 | 
					
						2017-11-14 21:15:04 +03:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Daniel Hershcovich
							
						 
					 | 
					
						
						
							
							
						
						
						
							
						
						
							d7ae54ff44
							
						
					 | 
					
						
						
							
							Fix typo in message
						
						
						
						
						
					 | 
					
						2017-11-08 16:06:28 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							9659391944
							
						
					 | 
					
						
						
							
							Update deprecated methods and add warnings
						
						
						
						
						
					 | 
					
						2017-11-01 16:49:42 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							d96e72f656
							
						
					 | 
					
						
						
							
							Tidy up rest
						
						
						
						
						
					 | 
					
						2017-10-27 21:07:59 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							72497c8cb2
							
						
					 | 
					
						
						
							
							Remove comments and add TODO
						
						
						
						
						
					 | 
					
						2017-10-25 12:15:43 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							b0f6fd3f1d
							
						
					 | 
					
						
						
							
							Disable tokenizer cache for special-cases. Fixes #1250
						
						
						
						
						
					 | 
					
						2017-10-24 16:08:05 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f45973848c
							
						
					 | 
					
						
						
							
							Rename 'tokens' variable 'doc' in tokenizer
						
						
						
						
						
					 | 
					
						2017-10-17 18:21:41 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							cd6a29dce7
							
						
					 | 
					
						
						
							
							Port over changes from #1294
						
						
						
						
						
					 | 
					
						2017-10-14 13:28:46 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							7c919aeb09
							
						
					 | 
					
						
						
							
							Make sure serializers and deserializers are ordered
						
						
						
						
						
					 | 
					
						2017-06-03 17:05:09 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							0153b66a86
							
						
					 | 
					
						
						
							
							Return self in Tokenizer.from_bytes
						
						
						
						
						
					 | 
					
						2017-06-03 13:26:13 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0561df2a9d
							
						
					 | 
					
						
						
							
							Fix tokenizer serialization
						
						
						
						
						
					 | 
					
						2017-05-31 14:12:38 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e9419072e7
							
						
					 | 
					
						
						
							
							Fix tokenizer serialisation
						
						
						
						
						
					 | 
					
						2017-05-31 13:43:31 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							66af019d5d
							
						
					 | 
					
						
						
							
							Fix serialization of tokenizer
						
						
						
						
						
					 | 
					
						2017-05-31 11:43:40 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a318f0cae1
							
						
					 | 
					
						
						
							
							Add to/from disk/bytes methods for tokenizer
						
						
						
						
						
					 | 
					
						2017-05-29 12:24:41 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							c5a653fa48
							
						
					 | 
					
						
						
							
							Update docstrings and API docs for Tokenizer
						
						
						
						
						
					 | 
					
						2017-05-21 13:18:14 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							f216422ac5
							
						
					 | 
					
						
						
							
							Remove deprecated load classmethod
						
						
						
						
						
					 | 
					
						2017-05-21 13:18:01 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							793430aa7a
							
						
					 | 
					
						
						
							
							Get spaCy train command working with neural network
						
						
						
						
						
						
						
						* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab 
						
					 | 
					
						2017-05-17 12:04:50 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							e1efd589c3
							
						
					 | 
					
						
						
							
							Fix json imports and use ujson
						
						
						
						
						
					 | 
					
						2017-04-15 12:13:34 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							c05ec4b89a
							
						
					 | 
					
						
						
							
							Add compat functions and remove old workarounds
						
						
						
						
						
						
						
						Add ensure_path util function to handle checking instance of path 
						
					 | 
					
						2017-04-15 12:11:16 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							d24589aa72
							
						
					 | 
					
						
						
							
							Clean up imports, unused code, whitespace, docstrings
						
						
						
						
						
					 | 
					
						2017-04-15 12:05:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								ines
							
						 
					 | 
					
						
						
						
						
							
						
						
							561f2a3eb4
							
						
					 | 
					
						
						
							
							Use consistent formatting for docstrings
						
						
						
						
						
					 | 
					
						2017-04-15 11:59:21 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Raphaël Bournhonesque
							
						 
					 | 
					
						
						
						
						
							
						
						
							f332bf05be
							
						
					 | 
					
						
						
							
							Remove unused import statements
						
						
						
						
						
					 | 
					
						2017-03-21 21:08:54 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0ac3d27689
							
						
					 | 
					
						
						
							
							Fix handling of trailing whitespace
						
						
						
						
						
						
						
						Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792 
						
					 | 
					
						2017-03-08 15:01:40 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							0a6d7ca200
							
						
					 | 
					
						
						
							
							Fix spacing after token_match
						
						
						
						
						
						
						
						The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes #859. 
						
					 | 
					
						2017-03-08 14:33:32 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Raphaël Bournhonesque
							
						 
					 | 
					
						
						
						
						
							
						
						
							dce8f5515e
							
						
					 | 
					
						
						
							
							Allow zero-width 'infix' token
						
						
						
						
						
					 | 
					
						2017-01-23 18:28:01 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Ines Montani
							
						 
					 | 
					
						
						
						
						
							
						
						
							aa876884f0
							
						
					 | 
					
						
						
							
							Revert "Revert "Merge remote-tracking branch 'origin/master'""
						
						
						
						
						
						
						
						This reverts commit fb9d3bb022. 
						
					 | 
					
						2017-01-09 13:28:13 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							a36353df47
							
						
					 | 
					
						
						
							
							Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.
						
						
						
						
						
					 | 
					
						2016-11-04 19:18:07 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e0c9695615
							
						
					 | 
					
						
						
							
							Fix doc strings for tokenizer
						
						
						
						
						
					 | 
					
						2016-11-02 23:15:39 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							e9e6fce576
							
						
					 | 
					
						
						
							
							Handle null prefix/suffix/infix search in tokenizer
						
						
						
						
						
					 | 
					
						2016-11-02 20:35:48 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							8ce8803824
							
						
					 | 
					
						
						
							
							Fix JSON in tokenizer
						
						
						
						
						
					 | 
					
						2016-10-21 01:44:20 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							95aaea0d3f
							
						
					 | 
					
						
						
							
							Refactor so that the tokenizer data is read from Python data, rather than from disk
						
						
						
						
						
					 | 
					
						2016-09-25 14:49:53 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							fd65cf6cbb
							
						
					 | 
					
						
						
							
							Finish refactoring data loading
						
						
						
						
						
					 | 
					
						2016-09-24 20:26:17 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							83e364188c
							
						
					 | 
					
						
						
							
							Mostly finished loading refactoring. Design is in place, but doesn't work yet.
						
						
						
						
						
					 | 
					
						2016-09-24 15:42:01 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							cc8bf62208
							
						
					 | 
					
						
						
							
							* Fix Issue #360: Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.
						
						
						
						
						
					 | 
					
						2016-05-09 13:23:47 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							519366f677
							
						
					 | 
					
						
						
							
							* Fix Issue #351: Indices off when leading whitespace
						
						
						
						
						
					 | 
					
						2016-05-04 15:53:36 +02:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							04d0209be9
							
						
					 | 
					
						
						
							
							* Recognise multiple infixes in a token.
						
						
						
						
						
					 | 
					
						2016-04-13 18:38:26 +10:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Henning Peters
							
						 
					 | 
					
						
						
						
						
							
						
						
							b8f63071eb
							
						
					 | 
					
						
						
							
							add lang registration facility
						
						
						
						
						
					 | 
					
						2016-03-25 18:54:45 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							141639ea3a
							
						
					 | 
					
						
						
							
							* Fix bug in tokenizer that caused new tokens to be added for affixes
						
						
						
						
						
					 | 
					
						2016-02-21 23:17:47 +00:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							f9e765cae7
							
						
					 | 
					
						
						
							
							* Add pipe() method to tokenizer
						
						
						
						
						
					 | 
					
						2016-02-03 02:32:37 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Matthew Honnibal
							
						 
					 | 
					
						
						
						
						
							
						
						
							3e9961d2c4
							
						
					 | 
					
						
						
							
							* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154
						
						
						
						
						
					 | 
					
						2016-01-16 17:08:59 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Henning Peters
							
						 
					 | 
					
						
						
						
						
							
						
						
							235f094534
							
						
					 | 
					
						
						
							
							untangle data_path/via
						
						
						
						
						
					 | 
					
						2016-01-16 12:23:45 +01:00 | 
					
					
						
						
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Henning Peters
							
						 
					 | 
					
						
						
						
						
							
						
						
							846fa49b2a
							
						
					 | 
					
						
						
							
							distinct load() and from_package() methods
						
						
						
						
						
					 | 
					
						2016-01-16 10:00:57 +01:00 | 
					
					
						
						
							
							
							
						
					 |