Matthew Honnibal
|
6bc0f4d29f
|
Merge pull request #1611 from fsonntag/master
Solving #1494
|
2017-11-29 23:11:23 +01:00 |
|
Felix Sonntag
|
724ae7dc55
|
Fixed issue of infix capturing prefixes
|
2017-11-28 17:17:12 +01:00 |
|
Matthew Honnibal
|
542e6fd4ea
|
Don't remove entries from specials
|
2017-11-23 12:17:42 +00:00 |
|
Felix Sonntag
|
33b0f86de3
|
Changed tokenizer to add infix when infix_start is offset
|
2017-11-19 16:32:10 +01:00 |
|
Roman Domrachev
|
61d28d03e4
|
Try again to do selective remove cache
|
2017-11-15 19:11:12 +03:00 |
|
Roman Domrachev
|
b3311100c7
|
Merge branch 'master' of github.com:explosion/spaCy
|
2017-11-15 18:30:04 +03:00 |
|
Roman Domrachev
|
505c6a2f2f
|
Completely cleanup tokenizer cache
Tokenizer cache can have be different keys than string
That modification can slow down tokenizer and need to be measured
|
2017-11-15 17:55:48 +03:00 |
|
Matthew Honnibal
|
fe3c42a06b
|
Fix caching in tokenizer
|
2017-11-15 13:55:46 +01:00 |
|
Roman Domrachev
|
91e2fa6561
|
Clean all caches
|
2017-11-14 21:15:04 +03:00 |
|
Daniel Hershcovich
|
d7ae54ff44
|
Fix typo in message
|
2017-11-08 16:06:28 +02:00 |
|
ines
|
9659391944
|
Update deprecated methods and add warnings
|
2017-11-01 16:49:42 +01:00 |
|
ines
|
d96e72f656
|
Tidy up rest
|
2017-10-27 21:07:59 +02:00 |
|
ines
|
72497c8cb2
|
Remove comments and add TODO
|
2017-10-25 12:15:43 +02:00 |
|
Matthew Honnibal
|
b0f6fd3f1d
|
Disable tokenizer cache for special-cases. Fixes #1250
|
2017-10-24 16:08:05 +02:00 |
|
Matthew Honnibal
|
f45973848c
|
Rename 'tokens' variable 'doc' in tokenizer
|
2017-10-17 18:21:41 +02:00 |
|
ines
|
cd6a29dce7
|
Port over changes from #1294
|
2017-10-14 13:28:46 +02:00 |
|
ines
|
7c919aeb09
|
Make sure serializers and deserializers are ordered
|
2017-06-03 17:05:09 +02:00 |
|
ines
|
0153b66a86
|
Return self in Tokenizer.from_bytes
|
2017-06-03 13:26:13 +02:00 |
|
Matthew Honnibal
|
0561df2a9d
|
Fix tokenizer serialization
|
2017-05-31 14:12:38 +02:00 |
|
Matthew Honnibal
|
e9419072e7
|
Fix tokenizer serialisation
|
2017-05-31 13:43:31 +02:00 |
|
Matthew Honnibal
|
66af019d5d
|
Fix serialization of tokenizer
|
2017-05-31 11:43:40 +02:00 |
|
Matthew Honnibal
|
a318f0cae1
|
Add to/from disk/bytes methods for tokenizer
|
2017-05-29 12:24:41 +02:00 |
|
ines
|
c5a653fa48
|
Update docstrings and API docs for Tokenizer
|
2017-05-21 13:18:14 +02:00 |
|
ines
|
f216422ac5
|
Remove deprecated load classmethod
|
2017-05-21 13:18:01 +02:00 |
|
Matthew Honnibal
|
793430aa7a
|
Get spaCy train command working with neural network
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
|
2017-05-17 12:04:50 +02:00 |
|
ines
|
e1efd589c3
|
Fix json imports and use ujson
|
2017-04-15 12:13:34 +02:00 |
|
ines
|
c05ec4b89a
|
Add compat functions and remove old workarounds
Add ensure_path util function to handle checking instance of path
|
2017-04-15 12:11:16 +02:00 |
|
ines
|
d24589aa72
|
Clean up imports, unused code, whitespace, docstrings
|
2017-04-15 12:05:47 +02:00 |
|
ines
|
561f2a3eb4
|
Use consistent formatting for docstrings
|
2017-04-15 11:59:21 +02:00 |
|
Raphaël Bournhonesque
|
f332bf05be
|
Remove unused import statements
|
2017-03-21 21:08:54 +01:00 |
|
Matthew Honnibal
|
0ac3d27689
|
Fix handling of trailing whitespace
Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792
|
2017-03-08 15:01:40 +01:00 |
|
Matthew Honnibal
|
0a6d7ca200
|
Fix spacing after token_match
The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes #859.
|
2017-03-08 14:33:32 +01:00 |
|
Raphaël Bournhonesque
|
dce8f5515e
|
Allow zero-width 'infix' token
|
2017-01-23 18:28:01 +01:00 |
|
Ines Montani
|
aa876884f0
|
Revert "Revert "Merge remote-tracking branch 'origin/master'""
This reverts commit fb9d3bb022 .
|
2017-01-09 13:28:13 +01:00 |
|
Matthew Honnibal
|
a36353df47
|
Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.
|
2016-11-04 19:18:07 +01:00 |
|
Matthew Honnibal
|
e0c9695615
|
Fix doc strings for tokenizer
|
2016-11-02 23:15:39 +01:00 |
|
Matthew Honnibal
|
e9e6fce576
|
Handle null prefix/suffix/infix search in tokenizer
|
2016-11-02 20:35:48 +01:00 |
|
Matthew Honnibal
|
8ce8803824
|
Fix JSON in tokenizer
|
2016-10-21 01:44:20 +02:00 |
|
Matthew Honnibal
|
95aaea0d3f
|
Refactor so that the tokenizer data is read from Python data, rather than from disk
|
2016-09-25 14:49:53 +02:00 |
|
Matthew Honnibal
|
fd65cf6cbb
|
Finish refactoring data loading
|
2016-09-24 20:26:17 +02:00 |
|
Matthew Honnibal
|
83e364188c
|
Mostly finished loading refactoring. Design is in place, but doesn't work yet.
|
2016-09-24 15:42:01 +02:00 |
|
Matthew Honnibal
|
cc8bf62208
|
* Fix Issue #360: Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.
|
2016-05-09 13:23:47 +02:00 |
|
Matthew Honnibal
|
519366f677
|
* Fix Issue #351: Indices off when leading whitespace
|
2016-05-04 15:53:36 +02:00 |
|
Matthew Honnibal
|
04d0209be9
|
* Recognise multiple infixes in a token.
|
2016-04-13 18:38:26 +10:00 |
|
Henning Peters
|
b8f63071eb
|
add lang registration facility
|
2016-03-25 18:54:45 +01:00 |
|
Matthew Honnibal
|
141639ea3a
|
* Fix bug in tokenizer that caused new tokens to be added for affixes
|
2016-02-21 23:17:47 +00:00 |
|
Matthew Honnibal
|
f9e765cae7
|
* Add pipe() method to tokenizer
|
2016-02-03 02:32:37 +01:00 |
|
Matthew Honnibal
|
3e9961d2c4
|
* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154
|
2016-01-16 17:08:59 +01:00 |
|
Henning Peters
|
235f094534
|
untangle data_path/via
|
2016-01-16 12:23:45 +01:00 |
|
Henning Peters
|
846fa49b2a
|
distinct load() and from_package() methods
|
2016-01-16 10:00:57 +01:00 |
|