Matthew Honnibal
82277f63a3
💫 Small efficiency fixes to tokenizer ( #2587 )
...
This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical.
The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache.
With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second.
Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to:
* Fix the variable-length lookarounds in the suffix, infix and `token_match` rules
* Improve the performance of the `token_match` regex
* Switch back from the `regex` library to the `re` library.
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:35:54 +02:00
Matthew Honnibal
43dcaa473e
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2018-07-06 12:36:42 +02:00
Matthew Honnibal
6c8d627733
Fix tokenizer deserialization
2018-07-06 12:36:33 +02:00
ines
c001d46153
Tidy up
2018-07-06 12:33:42 +02:00
Matthew Honnibal
63f5651f8d
Fix tokenizer serialization
2018-07-06 12:32:11 +02:00
Matthew Honnibal
1a2f61725c
Fix tokenizer serialization
2018-07-06 12:23:04 +02:00
ines
63666af328
Merge branch 'master' into develop
2018-07-04 14:52:25 +02:00
Bùi Trung Chí
9af46b4f1b
Fix loading tokenizer with custom prefix search ( #2495 )
...
* Add contributor agreement
* Fix loading tokenizer with cutom prefix search
2018-07-04 12:56:07 +02:00
Matthew Honnibal
46d8a66fef
Fix tokenizer serialization if token_match is None
2018-06-29 14:24:46 +02:00
Ines Montani
3141e04822
💫 New system for error messages and warnings ( #2163 )
...
* Add spacy.errors module
* Update deprecation and user warnings
* Replace errors and asserts with new error message system
* Remove redundant asserts
* Fix whitespace
* Add messages for print/util.prints statements
* Fix typo
* Fix typos
* Move CLI messages to spacy.cli._messages
* Add decorator to display error code with message
An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.
* Remove unused link in spacy.about
* Update errors for invalid pipeline components
* Improve error for unknown factories
* Add displaCy warnings
* Update formatting consistency
* Move error message to spacy.errors
* Update errors and check if doc returned by component is None
2018-04-03 15:50:31 +02:00
Matthew Honnibal
6bc0f4d29f
Merge pull request #1611 from fsonntag/master
...
Solving #1494
2017-11-29 23:11:23 +01:00
Felix Sonntag
724ae7dc55
Fixed issue of infix capturing prefixes
2017-11-28 17:17:12 +01:00
Matthew Honnibal
542e6fd4ea
Don't remove entries from specials
2017-11-23 12:17:42 +00:00
Felix Sonntag
33b0f86de3
Changed tokenizer to add infix when infix_start is offset
2017-11-19 16:32:10 +01:00
Roman Domrachev
61d28d03e4
Try again to do selective remove cache
2017-11-15 19:11:12 +03:00
Roman Domrachev
b3311100c7
Merge branch 'master' of github.com:explosion/spaCy
2017-11-15 18:30:04 +03:00
Roman Domrachev
505c6a2f2f
Completely cleanup tokenizer cache
...
Tokenizer cache can have be different keys than string
That modification can slow down tokenizer and need to be measured
2017-11-15 17:55:48 +03:00
Matthew Honnibal
fe3c42a06b
Fix caching in tokenizer
2017-11-15 13:55:46 +01:00
Roman Domrachev
91e2fa6561
Clean all caches
2017-11-14 21:15:04 +03:00
Daniel Hershcovich
d7ae54ff44
Fix typo in message
2017-11-08 16:06:28 +02:00
ines
9659391944
Update deprecated methods and add warnings
2017-11-01 16:49:42 +01:00
ines
d96e72f656
Tidy up rest
2017-10-27 21:07:59 +02:00
ines
72497c8cb2
Remove comments and add TODO
2017-10-25 12:15:43 +02:00
Matthew Honnibal
b0f6fd3f1d
Disable tokenizer cache for special-cases. Fixes #1250
2017-10-24 16:08:05 +02:00
Matthew Honnibal
f45973848c
Rename 'tokens' variable 'doc' in tokenizer
2017-10-17 18:21:41 +02:00
ines
cd6a29dce7
Port over changes from #1294
2017-10-14 13:28:46 +02:00
ines
7c919aeb09
Make sure serializers and deserializers are ordered
2017-06-03 17:05:09 +02:00
ines
0153b66a86
Return self in Tokenizer.from_bytes
2017-06-03 13:26:13 +02:00
Matthew Honnibal
0561df2a9d
Fix tokenizer serialization
2017-05-31 14:12:38 +02:00
Matthew Honnibal
e9419072e7
Fix tokenizer serialisation
2017-05-31 13:43:31 +02:00
Matthew Honnibal
66af019d5d
Fix serialization of tokenizer
2017-05-31 11:43:40 +02:00
Matthew Honnibal
a318f0cae1
Add to/from disk/bytes methods for tokenizer
2017-05-29 12:24:41 +02:00
ines
c5a653fa48
Update docstrings and API docs for Tokenizer
2017-05-21 13:18:14 +02:00
ines
f216422ac5
Remove deprecated load classmethod
2017-05-21 13:18:01 +02:00
Matthew Honnibal
793430aa7a
Get spaCy train command working with neural network
...
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
ines
e1efd589c3
Fix json imports and use ujson
2017-04-15 12:13:34 +02:00
ines
c05ec4b89a
Add compat functions and remove old workarounds
...
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines
d24589aa72
Clean up imports, unused code, whitespace, docstrings
2017-04-15 12:05:47 +02:00
ines
561f2a3eb4
Use consistent formatting for docstrings
2017-04-15 11:59:21 +02:00
Raphaël Bournhonesque
f332bf05be
Remove unused import statements
2017-03-21 21:08:54 +01:00
Matthew Honnibal
0ac3d27689
Fix handling of trailing whitespace
...
Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792
2017-03-08 15:01:40 +01:00
Matthew Honnibal
0a6d7ca200
Fix spacing after token_match
...
The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes #859 .
2017-03-08 14:33:32 +01:00
Raphaël Bournhonesque
dce8f5515e
Allow zero-width 'infix' token
2017-01-23 18:28:01 +01:00
Ines Montani
aa876884f0
Revert "Revert "Merge remote-tracking branch 'origin/master'""
...
This reverts commit fb9d3bb022
.
2017-01-09 13:28:13 +01:00
Matthew Honnibal
a36353df47
Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.
2016-11-04 19:18:07 +01:00
Matthew Honnibal
e0c9695615
Fix doc strings for tokenizer
2016-11-02 23:15:39 +01:00
Matthew Honnibal
e9e6fce576
Handle null prefix/suffix/infix search in tokenizer
2016-11-02 20:35:48 +01:00
Matthew Honnibal
8ce8803824
Fix JSON in tokenizer
2016-10-21 01:44:20 +02:00
Matthew Honnibal
95aaea0d3f
Refactor so that the tokenizer data is read from Python data, rather than from disk
2016-09-25 14:49:53 +02:00
Matthew Honnibal
fd65cf6cbb
Finish refactoring data loading
2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c
Mostly finished loading refactoring. Design is in place, but doesn't work yet.
2016-09-24 15:42:01 +02:00
Matthew Honnibal
cc8bf62208
* Fix Issue #360 : Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens.
2016-05-09 13:23:47 +02:00
Matthew Honnibal
519366f677
* Fix Issue #351 : Indices off when leading whitespace
2016-05-04 15:53:36 +02:00
Matthew Honnibal
04d0209be9
* Recognise multiple infixes in a token.
2016-04-13 18:38:26 +10:00
Henning Peters
b8f63071eb
add lang registration facility
2016-03-25 18:54:45 +01:00
Matthew Honnibal
141639ea3a
* Fix bug in tokenizer that caused new tokens to be added for affixes
2016-02-21 23:17:47 +00:00
Matthew Honnibal
f9e765cae7
* Add pipe() method to tokenizer
2016-02-03 02:32:37 +01:00
Matthew Honnibal
3e9961d2c4
* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154
2016-01-16 17:08:59 +01:00
Henning Peters
235f094534
untangle data_path/via
2016-01-16 12:23:45 +01:00
Henning Peters
846fa49b2a
distinct load() and from_package() methods
2016-01-16 10:00:57 +01:00
Henning Peters
788f734513
refactored data_dir->via, add zip_safe, add spacy.load()
2016-01-15 18:01:02 +01:00
Henning Peters
bc229790ac
integrate with sputnik
2016-01-13 19:46:17 +01:00
Matthew Honnibal
a6ba43ecaf
* Fix errors in packaging revision
2015-12-29 18:37:26 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Henning Peters
9027cef3bc
access model via sputnik
2015-12-07 06:01:28 +01:00
Matthew Honnibal
68f479e821
* Rename Doc.data to Doc.c
2015-11-04 00:15:14 +11:00
Chris DuBois
dac8fe7bdb
Add __reduce__ to Tokenizer so that English pickles.
...
- Add tests to test_pickle and test_tokenizer that save to tempfiles.
2015-10-23 22:24:03 -07:00
Matthew Honnibal
3ba66f2dc7
* Add string length cap in Tokenizer.__call__
2015-10-16 04:54:16 +11:00
Matthew Honnibal
c2307fa9ee
* More work on language-generic parsing
2015-08-28 02:02:33 +02:00
Matthew Honnibal
119c0f8c3f
* Hack out morphology stuff from tokenizer, while morphology being reimplemented.
2015-08-26 19:20:11 +02:00
Matthew Honnibal
9c4d0aae62
* Switch to better Python2/3 compatible unicode handling
2015-07-28 14:45:37 +02:00
Matthew Honnibal
0c507bd80a
* Fix tokenizer
2015-07-22 14:10:30 +02:00
Matthew Honnibal
2fc66e3723
* Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff
2015-07-22 13:38:45 +02:00
Matthew Honnibal
109106a949
* Replace UniStr, using unicode objects instead
2015-07-22 04:52:05 +02:00
Matthew Honnibal
e49c7f1478
* Update oov check in tokenizer
2015-07-18 22:45:28 +02:00
Matthew Honnibal
cfd842769e
* Allow infix tokens to be variable length
2015-07-18 22:45:00 +02:00
Matthew Honnibal
3b5baa660f
* Fix tokenizer
2015-07-14 00:10:51 +02:00
Matthew Honnibal
24d6ce99ec
* Add comment to tokenizer, explaining the spacy attr
2015-07-13 22:29:13 +02:00
Matthew Honnibal
67641f3b58
* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string
2015-07-13 21:46:02 +02:00
Matthew Honnibal
6eef0bf9ab
* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx
2015-07-13 20:20:58 +02:00
Matthew Honnibal
bb522496dd
* Rename Tokens to Doc
2015-07-08 18:53:00 +02:00
Matthew Honnibal
935bcdf3e5
* Remove redundant tag_names argument to Tokenizer
2015-07-08 12:36:04 +02:00
Matthew Honnibal
2d0e99a096
* Pass pos_tags into Tokenizer.from_dir
2015-07-07 14:23:08 +02:00
Matthew Honnibal
6788c86b2f
* Begin refactor
2015-07-07 14:00:07 +02:00
Matthew Honnibal
98cfd84123
* Remove hyphenation from main tokenizer loop: do it in infix.txt instead. This lets emoticons work
2015-06-06 05:57:03 +02:00
Matthew Honnibal
20f1d868a3
* Tmp commit. Working on whole document parsing
2015-05-24 02:49:56 +02:00
Jordan Suchow
3a8d9b37a6
Remove trailing whitespace
2015-04-19 13:01:38 -07:00
Matthew Honnibal
f02c39dfaf
* Compare to is not None, for more robustness
2015-03-26 16:44:48 +01:00
Matthew Honnibal
7237c805c7
* Load tag for specials.json token
2015-03-26 16:44:46 +01:00
Matthew Honnibal
0492cee8b4
* Fix Issue #24 : Lemmas are empty when the L field is missing for special-cased tokens
2015-02-08 18:30:30 -05:00
Matthew Honnibal
4ff180db74
* Fix off-by-one error in commit 0a7fceb
2015-01-30 12:49:33 +11:00
Matthew Honnibal
0a7fcebdf7
* Fix Issue #12 : Incorrect token.idx calculations for some punctuation, in the presence of token cache
2015-01-30 12:33:38 +11:00
Matthew Honnibal
5928d158ce
* Pass the string to Tokens
2015-01-22 02:04:58 +11:00
Matthew Honnibal
6c7e44140b
* Work on word vectors, and other stuff
2015-01-17 16:21:17 +11:00
Matthew Honnibal
ce2edd6312
* Tmp commit. Refactoring to create a Python Lexeme class.
2015-01-12 10:26:22 +11:00
Matthew Honnibal
3f1944d688
* Make PyPy work
2015-01-05 17:54:38 +11:00
Matthew Honnibal
9976aa976e
* Messily fix morphology and POS tags on special tokens.
2014-12-30 23:24:37 +11:00
Matthew Honnibal
4c4aa2c5c9
* Work on train
2014-12-22 07:25:43 +11:00
Matthew Honnibal
e1c1a4b868
* Tmp
2014-12-21 05:36:29 +11:00
Matthew Honnibal
be1bdcbd85
* Move lang.pyx to tokenizer.pyx
2014-12-20 07:55:40 +11:00