spaCy/spacy
Matthew Honnibal 82277f63a3 💫 Small efficiency fixes to tokenizer (#2587)
This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical.

The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. 

With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second.

Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to:

* Fix the variable-length lookarounds in the suffix, infix and `token_match` rules
* Improve the performance of the `token_match` regex
* Switch back from the `regex` library to the `re` library.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:35:54 +02:00
..
cli Merge branch 'master' into develop 2018-07-18 18:57:00 +02:00
data Make spacy/data a package 2017-03-18 20:04:22 +01:00
displacy fix issue #2452 - displacy arrow direction is always forward (#2506) (closes #2452) 2018-07-04 14:12:08 +02:00
lang Merge branch 'master' into develop 2018-07-21 15:34:18 +02:00
syntax Make pipeline work on empty docs 2018-06-29 19:21:38 +02:00
tests Merge branch 'master' into develop 2018-07-21 15:34:18 +02:00
tokens Merge branch 'master' into develop 2018-07-21 15:34:18 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Silent keyword in info function in init (#2459) 2018-06-18 12:24:21 +02:00
__main__.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
_align.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
_matcher2_notes.py Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
_ml.py Only warn about unnamed vectors if non-zero sized. 2018-05-19 18:51:55 +02:00
about.py Set version to 2.0.12.dev1 2018-07-21 13:08:01 +02:00
attrs.pxd Fix LANG symbol 2018-02-17 18:10:50 +01:00
attrs.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
compat.py 💫 Rule-based NER component (#2513) 2018-07-18 19:43:16 +02:00
errors.py Allow ignoring warnings and only overwrite if set explicitly 2018-07-20 22:50:19 +02:00
glossary.py Fix typo in glossary (resolves #1964) 2018-02-10 11:58:41 +01:00
gold.pxd Add support for sent_start to GoldParse 2017-08-25 20:03:14 -05:00
gold.pyx Merge branch 'master' into develop 2018-05-30 13:01:01 +02:00
language.py 💫 Rule-based NER component (#2513) 2018-07-18 19:43:16 +02:00
lemmatizer.py Fix lemmatization 2018-07-05 13:56:02 +02:00
lexeme.pxd WIP on stringstore change. 27 failures 2017-05-28 14:06:40 +02:00
lexeme.pyx 💫 Add .similarity warnings for no vectors and option to exclude warnings (#2197) 2018-05-21 01:22:38 +02:00
matcher.pyx Fix compile error in matcher 2018-07-06 12:29:23 +02:00
morphology.pxd Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
morphology.pyx Fix lemmatization 2018-07-05 13:56:02 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
pipeline.pxd Fix names of pipeline components 2017-10-26 12:38:23 +02:00
pipeline.pyx 💫 Rule-based NER component (#2513) 2018-07-18 19:43:16 +02:00
scorer.py Fix scoring if tokenization changes 2018-05-01 01:33:20 +02:00
strings.pxd Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
strings.pyx 💫 New system for error messages and warnings (#2163) 2018-04-03 15:50:31 +02:00
structs.pxd Make TokenC.sent_tart an int, to allow ternary value 2017-10-08 19:58:54 +02:00
symbols.pxd Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
symbols.pyx Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" 2018-03-27 19:23:02 +02:00
tokenizer.pxd Disable tokenizer cache for special-cases. Fixes #1250 2017-10-24 16:08:05 +02:00
tokenizer.pyx 💫 Small efficiency fixes to tokenizer (#2587) 2018-07-24 23:35:54 +02:00
typedefs.pxd Work on changing StringStore to return hashes. 2017-05-28 12:36:27 +02:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py Merge branch 'master' into develop 2018-07-21 15:34:18 +02:00
vectors.pyx 💫 New system for error messages and warnings (#2163) 2018-04-03 15:50:31 +02:00
vocab.pxd 💫 Small efficiency fixes to tokenizer (#2587) 2018-07-24 23:35:54 +02:00
vocab.pyx 💫 Small efficiency fixes to tokenizer (#2587) 2018-07-24 23:35:54 +02:00