spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-17 20:50:55 +03:00

History

Matthew Honnibal 82277f63a3 💫 Small efficiency fixes to tokenizer (#2587 ) This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical. The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second. Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to: * Fix the variable-length lookarounds in the suffix, infix and `token_match` rules * Improve the performance of the `token_match` regex * Switch back from the `regex` library to the `re` library. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.		2018-07-24 23:35:54 +02:00
..
cli	Merge branch 'master' into develop	2018-07-18 18:57:00 +02:00
data	Make spacy/data a package	2017-03-18 20:04:22 +01:00
displacy	fix issue #2452 - displacy arrow direction is always forward (#2506 ) (closes #2452 )	2018-07-04 14:12:08 +02:00
lang	Merge branch 'master' into develop	2018-07-21 15:34:18 +02:00
syntax	Make pipeline work on empty docs	2018-06-29 19:21:38 +02:00
tests	Merge branch 'master' into develop	2018-07-21 15:34:18 +02:00
tokens	Merge branch 'master' into develop	2018-07-21 15:34:18 +02:00
__init__.pxd	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
__init__.py	Silent keyword in info function in init (#2459 )	2018-06-18 12:24:21 +02:00
__main__.py	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"	2018-03-27 19:23:02 +02:00
_align.pyx	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"	2018-03-27 19:23:02 +02:00
_matcher2_notes.py	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"	2018-03-27 19:23:02 +02:00
_ml.py	Only warn about unnamed vectors if non-zero sized.	2018-05-19 18:51:55 +02:00
about.py	Set version to 2.0.12.dev1	2018-07-21 13:08:01 +02:00
attrs.pxd	Fix LANG symbol	2018-02-17 18:10:50 +01:00
attrs.pyx	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"	2018-03-27 19:23:02 +02:00
compat.py	💫 Rule-based NER component (#2513 )	2018-07-18 19:43:16 +02:00
errors.py	Allow ignoring warnings and only overwrite if set explicitly	2018-07-20 22:50:19 +02:00
glossary.py	Fix typo in glossary (resolves #1964 )	2018-02-10 11:58:41 +01:00
gold.pxd	Add support for sent_start to GoldParse	2017-08-25 20:03:14 -05:00
gold.pyx	Merge branch 'master' into develop	2018-05-30 13:01:01 +02:00
language.py	💫 Rule-based NER component (#2513 )	2018-07-18 19:43:16 +02:00
lemmatizer.py	Fix lemmatization	2018-07-05 13:56:02 +02:00
lexeme.pxd	WIP on stringstore change. 27 failures	2017-05-28 14:06:40 +02:00
lexeme.pyx	💫 Add .similarity warnings for no vectors and option to exclude warnings (#2197 )	2018-05-21 01:22:38 +02:00
matcher.pyx	Fix compile error in matcher	2018-07-06 12:29:23 +02:00
morphology.pxd	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"	2018-03-27 19:23:02 +02:00
morphology.pyx	Fix lemmatization	2018-07-05 13:56:02 +02:00
parts_of_speech.pxd	Add support for Universal Dependencies v2.0	2017-03-03 13:17:34 +01:00
parts_of_speech.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
pipeline.pxd	Fix names of pipeline components	2017-10-26 12:38:23 +02:00
pipeline.pyx	💫 Rule-based NER component (#2513 )	2018-07-18 19:43:16 +02:00
scorer.py	Fix scoring if tokenization changes	2018-05-01 01:33:20 +02:00
strings.pxd	Try to fix StringStore clean up (see #1506 )	2017-11-11 03:11:27 +03:00
strings.pyx	💫 New system for error messages and warnings (#2163 )	2018-04-03 15:50:31 +02:00
structs.pxd	Make TokenC.sent_tart an int, to allow ternary value	2017-10-08 19:58:54 +02:00
symbols.pxd	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"	2018-03-27 19:23:02 +02:00
symbols.pyx	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop"	2018-03-27 19:23:02 +02:00
tokenizer.pxd	Disable tokenizer cache for special-cases. Fixes #1250	2017-10-24 16:08:05 +02:00
tokenizer.pyx	💫 Small efficiency fixes to tokenizer (#2587 )	2018-07-24 23:35:54 +02:00
typedefs.pxd	Work on changing StringStore to return hashes.	2017-05-28 12:36:27 +02:00
typedefs.pyx	Tidy up rest	2017-10-27 21:07:59 +02:00
util.py	Merge branch 'master' into develop	2018-07-21 15:34:18 +02:00
vectors.pyx	💫 New system for error messages and warnings (#2163 )	2018-04-03 15:50:31 +02:00
vocab.pxd	💫 Small efficiency fixes to tokenizer (#2587 )	2018-07-24 23:35:54 +02:00
vocab.pyx	💫 Small efficiency fixes to tokenizer (#2587 )	2018-07-24 23:35:54 +02:00