spaCy/spacy
adrianeboyd c23edf302b Replace PhraseMatcher with trie-based search (#4309)
* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Store docs internally only as attr lists

* Reduces size for pickle

* Remove duplicate keywords store

Now that docs are stored as lists of attr hashes, there's no need to
have the duplicate _keywords store.
2019-09-27 16:22:34 +02:00
..
cli Support model name in init-model 2019-09-26 03:01:32 +02:00
data Make spacy/data a package 2017-03-18 20:04:22 +01:00
displacy Improve token pattern checking without validation (#4105) 2019-08-21 14:00:37 +02:00
lang Update stop_words.py and add name in contributors (#4325) 2019-09-27 11:57:27 +02:00
matcher Replace PhraseMatcher with trie-based search (#4309) 2019-09-27 16:22:34 +02:00
pipeline Merge changes to test_ner 2019-09-18 21:41:24 +02:00
syntax Improve Morphology errors (#4314) 2019-09-21 14:37:06 +02:00
tests Replace PhraseMatcher with trie-based search (#4309) 2019-09-27 16:22:34 +02:00
tokens Merge changes to test_ner 2019-09-18 21:41:24 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Fix formatting (hopefully also restarts build properly) 2019-03-20 09:55:45 +01:00
__main__.py Update __main__.py 2019-03-20 09:43:26 +01:00
_align.pyx Improve alignment around quotes 2018-08-16 01:04:34 +02:00
_ml.py Make "unnamed vectors" warning a real warning 2019-09-16 15:16:12 +02:00
about.py Set version to v2.2.0.dev10 2019-09-26 03:03:50 +02:00
attrs.pxd Fix attrs alignment 2019-07-12 17:59:47 +02:00
attrs.pyx Merge changes from master 2019-08-21 14:18:52 +02:00
compat.py Fix symlink creation to show error message on failure (#3589) (resolves #3307)) 2019-04-16 11:58:31 +02:00
errors.py Replace PhraseMatcher with trie-based search (#4309) 2019-09-27 16:22:34 +02:00
glossary.py Include Norwegian NER entity types in glossary [ci skip] 2019-09-15 17:16:21 +02:00
gold.pxd Merge changes from master 2019-08-21 14:18:52 +02:00
gold.pyx Fix orth replacement 2019-09-19 00:03:24 +02:00
kb.pxd rename entity frequency 2019-07-19 17:40:28 +02:00
kb.pyx Documentation for Entity Linking (#4065) 2019-09-12 11:38:34 +02:00
language.py Refactor language update (#4316) 2019-09-27 16:20:21 +02:00
lemmatizer.py 💫 Adjust Table API and add docs (#4289) 2019-09-15 22:08:13 +02:00
lexeme.pxd 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325) 2019-02-24 21:13:51 +01:00
lexeme.pyx Tidy up property code style (#3391) 2019-03-11 15:59:09 +01:00
lookups.py Simplify lookup hashing 2019-09-18 20:24:41 +02:00
morphology.pxd annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
morphology.pyx Improve Morphology errors (#4314) 2019-09-21 14:37:06 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
scorer.py Make except more explicit 2019-09-18 19:57:08 +02:00
strings.pxd Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
strings.pyx Merge branch 'master' into feature/lemmatizer 2019-03-16 13:44:22 +01:00
structs.pxd Merge changes from master 2019-08-21 14:18:52 +02:00
symbols.pxd Fix symbol alignment 2019-07-12 17:48:38 +02:00
symbols.pyx ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
tokenizer.pxd Flush tokenizer cache when necessary (#4258) 2019-09-08 20:52:46 +02:00
tokenizer.pyx Flush tokenizer cache when necessary (#4258) 2019-09-08 20:52:46 +02:00
typedefs.pxd Work on changing StringStore to return hashes. 2017-05-28 12:36:27 +02:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py Improve Morphology errors (#4314) 2019-09-21 14:37:06 +02:00
vectors.pyx Update vectors name docs [ci skip] 2019-09-26 16:21:32 +02:00
vocab.pxd 💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) 2019-08-22 14:21:32 +02:00
vocab.pyx Update vectors name docs [ci skip] 2019-09-26 16:21:32 +02:00