mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
c23edf302b
* Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Store docs internally only as attr lists * Reduces size for pickle * Remove duplicate keywords store Now that docs are stored as lists of attr hashes, there's no need to have the duplicate _keywords store.
28 lines
568 B
Cython
28 lines
568 B
Cython
from libcpp.vector cimport vector
|
|
|
|
from cymem.cymem cimport Pool
|
|
from preshed.maps cimport key_t, MapStruct
|
|
|
|
from ..attrs cimport attr_id_t
|
|
from ..tokens.doc cimport Doc
|
|
from ..vocab cimport Vocab
|
|
|
|
|
|
cdef class PhraseMatcher:
|
|
cdef Vocab vocab
|
|
cdef attr_id_t attr
|
|
cdef object _callbacks
|
|
cdef object _docs
|
|
cdef bint _validate
|
|
cdef MapStruct* c_map
|
|
cdef Pool mem
|
|
cdef key_t _terminal_hash
|
|
|
|
cdef void find_matches(self, Doc doc, vector[MatchStruct] *matches) nogil
|
|
|
|
|
|
cdef struct MatchStruct:
|
|
key_t match_id
|
|
int start
|
|
int end
|