spaCy/spacy/matcher/matcher.pxd
Kevin Humphreys 19650ebb52
Enable fuzzy text matching in Matcher (#11359)
* enable fuzzy matching

* add fuzzy param to EntityMatcher

* include rapidfuzz_capi

not yet used

* fix type

* add FUZZY predicate

* add fuzzy attribute list

* fix type properly

* tidying

* remove unnecessary dependency

* handle fuzzy sets

* simplify fuzzy sets

* case fix

* switch to FUZZYn predicates

use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.

* revert changes added for fuzzy param

* switch to polyleven

(Python package)

* enable fuzzy matching

* add fuzzy param to EntityMatcher

* include rapidfuzz_capi

not yet used

* fix type

* add FUZZY predicate

* add fuzzy attribute list

* fix type properly

* tidying

* remove unnecessary dependency

* handle fuzzy sets

* simplify fuzzy sets

* case fix

* switch to FUZZYn predicates

use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.

* revert changes added for fuzzy param

* switch to polyleven

(Python package)

* fuzzy match only on oov tokens

* remove polyleven

* exclude whitespace tokens

* don't allow more edits than characters

* fix min distance

* reinstate FUZZY operator

with length-based distance function

* handle sets inside regex operator

* remove is_oov check

* attempt build fix

no mypy failure locally

* re-attempt build fix

* don't overwrite fuzzy param value

* move fuzzy_match

to its own Python module to allow patching

* move fuzzy_match back inside Matcher

simplify logic and add tests

* Format tests

* Parametrize fuzzyn tests

* Parametrize and merge fuzzy+set tests

* Format

* Move fuzzy_match to a standalone method

* Change regex kwarg type to bool

* Add types for fuzzy_match

- Refactor variable names
- Add test for symmetrical behavior

* Parametrize fuzzyn+set tests

* Minor refactoring for fuzz/fuzzy

* Make fuzzy_match a Matcher kwarg

* Update type for _default_fuzzy_match

* don't overwrite function param

* Rename to fuzzy_compare

* Update fuzzy_compare default argument declarations

* allow fuzzy_compare override from EntityRuler

* define new Matcher keyword arg

* fix type definition

* Implement fuzzy_compare config option for EntityRuler and SpanRuler

* Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects

* Use simpler fuzzy_compare algorithm

* Update types

* Increase minimum to 2 in fuzzy_compare to allow one transposition

* Fix predicate keys and matching for SetPredicate with FUZZY and REGEX

* Add FUZZY6..9

* Add initial docs

* Increase default fuzzy to rounded 30% of pattern length

* Update docs for fuzzy_compare in components

* Update EntityRuler and SpanRuler API docs

* Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare

To having naming similar to `phrase_matcher_attr`, rename
`fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to
`matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.

* Fix schema aliases

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typo

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add FUZZY6-9 operators and update tests

* Parameterize test over greedy

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type for fuzzy_compare to remove Optional

* Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein

* Update docs following levenshtein_compare renaming

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-10 10:36:17 +01:00

81 lines
1.5 KiB
Cython

from libc.stdint cimport int32_t
from libcpp.vector cimport vector
from cymem.cymem cimport Pool
from ..vocab cimport Vocab
from ..typedefs cimport attr_t, hash_t
from ..structs cimport TokenC
from ..lexeme cimport attr_id_t
cdef enum action_t:
REJECT = 0000
MATCH = 1000
ADVANCE = 0100
RETRY = 0010
RETRY_EXTEND = 0011
RETRY_ADVANCE = 0110
MATCH_EXTEND = 1001
MATCH_REJECT = 2000
MATCH_DOUBLE = 3000
cdef enum quantifier_t:
ZERO
ZERO_ONE
ZERO_PLUS
ONE
ONE_PLUS
FINAL_ID
cdef struct AttrValueC:
attr_id_t attr
attr_t value
cdef struct IndexValueC:
int32_t index
attr_t value
cdef struct TokenPatternC:
AttrValueC* attrs
int32_t* py_predicates
IndexValueC* extra_attrs
int32_t nr_attr
int32_t nr_extra_attr
int32_t nr_py
quantifier_t quantifier
hash_t key
int32_t token_idx
cdef struct MatchAlignmentC:
int32_t token_idx
int32_t length
cdef struct PatternStateC:
TokenPatternC* pattern
int32_t start
int32_t length
cdef struct MatchC:
attr_t pattern_id
int32_t start
int32_t length
cdef class Matcher:
cdef Pool mem
cdef vector[TokenPatternC*] patterns
cdef readonly Vocab vocab
cdef public object validate
cdef public object _patterns
cdef public object _callbacks
cdef public object _filter
cdef public object _extensions
cdef public object _extra_predicates
cdef public object _seen_attrs
cdef public object _fuzzy_compare