spaCy/website/docs/api
Kevin Humphreys 19650ebb52
Enable fuzzy text matching in Matcher (#11359)
* enable fuzzy matching

* add fuzzy param to EntityMatcher

* include rapidfuzz_capi

not yet used

* fix type

* add FUZZY predicate

* add fuzzy attribute list

* fix type properly

* tidying

* remove unnecessary dependency

* handle fuzzy sets

* simplify fuzzy sets

* case fix

* switch to FUZZYn predicates

use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.

* revert changes added for fuzzy param

* switch to polyleven

(Python package)

* enable fuzzy matching

* add fuzzy param to EntityMatcher

* include rapidfuzz_capi

not yet used

* fix type

* add FUZZY predicate

* add fuzzy attribute list

* fix type properly

* tidying

* remove unnecessary dependency

* handle fuzzy sets

* simplify fuzzy sets

* case fix

* switch to FUZZYn predicates

use Levenshtein distance.
remove fuzzy param.
remove rapidfuzz_capi.

* revert changes added for fuzzy param

* switch to polyleven

(Python package)

* fuzzy match only on oov tokens

* remove polyleven

* exclude whitespace tokens

* don't allow more edits than characters

* fix min distance

* reinstate FUZZY operator

with length-based distance function

* handle sets inside regex operator

* remove is_oov check

* attempt build fix

no mypy failure locally

* re-attempt build fix

* don't overwrite fuzzy param value

* move fuzzy_match

to its own Python module to allow patching

* move fuzzy_match back inside Matcher

simplify logic and add tests

* Format tests

* Parametrize fuzzyn tests

* Parametrize and merge fuzzy+set tests

* Format

* Move fuzzy_match to a standalone method

* Change regex kwarg type to bool

* Add types for fuzzy_match

- Refactor variable names
- Add test for symmetrical behavior

* Parametrize fuzzyn+set tests

* Minor refactoring for fuzz/fuzzy

* Make fuzzy_match a Matcher kwarg

* Update type for _default_fuzzy_match

* don't overwrite function param

* Rename to fuzzy_compare

* Update fuzzy_compare default argument declarations

* allow fuzzy_compare override from EntityRuler

* define new Matcher keyword arg

* fix type definition

* Implement fuzzy_compare config option for EntityRuler and SpanRuler

* Rename _default_fuzzy_compare to fuzzy_compare, remove from reexported objects

* Use simpler fuzzy_compare algorithm

* Update types

* Increase minimum to 2 in fuzzy_compare to allow one transposition

* Fix predicate keys and matching for SetPredicate with FUZZY and REGEX

* Add FUZZY6..9

* Add initial docs

* Increase default fuzzy to rounded 30% of pattern length

* Update docs for fuzzy_compare in components

* Update EntityRuler and SpanRuler API docs

* Rename EntityRuler and SpanRuler setting to matcher_fuzzy_compare

To having naming similar to `phrase_matcher_attr`, rename
`fuzzy_compare` setting for `EntityRuler` and `SpanRuler` to
`matcher_fuzzy_compare. Organize next to `phrase_matcher_attr` in docs.

* Fix schema aliases

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix typo

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add FUZZY6-9 operators and update tests

* Parameterize test over greedy

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type for fuzzy_compare to remove Optional

* Rename to spacy.levenshtein_compare.v1, move to spacy.matcher.levenshtein

* Update docs following levenshtein_compare renaming

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-01-10 10:36:17 +01:00
..
architectures.md Add experimental coref docs (#11291) 2022-09-27 18:11:23 +09:00
attributeruler.md Document scorers in registry and components from #8766 (#8929) 2021-08-12 12:50:03 +02:00
attributes.md Add API docs for token attribute symbols (#10836) 2022-06-23 08:16:38 +02:00
cli.md Add apply CLI (#11376) 2022-12-20 17:11:33 +01:00
coref.md Add experimental coref docs (#11291) 2022-09-27 18:11:23 +09:00
corpus.md Remove NBSP's across tables in the docs (#10842) 2022-05-25 09:48:39 +02:00
cython-classes.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
cython-structs.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
cython.md Update docs [ci skip] 2020-09-12 17:05:10 +02:00
data-formats.md Add version tag to before_update config key (#12059) 2023-01-05 11:46:04 +01:00
dependencymatcher.md add additional REL_OP (#10371) 2022-07-27 13:16:44 +02:00
dependencyparser.md Update docs for pipeline initialize() methods (#11221) 2022-08-03 16:53:02 +02:00
doc.md remove new v2 tags (#11780) 2022-11-14 17:41:01 +09:00
docbin.md Fix point typo on docbin docs (#9097) 2021-08-31 10:55:44 +02:00
edittreelemmatizer.md Update docs for pipeline initialize() methods (#11221) 2022-08-03 16:53:02 +02:00
entitylinker.md Refactor KB for easier customization (#11268) 2022-09-08 10:38:07 +02:00
entityrecognizer.md Update docs for pipeline initialize() methods (#11221) 2022-08-03 16:53:02 +02:00
entityruler.md Enable fuzzy text matching in Matcher (#11359) 2023-01-10 10:36:17 +01:00
example.md more explicit Example constructor example (#11489) 2022-09-16 09:26:33 +02:00
index.md Update v3 docs 2020-07-03 16:48:21 +02:00
kb_in_memory.md fix docs (#11573) 2022-10-03 17:01:04 +02:00
kb.md Refactor KB for easier customization (#11268) 2022-09-08 10:38:07 +02:00
language.md remove new v2 tags (#11780) 2022-11-14 17:41:01 +09:00
legacy.md Add ConsoleLogger.v2 (#11214) 2022-08-29 10:23:05 +02:00
lemmatizer.md Switch ru and uk lemmatizers to pymorphy3 (#11345) 2022-08-22 11:27:14 +02:00
lexeme.md Update lexeme.md (#11994) 2022-12-19 10:33:38 +01:00
lookups.md Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
matcher.md Enable fuzzy text matching in Matcher (#11359) 2023-01-10 10:36:17 +01:00
morphologizer.md Update docs for pipeline initialize() methods (#11221) 2022-08-03 16:53:02 +02:00
morphology.md Document Assigned Attributes of Pipeline Components (#9041) 2021-09-01 12:09:39 +02:00
phrasematcher.md remove new v2 tags (#11780) 2022-11-14 17:41:01 +09:00
pipe.md Document scorers in registry and components from #8766 (#8929) 2021-08-12 12:50:03 +02:00
pipeline-functions.md Add experimental coref docs (#11291) 2022-09-27 18:11:23 +09:00
scorer.md Update textcat scorer threshold behavior (#11696) 2022-11-02 15:35:04 +01:00
sentencerecognizer.md Update docs for pipeline initialize() methods (#11221) 2022-08-03 16:53:02 +02:00
sentencizer.md Update overwrite and scorer in API docs (#9384) 2021-10-11 10:35:07 +02:00
span-resolver.md Add experimental coref docs (#11291) 2022-09-27 18:11:23 +09:00
span.md remove new v2 tags (#11780) 2022-11-14 17:41:01 +09:00
spancategorizer.md Update docs for pipeline initialize() methods (#11221) 2022-08-03 16:53:02 +02:00
spangroup.md Fix SpanGroup and Span typing (#12009) 2022-12-21 18:54:27 +01:00
spanruler.md Enable fuzzy text matching in Matcher (#11359) 2023-01-10 10:36:17 +01:00
stringstore.md Fix misspelt keyword in StringStore example 2022-05-29 10:49:19 +01:00
tagger.md Update docs for pipeline initialize() methods (#11221) 2022-08-03 16:53:02 +02:00
textcategorizer.md Update textcat scorer threshold behavior (#11696) 2022-11-02 15:35:04 +01:00
tok2vec.md Update docs for pipeline initialize() methods (#11221) 2022-08-03 16:53:02 +02:00
token.md remove new v2 tags (#11780) 2022-11-14 17:41:01 +09:00
tokenizer.md Add tokenizer option to allow Matcher handling for all rules (#10452) 2022-03-24 13:21:32 +01:00
top-level.md improve ux for displacy when the serve port is in use (#11948) 2023-01-10 15:52:57 +09:00
transformer.md Update docs for pipeline initialize() methods (#11221) 2022-08-03 16:53:02 +02:00
vectors.md correct ndim in docs (#11869) 2022-11-25 11:31:28 +01:00
vocab.md Fix typo in vocab.md table (#11908) 2022-12-01 13:06:28 +01:00