From 45675e1cbb5e110a626b1a404c3fd3c04f2bec85 Mon Sep 17 00:00:00 2001 From: Adriane Boyd Date: Fri, 2 Dec 2022 08:58:20 +0100 Subject: [PATCH] Add initial docs --- website/docs/api/matcher.md | 31 +++++++++++------- website/docs/usage/rule-based-matching.md | 40 +++++++++++++++++++++++ 2 files changed, 59 insertions(+), 12 deletions(-) diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index cd7bfa070..122fab243 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -86,14 +86,20 @@ it compares to another value. > ] > ``` -| Attribute | Description | -| -------------------------- | -------------------------------------------------------------------------------------------------------- | -| `IN` | Attribute value is member of a list. ~~Any~~ | -| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ | -| `IS_SUBSET` | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~ | -| `IS_SUPERSET` | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~ | -| `INTERSECTS` | Attribute value (for `MORPH` or custom list attribute) has a non-empty intersection with a list. ~~Any~~ | -| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ | +| Attribute | Description | +| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `REGEX` | Attribute value matches the regular expression at any position in the string. ~~Any~~ | +| `FUZZY` | Attribute value matches if the `fuzzy_compare` method matches for `(value, pattern, -1)`. The default method allows a Levenshtein edit distance of at least 2 and up to 20% of the pattern string length. ~~Any~~ | +| `FUZZY1`, `FUZZY2`, ... `FUZZY9` | Attribute value matches if the `fuzzy_compare` method matches for `(value, pattern, N)`. The default method allows a Levenshtein edit distance of at most N (1-9). ~~Any~~ | +| `IN` | Attribute value is member of a list. ~~Any~~ | +| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ | +| `IS_SUBSET` | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~ | +| `IS_SUPERSET` | Attribute value (for `MORPH` or custom list attributes) is a superset of a list. ~~Any~~ | +| `INTERSECTS` | Attribute value (for `MORPH` or custom list attribute) has a non-empty intersection with a list. ~~Any~~ | +| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ | + +As of spaCy v3.5, `REGEX` and `FUZZY` can be used in combination with `IN` and +`NOT_IN`. ## Matcher.\_\_init\_\_ {#init tag="method"} @@ -109,10 +115,11 @@ string where an integer is expected) or unexpected property names. > matcher = Matcher(nlp.vocab) > ``` -| Name | Description | -| ---------- | ----------------------------------------------------------------------------------------------------- | -| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ | -| `validate` | Validate all patterns added to this matcher. ~~bool~~ | +| Name | Description | +| --------------- | ----------------------------------------------------------------------------------------------------- | +| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ | +| `validate` | Validate all patterns added to this matcher. ~~bool~~ | +| `fuzzy_compare` | The comparison method used for the `FUZZY` operators. ~~Callable[[str, str, int], bool]~~ | ## Matcher.\_\_call\_\_ {#call tag="method"} diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index ad8ea27f3..ce1f3672d 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -364,6 +364,46 @@ else: +#### Fuzzy matching {#fuzzy new="3.5"} + +Fuzzy matching allows you to match tokens with alternate spellings, typos, etc. +without specifying every possible variant. + +```python +# Matches "favourite", "favorites", "gavorite", "theatre", "theatr", ... +pattern = [{"TEXT": {"FUZZY": "favorite"}}, + {"TEXT": {"FUZZY": "theater"}}] +``` + +The `FUZZY` attribute allows fuzzy matches for any attribute string value, +including custom attributes. Just like `REGEX`, it always needs to be applied to +an attribute like `TEXT` or `LOWER`. By default `FUZZY` allows a Levenshtein +edit distance of at least 2 and up to 20% of the pattern string length. Using +the more specific attributes `FUZZY1`..`FUZZY9` you can specify the maximum +allowed edit distance directly. + +```python +# Match lowercase with fuzzy matching (allows 2 edits) +pattern = [{"LOWER": {"FUZZY": "definitely"}}] + +# Match custom attribute values with fuzzy matching (allows 2 edits) +pattern = [{"_": {"country": {"FUZZY": "Kyrgyzstan"}}}] + +# Match with exact Levenshtein edit distance limits (allows 3 edits) +pattern = [{"_": {"country": {"FUZZY3": "Kyrgyzstan"}}}] +``` + +#### Regex and fuzzy matching with lists {#regex-fuzzy-lists new="3.5"} + +Starting in spaCy v3.5, both `REGEX` and `FUZZY` can be combined with the +attributes `IN` and `NOT_IN`: + +```python +pattern = [{"TEXT": {"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}] + +pattern = [{"TEXT": {"REGEX": {"NOT_IN": ["^awe(some)?$", "^wonder(ful)?"]}}}] +``` + --- #### Operators and quantifiers {#quantifiers}