Add initial docs

This commit is contained in:
Adriane Boyd 2022-12-02 08:58:20 +01:00
parent 27a4925f8d
commit 45675e1cbb
2 changed files with 59 additions and 12 deletions

View File

@ -87,7 +87,10 @@ it compares to another value.
> ```
| Attribute | Description |
| -------------------------- | -------------------------------------------------------------------------------------------------------- |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `REGEX` | Attribute value matches the regular expression at any position in the string. ~~Any~~ |
| `FUZZY` | Attribute value matches if the `fuzzy_compare` method matches for `(value, pattern, -1)`. The default method allows a Levenshtein edit distance of at least 2 and up to 20% of the pattern string length. ~~Any~~ |
| `FUZZY1`, `FUZZY2`, ... `FUZZY9` | Attribute value matches if the `fuzzy_compare` method matches for `(value, pattern, N)`. The default method allows a Levenshtein edit distance of at most N (1-9). ~~Any~~ |
| `IN` | Attribute value is member of a list. ~~Any~~ |
| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ |
| `IS_SUBSET` | Attribute value (for `MORPH` or custom list attributes) is a subset of a list. ~~Any~~ |
@ -95,6 +98,9 @@ it compares to another value.
| `INTERSECTS` | Attribute value (for `MORPH` or custom list attribute) has a non-empty intersection with a list. ~~Any~~ |
| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ |
As of spaCy v3.5, `REGEX` and `FUZZY` can be used in combination with `IN` and
`NOT_IN`.
## Matcher.\_\_init\_\_ {#init tag="method"}
Create the rule-based `Matcher`. If `validate=True` is set, all patterns added
@ -110,9 +116,10 @@ string where an integer is expected) or unexpected property names.
> ```
| Name | Description |
| ---------- | ----------------------------------------------------------------------------------------------------- |
| --------------- | ----------------------------------------------------------------------------------------------------- |
| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ |
| `validate` | Validate all patterns added to this matcher. ~~bool~~ |
| `fuzzy_compare` | The comparison method used for the `FUZZY` operators. ~~Callable[[str, str, int], bool]~~ |
## Matcher.\_\_call\_\_ {#call tag="method"}

View File

@ -364,6 +364,46 @@ else:
</Accordion>
#### Fuzzy matching {#fuzzy new="3.5"}
Fuzzy matching allows you to match tokens with alternate spellings, typos, etc.
without specifying every possible variant.
```python
# Matches "favourite", "favorites", "gavorite", "theatre", "theatr", ...
pattern = [{"TEXT": {"FUZZY": "favorite"}},
{"TEXT": {"FUZZY": "theater"}}]
```
The `FUZZY` attribute allows fuzzy matches for any attribute string value,
including custom attributes. Just like `REGEX`, it always needs to be applied to
an attribute like `TEXT` or `LOWER`. By default `FUZZY` allows a Levenshtein
edit distance of at least 2 and up to 20% of the pattern string length. Using
the more specific attributes `FUZZY1`..`FUZZY9` you can specify the maximum
allowed edit distance directly.
```python
# Match lowercase with fuzzy matching (allows 2 edits)
pattern = [{"LOWER": {"FUZZY": "definitely"}}]
# Match custom attribute values with fuzzy matching (allows 2 edits)
pattern = [{"_": {"country": {"FUZZY": "Kyrgyzstan"}}}]
# Match with exact Levenshtein edit distance limits (allows 3 edits)
pattern = [{"_": {"country": {"FUZZY3": "Kyrgyzstan"}}}]
```
#### Regex and fuzzy matching with lists {#regex-fuzzy-lists new="3.5"}
Starting in spaCy v3.5, both `REGEX` and `FUZZY` can be combined with the
attributes `IN` and `NOT_IN`:
```python
pattern = [{"TEXT": {"FUZZY": {"IN": ["awesome", "cool", "wonderful"]}}}]
pattern = [{"TEXT": {"REGEX": {"NOT_IN": ["^awe(some)?$", "^wonder(ful)?"]}}}]
```
---
#### Operators and quantifiers {#quantifiers}