Merge branch 'master' into spacy.io

2025-09-18 18:12:45 +03:00 · 2019-08-19 13:59:51 +02:00 · 2019-08-19 13:59:51 +02:00 · af3dd786b1
commit af3dd786b1
parent 50b117c072 66aba2d676
1 changed files with 43 additions and 19 deletions
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -249,37 +249,61 @@ pattern = [{"TEXT": {"REGEX": "^[Uu](\\.?|nited)$"}},
           {"LOWER": "president"}]
 ```
-`'REGEX'` as an operator (instead of a top-level property that only matches on
+The `REGEX` operator allows defining rules for any attribute string value,
-the token's text) allows defining rules for any string value, including custom
+including custom attributes. It always needs to be applied to an attribute like
-attributes:
+`TEXT`, `LOWER` or `TAG`:
 ```python
 # Match different spellings of token texts
 pattern = [{"TEXT": {"REGEX": "deff?in[ia]tely"}}]
 # Match tokens with fine-grained POS tags starting with 'V'
 pattern = [{"TAG": {"REGEX": "^V"}}]
 # Match custom attribute values with regular expressions
-pattern = [{"_": {"country": {"REGEX": "^[Uu](\\.?|nited) ?[Ss](\\.?|tates)$"}}}]
+pattern = [{"_": {"country": {"REGEX": "^[Uu](nited|\\.?) ?[Ss](tates|\\.?)$"}}}]
 ```
-<Infobox title="Regular expressions in older versions" variant="warning">
+<Infobox title="Important note" variant="warning">
-Versions before v2.1.0 don't yet support the `REGEX` operator. A simple solution
+When using the `REGEX` operator, keep in mind that it operates on **single
-is to match a regular expression on the `Doc.text` with `re.finditer` and use
+tokens**, not the whole text. Each expression you provide will be matched on a
-the [`Doc.char_span`](/api/doc#char_span) method to create a `Span` from the
+token. If you need to match on the whole text instead, see the details on
-character indices of the match.
+[regex matching on the whole text](#regex-text).
 You can also use the regular expression by converting it to a **binary token
 flag**. [`Vocab.add_flag`](/api/vocab#add_flag) returns a flag ID which you can
 use as a key of a token match pattern.
 ```python
 definitely_flag = lambda text: bool(re.compile(r"deff?in[ia]tely").match(text))
 IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)
 pattern = [{IS_DEFINITELY: True}]
 ```
 </Infobox>
 ##### Matching regular expressions on the full text {#regex-text}
 If your expressions apply to multiple tokens, a simple solution is to match on
 the `doc.text` with `re.finditer` and use the
 [`Doc.char_span`](/api/doc#char_span) method to create a `Span` from the
 character indices of the match. If the matched characters don't map to one or
 more valid tokens, `Doc.char_span` returns `None`.
 > #### What's a valid token sequence?
 >
 > In the example, the expression will also match `"US"` in `"USA"`. However,
 > `"USA"` is a single token and `Span` objects are **sequences of tokens**. So
 > `"US"` cannot be its own span, because it does not end on a token boundary.
 ```python
 ### {executable="true"}
 import spacy
 import re
 nlp = spacy.load("en_core_web_sm")
 doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")
 expression = r"[Uu](nited|\\.?) ?[Ss](tates|\\.?)"
 for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)
 ```
 #### Operators and quantifiers {#quantifiers}
 The matcher also lets you use quantifiers, specified as the `'OP'` key.