Improve regex matching docs [ci skip]

This commit is contained in:
Ines Montani 2019-08-19 13:59:41 +02:00
parent 8b738a9f35
commit 66aba2d676

View File

@ -249,37 +249,61 @@ pattern = [{"TEXT": {"REGEX": "^[Uu](\\.?|nited)$"}},
{"LOWER": "president"}] {"LOWER": "president"}]
``` ```
`'REGEX'` as an operator (instead of a top-level property that only matches on The `REGEX` operator allows defining rules for any attribute string value,
the token's text) allows defining rules for any string value, including custom including custom attributes. It always needs to be applied to an attribute like
attributes: `TEXT`, `LOWER` or `TAG`:
```python ```python
# Match different spellings of token texts
pattern = [{"TEXT": {"REGEX": "deff?in[ia]tely"}}]
# Match tokens with fine-grained POS tags starting with 'V' # Match tokens with fine-grained POS tags starting with 'V'
pattern = [{"TAG": {"REGEX": "^V"}}] pattern = [{"TAG": {"REGEX": "^V"}}]
# Match custom attribute values with regular expressions # Match custom attribute values with regular expressions
pattern = [{"_": {"country": {"REGEX": "^[Uu](\\.?|nited) ?[Ss](\\.?|tates)$"}}}] pattern = [{"_": {"country": {"REGEX": "^[Uu](nited|\\.?) ?[Ss](tates|\\.?)$"}}}]
``` ```
<Infobox title="Regular expressions in older versions" variant="warning"> <Infobox title="Important note" variant="warning">
Versions before v2.1.0 don't yet support the `REGEX` operator. A simple solution When using the `REGEX` operator, keep in mind that it operates on **single
is to match a regular expression on the `Doc.text` with `re.finditer` and use tokens**, not the whole text. Each expression you provide will be matched on a
the [`Doc.char_span`](/api/doc#char_span) method to create a `Span` from the token. If you need to match on the whole text instead, see the details on
character indices of the match. [regex matching on the whole text](#regex-text).
You can also use the regular expression by converting it to a **binary token
flag**. [`Vocab.add_flag`](/api/vocab#add_flag) returns a flag ID which you can
use as a key of a token match pattern.
```python
definitely_flag = lambda text: bool(re.compile(r"deff?in[ia]tely").match(text))
IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)
pattern = [{IS_DEFINITELY: True}]
```
</Infobox> </Infobox>
##### Matching regular expressions on the full text {#regex-text}
If your expressions apply to multiple tokens, a simple solution is to match on
the `doc.text` with `re.finditer` and use the
[`Doc.char_span`](/api/doc#char_span) method to create a `Span` from the
character indices of the match. If the matched characters don't map to one or
more valid tokens, `Doc.char_span` returns `None`.
> #### What's a valid token sequence?
>
> In the example, the expression will also match `"US"` in `"USA"`. However,
> `"USA"` is a single token and `Span` objects are **sequences of tokens**. So
> `"US"` cannot be its own span, because it does not end on a token boundary.
```python
### {executable="true"}
import spacy
import re
nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")
expression = r"[Uu](nited|\\.?) ?[Ss](tates|\\.?)"
for match in re.finditer(expression, doc.text):
start, end = match.span()
span = doc.char_span(start, end)
# This is a Span object or None if match doesn't map to valid token sequence
if span is not None:
print("Found match:", span.text)
```
#### Operators and quantifiers {#quantifiers} #### Operators and quantifiers {#quantifiers}
The matcher also lets you use quantifiers, specified as the `'OP'` key. The matcher also lets you use quantifiers, specified as the `'OP'` key.