mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 18:56:36 +03:00
Improve regex matching docs [ci skip]
This commit is contained in:
parent
8b738a9f35
commit
66aba2d676
|
@ -249,37 +249,61 @@ pattern = [{"TEXT": {"REGEX": "^[Uu](\\.?|nited)$"}},
|
||||||
{"LOWER": "president"}]
|
{"LOWER": "president"}]
|
||||||
```
|
```
|
||||||
|
|
||||||
`'REGEX'` as an operator (instead of a top-level property that only matches on
|
The `REGEX` operator allows defining rules for any attribute string value,
|
||||||
the token's text) allows defining rules for any string value, including custom
|
including custom attributes. It always needs to be applied to an attribute like
|
||||||
attributes:
|
`TEXT`, `LOWER` or `TAG`:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
# Match different spellings of token texts
|
||||||
|
pattern = [{"TEXT": {"REGEX": "deff?in[ia]tely"}}]
|
||||||
|
|
||||||
# Match tokens with fine-grained POS tags starting with 'V'
|
# Match tokens with fine-grained POS tags starting with 'V'
|
||||||
pattern = [{"TAG": {"REGEX": "^V"}}]
|
pattern = [{"TAG": {"REGEX": "^V"}}]
|
||||||
|
|
||||||
# Match custom attribute values with regular expressions
|
# Match custom attribute values with regular expressions
|
||||||
pattern = [{"_": {"country": {"REGEX": "^[Uu](\\.?|nited) ?[Ss](\\.?|tates)$"}}}]
|
pattern = [{"_": {"country": {"REGEX": "^[Uu](nited|\\.?) ?[Ss](tates|\\.?)$"}}}]
|
||||||
```
|
```
|
||||||
|
|
||||||
<Infobox title="Regular expressions in older versions" variant="warning">
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
Versions before v2.1.0 don't yet support the `REGEX` operator. A simple solution
|
When using the `REGEX` operator, keep in mind that it operates on **single
|
||||||
is to match a regular expression on the `Doc.text` with `re.finditer` and use
|
tokens**, not the whole text. Each expression you provide will be matched on a
|
||||||
the [`Doc.char_span`](/api/doc#char_span) method to create a `Span` from the
|
token. If you need to match on the whole text instead, see the details on
|
||||||
character indices of the match.
|
[regex matching on the whole text](#regex-text).
|
||||||
|
|
||||||
You can also use the regular expression by converting it to a **binary token
|
|
||||||
flag**. [`Vocab.add_flag`](/api/vocab#add_flag) returns a flag ID which you can
|
|
||||||
use as a key of a token match pattern.
|
|
||||||
|
|
||||||
```python
|
|
||||||
definitely_flag = lambda text: bool(re.compile(r"deff?in[ia]tely").match(text))
|
|
||||||
IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)
|
|
||||||
pattern = [{IS_DEFINITELY: True}]
|
|
||||||
```
|
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
|
##### Matching regular expressions on the full text {#regex-text}
|
||||||
|
|
||||||
|
If your expressions apply to multiple tokens, a simple solution is to match on
|
||||||
|
the `doc.text` with `re.finditer` and use the
|
||||||
|
[`Doc.char_span`](/api/doc#char_span) method to create a `Span` from the
|
||||||
|
character indices of the match. If the matched characters don't map to one or
|
||||||
|
more valid tokens, `Doc.char_span` returns `None`.
|
||||||
|
|
||||||
|
> #### What's a valid token sequence?
|
||||||
|
>
|
||||||
|
> In the example, the expression will also match `"US"` in `"USA"`. However,
|
||||||
|
> `"USA"` is a single token and `Span` objects are **sequences of tokens**. So
|
||||||
|
> `"US"` cannot be its own span, because it does not end on a token boundary.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### {executable="true"}
|
||||||
|
import spacy
|
||||||
|
import re
|
||||||
|
|
||||||
|
nlp = spacy.load("en_core_web_sm")
|
||||||
|
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")
|
||||||
|
|
||||||
|
expression = r"[Uu](nited|\\.?) ?[Ss](tates|\\.?)"
|
||||||
|
for match in re.finditer(expression, doc.text):
|
||||||
|
start, end = match.span()
|
||||||
|
span = doc.char_span(start, end)
|
||||||
|
# This is a Span object or None if match doesn't map to valid token sequence
|
||||||
|
if span is not None:
|
||||||
|
print("Found match:", span.text)
|
||||||
|
```
|
||||||
|
|
||||||
#### Operators and quantifiers {#quantifiers}
|
#### Operators and quantifiers {#quantifiers}
|
||||||
|
|
||||||
The matcher also lets you use quantifiers, specified as the `'OP'` key.
|
The matcher also lets you use quantifiers, specified as the `'OP'` key.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user