* Implement new API for {Phrase}Matcher.add (backwards-compatible)
* Update docs
* Also update DependencyMatcher.add
* Update internals
* Rewrite tests to use new API
* Add basic check for common mistake
Raise error with suggestion if user likely passed in a pattern instead of a list of patterns
* Fix typo [ci skip]
8.1 KiB
| title | teaser | tag | source | new |
|---|---|---|---|---|
| PhraseMatcher | Match sequences of tokens, based on documents | class | spacy/matcher/phrasematcher.pyx | 2 |
The PhraseMatcher lets you efficiently match large terminology lists. While
the Matcher lets you match sequences based on lists of token
descriptions, the PhraseMatcher accepts match patterns in the form of Doc
objects.
PhraseMatcher.__init__
Create the rule-based PhraseMatcher. Setting a different attr to match on
will change the token attributes that will be compared to determine a match. By
default, the incoming Doc is checked for sequences of tokens with the same
ORTH value, i.e. the verbatim token text. Matching on the attribute LOWER
will result in case-insensitive matching, since only the lowercase token texts
are compared. In theory, it's also possible to match on sequences of the same
part-of-speech tags or dependency labels.
If validate=True is set, additional validation is performed when pattern are
added. At the moment, it will check whether a Doc has attributes assigned that
aren't necessary to produce the matches (for example, part-of-speech tags if the
PhraseMatcher matches on the token text). Since this can often lead to
significantly worse performance when creating the pattern, a UserWarning will
be shown.
Example
from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab)
| Name | Type | Description |
|---|---|---|
vocab |
Vocab |
The vocabulary object, which must be shared with the documents the matcher will operate on. |
attr 2.1 |
int / unicode | The token attribute to match on. Defaults to ORTH, i.e. the verbatim token text. |
validate 2.1 |
bool | Validate patterns added to the matcher. |
| RETURNS | PhraseMatcher |
The newly constructed object. |
As of v2.1, the PhraseMatcher doesn't have a phrase length limit anymore, so
the max_length argument is now deprecated.
PhraseMatcher.__call__
Find all token sequences matching the supplied patterns on the Doc.
Example
from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab) matcher.add("OBAMA", None, nlp("Barack Obama")) doc = nlp("Barack Obama lifts America one last time in emotional farewell") matches = matcher(doc)
| Name | Type | Description |
|---|---|---|
doc |
Doc |
The document to match over. |
| RETURNS | list | A list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end]. The match_id is the ID of the added match pattern. |
PhraseMatcher.pipe
Match a stream of documents, yielding them in turn.
Example
from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab) for doc in matcher.pipe(texts, batch_size=50): pass
| Name | Type | Description |
|---|---|---|
docs |
iterable | A stream of documents. |
batch_size |
int | The number of documents to accumulate into a working set. |
| YIELDS | Doc |
Documents, in order. |
PhraseMatcher.__len__
Get the number of rules added to the matcher. Note that this only returns the number of rules (identical with the number of IDs), not the number of individual patterns.
Example
matcher = PhraseMatcher(nlp.vocab) assert len(matcher) == 0 matcher.add("OBAMA", None, nlp("Barack Obama")) assert len(matcher) == 1
| Name | Type | Description |
|---|---|---|
| RETURNS | int | The number of rules. |
PhraseMatcher.__contains__
Check whether the matcher contains rules for a match ID.
Example
matcher = PhraseMatcher(nlp.vocab) assert "OBAMA" not in matcher matcher.add("OBAMA", None, nlp("Barack Obama")) assert "OBAMA" in matcher
| Name | Type | Description |
|---|---|---|
key |
unicode | The match ID. |
| RETURNS | bool | Whether the matcher contains rules for this match ID. |
PhraseMatcher.add
Add a rule to the matcher, consisting of an ID key, one or more patterns, and a
callback function to act on the matches. The callback function will receive the
arguments matcher, doc, i and matches. If a pattern already exists for
the given ID, the patterns will be extended. An on_match callback will be
overwritten.
Example
def on_match(matcher, doc, id, matches): print('Matched!', matches) matcher = PhraseMatcher(nlp.vocab) matcher.add("OBAMA", on_match, nlp("Barack Obama")) matcher.add("HEALTH", on_match, nlp("health care reform"), nlp("healthcare reform")) doc = nlp("Barack Obama urges Congress to find courage to defend his healthcare reforms") matches = matcher(doc)
| Name | Type | Description |
|---|---|---|
match_id |
unicode | An ID for the thing you're matching. |
on_match |
callable or None |
Callback function to act on matches. Takes the arguments matcher, doc, i and matches. |
*docs |
Doc |
Doc objects of the phrases to match. |
As of spaCy 2.2.2, PhraseMatcher.add also supports the new API, which will
become the default in the future. The Doc patterns are now the second argument
and a list (instead of a variable number of arguments). The on_match callback
becomes an optional keyword argument.
patterns = [nlp("health care reform"), nlp("healthcare reform")]
- matcher.add("HEALTH", None, *patterns)
+ matcher.add("HEALTH", patterns)
- matcher.add("HEALTH", on_match, *patterns)
+ matcher.add("HEALTH", patterns, on_match=on_match)
PhraseMatcher.remove
Remove a rule from the matcher by match ID. A KeyError is raised if the key
does not exist.
Example
matcher = PhraseMatcher(nlp.vocab) matcher.add("OBAMA", None, nlp("Barack Obama")) assert "OBAMA" in matcher matcher.remove("OBAMA") assert "OBAMA" not in matcher
| Name | Type | Description |
|---|---|---|
key |
unicode | The ID of the match rule. |