8.6 KiB
title | teaser | tag | source | new |
---|---|---|---|---|
PhraseMatcher | Match sequences of tokens, based on documents | class | spacy/matcher/phrasematcher.pyx | 2 |
The PhraseMatcher
lets you efficiently match large terminology lists. While
the Matcher
lets you match sequences based on lists of token
descriptions, the PhraseMatcher
accepts match patterns in the form of Doc
objects.
PhraseMatcher.__init__
Create the rule-based PhraseMatcher
. Setting a different attr
to match on
will change the token attributes that will be compared to determine a match. By
default, the incoming Doc
is checked for sequences of tokens with the same
ORTH
value, i.e. the verbatim token text. Matching on the attribute LOWER
will result in case-insensitive matching, since only the lowercase token texts
are compared. In theory, it's also possible to match on sequences of the same
part-of-speech tags or dependency labels.
If validate=True
is set, additional validation is performed when pattern are
added. At the moment, it will check whether a Doc
has attributes assigned that
aren't necessary to produce the matches (for example, part-of-speech tags if the
PhraseMatcher
matches on the token text). Since this can often lead to
significantly worse performance when creating the pattern, a UserWarning
will
be shown.
Example
from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab)
Name | Type | Description |
---|---|---|
vocab |
Vocab |
The vocabulary object, which must be shared with the documents the matcher will operate on. |
max_length |
int | Deprecated argument - the PhraseMatcher does not have a phrase length limit anymore. |
attr 2.1 |
int / unicode | The token attribute to match on. Defaults to ORTH , i.e. the verbatim token text. |
validate 2.1 |
bool | Validate patterns added to the matcher. |
RETURNS | PhraseMatcher |
The newly constructed object. |
As of v2.1, the PhraseMatcher
doesn't have a phrase length limit anymore, so
the max_length
argument is now deprecated.
PhraseMatcher.__call__
Find all token sequences matching the supplied patterns on the Doc
.
Example
from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab) matcher.add("OBAMA", None, nlp("Barack Obama")) doc = nlp("Barack Obama lifts America one last time in emotional farewell") matches = matcher(doc)
Name | Type | Description |
---|---|---|
doc |
Doc |
The document to match over. |
RETURNS | list | A list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end] . The match_id is the ID of the added match pattern. |
Because spaCy stores all strings as integers, the match_id
you get back will
be an integer, too – but you can always get the string representation by looking
it up in the vocabulary's StringStore
, i.e. nlp.vocab.strings
:
match_id_string = nlp.vocab.strings[match_id]
PhraseMatcher.pipe
Match a stream of documents, yielding them in turn.
Example
from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab) for doc in matcher.pipe(docs, batch_size=50): pass
Name | Type | Description |
---|---|---|
docs |
iterable | A stream of documents. |
batch_size |
int | The number of documents to accumulate into a working set. |
YIELDS | Doc |
Documents, in order. |
PhraseMatcher.__len__
Get the number of rules added to the matcher. Note that this only returns the number of rules (identical with the number of IDs), not the number of individual patterns.
Example
matcher = PhraseMatcher(nlp.vocab) assert len(matcher) == 0 matcher.add("OBAMA", None, nlp("Barack Obama")) assert len(matcher) == 1
Name | Type | Description |
---|---|---|
RETURNS | int | The number of rules. |
PhraseMatcher.__contains__
Check whether the matcher contains rules for a match ID.
Example
matcher = PhraseMatcher(nlp.vocab) assert "OBAMA" not in matcher matcher.add("OBAMA", None, nlp("Barack Obama")) assert "OBAMA" in matcher
Name | Type | Description |
---|---|---|
key |
str | The match ID. |
RETURNS | bool | Whether the matcher contains rules for this match ID. |
PhraseMatcher.add
Add a rule to the matcher, consisting of an ID key, one or more patterns, and a
callback function to act on the matches. The callback function will receive the
arguments matcher
, doc
, i
and matches
. If a pattern already exists for
the given ID, the patterns will be extended. An on_match
callback will be
overwritten.
Example
def on_match(matcher, doc, id, matches): print('Matched!', matches) matcher = PhraseMatcher(nlp.vocab) matcher.add("OBAMA", on_match, nlp("Barack Obama")) matcher.add("HEALTH", on_match, nlp("health care reform"), nlp("healthcare reform")) doc = nlp("Barack Obama urges Congress to find courage to defend his healthcare reforms") matches = matcher(doc)
Name | Type | Description |
---|---|---|
match_id |
str | An ID for the thing you're matching. |
on_match |
callable or None |
Callback function to act on matches. Takes the arguments matcher , doc , i and matches . |
*docs |
Doc |
Doc objects of the phrases to match. |
As of spaCy 2.2.2, PhraseMatcher.add
also supports the new API, which will
become the default in the future. The Doc
patterns are now the second argument
and a list (instead of a variable number of arguments). The on_match
callback
becomes an optional keyword argument.
patterns = [nlp("health care reform"), nlp("healthcare reform")]
- matcher.add("HEALTH", None, *patterns)
+ matcher.add("HEALTH", patterns)
- matcher.add("HEALTH", on_match, *patterns)
+ matcher.add("HEALTH", patterns, on_match=on_match)
PhraseMatcher.remove
Remove a rule from the matcher by match ID. A KeyError
is raised if the key
does not exist.
Example
matcher = PhraseMatcher(nlp.vocab) matcher.add("OBAMA", None, nlp("Barack Obama")) assert "OBAMA" in matcher matcher.remove("OBAMA") assert "OBAMA" not in matcher
Name | Type | Description |
---|---|---|
key |
str | The ID of the match rule. |