2016-10-31 21:04:15 +03:00
|
|
|
|
//- 💫 DOCS > USAGE > RULE-BASED MATCHING
|
|
|
|
|
|
|
|
|
|
include ../../_includes/_mixins
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| spaCy features a rule-matching engine that operates over tokens, similar
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| to regular expressions. The rules can refer to token annotations (e.g.
|
|
|
|
|
| the token #[code text] or #[code tag_], and flags (e.g. #[code IS_PUNCT]).
|
|
|
|
|
| The rule matcher also lets you pass in a custom callback
|
|
|
|
|
| to act on matches – for example, to merge entities and apply custom labels.
|
|
|
|
|
| You can also associate patterns with entity IDs, to allow some basic
|
|
|
|
|
| entity linking or disambiguation.
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
2017-05-20 02:38:55 +03:00
|
|
|
|
+aside("What about \"real\" regular expressions?")
|
|
|
|
|
|
|
|
|
|
+h(2, "adding-patterns") Adding patterns
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Let's say we want to enable spaCy to find a combination of three tokens:
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
|
|
|
|
+list("numbers")
|
2017-05-20 02:38:55 +03:00
|
|
|
|
+item
|
|
|
|
|
| A token whose #[strong lower-case form matches "hello"], e.g. "Hello"
|
|
|
|
|
| or "HELLO".
|
|
|
|
|
+item
|
|
|
|
|
| A token whose #[strong #[code is_punct] flag is set to #[code True]],
|
|
|
|
|
| i.e. any punctuation.
|
|
|
|
|
+item
|
|
|
|
|
| A token whose #[strong lower-case form matches "world"], e.g. "World"
|
|
|
|
|
| or "WORLD".
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
[{LOWER: 'hello'}, {IS_PUNCT: True}, {LOWER: 'world'}]
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
|
|
|
|
p
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| First, we initialise the #[code Matcher] with a vocab. The matcher must
|
|
|
|
|
| always share the same vocab with the documents it will operate on. We
|
|
|
|
|
| can now call #[+api("matcher#add") #[code matcher.add()]] with an ID and
|
|
|
|
|
| our custom pattern:
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
|
|
|
|
+code.
|
2017-05-20 02:38:55 +03:00
|
|
|
|
import spacy
|
2016-11-11 15:04:08 +03:00
|
|
|
|
from spacy.matcher import Matcher
|
2017-05-20 02:38:55 +03:00
|
|
|
|
from spacy.attrs import LOWER, IS_PUNCT # don't forget to import the attrs!
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load('en')
|
2016-10-31 21:04:15 +03:00
|
|
|
|
matcher = Matcher(nlp.vocab)
|
2017-05-20 02:38:55 +03:00
|
|
|
|
matcher.add_pattern('HelloWorld', [{LOWER: 'hello'}, {IS_PUNCT: True}, {LOWER: 'world'}])
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
2017-05-20 02:38:55 +03:00
|
|
|
|
doc = nlp(u'Hello, world! Hello world!')
|
2016-10-31 21:04:15 +03:00
|
|
|
|
matches = matcher(doc)
|
|
|
|
|
|
|
|
|
|
p
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| The matcher returns a list of #[code (match_id, start, end)] tuples – in
|
|
|
|
|
| this case, #[code [('HelloWorld', 0, 2)]], which maps to the span
|
|
|
|
|
| #[code doc[0:2]] of our original document. Optionally, we could also
|
|
|
|
|
| choose to add more than one pattern, for example to also match sequences
|
|
|
|
|
| without punctuation between "hello" and "world":
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
matcher.add_pattern('HelloWorld', [{LOWER: 'hello'}, {IS_PUNCT: True}, {LOWER: 'world'}],
|
|
|
|
|
[{LOWER: 'hello'}, {LOWER: 'world'}])
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| By default, the matcher will only return the matches and
|
|
|
|
|
| #[strong not do anything else], like merge entities or assign labels.
|
|
|
|
|
| This is all up to you and can be defined individually for each pattern,
|
|
|
|
|
| by passing in a callback function as the #[code on_match] argument on
|
|
|
|
|
| #[code add()]. This is useful, because it lets you write entirely custom
|
|
|
|
|
| and #[strong pattern-specific logic]. For example, you might want to
|
|
|
|
|
| merge #[em some] patterns into one token, while adding entity labels for
|
|
|
|
|
| other pattern types. You shouldn't have to create different matchers for
|
|
|
|
|
| each of those processes.
|
|
|
|
|
|
|
|
|
|
+h(2, "on_match") Adding #[code on_match] rules
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| To move on to a more realistic example, let's say you're working with a
|
|
|
|
|
| large corpus of blog articles, and you want to match all mentions of
|
|
|
|
|
| "Google I/O" (which spaCy tokenizes as #[code ['Google', 'I', '/', 'O']]).
|
|
|
|
|
| To be safe, you only match on the uppercase versions, in case someone has
|
|
|
|
|
| written it as "Google i/o". You also add a second pattern with an added
|
|
|
|
|
| #[code {IS_DIGIT: True}] token – this will make sure you also match on
|
|
|
|
|
| "Google I/O 2017". If this pattern matches, spaCy should execute your
|
|
|
|
|
| custom callback function #[code add_event_ent].
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
import spacy
|
|
|
|
|
from spacy.matcher import Matcher
|
|
|
|
|
from spacy.attrs import ORTH, UPPER, LOWER, IS_DIGIT
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load('en')
|
|
|
|
|
matcher = Matcher(nlp.vocab)
|
|
|
|
|
|
|
|
|
|
matcher.add_pattern('GoogleIO', [{ORTH: 'Google'}, {UPPER: 'I'}, {ORTH: '/'}, {UPPER: 'O'}],
|
|
|
|
|
[{ORTH: 'Google'}, {UPPER: 'I'}, {ORTH: '/'}, {UPPER: 'O'}, {IS_DIGIT: True}],
|
|
|
|
|
on_match=add_event_ent)
|
|
|
|
|
|
|
|
|
|
# Get the ID of the 'EVENT' entity type. This is required to set an entity.
|
|
|
|
|
EVENT = nlp.vocab.strings['EVENT']
|
|
|
|
|
|
|
|
|
|
def add_event_ent(matcher, doc, i, matches):
|
|
|
|
|
# Get the current match and create tuple of entity label, start and end.
|
|
|
|
|
# Append entity to the doc's entity. (Don't overwrite doc.ents, in case
|
|
|
|
|
# it already has other entities!)
|
|
|
|
|
match_id, start, end = matches[i]
|
|
|
|
|
doc.ents += ((EVENT, start, end),)
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| In addition to mentions of "Google I/O", your data also contains some
|
|
|
|
|
| annoying pre-processing artefacts, like leftover HTML line breaks
|
|
|
|
|
| (e.g. #[code <br>] or #[code <BR/>]). While you're at it,
|
|
|
|
|
| you want to merge those into one token and flag them, to make sure you
|
|
|
|
|
| can easily ignore them later. So you add a second pattern and pass in a
|
|
|
|
|
| function #[code merge_and_flag]:
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
matcher.add_pattern('BAD_HTML', [{ORTH: '<'}, {LOWER: 'br'}, {ORTH: '>'}],
|
|
|
|
|
[{ORTH: '<'}, {LOWER: 'br/'}, {ORTH: '>'}]
|
|
|
|
|
on_match=merge_and_flag)
|
|
|
|
|
|
|
|
|
|
# Add a new custom flag to the vocab, which is always False by default.
|
|
|
|
|
# BAD_HTML will be the flag ID, which we can use to set it to True on the span.
|
|
|
|
|
BAD_HTML_FLAG = doc.vocab.add_flag(lambda text: False)
|
|
|
|
|
|
|
|
|
|
def merge_and_flag(matcher, doc, i, matches):
|
|
|
|
|
match_id, start, end = matches[i]
|
|
|
|
|
span = doc[start : end]
|
|
|
|
|
span.merge(is_stop=True) # merge (and mark it as a stop word, just in case)
|
|
|
|
|
span.set_flag(BAD_HTML_FLAG, True) # set BAD_HTML_FLAG
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
2017-05-20 02:38:55 +03:00
|
|
|
|
+aside("Tip: Visualizing matches")
|
2017-05-20 02:43:48 +03:00
|
|
|
|
| When working with entities, you can use #[+api("displacy") displaCy]
|
|
|
|
|
| to quickly generate a NER visualization from your updated #[code Doc],
|
|
|
|
|
| which can be exported as an HTML file:
|
2017-05-20 02:38:55 +03:00
|
|
|
|
|
|
|
|
|
+code.o-no-block.
|
|
|
|
|
from spacy import displacy
|
|
|
|
|
html = displacy.render(doc, style='ent', page=True,
|
|
|
|
|
options={'ents': ['EVENT']})
|
|
|
|
|
|
|
|
|
|
| For more info and examples, see the usage workflow on
|
|
|
|
|
| #[+a("/docs/usage/visualizers") visualizing spaCy].
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| We can now call the matcher on our documents. The patterns will be
|
|
|
|
|
| matched in the order they occur in the text.
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
|
|
|
|
+code.
|
2017-05-20 02:38:55 +03:00
|
|
|
|
doc = nlp(LOTS_OF_TEXT)
|
|
|
|
|
matcher(doc)
|
|
|
|
|
|
|
|
|
|
+h(3, "on_match-callback") The callback function
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| The matcher will first collect all matches over the document. It will
|
|
|
|
|
| then iterate over the matches, lookup the callback for the entity ID
|
|
|
|
|
| that was matched, and invoke it. When the callback is invoked, it is
|
|
|
|
|
| passed four arguments: the matcher itself, the document, the position of
|
|
|
|
|
| the current match, and the total list of matches. This allows you to
|
|
|
|
|
| write callbacks that consider the entire set of matched phrases, so that
|
|
|
|
|
| you can resolve overlaps and other conflicts in whatever way you prefer.
|
|
|
|
|
|
|
|
|
|
+table(["Argument", "Type", "Description"])
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code matcher]
|
|
|
|
|
+cell #[code Matcher]
|
|
|
|
|
+cell The matcher instance.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code doc]
|
|
|
|
|
+cell #[code Doc]
|
|
|
|
|
+cell The document the matcher was used on.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code i]
|
|
|
|
|
+cell int
|
|
|
|
|
+cell Index of the current match (#[code matches[i]]).
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code matches]
|
|
|
|
|
+cell list
|
|
|
|
|
+cell
|
|
|
|
|
| A list of #[code (match_id, start, end)] tuples, describing the
|
|
|
|
|
| matches. A match tuple describes a span #[code doc[start:end]].
|
|
|
|
|
| The #[code match_id] is the ID of the added match pattern.
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
|
|
|
|
+h(2, "quantifiers") Using quantifiers
|
|
|
|
|
|
|
|
|
|
+table([ "Name", "Description", "Example"])
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code !]
|
|
|
|
|
+cell match exactly 0 times
|
|
|
|
|
+cell negation
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code *]
|
|
|
|
|
+cell match 0 or more times
|
|
|
|
|
+cell optional, variable number
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code +]
|
|
|
|
|
+cell match 1 or more times
|
|
|
|
|
+cell mandatory, variable number
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code ?]
|
|
|
|
|
+cell match 0 or 1 times
|
|
|
|
|
+cell optional, max one
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| There are no nested or scoped quantifiers. You can build those
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| behaviours with #[code on_match] callbacks.
|