spaCy/website/usage/_linguistic-features/_rule-based-matching.jade

562 lines
24 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > RULE-BASED MATCHING
p
| spaCy features a rule-matching engine, the #[+api("matcher") #[code Matcher]],
| that operates over tokens, similar
| to regular expressions. The rules can refer to token annotations (e.g.
| the token #[code text] or #[code tag_], and flags (e.g. #[code IS_PUNCT]).
| The rule matcher also lets you pass in a custom callback
| to act on matches for example, to merge entities and apply custom labels.
| You can also associate patterns with entity IDs, to allow some basic
| entity linking or disambiguation. To match large terminology lists,
| you can use the #[+api("phrasematcher") #[code PhraseMatcher]], which
| accepts #[code Doc] objects as match patterns.
+h(3, "adding-patterns") Adding patterns
p
| Let's say we want to enable spaCy to find a combination of three tokens:
+list("numbers")
+item
| A token whose #[strong lowercase form matches "hello"], e.g. "Hello"
| or "HELLO".
+item
| A token whose #[strong #[code is_punct] flag is set to #[code True]],
| i.e. any punctuation.
+item
| A token whose #[strong lowercase form matches "world"], e.g. "World"
| or "WORLD".
+code.
[{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
p
| First, we initialise the #[code Matcher] with a vocab. The matcher must
| always share the same vocab with the documents it will operate on. We
| can now call #[+api("matcher#add") #[code matcher.add()]] with an ID and
| our custom pattern. The second argument lets you pass in an optional
| callback function to invoke on a successful match. For now, we set it
| to #[code None].
+code.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
# add match ID "HelloWorld" with no callback and one pattern
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher.add('HelloWorld', None, pattern)
doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)
p
| The matcher returns a list of #[code (match_id, start, end)] tuples in
| this case, #[code [('HelloWorld', 0, 2)]], which maps to the span
| #[code doc[0:2]] of our original document. Optionally, we could also
| choose to add more than one pattern, for example to also match sequences
| without punctuation between "hello" and "world":
+code.
matcher.add('HelloWorld', None,
[{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}],
[{'LOWER': 'hello'}, {'LOWER': 'world'}])
p
| By default, the matcher will only return the matches and
| #[strong not do anything else], like merge entities or assign labels.
| This is all up to you and can be defined individually for each pattern,
| by passing in a callback function as the #[code on_match] argument on
| #[code add()]. This is useful, because it lets you write entirely custom
| and #[strong pattern-specific logic]. For example, you might want to
| merge #[em some] patterns into one token, while adding entity labels for
| other pattern types. You shouldn't have to create different matchers for
| each of those processes.
+h(4, "adding-patterns-attributes") Available token attributes
p
| The available token pattern keys are uppercase versions of the
| #[+api("token#attributes") #[code Token] attributes]. The most relevant
| ones for rule-based matching are:
+table(["Attribute", "Description"])
+row
+cell #[code ORTH]
+cell The exact verbatim text of a token.
+row
+cell.u-nowrap #[code LOWER]
+cell The lowercase form of the token text.
+row
+cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT]
+cell
| Token text consists of alphanumeric characters, ASCII characters,
| digits.
+row
+cell.u-nowrap #[code IS_LOWER], #[code IS_UPPER], #[code IS_TITLE]
+cell Token text is in lowercase, uppercase, titlecase.
+row
+cell.u-nowrap #[code IS_PUNCT], #[code IS_SPACE], #[code IS_STOP]
+cell Token is punctuation, whitespace, stop word.
+row
+cell.u-nowrap #[code LIKE_NUM], #[code LIKE_URL], #[code LIKE_EMAIL]
+cell Token text resembles a number, URL, email.
+row
+cell.u-nowrap
| #[code POS], #[code TAG], #[code DEP], #[code LEMMA],
| #[code SHAPE]
+cell
| The token's simple and extended part-of-speech tag, dependency
| label, lemma, shape.
+h(4, "adding-patterns-wildcard") Using wildcard token patterns
+tag-new(2)
p
| While the token attributes offer many options to write highly specific
| patterns, you can also use an empty dictionary, #[code {}] as a wildcard
| representing #[strong any token]. This is useful if you know the context
| of what you're trying to match, but very little about the specific token
| and its characters. For example, let's say you're trying to extract
| people's user names from your data. All you know is that they are listed
| as "User name: {username}". The name itself may contain any character,
| but no whitespace so you'll know it will be handled as one token.
+code.
[{'ORTH': 'User'}, {'ORTH': 'name'}, {'ORTH': ':'}, {}]
+h(4, "quantifiers") Using operators and quantifiers
p
| The matcher also lets you use quantifiers, specified as the #[code 'OP']
| key. Quantifiers let you define sequences of tokens to be mached, e.g.
| one or more punctuation marks, or specify optional tokens. Note that there
| are no nested or scoped quantifiers instead, you can build those
| behaviours with #[code on_match] callbacks.
+table([ "OP", "Description"])
+row
+cell #[code !]
+cell Negate the pattern, by requiring it to match exactly 0 times.
+row
+cell #[code ?]
+cell Make the pattern optional, by allowing it to match 0 or 1 times.
+row
+cell #[code +]
+cell Require the pattern to match 1 or more times.
+row
+cell #[code *]
+cell Allow the pattern to match zero or more times.
p
| The #[code +] and #[code *] operators are usually interpretted
| "greedily", i.e. longer matches are returned where possible.
+h(3, "adding-phrase-patterns") Adding phrase patterns
p
| If you need to match large terminology lists, you can also use the
| #[+api("phrasematcher") #[code PhraseMatcher]] and create
| #[+api("doc") #[code Doc]] objects instead of token patterns, which is
| much more efficient overall. The #[code Doc] patterns can contain single
| or multiple tokens.
+code.
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
patterns = [nlp(text) for text in terminology_list]
matcher.add('TerminologyList', None, *patterns)
doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
p
| Since spaCy is used for processing both the patterns and the text to be
| matched, you won't have to worry about specific tokenization for
| example, you can simply pass in #[code nlp(u"Washington, D.C.")] and
| won't have to write a complex token pattern covering the exact
| tokenization of the term.
+h(3, "on_match") Adding #[code on_match] rules
p
| To move on to a more realistic example, let's say you're working with a
| large corpus of blog articles, and you want to match all mentions of
| "Google I/O" (which spaCy tokenizes as #[code ['Google', 'I', '/', 'O']]).
| To be safe, you only match on the uppercase versions, in case someone has
| written it as "Google i/o". You also add a second pattern with an added
| #[code {IS_DIGIT: True}] token this will make sure you also match on
| "Google I/O 2017". If your pattern matches, spaCy should execute your
| custom callback function #[code add_event_ent].
+code.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
# Get the ID of the 'EVENT' entity type. This is required to set an entity.
EVENT = nlp.vocab.strings['EVENT']
def add_event_ent(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
doc.ents += ((EVENT, start, end),)
matcher.add('GoogleIO', add_event_ent,
[{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}],
[{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}, {'IS_DIGIT': True}])
p
| In addition to mentions of "Google I/O", your data also contains some
| annoying pre-processing artefacts, like leftover HTML line breaks
| (e.g. #[code <br>] or #[code <BR/>]). While you're at it,
| you want to merge those into one token and flag them, to make sure you
| can easily ignore them later. So you add a second pattern and pass in a
| function #[code merge_and_flag]:
+code.
# Add a new custom flag to the vocab, which is always False by default.
# BAD_HTML_FLAG will be the flag ID, which we can use to set it to True on the span.
BAD_HTML_FLAG = nlp.vocab.add_flag(lambda text: False)
def merge_and_flag(matcher, doc, i, matches):
match_id, start, end = matches[i]
span = doc[start : end]
span.merge(is_stop=True) # merge (and mark it as a stop word, just in case)
span.set_flag(BAD_HTML_FLAG, True) # set BAD_HTML_FLAG
matcher.add('BAD_HTML', merge_and_flag,
[{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
[{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])
+aside("Tip: Visualizing matches")
| When working with entities, you can use #[+api("top-level#displacy") displaCy]
| to quickly generate a NER visualization from your updated #[code Doc],
| which can be exported as an HTML file:
+code.o-no-block.
from spacy import displacy
html = displacy.render(doc, style='ent', page=True,
options={'ents': ['EVENT']})
| For more info and examples, see the usage guide on
| #[+a("/usage/visualizers") visualizing spaCy].
p
| We can now call the matcher on our documents. The patterns will be
| matched in the order they occur in the text. The matcher will then
| iterate over the matches, look up the callback for the match ID
| that was matched, and invoke it.
+code.
doc = nlp(LOTS_OF_TEXT)
matcher(doc)
p
| When the callback is invoked, it is
| passed four arguments: the matcher itself, the document, the position of
| the current match, and the total list of matches. This allows you to
| write callbacks that consider the entire set of matched phrases, so that
| you can resolve overlaps and other conflicts in whatever way you prefer.
+table(["Argument", "Type", "Description"])
+row
+cell #[code matcher]
+cell #[code Matcher]
+cell The matcher instance.
+row
+cell #[code doc]
+cell #[code Doc]
+cell The document the matcher was used on.
+row
+cell #[code i]
+cell int
+cell Index of the current match (#[code matches[i]]).
+row
+cell #[code matches]
+cell list
+cell
| A list of #[code (match_id, start, end)] tuples, describing the
| matches. A match tuple describes a span #[code doc[start:end]].
+h(3, "regex") Using regular expressions
p
| In some cases, only matching tokens and token attributes isn't enough
| for example, you might want to match different spellings of a word,
| without having to add a new pattern for each spelling. A simple solution
| is to match a regular expression on the #[code Doc]'s #[code text] and
| use the #[+api("doc#char_span") #[code Doc.char_span]] method to
| create a #[code Span] from the character indices of the match:
+code.
import spacy
import re
nlp = spacy.load('en')
doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely')
for match in re.finditer(DEFINITELY_PATTERN, doc.text):
start, end = match.span() # get matched indices
span = doc.char_span(start, end) # create Span from indices
p
| You can also use the regular expression with spaCy's #[code Matcher] by
| converting it to a token flag. To ensure efficiency, the
| #[code Matcher] can only access the C-level data. This means that it can
| either use built-in token attributes or #[strong binary flags].
| #[+api("vocab#add_flag") #[code Vocab.add_flag]] returns a flag ID which
| you can use as a key of a token match pattern. Tokens that match the
| regular expression will return #[code True] for the #[code IS_DEFINITELY]
| flag.
+code.
IS_DEFINITELY = nlp.vocab.add_flag(re.compile(r'deff?in[ia]tely').match)
matcher = Matcher(nlp.vocab)
matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
p
| Providing the regular expressions as binary flags also lets you use them
| in combination with other token patterns for example, to match the
| word "definitely" in various spellings, followed by a case-insensitive
| "not" and and adjective:
+code.
[{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}]
+h(3, "example1") Example: Using linguistic annotations
p
| Let's say you're analysing user comments and you want to find out what
| people are saying about Facebook. You want to start off by finding
| adjectives following "Facebook is" or "Facebook was". This is obviously
| a very rudimentary solution, but it'll be fast, and a great way get an
| idea for what's in your data. Your pattern could look like this:
+code.
[{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]
p
| This translates to a token whose lowercase form matches "facebook"
| (like Facebook, facebook or FACEBOOK), followed by a token with the lemma
| "be" (for example, is, was, or 's), followed by an #[strong optional] adverb,
| followed by an adjective. Using the linguistic annotations here is
| especially useful, because you can tell spaCy to match "Facebook's
| annoying", but #[strong not] "Facebook's annoying ads". The optional
| adverb makes sure you won't miss adjectives with intensifiers, like
| "pretty awful" or "very nice".
p
| To get a quick overview of the results, you could collect all sentences
| containing a match and render them with the
| #[+a("/usage/visualizers") displaCy visualizer].
| In the callback function, you'll have access to the #[code start] and
| #[code end] of each match, as well as the parent #[code Doc]. This lets
| you determine the sentence containing the match,
| #[code doc[start : end].sent], and calculate the start and end of the
| matched span within the sentence. Using displaCy in
| #[+a("/usage/visualizers#manual-usage") "manual" mode] lets you
| pass in a list of dictionaries containing the text and entities to render.
+code.
from spacy import displacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matched_sents = [] # collect data of matched sentences to be visualized
def collect_sents(matcher, doc, i, matches):
match_id, start, end = matches[i]
span = doc[start : end] # matched span
sent = span.sent # sentence containing matched span
# append mock entity for match in displaCy style to matched_sents
# get the match span by ofsetting the start and end of the span with the
# start and end of the sentence in the doc
match_ents = [{'start': span.start_char - sent.start_char,
'end': span.end_char - sent.start_char,
'label': 'MATCH'}]
matched_sents.append({'text': sent.text, 'ents': match_ents })
pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
{'POS': 'ADJ'}]
matcher.add('FacebookIs', collect_sents, pattern) # add pattern
matches = matcher(nlp(LOTS_OF_TEXT)) # match on your text
# serve visualization of sentences containing match with displaCy
# set manual=True to make displaCy render straight from a dictionary
displacy.serve(matched_sents, style='ent', manual=True)
+h(3, "example2") Example: Phone numbers
p
| Phone numbers can have many different formats and matching them is often
| tricky. During tokenization, spaCy will leave sequences of numbers intact
| and only split on whitespace and punctuation. This means that your match
| pattern will have to look out for number sequences of a certain length,
| surrounded by specific punctuation depending on the
| #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers") national conventions].
p
| The #[code IS_DIGIT] flag is not very helpful here, because it doesn't
| tell us anything about the length. However, you can use the #[code SHAPE]
| flag, with each #[code d] representing a digit:
+code.
[{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'},
{'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}]
p
| This will match phone numbers of the format #[strong (123) 4567 8901] or
| #[strong (123) 4567-8901]. To also match formats like #[strong (123) 456 789],
| you can add a second pattern using #[code 'ddd'] in place of #[code 'dddd'].
| By hard-coding some values, you can match only certain, country-specific
| numbers. For example, here's a pattern to match the most common formats of
| #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany") international German numbers]:
+code.
[{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'},
{'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}]
p
| Depending on the formats your application needs to match, creating an
| extensive set of rules like this is often better than training a model.
| It'll produce more predictable results, is much easier to modify and
| extend, and doesn't require any training data only a set of
| test cases.
+h(3, "example3") Example: Hashtags and emoji on social media
p
| Social media posts, especially tweets, can be difficult to work with.
| They're very short and often contain various emoji and hashtags. By only
| looking at the plain text, you'll lose a lot of valuable semantic
| information.
p
| Let's say you've extracted a large sample of social media posts on a
| specific topic, for example posts mentioning a brand name or product.
| As the first step of your data exploration, you want to filter out posts
| containing certain emoji and use them to assign a general sentiment
| score, based on whether the expressed emotion is positive or negative,
| e.g. #[span.o-icon.o-icon--inline 😀] or #[span.o-icon.o-icon--inline 😞].
| You also want to find, merge and label hashtags like
| #[code #MondayMotivation], to be able to ignore or analyse them later.
+aside("Note on sentiment analysis")
| Ultimately, sentiment analysis is not always #[em that] easy. In
| addition to the emoji, you'll also want to take specific words into
| account and check the #[code subtree] for intensifiers like "very", to
| increase the sentiment score. At some point, you might also want to train
| a sentiment model. However, the approach described in this example is
| very useful for #[strong bootstrapping rules to collect training data].
| It's also an incredibly fast way to gather first insights into your data
| with about 1 million tweets, you'd be looking at a processing time of
| #[strong under 1 minute].
p
| By default, spaCy's tokenizer will split emoji into separate tokens. This
| means that you can create a pattern for one or more emoji tokens.
| Valid hashtags usually consist of a #[code #], plus a sequence of
| ASCII characters with no whitespace, making them easy to match as well.
+code.
from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English() # we only want the tokenizer, so no need to load a model
matcher = Matcher(nlp.vocab)
pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍'] # positive emoji
neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒'] # negative emoji
# add patterns to match one or more emoji tokens
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]
matcher.add('HAPPY', label_sentiment, *pos_patterns) # add positive pattern
matcher.add('SAD', label_sentiment, *neg_patterns) # add negative pattern
# add pattern to merge valid hashtag, i.e. '#' plus any ASCII token
matcher.add('HASHTAG', merge_hashtag, [{'ORTH': '#'}, {'IS_ASCII': True}])
p
| Because the #[code on_match] callback receives the ID of each match, you
| can use the same function to handle the sentiment assignment for both
| the positive and negative pattern. To keep it simple, we'll either add
| or subtract #[code 0.1] points this way, the score will also reflect
| combinations of emoji, even positive #[em and] negative ones.
p
| With a library like
| #[+a("https://github.com/bcongdon/python-emojipedia") Emojipedia],
| we can also retrieve a short description for each emoji for example,
| #[span.o-icon.o-icon--inline 😍]'s official title is "Smiling Face With
| Heart-Eyes". Assigning it to the merged token's norm will make it
| available as #[code token.norm_].
+code.
from emojipedia import Emojipedia # installation: pip install emojipedia
def label_sentiment(matcher, doc, i, matches):
match_id, start, end = matches[i]
if doc.vocab.strings[match_id] == 'HAPPY': # don't forget to get string!
doc.sentiment += 0.1 # add 0.1 for positive sentiment
elif doc.vocab.strings[match_id] == 'SAD':
doc.sentiment -= 0.1 # subtract 0.1 for negative sentiment
span = doc[start : end]
emoji = Emojipedia.search(span[0].text) # get data for emoji
span.merge(norm=emoji.title) # merge span and set NORM to emoji title
p
| To label the hashtags, we first need to add a new custom flag.
| #[code IS_HASHTAG] will be the flag's ID, which you can use to assign it
| to the hashtag's span, and check its value via a token's
| #[+api("token#check_flag") #[code check_flag()]] method. On each
| match, we merge the hashtag and assign the flag.
+code.
# Add a new custom flag to the vocab, which is always False by default
IS_HASHTAG = nlp.vocab.add_flag(lambda text: False)
def merge_hashtag(matcher, doc, i, matches):
match_id, start, end = matches[i]
span = doc[start : end]
span.merge() # merge hashtag
span.set_flag(IS_HASHTAG, True) # set IS_HASHTAG to True
p
| To process a stream of social media posts, we can use
| #[+api("language#pipe") #[code Language.pipe()]], which will return a
| stream of #[code Doc] objects that we can pass to
| #[+api("matcher#pipe") #[code Matcher.pipe()]].
+code.
docs = nlp.pipe(LOTS_OF_TWEETS)
matches = matcher.pipe(docs)