//- 💫 DOCS > USAGE > RULE-BASED MATCHING include ../../_includes/_mixins p | spaCy features a rule-matching engine that operates over tokens, similar | to regular expressions. The rules can refer to token annotations (e.g. | the token #[code text] or #[code tag_], and flags (e.g. #[code IS_PUNCT]). | The rule matcher also lets you pass in a custom callback | to act on matches – for example, to merge entities and apply custom labels. | You can also associate patterns with entity IDs, to allow some basic | entity linking or disambiguation. +aside("What about \"real\" regular expressions?") +h(2, "adding-patterns") Adding patterns p | Let's say we want to enable spaCy to find a combination of three tokens: +list("numbers") +item | A token whose #[strong lowercase form matches "hello"], e.g. "Hello" | or "HELLO". +item | A token whose #[strong #[code is_punct] flag is set to #[code True]], | i.e. any punctuation. +item | A token whose #[strong lowercase form matches "world"], e.g. "World" | or "WORLD". +code. [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}] p | First, we initialise the #[code Matcher] with a vocab. The matcher must | always share the same vocab with the documents it will operate on. We | can now call #[+api("matcher#add") #[code matcher.add()]] with an ID and | our custom pattern. The second argument lets you pass in an optional | callback function to invoke on a successful match. For now, we set it | to #[code None]. +code. import spacy from spacy.matcher import Matcher nlp = spacy.load('en') matcher = Matcher(nlp.vocab) # add match ID "HelloWorld" with no callback and one pattern matcher.add('HelloWorld', on_match=None, [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]) doc = nlp(u'Hello, world! Hello world!') matches = matcher(doc) p | The matcher returns a list of #[code (match_id, start, end)] tuples – in | this case, #[code [('HelloWorld', 0, 2)]], which maps to the span | #[code doc[0:2]] of our original document. Optionally, we could also | choose to add more than one pattern, for example to also match sequences | without punctuation between "hello" and "world": +code. matcher.add('HelloWorld', on_match=None, [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}], [{'LOWER': 'hello'}, {'LOWER': 'world'}]) p | By default, the matcher will only return the matches and | #[strong not do anything else], like merge entities or assign labels. | This is all up to you and can be defined individually for each pattern, | by passing in a callback function as the #[code on_match] argument on | #[code add()]. This is useful, because it lets you write entirely custom | and #[strong pattern-specific logic]. For example, you might want to | merge #[em some] patterns into one token, while adding entity labels for | other pattern types. You shouldn't have to create different matchers for | each of those processes. +h(2, "on_match") Adding #[code on_match] rules p | To move on to a more realistic example, let's say you're working with a | large corpus of blog articles, and you want to match all mentions of | "Google I/O" (which spaCy tokenizes as #[code ['Google', 'I', '/', 'O']]). | To be safe, you only match on the uppercase versions, in case someone has | written it as "Google i/o". You also add a second pattern with an added | #[code {IS_DIGIT: True}] token – this will make sure you also match on | "Google I/O 2017". If your pattern matches, spaCy should execute your | custom callback function #[code add_event_ent]. +code. import spacy from spacy.matcher import Matcher nlp = spacy.load('en') matcher = Matcher(nlp.vocab) # Get the ID of the 'EVENT' entity type. This is required to set an entity. EVENT = nlp.vocab.strings['EVENT'] def add_event_ent(matcher, doc, i, matches): # Get the current match and create tuple of entity label, start and end. # Append entity to the doc's entity. (Don't overwrite doc.ents!) match_id, start, end = matches[i] doc.ents += ((EVENT, start, end),) matcher.add('GoogleIO', on_match=add_event_ent, [{'ORTH': 'Google'}, {'UPPER': 'I'}, {'ORTH': '/'}, {'UPPER': 'O'}], [{'ORTH': 'Google'}, {'UPPER': 'I'}, {'ORTH': '/'}, {'UPPER': 'O'}, {'IS_DIGIT': True}]) p | In addition to mentions of "Google I/O", your data also contains some | annoying pre-processing artefacts, like leftover HTML line breaks | (e.g. #[code <br>] or #[code <BR/>]). While you're at it, | you want to merge those into one token and flag them, to make sure you | can easily ignore them later. So you add a second pattern and pass in a | function #[code merge_and_flag]: +code. # Add a new custom flag to the vocab, which is always False by default. # BAD_HTML_FLAG will be the flag ID, which we can use to set it to True on the span. BAD_HTML_FLAG = doc.vocab.add_flag(lambda text: False) def merge_and_flag(matcher, doc, i, matches): match_id, start, end = matches[i] span = doc[start : end] span.merge(is_stop=True) # merge (and mark it as a stop word, just in case) span.set_flag(BAD_HTML_FLAG, True) # set BAD_HTML_FLAG matcher.add('BAD_HTML', on_match=merge_and_flag, [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}], [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}]) +aside("Tip: Visualizing matches") | When working with entities, you can use #[+api("displacy") displaCy] | to quickly generate a NER visualization from your updated #[code Doc], | which can be exported as an HTML file: +code.o-no-block. from spacy import displacy html = displacy.render(doc, style='ent', page=True, options={'ents': ['EVENT']}) | For more info and examples, see the usage workflow on | #[+a("/docs/usage/visualizers") visualizing spaCy]. p | We can now call the matcher on our documents. The patterns will be | matched in the order they occur in the text. The matcher will then | iterate over the matches, look up the callback for the match ID | that was matched, and invoke it. +code. doc = nlp(LOTS_OF_TEXT) matcher(doc) p | When the callback is invoked, it is | passed four arguments: the matcher itself, the document, the position of | the current match, and the total list of matches. This allows you to | write callbacks that consider the entire set of matched phrases, so that | you can resolve overlaps and other conflicts in whatever way you prefer. +table(["Argument", "Type", "Description"]) +row +cell #[code matcher] +cell #[code Matcher] +cell The matcher instance. +row +cell #[code doc] +cell #[code Doc] +cell The document the matcher was used on. +row +cell #[code i] +cell int +cell Index of the current match (#[code matches[i]]). +row +cell #[code matches] +cell list +cell | A list of #[code (match_id, start, end)] tuples, describing the | matches. A match tuple describes a span #[code doc[start:end]]. +h(2, "quantifiers") Using operators and quantifiers p | The matcher also lets you use quantifiers, specified as the #[code 'OP'] | key. Quantifiers let you define sequences of tokens to be mached, e.g. | one or more punctuation marks, or specify optional tokens. Note that there | are no nested or scoped quantifiers – instead, you can build those | behaviours with #[code on_match] callbacks. +aside("Problems with quantifiers") | Using quantifiers may lead to unexpected results when matching | variable-length patterns, for example if the next token would also be | matched by the previous token. This problem should be resolved in a future | release. For more information, see | #[+a(gh("spaCy") + "/issues/864") this issue]. +table([ "OP", "Description", "Example"]) +row +cell #[code !] +cell match exactly 0 times +cell negation +row +cell #[code *] +cell match 0 or more times +cell optional, variable number +row +cell #[code +] +cell match 1 or more times +cell mandatory, variable number +row +cell #[code ?] +cell match 0 or 1 times +cell optional, max one +h(3, "quantifiers-example1") Quantifiers example: Using linguistic annotations p | Let's say you're analysing user comments and you want to find out what | people are saying about Facebook. You want to start off by finding | adjectives following "Facebook is" or "Facebook was". This is obviously | a very rudimentary solution, but it'll be fast, and a great way get an | idea for what's in your data. Your pattern could look like this: +code. [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}] p | This translates to a token whose lowercase form matches "facebook" | (like Facebook, facebook or FACEBOOK), followed by a token with the lemma | "be" (for example, is, was, or 's), followed by an #[strong optional] adverb, | followed by an adjective. Using the linguistic annotations here is | especially useful, because you can tell spaCy to match "Facebook's | annoying", but #[strong not] "Facebook's annoying ads". The optional | adverb makes sure you won't miss adjectives with intensifiers, like | "pretty awful" or "very nice". p | To get a quick overview of the results, you could collect all sentences | containing a match and render them with the | #[+a("/docs/usage/visualizers") displaCy visualizer]. | In the callback function, you'll have access to the #[code start] and | #[code end] of each match, as well as the parent #[code Doc]. This lets | you determine the sentence containing the match, | #[code doc[start : end].sent], and calculate the start and end of the | matched span within the sentence. Using displaCy in | #[+a("/docs/usage/visualizers#manual-usage") "manual" mode] lets you | pass in a list of dictionaries containing the text and entities to render. +code. from spacy import displacy from spacy.matcher import Matcher nlp = spacy.load('en') matcher = Matcher(nlp.vocab) matched_sents = [] # collect data of matched sentences to be visualized def collect_sents(matcher, doc, i, matches): match_id, start, end = matches[i] span = doc[start : end] # matched span sent = span.sent # sentence containing matched span # append mock entity for match in displaCy style to matched_sents # get the match span by ofsetting the start and end of the span with the # start and end of the sentence in the doc match_ents = [{'start': span.start-sent.start, 'end': span.end-sent.start, 'label': 'MATCH'}] matched_sents.append({'text': sent.text, 'ents': match_ents }) pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}] matcher.add('FacebookIs', collect_sents, pattern) # add pattern matches = matcher(nlp(LOTS_OF_TEXT)) # match on your text # serve visualization of sentences containing match with displaCy # set manual=True to make displaCy render straight from a dictionary displacy.serve(matched_sents, style='ent', manual=True) +h(3, "quantifiers-example2") Quantifiers example: Phone numbers p | Phone numbers can have many different formats and matching them is often | tricky. During tokenization, spaCy will leave sequences of numbers intact | and only split on whitespace and punctuation. This means that your match | pattern will have to look out for number sequences of a certain length, | surrounded by specific punctuation – depending on the | #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers") national conventions]. p | The #[code IS_DIGIT] flag is not very helpful here, because it doesn't | tell us anything about the length. However, you can use the #[code SHAPE] | flag, with each #[code d] representing a digit: +code. [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'}, {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}] p | This will match phone numbers of the format #[strong (123) 4567 8901] or | #[strong (123) 4567-8901]. To also match formats like #[strong (123) 456 789], | you can add a second pattern using #[code 'ddd'] in place of #[code 'dddd']. | By hard-coding some values, you can match only certain, country-specific | numbers. For example, here's a pattern to match the most common formats of | #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany") international German numbers]: +code. [{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'}, {'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}] p | Depending on the formats your application needs to match, creating an | extensive set of rules like this is often better than training a model. | It'll produce more predictable results, is much easier to modify and | extend, and doesn't require any training data – only a set of | test cases.