2016-10-31 21:04:15 +03:00
|
|
|
|
//- 💫 DOCS > USAGE > RULE-BASED MATCHING
|
|
|
|
|
|
|
|
|
|
include ../../_includes/_mixins
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| spaCy features a rule-matching engine that operates over tokens, similar
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| to regular expressions. The rules can refer to token annotations (e.g.
|
|
|
|
|
| the token #[code text] or #[code tag_], and flags (e.g. #[code IS_PUNCT]).
|
|
|
|
|
| The rule matcher also lets you pass in a custom callback
|
|
|
|
|
| to act on matches – for example, to merge entities and apply custom labels.
|
|
|
|
|
| You can also associate patterns with entity IDs, to allow some basic
|
|
|
|
|
| entity linking or disambiguation.
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
2017-05-27 18:58:18 +03:00
|
|
|
|
//-+aside("What about \"real\" regular expressions?")
|
2017-05-20 02:38:55 +03:00
|
|
|
|
|
|
|
|
|
+h(2, "adding-patterns") Adding patterns
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Let's say we want to enable spaCy to find a combination of three tokens:
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
|
|
|
|
+list("numbers")
|
2017-05-20 02:38:55 +03:00
|
|
|
|
+item
|
2017-05-22 20:04:02 +03:00
|
|
|
|
| A token whose #[strong lowercase form matches "hello"], e.g. "Hello"
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| or "HELLO".
|
|
|
|
|
+item
|
|
|
|
|
| A token whose #[strong #[code is_punct] flag is set to #[code True]],
|
|
|
|
|
| i.e. any punctuation.
|
|
|
|
|
+item
|
2017-05-22 20:04:02 +03:00
|
|
|
|
| A token whose #[strong lowercase form matches "world"], e.g. "World"
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| or "WORLD".
|
|
|
|
|
|
|
|
|
|
+code.
|
2017-05-22 14:54:45 +03:00
|
|
|
|
[{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
|
|
|
|
p
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| First, we initialise the #[code Matcher] with a vocab. The matcher must
|
|
|
|
|
| always share the same vocab with the documents it will operate on. We
|
|
|
|
|
| can now call #[+api("matcher#add") #[code matcher.add()]] with an ID and
|
2017-05-20 13:59:03 +03:00
|
|
|
|
| our custom pattern. The second argument lets you pass in an optional
|
|
|
|
|
| callback function to invoke on a successful match. For now, we set it
|
|
|
|
|
| to #[code None].
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
|
|
|
|
+code.
|
2017-05-20 02:38:55 +03:00
|
|
|
|
import spacy
|
2016-11-11 15:04:08 +03:00
|
|
|
|
from spacy.matcher import Matcher
|
2017-05-20 02:38:55 +03:00
|
|
|
|
|
|
|
|
|
nlp = spacy.load('en')
|
2016-10-31 21:04:15 +03:00
|
|
|
|
matcher = Matcher(nlp.vocab)
|
2017-05-20 13:59:03 +03:00
|
|
|
|
# add match ID "HelloWorld" with no callback and one pattern
|
2017-05-23 12:36:02 +03:00
|
|
|
|
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
|
|
|
|
|
matcher.add('HelloWorld', None, pattern)
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
2017-05-20 02:38:55 +03:00
|
|
|
|
doc = nlp(u'Hello, world! Hello world!')
|
2016-10-31 21:04:15 +03:00
|
|
|
|
matches = matcher(doc)
|
|
|
|
|
|
|
|
|
|
p
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| The matcher returns a list of #[code (match_id, start, end)] tuples – in
|
|
|
|
|
| this case, #[code [('HelloWorld', 0, 2)]], which maps to the span
|
|
|
|
|
| #[code doc[0:2]] of our original document. Optionally, we could also
|
|
|
|
|
| choose to add more than one pattern, for example to also match sequences
|
|
|
|
|
| without punctuation between "hello" and "world":
|
|
|
|
|
|
|
|
|
|
+code.
|
2017-05-23 12:36:02 +03:00
|
|
|
|
matcher.add('HelloWorld', None,
|
2017-05-22 14:54:45 +03:00
|
|
|
|
[{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}],
|
|
|
|
|
[{'LOWER': 'hello'}, {'LOWER': 'world'}])
|
2017-05-20 02:38:55 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| By default, the matcher will only return the matches and
|
|
|
|
|
| #[strong not do anything else], like merge entities or assign labels.
|
|
|
|
|
| This is all up to you and can be defined individually for each pattern,
|
|
|
|
|
| by passing in a callback function as the #[code on_match] argument on
|
|
|
|
|
| #[code add()]. This is useful, because it lets you write entirely custom
|
|
|
|
|
| and #[strong pattern-specific logic]. For example, you might want to
|
|
|
|
|
| merge #[em some] patterns into one token, while adding entity labels for
|
|
|
|
|
| other pattern types. You shouldn't have to create different matchers for
|
|
|
|
|
| each of those processes.
|
|
|
|
|
|
|
|
|
|
+h(2, "on_match") Adding #[code on_match] rules
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| To move on to a more realistic example, let's say you're working with a
|
|
|
|
|
| large corpus of blog articles, and you want to match all mentions of
|
|
|
|
|
| "Google I/O" (which spaCy tokenizes as #[code ['Google', 'I', '/', 'O']]).
|
|
|
|
|
| To be safe, you only match on the uppercase versions, in case someone has
|
|
|
|
|
| written it as "Google i/o". You also add a second pattern with an added
|
|
|
|
|
| #[code {IS_DIGIT: True}] token – this will make sure you also match on
|
2017-05-20 13:27:22 +03:00
|
|
|
|
| "Google I/O 2017". If your pattern matches, spaCy should execute your
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| custom callback function #[code add_event_ent].
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
import spacy
|
|
|
|
|
from spacy.matcher import Matcher
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load('en')
|
|
|
|
|
matcher = Matcher(nlp.vocab)
|
|
|
|
|
|
|
|
|
|
# Get the ID of the 'EVENT' entity type. This is required to set an entity.
|
|
|
|
|
EVENT = nlp.vocab.strings['EVENT']
|
|
|
|
|
|
|
|
|
|
def add_event_ent(matcher, doc, i, matches):
|
|
|
|
|
# Get the current match and create tuple of entity label, start and end.
|
2017-05-20 13:27:22 +03:00
|
|
|
|
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
|
2017-05-20 02:38:55 +03:00
|
|
|
|
match_id, start, end = matches[i]
|
|
|
|
|
doc.ents += ((EVENT, start, end),)
|
|
|
|
|
|
2017-05-23 12:36:02 +03:00
|
|
|
|
matcher.add('GoogleIO', add_event_ent,
|
2017-05-22 20:04:02 +03:00
|
|
|
|
[{'ORTH': 'Google'}, {'UPPER': 'I'}, {'ORTH': '/'}, {'UPPER': 'O'}],
|
|
|
|
|
[{'ORTH': 'Google'}, {'UPPER': 'I'}, {'ORTH': '/'}, {'UPPER': 'O'}, {'IS_DIGIT': True}])
|
|
|
|
|
|
2017-05-20 02:38:55 +03:00
|
|
|
|
p
|
|
|
|
|
| In addition to mentions of "Google I/O", your data also contains some
|
|
|
|
|
| annoying pre-processing artefacts, like leftover HTML line breaks
|
|
|
|
|
| (e.g. #[code <br>] or #[code <BR/>]). While you're at it,
|
|
|
|
|
| you want to merge those into one token and flag them, to make sure you
|
|
|
|
|
| can easily ignore them later. So you add a second pattern and pass in a
|
|
|
|
|
| function #[code merge_and_flag]:
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
# Add a new custom flag to the vocab, which is always False by default.
|
2017-05-20 13:27:22 +03:00
|
|
|
|
# BAD_HTML_FLAG will be the flag ID, which we can use to set it to True on the span.
|
2017-05-27 18:58:18 +03:00
|
|
|
|
BAD_HTML_FLAG = nlp.vocab.add_flag(lambda text: False)
|
2017-05-20 02:38:55 +03:00
|
|
|
|
|
|
|
|
|
def merge_and_flag(matcher, doc, i, matches):
|
|
|
|
|
match_id, start, end = matches[i]
|
|
|
|
|
span = doc[start : end]
|
|
|
|
|
span.merge(is_stop=True) # merge (and mark it as a stop word, just in case)
|
|
|
|
|
span.set_flag(BAD_HTML_FLAG, True) # set BAD_HTML_FLAG
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
2017-05-23 12:36:02 +03:00
|
|
|
|
matcher.add('BAD_HTML', merge_and_flag,
|
2017-05-22 20:04:02 +03:00
|
|
|
|
[{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
|
|
|
|
|
[{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])
|
|
|
|
|
|
2017-05-20 02:38:55 +03:00
|
|
|
|
+aside("Tip: Visualizing matches")
|
2017-05-20 02:43:48 +03:00
|
|
|
|
| When working with entities, you can use #[+api("displacy") displaCy]
|
|
|
|
|
| to quickly generate a NER visualization from your updated #[code Doc],
|
|
|
|
|
| which can be exported as an HTML file:
|
2017-05-20 02:38:55 +03:00
|
|
|
|
|
|
|
|
|
+code.o-no-block.
|
|
|
|
|
from spacy import displacy
|
|
|
|
|
html = displacy.render(doc, style='ent', page=True,
|
|
|
|
|
options={'ents': ['EVENT']})
|
|
|
|
|
|
2017-05-28 17:41:01 +03:00
|
|
|
|
| For more info and examples, see the usage guide on
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| #[+a("/docs/usage/visualizers") visualizing spaCy].
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| We can now call the matcher on our documents. The patterns will be
|
2017-05-22 20:04:02 +03:00
|
|
|
|
| matched in the order they occur in the text. The matcher will then
|
|
|
|
|
| iterate over the matches, look up the callback for the match ID
|
|
|
|
|
| that was matched, and invoke it.
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
|
|
|
|
+code.
|
2017-05-20 02:38:55 +03:00
|
|
|
|
doc = nlp(LOTS_OF_TEXT)
|
|
|
|
|
matcher(doc)
|
|
|
|
|
|
|
|
|
|
p
|
2017-05-22 20:04:02 +03:00
|
|
|
|
| When the callback is invoked, it is
|
2017-05-20 02:38:55 +03:00
|
|
|
|
| passed four arguments: the matcher itself, the document, the position of
|
|
|
|
|
| the current match, and the total list of matches. This allows you to
|
|
|
|
|
| write callbacks that consider the entire set of matched phrases, so that
|
|
|
|
|
| you can resolve overlaps and other conflicts in whatever way you prefer.
|
|
|
|
|
|
|
|
|
|
+table(["Argument", "Type", "Description"])
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code matcher]
|
|
|
|
|
+cell #[code Matcher]
|
|
|
|
|
+cell The matcher instance.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code doc]
|
|
|
|
|
+cell #[code Doc]
|
|
|
|
|
+cell The document the matcher was used on.
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code i]
|
|
|
|
|
+cell int
|
|
|
|
|
+cell Index of the current match (#[code matches[i]]).
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code matches]
|
|
|
|
|
+cell list
|
|
|
|
|
+cell
|
|
|
|
|
| A list of #[code (match_id, start, end)] tuples, describing the
|
|
|
|
|
| matches. A match tuple describes a span #[code doc[start:end]].
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
2017-05-22 20:04:02 +03:00
|
|
|
|
+h(2, "quantifiers") Using operators and quantifiers
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| The matcher also lets you use quantifiers, specified as the #[code 'OP']
|
|
|
|
|
| key. Quantifiers let you define sequences of tokens to be mached, e.g.
|
|
|
|
|
| one or more punctuation marks, or specify optional tokens. Note that there
|
|
|
|
|
| are no nested or scoped quantifiers – instead, you can build those
|
|
|
|
|
| behaviours with #[code on_match] callbacks.
|
|
|
|
|
|
|
|
|
|
+aside("Problems with quantifiers")
|
|
|
|
|
| Using quantifiers may lead to unexpected results when matching
|
|
|
|
|
| variable-length patterns, for example if the next token would also be
|
|
|
|
|
| matched by the previous token. This problem should be resolved in a future
|
|
|
|
|
| release. For more information, see
|
|
|
|
|
| #[+a(gh("spaCy") + "/issues/864") this issue].
|
2016-10-31 21:04:15 +03:00
|
|
|
|
|
2017-05-22 20:04:02 +03:00
|
|
|
|
+table([ "OP", "Description", "Example"])
|
2016-10-31 21:04:15 +03:00
|
|
|
|
+row
|
|
|
|
|
+cell #[code !]
|
|
|
|
|
+cell match exactly 0 times
|
|
|
|
|
+cell negation
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code *]
|
|
|
|
|
+cell match 0 or more times
|
|
|
|
|
+cell optional, variable number
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code +]
|
|
|
|
|
+cell match 1 or more times
|
|
|
|
|
+cell mandatory, variable number
|
|
|
|
|
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code ?]
|
|
|
|
|
+cell match 0 or 1 times
|
|
|
|
|
+cell optional, max one
|
|
|
|
|
|
2017-05-27 18:58:18 +03:00
|
|
|
|
+h(2, "example1") Example: Using linguistic annotations
|
2017-05-22 20:04:02 +03:00
|
|
|
|
|
2016-10-31 21:04:15 +03:00
|
|
|
|
p
|
2017-05-22 20:04:02 +03:00
|
|
|
|
| Let's say you're analysing user comments and you want to find out what
|
|
|
|
|
| people are saying about Facebook. You want to start off by finding
|
|
|
|
|
| adjectives following "Facebook is" or "Facebook was". This is obviously
|
|
|
|
|
| a very rudimentary solution, but it'll be fast, and a great way get an
|
|
|
|
|
| idea for what's in your data. Your pattern could look like this:
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
[{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| This translates to a token whose lowercase form matches "facebook"
|
|
|
|
|
| (like Facebook, facebook or FACEBOOK), followed by a token with the lemma
|
|
|
|
|
| "be" (for example, is, was, or 's), followed by an #[strong optional] adverb,
|
|
|
|
|
| followed by an adjective. Using the linguistic annotations here is
|
|
|
|
|
| especially useful, because you can tell spaCy to match "Facebook's
|
|
|
|
|
| annoying", but #[strong not] "Facebook's annoying ads". The optional
|
|
|
|
|
| adverb makes sure you won't miss adjectives with intensifiers, like
|
|
|
|
|
| "pretty awful" or "very nice".
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| To get a quick overview of the results, you could collect all sentences
|
|
|
|
|
| containing a match and render them with the
|
|
|
|
|
| #[+a("/docs/usage/visualizers") displaCy visualizer].
|
|
|
|
|
| In the callback function, you'll have access to the #[code start] and
|
|
|
|
|
| #[code end] of each match, as well as the parent #[code Doc]. This lets
|
|
|
|
|
| you determine the sentence containing the match,
|
|
|
|
|
| #[code doc[start : end].sent], and calculate the start and end of the
|
|
|
|
|
| matched span within the sentence. Using displaCy in
|
|
|
|
|
| #[+a("/docs/usage/visualizers#manual-usage") "manual" mode] lets you
|
|
|
|
|
| pass in a list of dictionaries containing the text and entities to render.
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
from spacy import displacy
|
|
|
|
|
from spacy.matcher import Matcher
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load('en')
|
|
|
|
|
matcher = Matcher(nlp.vocab)
|
|
|
|
|
matched_sents = [] # collect data of matched sentences to be visualized
|
|
|
|
|
|
|
|
|
|
def collect_sents(matcher, doc, i, matches):
|
|
|
|
|
match_id, start, end = matches[i]
|
|
|
|
|
span = doc[start : end] # matched span
|
|
|
|
|
sent = span.sent # sentence containing matched span
|
|
|
|
|
# append mock entity for match in displaCy style to matched_sents
|
|
|
|
|
# get the match span by ofsetting the start and end of the span with the
|
|
|
|
|
# start and end of the sentence in the doc
|
|
|
|
|
match_ents = [{'start': span.start-sent.start, 'end': span.end-sent.start,
|
|
|
|
|
'label': 'MATCH'}]
|
|
|
|
|
matched_sents.append({'text': sent.text, 'ents': match_ents })
|
|
|
|
|
|
|
|
|
|
pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
|
|
|
|
|
{'POS': 'ADJ'}]
|
|
|
|
|
matcher.add('FacebookIs', collect_sents, pattern) # add pattern
|
|
|
|
|
matches = matcher(nlp(LOTS_OF_TEXT)) # match on your text
|
|
|
|
|
|
|
|
|
|
# serve visualization of sentences containing match with displaCy
|
|
|
|
|
# set manual=True to make displaCy render straight from a dictionary
|
|
|
|
|
displacy.serve(matched_sents, style='ent', manual=True)
|
|
|
|
|
|
2017-05-27 18:58:18 +03:00
|
|
|
|
+h(2, "example2") Example: Phone numbers
|
2017-05-22 20:04:02 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Phone numbers can have many different formats and matching them is often
|
|
|
|
|
| tricky. During tokenization, spaCy will leave sequences of numbers intact
|
|
|
|
|
| and only split on whitespace and punctuation. This means that your match
|
|
|
|
|
| pattern will have to look out for number sequences of a certain length,
|
|
|
|
|
| surrounded by specific punctuation – depending on the
|
|
|
|
|
| #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers") national conventions].
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| The #[code IS_DIGIT] flag is not very helpful here, because it doesn't
|
|
|
|
|
| tell us anything about the length. However, you can use the #[code SHAPE]
|
|
|
|
|
| flag, with each #[code d] representing a digit:
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
[{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'},
|
|
|
|
|
{'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}]
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| This will match phone numbers of the format #[strong (123) 4567 8901] or
|
|
|
|
|
| #[strong (123) 4567-8901]. To also match formats like #[strong (123) 456 789],
|
|
|
|
|
| you can add a second pattern using #[code 'ddd'] in place of #[code 'dddd'].
|
|
|
|
|
| By hard-coding some values, you can match only certain, country-specific
|
|
|
|
|
| numbers. For example, here's a pattern to match the most common formats of
|
|
|
|
|
| #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany") international German numbers]:
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
[{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'},
|
|
|
|
|
{'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}]
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Depending on the formats your application needs to match, creating an
|
|
|
|
|
| extensive set of rules like this is often better than training a model.
|
|
|
|
|
| It'll produce more predictable results, is much easier to modify and
|
|
|
|
|
| extend, and doesn't require any training data – only a set of
|
|
|
|
|
| test cases.
|
2017-05-27 18:58:18 +03:00
|
|
|
|
|
|
|
|
|
+h(2, "example3") Example: Hashtags and emoji on social media
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Social media posts, especially tweets, can be difficult to work with.
|
|
|
|
|
| They're very short and often contain various emoji and hashtags. By only
|
|
|
|
|
| looking at the plain text, you'll lose a lot of valuable semantic
|
|
|
|
|
| information.
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Let's say you've extracted a large sample of social media posts on a
|
|
|
|
|
| specific topic, for example posts mentioning a brand name or product.
|
|
|
|
|
| As the first step of your data exploration, you want to filter out posts
|
|
|
|
|
| containing certain emoji and use them to assign a general sentiment
|
|
|
|
|
| score, based on whether the expressed emotion is positive or negative,
|
|
|
|
|
| e.g. #[span.o-icon.o-icon--inline 😀] or #[span.o-icon.o-icon--inline 😞].
|
|
|
|
|
| You also want to find, merge and label hashtags like
|
|
|
|
|
| #[code #MondayMotivation], to be able to ignore or analyse them later.
|
|
|
|
|
|
|
|
|
|
+aside("Note on sentiment analysis")
|
|
|
|
|
| Ultimately, sentiment analysis is not always #[em that] easy. In
|
|
|
|
|
| addition to the emoji, you'll also want to take specific words into
|
|
|
|
|
| account and check the #[code subtree] for intensifiers like "very", to
|
|
|
|
|
| increase the sentiment score. At some point, you might also want to train
|
|
|
|
|
| a sentiment model. However, the approach described in this example is
|
2017-05-28 01:03:16 +03:00
|
|
|
|
| very useful for #[strong bootstrapping rules to collect training data].
|
2017-05-27 18:58:18 +03:00
|
|
|
|
| It's also an incredibly fast way to gather first insights into your data
|
|
|
|
|
| – with about 1 million tweets, you'd be looking at a processing time of
|
|
|
|
|
| #[strong under 1 minute].
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| By default, spaCy's tokenizer will split emoji into separate tokens. This
|
2017-05-29 02:08:47 +03:00
|
|
|
|
| means that you can create a pattern for one or more emoji tokens.
|
2017-05-27 18:58:18 +03:00
|
|
|
|
| Valid hashtags usually consist of a #[code #], plus a sequence of
|
|
|
|
|
| ASCII characters with no whitespace, making them easy to match as well.
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
from spacy.lang.en import English
|
|
|
|
|
from spacy.matcher import Matcher
|
|
|
|
|
|
|
|
|
|
nlp = English() # we only want the tokenizer, so no need to load a model
|
|
|
|
|
matcher = Matcher(nlp.vocab)
|
|
|
|
|
|
|
|
|
|
pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍'] # positive emoji
|
|
|
|
|
neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒'] # negative emoji
|
|
|
|
|
|
|
|
|
|
# add patterns to match one or more emoji tokens
|
2017-05-29 02:08:47 +03:00
|
|
|
|
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
|
|
|
|
|
neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]
|
2017-05-27 18:58:18 +03:00
|
|
|
|
|
|
|
|
|
matcher.add('HAPPY', label_sentiment, *pos_patterns) # add positive pattern
|
|
|
|
|
matcher.add('SAD', label_sentiment, *neg_patterns) # add negative pattern
|
|
|
|
|
|
|
|
|
|
# add pattern to merge valid hashtag, i.e. '#' plus any ASCII token
|
|
|
|
|
matcher.add('HASHTAG', merge_hashtag, [{'ORTH': '#'}, {'IS_ASCII': True}])
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Because the #[code on_match] callback receives the ID of each match, you
|
|
|
|
|
| can use the same function to handle the sentiment assignment for both
|
|
|
|
|
| the positive and negative pattern. To keep it simple, we'll either add
|
|
|
|
|
| or subtract #[code 0.1] points – this way, the score will also reflect
|
|
|
|
|
| combinations of emoji, even positive #[em and] negative ones.
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| With a library like
|
|
|
|
|
| #[+a("https://github.com/bcongdon/python-emojipedia") Emojipedia],
|
|
|
|
|
| we can also retrieve a short description for each emoji – for example,
|
|
|
|
|
| #[span.o-icon.o-icon--inline 😍]'s official title is "Smiling Face With
|
|
|
|
|
| Heart-Eyes". Assigning it to the merged token's norm will make it
|
|
|
|
|
| available as #[code token.norm_].
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
from emojipedia import Emojipedia # installation: pip install emojipedia
|
|
|
|
|
|
|
|
|
|
def label_sentiment(matcher, doc, i, matches):
|
|
|
|
|
match_id, start, end = matches[i]
|
2017-05-29 02:08:47 +03:00
|
|
|
|
if doc.vocab.strings[match_id] == 'HAPPY': # don't forget to get string!
|
2017-05-27 18:58:18 +03:00
|
|
|
|
doc.sentiment += 0.1 # add 0.1 for positive sentiment
|
2017-05-29 02:08:47 +03:00
|
|
|
|
elif doc.vocab.strings[match_id] == 'SAD':
|
2017-05-27 18:58:18 +03:00
|
|
|
|
doc.sentiment -= 0.1 # subtract 0.1 for negative sentiment
|
|
|
|
|
span = doc[start : end]
|
|
|
|
|
emoji = Emojipedia.search(span[0].text) # get data for emoji
|
|
|
|
|
span.merge(norm=emoji.title) # merge span and set NORM to emoji title
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| To label the hashtags, we first need to add a new custom flag.
|
|
|
|
|
| #[code IS_HASHTAG] will be the flag's ID, which you can use to assign it
|
|
|
|
|
| to the hashtag's span, and check its value via a token's
|
|
|
|
|
| #[+api("token#check_flag") #[code code check_flag()]] method. On each
|
|
|
|
|
| match, we merge the hashtag and assign the flag.
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
# Add a new custom flag to the vocab, which is always False by default
|
|
|
|
|
IS_HASHTAG = nlp.vocab.add_flag(lambda text: False)
|
|
|
|
|
|
|
|
|
|
def merge_hashtag(matcher, doc, i, matches):
|
|
|
|
|
match_id, start, end = matches[i]
|
|
|
|
|
span = doc[start : end]
|
|
|
|
|
span.merge() # merge hashtag
|
|
|
|
|
span.set_flag(IS_HASHTAG, True) # set IS_HASHTAG to True
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| To process a stream of social media posts, we can use
|
|
|
|
|
| #[+api("language#pipe") #[code Language.pipe()]], which will return a
|
|
|
|
|
| stream of #[code Doc] objects that we can pass to
|
|
|
|
|
| #[+api("matcher#pipe") #[code Matcher.pipe()]].
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
docs = nlp.pipe(LOTS_OF_TWEETS)
|
|
|
|
|
matches = matcher.pipe(docs)
|