//- 💫 DOCS > USAGE > RULE-BASED MATCHING p | spaCy features a rule-matching engine, the #[+api("matcher") #[code Matcher]], | that operates over tokens, similar | to regular expressions. The rules can refer to token annotations (e.g. | the token #[code text] or #[code tag_], and flags (e.g. #[code IS_PUNCT]). | The rule matcher also lets you pass in a custom callback | to act on matches – for example, to merge entities and apply custom labels. | You can also associate patterns with entity IDs, to allow some basic | entity linking or disambiguation. To match large terminology lists, | you can use the #[+api("phrasematcher") #[code PhraseMatcher]], which | accepts #[code Doc] objects as match patterns. +h(3, "adding-patterns") Adding patterns p | Let's say we want to enable spaCy to find a combination of three tokens: +list("numbers") +item | A token whose #[strong lowercase form matches "hello"], e.g. "Hello" | or "HELLO". +item | A token whose #[strong #[code is_punct] flag is set to #[code True]], | i.e. any punctuation. +item | A token whose #[strong lowercase form matches "world"], e.g. "World" | or "WORLD". +code. [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}] p | First, we initialise the #[code Matcher] with a vocab. The matcher must | always share the same vocab with the documents it will operate on. We | can now call #[+api("matcher#add") #[code matcher.add()]] with an ID and | our custom pattern. The second argument lets you pass in an optional | callback function to invoke on a successful match. For now, we set it | to #[code None]. +code-exec. import spacy from spacy.matcher import Matcher nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) # add match ID "HelloWorld" with no callback and one pattern pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}] matcher.add('HelloWorld', None, pattern) doc = nlp(u'Hello, world! Hello world!') matches = matcher(doc) for match_id, start, end in matches: string_id = nlp.vocab.strings[match_id] # get string representation span = doc[start:end] # the matched span print(match_id, string_id, start, end, span.text) p | The matcher returns a list of #[code (match_id, start, end)] tuples – in | this case, #[code [('15578876784678163569', 0, 2)]], which maps to the | span #[code doc[0:2]] of our original document. The #[code match_id] | is the #[+a("/usage/spacy-101#vocab") hash value] of the string ID | "HelloWorld". To get the string value, you can look up the ID | in the #[+api("stringstore") #[code StringStore]]. +code. for match_id, start, end in matches: string_id = nlp.vocab.strings[match_id] # 'HelloWorld' span = doc[start:end] # the matched span p | Optionally, we could also choose to add more than one pattern, for | example to also match sequences without punctuation between "hello" and | "world": +code. matcher.add('HelloWorld', None, [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}], [{'LOWER': 'hello'}, {'LOWER': 'world'}]) p | By default, the matcher will only return the matches and | #[strong not do anything else], like merge entities or assign labels. | This is all up to you and can be defined individually for each pattern, | by passing in a callback function as the #[code on_match] argument on | #[code add()]. This is useful, because it lets you write entirely custom | and #[strong pattern-specific logic]. For example, you might want to | merge #[em some] patterns into one token, while adding entity labels for | other pattern types. You shouldn't have to create different matchers for | each of those processes. +h(4, "adding-patterns-attributes") Available token attributes p | The available token pattern keys are uppercase versions of the | #[+api("token#attributes") #[code Token] attributes]. The most relevant | ones for rule-based matching are: +table(["Attribute", "Description"]) +row +cell #[code ORTH] +cell The exact verbatim text of a token. +row +cell.u-nowrap #[code LOWER] +cell The lowercase form of the token text. +row +cell #[code LENGTH] +cell The length of the token text. +row +cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT] +cell | Token text consists of alphanumeric characters, ASCII characters, | digits. +row +cell.u-nowrap #[code IS_LOWER], #[code IS_UPPER], #[code IS_TITLE] +cell Token text is in lowercase, uppercase, titlecase. +row +cell.u-nowrap #[code IS_PUNCT], #[code IS_SPACE], #[code IS_STOP] +cell Token is punctuation, whitespace, stop word. +row +cell.u-nowrap #[code LIKE_NUM], #[code LIKE_URL], #[code LIKE_EMAIL] +cell Token text resembles a number, URL, email. +row +cell.u-nowrap | #[code POS], #[code TAG], #[code DEP], #[code LEMMA], | #[code SHAPE] +cell | The token's simple and extended part-of-speech tag, dependency | label, lemma, shape. +row +cell.u-nowrap #[code ENT_TYPE] +cell The token's entity label. +h(4, "adding-patterns-wildcard") Using wildcard token patterns +tag-new(2) p | While the token attributes offer many options to write highly specific | patterns, you can also use an empty dictionary, #[code {}] as a wildcard | representing #[strong any token]. This is useful if you know the context | of what you're trying to match, but very little about the specific token | and its characters. For example, let's say you're trying to extract | people's user names from your data. All you know is that they are listed | as "User name: {username}". The name itself may contain any character, | but no whitespace – so you'll know it will be handled as one token. +code. [{'ORTH': 'User'}, {'ORTH': 'name'}, {'ORTH': ':'}, {}] +h(4, "quantifiers") Using operators and quantifiers p | The matcher also lets you use quantifiers, specified as the #[code 'OP'] | key. Quantifiers let you define sequences of tokens to be mached, e.g. | one or more punctuation marks, or specify optional tokens. Note that there | are no nested or scoped quantifiers – instead, you can build those | behaviours with #[code on_match] callbacks. +table([ "OP", "Description"]) +row +cell #[code !] +cell Negate the pattern, by requiring it to match exactly 0 times. +row +cell #[code ?] +cell Make the pattern optional, by allowing it to match 0 or 1 times. +row +cell #[code +] +cell Require the pattern to match 1 or more times. +row +cell #[code *] +cell Allow the pattern to match zero or more times. p | In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators | behave inconsistently. They were usually interpreted | "greedily", i.e. longer matches are returned where possible. However, if | you specify two #[code +] and #[code *] patterns in a row and their | matches overlap, the first operator will behave non-greedily. This quirk | in the semantics is corrected in spaCy v2.1.0. +h(3, "adding-phrase-patterns") Adding phrase patterns p | If you need to match large terminology lists, you can also use the | #[+api("phrasematcher") #[code PhraseMatcher]] and create | #[+api("doc") #[code Doc]] objects instead of token patterns, which is | much more efficient overall. The #[code Doc] patterns can contain single | or multiple tokens. +code-exec. import spacy from spacy.matcher import PhraseMatcher nlp = spacy.load('en_core_web_sm') matcher = PhraseMatcher(nlp.vocab) terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.'] patterns = [nlp(text) for text in terminology_list] matcher.add('TerminologyList', None, *patterns) doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama " u"converse in the Oval Office inside the White House in Washington, D.C.") matches = matcher(doc) for match_id, start, end in matches: span = doc[start:end] print(span.text) p | Since spaCy is used for processing both the patterns and the text to be | matched, you won't have to worry about specific tokenization – for | example, you can simply pass in #[code nlp(u"Washington, D.C.")] and | won't have to write a complex token pattern covering the exact | tokenization of the term. +h(3, "on_match") Adding #[code on_match] rules p | To move on to a more realistic example, let's say you're working with a | large corpus of blog articles, and you want to match all mentions of | "Google I/O" (which spaCy tokenizes as #[code ['Google', 'I', '/', 'O']]). | To be safe, you only match on the uppercase versions, in case someone has | written it as "Google i/o". You also add a second pattern with an added | #[code {IS_DIGIT: True}] token – this will make sure you also match on | "Google I/O 2017". If your pattern matches, spaCy should execute your | custom callback function #[code add_event_ent]. +code-exec. import spacy from spacy.matcher import Matcher nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) # Get the ID of the 'EVENT' entity type. This is required to set an entity. EVENT = nlp.vocab.strings['EVENT'] def add_event_ent(matcher, doc, i, matches): # Get the current match and create tuple of entity label, start and end. # Append entity to the doc's entity. (Don't overwrite doc.ents!) match_id, start, end = matches[i] entity = (EVENT, start, end) doc.ents += (entity,) print(doc[start:end].text, entity) matcher.add('GoogleIO', add_event_ent, [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}], [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}, {'IS_DIGIT': True}]) doc = nlp(u"This is a text about Google I/O 2015.") matches = matcher(doc) +aside("Tip: Visualizing matches") | When working with entities, you can use #[+api("top-level#displacy") displaCy] | to quickly generate a NER visualization from your updated #[code Doc], | which can be exported as an HTML file: +code.o-no-block. from spacy import displacy html = displacy.render(doc, style='ent', page=True, options={'ents': ['EVENT']}) | For more info and examples, see the usage guide on | #[+a("/usage/visualizers") visualizing spaCy]. p | We can now call the matcher on our documents. The patterns will be | matched in the order they occur in the text. The matcher will then | iterate over the matches, look up the callback for the match ID | that was matched, and invoke it. +code. doc = nlp(YOUR_TEXT_HERE) matcher(doc) p | When the callback is invoked, it is | passed four arguments: the matcher itself, the document, the position of | the current match, and the total list of matches. This allows you to | write callbacks that consider the entire set of matched phrases, so that | you can resolve overlaps and other conflicts in whatever way you prefer. +table(["Argument", "Type", "Description"]) +row +cell #[code matcher] +cell #[code Matcher] +cell The matcher instance. +row +cell #[code doc] +cell #[code Doc] +cell The document the matcher was used on. +row +cell #[code i] +cell int +cell Index of the current match (#[code matches[i]]). +row +cell #[code matches] +cell list +cell | A list of #[code (match_id, start, end)] tuples, describing the | matches. A match tuple describes a span #[code doc[start:end]]. +h(3, "matcher-pipeline") Using custom pipeline components p | Let's say your data also contains some annoying pre-processing artefacts, | like leftover HTML line breaks (e.g. #[code <br>] or | #[code <BR/>]). To make your text easier to analyse, you want to | merge those into one token and flag them, to make sure you | can ignore them later. Ideally, this should all be done automatically | as you process the text. You can achieve this by adding a | #[+a("/usage/processing-pipelines#custom-components") custom pipeline component] | that's called on each #[code Doc] object, merges the leftover HTML spans | and sets an attribute #[code bad_html] on the token. +code-exec. import spacy from spacy.matcher import Matcher from spacy.tokens import Token # we're using a class because the component needs to be initialised with # the shared vocab via the nlp object class BadHTMLMerger(object): def __init__(self, nlp): # register a new token extension to flag bad HTML Token.set_extension('bad_html', default=False) self.matcher = Matcher(nlp.vocab) self.matcher.add('BAD_HTML', None, [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}], [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}]) def __call__(self, doc): # this method is invoked when the component is called on a Doc matches = self.matcher(doc) spans = [] # collect the matched spans here for match_id, start, end in matches: spans.append(doc[start:end]) for span in spans: span.merge() # merge for token in span: token._.bad_html = True # mark token as bad HTML doc.vocab[token.text].is_stop = True # mark lexeme as stop word return doc nlp = spacy.load('en_core_web_sm') html_merger = BadHTMLMerger(nlp) nlp.add_pipe(html_merger, last=True) # add component to the pipeline doc = nlp(u"Hello<br>world! <br/> This is a test.") for token in doc: print(token.text, token._.bad_html) p | Instead of hard-coding the patterns into the component, you could also | make it take a path to a JSON file containing the patterns. This lets | you reuse the component with different patterns, depending on your | application: +code. html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json') +infobox | For more details and examples of how to | #[strong create custom pipeline components] and | #[strong extension attributes], see the | #[+a("/usage/processing-pipelines") usage guide]. +h(3, "regex") Using regular expressions p | In some cases, only matching tokens and token attributes isn't enough – | for example, you might want to match different spellings of a word, | without having to add a new pattern for each spelling. A simple solution | is to match a regular expression on the #[code Doc]'s #[code text] and | use the #[+api("doc#char_span") #[code Doc.char_span]] method to | create a #[code Span] from the character indices of the match: +code-exec. import spacy import re nlp = spacy.load('en_core_web_sm') doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".') DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely') for match in re.finditer(DEFINITELY_PATTERN, doc.text): start, end = match.span() # get matched indices span = doc.char_span(start, end) # create Span from indices print(span.text) p | You can also use the regular expression with spaCy's #[code Matcher] by | converting it to a token flag. To ensure efficiency, the | #[code Matcher] can only access the C-level data. This means that it can | either use built-in token attributes or #[strong binary flags]. | #[+api("vocab#add_flag") #[code Vocab.add_flag]] returns a flag ID which | you can use as a key of a token match pattern. Tokens that match the | regular expression will return #[code True] for the #[code IS_DEFINITELY] | flag. +code-exec. import spacy from spacy.matcher import Matcher import re nlp = spacy.load('en_core_web_sm') definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text)) IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag) matcher = Matcher(nlp.vocab) matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}]) doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".') matches = matcher(doc) for match_id, start, end in matches: span = doc[start:end] print(span.text) p | Providing the regular expressions as binary flags also lets you use them | in combination with other token patterns – for example, to match the | word "definitely" in various spellings, followed by a case-insensitive | "not" and and adjective: +code. [{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}] +h(3, "example1") Example: Using linguistic annotations p | Let's say you're analysing user comments and you want to find out what | people are saying about Facebook. You want to start off by finding | adjectives following "Facebook is" or "Facebook was". This is obviously | a very rudimentary solution, but it'll be fast, and a great way get an | idea for what's in your data. Your pattern could look like this: +code. [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}] p | This translates to a token whose lowercase form matches "facebook" | (like Facebook, facebook or FACEBOOK), followed by a token with the lemma | "be" (for example, is, was, or 's), followed by an #[strong optional] adverb, | followed by an adjective. Using the linguistic annotations here is | especially useful, because you can tell spaCy to match "Facebook's | annoying", but #[strong not] "Facebook's annoying ads". The optional | adverb makes sure you won't miss adjectives with intensifiers, like | "pretty awful" or "very nice". p | To get a quick overview of the results, you could collect all sentences | containing a match and render them with the | #[+a("/usage/visualizers") displaCy visualizer]. | In the callback function, you'll have access to the #[code start] and | #[code end] of each match, as well as the parent #[code Doc]. This lets | you determine the sentence containing the match, | #[code doc[start : end].sent], and calculate the start and end of the | matched span within the sentence. Using displaCy in | #[+a("/usage/visualizers#manual-usage") "manual" mode] lets you | pass in a list of dictionaries containing the text and entities to render. +code-exec. import spacy from spacy import displacy from spacy.matcher import Matcher nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) matched_sents = [] # collect data of matched sentences to be visualized def collect_sents(matcher, doc, i, matches): match_id, start, end = matches[i] span = doc[start : end] # matched span sent = span.sent # sentence containing matched span # append mock entity for match in displaCy style to matched_sents # get the match span by ofsetting the start and end of the span with the # start and end of the sentence in the doc match_ents = [{'start': span.start_char - sent.start_char, 'end': span.end_char - sent.start_char, 'label': 'MATCH'}] matched_sents.append({'text': sent.text, 'ents': match_ents }) pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}] matcher.add('FacebookIs', collect_sents, pattern) # add pattern doc = nlp(u"I'd say that Facebook is evil. – Facebook is pretty cool, right?") matches = matcher(doc) # serve visualization of sentences containing match with displaCy # set manual=True to make displaCy render straight from a dictionary # (if you're not running the code within a Jupyer environment, you can # remove jupyter=True and use displacy.serve instead) displacy.render(matched_sents, style='ent', manual=True, jupyter=True) +h(3, "example2") Example: Phone numbers p | Phone numbers can have many different formats and matching them is often | tricky. During tokenization, spaCy will leave sequences of numbers intact | and only split on whitespace and punctuation. This means that your match | pattern will have to look out for number sequences of a certain length, | surrounded by specific punctuation – depending on the | #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers") national conventions]. p | The #[code IS_DIGIT] flag is not very helpful here, because it doesn't | tell us anything about the length. However, you can use the #[code SHAPE] | flag, with each #[code d] representing a digit: +code. [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'}, {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}] p | This will match phone numbers of the format #[strong (123) 4567 8901] or | #[strong (123) 4567-8901]. To also match formats like #[strong (123) 456 789], | you can add a second pattern using #[code 'ddd'] in place of #[code 'dddd']. | By hard-coding some values, you can match only certain, country-specific | numbers. For example, here's a pattern to match the most common formats of | #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany") international German numbers]: +code. [{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'}, {'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}] p | Depending on the formats your application needs to match, creating an | extensive set of rules like this is often better than training a model. | It'll produce more predictable results, is much easier to modify and | extend, and doesn't require any training data – only a set of | test cases. +code-exec. import spacy from spacy.matcher import Matcher nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'ddd'}, {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'ddd'}] matcher.add('PHONE_NUMBER', None, pattern) doc = nlp(u"Call me at (123) 456 789 or (123) 456 789!") print([t.text for t in doc]) matches = matcher(doc) for match_id, start, end in matches: span = doc[start:end] print(span.text) +h(3, "example3") Example: Hashtags and emoji on social media p | Social media posts, especially tweets, can be difficult to work with. | They're very short and often contain various emoji and hashtags. By only | looking at the plain text, you'll lose a lot of valuable semantic | information. p | Let's say you've extracted a large sample of social media posts on a | specific topic, for example posts mentioning a brand name or product. | As the first step of your data exploration, you want to filter out posts | containing certain emoji and use them to assign a general sentiment | score, based on whether the expressed emotion is positive or negative, | e.g. #[span.o-icon.o-icon--inline 😀] or #[span.o-icon.o-icon--inline 😞]. | You also want to find, merge and label hashtags like | #[code #MondayMotivation], to be able to ignore or analyse them later. +aside("Note on sentiment analysis") | Ultimately, sentiment analysis is not always #[em that] easy. In | addition to the emoji, you'll also want to take specific words into | account and check the #[code subtree] for intensifiers like "very", to | increase the sentiment score. At some point, you might also want to train | a sentiment model. However, the approach described in this example is | very useful for #[strong bootstrapping rules to collect training data]. | It's also an incredibly fast way to gather first insights into your data | – with about 1 million tweets, you'd be looking at a processing time of | #[strong under 1 minute]. p | By default, spaCy's tokenizer will split emoji into separate tokens. This | means that you can create a pattern for one or more emoji tokens. | Valid hashtags usually consist of a #[code #], plus a sequence of | ASCII characters with no whitespace, making them easy to match as well. +code-exec. from spacy.lang.en import English from spacy.matcher import Matcher nlp = English() # we only want the tokenizer, so no need to load a model matcher = Matcher(nlp.vocab) pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍'] # positive emoji neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒'] # negative emoji # add patterns to match one or more emoji tokens pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji] neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji] # function to label the sentiment def label_sentiment(matcher, doc, i, matches): match_id, start, end = matches[i] if doc.vocab.strings[match_id] == 'HAPPY': # don't forget to get string! doc.sentiment += 0.1 # add 0.1 for positive sentiment elif doc.vocab.strings[match_id] == 'SAD': doc.sentiment -= 0.1 # subtract 0.1 for negative sentiment matcher.add('HAPPY', label_sentiment, *pos_patterns) # add positive pattern matcher.add('SAD', label_sentiment, *neg_patterns) # add negative pattern # add pattern for valid hashtag, i.e. '#' plus any ASCII token matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}]) doc = nlp(u"Hello world 😀 #MondayMotivation") matches = matcher(doc) for match_id, start, end in matches: string_id = doc.vocab.strings[match_id] # look up string ID span = doc[start:end] print(string_id, span.text) p | Because the #[code on_match] callback receives the ID of each match, you | can use the same function to handle the sentiment assignment for both | the positive and negative pattern. To keep it simple, we'll either add | or subtract #[code 0.1] points – this way, the score will also reflect | combinations of emoji, even positive #[em and] negative ones. p | With a library like | #[+a("https://github.com/bcongdon/python-emojipedia") Emojipedia], | we can also retrieve a short description for each emoji – for example, | #[span.o-icon.o-icon--inline 😍]'s official title is "Smiling Face With | Heart-Eyes". Assigning it to a | #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute] | on the emoji span will make it available as #[code span._.emoji_desc]. +code. from emojipedia import Emojipedia # installation: pip install emojipedia from spacy.tokens import Span # get the global Span object Span.set_extension('emoji_desc', default=None) # register the custom attribute def label_sentiment(matcher, doc, i, matches): match_id, start, end = matches[i] if doc.vocab.strings[match_id] == 'HAPPY': # don't forget to get string! doc.sentiment += 0.1 # add 0.1 for positive sentiment elif doc.vocab.strings[match_id] == 'SAD': doc.sentiment -= 0.1 # subtract 0.1 for negative sentiment span = doc[start : end] emoji = Emojipedia.search(span[0].text) # get data for emoji span._.emoji_desc = emoji.title # assign emoji description p | To label the hashtags, we can use a | #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute] | set on the respective token: +code-exec. import spacy from spacy.matcher import Matcher from spacy.tokens import Token nlp = spacy.load('en_core_web_sm') matcher = Matcher(nlp.vocab) # add pattern for valid hashtag, i.e. '#' plus any ASCII token matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}]) # register token extension Token.set_extension('is_hashtag', default=False) doc = nlp(u"Hello world 😀 #MondayMotivation") matches = matcher(doc) hashtags = [] for match_id, start, end in matches: if doc.vocab.strings[match_id] == 'HASHTAG': hashtags.append(doc[start:end]) for span in hashtags: span.merge() for token in span: token._.is_hashtag = True for token in doc: print(token.text, token._.is_hashtag) p | To process a stream of social media posts, we can use | #[+api("language#pipe") #[code Language.pipe()]], which will return a | stream of #[code Doc] objects that we can pass to | #[+api("matcher#pipe") #[code Matcher.pipe()]]. +code. docs = nlp.pipe(LOTS_OF_TWEETS) matches = matcher.pipe(docs)