spaCy/website/usage/_linguistic-features/_rule-based-matching.jade

//- 💫 DOCS > USAGE > RULE-BASED MATCHING

p
    |  spaCy features a rule-matching engine, the #[+api("matcher") #[code Matcher]],
    |  that operates over tokens, similar
    |  to regular expressions. The rules can refer to token annotations (e.g.
    |  the token #[code text] or #[code tag_], and flags (e.g. #[code IS_PUNCT]).
    |  The rule matcher also lets you pass in a custom callback
    |  to act on matches – for example, to merge entities and apply custom labels.
    |  You can also associate patterns with entity IDs, to allow some basic
    |  entity linking or disambiguation. To match large terminology lists,
    |  you can use the #[+api("phrasematcher") #[code PhraseMatcher]], which
    |  accepts #[code Doc] objects as match patterns.

+h(3, "adding-patterns") Adding patterns

p
    |  Let's say we want to enable spaCy to find a combination of three tokens:

+list("numbers")
    +item
        |  A token whose #[strong lowercase form matches "hello"], e.g. "Hello"
        |  or "HELLO".
    +item
        |  A token whose #[strong #[code is_punct] flag is set to #[code True]],
        |  i.e. any punctuation.
    +item
        |  A token whose #[strong lowercase form matches "world"], e.g. "World"
        |  or "WORLD".

+code.
    [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]

p
    |  First, we initialise the #[code Matcher] with a vocab. The matcher must
    |  always share the same vocab with the documents it will operate on. We
    |  can now call #[+api("matcher#add") #[code matcher.add()]] with an ID and
    |  our custom pattern. The second argument lets you pass in an optional
    |  callback function to invoke on a successful match. For now, we set it
    |  to #[code None].

+code-exec.
    import spacy
    from spacy.matcher import Matcher

    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    # add match ID "HelloWorld" with no callback and one pattern
    pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
    matcher.add('HelloWorld', None, pattern)

    doc = nlp(u'Hello, world! Hello world!')
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # get string representation
        span = doc[start:end]  # the matched span
        print(match_id, string_id, start, end, span.text)

p
    |  The matcher returns a list of #[code (match_id, start, end)] tuples – in
    |  this case, #[code [('15578876784678163569', 0, 2)]], which maps to the
    |  span #[code doc[0:2]] of our original document. The #[code match_id]
    |  is the #[+a("/usage/spacy-101#vocab") hash value] of the string ID
    |  "HelloWorld". To get the string value, you can look up the ID
    |  in the #[+api("stringstore") #[code StringStore]].

+code.
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # 'HelloWorld'
        span = doc[start:end]                    # the matched span

p
    |  Optionally, we could also choose to add more than one pattern, for
    |  example to also match sequences without punctuation between "hello" and
    |  "world":

+code.
    matcher.add('HelloWorld', None,
                [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}],
                [{'LOWER': 'hello'}, {'LOWER': 'world'}])

p
    |  By default, the matcher will only return the matches and
    |  #[strong not do anything else], like merge entities or assign labels.
    |  This is all up to you and can be defined individually for each pattern,
    |  by passing in a callback function as the #[code on_match] argument on
    |  #[code add()]. This is useful, because it lets you write entirely custom
    |  and #[strong pattern-specific logic]. For example, you might want to
    |  merge #[em some] patterns into one token, while adding entity labels for
    |  other pattern types. You shouldn't have to create different matchers for
    |  each of those processes.

+h(4, "adding-patterns-attributes") Available token attributes

p
    |  The available token pattern keys are uppercase versions of the
    |  #[+api("token#attributes") #[code Token] attributes]. The most relevant
    |  ones for rule-based matching are:

+table(["Attribute", "Description"])
    +row
        +cell #[code ORTH]
        +cell The exact verbatim text of a token.

    +row
        +cell.u-nowrap #[code LOWER]
        +cell The lowercase form of the token text.

    +row
        +cell #[code LENGTH]
        +cell The length of the token text.

    +row
        +cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT]
        +cell
            |  Token text consists of alphanumeric characters, ASCII characters,
            |  digits.

    +row
        +cell.u-nowrap #[code IS_LOWER], #[code IS_UPPER], #[code IS_TITLE]
        +cell Token text is in lowercase, uppercase, titlecase.

    +row
        +cell.u-nowrap #[code IS_PUNCT], #[code IS_SPACE], #[code IS_STOP]
        +cell Token is punctuation, whitespace, stop word.

    +row
        +cell.u-nowrap #[code LIKE_NUM], #[code LIKE_URL], #[code LIKE_EMAIL]
        +cell Token text resembles a number, URL, email.

    +row
        +cell.u-nowrap
            |  #[code POS], #[code TAG], #[code DEP], #[code LEMMA],
            |  #[code SHAPE]
        +cell
            |  The token's simple and extended part-of-speech tag, dependency
            |  label, lemma, shape.

    +row
        +cell.u-nowrap #[code ENT_TYPE]
        +cell The token's entity label.

+h(4, "adding-patterns-wildcard") Using wildcard token patterns
    +tag-new(2)

p
    |  While the token attributes offer many options to write highly specific
    |  patterns, you can also use an empty dictionary, #[code {}] as a wildcard
    |  representing #[strong any token]. This is useful if you know the context
    |  of what you're trying to match, but very little about the specific token
    |  and its characters. For example, let's say you're trying to extract
    |  people's user names from your data. All you know is that they are listed
    |  as "User name: {username}". The name itself may contain any character,
    |  but no whitespace – so you'll know it will be handled as one token.

+code.
    [{'ORTH': 'User'}, {'ORTH': 'name'}, {'ORTH': ':'}, {}]

+h(4, "quantifiers") Using operators and quantifiers

p
    |  The matcher also lets you use quantifiers, specified as the #[code 'OP']
    |  key. Quantifiers let you define sequences of tokens to be mached, e.g.
    |  one or more punctuation marks, or specify optional tokens. Note that there
    |  are no nested or scoped quantifiers – instead, you can build those
    |  behaviours with #[code on_match] callbacks.

+table([ "OP", "Description"])
    +row
        +cell #[code !]
        +cell Negate the pattern, by requiring it to match exactly 0 times.

    +row
        +cell #[code ?]
        +cell Make the pattern optional, by allowing it to match 0 or 1 times.

    +row
        +cell #[code +]
        +cell Require the pattern to match 1 or more times.

    +row
        +cell #[code *]
        +cell Allow the pattern to match zero or more times.

p
    |  In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators
    |  behave inconsistently. They were usually interpreted
    |  "greedily", i.e. longer matches are returned where possible. However, if
    |  you specify two #[code +] and #[code *] patterns in a row and their
    |  matches overlap, the first operator will behave non-greedily. This quirk
    |  in the semantics is corrected in spaCy v2.1.0.

+h(3, "adding-phrase-patterns") Adding phrase patterns

p
    |  If you need to match large terminology lists, you can also use the
    |  #[+api("phrasematcher") #[code PhraseMatcher]] and create
    |  #[+api("doc") #[code Doc]] objects instead of token patterns, which is
    |  much more efficient overall. The #[code Doc] patterns can contain single
    |  or multiple tokens.

+code-exec.
    import spacy
    from spacy.matcher import PhraseMatcher

    nlp = spacy.load('en_core_web_sm')
    matcher = PhraseMatcher(nlp.vocab)
    terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
    patterns = [nlp(text) for text in terminology_list]
    matcher.add('TerminologyList', None, *patterns)

    doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
              u"converse in the Oval Office inside the White House in Washington, D.C.")
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        print(span.text)

p
    |  Since spaCy is used for processing both the patterns and the text to be
    |  matched, you won't have to worry about specific tokenization – for
    |  example, you can simply pass in #[code nlp(u"Washington, D.C.")] and
    |  won't have to write a complex token pattern covering the exact
    |  tokenization of the term.

+h(3, "on_match") Adding #[code on_match] rules

p
    |  To move on to a more realistic example, let's say you're working with a
    |  large corpus of blog articles, and you want to match all mentions of
    |  "Google I/O" (which spaCy tokenizes as #[code ['Google', 'I', '/', 'O']]).
    |  To be safe, you only match on the uppercase versions, in case someone has
    |  written it as "Google i/o". You also add a second pattern with an added
    |  #[code {IS_DIGIT: True}] token – this will make sure you also match on
    |  "Google I/O 2017". If your pattern matches, spaCy should execute your
    |  custom callback function #[code add_event_ent].

+code-exec.
    import spacy
    from spacy.matcher import Matcher

    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)

    # Get the ID of the 'EVENT' entity type. This is required to set an entity.
    EVENT = nlp.vocab.strings['EVENT']

    def add_event_ent(matcher, doc, i, matches):
        # Get the current match and create tuple of entity label, start and end.
        # Append entity to the doc's entity. (Don't overwrite doc.ents!)
        match_id, start, end = matches[i]
        entity = (EVENT, start, end)
        doc.ents += (entity,)
        print(doc[start:end].text, entity)

    matcher.add('GoogleIO', add_event_ent,
                [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}],
                [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}, {'IS_DIGIT': True}])

    doc = nlp(u"This is a text about Google I/O 2015.")
    matches = matcher(doc)

+aside("Tip: Visualizing matches")
    |  When working with entities, you can use #[+api("top-level#displacy") displaCy]
    |  to quickly generate a NER visualization from your updated #[code Doc],
    |  which can be exported as an HTML file:

    +code.o-no-block.
        from spacy import displacy
        html = displacy.render(doc, style='ent', page=True,
                               options={'ents': ['EVENT']})

    |  For more info and examples, see the usage guide on
    |  #[+a("/usage/visualizers") visualizing spaCy].

p
    |  We can now call the matcher on our documents. The patterns will be
    |  matched in the order they occur in the text. The matcher will then
    |  iterate over the matches, look up the callback for the match ID
    |  that was matched, and invoke it.

+code.
    doc = nlp(YOUR_TEXT_HERE)
    matcher(doc)

p
    |  When the callback is invoked, it is
    |  passed four arguments: the matcher itself, the document, the position of
    |  the current match, and the total list of matches. This allows you to
    |  write callbacks that consider the entire set of matched phrases, so that
    |  you can resolve overlaps and other conflicts in whatever way you prefer.

+table(["Argument", "Type", "Description"])
    +row
        +cell #[code matcher]
        +cell #[code Matcher]
        +cell The matcher instance.

    +row
        +cell #[code doc]
        +cell #[code Doc]
        +cell The document the matcher was used on.

    +row
        +cell #[code i]
        +cell int
        +cell Index of the current match (#[code matches[i]]).

    +row
        +cell #[code matches]
        +cell list
        +cell
            |  A list of #[code (match_id, start, end)] tuples, describing the
            |  matches. A match tuple describes a span #[code doc[start:end]].

+h(3, "matcher-pipeline") Using custom pipeline components

p
    |  Let's say your data also contains some annoying pre-processing artefacts,
    |  like leftover HTML line breaks (e.g. #[code &lt;br&gt;] or
    |  #[code &lt;BR/&gt;]). To make your text easier to analyse, you want to
    |  merge those into one token and flag them, to make sure you
    |  can ignore them later. Ideally, this should all be done automatically
    |  as you process the text. You can achieve this by adding a
    |  #[+a("/usage/processing-pipelines#custom-components") custom pipeline component]
    |  that's called on each #[code Doc] object, merges the leftover HTML spans
    |  and sets an attribute #[code bad_html] on the token.

+code-exec.
    import spacy
    from spacy.matcher import Matcher
    from spacy.tokens import Token

    # we're using a class because the component needs to be initialised with
    # the shared vocab via the nlp object
    class BadHTMLMerger(object):
        def __init__(self, nlp):
            # register a new token extension to flag bad HTML
            Token.set_extension('bad_html', default=False)
            self.matcher = Matcher(nlp.vocab)
            self.matcher.add('BAD_HTML', None,
                [{'ORTH': '&lt;'}, {'LOWER': 'br'}, {'ORTH': '&gt;'}],
                [{'ORTH': '&lt;'}, {'LOWER': 'br/'}, {'ORTH': '&gt;'}])

        def __call__(self, doc):
            # this method is invoked when the component is called on a Doc
            matches = self.matcher(doc)
            spans = []  # collect the matched spans here
            for match_id, start, end in matches:
                spans.append(doc[start:end])
            for span in spans:
                span.merge(is_stop=True) # merge (and mark it as a stop word)
                for token in span:
                    token._.bad_html = True  # mark token as bad HTML
            return doc

    nlp = spacy.load('en_core_web_sm')
    html_merger = BadHTMLMerger(nlp)
    nlp.add_pipe(html_merger, last=True)  # add component to the pipeline
    doc = nlp(u"Hello&lt;br&gt;world! &lt;br/&gt; This is a test.")
    for token in doc:
        print(token.text, token._.bad_html)

p
    |  Instead of hard-coding the patterns into the component, you could also
    |  make it take a path to a JSON file containing the patterns. This lets
    |  you reuse the component with different patterns, depending on your
    |  application:

+code.
    html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json')

+infobox
    |  For more details and examples of how to
    |  #[strong create custom pipeline components] and
    |  #[strong extension attributes], see the
    |  #[+a("/usage/processing-pipelines") usage guide].

+h(3, "regex") Using regular expressions

p
    |  In some cases, only matching tokens and token attributes isn't enough –
    |  for example, you might want to match different spellings of a word,
    |  without having to add a new pattern for each spelling. A simple solution
    |  is to match a regular expression on the #[code Doc]'s #[code text] and
    |  use the #[+api("doc#char_span") #[code Doc.char_span]] method to
    |  create a #[code Span] from the character indices of the match:

+code-exec.
    import spacy
    import re

    nlp = spacy.load('en_core_web_sm')
    doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')

    DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely')

    for match in re.finditer(DEFINITELY_PATTERN, doc.text):
        start, end = match.span()         # get matched indices
        span = doc.char_span(start, end)  # create Span from indices
        print(span.text)

p
    |  You can also use the regular expression with spaCy's #[code Matcher] by
    |  converting it to a token flag. To ensure efficiency, the
    |  #[code Matcher] can only access the C-level data. This means that it can
    |  either use built-in token attributes or #[strong binary flags].
    |  #[+api("vocab#add_flag") #[code Vocab.add_flag]] returns a flag ID which
    |  you can use as a key of a token match pattern. Tokens that match the
    |  regular expression will return #[code True] for the #[code IS_DEFINITELY]
    |  flag.

+code-exec.
    import spacy
    from spacy.matcher import Matcher
    import re

    nlp = spacy.load('en_core_web_sm')
    definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text))
    IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)

    matcher = Matcher(nlp.vocab)
    matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])

    doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        print(span.text)

p
    |  Providing the regular expressions as binary flags also lets you use them
    |  in combination with other token patterns – for example, to match the
    |  word "definitely" in various spellings, followed by a case-insensitive
    |  "not" and and adjective:

+code.
    [{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}]

+h(3, "example1") Example: Using linguistic annotations

p
    |  Let's say you're analysing user comments and you want to find out what
    |  people are saying about Facebook. You want to start off by finding
    |  adjectives following "Facebook is" or "Facebook was". This is obviously
    |  a very rudimentary solution, but it'll be fast, and a great way get an
    |  idea for what's in your data. Your pattern could look like this:

+code.
    [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]

p
    |  This translates to a token whose lowercase form matches "facebook"
    |  (like Facebook, facebook or FACEBOOK), followed by a token with the lemma
    |  "be" (for example, is, was, or 's), followed by an #[strong optional] adverb,
    |  followed by an adjective. Using the linguistic annotations here is
    |  especially useful, because you can tell spaCy to match "Facebook's
    |  annoying", but #[strong not] "Facebook's annoying ads". The optional
    |  adverb makes sure you won't miss adjectives with intensifiers, like
    |  "pretty awful" or "very nice".

p
    |  To get a quick overview of the results, you could collect all sentences
    |  containing a match and render them with the
    |  #[+a("/usage/visualizers") displaCy visualizer].
    |  In the callback function, you'll have access to the #[code start] and
    |  #[code end] of each match, as well as the parent #[code Doc]. This lets
    |  you determine the sentence containing the match,
    |  #[code doc[start : end].sent], and calculate the start and end of the
    |  matched span within the sentence. Using displaCy in
    |  #[+a("/usage/visualizers#manual-usage") "manual" mode] lets you
    |  pass in a list of dictionaries containing the text and entities to render.

+code-exec.
    import spacy
    from spacy import displacy
    from spacy.matcher import Matcher

    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    matched_sents = [] # collect data of matched sentences to be visualized

    def collect_sents(matcher, doc, i, matches):
        match_id, start, end = matches[i]
        span = doc[start : end]  # matched span
        sent = span.sent  # sentence containing matched span
        # append mock entity for match in displaCy style to matched_sents
        # get the match span by ofsetting the start and end of the span with the
        # start and end of the sentence in the doc
        match_ents = [{'start': span.start_char - sent.start_char,
                       'end': span.end_char - sent.start_char,
                       'label': 'MATCH'}]
        matched_sents.append({'text': sent.text, 'ents': match_ents })

    pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
               {'POS': 'ADJ'}]
    matcher.add('FacebookIs', collect_sents, pattern)  # add pattern
    doc = nlp(u"I'd say that Facebook is evil. – Facebook is pretty cool, right?")
    matches = matcher(doc)

    # serve visualization of sentences containing match with displaCy
    # set manual=True to make displaCy render straight from a dictionary
    # (if you're not running the code within a Jupyer environment, you can
    # remove jupyter=True and use displacy.serve instead)
    displacy.render(matched_sents, style='ent', manual=True, jupyter=True)

+h(3, "example2") Example: Phone numbers

p
    |  Phone numbers can have many different formats and matching them is often
    |  tricky. During tokenization, spaCy will leave sequences of numbers intact
    |  and only split on whitespace and punctuation. This means that your match
    |  pattern will have to look out for number sequences of a certain length,
    |  surrounded by specific punctuation – depending on the
    |  #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers") national conventions].

p
    |  The #[code IS_DIGIT] flag is not very helpful here, because it doesn't
    |  tell us anything about the length. However, you can use the #[code SHAPE]
    |  flag, with each #[code d] representing a digit:

+code.
    [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'},
     {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}]

p
    |  This will match phone numbers of the format #[strong (123) 4567 8901] or
    |  #[strong (123) 4567-8901]. To also match formats like #[strong (123) 456 789],
    |  you can add a second pattern using #[code 'ddd'] in place of #[code 'dddd'].
    |  By hard-coding some values, you can match only certain, country-specific
    |  numbers. For example, here's a pattern to match the most common formats of
    |  #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany") international German numbers]:

+code.
    [{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'},
     {'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}]

p
    |  Depending on the formats your application needs to match, creating an
    |  extensive set of rules like this is often better than training a model.
    |  It'll produce more predictable results, is much easier to modify and
    |  extend, and doesn't require any training data – only a set of
    |  test cases.

+code-exec.
    import spacy
    from spacy.matcher import Matcher

    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'ddd'},
               {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'ddd'}]
    matcher.add('PHONE_NUMBER', None, pattern)

    doc = nlp(u"Call me at (123) 456 789 or (123) 456 789!")
    print([t.text for t in doc])
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        print(span.text)

+h(3, "example3") Example: Hashtags and emoji on social media

p
    |  Social media posts, especially tweets, can be difficult to work with.
    |  They're very short and often contain various emoji and hashtags. By only
    |  looking at the plain text, you'll lose a lot of valuable semantic
    |  information.

p
    |  Let's say you've extracted a large sample of social media posts on a
    |  specific topic, for example posts mentioning a brand name or product.
    |  As the first step of your data exploration, you want to filter out posts
    |  containing certain emoji and use them to assign a general sentiment
    |  score, based on whether the expressed emotion is positive or negative,
    |  e.g. #[span.o-icon.o-icon--inline 😀] or #[span.o-icon.o-icon--inline 😞].
    |  You also want to find, merge and label hashtags like
    |  #[code #MondayMotivation], to be able to ignore or analyse them later.

+aside("Note on sentiment analysis")
    |  Ultimately, sentiment analysis is not always #[em that] easy. In
    |  addition to the emoji, you'll also want to take specific words into
    |  account and check the #[code subtree] for intensifiers like "very", to
    |  increase the sentiment score. At some point, you might also want to train
    |  a sentiment model. However, the approach described in this example is
    |  very useful for #[strong bootstrapping rules to collect training data].
    |  It's also an incredibly fast way to gather first insights into your data
    |  – with about 1 million tweets, you'd be looking at a processing time of
    |  #[strong under 1 minute].

p
    |  By default, spaCy's tokenizer will split emoji into separate tokens. This
    |  means that you can create a pattern for one or more emoji tokens.
    |  Valid hashtags usually consist of a #[code #], plus a sequence of
    |  ASCII characters with no whitespace, making them easy to match as well.

+code-exec.
    from spacy.lang.en import English
    from spacy.matcher import Matcher

    nlp = English()  # we only want the tokenizer, so no need to load a model
    matcher = Matcher(nlp.vocab)

    pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍']  # positive emoji
    neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒']  # negative emoji

    # add patterns to match one or more emoji tokens
    pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
    neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]

    # function to label the sentiment
    def label_sentiment(matcher, doc, i, matches):
        match_id, start, end = matches[i]
        if doc.vocab.strings[match_id] == 'HAPPY':  # don't forget to get string!
            doc.sentiment += 0.1  # add 0.1 for positive sentiment
        elif doc.vocab.strings[match_id] == 'SAD':
            doc.sentiment -= 0.1  # subtract 0.1 for negative sentiment

    matcher.add('HAPPY', label_sentiment, *pos_patterns)  # add positive pattern
    matcher.add('SAD', label_sentiment, *neg_patterns)  # add negative pattern

    # add pattern for valid hashtag, i.e. '#' plus any ASCII token
    matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])

    doc = nlp(u"Hello world 😀 #MondayMotivation")
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = doc.vocab.strings[match_id]  # look up string ID
        span = doc[start:end]
        print(string_id, span.text)

p
    |  Because the #[code on_match] callback receives the ID of each match, you
    |  can use the same function to handle the sentiment assignment for both
    |  the positive and negative pattern. To keep it simple, we'll either add
    |  or subtract #[code 0.1] points – this way, the score will also reflect
    |  combinations of emoji, even positive #[em and] negative ones.

p
    |  With a library like
    |  #[+a("https://github.com/bcongdon/python-emojipedia") Emojipedia],
    |  we can also retrieve a short description for each emoji – for example,
    |  #[span.o-icon.o-icon--inline 😍]'s official title is "Smiling Face With
    |  Heart-Eyes". Assigning it to a
    |  #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute]
    |  on the emoji span will make it available as #[code span._.emoji_desc].

+code.
    from emojipedia import Emojipedia  # installation: pip install emojipedia
    from spacy.tokens import Span  # get the global Span object

    Span.set_extension('emoji_desc', default=None)  # register the custom attribute

    def label_sentiment(matcher, doc, i, matches):
        match_id, start, end = matches[i]
        if doc.vocab.strings[match_id] == 'HAPPY':  # don't forget to get string!
            doc.sentiment += 0.1  # add 0.1 for positive sentiment
        elif doc.vocab.strings[match_id] == 'SAD':
            doc.sentiment -= 0.1  # subtract 0.1 for negative sentiment
        span = doc[start : end]
        emoji = Emojipedia.search(span[0].text) # get data for emoji
        span._.emoji_desc = emoji.title  # assign emoji description

p
    |  To label the hashtags, we can use a
    |  #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute]
    |  set on the respective token:

+code-exec.
    import spacy
    from spacy.matcher import Matcher
    from spacy.tokens import Token

    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)

    # add pattern for valid hashtag, i.e. '#' plus any ASCII token
    matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])

    # register token extension
    Token.set_extension('is_hashtag', default=False)

    doc = nlp(u"Hello world 😀 #MondayMotivation")
    matches = matcher(doc)
    hashtags = []
    for match_id, start, end in matches:
        if doc.vocab.strings[match_id] == 'HASHTAG':
            hashtags.append(doc[start:end])
    for span in hashtags:
        span.merge()
        for token in span:
            token._.is_hashtag = True

    for token in doc:
        print(token.text, token._.is_hashtag)

p
    |  To process a stream of social media posts, we can use
    |  #[+api("language#pipe") #[code Language.pipe()]], which will return a
    |  stream of #[code Doc] objects that we can pass to
    |  #[+api("matcher#pipe") #[code Matcher.pipe()]].

+code.
    docs = nlp.pipe(LOTS_OF_TWEETS)
    matches = matcher.pipe(docs)
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
+								//- 💫 DOCS > USAGE > RULE-BASED MATCHING
 								p
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  spaCy features a rule-matching engine, the #[+api("matcher") #[code Matcher]],
 								    |  that operates over tokens, similar
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    |  to regular expressions. The rules can refer to token annotations (e.g.
 								    |  the token #[code text] or #[code tag_], and flags (e.g. #[code IS_PUNCT]).
 								    |  The rule matcher also lets you pass in a custom callback
 								    |  to act on matches – for example, to merge entities and apply custom labels.
 								    |  You can also associate patterns with entity IDs, to allow some basic
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  entity linking or disambiguation. To match large terminology lists,
 								    |  you can use the #[+api("phrasematcher") #[code PhraseMatcher]], which
 								    |  accepts #[code Doc] objects as match patterns.
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+h(3, "adding-patterns") Adding patterns
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
 								p
 								    |  Let's say we want to enable spaCy to find a combination of three tokens:
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
 								+list("numbers")
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    +item
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								        |  A token whose #[strong lowercase form matches "hello"], e.g. "Hello"
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								        |  or "HELLO".
 								    +item
 								        |  A token whose #[strong #[code is_punct] flag is set to #[code True]],
 								        |  i.e. any punctuation.
 								    +item
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								        |  A token whose #[strong lowercase form matches "world"], e.g. "World"
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								        |  or "WORLD".
 								+code.
-												Use string values for attrs in Matcher docs

											
										
										
											2017-05-22 14:54:45 +03:00
+								    [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
 								p
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    |  First, we initialise the #[code Matcher] with a vocab. The matcher must
 								    |  always share the same vocab with the documents it will operate on. We
 								    |  can now call #[+api("matcher#add") #[code matcher.add()]] with an ID and
-												Update Matcher API and workflow to reflect new API

on_match is now the second positional argument, to easily allow a
variable number of patterns while keeping the method clean and readable.

											
										
										
											2017-05-20 13:59:03 +03:00
+								    |  our custom pattern. The second argument lets you pass in an optional
 								    |  callback function to invoke on a successful match. For now, we set it
 								    |  to #[code None].
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    import spacy
-												Missing import statement

It is useful to know where the Matcher class is if you haven't used it before. Or you are simply too lazy to remember, like me :)

FYI: some packages don't appear in the PyCharm autocompletion lists. `spacy.matcher` is one of them.
											
										
										
											2016-11-11 15:04:08 +03:00
+								    from spacy.matcher import Matcher
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    nlp = spacy.load('en_core_web_sm')
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
+								    matcher = Matcher(nlp.vocab)
-												Update Matcher API and workflow to reflect new API

on_match is now the second positional argument, to easily allow a
variable number of patterns while keeping the method clean and readable.

											
										
										
											2017-05-20 13:59:03 +03:00
+								    # add match ID "HelloWorld" with no callback and one pattern
-												Fix matcher tests and matcher docs

											
										
										
											2017-05-23 12:36:02 +03:00
+								    pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
 								    matcher.add('HelloWorld', None, pattern)
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    doc = nlp(u'Hello, world! Hello world!')
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
+								    matches = matcher(doc)
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    for match_id, start, end in matches:
 								        string_id = nlp.vocab.strings[match_id]  # get string representation
 								        span = doc[start:end]  # the matched span
 								        print(match_id, string_id, start, end, span.text)
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
 								p
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    |  The matcher returns a list of #[code (match_id, start, end)] tuples – in
-												Update first matcher example and match_id (resolves #1989)

											
										
										
											2018-02-17 13:57:38 +03:00
+								    |  this case, #[code [('15578876784678163569', 0, 2)]], which maps to the
 								    |  span #[code doc[0:2]] of our original document. The #[code match_id]
 								    |  is the #[+a("/usage/spacy-101#vocab") hash value] of the string ID
 								    |  "HelloWorld". To get the string value, you can look up the ID
 								    |  in the #[+api("stringstore") #[code StringStore]].
 								+code.
 								    for match_id, start, end in matches:
 								        string_id = nlp.vocab.strings[match_id]  # 'HelloWorld'
 								        span = doc[start:end]                    # the matched span
 								p
 								    |  Optionally, we could also choose to add more than one pattern, for
 								    |  example to also match sequences without punctuation between "hello" and
 								    |  "world":
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
 								+code.
-												Fix matcher tests and matcher docs

											
										
										
											2017-05-23 12:36:02 +03:00
+								    matcher.add('HelloWorld', None,
-												Use string values for attrs in Matcher docs

											
										
										
											2017-05-22 14:54:45 +03:00
+								                [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}],
 								                [{'LOWER': 'hello'}, {'LOWER': 'world'}])
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
 								p
 								    |  By default, the matcher will only return the matches and
 								    |  #[strong not do anything else], like merge entities or assign labels.
 								    |  This is all up to you and can be defined individually for each pattern,
 								    |  by passing in a callback function as the #[code on_match] argument on
 								    |  #[code add()]. This is useful, because it lets you write entirely custom
 								    |  and #[strong pattern-specific logic]. For example, you might want to
 								    |  merge #[em some] patterns into one token, while adding entity labels for
 								    |  other pattern types. You shouldn't have to create different matchers for
 								    |  each of those processes.
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
+								+h(4, "adding-patterns-attributes") Available token attributes
 								p
 								    |  The available token pattern keys are uppercase versions of the
 								    |  #[+api("token#attributes") #[code Token] attributes]. The most relevant
 								    |  ones for rule-based matching are:
 								+table(["Attribute", "Description"])
 								    +row
 								        +cell #[code ORTH]
 								        +cell The exact verbatim text of a token.
 								    +row
-												Remove UPPER from Matcher attributes docs (resolves #1949)

											
										
										
											2018-02-08 13:29:27 +03:00
+								        +cell.u-nowrap #[code LOWER]
 								        +cell The lowercase form of the token text.
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
-												Document LENGTH attribute in matcher

											
										
										
											2018-02-09 12:23:03 +03:00
+								    +row
 								        +cell #[code LENGTH]
 								        +cell The length of the token text.
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
+								    +row
 								        +cell.u-nowrap #[code IS_ALPHA], #[code IS_ASCII], #[code IS_DIGIT]
 								        +cell
 								            |  Token text consists of alphanumeric characters, ASCII characters,
 								            |  digits.
 								    +row
 								        +cell.u-nowrap #[code IS_LOWER], #[code IS_UPPER], #[code IS_TITLE]
 								        +cell Token text is in lowercase, uppercase, titlecase.
 								    +row
 								        +cell.u-nowrap #[code IS_PUNCT], #[code IS_SPACE], #[code IS_STOP]
 								        +cell Token is punctuation, whitespace, stop word.
 								    +row
 								        +cell.u-nowrap #[code LIKE_NUM], #[code LIKE_URL], #[code LIKE_EMAIL]
 								        +cell Token text resembles a number, URL, email.
 								    +row
 								        +cell.u-nowrap
 								            |  #[code POS], #[code TAG], #[code DEP], #[code LEMMA],
 								            |  #[code SHAPE]
 								        +cell
 								            |  The token's simple and extended part-of-speech tag, dependency
 								            |  label, lemma, shape.
-												Document ENT_TYPE matcher attribute [ci skip]

											
										
										
											2018-02-15 14:14:19 +03:00
+								    +row
 								        +cell.u-nowrap #[code ENT_TYPE]
 								        +cell The token's entity label.
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
+								+h(4, "adding-patterns-wildcard") Using wildcard token patterns
 								    +tag-new(2)
 								p
 								    |  While the token attributes offer many options to write highly specific
 								    |  patterns, you can also use an empty dictionary, #[code {}] as a wildcard
 								    |  representing #[strong any token]. This is useful if you know the context
 								    |  of what you're trying to match, but very little about the specific token
 								    |  and its characters. For example, let's say you're trying to extract
 								    |  people's user names from your data. All you know is that they are listed
 								    |  as "User name: {username}". The name itself may contain any character,
 								    |  but no whitespace – so you'll know it will be handled as one token.
 								+code.
 								    [{'ORTH': 'User'}, {'ORTH': 'name'}, {'ORTH': ':'}, {}]
 								+h(4, "quantifiers") Using operators and quantifiers
 								p
 								    |  The matcher also lets you use quantifiers, specified as the #[code 'OP']
 								    |  key. Quantifiers let you define sequences of tokens to be mached, e.g.
 								    |  one or more punctuation marks, or specify optional tokens. Note that there
 								    |  are no nested or scoped quantifiers – instead, you can build those
 								    |  behaviours with #[code on_match] callbacks.
-												Update matcher docs to reflect operator changes

											
										
										
											2017-10-16 14:44:12 +03:00
+								+table([ "OP", "Description"])
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
+								    +row
 								        +cell #[code !]
-												Update matcher docs to reflect operator changes

											
										
										
											2017-10-16 14:44:12 +03:00
+								        +cell Negate the pattern, by requiring it to match exactly 0 times.
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
 								    +row
-												Fix typo, formatting and operator descriptions (resolves #1820)

											
										
										
											2018-01-13 00:06:27 +03:00
+								        +cell #[code ?]
-												Update matcher docs to reflect operator changes

											
										
										
											2017-10-16 14:44:12 +03:00
+								        +cell Make the pattern optional, by allowing it to match 0 or 1 times.
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
 								    +row
 								        +cell #[code +]
-												Update matcher docs to reflect operator changes

											
										
										
											2017-10-16 14:44:12 +03:00
+								        +cell Require the pattern to match 1 or more times.
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
 								    +row
-												Fix typo, formatting and operator descriptions (resolves #1820)

											
										
										
											2018-01-13 00:06:27 +03:00
+								        +cell #[code *]
 								        +cell Allow the pattern to match zero or more times.
-												Update matcher docs to reflect operator changes

											
										
										
											2017-10-16 14:44:12 +03:00
 								p
-												Revert matcher fixes from GregDubbin

											
										
										
											2018-02-18 12:59:28 +03:00
+								    |  In versions before v2.1.0, the semantics of the #[code +] and #[code *] operators
-												corrected spelling (#2359)

changed **interpretted** to **interpreted**
											
										
										
											2018-05-24 14:29:52 +03:00
+								    |  behave inconsistently. They were usually interpreted
-												Revert matcher fixes from GregDubbin

											
										
										
											2018-02-18 12:59:28 +03:00
+								    |  "greedily", i.e. longer matches are returned where possible. However, if
 								    |  you specify two #[code +] and #[code *] patterns in a row and their
 								    |  matches overlap, the first operator will behave non-greedily. This quirk
 								    |  in the semantics is corrected in spaCy v2.1.0.
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
 								+h(3, "adding-phrase-patterns") Adding phrase patterns
 								p
 								    |  If you need to match large terminology lists, you can also use the
 								    |  #[+api("phrasematcher") #[code PhraseMatcher]] and create
 								    |  #[+api("doc") #[code Doc]] objects instead of token patterns, which is
 								    |  much more efficient overall. The #[code Doc] patterns can contain single
 								    |  or multiple tokens.
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
+								    import spacy
 								    from spacy.matcher import PhraseMatcher
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    nlp = spacy.load('en_core_web_sm')
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
+								    matcher = PhraseMatcher(nlp.vocab)
 								    terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
 								    patterns = [nlp(text) for text in terminology_list]
 								    matcher.add('TerminologyList', None, *patterns)
 								    doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
 								              u"converse in the Oval Office inside the White House in Washington, D.C.")
 								    matches = matcher(doc)
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    for match_id, start, end in matches:
 								        span = doc[start:end]
 								        print(span.text)
-												Update rule-based matching docs

											
										
										
											2017-10-07 16:04:09 +03:00
 								p
 								    |  Since spaCy is used for processing both the patterns and the text to be
 								    |  matched, you won't have to worry about specific tokenization – for
 								    |  example, you can simply pass in #[code nlp(u"Washington, D.C.")] and
 								    |  won't have to write a complex token pattern covering the exact
 								    |  tokenization of the term.
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+h(3, "on_match") Adding #[code on_match] rules
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
 								p
 								    |  To move on to a more realistic example, let's say you're working with a
 								    |  large corpus of blog articles, and you want to match all mentions of
 								    |  "Google I/O" (which spaCy tokenizes as #[code ['Google', 'I', '/', 'O']]).
 								    |  To be safe, you only match on the uppercase versions, in case someone has
 								    |  written it as "Google i/o". You also add a second pattern with an added
 								    |  #[code {IS_DIGIT: True}] token – this will make sure you also match on
-												Update Matcher API docs

											
										
										
											2017-05-20 13:27:22 +03:00
+								    |  "Google I/O 2017". If your pattern matches, spaCy should execute your
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    |  custom callback function #[code add_event_ent].
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    import spacy
 								    from spacy.matcher import Matcher
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    nlp = spacy.load('en_core_web_sm')
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    matcher = Matcher(nlp.vocab)
 								    # Get the ID of the 'EVENT' entity type. This is required to set an entity.
 								    EVENT = nlp.vocab.strings['EVENT']
 								    def add_event_ent(matcher, doc, i, matches):
 								        # Get the current match and create tuple of entity label, start and end.
-												Update Matcher API docs

											
										
										
											2017-05-20 13:27:22 +03:00
+								        # Append entity to the doc's entity. (Don't overwrite doc.ents!)
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								        match_id, start, end = matches[i]
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								        entity = (EVENT, start, end)
 								        doc.ents += (entity,)
 								        print(doc[start:end].text, entity)
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
-												Fix matcher tests and matcher docs

											
										
										
											2017-05-23 12:36:02 +03:00
+								    matcher.add('GoogleIO', add_event_ent,
-												Remove UPPER from Matcher attributes docs (resolves #1949)

											
										
										
											2018-02-08 13:29:27 +03:00
+								                [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}],
 								                [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}, {'IS_DIGIT': True}])
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    doc = nlp(u"This is a text about Google I/O 2015.")
 								    matches = matcher(doc)
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								+aside("Tip: Visualizing matches")
-												Fix broken link

											
										
										
											2017-11-06 15:27:30 +03:00
+								    |  When working with entities, you can use #[+api("top-level#displacy") displaCy]
-												Update API docs

											
										
										
											2017-05-20 02:43:48 +03:00
+								    |  to quickly generate a NER visualization from your updated #[code Doc],
 								    |  which can be exported as an HTML file:
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
 								    +code.o-no-block.
 								        from spacy import displacy
 								        html = displacy.render(doc, style='ent', page=True,
 								                               options={'ents': ['EVENT']})
-												Update text, examples, typos, wording and formatting

											
										
										
											2017-05-28 17:41:01 +03:00
+								    |  For more info and examples, see the usage guide on
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  #[+a("/usage/visualizers") visualizing spaCy].
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
 								p
 								    |  We can now call the matcher on our documents. The patterns will be
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								    |  matched in the order they occur in the text. The matcher will then
 								    |  iterate over the matches, look up the callback for the match ID
 								    |  that was matched, and invoke it.
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
 								+code.
-												Fix bad HTML example (see #2376) and turn it into section on matcher + components

Avoid problems caused by merging while matching (e.g. index errors). Creating a Matcher component also better reflects the recommended best practices.

											
										
										
											2018-05-26 18:57:02 +03:00
+								    doc = nlp(YOUR_TEXT_HERE)
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    matcher(doc)
 								p
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								    |  When the callback is invoked, it is
-												Rewrite rule-based matching workflow

											
										
										
											2017-05-20 02:38:55 +03:00
+								    |  passed four arguments: the matcher itself, the document, the position of
 								    |  the current match, and the total list of matches. This allows you to
 								    |  write callbacks that consider the entire set of matched phrases, so that
 								    |  you can resolve overlaps and other conflicts in whatever way you prefer.
 								+table(["Argument", "Type", "Description"])
 								    +row
 								        +cell #[code matcher]
 								        +cell #[code Matcher]
 								        +cell The matcher instance.
 								    +row
 								        +cell #[code doc]
 								        +cell #[code Doc]
 								        +cell The document the matcher was used on.
 								    +row
 								        +cell #[code i]
 								        +cell int
 								        +cell Index of the current match (#[code matches[i]]).
 								    +row
 								        +cell #[code matches]
 								        +cell list
 								        +cell
 								            |  A list of #[code (match_id, start, end)] tuples, describing the
 								            |  matches. A match tuple describes a span #[code doc[start:end]].
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
-												Fix bad HTML example (see #2376) and turn it into section on matcher + components

Avoid problems caused by merging while matching (e.g. index errors). Creating a Matcher component also better reflects the recommended best practices.

											
										
										
											2018-05-26 18:57:02 +03:00
+								+h(3, "matcher-pipeline") Using custom pipeline components
 								p
 								    |  Let's say your data also contains some annoying pre-processing artefacts,
 								    |  like leftover HTML line breaks (e.g. #[code &lt;br&gt;] or
 								    |  #[code &lt;BR/&gt;]). To make your text easier to analyse, you want to
 								    |  merge those into one token and flag them, to make sure you
 								    |  can ignore them later. Ideally, this should all be done automatically
 								    |  as you process the text. You can achieve this by adding a
 								    |  #[+a("/usage/processing-pipelines#custom-components") custom pipeline component]
 								    |  that's called on each #[code Doc] object, merges the leftover HTML spans
 								    |  and sets an attribute #[code bad_html] on the token.
 								+code-exec.
 								    import spacy
 								    from spacy.matcher import Matcher
 								    from spacy.tokens import Token
 								    # we're using a class because the component needs to be initialised with
 								    # the shared vocab via the nlp object
 								    class BadHTMLMerger(object):
 								        def __init__(self, nlp):
 								            # register a new token extension to flag bad HTML
 								            Token.set_extension('bad_html', default=False)
 								            self.matcher = Matcher(nlp.vocab)
 								            self.matcher.add('BAD_HTML', None,
 								                [{'ORTH': '&lt;'}, {'LOWER': 'br'}, {'ORTH': '&gt;'}],
 								                [{'ORTH': '&lt;'}, {'LOWER': 'br/'}, {'ORTH': '&gt;'}])
 								        def __call__(self, doc):
 								            # this method is invoked when the component is called on a Doc
 								            matches = self.matcher(doc)
 								            spans = []  # collect the matched spans here
 								            for match_id, start, end in matches:
 								                spans.append(doc[start:end])
 								            for span in spans:
 								                span.merge(is_stop=True) # merge (and mark it as a stop word)
 								                for token in span:
 								                    token._.bad_html = True  # mark token as bad HTML
 								            return doc
 								    nlp = spacy.load('en_core_web_sm')
 								    html_merger = BadHTMLMerger(nlp)
 								    nlp.add_pipe(html_merger, last=True)  # add component to the pipeline
 								    doc = nlp(u"Hello&lt;br&gt;world! &lt;br/&gt; This is a test.")
 								    for token in doc:
 								        print(token.text, token._.bad_html)
 								p
 								    |  Instead of hard-coding the patterns into the component, you could also
 								    |  make it take a path to a JSON file containing the patterns. This lets
 								    |  you reuse the component with different patterns, depending on your
 								    |  application:
 								+code.
 								    html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json')
 								+infobox
 								    |  For more details and examples of how to
 								    |  #[strong create custom pipeline components] and
 								    |  #[strong extension attributes], see the
 								    |  #[+a("/usage/processing-pipelines") usage guide].
-												Add regex section to rule-based matching docs (see #1567, #1833) [ci skip]

											
										
										
											2018-01-14 16:22:13 +03:00
+								+h(3, "regex") Using regular expressions
 								p
 								    |  In some cases, only matching tokens and token attributes isn't enough –
 								    |  for example, you might want to match different spellings of a word,
 								    |  without having to add a new pattern for each spelling. A simple solution
 								    |  is to match a regular expression on the #[code Doc]'s #[code text] and
 								    |  use the #[+api("doc#char_span") #[code Doc.char_span]] method to
 								    |  create a #[code Span] from the character indices of the match:
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
-												Add regex section to rule-based matching docs (see #1567, #1833) [ci skip]

											
										
										
											2018-01-14 16:22:13 +03:00
+								    import spacy
 								    import re
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    nlp = spacy.load('en_core_web_sm')
-												Add regex section to rule-based matching docs (see #1567, #1833) [ci skip]

											
										
										
											2018-01-14 16:22:13 +03:00
+								    doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
 								    DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely')
 								    for match in re.finditer(DEFINITELY_PATTERN, doc.text):
 								        start, end = match.span()         # get matched indices
 								        span = doc.char_span(start, end)  # create Span from indices
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								        print(span.text)
-												Add regex section to rule-based matching docs (see #1567, #1833) [ci skip]

											
										
										
											2018-01-14 16:22:13 +03:00
 								p
 								    |  You can also use the regular expression with spaCy's #[code Matcher] by
 								    |  converting it to a token flag. To ensure efficiency, the
 								    |  #[code Matcher] can only access the C-level data. This means that it can
 								    |  either use built-in token attributes or #[strong binary flags].
 								    |  #[+api("vocab#add_flag") #[code Vocab.add_flag]] returns a flag ID which
 								    |  you can use as a key of a token match pattern. Tokens that match the
 								    |  regular expression will return #[code True] for the #[code IS_DEFINITELY]
 								    |  flag.
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
 								    import spacy
 								    from spacy.matcher import Matcher
 								    import re
 								    nlp = spacy.load('en_core_web_sm')
-												Fix regex flag matcher example (resolves #1950)

											
										
										
											2018-02-09 12:23:33 +03:00
+								    definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text))
 								    IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)
-												Add regex section to rule-based matching docs (see #1567, #1833) [ci skip]

											
										
										
											2018-01-14 16:22:13 +03:00
 								    matcher = Matcher(nlp.vocab)
 								    matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
 								    matches = matcher(doc)
 								    for match_id, start, end in matches:
 								        span = doc[start:end]
 								        print(span.text)
-												Add regex section to rule-based matching docs (see #1567, #1833) [ci skip]

											
										
										
											2018-01-14 16:22:13 +03:00
+								p
 								    |  Providing the regular expressions as binary flags also lets you use them
 								    |  in combination with other token patterns – for example, to match the
 								    |  word "definitely" in various spellings, followed by a case-insensitive
 								    |  "not" and and adjective:
 								+code.
 								    [{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}]
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+h(3, "example1") Example: Using linguistic annotations
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
-												Update to new website

											
										
										
											2016-10-31 21:04:15 +03:00
+								p
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								    |  Let's say you're analysing user comments and you want to find out what
 								    |  people are saying about Facebook. You want to start off by finding
 								    |  adjectives following "Facebook is" or "Facebook was". This is obviously
 								    |  a very rudimentary solution, but it'll be fast, and a great way get an
 								    |  idea for what's in your data. Your pattern could look like this:
 								+code.
 								    [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]
 								p
 								    |  This translates to a token whose lowercase form matches "facebook"
 								    |  (like Facebook, facebook or FACEBOOK), followed by a token with the lemma
 								    |  "be" (for example, is, was, or 's), followed by an #[strong optional] adverb,
 								    |  followed by an adjective. Using the linguistic annotations here is
 								    |  especially useful, because you can tell spaCy to match "Facebook's
 								    |  annoying", but #[strong not] "Facebook's annoying ads". The optional
 								    |  adverb makes sure you won't miss adjectives with intensifiers, like
 								    |  "pretty awful" or "very nice".
 								p
 								    |  To get a quick overview of the results, you could collect all sentences
 								    |  containing a match and render them with the
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  #[+a("/usage/visualizers") displaCy visualizer].
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								    |  In the callback function, you'll have access to the #[code start] and
 								    |  #[code end] of each match, as well as the parent #[code Doc]. This lets
 								    |  you determine the sentence containing the match,
 								    |  #[code doc[start : end].sent], and calculate the start and end of the
 								    |  matched span within the sentence. Using displaCy in
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  #[+a("/usage/visualizers#manual-usage") "manual" mode] lets you
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								    |  pass in a list of dictionaries containing the text and entities to render.
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
 								    import spacy
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								    from spacy import displacy
 								    from spacy.matcher import Matcher
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    nlp = spacy.load('en_core_web_sm')
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								    matcher = Matcher(nlp.vocab)
 								    matched_sents = [] # collect data of matched sentences to be visualized
 								    def collect_sents(matcher, doc, i, matches):
 								        match_id, start, end = matches[i]
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								        span = doc[start : end]  # matched span
 								        sent = span.sent  # sentence containing matched span
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								        # append mock entity for match in displaCy style to matched_sents
 								        # get the match span by ofsetting the start and end of the span with the
 								        # start and end of the sentence in the doc
-												Fix typo, formatting and operator descriptions (resolves #1820)

											
										
										
											2018-01-13 00:06:27 +03:00
+								        match_ents = [{'start': span.start_char - sent.start_char,
-												Corrected char index instead of token index

Changed the index used to add the label because `displacy.render` apparently uses char index
											
										
										
											2017-11-27 01:55:25 +03:00
+								                       'end': span.end_char - sent.start_char,
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
+								                       'label': 'MATCH'}]
 								        matched_sents.append({'text': sent.text, 'ents': match_ents })
 								    pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
 								               {'POS': 'ADJ'}]
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    matcher.add('FacebookIs', collect_sents, pattern)  # add pattern
 								    doc = nlp(u"I'd say that Facebook is evil. – Facebook is pretty cool, right?")
 								    matches = matcher(doc)
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
 								    # serve visualization of sentences containing match with displaCy
 								    # set manual=True to make displaCy render straight from a dictionary
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    # (if you're not running the code within a Jupyer environment, you can
 								    # remove jupyter=True and use displacy.serve instead)
 								    displacy.render(matched_sents, style='ent', manual=True, jupyter=True)
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+h(3, "example2") Example: Phone numbers
-												Update docs on rule-based matching and add examples

											
										
										
											2017-05-22 20:04:02 +03:00
 								p
 								    |  Phone numbers can have many different formats and matching them is often
 								    |  tricky. During tokenization, spaCy will leave sequences of numbers intact
 								    |  and only split on whitespace and punctuation. This means that your match
 								    |  pattern will have to look out for number sequences of a certain length,
 								    |  surrounded by specific punctuation – depending on the
 								    |  #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers") national conventions].
 								p
 								    |  The #[code IS_DIGIT] flag is not very helpful here, because it doesn't
 								    |  tell us anything about the length. However, you can use the #[code SHAPE]
 								    |  flag, with each #[code d] representing a digit:
 								+code.
 								    [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'},
 								     {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}]
 								p
 								    |  This will match phone numbers of the format #[strong (123) 4567 8901] or
 								    |  #[strong (123) 4567-8901]. To also match formats like #[strong (123) 456 789],
 								    |  you can add a second pattern using #[code 'ddd'] in place of #[code 'dddd'].
 								    |  By hard-coding some values, you can match only certain, country-specific
 								    |  numbers. For example, here's a pattern to match the most common formats of
 								    |  #[+a("https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany") international German numbers]:
 								+code.
 								    [{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'},
 								     {'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}]
 								p
 								    |  Depending on the formats your application needs to match, creating an
 								    |  extensive set of rules like this is often better than training a model.
 								    |  It'll produce more predictable results, is much easier to modify and
 								    |  extend, and doesn't require any training data – only a set of
 								    |  test cases.
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
 								    import spacy
 								    from spacy.matcher import Matcher
 								    nlp = spacy.load('en_core_web_sm')
 								    matcher = Matcher(nlp.vocab)
 								    pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'ddd'},
 								               {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'ddd'}]
 								    matcher.add('PHONE_NUMBER', None, pattern)
 								    doc = nlp(u"Call me at (123) 456 789 or (123) 456 789!")
 								    print([t.text for t in doc])
 								    matches = matcher(doc)
 								    for match_id, start, end in matches:
 								        span = doc[start:end]
 								        print(span.text)
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+h(3, "example3") Example: Hashtags and emoji on social media
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
 								p
 								    |  Social media posts, especially tweets, can be difficult to work with.
 								    |  They're very short and often contain various emoji and hashtags. By only
 								    |  looking at the plain text, you'll lose a lot of valuable semantic
 								    |  information.
 								p
 								    |  Let's say you've extracted a large sample of social media posts on a
 								    |  specific topic, for example posts mentioning a brand name or product.
 								    |  As the first step of your data exploration, you want to filter out posts
 								    |  containing certain emoji and use them to assign a general sentiment
 								    |  score, based on whether the expressed emotion is positive or negative,
 								    |  e.g. #[span.o-icon.o-icon--inline 😀] or #[span.o-icon.o-icon--inline 😞].
 								    |  You also want to find, merge and label hashtags like
 								    |  #[code #MondayMotivation], to be able to ignore or analyse them later.
 								+aside("Note on sentiment analysis")
 								    |  Ultimately, sentiment analysis is not always #[em that] easy. In
 								    |  addition to the emoji, you'll also want to take specific words into
 								    |  account and check the #[code subtree] for intensifiers like "very", to
 								    |  increase the sentiment score. At some point, you might also want to train
 								    |  a sentiment model. However, the approach described in this example is
-												Update 101 and usage docs

											
										
										
											2017-05-28 01:03:16 +03:00
+								    |  very useful for #[strong bootstrapping rules to collect training data].
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
+								    |  It's also an incredibly fast way to gather first insights into your data
 								    |  – with about 1 million tweets, you'd be looking at a processing time of
 								    |  #[strong under 1 minute].
 								p
 								    |  By default, spaCy's tokenizer will split emoji into separate tokens. This
-												Update Matcher example

											
										
										
											2017-05-29 02:08:47 +03:00
+								    |  means that you can create a pattern for one or more emoji tokens.
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
+								    |  Valid hashtags usually consist of a #[code #], plus a sequence of
 								    |  ASCII characters with no whitespace, making them easy to match as well.
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
+								    from spacy.lang.en import English
 								    from spacy.matcher import Matcher
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								    nlp = English()  # we only want the tokenizer, so no need to load a model
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
+								    matcher = Matcher(nlp.vocab)
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								    pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍']  # positive emoji
 								    neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒']  # negative emoji
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
 								    # add patterns to match one or more emoji tokens
-												Update Matcher example

											
										
										
											2017-05-29 02:08:47 +03:00
+								    pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
 								    neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    # function to label the sentiment
 								    def label_sentiment(matcher, doc, i, matches):
 								        match_id, start, end = matches[i]
 								        if doc.vocab.strings[match_id] == 'HAPPY':  # don't forget to get string!
 								            doc.sentiment += 0.1  # add 0.1 for positive sentiment
 								        elif doc.vocab.strings[match_id] == 'SAD':
 								            doc.sentiment -= 0.1  # subtract 0.1 for negative sentiment
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								    matcher.add('HAPPY', label_sentiment, *pos_patterns)  # add positive pattern
 								    matcher.add('SAD', label_sentiment, *neg_patterns)  # add negative pattern
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    # add pattern for valid hashtag, i.e. '#' plus any ASCII token
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								    matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    doc = nlp(u"Hello world 😀 #MondayMotivation")
 								    matches = matcher(doc)
 								    for match_id, start, end in matches:
 								        string_id = doc.vocab.strings[match_id]  # look up string ID
 								        span = doc[start:end]
 								        print(string_id, span.text)
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
+								p
 								    |  Because the #[code on_match] callback receives the ID of each match, you
 								    |  can use the same function to handle the sentiment assignment for both
 								    |  the positive and negative pattern. To keep it simple, we'll either add
 								    |  or subtract #[code 0.1] points – this way, the score will also reflect
 								    |  combinations of emoji, even positive #[em and] negative ones.
 								p
 								    |  With a library like
 								    |  #[+a("https://github.com/bcongdon/python-emojipedia") Emojipedia],
 								    |  we can also retrieve a short description for each emoji – for example,
 								    |  #[span.o-icon.o-icon--inline 😍]'s official title is "Smiling Face With
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								    |  Heart-Eyes". Assigning it to a
 								    |  #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute]
 								    |  on the emoji span will make it available as #[code span._.emoji_desc].
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
 								+code.
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								    from emojipedia import Emojipedia  # installation: pip install emojipedia
 								    from spacy.tokens import Span  # get the global Span object
 								    Span.set_extension('emoji_desc', default=None)  # register the custom attribute
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
 								    def label_sentiment(matcher, doc, i, matches):
 								        match_id, start, end = matches[i]
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								        if doc.vocab.strings[match_id] == 'HAPPY':  # don't forget to get string!
 								            doc.sentiment += 0.1  # add 0.1 for positive sentiment
-												Update Matcher example

											
										
										
											2017-05-29 02:08:47 +03:00
+								        elif doc.vocab.strings[match_id] == 'SAD':
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								            doc.sentiment -= 0.1  # subtract 0.1 for negative sentiment
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
+								        span = doc[start : end]
 								        emoji = Emojipedia.search(span[0].text) # get data for emoji
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								        span._.emoji_desc = emoji.title  # assign emoji description
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
 								p
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    |  To label the hashtags, we can use a
 								    |  #[+a("/usage/processing-pipelines#custom-components-attributes") custom attribute]
 								    |  set on the respective token:
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
 								    import spacy
 								    from spacy.matcher import Matcher
 								    from spacy.tokens import Token
 								    nlp = spacy.load('en_core_web_sm')
 								    matcher = Matcher(nlp.vocab)
 								    # add pattern for valid hashtag, i.e. '#' plus any ASCII token
 								    matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
 								    # register token extension
-												Use Juniper and tidy up

											
										
										
											2018-04-30 19:48:35 +03:00
+								    Token.set_extension('is_hashtag', default=False)
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    doc = nlp(u"Hello world 😀 #MondayMotivation")
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								    matches = matcher(doc)
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								    hashtags = []
-												Update emoji/hashtag matcher example (resolves #2156) [ci skip]

											
										
										
											2018-03-28 19:41:28 +03:00
+								    for match_id, start, end in matches:
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								        if doc.vocab.strings[match_id] == 'HASHTAG':
 								            hashtags.append(doc[start:end])
 								    for span in hashtags:
 								        span.merge()
 								        for token in span:
 								            token._.is_hashtag = True
 								    for token in doc:
 								        print(token.text, token._.is_hashtag)
-												Update Matcher docs and add social media analysis example

											
										
										
											2017-05-27 18:58:18 +03:00
 								p
 								    |  To process a stream of social media posts, we can use
 								    |  #[+api("language#pipe") #[code Language.pipe()]], which will return a
 								    |  stream of #[code Doc] objects that we can pass to
 								    |  #[+api("matcher#pipe") #[code Matcher.pipe()]].
 								+code.
 								    docs = nlp.pipe(LOTS_OF_TWEETS)
 								    matches = matcher.pipe(docs)