spaCy/website/docs/usage/rule-based-matching.md

---
title: Rule-based matching
teaser: Find phrases and tokens, and match entities
menu:
  - ['Token Matcher', 'matcher']
  - ['Phrase Matcher', 'phrasematcher']
  - ['Entity Ruler', 'entityruler']
  - ['Models & Rules', 'models-rules']
---

Compared to using regular expressions on raw text, spaCy's rule-based matcher
engines and components not only let you find the words and phrases you're
looking for – they also give you access to the tokens within the document and
their relationships. This means you can easily access and analyze the
surrounding tokens, merge spans into single tokens or add entries to the named
entities in `doc.ents`.

<Accordion title="Should I use rules or train a model?" id="rules-vs-model">

For complex tasks, it's usually better to train a statistical entity recognition
model. However, statistical models require training data, so for many
situations, rule-based approaches are more practical. This is especially true at
the start of a project: you can use a rule-based approach as part of a data
collection process, to help you "bootstrap" a statistical model.

Training a model is useful if you have some examples and you want your system to
be able to **generalize** based on those examples. It works especially well if
there are clues in the _local context_. For instance, if you're trying to detect
person or company names, your application may benefit from a statistical named
entity recognition model.

Rule-based systems are a good choice if there's a more or less **finite number**
of examples that you want to find in the data, or if there's a very **clear,
structured pattern** you can express with token rules or regular expressions.
For instance, country names, IP addresses or URLs are things you might be able
to handle well with a purely rule-based approach.

You can also combine both approaches and improve a statistical model with rules
to handle very specific cases and boost accuracy. For details, see the section
on [rule-based entity recognition](#entityruler).

</Accordion>

<Accordion title="When should I use the token matcher vs. the phrase matcher?" id="matcher-vs-phrase-matcher">

The `PhraseMatcher` is useful if you already have a large terminology list or
gazetteer consisting of single or multi-token phrases that you want to find
exact instances of in your data. As of spaCy v2.1.0, you can also match on the
`LOWER` attribute for fast and case-insensitive matching.

The `Matcher` isn't as blazing fast as the `PhraseMatcher`, since it compares
across individual token attributes. However, it allows you to write very
abstract representations of the tokens you're looking for, using lexical
attributes, linguistic features predicted by the model, operators, set
membership and rich comparison. For example, you can find a noun, followed by a
verb with the lemma "love" or "like", followed by an optional determiner and
another token that's at least ten characters long.

</Accordion>

## Token-based matching {#matcher}

spaCy features a rule-matching engine, the [`Matcher`](/api/matcher), that
operates over tokens, similar to regular expressions. The rules can refer to
token annotations (e.g. the token `text` or `tag_`, and flags (e.g. `IS_PUNCT`).
The rule matcher also lets you pass in a custom callback to act on matches – for
example, to merge entities and apply custom labels. You can also associate
patterns with entity IDs, to allow some basic entity linking or disambiguation.
To match large terminology lists, you can use the
[`PhraseMatcher`](/api/phrasematcher), which accepts `Doc` objects as match
patterns.

### Adding patterns {#adding-patterns}

Let's say we want to enable spaCy to find a combination of three tokens:

1. A token whose **lowercase form matches "hello"**, e.g. "Hello" or "HELLO".
2. A token whose **`is_punct` flag is set to `True`**, i.e. any punctuation.
3. A token whose **lowercase form matches "world"**, e.g. "World" or "WORLD".

```python
[{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
```

<Infobox title="Important note" variant="danger">

When writing patterns, keep in mind that **each dictionary** represents **one
token**. If spaCy's tokenization doesn't match the tokens defined in a pattern,
the pattern is not going to produce any results. When developing complex
patterns, make sure to check examples against spaCy's tokenization:

```python
doc = nlp("A complex-example,!")
print([token.text for token in doc])
```

</Infobox>

First, we initialize the `Matcher` with a vocab. The matcher must always share
the same vocab with the documents it will operate on. We can now call
[`matcher.add()`](/api/matcher#add) with an ID and our custom pattern. The
second argument lets you pass in an optional callback function to invoke on a
successful match. For now, we set it to `None`.

```python
### {executable="true"}
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", None, pattern)

doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)
```

The matcher returns a list of `(match_id, start, end)` tuples – in this case,
`[('15578876784678163569', 0, 2)]`, which maps to the span `doc[0:2]` of our
original document. The `match_id` is the [hash value](/usage/spacy-101#vocab) of
the string ID "HelloWorld". To get the string value, you can look up the ID in
the [`StringStore`](/api/stringstore).

```python
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # 'HelloWorld'
    span = doc[start:end]                    # The matched span
```

Optionally, we could also choose to add more than one pattern, for example to
also match sequences without punctuation between "hello" and "world":

```python
matcher.add("HelloWorld", None,
            [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],
            [{"LOWER": "hello"}, {"LOWER": "world"}])
```

By default, the matcher will only return the matches and **not do anything
else**, like merge entities or assign labels. This is all up to you and can be
defined individually for each pattern, by passing in a callback function as the
`on_match` argument on `add()`. This is useful, because it lets you write
entirely custom and **pattern-specific logic**. For example, you might want to
merge _some_ patterns into one token, while adding entity labels for other
pattern types. You shouldn't have to create different matchers for each of those
processes.

#### Available token attributes {#adding-patterns-attributes}

The available token pattern keys correspond to a number of
[`Token` attributes](/api/token#attributes). The supported attributes for
rule-based matching are:

| Attribute                              | Type    |  Description                                                                                           |
| -------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------ |
| `ORTH`                                 | unicode | The exact verbatim text of a token.                                                                    |
| `TEXT` <Tag variant="new">2.1</Tag>    | unicode | The exact verbatim text of a token.                                                                    |
| `LOWER`                                | unicode | The lowercase form of the token text.                                                                  |
|  `LENGTH`                              | int     | The length of the token text.                                                                          |
|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`    | bool    | Token text consists of alphabetic characters, ASCII characters, digits.                                |
|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE`    | bool    | Token text is in lowercase, uppercase, titlecase.                                                      |
|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP`     | bool    | Token is punctuation, whitespace, stop word.                                                           |
|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`  | bool    | Token text resembles a number, URL, email.                                                             |
|  `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | unicode | The token's simple and extended part-of-speech tag, dependency label, lemma, shape.                    |
| `ENT_TYPE`                             | unicode | The token's entity label.                                                                              |
| `_` <Tag variant="new">2.1</Tag>       | dict    | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). |

<Accordion title="Does it matter if the attribute names are uppercase or lowercase?">

No, it shouldn't. spaCy will normalize the names internally and
`{"LOWER": "text"}` and `{"lower": "text"}` will both produce the same result.
Using the uppercase version is mostly a convention to make it clear that the
attributes are "special" and don't exactly map to the token attributes like
`Token.lower` and `Token.lower_`.

</Accordion>

<Accordion title="Why are not all token attributes supported?">

spaCy can't provide access to all of the attributes because the `Matcher` loops
over the Cython data, not the Python objects. Inside the matcher, we're dealing
with a [`TokenC` struct](/api/cython-structs#tokenc) – we don't have an instance
of [`Token`](/api/token). This means that all of the attributes that refer to
computed properties can't be accessed.

The uppercase attribute names like `LOWER` or `IS_PUNCT` refer to symbols from
the
[`spacy.attrs`](https://github.com/explosion/spaCy/tree/master/spacy/attrs.pyx)
enum table. They're passed into a function that essentially is a big case/switch
statement, to figure out which struct field to return. The same attribute
identifiers are used in [`Doc.to_array`](/api/doc#to_array), and a few other
places in the code where you need to describe fields like this.

</Accordion>

---

<Infobox title="Tip: Try the interactive matcher explorer">

[![Matcher demo](../images/matcher-demo.jpg)](https://explosion.ai/demos/matcher)

The [Matcher Explorer](https://explosion.ai/demos/matcher) lets you test the
rule-based `Matcher` by creating token patterns interactively and running them
over your text. Each token can set multiple attributes like text value,
part-of-speech tag or boolean flags. The token-based view lets you explore how
spaCy processes your text – and why your pattern matches, or why it doesn't.

</Infobox>

#### Extended pattern syntax and attributes {#adding-patterns-attributes-extended new="2.1"}

Instead of mapping to a single value, token patterns can also map to a
**dictionary of properties**. For example, to specify that the value of a lemma
should be part of a list of values, or to set a minimum character length. The
following rich comparison attributes are available:

> #### Example
>
> ```python
> # Matches "love cats" or "likes flowers"
> pattern1 = [{"LEMMA": {"IN": ["like", "love"]}},
>             {"POS": "NOUN"}]
>
> # Matches tokens of length >= 10
> pattern2 = [{"LENGTH": {">=": 10}}]
> ```

| Attribute                  | Value Type | Description                                                                       |
| -------------------------- | ---------- | --------------------------------------------------------------------------------- |
| `IN`                       | any        | Attribute value is member of a list.                                              |
| `NOT_IN`                   | any        | Attribute value is _not_ member of a list.                                        |
| `==`, `>=`, `<=`, `>`, `<` | int, float | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. |

#### Regular expressions {#regex new="2.1"}

In some cases, only matching tokens and token attributes isn't enough – for
example, you might want to match different spellings of a word, without having
to add a new pattern for each spelling.

```python
pattern = [{"TEXT": {"REGEX": "^[Uu](\\.?|nited)$"}},
           {"TEXT": {"REGEX": "^[Ss](\\.?|tates)$"}},
           {"LOWER": "president"}]
```

The `REGEX` operator allows defining rules for any attribute string value,
including custom attributes. It always needs to be applied to an attribute like
`TEXT`, `LOWER` or `TAG`:

```python
# Match different spellings of token texts
pattern = [{"TEXT": {"REGEX": "deff?in[ia]tely"}}]

# Match tokens with fine-grained POS tags starting with 'V'
pattern = [{"TAG": {"REGEX": "^V"}}]

# Match custom attribute values with regular expressions
pattern = [{"_": {"country": {"REGEX": "^[Uu](nited|\\.?) ?[Ss](tates|\\.?)$"}}}]
```

<Infobox title="Important note" variant="warning">

When using the `REGEX` operator, keep in mind that it operates on **single
tokens**, not the whole text. Each expression you provide will be matched on a
token. If you need to match on the whole text instead, see the details on
[regex matching on the whole text](#regex-text).

</Infobox>

##### Matching regular expressions on the full text {#regex-text}

If your expressions apply to multiple tokens, a simple solution is to match on
the `doc.text` with `re.finditer` and use the
[`Doc.char_span`](/api/doc#char_span) method to create a `Span` from the
character indices of the match. If the matched characters don't map to one or
more valid tokens, `Doc.char_span` returns `None`.

> #### What's a valid token sequence?
>
> In the example, the expression will also match `"US"` in `"USA"`. However,
> `"USA"` is a single token and `Span` objects are **sequences of tokens**. So
> `"US"` cannot be its own span, because it does not end on a token boundary.

```python
### {executable="true"}
import spacy
import re

nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

expression = r"[Uu](nited|\\.?) ?[Ss](tates|\\.?)"
for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)
```

<Accordion title="How can I expand the match to a valid token sequence?">

In some cases, you might want to expand the match to the closest token
boundaries, so you can create a `Span` for `"USA"`, even though only the
substring `"US"` is matched. You can calculate this using the character offsets
of the tokens in the document, available as
[`Token.idx`](/api/token#attributes). This lets you create a list of valid token
start and end boundaries and leaves you with a rather basic algorithmic problem:
Given a number, find the next lowest (start token) or the next highest (end
token) number that's part of a given list of numbers. This will be the closest
valid token boundary.

There are many ways to do this and the most straightforward one is to create a
dict keyed by characters in the `Doc`, mapped to the token they're part of. It's
easy to write and less error-prone, and gives you a constant lookup time: you
only ever need to create the dict once per `Doc`.

```python
chars_to_tokens = {}
for token in doc:
    for i in range(token.idx, token.idx + len(token.text)):
        chars_to_tokens[i] = token.i
```

You can then look up character at a given position, and get the index of the
corresponding token that the character is part of. Your span would then be
`doc[token_start:token_end]`. If a character isn't in the dict, it means it's
the (white)space tokens are split on. That hopefully shouldn't happen, though,
because it'd mean your regex is producing matches with leading or trailing
whitespace.

```python
### {highlight="5-8"}
span = doc.char_span(start, end)
if span is not None:
    print("Found match:", span.text)
else:
    start_token = chars_to_tokens.get(start)
    end_token = chars_to_tokens.get(end)
    if start_token is not None and end_token is not None:
        span = doc[start_token:end_token + 1]
        print("Found closest match:", span.text)
```

</Accordion>

---

#### Operators and quantifiers {#quantifiers}

The matcher also lets you use quantifiers, specified as the `'OP'` key.
Quantifiers let you define sequences of tokens to be matched, e.g. one or more
punctuation marks, or specify optional tokens. Note that there are no nested or
scoped quantifiers – instead, you can build those behaviors with `on_match`
callbacks.

| OP  | Description                                                      |
| --- | ---------------------------------------------------------------- |
| `!` | Negate the pattern, by requiring it to match exactly 0 times.    |
| `?` | Make the pattern optional, by allowing it to match 0 or 1 times. |
| `+` | Require the pattern to match 1 or more times.                    |
| `*` | Allow the pattern to match zero or more times.                   |

> #### Example
>
> ```python
> pattern = [{"LOWER": "hello"},
>            {"IS_PUNCT": True, "OP": "?"}]
> ```

<Infobox title="Note on operator behaviour" variant="warning">

In versions before v2.1.0, the semantics of the `+` and `*` operators behave
inconsistently. They were usually interpreted "greedily", i.e. longer matches
are returned where possible. However, if you specify two `+` and `*` patterns in
a row and their matches overlap, the first operator will behave non-greedily.
This quirk in the semantics is corrected in spaCy v2.1.0.

</Infobox>

#### Using wildcard token patterns {#adding-patterns-wildcard new="2"}

While the token attributes offer many options to write highly specific patterns,
you can also use an empty dictionary, `{}` as a wildcard representing **any
token**. This is useful if you know the context of what you're trying to match,
but very little about the specific token and its characters. For example, let's
say you're trying to extract people's user names from your data. All you know is
that they are listed as "User name: {username}". The name itself may contain any
character, but no whitespace – so you'll know it will be handled as one token.

```python
[{"ORTH": "User"}, {"ORTH": "name"}, {"ORTH": ":"}, {}]
```

#### Validating and debugging patterns {#pattern-validation new="2.1"}

The `Matcher` can validate patterns against a JSON schema with the option
`validate=True`. This is useful for debugging patterns during development, in
particular for catching unsupported attributes.

```python
### {executable="true"}
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab, validate=True)
# Add match ID "HelloWorld" with unsupported attribute CASEINSENSITIVE
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"CASEINSENSITIVE": "world"}]
matcher.add("HelloWorld", None, pattern)
# 🚨 Raises an error:
# MatchPatternError: Invalid token patterns for matcher rule 'HelloWorld'
# Pattern 0:
# - Additional properties are not allowed ('CASEINSENSITIVE' was unexpected) [2]

```

### Adding on_match rules {#on_match}

To move on to a more realistic example, let's say you're working with a large
corpus of blog articles, and you want to match all mentions of "Google I/O"
(which spaCy tokenizes as `['Google', 'I', '/', 'O'`]). To be safe, you only
match on the uppercase versions, in case someone has written it as "Google i/o".

```python
### {executable="true"}
from spacy.lang.en import English
from spacy.matcher import Matcher
from spacy.tokens import Span

nlp = English()
matcher = Matcher(nlp.vocab)

def add_event_ent(matcher, doc, i, matches):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    entity = Span(doc, start, end, label="EVENT")
    doc.ents += (entity,)
    print(entity.text)

pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
matcher.add("GoogleIO", add_event_ent, pattern)
doc = nlp("This is a text about Google I/O")
matches = matcher(doc)
```

A very similar logic has been implemented in the built-in
[`EntityRuler`](/api/entityruler) by the way. It also takes care of handling
overlapping matches, which you would otherwise have to take care of yourself.

> #### Tip: Visualizing matches
>
> When working with entities, you can use [displaCy](/api/top-level#displacy) to
> quickly generate a NER visualization from your updated `Doc`, which can be
> exported as an HTML file:
>
> ```python
> from spacy import displacy
> html = displacy.render(doc, style="ent", page=True,
>                        options={"ents": ["EVENT"]})
> ```
>
> For more info and examples, see the usage guide on
> [visualizing spaCy](/usage/visualizers).

We can now call the matcher on our documents. The patterns will be matched in
the order they occur in the text. The matcher will then iterate over the
matches, look up the callback for the match ID that was matched, and invoke it.

```python
doc = nlp(YOUR_TEXT_HERE)
matcher(doc)
```

When the callback is invoked, it is passed four arguments: the matcher itself,
the document, the position of the current match, and the total list of matches.
This allows you to write callbacks that consider the entire set of matched
phrases, so that you can resolve overlaps and other conflicts in whatever way
you prefer.

| Argument  | Type      | Description                                                                                                          |
| --------- | --------- | -------------------------------------------------------------------------------------------------------------------- |
| `matcher` | `Matcher` | The matcher instance.                                                                                                |
| `doc`     | `Doc`     | The document the matcher was used on.                                                                                |
| `i`       | int       | Index of the current match (`matches[i`]).                                                                           |
| `matches` | list      |  A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. |

### Using custom pipeline components {#matcher-pipeline}

Let's say your data also contains some annoying pre-processing artifacts, like
leftover HTML line breaks (e.g. `<br>` or `<BR/>`). To make your text easier to
analyze, you want to merge those into one token and flag them, to make sure you
can ignore them later. Ideally, this should all be done automatically as you
process the text. You can achieve this by adding a
[custom pipeline component](/usage/processing-pipelines#custom-components)
that's called on each `Doc` object, merges the leftover HTML spans and sets an
attribute `bad_html` on the token.

```python
### {executable="true"}
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token

# We're using a class because the component needs to be initialised with
# the shared vocab via the nlp object
class BadHTMLMerger(object):
    def __init__(self, nlp):
        # Register a new token extension to flag bad HTML
        Token.set_extension("bad_html", default=False)
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add(
            "BAD_HTML",
            None,
            [{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
            [{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
        )

    def __call__(self, doc):
        # This method is invoked when the component is called on a Doc
        matches = self.matcher(doc)
        spans = []  # Collect the matched spans here
        for match_id, start, end in matches:
            spans.append(doc[start:end])
        with doc.retokenize() as retokenizer:
            for span in spans:
                retokenizer.merge(span)
                for token in span:
                    token._.bad_html = True  # Mark token as bad HTML
        return doc

nlp = spacy.load("en_core_web_sm")
html_merger = BadHTMLMerger(nlp)
nlp.add_pipe(html_merger, last=True)  # Add component to the pipeline
doc = nlp("Hello<br>world! <br/> This is a test.")
for token in doc:
    print(token.text, token._.bad_html)

```

Instead of hard-coding the patterns into the component, you could also make it
take a path to a JSON file containing the patterns. This lets you reuse the
component with different patterns, depending on your application:

```python
html_merger = BadHTMLMerger(nlp, path="/path/to/patterns.json")
```

<Infobox title="📖 Processing pipelines">

For more details and examples of how to **create custom pipeline components**
and **extension attributes**, see the
[usage guide](/usage/processing-pipelines).

</Infobox>

### Example: Using linguistic annotations {#example1}

Let's say you're analyzing user comments and you want to find out what people
are saying about Facebook. You want to start off by finding adjectives following
"Facebook is" or "Facebook was". This is obviously a very rudimentary solution,
but it'll be fast, and a great way to get an idea for what's in your data. Your
pattern could look like this:

```python
[{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]
```

This translates to a token whose lowercase form matches "facebook" (like
Facebook, facebook or FACEBOOK), followed by a token with the lemma "be" (for
example, is, was, or 's), followed by an **optional** adverb, followed by an
adjective. Using the linguistic annotations here is especially useful, because
you can tell spaCy to match "Facebook's annoying", but **not** "Facebook's
annoying ads". The optional adverb makes sure you won't miss adjectives with
intensifiers, like "pretty awful" or "very nice".

To get a quick overview of the results, you could collect all sentences
containing a match and render them with the
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
access to the `start` and `end` of each match, as well as the parent `Doc`. This
lets you determine the sentence containing the match, `doc[start : end`.sent],
and calculate the start and end of the matched span within the sentence. Using
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
list of dictionaries containing the text and entities to render.

```python
### {executable="true"}
import spacy
from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matched_sents = []  # Collect data of matched sentences to be visualized

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start:end]  # Matched span
    sent = span.sent  # Sentence containing matched span
    # Append mock entity for match in displaCy style to matched_sents
    # get the match span by ofsetting the start and end of the span with the
    # start and end of the sentence in the doc
    match_ents = [{
        "start": span.start_char - sent.start_char,
        "end": span.end_char - sent.start_char,
        "label": "MATCH",
    }]
    matched_sents.append({"text": sent.text, "ents": match_ents})

pattern = [{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"},
           {"POS": "ADJ"}]
matcher.add("FacebookIs", collect_sents, pattern)  # add pattern
doc = nlp("I'd say that Facebook is evil. – Facebook is pretty cool, right?")
matches = matcher(doc)

# Serve visualization of sentences containing match with displaCy
# set manual=True to make displaCy render straight from a dictionary
# (if you're not running the code within a Jupyer environment, you can
# use displacy.serve instead)
displacy.render(matched_sents, style="ent", manual=True)
```

### Example: Phone numbers {#example2}

Phone numbers can have many different formats and matching them is often tricky.
During tokenization, spaCy will leave sequences of numbers intact and only split
on whitespace and punctuation. This means that your match pattern will have to
look out for number sequences of a certain length, surrounded by specific
punctuation – depending on the
[national conventions](https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers).

The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
anything about the length. However, you can use the `SHAPE` flag, with each `d`
representing a digit (up to 4 digits / characters):

```python
[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
 {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]
```

This will match phone numbers of the format **(123) 4567 8901** or **(123)
4567-8901**. To also match formats like **(123) 456 789**, you can add a second
pattern using `'ddd'` in place of `'dddd'`. By hard-coding some values, you can
match only certain, country-specific numbers. For example, here's a pattern to
match the most common formats of
[international German numbers](https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany):

```python
[{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
 {"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}]
```

Depending on the formats your application needs to match, creating an extensive
set of rules like this is often better than training a model. It'll produce more
predictable results, is much easier to modify and extend, and doesn't require
any training data – only a set of test cases.

```python
### {executable="true"}
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},
           {"ORTH": "-", "OP": "?"}, {"SHAPE": "ddd"}]
matcher.add("PHONE_NUMBER", None, pattern)

doc = nlp("Call me at (123) 456 789 or (123) 456 789!")
print([t.text for t in doc])
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
```

### Example: Hashtags and emoji on social media {#example3}

Social media posts, especially tweets, can be difficult to work with. They're
very short and often contain various emoji and hashtags. By only looking at the
plain text, you'll lose a lot of valuable semantic information.

Let's say you've extracted a large sample of social media posts on a specific
topic, for example posts mentioning a brand name or product. As the first step
of your data exploration, you want to filter out posts containing certain emoji
and use them to assign a general sentiment score, based on whether the expressed
emotion is positive or negative, e.g. 😀 or 😞. You also want to find, merge and
label hashtags like `#MondayMotivation`, to be able to ignore or analyze them
later.

> #### Note on sentiment analysis
>
> Ultimately, sentiment analysis is not always _that_ easy. In addition to the
> emoji, you'll also want to take specific words into account and check the
> `subtree` for intensifiers like "very", to increase the sentiment score. At
> some point, you might also want to train a sentiment model. However, the
> approach described in this example is very useful for **bootstrapping rules to
> collect training data**. It's also an incredibly fast way to gather first
> insights into your data – with about 1 million tweets, you'd be looking at a
> processing time of **under 1 minute**.

By default, spaCy's tokenizer will split emoji into separate tokens. This means
that you can create a pattern for one or more emoji tokens. Valid hashtags
usually consist of a `#`, plus a sequence of ASCII characters with no
whitespace, making them easy to match as well.

```python
### {executable="true"}
from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English()  # We only want the tokenizer, so no need to load a model
matcher = Matcher(nlp.vocab)

pos_emoji = ["😀", "😃", "😂", "🤣", "😊", "😍"]  # Positive emoji
neg_emoji = ["😞", "😠", "😩", "😢", "😭", "😒"]  # Negative emoji

# Add patterns to match one or more emoji tokens
pos_patterns = [[{"ORTH": emoji}] for emoji in pos_emoji]
neg_patterns = [[{"ORTH": emoji}] for emoji in neg_emoji]

# Function to label the sentiment
def label_sentiment(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    if doc.vocab.strings[match_id] == "HAPPY":  # Don't forget to get string!
        doc.sentiment += 0.1  # Add 0.1 for positive sentiment
    elif doc.vocab.strings[match_id] == "SAD":
        doc.sentiment -= 0.1  # Subtract 0.1 for negative sentiment

matcher.add("HAPPY", label_sentiment, *pos_patterns)  # Add positive pattern
matcher.add("SAD", label_sentiment, *neg_patterns)  # Add negative pattern

# Add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add("HASHTAG", None, [{"ORTH": "#"}, {"IS_ASCII": True}])

doc = nlp("Hello world 😀 #MondayMotivation")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = doc.vocab.strings[match_id]  # Look up string ID
    span = doc[start:end]
    print(string_id, span.text)
```

Because the `on_match` callback receives the ID of each match, you can use the
same function to handle the sentiment assignment for both the positive and
negative pattern. To keep it simple, we'll either add or subtract `0.1` points –
this way, the score will also reflect combinations of emoji, even positive _and_
negative ones.

With a library like [Emojipedia](https://github.com/bcongdon/python-emojipedia),
we can also retrieve a short description for each emoji – for example, 😍's
official title is "Smiling Face With Heart-Eyes". Assigning it to a
[custom attribute](/usage/processing-pipelines#custom-components-attributes) on
the emoji span will make it available as `span._.emoji_desc`.

```python
from emojipedia import Emojipedia  # Installation: pip install emojipedia
from spacy.tokens import Span  # Get the global Span object

Span.set_extension("emoji_desc", default=None)  # Register the custom attribute

def label_sentiment(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    if doc.vocab.strings[match_id] == "HAPPY":  # Don't forget to get string!
        doc.sentiment += 0.1  # Add 0.1 for positive sentiment
    elif doc.vocab.strings[match_id] == "SAD":
        doc.sentiment -= 0.1  # Subtract 0.1 for negative sentiment
    span = doc[start:end]
    emoji = Emojipedia.search(span[0].text)  # Get data for emoji
    span._.emoji_desc = emoji.title  # Assign emoji description

```

To label the hashtags, we can use a
[custom attribute](/usage/processing-pipelines#custom-components-attributes) set
on the respective token:

```python
### {executable="true"}
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

# Add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add("HASHTAG", None, [{"ORTH": "#"}, {"IS_ASCII": True}])

# Register token extension
Token.set_extension("is_hashtag", default=False)

doc = nlp("Hello world 😀 #MondayMotivation")
matches = matcher(doc)
hashtags = []
for match_id, start, end in matches:
    if doc.vocab.strings[match_id] == "HASHTAG":
        hashtags.append(doc[start:end])
with doc.retokenize() as retokenizer:
    for span in hashtags:
        retokenizer.merge(span)
        for token in span:
            token._.is_hashtag = True

for token in doc:
    print(token.text, token._.is_hashtag)
```

To process a stream of social media posts, we can use
[`Language.pipe`](/api/language#pipe), which will return a stream of `Doc`
objects that we can pass to [`Matcher.pipe`](/api/matcher#pipe).

```python
docs = nlp.pipe(LOTS_OF_TWEETS)
matches = matcher.pipe(docs)
```

## Efficient phrase matching {#phrasematcher}

If you need to match large terminology lists, you can also use the
[`PhraseMatcher`](/api/phrasematcher) and create [`Doc`](/api/doc) objects
instead of token patterns, which is much more efficient overall. The `Doc`
patterns can contain single or multiple tokens.

### Adding phrase patterns {#adding-phrase-patterns}

```python
### {executable="true"}
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
```

Since spaCy is used for processing both the patterns and the text to be matched,
you won't have to worry about specific tokenization – for example, you can
simply pass in `nlp("Washington, D.C.")` and won't have to write a complex token
pattern covering the exact tokenization of the term.

<Infobox title="Important note on creating patterns" variant="warning">

To create the patterns, each phrase has to be processed with the `nlp` object.
If you have a model loaded, doing this in a loop or list comprehension can
easily become inefficient and slow. If you **only need the tokenization and
lexical attributes**, you can run [`nlp.make_doc`](/api/language#make_doc)
instead, which will only run the tokenizer. For an additional speed boost, you
can also use the [`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will
process the texts as a stream.

```diff
- patterns = [nlp(term) for term in LOTS_OF_TERMS]
+ patterns = [nlp.make_doc(term) for term in LOTS_OF_TERMS]
+ patterns = list(nlp.tokenizer.pipe(LOTS_OF_TERMS))
```

</Infobox>

### Matching on other token attributes {#phrasematcher-attrs new="2.1"}

By default, the `PhraseMatcher` will match on the verbatim token text, e.g.
`Token.text`. By setting the `attr` argument on initialization, you can change
**which token attribute the matcher should use** when comparing the phrase
pattern to the matched `Doc`. For example, using the attribute `LOWER` lets you
match on `Token.lower` and create case-insensitive match patterns:

```python
### {executable="true"}
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(name) for name in ["Angela Merkel", "Barack Obama"]]
matcher.add("Names", None, *patterns)

doc = nlp("angela merkel and us president barack Obama")
for match_id, start, end in matcher(doc):
    print("Matched based on lowercase token text:", doc[start:end])
```

<Infobox title="Important note on creating patterns" variant="warning">

The examples here use [`nlp.make_doc`](/api/language#make_doc) to create `Doc`
object patterns as efficiently as possible and without running any of the other
pipeline components. If the token attribute you want to match on are set by a
pipeline component, **make sure that the pipeline component runs** when you
create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
objects need to have part-of-speech tags set by the `tagger`. You can either
call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use
[`nlp.select_pipes`](/api/language#select_pipes) to disable components
selectively.

</Infobox>

Another possible use case is matching number tokens like IP addresses based on
their shape. This means that you won't have to worry about how those string will
be tokenized and you'll be able to find tokens and combinations of tokens based
on a few examples. Here, we're matching on the shapes `ddd.d.d.d` and
`ddd.ddd.d.d`:

```python
### {executable="true"}
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

nlp = English()
matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")
matcher.add("IP", None, nlp("127.0.0.1"), nlp("127.127.0.0"))

doc = nlp("Often the router will have an IP address such as 192.168.1.1 or 192.168.2.1.")
for match_id, start, end in matcher(doc):
    print("Matched based on token shape:", doc[start:end])
```

In theory, the same also works for attributes like `POS`. For example, a pattern
`nlp("I like cats")` matched based on its part-of-speech tag would return a
match for "I love dogs". You could also match on boolean flags like `IS_PUNCT`
to match phrases with the same sequence of punctuation and non-punctuation
tokens as the pattern. But this can easily get confusing and doesn't have much
of an advantage over writing one or two token patterns.

## Rule-based entity recognition {#entityruler new="2.1"}

The [`EntityRuler`](/api/entityruler) is an exciting new component that lets you
add named entities based on pattern dictionaries, and makes it easy to combine
rule-based and statistical named entity recognition for even more powerful
models.

### Entity Patterns {#entityruler-patterns}

Entity patterns are dictionaries with two keys: `"label"`, specifying the label
to assign to the entity if the pattern is matched, and `"pattern"`, the match
pattern. The entity ruler accepts two types of patterns:

1. **Phrase patterns** for exact string matches (string).

   ```python
   {"label": "ORG", "pattern": "Apple"}
   ```

2. **Token patterns** with one dictionary describing one token (list).

   ```python
   {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
   ```

### Using the entity ruler {#entityruler-usage}

The [`EntityRuler`](/api/entityruler) is a pipeline component that's typically
added via [`nlp.add_pipe`](/api/language#add_pipe). When the `nlp` object is
called on a text, it will find matches in the `doc` and add them as entities to
the `doc.ents`, using the specified pattern label as the entity label. If any
matches were to overlap, the pattern matching most tokens takes priority. If
they also happen to be equally long, then the match occuring first in the Doc is
chosen.

```python
### {executable="true"}
from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

The entity ruler is designed to integrate with spaCy's existing statistical
models and enhance the named entity recognizer. If it's added **before the
`"ner"` component**, the entity recognizer will respect the existing entity
spans and adjust its predictions around it. This can significantly improve
accuracy in some cases. If it's added **after the `"ner"` component**, the
entity ruler will only add spans to the `doc.ents` if they don't overlap with
existing entities predicted by the model. To overwrite overlapping entities, you
can set `overwrite_ents=True` on initialization.

```python
### {executable="true"}
import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "MyCorp Inc."}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("MyCorp Inc. is a company in the U.S.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

#### Validating and debugging EntityRuler patterns {#entityruler-pattern-validation new="2.1.8"}

The `EntityRuler` can validate patterns against a JSON schema with the option
`validate=True`. See details under
[Validating and debugging patterns](#pattern-validation).

```python
ruler = EntityRuler(nlp, validate=True)
```

### Adding IDs to patterns {#entityruler-ent-ids new="2.2.2"}

The [`EntityRuler`](/api/entityruler) can also accept an `id` attribute for each
pattern. Using the `id` attribute allows multiple patterns to be associated with
the same entity.

```python
### {executable="true"}
from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc1 = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents])

doc2 = nlp("Apple is opening its first big office in San Fran.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents])
```

If the `id` attribute is included in the [`EntityRuler`](/api/entityruler)
patterns, the `ent_id_` property of the matched entity is set to the `id` given
in the patterns. So in the example above it's easy to identify that "San
Francisco" and "San Fran" are both the same entity.

### Using pattern files {#entityruler-files}

The [`to_disk`](/api/entityruler#to_disk) and
[`from_disk`](/api/entityruler#from_disk) let you save and load patterns to and
from JSONL (newline-delimited JSON) files, containing one pattern object per
line.

```json
### patterns.jsonl
{"label": "ORG", "pattern": "Apple"}
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
```

```python
ruler.to_disk("./patterns.jsonl")
new_ruler = EntityRuler(nlp).from_disk("./patterns.jsonl")
```

<Infobox title="Integration with Prodigy">

If you're using the [Prodigy](https://prodi.gy) annotation tool, you might
recognize these pattern files from bootstrapping your named entity and text
classification labelling. The patterns for the `EntityRuler` follow the same
syntax, so you can use your existing Prodigy pattern files in spaCy, and vice
versa.

</Infobox>

When you save out an `nlp` object that has an `EntityRuler` added to its
pipeline, its patterns are automatically exported to the model directory:

```python
nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp)
ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
nlp.add_pipe(ruler)
nlp.to_disk("/path/to/model")
```

The saved model now includes the `"entity_ruler"` in its `"pipeline"` setting in
the `meta.json`, and the model directory contains a file `entityruler.jsonl`
with the patterns. When you load the model back in, all pipeline components will
be restored and deserialized – including the entity ruler. This lets you ship
powerful model packages with binary weights _and_ rules included!

### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}

When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to 
extract matches based on the pattern's POS signature.

In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.

Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.

As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively. 

Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.

An easy workaround to make this function run faster is disabling the other language pipes
while adding the phrase patterns.

```python
entityruler = EntityRuler(nlp)
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]

with nlp.select_pipes(enable="tagger"):
    entityruler.add_patterns(patterns)
```

## Combining models and rules {#models-rules}

You can combine statistical and rule-based components in a variety of ways.
Rule-based components can be used to improve the accuracy of statistical models,
by presetting tags, entities or sentence boundaries for specific tokens. The
statistical models will usually respect these preset annotations, which
sometimes improves the accuracy of other decisions. You can also use rule-based
components after a statistical model to correct common errors. Finally,
rule-based components can reference the attributes set by statistical models, in
order to implement more abstract logic.

### Example: Expanding named entities {#models-rules-ner}

When using the a pretrained
[named entity recognition](/usage/linguistic-features/#named-entities) model to
extract information from your texts, you may find that the predicted span only
includes parts of the entity you're looking for. Sometimes, this happens if
statistical model predicts entities incorrectly. Other times, it happens if the
way the entity type way defined in the original training corpus doesn't match
what you need for your application.

> #### Where corpora come from
>
> Corpora used to train models from scratch are often produced in academia. They
> contain text from various sources with linguistic features labeled manually by
> human annotators (following a set of specific guidelines). The corpora are
> then distributed with evaluation data, so other researchers can benchmark
> their algorithms and everyone can report numbers on the same data. However,
> most applications need to learn information that isn't contained in any
> available corpus.

For example, the corpus spaCy's [English models](/models/en) were trained on
defines a `PERSON` entity as just the **person name**, without titles like "Mr"
or "Dr". This makes sense, because it makes it easier to resolve the entity type
back to a knowledge base. But what if your application needs the full names,
_including_ the titles?

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

While you could try and teach the model a new definition of the `PERSON` entity
by [updating it](/usage/training/#example-train-ner) with more examples of spans
that include the title, this might not be the most efficient approach. The
existing model was trained on over 2 million words, so in order to completely
change the definition of an entity type, you might need a lot of training
examples. However, if you already have the predicted `PERSON` entities, you can
use a rule-based approach that checks whether they come with a title and if so,
expands the entity span by one token. After all, what all titles in this example
have in common is that _if_ they occur, they occur in the **previous token**
right before the person entity.

```python
### {highlight="7-11"}
from spacy.tokens import Span

def expand_person_entities(doc):
    new_ents = []
    for ent in doc.ents:
        # Only check for title if it's a person and not the first token
        if ent.label_ == "PERSON" and ent.start != 0:
            prev_token = doc[ent.start - 1]
            if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
                new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
                new_ents.append(new_ent)
            else:
                new_ents.append(ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc
```

The above function takes a `Doc` object, modifies its `doc.ents` and returns it.
This is exactly what a [pipeline component](/usage/processing-pipelines) does,
so in order to let it run automatically when processing a text with the `nlp`
object, we can use [`nlp.add_pipe`](/api/language#add_pipe) to add it to the
current pipeline.

```python
### {executable="true"}
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def expand_person_entities(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "PERSON" and ent.start != 0:
            prev_token = doc[ent.start - 1]
            if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
                new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
                new_ents.append(new_ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

# Add the component after the named entity recognizer
nlp.add_pipe(expand_person_entities, after='ner')

doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

An alternative approach would be to an
[extension attribute](/usage/processing-pipelines/#custom-components-attributes)
like `._.person_title` and add it to `Span` objects (which includes entity spans
in `doc.ents`). The advantage here is that the entity text stays intact and can
still be used to look up the name in a knowledge base. The following function
takes a `Span` object, checks the previous token if it's a `PERSON` entity and
returns the title if one is found. The `Span.doc` attribute gives us easy access
to the span's parent document.

```python
def get_person_title(span):
    if span.label_ == "PERSON" and span.start != 0:
        prev_token = span.doc[span.start - 1]
        if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
            return prev_token.text
```

We can now use the [`Span.set_extension`](/api/span#set_extension) method to add
the custom extension attribute `"person_title"`, using `get_person_title` as the
getter function.

```python
### {executable="true"}
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def get_person_title(span):
    if span.label_ == "PERSON" and span.start != 0:
        prev_token = span.doc[span.start - 1]
        if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
            return prev_token.text

# Register the Span extension as 'person_title'
Span.set_extension("person_title", getter=get_person_title)

doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_, ent._.person_title) for ent in doc.ents])
```

### Example: Using entities, part-of-speech tags and the dependency parse {#models-rules-pos-dep}

> #### Linguistic features
>
> This example makes extensive use of part-of-speech tag and dependency
> attributes and related `Doc`, `Token` and `Span` methods. For an introduction
> on this, see the guide on
> [linguistic features](http://localhost:8000/usage/linguistic-features/). Also
> see the [annotation specs](/api/annotation#pos-tagging) for details on the
> label schemes.

Let's say you want to parse professional biographies and extract the person
names and company names, and whether it's a company they're _currently_ working
at, or a _previous_ company. One approach could be to try and train a named
entity recognizer to predict `CURRENT_ORG` and `PREVIOUS_ORG` – but this
distinction is very subtle and something the entity recognizer may struggle to
learn. Nothing about "Acme Corp Inc." is inherently "current" or "previous".

However, the syntax of the sentence holds some very important clues: we can
check for trigger words like "work", whether they're **past tense** or **present
tense**, whether company names are attached to it and whether the person is the
subject. All of this information is available in the part-of-speech tags and the
dependency parse.

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Alex Smith worked at Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

> - `nsubj`: Nominal subject.
> - `prep`: Preposition.
> - `pobj`: Object of preposition.
> - `NNP`: Proper noun, singular.
> - `VBD`: Verb, past tense.
> - `IN`: Conjunction, subordinating or preposition.

![Visualization of dependency parse](../images/displacy-model-rules.svg "[`spacy.displacy`](/api/top-level#displacy) visualization with `options={'fine_grained': True}` to output the fine-grained part-of-speech tags, i.e. `Token.tag_`")

In this example, "worked" is the root of the sentence and is a past tense verb.
Its subject is "Alex Smith", the person who worked. "at Acme Corp Inc." is a
prepositional phrase attached to the verb "worked". To extract this
relationship, we can start by looking at the predicted `PERSON` entities, find
their heads and check whether they're attached to a trigger word like "work".
Next, we can check for prepositional phrases attached to the head and whether
they contain an `ORG` entity. Finally, to determine whether the company
affiliation is current, we can check the head's part-of-speech tag.

```python
person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
for ent in person_entities:
    # Because the entity is a spans, we need to use its root token. The head
    # is the syntactic governor of the person, e.g. the verb
    head = ent.root.head
    if head.lemma_ == "work":
        # Check if the children contain a preposition
        preps = [token for token in head.children if token.dep_ == "prep"]
        for prep in preps:
            # Check if tokens part of ORG entities are in the preposition's
            # children, e.g. at -> Acme Corp Inc.
            orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
            # If the verb is in past tense, the company was a previous company
            print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
```

To apply this logic automatically when we process a text, we can add it to the
`nlp` object as a
[custom pipeline component](/usage/processing-pipelines/#custom-components). The
above logic also expects that entities are merged into single tokens. spaCy
ships with a handy built-in `merge_entities` that takes care of that. Instead of
just printing the result, you could also write it to
[custom attributes](/usage/processing-pipelines#custom-components-attributes) on
the entity `Span` – for example `._.orgs` or `._.prev_orgs` and
`._.current_orgs`.

> #### Merging entities
>
> Under the hood, entities are merged using the
> [`Doc.retokenize`](/api/doc#retokenize) context manager:
>
> ```python
> with doc.retokenize() as retokenize:
>   for ent in doc.ents:
>       retokenizer.merge(ent)
> ```

```python
### {executable="true"}
import spacy
from spacy.pipeline import merge_entities
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

def extract_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == "work":
            preps = [token for token in head.children if token.dep_ == "prep"]
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
                print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
    return doc

# To make the entities easier to work with, we'll merge them into single tokens
nlp.add_pipe(merge_entities)
nlp.add_pipe(extract_person_orgs)

doc = nlp("Alex Smith worked at Acme Corp Inc.")
# If you're not in a Jupyter / IPython environment, use displacy.serve
displacy.render(doc, options={'fine_grained': True})
```

If you change the sentence structure above, for example to "was working", you'll
notice that our current logic fails and doesn't correctly detect the company as
a past organization. That's because the root is a participle and the tense
information is in the attached auxiliary "was":

![Visualization of dependency parse](../images/displacy-model-rules2.svg)

To solve this, we can adjust the rules to also check for the above construction:

```python
### {highlight="9-11"}
def extract_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == "work":
            preps = [token for token in head.children if token.dep_ == "prep"]
            for prep in preps:
                orgs = [t for t in prep.children if t.ent_type_ == "ORG"]
                aux = [token for token in head.children if token.dep_ == "aux"]
                past_aux = any(t.tag_ == "VBD" for t in aux)
                past = head.tag_ == "VBD" or head.tag_ == "VBG" and past_aux
                print({'person': ent, 'orgs': orgs, 'past': past})
    return doc
```

In your final rule-based system, you may end up with **several different code
paths** to cover the types of constructions that occur in your data.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								---
 								title: Rule-based matching
 								teaser: Find phrases and tokens, and match entities
 								menu:
 								  - ['Token Matcher', 'matcher']
 								  - ['Phrase Matcher', 'phrasematcher']
 								  - ['Entity Ruler', 'entityruler']
 								  - ['Models & Rules', 'models-rules']
 								---
 								Compared to using regular expressions on raw text, spaCy's rule-based matcher
-												remove extra word typo (#4875)

"let you find you"
											
										
										
											2020-01-06 14:37:42 +03:00
+								engines and components not only let you find the words and phrases you're
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								looking for – they also give you access to the tokens within the document and
 								their relationships. This means you can easily access and analyze the
 								surrounding tokens, merge spans into single tokens or add entries to the named
 								entities in `doc.ents`.
-												Don't auto-slugify accordion links [ci skip]

											
										
										
											2019-03-12 17:30:49 +03:00
+								<Accordion title="Should I use rules or train a model?" id="rules-vs-model">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								For complex tasks, it's usually better to train a statistical entity recognition
 								model. However, statistical models require training data, so for many
 								situations, rule-based approaches are more practical. This is especially true at
 								the start of a project: you can use a rule-based approach as part of a data
 								collection process, to help you "bootstrap" a statistical model.
 								Training a model is useful if you have some examples and you want your system to
 								be able to **generalize** based on those examples. It works especially well if
 								there are clues in the _local context_. For instance, if you're trying to detect
 								person or company names, your application may benefit from a statistical named
 								entity recognition model.
 								Rule-based systems are a good choice if there's a more or less **finite number**
 								of examples that you want to find in the data, or if there's a very **clear,
 								structured pattern** you can express with token rules or regular expressions.
 								For instance, country names, IP addresses or URLs are things you might be able
 								to handle well with a purely rule-based approach.
 								You can also combine both approaches and improve a statistical model with rules
 								to handle very specific cases and boost accuracy. For details, see the section
 								on [rule-based entity recognition](#entityruler).
 								</Accordion>
-												Don't auto-slugify accordion links [ci skip]

											
										
										
											2019-03-12 17:30:49 +03:00
+								<Accordion title="When should I use the token matcher vs. the phrase matcher?" id="matcher-vs-phrase-matcher">
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								The `PhraseMatcher` is useful if you already have a large terminology list or
 								gazetteer consisting of single or multi-token phrases that you want to find
 								exact instances of in your data. As of spaCy v2.1.0, you can also match on the
 								`LOWER` attribute for fast and case-insensitive matching.
 								The `Matcher` isn't as blazing fast as the `PhraseMatcher`, since it compares
 								across individual token attributes. However, it allows you to write very
 								abstract representations of the tokens you're looking for, using lexical
 								attributes, linguistic features predicted by the model, operators, set
 								membership and rich comparison. For example, you can find a noun, followed by a
 								verb with the lemma "love" or "like", followed by an optional determiner and
 								another token that's at least ten characters long.
 								</Accordion>
 								## Token-based matching {#matcher}
 								spaCy features a rule-matching engine, the [`Matcher`](/api/matcher), that
 								operates over tokens, similar to regular expressions. The rules can refer to
 								token annotations (e.g. the token `text` or `tag_`, and flags (e.g. `IS_PUNCT`).
 								The rule matcher also lets you pass in a custom callback to act on matches – for
 								example, to merge entities and apply custom labels. You can also associate
 								patterns with entity IDs, to allow some basic entity linking or disambiguation.
 								To match large terminology lists, you can use the
 								[`PhraseMatcher`](/api/phrasematcher), which accepts `Doc` objects as match
 								patterns.
 								### Adding patterns {#adding-patterns}
 								Let's say we want to enable spaCy to find a combination of three tokens:
 . A token whose **lowercase form matches "hello"**, e.g. "Hello" or "HELLO".
 . A token whose **`is_punct` flag is set to `True`**, i.e. any punctuation.
 . A token whose **lowercase form matches "world"**, e.g. "World" or "WORLD".
 								```python
 								[{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
 								```
 								<Infobox title="Important note" variant="danger">
 								When writing patterns, keep in mind that **each dictionary** represents **one
 								token**. If spaCy's tokenization doesn't match the tokens defined in a pattern,
 								the pattern is not going to produce any results. When developing complex
 								patterns, make sure to check examples against spaCy's tokenization:
 								```python
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("A complex-example,!")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								print([token.text for token in doc])
 								```
 								</Infobox>
 								First, we initialize the `Matcher` with a vocab. The matcher must always share
 								the same vocab with the documents it will operate on. We can now call
 								[`matcher.add()`](/api/matcher#add) with an ID and our custom pattern. The
 								second argument lets you pass in an optional callback function to invoke on a
 								successful match. For now, we set it to `None`.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.matcher import Matcher
 								nlp = spacy.load("en_core_web_sm")
 								matcher = Matcher(nlp.vocab)
 								# Add match ID "HelloWorld" with no callback and one pattern
 								pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
 								matcher.add("HelloWorld", None, pattern)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Hello, world! Hello world!")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								matches = matcher(doc)
 								for match_id, start, end in matches:
 								    string_id = nlp.vocab.strings[match_id]  # Get string representation
 								    span = doc[start:end]  # The matched span
 								    print(match_id, string_id, start, end, span.text)
 								```
 								The matcher returns a list of `(match_id, start, end)` tuples – in this case,
 								`[('15578876784678163569', 0, 2)]`, which maps to the span `doc[0:2]` of our
 								original document. The `match_id` is the [hash value](/usage/spacy-101#vocab) of
 								the string ID "HelloWorld". To get the string value, you can look up the ID in
 								the [`StringStore`](/api/stringstore).
 								```python
 								for match_id, start, end in matches:
 								    string_id = nlp.vocab.strings[match_id]  # 'HelloWorld'
 								    span = doc[start:end]                    # The matched span
 								```
 								Optionally, we could also choose to add more than one pattern, for example to
 								also match sequences without punctuation between "hello" and "world":
 								```python
 								matcher.add("HelloWorld", None,
 								            [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],
 								            [{"LOWER": "hello"}, {"LOWER": "world"}])
 								```
 								By default, the matcher will only return the matches and **not do anything
 								else**, like merge entities or assign labels. This is all up to you and can be
 								defined individually for each pattern, by passing in a callback function as the
 								`on_match` argument on `add()`. This is useful, because it lets you write
 								entirely custom and **pattern-specific logic**. For example, you might want to
 								merge _some_ patterns into one token, while adding entity labels for other
 								pattern types. You shouldn't have to create different matchers for each of those
 								processes.
 								#### Available token attributes {#adding-patterns-attributes}
-												Improve docs on matcher attributes [ci skip] (closes #4063)

											
										
										
											2019-08-06 13:13:42 +03:00
+								The available token pattern keys correspond to a number of
 								[`Token` attributes](/api/token#attributes). The supported attributes for
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								rule-based matching are:
-												Auto-format

											
										
										
											2019-08-06 13:13:31 +03:00
+								| Attribute                              | Type    |  Description                                                                                           |
 								| -------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------ |
 								| `ORTH`                                 | unicode | The exact verbatim text of a token.                                                                    |
 								| `TEXT` <Tag variant="new">2.1</Tag>    | unicode | The exact verbatim text of a token.                                                                    |
 								| `LOWER`                                | unicode | The lowercase form of the token text.                                                                  |
 								|  `LENGTH`                              | int     | The length of the token text.                                                                          |
-												Alphanumeric -> alphabetic [ci skip]

see ines/spacy-course#38

											
										
										
											2019-10-06 14:30:01 +03:00
+								|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`    | bool    | Token text consists of alphabetic characters, ASCII characters, digits.                                |
-												Auto-format

											
										
										
											2019-08-06 13:13:31 +03:00
+								|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE`    | bool    | Token text is in lowercase, uppercase, titlecase.                                                      |
 								|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP`     | bool    | Token is punctuation, whitespace, stop word.                                                           |
 								|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`  | bool    | Token text resembles a number, URL, email.                                                             |
 								|  `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | unicode | The token's simple and extended part-of-speech tag, dependency label, lemma, shape.                    |
 								| `ENT_TYPE`                             | unicode | The token's entity label.                                                                              |
-												fix custom attribute links

											
										
										
											2019-07-15 03:23:54 +03:00
+								| `_` <Tag variant="new">2.1</Tag>       | dict    | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Improve docs on matcher attributes [ci skip] (closes #4063)

											
										
										
											2019-08-06 13:13:42 +03:00
+								<Accordion title="Does it matter if the attribute names are uppercase or lowercase?">
 								No, it shouldn't. spaCy will normalize the names internally and
 								`{"LOWER": "text"}` and `{"lower": "text"}` will both produce the same result.
 								Using the uppercase version is mostly a convention to make it clear that the
 								attributes are "special" and don't exactly map to the token attributes like
 								`Token.lower` and `Token.lower_`.
 								</Accordion>
 								<Accordion title="Why are not all token attributes supported?">
 								spaCy can't provide access to all of the attributes because the `Matcher` loops
 								over the Cython data, not the Python objects. Inside the matcher, we're dealing
 								with a [`TokenC` struct](/api/cython-structs#tokenc) – we don't have an instance
 								of [`Token`](/api/token). This means that all of the attributes that refer to
 								computed properties can't be accessed.
 								The uppercase attribute names like `LOWER` or `IS_PUNCT` refer to symbols from
 								the
 								[`spacy.attrs`](https://github.com/explosion/spaCy/tree/master/spacy/attrs.pyx)
 								enum table. They're passed into a function that essentially is a big case/switch
 								statement, to figure out which struct field to return. The same attribute
 								identifiers are used in [`Doc.to_array`](/api/doc#to_array), and a few other
 								places in the code where you need to describe fields like this.
 								</Accordion>
 								---
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								<Infobox title="Tip: Try the interactive matcher explorer">
 								[![Matcher demo](../images/matcher-demo.jpg)](https://explosion.ai/demos/matcher)
 								The [Matcher Explorer](https://explosion.ai/demos/matcher) lets you test the
 								rule-based `Matcher` by creating token patterns interactively and running them
 								over your text. Each token can set multiple attributes like text value,
 								part-of-speech tag or boolean flags. The token-based view lets you explore how
 								spaCy processes your text – and why your pattern matches, or why it doesn't.
 								</Infobox>
 								#### Extended pattern syntax and attributes {#adding-patterns-attributes-extended new="2.1"}
 								Instead of mapping to a single value, token patterns can also map to a
 								**dictionary of properties**. For example, to specify that the value of a lemma
 								should be part of a list of values, or to set a minimum character length. The
 								following rich comparison attributes are available:
 								> #### Example
 								>
 								> ```python
 								> # Matches "love cats" or "likes flowers"
 								> pattern1 = [{"LEMMA": {"IN": ["like", "love"]}},
 								>             {"POS": "NOUN"}]
 								>
 								> # Matches tokens of length >= 10
 								> pattern2 = [{"LENGTH": {">=": 10}}]
 								> ```
 								| Attribute                  | Value Type | Description                                                                       |
 								| -------------------------- | ---------- | --------------------------------------------------------------------------------- |
 								| `IN`                       | any        | Attribute value is member of a list.                                              |
 								| `NOT_IN`                   | any        | Attribute value is _not_ member of a list.                                        |
 								| `==`, `>=`, `<=`, `>`, `<` | int, float | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. |
 								#### Regular expressions {#regex new="2.1"}
 								In some cases, only matching tokens and token attributes isn't enough – for
 								example, you might want to match different spellings of a word, without having
 								to add a new pattern for each spelling.
 								```python
-												fixing regex matcher examples (#3708) (#3719)


											
										
										
											2019-05-10 15:23:52 +03:00
+								pattern = [{"TEXT": {"REGEX": "^[Uu](\\.?|nited)$"}},
 								           {"TEXT": {"REGEX": "^[Ss](\\.?|tates)$"}},
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								           {"LOWER": "president"}]
 								```
-												Improve regex matching docs [ci skip]

											
										
										
											2019-08-19 14:59:41 +03:00
+								The `REGEX` operator allows defining rules for any attribute string value,
 								including custom attributes. It always needs to be applied to an attribute like
 								`TEXT`, `LOWER` or `TAG`:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Improve regex matching docs [ci skip]

											
										
										
											2019-08-19 14:59:41 +03:00
+								# Match different spellings of token texts
 								pattern = [{"TEXT": {"REGEX": "deff?in[ia]tely"}}]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								# Match tokens with fine-grained POS tags starting with 'V'
 								pattern = [{"TAG": {"REGEX": "^V"}}]
 								# Match custom attribute values with regular expressions
-												Improve regex matching docs [ci skip]

											
										
										
											2019-08-19 14:59:41 +03:00
+								pattern = [{"_": {"country": {"REGEX": "^[Uu](nited|\\.?) ?[Ss](tates|\\.?)$"}}}]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Improve regex matching docs [ci skip]

											
										
										
											2019-08-19 14:59:41 +03:00
+								<Infobox title="Important note" variant="warning">
 								When using the `REGEX` operator, keep in mind that it operates on **single
 								tokens**, not the whole text. Each expression you provide will be matched on a
 								token. If you need to match on the whole text instead, see the details on
 								[regex matching on the whole text](#regex-text).
 								</Infobox>
 								##### Matching regular expressions on the full text {#regex-text}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Improve regex matching docs [ci skip]

											
										
										
											2019-08-19 14:59:41 +03:00
+								If your expressions apply to multiple tokens, a simple solution is to match on
 								the `doc.text` with `re.finditer` and use the
 								[`Doc.char_span`](/api/doc#char_span) method to create a `Span` from the
 								character indices of the match. If the matched characters don't map to one or
 								more valid tokens, `Doc.char_span` returns `None`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Improve regex matching docs [ci skip]

											
										
										
											2019-08-19 14:59:41 +03:00
+								> #### What's a valid token sequence?
 								>
 								> In the example, the expression will also match `"US"` in `"USA"`. However,
 								> `"USA"` is a single token and `Span` objects are **sequences of tokens**. So
 								> `"US"` cannot be its own span, because it does not end on a token boundary.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
-												Improve regex matching docs [ci skip]

											
										
										
											2019-08-19 14:59:41 +03:00
+								### {executable="true"}
 								import spacy
 								import re
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Improve regex matching docs [ci skip]

											
										
										
											2019-08-19 14:59:41 +03:00
+								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")
 								expression = r"[Uu](nited|\\.?) ?[Ss](tates|\\.?)"
 								for match in re.finditer(expression, doc.text):
 								    start, end = match.span()
 								    span = doc.char_span(start, end)
 								    # This is a Span object or None if match doesn't map to valid token sequence
 								    if span is not None:
 								        print("Found match:", span.text)
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Add section on expanding regex match to token boundaries (see #4158) [ci skip]

											
										
										
											2019-08-21 13:53:31 +03:00
+								<Accordion title="How can I expand the match to a valid token sequence?">
 								In some cases, you might want to expand the match to the closest token
 								boundaries, so you can create a `Span` for `"USA"`, even though only the
 								substring `"US"` is matched. You can calculate this using the character offsets
 								of the tokens in the document, available as
 								[`Token.idx`](/api/token#attributes). This lets you create a list of valid token
 								start and end boundaries and leaves you with a rather basic algorithmic problem:
 								Given a number, find the next lowest (start token) or the next highest (end
 								token) number that's part of a given list of numbers. This will be the closest
 								valid token boundary.
 								There are many ways to do this and the most straightforward one is to create a
 								dict keyed by characters in the `Doc`, mapped to the token they're part of. It's
 								easy to write and less error-prone, and gives you a constant lookup time: you
 								only ever need to create the dict once per `Doc`.
 								```python
 								chars_to_tokens = {}
 								for token in doc:
 								    for i in range(token.idx, token.idx + len(token.text)):
 								        chars_to_tokens[i] = token.i
 								```
 								You can then look up character at a given position, and get the index of the
 								corresponding token that the character is part of. Your span would then be
 								`doc[token_start:token_end]`. If a character isn't in the dict, it means it's
 								the (white)space tokens are split on. That hopefully shouldn't happen, though,
 								because it'd mean your regex is producing matches with leading or trailing
 								whitespace.
 								```python
 								### {highlight="5-8"}
 								span = doc.char_span(start, end)
 								if span is not None:
 								    print("Found match:", span.text)
 								else:
 								    start_token = chars_to_tokens.get(start)
 								    end_token = chars_to_tokens.get(end)
 								    if start_token is not None and end_token is not None:
 								        span = doc[start_token:end_token + 1]
 								        print("Found closest match:", span.text)
 								```
 								</Accordion>
 								---
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								#### Operators and quantifiers {#quantifiers}
 								The matcher also lets you use quantifiers, specified as the `'OP'` key.
 								Quantifiers let you define sequences of tokens to be matched, e.g. one or more
 								punctuation marks, or specify optional tokens. Note that there are no nested or
 								scoped quantifiers – instead, you can build those behaviors with `on_match`
 								callbacks.
 								| OP  | Description                                                      |
 								| --- | ---------------------------------------------------------------- |
 								| `!` | Negate the pattern, by requiring it to match exactly 0 times.    |
 								| `?` | Make the pattern optional, by allowing it to match 0 or 1 times. |
 								| `+` | Require the pattern to match 1 or more times.                    |
 								| `*` | Allow the pattern to match zero or more times.                   |
 								> #### Example
 								>
 								> ```python
 								> pattern = [{"LOWER": "hello"},
 								>            {"IS_PUNCT": True, "OP": "?"}]
 								> ```
 								<Infobox title="Note on operator behaviour" variant="warning">
 								In versions before v2.1.0, the semantics of the `+` and `*` operators behave
 								inconsistently. They were usually interpreted "greedily", i.e. longer matches
 								are returned where possible. However, if you specify two `+` and `*` patterns in
 								a row and their matches overlap, the first operator will behave non-greedily.
 								This quirk in the semantics is corrected in spaCy v2.1.0.
 								</Infobox>
 								#### Using wildcard token patterns {#adding-patterns-wildcard new="2"}
 								While the token attributes offer many options to write highly specific patterns,
 								you can also use an empty dictionary, `{}` as a wildcard representing **any
 								token**. This is useful if you know the context of what you're trying to match,
 								but very little about the specific token and its characters. For example, let's
 								say you're trying to extract people's user names from your data. All you know is
 								that they are listed as "User name: {username}". The name itself may contain any
 								character, but no whitespace – so you'll know it will be handled as one token.
 								```python
-												Fix formatting

											
										
										
											2019-02-18 15:26:22 +03:00
+								[{"ORTH": "User"}, {"ORTH": "name"}, {"ORTH": ":"}, {}]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
-												Add validate option to EntityRuler (#4089)

* Add validate option to EntityRuler

* Add validate to EntityRuler, passed to Matcher and PhraseMatcher

* Add validate to usage and API docs

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

											
										
										
											2019-08-07 01:40:53 +03:00
+								#### Validating and debugging patterns {#pattern-validation new="2.1"}
 								The `Matcher` can validate patterns against a JSON schema with the option
 								`validate=True`. This is useful for debugging patterns during development, in
 								particular for catching unsupported attributes.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.matcher import Matcher
 								nlp = spacy.load("en_core_web_sm")
 								matcher = Matcher(nlp.vocab, validate=True)
 								# Add match ID "HelloWorld" with unsupported attribute CASEINSENSITIVE
 								pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"CASEINSENSITIVE": "world"}]
 								matcher.add("HelloWorld", None, pattern)
-												Adjust docs example [ci skip]

											
										
										
											2019-08-07 01:46:47 +03:00
+								# 🚨 Raises an error:
 								# MatchPatternError: Invalid token patterns for matcher rule 'HelloWorld'
-												Add validate option to EntityRuler (#4089)

* Add validate option to EntityRuler

* Add validate to EntityRuler, passed to Matcher and PhraseMatcher

* Add validate to usage and API docs

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

											
										
										
											2019-08-07 01:40:53 +03:00
+								# Pattern 0:
 								# - Additional properties are not allowed ('CASEINSENSITIVE' was unexpected) [2]
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								### Adding on_match rules {#on_match}
 								To move on to a more realistic example, let's say you're working with a large
 								corpus of blog articles, and you want to match all mentions of "Google I/O"
 								(which spaCy tokenizes as `['Google', 'I', '/', 'O'`]). To be safe, you only
 								match on the uppercase versions, in case someone has written it as "Google i/o".
 								```python
 								### {executable="true"}
-												Fix matcher callback example (closes #3862)

											
										
										
											2019-06-26 15:47:26 +03:00
+								from spacy.lang.en import English
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								from spacy.matcher import Matcher
-												Improve matcher example (resolves #3287)

											
										
										
											2019-02-18 15:26:37 +03:00
+								from spacy.tokens import Span
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Fix matcher callback example (closes #3862)

											
										
										
											2019-06-26 15:47:26 +03:00
+								nlp = English()
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								matcher = Matcher(nlp.vocab)
 								def add_event_ent(matcher, doc, i, matches):
 								    # Get the current match and create tuple of entity label, start and end.
 								    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
 								    match_id, start, end = matches[i]
-												Improve matcher example (resolves #3287)

											
										
										
											2019-02-18 15:26:37 +03:00
+								    entity = Span(doc, start, end, label="EVENT")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								    doc.ents += (entity,)
-												Improve matcher example (resolves #3287)

											
										
										
											2019-02-18 15:26:37 +03:00
+								    print(entity.text)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Improve matcher example (resolves #3287)

											
										
										
											2019-02-18 15:26:37 +03:00
+								pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
 								matcher.add("GoogleIO", add_event_ent, pattern)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("This is a text about Google I/O")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								matches = matcher(doc)
 								```
-												Improve matcher example (resolves #3287)

											
										
										
											2019-02-18 15:26:37 +03:00
+								A very similar logic has been implemented in the built-in
 								[`EntityRuler`](/api/entityruler) by the way. It also takes care of handling
 								overlapping matches, which you would otherwise have to take care of yourself.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> #### Tip: Visualizing matches
 								>
 								> When working with entities, you can use [displaCy](/api/top-level#displacy) to
 								> quickly generate a NER visualization from your updated `Doc`, which can be
 								> exported as an HTML file:
 								>
 								> ```python
 								> from spacy import displacy
 								> html = displacy.render(doc, style="ent", page=True,
 								>                        options={"ents": ["EVENT"]})
 								> ```
 								>
 								> For more info and examples, see the usage guide on
 								> [visualizing spaCy](/usage/visualizers).
 								We can now call the matcher on our documents. The patterns will be matched in
 								the order they occur in the text. The matcher will then iterate over the
 								matches, look up the callback for the match ID that was matched, and invoke it.
 								```python
 								doc = nlp(YOUR_TEXT_HERE)
 								matcher(doc)
 								```
 								When the callback is invoked, it is passed four arguments: the matcher itself,
 								the document, the position of the current match, and the total list of matches.
 								This allows you to write callbacks that consider the entire set of matched
 								phrases, so that you can resolve overlaps and other conflicts in whatever way
 								you prefer.
 								| Argument  | Type      | Description                                                                                                          |
 								| --------- | --------- | -------------------------------------------------------------------------------------------------------------------- |
 								| `matcher` | `Matcher` | The matcher instance.                                                                                                |
 								| `doc`     | `Doc`     | The document the matcher was used on.                                                                                |
 								| `i`       | int       | Index of the current match (`matches[i`]).                                                                           |
 								| `matches` | list      |  A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. |
 								### Using custom pipeline components {#matcher-pipeline}
 								Let's say your data also contains some annoying pre-processing artifacts, like
 								leftover HTML line breaks (e.g. `<br>` or `<BR/>`). To make your text easier to
 								analyze, you want to merge those into one token and flag them, to make sure you
 								can ignore them later. Ideally, this should all be done automatically as you
 								process the text. You can achieve this by adding a
 								[custom pipeline component](/usage/processing-pipelines#custom-components)
 								that's called on each `Doc` object, merges the leftover HTML spans and sets an
 								attribute `bad_html` on the token.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.matcher import Matcher
 								from spacy.tokens import Token
 								# We're using a class because the component needs to be initialised with
 								# the shared vocab via the nlp object
 								class BadHTMLMerger(object):
 								    def __init__(self, nlp):
 								        # Register a new token extension to flag bad HTML
 								        Token.set_extension("bad_html", default=False)
 								        self.matcher = Matcher(nlp.vocab)
 								        self.matcher.add(
 								            "BAD_HTML",
 								            None,
 								            [{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
 								            [{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
 								        )
 								    def __call__(self, doc):
 								        # This method is invoked when the component is called on a Doc
 								        matches = self.matcher(doc)
 								        spans = []  # Collect the matched spans here
 								        for match_id, start, end in matches:
 								            spans.append(doc[start:end])
 								        with doc.retokenize() as retokenizer:
-												Fix typos in docs (closes #3802) [ci skip]

											
										
										
											2019-06-01 12:35:01 +03:00
+								            for span in spans:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								                retokenizer.merge(span)
 								                for token in span:
 								                    token._.bad_html = True  # Mark token as bad HTML
 								        return doc
 								nlp = spacy.load("en_core_web_sm")
 								html_merger = BadHTMLMerger(nlp)
 								nlp.add_pipe(html_merger, last=True)  # Add component to the pipeline
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Hello<br>world! <br/> This is a test.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for token in doc:
 								    print(token.text, token._.bad_html)
 								```
 								Instead of hard-coding the patterns into the component, you could also make it
 								take a path to a JSON file containing the patterns. This lets you reuse the
 								component with different patterns, depending on your application:
 								```python
-												Tidy up and improve docs and docstrings (#3370)

<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-03-08 13:42:26 +03:00
+								html_merger = BadHTMLMerger(nlp, path="/path/to/patterns.json")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								<Infobox title="📖 Processing pipelines">
 								For more details and examples of how to **create custom pipeline components**
 								and **extension attributes**, see the
 								[usage guide](/usage/processing-pipelines).
 								</Infobox>
 								### Example: Using linguistic annotations {#example1}
 								Let's say you're analyzing user comments and you want to find out what people
 								are saying about Facebook. You want to start off by finding adjectives following
 								"Facebook is" or "Facebook was". This is obviously a very rudimentary solution,
 								but it'll be fast, and a great way to get an idea for what's in your data. Your
 								pattern could look like this:
 								```python
 								[{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]
 								```
 								This translates to a token whose lowercase form matches "facebook" (like
 								Facebook, facebook or FACEBOOK), followed by a token with the lemma "be" (for
 								example, is, was, or 's), followed by an **optional** adverb, followed by an
 								adjective. Using the linguistic annotations here is especially useful, because
 								you can tell spaCy to match "Facebook's annoying", but **not** "Facebook's
 								annoying ads". The optional adverb makes sure you won't miss adjectives with
 								intensifiers, like "pretty awful" or "very nice".
 								To get a quick overview of the results, you could collect all sentences
 								containing a match and render them with the
 								[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
 								access to the `start` and `end` of each match, as well as the parent `Doc`. This
 								lets you determine the sentence containing the match, `doc[start : end`.sent],
 								and calculate the start and end of the matched span within the sentence. Using
 								displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
 								list of dictionaries containing the text and entities to render.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy import displacy
 								from spacy.matcher import Matcher
 								nlp = spacy.load("en_core_web_sm")
 								matcher = Matcher(nlp.vocab)
 								matched_sents = []  # Collect data of matched sentences to be visualized
 								def collect_sents(matcher, doc, i, matches):
 								    match_id, start, end = matches[i]
 								    span = doc[start:end]  # Matched span
 								    sent = span.sent  # Sentence containing matched span
 								    # Append mock entity for match in displaCy style to matched_sents
 								    # get the match span by ofsetting the start and end of the span with the
 								    # start and end of the sentence in the doc
 								    match_ents = [{
 								        "start": span.start_char - sent.start_char,
 								        "end": span.end_char - sent.start_char,
 								        "label": "MATCH",
 								    }]
 								    matched_sents.append({"text": sent.text, "ents": match_ents})
 								pattern = [{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"},
 								           {"POS": "ADJ"}]
 								matcher.add("FacebookIs", collect_sents, pattern)  # add pattern
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("I'd say that Facebook is evil. – Facebook is pretty cool, right?")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								matches = matcher(doc)
 								# Serve visualization of sentences containing match with displaCy
 								# set manual=True to make displaCy render straight from a dictionary
 								# (if you're not running the code within a Jupyer environment, you can
 								# use displacy.serve instead)
 								displacy.render(matched_sents, style="ent", manual=True)
 								```
 								### Example: Phone numbers {#example2}
 								Phone numbers can have many different formats and matching them is often tricky.
 								During tokenization, spaCy will leave sequences of numbers intact and only split
 								on whitespace and punctuation. This means that your match pattern will have to
 								look out for number sequences of a certain length, surrounded by specific
 								punctuation – depending on the
 								[national conventions](https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers).
 								The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
 								anything about the length. However, you can use the `SHAPE` flag, with each `d`
-												Update shape docs and examples (resolves #4615) [ci skip]

											
										
										
											2019-11-23 19:16:55 +03:00
+								representing a digit (up to 4 digits / characters):
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
 								 {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]
 								```
 								This will match phone numbers of the format **(123) 4567 8901** or **(123)
 -8901**. To also match formats like **(123) 456 789**, you can add a second
 								pattern using `'ddd'` in place of `'dddd'`. By hard-coding some values, you can
 								match only certain, country-specific numbers. For example, here's a pattern to
 								match the most common formats of
 								[international German numbers](https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers#Germany):
 								```python
 								[{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
-												Update shape docs and examples (resolves #4615) [ci skip]

											
										
										
											2019-11-23 19:16:55 +03:00
+								 {"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								Depending on the formats your application needs to match, creating an extensive
 								set of rules like this is often better than training a model. It'll produce more
 								predictable results, is much easier to modify and extend, and doesn't require
 								any training data – only a set of test cases.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.matcher import Matcher
 								nlp = spacy.load("en_core_web_sm")
 								matcher = Matcher(nlp.vocab)
 								pattern = [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},
 								           {"ORTH": "-", "OP": "?"}, {"SHAPE": "ddd"}]
 								matcher.add("PHONE_NUMBER", None, pattern)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Call me at (123) 456 789 or (123) 456 789!")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								print([t.text for t in doc])
 								matches = matcher(doc)
 								for match_id, start, end in matches:
 								    span = doc[start:end]
 								    print(span.text)
 								```
 								### Example: Hashtags and emoji on social media {#example3}
 								Social media posts, especially tweets, can be difficult to work with. They're
 								very short and often contain various emoji and hashtags. By only looking at the
 								plain text, you'll lose a lot of valuable semantic information.
 								Let's say you've extracted a large sample of social media posts on a specific
 								topic, for example posts mentioning a brand name or product. As the first step
 								of your data exploration, you want to filter out posts containing certain emoji
 								and use them to assign a general sentiment score, based on whether the expressed
 								emotion is positive or negative, e.g. 😀 or 😞. You also want to find, merge and
 								label hashtags like `#MondayMotivation`, to be able to ignore or analyze them
 								later.
 								> #### Note on sentiment analysis
 								>
 								> Ultimately, sentiment analysis is not always _that_ easy. In addition to the
 								> emoji, you'll also want to take specific words into account and check the
 								> `subtree` for intensifiers like "very", to increase the sentiment score. At
 								> some point, you might also want to train a sentiment model. However, the
 								> approach described in this example is very useful for **bootstrapping rules to
 								> collect training data**. It's also an incredibly fast way to gather first
 								> insights into your data – with about 1 million tweets, you'd be looking at a
 								> processing time of **under 1 minute**.
 								By default, spaCy's tokenizer will split emoji into separate tokens. This means
 								that you can create a pattern for one or more emoji tokens. Valid hashtags
 								usually consist of a `#`, plus a sequence of ASCII characters with no
 								whitespace, making them easy to match as well.
 								```python
 								### {executable="true"}
 								from spacy.lang.en import English
 								from spacy.matcher import Matcher
 								nlp = English()  # We only want the tokenizer, so no need to load a model
 								matcher = Matcher(nlp.vocab)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								pos_emoji = ["😀", "😃", "😂", "🤣", "😊", "😍"]  # Positive emoji
 								neg_emoji = ["😞", "😠", "😩", "😢", "😭", "😒"]  # Negative emoji
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								# Add patterns to match one or more emoji tokens
 								pos_patterns = [[{"ORTH": emoji}] for emoji in pos_emoji]
 								neg_patterns = [[{"ORTH": emoji}] for emoji in neg_emoji]
 								# Function to label the sentiment
 								def label_sentiment(matcher, doc, i, matches):
 								    match_id, start, end = matches[i]
 								    if doc.vocab.strings[match_id] == "HAPPY":  # Don't forget to get string!
 								        doc.sentiment += 0.1  # Add 0.1 for positive sentiment
 								    elif doc.vocab.strings[match_id] == "SAD":
 								        doc.sentiment -= 0.1  # Subtract 0.1 for negative sentiment
 								matcher.add("HAPPY", label_sentiment, *pos_patterns)  # Add positive pattern
 								matcher.add("SAD", label_sentiment, *neg_patterns)  # Add negative pattern
 								# Add pattern for valid hashtag, i.e. '#' plus any ASCII token
 								matcher.add("HASHTAG", None, [{"ORTH": "#"}, {"IS_ASCII": True}])
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Hello world 😀 #MondayMotivation")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								matches = matcher(doc)
 								for match_id, start, end in matches:
 								    string_id = doc.vocab.strings[match_id]  # Look up string ID
 								    span = doc[start:end]
 								    print(string_id, span.text)
 								```
 								Because the `on_match` callback receives the ID of each match, you can use the
 								same function to handle the sentiment assignment for both the positive and
 								negative pattern. To keep it simple, we'll either add or subtract `0.1` points –
 								this way, the score will also reflect combinations of emoji, even positive _and_
 								negative ones.
 								With a library like [Emojipedia](https://github.com/bcongdon/python-emojipedia),
 								we can also retrieve a short description for each emoji – for example, 😍's
 								official title is "Smiling Face With Heart-Eyes". Assigning it to a
 								[custom attribute](/usage/processing-pipelines#custom-components-attributes) on
 								the emoji span will make it available as `span._.emoji_desc`.
 								```python
 								from emojipedia import Emojipedia  # Installation: pip install emojipedia
 								from spacy.tokens import Span  # Get the global Span object
 								Span.set_extension("emoji_desc", default=None)  # Register the custom attribute
 								def label_sentiment(matcher, doc, i, matches):
 								    match_id, start, end = matches[i]
 								    if doc.vocab.strings[match_id] == "HAPPY":  # Don't forget to get string!
 								        doc.sentiment += 0.1  # Add 0.1 for positive sentiment
 								    elif doc.vocab.strings[match_id] == "SAD":
 								        doc.sentiment -= 0.1  # Subtract 0.1 for negative sentiment
 								    span = doc[start:end]
 								    emoji = Emojipedia.search(span[0].text)  # Get data for emoji
 								    span._.emoji_desc = emoji.title  # Assign emoji description
 								```
 								To label the hashtags, we can use a
 								[custom attribute](/usage/processing-pipelines#custom-components-attributes) set
 								on the respective token:
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.matcher import Matcher
 								from spacy.tokens import Token
 								nlp = spacy.load("en_core_web_sm")
 								matcher = Matcher(nlp.vocab)
 								# Add pattern for valid hashtag, i.e. '#' plus any ASCII token
 								matcher.add("HASHTAG", None, [{"ORTH": "#"}, {"IS_ASCII": True}])
 								# Register token extension
 								Token.set_extension("is_hashtag", default=False)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Hello world 😀 #MondayMotivation")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								matches = matcher(doc)
 								hashtags = []
 								for match_id, start, end in matches:
 								    if doc.vocab.strings[match_id] == "HASHTAG":
 								        hashtags.append(doc[start:end])
 								with doc.retokenize() as retokenizer:
-												Fix typos in docs (closes #3802) [ci skip]

											
										
										
											2019-06-01 12:35:01 +03:00
+								    for span in hashtags:
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								        retokenizer.merge(span)
 								        for token in span:
 								            token._.is_hashtag = True
 								for token in doc:
 								    print(token.text, token._.is_hashtag)
 								```
 								To process a stream of social media posts, we can use
 								[`Language.pipe`](/api/language#pipe), which will return a stream of `Doc`
 								objects that we can pass to [`Matcher.pipe`](/api/matcher#pipe).
 								```python
 								docs = nlp.pipe(LOTS_OF_TWEETS)
 								matches = matcher.pipe(docs)
 								```
 								## Efficient phrase matching {#phrasematcher}
 								If you need to match large terminology lists, you can also use the
 								[`PhraseMatcher`](/api/phrasematcher) and create [`Doc`](/api/doc) objects
 								instead of token patterns, which is much more efficient overall. The `Doc`
 								patterns can contain single or multiple tokens.
 								### Adding phrase patterns {#adding-phrase-patterns}
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.matcher import PhraseMatcher
 								nlp = spacy.load('en_core_web_sm')
 								matcher = PhraseMatcher(nlp.vocab)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								# Only run nlp.make_doc to speed things up
-												Improve redundant variable name (#3643)

* Improve redundant variable name

* Apply suggestions from code review

Co-Authored-By: pickfire <pickfire@riseup.net>

											
										
										
											2019-04-26 17:50:14 +03:00
+								patterns = [nlp.make_doc(text) for text in terms]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								matcher.add("TerminologyList", None, *patterns)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
 								          "converse in the Oval Office inside the White House in Washington, D.C.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								matches = matcher(doc)
 								for match_id, start, end in matches:
 								    span = doc[start:end]
 								    print(span.text)
 								```
 								Since spaCy is used for processing both the patterns and the text to be matched,
 								you won't have to worry about specific tokenization – for example, you can
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								simply pass in `nlp("Washington, D.C.")` and won't have to write a complex token
 								pattern covering the exact tokenization of the term.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								<Infobox title="Important note on creating patterns" variant="warning">
 								To create the patterns, each phrase has to be processed with the `nlp` object.
-												 Improve token pattern checking without validation  (#4105)

* Fix typo in rule-based matching docs

* Improve token pattern checking without validation

Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses #4070 (also related: #4063, #4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

* Remove unnecessary .keys()

* Rephrase error messages

* Add another type check to Matcher

Add another type check to Matcher for more understandable error messages
in some rare cases.

* Support phrase_matcher_attr=TEXT for EntityRuler

* Don't use spacy.errors in examples and bin scripts

* Fix error code

* Auto-format

Also try get Azure pipelines to finally start a build :(

* Update errors.py


Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2019-08-21 15:00:37 +03:00
+								If you have a model loaded, doing this in a loop or list comprehension can
 								easily become inefficient and slow. If you **only need the tokenization and
 								lexical attributes**, you can run [`nlp.make_doc`](/api/language#make_doc)
 								instead, which will only run the tokenizer. For an additional speed boost, you
 								can also use the [`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will
 								process the texts as a stream.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```diff
 								- patterns = [nlp(term) for term in LOTS_OF_TERMS]
 								+ patterns = [nlp.make_doc(term) for term in LOTS_OF_TERMS]
 								+ patterns = list(nlp.tokenizer.pipe(LOTS_OF_TERMS))
 								```
 								</Infobox>
 								### Matching on other token attributes {#phrasematcher-attrs new="2.1"}
 								By default, the `PhraseMatcher` will match on the verbatim token text, e.g.
 								`Token.text`. By setting the `attr` argument on initialization, you can change
 								**which token attribute the matcher should use** when comparing the phrase
 								pattern to the matched `Doc`. For example, using the attribute `LOWER` lets you
 								match on `Token.lower` and create case-insensitive match patterns:
 								```python
 								### {executable="true"}
 								from spacy.lang.en import English
 								from spacy.matcher import PhraseMatcher
 								nlp = English()
 								matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								patterns = [nlp.make_doc(name) for name in ["Angela Merkel", "Barack Obama"]]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								matcher.add("Names", None, *patterns)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("angela merkel and us president barack Obama")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for match_id, start, end in matcher(doc):
 								    print("Matched based on lowercase token text:", doc[start:end])
 								```
-												Improve docs on phrase pattern attributes (closes #4100) [ci skip]

											
										
										
											2019-08-11 12:13:49 +03:00
+								<Infobox title="Important note on creating patterns" variant="warning">
 								The examples here use [`nlp.make_doc`](/api/language#make_doc) to create `Doc`
 								object patterns as efficiently as possible and without running any of the other
 								pipeline components. If the token attribute you want to match on are set by a
 								pipeline component, **make sure that the pipeline component runs** when you
 								create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
 								objects need to have part-of-speech tags set by the `tagger`. You can either
 								call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 23:27:10 +03:00
+								[`nlp.select_pipes`](/api/language#select_pipes) to disable components
-												Improve docs on phrase pattern attributes (closes #4100) [ci skip]

											
										
										
											2019-08-11 12:13:49 +03:00
+								selectively.
 								</Infobox>
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								Another possible use case is matching number tokens like IP addresses based on
 								their shape. This means that you won't have to worry about how those string will
 								be tokenized and you'll be able to find tokens and combinations of tokens based
 								on a few examples. Here, we're matching on the shapes `ddd.d.d.d` and
 								`ddd.ddd.d.d`:
 								```python
 								### {executable="true"}
 								from spacy.lang.en import English
 								from spacy.matcher import PhraseMatcher
 								nlp = English()
 								matcher = PhraseMatcher(nlp.vocab, attr="SHAPE")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								matcher.add("IP", None, nlp("127.0.0.1"), nlp("127.127.0.0"))
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Often the router will have an IP address such as 192.168.1.1 or 192.168.2.1.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for match_id, start, end in matcher(doc):
 								    print("Matched based on token shape:", doc[start:end])
 								```
 								In theory, the same also works for attributes like `POS`. For example, a pattern
 								`nlp("I like cats")` matched based on its part-of-speech tag would return a
 								match for "I love dogs". You could also match on boolean flags like `IS_PUNCT`
 								to match phrases with the same sequence of punctuation and non-punctuation
 								tokens as the pattern. But this can easily get confusing and doesn't have much
 								of an advantage over writing one or two token patterns.
 								## Rule-based entity recognition {#entityruler new="2.1"}
 								The [`EntityRuler`](/api/entityruler) is an exciting new component that lets you
 								add named entities based on pattern dictionaries, and makes it easy to combine
 								rule-based and statistical named entity recognition for even more powerful
 								models.
 								### Entity Patterns {#entityruler-patterns}
 								Entity patterns are dictionaries with two keys: `"label"`, specifying the label
 								to assign to the entity if the pattern is matched, and `"pattern"`, the match
 								pattern. The entity ruler accepts two types of patterns:
 . **Phrase patterns** for exact string matches (string).
 								   ```python
 								   {"label": "ORG", "pattern": "Apple"}
 								   ```
 . **Token patterns** with one dictionary describing one token (list).
 								   ```python
-												Use consistent casing for entity ruler patterns (see #4063) [ci skip]

											
										
										
											2019-08-06 13:20:22 +03:00
+								   {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								   ```
 								### Using the entity ruler {#entityruler-usage}
 								The [`EntityRuler`](/api/entityruler) is a pipeline component that's typically
 								added via [`nlp.add_pipe`](/api/language#add_pipe). When the `nlp` object is
 								called on a text, it will find matches in the `doc` and add them as entities to
-												Describing priority rules for overlapping matches (#5197)

* Describing priority rules for overlapping matches

* Create Tiljander.md

* Describing priority rules for overlapping matches

* Update website/docs/api/entityruler.md

Co-Authored-By: Ines Montani <ines@ines.io>

Co-authored-by: Ines Montani <ines@ines.io>
											
										
										
											2020-03-26 15:13:22 +03:00
+								the `doc.ents`, using the specified pattern label as the entity label. If any
 								matches were to overlap, the pattern matching most tokens takes priority. If
 								they also happen to be equally long, then the match occuring first in the Doc is
 								chosen.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								```python
 								### {executable="true"}
 								from spacy.lang.en import English
 								from spacy.pipeline import EntityRuler
 								nlp = English()
 								ruler = EntityRuler(nlp)
 								patterns = [{"label": "ORG", "pattern": "Apple"},
-												Use consistent casing for entity ruler patterns (see #4063) [ci skip]

											
										
										
											2019-08-06 13:20:22 +03:00
+								            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								ruler.add_patterns(patterns)
 								nlp.add_pipe(ruler)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Apple is opening its first big office in San Francisco.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								print([(ent.text, ent.label_) for ent in doc.ents])
 								```
 								The entity ruler is designed to integrate with spaCy's existing statistical
 								models and enhance the named entity recognizer. If it's added **before the
 								`"ner"` component**, the entity recognizer will respect the existing entity
 								spans and adjust its predictions around it. This can significantly improve
 								accuracy in some cases. If it's added **after the `"ner"` component**, the
 								entity ruler will only add spans to the `doc.ents` if they don't overlap with
 								existing entities predicted by the model. To overwrite overlapping entities, you
 								can set `overwrite_ents=True` on initialization.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.pipeline import EntityRuler
 								nlp = spacy.load("en_core_web_sm")
 								ruler = EntityRuler(nlp)
 								patterns = [{"label": "ORG", "pattern": "MyCorp Inc."}]
 								ruler.add_patterns(patterns)
 								nlp.add_pipe(ruler)
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("MyCorp Inc. is a company in the U.S.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								print([(ent.text, ent.label_) for ent in doc.ents])
 								```
-												Adjust docs example [ci skip]

											
										
										
											2019-08-07 01:46:47 +03:00
+								#### Validating and debugging EntityRuler patterns {#entityruler-pattern-validation new="2.1.8"}
-												Add validate option to EntityRuler (#4089)

* Add validate option to EntityRuler

* Add validate to EntityRuler, passed to Matcher and PhraseMatcher

* Add validate to usage and API docs

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

											
										
										
											2019-08-07 01:40:53 +03:00
 								The `EntityRuler` can validate patterns against a JSON schema with the option
-												Adjust docs example [ci skip]

											
										
										
											2019-08-07 01:46:47 +03:00
+								`validate=True`. See details under
 								[Validating and debugging patterns](#pattern-validation).
-												Add validate option to EntityRuler (#4089)

* Add validate option to EntityRuler

* Add validate to EntityRuler, passed to Matcher and PhraseMatcher

* Add validate to usage and API docs

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

											
										
										
											2019-08-07 01:40:53 +03:00
 								```python
 								ruler = EntityRuler(nlp, validate=True)
 								```
-												Adjust order of docs sections [ci skip]

											
										
										
											2019-11-17 18:08:56 +03:00
+								### Adding IDs to patterns {#entityruler-ent-ids new="2.2.2"}
 								The [`EntityRuler`](/api/entityruler) can also accept an `id` attribute for each
 								pattern. Using the `id` attribute allows multiple patterns to be associated with
 								the same entity.
 								```python
 								### {executable="true"}
 								from spacy.lang.en import English
 								from spacy.pipeline import EntityRuler
 								nlp = English()
 								ruler = EntityRuler(nlp)
 								patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"},
 								            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"},
 								            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}]
 								ruler.add_patterns(patterns)
 								nlp.add_pipe(ruler)
 								doc1 = nlp("Apple is opening its first big office in San Francisco.")
 								print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents])
 								doc2 = nlp("Apple is opening its first big office in San Fran.")
 								print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents])
 								```
 								If the `id` attribute is included in the [`EntityRuler`](/api/entityruler)
 								patterns, the `ent_id_` property of the matched entity is set to the `id` given
 								in the patterns. So in the example above it's easy to identify that "San
 								Francisco" and "San Fran" are both the same entity.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								### Using pattern files {#entityruler-files}
 								The [`to_disk`](/api/entityruler#to_disk) and
 								[`from_disk`](/api/entityruler#from_disk) let you save and load patterns to and
 								from JSONL (newline-delimited JSON) files, containing one pattern object per
 								line.
 								```json
 								### patterns.jsonl
 								{"label": "ORG", "pattern": "Apple"}
-												Use consistent casing for entity ruler patterns (see #4063) [ci skip]

											
										
										
											2019-08-06 13:20:22 +03:00
+								{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								```
 								```python
 								ruler.to_disk("./patterns.jsonl")
 								new_ruler = EntityRuler(nlp).from_disk("./patterns.jsonl")
 								```
 								<Infobox title="Integration with Prodigy">
 								If you're using the [Prodigy](https://prodi.gy) annotation tool, you might
 								recognize these pattern files from bootstrapping your named entity and text
 								classification labelling. The patterns for the `EntityRuler` follow the same
 								syntax, so you can use your existing Prodigy pattern files in spaCy, and vice
 								versa.
 								</Infobox>
 								When you save out an `nlp` object that has an `EntityRuler` added to its
 								pipeline, its patterns are automatically exported to the model directory:
 								```python
 								nlp = spacy.load("en_core_web_sm")
 								ruler = EntityRuler(nlp)
 								ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}])
 								nlp.add_pipe(ruler)
 								nlp.to_disk("/path/to/model")
 								```
 								The saved model now includes the `"entity_ruler"` in its `"pipeline"` setting in
 								the `meta.json`, and the model directory contains a file `entityruler.jsonl`
 								with the patterns. When you load the model back in, all pipeline components will
 								be restored and deserialized – including the entity ruler. This lets you ship
 								powerful model packages with binary weights _and_ rules included!
-												Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)

* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk

											
										
										
											2020-02-16 20:17:47 +03:00
+								### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
 								When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
 								the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
 								to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to
 								extract matches based on the pattern's POS signature.
 								In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
 								Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
 								As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively.
 								Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
 								An easy workaround to make this function run faster is disabling the other language pipes
 								while adding the phrase patterns.
 								```python
 								entityruler = EntityRuler(nlp)
 								patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
-												Feature toggle_pipes (#5378)

* make disable_pipes deprecated in favour of the new toggle_pipes

* rewrite disable_pipes statements

* update documentation

* remove bin/wiki_entity_linking folder

* one more fix

* remove deprecated link to documentation

* few more doc fixes

* add note about name change to the docs

* restore original disable_pipes

* small fixes

* fix typo

* fix error number to W096

* rename to select_pipes

* also make changes to the documentation

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
											
										
										
											2020-05-18 23:27:10 +03:00
+								with nlp.select_pipes(enable="tagger"):
-												Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)

* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk

											
										
										
											2020-02-16 20:17:47 +03:00
+								    entityruler.add_patterns(patterns)
 								```
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								## Combining models and rules {#models-rules}
 								You can combine statistical and rule-based components in a variety of ways.
 								Rule-based components can be used to improve the accuracy of statistical models,
 								by presetting tags, entities or sentence boundaries for specific tokens. The
 								statistical models will usually respect these preset annotations, which
 								sometimes improves the accuracy of other decisions. You can also use rule-based
 								components after a statistical model to correct common errors. Finally,
 								rule-based components can reference the attributes set by statistical models, in
 								order to implement more abstract logic.
 								### Example: Expanding named entities {#models-rules-ner}
-												Use consistent spelling

											
										
										
											2019-10-02 11:37:39 +03:00
+								When using the a pretrained
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								[named entity recognition](/usage/linguistic-features/#named-entities) model to
 								extract information from your texts, you may find that the predicted span only
 								includes parts of the entity you're looking for. Sometimes, this happens if
 								statistical model predicts entities incorrectly. Other times, it happens if the
 								way the entity type way defined in the original training corpus doesn't match
 								what you need for your application.
 								> #### Where corpora come from
 								>
 								> Corpora used to train models from scratch are often produced in academia. They
 								> contain text from various sources with linguistic features labeled manually by
 								> human annotators (following a set of specific guidelines). The corpora are
 								> then distributed with evaluation data, so other researchers can benchmark
 								> their algorithms and everyone can report numbers on the same data. However,
 								> most applications need to learn information that isn't contained in any
 								> available corpus.
 								For example, the corpus spaCy's [English models](/models/en) were trained on
 								defines a `PERSON` entity as just the **person name**, without titles like "Mr"
 								or "Dr". This makes sense, because it makes it easier to resolve the entity type
 								back to a knowledge base. But what if your application needs the full names,
 								_including_ the titles?
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
 								print([(ent.text, ent.label_) for ent in doc.ents])
 								```
 								While you could try and teach the model a new definition of the `PERSON` entity
 								by [updating it](/usage/training/#example-train-ner) with more examples of spans
 								that include the title, this might not be the most efficient approach. The
 								existing model was trained on over 2 million words, so in order to completely
 								change the definition of an entity type, you might need a lot of training
 								examples. However, if you already have the predicted `PERSON` entities, you can
 								use a rule-based approach that checks whether they come with a title and if so,
 								expands the entity span by one token. After all, what all titles in this example
 								have in common is that _if_ they occur, they occur in the **previous token**
 								right before the person entity.
 								```python
 								### {highlight="7-11"}
 								from spacy.tokens import Span
 								def expand_person_entities(doc):
 								    new_ents = []
 								    for ent in doc.ents:
 								        # Only check for title if it's a person and not the first token
 								        if ent.label_ == "PERSON" and ent.start != 0:
 								            prev_token = doc[ent.start - 1]
 								            if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
 								                new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
 								                new_ents.append(new_ent)
-												Fix logic in rules+model entity example [ci skip] (#4510)


											
										
										
											2019-10-23 15:41:21 +03:00
+								            else:
 								                new_ents.append(ent)
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								        else:
 								            new_ents.append(ent)
 								    doc.ents = new_ents
 								    return doc
 								```
 								The above function takes a `Doc` object, modifies its `doc.ents` and returns it.
 								This is exactly what a [pipeline component](/usage/processing-pipelines) does,
 								so in order to let it run automatically when processing a text with the `nlp`
 								object, we can use [`nlp.add_pipe`](/api/language#add_pipe) to add it to the
 								current pipeline.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.tokens import Span
 								nlp = spacy.load("en_core_web_sm")
 								def expand_person_entities(doc):
 								    new_ents = []
 								    for ent in doc.ents:
 								        if ent.label_ == "PERSON" and ent.start != 0:
 								            prev_token = doc[ent.start - 1]
 								            if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
 								                new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
 								                new_ents.append(new_ent)
 								        else:
 								            new_ents.append(ent)
 								    doc.ents = new_ents
 								    return doc
 								# Add the component after the named entity recognizer
 								nlp.add_pipe(expand_person_entities, after='ner')
 								doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
 								print([(ent.text, ent.label_) for ent in doc.ents])
 								```
 								An alternative approach would be to an
 								[extension attribute](/usage/processing-pipelines/#custom-components-attributes)
 								like `._.person_title` and add it to `Span` objects (which includes entity spans
 								in `doc.ents`). The advantage here is that the entity text stays intact and can
 								still be used to look up the name in a knowledge base. The following function
 								takes a `Span` object, checks the previous token if it's a `PERSON` entity and
 								returns the title if one is found. The `Span.doc` attribute gives us easy access
 								to the span's parent document.
 								```python
 								def get_person_title(span):
 								    if span.label_ == "PERSON" and span.start != 0:
 								        prev_token = span.doc[span.start - 1]
 								        if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
 								            return prev_token.text
 								```
 								We can now use the [`Span.set_extension`](/api/span#set_extension) method to add
 								the custom extension attribute `"person_title"`, using `get_person_title` as the
 								getter function.
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.tokens import Span
 								nlp = spacy.load("en_core_web_sm")
 								def get_person_title(span):
 								    if span.label_ == "PERSON" and span.start != 0:
 								        prev_token = span.doc[span.start - 1]
 								        if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
 								            return prev_token.text
 								# Register the Span extension as 'person_title'
 								Span.set_extension("person_title", getter=get_person_title)
 								doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
 								print([(ent.text, ent.label_, ent._.person_title) for ent in doc.ents])
 								```
 								### Example: Using entities, part-of-speech tags and the dependency parse {#models-rules-pos-dep}
 								> #### Linguistic features
 								>
 								> This example makes extensive use of part-of-speech tag and dependency
 								> attributes and related `Doc`, `Token` and `Span` methods. For an introduction
 								> on this, see the guide on
 								> [linguistic features](http://localhost:8000/usage/linguistic-features/). Also
 								> see the [annotation specs](/api/annotation#pos-tagging) for details on the
 								> label schemes.
 								Let's say you want to parse professional biographies and extract the person
 								names and company names, and whether it's a company they're _currently_ working
 								at, or a _previous_ company. One approach could be to try and train a named
 								entity recognizer to predict `CURRENT_ORG` and `PREVIOUS_ORG` – but this
 								distinction is very subtle and something the entity recognizer may struggle to
 								learn. Nothing about "Acme Corp Inc." is inherently "current" or "previous".
 								However, the syntax of the sentence holds some very important clues: we can
 								check for trigger words like "work", whether they're **past tense** or **present
 								tense**, whether company names are attached to it and whether the person is the
 								subject. All of this information is available in the part-of-speech tags and the
 								dependency parse.
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
 								doc = nlp("Alex Smith worked at Acme Corp Inc.")
 								print([(ent.text, ent.label_) for ent in doc.ents])
 								```
 								> - `nsubj`: Nominal subject.
 								> - `prep`: Preposition.
 								> - `pobj`: Object of preposition.
 								> - `NNP`: Proper noun, singular.
 								> - `VBD`: Verb, past tense.
 								> - `IN`: Conjunction, subordinating or preposition.
 								![Visualization of dependency parse](../images/displacy-model-rules.svg "[`spacy.displacy`](/api/top-level#displacy) visualization with `options={'fine_grained': True}` to output the fine-grained part-of-speech tags, i.e. `Token.tag_`")
 								In this example, "worked" is the root of the sentence and is a past tense verb.
 								Its subject is "Alex Smith", the person who worked. "at Acme Corp Inc." is a
 								prepositional phrase attached to the verb "worked". To extract this
 								relationship, we can start by looking at the predicted `PERSON` entities, find
 								their heads and check whether they're attached to a trigger word like "work".
 								Next, we can check for prepositional phrases attached to the head and whether
 								they contain an `ORG` entity. Finally, to determine whether the company
 								affiliation is current, we can check the head's part-of-speech tag.
 								```python
 								person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
 								for ent in person_entities:
 								    # Because the entity is a spans, we need to use its root token. The head
 								    # is the syntactic governor of the person, e.g. the verb
 								    head = ent.root.head
 								    if head.lemma_ == "work":
 								        # Check if the children contain a preposition
 								        preps = [token for token in head.children if token.dep_ == "prep"]
 								        for prep in preps:
 								            # Check if tokens part of ORG entities are in the preposition's
 								            # children, e.g. at -> Acme Corp Inc.
 								            orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
 								            # If the verb is in past tense, the company was a previous company
 								            print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
 								```
 								To apply this logic automatically when we process a text, we can add it to the
 								`nlp` object as a
 								[custom pipeline component](/usage/processing-pipelines/#custom-components). The
 								above logic also expects that entities are merged into single tokens. spaCy
 								ships with a handy built-in `merge_entities` that takes care of that. Instead of
 								just printing the result, you could also write it to
-												Auto-format

											
										
										
											2019-08-06 13:13:31 +03:00
+								[custom attributes](/usage/processing-pipelines#custom-components-attributes) on
 								the entity `Span` – for example `._.orgs` or `._.prev_orgs` and
 								`._.current_orgs`.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> #### Merging entities
 								>
 								> Under the hood, entities are merged using the
 								> [`Doc.retokenize`](/api/doc#retokenize) context manager:
 								>
 								> ```python
 								> with doc.retokenize() as retokenize:
 								>   for ent in doc.ents:
 								>       retokenizer.merge(ent)
 								> ```
 								```python
 								### {executable="true"}
 								import spacy
 								from spacy.pipeline import merge_entities
 								from spacy import displacy
 								nlp = spacy.load("en_core_web_sm")
 								def extract_person_orgs(doc):
 								    person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
 								    for ent in person_entities:
 								        head = ent.root.head
 								        if head.lemma_ == "work":
 								            preps = [token for token in head.children if token.dep_ == "prep"]
 								            for prep in preps:
 								                orgs = [token for token in prep.children if token.ent_type_ == "ORG"]
 								                print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
 								    return doc
 								# To make the entities easier to work with, we'll merge them into single tokens
 								nlp.add_pipe(merge_entities)
 								nlp.add_pipe(extract_person_orgs)
 								doc = nlp("Alex Smith worked at Acme Corp Inc.")
 								# If you're not in a Jupyter / IPython environment, use displacy.serve
 								displacy.render(doc, options={'fine_grained': True})
 								```
 								If you change the sentence structure above, for example to "was working", you'll
 								notice that our current logic fails and doesn't correctly detect the company as
 								a past organization. That's because the root is a participle and the tense
 								information is in the attached auxiliary "was":
 								![Visualization of dependency parse](../images/displacy-model-rules2.svg)
 								To solve this, we can adjust the rules to also check for the above construction:
 								```python
 								### {highlight="9-11"}
 								def extract_person_orgs(doc):
 								    person_entities = [ent for ent in doc.ents if ent.label_ == "PERSON"]
 								    for ent in person_entities:
 								        head = ent.root.head
 								        if head.lemma_ == "work":
 								            preps = [token for token in head.children if token.dep_ == "prep"]
 								            for prep in preps:
 								                orgs = [t for t in prep.children if t.ent_type_ == "ORG"]
 								                aux = [token for token in head.children if token.dep_ == "aux"]
 								                past_aux = any(t.tag_ == "VBD" for t in aux)
 								                past = head.tag_ == "VBD" or head.tag_ == "VBG" and past_aux
 								                print({'person': ent, 'orgs': orgs, 'past': past})
 								    return doc
 								```
 								In your final rule-based system, you may end up with **several different code
 								paths** to cover the types of constructions that occur in your data.