spaCy/website/docs/api/phrasematcher.md

---
title: PhraseMatcher
teaser: Match sequences of tokens, based on documents
tag: class
source: spacy/matcher/phrasematcher.pyx
new: 2
---

The `PhraseMatcher` lets you efficiently match large terminology lists. While
the [`Matcher`](/api/matcher) lets you match sequences based on lists of token
descriptions, the `PhraseMatcher` accepts match patterns in the form of `Doc`
objects.

## PhraseMatcher.\_\_init\_\_ {#init tag="method"}

Create the rule-based `PhraseMatcher`. Setting a different `attr` to match on
will change the token attributes that will be compared to determine a match. By
default, the incoming `Doc` is checked for sequences of tokens with the same
`ORTH` value, i.e. the verbatim token text. Matching on the attribute `LOWER`
will result in case-insensitive matching, since only the lowercase token texts
are compared. In theory, it's also possible to match on sequences of the same
part-of-speech tags or dependency labels.

If `validate=True` is set, additional validation is performed when pattern are
added. At the moment, it will check whether a `Doc` has attributes assigned that
aren't necessary to produce the matches (for example, part-of-speech tags if the
`PhraseMatcher` matches on the token text). Since this can often lead to
significantly worse performance when creating the pattern, a `UserWarning` will
be shown.

> #### Example
>
> ```python
> from spacy.matcher import PhraseMatcher
> matcher = PhraseMatcher(nlp.vocab)
> ```

| Name                                    | Type            | Description                                                                                 |
| --------------------------------------- | --------------- | ------------------------------------------------------------------------------------------- |
| `vocab`                                 | `Vocab`         | The vocabulary object, which must be shared with the documents the matcher will operate on. |
| `attr` <Tag variant="new">2.1</Tag>     | int / unicode   | The token attribute to match on. Defaults to `ORTH`, i.e. the verbatim token text.          |
| `validate` <Tag variant="new">2.1</Tag> | bool            | Validate patterns added to the matcher.                                                     |
| **RETURNS**                             | `PhraseMatcher` | The newly constructed object.                                                               |

<Infobox title="Changed in v2.1" variant="warning">

As of v2.1, the `PhraseMatcher` doesn't have a phrase length limit anymore, so
the `max_length` argument is now deprecated.

</Infobox>

## PhraseMatcher.\_\_call\_\_ {#call tag="method"}

Find all token sequences matching the supplied patterns on the `Doc`.

> #### Example
>
> ```python
> from spacy.matcher import PhraseMatcher
>
> matcher = PhraseMatcher(nlp.vocab)
> matcher.add("OBAMA", None, nlp("Barack Obama"))
> doc = nlp("Barack Obama lifts America one last time in emotional farewell")
> matches = matcher(doc)
> ```

| Name        | Type  | Description                                                                                                                                                              |
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `doc`       | `Doc` | The document to match over.                                                                                                                                              |
| **RETURNS** | list  | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. |

## PhraseMatcher.pipe {#pipe tag="method"}

Match a stream of documents, yielding them in turn.

> #### Example
>
> ```python
>   from spacy.matcher import PhraseMatcher
>   matcher = PhraseMatcher(nlp.vocab)
>   for doc in matcher.pipe(texts, batch_size=50):
>       pass
> ```

| Name         | Type     | Description                                               |
| ------------ | -------- | --------------------------------------------------------- |
| `docs`       | iterable | A stream of documents.                                    |
| `batch_size` | int      | The number of documents to accumulate into a working set. |
| **YIELDS**   | `Doc`    | Documents, in order.                                      |

## PhraseMatcher.\_\_len\_\_ {#len tag="method"}

Get the number of rules added to the matcher. Note that this only returns the
number of rules (identical with the number of IDs), not the number of individual
patterns.

> #### Example
>
> ```python
>   matcher = PhraseMatcher(nlp.vocab)
>   assert len(matcher) == 0
>   matcher.add("OBAMA", None, nlp("Barack Obama"))
>   assert len(matcher) == 1
> ```

| Name        | Type | Description          |
| ----------- | ---- | -------------------- |
| **RETURNS** | int  | The number of rules. |

## PhraseMatcher.\_\_contains\_\_ {#contains tag="method"}

Check whether the matcher contains rules for a match ID.

> #### Example
>
> ```python
>   matcher = PhraseMatcher(nlp.vocab)
>   assert "OBAMA" not in matcher
>   matcher.add("OBAMA", None, nlp("Barack Obama"))
>   assert "OBAMA" in matcher
> ```

| Name        | Type    | Description                                           |
| ----------- | ------- | ----------------------------------------------------- |
| `key`       | unicode | The match ID.                                         |
| **RETURNS** | bool    | Whether the matcher contains rules for this match ID. |

## PhraseMatcher.add {#add tag="method"}

Add a rule to the matcher, consisting of an ID key, one or more patterns, and a
callback function to act on the matches. The callback function will receive the
arguments `matcher`, `doc`, `i` and `matches`. If a pattern already exists for
the given ID, the patterns will be extended. An `on_match` callback will be
overwritten.

> #### Example
>
> ```python
>   def on_match(matcher, doc, id, matches):
>       print('Matched!', matches)
>
>   matcher = PhraseMatcher(nlp.vocab)
>   matcher.add("OBAMA", on_match, nlp("Barack Obama"))
>   matcher.add("HEALTH", on_match, nlp("health care reform"),
>                                   nlp("healthcare reform"))
>   doc = nlp("Barack Obama urges Congress to find courage to defend his healthcare reforms")
>   matches = matcher(doc)
> ```

| Name       | Type               | Description                                                                                   |
| ---------- | ------------------ | --------------------------------------------------------------------------------------------- |
| `match_id` | unicode            | An ID for the thing you're matching.                                                          |
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
| `*docs`    | `Doc`              | `Doc` objects of the phrases to match.                                                        |

## PhraseMatcher.remove {#remove tag="method" new="2.2"}

Remove a rule from the matcher by match ID. A `KeyError` is raised if the key
does not exist.

> #### Example
>
> ```python
> matcher = PhraseMatcher(nlp.vocab)
> matcher.add("OBAMA", None, nlp("Barack Obama"))
> assert "OBAMA" in matcher
> matcher.remove("OBAMA")
> assert "OBAMA" not in matcher
> ```

| Name  | Type    | Description               |
| ----- | ------- | ------------------------- |
| `key` | unicode | The ID of the match rule. |
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00			`---`
			`title: PhraseMatcher`
			`teaser: Match sequences of tokens, based on documents`
			`tag: class`
			`source: spacy/matcher/phrasematcher.pyx`
			`new: 2`
			`---`

			The `PhraseMatcher` lets you efficiently match large terminology lists. While
			the [`Matcher`](/api/matcher) lets you match sequences based on lists of token
			descriptions, the `PhraseMatcher` accepts match patterns in the form of `Doc`
			`objects.`

			`## PhraseMatcher.\_\_init\_\_ {#init tag="method"}`

			Create the rule-based `PhraseMatcher`. Setting a different `attr` to match on
			`will change the token attributes that will be compared to determine a match. By`
			default, the incoming `Doc` is checked for sequences of tokens with the same
			`ORTH` value, i.e. the verbatim token text. Matching on the attribute `LOWER`
			`will result in case-insensitive matching, since only the lowercase token texts`
			`are compared. In theory, it's also possible to match on sequences of the same`
			`part-of-speech tags or dependency labels.`

			If `validate=True` is set, additional validation is performed when pattern are
			added. At the moment, it will check whether a `Doc` has attributes assigned that
			`aren't necessary to produce the matches (for example, part-of-speech tags if the`
			`PhraseMatcher` matches on the token text). Since this can often lead to
			significantly worse performance when creating the pattern, a `UserWarning` will
			`be shown.`

			`> #### Example`
			`>`
			> ```python
			`> from spacy.matcher import PhraseMatcher`
			`> matcher = PhraseMatcher(nlp.vocab)`
			> ```

			`\| Name \| Type \| Description \|`
			`\| --------------------------------------- \| --------------- \| ------------------------------------------------------------------------------------------- \|`
			\| `vocab` \| `Vocab` \| The vocabulary object, which must be shared with the documents the matcher will operate on. \|
			\| `attr` <Tag variant="new">2.1</Tag> \| int / unicode \| The token attribute to match on. Defaults to `ORTH`, i.e. the verbatim token text. \|
			\| `validate` <Tag variant="new">2.1</Tag> \| bool \| Validate patterns added to the matcher. \|
			\| RETURNS \| `PhraseMatcher` \| The newly constructed object. \|

			`<Infobox title="Changed in v2.1" variant="warning">`

			As of v2.1, the `PhraseMatcher` doesn't have a phrase length limit anymore, so
			the `max_length` argument is now deprecated.

			`</Infobox>`

			`## PhraseMatcher.\_\_call\_\_ {#call tag="method"}`

			Find all token sequences matching the supplied patterns on the `Doc`.

			`> #### Example`
			`>`
			> ```python
			`> from spacy.matcher import PhraseMatcher`
			`>`
			`> matcher = PhraseMatcher(nlp.vocab)`
Remove u-strings and fix formatting [ci skip] 2019-09-12 17:11:15 +03:00			`> matcher.add("OBAMA", None, nlp("Barack Obama"))`
			`> doc = nlp("Barack Obama lifts America one last time in emotional farewell")`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00			`> matches = matcher(doc)`
			> ```

			`\| Name \| Type \| Description \|`
			`\| ----------- \| ----- \| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ \|`
			\| `doc` \| `Doc` \| The document to match over. \|
			\| RETURNS \| list \| A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. \|

			`## PhraseMatcher.pipe {#pipe tag="method"}`

			`Match a stream of documents, yielding them in turn.`

			`> #### Example`
			`>`
			> ```python
			`> from spacy.matcher import PhraseMatcher`
			`> matcher = PhraseMatcher(nlp.vocab)`
Remove n_threads 2019-02-18 00:25:42 +03:00			`> for doc in matcher.pipe(texts, batch_size=50):`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00			`> pass`
			> ```

Remove n_threads 2019-02-18 00:25:42 +03:00			`\| Name \| Type \| Description \|`
			`\| ------------ \| -------- \| --------------------------------------------------------- \|`
			\| `docs` \| iterable \| A stream of documents. \|
			\| `batch_size` \| int \| The number of documents to accumulate into a working set. \|
			\| YIELDS \| `Doc` \| Documents, in order. \|
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00
			`## PhraseMatcher.\_\_len\_\_ {#len tag="method"}`

			`Get the number of rules added to the matcher. Note that this only returns the`
			`number of rules (identical with the number of IDs), not the number of individual`
			`patterns.`

			`> #### Example`
			`>`
			> ```python
			`> matcher = PhraseMatcher(nlp.vocab)`
			`> assert len(matcher) == 0`
Remove u-strings and fix formatting [ci skip] 2019-09-12 17:11:15 +03:00			`> matcher.add("OBAMA", None, nlp("Barack Obama"))`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00			`> assert len(matcher) == 1`
			> ```

			`\| Name \| Type \| Description \|`
			`\| ----------- \| ---- \| -------------------- \|`
			`\| RETURNS \| int \| The number of rules. \|`

			`## PhraseMatcher.\_\_contains\_\_ {#contains tag="method"}`

			`Check whether the matcher contains rules for a match ID.`

			`> #### Example`
			`>`
			> ```python
			`> matcher = PhraseMatcher(nlp.vocab)`
			`> assert "OBAMA" not in matcher`
Remove u-strings and fix formatting [ci skip] 2019-09-12 17:11:15 +03:00			`> matcher.add("OBAMA", None, nlp("Barack Obama"))`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00			`> assert "OBAMA" in matcher`
			> ```

			`\| Name \| Type \| Description \|`
			`\| ----------- \| ------- \| ----------------------------------------------------- \|`
			\| `key` \| unicode \| The match ID. \|
			`\| RETURNS \| bool \| Whether the matcher contains rules for this match ID. \|`

			`## PhraseMatcher.add {#add tag="method"}`

			`Add a rule to the matcher, consisting of an ID key, one or more patterns, and a`
			`callback function to act on the matches. The callback function will receive the`
			arguments `matcher`, `doc`, `i` and `matches`. If a pattern already exists for
			the given ID, the patterns will be extended. An `on_match` callback will be
			`overwritten.`

			`> #### Example`
			`>`
			> ```python
			`> def on_match(matcher, doc, id, matches):`
			`> print('Matched!', matches)`
			`>`
			`> matcher = PhraseMatcher(nlp.vocab)`
Remove u-strings and fix formatting [ci skip] 2019-09-12 17:11:15 +03:00			`> matcher.add("OBAMA", on_match, nlp("Barack Obama"))`
			`> matcher.add("HEALTH", on_match, nlp("health care reform"),`
			`> nlp("healthcare reform"))`
			`> doc = nlp("Barack Obama urges Congress to find courage to defend his healthcare reforms")`
💫 Update website (#3285) <!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in straightforward Markdown without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. 2019-02-17 21:31:19 +03:00			`> matches = matcher(doc)`
			> ```

			`\| Name \| Type \| Description \|`
			`\| ---------- \| ------------------ \| --------------------------------------------------------------------------------------------- \|`
			\| `match_id` \| unicode \| An ID for the thing you're matching. \|
			\| `on_match` \| callable or `None` \| Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. \|
Fix argument type in PhraseMatcher.add docs (closes #4496) [ci skip] 2019-10-22 15:37:30 +03:00			\| `*docs` \| `Doc` \| `Doc` objects of the phrases to match. \|
Document PhraseMatcher.remove [ci skip] 2019-09-27 17:34:53 +03:00
			`## PhraseMatcher.remove {#remove tag="method" new="2.2"}`

			Remove a rule from the matcher by match ID. A `KeyError` is raised if the key
			`does not exist.`

			`> #### Example`
			`>`
			> ```python
			`> matcher = PhraseMatcher(nlp.vocab)`
			`> matcher.add("OBAMA", None, nlp("Barack Obama"))`
			`> assert "OBAMA" in matcher`
			`> matcher.remove("OBAMA")`
			`> assert "OBAMA" not in matcher`
			> ```

			`\| Name \| Type \| Description \|`
			`\| ----- \| ------- \| ------------------------- \|`
			\| `key` \| unicode \| The ID of the match rule. \|