spaCy/website/docs/api/phrasematcher.md
Ines Montani e597110d31
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 19:31:19 +01:00

156 lines
7.5 KiB
Markdown

---
title: PhraseMatcher
teaser: Match sequences of tokens, based on documents
tag: class
source: spacy/matcher/phrasematcher.pyx
new: 2
---
The `PhraseMatcher` lets you efficiently match large terminology lists. While
the [`Matcher`](/api/matcher) lets you match sequences based on lists of token
descriptions, the `PhraseMatcher` accepts match patterns in the form of `Doc`
objects.
## PhraseMatcher.\_\_init\_\_ {#init tag="method"}
Create the rule-based `PhraseMatcher`. Setting a different `attr` to match on
will change the token attributes that will be compared to determine a match. By
default, the incoming `Doc` is checked for sequences of tokens with the same
`ORTH` value, i.e. the verbatim token text. Matching on the attribute `LOWER`
will result in case-insensitive matching, since only the lowercase token texts
are compared. In theory, it's also possible to match on sequences of the same
part-of-speech tags or dependency labels.
If `validate=True` is set, additional validation is performed when pattern are
added. At the moment, it will check whether a `Doc` has attributes assigned that
aren't necessary to produce the matches (for example, part-of-speech tags if the
`PhraseMatcher` matches on the token text). Since this can often lead to
significantly worse performance when creating the pattern, a `UserWarning` will
be shown.
> #### Example
>
> ```python
> from spacy.matcher import PhraseMatcher
> matcher = PhraseMatcher(nlp.vocab)
> ```
| Name | Type | Description |
| --------------------------------------- | --------------- | ------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. |
| `attr` <Tag variant="new">2.1</Tag> | int / unicode | The token attribute to match on. Defaults to `ORTH`, i.e. the verbatim token text. |
| `validate` <Tag variant="new">2.1</Tag> | bool | Validate patterns added to the matcher. |
| **RETURNS** | `PhraseMatcher` | The newly constructed object. |
<Infobox title="Changed in v2.1" variant="warning">
As of v2.1, the `PhraseMatcher` doesn't have a phrase length limit anymore, so
the `max_length` argument is now deprecated.
</Infobox>
## PhraseMatcher.\_\_call\_\_ {#call tag="method"}
Find all token sequences matching the supplied patterns on the `Doc`.
> #### Example
>
> ```python
> from spacy.matcher import PhraseMatcher
>
> matcher = PhraseMatcher(nlp.vocab)
> matcher.add("OBAMA", None, nlp(u"Barack Obama"))
> doc = nlp(u"Barack Obama lifts America one last time in emotional farewell")
> matches = matcher(doc)
> ```
| Name | Type | Description |
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `doc` | `Doc` | The document to match over. |
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. |
## PhraseMatcher.pipe {#pipe tag="method"}
Match a stream of documents, yielding them in turn.
> #### Example
>
> ```python
> from spacy.matcher import PhraseMatcher
> matcher = PhraseMatcher(nlp.vocab)
> for doc in matcher.pipe(texts, batch_size=50, n_threads=4):
> pass
> ```
| Name | Type | Description |
| ------------ | -------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `docs` | iterable | A stream of documents. |
| `batch_size` | int | The number of documents to accumulate into a working set. |
| `n_threads` | int | The number of threads with which to work on the buffer in parallel, if the `PhraseMatcher` implementation supports multi-threading. |
| **YIELDS** | `Doc` | Documents, in order. |
## PhraseMatcher.\_\_len\_\_ {#len tag="method"}
Get the number of rules added to the matcher. Note that this only returns the
number of rules (identical with the number of IDs), not the number of individual
patterns.
> #### Example
>
> ```python
> matcher = PhraseMatcher(nlp.vocab)
> assert len(matcher) == 0
> matcher.add("OBAMA", None, nlp(u"Barack Obama"))
> assert len(matcher) == 1
> ```
| Name | Type | Description |
| ----------- | ---- | -------------------- |
| **RETURNS** | int | The number of rules. |
## PhraseMatcher.\_\_contains\_\_ {#contains tag="method"}
Check whether the matcher contains rules for a match ID.
> #### Example
>
> ```python
> matcher = PhraseMatcher(nlp.vocab)
> assert "OBAMA" not in matcher
> matcher.add("OBAMA", None, nlp(u"Barack Obama"))
> assert "OBAMA" in matcher
> ```
| Name | Type | Description |
| ----------- | ------- | ----------------------------------------------------- |
| `key` | unicode | The match ID. |
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
## PhraseMatcher.add {#add tag="method"}
Add a rule to the matcher, consisting of an ID key, one or more patterns, and a
callback function to act on the matches. The callback function will receive the
arguments `matcher`, `doc`, `i` and `matches`. If a pattern already exists for
the given ID, the patterns will be extended. An `on_match` callback will be
overwritten.
> #### Example
>
> ```python
> def on_match(matcher, doc, id, matches):
> print('Matched!', matches)
>
> matcher = PhraseMatcher(nlp.vocab)
> matcher.add("OBAMA", on_match, nlp(u"Barack Obama"))
> matcher.add("HEALTH", on_match, nlp(u"health care reform"),
> nlp(u"healthcare reform"))
> doc = nlp(u"Barack Obama urges Congress to find courage to defend his healthcare reforms")
> matches = matcher(doc)
> ```
| Name | Type | Description |
| ---------- | ------------------ | --------------------------------------------------------------------------------------------- |
| `match_id` | unicode | An ID for the thing you're matching. |
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
| `*docs` | list | `Doc` objects of the phrases to match. |