mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-15 03:56:23 +03:00
267 lines
13 KiB
Markdown
267 lines
13 KiB
Markdown
---
|
|
title: DependencyMatcher
|
|
teaser: Match subtrees within a dependency parse
|
|
tag: class
|
|
new: 3
|
|
source: spacy/matcher/dependencymatcher.pyx
|
|
---
|
|
|
|
The `DependencyMatcher` follows the same API as the [`Matcher`](/api/matcher)
|
|
and [`PhraseMatcher`](/api/phrasematcher) and lets you match on dependency trees
|
|
using
|
|
[Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).
|
|
It requires a pretrained [`DependencyParser`](/api/parser) or other component
|
|
that sets the `Token.dep` and `Token.head` attributes.
|
|
|
|
## Pattern format {#patterns}
|
|
|
|
> ```python
|
|
> ### Example
|
|
> # pattern: "[subject] ... initially founded"
|
|
> [
|
|
> # anchor token: founded
|
|
> {
|
|
> "RIGHT_ID": "founded",
|
|
> "RIGHT_ATTRS": {"ORTH": "founded"}
|
|
> },
|
|
> # founded -> subject
|
|
> {
|
|
> "LEFT_ID": "founded",
|
|
> "REL_OP": ">",
|
|
> "RIGHT_ID": "subject",
|
|
> "RIGHT_ATTRS": {"DEP": "nsubj"}
|
|
> },
|
|
> # "founded" follows "initially"
|
|
> {
|
|
> "LEFT_ID": "founded",
|
|
> "REL_OP": ";",
|
|
> "RIGHT_ID": "initially",
|
|
> "RIGHT_ATTRS": {"ORTH": "initially"}
|
|
> }
|
|
> ]
|
|
> ```
|
|
|
|
A pattern added to the `DependencyMatcher` consists of a list of dictionaries,
|
|
with each dictionary describing a token to match. Except for the first
|
|
dictionary, which defines an anchor token using only `RIGHT_ID` and
|
|
`RIGHT_ATTRS`, each pattern should have the following keys:
|
|
|
|
| Name | Description |
|
|
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. |
|
|
| `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ |
|
|
| `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ |
|
|
| `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
|
|
|
|
The first pattern defines an anchor token and each additional token added to the
|
|
pattern is linked to an existing token `LEFT_ID` by the relation `REL_OP` and is
|
|
described by the name `RIGHT_ID` and the attributes `RIGHT_ATTRS`.
|
|
|
|
Let's say we want to find sentences describing who founded what kind of company:
|
|
|
|
- `Smith founded a healthcare company in 2005.`
|
|
- `Williams initially founded an insurance company in 1987.`
|
|
- `Lee, an established CEO, founded yet another AI startup.`
|
|
|
|
Since it's the root of the dependency parse, `founded` is a good choice for the
|
|
anchor token in our pattern:
|
|
|
|
```python
|
|
pattern = [
|
|
{"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}}
|
|
]
|
|
```
|
|
|
|
We can add the subject as the token with the dependency label `nsubj` that is a
|
|
direct child `>` of the anchor token named `anchor_founded`:
|
|
|
|
```python
|
|
pattern = [
|
|
{"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},
|
|
{
|
|
"LEFT_ID": "anchor_founded",
|
|
"REL_OP": ">",
|
|
"RIGHT_ID": "subject",
|
|
"RIGHT_ATTRS": {"DEP": "nsubj"},
|
|
}
|
|
]
|
|
```
|
|
|
|
And the direct object along with its modifier:
|
|
|
|
```python
|
|
pattern = [ ...
|
|
{
|
|
"LEFT_ID": "anchor_founded",
|
|
"REL_OP": ">",
|
|
"RIGHT_ID": "founded_object",
|
|
"RIGHT_ATTRS": {"DEP": "dobj"},
|
|
},
|
|
{
|
|
"LEFT_ID": "founded_object",
|
|
"REL_OP": ">",
|
|
"RIGHT_ID": "founded_object_modifier",
|
|
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}},
|
|
}
|
|
]
|
|
```
|
|
|
|
### Operators
|
|
|
|
The following operators are supported by the `DependencyMatcher`, most of which
|
|
come directly from
|
|
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html):
|
|
|
|
| Symbol | Description |
|
|
| --------- | ------------------------------------------------------------------------------------------------------------------- |
|
|
| `A < B` | `A` is the immediate dependent of `B` |
|
|
| `A > B` | `A` is the immediate head of `B` |
|
|
| `A << B` | `A` is the dependent in a chain to `B` following dep->head paths |
|
|
| `A >> B` | `A` is the head in a chain to `B` following head->dep paths |
|
|
| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree |
|
|
| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_ |
|
|
| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_ |
|
|
| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_ |
|
|
| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1` |
|
|
| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1` |
|
|
| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i` |
|
|
| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i` |
|
|
|
|
## DependencyMatcher.\_\_init\_\_ {#init tag="method"}
|
|
|
|
Create a `DependencyMatcher`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.matcher import DependencyMatcher
|
|
> matcher = DependencyMatcher(nlp.vocab)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | ----------------------------------------------------------------------------------------------------- |
|
|
| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ |
|
|
| _keyword-only_ | |
|
|
| `validate` | Validate all patterns added to this matcher. ~~bool~~ |
|
|
|
|
## DependencyMatcher.\_\call\_\_ {#call tag="method"}
|
|
|
|
Find all tokens matching the supplied patterns on the `Doc` or `Span`.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> from spacy.matcher import DependencyMatcher
|
|
>
|
|
> matcher = DependencyMatcher(nlp.vocab)
|
|
> pattern = [{"RIGHT_ID": "founded_id",
|
|
> "RIGHT_ATTRS": {"ORTH": "founded"}}]
|
|
> matcher.add("FOUNDED", [pattern])
|
|
> doc = nlp("Bill Gates founded Microsoft.")
|
|
> matches = matcher(doc)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
|
| **RETURNS** | A list of `(match_id, token_ids)` tuples, describing the matches. The `match_id` is the ID of the match pattern and `token_ids` is a list of token indices matched by the pattern, where the position of each token in the list corresponds to the position of the node specification in the pattern. ~~List[Tuple[int, List[int]]]~~ |
|
|
|
|
## DependencyMatcher.\_\_len\_\_ {#len tag="method"}
|
|
|
|
Get the number of rules added to the dependency matcher. Note that this only
|
|
returns the number of rules (identical with the number of IDs), not the number
|
|
of individual patterns.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> matcher = DependencyMatcher(nlp.vocab)
|
|
> assert len(matcher) == 0
|
|
> pattern = [{"RIGHT_ID": "founded_id",
|
|
> "RIGHT_ATTRS": {"ORTH": "founded"}}]
|
|
> matcher.add("FOUNDED", [pattern])
|
|
> assert len(matcher) == 1
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ---------------------------- |
|
|
| **RETURNS** | The number of rules. ~~int~~ |
|
|
|
|
## DependencyMatcher.\_\_contains\_\_ {#contains tag="method"}
|
|
|
|
Check whether the matcher contains rules for a match ID.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> matcher = DependencyMatcher(nlp.vocab)
|
|
> assert "FOUNDED" not in matcher
|
|
> matcher.add("FOUNDED", [pattern])
|
|
> assert "FOUNDED" in matcher
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | -------------------------------------------------------------- |
|
|
| `key` | The match ID. ~~str~~ |
|
|
| **RETURNS** | Whether the matcher contains rules for this match ID. ~~bool~~ |
|
|
|
|
## DependencyMatcher.add {#add tag="method"}
|
|
|
|
Add a rule to the matcher, consisting of an ID key, one or more patterns, and an
|
|
optional callback function to act on the matches. The callback function will
|
|
receive the arguments `matcher`, `doc`, `i` and `matches`. If a pattern already
|
|
exists for the given ID, the patterns will be extended. An `on_match` callback
|
|
will be overwritten.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> def on_match(matcher, doc, id, matches):
|
|
> print('Matched!', matches)
|
|
>
|
|
> matcher = DependencyMatcher(nlp.vocab)
|
|
> matcher.add("FOUNDED", patterns, on_match=on_match)
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| `match_id` | An ID for the patterns. ~~str~~ |
|
|
| `patterns` | A list of match patterns. A pattern consists of a list of dicts, where each dict describes a token in the tree. ~~List[List[Dict[str, Union[str, Dict]]]]~~ |
|
|
| _keyword-only_ | | |
|
|
| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[DependencyMatcher, Doc, int, List[Tuple], Any]]~~ |
|
|
|
|
## DependencyMatcher.get {#get tag="method"}
|
|
|
|
Retrieve the pattern stored for a key. Returns the rule as an
|
|
`(on_match, patterns)` tuple containing the callback and available patterns.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> matcher.add("FOUNDED", patterns, on_match=on_match)
|
|
> on_match, patterns = matcher.get("FOUNDED")
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----------- | ----------------------------------------------------------------------------------------------------------- |
|
|
| `key` | The ID of the match rule. ~~str~~ |
|
|
| **RETURNS** | The rule, as an `(on_match, patterns)` tuple. ~~Tuple[Optional[Callable], List[List[Union[Dict, Tuple]]]]~~ |
|
|
|
|
## DependencyMatcher.remove {#remove tag="method"}
|
|
|
|
Remove a rule from the dependency matcher. A `KeyError` is raised if the match
|
|
ID does not exist.
|
|
|
|
> #### Example
|
|
>
|
|
> ```python
|
|
> matcher.add("FOUNDED", patterns)
|
|
> assert "FOUNDED" in matcher
|
|
> matcher.remove("FOUNDED")
|
|
> assert "FOUNDED" not in matcher
|
|
> ```
|
|
|
|
| Name | Description |
|
|
| ----- | --------------------------------- |
|
|
| `key` | The ID of the match rule. ~~str~~ |
|