mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 18:56:36 +03:00
WIP: update docs [ci skip]
This commit is contained in:
parent
f174c7b1f3
commit
157caf4dfa
|
@ -11,7 +11,8 @@ and [`PhraseMatcher`](/api/phrasematcher) and lets you match on dependency trees
|
|||
using
|
||||
[Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).
|
||||
It requires a pretrained [`DependencyParser`](/api/parser) or other component
|
||||
that sets the `Token.dep` and `Token.head` attributes.
|
||||
that sets the `Token.dep` and `Token.head` attributes. See the
|
||||
[usage guide](/usage/rule-based-matching#dependencymatcher) for examples.
|
||||
|
||||
## Pattern format {#patterns}
|
||||
|
||||
|
@ -48,63 +49,18 @@ dictionary, which defines an anchor token using only `RIGHT_ID` and
|
|||
|
||||
| Name | Description |
|
||||
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. |
|
||||
| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ |
|
||||
| `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ |
|
||||
| `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ |
|
||||
| `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
|
||||
|
||||
The first pattern defines an anchor token and each additional token added to the
|
||||
pattern is linked to an existing token `LEFT_ID` by the relation `REL_OP` and is
|
||||
described by the name `RIGHT_ID` and the attributes `RIGHT_ATTRS`.
|
||||
<Infobox title="Designing dependency matcher patterns" emoji="📖">
|
||||
|
||||
Let's say we want to find sentences describing who founded what kind of company:
|
||||
For examples of how to construct dependency matcher patterns for different types
|
||||
of relations, see the usage guide on
|
||||
[dependency matching](/usage/rule-based-matching#dependencymatcher).
|
||||
|
||||
- `Smith founded a healthcare company in 2005.`
|
||||
- `Williams initially founded an insurance company in 1987.`
|
||||
- `Lee, an established CEO, founded yet another AI startup.`
|
||||
|
||||
Since it's the root of the dependency parse, `founded` is a good choice for the
|
||||
anchor token in our pattern:
|
||||
|
||||
```python
|
||||
pattern = [
|
||||
{"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}}
|
||||
]
|
||||
```
|
||||
|
||||
We can add the subject as the token with the dependency label `nsubj` that is a
|
||||
direct child `>` of the anchor token named `anchor_founded`:
|
||||
|
||||
```python
|
||||
pattern = [
|
||||
{"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},
|
||||
{
|
||||
"LEFT_ID": "anchor_founded",
|
||||
"REL_OP": ">",
|
||||
"RIGHT_ID": "subject",
|
||||
"RIGHT_ATTRS": {"DEP": "nsubj"},
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
And the direct object along with its modifier:
|
||||
|
||||
```python
|
||||
pattern = [ ...
|
||||
{
|
||||
"LEFT_ID": "anchor_founded",
|
||||
"REL_OP": ">",
|
||||
"RIGHT_ID": "founded_object",
|
||||
"RIGHT_ATTRS": {"DEP": "dobj"},
|
||||
},
|
||||
{
|
||||
"LEFT_ID": "founded_object",
|
||||
"REL_OP": ">",
|
||||
"RIGHT_ID": "founded_object_modifier",
|
||||
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}},
|
||||
}
|
||||
]
|
||||
```
|
||||
</Infobox>
|
||||
|
||||
### Operators
|
||||
|
||||
|
@ -113,19 +69,19 @@ come directly from
|
|||
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html):
|
||||
|
||||
| Symbol | Description |
|
||||
| --------- | ------------------------------------------------------------------------------------------------------------------- |
|
||||
| `A < B` | `A` is the immediate dependent of `B` |
|
||||
| `A > B` | `A` is the immediate head of `B` |
|
||||
| `A << B` | `A` is the dependent in a chain to `B` following dep->head paths |
|
||||
| `A >> B` | `A` is the head in a chain to `B` following head->dep paths |
|
||||
| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree |
|
||||
| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_ |
|
||||
| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_ |
|
||||
| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_ |
|
||||
| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1` |
|
||||
| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1` |
|
||||
| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i` |
|
||||
| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i` |
|
||||
| --------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||
| `A < B` | `A` is the immediate dependent of `B`. |
|
||||
| `A > B` | `A` is the immediate head of `B`. |
|
||||
| `A << B` | `A` is the dependent in a chain to `B` following dep → head paths. |
|
||||
| `A >> B` | `A` is the head in a chain to `B` following head → dep paths. |
|
||||
| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. |
|
||||
| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. |
|
||||
| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. |
|
||||
| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. |
|
||||
| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. |
|
||||
| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. |
|
||||
| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. |
|
||||
| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. |
|
||||
|
||||
## DependencyMatcher.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
|
|
File diff suppressed because one or more lines are too long
Before Width: | Height: | Size: 4.6 KiB After Width: | Height: | Size: 25 KiB |
|
@ -20,7 +20,7 @@
|
|||
</text>
|
||||
|
||||
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">
|
||||
<tspan class="displacy-word" fill="currentColor" x="750">company.</tspan>
|
||||
<tspan class="displacy-word" fill="currentColor" x="750">company</tspan>
|
||||
<tspan class="displacy-tag" dy="2em" fill="currentColor" x="750"></tspan>
|
||||
</text>
|
||||
|
||||
|
|
Before Width: | Height: | Size: 3.8 KiB After Width: | Height: | Size: 3.8 KiB |
|
@ -974,10 +974,12 @@ to match phrases with the same sequence of punctuation and non-punctuation
|
|||
tokens as the pattern. But this can easily get confusing and doesn't have much
|
||||
of an advantage over writing one or two token patterns.
|
||||
|
||||
## Dependency Matcher {#dependencymatcher new="3"}
|
||||
## Dependency Matcher {#dependencymatcher new="3" model="parser"}
|
||||
|
||||
The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within
|
||||
the dependency parse. It requires a model containing a parser such as the
|
||||
the dependency parse using
|
||||
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
|
||||
operators. It requires a model containing a parser such as the
|
||||
[`DependencyParser`](/api/dependencyparser). Instead of defining a list of
|
||||
adjacent tokens as in `Matcher` patterns, the `DependencyMatcher` patterns match
|
||||
tokens in the dependency parse and specify the relations between them.
|
||||
|
@ -1014,15 +1016,15 @@ tokens in the dependency parse and specify the relations between them.
|
|||
> matches = matcher(doc)
|
||||
> ```
|
||||
|
||||
A pattern added to the `DependencyMatcher` consists of a list of dictionaries,
|
||||
with each dictionary describing a token to match and its relation to an existing
|
||||
token in the pattern. Except for the first dictionary, which defines an anchor
|
||||
token using only `RIGHT_ID` and `RIGHT_ATTRS`, each pattern should have the
|
||||
following keys:
|
||||
A pattern added to the dependency matcher consists of a **list of
|
||||
dictionaries**, with each dictionary describing a **token to match** and its
|
||||
**relation to an existing token** in the pattern. Except for the first
|
||||
dictionary, which defines an anchor token using only `RIGHT_ID` and
|
||||
`RIGHT_ATTRS`, each pattern should have the following keys:
|
||||
|
||||
| Name | Description |
|
||||
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. |
|
||||
| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ |
|
||||
| `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ |
|
||||
| `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ |
|
||||
| `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
|
||||
|
@ -1040,54 +1042,68 @@ can be used as `LEFT_ID` in another dict.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Dependency matcher operators
|
||||
### Dependency matcher operators {#dependencymatcher-operators}
|
||||
|
||||
The following operators are supported by the `DependencyMatcher`, most of which
|
||||
come directly from
|
||||
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html):
|
||||
|
||||
| Symbol | Description |
|
||||
| --------- | ------------------------------------------------------------------------------------------------------------------- |
|
||||
| `A < B` | `A` is the immediate dependent of `B` |
|
||||
| `A > B` | `A` is the immediate head of `B` |
|
||||
| `A << B` | `A` is the dependent in a chain to `B` following dep->head paths |
|
||||
| `A >> B` | `A` is the head in a chain to `B` following head->dep paths |
|
||||
| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree |
|
||||
| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_ |
|
||||
| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_ |
|
||||
| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_ |
|
||||
| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1` |
|
||||
| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1` |
|
||||
| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i` |
|
||||
| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i` |
|
||||
| --------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||
| `A < B` | `A` is the immediate dependent of `B`. |
|
||||
| `A > B` | `A` is the immediate head of `B`. |
|
||||
| `A << B` | `A` is the dependent in a chain to `B` following dep → head paths. |
|
||||
| `A >> B` | `A` is the head in a chain to `B` following head → dep paths. |
|
||||
| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. |
|
||||
| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. |
|
||||
| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. |
|
||||
| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. |
|
||||
| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. |
|
||||
| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. |
|
||||
| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. |
|
||||
| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. |
|
||||
|
||||
### Designing dependency matcher patterns
|
||||
### Designing dependency matcher patterns {#dependencymatcher-patterns}
|
||||
|
||||
Let's say we want to find sentences describing who founded what kind of company:
|
||||
|
||||
- `Smith founded a healthcare company in 2005.`
|
||||
- `Williams initially founded an insurance company in 1987.`
|
||||
- `Lee, an experienced CEO, has founded two AI startups.`
|
||||
- _Smith founded a healthcare company in 2005._
|
||||
- _Williams initially founded an insurance company in 1987._
|
||||
- _Lee, an experienced CEO, has founded two AI startups._
|
||||
|
||||
The dependency parse for `Smith founded a healthcare company` shows types of
|
||||
The dependency parse for "Smith founded a healthcare company" shows types of
|
||||
relations and tokens we want to match:
|
||||
|
||||
> #### Visualizing the parse
|
||||
>
|
||||
> The [`displacy` visualizer](/usage/visualizer) lets you render `Doc` objects
|
||||
> and their dependency parse and part-of-speech tags:
|
||||
>
|
||||
> ```python
|
||||
> import spacy
|
||||
> from spacy import displacy
|
||||
>
|
||||
> nlp = spacy.load("en_core_web_sm")
|
||||
> doc = nlp("Smith founded a healthcare company")
|
||||
> displacy.serve(doc)
|
||||
> ```
|
||||
|
||||
import DisplaCyDepFoundedHtml from 'images/displacy-dep-founded.html'
|
||||
|
||||
<Iframe title="displaCy visualization of dependencies" html={DisplaCyDepFoundedHtml} height={450} />
|
||||
|
||||
The relations we're interested in are:
|
||||
|
||||
- the founder is the subject (`nsubj`) of the token with the text `founded`
|
||||
- the company is the object (`dobj`) of `founded`
|
||||
- the kind of company may be an adjective (`amod`, not shown above) or a
|
||||
compound (`compound`)
|
||||
- the founder is the **subject** (`nsubj`) of the token with the text `founded`
|
||||
- the company is the **object** (`dobj`) of `founded`
|
||||
- the kind of company may be an **adjective** (`amod`, not shown above) or a
|
||||
**compound** (`compound`)
|
||||
|
||||
The first step is to pick an anchor token for the pattern. Since it's the root
|
||||
of the dependency parse, `founded` is a good choice here. It is often easier to
|
||||
construct patterns when all dependency relation operators point from the head to
|
||||
the children. In this example, we'll only use `>`, which connects a head to an
|
||||
immediate dependent as `head > child`.
|
||||
The first step is to pick an **anchor token** for the pattern. Since it's the
|
||||
root of the dependency parse, `founded` is a good choice here. It is often
|
||||
easier to construct patterns when all dependency relation operators point from
|
||||
the head to the children. In this example, we'll only use `>`, which connects a
|
||||
head to an immediate dependent as `head > child`.
|
||||
|
||||
The simplest dependency matcher pattern will identify and name a single token in
|
||||
the tree:
|
||||
|
@ -1099,7 +1115,6 @@ from spacy.matcher import DependencyMatcher
|
|||
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
matcher = DependencyMatcher(nlp.vocab)
|
||||
|
||||
pattern = [
|
||||
{
|
||||
"RIGHT_ID": "anchor_founded", # unique name
|
||||
|
@ -1116,6 +1131,7 @@ Now that we have a named anchor token (`anchor_founded`), we can add the founder
|
|||
as the immediate dependent (`>`) of `founded` with the dependency label `nsubj`:
|
||||
|
||||
```python
|
||||
### Step 1 {highlight="8,10"}
|
||||
pattern = [
|
||||
{
|
||||
"RIGHT_ID": "anchor_founded",
|
||||
|
@ -1127,31 +1143,37 @@ pattern = [
|
|||
"RIGHT_ID": "subject",
|
||||
"RIGHT_ATTRS": {"DEP": "nsubj"},
|
||||
}
|
||||
# ...
|
||||
]
|
||||
```
|
||||
|
||||
The direct object (`dobj`) is added in the same way:
|
||||
|
||||
```python
|
||||
pattern = [ ...
|
||||
### Step 2 {highlight=""}
|
||||
pattern = [
|
||||
#...
|
||||
{
|
||||
"LEFT_ID": "anchor_founded",
|
||||
"REL_OP": ">",
|
||||
"RIGHT_ID": "founded_object",
|
||||
"RIGHT_ATTRS": {"DEP": "dobj"},
|
||||
}
|
||||
# ...
|
||||
]
|
||||
```
|
||||
|
||||
When the subject and object tokens are added, they are required to have names
|
||||
under the key `RIGHT_ID`, which are allowed to be any unique string, e.g.
|
||||
`founded_subject`. These names can then be used as `LEFT_ID` to link new tokens
|
||||
into the pattern. For the final part of our pattern, we'll specify that the
|
||||
token `founded_object` should have a modifier with the dependency relation
|
||||
`founded_subject`. These names can then be used as `LEFT_ID` to **link new
|
||||
tokens into the pattern**. For the final part of our pattern, we'll specify that
|
||||
the token `founded_object` should have a modifier with the dependency relation
|
||||
`amod` or `compound`:
|
||||
|
||||
```python
|
||||
pattern = [ ...
|
||||
### Step 3 {highlight="7"}
|
||||
pattern = [
|
||||
# ...
|
||||
{
|
||||
"LEFT_ID": "founded_object",
|
||||
"REL_OP": ">",
|
||||
|
@ -1168,8 +1190,6 @@ each new token needs to be linked to an existing token on its left. As for
|
|||
`founded` in this example, a token may be linked to more than one token on its
|
||||
right:
|
||||
|
||||
<!-- TODO: adjust for final example, prettify -->
|
||||
|
||||
![Dependency matcher pattern](../images/dep-match-diagram.svg)
|
||||
|
||||
The full pattern comes together as shown in the example below:
|
||||
|
@ -1209,11 +1229,10 @@ pattern = [
|
|||
|
||||
matcher.add("FOUNDED", [pattern])
|
||||
doc = nlp("Lee, an experienced CEO, has founded two AI startups.")
|
||||
|
||||
matches = matcher(doc)
|
||||
print(matches) # [(4851363122962674176, [6, 0, 10, 9])]
|
||||
|
||||
# each token_id corresponds to one pattern dict
|
||||
print(matches) # [(4851363122962674176, [6, 0, 10, 9])]
|
||||
# Each token_id corresponds to one pattern dict
|
||||
match_id, token_ids = matches[0]
|
||||
for i in range(len(token_ids)):
|
||||
print(pattern[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text)
|
||||
|
|
|
@ -26,6 +26,7 @@ menu:
|
|||
- [End-to-end project workflows](#features-projects)
|
||||
- [New built-in components](#features-pipeline-components)
|
||||
- [New custom component API](#features-components)
|
||||
- [Dependency matching](#features-dep-matcher)
|
||||
- [Python type hints](#features-types)
|
||||
- [New methods & attributes](#new-methods)
|
||||
- [New & updated documentation](#new-docs)
|
||||
|
@ -152,7 +153,6 @@ add to your pipeline and customize for your use case:
|
|||
| [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. |
|
||||
| [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. |
|
||||
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
|
||||
| [`DependencyMatcher`](/api/dependencymatcher) | Component for matching subtrees within a dependency parse. |
|
||||
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
|
||||
|
||||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||
|
@ -202,6 +202,34 @@ aren't set.
|
|||
|
||||
</Infobox>
|
||||
|
||||
### Dependency matching {#features-dep-matcher}
|
||||
|
||||
<!-- TODO: improve summary -->
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> # TODO: example
|
||||
> ```
|
||||
|
||||
The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within
|
||||
the dependency parse using
|
||||
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
|
||||
operators. It follows the same API as the token-based [`Matcher`](/api/matcher).
|
||||
A pattern added to the dependency matcher consists of a **list of
|
||||
dictionaries**, with each dictionary describing a **token to match** and its
|
||||
**relation to an existing token** in the pattern.
|
||||
|
||||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||
|
||||
- **Usage:**
|
||||
[Dependency matching](/usage/rule-based-matching#dependencymatcher),
|
||||
- **API:** [`DependencyMatcher`](/api/dependencymatcher),
|
||||
- **Implementation:**
|
||||
[`spacy/matcher/dependencymatcher.pyx`](https://github.com/explosion/spaCy/tree/develop/spacy/matcher/dependencymatcher.pyx)
|
||||
|
||||
</Infobox>
|
||||
|
||||
### Type hints and type-based data validation {#features-types}
|
||||
|
||||
> #### Example
|
||||
|
|
Loading…
Reference in New Issue
Block a user