WIP: update docs [ci skip]

This commit is contained in:
Ines Montani 2020-09-04 16:30:31 +02:00
parent f174c7b1f3
commit 157caf4dfa
5 changed files with 154 additions and 176 deletions

View File

@ -11,7 +11,8 @@ and [`PhraseMatcher`](/api/phrasematcher) and lets you match on dependency trees
using using
[Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). [Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).
It requires a pretrained [`DependencyParser`](/api/parser) or other component It requires a pretrained [`DependencyParser`](/api/parser) or other component
that sets the `Token.dep` and `Token.head` attributes. that sets the `Token.dep` and `Token.head` attributes. See the
[usage guide](/usage/rule-based-matching#dependencymatcher) for examples.
## Pattern format {#patterns} ## Pattern format {#patterns}
@ -48,63 +49,18 @@ dictionary, which defines an anchor token using only `RIGHT_ID` and
| Name | Description | | Name | Description |
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. | | `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ |
| `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ | | `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ |
| `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ | | `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ |
| `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ | | `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
The first pattern defines an anchor token and each additional token added to the <Infobox title="Designing dependency matcher patterns" emoji="📖">
pattern is linked to an existing token `LEFT_ID` by the relation `REL_OP` and is
described by the name `RIGHT_ID` and the attributes `RIGHT_ATTRS`.
Let's say we want to find sentences describing who founded what kind of company: For examples of how to construct dependency matcher patterns for different types
of relations, see the usage guide on
[dependency matching](/usage/rule-based-matching#dependencymatcher).
- `Smith founded a healthcare company in 2005.` </Infobox>
- `Williams initially founded an insurance company in 1987.`
- `Lee, an established CEO, founded yet another AI startup.`
Since it's the root of the dependency parse, `founded` is a good choice for the
anchor token in our pattern:
```python
pattern = [
{"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}}
]
```
We can add the subject as the token with the dependency label `nsubj` that is a
direct child `>` of the anchor token named `anchor_founded`:
```python
pattern = [
{"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},
{
"LEFT_ID": "anchor_founded",
"REL_OP": ">",
"RIGHT_ID": "subject",
"RIGHT_ATTRS": {"DEP": "nsubj"},
}
]
```
And the direct object along with its modifier:
```python
pattern = [ ...
{
"LEFT_ID": "anchor_founded",
"REL_OP": ">",
"RIGHT_ID": "founded_object",
"RIGHT_ATTRS": {"DEP": "dobj"},
},
{
"LEFT_ID": "founded_object",
"REL_OP": ">",
"RIGHT_ID": "founded_object_modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}},
}
]
```
### Operators ### Operators
@ -113,19 +69,19 @@ come directly from
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html): [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html):
| Symbol | Description | | Symbol | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------- | | --------- | -------------------------------------------------------------------------------------------------------------------- |
| `A < B` | `A` is the immediate dependent of `B` | | `A < B` | `A` is the immediate dependent of `B`. |
| `A > B` | `A` is the immediate head of `B` | | `A > B` | `A` is the immediate head of `B`. |
| `A << B` | `A` is the dependent in a chain to `B` following dep->head paths | | `A << B` | `A` is the dependent in a chain to `B` following dep &rarr; head paths. |
| `A >> B` | `A` is the head in a chain to `B` following head->dep paths | | `A >> B` | `A` is the head in a chain to `B` following head &rarr; dep paths. |
| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree | | `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. |
| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_ | | `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. |
| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_ | | `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. |
| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_ | | `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. |
| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1` | | `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. |
| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1` | | `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. |
| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i` | | `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. |
| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i` | | `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. |
## DependencyMatcher.\_\_init\_\_ {#init tag="method"} ## DependencyMatcher.\_\_init\_\_ {#init tag="method"}

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 4.6 KiB

After

Width:  |  Height:  |  Size: 25 KiB

View File

@ -20,7 +20,7 @@
</text> </text>
<text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5"> <text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">
<tspan class="displacy-word" fill="currentColor" x="750">company.</tspan> <tspan class="displacy-word" fill="currentColor" x="750">company</tspan>
<tspan class="displacy-tag" dy="2em" fill="currentColor" x="750"></tspan> <tspan class="displacy-tag" dy="2em" fill="currentColor" x="750"></tspan>
</text> </text>

Before

Width:  |  Height:  |  Size: 3.8 KiB

After

Width:  |  Height:  |  Size: 3.8 KiB

View File

@ -974,10 +974,12 @@ to match phrases with the same sequence of punctuation and non-punctuation
tokens as the pattern. But this can easily get confusing and doesn't have much tokens as the pattern. But this can easily get confusing and doesn't have much
of an advantage over writing one or two token patterns. of an advantage over writing one or two token patterns.
## Dependency Matcher {#dependencymatcher new="3"} ## Dependency Matcher {#dependencymatcher new="3" model="parser"}
The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within
the dependency parse. It requires a model containing a parser such as the the dependency parse using
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
operators. It requires a model containing a parser such as the
[`DependencyParser`](/api/dependencyparser). Instead of defining a list of [`DependencyParser`](/api/dependencyparser). Instead of defining a list of
adjacent tokens as in `Matcher` patterns, the `DependencyMatcher` patterns match adjacent tokens as in `Matcher` patterns, the `DependencyMatcher` patterns match
tokens in the dependency parse and specify the relations between them. tokens in the dependency parse and specify the relations between them.
@ -1014,15 +1016,15 @@ tokens in the dependency parse and specify the relations between them.
> matches = matcher(doc) > matches = matcher(doc)
> ``` > ```
A pattern added to the `DependencyMatcher` consists of a list of dictionaries, A pattern added to the dependency matcher consists of a **list of
with each dictionary describing a token to match and its relation to an existing dictionaries**, with each dictionary describing a **token to match** and its
token in the pattern. Except for the first dictionary, which defines an anchor **relation to an existing token** in the pattern. Except for the first
token using only `RIGHT_ID` and `RIGHT_ATTRS`, each pattern should have the dictionary, which defines an anchor token using only `RIGHT_ID` and
following keys: `RIGHT_ATTRS`, each pattern should have the following keys:
| Name | Description | | Name | Description |
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. | | `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ |
| `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ | | `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ |
| `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ | | `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ |
| `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ | | `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
@ -1040,54 +1042,68 @@ can be used as `LEFT_ID` in another dict.
</Infobox> </Infobox>
### Dependency matcher operators ### Dependency matcher operators {#dependencymatcher-operators}
The following operators are supported by the `DependencyMatcher`, most of which The following operators are supported by the `DependencyMatcher`, most of which
come directly from come directly from
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html): [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html):
| Symbol | Description | | Symbol | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------- | | --------- | -------------------------------------------------------------------------------------------------------------------- |
| `A < B` | `A` is the immediate dependent of `B` | | `A < B` | `A` is the immediate dependent of `B`. |
| `A > B` | `A` is the immediate head of `B` | | `A > B` | `A` is the immediate head of `B`. |
| `A << B` | `A` is the dependent in a chain to `B` following dep->head paths | | `A << B` | `A` is the dependent in a chain to `B` following dep &rarr; head paths. |
| `A >> B` | `A` is the head in a chain to `B` following head->dep paths | | `A >> B` | `A` is the head in a chain to `B` following head &rarr; dep paths. |
| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree | | `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. |
| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_ | | `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. |
| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_ | | `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. |
| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_ | | `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. |
| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1` | | `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. |
| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1` | | `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. |
| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i` | | `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. |
| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i` | | `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. |
### Designing dependency matcher patterns ### Designing dependency matcher patterns {#dependencymatcher-patterns}
Let's say we want to find sentences describing who founded what kind of company: Let's say we want to find sentences describing who founded what kind of company:
- `Smith founded a healthcare company in 2005.` - _Smith founded a healthcare company in 2005._
- `Williams initially founded an insurance company in 1987.` - _Williams initially founded an insurance company in 1987._
- `Lee, an experienced CEO, has founded two AI startups.` - _Lee, an experienced CEO, has founded two AI startups._
The dependency parse for `Smith founded a healthcare company` shows types of The dependency parse for "Smith founded a healthcare company" shows types of
relations and tokens we want to match: relations and tokens we want to match:
> #### Visualizing the parse
>
> The [`displacy` visualizer](/usage/visualizer) lets you render `Doc` objects
> and their dependency parse and part-of-speech tags:
>
> ```python
> import spacy
> from spacy import displacy
>
> nlp = spacy.load("en_core_web_sm")
> doc = nlp("Smith founded a healthcare company")
> displacy.serve(doc)
> ```
import DisplaCyDepFoundedHtml from 'images/displacy-dep-founded.html' import DisplaCyDepFoundedHtml from 'images/displacy-dep-founded.html'
<Iframe title="displaCy visualization of dependencies" html={DisplaCyDepFoundedHtml} height={450} /> <Iframe title="displaCy visualization of dependencies" html={DisplaCyDepFoundedHtml} height={450} />
The relations we're interested in are: The relations we're interested in are:
- the founder is the subject (`nsubj`) of the token with the text `founded` - the founder is the **subject** (`nsubj`) of the token with the text `founded`
- the company is the object (`dobj`) of `founded` - the company is the **object** (`dobj`) of `founded`
- the kind of company may be an adjective (`amod`, not shown above) or a - the kind of company may be an **adjective** (`amod`, not shown above) or a
compound (`compound`) **compound** (`compound`)
The first step is to pick an anchor token for the pattern. Since it's the root The first step is to pick an **anchor token** for the pattern. Since it's the
of the dependency parse, `founded` is a good choice here. It is often easier to root of the dependency parse, `founded` is a good choice here. It is often
construct patterns when all dependency relation operators point from the head to easier to construct patterns when all dependency relation operators point from
the children. In this example, we'll only use `>`, which connects a head to an the head to the children. In this example, we'll only use `>`, which connects a
immediate dependent as `head > child`. head to an immediate dependent as `head > child`.
The simplest dependency matcher pattern will identify and name a single token in The simplest dependency matcher pattern will identify and name a single token in
the tree: the tree:
@ -1099,7 +1115,6 @@ from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
matcher = DependencyMatcher(nlp.vocab) matcher = DependencyMatcher(nlp.vocab)
pattern = [ pattern = [
{ {
"RIGHT_ID": "anchor_founded", # unique name "RIGHT_ID": "anchor_founded", # unique name
@ -1116,6 +1131,7 @@ Now that we have a named anchor token (`anchor_founded`), we can add the founder
as the immediate dependent (`>`) of `founded` with the dependency label `nsubj`: as the immediate dependent (`>`) of `founded` with the dependency label `nsubj`:
```python ```python
### Step 1 {highlight="8,10"}
pattern = [ pattern = [
{ {
"RIGHT_ID": "anchor_founded", "RIGHT_ID": "anchor_founded",
@ -1127,31 +1143,37 @@ pattern = [
"RIGHT_ID": "subject", "RIGHT_ID": "subject",
"RIGHT_ATTRS": {"DEP": "nsubj"}, "RIGHT_ATTRS": {"DEP": "nsubj"},
} }
# ...
] ]
``` ```
The direct object (`dobj`) is added in the same way: The direct object (`dobj`) is added in the same way:
```python ```python
pattern = [ ... ### Step 2 {highlight=""}
pattern = [
#...
{ {
"LEFT_ID": "anchor_founded", "LEFT_ID": "anchor_founded",
"REL_OP": ">", "REL_OP": ">",
"RIGHT_ID": "founded_object", "RIGHT_ID": "founded_object",
"RIGHT_ATTRS": {"DEP": "dobj"}, "RIGHT_ATTRS": {"DEP": "dobj"},
} }
# ...
] ]
``` ```
When the subject and object tokens are added, they are required to have names When the subject and object tokens are added, they are required to have names
under the key `RIGHT_ID`, which are allowed to be any unique string, e.g. under the key `RIGHT_ID`, which are allowed to be any unique string, e.g.
`founded_subject`. These names can then be used as `LEFT_ID` to link new tokens `founded_subject`. These names can then be used as `LEFT_ID` to **link new
into the pattern. For the final part of our pattern, we'll specify that the tokens into the pattern**. For the final part of our pattern, we'll specify that
token `founded_object` should have a modifier with the dependency relation the token `founded_object` should have a modifier with the dependency relation
`amod` or `compound`: `amod` or `compound`:
```python ```python
pattern = [ ... ### Step 3 {highlight="7"}
pattern = [
# ...
{ {
"LEFT_ID": "founded_object", "LEFT_ID": "founded_object",
"REL_OP": ">", "REL_OP": ">",
@ -1168,8 +1190,6 @@ each new token needs to be linked to an existing token on its left. As for
`founded` in this example, a token may be linked to more than one token on its `founded` in this example, a token may be linked to more than one token on its
right: right:
<!-- TODO: adjust for final example, prettify -->
![Dependency matcher pattern](../images/dep-match-diagram.svg) ![Dependency matcher pattern](../images/dep-match-diagram.svg)
The full pattern comes together as shown in the example below: The full pattern comes together as shown in the example below:
@ -1209,11 +1229,10 @@ pattern = [
matcher.add("FOUNDED", [pattern]) matcher.add("FOUNDED", [pattern])
doc = nlp("Lee, an experienced CEO, has founded two AI startups.") doc = nlp("Lee, an experienced CEO, has founded two AI startups.")
matches = matcher(doc) matches = matcher(doc)
print(matches) # [(4851363122962674176, [6, 0, 10, 9])]
# each token_id corresponds to one pattern dict print(matches) # [(4851363122962674176, [6, 0, 10, 9])]
# Each token_id corresponds to one pattern dict
match_id, token_ids = matches[0] match_id, token_ids = matches[0]
for i in range(len(token_ids)): for i in range(len(token_ids)):
print(pattern[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text) print(pattern[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text)

View File

@ -26,6 +26,7 @@ menu:
- [End-to-end project workflows](#features-projects) - [End-to-end project workflows](#features-projects)
- [New built-in components](#features-pipeline-components) - [New built-in components](#features-pipeline-components)
- [New custom component API](#features-components) - [New custom component API](#features-components)
- [Dependency matching](#features-dep-matcher)
- [Python type hints](#features-types) - [Python type hints](#features-types)
- [New methods & attributes](#new-methods) - [New methods & attributes](#new-methods)
- [New & updated documentation](#new-docs) - [New & updated documentation](#new-docs)
@ -152,7 +153,6 @@ add to your pipeline and customize for your use case:
| [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. | | [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. |
| [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. | | [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. |
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. | | [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
| [`DependencyMatcher`](/api/dependencymatcher) | Component for matching subtrees within a dependency parse. |
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). | | [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
<Infobox title="Details & Documentation" emoji="📖" list> <Infobox title="Details & Documentation" emoji="📖" list>
@ -202,6 +202,34 @@ aren't set.
</Infobox> </Infobox>
### Dependency matching {#features-dep-matcher}
<!-- TODO: improve summary -->
> #### Example
>
> ```python
> # TODO: example
> ```
The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within
the dependency parse using
[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
operators. It follows the same API as the token-based [`Matcher`](/api/matcher).
A pattern added to the dependency matcher consists of a **list of
dictionaries**, with each dictionary describing a **token to match** and its
**relation to an existing token** in the pattern.
<Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:**
[Dependency matching](/usage/rule-based-matching#dependencymatcher),
- **API:** [`DependencyMatcher`](/api/dependencymatcher),
- **Implementation:**
[`spacy/matcher/dependencymatcher.pyx`](https://github.com/explosion/spaCy/tree/develop/spacy/matcher/dependencymatcher.pyx)
</Infobox>
### Type hints and type-based data validation {#features-types} ### Type hints and type-based data validation {#features-types}
> #### Example > #### Example