WIP: update docs [ci skip]

2025-07-15 18:52:29 +03:00 · 2020-09-04 16:30:31 +02:00 · 2020-09-04 16:30:31 +02:00 · 157caf4dfa
commit 157caf4dfa
parent f174c7b1f3
5 changed files with 154 additions and 176 deletions
--- a/website/docs/api/dependencymatcher.md
+++ b/website/docs/api/dependencymatcher.md
@ -11,7 +11,8 @@ and [`PhraseMatcher`](/api/phrasematcher) and lets you match on dependency trees
 using
 [Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html).
 It requires a pretrained [`DependencyParser`](/api/parser) or other component
-that sets the `Token.dep` and `Token.head` attributes.
+that sets the `Token.dep` and `Token.head` attributes. See the
+[usage guide](/usage/rule-based-matching#dependencymatcher) for examples.

 ## Pattern format {#patterns}

@ -48,63 +49,18 @@ dictionary, which defines an anchor token using only `RIGHT_ID` and

 | Name          | Description                                                                                                                                                            |
 | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `LEFT_ID`     | The name of the left-hand node in the relation, which has been defined in an earlier node.                                                                             |
+| `LEFT_ID`     | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~                                                                     |
 | `REL_OP`      | An operator that describes how the two nodes are related. ~~str~~                                                                                                      |
 | `RIGHT_ID`    | A unique name for the right-hand node in the relation. ~~str~~                                                                                                         |
 | `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |

-The first pattern defines an anchor token and each additional token added to the
-pattern is linked to an existing token `LEFT_ID` by the relation `REL_OP` and is
-described by the name `RIGHT_ID` and the attributes `RIGHT_ATTRS`.
+<Infobox title="Designing dependency matcher patterns" emoji="📖">

-Let's say we want to find sentences describing who founded what kind of company:
+For examples of how to construct dependency matcher patterns for different types
+of relations, see the usage guide on
+[dependency matching](/usage/rule-based-matching#dependencymatcher).

- `Smith founded a healthcare company in 2005.`
- `Williams initially founded an insurance company in 1987.`
- `Lee, an established CEO, founded yet another AI startup.`
-
-Since it's the root of the dependency parse, `founded` is a good choice for the
-anchor token in our pattern:
-
-```python
-pattern = [
-    {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}}
-]
-```
-
-We can add the subject as the token with the dependency label `nsubj` that is a
-direct child `>` of the anchor token named `anchor_founded`:
-
-```python
-pattern = [
-    {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}},
-    {
-        "LEFT_ID": "anchor_founded",
-        "REL_OP": ">",
-        "RIGHT_ID": "subject",
-        "RIGHT_ATTRS": {"DEP": "nsubj"},
-    }
-]
-```
-
-And the direct object along with its modifier:
-
-```python
-pattern = [ ...
-    {
-        "LEFT_ID": "anchor_founded",
-        "REL_OP": ">",
-        "RIGHT_ID": "founded_object",
-        "RIGHT_ATTRS": {"DEP": "dobj"},
-    },
-    {
-        "LEFT_ID": "founded_object",
-        "REL_OP": ">",
-        "RIGHT_ID": "founded_object_modifier",
-        "RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}},
-    }
-]
-```
+</Infobox>

 ### Operators

@ -112,20 +68,20 @@ The following operators are supported by the `DependencyMatcher`, most of which
 come directly from
 [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html):

-| Symbol    | Description                                                                                                         |
-| --------- | ------------------------------------------------------------------------------------------------------------------- |
-| `A < B`   | `A` is the immediate dependent of `B`                                                                               |
-| `A > B`   | `A` is the immediate head of `B`                                                                                    |
-| `A << B`  | `A` is the dependent in a chain to `B` following dep->head paths                                                    |
-| `A >> B`  | `A` is the head in a chain to `B` following head->dep paths                                                         |
-| `A . B`   | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree                   |
-| `A .* B`  | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_                 |
-| `A ; B`   | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_ |
-| `A ;* B`  | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_                  |
-| `A $+ B`  | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`                 |
-| `A $- B`  | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`                  |
-| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`                                |
-| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`                                 |
+| Symbol    | Description                                                                                                          |
+| --------- | -------------------------------------------------------------------------------------------------------------------- |
+| `A < B`   | `A` is the immediate dependent of `B`.                                                                               |
+| `A > B`   | `A` is the immediate head of `B`.                                                                                    |
+| `A << B`  | `A` is the dependent in a chain to `B` following dep &rarr; head paths.                                              |
+| `A >> B`  | `A` is the head in a chain to `B` following head &rarr; dep paths.                                                   |
+| `A . B`   | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree.                   |
+| `A .* B`  | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_.                 |
+| `A ; B`   | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. |
+| `A ;* B`  | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_.                  |
+| `A $+ B`  | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`.                 |
+| `A $- B`  | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`.                  |
+| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`.                                |
+| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`.                                 |

 ## DependencyMatcher.\_\_init\_\_ {#init tag="method"}

--- a/website/docs/images/dep-match-diagram.svg
+++ b/website/docs/images/dep-match-diagram.svg
--- a/website/docs/images/displacy-dep-founded.html
+++ b/website/docs/images/displacy-dep-founded.html
@ -20,7 +20,7 @@
 </text>

 <text class="displacy-token" fill="currentColor" text-anchor="middle" y="309.5">
-    <tspan class="displacy-word" fill="currentColor" x="750">company.</tspan>
+    <tspan class="displacy-word" fill="currentColor" x="750">company</tspan>
    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="750"></tspan>
 </text>

--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -974,10 +974,12 @@ to match phrases with the same sequence of punctuation and non-punctuation
 tokens as the pattern. But this can easily get confusing and doesn't have much
 of an advantage over writing one or two token patterns.

-## Dependency Matcher {#dependencymatcher new="3"}
+## Dependency Matcher {#dependencymatcher new="3" model="parser"}

 The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within
-the dependency parse. It requires a model containing a parser such as the
+the dependency parse using
+[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
+operators. It requires a model containing a parser such as the
 [`DependencyParser`](/api/dependencyparser). Instead of defining a list of
 adjacent tokens as in `Matcher` patterns, the `DependencyMatcher` patterns match
 tokens in the dependency parse and specify the relations between them.
@ -1014,15 +1016,15 @@ tokens in the dependency parse and specify the relations between them.
 > matches = matcher(doc)
 > ```

-A pattern added to the `DependencyMatcher` consists of a list of dictionaries,
-with each dictionary describing a token to match and its relation to an existing
-token in the pattern. Except for the first dictionary, which defines an anchor
-token using only `RIGHT_ID` and `RIGHT_ATTRS`, each pattern should have the
-following keys:
+A pattern added to the dependency matcher consists of a **list of
+dictionaries**, with each dictionary describing a **token to match** and its
+**relation to an existing token** in the pattern. Except for the first
+dictionary, which defines an anchor token using only `RIGHT_ID` and
+`RIGHT_ATTRS`, each pattern should have the following keys:

 | Name          | Description                                                                                                                                                            |
 | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `LEFT_ID`     | The name of the left-hand node in the relation, which has been defined in an earlier node.                                                                             |
+| `LEFT_ID`     | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~                                                                     |
 | `REL_OP`      | An operator that describes how the two nodes are related. ~~str~~                                                                                                      |
 | `RIGHT_ID`    | A unique name for the right-hand node in the relation. ~~str~~                                                                                                         |
 | `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ |
@ -1040,54 +1042,68 @@ can be used as `LEFT_ID` in another dict.

 </Infobox>

-### Dependency matcher operators
+### Dependency matcher operators {#dependencymatcher-operators}

 The following operators are supported by the `DependencyMatcher`, most of which
 come directly from
 [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html):

-| Symbol    | Description                                                                                                         |
-| --------- | ------------------------------------------------------------------------------------------------------------------- |
-| `A < B`   | `A` is the immediate dependent of `B`                                                                               |
-| `A > B`   | `A` is the immediate head of `B`                                                                                    |
-| `A << B`  | `A` is the dependent in a chain to `B` following dep->head paths                                                    |
-| `A >> B`  | `A` is the head in a chain to `B` following head->dep paths                                                         |
-| `A . B`   | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree                   |
-| `A .* B`  | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_                 |
-| `A ; B`   | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_ |
-| `A ;* B`  | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_                  |
-| `A $+ B`  | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`                 |
-| `A $- B`  | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`                  |
-| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`                                |
-| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`                                 |
+| Symbol    | Description                                                                                                          |
+| --------- | -------------------------------------------------------------------------------------------------------------------- |
+| `A < B`   | `A` is the immediate dependent of `B`.                                                                               |
+| `A > B`   | `A` is the immediate head of `B`.                                                                                    |
+| `A << B`  | `A` is the dependent in a chain to `B` following dep &rarr; head paths.                                              |
+| `A >> B`  | `A` is the head in a chain to `B` following head &rarr; dep paths.                                                   |
+| `A . B`   | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree.                   |
+| `A .* B`  | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_.                 |
+| `A ; B`   | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. |
+| `A ;* B`  | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_.                  |
+| `A $+ B`  | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`.                 |
+| `A $- B`  | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`.                  |
+| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`.                                |
+| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`.                                 |

-### Designing dependency matcher patterns
+### Designing dependency matcher patterns {#dependencymatcher-patterns}

 Let's say we want to find sentences describing who founded what kind of company:

- `Smith founded a healthcare company in 2005.`
- `Williams initially founded an insurance company in 1987.`
- `Lee, an experienced CEO, has founded two AI startups.`
+- _Smith founded a healthcare company in 2005._
+- _Williams initially founded an insurance company in 1987._
+- _Lee, an experienced CEO, has founded two AI startups._

-The dependency parse for `Smith founded a healthcare company` shows types of
+The dependency parse for "Smith founded a healthcare company" shows types of
 relations and tokens we want to match:

+> #### Visualizing the parse
+>
+> The [`displacy` visualizer](/usage/visualizer) lets you render `Doc` objects
+> and their dependency parse and part-of-speech tags:
+>
+> ```python
+> import spacy
+> from spacy import displacy
+>
+> nlp = spacy.load("en_core_web_sm")
+> doc = nlp("Smith founded a healthcare company")
+> displacy.serve(doc)
+> ```
+
 import DisplaCyDepFoundedHtml from 'images/displacy-dep-founded.html'

 <Iframe title="displaCy visualization of dependencies" html={DisplaCyDepFoundedHtml} height={450} />

 The relations we're interested in are:

- the founder is the subject (`nsubj`) of the token with the text `founded`
- the company is the object (`dobj`) of `founded`
- the kind of company may be an adjective (`amod`, not shown above) or a
-  compound (`compound`)
+- the founder is the **subject** (`nsubj`) of the token with the text `founded`
+- the company is the **object** (`dobj`) of `founded`
+- the kind of company may be an **adjective** (`amod`, not shown above) or a
+  **compound** (`compound`)

-The first step is to pick an anchor token for the pattern. Since it's the root
-of the dependency parse, `founded` is a good choice here. It is often easier to
-construct patterns when all dependency relation operators point from the head to
-the children. In this example, we'll only use `>`, which connects a head to an
-immediate dependent as `head > child`.
+The first step is to pick an **anchor token** for the pattern. Since it's the
+root of the dependency parse, `founded` is a good choice here. It is often
+easier to construct patterns when all dependency relation operators point from
+the head to the children. In this example, we'll only use `>`, which connects a
+head to an immediate dependent as `head > child`.

 The simplest dependency matcher pattern will identify and name a single token in
 the tree:
@ -1099,11 +1115,10 @@ from spacy.matcher import DependencyMatcher

 nlp = spacy.load("en_core_web_sm")
 matcher = DependencyMatcher(nlp.vocab)
-
 pattern = [
  {
-    "RIGHT_ID": "anchor_founded",      # unique name
-    "RIGHT_ATTRS": {"ORTH": "founded"} # token pattern for "founded"
+    "RIGHT_ID": "anchor_founded",       # unique name
+    "RIGHT_ATTRS": {"ORTH": "founded"}  # token pattern for "founded"
  }
 ]
 matcher.add("FOUNDED", [pattern])
@ -1116,6 +1131,7 @@ Now that we have a named anchor token (`anchor_founded`), we can add the founder
 as the immediate dependent (`>`) of `founded` with the dependency label `nsubj`:

 ```python
+### Step 1 {highlight="8,10"}
 pattern = [
    {
        "RIGHT_ID": "anchor_founded",
@ -1127,31 +1143,37 @@ pattern = [
        "RIGHT_ID": "subject",
        "RIGHT_ATTRS": {"DEP": "nsubj"},
    }
+    # ...
 ]
 ```

 The direct object (`dobj`) is added in the same way:

 ```python
-pattern = [ ...
+### Step 2 {highlight=""}
+pattern = [
+    #...
    {
        "LEFT_ID": "anchor_founded",
        "REL_OP": ">",
        "RIGHT_ID": "founded_object",
        "RIGHT_ATTRS": {"DEP": "dobj"},
    }
+    # ...
 ]
 ```

 When the subject and object tokens are added, they are required to have names
 under the key `RIGHT_ID`, which are allowed to be any unique string, e.g.
-`founded_subject`. These names can then be used as `LEFT_ID` to link new tokens
-into the pattern. For the final part of our pattern, we'll specify that the
-token `founded_object` should have a modifier with the dependency relation
+`founded_subject`. These names can then be used as `LEFT_ID` to **link new
+tokens into the pattern**. For the final part of our pattern, we'll specify that
+the token `founded_object` should have a modifier with the dependency relation
 `amod` or `compound`:

 ```python
-pattern = [ ...
+### Step 3 {highlight="7"}
+pattern = [
+    # ...
    {
        "LEFT_ID": "founded_object",
        "REL_OP": ">",
@ -1168,8 +1190,6 @@ each new token needs to be linked to an existing token on its left. As for
 `founded` in this example, a token may be linked to more than one token on its
 right:

-<!-- TODO: adjust for final example, prettify -->
-
 ![Dependency matcher pattern](../images/dep-match-diagram.svg)

 The full pattern comes together as shown in the example below:
@ -1209,11 +1229,10 @@ pattern = [

 matcher.add("FOUNDED", [pattern])
 doc = nlp("Lee, an experienced CEO, has founded two AI startups.")
-
 matches = matcher(doc)
-print(matches) # [(4851363122962674176, [6, 0, 10, 9])]

-# each token_id corresponds to one pattern dict
+print(matches) # [(4851363122962674176, [6, 0, 10, 9])]
+# Each token_id corresponds to one pattern dict
 match_id, token_ids = matches[0]
 for i in range(len(token_ids)):
    print(pattern[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text)
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -26,6 +26,7 @@ menu:
 - [End-to-end project workflows](#features-projects)
 - [New built-in components](#features-pipeline-components)
 - [New custom component API](#features-components)
+- [Dependency matching](#features-dep-matcher)
 - [Python type hints](#features-types)
 - [New methods & attributes](#new-methods)
 - [New & updated documentation](#new-docs)
@ -152,7 +153,6 @@ add to your pipeline and customize for your use case:
 | [`Morphologizer`](/api/morphologizer)           | Trainable component to predict morphological features.                                                                                                                                                                  |
 | [`Lemmatizer`](/api/lemmatizer)                 | Standalone component for rule-based and lookup lemmatization.                                                                                                                                                           |
 | [`AttributeRuler`](/api/attributeruler)         | Component for setting token attributes using match patterns.                                                                                                                                                            |
-| [`DependencyMatcher`](/api/dependencymatcher)   | Component for matching subtrees within a dependency parse.                                                                                                                                                              |
 | [`Transformer`](/api/transformer)               | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |

 <Infobox title="Details & Documentation" emoji="📖" list>
@ -202,6 +202,34 @@ aren't set.

 </Infobox>

+### Dependency matching {#features-dep-matcher}
+
+<!-- TODO: improve summary -->
+
+> #### Example
+>
+> ```python
+> # TODO: example
+> ```
+
+The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within
+the dependency parse using
+[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html)
+operators. It follows the same API as the token-based [`Matcher`](/api/matcher).
+A pattern added to the dependency matcher consists of a **list of
+dictionaries**, with each dictionary describing a **token to match** and its
+**relation to an existing token** in the pattern.
+
+<Infobox title="Details & Documentation" emoji="📖" list>
+
+- **Usage:**
+  [Dependency matching](/usage/rule-based-matching#dependencymatcher),
+- **API:** [`DependencyMatcher`](/api/dependencymatcher),
+- **Implementation:**
+  [`spacy/matcher/dependencymatcher.pyx`](https://github.com/explosion/spaCy/tree/develop/spacy/matcher/dependencymatcher.pyx)
+
+</Infobox>
+
 ### Type hints and type-based data validation {#features-types}

 > #### Example