From 157caf4dfa6483b751e2ea6afd975e7446fd9ceb Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Fri, 4 Sep 2020 16:30:31 +0200 Subject: [PATCH] WIP: update docs [ci skip] --- website/docs/api/dependencymatcher.md | 88 ++++--------- website/docs/images/dep-match-diagram.svg | 91 +++++--------- website/docs/images/displacy-dep-founded.html | 2 +- website/docs/usage/rule-based-matching.md | 119 ++++++++++-------- website/docs/usage/v3.md | 30 ++++- 5 files changed, 154 insertions(+), 176 deletions(-) diff --git a/website/docs/api/dependencymatcher.md b/website/docs/api/dependencymatcher.md index 333f82043..c90a715d9 100644 --- a/website/docs/api/dependencymatcher.md +++ b/website/docs/api/dependencymatcher.md @@ -11,7 +11,8 @@ and [`PhraseMatcher`](/api/phrasematcher) and lets you match on dependency trees using [Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). It requires a pretrained [`DependencyParser`](/api/parser) or other component -that sets the `Token.dep` and `Token.head` attributes. +that sets the `Token.dep` and `Token.head` attributes. See the +[usage guide](/usage/rule-based-matching#dependencymatcher) for examples. ## Pattern format {#patterns} @@ -48,63 +49,18 @@ dictionary, which defines an anchor token using only `RIGHT_ID` and | Name | Description | | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. | +| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ | | `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ | | `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ | | `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ | -The first pattern defines an anchor token and each additional token added to the -pattern is linked to an existing token `LEFT_ID` by the relation `REL_OP` and is -described by the name `RIGHT_ID` and the attributes `RIGHT_ATTRS`. + -Let's say we want to find sentences describing who founded what kind of company: +For examples of how to construct dependency matcher patterns for different types +of relations, see the usage guide on +[dependency matching](/usage/rule-based-matching#dependencymatcher). -- `Smith founded a healthcare company in 2005.` -- `Williams initially founded an insurance company in 1987.` -- `Lee, an established CEO, founded yet another AI startup.` - -Since it's the root of the dependency parse, `founded` is a good choice for the -anchor token in our pattern: - -```python -pattern = [ - {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}} -] -``` - -We can add the subject as the token with the dependency label `nsubj` that is a -direct child `>` of the anchor token named `anchor_founded`: - -```python -pattern = [ - {"RIGHT_ID": "anchor_founded", "RIGHT_ATTRS": {"ORTH": "founded"}}, - { - "LEFT_ID": "anchor_founded", - "REL_OP": ">", - "RIGHT_ID": "subject", - "RIGHT_ATTRS": {"DEP": "nsubj"}, - } -] -``` - -And the direct object along with its modifier: - -```python -pattern = [ ... - { - "LEFT_ID": "anchor_founded", - "REL_OP": ">", - "RIGHT_ID": "founded_object", - "RIGHT_ATTRS": {"DEP": "dobj"}, - }, - { - "LEFT_ID": "founded_object", - "REL_OP": ">", - "RIGHT_ID": "founded_object_modifier", - "RIGHT_ATTRS": {"DEP": {"IN": ["amod", "compound"]}}, - } -] -``` + ### Operators @@ -112,20 +68,20 @@ The following operators are supported by the `DependencyMatcher`, most of which come directly from [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html): -| Symbol | Description | -| --------- | ------------------------------------------------------------------------------------------------------------------- | -| `A < B` | `A` is the immediate dependent of `B` | -| `A > B` | `A` is the immediate head of `B` | -| `A << B` | `A` is the dependent in a chain to `B` following dep->head paths | -| `A >> B` | `A` is the head in a chain to `B` following head->dep paths | -| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree | -| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_ | -| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_ | -| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_ | -| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1` | -| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1` | -| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i` | -| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i` | +| Symbol | Description | +| --------- | -------------------------------------------------------------------------------------------------------------------- | +| `A < B` | `A` is the immediate dependent of `B`. | +| `A > B` | `A` is the immediate head of `B`. | +| `A << B` | `A` is the dependent in a chain to `B` following dep → head paths. | +| `A >> B` | `A` is the head in a chain to `B` following head → dep paths. | +| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. | +| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. | +| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. | +| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. | +| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. | ## DependencyMatcher.\_\_init\_\_ {#init tag="method"} diff --git a/website/docs/images/dep-match-diagram.svg b/website/docs/images/dep-match-diagram.svg index f23c573e2..676be4137 100644 --- a/website/docs/images/dep-match-diagram.svg +++ b/website/docs/images/dep-match-diagram.svg @@ -1,64 +1,39 @@ - - - - - - - - ID: founded - ORTH: founded - + + + + + + + + + + + + + - - - - ID: subject - DEP: nsubj - + + + - - - - ID: object - DEP: dobj - + + + - - - - - - - - - - - - > - - - - - - > - - - - - - ID: modifier - DEP: amod | compound - - - - - - - - - - > - + + + + + + + + + + + + + + diff --git a/website/docs/images/displacy-dep-founded.html b/website/docs/images/displacy-dep-founded.html index 3f89ffd4a..e22984ee1 100644 --- a/website/docs/images/displacy-dep-founded.html +++ b/website/docs/images/displacy-dep-founded.html @@ -20,7 +20,7 @@ - company. + company diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index 532796303..01d60ddb8 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -974,10 +974,12 @@ to match phrases with the same sequence of punctuation and non-punctuation tokens as the pattern. But this can easily get confusing and doesn't have much of an advantage over writing one or two token patterns. -## Dependency Matcher {#dependencymatcher new="3"} +## Dependency Matcher {#dependencymatcher new="3" model="parser"} The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within -the dependency parse. It requires a model containing a parser such as the +the dependency parse using +[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html) +operators. It requires a model containing a parser such as the [`DependencyParser`](/api/dependencyparser). Instead of defining a list of adjacent tokens as in `Matcher` patterns, the `DependencyMatcher` patterns match tokens in the dependency parse and specify the relations between them. @@ -1014,15 +1016,15 @@ tokens in the dependency parse and specify the relations between them. > matches = matcher(doc) > ``` -A pattern added to the `DependencyMatcher` consists of a list of dictionaries, -with each dictionary describing a token to match and its relation to an existing -token in the pattern. Except for the first dictionary, which defines an anchor -token using only `RIGHT_ID` and `RIGHT_ATTRS`, each pattern should have the -following keys: +A pattern added to the dependency matcher consists of a **list of +dictionaries**, with each dictionary describing a **token to match** and its +**relation to an existing token** in the pattern. Except for the first +dictionary, which defines an anchor token using only `RIGHT_ID` and +`RIGHT_ATTRS`, each pattern should have the following keys: | Name | Description | | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. | +| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ | | `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ | | `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ | | `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ | @@ -1040,54 +1042,68 @@ can be used as `LEFT_ID` in another dict. -### Dependency matcher operators +### Dependency matcher operators {#dependencymatcher-operators} The following operators are supported by the `DependencyMatcher`, most of which come directly from [Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html): -| Symbol | Description | -| --------- | ------------------------------------------------------------------------------------------------------------------- | -| `A < B` | `A` is the immediate dependent of `B` | -| `A > B` | `A` is the immediate head of `B` | -| `A << B` | `A` is the dependent in a chain to `B` following dep->head paths | -| `A >> B` | `A` is the head in a chain to `B` following head->dep paths | -| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree | -| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_ | -| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_ | -| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_ | -| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1` | -| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1` | -| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i` | -| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i` | +| Symbol | Description | +| --------- | -------------------------------------------------------------------------------------------------------------------- | +| `A < B` | `A` is the immediate dependent of `B`. | +| `A > B` | `A` is the immediate head of `B`. | +| `A << B` | `A` is the dependent in a chain to `B` following dep → head paths. | +| `A >> B` | `A` is the head in a chain to `B` following head → dep paths. | +| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. | +| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. | +| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. | +| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. | +| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. | -### Designing dependency matcher patterns +### Designing dependency matcher patterns {#dependencymatcher-patterns} Let's say we want to find sentences describing who founded what kind of company: -- `Smith founded a healthcare company in 2005.` -- `Williams initially founded an insurance company in 1987.` -- `Lee, an experienced CEO, has founded two AI startups.` +- _Smith founded a healthcare company in 2005._ +- _Williams initially founded an insurance company in 1987._ +- _Lee, an experienced CEO, has founded two AI startups._ -The dependency parse for `Smith founded a healthcare company` shows types of +The dependency parse for "Smith founded a healthcare company" shows types of relations and tokens we want to match: +> #### Visualizing the parse +> +> The [`displacy` visualizer](/usage/visualizer) lets you render `Doc` objects +> and their dependency parse and part-of-speech tags: +> +> ```python +> import spacy +> from spacy import displacy +> +> nlp = spacy.load("en_core_web_sm") +> doc = nlp("Smith founded a healthcare company") +> displacy.serve(doc) +> ``` + import DisplaCyDepFoundedHtml from 'images/displacy-dep-founded.html' The relations we're interested in are: -- the founder is the subject (`nsubj`) of the token with the text `founded` -- the company is the object (`dobj`) of `founded` -- the kind of company may be an adjective (`amod`, not shown above) or a - compound (`compound`) +- the founder is the **subject** (`nsubj`) of the token with the text `founded` +- the company is the **object** (`dobj`) of `founded` +- the kind of company may be an **adjective** (`amod`, not shown above) or a + **compound** (`compound`) -The first step is to pick an anchor token for the pattern. Since it's the root -of the dependency parse, `founded` is a good choice here. It is often easier to -construct patterns when all dependency relation operators point from the head to -the children. In this example, we'll only use `>`, which connects a head to an -immediate dependent as `head > child`. +The first step is to pick an **anchor token** for the pattern. Since it's the +root of the dependency parse, `founded` is a good choice here. It is often +easier to construct patterns when all dependency relation operators point from +the head to the children. In this example, we'll only use `>`, which connects a +head to an immediate dependent as `head > child`. The simplest dependency matcher pattern will identify and name a single token in the tree: @@ -1099,11 +1115,10 @@ from spacy.matcher import DependencyMatcher nlp = spacy.load("en_core_web_sm") matcher = DependencyMatcher(nlp.vocab) - pattern = [ { - "RIGHT_ID": "anchor_founded", # unique name - "RIGHT_ATTRS": {"ORTH": "founded"} # token pattern for "founded" + "RIGHT_ID": "anchor_founded", # unique name + "RIGHT_ATTRS": {"ORTH": "founded"} # token pattern for "founded" } ] matcher.add("FOUNDED", [pattern]) @@ -1116,6 +1131,7 @@ Now that we have a named anchor token (`anchor_founded`), we can add the founder as the immediate dependent (`>`) of `founded` with the dependency label `nsubj`: ```python +### Step 1 {highlight="8,10"} pattern = [ { "RIGHT_ID": "anchor_founded", @@ -1127,31 +1143,37 @@ pattern = [ "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}, } + # ... ] ``` The direct object (`dobj`) is added in the same way: ```python -pattern = [ ... +### Step 2 {highlight=""} +pattern = [ + #... { "LEFT_ID": "anchor_founded", "REL_OP": ">", "RIGHT_ID": "founded_object", "RIGHT_ATTRS": {"DEP": "dobj"}, } + # ... ] ``` When the subject and object tokens are added, they are required to have names under the key `RIGHT_ID`, which are allowed to be any unique string, e.g. -`founded_subject`. These names can then be used as `LEFT_ID` to link new tokens -into the pattern. For the final part of our pattern, we'll specify that the -token `founded_object` should have a modifier with the dependency relation +`founded_subject`. These names can then be used as `LEFT_ID` to **link new +tokens into the pattern**. For the final part of our pattern, we'll specify that +the token `founded_object` should have a modifier with the dependency relation `amod` or `compound`: ```python -pattern = [ ... +### Step 3 {highlight="7"} +pattern = [ + # ... { "LEFT_ID": "founded_object", "REL_OP": ">", @@ -1168,8 +1190,6 @@ each new token needs to be linked to an existing token on its left. As for `founded` in this example, a token may be linked to more than one token on its right: - - ![Dependency matcher pattern](../images/dep-match-diagram.svg) The full pattern comes together as shown in the example below: @@ -1209,11 +1229,10 @@ pattern = [ matcher.add("FOUNDED", [pattern]) doc = nlp("Lee, an experienced CEO, has founded two AI startups.") - matches = matcher(doc) -print(matches) # [(4851363122962674176, [6, 0, 10, 9])] -# each token_id corresponds to one pattern dict +print(matches) # [(4851363122962674176, [6, 0, 10, 9])] +# Each token_id corresponds to one pattern dict match_id, token_ids = matches[0] for i in range(len(token_ids)): print(pattern[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text) diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index bce261c42..e5228ab21 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -26,6 +26,7 @@ menu: - [End-to-end project workflows](#features-projects) - [New built-in components](#features-pipeline-components) - [New custom component API](#features-components) +- [Dependency matching](#features-dep-matcher) - [Python type hints](#features-types) - [New methods & attributes](#new-methods) - [New & updated documentation](#new-docs) @@ -152,7 +153,6 @@ add to your pipeline and customize for your use case: | [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. | | [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. | | [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. | -| [`DependencyMatcher`](/api/dependencymatcher) | Component for matching subtrees within a dependency parse. | | [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/embeddings-transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). | @@ -202,6 +202,34 @@ aren't set. +### Dependency matching {#features-dep-matcher} + + + +> #### Example +> +> ```python +> # TODO: example +> ``` + +The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within +the dependency parse using +[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html) +operators. It follows the same API as the token-based [`Matcher`](/api/matcher). +A pattern added to the dependency matcher consists of a **list of +dictionaries**, with each dictionary describing a **token to match** and its +**relation to an existing token** in the pattern. + + + +- **Usage:** + [Dependency matching](/usage/rule-based-matching#dependencymatcher), +- **API:** [`DependencyMatcher`](/api/dependencymatcher), +- **Implementation:** + [`spacy/matcher/dependencymatcher.pyx`](https://github.com/explosion/spaCy/tree/develop/spacy/matcher/dependencymatcher.pyx) + + + ### Type hints and type-based data validation {#features-types} > #### Example