Add SPACY as a Matcher attribute (#6463)

This commit is contained in:
Adriane Boyd 2020-11-30 02:34:50 +01:00 committed by GitHub
parent 3a5cc5f8b4
commit 03ae77e603
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 37 additions and 24 deletions

View File

@ -174,6 +174,10 @@ TOKEN_PATTERN_SCHEMA = {
"title": "Token is the first in a sentence", "title": "Token is the first in a sentence",
"$ref": "#/definitions/boolean_value", "$ref": "#/definitions/boolean_value",
}, },
"SPACY": {
"title": "Token has a trailing space",
"$ref": "#/definitions/boolean_value",
},
"LIKE_NUM": { "LIKE_NUM": {
"title": "Token resembles a number", "title": "Token resembles a number",
"$ref": "#/definitions/boolean_value", "$ref": "#/definitions/boolean_value",

View File

@ -440,6 +440,7 @@ def test_attr_pipeline_checks(en_vocab):
([{"IS_LEFT_PUNCT": True}], "``"), ([{"IS_LEFT_PUNCT": True}], "``"),
([{"IS_RIGHT_PUNCT": True}], "''"), ([{"IS_RIGHT_PUNCT": True}], "''"),
([{"IS_STOP": True}], "the"), ([{"IS_STOP": True}], "the"),
([{"SPACY": True}], "the"),
([{"LIKE_NUM": True}], "1"), ([{"LIKE_NUM": True}], "1"),
([{"LIKE_URL": True}], "http://example.com"), ([{"LIKE_URL": True}], "http://example.com"),
([{"LIKE_EMAIL": True}], "mail@example.com"), ([{"LIKE_EMAIL": True}], "mail@example.com"),

View File

@ -158,7 +158,7 @@ The available token pattern keys correspond to a number of
rule-based matching are: rule-based matching are:
| Attribute | Value Type |  Description | | Attribute | Value Type |  Description |
| -------------------------------------- | ------------- | ------------------------------------------------------------------------------------------------------ | | ------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ORTH` | unicode | The exact verbatim text of a token. | | `ORTH` | unicode | The exact verbatim text of a token. |
| `TEXT` <Tag variant="new">2.1</Tag> | unicode | The exact verbatim text of a token. | | `TEXT` <Tag variant="new">2.1</Tag> | unicode | The exact verbatim text of a token. |
| `LOWER` | unicode | The lowercase form of the token text. | | `LOWER` | unicode | The lowercase form of the token text. |
@ -167,6 +167,7 @@ rule-based matching are:
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | bool | Token text is in lowercase, uppercase, titlecase. | | `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | bool | Token text is in lowercase, uppercase, titlecase. |
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | bool | Token is punctuation, whitespace, stop word. | | `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | bool | Token is punctuation, whitespace, stop word. |
| `IS_SENT_START` | bool | Token is start of sentence. | | `IS_SENT_START` | bool | Token is start of sentence. |
| `SPACY` | bool | Token has a trailing space. |
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | bool | Token text resembles a number, URL, email. | | `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | bool | Token text resembles a number, URL, email. |
| `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | unicode | The token's simple and extended part-of-speech tag, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). | | `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | unicode | The token's simple and extended part-of-speech tag, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). |
| `ENT_TYPE` | unicode | The token's entity label. | | `ENT_TYPE` | unicode | The token's entity label. |
@ -1102,21 +1103,28 @@ powerful model packages with binary weights _and_ rules included!
### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"} ### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**, When using a large amount of **phrase patterns** (roughly > 10000) it's useful
the EntityRuler calls the nlp object to construct a doc object. This happens in case you try to understand how the `add_patterns` function of the EntityRuler works. For each
to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to **phrase pattern**, the EntityRuler calls the nlp object to construct a doc
extract matches based on the pattern's POS signature. object. This happens in case you try to add the EntityRuler at the end of an
existing pipeline with, for example, a POS tagger and want to extract matches
based on the pattern's POS signature.
In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler. In this case you would pass a config value of `phrase_matcher_attr="POS"` for
the EntityRuler.
Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns. Running the full language pipeline across every pattern in a large list scales
linearly and can therefore take a long time on large amounts of phrase patterns.
As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively. As of spaCy 2.2.4 the `add_patterns` function has been refactored to use
nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with
5,000-100,000 phrase patterns respectively.
Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time. Even with this speedup (but especially if you're using an older version) the
`add_patterns` function can still take a long time.
An easy workaround to make this function run faster is disabling the other language pipes An easy workaround to make this function run faster is disabling the other
while adding the phrase patterns. language pipes while adding the phrase patterns.
```python ```python
entityruler = EntityRuler(nlp) entityruler = EntityRuler(nlp)