Add SPACY as a Matcher attribute (#6463)

This commit is contained in:
Adriane Boyd 2020-11-30 02:34:50 +01:00 committed by GitHub
parent 3a5cc5f8b4
commit 03ae77e603
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 37 additions and 24 deletions

View File

@ -174,6 +174,10 @@ TOKEN_PATTERN_SCHEMA = {
"title": "Token is the first in a sentence",
"$ref": "#/definitions/boolean_value",
},
"SPACY": {
"title": "Token has a trailing space",
"$ref": "#/definitions/boolean_value",
},
"LIKE_NUM": {
"title": "Token resembles a number",
"$ref": "#/definitions/boolean_value",

View File

@ -440,6 +440,7 @@ def test_attr_pipeline_checks(en_vocab):
([{"IS_LEFT_PUNCT": True}], "``"),
([{"IS_RIGHT_PUNCT": True}], "''"),
([{"IS_STOP": True}], "the"),
([{"SPACY": True}], "the"),
([{"LIKE_NUM": True}], "1"),
([{"LIKE_URL": True}], "http://example.com"),
([{"LIKE_EMAIL": True}], "mail@example.com"),

View File

@ -157,20 +157,21 @@ The available token pattern keys correspond to a number of
[`Token` attributes](/api/token#attributes). The supported attributes for
rule-based matching are:
| Attribute | Value Type |  Description |
| -------------------------------------- | ------------- | ------------------------------------------------------------------------------------------------------ |
| `ORTH` | unicode | The exact verbatim text of a token. |
| `TEXT` <Tag variant="new">2.1</Tag> | unicode | The exact verbatim text of a token. |
| `LOWER` | unicode | The lowercase form of the token text. |
| `LENGTH` | int | The length of the token text. |
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | bool | Token text consists of alphabetic characters, ASCII characters, digits. |
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | bool | Token text is in lowercase, uppercase, titlecase. |
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | bool | Token is punctuation, whitespace, stop word. |
| `IS_SENT_START` | bool | Token is start of sentence. |
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | bool | Token text resembles a number, URL, email. |
| `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | unicode | The token's simple and extended part-of-speech tag, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation).|
| `ENT_TYPE` | unicode | The token's entity label. |
| `_` <Tag variant="new">2.1</Tag> | dict | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). |
| Attribute | Value Type |  Description |
| ------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ORTH` | unicode | The exact verbatim text of a token. |
| `TEXT` <Tag variant="new">2.1</Tag> | unicode | The exact verbatim text of a token. |
| `LOWER` | unicode | The lowercase form of the token text. |
| `LENGTH` | int | The length of the token text. |
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | bool | Token text consists of alphabetic characters, ASCII characters, digits. |
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | bool | Token text is in lowercase, uppercase, titlecase. |
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | bool | Token is punctuation, whitespace, stop word. |
| `IS_SENT_START` | bool | Token is start of sentence. |
| `SPACY` | bool | Token has a trailing space. |
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | bool | Token text resembles a number, URL, email. |
| `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | unicode | The token's simple and extended part-of-speech tag, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). |
| `ENT_TYPE` | unicode | The token's entity label. |
| `_` <Tag variant="new">2.1</Tag> | dict | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). |
<Accordion title="Does it matter if the attribute names are uppercase or lowercase?">
@ -1102,21 +1103,28 @@ powerful model packages with binary weights _and_ rules included!
### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to
extract matches based on the pattern's POS signature.
When using a large amount of **phrase patterns** (roughly > 10000) it's useful
to understand how the `add_patterns` function of the EntityRuler works. For each
**phrase pattern**, the EntityRuler calls the nlp object to construct a doc
object. This happens in case you try to add the EntityRuler at the end of an
existing pipeline with, for example, a POS tagger and want to extract matches
based on the pattern's POS signature.
In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
In this case you would pass a config value of `phrase_matcher_attr="POS"` for
the EntityRuler.
Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
Running the full language pipeline across every pattern in a large list scales
linearly and can therefore take a long time on large amounts of phrase patterns.
As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively.
As of spaCy 2.2.4 the `add_patterns` function has been refactored to use
nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with
5,000-100,000 phrase patterns respectively.
Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
Even with this speedup (but especially if you're using an older version) the
`add_patterns` function can still take a long time.
An easy workaround to make this function run faster is disabling the other language pipes
while adding the phrase patterns.
An easy workaround to make this function run faster is disabling the other
language pipes while adding the phrase patterns.
```python
entityruler = EntityRuler(nlp)