Add SPACY as a Matcher attribute (#6463)

2025-07-18 20:22:25 +03:00 · 2020-11-30 02:34:50 +01:00 · 2020-11-30 02:34:50 +01:00 · 03ae77e603
commit 03ae77e603
parent 3a5cc5f8b4
3 changed files with 37 additions and 24 deletions
--- a/spacy/matcher/_schemas.py
+++ b/spacy/matcher/_schemas.py
@ -174,6 +174,10 @@ TOKEN_PATTERN_SCHEMA = {
                "title": "Token is the first in a sentence",
                "$ref": "#/definitions/boolean_value",
            },
            "SPACY": {
                "title": "Token has a trailing space",
                "$ref": "#/definitions/boolean_value",
            },
            "LIKE_NUM": {
                "title": "Token resembles a number",
                "$ref": "#/definitions/boolean_value",
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@ -440,6 +440,7 @@ def test_attr_pipeline_checks(en_vocab):
        ([{"IS_LEFT_PUNCT": True}], "``"),
        ([{"IS_RIGHT_PUNCT": True}], "''"),
        ([{"IS_STOP": True}], "the"),
        ([{"SPACY": True}], "the"),
        ([{"LIKE_NUM": True}], "1"),
        ([{"LIKE_URL": True}], "http://example.com"),
        ([{"LIKE_EMAIL": True}], "mail@example.com"),
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -158,7 +158,7 @@ The available token pattern keys correspond to a number of
 rule-based matching are:
 | Attribute                             | Value Type |  Description                                                                                                                                                                                                                                                              |
-| -------------------------------------- | ------------- | ------------------------------------------------------------------------------------------------------ |
+| ------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `ORTH`                                | unicode    | The exact verbatim text of a token.                                                                                                                                                                                                                                       |
 | `TEXT` <Tag variant="new">2.1</Tag>   | unicode    | The exact verbatim text of a token.                                                                                                                                                                                                                                       |
 | `LOWER`                               | unicode    | The lowercase form of the token text.                                                                                                                                                                                                                                     |
@ -167,6 +167,7 @@ rule-based matching are:
 | `IS_LOWER`, `IS_UPPER`, `IS_TITLE`    | bool       | Token text is in lowercase, uppercase, titlecase.                                                                                                                                                                                                                         |
 | `IS_PUNCT`, `IS_SPACE`, `IS_STOP`     | bool       | Token is punctuation, whitespace, stop word.                                                                                                                                                                                                                              |
 | `IS_SENT_START`                       | bool       | Token is start of sentence.                                                                                                                                                                                                                                               |
 | `SPACY`                               | bool       | Token has a trailing space.                                                                                                                                                                                                                                               |
 | `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`  | bool       | Token text resembles a number, URL, email.                                                                                                                                                                                                                                |
 | `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | unicode    | The token's simple and extended part-of-speech tag, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). |
 | `ENT_TYPE`                            | unicode    | The token's entity label.                                                                                                                                                                                                                                                 |
@ -1102,21 +1103,28 @@ powerful model packages with binary weights _and_ rules included!
 ### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
-When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
+When using a large amount of **phrase patterns** (roughly > 10000) it's useful
-the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
+to understand how the `add_patterns` function of the EntityRuler works. For each
-to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to 
+**phrase pattern**, the EntityRuler calls the nlp object to construct a doc
-extract matches based on the pattern's POS signature.
+object. This happens in case you try to add the EntityRuler at the end of an
 existing pipeline with, for example, a POS tagger and want to extract matches
 based on the pattern's POS signature.
-In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
+In this case you would pass a config value of `phrase_matcher_attr="POS"` for
 the EntityRuler.
-Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
+Running the full language pipeline across every pattern in a large list scales
 linearly and can therefore take a long time on large amounts of phrase patterns.
-As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively. 
+As of spaCy 2.2.4 the `add_patterns` function has been refactored to use
 nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with
 5,000-100,000 phrase patterns respectively.
-Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
+Even with this speedup (but especially if you're using an older version) the
 `add_patterns` function can still take a long time.
-An easy workaround to make this function run faster is disabling the other language pipes
+An easy workaround to make this function run faster is disabling the other
-while adding the phrase patterns.
+language pipes while adding the phrase patterns.
 ```python
 entityruler = EntityRuler(nlp)