Fix behaviour of Matcher's ? quantifier for v2.1 (#3105)
* Add failing test for matcher bug #3009
* Deduplicate matches from Matcher
* Update matcher ? quantifier test
* Fix bug with ? quantifier in Matcher
The ? quantifier indicates a token may occur zero or one times. If the
token pattern fit, the matcher would fail to consider valid matches
where the token pattern did not fit. Consider a simple regex like:
.?b
If we have the string 'b', the .? part will fit --- but then the 'b' in
the pattern will not fit, leaving us with no match. The same bug left us
with too few matches in some cases. For instance, consider:
.?.?
If we have a string of length two, like 'ab', we actually have three
possible matches here: [a, b, ab]. We were only recovering 'ab'. This
should now be fixed. Note that the fix also uncovered another bug, where
we weren't deduplicating the matches. There are actually two ways we
might match 'a' and two ways we might match 'b': as the second token of the pattern,
or as the first token of the pattern. This ambiguity is spurious, so we
need to deduplicate.
Closes #2464 and #3009
* Fix Python2
2018-12-29 18:18:09 +03:00
|
|
|
# coding: utf-8
|
|
|
|
from __future__ import unicode_literals
|
|
|
|
|
|
|
|
import pytest
|
|
|
|
from spacy.matcher import Matcher
|
|
|
|
from spacy.tokens import Doc
|
|
|
|
|
|
|
|
|
|
|
|
PATTERNS = [
|
|
|
|
("1", [[{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"POS": "ADP"}]]),
|
|
|
|
(
|
|
|
|
"2",
|
|
|
|
[
|
|
|
|
[
|
|
|
|
{"LEMMA": "have"},
|
|
|
|
{"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"},
|
|
|
|
{"LOWER": "to"},
|
|
|
|
{"LOWER": "do"},
|
|
|
|
{"POS": "ADP"},
|
|
|
|
]
|
|
|
|
],
|
|
|
|
),
|
|
|
|
(
|
|
|
|
"3",
|
|
|
|
[
|
|
|
|
[
|
|
|
|
{"LEMMA": "have"},
|
|
|
|
{"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"},
|
|
|
|
{"LOWER": "to"},
|
|
|
|
{"LOWER": "do"},
|
|
|
|
{"POS": "ADP"},
|
|
|
|
]
|
|
|
|
],
|
|
|
|
),
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.fixture
|
|
|
|
def doc(en_tokenizer):
|
|
|
|
doc = en_tokenizer("also has to do with")
|
|
|
|
doc[0].tag_ = "RB"
|
|
|
|
doc[1].tag_ = "VBZ"
|
|
|
|
doc[2].tag_ = "TO"
|
|
|
|
doc[3].tag_ = "VB"
|
|
|
|
doc[4].tag_ = "IN"
|
|
|
|
return doc
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.fixture
|
|
|
|
def matcher(en_tokenizer):
|
|
|
|
return Matcher(en_tokenizer.vocab)
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.mark.parametrize("pattern", PATTERNS)
|
|
|
|
def test_issue3009(doc, matcher, pattern):
|
|
|
|
"""Test problem with matcher quantifiers"""
|
|
|
|
matcher.add(pattern[0], None, *pattern[1])
|
|
|
|
matches = matcher(doc)
|
|
|
|
assert matches
|
|
|
|
|
2019-02-08 16:14:49 +03:00
|
|
|
|
Fix behaviour of Matcher's ? quantifier for v2.1 (#3105)
* Add failing test for matcher bug #3009
* Deduplicate matches from Matcher
* Update matcher ? quantifier test
* Fix bug with ? quantifier in Matcher
The ? quantifier indicates a token may occur zero or one times. If the
token pattern fit, the matcher would fail to consider valid matches
where the token pattern did not fit. Consider a simple regex like:
.?b
If we have the string 'b', the .? part will fit --- but then the 'b' in
the pattern will not fit, leaving us with no match. The same bug left us
with too few matches in some cases. For instance, consider:
.?.?
If we have a string of length two, like 'ab', we actually have three
possible matches here: [a, b, ab]. We were only recovering 'ab'. This
should now be fixed. Note that the fix also uncovered another bug, where
we weren't deduplicating the matches. There are actually two ways we
might match 'a' and two ways we might match 'b': as the second token of the pattern,
or as the first token of the pattern. This ambiguity is spurious, so we
need to deduplicate.
Closes #2464 and #3009
* Fix Python2
2018-12-29 18:18:09 +03:00
|
|
|
def test_issue2464(matcher):
|
|
|
|
"""Test problem with successive ?. This is the same bug, so putting it here."""
|
2019-02-08 16:14:49 +03:00
|
|
|
doc = Doc(matcher.vocab, words=["a", "b"])
|
|
|
|
matcher.add("4", None, [{"OP": "?"}, {"OP": "?"}])
|
Fix behaviour of Matcher's ? quantifier for v2.1 (#3105)
* Add failing test for matcher bug #3009
* Deduplicate matches from Matcher
* Update matcher ? quantifier test
* Fix bug with ? quantifier in Matcher
The ? quantifier indicates a token may occur zero or one times. If the
token pattern fit, the matcher would fail to consider valid matches
where the token pattern did not fit. Consider a simple regex like:
.?b
If we have the string 'b', the .? part will fit --- but then the 'b' in
the pattern will not fit, leaving us with no match. The same bug left us
with too few matches in some cases. For instance, consider:
.?.?
If we have a string of length two, like 'ab', we actually have three
possible matches here: [a, b, ab]. We were only recovering 'ab'. This
should now be fixed. Note that the fix also uncovered another bug, where
we weren't deduplicating the matches. There are actually two ways we
might match 'a' and two ways we might match 'b': as the second token of the pattern,
or as the first token of the pattern. This ambiguity is spurious, so we
need to deduplicate.
Closes #2464 and #3009
* Fix Python2
2018-12-29 18:18:09 +03:00
|
|
|
matches = matcher(doc)
|
|
|
|
assert len(matches) == 3
|