mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
Support match alignments (#7321)
* Support match alignments * change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case * remove added errors, utilize bint type, cleanup whitespace * fix no new line in end of file * Minor formatting * Skip alignments processing if as_spans is set * Add with_alignments to Matcher API docs * Update website/docs/api/matcher.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
This commit is contained in:
parent
ff84075839
commit
ee159b8543
106
.github/contributors/broaddeep.md
vendored
Normal file
106
.github/contributors/broaddeep.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Dongjun Park |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2021-03-06 |
|
||||
| GitHub username | broaddeep |
|
||||
| Website (optional) | |
|
|
@ -46,6 +46,12 @@ cdef struct TokenPatternC:
|
|||
int32_t nr_py
|
||||
quantifier_t quantifier
|
||||
hash_t key
|
||||
int32_t token_idx
|
||||
|
||||
|
||||
cdef struct MatchAlignmentC:
|
||||
int32_t token_idx
|
||||
int32_t length
|
||||
|
||||
|
||||
cdef struct PatternStateC:
|
||||
|
|
|
@ -196,7 +196,7 @@ cdef class Matcher:
|
|||
else:
|
||||
yield doc
|
||||
|
||||
def __call__(self, object doclike, *, as_spans=False, allow_missing=False):
|
||||
def __call__(self, object doclike, *, as_spans=False, allow_missing=False, with_alignments=False):
|
||||
"""Find all token sequences matching the supplied pattern.
|
||||
|
||||
doclike (Doc or Span): The document to match over.
|
||||
|
@ -204,10 +204,16 @@ cdef class Matcher:
|
|||
start, end) tuples.
|
||||
allow_missing (bool): Whether to skip checks for missing annotation for
|
||||
attributes included in patterns. Defaults to False.
|
||||
with_alignments (bool): Return match alignment information, which is
|
||||
`List[int]` with length of matched span. Each entry denotes the
|
||||
corresponding index of token pattern. If as_spans is set to True,
|
||||
this setting is ignored.
|
||||
RETURNS (list): A list of `(match_id, start, end)` tuples,
|
||||
describing the matches. A match tuple describes a span
|
||||
`doc[start:end]`. The `match_id` is an integer. If as_spans is set
|
||||
to True, a list of Span objects is returned.
|
||||
If with_alignments is set to True and as_spans is set to False,
|
||||
A list of `(match_id, start, end, alignments)` tuples is returned.
|
||||
"""
|
||||
if isinstance(doclike, Doc):
|
||||
doc = doclike
|
||||
|
@ -217,6 +223,9 @@ cdef class Matcher:
|
|||
length = doclike.end - doclike.start
|
||||
else:
|
||||
raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
|
||||
# Skip alignments calculations if as_spans is set
|
||||
if as_spans:
|
||||
with_alignments = False
|
||||
cdef Pool tmp_pool = Pool()
|
||||
if not allow_missing:
|
||||
for attr in (TAG, POS, MORPH, LEMMA, DEP):
|
||||
|
@ -232,18 +241,20 @@ cdef class Matcher:
|
|||
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
|
||||
raise ValueError(error_msg)
|
||||
matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
|
||||
extensions=self._extensions, predicates=self._extra_predicates)
|
||||
extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
|
||||
final_matches = []
|
||||
pairs_by_id = {}
|
||||
# For each key, either add all matches, or only the filtered, non-overlapping ones
|
||||
for (key, start, end) in matches:
|
||||
# For each key, either add all matches, or only the filtered,
|
||||
# non-overlapping ones this `match` can be either (start, end) or
|
||||
# (start, end, alignments) depending on `with_alignments=` option.
|
||||
for key, *match in matches:
|
||||
span_filter = self._filter.get(key)
|
||||
if span_filter is not None:
|
||||
pairs = pairs_by_id.get(key, [])
|
||||
pairs.append((start,end))
|
||||
pairs.append(match)
|
||||
pairs_by_id[key] = pairs
|
||||
else:
|
||||
final_matches.append((key, start, end))
|
||||
final_matches.append((key, *match))
|
||||
matched = <char*>tmp_pool.alloc(length, sizeof(char))
|
||||
empty = <char*>tmp_pool.alloc(length, sizeof(char))
|
||||
for key, pairs in pairs_by_id.items():
|
||||
|
@ -255,14 +266,18 @@ cdef class Matcher:
|
|||
sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length
|
||||
else:
|
||||
raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter))
|
||||
for (start, end) in sorted_pairs:
|
||||
for match in sorted_pairs:
|
||||
start, end = match[:2]
|
||||
assert 0 <= start < end # Defend against segfaults
|
||||
span_len = end-start
|
||||
# If no tokens in the span have matched
|
||||
if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0:
|
||||
final_matches.append((key, start, end))
|
||||
final_matches.append((key, *match))
|
||||
# Mark tokens that have matched
|
||||
memset(&matched[start], 1, span_len * sizeof(matched[0]))
|
||||
if with_alignments:
|
||||
final_matches_with_alignments = final_matches
|
||||
final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
|
||||
# perform the callbacks on the filtered set of results
|
||||
for i, (key, start, end) in enumerate(final_matches):
|
||||
on_match = self._callbacks.get(key, None)
|
||||
|
@ -270,6 +285,22 @@ cdef class Matcher:
|
|||
on_match(self, doc, i, final_matches)
|
||||
if as_spans:
|
||||
return [Span(doc, start, end, label=key) for key, start, end in final_matches]
|
||||
elif with_alignments:
|
||||
# convert alignments List[Dict[str, int]] --> List[int]
|
||||
final_matches = []
|
||||
# when multiple alignment (belongs to the same length) is found,
|
||||
# keeps the alignment that has largest token_idx
|
||||
for key, start, end, alignments in final_matches_with_alignments:
|
||||
sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
|
||||
alignments = [0] * (end-start)
|
||||
for align in sorted_alignments:
|
||||
if align['length'] >= end-start:
|
||||
continue
|
||||
# Since alignments are sorted in order of (length, token_idx)
|
||||
# this overwrites smaller token_idx when they have same length.
|
||||
alignments[align['length']] = align['token_idx']
|
||||
final_matches.append((key, start, end, alignments))
|
||||
return final_matches
|
||||
else:
|
||||
return final_matches
|
||||
|
||||
|
@ -288,9 +319,9 @@ def unpickle_matcher(vocab, patterns, callbacks):
|
|||
return matcher
|
||||
|
||||
|
||||
cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()):
|
||||
cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple(), bint with_alignments=0):
|
||||
"""Find matches in a doc, with a compiled array of patterns. Matches are
|
||||
returned as a list of (id, start, end) tuples.
|
||||
returned as a list of (id, start, end) tuples or (id, start, end, alignments) tuples (if with_alignments != 0)
|
||||
|
||||
To augment the compiled patterns, we optionally also take two Python lists.
|
||||
|
||||
|
@ -302,6 +333,8 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
|||
"""
|
||||
cdef vector[PatternStateC] states
|
||||
cdef vector[MatchC] matches
|
||||
cdef vector[vector[MatchAlignmentC]] align_states
|
||||
cdef vector[vector[MatchAlignmentC]] align_matches
|
||||
cdef PatternStateC state
|
||||
cdef int i, j, nr_extra_attr
|
||||
cdef Pool mem = Pool()
|
||||
|
@ -328,12 +361,14 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
|||
for i in range(length):
|
||||
for j in range(n):
|
||||
states.push_back(PatternStateC(patterns[j], i, 0))
|
||||
transition_states(states, matches, predicate_cache,
|
||||
doclike[i], extra_attr_values, predicates)
|
||||
if with_alignments != 0:
|
||||
align_states.resize(states.size())
|
||||
transition_states(states, matches, align_states, align_matches, predicate_cache,
|
||||
doclike[i], extra_attr_values, predicates, with_alignments)
|
||||
extra_attr_values += nr_extra_attr
|
||||
predicate_cache += len(predicates)
|
||||
# Handle matches that end in 0-width patterns
|
||||
finish_states(matches, states)
|
||||
finish_states(matches, states, align_matches, align_states, with_alignments)
|
||||
seen = set()
|
||||
for i in range(matches.size()):
|
||||
match = (
|
||||
|
@ -346,16 +381,22 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
|
|||
# first .?, or the second .? -- it doesn't matter, it's just one match.
|
||||
# Skip 0-length matches. (TODO: fix algorithm)
|
||||
if match not in seen and matches[i].length > 0:
|
||||
output.append(match)
|
||||
if with_alignments != 0:
|
||||
# since the length of align_matches equals to that of match, we can share same 'i'
|
||||
output.append(match + (align_matches[i],))
|
||||
else:
|
||||
output.append(match)
|
||||
seen.add(match)
|
||||
return output
|
||||
|
||||
|
||||
cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
|
||||
vector[vector[MatchAlignmentC]]& align_states, vector[vector[MatchAlignmentC]]& align_matches,
|
||||
int8_t* cached_py_predicates,
|
||||
Token token, const attr_t* extra_attrs, py_predicates) except *:
|
||||
Token token, const attr_t* extra_attrs, py_predicates, bint with_alignments) except *:
|
||||
cdef int q = 0
|
||||
cdef vector[PatternStateC] new_states
|
||||
cdef vector[vector[MatchAlignmentC]] align_new_states
|
||||
cdef int nr_predicate = len(py_predicates)
|
||||
for i in range(states.size()):
|
||||
if states[i].pattern.nr_py >= 1:
|
||||
|
@ -370,23 +411,39 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
|
|||
# it in the states list, because q doesn't advance.
|
||||
state = states[i]
|
||||
states[q] = state
|
||||
# Separate from states, performance is guaranteed for users who only need basic options (without alignments).
|
||||
# `align_states` always corresponds to `states` 1:1.
|
||||
if with_alignments != 0:
|
||||
align_state = align_states[i]
|
||||
align_states[q] = align_state
|
||||
while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND):
|
||||
# Update alignment before the transition of current state
|
||||
# 'MatchAlignmentC' maps 'original token index of current pattern' to 'current matching length'
|
||||
if with_alignments != 0:
|
||||
align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
|
||||
if action == RETRY_EXTEND:
|
||||
# This handles the 'extend'
|
||||
new_states.push_back(
|
||||
PatternStateC(pattern=states[q].pattern, start=state.start,
|
||||
length=state.length+1))
|
||||
if with_alignments != 0:
|
||||
align_new_states.push_back(align_states[q])
|
||||
if action == RETRY_ADVANCE:
|
||||
# This handles the 'advance'
|
||||
new_states.push_back(
|
||||
PatternStateC(pattern=states[q].pattern+1, start=state.start,
|
||||
length=state.length+1))
|
||||
if with_alignments != 0:
|
||||
align_new_states.push_back(align_states[q])
|
||||
states[q].pattern += 1
|
||||
if states[q].pattern.nr_py != 0:
|
||||
update_predicate_cache(cached_py_predicates,
|
||||
states[q].pattern, token, py_predicates)
|
||||
action = get_action(states[q], token.c, extra_attrs,
|
||||
cached_py_predicates)
|
||||
# Update alignment before the transition of current state
|
||||
if with_alignments != 0:
|
||||
align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
|
||||
if action == REJECT:
|
||||
pass
|
||||
elif action == ADVANCE:
|
||||
|
@ -399,29 +456,50 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
|
|||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length+1))
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
elif action == MATCH_DOUBLE:
|
||||
# push match without last token if length > 0
|
||||
if state.length > 0:
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length))
|
||||
# MATCH_DOUBLE emits matches twice,
|
||||
# add one more to align_matches in order to keep 1:1 relationship
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
# push match with last token
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length+1))
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
elif action == MATCH_REJECT:
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length))
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
elif action == MATCH_EXTEND:
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start,
|
||||
length=state.length))
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_states[q])
|
||||
states[q].length += 1
|
||||
q += 1
|
||||
states.resize(q)
|
||||
for i in range(new_states.size()):
|
||||
states.push_back(new_states[i])
|
||||
# `align_states` always corresponds to `states` 1:1
|
||||
if with_alignments != 0:
|
||||
align_states.resize(q)
|
||||
for i in range(align_new_states.size()):
|
||||
align_states.push_back(align_new_states[i])
|
||||
|
||||
|
||||
cdef int update_predicate_cache(int8_t* cache,
|
||||
|
@ -444,15 +522,27 @@ cdef int update_predicate_cache(int8_t* cache,
|
|||
raise ValueError(Errors.E125.format(value=result))
|
||||
|
||||
|
||||
cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states) except *:
|
||||
cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states,
|
||||
vector[vector[MatchAlignmentC]]& align_matches,
|
||||
vector[vector[MatchAlignmentC]]& align_states,
|
||||
bint with_alignments) except *:
|
||||
"""Handle states that end in zero-width patterns."""
|
||||
cdef PatternStateC state
|
||||
cdef vector[MatchAlignmentC] align_state
|
||||
for i in range(states.size()):
|
||||
state = states[i]
|
||||
if with_alignments != 0:
|
||||
align_state = align_states[i]
|
||||
while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE):
|
||||
# Update alignment before the transition of current state
|
||||
if with_alignments != 0:
|
||||
align_state.push_back(MatchAlignmentC(state.pattern.token_idx, state.length))
|
||||
is_final = get_is_final(state)
|
||||
if is_final:
|
||||
ent_id = get_ent_id(state.pattern)
|
||||
# `align_matches` always corresponds to `matches` 1:1
|
||||
if with_alignments != 0:
|
||||
align_matches.push_back(align_state)
|
||||
matches.push_back(
|
||||
MatchC(pattern_id=ent_id, start=state.start, length=state.length))
|
||||
break
|
||||
|
@ -607,7 +697,7 @@ cdef int8_t get_quantifier(PatternStateC state) nogil:
|
|||
cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL:
|
||||
pattern = <TokenPatternC*>mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC))
|
||||
cdef int i, index
|
||||
for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs):
|
||||
for i, (quantifier, spec, extensions, predicates, token_idx) in enumerate(token_specs):
|
||||
pattern[i].quantifier = quantifier
|
||||
# Ensure attrs refers to a null pointer if nr_attr == 0
|
||||
if len(spec) > 0:
|
||||
|
@ -628,6 +718,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
|||
pattern[i].py_predicates[j] = index
|
||||
pattern[i].nr_py = len(predicates)
|
||||
pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
|
||||
pattern[i].token_idx = token_idx
|
||||
i = len(token_specs)
|
||||
# Use quantifier to identify final ID pattern node (rather than previous
|
||||
# uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
|
||||
|
@ -638,6 +729,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
|
|||
pattern[i].nr_attr = 1
|
||||
pattern[i].nr_extra_attr = 0
|
||||
pattern[i].nr_py = 0
|
||||
pattern[i].token_idx = -1
|
||||
return pattern
|
||||
|
||||
|
||||
|
@ -655,7 +747,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
|||
"""This function interprets the pattern, converting the various bits of
|
||||
syntactic sugar before we compile it into a struct with init_pattern.
|
||||
|
||||
We need to split the pattern up into three parts:
|
||||
We need to split the pattern up into four parts:
|
||||
* Normal attribute/value pairs, which are stored on either the token or lexeme,
|
||||
can be handled directly.
|
||||
* Extension attributes are handled specially, as we need to prefetch the
|
||||
|
@ -664,13 +756,14 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
|||
functions and store them. So we store these specially as well.
|
||||
* Extension attributes that have extra predicates are stored within the
|
||||
extra_predicates.
|
||||
* Token index that this pattern belongs to.
|
||||
"""
|
||||
tokens = []
|
||||
string_store = vocab.strings
|
||||
for spec in token_specs:
|
||||
for token_idx, spec in enumerate(token_specs):
|
||||
if not spec:
|
||||
# Signifier for 'any token'
|
||||
tokens.append((ONE, [(NULL_ATTR, 0)], [], []))
|
||||
tokens.append((ONE, [(NULL_ATTR, 0)], [], [], token_idx))
|
||||
continue
|
||||
if not isinstance(spec, dict):
|
||||
raise ValueError(Errors.E154.format())
|
||||
|
@ -679,7 +772,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
|
|||
extensions = _get_extensions(spec, string_store, extensions_table)
|
||||
predicates = _get_extra_predicates(spec, extra_predicates, vocab)
|
||||
for op in ops:
|
||||
tokens.append((op, list(attr_values), list(extensions), list(predicates)))
|
||||
tokens.append((op, list(attr_values), list(extensions), list(predicates), token_idx))
|
||||
return tokens
|
||||
|
||||
|
||||
|
|
|
@ -204,3 +204,90 @@ def test_matcher_remove():
|
|||
# removing again should throw an error
|
||||
with pytest.raises(ValueError):
|
||||
matcher.remove("Rule")
|
||||
|
||||
|
||||
def test_matcher_with_alignments_greedy_longest(en_vocab):
|
||||
cases = [
|
||||
("aaab", "a* b", [0, 0, 0, 1]),
|
||||
("baab", "b a* b", [0, 1, 1, 2]),
|
||||
("aaab", "a a a b", [0, 1, 2, 3]),
|
||||
("aaab", "a+ b", [0, 0, 0, 1]),
|
||||
("aaba", "a+ b a+", [0, 0, 1, 2]),
|
||||
("aabaa", "a+ b a+", [0, 0, 1, 2, 2]),
|
||||
("aaba", "a+ b a*", [0, 0, 1, 2]),
|
||||
("aaaa", "a*", [0, 0, 0, 0]),
|
||||
("baab", "b a* b b*", [0, 1, 1, 2]),
|
||||
("aabb", "a* b* a*", [0, 0, 1, 1]),
|
||||
("aaab", "a+ a+ a b", [0, 1, 2, 3]),
|
||||
("aaab", "a+ a+ a+ b", [0, 1, 2, 3]),
|
||||
("aaab", "a+ a a b", [0, 1, 2, 3]),
|
||||
("aaab", "a+ a a", [0, 1, 2]),
|
||||
("aaab", "a+ a a?", [0, 1, 2]),
|
||||
("aaaa", "a a a a a?", [0, 1, 2, 3]),
|
||||
("aaab", "a+ a b", [0, 0, 1, 2]),
|
||||
("aaab", "a+ a+ b", [0, 0, 1, 2]),
|
||||
]
|
||||
for string, pattern_str, result in cases:
|
||||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(matcher.vocab, words=list(string))
|
||||
pattern = []
|
||||
for part in pattern_str.split():
|
||||
if part.endswith("+"):
|
||||
pattern.append({"ORTH": part[0], "OP": "+"})
|
||||
elif part.endswith("*"):
|
||||
pattern.append({"ORTH": part[0], "OP": "*"})
|
||||
elif part.endswith("?"):
|
||||
pattern.append({"ORTH": part[0], "OP": "?"})
|
||||
else:
|
||||
pattern.append({"ORTH": part})
|
||||
matcher.add("PATTERN", [pattern], greedy="LONGEST")
|
||||
matches = matcher(doc, with_alignments=True)
|
||||
n_matches = len(matches)
|
||||
|
||||
_, s, e, expected = matches[0]
|
||||
|
||||
assert expected == result, (string, pattern_str, s, e, n_matches)
|
||||
|
||||
|
||||
def test_matcher_with_alignments_nongreedy(en_vocab):
|
||||
cases = [
|
||||
(0, "aaab", "a* b", [[0, 1], [0, 0, 1], [0, 0, 0, 1], [1]]),
|
||||
(1, "baab", "b a* b", [[0, 1, 1, 2]]),
|
||||
(2, "aaab", "a a a b", [[0, 1, 2, 3]]),
|
||||
(3, "aaab", "a+ b", [[0, 1], [0, 0, 1], [0, 0, 0, 1]]),
|
||||
(4, "aaba", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2]]),
|
||||
(5, "aabaa", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2], [0, 0, 1, 2, 2], [0, 1, 2, 2] ]),
|
||||
(6, "aaba", "a+ b a*", [[0, 1], [0, 0, 1], [0, 0, 1, 2], [0, 1, 2]]),
|
||||
(7, "aaaa", "a*", [[0], [0, 0], [0, 0, 0], [0, 0, 0, 0]]),
|
||||
(8, "baab", "b a* b b*", [[0, 1, 1, 2]]),
|
||||
(9, "aabb", "a* b* a*", [[1], [2], [2, 2], [0, 1], [0, 0, 1], [0, 0, 1, 1], [0, 1, 1], [1, 1]]),
|
||||
(10, "aaab", "a+ a+ a b", [[0, 1, 2, 3]]),
|
||||
(11, "aaab", "a+ a+ a+ b", [[0, 1, 2, 3]]),
|
||||
(12, "aaab", "a+ a a b", [[0, 1, 2, 3]]),
|
||||
(13, "aaab", "a+ a a", [[0, 1, 2]]),
|
||||
(14, "aaab", "a+ a a?", [[0, 1], [0, 1, 2]]),
|
||||
(15, "aaaa", "a a a a a?", [[0, 1, 2, 3]]),
|
||||
(16, "aaab", "a+ a b", [[0, 1, 2], [0, 0, 1, 2]]),
|
||||
(17, "aaab", "a+ a+ b", [[0, 1, 2], [0, 0, 1, 2]]),
|
||||
]
|
||||
for case_id, string, pattern_str, results in cases:
|
||||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(matcher.vocab, words=list(string))
|
||||
pattern = []
|
||||
for part in pattern_str.split():
|
||||
if part.endswith("+"):
|
||||
pattern.append({"ORTH": part[0], "OP": "+"})
|
||||
elif part.endswith("*"):
|
||||
pattern.append({"ORTH": part[0], "OP": "*"})
|
||||
elif part.endswith("?"):
|
||||
pattern.append({"ORTH": part[0], "OP": "?"})
|
||||
else:
|
||||
pattern.append({"ORTH": part})
|
||||
|
||||
matcher.add("PATTERN", [pattern])
|
||||
matches = matcher(doc, with_alignments=True)
|
||||
n_matches = len(matches)
|
||||
|
||||
for _, s, e, expected in matches:
|
||||
assert expected in results, (case_id, string, pattern_str, s, e, n_matches)
|
||||
assert len(expected) == e - s
|
||||
|
|
|
@ -120,13 +120,14 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
|
|||
> matches = matcher(doc)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
|
||||
| `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ |
|
||||
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
|
||||
| Name | Description |
|
||||
| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `as_spans` <Tag variant="new">3</Tag> | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ |
|
||||
| `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ |
|
||||
| `with_alignments` <Tag variant="new">3.1</Tag> | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~ |
|
||||
| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
|
||||
|
||||
## Matcher.\_\_len\_\_ {#len tag="method" new="2"}
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user