mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 05:37:03 +03:00
* Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions
This commit is contained in:
parent
476472d181
commit
aeba99ab0d
106
.github/contributors/grivaz.md
vendored
Normal file
106
.github/contributors/grivaz.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name |C. Grivaz |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date |08.22.2018 |
|
||||||
|
| GitHub username |grivaz |
|
||||||
|
| Website (optional) | |
|
|
@ -249,6 +249,7 @@ class Errors(object):
|
||||||
"error. Are you writing to a default function argument?")
|
"error. Are you writing to a default function argument?")
|
||||||
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
|
E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
|
||||||
"Span objects, or dicts if set to manual=True.")
|
"Span objects, or dicts if set to manual=True.")
|
||||||
|
E097 = ("Can't merge non-disjoint spans. '{token}' is already part of tokens to merge")
|
||||||
|
|
||||||
|
|
||||||
@add_codes
|
@add_codes
|
||||||
|
|
|
@ -5,6 +5,7 @@ from ..util import get_doc
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
from ...vocab import Vocab
|
from ...vocab import Vocab
|
||||||
from ...attrs import LEMMA
|
from ...attrs import LEMMA
|
||||||
|
from ...tokens import Span
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
import numpy
|
import numpy
|
||||||
|
@ -156,6 +157,23 @@ def test_doc_api_merge(en_tokenizer):
|
||||||
assert doc[7].text == 'all night'
|
assert doc[7].text == 'all night'
|
||||||
assert doc[7].text_with_ws == 'all night'
|
assert doc[7].text_with_ws == 'all night'
|
||||||
|
|
||||||
|
# merge both with bulk merge
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
assert len(doc) == 9
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.merge(doc[4: 7], attrs={'tag':'NAMED', 'lemma':'LEMMA',
|
||||||
|
'ent_type':'TYPE'})
|
||||||
|
retokenizer.merge(doc[7: 9], attrs={'tag':'NAMED', 'lemma':'LEMMA',
|
||||||
|
'ent_type':'TYPE'})
|
||||||
|
|
||||||
|
assert len(doc) == 6
|
||||||
|
assert doc[4].text == 'the beach boys'
|
||||||
|
assert doc[4].text_with_ws == 'the beach boys '
|
||||||
|
assert doc[4].tag_ == 'NAMED'
|
||||||
|
assert doc[5].text == 'all night'
|
||||||
|
assert doc[5].text_with_ws == 'all night'
|
||||||
|
assert doc[5].tag_ == 'NAMED'
|
||||||
|
|
||||||
|
|
||||||
def test_doc_api_merge_children(en_tokenizer):
|
def test_doc_api_merge_children(en_tokenizer):
|
||||||
"""Test that attachments work correctly after merging."""
|
"""Test that attachments work correctly after merging."""
|
||||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals
|
||||||
from ..util import get_doc
|
from ..util import get_doc
|
||||||
from ...vocab import Vocab
|
from ...vocab import Vocab
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
|
from ...tokens import Span
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
@ -16,16 +17,8 @@ def test_spans_merge_tokens(en_tokenizer):
|
||||||
assert len(doc) == 4
|
assert len(doc) == 4
|
||||||
assert doc[0].head.text == 'Angeles'
|
assert doc[0].head.text == 'Angeles'
|
||||||
assert doc[1].head.text == 'start'
|
assert doc[1].head.text == 'start'
|
||||||
doc.merge(0, len('Los Angeles'), tag='NNP', lemma='Los Angeles', ent_type='GPE')
|
with doc.retokenize() as retokenizer:
|
||||||
assert len(doc) == 3
|
retokenizer.merge(doc[0 : 2], attrs={'tag':'NNP', 'lemma':'Los Angeles', 'ent_type':'GPE'})
|
||||||
assert doc[0].text == 'Los Angeles'
|
|
||||||
assert doc[0].head.text == 'start'
|
|
||||||
|
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
|
||||||
assert len(doc) == 4
|
|
||||||
assert doc[0].head.text == 'Angeles'
|
|
||||||
assert doc[1].head.text == 'start'
|
|
||||||
doc.merge(0, len('Los Angeles'), tag='NNP', lemma='Los Angeles', label='GPE')
|
|
||||||
assert len(doc) == 3
|
assert len(doc) == 3
|
||||||
assert doc[0].text == 'Los Angeles'
|
assert doc[0].text == 'Los Angeles'
|
||||||
assert doc[0].head.text == 'start'
|
assert doc[0].head.text == 'start'
|
||||||
|
@ -38,8 +31,8 @@ def test_spans_merge_heads(en_tokenizer):
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||||
|
|
||||||
assert len(doc) == 8
|
assert len(doc) == 8
|
||||||
doc.merge(doc[3].idx, doc[4].idx + len(doc[4]), tag=doc[4].tag_,
|
with doc.retokenize() as retokenizer:
|
||||||
lemma='pilates class', ent_type='O')
|
retokenizer.merge(doc[3 : 5], attrs={'tag':doc[4].tag_, 'lemma':'pilates class', 'ent_type':'O'})
|
||||||
assert len(doc) == 7
|
assert len(doc) == 7
|
||||||
assert doc[0].head.i == 1
|
assert doc[0].head.i == 1
|
||||||
assert doc[1].head.i == 1
|
assert doc[1].head.i == 1
|
||||||
|
@ -48,6 +41,14 @@ def test_spans_merge_heads(en_tokenizer):
|
||||||
assert doc[4].head.i in [1, 3]
|
assert doc[4].head.i in [1, 3]
|
||||||
assert doc[5].head.i == 4
|
assert doc[5].head.i == 4
|
||||||
|
|
||||||
|
def test_spans_merge_non_disjoint(en_tokenizer):
|
||||||
|
text = "Los Angeles start."
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
doc = get_doc(tokens.vocab, [t.text for t in tokens])
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.merge(doc[0: 2], attrs={'tag': 'NNP', 'lemma': 'Los Angeles', 'ent_type': 'GPE'})
|
||||||
|
retokenizer.merge(doc[0: 1], attrs={'tag': 'NNP', 'lemma': 'Los Angeles', 'ent_type': 'GPE'})
|
||||||
|
|
||||||
def test_span_np_merges(en_tokenizer):
|
def test_span_np_merges(en_tokenizer):
|
||||||
text = "displaCy is a parse tool built with Javascript"
|
text = "displaCy is a parse tool built with Javascript"
|
||||||
|
@ -111,6 +112,25 @@ def test_spans_entity_merge_iob():
|
||||||
assert doc[0].ent_iob_ == "B"
|
assert doc[0].ent_iob_ == "B"
|
||||||
assert doc[1].ent_iob_ == "I"
|
assert doc[1].ent_iob_ == "I"
|
||||||
|
|
||||||
|
words = ["a", "b", "c", "d", "e", "f", "g", "h", "i"]
|
||||||
|
doc = Doc(Vocab(), words=words)
|
||||||
|
doc.ents = [(doc.vocab.strings.add('ent-de'), 3, 5),
|
||||||
|
(doc.vocab.strings.add('ent-fg'), 5, 7)]
|
||||||
|
assert doc[3].ent_iob_ == "B"
|
||||||
|
assert doc[4].ent_iob_ == "I"
|
||||||
|
assert doc[5].ent_iob_ == "B"
|
||||||
|
assert doc[6].ent_iob_ == "I"
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.merge(doc[2 : 4])
|
||||||
|
retokenizer.merge(doc[4 : 6])
|
||||||
|
retokenizer.merge(doc[7 : 9])
|
||||||
|
for token in doc:
|
||||||
|
print(token)
|
||||||
|
print(token.ent_iob)
|
||||||
|
assert len(doc) == 6
|
||||||
|
assert doc[3].ent_iob_ == "B"
|
||||||
|
assert doc[4].ent_iob_ == "I"
|
||||||
|
|
||||||
|
|
||||||
def test_spans_sentence_update_after_merge(en_tokenizer):
|
def test_spans_sentence_update_after_merge(en_tokenizer):
|
||||||
text = "Stewart Lee is a stand up comedian. He lives in England and loves Joe Pasquale."
|
text = "Stewart Lee is a stand up comedian. He lives in England and loves Joe Pasquale."
|
||||||
|
|
|
@ -5,6 +5,9 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from libc.string cimport memcpy, memset
|
from libc.string cimport memcpy, memset
|
||||||
|
from libc.stdlib cimport malloc, free
|
||||||
|
|
||||||
|
from cymem.cymem cimport Pool
|
||||||
|
|
||||||
from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
|
from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end
|
||||||
from .span cimport Span
|
from .span cimport Span
|
||||||
|
@ -14,24 +17,31 @@ from ..structs cimport LexemeC, TokenC
|
||||||
from ..attrs cimport TAG
|
from ..attrs cimport TAG
|
||||||
from ..attrs import intify_attrs
|
from ..attrs import intify_attrs
|
||||||
from ..util import SimpleFrozenDict
|
from ..util import SimpleFrozenDict
|
||||||
|
from ..errors import Errors
|
||||||
|
|
||||||
cdef class Retokenizer:
|
cdef class Retokenizer:
|
||||||
"""Helper class for doc.retokenize() context manager."""
|
"""Helper class for doc.retokenize() context manager."""
|
||||||
cdef Doc doc
|
cdef Doc doc
|
||||||
cdef list merges
|
cdef list merges
|
||||||
cdef list splits
|
cdef list splits
|
||||||
|
cdef set tokens_to_merge
|
||||||
def __init__(self, doc):
|
def __init__(self, doc):
|
||||||
self.doc = doc
|
self.doc = doc
|
||||||
self.merges = []
|
self.merges = []
|
||||||
self.splits = []
|
self.splits = []
|
||||||
|
self.tokens_to_merge = set()
|
||||||
|
|
||||||
def merge(self, Span span, attrs=SimpleFrozenDict()):
|
def merge(self, Span span, attrs=SimpleFrozenDict()):
|
||||||
"""Mark a span for merging. The attrs will be applied to the resulting
|
"""Mark a span for merging. The attrs will be applied to the resulting
|
||||||
token.
|
token.
|
||||||
"""
|
"""
|
||||||
|
for token in span:
|
||||||
|
if token.i in self.tokens_to_merge:
|
||||||
|
raise ValueError(Errors.E097.format(token=repr(token)))
|
||||||
|
self.tokens_to_merge.add(token.i)
|
||||||
|
|
||||||
attrs = intify_attrs(attrs, strings_map=self.doc.vocab.strings)
|
attrs = intify_attrs(attrs, strings_map=self.doc.vocab.strings)
|
||||||
self.merges.append((span.start_char, span.end_char, attrs))
|
self.merges.append((span, attrs))
|
||||||
|
|
||||||
def split(self, Token token, orths, attrs=SimpleFrozenDict()):
|
def split(self, Token token, orths, attrs=SimpleFrozenDict()):
|
||||||
"""Mark a Token for splitting, into the specified orths. The attrs
|
"""Mark a Token for splitting, into the specified orths. The attrs
|
||||||
|
@ -47,20 +57,22 @@ cdef class Retokenizer:
|
||||||
|
|
||||||
def __exit__(self, *args):
|
def __exit__(self, *args):
|
||||||
# Do the actual merging here
|
# Do the actual merging here
|
||||||
for start_char, end_char, attrs in self.merges:
|
if len(self.merges) > 1:
|
||||||
start = token_by_start(self.doc.c, self.doc.length, start_char)
|
_bulk_merge(self.doc, self.merges)
|
||||||
end = token_by_end(self.doc.c, self.doc.length, end_char)
|
elif len(self.merges) == 1:
|
||||||
_merge(self.doc, start, end+1, attrs)
|
(span, attrs) = self.merges[0]
|
||||||
|
start = span.start
|
||||||
|
end = span.end
|
||||||
|
_merge(self.doc, start, end, attrs)
|
||||||
|
|
||||||
for start_char, orths, attrs in self.splits:
|
for start_char, orths, attrs in self.splits:
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
|
||||||
|
|
||||||
def _merge(Doc doc, int start, int end, attributes):
|
def _merge(Doc doc, int start, int end, attributes):
|
||||||
"""Retokenize the document, such that the span at
|
"""Retokenize the document, such that the span at
|
||||||
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
||||||
`start_idx` and `end_idx `do not mark start and end token boundaries,
|
`start_idx` and `end_idx `do not mark start and end token boundaries,
|
||||||
the document remains unchanged.
|
the document remains unchanged.
|
||||||
|
|
||||||
start_idx (int): Character index of the start of the slice to merge.
|
start_idx (int): Character index of the start of the slice to merge.
|
||||||
end_idx (int): Character index after the end of the slice to merge.
|
end_idx (int): Character index after the end of the slice to merge.
|
||||||
**attributes: Attributes to assign to the merged token. By default,
|
**attributes: Attributes to assign to the merged token. By default,
|
||||||
|
@ -131,3 +143,139 @@ def _merge(Doc doc, int start, int end, attributes):
|
||||||
# Clear the cached Python objects
|
# Clear the cached Python objects
|
||||||
# Return the merged Python object
|
# Return the merged Python object
|
||||||
return doc[start]
|
return doc[start]
|
||||||
|
|
||||||
|
def _bulk_merge(Doc doc, merges):
|
||||||
|
"""Retokenize the document, such that the spans described in 'merges'
|
||||||
|
are merged into a single token. This method assumes that the merges
|
||||||
|
are in the same order at which they appear in the doc, and that merges
|
||||||
|
do not intersect each other in any way.
|
||||||
|
|
||||||
|
merges: Tokens to merge, and corresponding attributes to assign to the
|
||||||
|
merged token. By default, attributes are inherited from the
|
||||||
|
syntactic root of the span.
|
||||||
|
RETURNS (Token): The first newly merged token.
|
||||||
|
"""
|
||||||
|
cdef Span span
|
||||||
|
cdef const LexemeC* lex
|
||||||
|
cdef Pool mem = Pool()
|
||||||
|
tokens = <TokenC**>mem.alloc(len(merges), sizeof(TokenC))
|
||||||
|
spans = []
|
||||||
|
|
||||||
|
def _get_start(merge):
|
||||||
|
return merge[0].start
|
||||||
|
merges.sort(key=_get_start)
|
||||||
|
|
||||||
|
for merge_index, (span, attributes) in enumerate(merges):
|
||||||
|
start = span.start
|
||||||
|
end = span.end
|
||||||
|
spans.append(span)
|
||||||
|
|
||||||
|
# House the new merged token where it starts
|
||||||
|
token = &doc.c[start]
|
||||||
|
|
||||||
|
tokens[merge_index] = token
|
||||||
|
|
||||||
|
# Assign attributes
|
||||||
|
for attr_name, attr_value in attributes.items():
|
||||||
|
if attr_name == TAG:
|
||||||
|
doc.vocab.morphology.assign_tag(token, attr_value)
|
||||||
|
else:
|
||||||
|
Token.set_struct_attr(token, attr_name, attr_value)
|
||||||
|
|
||||||
|
|
||||||
|
# Memorize span roots and sets dependencies of the newly merged
|
||||||
|
# tokens to the dependencies of their roots.
|
||||||
|
span_roots = []
|
||||||
|
for i, span in enumerate(spans):
|
||||||
|
span_roots.append(span.root.i)
|
||||||
|
tokens[i].dep = span.root.dep
|
||||||
|
|
||||||
|
# We update token.lex after keeping span root and dep, since
|
||||||
|
# setting token.lex will change span.start and span.end properties
|
||||||
|
# as it modifies the character offsets in the doc
|
||||||
|
for token_index in range(len(merges)):
|
||||||
|
new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
|
||||||
|
if spans[token_index][-1].whitespace_:
|
||||||
|
new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
|
||||||
|
lex = doc.vocab.get(doc.mem, new_orth)
|
||||||
|
tokens[token_index].lex = lex
|
||||||
|
# We set trailing space here too
|
||||||
|
tokens[token_index].spacy = doc.c[spans[token_index].end-1].spacy
|
||||||
|
|
||||||
|
# Begin by setting all the head indices to absolute token positions
|
||||||
|
# This is easier to work with for now than the offsets
|
||||||
|
# Before thinking of something simpler, beware the case where a
|
||||||
|
# dependency bridges over the entity. Here the alignment of the
|
||||||
|
# tokens changes.
|
||||||
|
for i in range(doc.length):
|
||||||
|
doc.c[i].head += i
|
||||||
|
|
||||||
|
# Set the head of the merged token from the Span
|
||||||
|
for i in range(len(merges)):
|
||||||
|
tokens[i].head = doc.c[span_roots[i]].head
|
||||||
|
|
||||||
|
# Adjust deps before shrinking tokens
|
||||||
|
# Tokens which point into the merged token should now point to it
|
||||||
|
# Subtract the offset from all tokens which point to >= end
|
||||||
|
offsets = []
|
||||||
|
current_span_index = 0
|
||||||
|
current_offset = 0
|
||||||
|
for i in range(doc.length):
|
||||||
|
if current_span_index < len(spans) and i == spans[current_span_index].end:
|
||||||
|
#last token was the last of the span
|
||||||
|
current_offset += (spans[current_span_index].end - spans[current_span_index].start) -1
|
||||||
|
current_span_index += 1
|
||||||
|
|
||||||
|
if current_span_index < len(spans) and \
|
||||||
|
spans[current_span_index].start <= i < spans[current_span_index].end:
|
||||||
|
offsets.append(spans[current_span_index].start - current_offset)
|
||||||
|
else:
|
||||||
|
offsets.append(i - current_offset)
|
||||||
|
|
||||||
|
for i in range(doc.length):
|
||||||
|
doc.c[i].head = offsets[doc.c[i].head]
|
||||||
|
|
||||||
|
# Now compress the token array
|
||||||
|
offset = 0
|
||||||
|
in_span = False
|
||||||
|
span_index = 0
|
||||||
|
for i in range(doc.length):
|
||||||
|
if in_span and i == spans[span_index].end:
|
||||||
|
# First token after a span
|
||||||
|
in_span = False
|
||||||
|
span_index += 1
|
||||||
|
if span_index < len(spans) and i == spans[span_index].start:
|
||||||
|
# First token in a span
|
||||||
|
doc.c[i - offset] = doc.c[i] # move token to its place
|
||||||
|
offset += (spans[span_index].end - spans[span_index].start) - 1
|
||||||
|
in_span = True
|
||||||
|
if not in_span:
|
||||||
|
doc.c[i - offset] = doc.c[i] # move token to its place
|
||||||
|
|
||||||
|
for i in range(doc.length - offset, doc.length):
|
||||||
|
memset(&doc.c[i], 0, sizeof(TokenC))
|
||||||
|
doc.c[i].lex = &EMPTY_LEXEME
|
||||||
|
doc.length -= offset
|
||||||
|
|
||||||
|
# ...And, set heads back to a relative position
|
||||||
|
for i in range(doc.length):
|
||||||
|
doc.c[i].head -= i
|
||||||
|
|
||||||
|
# Set the left/right children, left/right edges
|
||||||
|
set_children_from_heads(doc.c, doc.length)
|
||||||
|
|
||||||
|
# Make sure ent_iob remains consistent
|
||||||
|
for (span, _) in merges:
|
||||||
|
if(span.end < len(offsets)):
|
||||||
|
#if it's not the last span
|
||||||
|
token_after_span_position = offsets[span.end]
|
||||||
|
if doc.c[token_after_span_position].ent_iob == 1\
|
||||||
|
and doc.c[token_after_span_position - 1].ent_iob in (0, 2):
|
||||||
|
if doc.c[token_after_span_position - 1].ent_type == doc.c[token_after_span_position].ent_type:
|
||||||
|
doc.c[token_after_span_position - 1].ent_iob = 3
|
||||||
|
else:
|
||||||
|
# If they're not the same entity type, let them be two entities
|
||||||
|
doc.c[token_after_span_position].ent_iob = 3
|
||||||
|
|
||||||
|
# Return the merged Python object
|
||||||
|
return doc[spans[0].start]
|
||||||
|
|
|
@ -884,6 +884,28 @@ cdef class Doc:
|
||||||
'''
|
'''
|
||||||
return Retokenizer(self)
|
return Retokenizer(self)
|
||||||
|
|
||||||
|
def _bulk_merge(self, spans, attributes):
|
||||||
|
"""Retokenize the document, such that the spans given as arguments
|
||||||
|
are merged into single tokens. The spans need to be in document
|
||||||
|
order, and no span intersection is allowed.
|
||||||
|
|
||||||
|
spans (Span[]): Spans to merge, in document order, with all span
|
||||||
|
intersections empty. Cannot be emty.
|
||||||
|
attributes (Dictionary[]): Attributes to assign to the merged tokens. By default,
|
||||||
|
must be the same lenghth as spans, emty dictionaries are allowed.
|
||||||
|
attributes are inherited from the syntactic root of the span.
|
||||||
|
RETURNS (Token): The first newly merged token.
|
||||||
|
"""
|
||||||
|
cdef unicode tag, lemma, ent_type
|
||||||
|
|
||||||
|
assert len(attributes) == len(spans), "attribute length should be equal to span length" + str(len(attributes)) +\
|
||||||
|
str(len(spans))
|
||||||
|
with self.retokenize() as retokenizer:
|
||||||
|
for i, span in enumerate(spans):
|
||||||
|
fix_attributes(self, attributes[i])
|
||||||
|
remove_label_if_necessary(attributes[i])
|
||||||
|
retokenizer.merge(span, attributes[i])
|
||||||
|
|
||||||
def merge(self, int start_idx, int end_idx, *args, **attributes):
|
def merge(self, int start_idx, int end_idx, *args, **attributes):
|
||||||
"""Retokenize the document, such that the span at
|
"""Retokenize the document, such that the span at
|
||||||
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
`doc.text[start_idx : end_idx]` is merged into a single token. If
|
||||||
|
@ -905,20 +927,12 @@ cdef class Doc:
|
||||||
attributes[LEMMA] = lemma
|
attributes[LEMMA] = lemma
|
||||||
attributes[ENT_TYPE] = ent_type
|
attributes[ENT_TYPE] = ent_type
|
||||||
elif not args:
|
elif not args:
|
||||||
if 'label' in attributes and 'ent_type' not in attributes:
|
fix_attributes(self, attributes)
|
||||||
if isinstance(attributes['label'], int):
|
|
||||||
attributes[ENT_TYPE] = attributes['label']
|
|
||||||
else:
|
|
||||||
attributes[ENT_TYPE] = self.vocab.strings[attributes['label']]
|
|
||||||
if 'ent_type' in attributes:
|
|
||||||
attributes[ENT_TYPE] = attributes['ent_type']
|
|
||||||
elif args:
|
elif args:
|
||||||
raise ValueError(Errors.E034.format(n_args=len(args),
|
raise ValueError(Errors.E034.format(n_args=len(args),
|
||||||
args=repr(args),
|
args=repr(args),
|
||||||
kwargs=repr(attributes)))
|
kwargs=repr(attributes)))
|
||||||
# More deprecated attribute handling =/
|
remove_label_if_necessary(attributes)
|
||||||
if 'label' in attributes:
|
|
||||||
attributes['ent_type'] = attributes.pop('label')
|
|
||||||
|
|
||||||
attributes = intify_attrs(attributes, strings_map=self.vocab.strings)
|
attributes = intify_attrs(attributes, strings_map=self.vocab.strings)
|
||||||
|
|
||||||
|
@ -1034,3 +1048,17 @@ def unpickle_doc(vocab, hooks_and_data, bytes_data):
|
||||||
|
|
||||||
|
|
||||||
copy_reg.pickle(Doc, pickle_doc, unpickle_doc)
|
copy_reg.pickle(Doc, pickle_doc, unpickle_doc)
|
||||||
|
|
||||||
|
def remove_label_if_necessary(attributes):
|
||||||
|
# More deprecated attribute handling =/
|
||||||
|
if 'label' in attributes:
|
||||||
|
attributes['ent_type'] = attributes.pop('label')
|
||||||
|
|
||||||
|
def fix_attributes(doc, attributes):
|
||||||
|
if 'label' in attributes and 'ent_type' not in attributes:
|
||||||
|
if isinstance(attributes['label'], int):
|
||||||
|
attributes[ENT_TYPE] = attributes['label']
|
||||||
|
else:
|
||||||
|
attributes[ENT_TYPE] = doc.vocab.strings[attributes['label']]
|
||||||
|
if 'ent_type' in attributes:
|
||||||
|
attributes[ENT_TYPE] = attributes['ent_type']
|
||||||
|
|
Loading…
Reference in New Issue
Block a user