mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Merge changes to test_ner
This commit is contained in:
commit
46c02d25b1
106
.github/contributors/Hazoom.md
vendored
Normal file
106
.github/contributors/Hazoom.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Moshe Hazoom |
|
||||
| Company name (if applicable) | Amenity Analytics |
|
||||
| Title or role (if applicable) | NLP Engineer |
|
||||
| Date | 2019-09-15 |
|
||||
| GitHub username | Hazoom |
|
||||
| Website (optional) | |
|
|
@ -120,7 +120,7 @@ class Errors(object):
|
|||
E011 = ("Unknown operator: '{op}'. Options: {opts}")
|
||||
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
|
||||
E013 = ("Error selecting action in matcher")
|
||||
E014 = ("Uknown tag ID: {tag}")
|
||||
E014 = ("Unknown tag ID: {tag}")
|
||||
E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
|
||||
"`force=True` to overwrite.")
|
||||
E016 = ("MultitaskObjective target should be function or one of: dep, "
|
||||
|
|
|
@ -69,7 +69,7 @@ class Pipe(object):
|
|||
predictions = self.predict([doc])
|
||||
if isinstance(predictions, tuple) and len(predictions) == 2:
|
||||
scores, tensors = predictions
|
||||
self.set_annotations([doc], scores, tensor=tensors)
|
||||
self.set_annotations([doc], scores, tensors=tensors)
|
||||
else:
|
||||
self.set_annotations([doc], predictions)
|
||||
return doc
|
||||
|
@ -90,7 +90,7 @@ class Pipe(object):
|
|||
predictions = self.predict(docs)
|
||||
if isinstance(predictions, tuple) and len(tuple) == 2:
|
||||
scores, tensors = predictions
|
||||
self.set_annotations(docs, scores, tensor=tensors)
|
||||
self.set_annotations(docs, scores, tensors=tensors)
|
||||
else:
|
||||
self.set_annotations(docs, predictions)
|
||||
yield from docs
|
||||
|
@ -937,11 +937,6 @@ class TextCategorizer(Pipe):
|
|||
def labels(self, value):
|
||||
self.cfg["labels"] = tuple(value)
|
||||
|
||||
def __call__(self, doc):
|
||||
scores, tensors = self.predict([doc])
|
||||
self.set_annotations([doc], scores, tensors=tensors)
|
||||
return doc
|
||||
|
||||
def pipe(self, stream, batch_size=128, n_threads=-1):
|
||||
for docs in util.minibatch(stream, size=batch_size):
|
||||
docs = list(docs)
|
||||
|
|
|
@ -66,7 +66,8 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
UNIT: Counter(),
|
||||
OUT: Counter()
|
||||
}
|
||||
actions[OUT][''] = 1
|
||||
actions[OUT][''] = 1 # Represents a token predicted to be outside of any entity
|
||||
actions[UNIT][''] = 1 # Represents a token prohibited to be in an entity
|
||||
for entity_type in kwargs.get('entity_types', []):
|
||||
for action in (BEGIN, IN, LAST, UNIT):
|
||||
actions[action][entity_type] = 1
|
||||
|
@ -162,8 +163,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
for i in range(self.n_moves):
|
||||
if self.c[i].move == move and self.c[i].label == label:
|
||||
return self.c[i]
|
||||
else:
|
||||
raise KeyError(Errors.E022.format(name=name))
|
||||
raise KeyError(Errors.E022.format(name=name))
|
||||
|
||||
cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
|
||||
# TODO: Apparent Cython bug here when we try to use the Transition()
|
||||
|
@ -267,7 +267,7 @@ cdef class Begin:
|
|||
return False
|
||||
elif label == 0:
|
||||
return False
|
||||
elif preset_ent_iob == 1 or preset_ent_iob == 2:
|
||||
elif preset_ent_iob == 1:
|
||||
# Ensure we don't clobber preset entities. If no entity preset,
|
||||
# ent_iob is 0
|
||||
return False
|
||||
|
@ -283,8 +283,8 @@ cdef class Begin:
|
|||
# Otherwise, force acceptance, even if we're across a sentence
|
||||
# boundary or the token is whitespace.
|
||||
return True
|
||||
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
|
||||
# If the next word is B or O, we can't B now
|
||||
elif st.B_(1).ent_iob == 3:
|
||||
# If the next word is B, we can't B now
|
||||
return False
|
||||
elif st.B_(1).sent_start == 1:
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
|
@ -327,6 +327,7 @@ cdef class In:
|
|||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
cdef attr_t preset_ent_label = st.B_(0).ent_type
|
||||
if label == 0:
|
||||
return False
|
||||
elif st.E_(0).ent_type != label:
|
||||
|
@ -336,13 +337,22 @@ cdef class In:
|
|||
elif st.B(1) == -1:
|
||||
# If we're at the end, we can't I.
|
||||
return False
|
||||
elif preset_ent_iob == 2:
|
||||
return False
|
||||
elif preset_ent_iob == 3:
|
||||
return False
|
||||
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3:
|
||||
# If we know the next word is B or O, we can't be I (must be L)
|
||||
elif st.B_(1).ent_iob == 3:
|
||||
# If we know the next word is B, we can't be I (must be L)
|
||||
return False
|
||||
elif preset_ent_iob == 1:
|
||||
if st.B_(1).ent_iob in (0, 2):
|
||||
# if next preset is missing or O, this can't be I (must be L)
|
||||
return False
|
||||
elif label != preset_ent_label:
|
||||
# If label isn't right, reject
|
||||
return False
|
||||
else:
|
||||
# Otherwise, force acceptance, even if we're across a sentence
|
||||
# boundary or the token is whitespace.
|
||||
return True
|
||||
elif st.B(1) != -1 and st.B_(1).sent_start == 1:
|
||||
# Don't allow entities to extend across sentence boundaries
|
||||
return False
|
||||
|
@ -388,17 +398,24 @@ cdef class In:
|
|||
else:
|
||||
return 1
|
||||
|
||||
|
||||
cdef class Last:
|
||||
@staticmethod
|
||||
cdef bint is_valid(const StateC* st, attr_t label) nogil:
|
||||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
cdef attr_t preset_ent_label = st.B_(0).ent_type
|
||||
if label == 0:
|
||||
return False
|
||||
elif not st.entity_is_open():
|
||||
return False
|
||||
elif st.B_(0).ent_iob == 1 and st.B_(1).ent_iob != 1:
|
||||
elif preset_ent_iob == 1 and st.B_(1).ent_iob != 1:
|
||||
# If a preset entity has I followed by not-I, is L
|
||||
return True
|
||||
if label != preset_ent_label:
|
||||
# If label isn't right, reject
|
||||
return False
|
||||
else:
|
||||
# Otherwise, force acceptance, even if we're across a sentence
|
||||
# boundary or the token is whitespace.
|
||||
return True
|
||||
elif st.E_(0).ent_type != label:
|
||||
return False
|
||||
elif st.B_(1).ent_iob == 1:
|
||||
|
@ -451,12 +468,13 @@ cdef class Unit:
|
|||
cdef int preset_ent_iob = st.B_(0).ent_iob
|
||||
cdef attr_t preset_ent_label = st.B_(0).ent_type
|
||||
if label == 0:
|
||||
return False
|
||||
# this is only allowed if it's a preset blocked annotation
|
||||
if preset_ent_label == 0 and preset_ent_iob == 3:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
elif st.entity_is_open():
|
||||
return False
|
||||
elif preset_ent_iob == 2:
|
||||
# Don't clobber preset O
|
||||
return False
|
||||
elif st.B_(1).ent_iob == 1:
|
||||
# If next token is In, we can't be Unit -- must be Begin
|
||||
return False
|
||||
|
|
|
@ -130,16 +130,14 @@ cdef class Parser:
|
|||
def __reduce__(self):
|
||||
return (Parser, (self.vocab, self.moves, self.model), None, None)
|
||||
|
||||
@property
|
||||
def tok2vec(self):
|
||||
return self.model.tok2vec
|
||||
|
||||
@property
|
||||
def move_names(self):
|
||||
names = []
|
||||
for i in range(self.moves.n_moves):
|
||||
name = self.moves.move_name(self.moves.c[i].move, self.moves.c[i].label)
|
||||
names.append(name)
|
||||
# Explicitly removing the internal "U-" token used for blocking entities
|
||||
if name != "U-":
|
||||
names.append(name)
|
||||
return names
|
||||
|
||||
nr_feature = 8
|
||||
|
|
|
@ -16,10 +16,23 @@ def test_doc_add_entities_set_ents_iob(en_vocab):
|
|||
ner(doc)
|
||||
assert len(list(doc.ents)) == 0
|
||||
assert [w.ent_iob_ for w in doc] == (["O"] * len(doc))
|
||||
|
||||
doc.ents = [(doc.vocab.strings["ANIMAL"], 3, 4)]
|
||||
assert [w.ent_iob_ for w in doc] == ["", "", "", "B"]
|
||||
assert [w.ent_iob_ for w in doc] == ["O", "O", "O", "B"]
|
||||
|
||||
doc.ents = [(doc.vocab.strings["WORD"], 0, 2)]
|
||||
assert [w.ent_iob_ for w in doc] == ["B", "I", "", ""]
|
||||
assert [w.ent_iob_ for w in doc] == ["B", "I", "O", "O"]
|
||||
|
||||
|
||||
def test_ents_reset(en_vocab):
|
||||
text = ["This", "is", "a", "lion"]
|
||||
doc = get_doc(en_vocab, text)
|
||||
ner = EntityRecognizer(en_vocab)
|
||||
ner.begin_training([])
|
||||
ner(doc)
|
||||
assert [t.ent_iob_ for t in doc] == (["O"] * len(doc))
|
||||
doc.ents = list(doc.ents)
|
||||
assert [t.ent_iob_ for t in doc] == (["O"] * len(doc))
|
||||
|
||||
|
||||
def test_add_overlapping_entities(en_vocab):
|
||||
|
|
|
@ -2,7 +2,9 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.pipeline import EntityRecognizer
|
||||
from spacy.lang.en import English
|
||||
|
||||
from spacy.pipeline import EntityRecognizer, EntityRuler
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.syntax.ner import BiluoPushDown
|
||||
from spacy.gold import GoldParse
|
||||
|
@ -115,7 +117,6 @@ def test_oracle_moves_missing_B(en_vocab):
|
|||
moves.add_action(move_types.index("U"), label)
|
||||
moves.preprocess_gold(gold)
|
||||
seq = moves.get_oracle_sequence(doc, gold)
|
||||
print(seq)
|
||||
|
||||
|
||||
def test_oracle_moves_whitespace(en_vocab):
|
||||
|
@ -137,3 +138,147 @@ def test_oracle_moves_whitespace(en_vocab):
|
|||
moves.add_action(move_types.index(action), label)
|
||||
moves.preprocess_gold(gold)
|
||||
moves.get_oracle_sequence(doc, gold)
|
||||
|
||||
|
||||
def test_accept_blocked_token():
|
||||
"""Test succesful blocking of tokens to be in an entity."""
|
||||
# 1. test normal behaviour
|
||||
nlp1 = English()
|
||||
doc1 = nlp1("I live in New York")
|
||||
ner1 = EntityRecognizer(doc1.vocab)
|
||||
assert [token.ent_iob_ for token in doc1] == ["", "", "", "", ""]
|
||||
assert [token.ent_type_ for token in doc1] == ["", "", "", "", ""]
|
||||
|
||||
# Add the OUT action
|
||||
ner1.moves.add_action(5, "")
|
||||
ner1.add_label("GPE")
|
||||
# Get into the state just before "New"
|
||||
state1 = ner1.moves.init_batch([doc1])[0]
|
||||
ner1.moves.apply_transition(state1, "O")
|
||||
ner1.moves.apply_transition(state1, "O")
|
||||
ner1.moves.apply_transition(state1, "O")
|
||||
# Check that B-GPE is valid.
|
||||
assert ner1.moves.is_valid(state1, "B-GPE")
|
||||
|
||||
# 2. test blocking behaviour
|
||||
nlp2 = English()
|
||||
doc2 = nlp2("I live in New York")
|
||||
ner2 = EntityRecognizer(doc2.vocab)
|
||||
|
||||
# set "New York" to a blocked entity
|
||||
doc2.ents = [(0, 3, 5)]
|
||||
assert [token.ent_iob_ for token in doc2] == ["", "", "", "B", "B"]
|
||||
assert [token.ent_type_ for token in doc2] == ["", "", "", "", ""]
|
||||
|
||||
# Check that B-GPE is now invalid.
|
||||
ner2.moves.add_action(4, "")
|
||||
ner2.moves.add_action(5, "")
|
||||
ner2.add_label("GPE")
|
||||
state2 = ner2.moves.init_batch([doc2])[0]
|
||||
ner2.moves.apply_transition(state2, "O")
|
||||
ner2.moves.apply_transition(state2, "O")
|
||||
ner2.moves.apply_transition(state2, "O")
|
||||
# we can only use U- for "New"
|
||||
assert not ner2.moves.is_valid(state2, "B-GPE")
|
||||
assert ner2.moves.is_valid(state2, "U-")
|
||||
ner2.moves.apply_transition(state2, "U-")
|
||||
# we can only use U- for "York"
|
||||
assert not ner2.moves.is_valid(state2, "B-GPE")
|
||||
assert ner2.moves.is_valid(state2, "U-")
|
||||
|
||||
|
||||
def test_overwrite_token():
|
||||
nlp = English()
|
||||
ner1 = nlp.create_pipe("ner")
|
||||
nlp.add_pipe(ner1, name="ner")
|
||||
nlp.begin_training()
|
||||
|
||||
# The untrained NER will predict O for each token
|
||||
doc = nlp("I live in New York")
|
||||
assert [token.ent_iob_ for token in doc] == ["O", "O", "O", "O", "O"]
|
||||
assert [token.ent_type_ for token in doc] == ["", "", "", "", ""]
|
||||
|
||||
# Check that a new ner can overwrite O
|
||||
ner2 = EntityRecognizer(doc.vocab)
|
||||
ner2.moves.add_action(5, "")
|
||||
ner2.add_label("GPE")
|
||||
state = ner2.moves.init_batch([doc])[0]
|
||||
assert ner2.moves.is_valid(state, "B-GPE")
|
||||
assert ner2.moves.is_valid(state, "U-GPE")
|
||||
ner2.moves.apply_transition(state, "B-GPE")
|
||||
assert ner2.moves.is_valid(state, "I-GPE")
|
||||
assert ner2.moves.is_valid(state, "L-GPE")
|
||||
|
||||
|
||||
def test_ruler_before_ner():
|
||||
""" Test that an NER works after an entity_ruler: the second can add annotations """
|
||||
nlp = English()
|
||||
|
||||
# 1 : Entity Ruler - should set "this" to B and everything else to empty
|
||||
ruler = EntityRuler(nlp)
|
||||
patterns = [{"label": "THING", "pattern": "This"}]
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
|
||||
# 2: untrained NER - should set everything else to O
|
||||
untrained_ner = nlp.create_pipe("ner")
|
||||
untrained_ner.add_label("MY_LABEL")
|
||||
nlp.add_pipe(untrained_ner)
|
||||
nlp.begin_training()
|
||||
|
||||
doc = nlp("This is Antti Korhonen speaking in Finland")
|
||||
expected_iobs = ["B", "O", "O", "O", "O", "O", "O"]
|
||||
expected_types = ["THING", "", "", "", "", "", ""]
|
||||
assert [token.ent_iob_ for token in doc] == expected_iobs
|
||||
assert [token.ent_type_ for token in doc] == expected_types
|
||||
|
||||
|
||||
def test_ner_before_ruler():
|
||||
""" Test that an entity_ruler works after an NER: the second can overwrite O annotations """
|
||||
nlp = English()
|
||||
|
||||
# 1: untrained NER - should set everything to O
|
||||
untrained_ner = nlp.create_pipe("ner")
|
||||
untrained_ner.add_label("MY_LABEL")
|
||||
nlp.add_pipe(untrained_ner, name="uner")
|
||||
nlp.begin_training()
|
||||
|
||||
# 2 : Entity Ruler - should set "this" to B and keep everything else O
|
||||
ruler = EntityRuler(nlp)
|
||||
patterns = [{"label": "THING", "pattern": "This"}]
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
|
||||
doc = nlp("This is Antti Korhonen speaking in Finland")
|
||||
expected_iobs = ["B", "O", "O", "O", "O", "O", "O"]
|
||||
expected_types = ["THING", "", "", "", "", "", ""]
|
||||
assert [token.ent_iob_ for token in doc] == expected_iobs
|
||||
assert [token.ent_type_ for token in doc] == expected_types
|
||||
|
||||
|
||||
def test_block_ner():
|
||||
""" Test functionality for blocking tokens so they can't be in a named entity """
|
||||
# block "Antti L Korhonen" from being a named entity
|
||||
nlp = English()
|
||||
nlp.add_pipe(BlockerComponent1(2, 5))
|
||||
untrained_ner = nlp.create_pipe("ner")
|
||||
untrained_ner.add_label("MY_LABEL")
|
||||
nlp.add_pipe(untrained_ner, name="uner")
|
||||
nlp.begin_training()
|
||||
doc = nlp("This is Antti L Korhonen speaking in Finland")
|
||||
expected_iobs = ["O", "O", "B", "B", "B", "O", "O", "O"]
|
||||
expected_types = ["", "", "", "", "", "", "", ""]
|
||||
assert [token.ent_iob_ for token in doc] == expected_iobs
|
||||
assert [token.ent_type_ for token in doc] == expected_types
|
||||
|
||||
|
||||
class BlockerComponent1(object):
|
||||
name = "my_blocker"
|
||||
|
||||
def __init__(self, start, end):
|
||||
self.start = start
|
||||
self.end = end
|
||||
|
||||
def __call__(self, doc):
|
||||
doc.ents = [(0, self.start, self.end)]
|
||||
return doc
|
||||
|
|
|
@ -426,7 +426,7 @@ def test_issue957(en_tokenizer):
|
|||
def test_issue999(train_data):
|
||||
"""Test that adding entities and resuming training works passably OK.
|
||||
There are two issues here:
|
||||
1) We have to readd labels. This isn't very nice.
|
||||
1) We have to read labels. This isn't very nice.
|
||||
2) There's no way to set the learning rate for the weight update, so we
|
||||
end up out-of-scale, causing it to learn too fast.
|
||||
"""
|
||||
|
|
42
spacy/tests/regression/test_issue4267.py
Normal file
42
spacy/tests/regression/test_issue4267.py
Normal file
|
@ -0,0 +1,42 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
import spacy
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.pipeline import EntityRuler
|
||||
from spacy.tokens import Span
|
||||
|
||||
|
||||
def test_issue4267():
|
||||
""" Test that running an entity_ruler after ner gives consistent results"""
|
||||
nlp = English()
|
||||
ner = nlp.create_pipe("ner")
|
||||
ner.add_label("PEOPLE")
|
||||
nlp.add_pipe(ner)
|
||||
nlp.begin_training()
|
||||
|
||||
assert "ner" in nlp.pipe_names
|
||||
|
||||
# assert that we have correct IOB annotations
|
||||
doc1 = nlp("hi")
|
||||
assert doc1.is_nered
|
||||
for token in doc1:
|
||||
assert token.ent_iob == 2
|
||||
|
||||
# add entity ruler and run again
|
||||
ruler = EntityRuler(nlp)
|
||||
patterns = [{"label": "SOFTWARE", "pattern": "spacy"}]
|
||||
|
||||
ruler.add_patterns(patterns)
|
||||
nlp.add_pipe(ruler)
|
||||
assert "entity_ruler" in nlp.pipe_names
|
||||
assert "ner" in nlp.pipe_names
|
||||
|
||||
# assert that we still have correct IOB annotations
|
||||
doc2 = nlp("hi")
|
||||
assert doc2.is_nered
|
||||
for token in doc2:
|
||||
assert token.ent_iob == 2
|
|
@ -13,7 +13,7 @@ class DummyPipe(Pipe):
|
|||
def predict(self, docs):
|
||||
return ([1, 2, 3], [4, 5, 6])
|
||||
|
||||
def set_annotations(self, docs, scores, tensor=None):
|
||||
def set_annotations(self, docs, scores, tensors=None):
|
||||
return docs
|
||||
|
||||
|
||||
|
|
|
@ -146,11 +146,12 @@ def _merge(Doc doc, merges):
|
|||
syntactic root of the span.
|
||||
RETURNS (Token): The first newly merged token.
|
||||
"""
|
||||
cdef int i, merge_index, start, end, token_index
|
||||
cdef int i, merge_index, start, end, token_index, current_span_index, current_offset, offset, span_index
|
||||
cdef Span span
|
||||
cdef const LexemeC* lex
|
||||
cdef TokenC* token
|
||||
cdef Pool mem = Pool()
|
||||
cdef int merged_iob = 0
|
||||
tokens = <TokenC**>mem.alloc(len(merges), sizeof(TokenC))
|
||||
spans = []
|
||||
|
||||
|
|
|
@ -256,7 +256,7 @@ cdef class Doc:
|
|||
def is_nered(self):
|
||||
"""Check if the document has named entities set. Will return True if
|
||||
*any* of the tokens has a named entity tag set (even if the others are
|
||||
uknown values).
|
||||
unknown values).
|
||||
"""
|
||||
if len(self) == 0:
|
||||
return True
|
||||
|
@ -525,13 +525,11 @@ cdef class Doc:
|
|||
|
||||
def __set__(self, ents):
|
||||
# TODO:
|
||||
# 1. Allow negative matches
|
||||
# 2. Ensure pre-set NERs are not over-written during statistical
|
||||
# prediction
|
||||
# 3. Test basic data-driven ORTH gazetteer
|
||||
# 4. Test more nuanced date and currency regex
|
||||
# 1. Test basic data-driven ORTH gazetteer
|
||||
# 2. Test more nuanced date and currency regex
|
||||
tokens_in_ents = {}
|
||||
cdef attr_t entity_type
|
||||
cdef attr_t kb_id
|
||||
cdef int ent_start, ent_end
|
||||
for ent_info in ents:
|
||||
entity_type, kb_id, ent_start, ent_end = get_entity_info(ent_info)
|
||||
|
@ -545,27 +543,31 @@ cdef class Doc:
|
|||
tokens_in_ents[token_index] = (ent_start, ent_end, entity_type, kb_id)
|
||||
cdef int i
|
||||
for i in range(self.length):
|
||||
self.c[i].ent_type = 0
|
||||
self.c[i].ent_kb_id = 0
|
||||
self.c[i].ent_iob = 0 # Means missing.
|
||||
cdef attr_t ent_type
|
||||
cdef int start, end
|
||||
for ent_info in ents:
|
||||
ent_type, ent_kb_id, start, end = get_entity_info(ent_info)
|
||||
if ent_type is None or ent_type < 0:
|
||||
# Mark as O
|
||||
for i in range(start, end):
|
||||
self.c[i].ent_type = 0
|
||||
self.c[i].ent_kb_id = 0
|
||||
self.c[i].ent_iob = 2
|
||||
else:
|
||||
# Mark (inside) as I
|
||||
for i in range(start, end):
|
||||
self.c[i].ent_type = ent_type
|
||||
self.c[i].ent_kb_id = ent_kb_id
|
||||
self.c[i].ent_iob = 1
|
||||
# Set start as B
|
||||
self.c[start].ent_iob = 3
|
||||
# default values
|
||||
entity_type = 0
|
||||
kb_id = 0
|
||||
|
||||
# Set ent_iob to Missing (0) bij default unless this token was nered before
|
||||
ent_iob = 0
|
||||
if self.c[i].ent_iob != 0:
|
||||
ent_iob = 2
|
||||
|
||||
# overwrite if the token was part of a specified entity
|
||||
if i in tokens_in_ents.keys():
|
||||
ent_start, ent_end, entity_type, kb_id = tokens_in_ents[i]
|
||||
if entity_type is None or entity_type <= 0:
|
||||
# Blocking this token from being overwritten by downstream NER
|
||||
ent_iob = 3
|
||||
elif ent_start == i:
|
||||
# Marking the start of an entity
|
||||
ent_iob = 3
|
||||
else:
|
||||
# Marking the inside of an entity
|
||||
ent_iob = 1
|
||||
|
||||
self.c[i].ent_type = entity_type
|
||||
self.c[i].ent_kb_id = kb_id
|
||||
self.c[i].ent_iob = ent_iob
|
||||
|
||||
@property
|
||||
def noun_chunks(self):
|
||||
|
|
|
@ -754,7 +754,8 @@ cdef class Token:
|
|||
def ent_iob_(self):
|
||||
"""IOB code of named entity tag. "B" means the token begins an entity,
|
||||
"I" means it is inside an entity, "O" means it is outside an entity,
|
||||
and "" means no entity tag is set.
|
||||
and "" means no entity tag is set. "B" with an empty ent_type
|
||||
means that the token is blocked from further processing by NER.
|
||||
|
||||
RETURNS (unicode): IOB code of named entity tag.
|
||||
"""
|
||||
|
|
|
@ -588,8 +588,8 @@ data.
|
|||
```python
|
||||
### Entry structure
|
||||
{
|
||||
"orth": string,
|
||||
"id": int,
|
||||
"orth": string, # the word text
|
||||
"id": int, # can correspond to row in vectors table
|
||||
"lower": string,
|
||||
"norm": string,
|
||||
"shape": string
|
||||
|
|
|
@ -181,6 +181,7 @@ All output files generated by this command are compatible with
|
|||
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
|
||||
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
|
||||
<<<<<<< HEAD
|
||||
|
||||
## Debug data {#debug-data new="2.2"}
|
||||
|
||||
|
@ -341,6 +342,8 @@ will not be available.
|
|||
```
|
||||
|
||||
</Accordion>
|
||||
=======
|
||||
>>>>>>> master
|
||||
|
||||
## Train {#train}
|
||||
|
||||
|
@ -512,14 +515,17 @@ tokenization can be provided.
|
|||
|
||||
Create a new model directory from raw data, like word frequencies, Brown
|
||||
clusters and word vectors. This command is similar to the `spacy model` command
|
||||
in v1.x.
|
||||
in v1.x. Note that in order to populate the model's vocab, you need to pass in a
|
||||
JSONL-formatted [vocabulary file](<(/api/annotation#vocab-jsonl)>) as
|
||||
`--jsonl-loc` with optional `id` values that correspond to the vectors table.
|
||||
Just loading in vectors will not automatically populate the vocab.
|
||||
|
||||
<Infobox title="Deprecation note" variant="warning">
|
||||
|
||||
As of v2.1.0, the `--freqs-loc` and `--clusters-loc` are deprecated and have
|
||||
been replaced with the `--jsonl-loc` argument, which lets you pass in a a
|
||||
[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one
|
||||
lexical entry per line. For more details on the format, see the
|
||||
[JSONL](http://jsonlines.org/) file containing one lexical entry per line. For
|
||||
more details on the format, see the
|
||||
[annotation specs](/api/annotation#vocab-jsonl).
|
||||
|
||||
</Infobox>
|
||||
|
@ -529,14 +535,14 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
|
|||
[--prune-vectors]
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. |
|
||||
| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. |
|
||||
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted vocabulary file with lexical attributes. |
|
||||
| `--vectors-loc`, `-v` | option | Optional location of vectors file. Should be a tab-separated file in Word2Vec format where the first column contains the word and the remaining columns the values. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
|
||||
| `--prune-vectors`, `-V` | flag | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
|
||||
| **CREATES** | model | A spaCy model containing the vocab and vectors. |
|
||||
| Argument | Type | Description |
|
||||
| ----------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. |
|
||||
| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. |
|
||||
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. |
|
||||
| `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
|
||||
| `--prune-vectors`, `-V` | flag | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
|
||||
| **CREATES** | model | A spaCy model containing the vocab and vectors. |
|
||||
|
||||
## Evaluate {#evaluate new="2"}
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user