Merge changes to test_ner

This commit is contained in:
Matthew Honnibal 2019-09-18 21:41:24 +02:00
commit 46c02d25b1
15 changed files with 405 additions and 78 deletions

106
.github/contributors/Hazoom.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Moshe Hazoom |
| Company name (if applicable) | Amenity Analytics |
| Title or role (if applicable) | NLP Engineer |
| Date | 2019-09-15 |
| GitHub username | Hazoom |
| Website (optional) | |

View File

@ -120,7 +120,7 @@ class Errors(object):
E011 = ("Unknown operator: '{op}'. Options: {opts}") E011 = ("Unknown operator: '{op}'. Options: {opts}")
E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}") E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}")
E013 = ("Error selecting action in matcher") E013 = ("Error selecting action in matcher")
E014 = ("Uknown tag ID: {tag}") E014 = ("Unknown tag ID: {tag}")
E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use " E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use "
"`force=True` to overwrite.") "`force=True` to overwrite.")
E016 = ("MultitaskObjective target should be function or one of: dep, " E016 = ("MultitaskObjective target should be function or one of: dep, "

View File

@ -69,7 +69,7 @@ class Pipe(object):
predictions = self.predict([doc]) predictions = self.predict([doc])
if isinstance(predictions, tuple) and len(predictions) == 2: if isinstance(predictions, tuple) and len(predictions) == 2:
scores, tensors = predictions scores, tensors = predictions
self.set_annotations([doc], scores, tensor=tensors) self.set_annotations([doc], scores, tensors=tensors)
else: else:
self.set_annotations([doc], predictions) self.set_annotations([doc], predictions)
return doc return doc
@ -90,7 +90,7 @@ class Pipe(object):
predictions = self.predict(docs) predictions = self.predict(docs)
if isinstance(predictions, tuple) and len(tuple) == 2: if isinstance(predictions, tuple) and len(tuple) == 2:
scores, tensors = predictions scores, tensors = predictions
self.set_annotations(docs, scores, tensor=tensors) self.set_annotations(docs, scores, tensors=tensors)
else: else:
self.set_annotations(docs, predictions) self.set_annotations(docs, predictions)
yield from docs yield from docs
@ -937,11 +937,6 @@ class TextCategorizer(Pipe):
def labels(self, value): def labels(self, value):
self.cfg["labels"] = tuple(value) self.cfg["labels"] = tuple(value)
def __call__(self, doc):
scores, tensors = self.predict([doc])
self.set_annotations([doc], scores, tensors=tensors)
return doc
def pipe(self, stream, batch_size=128, n_threads=-1): def pipe(self, stream, batch_size=128, n_threads=-1):
for docs in util.minibatch(stream, size=batch_size): for docs in util.minibatch(stream, size=batch_size):
docs = list(docs) docs = list(docs)

View File

@ -66,7 +66,8 @@ cdef class BiluoPushDown(TransitionSystem):
UNIT: Counter(), UNIT: Counter(),
OUT: Counter() OUT: Counter()
} }
actions[OUT][''] = 1 actions[OUT][''] = 1 # Represents a token predicted to be outside of any entity
actions[UNIT][''] = 1 # Represents a token prohibited to be in an entity
for entity_type in kwargs.get('entity_types', []): for entity_type in kwargs.get('entity_types', []):
for action in (BEGIN, IN, LAST, UNIT): for action in (BEGIN, IN, LAST, UNIT):
actions[action][entity_type] = 1 actions[action][entity_type] = 1
@ -162,7 +163,6 @@ cdef class BiluoPushDown(TransitionSystem):
for i in range(self.n_moves): for i in range(self.n_moves):
if self.c[i].move == move and self.c[i].label == label: if self.c[i].move == move and self.c[i].label == label:
return self.c[i] return self.c[i]
else:
raise KeyError(Errors.E022.format(name=name)) raise KeyError(Errors.E022.format(name=name))
cdef Transition init_transition(self, int clas, int move, attr_t label) except *: cdef Transition init_transition(self, int clas, int move, attr_t label) except *:
@ -267,7 +267,7 @@ cdef class Begin:
return False return False
elif label == 0: elif label == 0:
return False return False
elif preset_ent_iob == 1 or preset_ent_iob == 2: elif preset_ent_iob == 1:
# Ensure we don't clobber preset entities. If no entity preset, # Ensure we don't clobber preset entities. If no entity preset,
# ent_iob is 0 # ent_iob is 0
return False return False
@ -283,8 +283,8 @@ cdef class Begin:
# Otherwise, force acceptance, even if we're across a sentence # Otherwise, force acceptance, even if we're across a sentence
# boundary or the token is whitespace. # boundary or the token is whitespace.
return True return True
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3: elif st.B_(1).ent_iob == 3:
# If the next word is B or O, we can't B now # If the next word is B, we can't B now
return False return False
elif st.B_(1).sent_start == 1: elif st.B_(1).sent_start == 1:
# Don't allow entities to extend across sentence boundaries # Don't allow entities to extend across sentence boundaries
@ -327,6 +327,7 @@ cdef class In:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
cdef attr_t preset_ent_label = st.B_(0).ent_type
if label == 0: if label == 0:
return False return False
elif st.E_(0).ent_type != label: elif st.E_(0).ent_type != label:
@ -336,13 +337,22 @@ cdef class In:
elif st.B(1) == -1: elif st.B(1) == -1:
# If we're at the end, we can't I. # If we're at the end, we can't I.
return False return False
elif preset_ent_iob == 2:
return False
elif preset_ent_iob == 3: elif preset_ent_iob == 3:
return False return False
elif st.B_(1).ent_iob == 2 or st.B_(1).ent_iob == 3: elif st.B_(1).ent_iob == 3:
# If we know the next word is B or O, we can't be I (must be L) # If we know the next word is B, we can't be I (must be L)
return False return False
elif preset_ent_iob == 1:
if st.B_(1).ent_iob in (0, 2):
# if next preset is missing or O, this can't be I (must be L)
return False
elif label != preset_ent_label:
# If label isn't right, reject
return False
else:
# Otherwise, force acceptance, even if we're across a sentence
# boundary or the token is whitespace.
return True
elif st.B(1) != -1 and st.B_(1).sent_start == 1: elif st.B(1) != -1 and st.B_(1).sent_start == 1:
# Don't allow entities to extend across sentence boundaries # Don't allow entities to extend across sentence boundaries
return False return False
@ -388,16 +398,23 @@ cdef class In:
else: else:
return 1 return 1
cdef class Last: cdef class Last:
@staticmethod @staticmethod
cdef bint is_valid(const StateC* st, attr_t label) nogil: cdef bint is_valid(const StateC* st, attr_t label) nogil:
cdef int preset_ent_iob = st.B_(0).ent_iob
cdef attr_t preset_ent_label = st.B_(0).ent_type
if label == 0: if label == 0:
return False return False
elif not st.entity_is_open(): elif not st.entity_is_open():
return False return False
elif st.B_(0).ent_iob == 1 and st.B_(1).ent_iob != 1: elif preset_ent_iob == 1 and st.B_(1).ent_iob != 1:
# If a preset entity has I followed by not-I, is L # If a preset entity has I followed by not-I, is L
if label != preset_ent_label:
# If label isn't right, reject
return False
else:
# Otherwise, force acceptance, even if we're across a sentence
# boundary or the token is whitespace.
return True return True
elif st.E_(0).ent_type != label: elif st.E_(0).ent_type != label:
return False return False
@ -451,12 +468,13 @@ cdef class Unit:
cdef int preset_ent_iob = st.B_(0).ent_iob cdef int preset_ent_iob = st.B_(0).ent_iob
cdef attr_t preset_ent_label = st.B_(0).ent_type cdef attr_t preset_ent_label = st.B_(0).ent_type
if label == 0: if label == 0:
# this is only allowed if it's a preset blocked annotation
if preset_ent_label == 0 and preset_ent_iob == 3:
return True
else:
return False return False
elif st.entity_is_open(): elif st.entity_is_open():
return False return False
elif preset_ent_iob == 2:
# Don't clobber preset O
return False
elif st.B_(1).ent_iob == 1: elif st.B_(1).ent_iob == 1:
# If next token is In, we can't be Unit -- must be Begin # If next token is In, we can't be Unit -- must be Begin
return False return False

View File

@ -130,15 +130,13 @@ cdef class Parser:
def __reduce__(self): def __reduce__(self):
return (Parser, (self.vocab, self.moves, self.model), None, None) return (Parser, (self.vocab, self.moves, self.model), None, None)
@property
def tok2vec(self):
return self.model.tok2vec
@property @property
def move_names(self): def move_names(self):
names = [] names = []
for i in range(self.moves.n_moves): for i in range(self.moves.n_moves):
name = self.moves.move_name(self.moves.c[i].move, self.moves.c[i].label) name = self.moves.move_name(self.moves.c[i].move, self.moves.c[i].label)
# Explicitly removing the internal "U-" token used for blocking entities
if name != "U-":
names.append(name) names.append(name)
return names return names

View File

@ -16,10 +16,23 @@ def test_doc_add_entities_set_ents_iob(en_vocab):
ner(doc) ner(doc)
assert len(list(doc.ents)) == 0 assert len(list(doc.ents)) == 0
assert [w.ent_iob_ for w in doc] == (["O"] * len(doc)) assert [w.ent_iob_ for w in doc] == (["O"] * len(doc))
doc.ents = [(doc.vocab.strings["ANIMAL"], 3, 4)] doc.ents = [(doc.vocab.strings["ANIMAL"], 3, 4)]
assert [w.ent_iob_ for w in doc] == ["", "", "", "B"] assert [w.ent_iob_ for w in doc] == ["O", "O", "O", "B"]
doc.ents = [(doc.vocab.strings["WORD"], 0, 2)] doc.ents = [(doc.vocab.strings["WORD"], 0, 2)]
assert [w.ent_iob_ for w in doc] == ["B", "I", "", ""] assert [w.ent_iob_ for w in doc] == ["B", "I", "O", "O"]
def test_ents_reset(en_vocab):
text = ["This", "is", "a", "lion"]
doc = get_doc(en_vocab, text)
ner = EntityRecognizer(en_vocab)
ner.begin_training([])
ner(doc)
assert [t.ent_iob_ for t in doc] == (["O"] * len(doc))
doc.ents = list(doc.ents)
assert [t.ent_iob_ for t in doc] == (["O"] * len(doc))
def test_add_overlapping_entities(en_vocab): def test_add_overlapping_entities(en_vocab):

View File

@ -2,7 +2,9 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.pipeline import EntityRecognizer from spacy.lang.en import English
from spacy.pipeline import EntityRecognizer, EntityRuler
from spacy.vocab import Vocab from spacy.vocab import Vocab
from spacy.syntax.ner import BiluoPushDown from spacy.syntax.ner import BiluoPushDown
from spacy.gold import GoldParse from spacy.gold import GoldParse
@ -115,7 +117,6 @@ def test_oracle_moves_missing_B(en_vocab):
moves.add_action(move_types.index("U"), label) moves.add_action(move_types.index("U"), label)
moves.preprocess_gold(gold) moves.preprocess_gold(gold)
seq = moves.get_oracle_sequence(doc, gold) seq = moves.get_oracle_sequence(doc, gold)
print(seq)
def test_oracle_moves_whitespace(en_vocab): def test_oracle_moves_whitespace(en_vocab):
@ -137,3 +138,147 @@ def test_oracle_moves_whitespace(en_vocab):
moves.add_action(move_types.index(action), label) moves.add_action(move_types.index(action), label)
moves.preprocess_gold(gold) moves.preprocess_gold(gold)
moves.get_oracle_sequence(doc, gold) moves.get_oracle_sequence(doc, gold)
def test_accept_blocked_token():
"""Test succesful blocking of tokens to be in an entity."""
# 1. test normal behaviour
nlp1 = English()
doc1 = nlp1("I live in New York")
ner1 = EntityRecognizer(doc1.vocab)
assert [token.ent_iob_ for token in doc1] == ["", "", "", "", ""]
assert [token.ent_type_ for token in doc1] == ["", "", "", "", ""]
# Add the OUT action
ner1.moves.add_action(5, "")
ner1.add_label("GPE")
# Get into the state just before "New"
state1 = ner1.moves.init_batch([doc1])[0]
ner1.moves.apply_transition(state1, "O")
ner1.moves.apply_transition(state1, "O")
ner1.moves.apply_transition(state1, "O")
# Check that B-GPE is valid.
assert ner1.moves.is_valid(state1, "B-GPE")
# 2. test blocking behaviour
nlp2 = English()
doc2 = nlp2("I live in New York")
ner2 = EntityRecognizer(doc2.vocab)
# set "New York" to a blocked entity
doc2.ents = [(0, 3, 5)]
assert [token.ent_iob_ for token in doc2] == ["", "", "", "B", "B"]
assert [token.ent_type_ for token in doc2] == ["", "", "", "", ""]
# Check that B-GPE is now invalid.
ner2.moves.add_action(4, "")
ner2.moves.add_action(5, "")
ner2.add_label("GPE")
state2 = ner2.moves.init_batch([doc2])[0]
ner2.moves.apply_transition(state2, "O")
ner2.moves.apply_transition(state2, "O")
ner2.moves.apply_transition(state2, "O")
# we can only use U- for "New"
assert not ner2.moves.is_valid(state2, "B-GPE")
assert ner2.moves.is_valid(state2, "U-")
ner2.moves.apply_transition(state2, "U-")
# we can only use U- for "York"
assert not ner2.moves.is_valid(state2, "B-GPE")
assert ner2.moves.is_valid(state2, "U-")
def test_overwrite_token():
nlp = English()
ner1 = nlp.create_pipe("ner")
nlp.add_pipe(ner1, name="ner")
nlp.begin_training()
# The untrained NER will predict O for each token
doc = nlp("I live in New York")
assert [token.ent_iob_ for token in doc] == ["O", "O", "O", "O", "O"]
assert [token.ent_type_ for token in doc] == ["", "", "", "", ""]
# Check that a new ner can overwrite O
ner2 = EntityRecognizer(doc.vocab)
ner2.moves.add_action(5, "")
ner2.add_label("GPE")
state = ner2.moves.init_batch([doc])[0]
assert ner2.moves.is_valid(state, "B-GPE")
assert ner2.moves.is_valid(state, "U-GPE")
ner2.moves.apply_transition(state, "B-GPE")
assert ner2.moves.is_valid(state, "I-GPE")
assert ner2.moves.is_valid(state, "L-GPE")
def test_ruler_before_ner():
""" Test that an NER works after an entity_ruler: the second can add annotations """
nlp = English()
# 1 : Entity Ruler - should set "this" to B and everything else to empty
ruler = EntityRuler(nlp)
patterns = [{"label": "THING", "pattern": "This"}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
# 2: untrained NER - should set everything else to O
untrained_ner = nlp.create_pipe("ner")
untrained_ner.add_label("MY_LABEL")
nlp.add_pipe(untrained_ner)
nlp.begin_training()
doc = nlp("This is Antti Korhonen speaking in Finland")
expected_iobs = ["B", "O", "O", "O", "O", "O", "O"]
expected_types = ["THING", "", "", "", "", "", ""]
assert [token.ent_iob_ for token in doc] == expected_iobs
assert [token.ent_type_ for token in doc] == expected_types
def test_ner_before_ruler():
""" Test that an entity_ruler works after an NER: the second can overwrite O annotations """
nlp = English()
# 1: untrained NER - should set everything to O
untrained_ner = nlp.create_pipe("ner")
untrained_ner.add_label("MY_LABEL")
nlp.add_pipe(untrained_ner, name="uner")
nlp.begin_training()
# 2 : Entity Ruler - should set "this" to B and keep everything else O
ruler = EntityRuler(nlp)
patterns = [{"label": "THING", "pattern": "This"}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp("This is Antti Korhonen speaking in Finland")
expected_iobs = ["B", "O", "O", "O", "O", "O", "O"]
expected_types = ["THING", "", "", "", "", "", ""]
assert [token.ent_iob_ for token in doc] == expected_iobs
assert [token.ent_type_ for token in doc] == expected_types
def test_block_ner():
""" Test functionality for blocking tokens so they can't be in a named entity """
# block "Antti L Korhonen" from being a named entity
nlp = English()
nlp.add_pipe(BlockerComponent1(2, 5))
untrained_ner = nlp.create_pipe("ner")
untrained_ner.add_label("MY_LABEL")
nlp.add_pipe(untrained_ner, name="uner")
nlp.begin_training()
doc = nlp("This is Antti L Korhonen speaking in Finland")
expected_iobs = ["O", "O", "B", "B", "B", "O", "O", "O"]
expected_types = ["", "", "", "", "", "", "", ""]
assert [token.ent_iob_ for token in doc] == expected_iobs
assert [token.ent_type_ for token in doc] == expected_types
class BlockerComponent1(object):
name = "my_blocker"
def __init__(self, start, end):
self.start = start
self.end = end
def __call__(self, doc):
doc.ents = [(0, self.start, self.end)]
return doc

View File

@ -426,7 +426,7 @@ def test_issue957(en_tokenizer):
def test_issue999(train_data): def test_issue999(train_data):
"""Test that adding entities and resuming training works passably OK. """Test that adding entities and resuming training works passably OK.
There are two issues here: There are two issues here:
1) We have to readd labels. This isn't very nice. 1) We have to read labels. This isn't very nice.
2) There's no way to set the learning rate for the weight update, so we 2) There's no way to set the learning rate for the weight update, so we
end up out-of-scale, causing it to learn too fast. end up out-of-scale, causing it to learn too fast.
""" """

View File

@ -0,0 +1,42 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
import spacy
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
from spacy.tokens import Span
def test_issue4267():
""" Test that running an entity_ruler after ner gives consistent results"""
nlp = English()
ner = nlp.create_pipe("ner")
ner.add_label("PEOPLE")
nlp.add_pipe(ner)
nlp.begin_training()
assert "ner" in nlp.pipe_names
# assert that we have correct IOB annotations
doc1 = nlp("hi")
assert doc1.is_nered
for token in doc1:
assert token.ent_iob == 2
# add entity ruler and run again
ruler = EntityRuler(nlp)
patterns = [{"label": "SOFTWARE", "pattern": "spacy"}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
assert "entity_ruler" in nlp.pipe_names
assert "ner" in nlp.pipe_names
# assert that we still have correct IOB annotations
doc2 = nlp("hi")
assert doc2.is_nered
for token in doc2:
assert token.ent_iob == 2

View File

@ -13,7 +13,7 @@ class DummyPipe(Pipe):
def predict(self, docs): def predict(self, docs):
return ([1, 2, 3], [4, 5, 6]) return ([1, 2, 3], [4, 5, 6])
def set_annotations(self, docs, scores, tensor=None): def set_annotations(self, docs, scores, tensors=None):
return docs return docs

View File

@ -146,11 +146,12 @@ def _merge(Doc doc, merges):
syntactic root of the span. syntactic root of the span.
RETURNS (Token): The first newly merged token. RETURNS (Token): The first newly merged token.
""" """
cdef int i, merge_index, start, end, token_index cdef int i, merge_index, start, end, token_index, current_span_index, current_offset, offset, span_index
cdef Span span cdef Span span
cdef const LexemeC* lex cdef const LexemeC* lex
cdef TokenC* token cdef TokenC* token
cdef Pool mem = Pool() cdef Pool mem = Pool()
cdef int merged_iob = 0
tokens = <TokenC**>mem.alloc(len(merges), sizeof(TokenC)) tokens = <TokenC**>mem.alloc(len(merges), sizeof(TokenC))
spans = [] spans = []

View File

@ -256,7 +256,7 @@ cdef class Doc:
def is_nered(self): def is_nered(self):
"""Check if the document has named entities set. Will return True if """Check if the document has named entities set. Will return True if
*any* of the tokens has a named entity tag set (even if the others are *any* of the tokens has a named entity tag set (even if the others are
uknown values). unknown values).
""" """
if len(self) == 0: if len(self) == 0:
return True return True
@ -525,13 +525,11 @@ cdef class Doc:
def __set__(self, ents): def __set__(self, ents):
# TODO: # TODO:
# 1. Allow negative matches # 1. Test basic data-driven ORTH gazetteer
# 2. Ensure pre-set NERs are not over-written during statistical # 2. Test more nuanced date and currency regex
# prediction
# 3. Test basic data-driven ORTH gazetteer
# 4. Test more nuanced date and currency regex
tokens_in_ents = {} tokens_in_ents = {}
cdef attr_t entity_type cdef attr_t entity_type
cdef attr_t kb_id
cdef int ent_start, ent_end cdef int ent_start, ent_end
for ent_info in ents: for ent_info in ents:
entity_type, kb_id, ent_start, ent_end = get_entity_info(ent_info) entity_type, kb_id, ent_start, ent_end = get_entity_info(ent_info)
@ -545,27 +543,31 @@ cdef class Doc:
tokens_in_ents[token_index] = (ent_start, ent_end, entity_type, kb_id) tokens_in_ents[token_index] = (ent_start, ent_end, entity_type, kb_id)
cdef int i cdef int i
for i in range(self.length): for i in range(self.length):
self.c[i].ent_type = 0 # default values
self.c[i].ent_kb_id = 0 entity_type = 0
self.c[i].ent_iob = 0 # Means missing. kb_id = 0
cdef attr_t ent_type
cdef int start, end # Set ent_iob to Missing (0) bij default unless this token was nered before
for ent_info in ents: ent_iob = 0
ent_type, ent_kb_id, start, end = get_entity_info(ent_info) if self.c[i].ent_iob != 0:
if ent_type is None or ent_type < 0: ent_iob = 2
# Mark as O
for i in range(start, end): # overwrite if the token was part of a specified entity
self.c[i].ent_type = 0 if i in tokens_in_ents.keys():
self.c[i].ent_kb_id = 0 ent_start, ent_end, entity_type, kb_id = tokens_in_ents[i]
self.c[i].ent_iob = 2 if entity_type is None or entity_type <= 0:
# Blocking this token from being overwritten by downstream NER
ent_iob = 3
elif ent_start == i:
# Marking the start of an entity
ent_iob = 3
else: else:
# Mark (inside) as I # Marking the inside of an entity
for i in range(start, end): ent_iob = 1
self.c[i].ent_type = ent_type
self.c[i].ent_kb_id = ent_kb_id self.c[i].ent_type = entity_type
self.c[i].ent_iob = 1 self.c[i].ent_kb_id = kb_id
# Set start as B self.c[i].ent_iob = ent_iob
self.c[start].ent_iob = 3
@property @property
def noun_chunks(self): def noun_chunks(self):

View File

@ -754,7 +754,8 @@ cdef class Token:
def ent_iob_(self): def ent_iob_(self):
"""IOB code of named entity tag. "B" means the token begins an entity, """IOB code of named entity tag. "B" means the token begins an entity,
"I" means it is inside an entity, "O" means it is outside an entity, "I" means it is inside an entity, "O" means it is outside an entity,
and "" means no entity tag is set. and "" means no entity tag is set. "B" with an empty ent_type
means that the token is blocked from further processing by NER.
RETURNS (unicode): IOB code of named entity tag. RETURNS (unicode): IOB code of named entity tag.
""" """

View File

@ -588,8 +588,8 @@ data.
```python ```python
### Entry structure ### Entry structure
{ {
"orth": string, "orth": string, # the word text
"id": int, "id": int, # can correspond to row in vectors table
"lower": string, "lower": string,
"norm": string, "norm": string,
"shape": string "shape": string

View File

@ -181,6 +181,7 @@ All output files generated by this command are compatible with
| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. | | `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. |
| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). | | `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). | | `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). |
<<<<<<< HEAD
## Debug data {#debug-data new="2.2"} ## Debug data {#debug-data new="2.2"}
@ -341,6 +342,8 @@ will not be available.
``` ```
</Accordion> </Accordion>
=======
>>>>>>> master
## Train {#train} ## Train {#train}
@ -512,14 +515,17 @@ tokenization can be provided.
Create a new model directory from raw data, like word frequencies, Brown Create a new model directory from raw data, like word frequencies, Brown
clusters and word vectors. This command is similar to the `spacy model` command clusters and word vectors. This command is similar to the `spacy model` command
in v1.x. in v1.x. Note that in order to populate the model's vocab, you need to pass in a
JSONL-formatted [vocabulary file](<(/api/annotation#vocab-jsonl)>) as
`--jsonl-loc` with optional `id` values that correspond to the vectors table.
Just loading in vectors will not automatically populate the vocab.
<Infobox title="Deprecation note" variant="warning"> <Infobox title="Deprecation note" variant="warning">
As of v2.1.0, the `--freqs-loc` and `--clusters-loc` are deprecated and have As of v2.1.0, the `--freqs-loc` and `--clusters-loc` are deprecated and have
been replaced with the `--jsonl-loc` argument, which lets you pass in a a been replaced with the `--jsonl-loc` argument, which lets you pass in a a
[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one [JSONL](http://jsonlines.org/) file containing one lexical entry per line. For
lexical entry per line. For more details on the format, see the more details on the format, see the
[annotation specs](/api/annotation#vocab-jsonl). [annotation specs](/api/annotation#vocab-jsonl).
</Infobox> </Infobox>
@ -530,11 +536,11 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
``` ```
| Argument | Type | Description | | Argument | Type | Description |
| ----------------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. | | `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. |
| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. | | `output_dir` | positional | Model output directory. Will be created if it doesn't exist. |
| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted vocabulary file with lexical attributes. | | `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. |
| `--vectors-loc`, `-v` | option | Optional location of vectors file. Should be a tab-separated file in Word2Vec format where the first column contains the word and the remaining columns the values. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. | | `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. |
| `--prune-vectors`, `-V` | flag | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. | | `--prune-vectors`, `-V` | flag | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. |
| **CREATES** | model | A spaCy model containing the vocab and vectors. | | **CREATES** | model | A spaCy model containing the vocab and vectors. |