* version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting
14 KiB
title | teaser | tag | source | new |
---|---|---|---|---|
Example | A training instance | class | spacy/gold/example.pyx | 3.0 |
An Example
holds the information for one training instance. It stores two
Doc
objects: one for holding the gold-standard reference data, and one for
holding the predictions of the pipeline. An
Alignment
object stores the alignment between
these two documents, as they can differ in tokenization.
Example.__init__
Construct an Example
object from the predicted
document and the reference
document. If alignment
is None
, it will be initialized from the words in
both documents.
Example
from spacy.tokens import Doc from spacy.training import Example words = ["hello", "world", "!"] spaces = [True, False, False] predicted = Doc(nlp.vocab, words=words, spaces=spaces) reference = parse_gold_doc(my_data) example = Example(predicted, reference)
Name | Description |
---|---|
predicted |
The document containing (partial) predictions. Can not be None . |
reference |
The document containing gold-standard annotations. Can not be None . |
keyword-only | |
alignment |
An object holding the alignment between the tokens of the predicted and reference documents. |
Example.from_dict
Construct an Example
object from the predicted
document and the reference
annotations provided as a dictionary. For more details on the required format,
see the training format documentation.
Example
from spacy.tokens import Doc from spacy.training import Example predicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) token_ref = ["Apply", "some", "sun", "screen"] tags_ref = ["VERB", "DET", "NOUN", "NOUN"] example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
Name | Description |
---|---|
predicted |
The document containing (partial) predictions. Can not be None . |
example_dict |
Dict[str, obj] |
RETURNS | The newly constructed object. |
Example.text
The text of the predicted
document in this Example
.
Example
raw_text = example.text
Name | Description |
---|---|
RETURNS | The text of the predicted document. |
Example.predicted
The Doc
holding the predictions. Occasionally also referred to as example.x
.
Example
docs = [eg.predicted for eg in examples] predictions, _ = model.begin_update(docs) set_annotations(docs, predictions)
Name | Description |
---|---|
RETURNS | The document containing (partial) predictions. |
Example.reference
The Doc
holding the gold-standard annotations. Occasionally also referred to
as example.y
.
Example
for i, eg in enumerate(examples): for j, label in enumerate(all_labels): gold_labels[i][j] = eg.reference.cats.get(label, 0.0)
Name | Description |
---|---|
RETURNS | The document containing gold-standard annotations. |
Example.alignment
The Alignment
object mapping the tokens of
the predicted
document to those of the reference
document.
Example
tokens_x = ["Apply", "some", "sunscreen"] x = Doc(vocab, words=tokens_x) tokens_y = ["Apply", "some", "sun", "screen"] example = Example.from_dict(x, {"words": tokens_y}) alignment = example.alignment assert list(alignment.y2x.data) == [[0], [1], [2], [2]]
Name | Description |
---|---|
RETURNS | The document containing gold-standard annotations. |
Example.get_aligned
Get the aligned view of a certain token attribute, denoted by its int ID or string name.
Example
predicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) token_ref = ["Apply", "some", "sun", "screen"] tags_ref = ["VERB", "DET", "NOUN", "NOUN"] example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref}) assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]
Name | Description |
---|---|
field |
Attribute ID or string name. |
as_string |
Whether or not to return the list of values as strings. Defaults to False . |
RETURNS | List of integer values, or string values if as_string is True . |
Example.get_aligned_parse
Get the aligned view of the dependency parse. If projectivize
is set to
True
, non-projective dependency trees are made projective through the
Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
Example
doc = nlp("He pretty quickly walks away") example = Example.from_dict(doc, {"heads": [3, 2, 3, 0, 2]}) proj_heads, proj_labels = example.get_aligned_parse(projectivize=True) assert proj_heads == [3, 2, 3, 0, 3]
Name | Description |
---|---|
projectivize |
Whether or not to projectivize the dependency trees. Defaults to True . |
RETURNS | List of integer values, or string values if as_string is True . |
Example.get_aligned_ner
Get the aligned view of the NER BILUO tags.
Example
words = ["Mrs", "Smith", "flew", "to", "New York"] doc = Doc(en_vocab, words=words) entities = [(0, 9, "PERSON"), (18, 26, "LOC")] gold_words = ["Mrs Smith", "flew", "to", "New", "York"] example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) ner_tags = example.get_aligned_ner() assert ner_tags == ["B-PERSON", "L-PERSON", "O", "O", "U-LOC"]
Name | Description |
---|---|
RETURNS | List of BILUO values, denoting whether tokens are part of an NER annotation or not. |
Example.get_aligned_spans_y2x
Get the aligned view of any set of Span
objects defined over
Example.reference
. The resulting span indices will
align to the tokenization in Example.predicted
.
Example
words = ["Mr and Mrs Smith", "flew", "to", "New York"] doc = Doc(en_vocab, words=words) entities = [(0, 16, "PERSON")] tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"] example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) ents_ref = example.reference.ents assert [(ent.start, ent.end) for ent in ents_ref] == [(0, 4)] ents_y2x = example.get_aligned_spans_y2x(ents_ref) assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]
Name | Description |
---|---|
y_spans |
Span objects aligned to the tokenization of reference . |
RETURNS | Span objects aligned to the tokenization of predicted . |
Example.get_aligned_spans_x2y
Get the aligned view of any set of Span
objects defined over
Example.predicted
. The resulting span indices will
align to the tokenization in Example.reference
. This
method is particularly useful to assess the accuracy of predicted entities
against the original gold-standard annotation.
Example
nlp.add_pipe("my_ner") doc = nlp("Mr and Mrs Smith flew to New York") tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"] example = Example.from_dict(doc, {"words": tokens_ref}) ents_pred = example.predicted.ents # Assume the NER model has found "Mr and Mrs Smith" as a named entity assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)] ents_x2y = example.get_aligned_spans_x2y(ents_pred) assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
Name | Description |
---|---|
x_spans |
Span objects aligned to the tokenization of predicted . |
RETURNS | Span objects aligned to the tokenization of reference . |
Example.to_dict
Return a dictionary representation of the
reference annotation contained in this Example
.
Example
eg_dict = example.to_dict()
Name | Description |
---|---|
RETURNS | Dictionary representation of the reference annotation. |
Example.split_sents
Split one Example
into multiple Example
objects, one for each sentence.
Example
doc = nlp("I went yesterday had lots of fun") tokens_ref = ["I", "went", "yesterday", "had", "lots", "of", "fun"] sents_ref = [True, False, False, True, False, False, False] example = Example.from_dict(doc, {"words": tokens_ref, "sent_starts": sents_ref}) split_examples = example.split_sents() assert split_examples[0].text == "I went yesterday " assert split_examples[1].text == "had lots of fun"
Name | Description |
---|---|
RETURNS | List of Example objects, one for each original sentence. |
Alignment
Calculate alignment tables between two tokenizations.
Alignment attributes {#alignment-attributes"}
Name | Description |
---|---|
x2y |
The Ragged object holding the alignment from x to y . |
y2x |
The Ragged object holding the alignment from y to x . |
The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
["I", "'", "m"]
and ["I", "'m"]
, which both add up to "I'm"
, but not
["I", "'m"]
and ["I", "am"]
.
Example
from spacy.training import Alignment bert_tokens = ["obama", "'", "s", "podcast"] spacy_tokens = ["obama", "'s", "podcast"] alignment = Alignment.from_strings(bert_tokens, spacy_tokens) a2b = alignment.x2y assert list(a2b.dataXd) == [0, 1, 1, 2]
If
a2b.dataXd[1] == a2b.dataXd[2] == 1
, that means thatA[1]
("'"
) andA[2]
("s"
) both align toB[1]
("'s"
).
Alignment.from_strings
Name | Description |
---|---|
A |
String values of candidate tokens to align. |
B |
String values of reference tokens to align. |
RETURNS | An Alignment object describing the alignment. |