14 KiB
title | teaser | tag | source | new |
---|---|---|---|---|
Example | A training instance | class | spacy/gold/example.pyx | 3.0 |
An Example
holds the information for one training instance. It stores two
Doc
objects: one for holding the gold-standard reference data, and one for
holding the predictions of the pipeline. An Alignment
object stores the alignment between these two documents, as they can differ in
tokenization.
Example.__init__
Construct an Example
object from the predicted
document and the reference
document. If alignment
is None
, it will be initialized from the words in
both documents.
Example
from spacy.tokens import Doc from spacy.gold import Example words = ["hello", "world", "!"] spaces = [True, False, False] predicted = Doc(nlp.vocab, words=words, spaces=spaces) reference = parse_gold_doc(my_data) example = Example(predicted, reference)
Name | Type | Description |
---|---|---|
predicted |
Doc |
The document containing (partial) predictions. Can not be None . |
reference |
Doc |
The document containing gold-standard annotations. Can not be None . |
keyword-only | ||
alignment |
Alignment |
An object holding the alignment between the tokens of the predicted and reference documents. |
Example.from_dict
Construct an Example
object from the predicted
document and the reference
annotations provided as a dictionary. For more details on the required format,
see the training format documentation.
Example
from spacy.tokens import Doc from spacy.gold import Example predicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) token_ref = ["Apply", "some", "sun", "screen"] tags_ref = ["VERB", "DET", "NOUN", "NOUN"] example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref})
Name | Type | Description |
---|---|---|
predicted |
Doc |
The document containing (partial) predictions. Can not be None . |
example_dict |
Dict[str, obj] |
The gold-standard annotations as a dictionary. Can not be None . |
RETURNS | Example |
The newly constructed object. |
Example.text
The text of the predicted
document in this Example
.
Example
raw_text = example.text
Name | Type | Description |
---|---|---|
RETURNS | str | The text of the predicted document. |
Example.predicted
Example
docs = [eg.predicted for eg in examples] predictions, _ = model.begin_update(docs) set_annotations(docs, predictions)
The Doc
holding the predictions. Occassionally also refered to as example.x
.
Name | Type | Description |
---|---|---|
RETURNS | Doc |
The document containing (partial) predictions. |
Example.reference
Example
for i, eg in enumerate(examples): for j, label in enumerate(all_labels): gold_labels[i][j] = eg.reference.cats.get(label, 0.0)
The Doc
holding the gold-standard annotations. Occassionally also refered to
as example.y
.
Name | Type | Description |
---|---|---|
RETURNS | Doc |
The document containing gold-standard annotations. |
Example.alignment
Example
tokens_x = ["Apply", "some", "sunscreen"] x = Doc(vocab, words=tokens_x) tokens_y = ["Apply", "some", "sun", "screen"] example = Example.from_dict(x, {"words": tokens_y}) alignment = example.alignment assert list(alignment.y2x.data) == [[0], [1], [2], [2]]
The Alignment
object mapping the tokens of the predicted
document to those
of the reference
document.
Name | Type | Description |
---|---|---|
RETURNS | Alignment |
The document containing gold-standard annotations. |
Example.get_aligned
Example
predicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) token_ref = ["Apply", "some", "sun", "screen"] tags_ref = ["VERB", "DET", "NOUN", "NOUN"] example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref}) assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"]
Get the aligned view of a certain token attribute, denoted by its int ID or string name.
Name | Type | Description | Default |
---|---|---|---|
field |
int or str | Attribute ID or string name | |
as_string |
bool | Whether or not to return the list of values as strings. | False |
RETURNS | List[int] or List[str] |
List of integer values, or string values if as_string is True . |
Example.get_aligned_parse
Example
doc = nlp("He pretty quickly walks away") example = Example.from_dict(doc, {"heads": [3, 2, 3, 0, 2]}) proj_heads, proj_labels = example.get_aligned_parse(projectivize=True) assert proj_heads == [3, 2, 3, 0, 3]
Get the aligned view of the dependency parse. If projectivize
is set to
True
, non-projective dependency trees are made projective through the
Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005).
Name | Type | Description | Default |
---|---|---|---|
projectivize |
bool | Whether or not to projectivize the dependency trees | True |
RETURNS | List[int] or List[str] |
List of integer values, or string values if as_string is True . |
Example.get_aligned_ner
Example
words = ["Mrs", "Smith", "flew", "to", "New York"] doc = Doc(en_vocab, words=words) entities = [(0, 9, "PERSON"), (18, 26, "LOC")] gold_words = ["Mrs Smith", "flew", "to", "New", "York"] example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) ner_tags = example.get_aligned_ner() assert ner_tags == ["B-PERSON", "L-PERSON", "O", "O", "U-LOC"]
Get the aligned view of the NER BILUO tags.
Name | Type | Description |
---|---|---|
RETURNS | List[str] |
List of BILUO values, denoting whether tokens are part of an NER annotation or not. |
Example.get_aligned_spans_y2x
Example
words = ["Mr and Mrs Smith", "flew", "to", "New York"] doc = Doc(en_vocab, words=words) entities = [(0, 16, "PERSON")] tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"] example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) ents_ref = example.reference.ents assert [(ent.start, ent.end) for ent in ents_ref] == [(0, 4)] ents_y2x = example.get_aligned_spans_y2x(ents_ref) assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)]
Get the aligned view of any set of Span
objects defined over
example.reference
. The resulting span indices will align to the tokenization
in example.predicted
.
Name | Type | Description |
---|---|---|
y_spans |
Iterable[Span] |
Span objects aligned to the tokenization of self.reference . |
RETURNS | Iterable[Span] |
Span objects aligned to the tokenization of self.predicted . |
Example.get_aligned_spans_x2y
Example
nlp.add_pipe("my_ner") doc = nlp("Mr and Mrs Smith flew to New York") tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"] example = Example.from_dict(doc, {"words": tokens_ref}) ents_pred = example.predicted.ents # Assume the NER model has found "Mr and Mrs Smith" as a named entity assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)] ents_x2y = example.get_aligned_spans_x2y(ents_pred) assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)]
Get the aligned view of any set of Span
objects defined over
example.predicted
. The resulting span indices will align to the tokenization
in example.reference
. This method is particularly useful to assess the
accuracy of predicted entities against the original gold-standard annotation.
Name | Type | Description |
---|---|---|
x_spans |
Iterable[Span] |
Span objects aligned to the tokenization of self.predicted . |
RETURNS | Iterable[Span] |
Span objects aligned to the tokenization of self.reference . |
Example.to_dict
Return a dictionary representation of the
reference annotation contained in this Example
.
Example
eg_dict = example.to_dict()
Name | Type | Description |
---|---|---|
RETURNS | Dict[str, Any] |
Dictionary representation of the reference annotation. |
Example.split_sents
Example
doc = nlp("I went yesterday had lots of fun") tokens_ref = ["I", "went", "yesterday", "had", "lots", "of", "fun"] sents_ref = [True, False, False, True, False, False, False] example = Example.from_dict(doc, {"words": tokens_ref, "sent_starts": sents_ref}) split_examples = example.split_sents() assert split_examples[0].text == "I went yesterday " assert split_examples[1].text == "had lots of fun"
Split one Example
into multiple Example
objects, one for each sentence.
Name | Type | Description |
---|---|---|
RETURNS | List[Example] |
List of Example objects, one for each original sentence. |
Alignment
Calculate alignment tables between two tokenizations.
Alignment attributes {#alignment-attributes"}
Name | Type | Description |
---|---|---|
x2y |
Ragged |
The Ragged object holding the alignment from x to y . |
y2x |
Ragged |
The Ragged object holding the alignment from y to x . |
The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
["I", "'", "m"]
and ["I", "'m"]
, which both add up to "I'm"
, but not
["I", "'m"]
and ["I", "am"]
.
Example
from spacy.gold import Alignment bert_tokens = ["obama", "'", "s", "podcast"] spacy_tokens = ["obama", "'s", "podcast"] alignment = Alignment.from_strings(bert_tokens, spacy_tokens) a2b = alignment.x2y assert list(a2b.dataXd) == [0, 1, 1, 2]
If
a2b.dataXd[1] == a2b.dataXd[2] == 1
, that means thatA[1]
("'"
) andA[2]
("s"
) both align toB[1]
("'s"
).
Alignment.from_strings
Name | Type | Description |
---|---|---|
A |
list | String values of candidate tokens to align. |
B |
list | String values of reference tokens to align. |
RETURNS | Alignment |
An Alignment object describing the alignment. |