| title | 
teaser | 
tag | 
source | 
| GoldParse | 
A collection for training annotations | 
class | 
spacy/gold.pyx | 
GoldParse.__init__
Create a GoldParse. Unlike annotations in entities, label annotations in
cats can overlap, i.e. a single word can be covered by multiple labelled
spans. The TextCategorizer component expects true
examples of a label to have the value 1.0, and negative examples of a label to
have the value 0.0. Labels not in the dictionary are treated as missing – the
gradient for those labels will be zero.
| Name | 
Type | 
Description | 
doc | 
Doc | 
The document the annotations refer to. | 
words | 
iterable | 
A sequence of unicode word strings. | 
tags | 
iterable | 
A sequence of strings, representing tag annotations. | 
heads | 
iterable | 
A sequence of integers, representing syntactic head offsets. | 
deps | 
iterable | 
A sequence of strings, representing the syntactic relation types. | 
entities | 
iterable | 
A sequence of named entity annotations, either as BILUO tag strings, or as (start_char, end_char, label) tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. | 
cats | 
dict | 
Labels for text classification. Each key in the dictionary may be a string or an int, or a (start_char, end_char, label) tuple, indicating that the label is applied to only part of the document (usually a sentence). | 
| RETURNS | 
GoldParse | 
The newly constructed object. | 
GoldParse.__len__
Get the number of gold-standard tokens.
| Name | 
Type | 
Description | 
| RETURNS | 
int | 
The number of gold-standard tokens. | 
GoldParse.is_projective
Whether the provided syntactic annotations form a projective dependency tree.
| Name | 
Type | 
Description | 
| RETURNS | 
bool | 
Whether annotations form projective tree. | 
Attributes
| Name | 
Type | 
Description | 
tags | 
list | 
The part-of-speech tag annotations. | 
heads | 
list | 
The syntactic head annotations. | 
labels | 
list | 
The syntactic relation-type annotations. | 
ents | 
list | 
The named entity annotations. | 
cand_to_gold | 
list | 
The alignment from candidate tokenization to gold tokenization. | 
gold_to_cand | 
list | 
The alignment from gold tokenization to candidate tokenization. | 
cats 2 | 
list | 
Entries in the list should be either a label, or a (start, end, label) triple. The tuple form is used for categories applied to spans of the document. | 
Utilities
gold.biluo_tags_from_offsets
Encode labelled spans into per-token tags, using the
BILUO scheme (Begin, In, Last, Unit, Out). Returns a
list of unicode strings, describing the tags. Each tag string will be of the
form of either "", "O" or "{action}-{label}", where action is one of
"B", "I", "L", "U". The string "-" is used where the entity offsets
don't align with the tokenization in the Doc object. The training algorithm
will view these as missing values. O denotes a non-entity token. B denotes
the beginning of a multi-token entity, I the inside of an entity of three or
more tokens, and L the end of an entity of two or more tokens. U denotes a
single-token entity.
Example
from spacy.gold import biluo_tags_from_offsets
doc = nlp(u"I like London.")
entities = [(7, 13, "LOC")]
tags = biluo_tags_from_offsets(doc, entities)
assert tags == ["O", "O", "U-LOC", "O"]
| Name | 
Type | 
Description | 
doc | 
Doc | 
The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. | 
entities | 
iterable | 
A sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string. | 
| RETURNS | 
list | 
Unicode strings, describing the BILUO tags. | 
gold.offsets_from_biluo_tags
Encode per-token tags following the BILUO scheme into
entity offsets.
Example
from spacy.gold import offsets_from_biluo_tags
doc = nlp(u"I like London.")
tags = ["O", "O", "U-LOC", "O"]
entities = offsets_from_biluo_tags(doc, tags)
assert entities == [(7, 13, "LOC")]
| Name | 
Type | 
Description | 
doc | 
Doc | 
The document that the BILUO tags refer to. | 
entities | 
iterable | 
A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". | 
| RETURNS | 
list | 
A sequence of (start, end, label) triples. start and end will be character-offset integers denoting the slice into the original string. | 
gold.spans_from_biluo_tags
Encode per-token tags following the BILUO scheme into
Span objects. This can be used to create entity spans from
token-based tags, e.g. to overwrite the doc.ents.
Example
from spacy.gold import offsets_from_biluo_tags
doc = nlp(u"I like London.")
tags = ["O", "O", "U-LOC", "O"]
doc.ents = spans_from_biluo_tags(doc, tags)
| Name | 
Type | 
Description | 
doc | 
Doc | 
The document that the BILUO tags refer to. | 
entities | 
iterable | 
A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". | 
| RETURNS | 
list | 
A sequence of Span objects with added entity labels. |