Create a GoldParse. Unlike annotations in entities, label annotations in
cats can overlap, i.e. a single word can be covered by multiple labelled
spans. The TextCategorizer component expects true
examples of a label to have the value 1.0, and negative examples of a label to
have the value 0.0. Labels not in the dictionary are treated as missing – the
gradient for those labels will be zero.
Name
Type
Description
doc
Doc
The document the annotations refer to.
words
iterable
A sequence of unicode word strings.
tags
iterable
A sequence of strings, representing tag annotations.
heads
iterable
A sequence of integers, representing syntactic head offsets.
deps
iterable
A sequence of strings, representing the syntactic relation types.
entities
iterable
A sequence of named entity annotations, either as BILUO tag strings, or as (start_char, end_char, label) tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None.
cats
dict
Labels for text classification. Each key in the dictionary may be a string or an int, or a (start_char, end_char, label) tuple, indicating that the label is applied to only part of the document (usually a sentence).
RETURNS
GoldParse
The newly constructed object.
GoldParse.__len__
Get the number of gold-standard tokens.
Name
Type
Description
RETURNS
int
The number of gold-standard tokens.
GoldParse.is_projective
Whether the provided syntactic annotations form a projective dependency tree.
Name
Type
Description
RETURNS
bool
Whether annotations form projective tree.
Attributes
Name
Type
Description
tags
list
The part-of-speech tag annotations.
heads
list
The syntactic head annotations.
labels
list
The syntactic relation-type annotations.
ents
list
The named entity annotations.
cand_to_gold
list
The alignment from candidate tokenization to gold tokenization.
gold_to_cand
list
The alignment from gold tokenization to candidate tokenization.
cats 2
list
Entries in the list should be either a label, or a (start, end, label) triple. The tuple form is used for categories applied to spans of the document.
fromspacy.goldimportdocs_to_jsondoc=nlp(u"I like London")json_data=docs_to_json([doc])
Name
Type
Description
docs
iterable / Doc
The Doc object(s) to convert.
id
int
ID to assign to the JSON. Defaults to 0.
RETURNS
list
The data in spaCy's JSON format.
gold.biluo_tags_from_offsets
Encode labelled spans into per-token tags, using the
BILUO scheme (Begin, In, Last, Unit, Out). Returns a
list of unicode strings, describing the tags. Each tag string will be of the
form of either "", "O" or "{action}-{label}", where action is one of
"B", "I", "L", "U". The string "-" is used where the entity offsets
don't align with the tokenization in the Doc object. The training algorithm
will view these as missing values. O denotes a non-entity token. B denotes
the beginning of a multi-token entity, I the inside of an entity of three or
more tokens, and L the end of an entity of two or more tokens. U denotes a
single-token entity.
Example
fromspacy.goldimportbiluo_tags_from_offsetsdoc=nlp(u"I like London.")entities=[(7,13,"LOC")]tags=biluo_tags_from_offsets(doc,entities)asserttags==["O","O","U-LOC","O"]
Name
Type
Description
doc
Doc
The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.
entities
iterable
A sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string.
Encode per-token tags following the BILUO scheme into
entity offsets.
Example
fromspacy.goldimportoffsets_from_biluo_tagsdoc=nlp(u"I like London.")tags=["O","O","U-LOC","O"]entities=offsets_from_biluo_tags(doc,tags)assertentities==[(7,13,"LOC")]
Name
Type
Description
doc
Doc
The document that the BILUO tags refer to.
entities
iterable
A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U".
RETURNS
list
A sequence of (start, end, label) triples. start and end will be character-offset integers denoting the slice into the original string.
gold.spans_from_biluo_tags
Encode per-token tags following the BILUO scheme into
Span objects. This can be used to create entity spans from
token-based tags, e.g. to overwrite the doc.ents.
Example
fromspacy.goldimportoffsets_from_biluo_tagsdoc=nlp(u"I like London.")tags=["O","O","U-LOC","O"]doc.ents=spans_from_biluo_tags(doc,tags)
Name
Type
Description
doc
Doc
The document that the BILUO tags refer to.
entities
iterable
A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U".
RETURNS
list
A sequence of Span objects with added entity labels.