15 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	| title | teaser | tag | source | 
|---|---|---|---|
| GoldParse | A collection for training annotations | class | spacy/gold.pyx | 
GoldParse.__init__
Create a GoldParse. Unlike annotations in entities, label annotations in
cats can overlap, i.e. a single word can be covered by multiple labelled
spans. The TextCategorizer component expects true
examples of a label to have the value 1.0, and negative examples of a label to
have the value 0.0. Labels not in the dictionary are treated as missing – the
gradient for those labels will be zero.
| Name | Type | Description | 
|---|---|---|
| doc | Doc | The document the annotations refer to. | 
| words | iterable | A sequence of unicode word strings. | 
| tags | iterable | A sequence of strings, representing tag annotations. | 
| heads | iterable | A sequence of integers, representing syntactic head offsets. | 
| deps | iterable | A sequence of strings, representing the syntactic relation types. | 
| entities | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as (start_char, end_char, label)tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. | 
| cats | dict | Labels for text classification. Each key in the dictionary may be a string or an int, or a (start_char, end_char, label)tuple, indicating that the label is applied to only part of the document (usually a sentence). | 
| RETURNS | GoldParse | The newly constructed object. | 
GoldParse.__len__
Get the number of gold-standard tokens.
| Name | Type | Description | 
|---|---|---|
| RETURNS | int | The number of gold-standard tokens. | 
GoldParse.is_projective
Whether the provided syntactic annotations form a projective dependency tree.
| Name | Type | Description | 
|---|---|---|
| RETURNS | bool | Whether annotations form projective tree. | 
Attributes
| Name | Type | Description | 
|---|---|---|
| words | list | The words. | 
| tags | list | The part-of-speech tag annotations. | 
| heads | list | The syntactic head annotations. | 
| labels | list | The syntactic relation-type annotations. | 
| ner | list | The named entity annotations as BILUO tags. | 
| cand_to_gold | list | The alignment from candidate tokenization to gold tokenization. | 
| gold_to_cand | list | The alignment from gold tokenization to candidate tokenization. | 
| cats2 | list | Entries in the list should be either a label, or a (start, end, label)triple. The tuple form is used for categories applied to spans of the document. | 
Utilities
gold.docs_to_json
Convert a list of Doc objects into the
JSON-serializable format used by the
spacy train command.
Example
from spacy.gold import docs_to_json doc = nlp(u"I like London") json_data = docs_to_json([doc])
| Name | Type | Description | 
|---|---|---|
| docs | iterable / Doc | The Docobject(s) to convert. | 
| id | int | ID to assign to the JSON. Defaults to 0. | 
| RETURNS | list | The data in spaCy's JSON format. | 
gold.align
Calculate alignment tables between two tokenizations, using the Levenshtein algorithm. The alignment is case-insensitive.
The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not
["I", "'m"] and ["I", "am"].
Example
from spacy.gold import align bert_tokens = ["obama", "'", "s", "podcast"] spacy_tokens = ["obama", "'s", "podcast"] alignment = align(bert_tokens, spacy_tokens) cost, a2b, b2a, a2b_multi, b2a_multi = alignment
| Name | Type | Description | 
|---|---|---|
| tokens_a | list | String values of candidate tokens to align. | 
| tokens_b | list | String values of reference tokens to align. | 
| RETURNS | tuple | A (cost, a2b, b2a, a2b_multi, b2a_multi)tuple describing the alignment. | 
The returned tuple contains the following alignment information:
Example
a2b = array([0, -1, -1, 2]) b2a = array([0, 2, 3]) a2b_multi = {1: 1, 2: 1} b2a_multi = {}If
a2b[3] == 2, that means thattokens_a[3]aligns totokens_b[2]. If there's no one-to-one alignment for a token, it has the value-1.
| Name | Type | Description | 
|---|---|---|
| cost | int | The number of misaligned tokens. | 
| a2b | numpy.ndarray[ndim=1, dtype='int32'] | One-to-one mappings of indices in tokens_ato indices intokens_b. | 
| b2a | numpy.ndarray[ndim=1, dtype='int32'] | One-to-one mappings of indices in tokens_bto indices intokens_a. | 
| a2b_multi | dict | A dictionary mapping indices in tokens_ato indices intokens_b, where multiple tokens oftokens_aalign to the same token oftokens_b. | 
| b2a_multi | dict | A dictionary mapping indices in tokens_bto indices intokens_a, where multiple tokens oftokens_balign to the same token oftokens_a. | 
gold.biluo_tags_from_offsets
Encode labelled spans into per-token tags, using the
BILUO scheme (Begin, In, Last, Unit, Out). Returns a
list of unicode strings, describing the tags. Each tag string will be of the
form of either "", "O" or "{action}-{label}", where action is one of
"B", "I", "L", "U". The string "-" is used where the entity offsets
don't align with the tokenization in the Doc object. The training algorithm
will view these as missing values. O denotes a non-entity token. B denotes
the beginning of a multi-token entity, I the inside of an entity of three or
more tokens, and L the end of an entity of two or more tokens. U denotes a
single-token entity.
Example
from spacy.gold import biluo_tags_from_offsets doc = nlp(u"I like London.") entities = [(7, 13, "LOC")] tags = biluo_tags_from_offsets(doc, entities) assert tags == ["O", "O", "U-LOC", "O"]
| Name | Type | Description | 
|---|---|---|
| doc | Doc | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. | 
| entities | iterable | A sequence of (start, end, label)triples.startandendshould be character-offset integers denoting the slice into the original string. | 
| RETURNS | list | Unicode strings, describing the BILUO tags. | 
gold.offsets_from_biluo_tags
Encode per-token tags following the BILUO scheme into entity offsets.
Example
from spacy.gold import offsets_from_biluo_tags doc = nlp(u"I like London.") tags = ["O", "O", "U-LOC", "O"] entities = offsets_from_biluo_tags(doc, tags) assert entities == [(7, 13, "LOC")]
| Name | Type | Description | 
|---|---|---|
| doc | Doc | The document that the BILUO tags refer to. | 
| entities | iterable | A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "","O"or"{action}-{label}", where action is one of"B","I","L","U". | 
| RETURNS | list | A sequence of (start, end, label)triples.startandendwill be character-offset integers denoting the slice into the original string. | 
gold.spans_from_biluo_tags
Encode per-token tags following the BILUO scheme into
Span objects. This can be used to create entity spans from
token-based tags, e.g. to overwrite the doc.ents.
Example
from spacy.gold import spans_from_biluo_tags doc = nlp(u"I like London.") tags = ["O", "O", "U-LOC", "O"] doc.ents = spans_from_biluo_tags(doc, tags)
| Name | Type | Description | 
|---|---|---|
| doc | Doc | The document that the BILUO tags refer to. | 
| entities | iterable | A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "","O"or"{action}-{label}", where action is one of"B","I","L","U". | 
| RETURNS | list | A sequence of Spanobjects with added entity labels. |