mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-23 15:02:46 +03:00
adding example dictionary formats to data-formats.md
This commit is contained in:
parent
c376c2e122
commit
2ab8c3f780
|
@ -28,7 +28,7 @@ spaCy's training format. To convert one or more existing `Doc` objects to
|
|||
spaCy's JSON format, you can use the
|
||||
[`gold.docs_to_json`](/api/top-level#docs_to_json) helper.
|
||||
|
||||
> #### Annotating entities
|
||||
> #### Annotating entities {#biluo}
|
||||
>
|
||||
> Named entities are provided in the
|
||||
> [BILUO](/usage/linguistic-features#accessing-ner) notation. Tokens outside an
|
||||
|
@ -75,6 +75,121 @@ from the English Wall Street Journal portion of the Penn Treebank:
|
|||
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
|
||||
```
|
||||
|
||||
### Annotations in dictionary format {#dict-input}
|
||||
|
||||
To create [`Example`](/api/example) objects, you can create a dictionary of the
|
||||
gold-standard annotations `gold_dict`, and then call
|
||||
|
||||
```python
|
||||
example = Example.from_dict(doc, gold_dict)
|
||||
```
|
||||
|
||||
There are currently two formats supported for this dictionary of annotations:
|
||||
one with a simple, flat structure of keywords, and one with a more hierarchical
|
||||
structure.
|
||||
|
||||
#### Flat structure {#dict-flat}
|
||||
|
||||
Here is the full overview of potential entries in a flat dictionary of
|
||||
annotations. You need to only specify those keys corresponding to the task you
|
||||
want to train.
|
||||
|
||||
```python
|
||||
{
|
||||
"text": string, # Raw text.
|
||||
"words": List[string], # List of gold tokens.
|
||||
"lemmas": List[string], # List of lemmas.
|
||||
"spaces": List[bool], # List of boolean values indicating whether the corresponding tokens is followed by a space or not.
|
||||
"tags": List[string], # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging).
|
||||
"pos": List[string], # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging).
|
||||
"morphs": List[string], # List of [morphological features](/usage/linguistic-features#rule-based-morphology).
|
||||
"sent_starts": List[bool], # List of boolean values indicating whether each token is the first of a sentence or not.
|
||||
"deps": List[string], # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head.
|
||||
"heads": List[int], # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text.
|
||||
"entities": List[string], # Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens.
|
||||
"entities": List[(int, int, string)], # Option 2: List of `"(start, end, label)"` tuples defining all entities in.
|
||||
"cats": Dict[str, float], # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text.
|
||||
"links": Dict[(int, int), Dict], # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs.
|
||||
}
|
||||
```
|
||||
|
||||
There are a few caveats to take into account:
|
||||
|
||||
- Multiple formats are possible for the "entities" entry, but you have to pick
|
||||
one.
|
||||
- Any values for sentence starts will be ignored if there are annotations for
|
||||
dependency relations.
|
||||
- If the dictionary contains values for "text" and "words", but not "spaces",
|
||||
the latter are inferred automatically. If "words" is not provided either, the
|
||||
values are inferred from the `doc` argument.
|
||||
|
||||
##### Examples
|
||||
|
||||
```python
|
||||
# Training data for a part-of-speech tagger
|
||||
doc = Doc(vocab, words=["I", "like", "stuff"])
|
||||
example = Example.from_dict(doc, {"tags": ["NOUN", "VERB", "NOUN"]})
|
||||
|
||||
# Training data for an entity recognizer (option 1)
|
||||
doc = nlp("Laura flew to Silicon Valley.")
|
||||
biluo_tags = ["U-PERS", "O", "O", "B-LOC", "L-LOC"]
|
||||
example = Example.from_dict(doc, {"entities": biluo_tags})
|
||||
|
||||
# Training data for an entity recognizer (option 2)
|
||||
doc = nlp("Laura flew to Silicon Valley.")
|
||||
entity_tuples = [
|
||||
(0, 5, "PERSON"),
|
||||
(14, 28, "LOC"),
|
||||
]
|
||||
example = Example.from_dict(doc, {"entities": entity_tuples})
|
||||
|
||||
# Training data for text categorization
|
||||
doc = nlp("I'm pretty happy about that!")
|
||||
example = Example.from_dict(doc, {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}})
|
||||
|
||||
# Training data for an Entity Linking component
|
||||
doc = nlp("Russ Cochran his reprints include EC Comics.")
|
||||
example = Example.from_dict(doc, {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}})
|
||||
```
|
||||
|
||||
#### Hierachical structure {#dict-hierarch}
|
||||
|
||||
Internally, a more hierarchical dictionary structure is used to store
|
||||
gold-standard annotations. Its format is similar to the structure described in
|
||||
the previous section, but there are two main sections `token_annotation` and
|
||||
`doc_annotation`, and the keys for token annotations should be uppercase
|
||||
[`Token` attributes](/api/token#attributes) such as "ORTH" and "TAG".
|
||||
|
||||
```python
|
||||
{
|
||||
"text": string, # Raw text.
|
||||
"token_annotation": {
|
||||
"ORTH": List[string], # List of gold tokens.
|
||||
"LEMMA": List[string], # List of lemmas.
|
||||
"SPACY": List[bool], # List of boolean values indicating whether the corresponding tokens is followed by a space or not.
|
||||
"TAG": List[string], # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging).
|
||||
"POS": List[string], # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging).
|
||||
"MORPH": List[string], # List of [morphological features](/usage/linguistic-features#rule-based-morphology).
|
||||
"SENT_START": List[bool], # List of boolean values indicating whether each token is the first of a sentence or not.
|
||||
"DEP": List[string], # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head.
|
||||
"HEAD": List[int], # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text.
|
||||
},
|
||||
"doc_annotation": {
|
||||
"entities": List[(int, int, string)], # List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens.
|
||||
"cats": Dict[str, float], # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text.
|
||||
"links": Dict[(int, int), Dict], # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs.
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
There are a few caveats to take into account:
|
||||
|
||||
- Any values for sentence starts will be ignored if there are annotations for
|
||||
dependency relations.
|
||||
- If the dictionary contains values for "text" and "ORTH", but not "SPACY", the
|
||||
latter are inferred automatically. If "ORTH" is not provided either, the
|
||||
values are inferred from the `doc` argument.
|
||||
|
||||
## Training config {#config new="3"}
|
||||
|
||||
Config files define the training process and model pipeline and can be passed to
|
||||
|
|
|
@ -41,9 +41,8 @@ both documents.
|
|||
## Example.from_dict {#from_dict tag="classmethod"}
|
||||
|
||||
Construct an `Example` object from the `predicted` document and the reference
|
||||
annotations provided as a dictionary.
|
||||
|
||||
<!-- TODO: document formats? legacy & token_annotation stuff -->
|
||||
annotations provided as a dictionary. For more details on the required format,
|
||||
see the [training format documentation](/api/data-formats#dict-input).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
Loading…
Reference in New Issue
Block a user