adding example dictionary formats to data-formats.md

2025-08-13 08:34:57 +03:00 · 2020-08-02 19:11:01 +02:00 · 2020-08-02 19:11:01 +02:00 · 2ab8c3f780
commit 2ab8c3f780
parent c376c2e122
2 changed files with 118 additions and 4 deletions
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@ -28,7 +28,7 @@ spaCy's training format. To convert one or more existing `Doc` objects to
 spaCy's JSON format, you can use the
 [`gold.docs_to_json`](/api/top-level#docs_to_json) helper.

-> #### Annotating entities
+> #### Annotating entities {#biluo}
 >
 > Named entities are provided in the
 > [BILUO](/usage/linguistic-features#accessing-ner) notation. Tokens outside an
@ -75,6 +75,121 @@ from the English Wall Street Journal portion of the Penn Treebank:
 https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
 ```

+### Annotations in dictionary format {#dict-input}
+
+To create [`Example`](/api/example) objects, you can create a dictionary of the
+gold-standard annotations `gold_dict`, and then call
+
+```python
+example = Example.from_dict(doc, gold_dict)
+```
+
+There are currently two formats supported for this dictionary of annotations:
+one with a simple, flat structure of keywords, and one with a more hierarchical
+structure.
+
+#### Flat structure {#dict-flat}
+
+Here is the full overview of potential entries in a flat dictionary of
+annotations. You need to only specify those keys corresponding to the task you
+want to train.
+
+```python
+{
+    "text": string,                        # Raw text.
+    "words": List[string],                 # List of gold tokens.
+    "lemmas": List[string],                # List of lemmas.
+    "spaces": List[bool],                  # List of boolean values indicating whether the corresponding tokens is followed by a space or not.
+    "tags": List[string],                  # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging).
+    "pos": List[string],                   # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging).
+    "morphs": List[string],                # List of [morphological features](/usage/linguistic-features#rule-based-morphology).
+    "sent_starts": List[bool],             # List of boolean values indicating whether each token is the first of a sentence or not.
+    "deps": List[string],                  # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head.
+    "heads": List[int],                    # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text.
+    "entities": List[string],              # Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens.
+    "entities": List[(int, int, string)],  # Option 2: List of `"(start, end, label)"` tuples defining all entities in.
+    "cats": Dict[str, float],              # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text.
+    "links": Dict[(int, int), Dict],       # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs.
+}
+```
+
+There are a few caveats to take into account:
+
+- Multiple formats are possible for the "entities" entry, but you have to pick
+  one.
+- Any values for sentence starts will be ignored if there are annotations for
+  dependency relations.
+- If the dictionary contains values for "text" and "words", but not "spaces",
+  the latter are inferred automatically. If "words" is not provided either, the
+  values are inferred from the `doc` argument.
+
+##### Examples
+
+```python
+# Training data for a part-of-speech tagger
+doc = Doc(vocab, words=["I", "like", "stuff"])
+example = Example.from_dict(doc, {"tags": ["NOUN", "VERB", "NOUN"]})
+
+# Training data for an entity recognizer (option 1)
+doc = nlp("Laura flew to Silicon Valley.")
+biluo_tags = ["U-PERS", "O", "O", "B-LOC", "L-LOC"]
+example = Example.from_dict(doc, {"entities": biluo_tags})
+
+# Training data for an entity recognizer (option 2)
+doc = nlp("Laura flew to Silicon Valley.")
+entity_tuples = [
+        (0, 5, "PERSON"),
+        (14, 28, "LOC"),
+    ]
+example = Example.from_dict(doc, {"entities": entity_tuples})
+
+# Training data for text categorization
+doc = nlp("I'm pretty happy about that!")
+example = Example.from_dict(doc, {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}})
+
+# Training data for an Entity Linking component
+doc = nlp("Russ Cochran his reprints include EC Comics.")
+example = Example.from_dict(doc, {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}})
+```
+
+#### Hierachical structure {#dict-hierarch}
+
+Internally, a more hierarchical dictionary structure is used to store
+gold-standard annotations. Its format is similar to the structure described in
+the previous section, but there are two main sections `token_annotation` and
+`doc_annotation`, and the keys for token annotations should be uppercase
+[`Token` attributes](/api/token#attributes) such as "ORTH" and "TAG".
+
+```python
+{
+    "text": string,                            # Raw text.
+    "token_annotation": {
+        "ORTH": List[string],                  # List of gold tokens.
+        "LEMMA": List[string],                 # List of lemmas.
+        "SPACY": List[bool],                   # List of boolean values indicating whether the corresponding tokens is followed by a space or not.
+        "TAG": List[string],                   # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging).
+        "POS": List[string],                   # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging).
+        "MORPH": List[string],                 # List of [morphological features](/usage/linguistic-features#rule-based-morphology).
+        "SENT_START": List[bool],              # List of boolean values indicating whether each token is the first of a sentence or not.
+        "DEP": List[string],                   # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head.
+        "HEAD": List[int],                     # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text.
+    },
+    "doc_annotation": {
+        "entities": List[(int, int, string)],  # List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens.
+        "cats": Dict[str, float],              # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text.
+        "links": Dict[(int, int), Dict],       # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs.
+    }
+}
+```
+
+There are a few caveats to take into account:
+
+- Any values for sentence starts will be ignored if there are annotations for
+  dependency relations.
+- If the dictionary contains values for "text" and "ORTH", but not "SPACY", the
+  latter are inferred automatically. If "ORTH" is not provided either, the
+  values are inferred from the `doc` argument.
+
 ## Training config {#config new="3"}

 Config files define the training process and model pipeline and can be passed to
--- a/website/docs/api/example.md
+++ b/website/docs/api/example.md
@ -41,9 +41,8 @@ both documents.
 ## Example.from_dict {#from_dict tag="classmethod"}

 Construct an `Example` object from the `predicted` document and the reference
-annotations provided as a dictionary.
-
-<!-- TODO: document formats? legacy & token_annotation stuff -->
+annotations provided as a dictionary. For more details on the required format,  
+see the [training format documentation](/api/data-formats#dict-input).

 > #### Example
 >