From 8387ce4c01db48d92ac5638e18316c0f1fc8861e Mon Sep 17 00:00:00 2001
From: Raphael Mitsch <r.mitsch@outlook.com>
Date: Thu, 2 Jun 2022 14:03:47 +0200
Subject: [PATCH] Add Doc.from_json() (#10688)

* Implement Doc.from_json: rough draft.

* Implement Doc.from_json: first draft with tests.

* Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json().

* Implement Doc.from_json: formatting changes.

* Implement Doc.to_json(): reverting unrelated formatting changes.

* Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file.

* Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls.

* Implement Doc.from_json(): handling sentence boundaries in spans.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): incorporated various PR feedback.

* Renaming fixture for document without dependencies.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): using two sent_starts instead of one.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master.

* Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations.

* Implement Doc.from_json(): reverting unwanted formatting/rebasing changes.

* Implement Doc.from_json(): added check for char_span() calculation for entities.

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test.

* Implement Doc.from_json(): removed redundancy in annotation type key naming.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): Simplifying setting annotation values.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement doc.from_json(): renaming annot_types to token_attrs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs.

* Implement Doc.from_json(): removing default categories.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring to only have keys for present annotations.

* Implement Doc.from_json(): fix check for tokens' HEAD attributes.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring Doc.from_json().

* Implement Doc.from_json(): fixing span_group retrieval.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing span retrieval.

* Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json().

* Implement Doc.from_json(): added comment regarding Token and Span extension support.

* Implement Doc.from_json(): renaming inconsistent_props to partial_attrs..

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusting error message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): extending E1038 message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): added params to E1038 raises.

* Implement Doc.from_json(): combined attribute collection with partial attributes check.

* Implement Doc.from_json(): added optional schema validation.

* Implement Doc.from_json(): fixed optional fields in schema, tests.

* Implement Doc.from_json(): removed redundant None check for DEP.

* Implement Doc.from_json(): added passing of schema validatoin message to E1037..

* Implement Doc.from_json(): removing redundant error E1040.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): changing message for E1037.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json().

* Update spacy/tests/doc/test_json_doc_conversion.py

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): website docs update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing Doc reference in website docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): reformatted website/docs/api/doc.md.

* Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts.

* Implement Doc.from_json(): fixing bug in tests.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fix setting of sentence starts for docs without DEP.

* Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py.

* Implement Doc.from_json(): simplify token sentence start manipulation.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Combine related error messages

* Update spacy/tests/doc/test_json_doc_conversion.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
---
 spacy/errors.py                             |   5 +
 spacy/schemas.py                            |  26 +++
 spacy/tests/doc/test_json_doc_conversion.py | 191 ++++++++++++++++++++
 spacy/tests/doc/test_to_json.py             |  72 --------
 spacy/tokens/doc.pyi                        |   3 +
 spacy/tokens/doc.pyx                        | 147 ++++++++++++++-
 website/docs/api/doc.md                     |  39 ++++
 7 files changed, 405 insertions(+), 78 deletions(-)
 create mode 100644 spacy/tests/doc/test_json_doc_conversion.py
 delete mode 100644 spacy/tests/doc/test_to_json.py

diff --git a/spacy/errors.py b/spacy/errors.py
index 665b91659..75e24789f 100644
--- a/spacy/errors.py
+++ b/spacy/errors.py
@@ -921,6 +921,11 @@ class Errors(metaclass=ErrorsWithCodes):
     E1035 = ("Token index {i} out of bounds ({length})")
     E1036 = ("Cannot index into NoneNode")
     E1037 = ("Invalid attribute value '{attr}'.")
+    E1038 = ("Invalid JSON input: {message}")
+    E1039 = ("The {obj} start or end annotations (start: {start}, end: {end}) "
+             "could not be aligned to token boundaries.")
+    E1040 = ("Doc.from_json requires all tokens to have the same attributes. "
+             "Some tokens do not contain annotation for: {partial_attrs}")
 
 
 # Deprecated model shortcuts, only used in errors and warnings
diff --git a/spacy/schemas.py b/spacy/schemas.py
index 7d87658f2..b284b82e5 100644
--- a/spacy/schemas.py
+++ b/spacy/schemas.py
@@ -485,3 +485,29 @@ class RecommendationSchema(BaseModel):
     word_vectors: Optional[str] = None
     transformer: Optional[RecommendationTrf] = None
     has_letters: bool = True
+
+
+class DocJSONSchema(BaseModel):
+    """
+    JSON/dict format for JSON representation of Doc objects.
+    """
+
+    cats: Optional[Dict[StrictStr, StrictFloat]] = Field(
+        None, title="Categories with corresponding probabilities"
+    )
+    ents: Optional[List[Dict[StrictStr, Union[StrictInt, StrictStr]]]] = Field(
+        None, title="Information on entities"
+    )
+    sents: Optional[List[Dict[StrictStr, StrictInt]]] = Field(
+        None, title="Indices of sentences' start and end indices"
+    )
+    text: StrictStr = Field(..., title="Document text")
+    spans: Dict[StrictStr, List[Dict[StrictStr, Union[StrictStr, StrictInt]]]] = Field(
+        None, title="Span information - end/start indices, label, KB ID"
+    )
+    tokens: List[Dict[StrictStr, Union[StrictStr, StrictInt]]] = Field(
+        ..., title="Token information - ID, start, annotations"
+    )
+    _: Optional[Dict[StrictStr, Any]] = Field(
+        None, title="Any custom data stored in the document's _ attribute"
+    )
diff --git a/spacy/tests/doc/test_json_doc_conversion.py b/spacy/tests/doc/test_json_doc_conversion.py
new file mode 100644
index 000000000..85e4def29
--- /dev/null
+++ b/spacy/tests/doc/test_json_doc_conversion.py
@@ -0,0 +1,191 @@
+import pytest
+import spacy
+from spacy import schemas
+from spacy.tokens import Doc, Span
+
+
+@pytest.fixture()
+def doc(en_vocab):
+    words = ["c", "d", "e"]
+    pos = ["VERB", "NOUN", "NOUN"]
+    tags = ["VBP", "NN", "NN"]
+    heads = [0, 0, 1]
+    deps = ["ROOT", "dobj", "dobj"]
+    ents = ["O", "B-ORG", "O"]
+    morphs = ["Feat1=A", "Feat1=B", "Feat1=A|Feat2=D"]
+
+    return Doc(
+        en_vocab,
+        words=words,
+        pos=pos,
+        tags=tags,
+        heads=heads,
+        deps=deps,
+        ents=ents,
+        morphs=morphs,
+    )
+
+
+@pytest.fixture()
+def doc_without_deps(en_vocab):
+    words = ["c", "d", "e"]
+    pos = ["VERB", "NOUN", "NOUN"]
+    tags = ["VBP", "NN", "NN"]
+    ents = ["O", "B-ORG", "O"]
+    morphs = ["Feat1=A", "Feat1=B", "Feat1=A|Feat2=D"]
+
+    return Doc(
+        en_vocab,
+        words=words,
+        pos=pos,
+        tags=tags,
+        ents=ents,
+        morphs=morphs,
+        sent_starts=[True, False, True],
+    )
+
+
+def test_doc_to_json(doc):
+    json_doc = doc.to_json()
+    assert json_doc["text"] == "c d e "
+    assert len(json_doc["tokens"]) == 3
+    assert json_doc["tokens"][0]["pos"] == "VERB"
+    assert json_doc["tokens"][0]["tag"] == "VBP"
+    assert json_doc["tokens"][0]["dep"] == "ROOT"
+    assert len(json_doc["ents"]) == 1
+    assert json_doc["ents"][0]["start"] == 2  # character offset!
+    assert json_doc["ents"][0]["end"] == 3  # character offset!
+    assert json_doc["ents"][0]["label"] == "ORG"
+    assert not schemas.validate(schemas.DocJSONSchema, json_doc)
+
+
+def test_doc_to_json_underscore(doc):
+    Doc.set_extension("json_test1", default=False)
+    Doc.set_extension("json_test2", default=False)
+    doc._.json_test1 = "hello world"
+    doc._.json_test2 = [1, 2, 3]
+    json_doc = doc.to_json(underscore=["json_test1", "json_test2"])
+    assert "_" in json_doc
+    assert json_doc["_"]["json_test1"] == "hello world"
+    assert json_doc["_"]["json_test2"] == [1, 2, 3]
+    assert not schemas.validate(schemas.DocJSONSchema, json_doc)
+
+
+def test_doc_to_json_underscore_error_attr(doc):
+    """Test that Doc.to_json() raises an error if a custom attribute doesn't
+    exist in the ._ space."""
+    with pytest.raises(ValueError):
+        doc.to_json(underscore=["json_test3"])
+
+
+def test_doc_to_json_underscore_error_serialize(doc):
+    """Test that Doc.to_json() raises an error if a custom attribute value
+    isn't JSON-serializable."""
+    Doc.set_extension("json_test4", method=lambda doc: doc.text)
+    with pytest.raises(ValueError):
+        doc.to_json(underscore=["json_test4"])
+
+
+def test_doc_to_json_span(doc):
+    """Test that Doc.to_json() includes spans"""
+    doc.spans["test"] = [Span(doc, 0, 2, "test"), Span(doc, 0, 1, "test")]
+    json_doc = doc.to_json()
+    assert "spans" in json_doc
+    assert len(json_doc["spans"]) == 1
+    assert len(json_doc["spans"]["test"]) == 2
+    assert json_doc["spans"]["test"][0]["start"] == 0
+    assert not schemas.validate(schemas.DocJSONSchema, json_doc)
+
+
+def test_json_to_doc(doc):
+    new_doc = Doc(doc.vocab).from_json(doc.to_json(), validate=True)
+    new_tokens = [token for token in new_doc]
+    assert new_doc.text == doc.text == "c d e "
+    assert len(new_tokens) == len([token for token in doc]) == 3
+    assert new_tokens[0].pos == doc[0].pos
+    assert new_tokens[0].tag == doc[0].tag
+    assert new_tokens[0].dep == doc[0].dep
+    assert new_tokens[0].head.idx == doc[0].head.idx
+    assert new_tokens[0].lemma == doc[0].lemma
+    assert len(new_doc.ents) == 1
+    assert new_doc.ents[0].start == 1
+    assert new_doc.ents[0].end == 2
+    assert new_doc.ents[0].label_ == "ORG"
+
+
+def test_json_to_doc_underscore(doc):
+    if not Doc.has_extension("json_test1"):
+        Doc.set_extension("json_test1", default=False)
+    if not Doc.has_extension("json_test2"):
+        Doc.set_extension("json_test2", default=False)
+
+    doc._.json_test1 = "hello world"
+    doc._.json_test2 = [1, 2, 3]
+    json_doc = doc.to_json(underscore=["json_test1", "json_test2"])
+    new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
+    assert all([new_doc.has_extension(f"json_test{i}") for i in range(1, 3)])
+    assert new_doc._.json_test1 == "hello world"
+    assert new_doc._.json_test2 == [1, 2, 3]
+
+
+def test_json_to_doc_spans(doc):
+    """Test that Doc.from_json() includes correct.spans."""
+    doc.spans["test"] = [
+        Span(doc, 0, 2, label="test"),
+        Span(doc, 0, 1, label="test", kb_id=7),
+    ]
+    json_doc = doc.to_json()
+    new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
+    assert len(new_doc.spans) == 1
+    assert len(new_doc.spans["test"]) == 2
+    for i in range(2):
+        assert new_doc.spans["test"][i].start == doc.spans["test"][i].start
+        assert new_doc.spans["test"][i].end == doc.spans["test"][i].end
+        assert new_doc.spans["test"][i].label == doc.spans["test"][i].label
+        assert new_doc.spans["test"][i].kb_id == doc.spans["test"][i].kb_id
+
+
+def test_json_to_doc_sents(doc, doc_without_deps):
+    """Test that Doc.from_json() includes correct.sents."""
+    for test_doc in (doc, doc_without_deps):
+        json_doc = test_doc.to_json()
+        new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
+        assert [sent.text for sent in test_doc.sents] == [
+            sent.text for sent in new_doc.sents
+        ]
+        assert [token.is_sent_start for token in test_doc] == [
+            token.is_sent_start for token in new_doc
+        ]
+
+
+def test_json_to_doc_cats(doc):
+    """Test that Doc.from_json() includes correct .cats."""
+    cats = {"A": 0.3, "B": 0.7}
+    doc.cats = cats
+    json_doc = doc.to_json()
+    new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
+    assert new_doc.cats == cats
+
+
+def test_json_to_doc_spaces():
+    """Test that Doc.from_json() preserves spaces correctly."""
+    doc = spacy.blank("en")("This is just brilliant.")
+    json_doc = doc.to_json()
+    new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
+    assert doc.text == new_doc.text
+
+
+def test_json_to_doc_attribute_consistency(doc):
+    """Test that Doc.from_json() raises an exception if tokens don't all have the same set of properties."""
+    doc_json = doc.to_json()
+    doc_json["tokens"][1].pop("morph")
+    with pytest.raises(ValueError):
+        Doc(doc.vocab).from_json(doc_json)
+
+
+def test_json_to_doc_validation_error(doc):
+    """Test that Doc.from_json() raises an exception when validating invalid input."""
+    doc_json = doc.to_json()
+    doc_json.pop("tokens")
+    with pytest.raises(ValueError):
+        Doc(doc.vocab).from_json(doc_json, validate=True)
diff --git a/spacy/tests/doc/test_to_json.py b/spacy/tests/doc/test_to_json.py
deleted file mode 100644
index 202281654..000000000
--- a/spacy/tests/doc/test_to_json.py
+++ /dev/null
@@ -1,72 +0,0 @@
-import pytest
-from spacy.tokens import Doc, Span
-
-
-@pytest.fixture()
-def doc(en_vocab):
-    words = ["c", "d", "e"]
-    pos = ["VERB", "NOUN", "NOUN"]
-    tags = ["VBP", "NN", "NN"]
-    heads = [0, 0, 0]
-    deps = ["ROOT", "dobj", "dobj"]
-    ents = ["O", "B-ORG", "O"]
-    morphs = ["Feat1=A", "Feat1=B", "Feat1=A|Feat2=D"]
-    return Doc(
-        en_vocab,
-        words=words,
-        pos=pos,
-        tags=tags,
-        heads=heads,
-        deps=deps,
-        ents=ents,
-        morphs=morphs,
-    )
-
-
-def test_doc_to_json(doc):
-    json_doc = doc.to_json()
-    assert json_doc["text"] == "c d e "
-    assert len(json_doc["tokens"]) == 3
-    assert json_doc["tokens"][0]["pos"] == "VERB"
-    assert json_doc["tokens"][0]["tag"] == "VBP"
-    assert json_doc["tokens"][0]["dep"] == "ROOT"
-    assert len(json_doc["ents"]) == 1
-    assert json_doc["ents"][0]["start"] == 2  # character offset!
-    assert json_doc["ents"][0]["end"] == 3  # character offset!
-    assert json_doc["ents"][0]["label"] == "ORG"
-
-
-def test_doc_to_json_underscore(doc):
-    Doc.set_extension("json_test1", default=False)
-    Doc.set_extension("json_test2", default=False)
-    doc._.json_test1 = "hello world"
-    doc._.json_test2 = [1, 2, 3]
-    json_doc = doc.to_json(underscore=["json_test1", "json_test2"])
-    assert "_" in json_doc
-    assert json_doc["_"]["json_test1"] == "hello world"
-    assert json_doc["_"]["json_test2"] == [1, 2, 3]
-
-
-def test_doc_to_json_underscore_error_attr(doc):
-    """Test that Doc.to_json() raises an error if a custom attribute doesn't
-    exist in the ._ space."""
-    with pytest.raises(ValueError):
-        doc.to_json(underscore=["json_test3"])
-
-
-def test_doc_to_json_underscore_error_serialize(doc):
-    """Test that Doc.to_json() raises an error if a custom attribute value
-    isn't JSON-serializable."""
-    Doc.set_extension("json_test4", method=lambda doc: doc.text)
-    with pytest.raises(ValueError):
-        doc.to_json(underscore=["json_test4"])
-
-
-def test_doc_to_json_span(doc):
-    """Test that Doc.to_json() includes spans"""
-    doc.spans["test"] = [Span(doc, 0, 2, "test"), Span(doc, 0, 1, "test")]
-    json_doc = doc.to_json()
-    assert "spans" in json_doc
-    assert len(json_doc["spans"]) == 1
-    assert len(json_doc["spans"]["test"]) == 2
-    assert json_doc["spans"]["test"][0]["start"] == 0
diff --git a/spacy/tokens/doc.pyi b/spacy/tokens/doc.pyi
index 7e9340d58..a40fa74aa 100644
--- a/spacy/tokens/doc.pyi
+++ b/spacy/tokens/doc.pyi
@@ -170,6 +170,9 @@ class Doc:
     def extend_tensor(self, tensor: Floats2d) -> None: ...
     def retokenize(self) -> Retokenizer: ...
     def to_json(self, underscore: Optional[List[str]] = ...) -> Dict[str, Any]: ...
+    def from_json(
+        self, doc_json: Dict[str, Any] = ..., validate: bool = False
+    ) -> Doc: ...
     def to_utf8_array(self, nr_char: int = ...) -> Ints2d: ...
     @staticmethod
     def _get_array_attrs() -> Tuple[Any]: ...
diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx
index c0b67fb7c..9bae9afd3 100644
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@@ -1,4 +1,6 @@
 # cython: infer_types=True, bounds_check=False, profile=True
+from typing import Set
+
 cimport cython
 cimport numpy as np
 from libc.string cimport memcpy
@@ -15,6 +17,7 @@ from thinc.api import get_array_module, get_current_ops
 from thinc.util import copy_array
 import warnings
 
+import spacy.schemas
 from .span cimport Span
 from .token cimport MISSING_DEP
 from ._dict_proxies import SpanGroups
@@ -34,7 +37,7 @@ from .. import parts_of_speech
 from .underscore import Underscore, get_ext_args
 from ._retokenize import Retokenizer
 from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS
-
+from ..util import get_words_and_spaces
 
 DEF PADDING = 5
 
@@ -1475,6 +1478,138 @@ cdef class Doc:
                 remove_label_if_necessary(attributes[i])
                 retokenizer.merge(span, attributes[i])
 
+    def from_json(self, doc_json, *, validate=False):
+        """Convert a JSON document generated by Doc.to_json() to a Doc.
+
+        doc_json (Dict): JSON representation of doc object to load.
+        validate (bool): Whether to validate `doc_json` against the expected schema.
+            Defaults to False.
+        RETURNS (Doc): A doc instance corresponding to the specified JSON representation.
+        """
+
+        if validate:
+            schema_validation_message = spacy.schemas.validate(spacy.schemas.DocJSONSchema, doc_json)
+            if schema_validation_message:
+                raise ValueError(Errors.E1038.format(message=schema_validation_message))
+
+        ### Token-level properties ###
+
+        words = []
+        token_attrs_ids = (POS, HEAD, DEP, LEMMA, TAG, MORPH)
+        # Map annotation type IDs to their string equivalents.
+        token_attrs = {t: self.vocab.strings[t].lower() for t in token_attrs_ids}
+        token_annotations = {}
+
+        # Gather token-level properties.
+        for token_json in doc_json["tokens"]:
+            words.append(doc_json["text"][token_json["start"]:token_json["end"]])
+            for attr, attr_json in token_attrs.items():
+                if attr_json in token_json:
+                    if token_json["id"] == 0 and attr not in token_annotations:
+                        token_annotations[attr] = []
+                    elif attr not in token_annotations:
+                        raise ValueError(Errors.E1040.format(partial_attrs=attr))
+                    token_annotations[attr].append(token_json[attr_json])
+
+        # Initialize doc instance.
+        start = 0
+        cdef const LexemeC* lex
+        cdef bint has_space
+        reconstructed_words, spaces = get_words_and_spaces(words, doc_json["text"])
+        assert words == reconstructed_words
+
+        for word, has_space in zip(words, spaces):
+            lex = self.vocab.get(self.mem, word)
+            self.push_back(lex, has_space)
+
+        # Set remaining token-level attributes via Doc.from_array().
+        if HEAD in token_annotations:
+            token_annotations[HEAD] = [
+                head - i for i, head in enumerate(token_annotations[HEAD])
+            ]
+
+        if DEP in token_annotations and HEAD not in token_annotations:
+            token_annotations[HEAD] = [0] * len(token_annotations[DEP])
+        if HEAD in token_annotations and DEP not in token_annotations:
+            raise ValueError(Errors.E1017)
+        if POS in token_annotations:
+            for pp in set(token_annotations[POS]):
+                if pp not in parts_of_speech.IDS:
+                    raise ValueError(Errors.E1021.format(pp=pp))
+
+        # Collect token attributes, assert all tokens have exactly the same set of attributes.
+        attrs = []
+        partial_attrs: Set[str] = set()
+        for attr in token_attrs.keys():
+            if attr in token_annotations:
+                if len(token_annotations[attr]) != len(words):
+                    partial_attrs.add(token_attrs[attr])
+                attrs.append(attr)
+        if len(partial_attrs):
+            raise ValueError(Errors.E1040.format(partial_attrs=partial_attrs))
+
+        # If there are any other annotations, set them.
+        if attrs:
+            array = self.to_array(attrs)
+            if array.ndim == 1:
+                array = numpy.reshape(array, (array.size, 1))
+            j = 0
+
+            for j, (attr, annot) in enumerate(token_annotations.items()):
+                if attr is HEAD:
+                    for i in range(len(words)):
+                        array[i, j] = annot[i]
+                elif attr is MORPH:
+                    for i in range(len(words)):
+                        array[i, j] = self.vocab.morphology.add(annot[i])
+                else:
+                    for i in range(len(words)):
+                        array[i, j] = self.vocab.strings.add(annot[i])
+            self.from_array(attrs, array)
+
+        ### Span/document properties ###
+
+        # Complement other document-level properties (cats, spans, ents).
+        self.cats = doc_json.get("cats", {})
+
+        # Set sentence boundaries, if dependency parser not available but sentences are specified in JSON.
+        if not self.has_annotation("DEP"):
+            for sent in doc_json.get("sents", {}):
+                char_span = self.char_span(sent["start"], sent["end"])
+                if char_span is None:
+                    raise ValueError(Errors.E1039.format(obj="sentence", start=sent["start"], end=sent["end"]))
+                char_span[0].is_sent_start = True
+                for token in char_span[1:]:
+                    token.is_sent_start = False
+
+
+        for span_group in doc_json.get("spans", {}):
+            spans = []
+            for span in doc_json["spans"][span_group]:
+                char_span = self.char_span(span["start"], span["end"], span["label"], span["kb_id"])
+                if char_span is None:
+                    raise ValueError(Errors.E1039.format(obj="span", start=span["start"], end=span["end"]))
+                spans.append(char_span)
+            self.spans[span_group] = spans
+
+        if "ents" in doc_json:
+            ents = []
+            for ent in doc_json["ents"]:
+                char_span = self.char_span(ent["start"], ent["end"], ent["label"])
+                if char_span is None:
+                    raise ValueError(Errors.E1039.format(obj="entity"), start=ent["start"], end=ent["end"])
+                ents.append(char_span)
+            self.ents = ents
+
+        # Add custom attributes. Note that only Doc extensions are currently considered, Token and Span extensions are
+        # not yet supported.
+        for attr in doc_json.get("_", {}):
+            if not Doc.has_extension(attr):
+                Doc.set_extension(attr)
+            self._.set(attr, doc_json["_"][attr])
+
+        return self
+
     def to_json(self, underscore=None):
         """Convert a Doc to JSON.
 
@@ -1485,12 +1620,10 @@ cdef class Doc:
         """
         data = {"text": self.text}
         if self.has_annotation("ENT_IOB"):
-            data["ents"] = [{"start": ent.start_char, "end": ent.end_char,
-                            "label": ent.label_} for ent in self.ents]
+            data["ents"] = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in self.ents]
         if self.has_annotation("SENT_START"):
             sents = list(self.sents)
-            data["sents"] = [{"start": sent.start_char, "end": sent.end_char}
-                             for sent in sents]
+            data["sents"] = [{"start": sent.start_char, "end": sent.end_char} for sent in sents]
         if self.cats:
             data["cats"] = self.cats
         data["tokens"] = []
@@ -1516,7 +1649,9 @@ cdef class Doc:
             for span_group in self.spans:
                 data["spans"][span_group] = []
                 for span in self.spans[span_group]:
-                    span_data = {"start": span.start_char, "end": span.end_char, "label": span.label_, "kb_id": span.kb_id_}
+                    span_data = {
+                        "start": span.start_char, "end": span.end_char, "label": span.label_, "kb_id": span.kb_id_
+                    }
                     data["spans"][span_group].append(span_data)
 
         if underscore:
diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md
index 0008cde31..f97f4ad83 100644
--- a/website/docs/api/doc.md
+++ b/website/docs/api/doc.md
@@ -481,6 +481,45 @@ Deserialize, i.e. import the document contents from a binary string.
 | `exclude`      | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
 | **RETURNS**    | The `Doc` object. ~~Doc~~                                                                   |
 
+## Doc.to_json {#to_json tag="method"}
+
+Serializes a document to JSON. Note that this is format differs from the
+deprecated [`JSON training format`](/api/data-formats#json-input).
+
+> #### Example
+>
+> ```python
+> doc = nlp("All we have to decide is what to do with the time that is given us.")
+> assert doc.to_json()["text"] == doc.text
+> ```
+
+| Name         | Description                                                                                                                                                                                                    |
+| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `underscore` | Optional list of string names of custom `Doc` attributes. Attribute values need to be JSON-serializable. Values will be added to an `"_"` key in the data, e.g. `"_": {"foo": "bar"}`. ~~Optional[List[str]]~~ |
+| **RETURNS**  | The data in JSON format. ~~Dict[str, Any]~~                                                                                                                                                                    |
+
+## Doc.from_json {#from_json tag="method" new="3.3.1"}
+
+Deserializes a document from JSON, i.e. generates a document from the provided
+JSON data as generated by [`Doc.to_json()`](/api/doc#to_json).
+
+> #### Example
+>
+> ```python
+> from spacy.tokens import Doc
+> doc = nlp("All we have to decide is what to do with the time that is given us.")
+> doc_json = doc.to_json()
+> deserialized_doc = Doc(nlp.vocab).from_json(doc_json)
+> assert deserialized_doc.text == doc.text == doc_json["text"]
+> ```
+
+| Name           | Description                                                                                                          |
+| -------------- | -------------------------------------------------------------------------------------------------------------------- |
+| `doc_json`     | The Doc data in JSON format from [`Doc.to_json`](#to_json). ~~Dict[str, Any]~~                                       |
+| _keyword-only_ |                                                                                                                      |
+| `validate`     | Whether to validate the JSON input against the expected schema for detailed debugging. Defaults to `False`. ~~bool~~ |
+| **RETURNS**    | A `Doc` corresponding to the provided JSON. ~~Doc~~                                                                  |
+
 ## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
 
 Context manager to handle retokenization of the `Doc`. Modifications to the