mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-10 19:57:17 +03:00
Add Doc.from_json() (#10688)
* Implement Doc.from_json: rough draft. * Implement Doc.from_json: first draft with tests. * Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json(). * Implement Doc.from_json: formatting changes. * Implement Doc.to_json(): reverting unrelated formatting changes. * Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file. * Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls. * Implement Doc.from_json(): handling sentence boundaries in spans. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): incorporated various PR feedback. * Renaming fixture for document without dependencies. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): using two sent_starts instead of one. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master. * Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations. * Implement Doc.from_json(): reverting unwanted formatting/rebasing changes. * Implement Doc.from_json(): added check for char_span() calculation for entities. * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test. * Implement Doc.from_json(): removed redundancy in annotation type key naming. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): Simplifying setting annotation values. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement doc.from_json(): renaming annot_types to token_attrs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs. * Implement Doc.from_json(): removing default categories. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring to only have keys for present annotations. * Implement Doc.from_json(): fix check for tokens' HEAD attributes. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring Doc.from_json(). * Implement Doc.from_json(): fixing span_group retrieval. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing span retrieval. * Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json(). * Implement Doc.from_json(): added comment regarding Token and Span extension support. * Implement Doc.from_json(): renaming inconsistent_props to partial_attrs.. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusting error message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): extending E1038 message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): added params to E1038 raises. * Implement Doc.from_json(): combined attribute collection with partial attributes check. * Implement Doc.from_json(): added optional schema validation. * Implement Doc.from_json(): fixed optional fields in schema, tests. * Implement Doc.from_json(): removed redundant None check for DEP. * Implement Doc.from_json(): added passing of schema validatoin message to E1037.. * Implement Doc.from_json(): removing redundant error E1040. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): changing message for E1037. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json(). * Update spacy/tests/doc/test_json_doc_conversion.py * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): website docs update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing Doc reference in website docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): reformatted website/docs/api/doc.md. * Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts. * Implement Doc.from_json(): fixing bug in tests. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fix setting of sentence starts for docs without DEP. * Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py. * Implement Doc.from_json(): simplify token sentence start manipulation. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Combine related error messages * Update spacy/tests/doc/test_json_doc_conversion.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
This commit is contained in:
parent
a322d6d5f2
commit
8387ce4c01
|
@ -921,6 +921,11 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
E1035 = ("Token index {i} out of bounds ({length})")
|
||||
E1036 = ("Cannot index into NoneNode")
|
||||
E1037 = ("Invalid attribute value '{attr}'.")
|
||||
E1038 = ("Invalid JSON input: {message}")
|
||||
E1039 = ("The {obj} start or end annotations (start: {start}, end: {end}) "
|
||||
"could not be aligned to token boundaries.")
|
||||
E1040 = ("Doc.from_json requires all tokens to have the same attributes. "
|
||||
"Some tokens do not contain annotation for: {partial_attrs}")
|
||||
|
||||
|
||||
# Deprecated model shortcuts, only used in errors and warnings
|
||||
|
|
|
@ -485,3 +485,29 @@ class RecommendationSchema(BaseModel):
|
|||
word_vectors: Optional[str] = None
|
||||
transformer: Optional[RecommendationTrf] = None
|
||||
has_letters: bool = True
|
||||
|
||||
|
||||
class DocJSONSchema(BaseModel):
|
||||
"""
|
||||
JSON/dict format for JSON representation of Doc objects.
|
||||
"""
|
||||
|
||||
cats: Optional[Dict[StrictStr, StrictFloat]] = Field(
|
||||
None, title="Categories with corresponding probabilities"
|
||||
)
|
||||
ents: Optional[List[Dict[StrictStr, Union[StrictInt, StrictStr]]]] = Field(
|
||||
None, title="Information on entities"
|
||||
)
|
||||
sents: Optional[List[Dict[StrictStr, StrictInt]]] = Field(
|
||||
None, title="Indices of sentences' start and end indices"
|
||||
)
|
||||
text: StrictStr = Field(..., title="Document text")
|
||||
spans: Dict[StrictStr, List[Dict[StrictStr, Union[StrictStr, StrictInt]]]] = Field(
|
||||
None, title="Span information - end/start indices, label, KB ID"
|
||||
)
|
||||
tokens: List[Dict[StrictStr, Union[StrictStr, StrictInt]]] = Field(
|
||||
..., title="Token information - ID, start, annotations"
|
||||
)
|
||||
_: Optional[Dict[StrictStr, Any]] = Field(
|
||||
None, title="Any custom data stored in the document's _ attribute"
|
||||
)
|
||||
|
|
191
spacy/tests/doc/test_json_doc_conversion.py
Normal file
191
spacy/tests/doc/test_json_doc_conversion.py
Normal file
|
@ -0,0 +1,191 @@
|
|||
import pytest
|
||||
import spacy
|
||||
from spacy import schemas
|
||||
from spacy.tokens import Doc, Span
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def doc(en_vocab):
|
||||
words = ["c", "d", "e"]
|
||||
pos = ["VERB", "NOUN", "NOUN"]
|
||||
tags = ["VBP", "NN", "NN"]
|
||||
heads = [0, 0, 1]
|
||||
deps = ["ROOT", "dobj", "dobj"]
|
||||
ents = ["O", "B-ORG", "O"]
|
||||
morphs = ["Feat1=A", "Feat1=B", "Feat1=A|Feat2=D"]
|
||||
|
||||
return Doc(
|
||||
en_vocab,
|
||||
words=words,
|
||||
pos=pos,
|
||||
tags=tags,
|
||||
heads=heads,
|
||||
deps=deps,
|
||||
ents=ents,
|
||||
morphs=morphs,
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def doc_without_deps(en_vocab):
|
||||
words = ["c", "d", "e"]
|
||||
pos = ["VERB", "NOUN", "NOUN"]
|
||||
tags = ["VBP", "NN", "NN"]
|
||||
ents = ["O", "B-ORG", "O"]
|
||||
morphs = ["Feat1=A", "Feat1=B", "Feat1=A|Feat2=D"]
|
||||
|
||||
return Doc(
|
||||
en_vocab,
|
||||
words=words,
|
||||
pos=pos,
|
||||
tags=tags,
|
||||
ents=ents,
|
||||
morphs=morphs,
|
||||
sent_starts=[True, False, True],
|
||||
)
|
||||
|
||||
|
||||
def test_doc_to_json(doc):
|
||||
json_doc = doc.to_json()
|
||||
assert json_doc["text"] == "c d e "
|
||||
assert len(json_doc["tokens"]) == 3
|
||||
assert json_doc["tokens"][0]["pos"] == "VERB"
|
||||
assert json_doc["tokens"][0]["tag"] == "VBP"
|
||||
assert json_doc["tokens"][0]["dep"] == "ROOT"
|
||||
assert len(json_doc["ents"]) == 1
|
||||
assert json_doc["ents"][0]["start"] == 2 # character offset!
|
||||
assert json_doc["ents"][0]["end"] == 3 # character offset!
|
||||
assert json_doc["ents"][0]["label"] == "ORG"
|
||||
assert not schemas.validate(schemas.DocJSONSchema, json_doc)
|
||||
|
||||
|
||||
def test_doc_to_json_underscore(doc):
|
||||
Doc.set_extension("json_test1", default=False)
|
||||
Doc.set_extension("json_test2", default=False)
|
||||
doc._.json_test1 = "hello world"
|
||||
doc._.json_test2 = [1, 2, 3]
|
||||
json_doc = doc.to_json(underscore=["json_test1", "json_test2"])
|
||||
assert "_" in json_doc
|
||||
assert json_doc["_"]["json_test1"] == "hello world"
|
||||
assert json_doc["_"]["json_test2"] == [1, 2, 3]
|
||||
assert not schemas.validate(schemas.DocJSONSchema, json_doc)
|
||||
|
||||
|
||||
def test_doc_to_json_underscore_error_attr(doc):
|
||||
"""Test that Doc.to_json() raises an error if a custom attribute doesn't
|
||||
exist in the ._ space."""
|
||||
with pytest.raises(ValueError):
|
||||
doc.to_json(underscore=["json_test3"])
|
||||
|
||||
|
||||
def test_doc_to_json_underscore_error_serialize(doc):
|
||||
"""Test that Doc.to_json() raises an error if a custom attribute value
|
||||
isn't JSON-serializable."""
|
||||
Doc.set_extension("json_test4", method=lambda doc: doc.text)
|
||||
with pytest.raises(ValueError):
|
||||
doc.to_json(underscore=["json_test4"])
|
||||
|
||||
|
||||
def test_doc_to_json_span(doc):
|
||||
"""Test that Doc.to_json() includes spans"""
|
||||
doc.spans["test"] = [Span(doc, 0, 2, "test"), Span(doc, 0, 1, "test")]
|
||||
json_doc = doc.to_json()
|
||||
assert "spans" in json_doc
|
||||
assert len(json_doc["spans"]) == 1
|
||||
assert len(json_doc["spans"]["test"]) == 2
|
||||
assert json_doc["spans"]["test"][0]["start"] == 0
|
||||
assert not schemas.validate(schemas.DocJSONSchema, json_doc)
|
||||
|
||||
|
||||
def test_json_to_doc(doc):
|
||||
new_doc = Doc(doc.vocab).from_json(doc.to_json(), validate=True)
|
||||
new_tokens = [token for token in new_doc]
|
||||
assert new_doc.text == doc.text == "c d e "
|
||||
assert len(new_tokens) == len([token for token in doc]) == 3
|
||||
assert new_tokens[0].pos == doc[0].pos
|
||||
assert new_tokens[0].tag == doc[0].tag
|
||||
assert new_tokens[0].dep == doc[0].dep
|
||||
assert new_tokens[0].head.idx == doc[0].head.idx
|
||||
assert new_tokens[0].lemma == doc[0].lemma
|
||||
assert len(new_doc.ents) == 1
|
||||
assert new_doc.ents[0].start == 1
|
||||
assert new_doc.ents[0].end == 2
|
||||
assert new_doc.ents[0].label_ == "ORG"
|
||||
|
||||
|
||||
def test_json_to_doc_underscore(doc):
|
||||
if not Doc.has_extension("json_test1"):
|
||||
Doc.set_extension("json_test1", default=False)
|
||||
if not Doc.has_extension("json_test2"):
|
||||
Doc.set_extension("json_test2", default=False)
|
||||
|
||||
doc._.json_test1 = "hello world"
|
||||
doc._.json_test2 = [1, 2, 3]
|
||||
json_doc = doc.to_json(underscore=["json_test1", "json_test2"])
|
||||
new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
|
||||
assert all([new_doc.has_extension(f"json_test{i}") for i in range(1, 3)])
|
||||
assert new_doc._.json_test1 == "hello world"
|
||||
assert new_doc._.json_test2 == [1, 2, 3]
|
||||
|
||||
|
||||
def test_json_to_doc_spans(doc):
|
||||
"""Test that Doc.from_json() includes correct.spans."""
|
||||
doc.spans["test"] = [
|
||||
Span(doc, 0, 2, label="test"),
|
||||
Span(doc, 0, 1, label="test", kb_id=7),
|
||||
]
|
||||
json_doc = doc.to_json()
|
||||
new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
|
||||
assert len(new_doc.spans) == 1
|
||||
assert len(new_doc.spans["test"]) == 2
|
||||
for i in range(2):
|
||||
assert new_doc.spans["test"][i].start == doc.spans["test"][i].start
|
||||
assert new_doc.spans["test"][i].end == doc.spans["test"][i].end
|
||||
assert new_doc.spans["test"][i].label == doc.spans["test"][i].label
|
||||
assert new_doc.spans["test"][i].kb_id == doc.spans["test"][i].kb_id
|
||||
|
||||
|
||||
def test_json_to_doc_sents(doc, doc_without_deps):
|
||||
"""Test that Doc.from_json() includes correct.sents."""
|
||||
for test_doc in (doc, doc_without_deps):
|
||||
json_doc = test_doc.to_json()
|
||||
new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
|
||||
assert [sent.text for sent in test_doc.sents] == [
|
||||
sent.text for sent in new_doc.sents
|
||||
]
|
||||
assert [token.is_sent_start for token in test_doc] == [
|
||||
token.is_sent_start for token in new_doc
|
||||
]
|
||||
|
||||
|
||||
def test_json_to_doc_cats(doc):
|
||||
"""Test that Doc.from_json() includes correct .cats."""
|
||||
cats = {"A": 0.3, "B": 0.7}
|
||||
doc.cats = cats
|
||||
json_doc = doc.to_json()
|
||||
new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
|
||||
assert new_doc.cats == cats
|
||||
|
||||
|
||||
def test_json_to_doc_spaces():
|
||||
"""Test that Doc.from_json() preserves spaces correctly."""
|
||||
doc = spacy.blank("en")("This is just brilliant.")
|
||||
json_doc = doc.to_json()
|
||||
new_doc = Doc(doc.vocab).from_json(json_doc, validate=True)
|
||||
assert doc.text == new_doc.text
|
||||
|
||||
|
||||
def test_json_to_doc_attribute_consistency(doc):
|
||||
"""Test that Doc.from_json() raises an exception if tokens don't all have the same set of properties."""
|
||||
doc_json = doc.to_json()
|
||||
doc_json["tokens"][1].pop("morph")
|
||||
with pytest.raises(ValueError):
|
||||
Doc(doc.vocab).from_json(doc_json)
|
||||
|
||||
|
||||
def test_json_to_doc_validation_error(doc):
|
||||
"""Test that Doc.from_json() raises an exception when validating invalid input."""
|
||||
doc_json = doc.to_json()
|
||||
doc_json.pop("tokens")
|
||||
with pytest.raises(ValueError):
|
||||
Doc(doc.vocab).from_json(doc_json, validate=True)
|
|
@ -1,72 +0,0 @@
|
|||
import pytest
|
||||
from spacy.tokens import Doc, Span
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def doc(en_vocab):
|
||||
words = ["c", "d", "e"]
|
||||
pos = ["VERB", "NOUN", "NOUN"]
|
||||
tags = ["VBP", "NN", "NN"]
|
||||
heads = [0, 0, 0]
|
||||
deps = ["ROOT", "dobj", "dobj"]
|
||||
ents = ["O", "B-ORG", "O"]
|
||||
morphs = ["Feat1=A", "Feat1=B", "Feat1=A|Feat2=D"]
|
||||
return Doc(
|
||||
en_vocab,
|
||||
words=words,
|
||||
pos=pos,
|
||||
tags=tags,
|
||||
heads=heads,
|
||||
deps=deps,
|
||||
ents=ents,
|
||||
morphs=morphs,
|
||||
)
|
||||
|
||||
|
||||
def test_doc_to_json(doc):
|
||||
json_doc = doc.to_json()
|
||||
assert json_doc["text"] == "c d e "
|
||||
assert len(json_doc["tokens"]) == 3
|
||||
assert json_doc["tokens"][0]["pos"] == "VERB"
|
||||
assert json_doc["tokens"][0]["tag"] == "VBP"
|
||||
assert json_doc["tokens"][0]["dep"] == "ROOT"
|
||||
assert len(json_doc["ents"]) == 1
|
||||
assert json_doc["ents"][0]["start"] == 2 # character offset!
|
||||
assert json_doc["ents"][0]["end"] == 3 # character offset!
|
||||
assert json_doc["ents"][0]["label"] == "ORG"
|
||||
|
||||
|
||||
def test_doc_to_json_underscore(doc):
|
||||
Doc.set_extension("json_test1", default=False)
|
||||
Doc.set_extension("json_test2", default=False)
|
||||
doc._.json_test1 = "hello world"
|
||||
doc._.json_test2 = [1, 2, 3]
|
||||
json_doc = doc.to_json(underscore=["json_test1", "json_test2"])
|
||||
assert "_" in json_doc
|
||||
assert json_doc["_"]["json_test1"] == "hello world"
|
||||
assert json_doc["_"]["json_test2"] == [1, 2, 3]
|
||||
|
||||
|
||||
def test_doc_to_json_underscore_error_attr(doc):
|
||||
"""Test that Doc.to_json() raises an error if a custom attribute doesn't
|
||||
exist in the ._ space."""
|
||||
with pytest.raises(ValueError):
|
||||
doc.to_json(underscore=["json_test3"])
|
||||
|
||||
|
||||
def test_doc_to_json_underscore_error_serialize(doc):
|
||||
"""Test that Doc.to_json() raises an error if a custom attribute value
|
||||
isn't JSON-serializable."""
|
||||
Doc.set_extension("json_test4", method=lambda doc: doc.text)
|
||||
with pytest.raises(ValueError):
|
||||
doc.to_json(underscore=["json_test4"])
|
||||
|
||||
|
||||
def test_doc_to_json_span(doc):
|
||||
"""Test that Doc.to_json() includes spans"""
|
||||
doc.spans["test"] = [Span(doc, 0, 2, "test"), Span(doc, 0, 1, "test")]
|
||||
json_doc = doc.to_json()
|
||||
assert "spans" in json_doc
|
||||
assert len(json_doc["spans"]) == 1
|
||||
assert len(json_doc["spans"]["test"]) == 2
|
||||
assert json_doc["spans"]["test"][0]["start"] == 0
|
|
@ -170,6 +170,9 @@ class Doc:
|
|||
def extend_tensor(self, tensor: Floats2d) -> None: ...
|
||||
def retokenize(self) -> Retokenizer: ...
|
||||
def to_json(self, underscore: Optional[List[str]] = ...) -> Dict[str, Any]: ...
|
||||
def from_json(
|
||||
self, doc_json: Dict[str, Any] = ..., validate: bool = False
|
||||
) -> Doc: ...
|
||||
def to_utf8_array(self, nr_char: int = ...) -> Ints2d: ...
|
||||
@staticmethod
|
||||
def _get_array_attrs() -> Tuple[Any]: ...
|
||||
|
|
|
@ -1,4 +1,6 @@
|
|||
# cython: infer_types=True, bounds_check=False, profile=True
|
||||
from typing import Set
|
||||
|
||||
cimport cython
|
||||
cimport numpy as np
|
||||
from libc.string cimport memcpy
|
||||
|
@ -15,6 +17,7 @@ from thinc.api import get_array_module, get_current_ops
|
|||
from thinc.util import copy_array
|
||||
import warnings
|
||||
|
||||
import spacy.schemas
|
||||
from .span cimport Span
|
||||
from .token cimport MISSING_DEP
|
||||
from ._dict_proxies import SpanGroups
|
||||
|
@ -34,7 +37,7 @@ from .. import parts_of_speech
|
|||
from .underscore import Underscore, get_ext_args
|
||||
from ._retokenize import Retokenizer
|
||||
from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS
|
||||
|
||||
from ..util import get_words_and_spaces
|
||||
|
||||
DEF PADDING = 5
|
||||
|
||||
|
@ -1475,6 +1478,138 @@ cdef class Doc:
|
|||
remove_label_if_necessary(attributes[i])
|
||||
retokenizer.merge(span, attributes[i])
|
||||
|
||||
def from_json(self, doc_json, *, validate=False):
|
||||
"""Convert a JSON document generated by Doc.to_json() to a Doc.
|
||||
|
||||
doc_json (Dict): JSON representation of doc object to load.
|
||||
validate (bool): Whether to validate `doc_json` against the expected schema.
|
||||
Defaults to False.
|
||||
RETURNS (Doc): A doc instance corresponding to the specified JSON representation.
|
||||
"""
|
||||
|
||||
if validate:
|
||||
schema_validation_message = spacy.schemas.validate(spacy.schemas.DocJSONSchema, doc_json)
|
||||
if schema_validation_message:
|
||||
raise ValueError(Errors.E1038.format(message=schema_validation_message))
|
||||
|
||||
### Token-level properties ###
|
||||
|
||||
words = []
|
||||
token_attrs_ids = (POS, HEAD, DEP, LEMMA, TAG, MORPH)
|
||||
# Map annotation type IDs to their string equivalents.
|
||||
token_attrs = {t: self.vocab.strings[t].lower() for t in token_attrs_ids}
|
||||
token_annotations = {}
|
||||
|
||||
# Gather token-level properties.
|
||||
for token_json in doc_json["tokens"]:
|
||||
words.append(doc_json["text"][token_json["start"]:token_json["end"]])
|
||||
for attr, attr_json in token_attrs.items():
|
||||
if attr_json in token_json:
|
||||
if token_json["id"] == 0 and attr not in token_annotations:
|
||||
token_annotations[attr] = []
|
||||
elif attr not in token_annotations:
|
||||
raise ValueError(Errors.E1040.format(partial_attrs=attr))
|
||||
token_annotations[attr].append(token_json[attr_json])
|
||||
|
||||
# Initialize doc instance.
|
||||
start = 0
|
||||
cdef const LexemeC* lex
|
||||
cdef bint has_space
|
||||
reconstructed_words, spaces = get_words_and_spaces(words, doc_json["text"])
|
||||
assert words == reconstructed_words
|
||||
|
||||
for word, has_space in zip(words, spaces):
|
||||
lex = self.vocab.get(self.mem, word)
|
||||
self.push_back(lex, has_space)
|
||||
|
||||
# Set remaining token-level attributes via Doc.from_array().
|
||||
if HEAD in token_annotations:
|
||||
token_annotations[HEAD] = [
|
||||
head - i for i, head in enumerate(token_annotations[HEAD])
|
||||
]
|
||||
|
||||
if DEP in token_annotations and HEAD not in token_annotations:
|
||||
token_annotations[HEAD] = [0] * len(token_annotations[DEP])
|
||||
if HEAD in token_annotations and DEP not in token_annotations:
|
||||
raise ValueError(Errors.E1017)
|
||||
if POS in token_annotations:
|
||||
for pp in set(token_annotations[POS]):
|
||||
if pp not in parts_of_speech.IDS:
|
||||
raise ValueError(Errors.E1021.format(pp=pp))
|
||||
|
||||
# Collect token attributes, assert all tokens have exactly the same set of attributes.
|
||||
attrs = []
|
||||
partial_attrs: Set[str] = set()
|
||||
for attr in token_attrs.keys():
|
||||
if attr in token_annotations:
|
||||
if len(token_annotations[attr]) != len(words):
|
||||
partial_attrs.add(token_attrs[attr])
|
||||
attrs.append(attr)
|
||||
if len(partial_attrs):
|
||||
raise ValueError(Errors.E1040.format(partial_attrs=partial_attrs))
|
||||
|
||||
# If there are any other annotations, set them.
|
||||
if attrs:
|
||||
array = self.to_array(attrs)
|
||||
if array.ndim == 1:
|
||||
array = numpy.reshape(array, (array.size, 1))
|
||||
j = 0
|
||||
|
||||
for j, (attr, annot) in enumerate(token_annotations.items()):
|
||||
if attr is HEAD:
|
||||
for i in range(len(words)):
|
||||
array[i, j] = annot[i]
|
||||
elif attr is MORPH:
|
||||
for i in range(len(words)):
|
||||
array[i, j] = self.vocab.morphology.add(annot[i])
|
||||
else:
|
||||
for i in range(len(words)):
|
||||
array[i, j] = self.vocab.strings.add(annot[i])
|
||||
self.from_array(attrs, array)
|
||||
|
||||
### Span/document properties ###
|
||||
|
||||
# Complement other document-level properties (cats, spans, ents).
|
||||
self.cats = doc_json.get("cats", {})
|
||||
|
||||
# Set sentence boundaries, if dependency parser not available but sentences are specified in JSON.
|
||||
if not self.has_annotation("DEP"):
|
||||
for sent in doc_json.get("sents", {}):
|
||||
char_span = self.char_span(sent["start"], sent["end"])
|
||||
if char_span is None:
|
||||
raise ValueError(Errors.E1039.format(obj="sentence", start=sent["start"], end=sent["end"]))
|
||||
char_span[0].is_sent_start = True
|
||||
for token in char_span[1:]:
|
||||
token.is_sent_start = False
|
||||
|
||||
|
||||
for span_group in doc_json.get("spans", {}):
|
||||
spans = []
|
||||
for span in doc_json["spans"][span_group]:
|
||||
char_span = self.char_span(span["start"], span["end"], span["label"], span["kb_id"])
|
||||
if char_span is None:
|
||||
raise ValueError(Errors.E1039.format(obj="span", start=span["start"], end=span["end"]))
|
||||
spans.append(char_span)
|
||||
self.spans[span_group] = spans
|
||||
|
||||
if "ents" in doc_json:
|
||||
ents = []
|
||||
for ent in doc_json["ents"]:
|
||||
char_span = self.char_span(ent["start"], ent["end"], ent["label"])
|
||||
if char_span is None:
|
||||
raise ValueError(Errors.E1039.format(obj="entity"), start=ent["start"], end=ent["end"])
|
||||
ents.append(char_span)
|
||||
self.ents = ents
|
||||
|
||||
# Add custom attributes. Note that only Doc extensions are currently considered, Token and Span extensions are
|
||||
# not yet supported.
|
||||
for attr in doc_json.get("_", {}):
|
||||
if not Doc.has_extension(attr):
|
||||
Doc.set_extension(attr)
|
||||
self._.set(attr, doc_json["_"][attr])
|
||||
|
||||
return self
|
||||
|
||||
def to_json(self, underscore=None):
|
||||
"""Convert a Doc to JSON.
|
||||
|
||||
|
@ -1485,12 +1620,10 @@ cdef class Doc:
|
|||
"""
|
||||
data = {"text": self.text}
|
||||
if self.has_annotation("ENT_IOB"):
|
||||
data["ents"] = [{"start": ent.start_char, "end": ent.end_char,
|
||||
"label": ent.label_} for ent in self.ents]
|
||||
data["ents"] = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in self.ents]
|
||||
if self.has_annotation("SENT_START"):
|
||||
sents = list(self.sents)
|
||||
data["sents"] = [{"start": sent.start_char, "end": sent.end_char}
|
||||
for sent in sents]
|
||||
data["sents"] = [{"start": sent.start_char, "end": sent.end_char} for sent in sents]
|
||||
if self.cats:
|
||||
data["cats"] = self.cats
|
||||
data["tokens"] = []
|
||||
|
@ -1516,7 +1649,9 @@ cdef class Doc:
|
|||
for span_group in self.spans:
|
||||
data["spans"][span_group] = []
|
||||
for span in self.spans[span_group]:
|
||||
span_data = {"start": span.start_char, "end": span.end_char, "label": span.label_, "kb_id": span.kb_id_}
|
||||
span_data = {
|
||||
"start": span.start_char, "end": span.end_char, "label": span.label_, "kb_id": span.kb_id_
|
||||
}
|
||||
data["spans"][span_group].append(span_data)
|
||||
|
||||
if underscore:
|
||||
|
|
|
@ -481,6 +481,45 @@ Deserialize, i.e. import the document contents from a binary string.
|
|||
| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
|
||||
| **RETURNS** | The `Doc` object. ~~Doc~~ |
|
||||
|
||||
## Doc.to_json {#to_json tag="method"}
|
||||
|
||||
Serializes a document to JSON. Note that this is format differs from the
|
||||
deprecated [`JSON training format`](/api/data-formats#json-input).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> doc = nlp("All we have to decide is what to do with the time that is given us.")
|
||||
> assert doc.to_json()["text"] == doc.text
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `underscore` | Optional list of string names of custom `Doc` attributes. Attribute values need to be JSON-serializable. Values will be added to an `"_"` key in the data, e.g. `"_": {"foo": "bar"}`. ~~Optional[List[str]]~~ |
|
||||
| **RETURNS** | The data in JSON format. ~~Dict[str, Any]~~ |
|
||||
|
||||
## Doc.from_json {#from_json tag="method" new="3.3.1"}
|
||||
|
||||
Deserializes a document from JSON, i.e. generates a document from the provided
|
||||
JSON data as generated by [`Doc.to_json()`](/api/doc#to_json).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.tokens import Doc
|
||||
> doc = nlp("All we have to decide is what to do with the time that is given us.")
|
||||
> doc_json = doc.to_json()
|
||||
> deserialized_doc = Doc(nlp.vocab).from_json(doc_json)
|
||||
> assert deserialized_doc.text == doc.text == doc_json["text"]
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| -------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc_json` | The Doc data in JSON format from [`Doc.to_json`](#to_json). ~~Dict[str, Any]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `validate` | Whether to validate the JSON input against the expected schema for detailed debugging. Defaults to `False`. ~~bool~~ |
|
||||
| **RETURNS** | A `Doc` corresponding to the provided JSON. ~~Doc~~ |
|
||||
|
||||
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
|
||||
|
||||
Context manager to handle retokenization of the `Doc`. Modifications to the
|
||||
|
|
Loading…
Reference in New Issue
Block a user