spaCy/spacy/tokens/doc.pyi
Raphael Mitsch 8387ce4c01
Add Doc.from_json() (#10688)
* Implement Doc.from_json: rough draft.

* Implement Doc.from_json: first draft with tests.

* Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json().

* Implement Doc.from_json: formatting changes.

* Implement Doc.to_json(): reverting unrelated formatting changes.

* Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file.

* Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls.

* Implement Doc.from_json(): handling sentence boundaries in spans.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): incorporated various PR feedback.

* Renaming fixture for document without dependencies.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): using two sent_starts instead of one.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master.

* Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations.

* Implement Doc.from_json(): reverting unwanted formatting/rebasing changes.

* Implement Doc.from_json(): added check for char_span() calculation for entities.

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test.

* Implement Doc.from_json(): removed redundancy in annotation type key naming.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): Simplifying setting annotation values.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement doc.from_json(): renaming annot_types to token_attrs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs.

* Implement Doc.from_json(): removing default categories.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring to only have keys for present annotations.

* Implement Doc.from_json(): fix check for tokens' HEAD attributes.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring Doc.from_json().

* Implement Doc.from_json(): fixing span_group retrieval.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing span retrieval.

* Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json().

* Implement Doc.from_json(): added comment regarding Token and Span extension support.

* Implement Doc.from_json(): renaming inconsistent_props to partial_attrs..

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusting error message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): extending E1038 message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): added params to E1038 raises.

* Implement Doc.from_json(): combined attribute collection with partial attributes check.

* Implement Doc.from_json(): added optional schema validation.

* Implement Doc.from_json(): fixed optional fields in schema, tests.

* Implement Doc.from_json(): removed redundant None check for DEP.

* Implement Doc.from_json(): added passing of schema validatoin message to E1037..

* Implement Doc.from_json(): removing redundant error E1040.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): changing message for E1037.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json().

* Update spacy/tests/doc/test_json_doc_conversion.py

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): website docs update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing Doc reference in website docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): reformatted website/docs/api/doc.md.

* Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts.

* Implement Doc.from_json(): fixing bug in tests.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fix setting of sentence starts for docs without DEP.

* Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py.

* Implement Doc.from_json(): simplify token sentence start manipulation.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Combine related error messages

* Update spacy/tests/doc/test_json_doc_conversion.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-02 14:03:47 +02:00

179 lines
5.7 KiB
Python

from typing import Callable, Protocol, Iterable, Iterator, Optional
from typing import Union, Tuple, List, Dict, Any, overload
from cymem.cymem import Pool
from thinc.types import Floats1d, Floats2d, Ints2d
from .span import Span
from .token import Token
from ._dict_proxies import SpanGroups
from ._retokenize import Retokenizer
from ..lexeme import Lexeme
from ..vocab import Vocab
from .underscore import Underscore
from pathlib import Path
import numpy as np
class DocMethod(Protocol):
def __call__(self: Doc, *args: Any, **kwargs: Any) -> Any: ... # type: ignore[misc]
class Doc:
vocab: Vocab
mem: Pool
spans: SpanGroups
max_length: int
length: int
sentiment: float
cats: Dict[str, float]
user_hooks: Dict[str, Callable[..., Any]]
user_token_hooks: Dict[str, Callable[..., Any]]
user_span_hooks: Dict[str, Callable[..., Any]]
tensor: np.ndarray[Any, np.dtype[np.float_]]
user_data: Dict[str, Any]
has_unknown_spaces: bool
_context: Any
@classmethod
def set_extension(
cls,
name: str,
default: Optional[Any] = ...,
getter: Optional[Callable[[Doc], Any]] = ...,
setter: Optional[Callable[[Doc, Any], None]] = ...,
method: Optional[DocMethod] = ...,
force: bool = ...,
) -> None: ...
@classmethod
def get_extension(
cls, name: str
) -> Tuple[
Optional[Any],
Optional[DocMethod],
Optional[Callable[[Doc], Any]],
Optional[Callable[[Doc, Any], None]],
]: ...
@classmethod
def has_extension(cls, name: str) -> bool: ...
@classmethod
def remove_extension(
cls, name: str
) -> Tuple[
Optional[Any],
Optional[DocMethod],
Optional[Callable[[Doc], Any]],
Optional[Callable[[Doc, Any], None]],
]: ...
def __init__(
self,
vocab: Vocab,
words: Optional[List[str]] = ...,
spaces: Optional[List[bool]] = ...,
user_data: Optional[Dict[Any, Any]] = ...,
tags: Optional[List[str]] = ...,
pos: Optional[List[str]] = ...,
morphs: Optional[List[str]] = ...,
lemmas: Optional[List[str]] = ...,
heads: Optional[List[int]] = ...,
deps: Optional[List[str]] = ...,
sent_starts: Optional[List[Union[bool, None]]] = ...,
ents: Optional[List[str]] = ...,
) -> None: ...
@property
def _(self) -> Underscore: ...
@property
def is_tagged(self) -> bool: ...
@property
def is_parsed(self) -> bool: ...
@property
def is_nered(self) -> bool: ...
@property
def is_sentenced(self) -> bool: ...
def has_annotation(
self, attr: Union[int, str], *, require_complete: bool = ...
) -> bool: ...
@overload
def __getitem__(self, i: int) -> Token: ...
@overload
def __getitem__(self, i: slice) -> Span: ...
def __iter__(self) -> Iterator[Token]: ...
def __len__(self) -> int: ...
def __unicode__(self) -> str: ...
def __bytes__(self) -> bytes: ...
def __str__(self) -> str: ...
def __repr__(self) -> str: ...
@property
def doc(self) -> Doc: ...
def char_span(
self,
start_idx: int,
end_idx: int,
label: Union[int, str] = ...,
kb_id: Union[int, str] = ...,
vector: Optional[Floats1d] = ...,
alignment_mode: str = ...,
) -> Span: ...
def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
@property
def has_vector(self) -> bool: ...
vector: Floats1d
vector_norm: float
@property
def text(self) -> str: ...
@property
def text_with_ws(self) -> str: ...
ents: Tuple[Span]
def set_ents(
self,
entities: List[Span],
*,
blocked: Optional[List[Span]] = ...,
missing: Optional[List[Span]] = ...,
outside: Optional[List[Span]] = ...,
default: str = ...
) -> None: ...
@property
def noun_chunks(self) -> Iterator[Span]: ...
@property
def sents(self) -> Iterator[Span]: ...
@property
def lang(self) -> int: ...
@property
def lang_(self) -> str: ...
def count_by(
self, attr_id: int, exclude: Optional[Any] = ..., counts: Optional[Any] = ...
) -> Dict[Any, int]: ...
def from_array(
self, attrs: Union[int, str, List[Union[int, str]]], array: Ints2d
) -> Doc: ...
def to_array(
self, py_attr_ids: Union[int, str, List[Union[int, str]]]
) -> np.ndarray[Any, np.dtype[np.float_]]: ...
@staticmethod
def from_docs(
docs: List[Doc],
ensure_whitespace: bool = ...,
attrs: Optional[Union[Tuple[Union[str, int]], List[Union[int, str]]]] = ...,
) -> Doc: ...
def get_lca_matrix(self) -> Ints2d: ...
def copy(self) -> Doc: ...
def to_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = ...
) -> None: ...
def from_disk(
self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
) -> Doc: ...
def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
def from_bytes(
self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
) -> Doc: ...
def to_dict(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
def from_dict(
self, msg: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
) -> Doc: ...
def extend_tensor(self, tensor: Floats2d) -> None: ...
def retokenize(self) -> Retokenizer: ...
def to_json(self, underscore: Optional[List[str]] = ...) -> Dict[str, Any]: ...
def from_json(
self, doc_json: Dict[str, Any] = ..., validate: bool = False
) -> Doc: ...
def to_utf8_array(self, nr_char: int = ...) -> Ints2d: ...
@staticmethod
def _get_array_attrs() -> Tuple[Any]: ...