mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	* Implement Doc.from_json: rough draft. * Implement Doc.from_json: first draft with tests. * Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json(). * Implement Doc.from_json: formatting changes. * Implement Doc.to_json(): reverting unrelated formatting changes. * Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file. * Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls. * Implement Doc.from_json(): handling sentence boundaries in spans. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): added parser-free sentence boundaries transfer. * Implementing Doc.from_json(): incorporated various PR feedback. * Renaming fixture for document without dependencies. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): using two sent_starts instead of one. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master. * Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations. * Implement Doc.from_json(): reverting unwanted formatting/rebasing changes. * Implement Doc.from_json(): added check for char_span() calculation for entities. * Update spacy/tokens/doc.pyx Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test. * Implement Doc.from_json(): removed redundancy in annotation type key naming. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): Simplifying setting annotation values. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement doc.from_json(): renaming annot_types to token_attrs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs. * Implement Doc.from_json(): removing default categories. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): simplifying lexeme initialization. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring to only have keys for present annotations. * Implement Doc.from_json(): fix check for tokens' HEAD attributes. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): refactoring Doc.from_json(). * Implement Doc.from_json(): fixing span_group retrieval. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing span retrieval. * Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json(). * Implement Doc.from_json(): added comment regarding Token and Span extension support. * Implement Doc.from_json(): renaming inconsistent_props to partial_attrs.. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusting error message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): extending E1038 message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): added params to E1038 raises. * Implement Doc.from_json(): combined attribute collection with partial attributes check. * Implement Doc.from_json(): added optional schema validation. * Implement Doc.from_json(): fixed optional fields in schema, tests. * Implement Doc.from_json(): removed redundant None check for DEP. * Implement Doc.from_json(): added passing of schema validatoin message to E1037.. * Implement Doc.from_json(): removing redundant error E1040. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): changing message for E1037. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json(). * Update spacy/tests/doc/test_json_doc_conversion.py * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): website docs update. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): docstring formatting. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fixing Doc reference in website docs. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): reformatted website/docs/api/doc.md. * Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts. * Implement Doc.from_json(): fixing bug in tests. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Implement Doc.from_json(): fix setting of sentence starts for docs without DEP. * Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py. * Implement Doc.from_json(): simplify token sentence start manipulation. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Combine related error messages * Update spacy/tests/doc/test_json_doc_conversion.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
		
			
				
	
	
		
			179 lines
		
	
	
		
			5.7 KiB
		
	
	
	
		
			Python
		
	
	
	
	
	
			
		
		
	
	
			179 lines
		
	
	
		
			5.7 KiB
		
	
	
	
		
			Python
		
	
	
	
	
	
from typing import Callable, Protocol, Iterable, Iterator, Optional
 | 
						|
from typing import Union, Tuple, List, Dict, Any, overload
 | 
						|
from cymem.cymem import Pool
 | 
						|
from thinc.types import Floats1d, Floats2d, Ints2d
 | 
						|
from .span import Span
 | 
						|
from .token import Token
 | 
						|
from ._dict_proxies import SpanGroups
 | 
						|
from ._retokenize import Retokenizer
 | 
						|
from ..lexeme import Lexeme
 | 
						|
from ..vocab import Vocab
 | 
						|
from .underscore import Underscore
 | 
						|
from pathlib import Path
 | 
						|
import numpy as np
 | 
						|
 | 
						|
class DocMethod(Protocol):
 | 
						|
    def __call__(self: Doc, *args: Any, **kwargs: Any) -> Any: ...  # type: ignore[misc]
 | 
						|
 | 
						|
class Doc:
 | 
						|
    vocab: Vocab
 | 
						|
    mem: Pool
 | 
						|
    spans: SpanGroups
 | 
						|
    max_length: int
 | 
						|
    length: int
 | 
						|
    sentiment: float
 | 
						|
    cats: Dict[str, float]
 | 
						|
    user_hooks: Dict[str, Callable[..., Any]]
 | 
						|
    user_token_hooks: Dict[str, Callable[..., Any]]
 | 
						|
    user_span_hooks: Dict[str, Callable[..., Any]]
 | 
						|
    tensor: np.ndarray[Any, np.dtype[np.float_]]
 | 
						|
    user_data: Dict[str, Any]
 | 
						|
    has_unknown_spaces: bool
 | 
						|
    _context: Any
 | 
						|
    @classmethod
 | 
						|
    def set_extension(
 | 
						|
        cls,
 | 
						|
        name: str,
 | 
						|
        default: Optional[Any] = ...,
 | 
						|
        getter: Optional[Callable[[Doc], Any]] = ...,
 | 
						|
        setter: Optional[Callable[[Doc, Any], None]] = ...,
 | 
						|
        method: Optional[DocMethod] = ...,
 | 
						|
        force: bool = ...,
 | 
						|
    ) -> None: ...
 | 
						|
    @classmethod
 | 
						|
    def get_extension(
 | 
						|
        cls, name: str
 | 
						|
    ) -> Tuple[
 | 
						|
        Optional[Any],
 | 
						|
        Optional[DocMethod],
 | 
						|
        Optional[Callable[[Doc], Any]],
 | 
						|
        Optional[Callable[[Doc, Any], None]],
 | 
						|
    ]: ...
 | 
						|
    @classmethod
 | 
						|
    def has_extension(cls, name: str) -> bool: ...
 | 
						|
    @classmethod
 | 
						|
    def remove_extension(
 | 
						|
        cls, name: str
 | 
						|
    ) -> Tuple[
 | 
						|
        Optional[Any],
 | 
						|
        Optional[DocMethod],
 | 
						|
        Optional[Callable[[Doc], Any]],
 | 
						|
        Optional[Callable[[Doc, Any], None]],
 | 
						|
    ]: ...
 | 
						|
    def __init__(
 | 
						|
        self,
 | 
						|
        vocab: Vocab,
 | 
						|
        words: Optional[List[str]] = ...,
 | 
						|
        spaces: Optional[List[bool]] = ...,
 | 
						|
        user_data: Optional[Dict[Any, Any]] = ...,
 | 
						|
        tags: Optional[List[str]] = ...,
 | 
						|
        pos: Optional[List[str]] = ...,
 | 
						|
        morphs: Optional[List[str]] = ...,
 | 
						|
        lemmas: Optional[List[str]] = ...,
 | 
						|
        heads: Optional[List[int]] = ...,
 | 
						|
        deps: Optional[List[str]] = ...,
 | 
						|
        sent_starts: Optional[List[Union[bool, None]]] = ...,
 | 
						|
        ents: Optional[List[str]] = ...,
 | 
						|
    ) -> None: ...
 | 
						|
    @property
 | 
						|
    def _(self) -> Underscore: ...
 | 
						|
    @property
 | 
						|
    def is_tagged(self) -> bool: ...
 | 
						|
    @property
 | 
						|
    def is_parsed(self) -> bool: ...
 | 
						|
    @property
 | 
						|
    def is_nered(self) -> bool: ...
 | 
						|
    @property
 | 
						|
    def is_sentenced(self) -> bool: ...
 | 
						|
    def has_annotation(
 | 
						|
        self, attr: Union[int, str], *, require_complete: bool = ...
 | 
						|
    ) -> bool: ...
 | 
						|
    @overload
 | 
						|
    def __getitem__(self, i: int) -> Token: ...
 | 
						|
    @overload
 | 
						|
    def __getitem__(self, i: slice) -> Span: ...
 | 
						|
    def __iter__(self) -> Iterator[Token]: ...
 | 
						|
    def __len__(self) -> int: ...
 | 
						|
    def __unicode__(self) -> str: ...
 | 
						|
    def __bytes__(self) -> bytes: ...
 | 
						|
    def __str__(self) -> str: ...
 | 
						|
    def __repr__(self) -> str: ...
 | 
						|
    @property
 | 
						|
    def doc(self) -> Doc: ...
 | 
						|
    def char_span(
 | 
						|
        self,
 | 
						|
        start_idx: int,
 | 
						|
        end_idx: int,
 | 
						|
        label: Union[int, str] = ...,
 | 
						|
        kb_id: Union[int, str] = ...,
 | 
						|
        vector: Optional[Floats1d] = ...,
 | 
						|
        alignment_mode: str = ...,
 | 
						|
    ) -> Span: ...
 | 
						|
    def similarity(self, other: Union[Doc, Span, Token, Lexeme]) -> float: ...
 | 
						|
    @property
 | 
						|
    def has_vector(self) -> bool: ...
 | 
						|
    vector: Floats1d
 | 
						|
    vector_norm: float
 | 
						|
    @property
 | 
						|
    def text(self) -> str: ...
 | 
						|
    @property
 | 
						|
    def text_with_ws(self) -> str: ...
 | 
						|
    ents: Tuple[Span]
 | 
						|
    def set_ents(
 | 
						|
        self,
 | 
						|
        entities: List[Span],
 | 
						|
        *,
 | 
						|
        blocked: Optional[List[Span]] = ...,
 | 
						|
        missing: Optional[List[Span]] = ...,
 | 
						|
        outside: Optional[List[Span]] = ...,
 | 
						|
        default: str = ...
 | 
						|
    ) -> None: ...
 | 
						|
    @property
 | 
						|
    def noun_chunks(self) -> Iterator[Span]: ...
 | 
						|
    @property
 | 
						|
    def sents(self) -> Iterator[Span]: ...
 | 
						|
    @property
 | 
						|
    def lang(self) -> int: ...
 | 
						|
    @property
 | 
						|
    def lang_(self) -> str: ...
 | 
						|
    def count_by(
 | 
						|
        self, attr_id: int, exclude: Optional[Any] = ..., counts: Optional[Any] = ...
 | 
						|
    ) -> Dict[Any, int]: ...
 | 
						|
    def from_array(
 | 
						|
        self, attrs: Union[int, str, List[Union[int, str]]], array: Ints2d
 | 
						|
    ) -> Doc: ...
 | 
						|
    def to_array(
 | 
						|
        self, py_attr_ids: Union[int, str, List[Union[int, str]]]
 | 
						|
    ) -> np.ndarray[Any, np.dtype[np.float_]]: ...
 | 
						|
    @staticmethod
 | 
						|
    def from_docs(
 | 
						|
        docs: List[Doc],
 | 
						|
        ensure_whitespace: bool = ...,
 | 
						|
        attrs: Optional[Union[Tuple[Union[str, int]], List[Union[int, str]]]] = ...,
 | 
						|
    ) -> Doc: ...
 | 
						|
    def get_lca_matrix(self) -> Ints2d: ...
 | 
						|
    def copy(self) -> Doc: ...
 | 
						|
    def to_disk(
 | 
						|
        self, path: Union[str, Path], *, exclude: Iterable[str] = ...
 | 
						|
    ) -> None: ...
 | 
						|
    def from_disk(
 | 
						|
        self, path: Union[str, Path], *, exclude: Union[List[str], Tuple[str]] = ...
 | 
						|
    ) -> Doc: ...
 | 
						|
    def to_bytes(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
 | 
						|
    def from_bytes(
 | 
						|
        self, bytes_data: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
 | 
						|
    ) -> Doc: ...
 | 
						|
    def to_dict(self, *, exclude: Union[List[str], Tuple[str]] = ...) -> bytes: ...
 | 
						|
    def from_dict(
 | 
						|
        self, msg: bytes, *, exclude: Union[List[str], Tuple[str]] = ...
 | 
						|
    ) -> Doc: ...
 | 
						|
    def extend_tensor(self, tensor: Floats2d) -> None: ...
 | 
						|
    def retokenize(self) -> Retokenizer: ...
 | 
						|
    def to_json(self, underscore: Optional[List[str]] = ...) -> Dict[str, Any]: ...
 | 
						|
    def from_json(
 | 
						|
        self, doc_json: Dict[str, Any] = ..., validate: bool = False
 | 
						|
    ) -> Doc: ...
 | 
						|
    def to_utf8_array(self, nr_char: int = ...) -> Ints2d: ...
 | 
						|
    @staticmethod
 | 
						|
    def _get_array_attrs() -> Tuple[Any]: ...
 |