spaCy/spacy/tokens/doc.pxd

from cymem.cymem cimport Pool
cimport numpy as np
from preshed.counter cimport PreshCounter

from ..vocab cimport Vocab
from ..structs cimport TokenC, LexemeC
from ..typedefs cimport attr_t
from ..attrs cimport attr_id_t


cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil


ctypedef const LexemeC* const_Lexeme_ptr
ctypedef const TokenC* const_TokenC_ptr

ctypedef fused LexemeOrToken:
    const_Lexeme_ptr
    const_TokenC_ptr


cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2


cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2


cdef int set_children_from_heads(TokenC* tokens, int length) except -1

cdef class Doc:
    cdef readonly Pool mem
    cdef readonly Vocab vocab

    cdef public object _vector
    cdef public object _vector_norm

    cdef public object tensor
    cdef public object cats
    cdef public object user_data

    cdef TokenC* c

    cdef public bint is_tagged
    cdef public bint is_parsed

    cdef public float sentiment

    cdef public dict user_hooks
    cdef public dict user_token_hooks
    cdef public dict user_span_hooks

    cdef public list _py_tokens

    cdef int length
    cdef int max_length

    cdef public object noun_chunks_iterator

    cdef object __weakref__

    cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1

    cpdef np.ndarray to_array(self, object features)

    cdef void set_parse(self, const TokenC* parsed) nogil
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00			`from cymem.cymem cimport Pool`
			`cimport numpy as np`
			`from preshed.counter cimport PreshCounter`

			`from ..vocab cimport Vocab`
			`from ..structs cimport TokenC, LexemeC`
* Gazetteer stuff working, now need to wire up to API 2015-08-06 01:35:40 +03:00			`from ..typedefs cimport attr_t`
			`from ..attrs cimport attr_id_t`


			`cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00

			`ctypedef const LexemeC* const_Lexeme_ptr`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`ctypedef const TokenC* const_TokenC_ptr`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00
			`ctypedef fused LexemeOrToken:`
			`const_Lexeme_ptr`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`const_TokenC_ptr`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00

* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient 2015-11-07 00:55:34 +03:00			`cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2`


			`cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2`


Add doc.retokenize() context manager (#2172) This patch takes a step towards #1487 by introducing the doc.retokenize() context manager, to handle merging spans, and soon splitting tokens. The idea is to do merging and splitting like this: with doc.retokenize() as retokenizer: for start, end, label in matches: retokenizer.merge(doc[start : end], attrs={'ent_type': label}) The retokenizer accumulates the merge requests, and applies them together at the end of the block. This will allow retokenization to be more efficient, and much less error prone. A retokenizer.split() function will then be added, to handle splitting a single token into multiple tokens. These methods take `Span` and `Token` objects; if the user wants to go directly from offsets, they can append to the .merges and .splits lists on the retokenizer. The doc.merge() method's behaviour remains unchanged, so this patch should be 100% backwards incompatible (modulo bugs). Internally, doc.merge() fixes up the arguments (to handle the various deprecated styles), opens the retokenizer, and makes the single merge. We can later start making deprecation warnings on direct calls to doc.merge(), to migrate people to use of the retokenize context manager. 2018-04-03 15:10:35 +03:00			`cdef int set_children_from_heads(TokenC* tokens, int length) except -1`

* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00			`cdef class Doc:`
* Make mem and vocab python-visible in Doc 2015-07-28 21:46:59 +03:00			`cdef readonly Pool mem`
			`cdef readonly Vocab vocab`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00
* Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea. 2015-09-17 04:50:11 +03:00			`cdef public object _vector`
			`cdef public object _vector_norm`

Tmp GPU code 2017-05-07 19:04:24 +03:00			`cdef public object tensor`
Add slot for text categories to Doc 2017-07-22 01:34:15 +03:00			`cdef public object cats`
Add user_data attribute to Doc object. 2016-10-17 12:43:22 +03:00			`cdef public object user_data`
Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs 2016-10-14 04:24:13 +03:00
* Rename Doc.data to Doc.c 2015-11-03 16:15:14 +03:00			`cdef TokenC* c`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00
			`cdef public bint is_tagged`
			`cdef public bint is_parsed`

Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc. 2016-10-19 21:54:03 +03:00			`cdef public float sentiment`

			`cdef public dict user_hooks`
			`cdef public dict user_token_hooks`
			`cdef public dict user_span_hooks`
Add getters_for_tokens and getters_for_spans attributes to Doc object. 2016-10-17 03:42:05 +03:00
* Restore _py_tokens cache, to handle orphan tokens. 2015-07-13 23:28:10 +03:00			`cdef public list _py_tokens`

* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00			`cdef int length`
			`cdef int max_length`

* Fix assignment of iterator on Doc object 2016-05-02 16:26:24 +03:00			`cdef public object noun_chunks_iterator`
add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model 2016-03-16 17:53:35 +03:00
Allow weakrefs on Doc objects 2017-10-16 20:22:11 +03:00			`cdef object __weakref__`

Fix parameter name in .pxd file 2017-09-26 15:28:50 +03:00			`cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00
			`cpdef np.ndarray to_array(self, object features)`

* Make set_parse nogil 2016-01-30 22:27:52 +03:00			`cdef void set_parse(self, const TokenC* parsed) nogil`