spaCy/spacy/tokens/doc.pxd

cimport numpy as np
from cymem.cymem cimport Pool

from ..attrs cimport attr_id_t
from ..structs cimport LexemeC, SpanC, TokenC
from ..typedefs cimport attr_t
from ..vocab cimport Vocab


cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil
cdef attr_t get_token_attr_for_matcher(const TokenC* token, attr_id_t feat_name) nogil


ctypedef const LexemeC* const_Lexeme_ptr
ctypedef const TokenC* const_TokenC_ptr

ctypedef fused LexemeOrToken:
    const_Lexeme_ptr
    const_TokenC_ptr


cdef int set_children_from_heads(TokenC* tokens, int start, int end) except -1


cdef int _set_lr_kids_and_edges(TokenC* tokens, int start, int end, int loop_count) except -1


cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2


cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2


cdef int [:,:] _get_lca_matrix(Doc, int start, int end)


cdef class Doc:
    cdef readonly Pool mem
    cdef readonly Vocab vocab

    cdef public object _vector
    cdef public object _vector_norm

    cdef public object tensor
    cdef public object cats
    cdef public object user_data
    cdef readonly object spans

    cdef TokenC* c

    cdef public dict activations

    cdef public dict user_hooks
    cdef public dict user_token_hooks
    cdef public dict user_span_hooks

    cdef public bint has_unknown_spaces

    cdef public object _context

    cdef int length
    cdef int max_length


    cdef public object noun_chunks_iterator

    cdef object __weakref__

    cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1

    cpdef np.ndarray to_array(self, object features)
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00			`cimport numpy as np`
isort all the things 2023-06-26 12:41:03 +03:00			`from cymem.cymem cimport Pool`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00
* Gazetteer stuff working, now need to wire up to API 2015-08-06 01:35:40 +03:00			`from ..attrs cimport attr_id_t`
isort all the things 2023-06-26 12:41:03 +03:00			`from ..structs cimport LexemeC, SpanC, TokenC`
			`from ..typedefs cimport attr_t`
			`from ..vocab cimport Vocab`
* Gazetteer stuff working, now need to wire up to API 2015-08-06 01:35:40 +03:00

			`cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil`
Normalize TokenC.sent_start values for Matcher (#5346) Normalize TokenC.sent_start values to booleans for the `Matcher`. 2020-04-29 13:57:30 +03:00			`cdef attr_t get_token_attr_for_matcher(const TokenC* token, attr_id_t feat_name) nogil`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00

			`ctypedef const LexemeC* const_Lexeme_ptr`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`ctypedef const TokenC* const_TokenC_ptr`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00
			`ctypedef fused LexemeOrToken:`
			`const_Lexeme_ptr`
* More work on language-generic parsing 2015-08-28 03:02:33 +03:00			`const_TokenC_ptr`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00

Clean up spacy.tokens (#6046) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment. 2020-09-16 21:32:38 +03:00			`cdef int set_children_from_heads(TokenC* tokens, int start, int end) except -1`
Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit c9ba3d3c2dc7067cf8bd55f878cec45a8c6d73d4, reversing changes made to 92c26a35d425d4e8ca1b805ea776ea10f5ded3df. 2018-03-27 20:23:02 +03:00

Clean up spacy.tokens (#6046) * Clean up spacy.tokens * Update `set_children_from_heads`: * Don't check `dep` when setting lr_* or sentence starts * Set all non-sentence starts to `False` * Use `set_children_from_heads` in `Token.head` setter * Reduce similar/duplicate code (admittedly adds a bit of overhead) * Update sentence starts consistently * Remove unused `Doc.set_parse` * Minor changes: * Declare cython variables (to avoid cython warnings) * Clean up imports * Modify set_children_from_heads to set token range Modify `set_children_from_heads` so that it adjust tokens within a specified range rather then the whole document. Modify the `Token.head` setter to adjust only the tokens affected by the new head assignment. 2020-09-16 21:32:38 +03:00			`cdef int _set_lr_kids_and_edges(TokenC* tokens, int start, int end, int loop_count) except -1`
Iterate over lr_edges until sents are correct (#4702) Iterate over lr_edges until all heads are within the current sentence. Instead of iterating over them for a fixed number of iterations, check whether the sentence boundaries are correct for the heads and stop when all are correct. Stop after a maximum of 10 iterations, providing a warning in this case since the sentence boundaries may not be correct. 2019-11-25 15:06:36 +03:00

* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient 2015-11-07 00:55:34 +03:00			`cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2`


			`cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2`


Fix issue 2396 (#3089) * Test on #2396: bug in Doc.get_lca_matrix() * reimplementation of Doc.get_lca_matrix(), (closes #2396) * reimplement Span.get_lca_matrix(), and call it from Doc.get_lca_matrix() * tests Span.get_lca_matrix() as well as Doc.get_lca_matrix() * implement _get_lca_matrix as a helper function in doc.pyx; call it from Doc.get_lca_matrix and Span.get_lca_matrix * use memory view instead of np.ndarray in _get_lca_matrix (faster) * fix bug when calling Span.get_lca_matrix; return lca matrix as np.array instead of memoryview * cleaner conditional, add comment 2018-12-29 20:02:26 +03:00			`cdef int [:,:] _get_lca_matrix(Doc, int start, int end)`

Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io> 2021-01-14 09:30:41 +03:00
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00			`cdef class Doc:`
* Make mem and vocab python-visible in Doc 2015-07-28 21:46:59 +03:00			`cdef readonly Pool mem`
			`cdef readonly Vocab vocab`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00
* Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea. 2015-09-17 04:50:11 +03:00			`cdef public object _vector`
			`cdef public object _vector_norm`

Tmp GPU code 2017-05-07 19:04:24 +03:00			`cdef public object tensor`
Add slot for text categories to Doc 2017-07-22 01:34:15 +03:00			`cdef public object cats`
Add user_data attribute to Doc object. 2016-10-17 12:43:22 +03:00			`cdef public object user_data`
Add SpanGroup and Graph container types to represent arbitrary annotations (#6696) * Draft out initial Spans data structure * Initial span group commit * Basic span group support on Doc * Basic test for span group * Compile span_group.pyx * Draft addition of SpanGroup to DocBin * Add deserialization for SpanGroup * Add tests for serializing SpanGroup * Fix serialization of SpanGroup * Add EdgeC and GraphC structs * Add draft Graph data structure * Compile graph * More work on Graph * Update GraphC * Upd graph * Fix walk functions * Let Graph take nodes and edges on construction * Fix walking and getting * Add graph tests * Fix import * Add module with the SpanGroups dict thingy * Update test * Rename 'span_groups' attribute * Try to fix c++11 compilation * Fix test * Update DocBin * Try to fix compilation * Try to fix graph * Improve SpanGroup docstrings * Add doc.spans to documentation * Fix serialization * Tidy up and add docs * Update docs [ci skip] * Add SpanGroup.has_overlap * WIP updated Graph API * Start testing new Graph API * Update Graph tests * Update Graph * Add docstring Co-authored-by: Ines Montani <ines@ines.io> 2021-01-14 09:30:41 +03:00			`cdef readonly object spans`
Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs 2016-10-14 04:24:13 +03:00
* Rename Doc.data to Doc.c 2015-11-03 16:15:14 +03:00			`cdef TokenC* c`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00
Store activations in `Doc`s when `save_activations` is enabled (#11002) * Store activations in Doc when `store_activations` is enabled This change adds the new `activations` attribute to `Doc`. This attribute can be used by trainable pipes to store their activations, probabilities, and guesses for downstream users. As an example, this change modifies the `tagger` and `senter` pipes to add an `store_activations` option. When this option is enabled, the probabilities and guesses are stored in `set_annotations`. * Change type of `store_activations` to `Union[bool, List[str]]` When the value is: - A bool: all activations are stored when set to `True`. - A List[str]: the activations named in the list are stored * Formatting fixes in Tagger * Support store_activations in spancat and morphologizer * Make Doc.activations type visible to MyPy * textcat/textcat_multilabel: add store_activations option * trainable_lemmatizer/entity_linker: add store_activations option * parser/ner: do not currently support returning activations * Extend tagger and senter tests So that they, like the other tests, also check that we get no activations if no activations were requested. * Document `Doc.activations` and `store_activations` in the relevant pipes * Start errors/warnings at higher numbers to avoid merge conflicts Between the master and v4 branches. * Add `store_activations` to docstrings. * Replace store_activations setter by set_store_activations method Setters that take a different type than what the getter returns are still problematic for MyPy. Replace the setter by a method, so that type inference works everywhere. * Use dict comprehension suggested by @svlandeg * Revert "Use dict comprehension suggested by @svlandeg" This reverts commit 6e7b958f7060397965176c69649e5414f1f24988. * EntityLinker: add type annotations to _add_activations * _store_activations: make kwarg-only, remove doc_scores_lens arg * set_annotations: add type annotations * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * TextCat.predict: return dict * Make the `TrainablePipe.store_activations` property a bool This means that we can also bring back `store_activations` setter. * Remove `TrainablePipe.activations` We do not need to enumerate the activations anymore since `store_activations` is `bool`. * Add type annotations for activations in predict/set_annotations * Rename `TrainablePipe.store_activations` to `save_activations` * Error E1400 is not used anymore This error was used when activations were still `Union[bool, List[str]]`. * Change wording in API docs after store -> save change * docs: tag (save_)activations as new in spaCy 4.0 * Fix copied line in morphologizer activations test * Don't train in any test_save_activations test * Rename activations - "probs" -> "probabilities" - "guesses" -> "label_ids", except in the edit tree lemmatizer, where "guesses" -> "tree_ids". * Remove unused W400 warning. This warning was used when we still allowed the user to specify which activations to save. * Formatting fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Replace "kb_ids" by a constant * spancat: replace a cast by an assertion * Fix EOF spacing * Fix comments in test_save_activations tests * Do not set RNG seed in activation saving tests * Revert "spancat: replace a cast by an assertion" This reverts commit 0bd5730d16432443a2b247316928d4f789ad8741. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> 2022-09-13 10:51:12 +03:00			`cdef public dict activations`

Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc. 2016-10-19 21:54:03 +03:00			`cdef public dict user_hooks`
			`cdef public dict user_token_hooks`
			`cdef public dict user_span_hooks`
Add getters_for_tokens and getters_for_spans attributes to Doc object. 2016-10-17 03:42:05 +03:00
Record whether Doc objects are built from known spacing (#5697) * Tell convert CLI to store user data for Doc * Remove assert * Add has_unknwon_spaces flag on Doc * Do not tokenize docs with unknown spaces in Corpus * Handle conversion of unknown spaces in Example * Fixes * Fixes * Draft has_known_spaces support in DocBin * Add test for serialize has_unknown_spaces * Fix DocBin serialization when has_unknown_spaces * Use serialization in test 2020-07-03 13:58:16 +03:00			`cdef public bint has_unknown_spaces`

Set as_tuples on Doc during processing (#9592) * Set as_tuples on Doc during processing * Fix types * Format 2021-11-02 17:08:22 +03:00			`cdef public object _context`
* Restore _py_tokens cache, to handle orphan tokens. 2015-07-13 23:28:10 +03:00
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00			`cdef int length`
			`cdef int max_length`

Record whether Doc objects are built from known spacing (#5697) * Tell convert CLI to store user data for Doc * Remove assert * Add has_unknwon_spaces flag on Doc * Do not tokenize docs with unknown spaces in Corpus * Handle conversion of unknown spaces in Example * Fixes * Fixes * Draft has_known_spaces support in DocBin * Add test for serialize has_unknown_spaces * Fix DocBin serialization when has_unknown_spaces * Use serialization in test 2020-07-03 13:58:16 +03:00
* Fix assignment of iterator on Doc object 2016-05-02 16:26:24 +03:00			`cdef public object noun_chunks_iterator`
add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model 2016-03-16 17:53:35 +03:00
Allow weakrefs on Doc objects 2017-10-16 20:22:11 +03:00			`cdef object __weakref__`

Fix parameter name in .pxd file 2017-09-26 15:28:50 +03:00			`cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1`
* Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 20:58:26 +03:00
			`cpdef np.ndarray to_array(self, object features)`