spaCy/spacy/tokens/doc.pyx
Ines Montani d33953037e
💫 Port master changes over to develop (#2979)
* Create aryaprabhudesai.md (#2681)

* Update _install.jade (#2688)

Typo fix: "models" -> "model"

* Add FAC to spacy.explain (resolves #2706)

* Remove docstrings for deprecated arguments (see #2703)

* When calling getoption() in conftest.py, pass a default option (#2709)

* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement

* update bengali token rules for hyphen and digits (#2731)

* Less norm computations in token similarity (#2730)

* Less norm computations in token similarity

* Contributor agreement

* Remove ')' for clarity (#2737)

Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.

* added contributor agreement for mbkupfer (#2738)

* Basic support for Telugu language (#2751)

* Lex _attrs for polish language (#2750)

* Signed spaCy contributor agreement

* Added polish version of english lex_attrs

* Introduces a bulk merge function, in order to solve issue #653 (#2696)

* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions

* Describe converters more explicitly (see #2643)

* Add multi-threading note to Language.pipe (resolves #2582) [ci skip]

* Fix formatting

* Fix dependency scheme docs (closes #2705) [ci skip]

* Don't set stop word in example (closes #2657) [ci skip]

* Add words to portuguese language _num_words (#2759)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Update Indonesian model (#2752)

* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file

* Fixed spaCy+Keras example (#2763)

* bug fixes in keras example

* created contributor agreement

* Adding French hyphenated first name (#2786)

* Fix typo (closes #2784)

* Fix typo (#2795) [ci skip]

Fixed typo on line 6 "regcognizer --> recognizer"

* Adding basic support for Sinhala language. (#2788)

* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement

* Also include lowercase norm exceptions

* Fix error (#2802)

* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement

* Add charlax's contributor agreement (#2805)

* agreement of contributor, may I introduce a tiny pl languge contribution (#2799)

* Contributors agreement

* Contributors agreement

* Contributors agreement

* Add jupyter=True to displacy.render in documentation (#2806)

* Revert "Also include lowercase norm exceptions"

This reverts commit 70f4e8adf3.

* Remove deprecated encoding argument to msgpack

* Set up dependency tree pattern matching skeleton (#2732)

* Fix bug when too many entity types. Fixes #2800

* Fix Python 2 test failure

* Require older msgpack-numpy

* Restore encoding arg on msgpack-numpy

* Try to fix version pin for msgpack-numpy

* Update Portuguese Language (#2790)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language

* Correct error in spacy universe docs concerning spacy-lookup (#2814)

* Update Keras Example for (Parikh et al, 2016) implementation  (#2803)

* bug fixes in keras example

* created contributor agreement

* baseline for Parikh model

* initial version of parikh 2016 implemented

* tested asymmetric models

* fixed grevious error in normalization

* use standard SNLI test file

* begin to rework parikh example

* initial version of running example

* start to document the new version

* start to document the new version

* Update Decompositional Attention.ipynb

* fixed calls to similarity

* updated the README

* import sys package duh

* simplified indexing on mapping word to IDs

* stupid python indent error

* added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround

* Fix typo (closes #2815) [ci skip]

* Update regex version dependency

* Set version to 2.0.13.dev3

* Skip seemingly problematic test

* Remove problematic test

* Try previous version of regex

* Revert "Remove problematic test"

This reverts commit bdebbef455.

* Unskip test

* Try older version of regex

* 💫 Update training examples and use minibatching (#2830)

<!--- Provide a general summary of your changes in the title. -->

## Description
Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results.

### Types of change
enhancements

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Visual C++ link updated (#2842) (closes #2841) [ci skip]

* New landing page

* Add contribution agreement

* Correcting lang/ru/examples.py (#2845)

* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file

* Set version to 2.0.13.dev4

* Add Persian(Farsi) language support (#2797)

* Also include lowercase norm exceptions

* Remove in favour of https://github.com/explosion/spaCy/graphs/contributors

* Rule-based French Lemmatizer (#2818)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Set version to 2.0.13

* Fix formatting and consistency

* Update docs for new version [ci skip]

* Increment version [ci skip]

* Add info on wheels [ci skip]

* Adding "This is a sentence" example to Sinhala (#2846)

* Add wheels badge

* Update badge [ci skip]

* Update README.rst [ci skip]

* Update murmurhash pin

* Increment version to 2.0.14.dev0

* Update GPU docs for v2.0.14

* Add wheel to setup_requires

* Import prefer_gpu and require_gpu functions from Thinc

* Add tests for prefer_gpu() and require_gpu()

* Update requirements and setup.py

* Workaround bug in thinc require_gpu

* Set version to v2.0.14

* Update push-tag script

* Unhack prefer_gpu

* Require thinc 6.10.6

* Update prefer_gpu and require_gpu docs [ci skip]

* Fix specifiers for GPU

* Set version to 2.0.14.dev1

* Set version to 2.0.14

* Update Thinc version pin

* Increment version

* Fix msgpack-numpy version pin

* Increment version

* Update version to 2.0.16

* Update version [ci skip]

* Redundant ')' in the Stop words' example (#2856)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Documentation improvement regarding joblib and SO (#2867)

Some documentation improvements

## Description
1. Fixed the dead URL to joblib
2. Fixed Stack Overflow brand name (with space)

### Types of change
Documentation

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* raise error when setting overlapping entities as doc.ents (#2880)

* Fix out-of-bounds access in NER training

The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.

* Change PyThaiNLP Url (#2876)

* Fix missing comma

* Add example showing a fix-up rule for space entities

* Set version to 2.0.17.dev0

* Update regex version

* Revert "Update regex version"

This reverts commit 62358dd867.

* Try setting older regex version, to align with conda

* Set version to 2.0.17

* Add spacy-js to universe [ci-skip]

* Add spacy-raspberry to universe (closes #2889)

* Add script to validate universe json [ci skip]

* Removed space in docs + added contributor indo (#2909)

* - removed unneeded space in documentation

* - added contributor info

* Allow input text of length up to max_length, inclusive (#2922)

* Include universe spec for spacy-wordnet component (#2919)

* feat: include universe spec for spacy-wordnet component

* chore: include spaCy contributor agreement

* Minor formatting changes [ci skip]

* Fix image [ci skip]

Twitter URL doesn't work on live site

* Check if the word is in one of the regular lists specific to each POS (#2886)

* 💫 Create random IDs for SVGs to prevent ID clashes (#2927)

Resolves #2924.

## Description
Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.)

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix typo [ci skip]

* fixes symbolic link on py3 and windows (#2949)

* fixes symbolic link on py3 and windows
during setup of spacy using command
python -m spacy link en_core_web_sm en
closes #2948

* Update spacy/compat.py

Co-Authored-By: cicorias <cicorias@users.noreply.github.com>

* Fix formatting

* Update universe [ci skip]

* Catalan Language Support (#2940)

* Catalan language Support

* Ddding Catalan to documentation

* Sort languages alphabetically [ci skip]

* Update tests for pytest 4.x (#2965)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize))
- [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here)

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix regex pin to harmonize with conda (#2964)

* Update README.rst

* Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977)

Fixes #2976

* Fix typo

* Fix typo

* Remove duplicate file

* Require thinc 7.0.0.dev2

Fixes bug in gpu_ops that would use cupy instead of numpy on CPU

* Add missing import

* Fix error IDs

* Fix tests
2018-11-29 16:30:29 +01:00

1089 lines
43 KiB
Cython
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# coding: utf8
# cython: infer_types=True
# cython: bounds_check=False
# cython: profile=True
from __future__ import unicode_literals
cimport cython
cimport numpy as np
import numpy
import numpy.linalg
import struct
import dill
import msgpack
from thinc.neural.util import get_array_module, copy_array
from libc.string cimport memcpy, memset
from libc.math cimport sqrt
from .span cimport Span
from .token cimport Token
from .span cimport Span
from .token cimport Token
from .printers import parse_tree
from ..lexeme cimport Lexeme, EMPTY_LEXEME
from ..typedefs cimport attr_t, flags_t
from ..attrs import intify_attrs, IDS
from ..attrs cimport attr_id_t
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER
from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB
from ..attrs cimport ENT_TYPE, SENT_START
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t
from ..util import normalize_slice
from ..compat import is_config, copy_reg, pickle, basestring_
from ..errors import deprecation_warning, models_warning, user_warning
from ..errors import Errors, Warnings
from .. import util
from .underscore import Underscore, get_ext_args
from ._retokenize import Retokenizer
DEF PADDING = 5
cdef int bounds_check(int i, int length, int padding) except -1:
if (i + padding) < 0:
raise IndexError(Errors.E026.format(i=i, length=length))
if (i - padding) >= length:
raise IndexError(Errors.E026.format(i=i, length=length))
cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil:
if feat_name == LEMMA:
return token.lemma
elif feat_name == POS:
return token.pos
elif feat_name == TAG:
return token.tag
elif feat_name == DEP:
return token.dep
elif feat_name == HEAD:
return token.head
elif feat_name == SENT_START:
return token.sent_start
elif feat_name == SPACY:
return token.spacy
elif feat_name == ENT_IOB:
return token.ent_iob
elif feat_name == ENT_TYPE:
return token.ent_type
else:
return Lexeme.get_struct_attr(token.lex, feat_name)
def _get_chunker(lang):
try:
cls = util.get_lang_class(lang)
except ImportError:
return None
except KeyError:
return None
return cls.Defaults.syntax_iterators.get(u'noun_chunks')
cdef class Doc:
"""A sequence of Token objects. Access sentences and named entities, export
annotations to numpy arrays, losslessly serialize to compressed binary
strings. The `Doc` object holds an array of `TokenC` structs. The
Python-level `Token` and `Span` objects are views of this array, i.e.
they don't own the data themselves.
EXAMPLE: Construction 1
>>> doc = nlp(u'Some text')
Construction 2
>>> from spacy.tokens import Doc
>>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
spaces=[True, False, False])
"""
@classmethod
def set_extension(cls, name, **kwargs):
if cls.has_extension(name) and not kwargs.get('force', False):
raise ValueError(Errors.E090.format(name=name, obj='Doc'))
Underscore.doc_extensions[name] = get_ext_args(**kwargs)
@classmethod
def get_extension(cls, name):
return Underscore.doc_extensions.get(name)
@classmethod
def has_extension(cls, name):
return name in Underscore.doc_extensions
@classmethod
def remove_extension(cls, name):
if not cls.has_extension(name):
raise ValueError(Errors.E046.format(name=name))
return Underscore.doc_extensions.pop(name)
def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None,
orths_and_spaces=None):
"""Create a Doc object.
vocab (Vocab): A vocabulary object, which must match any models you
want to use (e.g. tokenizer, parser, entity recognizer).
words (list or None): A list of unicode strings to add to the document
as words. If `None`, defaults to empty list.
spaces (list or None): A list of boolean values, of the same length as
words. True means that the word is followed by a space, False means
it is not. If `None`, defaults to `[True]*len(words)`
user_data (dict or None): Optional extra data to attach to the Doc.
RETURNS (Doc): The newly constructed object.
"""
self.vocab = vocab
size = 20
self.mem = Pool()
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
# However, we need to remember the true starting places, so that we can
# realloc.
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
cdef int i
for i in range(size + (PADDING*2)):
data_start[i].lex = &EMPTY_LEXEME
data_start[i].l_edge = i
data_start[i].r_edge = i
self.c = data_start + PADDING
self.max_length = size
self.length = 0
self.is_tagged = False
self.is_parsed = False
self.sentiment = 0.0
self.cats = {}
self.user_hooks = {}
self.user_token_hooks = {}
self.user_span_hooks = {}
self.tensor = numpy.zeros((0,), dtype='float32')
self.user_data = {} if user_data is None else user_data
self._vector = None
self.noun_chunks_iterator = _get_chunker(self.vocab.lang)
cdef unicode orth
cdef bint has_space
if orths_and_spaces is None and words is not None:
if spaces is None:
spaces = [True] * len(words)
elif len(spaces) != len(words):
raise ValueError(Errors.E027)
orths_and_spaces = zip(words, spaces)
if orths_and_spaces is not None:
for orth_space in orths_and_spaces:
if isinstance(orth_space, unicode):
orth = orth_space
has_space = True
elif isinstance(orth_space, bytes):
raise ValueError(Errors.E028.format(value=orth_space))
else:
orth, has_space = orth_space
# Note that we pass self.mem here --- we have ownership, if LexemeC
# must be created.
self.push_back(
<const LexemeC*>self.vocab.get(self.mem, orth), has_space)
# Tough to decide on policy for this. Is an empty doc tagged and parsed?
# There's no information we'd like to add to it, so I guess so?
if self.length == 0:
self.is_tagged = True
self.is_parsed = True
@property
def _(self):
return Underscore(Underscore.doc_extensions, self)
@property
def is_sentenced(self):
# Check if the document has sentence boundaries,
# i.e at least one tok has the sent_start in (-1, 1)
if 'sents' in self.user_hooks:
return True
if self.is_parsed:
return True
for i in range(self.length):
if self.c[i].sent_start == -1 or self.c[i].sent_start == 1:
return True
else:
return False
def __getitem__(self, object i):
"""Get a `Token` or `Span` object.
i (int or tuple) The index of the token, or the slice of the document
to get.
RETURNS (Token or Span): The token at `doc[i]]`, or the span at
`doc[start : end]`.
EXAMPLE:
>>> doc[i]
Get the `Token` object at position `i`, where `i` is an integer.
Negative indexing is supported, and follows the usual Python
semantics, i.e. `doc[-2]` is `doc[len(doc) - 2]`.
>>> doc[start : end]]
Get a `Span` object, starting at position `start` and ending at
position `end`, where `start` and `end` are token indices. For
instance, `doc[2:5]` produces a span consisting of tokens 2, 3 and
4. Stepped slices (e.g. `doc[start : end : step]`) are not
supported, as `Span` objects must be contiguous (cannot have gaps).
You can use negative indices and open-ended ranges, which have
their normal Python semantics.
"""
if isinstance(i, slice):
start, stop = normalize_slice(len(self), i.start, i.stop, i.step)
return Span(self, start, stop, label=0)
if i < 0:
i = self.length + i
bounds_check(i, self.length, PADDING)
return Token.cinit(self.vocab, &self.c[i], i, self)
def __iter__(self):
"""Iterate over `Token` objects, from which the annotations can be
easily accessed. This is the main way of accessing `Token` objects,
which are the main way annotations are accessed from Python. If faster-
than-Python speeds are required, you can instead access the annotations
as a numpy array, or access the underlying C data directly from Cython.
EXAMPLE:
>>> for token in doc
"""
cdef int i
for i in range(self.length):
yield Token.cinit(self.vocab, &self.c[i], i, self)
def __len__(self):
"""The number of tokens in the document.
RETURNS (int): The number of tokens in the document.
EXAMPLE:
>>> len(doc)
"""
return self.length
def __unicode__(self):
return u''.join([t.text_with_ws for t in self])
def __bytes__(self):
return u''.join([t.text_with_ws for t in self]).encode('utf-8')
def __str__(self):
if is_config(python3=True):
return self.__unicode__()
return self.__bytes__()
def __repr__(self):
return self.__str__()
@property
def doc(self):
return self
def char_span(self, int start_idx, int end_idx, label=0, vector=None):
"""Create a `Span` object from the slice `doc.text[start : end]`.
doc (Doc): The parent document.
start (int): The index of the first character of the span.
end (int): The index of the first character after the span.
label (uint64 or string): A label to attach to the Span, e.g. for
named entities.
vector (ndarray[ndim=1, dtype='float32']): A meaning representation of
the span.
RETURNS (Span): The newly constructed object.
"""
if not isinstance(label, int):
label = self.vocab.strings.add(label)
cdef int start = token_by_start(self.c, self.length, start_idx)
if start == -1:
return None
cdef int end = token_by_end(self.c, self.length, end_idx)
if end == -1:
return None
# Currently we have the token index, we want the range-end index
end += 1
cdef Span span = Span(self, start, end, label=label, vector=vector)
return span
def similarity(self, other):
"""Make a semantic similarity estimate. The default estimate is cosine
similarity using an average of word vectors.
other (object): The object to compare with. By default, accepts `Doc`,
`Span`, `Token` and `Lexeme` objects.
RETURNS (float): A scalar similarity score. Higher is more similar.
"""
if 'similarity' in self.user_hooks:
return self.user_hooks['similarity'](self, other)
if isinstance(other, (Lexeme, Token)) and self.length == 1:
if self.c[0].lex.orth == other.orth:
return 1.0
elif isinstance(other, (Span, Doc)):
if len(self) == len(other):
for i in range(self.length):
if self[i].orth != other[i].orth:
break
else:
return 1.0
if self.vocab.vectors.n_keys == 0:
models_warning(Warnings.W007.format(obj='Doc'))
if self.vector_norm == 0 or other.vector_norm == 0:
user_warning(Warnings.W008.format(obj='Doc'))
return 0.0
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
property has_vector:
"""A boolean value indicating whether a word vector is associated with
the object.
RETURNS (bool): Whether a word vector is associated with the object.
"""
def __get__(self):
if 'has_vector' in self.user_hooks:
return self.user_hooks['has_vector'](self)
elif self.vocab.vectors.data.size:
return True
elif self.tensor.size:
return True
else:
return False
property vector:
"""A real-valued meaning representation. Defaults to an average of the
token vectors.
RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array
representing the document's semantics.
"""
def __get__(self):
if 'vector' in self.user_hooks:
return self.user_hooks['vector'](self)
if self._vector is not None:
return self._vector
elif not len(self):
self._vector = numpy.zeros((self.vocab.vectors_length,),
dtype='f')
return self._vector
elif self.vocab.vectors.data.size > 0:
vector = numpy.zeros((self.vocab.vectors_length,), dtype='f')
for token in self.c[:self.length]:
vector += self.vocab.get_vector(token.lex.orth)
self._vector = vector / len(self)
return self._vector
elif self.tensor.size > 0:
self._vector = self.tensor.mean(axis=0)
return self._vector
else:
return numpy.zeros((self.vocab.vectors_length,),
dtype='float32')
def __set__(self, value):
self._vector = value
property vector_norm:
"""The L2 norm of the document's vector representation.
RETURNS (float): The L2 norm of the vector representation.
"""
def __get__(self):
if 'vector_norm' in self.user_hooks:
return self.user_hooks['vector_norm'](self)
cdef float value
cdef double norm = 0
if self._vector_norm is None:
norm = 0.0
for value in self.vector:
norm += value * value
self._vector_norm = sqrt(norm) if norm != 0 else 0
return self._vector_norm
def __set__(self, value):
self._vector_norm = value
property text:
"""A unicode representation of the document text.
RETURNS (unicode): The original verbatim text of the document.
"""
def __get__(self):
return u''.join(t.text_with_ws for t in self)
property text_with_ws:
"""An alias of `Doc.text`, provided for duck-type compatibility with
`Span` and `Token`.
RETURNS (unicode): The original verbatim text of the document.
"""
def __get__(self):
return self.text
property ents:
"""Iterate over the entities in the document. Yields named-entity
`Span` objects, if the entity recognizer has been applied to the
document.
YIELDS (Span): Entities in the document.
EXAMPLE: Iterate over the span to get individual Token objects,
or access the label:
>>> tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
>>> ents = list(tokens.ents)
>>> assert ents[0].label == 346
>>> assert ents[0].label_ == 'PERSON'
>>> assert ents[0].orth_ == 'Best'
>>> assert ents[0].text == 'Mr. Best'
"""
def __get__(self):
cdef int i
cdef const TokenC* token
cdef int start = -1
cdef attr_t label = 0
output = []
for i in range(self.length):
token = &self.c[i]
if token.ent_iob == 1:
if start == -1:
seq = ['%s|%s' % (t.text, t.ent_iob_) for t in self[i-5:i+5]]
raise ValueError(Errors.E093.format(seq=' '.join(seq)))
elif token.ent_iob == 2 or token.ent_iob == 0:
if start != -1:
output.append(Span(self, start, i, label=label))
start = -1
label = 0
elif token.ent_iob == 3:
if start != -1:
output.append(Span(self, start, i, label=label))
start = i
label = token.ent_type
if start != -1:
output.append(Span(self, start, self.length, label=label))
return tuple(output)
def __set__(self, ents):
# TODO:
# 1. Allow negative matches
# 2. Ensure pre-set NERs are not over-written during statistical
# prediction
# 3. Test basic data-driven ORTH gazetteer
# 4. Test more nuanced date and currency regex
tokens_in_ents = {}
cdef attr_t entity_type
cdef int ent_start, ent_end
for ent_info in ents:
entity_type, ent_start, ent_end = get_entity_info(ent_info)
for token_index in range(ent_start, ent_end):
if token_index in tokens_in_ents.keys():
raise ValueError(Errors.E103.format(
span1=(tokens_in_ents[token_index][0],
tokens_in_ents[token_index][1],
self.vocab.strings[tokens_in_ents[token_index][2]]),
span2=(ent_start, ent_end, self.vocab.strings[entity_type])))
tokens_in_ents[token_index] = (ent_start, ent_end, entity_type)
cdef int i
for i in range(self.length):
self.c[i].ent_type = 0
self.c[i].ent_iob = 0 # Means missing.
cdef attr_t ent_type
cdef int start, end
for ent_info in ents:
ent_type, start, end = get_entity_info(ent_info)
if ent_type is None or ent_type < 0:
# Mark as O
for i in range(start, end):
self.c[i].ent_type = 0
self.c[i].ent_iob = 2
else:
# Mark (inside) as I
for i in range(start, end):
self.c[i].ent_type = ent_type
self.c[i].ent_iob = 1
# Set start as B
self.c[start].ent_iob = 3
property noun_chunks:
"""Iterate over the base noun phrases in the document. Yields base
noun-phrase #[code Span] objects, if the document has been
syntactically parsed. A base noun phrase, or "NP chunk", is a noun
phrase that does not permit other NPs to be nested within it so no
NP-level coordination, no prepositional phrases, and no relative
clauses.
YIELDS (Span): Noun chunks in the document.
"""
def __get__(self):
if not self.is_parsed:
raise ValueError(Errors.E029)
# Accumulate the result before beginning to iterate over it. This
# prevents the tokenisation from being changed out from under us
# during the iteration. The tricky thing here is that Span accepts
# its tokenisation changing, so it's okay once we have the Span
# objects. See Issue #375.
spans = []
if self.noun_chunks_iterator is not None:
for start, end, label in self.noun_chunks_iterator(self):
spans.append(Span(self, start, end, label=label))
for span in spans:
yield span
property sents:
"""Iterate over the sentences in the document. Yields sentence `Span`
objects. Sentence spans have no label. To improve accuracy on informal
texts, spaCy calculates sentence boundaries from the syntactic
dependency parse. If the parser is disabled, the `sents` iterator will
be unavailable.
EXAMPLE:
>>> doc = nlp("This is a sentence. Here's another...")
>>> assert [s.root.text for s in doc.sents] == ["is", "'s"]
"""
def __get__(self):
if not self.is_sentenced:
raise ValueError(Errors.E030)
if 'sents' in self.user_hooks:
yield from self.user_hooks['sents'](self)
else:
start = 0
for i in range(1, self.length):
if self.c[i].sent_start == 1:
yield Span(self, start, i)
start = i
if start != self.length:
yield Span(self, start, self.length)
cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1:
if self.length == 0:
# Flip these to false when we see the first token.
self.is_tagged = False
self.is_parsed = False
if self.length == self.max_length:
self._realloc(self.length * 2)
cdef TokenC* t = &self.c[self.length]
if LexemeOrToken is const_TokenC_ptr:
t[0] = lex_or_tok[0]
else:
t.lex = lex_or_tok
if self.length == 0:
t.idx = 0
else:
t.idx = (t-1).idx + (t-1).lex.length + (t-1).spacy
t.l_edge = self.length
t.r_edge = self.length
if t.lex.orth == 0:
raise ValueError(Errors.E031.format(i=self.length))
t.spacy = has_space
self.length += 1
return t.idx + t.lex.length + t.spacy
@cython.boundscheck(False)
cpdef np.ndarray to_array(self, object py_attr_ids):
"""Export given token attributes to a numpy `ndarray`.
If `attr_ids` is a sequence of M attributes, the output array will be
of shape `(N, M)`, where N is the length of the `Doc` (in tokens). If
`attr_ids` is a single attribute, the output shape will be (N,). You
can specify attributes by integer ID (e.g. spacy.attrs.LEMMA) or
string name (e.g. 'LEMMA' or 'lemma').
attr_ids (list[]): A list of attributes (int IDs or string names).
RETURNS (numpy.ndarray[long, ndim=2]): A feature matrix, with one row
per word, and one column per attribute indicated in the input
`attr_ids`.
EXAMPLE:
>>> from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
>>> doc = nlp(text)
>>> # All strings mapped to integers, for easy export to numpy
>>> np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
"""
cdef int i, j
cdef attr_id_t feature
cdef np.ndarray[attr_t, ndim=2] output
# Handle scalar/list inputs of strings/ints for py_attr_ids
if not hasattr(py_attr_ids, '__iter__') \
and not isinstance(py_attr_ids, basestring_):
py_attr_ids = [py_attr_ids]
# Allow strings, e.g. 'lemma' or 'LEMMA'
py_attr_ids = [(IDS[id_.upper()] if hasattr(id_, 'upper') else id_)
for id_ in py_attr_ids]
# Make an array from the attributes --- otherwise our inner loop is
# Python dict iteration.
cdef np.ndarray attr_ids = numpy.asarray(py_attr_ids, dtype='i')
output = numpy.ndarray(shape=(self.length, len(attr_ids)),
dtype=numpy.uint64)
c_output = <attr_t*>output.data
c_attr_ids = <attr_id_t*>attr_ids.data
cdef TokenC* token
cdef int nr_attr = attr_ids.shape[0]
for i in range(self.length):
token = &self.c[i]
for j in range(nr_attr):
c_output[i*nr_attr + j] = get_token_attr(token, c_attr_ids[j])
# Handle 1d case
return output if len(attr_ids) >= 2 else output.reshape((self.length,))
def count_by(self, attr_id_t attr_id, exclude=None,
PreshCounter counts=None):
"""Count the frequencies of a given attribute. Produces a dict of
`{attribute (int): count (ints)}` frequencies, keyed by the values of
the given attribute ID.
attr_id (int): The attribute ID to key the counts.
RETURNS (dict): A dictionary mapping attributes to integer counts.
EXAMPLE:
>>> from spacy import attrs
>>> doc = nlp(u'apple apple orange banana')
>>> tokens.count_by(attrs.ORTH)
{12800L: 1, 11880L: 2, 7561L: 1}
>>> tokens.to_array([attrs.ORTH])
array([[11880], [11880], [7561], [12800]])
"""
cdef int i
cdef attr_t attr
cdef size_t count
if counts is None:
counts = PreshCounter()
output_dict = True
else:
output_dict = False
# Take this check out of the loop, for a bit of extra speed
if exclude is None:
for i in range(self.length):
counts.inc(get_token_attr(&self.c[i], attr_id), 1)
else:
for i in range(self.length):
if not exclude(self[i]):
attr = get_token_attr(&self.c[i], attr_id)
counts.inc(attr, 1)
if output_dict:
return dict(counts)
def _realloc(self, new_size):
self.max_length = new_size
n = new_size + (PADDING * 2)
# What we're storing is a "padded" array. We've jumped forward PADDING
# places, and are storing the pointer to that. This way, we can access
# words out-of-bounds, and get out-of-bounds markers.
# Now that we want to realloc, we need the address of the true start,
# so we jump the pointer back PADDING places.
cdef TokenC* data_start = self.c - PADDING
data_start = <TokenC*>self.mem.realloc(data_start, n * sizeof(TokenC))
self.c = data_start + PADDING
cdef int i
for i in range(self.length, self.max_length + PADDING):
self.c[i].lex = &EMPTY_LEXEME
cdef void set_parse(self, const TokenC* parsed) nogil:
# TODO: This method is fairly misleading atm. It's used by Parser
# to actually apply the parse calculated. Need to rethink this.
# Probably we should use from_array?
self.is_parsed = True
for i in range(self.length):
self.c[i] = parsed[i]
def from_array(self, attrs, array):
if SENT_START in attrs and HEAD in attrs:
raise ValueError(Errors.E032)
cdef int i, col
cdef attr_id_t attr_id
cdef TokenC* tokens = self.c
cdef int length = len(array)
# Get set up for fast loading
cdef Pool mem = Pool()
cdef int n_attrs = len(attrs)
attr_ids = <attr_id_t*>mem.alloc(n_attrs, sizeof(attr_id_t))
for i, attr_id in enumerate(attrs):
attr_ids[i] = attr_id
# Now load the data
for i in range(self.length):
token = &self.c[i]
for j in range(n_attrs):
Token.set_struct_attr(token, attr_ids[j], array[i, j])
# Auxiliary loading logic
for col, attr_id in enumerate(attrs):
if attr_id == TAG:
for i in range(length):
if array[i, col] != 0:
self.vocab.morphology.assign_tag(&tokens[i], array[i, col])
# set flags
self.is_parsed = bool(HEAD in attrs or DEP in attrs)
self.is_tagged = bool(TAG in attrs or POS in attrs)
# if document is parsed, set children
if self.is_parsed:
set_children_from_heads(self.c, self.length)
return self
def get_lca_matrix(self):
"""Calculates the lowest common ancestor matrix for a given `Doc`.
Returns LCA matrix containing the integer index of the ancestor, or -1
if no common ancestor is found (ex if span excludes a necessary
ancestor). Apologies about the recursion, but the impact on
performance is negligible given the natural limitations on the depth
of a typical human sentence.
"""
# Efficiency notes:
# We can easily improve the performance here by iterating in Cython.
# To loop over the tokens in Cython, the easiest way is:
# for token in doc.c[:doc.c.length]:
# head = token + token.head
# Both token and head will be TokenC* here. The token.head attribute
# is an integer offset.
def __pairwise_lca(token_j, token_k, lca_matrix):
if lca_matrix[token_j.i][token_k.i] != -2:
return lca_matrix[token_j.i][token_k.i]
elif token_j == token_k:
lca_index = token_j.i
elif token_k.head == token_j:
lca_index = token_j.i
elif token_j.head == token_k:
lca_index = token_k.i
elif (token_j.head == token_j) and (token_k.head == token_k):
lca_index = -1
else:
lca_index = __pairwise_lca(token_j.head, token_k.head,
lca_matrix)
lca_matrix[token_j.i][token_k.i] = lca_index
lca_matrix[token_k.i][token_j.i] = lca_index
return lca_index
lca_matrix = numpy.empty((len(self), len(self)), dtype=numpy.int32)
lca_matrix.fill(-2)
for j in range(len(self)):
token_j = self[j]
for k in range(j, len(self)):
token_k = self[k]
lca_matrix[j][k] = __pairwise_lca(token_j, token_k, lca_matrix)
lca_matrix[k][j] = lca_matrix[j][k]
return lca_matrix
def to_disk(self, path, **exclude):
"""Save the current state to a directory.
path (unicode or Path): A path to a directory, which will be created if
it doesn't exist. Paths may be either strings or Path-like objects.
"""
path = util.ensure_path(path)
with path.open('wb') as file_:
file_.write(self.to_bytes(**exclude))
def from_disk(self, path, **exclude):
"""Loads state from a directory. Modifies the object in place and
returns it.
path (unicode or Path): A path to a directory. Paths may be either
strings or `Path`-like objects.
RETURNS (Doc): The modified `Doc` object.
"""
path = util.ensure_path(path)
with path.open('rb') as file_:
bytes_data = file_.read()
return self.from_bytes(bytes_data, **exclude)
def to_bytes(self, **exclude):
"""Serialize, i.e. export the document contents to a binary string.
RETURNS (bytes): A losslessly serialized copy of the `Doc`, including
all annotations.
"""
array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE]
if self.is_tagged:
array_head.append(TAG)
# if doc parsed add head and dep attribute
if self.is_parsed:
array_head.extend([HEAD, DEP])
# otherwise add sent_start
else:
array_head.append(SENT_START)
# Msgpack doesn't distinguish between lists and tuples, which is
# vexing for user data. As a best guess, we *know* that within
# keys, we must have tuples. In values we just have to hope
# users don't mind getting a list instead of a tuple.
serializers = {
'text': lambda: self.text,
'array_head': lambda: array_head,
'array_body': lambda: self.to_array(array_head),
'sentiment': lambda: self.sentiment,
'tensor': lambda: self.tensor,
}
if 'user_data' not in exclude and self.user_data:
user_data_keys, user_data_values = list(zip(*self.user_data.items()))
serializers['user_data_keys'] = lambda: msgpack.dumps(user_data_keys)
serializers['user_data_values'] = lambda: msgpack.dumps(user_data_values)
return util.to_bytes(serializers, exclude)
def from_bytes(self, bytes_data, **exclude):
"""Deserialize, i.e. import the document contents from a binary string.
data (bytes): The string to load from.
RETURNS (Doc): Itself.
"""
if self.length != 0:
raise ValueError(Errors.E033.format(length=self.length))
deserializers = {
'text': lambda b: None,
'array_head': lambda b: None,
'array_body': lambda b: None,
'sentiment': lambda b: None,
'tensor': lambda b: None,
'user_data_keys': lambda b: None,
'user_data_values': lambda b: None,
}
msg = util.from_bytes(bytes_data, deserializers, exclude)
# Msgpack doesn't distinguish between lists and tuples, which is
# vexing for user data. As a best guess, we *know* that within
# keys, we must have tuples. In values we just have to hope
# users don't mind getting a list instead of a tuple.
if 'user_data' not in exclude and 'user_data_keys' in msg:
user_data_keys = msgpack.loads(msg['user_data_keys'],
use_list=False, raw=False)
user_data_values = msgpack.loads(msg['user_data_values'], raw=False)
for key, value in zip(user_data_keys, user_data_values):
self.user_data[key] = value
cdef int i, start, end, has_space
if 'sentiment' not in exclude and 'sentiment' in msg:
self.sentiment = msg['sentiment']
if 'tensor' not in exclude and 'tensor' in msg:
self.tensor = msg['tensor']
start = 0
cdef const LexemeC* lex
cdef unicode orth_
text = msg['text']
attrs = msg['array_body']
for i in range(attrs.shape[0]):
end = start + attrs[i, 0]
has_space = attrs[i, 1]
orth_ = text[start:end]
lex = self.vocab.get(self.mem, orth_)
self.push_back(lex, has_space)
start = end + has_space
self.from_array(msg['array_head'][2:], attrs[:, 2:])
return self
def extend_tensor(self, tensor):
'''Concatenate a new tensor onto the doc.tensor object.
The doc.tensor attribute holds dense feature vectors
computed by the models in the pipeline. Let's say a
document with 30 words has a tensor with 128 dimensions
per word. doc.tensor.shape will be (30, 128). After
calling doc.extend_tensor with an array of shape (30, 64),
doc.tensor == (30, 192).
'''
xp = get_array_module(self.tensor)
if self.tensor.size == 0:
self.tensor.resize(tensor.shape, refcheck=False)
copy_array(self.tensor, tensor)
else:
self.tensor = xp.hstack((self.tensor, tensor))
def retokenize(self):
'''Context manager to handle retokenization of the Doc.
Modifications to the Doc's tokenization are stored, and then
made all at once when the context manager exits. This is
much more efficient, and less error-prone.
All views of the Doc (Span and Token) created before the
retokenization are invalidated, although they may accidentally
continue to work.
'''
return Retokenizer(self)
def _bulk_merge(self, spans, attributes):
"""Retokenize the document, such that the spans given as arguments
are merged into single tokens. The spans need to be in document
order, and no span intersection is allowed.
spans (Span[]): Spans to merge, in document order, with all span
intersections empty. Cannot be emty.
attributes (Dictionary[]): Attributes to assign to the merged tokens. By default,
must be the same lenghth as spans, emty dictionaries are allowed.
attributes are inherited from the syntactic root of the span.
RETURNS (Token): The first newly merged token.
"""
cdef unicode tag, lemma, ent_type
assert len(attributes) == len(spans), "attribute length should be equal to span length" + str(len(attributes)) +\
str(len(spans))
with self.retokenize() as retokenizer:
for i, span in enumerate(spans):
fix_attributes(self, attributes[i])
remove_label_if_necessary(attributes[i])
retokenizer.merge(span, attributes[i])
def merge(self, int start_idx, int end_idx, *args, **attributes):
"""Retokenize the document, such that the span at
`doc.text[start_idx : end_idx]` is merged into a single token. If
`start_idx` and `end_idx `do not mark start and end token boundaries,
the document remains unchanged.
start_idx (int): Character index of the start of the slice to merge.
end_idx (int): Character index after the end of the slice to merge.
**attributes: Attributes to assign to the merged token. By default,
attributes are inherited from the syntactic root of the span.
RETURNS (Token): The newly merged token, or `None` if the start and end
indices did not fall at token boundaries.
"""
cdef unicode tag, lemma, ent_type
if len(args) == 3:
deprecation_warning(Warnings.W003)
tag, lemma, ent_type = args
attributes[TAG] = tag
attributes[LEMMA] = lemma
attributes[ENT_TYPE] = ent_type
elif not args:
fix_attributes(self, attributes)
elif args:
raise ValueError(Errors.E034.format(n_args=len(args),
args=repr(args),
kwargs=repr(attributes)))
remove_label_if_necessary(attributes)
attributes = intify_attrs(attributes, strings_map=self.vocab.strings)
cdef int start = token_by_start(self.c, self.length, start_idx)
if start == -1:
return None
cdef int end = token_by_end(self.c, self.length, end_idx)
if end == -1:
return None
# Currently we have the token index, we want the range-end index
end += 1
with self.retokenize() as retokenizer:
retokenizer.merge(self[start:end], attrs=attributes)
return self[start]
def print_tree(self, light=False, flat=False):
"""Returns the parse trees in JSON (dict) format.
light (bool): Don't include lemmas or entities.
flat (bool): Don't include arcs or modifiers.
RETURNS (dict): Parse tree as dict.
EXAMPLE:
>>> doc = nlp('Bob brought Alice the pizza. Alice ate the pizza.')
>>> trees = doc.print_tree()
>>> trees[1]
{'modifiers': [
{'modifiers': [], 'NE': 'PERSON', 'word': 'Alice',
'arc': 'nsubj', 'POS_coarse': 'PROPN', 'POS_fine': 'NNP',
'lemma': 'Alice'},
{'modifiers': [
{'modifiers': [], 'NE': '', 'word': 'the', 'arc': 'det',
'POS_coarse': 'DET', 'POS_fine': 'DT', 'lemma': 'the'}],
'NE': '', 'word': 'pizza', 'arc': 'dobj', 'POS_coarse': 'NOUN',
'POS_fine': 'NN', 'lemma': 'pizza'},
{'modifiers': [], 'NE': '', 'word': '.', 'arc': 'punct',
'POS_coarse': 'PUNCT', 'POS_fine': '.', 'lemma': '.'}],
'NE': '', 'word': 'ate', 'arc': 'ROOT', 'POS_coarse': 'VERB',
'POS_fine': 'VBD', 'lemma': 'eat'}
"""
return parse_tree(self, light=light, flat=flat)
cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2:
cdef int i
for i in range(length):
if tokens[i].idx == start_char:
return i
else:
return -1
cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2:
cdef int i
for i in range(length):
if tokens[i].idx + tokens[i].lex.length == end_char:
return i
else:
return -1
cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
cdef TokenC* head
cdef TokenC* child
cdef int i
# Set number of left/right children to 0. We'll increment it in the loops.
for i in range(length):
tokens[i].l_kids = 0
tokens[i].r_kids = 0
tokens[i].l_edge = i
tokens[i].r_edge = i
# Twice, for non-projectivity
for loop_count in range(2):
# Set left edges
for i in range(length):
child = &tokens[i]
head = &tokens[i + child.head]
if child < head and loop_count == 0:
head.l_kids += 1
if child.l_edge < head.l_edge:
head.l_edge = child.l_edge
if child.r_edge > head.r_edge:
head.r_edge = child.r_edge
# Set right edges --- same as above, but iterate in reverse
for i in range(length-1, -1, -1):
child = &tokens[i]
head = &tokens[i + child.head]
if child > head and loop_count == 0:
head.r_kids += 1
if child.r_edge > head.r_edge:
head.r_edge = child.r_edge
if child.l_edge < head.l_edge:
head.l_edge = child.l_edge
# Set sentence starts
for i in range(length):
if tokens[i].head == 0 and tokens[i].dep != 0:
tokens[tokens[i].l_edge].sent_start = True
def pickle_doc(doc):
bytes_data = doc.to_bytes(vocab=False, user_data=False)
hooks_and_data = (doc.user_data, doc.user_hooks, doc.user_span_hooks,
doc.user_token_hooks)
return (unpickle_doc, (doc.vocab, dill.dumps(hooks_and_data), bytes_data))
def unpickle_doc(vocab, hooks_and_data, bytes_data):
user_data, doc_hooks, span_hooks, token_hooks = dill.loads(hooks_and_data)
doc = Doc(vocab, user_data=user_data).from_bytes(bytes_data,
exclude='user_data')
doc.user_hooks.update(doc_hooks)
doc.user_span_hooks.update(span_hooks)
doc.user_token_hooks.update(token_hooks)
return doc
copy_reg.pickle(Doc, pickle_doc, unpickle_doc)
def remove_label_if_necessary(attributes):
# More deprecated attribute handling =/
if 'label' in attributes:
attributes['ent_type'] = attributes.pop('label')
def fix_attributes(doc, attributes):
if 'label' in attributes and 'ent_type' not in attributes:
if isinstance(attributes['label'], int):
attributes[ENT_TYPE] = attributes['label']
else:
attributes[ENT_TYPE] = doc.vocab.strings[attributes['label']]
if 'ent_type' in attributes:
attributes[ENT_TYPE] = attributes['ent_type']
def get_entity_info(ent_info):
if isinstance(ent_info, Span):
ent_type = ent_info.label
start = ent_info.start
end = ent_info.end
elif len(ent_info) == 3:
ent_type, start, end = ent_info
else:
ent_id, ent_type, start, end = ent_info
return ent_type, start, end