spaCy/docs/source/reference/using/document.rst

==============
The Doc Object
==============

.. autoclass:: spacy.tokens.Tokens

:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
  The Tokens class behaves as a Python sequence, supporting the usual operators,
  len(), etc.  Negative indexing is supported. Slices are supported as of v0.89

  .. code::

    >>> tokens = nlp(u'Zero one two three four five six')
    >>> tokens[0].orth_
    u'Zero'
    >>> tokens[-1].orth_
    u'six'
    >>> span = tokens[0:4]
    >>> [w.orth_ for w in span]
    [u'Zero', u'one', u'two', u'three']
    >>> span.string
    u'Zero one two three'

:code:`sents`
  Iterate over sentences in the document. Each sentence is a Span object.

:code:`ents`
  Iterate over entities in the document. Each entity is a Span object.

:code:`to_array`
  Given a list of M attribute IDs, export the tokens to a numpy ndarray
  of shape N*M, where N is the length of the sentence.

    Arguments:
        attr_ids (list[int]): A list of attribute ID ints.

    Returns:
        feat_array (numpy.ndarray[long, ndim=2]):
        A feature matrix, with one row per word, and one column per attribute
        indicated in the input attr_ids.

:code:`count_by`
  Produce a dict of {attribute (int): count (ints)} frequencies, keyed
  by the values of the given attribute ID.

    >>> from spacy.en import English, attrs
    >>> nlp = English()
    >>> tokens = nlp(u'apple apple orange banana')
    >>> tokens.count_by(attrs.ORTH)
    {12800L: 1, 11880L: 2, 7561L: 1}
    >>> tokens.to_array([attrs.ORTH])
    array([[11880],
          [11880],
          [ 7561],
          [12800]])

:code:`merge`
  Merge a multi-word expression into a single token.  Currently
  experimental; API is likely to change.

:code:`to_bytes()`
  Get a byte-string representation of the document, i.e. serialize.

:code:`from_bytes(self, byte_string)`
  Load data from a byte-string, i.e. deserialize

:code:`Doc.read_bytes`
  A staticmethod, used to read bytes from a file.


Example of serialization:

::

    doc1 = EN(u'This is a simple test. With a couple of sentences.')
    doc2 = EN(u'This is another test document.')

    with open('/tmp/spacy_docs.bin', 'wb') as file_:
        file_.write(doc1.to_bytes())
        file_.write(doc2.to_bytes())

    with open('/tmp/spacy_docs.bin', 'rb') as file_:
        bytes1, bytes2 = Doc.read_bytes(file_)
        r1 = Doc(EN.vocab).from_bytes(bytes1)
        r2 = Doc(EN.vocab).from_bytes(bytes2)

    assert r1.string == doc1.string
    assert r2.string == doc2.string


Internals
  A Tokens instance stores the annotations in a C-array of `TokenC` structs.
  Each TokenC struct holds a const pointer to a LexemeC struct, which describes
  a vocabulary item.

  The Token objects are built lazily, from this underlying C-data.

  For faster access, the underlying C data can be accessed from Cython.  You
  can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
  access is required, and you need slightly better performance.