spaCy/docs/source/reference/using/document.rst

==============
The Doc Object
==============


.. py:class:: spacy.tokens.doc.Doc

  .. py:method:: __init__(self, Vocab vocab, orths_and_spaces=None)

    :param Vocab vocab: A vocabulary object.

    :param list orths_and_spaces=None: Defaults to None.

  .. py:method:: __getitem__(self, int i)

    :returns: Token

  .. py:method:: __getitem__(self, slice start_colon_end)

    :returns: Span

  .. py:method:: __iter__(self)

    Iterate over tokens

    .. code::

      >>> tokens = nlp(u'Zero one two three four five six')
      >>> tokens[0].orth_
      u'Zero'
      >>> tokens[-1].orth_
      u'six'

  .. py:method:: __len__(self)

    Number of tokens

  .. py:attribute:: sents

    Iterate over sentences in the document.

    :returns generator: Sentences

  .. py:attribute:: ents

    Iterate over named entities in the document.

    :returns tuple: Named Entities

  .. py:attribute:: noun_chunks

    :returns generator:

  .. py:method:: to_array(self, list attr_ids)

    Given a list of M attribute IDs, export the tokens to a numpy ndarray
    of shape N*M, where N is the length of the sentence.

    :param list[int] attr_ids: A list of attribute ID ints.

    :returns feat_array:
      A feature matrix, with one row per word, and one column per attribute
      indicated in the input attr_ids.

  .. py:method:: count_by(self, attr_id)

    Produce a dict of {attribute (int): count (ints)} frequencies, keyed
    by the values of the given attribute ID.

    .. code::

      >>> from spacy.en import English, attrs
      >>> nlp = English()
      >>> tokens = nlp(u'apple apple orange banana')
      >>> tokens.count_by(attrs.ORTH)
      {12800L: 1, 11880L: 2, 7561L: 1}
      >>> tokens.to_array([attrs.ORTH])
      array([[11880],
            [11880],
            [ 7561],
            [12800]])

  .. py:method:: from_array(self, attrs, array)

  .. py:method:: to_bytes(self)

  .. py:method:: from_bytes(self)

  .. py:method:: read_bytes(self)

  .. py:method:: merge(self, int start_idx, int end_idx, unicode tag, unicode lemma, unicode ent_type)

    Merge a multi-word expression into a single token.  Currently
    experimental; API is likely to change.