spaCy/docs/source/api.rst

===
API
===


.. autoclass:: spacy.en.English

  +-----------+----------------------------------------+-------------+--------------------------+
  | Attribute | Type                                   | Attr API    | NoteS                    |
  +===========+========================================+=============+==========================+
  | strings   | :py:class:`strings.StringStore`        | __getitem__ | string <-> int  mapping  |
  +-----------+----------------------------------------+-------------+--------------------------+
  | vocab     | :py:class:`vocab.Vocab`                | __getitem__ | Look up Lexeme object    |
  +-----------+----------------------------------------+-------------+--------------------------+
  | tokenizer | :py:class:`tokenizer.Tokenizer`        | __call__    | Get Tokens given unicode |
  +-----------+----------------------------------------+-------------+--------------------------+
  | tagger    | :py:class:`en.pos.EnPosTagger`         | __call__    | Set POS tags on Tokens   |
  +-----------+----------------------------------------+-------------+--------------------------+
  | parser    | :py:class:`syntax.parser.GreedyParser` | __call__    | Set parse on Tokens      |
  +-----------+----------------------------------------+-------------+--------------------------+


  .. automethod:: spacy.en.English.__call__


.. autoclass:: spacy.tokens.Tokens
  :members: 
  
  +---------------+-------------+-------------+
  | Attribute     | Type        | Useful      |
  +===============+=============+=============+
  | vocab         | Vocab       | __getitem__ |
  +---------------+-------------+-------------+
  | vocab.strings | StringStore | __getitem__ |
  +---------------+-------------+-------------+


  Internals
    A Tokens instance stores the annotations in a C-array of `TokenC` structs.
    Each TokenC struct holds a const pointer to a LexemeC struct, which describes
    a vocabulary item.

    The Token objects are built lazily, from this underlying C-data.

    For faster access, the underlying C data can be accessed from Cython.  You
    can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
    access is required, and you need slightly better performance.  However, this
    is both slower and has a worse API than Cython access.  

.. Once a Token object has been created, it is persisted internally in Tokens._py_tokens.


.. autoclass:: spacy.tokens.Token

  Integer IDs are provided for all string features.  The (unicode) string is
  provided by an attribute of the same name followed by an underscore, e.g.
  token.orth is an integer ID, token.orth\_ is the unicode value.

  The only exception is the Token.string attribute, which is (unicode)
  string-typed.

  +--------------------------------------------------------------------------------+
  | **Context-independent Attributes** (calculated once per entry in vocab)        |
  +-----------------+-------------+-----------+------------------------------------+
  | Attribute       | Type        | Attribute | Type                               |
  +-----------------+-------------+-----------+------------------------------------+
  | orth/orth\_     | int/unicode | __len__   | int                                |
  +-----------------+-------------+-----------+------------------------------------+
  | lower/lower\_   | int/unicode | cluster   | int                                |
  +-----------------+-------------+-----------+------------------------------------+
  | norm/norm\_     | int/unicode | prob      | float                              |
  +-----------------+-------------+-----------+------------------------------------+
  | shape/shape\_   | int/unicode | repvec    | ndarray(shape=(300,), dtype=float) |
  +-----------------+-------------+-----------+------------------------------------+
  | prefix/prefix\_ | int/unicode |           |                                    |
  +-----------------+-------------+-----------+------------------------------------+
  | suffix/suffix\_ | int/unicode |           |                                    |
  +-----------------+-------------+-----------+------------------------------------+
  | **Context-dependent Attributes** (calculated once per token in input)          |
  +-----------------+-------------+-----------+------------------------------------+
  | Attribute       | Type        | Attribute | Type                               |
  +-----------------+-------------+-----------+------------------------------------+
  | whitespace\_    | unicode     | string    | unicode                            |
  +-----------------+-------------+-----------+------------------------------------+
  | pos/pos\_       | int/unicode | dep/dep\_ | int/unicode                        |
  +-----------------+-------------+-----------+------------------------------------+
  | tag/tag\_       | int/unicode |           |                                    |
  +-----------------+-------------+-----------+------------------------------------+
  | lemma/lemma\_   | int/unicode |           |                                    |
  +-----------------+-------------+-----------+------------------------------------+

  **String Features**

  string
    The form of the word as it appears in the string, include trailing
    whitespace.  This is useful when you need to use linguistic features to
    add inline mark-up to the string.

  orth
    The form of the word with no string normalization or processing, as it
    appears in the string, without trailing whitespace.

  lemma
    The "base" of the word, with no inflectional suffixes, e.g. the lemma of
    "developing" is "develop", the lemma of "geese" is "goose", etc.  Note that
    *derivational* suffixes are not stripped, e.g. the lemma of "instutitions"
    is "institution", not "institute".  Lemmatization is performed using the
    WordNet data, but extended to also cover closed-class words such as
    pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his".
    We assign pronouns the lemma -PRON-.

  lower
    The form of the word, but forced to lower-case, i.e. lower = word.orth\_.lower()

  norm
    The form of the word, after language-specific normalizations have been
    applied.

  shape
    A transform of the word's string, to show orthographic features.  The
    characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d. 
    After these mappings, sequences of 4 or more of the same character are
    truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx,
    :) --> :)

  prefix
    A length-N substring from the start of the word.  Length may vary by
    language; currently for English n=1, i.e. prefix = word.orth\_[:1]

  suffix
    A length-N substring from the end of the word.  Length may vary by
    language; currently for English n=3, i.e. suffix = word.orth\_[-3:]

  **Distributional Features**

  prob
    The unigram log-probability of the word, estimated from counts from a
    large corpus, smoothed using Simple Good Turing estimation.

  cluster
    The Brown cluster ID of the word.  These are often useful features for
    linear models.  If you're using a non-linear model, particularly
    a neural net or random forest, consider using the real-valued word
    representation vector, in Token.repvec, instead.

  repvec
    A "word embedding" representation: a dense real-valued vector that supports
    similarity queries between words.  By default, spaCy currently loads
    vectors produced by the Levy and Goldberg (2014) dependency-based word2vec
    model.

  **Syntactic Features**

  tag
    A morphosyntactic tag, e.g. NN, VBZ, DT, etc.  These tags are
    language/corpus specific, and typically describe part-of-speech and some
    amount of morphological information.  For instance, in the Penn Treebank
    tag set, VBZ is assigned to a present-tense singular verb.

  pos
    A part-of-speech tag, from the Google Universal Tag Set, e.g. NOUN, VERB,
    ADV.  Constants for the 17 tag values are provided in spacy.parts\_of\_speech.
 
  dep
    The type of syntactic dependency relation between the word and its
    syntactic head.

  n_lefts
    The number of immediate syntactic children preceding the word in the
    string.

  n_rights
    The number of immediate syntactic children following the word in the
    string.

  **Navigating the Dependency Tree**

  head
    The Token that is the immediate syntactic head of the word.  If the word is
    the root of the dependency tree, the same word is returned.

  lefts
    An iterator for the immediate leftward syntactic children of the word.

  rights
    An iterator for the immediate rightward syntactic children of the word.
    
  children
    An iterator that yields from lefts, and then yields from rights.
 
  subtree
    An iterator for the part of the sentence syntactically governed by the
    word, including the word itself.

.. py:class:: vocab.Vocab(self, data_dir=None, lex_props_getter=None)

  .. py:method:: __len__(self) --> int
  
  .. py:method:: __getitem__(self, id: int) --> unicode
  
  .. py:method:: __getitem__(self, string: unicode) --> int
  
  .. py:method:: __setitem__(self, py_str: unicode, props: Dict[str, int[float]) --> None

  .. py:method:: dump(self, loc: unicode) --> None
  
  .. py:method:: load_lexemes(self, loc: unicode) --> None

  .. py:method:: load_vectors(self, loc: unicode) --> None

.. py:class:: strings.StringStore(self)

  .. py:method:: __len__(self) --> int

  .. py:method:: __getitem__(self, id: int) --> unicode
  
  .. py:method:: __getitem__(self, string: bytes) --> id
  
  .. py:method:: __getitem__(self, string: unicode) --> id

  .. py:method:: dump(self, loc: unicode) --> None

  .. py:method:: load(self, loc: unicode) --> None

.. py:class:: tokenizer.Tokenizer(self, Vocab vocab, rules, prefix_re, suffix_re, infix_re, pos_tags, tag_names)

  .. py:method:: tokens_from_list(self, List[unicode]) --> spacy.tokens.Tokens

  .. py:method:: __call__(self, string: unicode) --> spacy.tokens.Tokens)

  .. py:attribute:: vocab: spacy.vocab.Vocab

.. py:class:: en.pos.EnPosTagger(self, strings: spacy.strings.StringStore, data_dir: unicode)

  .. py:method:: __call__(self, tokens: spacy.tokens.Tokens)

  .. py:method:: train(self, tokens: spacy.tokens.Tokens, List[int] golds) --> int

  .. py:method:: load_morph_exceptions(self, exc: Dict[unicode, Dict])

.. py:class:: syntax.parser.GreedyParser(self, model_dir: unicode)

  .. py:method:: __call__(self, tokens: spacy.tokens.Tokens) --> None

  .. py:method:: train(self, spacy.tokens.Tokens) --> None
* Warn about crappy docs 2015-01-25 13:57:28 +03:00			`===`
			`API`
			`===`
* Work on API reference 2014-12-27 10:45:47 +03:00
* Ammend warning 2015-01-27 10:56:18 +03:00
* Work on api.rst 2015-01-27 08:47:28 +03:00			`.. autoclass:: spacy.en.English`
* Upd docs 2015-01-17 08:19:21 +03:00
* Edt API docs 2015-01-24 12:49:44 +03:00			`+-----------+----------------------------------------+-------------+--------------------------+`
			`\| Attribute \| Type \| Attr API \| NoteS \|`
			`+===========+========================================+=============+==========================+`
			\| strings \| :py:class:`strings.StringStore` \| __getitem__ \| string <-> int mapping \|
			`+-----------+----------------------------------------+-------------+--------------------------+`
			\| vocab \| :py:class:`vocab.Vocab` \| __getitem__ \| Look up Lexeme object \|
			`+-----------+----------------------------------------+-------------+--------------------------+`
			\| tokenizer \| :py:class:`tokenizer.Tokenizer` \| __call__ \| Get Tokens given unicode \|
			`+-----------+----------------------------------------+-------------+--------------------------+`
			\| tagger \| :py:class:`en.pos.EnPosTagger` \| __call__ \| Set POS tags on Tokens \|
			`+-----------+----------------------------------------+-------------+--------------------------+`
			\| parser \| :py:class:`syntax.parser.GreedyParser` \| __call__ \| Set parse on Tokens \|
			`+-----------+----------------------------------------+-------------+--------------------------+`
* Work on API reference 2014-12-27 10:45:47 +03:00
* Upd docs 2015-01-17 08:19:21 +03:00
* Work on api.rst 2015-01-27 08:47:28 +03:00			`.. automethod:: spacy.en.English.__call__`
* Upd docs 2015-01-17 08:19:21 +03:00

* Work on api.rst 2015-01-27 08:47:28 +03:00			`.. autoclass:: spacy.tokens.Tokens`
			`:members:`

* Upd docs 2015-01-17 08:19:21 +03:00			`+---------------+-------------+-------------+`
			`\| Attribute \| Type \| Useful \|`
			`+===============+=============+=============+`
			`\| vocab \| Vocab \| __getitem__ \|`
			`+---------------+-------------+-------------+`
			`\| vocab.strings \| StringStore \| __getitem__ \|`
			`+---------------+-------------+-------------+`


* Improve API docs for Token 2015-02-12 02:03:06 +03:00			`Internals`
			A Tokens instance stores the annotations in a C-array of `TokenC` structs.
			`Each TokenC struct holds a const pointer to a LexemeC struct, which describes`
			`a vocabulary item.`
* Upd docs 2015-01-17 08:19:21 +03:00
* Improve API docs for Token 2015-02-12 02:03:06 +03:00			`The Token objects are built lazily, from this underlying C-data.`
* Upd docs 2015-01-17 08:19:21 +03:00
* Improve API docs for Token 2015-02-12 02:03:06 +03:00			`For faster access, the underlying C data can be accessed from Cython. You`
			can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
			`access is required, and you need slightly better performance. However, this`
			`is both slower and has a worse API than Cython access.`
* Work on api.rst 2015-01-27 08:47:28 +03:00
			`.. Once a Token object has been created, it is persisted internally in Tokens._py_tokens.`


			`.. autoclass:: spacy.tokens.Token`
* Improve API docs for Token 2015-02-12 02:03:06 +03:00
			`Integer IDs are provided for all string features. The (unicode) string is`
			`provided by an attribute of the same name followed by an underscore, e.g.`
			`token.orth is an integer ID, token.orth\_ is the unicode value.`

			`The only exception is the Token.string attribute, which is (unicode)`
			`string-typed.`
* Improve API docs 2015-01-27 09:56:36 +03:00
			`+--------------------------------------------------------------------------------+`
* Improve API docs for Token 2015-02-12 02:03:06 +03:00			`\| Context-independent Attributes (calculated once per entry in vocab) \|`
* Improve API docs 2015-01-27 09:56:36 +03:00			`+-----------------+-------------+-----------+------------------------------------+`
			`\| Attribute \| Type \| Attribute \| Type \|`
* Improve API docs for Token 2015-02-12 02:03:06 +03:00			`+-----------------+-------------+-----------+------------------------------------+`
* Improve API docs 2015-01-27 09:56:36 +03:00			`\| orth/orth\_ \| int/unicode \| __len__ \| int \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| lower/lower\_ \| int/unicode \| cluster \| int \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| norm/norm\_ \| int/unicode \| prob \| float \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| shape/shape\_ \| int/unicode \| repvec \| ndarray(shape=(300,), dtype=float) \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| prefix/prefix\_ \| int/unicode \| \| \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| suffix/suffix\_ \| int/unicode \| \| \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| Context-dependent Attributes (calculated once per token in input) \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| Attribute \| Type \| Attribute \| Type \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| whitespace\_ \| unicode \| string \| unicode \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| pos/pos\_ \| int/unicode \| dep/dep\_ \| int/unicode \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| tag/tag\_ \| int/unicode \| \| \|`
			`+-----------------+-------------+-----------+------------------------------------+`
			`\| lemma/lemma\_ \| int/unicode \| \| \|`
			`+-----------------+-------------+-----------+------------------------------------+`
* Improve API docs for Token 2015-02-12 02:03:06 +03:00
			`String Features`

			`string`
			`The form of the word as it appears in the string, include trailing`
			`whitespace. This is useful when you need to use linguistic features to`
			`add inline mark-up to the string.`

			`orth`
			`The form of the word with no string normalization or processing, as it`
			`appears in the string, without trailing whitespace.`

			`lemma`
			`The "base" of the word, with no inflectional suffixes, e.g. the lemma of`
			`"developing" is "develop", the lemma of "geese" is "goose", etc. Note that`
			`derivational suffixes are not stripped, e.g. the lemma of "instutitions"`
			`is "institution", not "institute". Lemmatization is performed using the`
			`WordNet data, but extended to also cover closed-class words such as`
			`pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his".`
			`We assign pronouns the lemma -PRON-.`

			`lower`
			`The form of the word, but forced to lower-case, i.e. lower = word.orth\_.lower()`

			`norm`
			`The form of the word, after language-specific normalizations have been`
			`applied.`

			`shape`
			`A transform of the word's string, to show orthographic features. The`
			`characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d.`
			`After these mappings, sequences of 4 or more of the same character are`
			`truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx,`
			`:) --> :)`

			`prefix`
			`A length-N substring from the start of the word. Length may vary by`
			`language; currently for English n=1, i.e. prefix = word.orth\_[:1]`

			`suffix`
			`A length-N substring from the end of the word. Length may vary by`
			`language; currently for English n=3, i.e. suffix = word.orth\_[-3:]`

			`Distributional Features`

			`prob`
			`The unigram log-probability of the word, estimated from counts from a`
			`large corpus, smoothed using Simple Good Turing estimation.`

			`cluster`
			`The Brown cluster ID of the word. These are often useful features for`
			`linear models. If you're using a non-linear model, particularly`
			`a neural net or random forest, consider using the real-valued word`
			`representation vector, in Token.repvec, instead.`

			`repvec`
			`A "word embedding" representation: a dense real-valued vector that supports`
			`similarity queries between words. By default, spaCy currently loads`
			`vectors produced by the Levy and Goldberg (2014) dependency-based word2vec`
			`model.`

			`Syntactic Features`

			`tag`
			`A morphosyntactic tag, e.g. NN, VBZ, DT, etc. These tags are`
			`language/corpus specific, and typically describe part-of-speech and some`
			`amount of morphological information. For instance, in the Penn Treebank`
			`tag set, VBZ is assigned to a present-tense singular verb.`

			`pos`
			`A part-of-speech tag, from the Google Universal Tag Set, e.g. NOUN, VERB,`
			`ADV. Constants for the 17 tag values are provided in spacy.parts\_of\_speech.`

			`dep`
			`The type of syntactic dependency relation between the word and its`
			`syntactic head.`

			`n_lefts`
			`The number of immediate syntactic children preceding the word in the`
			`string.`

			`n_rights`
			`The number of immediate syntactic children following the word in the`
			`string.`

			`Navigating the Dependency Tree`

			`head`
			`The Token that is the immediate syntactic head of the word. If the word is`
			`the root of the dependency tree, the same word is returned.`

			`lefts`
			`An iterator for the immediate leftward syntactic children of the word.`

			`rights`
			`An iterator for the immediate rightward syntactic children of the word.`

* Upd docs, improving description of parse tree navigation 2015-03-03 12:34:07 +03:00			`children`
			`An iterator that yields from lefts, and then yields from rights.`

			`subtree`
			`An iterator for the part of the sentence syntactically governed by the`
			`word, including the word itself.`
* Upd docs 2015-01-17 08:19:21 +03:00
* Fix links in API docs 2015-01-26 17:08:56 +03:00			`.. py:class:: vocab.Vocab(self, data_dir=None, lex_props_getter=None)`
* Upd docs 2015-01-17 08:19:21 +03:00
			`.. py:method:: __len__(self) --> int`

			`.. py:method:: __getitem__(self, id: int) --> unicode`

			`.. py:method:: __getitem__(self, string: unicode) --> int`

* Edt API docs 2015-01-24 12:49:44 +03:00			`.. py:method:: __setitem__(self, py_str: unicode, props: Dict[str, int[float]) --> None`
* Upd docs 2015-01-17 08:19:21 +03:00
			`.. py:method:: dump(self, loc: unicode) --> None`

			`.. py:method:: load_lexemes(self, loc: unicode) --> None`

			`.. py:method:: load_vectors(self, loc: unicode) --> None`

* Fix links in API docs 2015-01-26 17:08:56 +03:00			`.. py:class:: strings.StringStore(self)`
* Upd docs 2015-01-17 08:19:21 +03:00
			`.. py:method:: __len__(self) --> int`

			`.. py:method:: __getitem__(self, id: int) --> unicode`

* Edt API docs 2015-01-24 12:49:44 +03:00			`.. py:method:: __getitem__(self, string: bytes) --> id`
* Upd docs 2015-01-17 08:19:21 +03:00
			`.. py:method:: __getitem__(self, string: unicode) --> id`

			`.. py:method:: dump(self, loc: unicode) --> None`

			`.. py:method:: load(self, loc: unicode) --> None`

* Fix links in API docs 2015-01-26 17:08:56 +03:00			`.. py:class:: tokenizer.Tokenizer(self, Vocab vocab, rules, prefix_re, suffix_re, infix_re, pos_tags, tag_names)`
* Upd docs 2015-01-17 08:19:21 +03:00
			`.. py:method:: tokens_from_list(self, List[unicode]) --> spacy.tokens.Tokens`

			`.. py:method:: __call__(self, string: unicode) --> spacy.tokens.Tokens)`

			`.. py:attribute:: vocab: spacy.vocab.Vocab`

* Fix links in API docs 2015-01-26 17:08:56 +03:00			`.. py:class:: en.pos.EnPosTagger(self, strings: spacy.strings.StringStore, data_dir: unicode)`
* Upd docs 2015-01-17 08:19:21 +03:00
			`.. py:method:: __call__(self, tokens: spacy.tokens.Tokens)`

			`.. py:method:: train(self, tokens: spacy.tokens.Tokens, List[int] golds) --> int`

			`.. py:method:: load_morph_exceptions(self, exc: Dict[unicode, Dict])`

* Fix links in API docs 2015-01-26 17:08:56 +03:00			`.. py:class:: syntax.parser.GreedyParser(self, model_dir: unicode)`
* Upd docs 2015-01-17 08:19:21 +03:00
			`.. py:method:: __call__(self, tokens: spacy.tokens.Tokens) --> None`

			`.. py:method:: train(self, spacy.tokens.Tokens) --> None`