* Add using/ docs.

2026-03-03 11:21:29 +03:00 · 2015-07-08 17:59:07 +02:00 · 2015-07-08 17:59:07 +02:00 · 2f7110e852
commit 2f7110e852
parent 4b07c17d6f
4 changed files with 237 additions and 0 deletions
--- a/docs/source/reference/using/document.rst
+++ b/docs/source/reference/using/document.rst
@ -0,0 +1,70 @@
+========
+Document
+========
+
+.. autoclass:: spacy.tokens.Tokens
+
+:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
+  The Tokens class behaves as a Python sequence, supporting the usual operators,
+  len(), etc.  Negative indexing is supported. Slices are not yet.
+
+  .. code::
+
+    >>> tokens = nlp(u'Zero one two three four five six')
+    >>> tokens[0].orth_
+    u'Zero'
+    >>> tokens[-1].orth_
+    u'six'
+    >>> tokens[0:4]
+    Error
+
+:code:`sents`
+  Iterate over sentences in the document.
+
+:code:`ents`
+  Iterate over entities in the document.
+
+:code:`to_array`
+  Given a list of M attribute IDs, export the tokens to a numpy ndarray
+  of shape N*M, where N is the length of the sentence.
+
+    Arguments:
+        attr_ids (list[int]): A list of attribute ID ints.
+
+    Returns:
+        feat_array (numpy.ndarray[long, ndim=2]):
+        A feature matrix, with one row per word, and one column per attribute
+        indicated in the input attr_ids.
+ 
+:code:`count_by`
+  Produce a dict of {attribute (int): count (ints)} frequencies, keyed
+  by the values of the given attribute ID.
+
+    >>> from spacy.en import English, attrs
+    >>> nlp = English()
+    >>> tokens = nlp(u'apple apple orange banana')
+    >>> tokens.count_by(attrs.ORTH)
+    {12800L: 1, 11880L: 2, 7561L: 1}
+    >>> tokens.to_array([attrs.ORTH])
+    array([[11880],
+          [11880],
+          [ 7561],
+          [12800]])
+
+:code:`merge`
+  Merge a multi-word expression into a single token.  Currently
+  experimental; API is likely to change.
+
+
+
+Internals
+  A Tokens instance stores the annotations in a C-array of `TokenC` structs.
+  Each TokenC struct holds a const pointer to a LexemeC struct, which describes
+  a vocabulary item.
+
+  The Token objects are built lazily, from this underlying C-data.
+
+  For faster access, the underlying C data can be accessed from Cython.  You
+  can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
+  access is required, and you need slightly better performance.  However, this
+  is both slower and has a worse API than Cython access.
--- a/docs/source/reference/using/index.rst
+++ b/docs/source/reference/using/index.rst
@ -0,0 +1,11 @@
+==================
+Annotation Objects
+==================
+
+
+.. toctree::
+    :maxdepth: 3
+
+    document.rst
+    token.rst
+    span.rst
--- a/docs/source/reference/using/span.rst
+++ b/docs/source/reference/using/span.rst
@ -0,0 +1,32 @@
+====
+Span
+====
+
+.. autoclass:: spacy.spans.Span
+
+:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
+  Sequence API
+
+:code:`head`
+  Syntactic head, or None
+
+:code:`left`
+  Tokens to the left of the span
+
+:code:`rights`
+  Tokens to the left of the span
+
+:code:`orth` / :code:`orth_`
+  Orth string
+
+:code:`lemma` / :code:`lemma_`
+  Lemma string
+
+:code:`string`
+  String
+
+:code:`label` / :code:`label_`
+  Label
+
+:code:`subtree`
+  Lefts + [self] + Rights
--- a/docs/source/reference/using/token.rst
+++ b/docs/source/reference/using/token.rst
@ -0,0 +1,124 @@
+====================
+spacy.tokens.Tokens
+====================
+
+A Token represents a single word, punctuation or significant whitespace symbol.
+
+Integer IDs are provided for all string features.  The (unicode) string is
+provided by an attribute of the same name followed by an underscore, e.g.
+token.orth is an integer ID, token.orth\_ is the unicode value.
+
+The only exception is the Token.string attribute, which is (unicode)
+string-typed.
+
+**String Features**
+
+:code:`string`
+  The form of the word as it appears in the string, include trailing
+  whitespace.  This is useful when you need to use linguistic features to
+  add inline mark-up to the string.
+
+:code:`orth` / :code:`orth_`
+  The form of the word with no string normalization or processing, as it
+  appears in the string, without trailing whitespace.
+
+:code:`lemma` / :code:`lemma_`
+  The "base" of the word, with no inflectional suffixes, e.g. the lemma of
+  "developing" is "develop", the lemma of "geese" is "goose", etc.  Note that
+  *derivational* suffixes are not stripped, e.g. the lemma of "instutitions"
+  is "institution", not "institute".  Lemmatization is performed using the
+  WordNet data, but extended to also cover closed-class words such as
+  pronouns.  By default, the WN lemmatizer returns "hi" as the lemma of "his".
+  We assign pronouns the lemma -PRON-.
+
+:code:`lower` / :code:`lower_`
+  The form of the word, but forced to lower-case, i.e. lower = word.orth\_.lower()
+
+:code:`norm` / :code:`norm_`
+  The form of the word, after language-specific normalizations have been
+  applied.
+
+:code:`shape` / :code:`shape_`
+  A transform of the word's string, to show orthographic features.  The
+  characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d.
+  After these mappings, sequences of 4 or more of the same character are
+  truncated to length 4.  Examples: C3Po --> XdXx, favorite --> xxxx,
+  :) --> :)
+
+:code:`prefix` / :code:`prefix_`
+  A length-N substring from the start of the word.  Length may vary by
+  language; currently for English n=1, i.e. prefix = word.orth\_[:1]
+
+:code:`suffix` / :code:`suffix_`
+  A length-N substring from the end of the word.  Length may vary by
+  language; currently for English n=3, i.e. suffix = word.orth\_[-3:]
+
+**Distributional Features**
+
+:code:`prob`
+  The unigram log-probability of the word, estimated from counts from a
+  large corpus, smoothed using Simple Good Turing estimation.
+
+:code:`cluster`
+  The Brown cluster ID of the word.  These are often useful features for
+  linear models.  If you're using a non-linear model, particularly
+  a neural net or random forest, consider using the real-valued word
+  representation vector, in Token.repvec, instead.
+
+:code:`repvec`
+  A "word embedding" representation: a dense real-valued vector that supports
+  similarity queries between words.  By default, spaCy currently loads
+  vectors produced by the Levy and Goldberg (2014) dependency-based word2vec
+  model.
+
+**Syntactic Features**
+
+:code:`tag`
+  A morphosyntactic tag, e.g. NN, VBZ, DT, etc.  These tags are
+  language/corpus specific, and typically describe part-of-speech and some
+  amount of morphological information.  For instance, in the Penn Treebank
+  tag set, VBZ is assigned to a present-tense singular verb.
+
+:code:`pos`
+  A part-of-speech tag, from the Google Universal Tag Set, e.g. NOUN, VERB,
+  ADV.  Constants for the 17 tag values are provided in spacy.parts\_of\_speech.
+
+:code:`dep`
+  The type of syntactic dependency relation between the word and its
+  syntactic head.
+
+:code:`n_lefts`
+  The number of immediate syntactic children preceding the word in the
+  string.
+
+:code:`n_rights`
+  The number of immediate syntactic children following the word in the
+  string.
+
+**Navigating the Dependency Tree**
+
+:code:`head`
+  The Token that is the immediate syntactic head of the word.  If the word is
+  the root of the dependency tree, the same word is returned.
+
+:code:`lefts`
+  An iterator for the immediate leftward syntactic children of the word.
+
+:code:`rights`
+  An iterator for the immediate rightward syntactic children of the word.
+
+:code:`children`
+  An iterator that yields from lefts, and then yields from rights.
+
+:code:`subtree`
+  An iterator for the part of the sentence syntactically governed by the
+  word, including the word itself.
+
+
+**Named Entities**
+
+:code:`ent_type`
+  If the token is part of an entity, its entity type
+
+:code:`ent_iob`
+  The IOB (inside, outside, begin) entity recognition tag for the token