* Work on reorganization of docs

2024-11-11 04:08:09 +03:00 · 2015-08-08 19:14:32 +02:00 · 2015-08-08 19:14:32 +02:00 · 67979a8008
commit 67979a8008
parent 63f86efa8b
7 changed files with 411 additions and 227 deletions
--- a/docs/source/reference/index.rst
+++ b/docs/source/reference/index.rst
@ -54,11 +54,12 @@ and a small usage snippet.
 .. toctree::
    :maxdepth: 4
    loading.rst
    processing.rst
    using/document.rst
    using/span.rst
    using/token.rst
    using/lexeme.rst
    lookup.rst
 .. _English: processing.html
--- a/docs/source/reference/loading.rst
+++ b/docs/source/reference/loading.rst
@ -1,27 +1,6 @@
 =================
 Loading Resources
 =================
 99\% of the time, you will load spaCy's resources using a language pipeline class,
 e.g. `spacy.en.English`. The pipeline class reads the data from disk, from a
 specified directory.  By default, spaCy installs data into each language's
 package directory, and loads it from there.
 Usually, this is all you will need:
    >>> from spacy.en import English
    >>> nlp = English()
 If you need to replace some of the components, you may want to just make your
 own pipeline class --- the English class itself does almost no work; it just
 applies the modules in order. You can also provide a function or class that
 produces a tokenizer, tagger, parser or entity recognizer to :code:`English.__init__`,
 to customize the pipeline:
    >>> from spacy.en import English
    >>> from my_module import MyTagger
    >>> nlp = English(Tagger=MyTagger)
 In more detail:
 .. code::
@ -44,12 +23,12 @@ In more detail:
 :code:`Tokenizer`
  :code:`(Vocab vocab, unicode data_dir)(unicode) --> Doc`
-  
+
  A class/function that creates the tokenizer.
 :code:`Tagger` / :code:`Parser` / :code:`Entity`
  :code:`(Vocab vocab, unicode data_dir)(Doc) --> None`
-  
+
  A class/function that creates the part-of-speech tagger /
  syntactic dependency parser / named entity recogniser.
  May be None or False, to disable tagging.
--- a/docs/source/reference/lookup.rst
+++ b/docs/source/reference/lookup.rst
@ -17,33 +17,95 @@ up in the vocabulary directly:
 .. py:class:: vocab.Vocab(self, data_dir=None, lex_props_getter=None)
-  .. py:method:: __len__(self) --> int
+  .. py:method:: __len__(self)
-  .. py:method:: __getitem__(self, id: int) --> unicode
+    :returns: number of words in the vocabulary
    :rtype: int
-  .. py:method:: __getitem__(self, string: unicode) --> int
+  .. py:method:: __getitem__(self, key_int)
-  .. py:method:: __setitem__(self, py_str: unicode, props: Dict[str, int[float]) --> None
+    :param int key:
      Integer ID
-  .. py:method:: dump(self, loc: unicode) --> None
+    :returns: A Lexeme object
-  .. py:method:: load_lexemes(self, loc: unicode) --> None
+  .. py:method:: __getitem__(self, key_str)
-  .. py:method:: load_vectors(self, loc: unicode) --> None
+    :param unicode key_str:
      A string in the vocabulary
    :rtype: Lexeme
  .. py:method:: __setitem__(self, orth_str, props)
    :param unicode orth_str:
      The orth key
    :param dict props:
      A props dictionary
    :returns: None
  .. py:method:: dump(self, loc)
    :param unicode loc:
      Path where the vocabulary should be saved
  .. py:method:: load_lexemes(self, loc)
    :param unicode loc:
      Path to load the lexemes.bin file from
  .. py:method:: load_vectors(self, loc)
    :param unicode loc:
      Path to load the vectors.bin from
 .. py:class:: strings.StringStore(self)
-  .. py:method:: __len__(self) --> int
+  .. py:method:: __len__(self)
-  .. py:method:: __getitem__(self, id: int) --> unicode
+    :returns:
      Number of strings in the string-store
-  .. py:method:: __getitem__(self, string: bytes) --> id
+  .. py:method:: __getitem__(self, key_int)
-  .. py:method:: __getitem__(self, string: unicode) --> id
+    :param int key_int: An integer key
-  .. py:method:: dump(self, loc: unicode) --> None
+    :returns:
      The string that the integer key maps to
-  .. py:method:: load(self, loc: unicode) --> None
+      :rtype: unicode
  .. py:method:: __getitem__(self, key_unicode)
    :param int key_unicode:
      A key, as a unicode string
    :returns:
      The integer ID of the string.
    :rtype: int
  .. py:method:: __getitem__(self, key_utf8_bytes)
    :param int key_utf8_bytes:
      A key, as a UTF-8 encoded byte-string
    :returns:
      The integer ID of the string.
    :rtype:
      int
  .. py:method:: dump(self, loc)
    :param loc:
      File path to save the strings.txt to.
  .. py:method:: load(self, loc)
    :param loc:
      File path to load the strings.txt from.
--- a/docs/source/reference/processing.rst
+++ b/docs/source/reference/processing.rst
@ -1,33 +1,76 @@
-===============
+================
-Processing Text
+spacy.en.English
-===============
+================
 99\% of the time, you will load spaCy's resources using a language pipeline class,
 e.g. `spacy.en.English`. The pipeline class reads the data from disk, from a
 specified directory.  By default, spaCy installs data into each language's
 package directory, and loads it from there.
 Usually, this is all you will need:
    >>> from spacy.en import English
    >>> nlp = English()
 If you need to replace some of the components, you may want to just make your
 own pipeline class --- the English class itself does almost no work; it just
 applies the modules in order. You can also provide a function or class that
 produces a tokenizer, tagger, parser or entity recognizer to :code:`English.__init__`,
 to customize the pipeline:
    >>> from spacy.en import English
    >>> from my_module import MyTagger
    >>> nlp = English(Tagger=MyTagger)
 The text processing API is very small and simple. Everything is a callable object,
 and you will almost always apply the pipeline all at once.
 Applying a pipeline
 -------------------
 .. py:class:: spacy.en.English
  .. py:method:: __init__(self, data_dir=..., Tokenizer=..., Tagger=..., Parser=..., Entity=..., Matcher=..., Packer=None, load_vectors=True)
-.. py:method:: English.__call__(text, tag=True, parse=True, entity=True) --> Doc
+    :param unicode data_dir:
      The data directory.  May be None, to disable any data loading (including
      the vocabulary).
    :param Tokenizer:
      A class/function that creates the tokenizer.
-text (unicode)
+    :param Tagger:
-  The text to be processed.  No pre-processing needs to be applied, and any
+      A class/function that creates the part-of-speech tagger.
  length of text can be submitted.  Usually you will submit a whole document.
  Text may be zero-length. An exception is raised if byte strings are supplied.
-tag (bool)
+    :param Parser:
-  Whether to apply the part-of-speech tagger. Required for parsing and entity recognition.
+      A class/function that creates the dependency parser.
-parse (bool)
+    :param Entity:
-  Whether to apply the syntactic dependency parser.
+      A class/function that creates the named entity recogniser.
-entity (bool)
+    :param bool load_vectors:
-  Whether to apply the named entity recognizer.
+      A boolean value to control whether the word vectors are loaded.
  .. py:method:: __call__(text, tag=True, parse=True, entity=True) --> Doc
-**Examples**
+    :param unicode text:
      The text to be processed.  No pre-processing needs to be applied, and any
      length of text can be submitted.  Usually you will submit a whole document.
      Text may be zero-length. An exception is raised if byte strings are supplied.
    :param bool tag:
      Whether to apply the part-of-speech tagger. Required for parsing and entity
      recognition.
    :param bool parse:
      Whether to apply the syntactic dependency parser.
    :param bool entity:
      Whether to apply the named entity recognizer.
    :return: A document
    :rtype: :py:class:`spacy.tokens.Doc`
    :Example:
    >>> from spacy.en import English
    >>> nlp = English()
@ -44,24 +87,3 @@ entity (bool)
    TypeError: Argument 'string' has incorrect type (expected unicode, got str)
    >>> doc = nlp(b'Some text'.decode('utf8')) # Encode to unicode first.
    >>>
 Tokenizer
 ---------
 .. autoclass:: spacy.tokenizer.Tokenizer
  :members:
 Tagger
 ------
 .. autoclass:: spacy.en.pos.EnPosTagger
  :members:
 Parser and Entity Recognizer
 ----------------------------
 .. autoclass:: spacy.syntax.parser.Parser
  :members:
--- a/docs/source/reference/using/document.rst
+++ b/docs/source/reference/using/document.rst
@ -2,69 +2,93 @@
 The Doc Object
 ==============
 .. autoclass:: spacy.tokens.Tokens
-:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
+.. py:class:: spacy.tokens.doc.Doc
  The Tokens class behaves as a Python sequence, supporting the usual operators,
  len(), etc.  Negative indexing is supported. Slices are not yet.
-  .. code::
+  .. py:method:: __init__(self, Vocab vocab, orths_and_spaces=None)
-    >>> tokens = nlp(u'Zero one two three four five six')
+    :param Vocab vocab: A vocabulary object.
    >>> tokens[0].orth_
    u'Zero'
    >>> tokens[-1].orth_
    u'six'
    >>> tokens[0:4]
    Error
-:code:`sents`
+    :param list orths_and_spaces=None: Defaults to None.
  Iterate over sentences in the document.
-:code:`ents`
+  .. py:method:: __getitem__(self, int i)
-  Iterate over entities in the document.
+    
    :returns: Token
-:code:`to_array`
+  .. py:method:: __getitem__(self, slice start_colon_end)
  Given a list of M attribute IDs, export the tokens to a numpy ndarray
  of shape N*M, where N is the length of the sentence.
-    Arguments:
+    :returns: Span
        attr_ids (list[int]): A list of attribute ID ints.
-    Returns:
+  .. py:method:: __iter__(self)
        feat_array (numpy.ndarray[long, ndim=2]):
        A feature matrix, with one row per word, and one column per attribute
        indicated in the input attr_ids.
 :code:`count_by`
  Produce a dict of {attribute (int): count (ints)} frequencies, keyed
  by the values of the given attribute ID.
-    >>> from spacy.en import English, attrs
+    Iterate over tokens
-    >>> nlp = English()
+    
-    >>> tokens = nlp(u'apple apple orange banana')
+    .. code::
    >>> tokens.count_by(attrs.ORTH)
    {12800L: 1, 11880L: 2, 7561L: 1}
    >>> tokens.to_array([attrs.ORTH])
    array([[11880],
          [11880],
          [ 7561],
          [12800]])
-:code:`merge`
+      >>> tokens = nlp(u'Zero one two three four five six')
-  Merge a multi-word expression into a single token.  Currently
+      >>> tokens[0].orth_
-  experimental; API is likely to change.
+      u'Zero'
      >>> tokens[-1].orth_
      u'six'
  .. py:method:: __len__(self)
    Number of tokens
-Internals
+  .. py:attribute:: sents
-  A Tokens instance stores the annotations in a C-array of `TokenC` structs.
+  
-  Each TokenC struct holds a const pointer to a LexemeC struct, which describes
+    Iterate over sentences in the document.
  a vocabulary item.
-  The Token objects are built lazily, from this underlying C-data.
+    :returns generator: Sentences
-  For faster access, the underlying C data can be accessed from Cython.  You
+  .. py:attribute:: ents
-  can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
+    
-  access is required, and you need slightly better performance.  However, this
+    Iterate over named entities in the document.
-  is both slower and has a worse API than Cython access.
+
    :returns tuple: Named Entities
  .. py:attribute:: noun_chunks
    :returns generator:
  .. py:method:: to_array(self, list attr_ids)
    Given a list of M attribute IDs, export the tokens to a numpy ndarray
    of shape N*M, where N is the length of the sentence.
    :param list[int] attr_ids: A list of attribute ID ints.
    :returns feat_array:
      A feature matrix, with one row per word, and one column per attribute
      indicated in the input attr_ids.
  .. py:method:: count_by(self, attr_id)
    Produce a dict of {attribute (int): count (ints)} frequencies, keyed
    by the values of the given attribute ID.
    .. code::
      >>> from spacy.en import English, attrs
      >>> nlp = English()
      >>> tokens = nlp(u'apple apple orange banana')
      >>> tokens.count_by(attrs.ORTH)
      {12800L: 1, 11880L: 2, 7561L: 1}
      >>> tokens.to_array([attrs.ORTH])
      array([[11880],
            [11880],
            [ 7561],
            [12800]])
  .. py:method:: from_array(self, attrs, array)
  .. py:method:: to_bytes(self)
  .. py:method:: from_bytes(self)
  .. py:method:: read_bytes(self)
  .. py:method:: merge(self, int start_idx, int end_idx, unicode tag, unicode lemma, unicode ent_type)
    Merge a multi-word expression into a single token.  Currently
    experimental; API is likely to change.
--- a/docs/source/reference/using/span.rst
+++ b/docs/source/reference/using/span.rst
@ -4,29 +4,55 @@ The Span Object
 .. autoclass:: spacy.spans.Span
-:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
+.. py:class:: Span
  Sequence API
 :code:`head`
  Syntactic head, or None
-:code:`left`
+  .. py:method:: __getitem__
  Tokens to the left of the span
-:code:`rights`
+  .. py:method:: __iter__
  Tokens to the left of the span
-:code:`orth` / :code:`orth_`
+  .. py:method:: __len__
  Orth string
-:code:`lemma` / :code:`lemma_`
+  .. py:attribute:: root
  Lemma string
-:code:`string`
+    Syntactic head
  String
-:code:`label` / :code:`label_`
+  .. py:attribute:: lefts
  Label
-:code:`subtree`
+    Tokens that are:
-  Lefts + [self] + Rights
+
    1. To the left of the span;
    2. Syntactic children of words within the span
    i.e.
    .. code::
      lefts = [span.doc[i] for i in range(0, span.start) if span.doc[i].head in span]
  .. py:attribute:: rights
    Tokens that are:
    1. To the right of the span;
    2. Syntactic children of words within the span
    i.e.
    .. code::
      rights = [span.doc[i] for i in range(span.end, len(span.doc)) if span.doc[i].head in span]
    Tokens that are:
    1. To the right of the span;
    2. Syntactic children of words within the span
  .. py:attribute:: string
  .. py:attribute:: lemma / lemma\_
  .. py:attribute:: label / label\_
  .. py:attribute:: subtree
--- a/docs/source/reference/using/token.rst
+++ b/docs/source/reference/using/token.rst
@ -11,115 +11,185 @@ token.orth is an integer ID, token.orth\_ is the unicode value.
 The only exception is the Token.string attribute, which is (unicode)
 string-typed.
 **String Features**
-:code:`orth` / :code:`orth_`
+.. py:class:: Token
  The form of the word with no string normalization or processing, as it
  appears in the string, without trailing whitespace.
-:code:`lemma` / :code:`lemma_`
+  .. py:method:: __init__(self, Vocab vocab, Doc doc, int offset)
  The "base" of the word, with no inflectional suffixes, e.g. the lemma of
  "developing" is "develop", the lemma of "geese" is "goose", etc.  Note that
  *derivational* suffixes are not stripped, e.g. the lemma of "instutitions"
  is "institution", not "institute".  Lemmatization is performed using the
  WordNet data, but extended to also cover closed-class words such as
  pronouns.  By default, the WN lemmatizer returns "hi" as the lemma of "his".
  We assign pronouns the lemma -PRON-.
-:code:`lower` / :code:`lower_`
+  **String Views**
  The form of the word, but forced to lower-case, i.e. lower = word.orth\_.lower()
-:code:`norm` / :code:`norm_`
+  .. py:attribute:: orth / orth\_
  The form of the word, after language-specific normalizations have been
  applied.
-:code:`shape` / :code:`shape_`
+    The form of the word with no string normalization or processing, as it
-  A transform of the word's string, to show orthographic features.  The
+    appears in the string, without trailing whitespace.
  characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d.
  After these mappings, sequences of 4 or more of the same character are
  truncated to length 4.  Examples: C3Po --> XdXx, favorite --> xxxx,
  :) --> :)
-:code:`prefix` / :code:`prefix_`
+  .. py:attribute:: lemma / lemma\_
  A length-N substring from the start of the word.  Length may vary by
  language; currently for English n=1, i.e. prefix = word.orth\_[:1]
-:code:`suffix` / :code:`suffix_`
+    The "base" of the word, with no inflectional suffixes, e.g. the lemma of
-  A length-N substring from the end of the word.  Length may vary by
+    "developing" is "develop", the lemma of "geese" is "goose", etc.  Note that
-  language; currently for English n=3, i.e. suffix = word.orth\_[-3:]
+    *derivational* suffixes are not stripped, e.g. the lemma of "instutitions"
    is "institution", not "institute".  Lemmatization is performed using the
    WordNet data, but extended to also cover closed-class words such as
    pronouns.  By default, the WN lemmatizer returns "hi" as the lemma of "his".
    We assign pronouns the lemma -PRON-.
-:code:`string`
+  .. py:attribute:: lower / lower\_
  The form of the word as it appears in the string, **including trailing
  whitespace**.  This is useful when you need to use linguistic features to
  add inline mark-up to the string.
    The form of the word, but forced to lower-case, i.e. lower = word.orth\_.lower()
-**Distributional Features**
+  .. py:attribute:: norm / norm\_
-:code:`prob`
+    The form of the word, after language-specific normalizations have been
-  The unigram log-probability of the word, estimated from counts from a
+    applied.
  large corpus, smoothed using Simple Good Turing estimation.
-:code:`cluster`
+  .. py:attribute:: shape / shape\_
  The Brown cluster ID of the word.  These are often useful features for
  linear models.  If you're using a non-linear model, particularly
  a neural net or random forest, consider using the real-valued word
  representation vector, in Token.repvec, instead.
-:code:`repvec`
+    A transform of the word's string, to show orthographic features.  The
-  A "word embedding" representation: a dense real-valued vector that supports
+    characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d.
-  similarity queries between words.  By default, spaCy currently loads
+    After these mappings, sequences of 4 or more of the same character are
-  vectors produced by the Levy and Goldberg (2014) dependency-based word2vec
+    truncated to length 4.  Examples: C3Po --> XdXx, favorite --> xxxx,
-  model.
+    :) --> :)
-**Syntactic Features**
+  .. py:attribute:: prefix / prefix\_
-:code:`tag`
+    A length-N substring from the start of the word.  Length may vary by
-  A morphosyntactic tag, e.g. NN, VBZ, DT, etc.  These tags are
+    language; currently for English n=1, i.e. prefix = word.orth\_[:1]
  language/corpus specific, and typically describe part-of-speech and some
  amount of morphological information.  For instance, in the Penn Treebank
  tag set, VBZ is assigned to a present-tense singular verb.
-:code:`pos`
+  .. py:attribute:: suffix / suffix\_
  A part-of-speech tag, from the Google Universal Tag Set, e.g. NOUN, VERB,
  ADV.  Constants for the 17 tag values are provided in spacy.parts\_of\_speech.
-:code:`dep`
+    A length-N substring from the end of the word.  Length may vary by
-  The type of syntactic dependency relation between the word and its
+    language; currently for English n=3, i.e. suffix = word.orth\_[-3:]
  syntactic head.
-:code:`n_lefts`
+  .. py:attribute:: lex_id
  The number of immediate syntactic children preceding the word in the
  string.
-:code:`n_rights`
+  **Alignment and Output**
  The number of immediate syntactic children following the word in the
  string.
-**Navigating the Dependency Tree**
+  .. py:attribute:: idx
-:code:`head`
+  .. py:method:: __len__(self)
  The Token that is the immediate syntactic head of the word.  If the word is
  the root of the dependency tree, the same word is returned.
-:code:`lefts`
+  .. py:method:: __unicode__(self)
  An iterator for the immediate leftward syntactic children of the word.
-:code:`rights`
+  .. py:method:: __str__(self)
  An iterator for the immediate rightward syntactic children of the word.
-:code:`children`
+  .. py:attribute:: string
  An iterator that yields from lefts, and then yields from rights.
-:code:`subtree`
+    The form of the word as it appears in the string, **including trailing
-  An iterator for the part of the sentence syntactically governed by the
+    whitespace**.  This is useful when you need to use linguistic features to
-  word, including the word itself.
+    add inline mark-up to the string.
  .. py:method:: nbor(self, int i=1)
-**Named Entities**
+  **Distributional Features**
-:code:`ent_type`
+  .. py:attribute:: repvec
  If the token is part of an entity, its entity type
-:code:`ent_iob`
+    A "word embedding" representation: a dense real-valued vector that supports
-  The IOB (inside, outside, begin) entity recognition tag for the token
+    similarity queries between words.  By default, spaCy currently loads
    vectors produced by the Levy and Goldberg (2014) dependency-based word2vec
    model.
  .. py:attribute:: cluster
    The Brown cluster ID of the word.  These are often useful features for
    linear models.  If you're using a non-linear model, particularly
    a neural net or random forest, consider using the real-valued word
    representation vector, in Token.repvec, instead.
  .. py:attribute:: prob
    The unigram log-probability of the word, estimated from counts from a
    large corpus, smoothed using Simple Good Turing estimation.
  **Navigating the Dependency Tree**
  .. py:attribute:: pos / pos\_
    A part-of-speech tag, from the Google Universal Tag Set, e.g. NOUN, VERB,
    ADV.  Constants for the 17 tag values are provided in spacy.parts\_of\_speech.
  .. py:attribute:: tag / tag\_
    A morphosyntactic tag, e.g. NN, VBZ, DT, etc.  These tags are
    language/corpus specific, and typically describe part-of-speech and some
    amount of morphological information.  For instance, in the Penn Treebank
    tag set, VBZ is assigned to a present-tense singular verb.
  .. py:attribute:: dep / dep\_
    The type of syntactic dependency relation between the word and its
    syntactic head.
  .. py:attribute:: head
    The Token that is the immediate syntactic head of the word.  If the word is
    the root of the dependency tree, the same word is returned.
  .. py:attribute:: lefts
    An iterator for the immediate leftward syntactic children of the word.
  .. py:attribute:: rights
    An iterator for the immediate rightward syntactic children of the word.
  .. py:attribute:: n_lefts
    The number of immediate syntactic children preceding the word in the
    string.
  .. py:attribute:: n_rights
    The number of immediate syntactic children following the word in the
    string.
  .. py:attribute:: children
    An iterator that yields from lefts, and then yields from rights.
  .. py:attribute:: subtree
    An iterator for the part of the sentence syntactically governed by the
    word, including the word itself.
  .. py:attribute:: left_edge
  .. py:attribute:: right_edge
  .. py:attribute:: conjuncts
  **Named Entities**
  .. py:attribute:: ent_type
    If the token is part of an entity, its entity type
  .. py:attribute:: ent_iob
    The IOB (inside, outside, begin) entity recognition tag for the token
  **Lexeme Flags**
  .. py:method:: check_flag(self, attr_id_t flag_id)
  .. py:attribute:: is_oov
  .. py:attribute:: is_alpha
  .. py:attribute:: is_ascii
  .. py:attribute:: is_digit
  .. py:attribute:: is_lower
  .. py:attribute:: is_title
  .. py:attribute:: is_punct
  .. py:attribute:: is_space
  .. py:attribute:: like_url
  .. py:attribute:: like_num
  .. py:attribute:: like_email