* Work on API reference docs

This commit is contained in:
Matthew Honnibal 2015-07-08 12:34:23 +02:00
parent 99e84488da
commit b42db257b7

View File

@ -1,6 +1,6 @@
===== =========
Usage Reference
===== =========
Overview Overview
-------- --------
@ -31,11 +31,26 @@ e.g. `spacy.en.English`. The pipeline class reads the data from disk, from a
specified directory. By default, spaCy installs data into each language's specified directory. By default, spaCy installs data into each language's
package directory, and loads it from there. package directory, and loads it from there.
Usually, this is all you will need:
>>> from spacy.en import English
>>> nlp = English()
If you need to replace some of the components, you may want to just make your
own pipeline class --- the English class itself does almost no work; it just
applies the modules in order. You can also provide a function or class that
produces a tokenizer, tagger, parser or entity recognizer to :code:`English.__init__`,
to customize the pipeline:
>>> from spacy.en import English
>>> from my_module import MyTagger
>>> nlp = English(Tagger=MyTagger)
In more detail:
.. code:: .. code::
class English(object): class English(object):
...
def __init__(self, def __init__(self,
data_dir=path.join(path.dirname(__file__), 'data'), data_dir=path.join(path.dirname(__file__), 'data'),
Tokenizer=Tokenizer.from_dir, Tokenizer=Tokenizer.from_dir,
@ -45,34 +60,117 @@ package directory, and loads it from there.
load_vectors=True load_vectors=True
): ):
data\_dir :code:`data_dir`
Usually left default. The data directory. May be None, to disable any data loading (including :code:`unicode path`
The data directory. May be None, to disable any data loading (including
the vocabulary). the vocabulary).
Tokenizer :code:`Tokenizer`
Usually left default. A class/function that creates the tokenizer.
Its signature should be:
:code:`(Vocab vocab, unicode data_dir)(unicode) --> Tokens` :code:`(Vocab vocab, unicode data_dir)(unicode) --> Tokens`
Tagger / Parser / Entity A class/function that creates the tokenizer.
Usually left default. A class/function that creates the part-of-speech tagger /
syntactic dependency parser / named entity recogniser. :code:`Tagger` / :code:`Parser` / :code:`Entity`
May be None or False, to disable tagging. Otherwise, its signature should be:
:code:`(Vocab vocab, unicode data_dir)(Tokens) --> None` :code:`(Vocab vocab, unicode data_dir)(Tokens) --> None`
load_vectors A class/function that creates the part-of-speech tagger /
syntactic dependency parser / named entity recogniser.
May be None or False, to disable tagging.
:code:`load_vectors` (bool)
A boolean value to control whether the word vectors are loaded. A boolean value to control whether the word vectors are loaded.
.. autoclass:: spacy.tokens.Tokens
:members:
+---------------+-------------+-------------+ Processing Text
| Attribute | Type | Attr API | ---------------
+===============+=============+=============+
| vocab | Vocab | __getitem__ | The text processing API is very small and simple. Everything is a callable object,
+---------------+-------------+-------------+ and you will almost always apply the pipeline all at once.
| vocab.strings | StringStore | __getitem__ |
+---------------+-------------+-------------+
.. py:method:: English.__call__(text, tag=True, parse=True, entity=True) --> Tokens
text (unicode)
The text to be processed. No pre-processing needs to be applied, and any
length of text can be submitted. Usually you will submit a whole document.
Text may be zero-length. An exception is raised if byte strings are supplied.
tag (bool)
Whether to apply the part-of-speech tagger.
parse (bool)
Whether to apply the syntactic dependency parser.
entity (bool)
Whether to apply the named entity recognizer.
Accessing Annotation
--------------------
spaCy provides a rich API for using the annotations it calculates. It is arranged
into three data classes:
1. :code:`Tokens`: A container, which provides document-level access;
2. :code:`Span`: A (contiguous) sequence of tokens, e.g. a sentence, entity, etc
3. :code:`Token`: An individual token, and a node in a parse tree;
.. autoclass:: spacy.tokens.Tokens
:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
The Tokens class behaves as a Python sequence, supporting the usual operators,
len(), etc. Negative indexing is supported. Slices are not yet.
.. code::
>>> tokens = nlp(u'Zero one two three four five six')
>>> tokens[0].orth_
u'Zero'
>>> tokens[-1].orth_
u'six'
>>> tokens[0:4]
Error
:code:`sents`
Iterate over sentences in the document.
:code:`ents`
Iterate over entities in the document.
:code:`to_array`
Given a list of M attribute IDs, export the tokens to a numpy ndarray
of shape N*M, where N is the length of the sentence.
Arguments:
attr_ids (list[int]): A list of attribute ID ints.
Returns:
feat_array (numpy.ndarray[long, ndim=2]):
A feature matrix, with one row per word, and one column per attribute
indicated in the input attr_ids.
:code:`count_by`
Produce a dict of {attribute (int): count (ints)} frequencies, keyed
by the values of the given attribute ID.
>>> from spacy.en import English, attrs
>>> nlp = English()
>>> tokens = nlp(u'apple apple orange banana')
>>> tokens.count_by(attrs.ORTH)
{12800L: 1, 11880L: 2, 7561L: 1}
>>> tokens.to_array([attrs.ORTH])
array([[11880],
[11880],
[ 7561],
[12800]])
:code:`merge`
Merge a multi-word expression into a single token. Currently
experimental; API is likely to change.
Internals Internals
@ -87,6 +185,34 @@ load_vectors
access is required, and you need slightly better performance. However, this access is required, and you need slightly better performance. However, this
is both slower and has a worse API than Cython access. is both slower and has a worse API than Cython access.
.. autoclass:: spacy.spans.Span
:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
Sequence API
:code:`head`
Syntactic head, or None
:code:`left`
Tokens to the left of the span
:code:`rights`
Tokens to the left of the span
:code:`orth` / :code:`orth_`
Orth string
:code:`lemma` / :code:`lemma_`
Lemma string
:code:`string`
String
:code:`label` / :code:`label_`
Label
:code:`subtree`
Lefts + [self] + Rights
.. autoclass:: spacy.tokens.Token .. autoclass:: spacy.tokens.Token
@ -239,6 +365,24 @@ load_vectors
ent_iob ent_iob
The IOB (inside, outside, begin) entity recognition tag for the token The IOB (inside, outside, begin) entity recognition tag for the token
Lexical Lookup
--------------
Where possible, spaCy computes information over lexical *types*, rather than
*tokens*. If you process a large batch of text, the number of unique types
you will see will grow exponentially slower than the number of tokens --- so
it's much more efficient to compute over types. And, in small samples, we generally
want to know about the distribution of a word in the language at large ---
which again, is type-based information.
You can access the lexical features via the Token object, but you can also look them
up in the vocabulary directly:
>>> from spacy.en import English
>>> nlp = English()
>>> lexeme = nlp.vocab[u'Amazon']
.. py:class:: vocab.Vocab(self, data_dir=None, lex_props_getter=None) .. py:class:: vocab.Vocab(self, data_dir=None, lex_props_getter=None)
.. py:method:: __len__(self) --> int .. py:method:: __len__(self) --> int
@ -255,6 +399,7 @@ load_vectors
.. py:method:: load_vectors(self, loc: unicode) --> None .. py:method:: load_vectors(self, loc: unicode) --> None
.. py:class:: strings.StringStore(self) .. py:class:: strings.StringStore(self)
.. py:method:: __len__(self) --> int .. py:method:: __len__(self) --> int
@ -269,24 +414,4 @@ load_vectors
.. py:method:: load(self, loc: unicode) --> None .. py:method:: load(self, loc: unicode) --> None
.. py:class:: tokenizer.Tokenizer(self, Vocab vocab, rules, prefix_re, suffix_re, infix_re, pos_tags, tag_names)
.. py:method:: tokens_from_list(self, List[unicode]) --> spacy.tokens.Tokens
.. py:method:: __call__(self, string: unicode) --> spacy.tokens.Tokens)
.. py:attribute:: vocab: spacy.vocab.Vocab
.. py:class:: en.pos.EnPosTagger(self, strings: spacy.strings.StringStore, data_dir: unicode)
.. py:method:: __call__(self, tokens: spacy.tokens.Tokens)
.. py:method:: train(self, tokens: spacy.tokens.Tokens, List[int] golds) --> int
.. py:method:: load_morph_exceptions(self, exc: Dict[unicode, Dict])
.. py:class:: syntax.parser.Parser(self, model_dir: unicode)
.. py:method:: __call__(self, tokens: spacy.tokens.Tokens) --> None
.. py:method:: train(self, spacy.tokens.Tokens) --> None