* Update quickstart, with work on api-at-a-glance

This commit is contained in:
Matthew Honnibal 2015-01-24 17:27:50 +11:00
parent 76cd024095
commit cb6a526fcd

View File

@ -5,6 +5,8 @@ Quick Start
Install Install
------- -------
.. py:currentmodule:: spacy
.. code:: bash .. code:: bash
$ pip install spacy $ pip install spacy
@ -17,159 +19,180 @@ the spacy.en package directory.
Usage Usage
----- -----
The main entry-point is :py:meth:`spacy.en.English.__call__`, which accepts a unicode string as an argument, and returns a :py:class:`spacy.tokens.Tokens` object: The main entry-point is :meth:`en.English.__call__`, which accepts a unicode string
as an argument, and returns a :py:class:`tokens.Tokens` object. You can
iterate over it to get :py:class:`tokens.Token` objects, which provide
a convenient API:
>>> from spacy.en import English >>> from spacy.en import English
>>> nlp = English() >>> nlp = English()
>>> tokens = nlp(u'A fine, very fine, example sentence', tag=True, >>> tokens = nlp(u'I ate the pizza with anchovies.')
parse=True) >>> pizza = tokens[3]
>>> (pizza.orth, pizza.orth_, pizza.head.lemma, pizza.head.lemma_)
... (14702, u'pizza', 14702, u'ate')
Calls to :py:meth:`English.__call__` has a side-effect: when a new spaCy maps all strings to sequential integer IDs --- a common idiom in NLP.
word is seen, it is added to the string-to-ID mapping table in If an attribute `Token.foo` is an integer ID, then `Token.foo_` is the string,
:py:class:`English.vocab.strings`. Because of this, you will usually only want e.g. `pizza.orth_` and `pizza.orth` provide the integer ID and the string of
to create one instance of the pipeline. If you create two instances, and use the original orthographic form of the word, with no string normalizations
them to process different text, you'll probably get different string-to-ID applied.
mappings. You might choose to wrap the English class as a singleton to ensure
only one instance is created, but I've left that up to you. I prefer to pass
the instance around as an explicit argument.
You shouldn't need to batch up your text or prepare it in any way. .. note::
Processing times are linear in the length of the string, with minimal per-call
overhead (apart from the first call, when the tagger and parser models are en.English.__call__ is stateful --- it has an important **side-effect**:
lazy-loaded. This takes a few seconds on my machine.). spaCy maps strings to sequential integers, so when it processes a new
word, the mapping table is updated.
:py:meth:`English.__class__` returns a :py:class:`Tokens` object, through which Future releases will feature a way to reconcile :py:class:`strings.StringStore`
you'll access the processed text. You can access the text in three ways: mappings, but for now, you should only work with one instance of the pipeline
at a time.
Iteration This issue only affects rare words. spaCy's pre-compiled lexicon has 260,000
:py:meth:`Tokens.__iter__` and :py:meth:`Tokens.__getitem__` words; the string IDs for these words will always be consistent.
- Most "Pythonic"
- `spacy.tokens.Token` object, attribute access
- Inefficient: New Token object created each time.
Export
:py:meth:`Tokens.count_by` and :py:meth:`Tokens.to_array`
- `count_by`: Efficient dictionary of counts, for bag-of-words model.
- `to_array`: Export to numpy array. One row per word, one column per
attribute.
- Specify attributes with constants from `spacy.en.attrs`.
Cython
:py:attr:`TokenC* Tokens.data`
- Raw data is stored in contiguous array of structs
- Good syntax, C speed
- Documentation coming soon. In the meantime, see spacy/syntax/_parser.features.pyx
or spacy/en/pos.pyx
(Most of the) API at a glance (Most of the) API at a glance
----------------------------- -----------------------------
.. py:class:: spacy.en.English(self, data_dir=join(dirname(__file__), 'data')) **Process the string:**
.. py:method:: __call__(self, text: unicode, tag=True, parse=False) --> Tokens .. py:class:: spacy.en.English(self, data_dir=join(dirname(__file__), 'data'))
.. py:method:: vocab.__getitem__(self, text: unicode) --> Lexeme .. py:method:: __call__(self, text: unicode, tag=True, parse=False) --> Tokens
+-----------------+--------------+--------------+
| Attribute | Type | Its API |
+=================+==============+==============+
| vocab | Vocab | __getitem__ |
+-----------------+--------------+--------------+
| vocab.strings | StingStore | __getitem__ |
+-----------------+--------------+--------------+
| tokenizer | Tokenizer | __call__ |
+-----------------+--------------+--------------+
| tagger | EnPosTagger | __call__ |
+-----------------+--------------+--------------+
| parser | GreedyParser | __call__ |
+-----------------+--------------+--------------+
**Get dict or numpy array:**
.. py:method:: tokens.Tokens.to_array(self, attr_ids: List[int]) --> numpy.ndarray[ndim=2, dtype=int32]
.. py:method:: tokens.Tokens.count_by(self, attr_id: int) --> Dict[int, int]
**Get Token objects**
.. py:method:: tokens.Tokens.__getitem__(self, i) --> Token
.. py:method:: tokens.Tokens.__iter__(self) --> Iterator[Token]
**Embedded word representenations**
.. py:attribute:: tokens.Token.repvec
.. py:method:: vocab.__getitem__(self, text: unicode) --> Lexeme .. py:attribute:: lexeme.Lexeme.repvec
.. py:class:: spacy.tokens.Tokens via English.__call__
.. py:method:: __getitem__(self, i) --> Token **Navigate dependency parse**
.. py:method:: __iter__(self) --> Iterator[Token]
.. py:method:: to_array(self, attr_ids: List[int]) --> numpy.ndarray[ndim=2, dtype=int32]
.. py:method:: count_by(self, attr_id: int) --> Dict[int, int]
.. py:class:: spacy.tokens.Token via Tokens.__iter__, Tokens.__getitem__
.. py:method:: __unicode__(self) --> unicode
.. py:method:: __len__(self) --> int
.. py:method:: nbor(self, i=1) --> Token .. py:method:: nbor(self, i=1) --> Token
.. py:method:: child(self, i=1) --> Token .. py:method:: child(self, i=1) --> Token
.. py:method:: sibling(self, i=1) --> Token .. py:method:: sibling(self, i=1) --> Token
.. py:method:: check_flag(self, attr_id: int) --> bool
.. py:attribute:: cluster: int
.. py:attribute:: string: unicode
.. py:attribute:: string: unicode
.. py:attribute:: lemma: unicode
.. py:attribute:: dep_tag: unicode
.. py:attribute:: pos: unicode
.. py:attribute:: fine_pos: unicode
.. py:attribute:: sic: unicode
.. py:attribute:: head: Token .. py:attribute:: head: Token
.. py:attribute:: dep: int
**Align to original string**
.. py:attribute:: string: unicode
Padded with original whitespace.
.. py:attribute:: length: int
Length, in unicode code-points. Equal to len(self.orth_).
self.string[self.length:] gets whitespace.
.. py:attribute:: idx: int
Starting offset of word in the original string.
Features Features
-------- --------
+--------------------------------------------------------------------------+
| Boolean Features | **Boolean features**
+----------+---------------------------------------------------------------+
| IS_ALPHA | :py:meth:`str.isalpha` | >>> lexeme = nlp.vocab[u'Apple']
+----------+---------------------------------------------------------------+ >>> lexeme.is_alpha, is_upper
| IS_DIGIT | :py:meth:`str.isdigit` | True, False
+----------+---------------------------------------------------------------+ >>> tokens = nlp(u'Apple computers')
| IS_LOWER | :py:meth:`str.islower` | >>> tokens[0].is_alpha, tokens[0].is_upper
+----------+---------------------------------------------------------------+ >>> True, False
| IS_SPACE | :py:meth:`str.isspace` | >>> from spact.en.attrs import IS_ALPHA, IS_UPPER
+----------+---------------------------------------------------------------+ >>> tokens.to_array((IS_ALPHA, IS_UPPER))[0]
| IS_TITLE | :py:meth:`str.istitle` | array([1, 0])
+----------+---------------------------------------------------------------+
| IS_UPPER | :py:meth:`str.isupper` | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+ | is_alpha | :py:meth:`str.isalpha` |
| IS_ASCII | all(ord(c) < 128 for c in string) | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+ | is_digit | :py:meth:`str.isdigit` |
| IS_PUNCT | all(unicodedata.category(c).startswith('P') for c in string) | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+ | is_lower | :py:meth:`str.islower` |
| LIKE_URL | Using various heuristics, does the string resemble a URL? | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+ | is_title | :py:meth:`str.istitle` |
| LIKE_NUM | "Two", "10", "1,000", "10.54", "1/2" etc all match | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+ | is_upper | :py:meth:`str.isupper` |
| ID of string features | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+ | is_ascii | all(ord(c) < 128 for c in string) |
| SIC | The original string, unmodified. | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+ | is_punct | all(unicodedata.category(c).startswith('P') for c in string) |
| NORM1 | The string after level 1 normalization: case, spelling | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+ | like_url | Using various heuristics, does the string resemble a URL? |
| NORM2 | The string after level 2 normalization | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+ | like_num | "Two", "10", "1,000", "10.54", "1/2" etc all match |
| SHAPE | Word shape, e.g. 10 --> dd, Garden --> Xxxx, Hi!5 --> Xx!d | +----------+---------------------------------------------------------------+
+----------+---------------------------------------------------------------+
| PREFIX | A short slice from the start of the string. | **String-transform Features**
+----------+---------------------------------------------------------------+
| SUFFIX | A short slice from the end of the string. |
+----------+---------------------------------------------------------------+ +----------+---------------------------------------------------------------+
| CLUSTER | Brown cluster ID of the word | | orth | The original string, unmodified. |
+----------+---------------------------------------------------------------+ +----------+---------------------------------------------------------------+
| LEMMA | The word's lemma, i.e. morphological suffixes removed | | lower | The original string, forced to lower-case |
+----------+---------------------------------------------------------------+ +----------+---------------------------------------------------------------+
| TAG | The word's part-of-speech tag | | norm | The string after additional normalization |
+----------+---------------------------------------------------------------+ +----------+---------------------------------------------------------------+
| shape | Word shape, e.g. 10 --> dd, Garden --> Xxxx, Hi!5 --> Xx!d |
+----------+---------------------------------------------------------------+
| prefix | A short slice from the start of the string. |
+----------+---------------------------------------------------------------+
| suffix | A short slice from the end of the string. |
+----------+---------------------------------------------------------------+
| lemma | The word's lemma, i.e. morphological suffixes removed |
+----------+---------------------------------------------------------------+
**Syntactic labels**
+----------+---------------------------------------------------------------+
| pos | The word's part-of-speech, from the Google Universal Tag Set |
+----------+---------------------------------------------------------------+
| tag | A fine-grained morphosyntactic tag, e.g. VBZ, NNS, etc |
+----------+---------------------------------------------------------------+
| dep | Dependency type label between word and its head, e.g. subj |
+----------+---------------------------------------------------------------+
**Distributional**
+---------+-----------------------------------------------------------+
| cluster | Brown cluster ID of the word |
+---------+-----------------------------------------------------------+
| prob | Log probability of word, smoothed with Simple Good-Turing |
+---------+-----------------------------------------------------------+