spaCy/docs/source/quickstart.rst

237 lines
9.1 KiB
ReStructuredText
Raw Normal View History

2015-01-16 09:09:46 +03:00
Quick Start
===========
Install
-------
.. py:currentmodule:: spacy
2015-01-23 18:53:55 +03:00
With Python 2.7 or Python 3, using Linux or OSX, run:
2015-03-19 07:09:39 +03:00
.. code:: bash
2015-01-16 09:09:46 +03:00
$ pip install spacy
$ python -m spacy.en.download
.. _300 mb of data: http://s3-us-west-1.amazonaws.com/media.spacynlp.com/en_data_all-0.4.tgz
2015-04-19 11:31:31 +03:00
The download command fetches and installs about 300mb of data, for the
2015-02-01 08:40:15 +03:00
parser model and word vectors, which it installs within the spacy.en package directory.
2015-01-28 06:28:34 +03:00
If you're stuck using a server with an old version of Python, and you don't
have root access, I've prepared a bootstrap script to help you compile a local
Python install. Run:
2015-01-28 06:28:34 +03:00
2015-03-19 07:09:39 +03:00
.. code:: bash
$ curl https://raw.githubusercontent.com/honnibal/spaCy/master/bootstrap_python_env.sh | bash && source .env/bin/activate
2015-01-28 06:28:34 +03:00
The other way to install the package is to clone the github repository, and
build it from source. This installs an additional dependency, Cython.
If you're using Python 2, I also recommend installing fabric and fabtools ---
this is how I build the project.
2015-01-25 14:57:37 +03:00
.. code:: bash
$ git clone https://github.com/honnibal/spaCy.git
2015-01-28 06:28:34 +03:00
$ cd spaCy
2015-01-25 14:57:37 +03:00
$ virtualenv .env && source .env/bin/activate
2015-01-28 06:28:34 +03:00
$ export PYTHONPATH=`pwd`
2015-01-25 14:57:37 +03:00
$ pip install -r requirements.txt
2015-02-01 08:40:15 +03:00
$ python setup.py build_ext --inplace
2015-01-25 14:57:37 +03:00
$ python -m spacy.en.download
2015-02-01 08:40:15 +03:00
$ pip install pytest
$ py.test tests/
2015-01-25 13:58:07 +03:00
2015-01-28 06:28:34 +03:00
Python packaging is awkward at the best of times, and it's particularly tricky
with C extensions, built via Cython, requiring large data files. So, please
2015-02-01 08:40:15 +03:00
report issues as you encounter them, and bear with me :)
2015-01-16 09:09:46 +03:00
Usage
-----
The main entry-point is :meth:`en.English.__call__`, which accepts a unicode string
as an argument, and returns a :py:class:`tokens.Doc` object. You can
iterate over it to get :py:class:`tokens.Token` objects, which provide
a convenient API:
2015-01-16 09:09:46 +03:00
2015-01-25 15:05:35 +03:00
>>> from __future__ import unicode_literals # If Python 2
2015-01-16 09:09:46 +03:00
>>> from spacy.en import English
>>> nlp = English()
>>> tokens = nlp(u'I ate the pizza with anchovies.')
>>> pizza = tokens[3]
>>> (pizza.orth, pizza.orth_, pizza.head.lemma, pizza.head.lemma_)
2015-02-01 08:35:12 +03:00
... (14702, u'pizza', 14702, u'eat')
2015-01-25 15:05:35 +03:00
spaCy maps all strings to sequential integer IDs --- a common trick in NLP.
If an attribute `Token.foo` is an integer ID, then `Token.foo_` is the string,
2015-04-13 06:40:51 +03:00
e.g. `pizza.orth` and `pizza.orth_` provide the integer ID and the string of
2015-01-25 15:05:35 +03:00
the original orthographic form of the word.
2015-01-24 09:47:51 +03:00
.. note:: en.English.__call__ is stateful --- it has an important **side-effect**.
2015-01-16 09:09:46 +03:00
2015-01-24 09:47:51 +03:00
When it processes a previously unseen word, it increments the ID counter,
assigns the ID to the string, and writes the mapping in
:py:data:`English.vocab.strings` (instance of
:py:class:`strings.StringStore`).
Future releases will feature a way to reconcile mappings, but for now, you
should only work with one instance of the pipeline at a time.
2015-01-16 09:09:46 +03:00
(Most of the) API at a glance
-----------------------------
2015-01-16 09:09:46 +03:00
**Process the string:**
2015-01-16 09:09:46 +03:00
.. py:class:: spacy.en.English(self, data_dir=join(dirname(__file__), 'data'))
2015-01-16 09:09:46 +03:00
.. py:method:: __call__(self, text: unicode, tag=True, parse=True, entity=True, merge_mwes=False) --> Doc
2015-01-16 09:09:46 +03:00
+-----------------+--------------+--------------+
| Attribute | Type | Its API |
+=================+==============+==============+
| vocab | Vocab | __getitem__ |
+-----------------+--------------+--------------+
| vocab.strings | StingStore | __getitem__ |
+-----------------+--------------+--------------+
| tokenizer | Tokenizer | __call__ |
+-----------------+--------------+--------------+
| tagger | EnPosTagger | __call__ |
+-----------------+--------------+--------------+
| parser | GreedyParser | __call__ |
+-----------------+--------------+--------------+
2015-04-13 06:40:51 +03:00
| entity | GreedyParser | __call__ |
+-----------------+--------------+--------------+
2015-01-16 09:09:46 +03:00
**Get dict or numpy array:**
2015-01-16 09:09:46 +03:00
.. py:method:: tokens.Doc.to_array(self, attr_ids: List[int]) --> ndarray[ndim=2, dtype=long]
2015-01-16 09:09:46 +03:00
.. py:method:: tokens.Doc.count_by(self, attr_id: int) --> Dict[int, int]
2015-01-23 18:53:55 +03:00
**Get Token objects**
2015-01-16 09:09:46 +03:00
.. py:method:: tokens.Doc.__getitem__(self, i) --> Token
2015-01-16 09:09:46 +03:00
.. py:method:: tokens.Doc.__iter__(self) --> Iterator[Token]
2015-01-16 09:09:46 +03:00
2015-04-13 06:40:51 +03:00
**Get sentence or named entity spans**
.. py:attribute:: tokens.Doc.sents --> Iterator[Span]
2015-04-19 11:31:31 +03:00
.. py:attribute:: tokens.Doc.ents --> Iterator[Span]
2015-04-13 06:40:51 +03:00
You can iterate over a Span to access individual Doc, or access its
2015-04-13 06:40:51 +03:00
start, end or label.
**Embedded word representenations**
2015-01-16 09:09:46 +03:00
.. py:attribute:: tokens.Token.repvec
2015-04-19 11:31:31 +03:00
.. py:attribute:: lexeme.Lexeme.repvec
2015-01-23 18:53:55 +03:00
2015-01-16 09:09:46 +03:00
2015-01-24 09:47:51 +03:00
**Navigate to tree- or string-neighbor tokens**
2015-01-16 09:09:46 +03:00
.. py:method:: nbor(self, i=1) --> Token
2015-01-16 09:09:46 +03:00
.. py:method:: child(self, i=1) --> Token
2015-01-16 09:09:46 +03:00
.. py:method:: sibling(self, i=1) --> Token
2015-01-16 09:09:46 +03:00
.. py:attribute:: head: Token
2015-01-16 09:09:46 +03:00
.. py:attribute:: dep: int
2015-01-16 09:09:46 +03:00
**Align to original string**
2015-01-16 09:09:46 +03:00
.. py:attribute:: string: unicode
2015-04-19 11:31:31 +03:00
Padded with original whitespace.
2015-01-16 09:09:46 +03:00
.. py:attribute:: length: int
2015-01-16 09:09:46 +03:00
Length, in unicode code-points. Equal to len(self.orth_).
2015-04-19 11:31:31 +03:00
.. py:attribute:: idx: int
Starting offset of word in the original string.
2015-01-16 09:09:46 +03:00
2015-01-16 11:04:03 +03:00
Features
--------
**Boolean features**
>>> lexeme = nlp.vocab[u'Apple']
>>> lexeme.is_alpha, is_upper
True, False
2015-01-25 15:05:35 +03:00
>>> tokens = nlp('Apple computers')
>>> tokens[0].is_alpha, tokens[0].is_upper
>>> True, False
2015-03-12 23:13:42 +03:00
>>> from spacy.en.attrs import IS_ALPHA, IS_UPPER
>>> tokens.to_array((IS_ALPHA, IS_UPPER))[0]
array([1, 0])
+----------+---------------------------------------------------------------+
| is_alpha | :py:meth:`str.isalpha` |
+----------+---------------------------------------------------------------+
| is_digit | :py:meth:`str.isdigit` |
+----------+---------------------------------------------------------------+
| is_lower | :py:meth:`str.islower` |
+----------+---------------------------------------------------------------+
| is_title | :py:meth:`str.istitle` |
+----------+---------------------------------------------------------------+
| is_upper | :py:meth:`str.isupper` |
+----------+---------------------------------------------------------------+
| is_ascii | all(ord(c) < 128 for c in string) |
+----------+---------------------------------------------------------------+
| is_punct | all(unicodedata.category(c).startswith('P') for c in string) |
+----------+---------------------------------------------------------------+
| like_url | Using various heuristics, does the string resemble a URL? |
+----------+---------------------------------------------------------------+
| like_num | "Two", "10", "1,000", "10.54", "1/2" etc all match |
+----------+---------------------------------------------------------------+
**String-transform Features**
+----------+---------------------------------------------------------------+
| orth | The original string, unmodified. |
+----------+---------------------------------------------------------------+
| lower | The original string, forced to lower-case |
+----------+---------------------------------------------------------------+
| norm | The string after additional normalization |
+----------+---------------------------------------------------------------+
| shape | Word shape, e.g. 10 --> dd, Garden --> Xxxx, Hi!5 --> Xx!d |
+----------+---------------------------------------------------------------+
| prefix | A short slice from the start of the string. |
+----------+---------------------------------------------------------------+
| suffix | A short slice from the end of the string. |
+----------+---------------------------------------------------------------+
| lemma | The word's lemma, i.e. morphological suffixes removed |
+----------+---------------------------------------------------------------+
**Syntactic labels**
+----------+---------------------------------------------------------------+
| pos | The word's part-of-speech, from the Google Universal Tag Set |
+----------+---------------------------------------------------------------+
| tag | A fine-grained morphosyntactic tag, e.g. VBZ, NNS, etc |
+----------+---------------------------------------------------------------+
| dep | Dependency type label between word and its head, e.g. subj |
+----------+---------------------------------------------------------------+
**Distributional**
+---------+-----------------------------------------------------------+
| cluster | Brown cluster ID of the word |
+---------+-----------------------------------------------------------+
| prob | Log probability of word, smoothed with Simple Good-Turing |
+---------+-----------------------------------------------------------+