spaCy/docs/source/index.rst

48 lines
1.6 KiB
ReStructuredText
Raw Normal View History

.. spaCy documentation master file, created by
sphinx-quickstart on Tue Aug 19 16:27:38 2014.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
spaCy NLP Tokenizer and Lexicon
================================
2014-10-15 14:50:34 +04:00
spaCy is an industrial-strength multi-language tokenizer, bristling with features
you never knew you wanted. You do want these features though --- your current
tokenizer has been doing it wrong.
Where other tokenizers give you a list of strings, spaCy gives you references
to rich lexical types, for easy, excellent and efficient feature extraction.
2014-09-26 20:40:18 +04:00
2014-10-15 14:50:34 +04:00
* **Easy**: Tokenizer returns a sequence of rich lexical types, with features
pre-computed:
2014-09-26 20:40:18 +04:00
2014-10-15 14:50:34 +04:00
>>> from spacy.en import EN
>>> for w in EN.tokenize(string):
... print w.sic, w.shape, w.cluster, w.oft_title, w.can_verb
2014-09-26 20:40:18 +04:00
2014-10-15 14:50:34 +04:00
Check out the tutorial and API docs.
2014-09-26 20:40:18 +04:00
2014-10-15 14:50:34 +04:00
* **Excellent**: Distributional and orthographic features are crucial to robust
NLP. Without them, models can only learn from tiny annotated training
corpora. Read more.
* **Efficient**: spaCy serves you rich lexical objects faster than most
tokenizers can give you a list of strings.
2014-09-26 20:40:18 +04:00
2014-10-15 14:50:34 +04:00
+--------+-------+--------------+--------------+
| System | Time | Words/second | Speed Factor |
+--------+-------+--------------+--------------+
| NLTK | 6m4s | 89,000 | 1.00 |
+--------+-------+--------------+--------------+
| spaCy | 9.5s | 3,093,000 | 38.30 |
+--------+-------+--------------+--------------+
2014-09-26 20:40:18 +04:00
.. toctree::
2014-10-15 14:50:34 +04:00
:hidden:
:maxdepth: 3
2014-10-15 14:50:34 +04:00
what/index.rst
why/index.rst
how/index.rst