mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-27 17:54:39 +03:00
48 lines
1.6 KiB
ReStructuredText
48 lines
1.6 KiB
ReStructuredText
.. spaCy documentation master file, created by
|
|
sphinx-quickstart on Tue Aug 19 16:27:38 2014.
|
|
You can adapt this file completely to your liking, but it should at least
|
|
contain the root `toctree` directive.
|
|
|
|
spaCy NLP Tokenizer and Lexicon
|
|
================================
|
|
|
|
spaCy is an industrial-strength multi-language tokenizer, bristling with features
|
|
you never knew you wanted. You do want these features though --- your current
|
|
tokenizer has been doing it wrong.
|
|
Where other tokenizers give you a list of strings, spaCy gives you references
|
|
to rich lexical types, for easy, excellent and efficient feature extraction.
|
|
|
|
* **Easy**: Tokenizer returns a sequence of rich lexical types, with features
|
|
pre-computed:
|
|
|
|
>>> from spacy.en import EN
|
|
>>> for w in EN.tokenize(string):
|
|
... print w.sic, w.shape, w.cluster, w.oft_title, w.can_verb
|
|
|
|
Check out the tutorial and API docs.
|
|
|
|
* **Excellent**: Distributional and orthographic features are crucial to robust
|
|
NLP. Without them, models can only learn from tiny annotated training
|
|
corpora. Read more.
|
|
|
|
* **Efficient**: spaCy serves you rich lexical objects faster than most
|
|
tokenizers can give you a list of strings.
|
|
|
|
+--------+-------+--------------+--------------+
|
|
| System | Time | Words/second | Speed Factor |
|
|
+--------+-------+--------------+--------------+
|
|
| NLTK | 6m4s | 89,000 | 1.00 |
|
|
+--------+-------+--------------+--------------+
|
|
| spaCy | 9.5s | 3,093,000 | 38.30 |
|
|
+--------+-------+--------------+--------------+
|
|
|
|
|
|
|
|
.. toctree::
|
|
:hidden:
|
|
:maxdepth: 3
|
|
|
|
what/index.rst
|
|
why/index.rst
|
|
how/index.rst
|