mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-09 06:13:08 +03:00
* Upd docs
This commit is contained in:
parent
1ccabc806e
commit
f15deaad5b
|
@ -10,14 +10,27 @@ spaCy NLP Tokenizer and Lexicon
|
||||||
spaCy is a library for industrial-strength NLP in Python and Cython. spaCy's
|
spaCy is a library for industrial-strength NLP in Python and Cython. spaCy's
|
||||||
take on NLP is that it's mostly about feature extraction --- that's the part
|
take on NLP is that it's mostly about feature extraction --- that's the part
|
||||||
that's specific to NLP, so that's what an NLP library should focus on.
|
that's specific to NLP, so that's what an NLP library should focus on.
|
||||||
It should tell you what the current best-practice is, and help you do exactly
|
|
||||||
that, quickly and efficiently.
|
|
||||||
|
|
||||||
Best-practice is to **use lots of large lexicons**. Let's say you hit the word
|
spaCy also believes that for NLP, **efficiency is critical**. If you're
|
||||||
*belieber* in production. What will your system know about this word? A bad
|
running batch jobs, you probably have an enormous amount of data; if you're
|
||||||
system will only know things about the words in its training corpus, which
|
serving requests one-by-one, you want lower latency and fewer servers. Even if
|
||||||
probably consists of texts written before Justin Bieber was even born.
|
you're doing exploratory research on relatively small samples, you should still
|
||||||
It doesn't have to be like that.
|
value efficiency, because it means you can run more experiments.
|
||||||
|
|
||||||
|
Depending on the task, spaCy is between 10 and 200 times faster than NLTK,
|
||||||
|
often with much better accuracy. See Benchmarks for details, and
|
||||||
|
Why is spaCy so fast? for a discussion of the algorithms and implementation
|
||||||
|
that makes this possible.
|
||||||
|
|
||||||
|
+---------+----------+-------------+----------+
|
||||||
|
| System | Tokenize | --> Counts | --> Stem |
|
||||||
|
+---------+----------+-------------+----------+
|
||||||
|
| spaCy | 1m42s | 1m59s | 1m59s |
|
||||||
|
+---------+----------+-------------+----------+
|
||||||
|
| NLTK | 20m2s | 28m24s | 52m28 |
|
||||||
|
+---------+----------+-------------+----------+
|
||||||
|
|
||||||
|
Times for 100m words of text.
|
||||||
|
|
||||||
|
|
||||||
Unique Lexicon-centric design
|
Unique Lexicon-centric design
|
||||||
|
@ -25,15 +38,14 @@ Unique Lexicon-centric design
|
||||||
|
|
||||||
spaCy helps you build models that generalise better, by making it easy to use
|
spaCy helps you build models that generalise better, by making it easy to use
|
||||||
more robust features. Instead of a list of strings, the tokenizer returns
|
more robust features. Instead of a list of strings, the tokenizer returns
|
||||||
references to rich lexical types. Its tokenizer returns sequence of references
|
references to rich lexical types. Features which ask about the word's Brown cluster,
|
||||||
to rich lexical types. Features which ask about the word's Brown cluster, its
|
its typical part-of-speech tag, how it's usually cased etc require no extra effort:
|
||||||
typical part-of-speech tag, how it's usually cased etc require no extra effort:
|
|
||||||
|
|
||||||
>>> from spacy.en import EN
|
>>> from spacy.en import EN
|
||||||
>>> from spacy.feature_names import *
|
>>> from spacy.feature_names import *
|
||||||
>>> feats = (
|
>>> feats = (
|
||||||
SIC, # ID of the original word form
|
SIC, # ID of the original word form
|
||||||
NORM, # ID of the normalized word form
|
STEM, # ID of the stemmed word form
|
||||||
CLUSTER, # ID of the word's Brown cluster
|
CLUSTER, # ID of the word's Brown cluster
|
||||||
IS_TITLE, # Was the word title-cased?
|
IS_TITLE, # Was the word title-cased?
|
||||||
POS_TYPE # A cluster ID describing what POS tags the word is usually assigned
|
POS_TYPE # A cluster ID describing what POS tags the word is usually assigned
|
||||||
|
@ -113,14 +125,6 @@ all to the special tokenization rules.
|
||||||
|
|
||||||
spaCy's tokenizer is also incredibly efficient:
|
spaCy's tokenizer is also incredibly efficient:
|
||||||
|
|
||||||
+--------+---------------+--------------+
|
|
||||||
| System | Tokens/second | Speed Factor |
|
|
||||||
+--------+---------------+--------------+
|
|
||||||
| NLTK | 89 000 | 1.00 |
|
|
||||||
+--------+---------------+--------------+
|
|
||||||
| spaCy | 3 093 000 | 38.30 |
|
|
||||||
+--------+---------------+--------------+
|
|
||||||
|
|
||||||
spaCy can create an inverted index of the 1.8 billion word Gigaword corpus,
|
spaCy can create an inverted index of the 1.8 billion word Gigaword corpus,
|
||||||
in under half an hour --- on a Macbook Air. See the `inverted
|
in under half an hour --- on a Macbook Air. See the `inverted
|
||||||
index tutorial`_.
|
index tutorial`_.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user