mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
* Another redraft of index.rst
This commit is contained in:
parent
77dd7a212a
commit
24ffc32f2f
|
@ -11,30 +11,57 @@ spaCy is a library for industrial-strength text processing in Python and Cython.
|
||||||
It features extremely efficient, up-to-date algorithms, and a rethink of how those
|
It features extremely efficient, up-to-date algorithms, and a rethink of how those
|
||||||
algorithms should be accessed.
|
algorithms should be accessed.
|
||||||
|
|
||||||
Most text-processing libraries give you APIs that look like this:
|
A typical text-processing API looks something like this:
|
||||||
|
|
||||||
>>> import nltk
|
>>> import nltk
|
||||||
>>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.'''))
|
>>> nltk.pos_tag(nltk.word_tokenize('''Some string of language.'''))
|
||||||
[('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')]
|
[('Some', 'DT'), ('string', 'VBG'), ('of', 'IN'), ('language', 'NN'), ('.', '.')]
|
||||||
|
|
||||||
A list of strings is good for poking around, or for printing the annotation to
|
This API often leaves you with a lot of busy-work. If you're doing some machine
|
||||||
evaluate it. But to actually *use* the output, you have to jump through some
|
learning or information extraction, all the strings have to be mapped to integers,
|
||||||
hoops. If you're doing some machine learning, all the strings have to be
|
and you have to save and load the mapping at training and runtime. If you want
|
||||||
mapped to integers, and you have to save and load the mapping at training and
|
to display mark-up based on the annotation, you have to realign the tokens to your
|
||||||
runtime. If you want to display mark-up based on the annotation, you have to
|
original string.
|
||||||
realign the tokens to your original string.
|
|
||||||
|
|
||||||
With spaCy, you should never have to do any string processing at all:
|
I've been writing NLP systems for almost ten years now, so I've done these
|
||||||
|
things dozens of times. When designing spaCy, I thought carefully about how to
|
||||||
|
make the right thing easy.
|
||||||
|
|
||||||
|
We begin by initializing a global vocabulary store:
|
||||||
|
|
||||||
>>> from spacy.en import EN
|
>>> from spacy.en import EN
|
||||||
>>> from spacy.en import feature_names as fn
|
>>> EN.load()
|
||||||
>>> tokens = EN.tokenize('''Some string of language.''')
|
|
||||||
>>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER, fn.POS, fn.LEMMA))
|
|
||||||
|
|
||||||
A range of excellent features are pre-computed for you, and by default the
|
The vocabulary reads in a data file with all sorts of pre-computed lexical
|
||||||
words are part-of-speech tagged and lemmatized. We do this by default because
|
features. You can load anything you like here, but by default I give you:
|
||||||
even with these extra processes, spaCy is still several times faster than
|
|
||||||
most tokenizers:
|
* String IDs for the word's string, its prefix, suffix and "shape";
|
||||||
|
* Length (in unicode code-points)
|
||||||
|
* A cluster ID, representing distributional similarity;
|
||||||
|
* A cluster ID, representing its typical POS tag distribution;
|
||||||
|
* Good-turing smoothed unigram probability;
|
||||||
|
* 64 boolean features, for assorted orthographic and distributional features.
|
||||||
|
|
||||||
|
With so many features pre-computed, you usually don't have to do any string
|
||||||
|
processing at all. You give spaCy your string, and tell it to give you either
|
||||||
|
a numpy array, or a counts dictionary:
|
||||||
|
|
||||||
|
>>> from spacy.en import feature_names as fn
|
||||||
|
>>> tokens = EN.tokenize(u'''Some string of language.''')
|
||||||
|
>>> tokens.to_array((fn.WORD, fn.SUFFIX, fn.CLUSTER))
|
||||||
|
...
|
||||||
|
>>> tokens.count_by(fn.WORD)
|
||||||
|
|
||||||
|
If you do need strings, you can simply iterate over the Tokens object:
|
||||||
|
|
||||||
|
>>> for token in tokens:
|
||||||
|
...
|
||||||
|
|
||||||
|
I mostly use this for debugging and testing.
|
||||||
|
|
||||||
|
spaCy returns these rich Tokens objects much faster than most other tokenizers
|
||||||
|
can give you a list of strings --- in fact, spaCy's POS tagger is *4 times
|
||||||
|
faster* than CoreNLP's tokenizer:
|
||||||
|
|
||||||
+----------+----------+---------------+----------+
|
+----------+----------+---------------+----------+
|
||||||
| System | Tokenize | POS Tag | |
|
| System | Tokenize | POS Tag | |
|
||||||
|
@ -48,157 +75,7 @@ most tokenizers:
|
||||||
| ZPar | | ~1,500s | |
|
| ZPar | | ~1,500s | |
|
||||||
+----------+----------+---------------+----------+
|
+----------+----------+---------------+----------+
|
||||||
|
|
||||||
spaCy is designed to **make the right thing easy**, where the right thing is to:
|
|
||||||
|
|
||||||
* **Use rich distributional and orthographic features**. Without these, your model
|
|
||||||
will be very brittle and domain dependent.
|
|
||||||
|
|
||||||
* **Compute features per type, not per token**. Because of Zipf's law, you can
|
|
||||||
expect this to be exponentially more efficient.
|
|
||||||
|
|
||||||
* **Minimize string processing**, and instead compute with arrays of ID ints.
|
|
||||||
|
|
||||||
Tokenization done right
|
|
||||||
=======================
|
|
||||||
|
|
||||||
Most tokenizers rely on complicated regular expressions. Often, they leave you
|
|
||||||
with no way to align the tokens back to the original string --- a vital feature
|
|
||||||
if you want to display some mark-up, such as spelling correction. The regular
|
|
||||||
expressions also interact, making it hard to accommodate special cases.
|
|
||||||
|
|
||||||
spaCy introduces a **novel tokenization algorithm** that's much faster and much
|
|
||||||
more flexible:
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
def tokenize(string, prefixes={}, suffixes={}, specials={}):
|
|
||||||
'''Sketch of spaCy's tokenization algorithm.'''
|
|
||||||
tokens = []
|
|
||||||
cache = {}
|
|
||||||
for chunk in string.split():
|
|
||||||
# Because of Zipf's law, the cache serves the majority of "chunks".
|
|
||||||
if chunk in cache:
|
|
||||||
tokens.extend(cache[chunl])
|
|
||||||
continue
|
|
||||||
key = chunk
|
|
||||||
|
|
||||||
subtokens = []
|
|
||||||
# Process a chunk by splitting off prefixes e.g. ( " { and suffixes e.g. , . :
|
|
||||||
# If we split one off, check whether we're left with a special-case,
|
|
||||||
# e.g. contractions (can't, won't, etc), emoticons, abbreviations, etc.
|
|
||||||
# This makes the tokenization easy to update and customize.
|
|
||||||
while chunk:
|
|
||||||
prefix, chunk = _consume_prefix(chunk, prefixes)
|
|
||||||
if prefix:
|
|
||||||
subtokens.append(prefix)
|
|
||||||
if chunk in specials:
|
|
||||||
subtokens.extend(specials[chunk])
|
|
||||||
break
|
|
||||||
suffix, chunk = _consume_suffix(chunk, suffixes)
|
|
||||||
if suffix:
|
|
||||||
subtokens.append(suffix)
|
|
||||||
if chunk in specials:
|
|
||||||
subtokens.extend(specials[chunk])
|
|
||||||
break
|
|
||||||
cache[key] = subtokens
|
|
||||||
|
|
||||||
Your data is going to have its own quirks, so it's really useful to have
|
|
||||||
a tokenizer you can easily control. To see the limitations of the standard
|
|
||||||
regex-based approach, check out `CMU's recent work on tokenizing tweets <http://www.ark.cs.cmu.edu/TweetNLP/>`_. Despite a lot of careful attention, they can't handle all of their
|
|
||||||
known emoticons correctly --- doing so would interfere with the way they
|
|
||||||
process other punctuation. This isn't a problem for spaCy: we just add them
|
|
||||||
all to the special tokenization rules.
|
|
||||||
|
|
||||||
|
|
||||||
Comparison with NLTK
|
|
||||||
====================
|
|
||||||
|
|
||||||
`NLTK <http://nltk.org>`_ provides interfaces to a wide-variety of NLP
|
|
||||||
tools and resources, and its own implementations of a few algorithms. It comes
|
|
||||||
with comprehensive documentation, and a book introducing concepts in NLP. For
|
|
||||||
these reasons, it's very widely known. However, if you're trying to make money
|
|
||||||
or do cutting-edge research, NLTK is not a good choice.
|
|
||||||
|
|
||||||
The `list of stuff in NLTK <http://www.nltk.org/py-modindex.html>`_ looks impressive,
|
|
||||||
but almost none of it is useful for real work. You're not going to make any money,
|
|
||||||
or do top research, by using the NLTK chat bots, theorem provers, toy CCG implementation,
|
|
||||||
etc. Most of NLTK is there to assist in the explanation ideas in computational
|
|
||||||
linguistics, at roughly an undergraduate level.
|
|
||||||
But it also claims to support serious work, by wrapping external tools.
|
|
||||||
|
|
||||||
In a pretty well known essay, Joel Spolsky discusses the pain of dealing with
|
|
||||||
`leaky abstractions <http://www.joelonsoftware.com/articles/LeakyAbstractions.html>`_.
|
|
||||||
An abstraction tells you to not care about implementation
|
|
||||||
details, but sometimes the implementation matters after all. When it
|
|
||||||
does, you have to waste time revising your assumptions.
|
|
||||||
|
|
||||||
NLTK's wrappers call external tools via subprocesses, and wrap this up so
|
|
||||||
that it looks like a native API. This abstraction leaks *a lot*. The system
|
|
||||||
calls impose far more overhead than a normal Python function call, which makes
|
|
||||||
the most natural way to program against the API infeasible.
|
|
||||||
|
|
||||||
|
|
||||||
Case study: POS tagging
|
|
||||||
-----------------------
|
|
||||||
|
|
||||||
Here's a quick comparison of the following POS taggers:
|
|
||||||
|
|
||||||
* **Stanford (CLI)**: The Stanford POS tagger, invoked once as a batch process
|
|
||||||
from the command-line;
|
|
||||||
* **nltk.tag.stanford**: The Stanford tagger, invoked document-by-document via
|
|
||||||
NLTK's wrapper;
|
|
||||||
* **nltk.pos_tag**: NLTK's own POS tagger, invoked document-by-document.
|
|
||||||
* **spacy.en.pos_tag**: spaCy's POS tagger, invoked document-by-document.
|
|
||||||
|
|
||||||
|
|
||||||
+-------------------+-------------+--------+
|
|
||||||
| System | Speed (w/s) | % Acc. |
|
|
||||||
+-------------------+-------------+--------+
|
|
||||||
| spaCy | 107,000 | 96.7 |
|
|
||||||
+-------------------+-------------+--------+
|
|
||||||
| Stanford (CLI) | 8,000 | 96.7 |
|
|
||||||
+-------------------+-------------+--------+
|
|
||||||
| nltk.pos_tag | 543 | 94.0 |
|
|
||||||
+-------------------+-------------+--------+
|
|
||||||
| nltk.tag.stanford | 209 | 96.7 |
|
|
||||||
+-------------------+-------------+--------+
|
|
||||||
|
|
||||||
Experimental details TODO. Three things are apparent from this comparison:
|
|
||||||
|
|
||||||
1. The native NLTK tagger, nltk.pos_tag, is both slow and inaccurate;
|
|
||||||
|
|
||||||
2. Calling the Stanford tagger document-by-document via NLTK is **40x** slower
|
|
||||||
than invoking the model once as a batch process, via the command-line;
|
|
||||||
|
|
||||||
3. spaCy is over 10x faster than the Stanford tagger, even when called
|
|
||||||
**sentence-by-sentence**.
|
|
||||||
|
|
||||||
The problem is that NLTK simply wraps the command-line
|
|
||||||
interfaces of these tools, so communication is via a subprocess. NLTK does not
|
|
||||||
even hold open a pipe for you --- the model is reloaded, again and again.
|
|
||||||
|
|
||||||
To use the wrapper effectively, you should batch up your text as much as possible.
|
|
||||||
This probably isn't how you would like to structure your pipeline, and you
|
|
||||||
might not be able to batch up much text at all, e.g. if serving a single
|
|
||||||
request means processing a single document.
|
|
||||||
Technically, NLTK does give you Python functions to access lots of different
|
|
||||||
systems --- but, you can't use them as you would expect to use a normal Python
|
|
||||||
function. The abstraction leaks.
|
|
||||||
|
|
||||||
Here's the bottom-line: the Stanford tools are written in Java, so using them
|
|
||||||
from Python sucks. You shouldn't settle for this. It's a problem that springs
|
|
||||||
purely from the tooling, rather than the domain.
|
|
||||||
|
|
||||||
Summary
|
|
||||||
-------
|
|
||||||
|
|
||||||
NLTK is a well-known Python library for NLP, but for the important bits, you
|
|
||||||
don't get actual Python modules. You get wrappers which throw to external
|
|
||||||
tools, via subprocesses. This is not at all the same thing.
|
|
||||||
|
|
||||||
spaCy is implemented in Cython, just like numpy, scikit-learn, lxml and other
|
|
||||||
high-performance Python libraries. So you get a native Python API, but the
|
|
||||||
performance you expect from a program written in C.
|
|
||||||
|
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
|
|
Loading…
Reference in New Issue
Block a user