mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 13:47:13 +03:00
227 lines
9.5 KiB
ReStructuredText
227 lines
9.5 KiB
ReStructuredText
.. spaCy documentation master file, created by
|
|
sphinx-quickstart on Tue Aug 19 16:27:38 2014.
|
|
You can adapt this file completely to your liking, but it should at least
|
|
contain the root `toctree` directive.
|
|
|
|
================================
|
|
spaCy NLP Tokenizer and Lexicon
|
|
================================
|
|
|
|
spaCy is a library for industrial-strength NLP in Python and Cython. spaCy's
|
|
take on NLP is that it's mostly about feature extraction --- that's the part
|
|
that's specific to NLP, so that's what an NLP library should focus on.
|
|
It should tell you what the current best-practice is, and help you do exactly
|
|
that, quickly and efficiently.
|
|
|
|
Best-practice is to **use lots of large lexicons**. Let's say you hit the word
|
|
*belieber* in production. What will your system know about this word? A bad
|
|
system will only know things about the words in its training corpus, which
|
|
probably consists of texts written before Justin Bieber was even born.
|
|
It doesn't have to be like that.
|
|
|
|
|
|
Unique Lexicon-centric design
|
|
=============================
|
|
|
|
spaCy helps you build models that generalise better, by making it easy to use
|
|
more robust features. Instead of a list of strings, the tokenizer returns
|
|
references to rich lexical types. Its tokenizer returns sequence of references
|
|
to rich lexical types. Features which ask about the word's Brown cluster, its
|
|
typical part-of-speech tag, how it's usually cased etc require no extra effort:
|
|
|
|
>>> from spacy.en import EN
|
|
>>> from spacy.feature_names import *
|
|
>>> feats = (
|
|
SIC, # ID of the original word form
|
|
NORM, # ID of the normalized word form
|
|
CLUSTER, # ID of the word's Brown cluster
|
|
IS_TITLE, # Was the word title-cased?
|
|
POS_TYPE # A cluster ID describing what POS tags the word is usually assigned
|
|
)
|
|
>>> tokens = EN.tokenize(u'Split words, punctuation, emoticons etc.! ^_^')
|
|
>>> tokens.to_array(feats)[:5]
|
|
array([[ 1, 2, 3, 4],
|
|
[...],
|
|
[...],
|
|
[...]])
|
|
|
|
|
|
spaCy is designed to **make the right thing easy**, where the right thing is to:
|
|
|
|
* **Use rich distributional and orthographic features**. Without these, your model
|
|
will be very brittle and domain dependent.
|
|
|
|
* **Compute features per type, not per token**. Because of Zipf's law, you can
|
|
expect this to be exponentially more efficient.
|
|
|
|
* **Minimize string processing**, and instead compute with arrays of ID ints.
|
|
|
|
For the current list of lexical features, see `Lexical Features`_.
|
|
|
|
.. _lexical features: features.html
|
|
|
|
Tokenization done right
|
|
=======================
|
|
|
|
Most tokenizers rely on complicated regular expressions. Often, they leave you
|
|
with no way to align the tokens back to the original string --- a vital feature
|
|
if you want to display some mark-up, such as spelling correction. The regular
|
|
expressions also interact, making it hard to accommodate special cases.
|
|
|
|
spaCy introduces a **novel tokenization algorithm** that's much faster and much
|
|
more flexible:
|
|
|
|
.. code-block:: python
|
|
|
|
def tokenize(string, prefixes={}, suffixes={}, specials={}):
|
|
'''Sketch of spaCy's tokenization algorithm.'''
|
|
tokens = []
|
|
cache = {}
|
|
for chunk in string.split():
|
|
# Because of Zipf's law, the cache serves the majority of "chunks".
|
|
if chunk in cache:
|
|
tokens.extend(cache[chunl])
|
|
continue
|
|
key = chunk
|
|
|
|
subtokens = []
|
|
# Process a chunk by splitting off prefixes e.g. ( " { and suffixes e.g. , . :
|
|
# If we split one off, check whether we're left with a special-case,
|
|
# e.g. contractions (can't, won't, etc), emoticons, abbreviations, etc.
|
|
# This makes the tokenization easy to update and customize.
|
|
while chunk:
|
|
prefix, chunk = _consume_prefix(chunk, prefixes)
|
|
if prefix:
|
|
subtokens.append(prefix)
|
|
if chunk in specials:
|
|
subtokens.extend(specials[chunk])
|
|
break
|
|
suffix, chunk = _consume_suffix(chunk, suffixes)
|
|
if suffix:
|
|
subtokens.append(suffix)
|
|
if chunk in specials:
|
|
subtokens.extend(specials[chunk])
|
|
break
|
|
cache[key] = subtokens
|
|
|
|
Your data is going to have its own quirks, so it's really useful to have
|
|
a tokenizer you can easily control. To see the limitations of the standard
|
|
regex-based approach, check out `CMU's recent work on tokenizing tweets <http://www.ark.cs.cmu.edu/TweetNLP/>`_. Despite a lot of careful attention, they can't handle all of their
|
|
known emoticons correctly --- doing so would interfere with the way they
|
|
process other punctuation. This isn't a problem for spaCy: we just add them
|
|
all to the special tokenization rules.
|
|
|
|
spaCy's tokenizer is also incredibly efficient:
|
|
|
|
+--------+---------------+--------------+
|
|
| System | Tokens/second | Speed Factor |
|
|
+--------+---------------+--------------+
|
|
| NLTK | 89 000 | 1.00 |
|
|
+--------+---------------+--------------+
|
|
| spaCy | 3 093 000 | 38.30 |
|
|
+--------+---------------+--------------+
|
|
|
|
spaCy can create an inverted index of the 1.8 billion word Gigaword corpus,
|
|
in under half an hour --- on a Macbook Air. See the `inverted
|
|
index tutorial`_.
|
|
|
|
.. _inverted index tutorial: index_tutorial.html
|
|
|
|
Comparison with NLTK
|
|
====================
|
|
|
|
`NLTK <http://nltk.org>`_ provides interfaces to a wide-variety of NLP
|
|
tools and resources, and its own implementations of a few algorithms. It comes
|
|
with comprehensive documentation, and a book introducing concepts in NLP. For
|
|
these reasons, it's very widely known. However, if you're trying to make money
|
|
or do cutting-edge research, NLTK is not a good choice.
|
|
|
|
The `list of stuff in NLTK <http://www.nltk.org/py-modindex.html>`_ looks impressive,
|
|
but almost none of it is useful for real work. You're not going to make any money,
|
|
or do top research, by using the NLTK chat bots, theorem provers, toy CCG implementation,
|
|
etc. Most of NLTK is there to assist in the explanation ideas in computational
|
|
linguistics, at roughly an undergraduate level.
|
|
But it also claims to support serious work, by wrapping external tools.
|
|
|
|
In a pretty well known essay, Joel Spolsky discusses the pain of dealing with
|
|
`leaky abstractions <http://www.joelonsoftware.com/articles/LeakyAbstractions.html>`_.
|
|
An abstraction tells you to not care about implementation
|
|
details, but sometimes the implementation matters after all. When it
|
|
does, you have to waste time revising your assumptions.
|
|
|
|
NLTK's wrappers call external tools via subprocesses, and wrap this up so
|
|
that it looks like a native API. This abstraction leaks *a lot*. The system
|
|
calls impose far more overhead than a normal Python function call, which makes
|
|
the most natural way to program against the API infeasible.
|
|
|
|
|
|
Case study: POS tagging
|
|
-----------------------
|
|
|
|
Here's a quick comparison of the following POS taggers:
|
|
|
|
* **Stanford (CLI)**: The Stanford POS tagger, invoked once as a batch process
|
|
from the command-line;
|
|
* **nltk.tag.stanford**: The Stanford tagger, invoked document-by-document via
|
|
NLTK's wrapper;
|
|
* **nltk.pos_tag**: NLTK's own POS tagger, invoked document-by-document.
|
|
* **spacy.en.pos_tag**: spaCy's POS tagger, invoked document-by-document.
|
|
|
|
|
|
+-------------------+-------------+--------+
|
|
| System | Speed (w/s) | % Acc. |
|
|
+-------------------+-------------+--------+
|
|
| spaCy | 107,000 | 96.7 |
|
|
+-------------------+-------------+--------+
|
|
| Stanford (CLI) | 8,000 | 96.7 |
|
|
+-------------------+-------------+--------+
|
|
| nltk.pos_tag | 543 | 94.0 |
|
|
+-------------------+-------------+--------+
|
|
| nltk.tag.stanford | 209 | 96.7 |
|
|
+-------------------+-------------+--------+
|
|
|
|
Experimental details TODO. Three things are apparent from this comparison:
|
|
|
|
1. The native NLTK tagger, nltk.pos_tag, is both slow and inaccurate;
|
|
|
|
2. Calling the Stanford tagger document-by-document via NLTK is **40x** slower
|
|
than invoking the model once as a batch process, via the command-line;
|
|
|
|
3. spaCy is over 10x faster than the Stanford tagger, even when called
|
|
**sentence-by-sentence**.
|
|
|
|
The problem is that NLTK simply wraps the command-line
|
|
interfaces of these tools, so communication is via a subprocess. NLTK does not
|
|
even hold open a pipe for you --- the model is reloaded, again and again.
|
|
|
|
To use the wrapper effectively, you should batch up your text as much as possible.
|
|
This probably isn't how you would like to structure your pipeline, and you
|
|
might not be able to batch up much text at all, e.g. if serving a single
|
|
request means processing a single document.
|
|
Technically, NLTK does give you Python functions to access lots of different
|
|
systems --- but, you can't use them as you would expect to use a normal Python
|
|
function. The abstraction leaks.
|
|
|
|
Here's the bottom-line: the Stanford tools are written in Java, so using them
|
|
from Python sucks. You shouldn't settle for this. It's a problem that springs
|
|
purely from the tooling, rather than the domain.
|
|
|
|
Summary
|
|
-------
|
|
|
|
NLTK is a well-known Python library for NLP, but for the important bits, you
|
|
don't get actual Python modules. You get wrappers which throw to external
|
|
tools, via subprocesses. This is not at all the same thing.
|
|
|
|
spaCy is implemented in Cython, just like numpy, scikit-learn, lxml and other
|
|
high-performance Python libraries. So you get a native Python API, but the
|
|
performance you expect from a program written in C.
|
|
|
|
|
|
.. toctree::
|
|
:hidden:
|
|
:maxdepth: 3
|
|
|
|
features.rst
|
|
license_stories.rst
|