spaCy/docs/source/index.rst

.. spaCy documentation master file, created by
   sphinx-quickstart on Tue Aug 19 16:27:38 2014.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

==============================
spaCy: Industrial-strength NLP
==============================


.. _Issue Tracker: https://github.com/honnibal/spaCy/issues

**2015-07-08**: `Version 0.88 released`_

.. _Version 0.87 released: updates.html

`spaCy`_ is a new library for text processing in Python and Cython.
I wrote it because I think small companies are terrible at
natural language processing (NLP).  Or rather:
small companies are using terrible NLP technology.

.. _spaCy: https://github.com/honnibal/spaCy/

To do great NLP, you have to know a little about linguistics, a lot
about machine learning, and almost everything about the latest research.
The people who fit this description seldom join small companies.
Most are broke --- they've just finished grad school.
If they don't want to stay in academia, they join Google, IBM, etc.

The net result is that outside of the tech giants, commercial NLP has changed
little in the last ten years.  In academia, it's changed entirely.  Amazing
improvements in quality.  Orders of magnitude faster.  But the
academic code is always GPL, undocumented, unuseable, or all three.  You could
implement the ideas yourself, but the papers are hard to read, and training
data is exorbitantly expensive.  So what are you left with?  A common answer is
NLTK, which was written primarily as an educational resource.  Nothing past the
tokenizer is suitable for production use.

I used to think that the NLP community just needed to do more to communicate
its findings to software engineers.  So I wrote two blog posts, explaining
`how to write a part-of-speech tagger`_ and `parser`_.  Both were well received,
and there's been a bit of interest in `my research software`_ --- even though
it's entirely undocumented, and mostly unuseable to anyone but me.

.. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop

.. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/

.. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/

So six months ago I quit my post-doc, and I've been working day and night on
spaCy since.  I'm now pleased to announce an alpha release.

If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
It's by far the fastest NLP software ever released.
The full processing pipeline completes in 20ms per document, including accurate
tagging and parsing.  All strings are mapped to integer IDs, tokens are linked
to embedded word representations, and a range of useful features are pre-calculated
and cached.

If none of that made any sense to you, here's the gist of it.  Computers don't
understand text.  This is unfortunate, because that's what the web almost entirely
consists of.  We want to recommend people text based on other text they liked.
We want to shorten text to display it on a mobile screen.  We want to aggregate
it, link it, filter it, categorise it, generate it and correct it.

spaCy provides a library of utility functions that help programmers build such
products.  It's commercial open source software: you can either use it under
the AGPL, or you can `buy a commercial license`_ for a one-time fee.

.. _buy a commercial license: license.html

Example functionality
---------------------

Let's say you're developing a proofreading tool, or possibly an IDE for
writers.  You're convinced by Stephen King's advice that `adverbs are not your
friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
you want to **highlight all adverbs**.  We'll use one of the examples he finds
particularly egregious:

    >>> import spacy.en
    >>> from spacy.parts_of_speech import ADV
    >>> # Load the pipeline, and call it with some text.
    >>> nlp = spacy.en.English()
    >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
    >>> print u''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens)
    u‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’


Easy enough --- but the problem is that we've also highlighted "back".
While "back" is undoubtedly an adverb, we probably don't want to highlight it.
If what we're trying to do is flag dubious stylistic choices, we'll need to
refine our logic.  It turns out only a certain type of adverb is of interest to
us.

There are lots of ways we might do this, depending on just what words
we want to flag.  The simplest way to exclude adverbs like "back" and "not"
is by word frequency: these words are much more common than the prototypical
manner adverbs that the style guides are worried about.

The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
log probability estimate of the word:

   >>> nlp.vocab[u'back'].prob
   -7.403977394104004
   >>> nlp.vocab[u'not'].prob
   -5.407193660736084
   >>> nlp.vocab[u'quietly'].prob
   -11.07155704498291

(The probability estimate is based on counts from a 3 billion word corpus,
smoothed using the `Simple Good-Turing`_ method.)

.. _`Simple Good-Turing`: http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL02/Code/sgt-gale.pdf

So we can easily exclude the N most frequent words in English from our adverb
marker.  Let's try N=1000 for now:

    >>> import spacy.en
    >>> from spacy.parts_of_speech import ADV
    >>> nlp = spacy.en.English()
    >>> # Find log probability of Nth most frequent word
    >>> probs = [lex.prob for lex in nlp.vocab]
    >>> probs.sort()
    >>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
    >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
    >>> print u''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens)
    ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’

There are lots of other ways we could refine the logic, depending on just what
words we want to flag.  Let's say we wanted to only flag adverbs that modified words
similar to "pleaded".  This is easy to do, as spaCy loads a vector-space
representation for every word (by default, the vectors produced by
`Levy and Goldberg (2014)`_).  Naturally, the vector is provided as a numpy
array:

    >>> pleaded = tokens[7]
    >>> pleaded.repvec.shape
    (300,)
    >>> pleaded.repvec[:5]
    array([ 0.04229792,  0.07459262,  0.00820188, -0.02181299,  0.07519238], dtype=float32)

.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

We want to sort the words in our vocabulary by their similarity to "pleaded".
There are lots of ways to measure the similarity of two vectors.  We'll use the
cosine metric:

    >>> from numpy import dot
    >>> from numpy.linalg import norm

    >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
    >>> words = [w for w in nlp.vocab if w.has_repvec]
    >>> words.sort(key=lambda w: cosine(w.repvec, pleaded.repvec))
    >>> words.reverse()
    >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
    1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
    >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
    50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
    >>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
    100-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
    >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
    1000-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
    >>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
    50000-50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists

As you can see, the similarity model that these vectors give us is excellent
--- we're still getting meaningful results at 1000 words, off a single
prototype!  The only problem is that the list really contains two clusters of
words: one associated with the legal meaning of "pleaded", and one for the more
general sense.  Sorting out these clusters is an area of active research.


A simple work-around is to average the vectors of several words, and use that
as our target:

    >>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
    >>> say_vector = sum(nlp.vocab[verb].repvec for verb in say_verbs) / len(say_verbs)
    >>> words.sort(key=lambda w: cosine(w.repvec * say_vector))
    >>> words.reverse()
    >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
    1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
    >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
    50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
    >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
    1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate

These definitely look like words that King might scold a writer for attaching
adverbs to.  Recall that our original adverb highlighting function looked like
this:

    >>> import spacy.en
    >>> from spacy.parts_of_speech import ADV
    >>> # Load the pipeline, and call it with some text.
    >>> nlp = spacy.en.English()
    >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
                     tag=True, parse=False)
    >>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
    ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’


We wanted to refine the logic so that only adverbs modifying evocative verbs
of communication, like "pleaded", were highlighted.  We've now built a vector that
represents that type of word, so now we can highlight adverbs based on
subtle logic, honing in on adverbs that seem the most stylistically
problematic, given our starting assumptions:

    >>> import numpy
    >>> from numpy import dot
    >>> from numpy.linalg import norm
    >>> import spacy.en
    >>> from spacy.parts_of_speech import ADV, VERB
    >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
    >>> def is_bad_adverb(token, target_verb, tol):
    ...   if token.pos != ADV
    ...     return False
    ...   elif token.head.pos != VERB:
    ...     return False
    ...   elif cosine(token.head.repvec, target_verb) < tol:
    ...     return False
    ...   else:
    ...     return True


This example was somewhat contrived --- and, truth be told, I've never really
bought the idea that adverbs were a grave stylistic sin.  But hopefully it got
the message across: the state-of-the-art NLP technologies are very powerful.
spaCy gives you easy and efficient access to them, which lets you build all
sorts of use products and features that were previously impossible.


Independent Evaluation
----------------------

.. table:: Independent evaluation by Yahoo! Labs and Emory
  University, to appear at ACL 2015. Higher is better.
  
  +----------------+------------+------------+------------+
  | System         | Language   | Accuracy   | Speed      |        
  +----------------+------------+------------+------------+
  | spaCy v0.86    | Cython     | 91.9       | **13,963** |
  +----------------+------------+------------+------------+
  | ClearNLP       | Java       | 91.7       | 10,271     |
  +----------------+------------+------------+------------+
  | spaCy v0.84    | Cython     | 90.9       | 13,963     |
  +----------------+------------+------------+------------+
  | CoreNLP        | Java       | 89.6       | 8,602      |
  +----------------+------------+------------+------------+
  | MATE           | Java       | **92.5**   | 550        |
  +----------------+------------+------------+------------+
  | Turbo          | C++        | 92.4       | 349        |
  +----------------+------------+------------+------------+
  | Yara           | Java       | 92.3       | 340        |
  +----------------+------------+------------+------------+

 
Accuracy is % unlabelled arcs correct, speed is tokens per second.

Joel Tetreault and Amanda Stent (Yahoo! Labs) and Jin-ho Choi (Emory) performed
a detailed comparison of the best parsers available.  All numbers above
are taken from the pre-print they kindly made available to me,
except for spaCy v0.86. 

I'm particularly grateful to the authors for discussion of their results, which
led to the improvement in accuracy between v0.84 and v0.86.  A tip from Jin-ho
(developer of ClearNLP) was particularly useful.


Detailed Speed Comparison
-------------------------

**Set up**: 100,000 plain-text documents were streamed from an SQLite3
database, and processed with an NLP library, to one of three levels of detail
--- tokenization, tagging, or parsing.  The tasks are additive: to parse the
text you have to tokenize and tag it.  The  pre-processing was not subtracted
from the times --- I report the time required for the pipeline to complete.
I report mean times per document, in milliseconds.

**Hardware**: Intel i7-3770 (2012)

.. table:: Per-document processing times.  Lower is better.

  +--------------+---------------------------+--------------------------------+
  |              | Absolute (ms per doc)     | Relative (to spaCy)            |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | System       | Tokenize | Tag    | Parse | Tokenize | Tag     | Parse     |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | spaCy        | 0.2ms    | 1ms    | 19ms  | 1x       | 1x      | 1x        |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | CoreNLP      | 2ms      | 10ms   | 49ms  | 10x      | 10x     | 2.6x      |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | ZPar         | 1ms      | 8ms    | 850ms | 5x       | 8x      | 44.7x     |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | NLTK         | 4ms      | 443ms  | n/a   | 20x      | 443x    |  n/a      |
  +--------------+----------+--------+-------+----------+---------+-----------+


Efficiency is a major concern for NLP applications.  It is very common to hear
people say that they cannot afford more detailed processing, because their
datasets are too large.  This is a bad position to be in.  If you can't apply
detailed processing, you generally have to cobble together various heuristics.
This normally takes a few iterations, and what you come up with will usually be
brittle and difficult to reason about.

spaCy's parser is faster than most taggers, and its tokenizer is fast enough
for any workload.  And the tokenizer doesn't just give you a list
of strings.  A spaCy token is a pointer to a Lexeme struct, from which you can
access a wide range of pre-computed features, including embedded word
representations.

.. I wrote spaCy because I think existing commercial NLP engines are crap.
  Alchemy API are a typical example.  Check out this part of their terms of
  service:
  publish or perform any benchmark or performance tests or analysis relating to
  the Service or the use thereof without express authorization from AlchemyAPI;

.. Did you get that?  You're not allowed to evaluate how well their system works,
  unless you're granted a special exception.  Their system must be pretty
  terrible to motivate such an embarrassing restriction.
  They must know this makes them look bad, but they apparently believe allowing
  you to evaluate their product would make them look even worse!

.. spaCy is based on science, not alchemy.  It's open source, and I am happy to
  clarify any detail of the algorithms I've implemented.
  It's evaluated against the current best published systems, following the standard
  methodologies.  These evaluations show that it performs extremely well.
.. See `Benchmarks`_ for details.


.. toctree::
    :maxdepth: 4
    :hidden:

    quickstart.rst
    reference/index.rst
    license.rst
    updates.rst
-												* Re-add docs, sorting out mess from gh-pages

											
										
										
											2014-09-25 20:42:20 +04:00
+								.. spaCy documentation master file, created by
 								   sphinx-quickstart on Tue Aug 19 16:27:38 2014.
 								   You can adapt this file completely to your liking, but it should at least
 								   contain the root `toctree` directive.
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								==============================
 								spaCy: Industrial-strength NLP
 								==============================
-												* Re-add docs, sorting out mess from gh-pages

											
										
										
											2014-09-25 20:42:20 +04:00
-												* Advertise new version

											
										
										
											2015-01-31 15:05:17 +03:00
-												* Upd project status note

											
										
										
											2015-02-01 08:47:33 +03:00
+								.. _Issue Tracker: https://github.com/honnibal/spaCy/issues
-												* Announce v0.88

											
										
										
											2015-07-09 13:12:29 +03:00
+								**2015-07-08**: `Version 0.88 released`_
-												* Note new release in docs

											
										
										
											2015-06-08 02:47:06 +03:00
-												* Announce v0.87 in docs

											
										
										
											2015-07-01 16:36:41 +03:00
+								.. _Version 0.87 released: updates.html
-												* Adjust speed claim in index.rst

											
										
										
											2015-06-26 05:43:52 +03:00
-												* Merge index.rst

											
										
										
											2015-01-25 19:07:46 +03:00
+								`spaCy`_ is a new library for text processing in Python and Cython.
-												* Explain acronym

											
										
										
											2015-01-25 21:10:04 +03:00
+								I wrote it because I think small companies are terrible at
 								natural language processing (NLP).  Or rather:
-												* Rework intro text

											
										
										
											2015-01-24 16:58:52 +03:00
+								small companies are using terrible NLP technology.
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
-												Update index.rst

Fix broken links.
											
										
										
											2015-01-25 17:58:05 +03:00
+								.. _spaCy: https://github.com/honnibal/spaCy/
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
-												* Rework intro text

											
										
										
											2015-01-24 16:58:52 +03:00
+								To do great NLP, you have to know a little about linguistics, a lot
 								about machine learning, and almost everything about the latest research.
-												* Minor edits to intro

											
										
										
											2015-01-24 17:06:14 +03:00
+								The people who fit this description seldom join small companies.
 								Most are broke --- they've just finished grad school.
-												* Rework intro text

											
										
										
											2015-01-24 16:58:52 +03:00
+								If they don't want to stay in academia, they join Google, IBM, etc.
-												* Update main page

											
										
										
											2015-01-23 15:11:16 +03:00
-												* Rework intro text

											
										
										
											2015-01-24 16:58:52 +03:00
+								The net result is that outside of the tech giants, commercial NLP has changed
 								little in the last ten years.  In academia, it's changed entirely.  Amazing
-												Use consistent sentence spacing within files

											
										
										
											2015-04-19 11:43:46 +03:00
+								improvements in quality.  Orders of magnitude faster.  But the
-												* Rework intro text

											
										
										
											2015-01-24 16:58:52 +03:00
+								academic code is always GPL, undocumented, unuseable, or all three.  You could
 								implement the ideas yourself, but the papers are hard to read, and training
-												* Upd index.rst

											
										
										
											2015-01-25 15:38:36 +03:00
+								data is exorbitantly expensive.  So what are you left with?  A common answer is
 								NLTK, which was written primarily as an educational resource.  Nothing past the
 								tokenizer is suitable for production use.
-												* Rework intro text

											
										
										
											2015-01-24 16:58:52 +03:00
 								I used to think that the NLP community just needed to do more to communicate
 								its findings to software engineers.  So I wrote two blog posts, explaining
-												Minor copyediting

											
										
										
											2015-04-19 11:56:32 +03:00
+								`how to write a part-of-speech tagger`_ and `parser`_.  Both were well received,
-												* Rework intro text

											
										
										
											2015-01-24 16:58:52 +03:00
+								and there's been a bit of interest in `my research software`_ --- even though
 								it's entirely undocumented, and mostly unuseable to anyone but me.
 								.. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop
 								.. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
 								.. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
 								So six months ago I quit my post-doc, and I've been working day and night on
 								spaCy since.  I'm now pleased to announce an alpha release.
-												* Update main page

											
										
										
											2015-01-23 15:11:16 +03:00
 								If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
-												* Minor edits to intro

											
										
										
											2015-01-24 17:06:14 +03:00
+								It's by far the fastest NLP software ever released.
-												* Adjust speed claim in index.rst

											
										
										
											2015-06-26 05:43:52 +03:00
+								The full processing pipeline completes in 20ms per document, including accurate
-												* Rework intro text

											
										
										
											2015-01-24 16:58:52 +03:00
+								tagging and parsing.  All strings are mapped to integer IDs, tokens are linked
 								to embedded word representations, and a range of useful features are pre-calculated
 								and cached.
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
 								If none of that made any sense to you, here's the gist of it.  Computers don't
-												Use consistent sentence spacing within files

											
										
										
											2015-04-19 11:43:46 +03:00
+								understand text.  This is unfortunate, because that's what the web almost entirely
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								consists of.  We want to recommend people text based on other text they liked.
 								We want to shorten text to display it on a mobile screen.  We want to aggregate
 								it, link it, filter it, categorise it, generate it and correct it.
-												* Update main page

											
										
										
											2015-01-23 15:11:16 +03:00
+								spaCy provides a library of utility functions that help programmers build such
 								products.  It's commercial open source software: you can either use it under
-												Remove trailing whitespace

											
										
										
											2015-04-19 11:31:31 +03:00
+								the AGPL, or you can `buy a commercial license`_ for a one-time fee.
-												* Update main page

											
										
										
											2015-01-23 15:11:16 +03:00
-												Update index.rst

Fix broken links.
											
										
										
											2015-01-25 17:58:05 +03:00
+								.. _buy a commercial license: license.html
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
 								Example functionality
 								---------------------
 								Let's say you're developing a proofreading tool, or possibly an IDE for
 								writers.  You're convinced by Stephen King's advice that `adverbs are not your
 								friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								you want to **highlight all adverbs**.  We'll use one of the examples he finds
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								particularly egregious:
 								    >>> import spacy.en
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> from spacy.parts_of_speech import ADV
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								    >>> # Load the pipeline, and call it with some text.
 								    >>> nlp = spacy.en.English()
-												* Make corrections to example code

											
										
										
											2015-02-07 16:45:09 +03:00
+								    >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
 								    >>> print u''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens)
 								    u‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
-												* Upd docs

											
										
										
											2014-12-09 08:08:01 +03:00
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								Easy enough --- but the problem is that we've also highlighted "back".
 								While "back" is undoubtedly an adverb, we probably don't want to highlight it.
 								If what we're trying to do is flag dubious stylistic choices, we'll need to
 								refine our logic.  It turns out only a certain type of adverb is of interest to
 								us.
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								There are lots of ways we might do this, depending on just what words
-												* Minor edits to index.rst

											
										
										
											2015-01-25 14:07:08 +03:00
+								we want to flag.  The simplest way to exclude adverbs like "back" and "not"
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								is by word frequency: these words are much more common than the prototypical
 								manner adverbs that the style guides are worried about.
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
-												* Minor edits to index.rst

											
										
										
											2015-01-25 14:07:08 +03:00
+								The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
 								log probability estimate of the word:
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
-												* Make corrections to example code

											
										
										
											2015-02-07 16:45:09 +03:00
+								   >>> nlp.vocab[u'back'].prob
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								   -7.403977394104004
-												* Make corrections to example code

											
										
										
											2015-02-07 16:45:09 +03:00
+								   >>> nlp.vocab[u'not'].prob
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								   -5.407193660736084
-												* Make corrections to example code

											
										
										
											2015-02-07 16:45:09 +03:00
+								   >>> nlp.vocab[u'quietly'].prob
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								   -11.07155704498291
-												* Minor edits to index.rst

											
										
										
											2015-01-25 14:07:08 +03:00
+								(The probability estimate is based on counts from a 3 billion word corpus,
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								smoothed using the `Simple Good-Turing`_ method.)
-												* Minor edits to index.rst

											
										
										
											2015-01-25 14:07:08 +03:00
 								.. _`Simple Good-Turing`: http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL02/Code/sgt-gale.pdf
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								So we can easily exclude the N most frequent words in English from our adverb
 								marker.  Let's try N=1000 for now:
 								    >>> import spacy.en
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> from spacy.parts_of_speech import ADV
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								    >>> nlp = spacy.en.English()
 								    >>> # Find log probability of Nth most frequent word
 								    >>> probs = [lex.prob for lex in nlp.vocab]
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> probs.sort()
 								    >>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
-												* Make corrections to example code

											
										
										
											2015-02-07 16:45:09 +03:00
+								    >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
 								    >>> print u''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens)
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								    ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
-												* Minor edits to index.rst

											
										
										
											2015-01-25 14:07:08 +03:00
+								There are lots of other ways we could refine the logic, depending on just what
 								words we want to flag.  Let's say we wanted to only flag adverbs that modified words
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								similar to "pleaded".  This is easy to do, as spaCy loads a vector-space
 								representation for every word (by default, the vectors produced by
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								`Levy and Goldberg (2014)`_).  Naturally, the vector is provided as a numpy
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								array:
-												* Make corrections to example code

											
										
										
											2015-02-07 16:45:09 +03:00
+								    >>> pleaded = tokens[7]
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								    >>> pleaded.repvec.shape
 								    (300,)
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> pleaded.repvec[:5]
 								    array([ 0.04229792,  0.07459262,  0.00820188, -0.02181299,  0.07519238], dtype=float32)
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
 								.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
 								We want to sort the words in our vocabulary by their similarity to "pleaded".
 								There are lots of ways to measure the similarity of two vectors.  We'll use the
 								cosine metric:
 								    >>> from numpy import dot
 								    >>> from numpy.linalg import norm
-												Remove trailing whitespace

											
										
										
											2015-04-19 11:31:31 +03:00
-												* Make corrections to example code

											
										
										
											2015-02-07 16:45:09 +03:00
+								    >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
-												* Fix docs bug

											
										
										
											2015-02-12 04:07:39 +03:00
+								    >>> words = [w for w in nlp.vocab if w.has_repvec]
-												* Make corrections to example code

											
										
										
											2015-02-07 16:45:09 +03:00
+								    >>> words.sort(key=lambda w: cosine(w.repvec, pleaded.repvec))
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								    >>> words.reverse()
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
-												* Fixes to examples

											
										
										
											2015-01-26 05:26:42 +03:00
+-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
-												* Fixes to examples

											
										
										
											2015-01-26 05:26:42 +03:00
+-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
 								    >>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
 -50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
 								As you can see, the similarity model that these vectors give us is excellent
 								--- we're still getting meaningful results at 1000 words, off a single
 								prototype!  The only problem is that the list really contains two clusters of
 								words: one associated with the legal meaning of "pleaded", and one for the more
 								general sense.  Sorting out these clusters is an area of active research.
-												* Fixes to examples

											
										
										
											2015-01-26 05:26:42 +03:00
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								A simple work-around is to average the vectors of several words, and use that
 								as our target:
-												* Fixes to examples

											
										
										
											2015-01-26 05:26:42 +03:00
+								    >>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
 								    >>> say_vector = sum(nlp.vocab[verb].repvec for verb in say_verbs) / len(say_verbs)
-												* Make corrections to example code

											
										
										
											2015-02-07 16:45:09 +03:00
+								    >>> words.sort(key=lambda w: cosine(w.repvec * say_vector))
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								    >>> words.reverse()
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
-												* Another docs edit

											
										
										
											2015-01-26 05:29:02 +03:00
+								    >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
-												* Edits to docs

											
										
										
											2015-01-25 14:57:37 +03:00
+								    >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
 								These definitely look like words that King might scold a writer for attaching
-												* Fixes to examples

											
										
										
											2015-01-26 05:26:42 +03:00
+								adverbs to.  Recall that our original adverb highlighting function looked like
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								this:
-												* Another redraft of index.rst

											
										
										
											2014-12-15 08:32:03 +03:00
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								    >>> import spacy.en
-												* Fix code in examples.

											
										
										
											2015-01-25 20:55:41 +03:00
+								    >>> from spacy.parts_of_speech import ADV
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								    >>> # Load the pipeline, and call it with some text.
 								    >>> nlp = spacy.en.English()
 								    >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
-												* Fixes to examples

											
										
										
											2015-01-26 05:26:42 +03:00
+								                     tag=True, parse=False)
 								    >>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								    ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
-												* Fixes to examples

											
										
										
											2015-01-26 05:26:42 +03:00
-												* Ws

											
										
										
											2015-01-27 16:57:16 +03:00
-												Remove trailing whitespace

											
										
										
											2015-04-19 11:31:31 +03:00
+								We wanted to refine the logic so that only adverbs modifying evocative verbs
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								of communication, like "pleaded", were highlighted.  We've now built a vector that
-												Minor copyediting

											
										
										
											2015-04-19 11:56:32 +03:00
+								represents that type of word, so now we can highlight adverbs based on
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								subtle logic, honing in on adverbs that seem the most stylistically
 								problematic, given our starting assumptions:
 								    >>> import numpy
 								    >>> from numpy import dot
 								    >>> from numpy.linalg import norm
 								    >>> import spacy.en
-												* Fix code in examples.

											
										
										
											2015-01-25 20:55:41 +03:00
+								    >>> from spacy.parts_of_speech import ADV, VERB
-												* Fix cosine function in documentation

											
										
										
											2015-02-12 02:08:19 +03:00
+								    >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								    >>> def is_bad_adverb(token, target_verb, tol):
-												Remove trailing whitespace

											
										
										
											2015-04-19 11:31:31 +03:00
+								    ...   if token.pos != ADV
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								    ...     return False
-												* Fix typo from toby

											
										
										
											2015-01-27 10:53:29 +03:00
+								    ...   elif token.head.pos != VERB:
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								    ...     return False
 								    ...   elif cosine(token.head.repvec, target_verb) < tol:
 								    ...     return False
 								    ...   else:
 								    ...     return True
-												* Make intro chattier, explain philosophy better

											
										
										
											2014-12-02 07:20:18 +03:00
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								This example was somewhat contrived --- and, truth be told, I've never really
 								bought the idea that adverbs were a grave stylistic sin.  But hopefully it got
 								the message across: the state-of-the-art NLP technologies are very powerful.
 								spaCy gives you easy and efficient access to them, which lets you build all
 								sorts of use products and features that were previously impossible.
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
-												* Upd index.rst

											
										
										
											2015-06-26 05:39:48 +03:00
+								Independent Evaluation
 								----------------------
 								.. table:: Independent evaluation by Yahoo! Labs and Emory
 								  University, to appear at ACL 2015. Higher is better.
 								  +----------------+------------+------------+------------+
 								  | System         | Language   | Accuracy   | Speed      |
 								  +----------------+------------+------------+------------+
 								  | spaCy v0.86    | Cython     | 91.9       | **13,963** |
 								  +----------------+------------+------------+------------+
 								  | ClearNLP       | Java       | 91.7       | 10,271     |
 								  +----------------+------------+------------+------------+
 								  | spaCy v0.84    | Cython     | 90.9       | 13,963     |
 								  +----------------+------------+------------+------------+
 								  | CoreNLP        | Java       | 89.6       | 8,602      |
 								  +----------------+------------+------------+------------+
 								  | MATE           | Java       | **92.5**   | 550        |
 								  +----------------+------------+------------+------------+
 								  | Turbo          | C++        | 92.4       | 349        |
 								  +----------------+------------+------------+------------+
 								  | Yara           | Java       | 92.3       | 340        |
 								  +----------------+------------+------------+------------+
 								Accuracy is % unlabelled arcs correct, speed is tokens per second.
 								Joel Tetreault and Amanda Stent (Yahoo! Labs) and Jin-ho Choi (Emory) performed
 								a detailed comparison of the best parsers available.  All numbers above
 								are taken from the pre-print they kindly made available to me,
 								except for spaCy v0.86.
 								I'm particularly grateful to the authors for discussion of their results, which
 								led to the improvement in accuracy between v0.84 and v0.86.  A tip from Jin-ho
 								(developer of ClearNLP) was particularly useful.
 								Detailed Speed Comparison
 								-------------------------
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
-												* Add benchmark details.

											
										
										
											2015-01-24 17:25:27 +03:00
+								**Set up**: 100,000 plain-text documents were streamed from an SQLite3
 								database, and processed with an NLP library, to one of three levels of detail
 								--- tokenization, tagging, or parsing.  The tasks are additive: to parse the
 								text you have to tokenize and tag it.  The  pre-processing was not subtracted
 								from the times --- I report the time required for the pipeline to complete.
-												Remove trailing whitespace

											
										
										
											2015-04-19 11:31:31 +03:00
+								I report mean times per document, in milliseconds.
-												* Add benchmark details.

											
										
										
											2015-01-24 17:25:27 +03:00
 								**Hardware**: Intel i7-3770 (2012)
-												* Upd index.rst

											
										
										
											2015-06-26 05:39:48 +03:00
+								.. table:: Per-document processing times.  Lower is better.
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
 								  +--------------+---------------------------+--------------------------------+
 								  |              | Absolute (ms per doc)     | Relative (to spaCy)            |
 								  +--------------+----------+--------+-------+----------+---------+-----------+
 								  | System       | Tokenize | Tag    | Parse | Tokenize | Tag     | Parse     |
 								  +--------------+----------+--------+-------+----------+---------+-----------+
-												* Upd index.rst

											
										
										
											2015-06-26 05:39:48 +03:00
+								  | spaCy        | 0.2ms    | 1ms    | 19ms  | 1x       | 1x      | 1x        |
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								  +--------------+----------+--------+-------+----------+---------+-----------+
-												* Upd index.rst

											
										
										
											2015-06-26 05:39:48 +03:00
+								  | CoreNLP      | 2ms      | 10ms   | 49ms  | 10x      | 10x     | 2.6x      |
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								  +--------------+----------+--------+-------+----------+---------+-----------+
-												* Upd index.rst

											
										
										
											2015-06-26 05:39:48 +03:00
+								  | ZPar         | 1ms      | 8ms    | 850ms | 5x       | 8x      | 44.7x     |
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								  +--------------+----------+--------+-------+----------+---------+-----------+
 								  | NLTK         | 4ms      | 443ms  | n/a   | 20x      | 443x    |  n/a      |
 								  +--------------+----------+--------+-------+----------+---------+-----------+
 								Efficiency is a major concern for NLP applications.  It is very common to hear
 								people say that they cannot afford more detailed processing, because their
 								datasets are too large.  This is a bad position to be in.  If you can't apply
 								detailed processing, you generally have to cobble together various heuristics.
 								This normally takes a few iterations, and what you come up with will usually be
 								brittle and difficult to reason about.
 								spaCy's parser is faster than most taggers, and its tokenizer is fast enough
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								for any workload.  And the tokenizer doesn't just give you a list
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								of strings.  A spaCy token is a pointer to a Lexeme struct, from which you can
-												* Improve example functionality, adding usage of word vectors

											
										
										
											2015-01-23 00:22:00 +03:00
+								access a wide range of pre-computed features, including embedded word
 								representations.
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
 								.. I wrote spaCy because I think existing commercial NLP engines are crap.
 								  Alchemy API are a typical example.  Check out this part of their terms of
 								  service:
 								  publish or perform any benchmark or performance tests or analysis relating to
 								  the Service or the use thereof without express authorization from AlchemyAPI;
-												Use consistent sentence spacing within files

											
										
										
											2015-04-19 11:43:46 +03:00
+								.. Did you get that?  You're not allowed to evaluate how well their system works,
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								  unless you're granted a special exception.  Their system must be pretty
 								  terrible to motivate such an embarrassing restriction.
 								  They must know this makes them look bad, but they apparently believe allowing
 								  you to evaluate their product would make them look even worse!
 								.. spaCy is based on science, not alchemy.  It's open source, and I am happy to
 								  clarify any detail of the algorithms I've implemented.
 								  It's evaluated against the current best published systems, following the standard
-												Remove trailing whitespace

											
										
										
											2015-04-19 11:31:31 +03:00
+								  methodologies.  These evaluations show that it performs extremely well.
-												Make the accuracy table show up

It was previously commented out.  I'm not sure if this was intentional, but the text "The table above" was very confusing without the table.
											
										
										
											2015-01-26 01:31:34 +03:00
+								.. See `Benchmarks`_ for details.
-												* Update docs

											
										
										
											2014-12-30 13:20:34 +03:00
-												* Re-add docs, sorting out mess from gh-pages

											
										
										
											2014-09-25 20:42:20 +04:00
+								.. toctree::
-												* More work on reorganized docs. Getting close to useable

											
										
										
											2015-07-08 18:58:49 +03:00
+								    :maxdepth: 4
 								    :hidden:
-												* Revise intro copy. Add NLTK comparison

											
										
										
											2014-12-01 14:55:13 +03:00
-												* Impove index docs

											
										
										
											2015-01-15 23:08:35 +03:00
+								    quickstart.rst
-												* More work on reorganized docs. Getting close to useable

											
										
										
											2015-07-08 18:58:49 +03:00
+								    reference/index.rst
-												Remove trailing whitespace

											
										
										
											2015-04-19 11:31:31 +03:00
+								    license.rst
-												* Advertise new version

											
										
										
											2015-01-31 15:05:17 +03:00
+								    updates.rst