* Workon docs for v0.89

2025-09-13 23:52:38 +03:00 · 2015-07-29 22:34:10 +02:00 · 2015-07-29 22:34:10 +02:00 · 2bcb58456d
commit 2bcb58456d
parent 320836e346
5 changed files with 168 additions and 319 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -7,56 +7,18 @@
 spaCy: Industrial-strength NLP
 ==============================
-
+`spaCy`_ is a library for building tomorrow's language technology products.
-.. _Issue Tracker: https://github.com/honnibal/spaCy/issues
+It's like Stanford's CoreNLP for Python, but with a fundamentally different
-
+objective.  While CoreNLP is primarily built for conducting research, spaCy is
-**2015-07-08**: `Version 0.88 released`_
+designed for application.
 .. _Version 0.87 released: updates.html
 `spaCy`_ is a new library for text processing in Python and Cython.
 I wrote it because I think small companies are terrible at
 natural language processing (NLP).  Or rather:
 small companies are using terrible NLP technology.
 .. _spaCy: https://github.com/honnibal/spaCy/
 To do great NLP, you have to know a little about linguistics, a lot
 about machine learning, and almost everything about the latest research.
 The people who fit this description seldom join small companies.
 Most are broke --- they've just finished grad school.
 If they don't want to stay in academia, they join Google, IBM, etc.
 The net result is that outside of the tech giants, commercial NLP has changed
 little in the last ten years.  In academia, it's changed entirely.  Amazing
 improvements in quality.  Orders of magnitude faster.  But the
 academic code is always GPL, undocumented, unuseable, or all three.  You could
 implement the ideas yourself, but the papers are hard to read, and training
 data is exorbitantly expensive.  So what are you left with?  A common answer is
 NLTK, which was written primarily as an educational resource.  Nothing past the
 tokenizer is suitable for production use.
 I used to think that the NLP community just needed to do more to communicate
 its findings to software engineers.  So I wrote two blog posts, explaining
 `how to write a part-of-speech tagger`_ and `parser`_.  Both were well received,
 and there's been a bit of interest in `my research software`_ --- even though
 it's entirely undocumented, and mostly unuseable to anyone but me.
 .. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop
 .. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
 .. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
 So six months ago I quit my post-doc, and I've been working day and night on
 spaCy since.  I'm now pleased to announce an alpha release.
 If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
 It's by far the fastest NLP software ever released.
-The full processing pipeline completes in 20ms per document, including accurate
+The full processing pipeline completes in under 50ms per document, including accurate
-tagging and parsing.  All strings are mapped to integer IDs, tokens are linked
+tagging, entity recognition and parsing.  All strings are mapped to integer IDs,
-to embedded word representations, and a range of useful features are pre-calculated
+tokens are linked to embedded word representations, and a range of useful features
-and cached.
+are pre-calculated and cached.  The full analysis can be exported to numpy
 arrays, or losslessly serialized into binary data smaller than the raw text.
 If none of that made any sense to you, here's the gist of it.  Computers don't
 understand text.  This is unfortunate, because that's what the web almost entirely
@ -68,267 +30,17 @@ spaCy provides a library of utility functions that help programmers build such
 products.  It's commercial open source software: you can either use it under
 the AGPL, or you can `buy a commercial license`_ for a one-time fee.
 .. _spaCy: https://github.com/honnibal/spaCy/
 .. _Issue Tracker: https://github.com/honnibal/spaCy/issues
 **2015-07-08**: `Version 0.89 released`_
 .. _Version 0.89 released: updates.html
 .. _buy a commercial license: license.html
 Example functionality
 ---------------------
 Let's say you're developing a proofreading tool, or possibly an IDE for
 writers.  You're convinced by Stephen King's advice that `adverbs are not your
 friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
 you want to **highlight all adverbs**.  We'll use one of the examples he finds
 particularly egregious:
    >>> import spacy.en
    >>> from spacy.parts_of_speech import ADV
    >>> # Load the pipeline, and call it with some text.
    >>> nlp = spacy.en.English()
    >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
    >>> print u''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens)
    u‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
 Easy enough --- but the problem is that we've also highlighted "back".
 While "back" is undoubtedly an adverb, we probably don't want to highlight it.
 If what we're trying to do is flag dubious stylistic choices, we'll need to
 refine our logic.  It turns out only a certain type of adverb is of interest to
 us.
 There are lots of ways we might do this, depending on just what words
 we want to flag.  The simplest way to exclude adverbs like "back" and "not"
 is by word frequency: these words are much more common than the prototypical
 manner adverbs that the style guides are worried about.
 The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
 log probability estimate of the word:
   >>> nlp.vocab[u'back'].prob
   -7.403977394104004
   >>> nlp.vocab[u'not'].prob
   -5.407193660736084
   >>> nlp.vocab[u'quietly'].prob
   -11.07155704498291
 (The probability estimate is based on counts from a 3 billion word corpus,
 smoothed using the `Simple Good-Turing`_ method.)
 .. _`Simple Good-Turing`: http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL02/Code/sgt-gale.pdf
 So we can easily exclude the N most frequent words in English from our adverb
 marker.  Let's try N=1000 for now:
    >>> import spacy.en
    >>> from spacy.parts_of_speech import ADV
    >>> nlp = spacy.en.English()
    >>> # Find log probability of Nth most frequent word
    >>> probs = [lex.prob for lex in nlp.vocab]
    >>> probs.sort()
    >>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
    >>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
    >>> print u''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens)
    ‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
 There are lots of other ways we could refine the logic, depending on just what
 words we want to flag.  Let's say we wanted to only flag adverbs that modified words
 similar to "pleaded".  This is easy to do, as spaCy loads a vector-space
 representation for every word (by default, the vectors produced by
 `Levy and Goldberg (2014)`_).  Naturally, the vector is provided as a numpy
 array:
    >>> pleaded = tokens[7]
    >>> pleaded.repvec.shape
    (300,)
    >>> pleaded.repvec[:5]
    array([ 0.04229792,  0.07459262,  0.00820188, -0.02181299,  0.07519238], dtype=float32)
 .. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
 We want to sort the words in our vocabulary by their similarity to "pleaded".
 There are lots of ways to measure the similarity of two vectors.  We'll use the
 cosine metric:
    >>> from numpy import dot
    >>> from numpy.linalg import norm
    >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
    >>> words = [w for w in nlp.vocab if w.has_repvec]
    >>> words.sort(key=lambda w: cosine(w.repvec, pleaded.repvec))
    >>> words.reverse()
    >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
    1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
    >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
    50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
    >>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
    100-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
    >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
    1000-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
    >>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
    50000-50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
 As you can see, the similarity model that these vectors give us is excellent
 --- we're still getting meaningful results at 1000 words, off a single
 prototype!  The only problem is that the list really contains two clusters of
 words: one associated with the legal meaning of "pleaded", and one for the more
 general sense.  Sorting out these clusters is an area of active research.
 A simple work-around is to average the vectors of several words, and use that
 as our target:
    >>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
    >>> say_vector = sum(nlp.vocab[verb].repvec for verb in say_verbs) / len(say_verbs)
    >>> words.sort(key=lambda w: cosine(w.repvec * say_vector))
    >>> words.reverse()
    >>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
    1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
    >>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
    50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
    >>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
    1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
 These definitely look like words that King might scold a writer for attaching
 adverbs to.  Recall that our original adverb highlighting function looked like
 this:
    >>> import spacy.en
    >>> from spacy.parts_of_speech import ADV
    >>> # Load the pipeline, and call it with some text.
    >>> nlp = spacy.en.English()
    >>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
                     tag=True, parse=False)
    >>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
    ‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
 We wanted to refine the logic so that only adverbs modifying evocative verbs
 of communication, like "pleaded", were highlighted.  We've now built a vector that
 represents that type of word, so now we can highlight adverbs based on
 subtle logic, honing in on adverbs that seem the most stylistically
 problematic, given our starting assumptions:
    >>> import numpy
    >>> from numpy import dot
    >>> from numpy.linalg import norm
    >>> import spacy.en
    >>> from spacy.parts_of_speech import ADV, VERB
    >>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
    >>> def is_bad_adverb(token, target_verb, tol):
    ...   if token.pos != ADV
    ...     return False
    ...   elif token.head.pos != VERB:
    ...     return False
    ...   elif cosine(token.head.repvec, target_verb) < tol:
    ...     return False
    ...   else:
    ...     return True
 This example was somewhat contrived --- and, truth be told, I've never really
 bought the idea that adverbs were a grave stylistic sin.  But hopefully it got
 the message across: the state-of-the-art NLP technologies are very powerful.
 spaCy gives you easy and efficient access to them, which lets you build all
 sorts of use products and features that were previously impossible.
 Independent Evaluation
 ----------------------
 .. table:: Independent evaluation by Yahoo! Labs and Emory
  University, to appear at ACL 2015. Higher is better.
  +----------------+------------+------------+------------+
  | System         | Language   | Accuracy   | Speed      |        
  +----------------+------------+------------+------------+
  | spaCy v0.86    | Cython     | 91.9       | **13,963** |
  +----------------+------------+------------+------------+
  | ClearNLP       | Java       | 91.7       | 10,271     |
  +----------------+------------+------------+------------+
  | spaCy v0.84    | Cython     | 90.9       | 13,963     |
  +----------------+------------+------------+------------+
  | CoreNLP        | Java       | 89.6       | 8,602      |
  +----------------+------------+------------+------------+
  | MATE           | Java       | **92.5**   | 550        |
  +----------------+------------+------------+------------+
  | Turbo          | C++        | 92.4       | 349        |
  +----------------+------------+------------+------------+
  | Yara           | Java       | 92.3       | 340        |
  +----------------+------------+------------+------------+
 Accuracy is % unlabelled arcs correct, speed is tokens per second.
 Joel Tetreault and Amanda Stent (Yahoo! Labs) and Jin-ho Choi (Emory) performed
 a detailed comparison of the best parsers available.  All numbers above
 are taken from the pre-print they kindly made available to me,
 except for spaCy v0.86. 
 I'm particularly grateful to the authors for discussion of their results, which
 led to the improvement in accuracy between v0.84 and v0.86.  A tip from Jin-ho
 (developer of ClearNLP) was particularly useful.
 Detailed Speed Comparison
 -------------------------
 **Set up**: 100,000 plain-text documents were streamed from an SQLite3
 database, and processed with an NLP library, to one of three levels of detail
 --- tokenization, tagging, or parsing.  The tasks are additive: to parse the
 text you have to tokenize and tag it.  The  pre-processing was not subtracted
 from the times --- I report the time required for the pipeline to complete.
 I report mean times per document, in milliseconds.
 **Hardware**: Intel i7-3770 (2012)
 .. table:: Per-document processing times.  Lower is better.
  +--------------+---------------------------+--------------------------------+
  |              | Absolute (ms per doc)     | Relative (to spaCy)            |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | System       | Tokenize | Tag    | Parse | Tokenize | Tag     | Parse     |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | spaCy        | 0.2ms    | 1ms    | 19ms  | 1x       | 1x      | 1x        |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | CoreNLP      | 2ms      | 10ms   | 49ms  | 10x      | 10x     | 2.6x      |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | ZPar         | 1ms      | 8ms    | 850ms | 5x       | 8x      | 44.7x     |
  +--------------+----------+--------+-------+----------+---------+-----------+
  | NLTK         | 4ms      | 443ms  | n/a   | 20x      | 443x    |  n/a      |
  +--------------+----------+--------+-------+----------+---------+-----------+
 Efficiency is a major concern for NLP applications.  It is very common to hear
 people say that they cannot afford more detailed processing, because their
 datasets are too large.  This is a bad position to be in.  If you can't apply
 detailed processing, you generally have to cobble together various heuristics.
 This normally takes a few iterations, and what you come up with will usually be
 brittle and difficult to reason about.
 spaCy's parser is faster than most taggers, and its tokenizer is fast enough
 for any workload.  And the tokenizer doesn't just give you a list
 of strings.  A spaCy token is a pointer to a Lexeme struct, from which you can
 access a wide range of pre-computed features, including embedded word
 representations.
 .. I wrote spaCy because I think existing commercial NLP engines are crap.
  Alchemy API are a typical example.  Check out this part of their terms of
  service:
  publish or perform any benchmark or performance tests or analysis relating to
  the Service or the use thereof without express authorization from AlchemyAPI;
 .. Did you get that?  You're not allowed to evaluate how well their system works,
  unless you're granted a special exception.  Their system must be pretty
  terrible to motivate such an embarrassing restriction.
  They must know this makes them look bad, but they apparently believe allowing
  you to evaluate their product would make them look even worse!
 .. spaCy is based on science, not alchemy.  It's open source, and I am happy to
  clarify any detail of the algorithms I've implemented.
  It's evaluated against the current best published systems, following the standard
  methodologies.  These evaluations show that it performs extremely well.
 .. See `Benchmarks`_ for details.
 .. toctree::
    :maxdepth: 4
    :hidden:
--- a/docs/source/reference/index.rst
+++ b/docs/source/reference/index.rst
@ -59,6 +59,7 @@ and a small usage snippet.
    using/document.rst
    using/span.rst
    using/token.rst
    using/lexeme.rst
 .. _English: processing.html
@ -69,6 +70,8 @@ and a small usage snippet.
 .. _Span: using/span.html
 .. _Lexeme: using/lexeme.html
 .. _Vocab: lookup.html
 .. _StringStore: lookup.html
@ -79,8 +82,6 @@ and a small usage snippet.
 .. _Parser: processing.html
 .. _Lexeme: lookup.html
 .. _Scorer: misc.html
 .. _GoldParse:  misc.html
--- a/docs/source/reference/using/document.rst
+++ b/docs/source/reference/using/document.rst
@ -6,7 +6,7 @@ The Doc Object
 :code:`__getitem__`, :code:`__iter__`, :code:`__len__`
  The Tokens class behaves as a Python sequence, supporting the usual operators,
-  len(), etc.  Negative indexing is supported. Slices are not yet.
+  len(), etc.  Negative indexing is supported. Slices are supported as of v0.89
  .. code::
@ -15,14 +15,17 @@ The Doc Object
    u'Zero'
    >>> tokens[-1].orth_
    u'six'
-    >>> tokens[0:4]
+    >>> span = tokens[0:4]
-    Error
+    >>> [w.orth_ for w in span]
    [u'Zero', u'one', u'two', u'three']
    >>> span.string
    u'Zero one two three'
 :code:`sents`
-  Iterate over sentences in the document.
+  Iterate over sentences in the document. Each sentence is a Span object.
 :code:`ents`
-  Iterate over entities in the document.
+  Iterate over entities in the document. Each entity is a Span object.
 :code:`to_array`
  Given a list of M attribute IDs, export the tokens to a numpy ndarray
@ -55,8 +58,36 @@ The Doc Object
  Merge a multi-word expression into a single token.  Currently
  experimental; API is likely to change.
 :code:`to_bytes()`
  Get a byte-string representation of the document, i.e. serialize.
 :code:`from_bytes(self, byte_string)`
  Load data from a byte-string, i.e. deserialize
 :code:`Doc.read_bytes`
  A staticmethod, used to read bytes from a file.
 Example of serialization:
 ::
    doc1 = EN(u'This is a simple test. With a couple of sentences.')
    doc2 = EN(u'This is another test document.')
    with open('/tmp/spacy_docs.bin', 'wb') as file_:
        file_.write(doc1.to_bytes())
        file_.write(doc2.to_bytes())
    with open('/tmp/spacy_docs.bin', 'rb') as file_:
        bytes1, bytes2 = Doc.read_bytes(file_)
        r1 = Doc(EN.vocab).from_bytes(bytes1)
        r2 = Doc(EN.vocab).from_bytes(bytes2)
    assert r1.string == doc1.string
    assert r2.string == doc2.string
 Internals
  A Tokens instance stores the annotations in a C-array of `TokenC` structs.
  Each TokenC struct holds a const pointer to a LexemeC struct, which describes
@ -66,5 +97,4 @@ Internals
  For faster access, the underlying C data can be accessed from Cython.  You
  can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
-  access is required, and you need slightly better performance.  However, this
+  access is required, and you need slightly better performance. 
  is both slower and has a worse API than Cython access.
--- a/docs/source/reference/using/token.rst
+++ b/docs/source/reference/using/token.rst
@ -53,6 +53,41 @@ string-typed.
  whitespace**.  This is useful when you need to use linguistic features to
  add inline mark-up to the string.
 **Boolean Features**
 :code:`is_oov`
  Is the word out-of-vocabulary?
 :code:`is_alpha`
  Equivalent to `word.orth_.isalpha()`
 :code:`is_ascii`
  Equivalent to `any(ord(c) >= 128 for c in word.orth_)`
 :code:`is_digit`
  Equivalent to `word.orth_.isdigit()`
 :code:`is_lower`
  Equivalent to `word.orth_.islower()`
 :code:`is_title`
  Equivalent to `word.orth_.istitle()`
 :code:`is_punct`
  Equivalent to `word.orth_.ispunct()`
 :code:`is_space`
  Equivalent to `word.orth_.isspace()`
 :code:`like_url`
  Does the word resembles a URL?
 :code:`like_num`
  Does the word represent a number? e.g. "10.9", "10", "ten", etc
 :code:`like_email`
  Does the word resemble an email?
 **Distributional Features**
@ -115,6 +150,12 @@ string-typed.
  An iterator for the part of the sentence syntactically governed by the
  word, including the word itself.
 :code:`left_edge`
  The leftmost descendent of the word's subtree. Equivalent to `list(word.subtree)[0]`
 :code:`right_edge`
  The rightmost descendent of the word's subtree. Equivalent to `list(word.subtree)[-1]`
 **Named Entities**
--- a/docs/source/updates.rst
+++ b/docs/source/updates.rst
@ -10,18 +10,83 @@ To update your installation:
 Most updates ship a new model, so you will usually have to redownload the data.
-v0.89
+2015-07-28 v0.89
-----
+----------------
 Major update!
 * Support efficient binary serialization.  The dependency tree,
  part-of-speech tags, named entities, tokenization and text can be dumped to a
  byte string smaller than the original text representation.  Serialization is
  lossless, so there's no need to separately store the original text.
  Serialize:
  .. code-block:: python
      byte_string = doc.to_bytes()
  Deserialize by first creating a Doc object, and then loading the bytes:
  .. code-block:: python
      doc = Doc(nlp.vocab)
      doc.from_bytes(byte_string)
  If you have a binary file with several parses saved, you can iterate over
  them using the staticmethod `Doc.read_bytes`. Putting it all together:
  .. code-block:: python
      import codecs
      from spacy.en import English
      def serialize(nlp, texts, out_loc):
          with open(out_loc, 'wb') as out_file:
              for text in texts:
                  doc = nlp(text)
                  out_file.write(doc.to_bytes())
      def deserialize(nlp, file_loc):
          docs = []
          with open(file_loc, 'rb') as read_file:
              for byte_string in Doc.read_bytes(read_file, 'rb')):
                  doc = Doc(nlp.vocab).from_bytes(byte_string)
                  docs.append(doc)
          return docs
  Full tutorial coming soon.
 * Fix probability estimates, and base them off counts from the 2015 Reddit Comments
  dump.  The probability estimates are now very reliable, and out-of-vocabulary
  words now receive an accurate smoothed probability estimate.
 * Fix regression in parse times on very long texts. Recent versions were
  calculating parse features in a way that was polynomial in input length. 
-* Add tag SP (coarse tag SPACE) for whitespace tokens. Ensure entity recogniser
+
-  does not assign entities to whitespace.
+* Allow slicing into the Doc object, so that you can do e.g. doc[2:4]. Returns
  a Span object.
 * Add tag SP (coarse tag SPACE) for whitespace tokens.  Fix bug where
  whitespace was sometimes marked as an entity.
 * Reduce memory usage. Memory usage now under 2GB per process.
 * Rename :code:`Span.head` to :code:`Span.root`, fix its documentation, and make
  it more efficient.  I considered adding Span.head, Span.dep and Span.dep\_ as
  well, but for now I leave these as accessible via :code:`Span.root.head`,
  :code:`Span.head.dep`, and :code:`Span.head.dep\_`, to keep the API smaller.  
 * Add boolean features to Token and Lexeme objects.
 * Main parse function now marked **nogil**. This
  means I'll be able to add a Worker class that allows multi-threaded
  processing.  This will be available in the next version.  In the meantime,
  you should continue to use multiprocessing for parallelization.
 2015-07-08 v0.88
 ----------------