mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 12:18:04 +03:00
* Workon docs for v0.89
This commit is contained in:
parent
320836e346
commit
2bcb58456d
|
@ -7,56 +7,18 @@
|
||||||
spaCy: Industrial-strength NLP
|
spaCy: Industrial-strength NLP
|
||||||
==============================
|
==============================
|
||||||
|
|
||||||
|
`spaCy`_ is a library for building tomorrow's language technology products.
|
||||||
.. _Issue Tracker: https://github.com/honnibal/spaCy/issues
|
It's like Stanford's CoreNLP for Python, but with a fundamentally different
|
||||||
|
objective. While CoreNLP is primarily built for conducting research, spaCy is
|
||||||
**2015-07-08**: `Version 0.88 released`_
|
designed for application.
|
||||||
|
|
||||||
.. _Version 0.87 released: updates.html
|
|
||||||
|
|
||||||
`spaCy`_ is a new library for text processing in Python and Cython.
|
|
||||||
I wrote it because I think small companies are terrible at
|
|
||||||
natural language processing (NLP). Or rather:
|
|
||||||
small companies are using terrible NLP technology.
|
|
||||||
|
|
||||||
.. _spaCy: https://github.com/honnibal/spaCy/
|
|
||||||
|
|
||||||
To do great NLP, you have to know a little about linguistics, a lot
|
|
||||||
about machine learning, and almost everything about the latest research.
|
|
||||||
The people who fit this description seldom join small companies.
|
|
||||||
Most are broke --- they've just finished grad school.
|
|
||||||
If they don't want to stay in academia, they join Google, IBM, etc.
|
|
||||||
|
|
||||||
The net result is that outside of the tech giants, commercial NLP has changed
|
|
||||||
little in the last ten years. In academia, it's changed entirely. Amazing
|
|
||||||
improvements in quality. Orders of magnitude faster. But the
|
|
||||||
academic code is always GPL, undocumented, unuseable, or all three. You could
|
|
||||||
implement the ideas yourself, but the papers are hard to read, and training
|
|
||||||
data is exorbitantly expensive. So what are you left with? A common answer is
|
|
||||||
NLTK, which was written primarily as an educational resource. Nothing past the
|
|
||||||
tokenizer is suitable for production use.
|
|
||||||
|
|
||||||
I used to think that the NLP community just needed to do more to communicate
|
|
||||||
its findings to software engineers. So I wrote two blog posts, explaining
|
|
||||||
`how to write a part-of-speech tagger`_ and `parser`_. Both were well received,
|
|
||||||
and there's been a bit of interest in `my research software`_ --- even though
|
|
||||||
it's entirely undocumented, and mostly unuseable to anyone but me.
|
|
||||||
|
|
||||||
.. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop
|
|
||||||
|
|
||||||
.. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
|
|
||||||
|
|
||||||
.. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
|
|
||||||
|
|
||||||
So six months ago I quit my post-doc, and I've been working day and night on
|
|
||||||
spaCy since. I'm now pleased to announce an alpha release.
|
|
||||||
|
|
||||||
If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
|
If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
|
||||||
It's by far the fastest NLP software ever released.
|
It's by far the fastest NLP software ever released.
|
||||||
The full processing pipeline completes in 20ms per document, including accurate
|
The full processing pipeline completes in under 50ms per document, including accurate
|
||||||
tagging and parsing. All strings are mapped to integer IDs, tokens are linked
|
tagging, entity recognition and parsing. All strings are mapped to integer IDs,
|
||||||
to embedded word representations, and a range of useful features are pre-calculated
|
tokens are linked to embedded word representations, and a range of useful features
|
||||||
and cached.
|
are pre-calculated and cached. The full analysis can be exported to numpy
|
||||||
|
arrays, or losslessly serialized into binary data smaller than the raw text.
|
||||||
|
|
||||||
If none of that made any sense to you, here's the gist of it. Computers don't
|
If none of that made any sense to you, here's the gist of it. Computers don't
|
||||||
understand text. This is unfortunate, because that's what the web almost entirely
|
understand text. This is unfortunate, because that's what the web almost entirely
|
||||||
|
@ -68,267 +30,17 @@ spaCy provides a library of utility functions that help programmers build such
|
||||||
products. It's commercial open source software: you can either use it under
|
products. It's commercial open source software: you can either use it under
|
||||||
the AGPL, or you can `buy a commercial license`_ for a one-time fee.
|
the AGPL, or you can `buy a commercial license`_ for a one-time fee.
|
||||||
|
|
||||||
|
|
||||||
|
.. _spaCy: https://github.com/honnibal/spaCy/
|
||||||
|
|
||||||
|
.. _Issue Tracker: https://github.com/honnibal/spaCy/issues
|
||||||
|
|
||||||
|
**2015-07-08**: `Version 0.89 released`_
|
||||||
|
|
||||||
|
.. _Version 0.89 released: updates.html
|
||||||
|
|
||||||
.. _buy a commercial license: license.html
|
.. _buy a commercial license: license.html
|
||||||
|
|
||||||
Example functionality
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
Let's say you're developing a proofreading tool, or possibly an IDE for
|
|
||||||
writers. You're convinced by Stephen King's advice that `adverbs are not your
|
|
||||||
friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
|
|
||||||
you want to **highlight all adverbs**. We'll use one of the examples he finds
|
|
||||||
particularly egregious:
|
|
||||||
|
|
||||||
>>> import spacy.en
|
|
||||||
>>> from spacy.parts_of_speech import ADV
|
|
||||||
>>> # Load the pipeline, and call it with some text.
|
|
||||||
>>> nlp = spacy.en.English()
|
|
||||||
>>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
|
|
||||||
>>> print u''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens)
|
|
||||||
u‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
||||||
|
|
||||||
|
|
||||||
Easy enough --- but the problem is that we've also highlighted "back".
|
|
||||||
While "back" is undoubtedly an adverb, we probably don't want to highlight it.
|
|
||||||
If what we're trying to do is flag dubious stylistic choices, we'll need to
|
|
||||||
refine our logic. It turns out only a certain type of adverb is of interest to
|
|
||||||
us.
|
|
||||||
|
|
||||||
There are lots of ways we might do this, depending on just what words
|
|
||||||
we want to flag. The simplest way to exclude adverbs like "back" and "not"
|
|
||||||
is by word frequency: these words are much more common than the prototypical
|
|
||||||
manner adverbs that the style guides are worried about.
|
|
||||||
|
|
||||||
The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
|
|
||||||
log probability estimate of the word:
|
|
||||||
|
|
||||||
>>> nlp.vocab[u'back'].prob
|
|
||||||
-7.403977394104004
|
|
||||||
>>> nlp.vocab[u'not'].prob
|
|
||||||
-5.407193660736084
|
|
||||||
>>> nlp.vocab[u'quietly'].prob
|
|
||||||
-11.07155704498291
|
|
||||||
|
|
||||||
(The probability estimate is based on counts from a 3 billion word corpus,
|
|
||||||
smoothed using the `Simple Good-Turing`_ method.)
|
|
||||||
|
|
||||||
.. _`Simple Good-Turing`: http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL02/Code/sgt-gale.pdf
|
|
||||||
|
|
||||||
So we can easily exclude the N most frequent words in English from our adverb
|
|
||||||
marker. Let's try N=1000 for now:
|
|
||||||
|
|
||||||
>>> import spacy.en
|
|
||||||
>>> from spacy.parts_of_speech import ADV
|
|
||||||
>>> nlp = spacy.en.English()
|
|
||||||
>>> # Find log probability of Nth most frequent word
|
|
||||||
>>> probs = [lex.prob for lex in nlp.vocab]
|
|
||||||
>>> probs.sort()
|
|
||||||
>>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
|
|
||||||
>>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
|
|
||||||
>>> print u''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens)
|
|
||||||
‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
||||||
|
|
||||||
There are lots of other ways we could refine the logic, depending on just what
|
|
||||||
words we want to flag. Let's say we wanted to only flag adverbs that modified words
|
|
||||||
similar to "pleaded". This is easy to do, as spaCy loads a vector-space
|
|
||||||
representation for every word (by default, the vectors produced by
|
|
||||||
`Levy and Goldberg (2014)`_). Naturally, the vector is provided as a numpy
|
|
||||||
array:
|
|
||||||
|
|
||||||
>>> pleaded = tokens[7]
|
|
||||||
>>> pleaded.repvec.shape
|
|
||||||
(300,)
|
|
||||||
>>> pleaded.repvec[:5]
|
|
||||||
array([ 0.04229792, 0.07459262, 0.00820188, -0.02181299, 0.07519238], dtype=float32)
|
|
||||||
|
|
||||||
.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
|
|
||||||
|
|
||||||
We want to sort the words in our vocabulary by their similarity to "pleaded".
|
|
||||||
There are lots of ways to measure the similarity of two vectors. We'll use the
|
|
||||||
cosine metric:
|
|
||||||
|
|
||||||
>>> from numpy import dot
|
|
||||||
>>> from numpy.linalg import norm
|
|
||||||
|
|
||||||
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
|
||||||
>>> words = [w for w in nlp.vocab if w.has_repvec]
|
|
||||||
>>> words.sort(key=lambda w: cosine(w.repvec, pleaded.repvec))
|
|
||||||
>>> words.reverse()
|
|
||||||
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
|
||||||
1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
|
|
||||||
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
|
||||||
50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
|
|
||||||
>>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
|
|
||||||
100-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
|
|
||||||
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
|
||||||
1000-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
|
|
||||||
>>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
|
|
||||||
50000-50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
|
|
||||||
|
|
||||||
As you can see, the similarity model that these vectors give us is excellent
|
|
||||||
--- we're still getting meaningful results at 1000 words, off a single
|
|
||||||
prototype! The only problem is that the list really contains two clusters of
|
|
||||||
words: one associated with the legal meaning of "pleaded", and one for the more
|
|
||||||
general sense. Sorting out these clusters is an area of active research.
|
|
||||||
|
|
||||||
|
|
||||||
A simple work-around is to average the vectors of several words, and use that
|
|
||||||
as our target:
|
|
||||||
|
|
||||||
>>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
|
|
||||||
>>> say_vector = sum(nlp.vocab[verb].repvec for verb in say_verbs) / len(say_verbs)
|
|
||||||
>>> words.sort(key=lambda w: cosine(w.repvec * say_vector))
|
|
||||||
>>> words.reverse()
|
|
||||||
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
|
||||||
1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
|
|
||||||
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
|
||||||
50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
|
|
||||||
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
|
||||||
1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
|
|
||||||
|
|
||||||
These definitely look like words that King might scold a writer for attaching
|
|
||||||
adverbs to. Recall that our original adverb highlighting function looked like
|
|
||||||
this:
|
|
||||||
|
|
||||||
>>> import spacy.en
|
|
||||||
>>> from spacy.parts_of_speech import ADV
|
|
||||||
>>> # Load the pipeline, and call it with some text.
|
|
||||||
>>> nlp = spacy.en.English()
|
|
||||||
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
|
||||||
tag=True, parse=False)
|
|
||||||
>>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
|
|
||||||
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
We wanted to refine the logic so that only adverbs modifying evocative verbs
|
|
||||||
of communication, like "pleaded", were highlighted. We've now built a vector that
|
|
||||||
represents that type of word, so now we can highlight adverbs based on
|
|
||||||
subtle logic, honing in on adverbs that seem the most stylistically
|
|
||||||
problematic, given our starting assumptions:
|
|
||||||
|
|
||||||
>>> import numpy
|
|
||||||
>>> from numpy import dot
|
|
||||||
>>> from numpy.linalg import norm
|
|
||||||
>>> import spacy.en
|
|
||||||
>>> from spacy.parts_of_speech import ADV, VERB
|
|
||||||
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
|
||||||
>>> def is_bad_adverb(token, target_verb, tol):
|
|
||||||
... if token.pos != ADV
|
|
||||||
... return False
|
|
||||||
... elif token.head.pos != VERB:
|
|
||||||
... return False
|
|
||||||
... elif cosine(token.head.repvec, target_verb) < tol:
|
|
||||||
... return False
|
|
||||||
... else:
|
|
||||||
... return True
|
|
||||||
|
|
||||||
|
|
||||||
This example was somewhat contrived --- and, truth be told, I've never really
|
|
||||||
bought the idea that adverbs were a grave stylistic sin. But hopefully it got
|
|
||||||
the message across: the state-of-the-art NLP technologies are very powerful.
|
|
||||||
spaCy gives you easy and efficient access to them, which lets you build all
|
|
||||||
sorts of use products and features that were previously impossible.
|
|
||||||
|
|
||||||
|
|
||||||
Independent Evaluation
|
|
||||||
----------------------
|
|
||||||
|
|
||||||
.. table:: Independent evaluation by Yahoo! Labs and Emory
|
|
||||||
University, to appear at ACL 2015. Higher is better.
|
|
||||||
|
|
||||||
+----------------+------------+------------+------------+
|
|
||||||
| System | Language | Accuracy | Speed |
|
|
||||||
+----------------+------------+------------+------------+
|
|
||||||
| spaCy v0.86 | Cython | 91.9 | **13,963** |
|
|
||||||
+----------------+------------+------------+------------+
|
|
||||||
| ClearNLP | Java | 91.7 | 10,271 |
|
|
||||||
+----------------+------------+------------+------------+
|
|
||||||
| spaCy v0.84 | Cython | 90.9 | 13,963 |
|
|
||||||
+----------------+------------+------------+------------+
|
|
||||||
| CoreNLP | Java | 89.6 | 8,602 |
|
|
||||||
+----------------+------------+------------+------------+
|
|
||||||
| MATE | Java | **92.5** | 550 |
|
|
||||||
+----------------+------------+------------+------------+
|
|
||||||
| Turbo | C++ | 92.4 | 349 |
|
|
||||||
+----------------+------------+------------+------------+
|
|
||||||
| Yara | Java | 92.3 | 340 |
|
|
||||||
+----------------+------------+------------+------------+
|
|
||||||
|
|
||||||
|
|
||||||
Accuracy is % unlabelled arcs correct, speed is tokens per second.
|
|
||||||
|
|
||||||
Joel Tetreault and Amanda Stent (Yahoo! Labs) and Jin-ho Choi (Emory) performed
|
|
||||||
a detailed comparison of the best parsers available. All numbers above
|
|
||||||
are taken from the pre-print they kindly made available to me,
|
|
||||||
except for spaCy v0.86.
|
|
||||||
|
|
||||||
I'm particularly grateful to the authors for discussion of their results, which
|
|
||||||
led to the improvement in accuracy between v0.84 and v0.86. A tip from Jin-ho
|
|
||||||
(developer of ClearNLP) was particularly useful.
|
|
||||||
|
|
||||||
|
|
||||||
Detailed Speed Comparison
|
|
||||||
-------------------------
|
|
||||||
|
|
||||||
**Set up**: 100,000 plain-text documents were streamed from an SQLite3
|
|
||||||
database, and processed with an NLP library, to one of three levels of detail
|
|
||||||
--- tokenization, tagging, or parsing. The tasks are additive: to parse the
|
|
||||||
text you have to tokenize and tag it. The pre-processing was not subtracted
|
|
||||||
from the times --- I report the time required for the pipeline to complete.
|
|
||||||
I report mean times per document, in milliseconds.
|
|
||||||
|
|
||||||
**Hardware**: Intel i7-3770 (2012)
|
|
||||||
|
|
||||||
.. table:: Per-document processing times. Lower is better.
|
|
||||||
|
|
||||||
+--------------+---------------------------+--------------------------------+
|
|
||||||
| | Absolute (ms per doc) | Relative (to spaCy) |
|
|
||||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
||||||
| System | Tokenize | Tag | Parse | Tokenize | Tag | Parse |
|
|
||||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
||||||
| spaCy | 0.2ms | 1ms | 19ms | 1x | 1x | 1x |
|
|
||||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
||||||
| CoreNLP | 2ms | 10ms | 49ms | 10x | 10x | 2.6x |
|
|
||||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
||||||
| ZPar | 1ms | 8ms | 850ms | 5x | 8x | 44.7x |
|
|
||||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
||||||
| NLTK | 4ms | 443ms | n/a | 20x | 443x | n/a |
|
|
||||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
|
||||||
|
|
||||||
|
|
||||||
Efficiency is a major concern for NLP applications. It is very common to hear
|
|
||||||
people say that they cannot afford more detailed processing, because their
|
|
||||||
datasets are too large. This is a bad position to be in. If you can't apply
|
|
||||||
detailed processing, you generally have to cobble together various heuristics.
|
|
||||||
This normally takes a few iterations, and what you come up with will usually be
|
|
||||||
brittle and difficult to reason about.
|
|
||||||
|
|
||||||
spaCy's parser is faster than most taggers, and its tokenizer is fast enough
|
|
||||||
for any workload. And the tokenizer doesn't just give you a list
|
|
||||||
of strings. A spaCy token is a pointer to a Lexeme struct, from which you can
|
|
||||||
access a wide range of pre-computed features, including embedded word
|
|
||||||
representations.
|
|
||||||
|
|
||||||
.. I wrote spaCy because I think existing commercial NLP engines are crap.
|
|
||||||
Alchemy API are a typical example. Check out this part of their terms of
|
|
||||||
service:
|
|
||||||
publish or perform any benchmark or performance tests or analysis relating to
|
|
||||||
the Service or the use thereof without express authorization from AlchemyAPI;
|
|
||||||
|
|
||||||
.. Did you get that? You're not allowed to evaluate how well their system works,
|
|
||||||
unless you're granted a special exception. Their system must be pretty
|
|
||||||
terrible to motivate such an embarrassing restriction.
|
|
||||||
They must know this makes them look bad, but they apparently believe allowing
|
|
||||||
you to evaluate their product would make them look even worse!
|
|
||||||
|
|
||||||
.. spaCy is based on science, not alchemy. It's open source, and I am happy to
|
|
||||||
clarify any detail of the algorithms I've implemented.
|
|
||||||
It's evaluated against the current best published systems, following the standard
|
|
||||||
methodologies. These evaluations show that it performs extremely well.
|
|
||||||
.. See `Benchmarks`_ for details.
|
|
||||||
|
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 4
|
:maxdepth: 4
|
||||||
:hidden:
|
:hidden:
|
||||||
|
|
|
@ -59,6 +59,7 @@ and a small usage snippet.
|
||||||
using/document.rst
|
using/document.rst
|
||||||
using/span.rst
|
using/span.rst
|
||||||
using/token.rst
|
using/token.rst
|
||||||
|
using/lexeme.rst
|
||||||
|
|
||||||
|
|
||||||
.. _English: processing.html
|
.. _English: processing.html
|
||||||
|
@ -69,6 +70,8 @@ and a small usage snippet.
|
||||||
|
|
||||||
.. _Span: using/span.html
|
.. _Span: using/span.html
|
||||||
|
|
||||||
|
.. _Lexeme: using/lexeme.html
|
||||||
|
|
||||||
.. _Vocab: lookup.html
|
.. _Vocab: lookup.html
|
||||||
|
|
||||||
.. _StringStore: lookup.html
|
.. _StringStore: lookup.html
|
||||||
|
@ -79,8 +82,6 @@ and a small usage snippet.
|
||||||
|
|
||||||
.. _Parser: processing.html
|
.. _Parser: processing.html
|
||||||
|
|
||||||
.. _Lexeme: lookup.html
|
|
||||||
|
|
||||||
.. _Scorer: misc.html
|
.. _Scorer: misc.html
|
||||||
|
|
||||||
.. _GoldParse: misc.html
|
.. _GoldParse: misc.html
|
||||||
|
|
|
@ -6,7 +6,7 @@ The Doc Object
|
||||||
|
|
||||||
:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
|
:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
|
||||||
The Tokens class behaves as a Python sequence, supporting the usual operators,
|
The Tokens class behaves as a Python sequence, supporting the usual operators,
|
||||||
len(), etc. Negative indexing is supported. Slices are not yet.
|
len(), etc. Negative indexing is supported. Slices are supported as of v0.89
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
|
@ -15,14 +15,17 @@ The Doc Object
|
||||||
u'Zero'
|
u'Zero'
|
||||||
>>> tokens[-1].orth_
|
>>> tokens[-1].orth_
|
||||||
u'six'
|
u'six'
|
||||||
>>> tokens[0:4]
|
>>> span = tokens[0:4]
|
||||||
Error
|
>>> [w.orth_ for w in span]
|
||||||
|
[u'Zero', u'one', u'two', u'three']
|
||||||
|
>>> span.string
|
||||||
|
u'Zero one two three'
|
||||||
|
|
||||||
:code:`sents`
|
:code:`sents`
|
||||||
Iterate over sentences in the document.
|
Iterate over sentences in the document. Each sentence is a Span object.
|
||||||
|
|
||||||
:code:`ents`
|
:code:`ents`
|
||||||
Iterate over entities in the document.
|
Iterate over entities in the document. Each entity is a Span object.
|
||||||
|
|
||||||
:code:`to_array`
|
:code:`to_array`
|
||||||
Given a list of M attribute IDs, export the tokens to a numpy ndarray
|
Given a list of M attribute IDs, export the tokens to a numpy ndarray
|
||||||
|
@ -55,8 +58,36 @@ The Doc Object
|
||||||
Merge a multi-word expression into a single token. Currently
|
Merge a multi-word expression into a single token. Currently
|
||||||
experimental; API is likely to change.
|
experimental; API is likely to change.
|
||||||
|
|
||||||
|
:code:`to_bytes()`
|
||||||
|
Get a byte-string representation of the document, i.e. serialize.
|
||||||
|
|
||||||
|
:code:`from_bytes(self, byte_string)`
|
||||||
|
Load data from a byte-string, i.e. deserialize
|
||||||
|
|
||||||
|
:code:`Doc.read_bytes`
|
||||||
|
A staticmethod, used to read bytes from a file.
|
||||||
|
|
||||||
|
|
||||||
|
Example of serialization:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
doc1 = EN(u'This is a simple test. With a couple of sentences.')
|
||||||
|
doc2 = EN(u'This is another test document.')
|
||||||
|
|
||||||
|
with open('/tmp/spacy_docs.bin', 'wb') as file_:
|
||||||
|
file_.write(doc1.to_bytes())
|
||||||
|
file_.write(doc2.to_bytes())
|
||||||
|
|
||||||
|
with open('/tmp/spacy_docs.bin', 'rb') as file_:
|
||||||
|
bytes1, bytes2 = Doc.read_bytes(file_)
|
||||||
|
r1 = Doc(EN.vocab).from_bytes(bytes1)
|
||||||
|
r2 = Doc(EN.vocab).from_bytes(bytes2)
|
||||||
|
|
||||||
|
assert r1.string == doc1.string
|
||||||
|
assert r2.string == doc2.string
|
||||||
|
|
||||||
|
|
||||||
Internals
|
Internals
|
||||||
A Tokens instance stores the annotations in a C-array of `TokenC` structs.
|
A Tokens instance stores the annotations in a C-array of `TokenC` structs.
|
||||||
Each TokenC struct holds a const pointer to a LexemeC struct, which describes
|
Each TokenC struct holds a const pointer to a LexemeC struct, which describes
|
||||||
|
@ -66,5 +97,4 @@ Internals
|
||||||
|
|
||||||
For faster access, the underlying C data can be accessed from Cython. You
|
For faster access, the underlying C data can be accessed from Cython. You
|
||||||
can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
|
can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
|
||||||
access is required, and you need slightly better performance. However, this
|
access is required, and you need slightly better performance.
|
||||||
is both slower and has a worse API than Cython access.
|
|
||||||
|
|
|
@ -53,6 +53,41 @@ string-typed.
|
||||||
whitespace**. This is useful when you need to use linguistic features to
|
whitespace**. This is useful when you need to use linguistic features to
|
||||||
add inline mark-up to the string.
|
add inline mark-up to the string.
|
||||||
|
|
||||||
|
**Boolean Features**
|
||||||
|
|
||||||
|
:code:`is_oov`
|
||||||
|
Is the word out-of-vocabulary?
|
||||||
|
|
||||||
|
:code:`is_alpha`
|
||||||
|
Equivalent to `word.orth_.isalpha()`
|
||||||
|
|
||||||
|
:code:`is_ascii`
|
||||||
|
Equivalent to `any(ord(c) >= 128 for c in word.orth_)`
|
||||||
|
|
||||||
|
:code:`is_digit`
|
||||||
|
Equivalent to `word.orth_.isdigit()`
|
||||||
|
|
||||||
|
:code:`is_lower`
|
||||||
|
Equivalent to `word.orth_.islower()`
|
||||||
|
|
||||||
|
:code:`is_title`
|
||||||
|
Equivalent to `word.orth_.istitle()`
|
||||||
|
|
||||||
|
:code:`is_punct`
|
||||||
|
Equivalent to `word.orth_.ispunct()`
|
||||||
|
|
||||||
|
:code:`is_space`
|
||||||
|
Equivalent to `word.orth_.isspace()`
|
||||||
|
|
||||||
|
:code:`like_url`
|
||||||
|
Does the word resembles a URL?
|
||||||
|
|
||||||
|
:code:`like_num`
|
||||||
|
Does the word represent a number? e.g. "10.9", "10", "ten", etc
|
||||||
|
|
||||||
|
:code:`like_email`
|
||||||
|
Does the word resemble an email?
|
||||||
|
|
||||||
|
|
||||||
**Distributional Features**
|
**Distributional Features**
|
||||||
|
|
||||||
|
@ -115,6 +150,12 @@ string-typed.
|
||||||
An iterator for the part of the sentence syntactically governed by the
|
An iterator for the part of the sentence syntactically governed by the
|
||||||
word, including the word itself.
|
word, including the word itself.
|
||||||
|
|
||||||
|
:code:`left_edge`
|
||||||
|
The leftmost descendent of the word's subtree. Equivalent to `list(word.subtree)[0]`
|
||||||
|
|
||||||
|
:code:`right_edge`
|
||||||
|
The rightmost descendent of the word's subtree. Equivalent to `list(word.subtree)[-1]`
|
||||||
|
|
||||||
|
|
||||||
**Named Entities**
|
**Named Entities**
|
||||||
|
|
||||||
|
|
|
@ -10,18 +10,83 @@ To update your installation:
|
||||||
|
|
||||||
Most updates ship a new model, so you will usually have to redownload the data.
|
Most updates ship a new model, so you will usually have to redownload the data.
|
||||||
|
|
||||||
v0.89
|
2015-07-28 v0.89
|
||||||
-----
|
----------------
|
||||||
|
|
||||||
|
Major update!
|
||||||
|
|
||||||
|
* Support efficient binary serialization. The dependency tree,
|
||||||
|
part-of-speech tags, named entities, tokenization and text can be dumped to a
|
||||||
|
byte string smaller than the original text representation. Serialization is
|
||||||
|
lossless, so there's no need to separately store the original text.
|
||||||
|
|
||||||
|
Serialize:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
byte_string = doc.to_bytes()
|
||||||
|
|
||||||
|
Deserialize by first creating a Doc object, and then loading the bytes:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
doc = Doc(nlp.vocab)
|
||||||
|
doc.from_bytes(byte_string)
|
||||||
|
|
||||||
|
If you have a binary file with several parses saved, you can iterate over
|
||||||
|
them using the staticmethod `Doc.read_bytes`. Putting it all together:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import codecs
|
||||||
|
|
||||||
|
from spacy.en import English
|
||||||
|
|
||||||
|
def serialize(nlp, texts, out_loc):
|
||||||
|
with open(out_loc, 'wb') as out_file:
|
||||||
|
for text in texts:
|
||||||
|
doc = nlp(text)
|
||||||
|
out_file.write(doc.to_bytes())
|
||||||
|
|
||||||
|
def deserialize(nlp, file_loc):
|
||||||
|
docs = []
|
||||||
|
with open(file_loc, 'rb') as read_file:
|
||||||
|
for byte_string in Doc.read_bytes(read_file, 'rb')):
|
||||||
|
doc = Doc(nlp.vocab).from_bytes(byte_string)
|
||||||
|
docs.append(doc)
|
||||||
|
return docs
|
||||||
|
|
||||||
|
|
||||||
|
Full tutorial coming soon.
|
||||||
|
|
||||||
|
|
||||||
|
* Fix probability estimates, and base them off counts from the 2015 Reddit Comments
|
||||||
|
dump. The probability estimates are now very reliable, and out-of-vocabulary
|
||||||
|
words now receive an accurate smoothed probability estimate.
|
||||||
|
|
||||||
* Fix regression in parse times on very long texts. Recent versions were
|
* Fix regression in parse times on very long texts. Recent versions were
|
||||||
calculating parse features in a way that was polynomial in input length.
|
calculating parse features in a way that was polynomial in input length.
|
||||||
* Add tag SP (coarse tag SPACE) for whitespace tokens. Ensure entity recogniser
|
|
||||||
does not assign entities to whitespace.
|
* Allow slicing into the Doc object, so that you can do e.g. doc[2:4]. Returns
|
||||||
|
a Span object.
|
||||||
|
|
||||||
|
* Add tag SP (coarse tag SPACE) for whitespace tokens. Fix bug where
|
||||||
|
whitespace was sometimes marked as an entity.
|
||||||
|
|
||||||
|
* Reduce memory usage. Memory usage now under 2GB per process.
|
||||||
|
|
||||||
* Rename :code:`Span.head` to :code:`Span.root`, fix its documentation, and make
|
* Rename :code:`Span.head` to :code:`Span.root`, fix its documentation, and make
|
||||||
it more efficient. I considered adding Span.head, Span.dep and Span.dep\_ as
|
it more efficient. I considered adding Span.head, Span.dep and Span.dep\_ as
|
||||||
well, but for now I leave these as accessible via :code:`Span.root.head`,
|
well, but for now I leave these as accessible via :code:`Span.root.head`,
|
||||||
:code:`Span.head.dep`, and :code:`Span.head.dep\_`, to keep the API smaller.
|
:code:`Span.head.dep`, and :code:`Span.head.dep\_`, to keep the API smaller.
|
||||||
|
|
||||||
|
* Add boolean features to Token and Lexeme objects.
|
||||||
|
|
||||||
|
* Main parse function now marked **nogil**. This
|
||||||
|
means I'll be able to add a Worker class that allows multi-threaded
|
||||||
|
processing. This will be available in the next version. In the meantime,
|
||||||
|
you should continue to use multiprocessing for parallelization.
|
||||||
|
|
||||||
|
|
||||||
2015-07-08 v0.88
|
2015-07-08 v0.88
|
||||||
----------------
|
----------------
|
||||||
|
|
Loading…
Reference in New Issue
Block a user