mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-04 05:34:10 +03:00
* Workon docs for v0.89
This commit is contained in:
parent
320836e346
commit
2bcb58456d
|
@ -7,56 +7,18 @@
|
|||
spaCy: Industrial-strength NLP
|
||||
==============================
|
||||
|
||||
|
||||
.. _Issue Tracker: https://github.com/honnibal/spaCy/issues
|
||||
|
||||
**2015-07-08**: `Version 0.88 released`_
|
||||
|
||||
.. _Version 0.87 released: updates.html
|
||||
|
||||
`spaCy`_ is a new library for text processing in Python and Cython.
|
||||
I wrote it because I think small companies are terrible at
|
||||
natural language processing (NLP). Or rather:
|
||||
small companies are using terrible NLP technology.
|
||||
|
||||
.. _spaCy: https://github.com/honnibal/spaCy/
|
||||
|
||||
To do great NLP, you have to know a little about linguistics, a lot
|
||||
about machine learning, and almost everything about the latest research.
|
||||
The people who fit this description seldom join small companies.
|
||||
Most are broke --- they've just finished grad school.
|
||||
If they don't want to stay in academia, they join Google, IBM, etc.
|
||||
|
||||
The net result is that outside of the tech giants, commercial NLP has changed
|
||||
little in the last ten years. In academia, it's changed entirely. Amazing
|
||||
improvements in quality. Orders of magnitude faster. But the
|
||||
academic code is always GPL, undocumented, unuseable, or all three. You could
|
||||
implement the ideas yourself, but the papers are hard to read, and training
|
||||
data is exorbitantly expensive. So what are you left with? A common answer is
|
||||
NLTK, which was written primarily as an educational resource. Nothing past the
|
||||
tokenizer is suitable for production use.
|
||||
|
||||
I used to think that the NLP community just needed to do more to communicate
|
||||
its findings to software engineers. So I wrote two blog posts, explaining
|
||||
`how to write a part-of-speech tagger`_ and `parser`_. Both were well received,
|
||||
and there's been a bit of interest in `my research software`_ --- even though
|
||||
it's entirely undocumented, and mostly unuseable to anyone but me.
|
||||
|
||||
.. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop
|
||||
|
||||
.. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
|
||||
|
||||
.. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
|
||||
|
||||
So six months ago I quit my post-doc, and I've been working day and night on
|
||||
spaCy since. I'm now pleased to announce an alpha release.
|
||||
`spaCy`_ is a library for building tomorrow's language technology products.
|
||||
It's like Stanford's CoreNLP for Python, but with a fundamentally different
|
||||
objective. While CoreNLP is primarily built for conducting research, spaCy is
|
||||
designed for application.
|
||||
|
||||
If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
|
||||
It's by far the fastest NLP software ever released.
|
||||
The full processing pipeline completes in 20ms per document, including accurate
|
||||
tagging and parsing. All strings are mapped to integer IDs, tokens are linked
|
||||
to embedded word representations, and a range of useful features are pre-calculated
|
||||
and cached.
|
||||
The full processing pipeline completes in under 50ms per document, including accurate
|
||||
tagging, entity recognition and parsing. All strings are mapped to integer IDs,
|
||||
tokens are linked to embedded word representations, and a range of useful features
|
||||
are pre-calculated and cached. The full analysis can be exported to numpy
|
||||
arrays, or losslessly serialized into binary data smaller than the raw text.
|
||||
|
||||
If none of that made any sense to you, here's the gist of it. Computers don't
|
||||
understand text. This is unfortunate, because that's what the web almost entirely
|
||||
|
@ -68,267 +30,17 @@ spaCy provides a library of utility functions that help programmers build such
|
|||
products. It's commercial open source software: you can either use it under
|
||||
the AGPL, or you can `buy a commercial license`_ for a one-time fee.
|
||||
|
||||
|
||||
.. _spaCy: https://github.com/honnibal/spaCy/
|
||||
|
||||
.. _Issue Tracker: https://github.com/honnibal/spaCy/issues
|
||||
|
||||
**2015-07-08**: `Version 0.89 released`_
|
||||
|
||||
.. _Version 0.89 released: updates.html
|
||||
|
||||
.. _buy a commercial license: license.html
|
||||
|
||||
Example functionality
|
||||
---------------------
|
||||
|
||||
Let's say you're developing a proofreading tool, or possibly an IDE for
|
||||
writers. You're convinced by Stephen King's advice that `adverbs are not your
|
||||
friend <http://www.brainpickings.org/2013/03/13/stephen-king-on-adverbs/>`_, so
|
||||
you want to **highlight all adverbs**. We'll use one of the examples he finds
|
||||
particularly egregious:
|
||||
|
||||
>>> import spacy.en
|
||||
>>> from spacy.parts_of_speech import ADV
|
||||
>>> # Load the pipeline, and call it with some text.
|
||||
>>> nlp = spacy.en.English()
|
||||
>>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’", tag=True, parse=False)
|
||||
>>> print u''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens)
|
||||
u‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||||
|
||||
|
||||
Easy enough --- but the problem is that we've also highlighted "back".
|
||||
While "back" is undoubtedly an adverb, we probably don't want to highlight it.
|
||||
If what we're trying to do is flag dubious stylistic choices, we'll need to
|
||||
refine our logic. It turns out only a certain type of adverb is of interest to
|
||||
us.
|
||||
|
||||
There are lots of ways we might do this, depending on just what words
|
||||
we want to flag. The simplest way to exclude adverbs like "back" and "not"
|
||||
is by word frequency: these words are much more common than the prototypical
|
||||
manner adverbs that the style guides are worried about.
|
||||
|
||||
The :py:attr:`Lexeme.prob` and :py:attr:`Token.prob` attribute gives a
|
||||
log probability estimate of the word:
|
||||
|
||||
>>> nlp.vocab[u'back'].prob
|
||||
-7.403977394104004
|
||||
>>> nlp.vocab[u'not'].prob
|
||||
-5.407193660736084
|
||||
>>> nlp.vocab[u'quietly'].prob
|
||||
-11.07155704498291
|
||||
|
||||
(The probability estimate is based on counts from a 3 billion word corpus,
|
||||
smoothed using the `Simple Good-Turing`_ method.)
|
||||
|
||||
.. _`Simple Good-Turing`: http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL02/Code/sgt-gale.pdf
|
||||
|
||||
So we can easily exclude the N most frequent words in English from our adverb
|
||||
marker. Let's try N=1000 for now:
|
||||
|
||||
>>> import spacy.en
|
||||
>>> from spacy.parts_of_speech import ADV
|
||||
>>> nlp = spacy.en.English()
|
||||
>>> # Find log probability of Nth most frequent word
|
||||
>>> probs = [lex.prob for lex in nlp.vocab]
|
||||
>>> probs.sort()
|
||||
>>> is_adverb = lambda tok: tok.pos == ADV and tok.prob < probs[-1000]
|
||||
>>> tokens = nlp(u"‘Give it back,’ he pleaded abjectly, ‘it’s mine.’")
|
||||
>>> print u''.join(tok.string.upper() if is_adverb(tok) else tok.string for tok in tokens)
|
||||
‘Give it back,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||||
|
||||
There are lots of other ways we could refine the logic, depending on just what
|
||||
words we want to flag. Let's say we wanted to only flag adverbs that modified words
|
||||
similar to "pleaded". This is easy to do, as spaCy loads a vector-space
|
||||
representation for every word (by default, the vectors produced by
|
||||
`Levy and Goldberg (2014)`_). Naturally, the vector is provided as a numpy
|
||||
array:
|
||||
|
||||
>>> pleaded = tokens[7]
|
||||
>>> pleaded.repvec.shape
|
||||
(300,)
|
||||
>>> pleaded.repvec[:5]
|
||||
array([ 0.04229792, 0.07459262, 0.00820188, -0.02181299, 0.07519238], dtype=float32)
|
||||
|
||||
.. _Levy and Goldberg (2014): https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
|
||||
|
||||
We want to sort the words in our vocabulary by their similarity to "pleaded".
|
||||
There are lots of ways to measure the similarity of two vectors. We'll use the
|
||||
cosine metric:
|
||||
|
||||
>>> from numpy import dot
|
||||
>>> from numpy.linalg import norm
|
||||
|
||||
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
||||
>>> words = [w for w in nlp.vocab if w.has_repvec]
|
||||
>>> words.sort(key=lambda w: cosine(w.repvec, pleaded.repvec))
|
||||
>>> words.reverse()
|
||||
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
||||
1-20 pleaded, pled, plead, confessed, interceded, pleads, testified, conspired, motioned, demurred, countersued, remonstrated, begged, apologised, consented, acquiesced, petitioned, quarreled, appealed, pleading
|
||||
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
||||
50-60 counselled, bragged, backtracked, caucused, refiled, dueled, mused, dissented, yearned, confesses
|
||||
>>> print('100-110', ', '.join(w.orth_ for w in words[100:110]))
|
||||
100-110 cabled, ducked, sentenced, perjured, absconded, bargained, overstayed, clerked, confided, sympathizes
|
||||
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
||||
1000-1010 scorned, baled, righted, requested, swindled, posited, firebombed, slimed, deferred, sagged
|
||||
>>> print('50000-50010', ', '.join(w.orth_ for w in words[50000:50010]))
|
||||
50000-50010, fb, ford, systems, puck, anglers, ik, tabloid, dirty, rims, artists
|
||||
|
||||
As you can see, the similarity model that these vectors give us is excellent
|
||||
--- we're still getting meaningful results at 1000 words, off a single
|
||||
prototype! The only problem is that the list really contains two clusters of
|
||||
words: one associated with the legal meaning of "pleaded", and one for the more
|
||||
general sense. Sorting out these clusters is an area of active research.
|
||||
|
||||
|
||||
A simple work-around is to average the vectors of several words, and use that
|
||||
as our target:
|
||||
|
||||
>>> say_verbs = ['pleaded', 'confessed', 'remonstrated', 'begged', 'bragged', 'confided', 'requested']
|
||||
>>> say_vector = sum(nlp.vocab[verb].repvec for verb in say_verbs) / len(say_verbs)
|
||||
>>> words.sort(key=lambda w: cosine(w.repvec * say_vector))
|
||||
>>> words.reverse()
|
||||
>>> print('1-20', ', '.join(w.orth_ for w in words[0:20]))
|
||||
1-20 bragged, remonstrated, enquired, demurred, sighed, mused, intimated, retorted, entreated, motioned, ranted, confided, countersued, gestured, implored, interceded, muttered, marvelled, bickered, despaired
|
||||
>>> print('50-60', ', '.join(w.orth_ for w in words[50:60]))
|
||||
50-60 flaunted, quarrelled, ingratiated, vouched, agonized, apologised, lunched, joked, chafed, schemed
|
||||
>>> print('1000-1010', ', '.join(w.orth_ for w in words[1000:1010]))
|
||||
1000-1010 hoarded, waded, ensnared, clamoring, abided, deploring, shriveled, endeared, rethought, berate
|
||||
|
||||
These definitely look like words that King might scold a writer for attaching
|
||||
adverbs to. Recall that our original adverb highlighting function looked like
|
||||
this:
|
||||
|
||||
>>> import spacy.en
|
||||
>>> from spacy.parts_of_speech import ADV
|
||||
>>> # Load the pipeline, and call it with some text.
|
||||
>>> nlp = spacy.en.English()
|
||||
>>> tokens = nlp("‘Give it back,’ he pleaded abjectly, ‘it’s mine.’",
|
||||
tag=True, parse=False)
|
||||
>>> print(''.join(tok.string.upper() if tok.pos == ADV else tok.string for tok in tokens))
|
||||
‘Give it BACK,’ he pleaded ABJECTLY, ‘it’s mine.’
|
||||
|
||||
|
||||
|
||||
We wanted to refine the logic so that only adverbs modifying evocative verbs
|
||||
of communication, like "pleaded", were highlighted. We've now built a vector that
|
||||
represents that type of word, so now we can highlight adverbs based on
|
||||
subtle logic, honing in on adverbs that seem the most stylistically
|
||||
problematic, given our starting assumptions:
|
||||
|
||||
>>> import numpy
|
||||
>>> from numpy import dot
|
||||
>>> from numpy.linalg import norm
|
||||
>>> import spacy.en
|
||||
>>> from spacy.parts_of_speech import ADV, VERB
|
||||
>>> cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
|
||||
>>> def is_bad_adverb(token, target_verb, tol):
|
||||
... if token.pos != ADV
|
||||
... return False
|
||||
... elif token.head.pos != VERB:
|
||||
... return False
|
||||
... elif cosine(token.head.repvec, target_verb) < tol:
|
||||
... return False
|
||||
... else:
|
||||
... return True
|
||||
|
||||
|
||||
This example was somewhat contrived --- and, truth be told, I've never really
|
||||
bought the idea that adverbs were a grave stylistic sin. But hopefully it got
|
||||
the message across: the state-of-the-art NLP technologies are very powerful.
|
||||
spaCy gives you easy and efficient access to them, which lets you build all
|
||||
sorts of use products and features that were previously impossible.
|
||||
|
||||
|
||||
Independent Evaluation
|
||||
----------------------
|
||||
|
||||
.. table:: Independent evaluation by Yahoo! Labs and Emory
|
||||
University, to appear at ACL 2015. Higher is better.
|
||||
|
||||
+----------------+------------+------------+------------+
|
||||
| System | Language | Accuracy | Speed |
|
||||
+----------------+------------+------------+------------+
|
||||
| spaCy v0.86 | Cython | 91.9 | **13,963** |
|
||||
+----------------+------------+------------+------------+
|
||||
| ClearNLP | Java | 91.7 | 10,271 |
|
||||
+----------------+------------+------------+------------+
|
||||
| spaCy v0.84 | Cython | 90.9 | 13,963 |
|
||||
+----------------+------------+------------+------------+
|
||||
| CoreNLP | Java | 89.6 | 8,602 |
|
||||
+----------------+------------+------------+------------+
|
||||
| MATE | Java | **92.5** | 550 |
|
||||
+----------------+------------+------------+------------+
|
||||
| Turbo | C++ | 92.4 | 349 |
|
||||
+----------------+------------+------------+------------+
|
||||
| Yara | Java | 92.3 | 340 |
|
||||
+----------------+------------+------------+------------+
|
||||
|
||||
|
||||
Accuracy is % unlabelled arcs correct, speed is tokens per second.
|
||||
|
||||
Joel Tetreault and Amanda Stent (Yahoo! Labs) and Jin-ho Choi (Emory) performed
|
||||
a detailed comparison of the best parsers available. All numbers above
|
||||
are taken from the pre-print they kindly made available to me,
|
||||
except for spaCy v0.86.
|
||||
|
||||
I'm particularly grateful to the authors for discussion of their results, which
|
||||
led to the improvement in accuracy between v0.84 and v0.86. A tip from Jin-ho
|
||||
(developer of ClearNLP) was particularly useful.
|
||||
|
||||
|
||||
Detailed Speed Comparison
|
||||
-------------------------
|
||||
|
||||
**Set up**: 100,000 plain-text documents were streamed from an SQLite3
|
||||
database, and processed with an NLP library, to one of three levels of detail
|
||||
--- tokenization, tagging, or parsing. The tasks are additive: to parse the
|
||||
text you have to tokenize and tag it. The pre-processing was not subtracted
|
||||
from the times --- I report the time required for the pipeline to complete.
|
||||
I report mean times per document, in milliseconds.
|
||||
|
||||
**Hardware**: Intel i7-3770 (2012)
|
||||
|
||||
.. table:: Per-document processing times. Lower is better.
|
||||
|
||||
+--------------+---------------------------+--------------------------------+
|
||||
| | Absolute (ms per doc) | Relative (to spaCy) |
|
||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
||||
| System | Tokenize | Tag | Parse | Tokenize | Tag | Parse |
|
||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
||||
| spaCy | 0.2ms | 1ms | 19ms | 1x | 1x | 1x |
|
||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
||||
| CoreNLP | 2ms | 10ms | 49ms | 10x | 10x | 2.6x |
|
||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
||||
| ZPar | 1ms | 8ms | 850ms | 5x | 8x | 44.7x |
|
||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
||||
| NLTK | 4ms | 443ms | n/a | 20x | 443x | n/a |
|
||||
+--------------+----------+--------+-------+----------+---------+-----------+
|
||||
|
||||
|
||||
Efficiency is a major concern for NLP applications. It is very common to hear
|
||||
people say that they cannot afford more detailed processing, because their
|
||||
datasets are too large. This is a bad position to be in. If you can't apply
|
||||
detailed processing, you generally have to cobble together various heuristics.
|
||||
This normally takes a few iterations, and what you come up with will usually be
|
||||
brittle and difficult to reason about.
|
||||
|
||||
spaCy's parser is faster than most taggers, and its tokenizer is fast enough
|
||||
for any workload. And the tokenizer doesn't just give you a list
|
||||
of strings. A spaCy token is a pointer to a Lexeme struct, from which you can
|
||||
access a wide range of pre-computed features, including embedded word
|
||||
representations.
|
||||
|
||||
.. I wrote spaCy because I think existing commercial NLP engines are crap.
|
||||
Alchemy API are a typical example. Check out this part of their terms of
|
||||
service:
|
||||
publish or perform any benchmark or performance tests or analysis relating to
|
||||
the Service or the use thereof without express authorization from AlchemyAPI;
|
||||
|
||||
.. Did you get that? You're not allowed to evaluate how well their system works,
|
||||
unless you're granted a special exception. Their system must be pretty
|
||||
terrible to motivate such an embarrassing restriction.
|
||||
They must know this makes them look bad, but they apparently believe allowing
|
||||
you to evaluate their product would make them look even worse!
|
||||
|
||||
.. spaCy is based on science, not alchemy. It's open source, and I am happy to
|
||||
clarify any detail of the algorithms I've implemented.
|
||||
It's evaluated against the current best published systems, following the standard
|
||||
methodologies. These evaluations show that it performs extremely well.
|
||||
.. See `Benchmarks`_ for details.
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
:hidden:
|
||||
|
|
|
@ -59,6 +59,7 @@ and a small usage snippet.
|
|||
using/document.rst
|
||||
using/span.rst
|
||||
using/token.rst
|
||||
using/lexeme.rst
|
||||
|
||||
|
||||
.. _English: processing.html
|
||||
|
@ -69,6 +70,8 @@ and a small usage snippet.
|
|||
|
||||
.. _Span: using/span.html
|
||||
|
||||
.. _Lexeme: using/lexeme.html
|
||||
|
||||
.. _Vocab: lookup.html
|
||||
|
||||
.. _StringStore: lookup.html
|
||||
|
@ -79,8 +82,6 @@ and a small usage snippet.
|
|||
|
||||
.. _Parser: processing.html
|
||||
|
||||
.. _Lexeme: lookup.html
|
||||
|
||||
.. _Scorer: misc.html
|
||||
|
||||
.. _GoldParse: misc.html
|
||||
|
|
|
@ -6,7 +6,7 @@ The Doc Object
|
|||
|
||||
:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
|
||||
The Tokens class behaves as a Python sequence, supporting the usual operators,
|
||||
len(), etc. Negative indexing is supported. Slices are not yet.
|
||||
len(), etc. Negative indexing is supported. Slices are supported as of v0.89
|
||||
|
||||
.. code::
|
||||
|
||||
|
@ -15,14 +15,17 @@ The Doc Object
|
|||
u'Zero'
|
||||
>>> tokens[-1].orth_
|
||||
u'six'
|
||||
>>> tokens[0:4]
|
||||
Error
|
||||
>>> span = tokens[0:4]
|
||||
>>> [w.orth_ for w in span]
|
||||
[u'Zero', u'one', u'two', u'three']
|
||||
>>> span.string
|
||||
u'Zero one two three'
|
||||
|
||||
:code:`sents`
|
||||
Iterate over sentences in the document.
|
||||
Iterate over sentences in the document. Each sentence is a Span object.
|
||||
|
||||
:code:`ents`
|
||||
Iterate over entities in the document.
|
||||
Iterate over entities in the document. Each entity is a Span object.
|
||||
|
||||
:code:`to_array`
|
||||
Given a list of M attribute IDs, export the tokens to a numpy ndarray
|
||||
|
@ -55,8 +58,36 @@ The Doc Object
|
|||
Merge a multi-word expression into a single token. Currently
|
||||
experimental; API is likely to change.
|
||||
|
||||
:code:`to_bytes()`
|
||||
Get a byte-string representation of the document, i.e. serialize.
|
||||
|
||||
:code:`from_bytes(self, byte_string)`
|
||||
Load data from a byte-string, i.e. deserialize
|
||||
|
||||
:code:`Doc.read_bytes`
|
||||
A staticmethod, used to read bytes from a file.
|
||||
|
||||
|
||||
Example of serialization:
|
||||
|
||||
::
|
||||
|
||||
doc1 = EN(u'This is a simple test. With a couple of sentences.')
|
||||
doc2 = EN(u'This is another test document.')
|
||||
|
||||
with open('/tmp/spacy_docs.bin', 'wb') as file_:
|
||||
file_.write(doc1.to_bytes())
|
||||
file_.write(doc2.to_bytes())
|
||||
|
||||
with open('/tmp/spacy_docs.bin', 'rb') as file_:
|
||||
bytes1, bytes2 = Doc.read_bytes(file_)
|
||||
r1 = Doc(EN.vocab).from_bytes(bytes1)
|
||||
r2 = Doc(EN.vocab).from_bytes(bytes2)
|
||||
|
||||
assert r1.string == doc1.string
|
||||
assert r2.string == doc2.string
|
||||
|
||||
|
||||
Internals
|
||||
A Tokens instance stores the annotations in a C-array of `TokenC` structs.
|
||||
Each TokenC struct holds a const pointer to a LexemeC struct, which describes
|
||||
|
@ -66,5 +97,4 @@ Internals
|
|||
|
||||
For faster access, the underlying C data can be accessed from Cython. You
|
||||
can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
|
||||
access is required, and you need slightly better performance. However, this
|
||||
is both slower and has a worse API than Cython access.
|
||||
access is required, and you need slightly better performance.
|
||||
|
|
|
@ -53,6 +53,41 @@ string-typed.
|
|||
whitespace**. This is useful when you need to use linguistic features to
|
||||
add inline mark-up to the string.
|
||||
|
||||
**Boolean Features**
|
||||
|
||||
:code:`is_oov`
|
||||
Is the word out-of-vocabulary?
|
||||
|
||||
:code:`is_alpha`
|
||||
Equivalent to `word.orth_.isalpha()`
|
||||
|
||||
:code:`is_ascii`
|
||||
Equivalent to `any(ord(c) >= 128 for c in word.orth_)`
|
||||
|
||||
:code:`is_digit`
|
||||
Equivalent to `word.orth_.isdigit()`
|
||||
|
||||
:code:`is_lower`
|
||||
Equivalent to `word.orth_.islower()`
|
||||
|
||||
:code:`is_title`
|
||||
Equivalent to `word.orth_.istitle()`
|
||||
|
||||
:code:`is_punct`
|
||||
Equivalent to `word.orth_.ispunct()`
|
||||
|
||||
:code:`is_space`
|
||||
Equivalent to `word.orth_.isspace()`
|
||||
|
||||
:code:`like_url`
|
||||
Does the word resembles a URL?
|
||||
|
||||
:code:`like_num`
|
||||
Does the word represent a number? e.g. "10.9", "10", "ten", etc
|
||||
|
||||
:code:`like_email`
|
||||
Does the word resemble an email?
|
||||
|
||||
|
||||
**Distributional Features**
|
||||
|
||||
|
@ -115,6 +150,12 @@ string-typed.
|
|||
An iterator for the part of the sentence syntactically governed by the
|
||||
word, including the word itself.
|
||||
|
||||
:code:`left_edge`
|
||||
The leftmost descendent of the word's subtree. Equivalent to `list(word.subtree)[0]`
|
||||
|
||||
:code:`right_edge`
|
||||
The rightmost descendent of the word's subtree. Equivalent to `list(word.subtree)[-1]`
|
||||
|
||||
|
||||
**Named Entities**
|
||||
|
||||
|
|
|
@ -10,18 +10,83 @@ To update your installation:
|
|||
|
||||
Most updates ship a new model, so you will usually have to redownload the data.
|
||||
|
||||
v0.89
|
||||
-----
|
||||
2015-07-28 v0.89
|
||||
----------------
|
||||
|
||||
Major update!
|
||||
|
||||
* Support efficient binary serialization. The dependency tree,
|
||||
part-of-speech tags, named entities, tokenization and text can be dumped to a
|
||||
byte string smaller than the original text representation. Serialization is
|
||||
lossless, so there's no need to separately store the original text.
|
||||
|
||||
Serialize:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
byte_string = doc.to_bytes()
|
||||
|
||||
Deserialize by first creating a Doc object, and then loading the bytes:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
doc = Doc(nlp.vocab)
|
||||
doc.from_bytes(byte_string)
|
||||
|
||||
If you have a binary file with several parses saved, you can iterate over
|
||||
them using the staticmethod `Doc.read_bytes`. Putting it all together:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import codecs
|
||||
|
||||
from spacy.en import English
|
||||
|
||||
def serialize(nlp, texts, out_loc):
|
||||
with open(out_loc, 'wb') as out_file:
|
||||
for text in texts:
|
||||
doc = nlp(text)
|
||||
out_file.write(doc.to_bytes())
|
||||
|
||||
def deserialize(nlp, file_loc):
|
||||
docs = []
|
||||
with open(file_loc, 'rb') as read_file:
|
||||
for byte_string in Doc.read_bytes(read_file, 'rb')):
|
||||
doc = Doc(nlp.vocab).from_bytes(byte_string)
|
||||
docs.append(doc)
|
||||
return docs
|
||||
|
||||
|
||||
Full tutorial coming soon.
|
||||
|
||||
|
||||
* Fix probability estimates, and base them off counts from the 2015 Reddit Comments
|
||||
dump. The probability estimates are now very reliable, and out-of-vocabulary
|
||||
words now receive an accurate smoothed probability estimate.
|
||||
|
||||
* Fix regression in parse times on very long texts. Recent versions were
|
||||
calculating parse features in a way that was polynomial in input length.
|
||||
* Add tag SP (coarse tag SPACE) for whitespace tokens. Ensure entity recogniser
|
||||
does not assign entities to whitespace.
|
||||
|
||||
* Allow slicing into the Doc object, so that you can do e.g. doc[2:4]. Returns
|
||||
a Span object.
|
||||
|
||||
* Add tag SP (coarse tag SPACE) for whitespace tokens. Fix bug where
|
||||
whitespace was sometimes marked as an entity.
|
||||
|
||||
* Reduce memory usage. Memory usage now under 2GB per process.
|
||||
|
||||
* Rename :code:`Span.head` to :code:`Span.root`, fix its documentation, and make
|
||||
it more efficient. I considered adding Span.head, Span.dep and Span.dep\_ as
|
||||
well, but for now I leave these as accessible via :code:`Span.root.head`,
|
||||
:code:`Span.head.dep`, and :code:`Span.head.dep\_`, to keep the API smaller.
|
||||
|
||||
* Add boolean features to Token and Lexeme objects.
|
||||
|
||||
* Main parse function now marked **nogil**. This
|
||||
means I'll be able to add a Worker class that allows multi-threaded
|
||||
processing. This will be available in the next version. In the meantime,
|
||||
you should continue to use multiprocessing for parallelization.
|
||||
|
||||
|
||||
2015-07-08 v0.88
|
||||
----------------
|
||||
|
|
Loading…
Reference in New Issue
Block a user