mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
* Upd docs
This commit is contained in:
parent
096ef2b199
commit
0c6402ab73
|
@ -1,71 +1,106 @@
|
|||
Overview
|
||||
========
|
||||
Don't Settle for a List of Strings
|
||||
==================================
|
||||
|
||||
What and Why
|
||||
------------
|
||||
|
||||
spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.
|
||||
*"Other NLP tokenizers return lists of strings, which is downright
|
||||
barbaric."* --- me
|
||||
|
||||
Most tokenizers give you a sequence of strings. That's barbaric.
|
||||
Giving you strings invites you to compute on every *token*, when what
|
||||
you should be doing is computing on every *type*. Remember
|
||||
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
|
||||
see exponentially fewer types than tokens.
|
||||
|
||||
Instead of strings, spaCy gives you references to Lexeme objects, from which you
|
||||
can access an excellent set of pre-computed orthographic and distributional features:
|
||||
spaCy splits text into a list of lexical types, which come with a variety of
|
||||
features pre-computed. It's designed to **make the right thing easy**, where the right
|
||||
thing is:
|
||||
|
||||
::
|
||||
* A global vocabulary store;
|
||||
|
||||
>>> from spacy import en
|
||||
>>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...")
|
||||
>>> are.prob >= oranges.prob
|
||||
True
|
||||
>>> apples.check_flag(en.IS_TITLE)
|
||||
True
|
||||
>>> apples.check_flag(en.OFT_TITLE)
|
||||
False
|
||||
>>> are.check_flag(en.CAN_NOUN)
|
||||
False
|
||||
* Cached orthographic features;
|
||||
|
||||
spaCy makes it easy to write very efficient NLP applications, because your feature
|
||||
functions have to do almost no work: almost every lexical property you'll want
|
||||
is pre-computed for you. See the tutorial for an example POS tagger.
|
||||
* Clever use of distributional data.
|
||||
|
||||
Benchmark
|
||||
---------
|
||||
Let's say you're writing an entity tagger for English. Case distinctions are an
|
||||
important feature here: you need to know whether the word you're tagging is
|
||||
upper-cased, lower-cased, title-cased, non-alphabetic, etc.
|
||||
The right thing is to call the string.isupper(), string.islower(), string.isalpha()
|
||||
etc functions once for every *type* in your vocabulary, instead
|
||||
of once for every *token* in the text you're tagging.
|
||||
When you encounter a new word, you want to create a lexeme object, calculate its
|
||||
features, and save it.
|
||||
|
||||
The tokenizer itself is also very efficient:
|
||||
That's the *right* way to do it, so it's what spaCy does for you.
|
||||
|
||||
Other tokenizers give you a list of strings, which makes it really easy to do
|
||||
the wrong thing. And the wrong thing isn't just a little bit worse: it's
|
||||
**exponentially** worse, because of
|
||||
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_.
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<center>
|
||||
<figure>
|
||||
<embed
|
||||
width="650em" height="auto"
|
||||
type="image/svg+xml" src="chart.svg"/>
|
||||
</figure>
|
||||
</center>
|
||||
|
||||
Over the Gigaword corpus, if you compute some feature on a per-token basis, you'll
|
||||
make **500x more calls** to that function than if you had computed it on a per-token
|
||||
basis.
|
||||
(Mouse-over a line to see its value at that point. And yes, it's a bit snarky
|
||||
to present the graph in a linear scale --- but it isn't misleading.)
|
||||
|
||||
Zipf's Law also makes distributional information a really powerful source of
|
||||
type-based features. It's really handy to know where a word falls in the language's
|
||||
frequency distribution, especially compared to variants of the word. For instance,
|
||||
we might be processing a Twitter comment that contains the string "nasa". We have
|
||||
little hope of recognising this as an entity except by noting that the string "NASA"
|
||||
is much more common, and that both strings are quite rare.
|
||||
|
||||
.. Each spaCy Lexeme comes with a rich, curated set of orthographic and
|
||||
.. distributional features. Different languages get a different set of features,
|
||||
.. to take into account different orthographic conventions and morphological
|
||||
.. complexity. It's also easy to define your own features.
|
||||
|
||||
.. And, of course, we take care to get the details right. Indices into the original
|
||||
.. text are always easy to calculate, so it's easy to, say, mark entities with in-line
|
||||
.. mark-up. You'll also receive tokens for newlines, tabs and other non-space whitespace,
|
||||
.. making it easy to do paragraph and sentence recognition. And, of course, we deal
|
||||
.. smartly with all the random unicode whitespace and punctuation characters you might
|
||||
.. not have thought of.
|
||||
|
||||
|
||||
Benchmarks
|
||||
----------
|
||||
|
||||
We here ask two things:
|
||||
|
||||
1. How fast is the spaCy tokenizer itself, relative to other tokenizers?
|
||||
|
||||
2. How fast are applications using spaCy's pre-computed lexical features,
|
||||
compared to applications that re-compute their features on every token?
|
||||
|
||||
+--------+-------+--------------+--------------+
|
||||
| System | Time | Words/second | Speed Factor |
|
||||
+--------+-------+--------------+--------------+
|
||||
| NLTK | 6m4s | 89,000 | 1.00 |
|
||||
+--------+-------+--------------+--------------+
|
||||
| spaCy | 9.5s | 3,093,000 | 38.30 |
|
||||
| spaCy | | | |
|
||||
+--------+-------+--------------+--------------+
|
||||
|
||||
The comparison refers to 30 million words from the English Gigaword, on
|
||||
a Maxbook Air. For context, calling string.split() on the data completes in
|
||||
about 5s.
|
||||
|
||||
spaCy uses more memory than a standard tokenizer, but is far more efficient. We
|
||||
compare against the NLTK tokenizer and the Penn Treebank's tokenizer.sed script.
|
||||
We also give the performance of Python's native string.split, for reference.
|
||||
|
||||
|
||||
Pros and Cons
|
||||
-------------
|
||||
|
||||
Pros:
|
||||
|
||||
- All tokens come with indices into the original string
|
||||
- Full unicode support
|
||||
- Extensible to other languages
|
||||
- Batch operations computed efficiently in Cython
|
||||
- Cython API
|
||||
- numpy interoperability
|
||||
- Stuff
|
||||
|
||||
Cons:
|
||||
|
||||
- It's new (released September 2014)
|
||||
- Security concerns, from memory management
|
||||
- Higher memory usage (up to 1gb)
|
||||
- More conceptually complicated
|
||||
- Tokenization rules expressed in code, not as data
|
||||
|
||||
- More complicated
|
||||
|
|
|
@ -6,6 +6,40 @@
|
|||
spaCy NLP Tokenizer and Lexicon
|
||||
================================
|
||||
|
||||
spaCy splits a string of natural language into a list of references to lexical types:
|
||||
|
||||
>>> from spacy.en import EN
|
||||
>>> tokens = EN.tokenize(u"Examples aren't easy, are they?")
|
||||
>>> type(tokens[0])
|
||||
spacy.word.Lexeme
|
||||
>>> tokens[1] is tokens[5]
|
||||
True
|
||||
|
||||
Other tokenizers return lists of strings, which is
|
||||
`downright barbaric <guide/overview.html>`__. If you get a list of strings,
|
||||
you have to write all the features yourself, and you'll probably compute them
|
||||
on a per-token basis, instead of a per-type basis. At scale, that's very
|
||||
inefficient.
|
||||
|
||||
spaCy's tokens come with the following orthographic and distributional features
|
||||
pre-computed:
|
||||
|
||||
* Orthographic flags, such as is_alpha, is_digit, is_punct, is_title etc;
|
||||
|
||||
* Useful string transforms, such as canonical casing, word shape, ASCIIfied,
|
||||
etc;
|
||||
|
||||
* Unigram log probability;
|
||||
|
||||
* Brown cluster;
|
||||
|
||||
* can_noun, can_verb etc tag-dictionary;
|
||||
|
||||
* oft_upper, oft_title etc case-behaviour flags.
|
||||
|
||||
The features are up-to-date with current NLP research, but you can replace or
|
||||
augment them if you need to.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 3
|
||||
|
||||
|
@ -16,19 +50,23 @@ spaCy NLP Tokenizer and Lexicon
|
|||
|
||||
modules/index.rst
|
||||
|
||||
|
||||
Source (GitHub)
|
||||
----------------
|
||||
|
||||
http://github.com/honnibal/spaCy
|
||||
|
||||
License
|
||||
-------
|
||||
=======
|
||||
|
||||
Copyright Matthew Honnibal
|
||||
+------------------+------+
|
||||
| Non-commercial | $0 |
|
||||
+------------------+------+
|
||||
| Trial commercial | $0 |
|
||||
+------------------+------+
|
||||
| Full commercial | $500 |
|
||||
+------------------+------+
|
||||
|
||||
Non-commercial use: $0
|
||||
Commercial trial use: $0
|
||||
Full commercial license: $500
|
||||
spaCy is non-free software. Its source is published, but the copyright is
|
||||
retained by the author (Matthew Honnibal). Licenses are currently under preparation.
|
||||
|
||||
There is currently a gap between the output of academic NLP researchers, and
|
||||
the needs of a small software companiess. I left academia to try to correct this.
|
||||
My idea is that non-commercial and trial commercial use should "feel" just like
|
||||
free software. But, if you do use the code in a commercial product, a small
|
||||
fixed license-fee will apply, in order to fund development.
|
||||
|
||||
honnibal@gmail.com
|
||||
|
|
Loading…
Reference in New Issue
Block a user