mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 18:26:30 +03:00
* Upd docs
This commit is contained in:
parent
096ef2b199
commit
0c6402ab73
|
@ -1,71 +1,106 @@
|
||||||
Overview
|
Don't Settle for a List of Strings
|
||||||
========
|
==================================
|
||||||
|
|
||||||
What and Why
|
|
||||||
------------
|
|
||||||
|
|
||||||
spaCy is a lightning-fast, full-cream NLP tokenizer and lexicon.
|
*"Other NLP tokenizers return lists of strings, which is downright
|
||||||
|
barbaric."* --- me
|
||||||
|
|
||||||
Most tokenizers give you a sequence of strings. That's barbaric.
|
|
||||||
Giving you strings invites you to compute on every *token*, when what
|
|
||||||
you should be doing is computing on every *type*. Remember
|
|
||||||
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_: you'll
|
|
||||||
see exponentially fewer types than tokens.
|
|
||||||
|
|
||||||
Instead of strings, spaCy gives you references to Lexeme objects, from which you
|
spaCy splits text into a list of lexical types, which come with a variety of
|
||||||
can access an excellent set of pre-computed orthographic and distributional features:
|
features pre-computed. It's designed to **make the right thing easy**, where the right
|
||||||
|
thing is:
|
||||||
|
|
||||||
::
|
* A global vocabulary store;
|
||||||
|
|
||||||
>>> from spacy import en
|
* Cached orthographic features;
|
||||||
>>> apples, are, nt, oranges, dots = en.EN.tokenize(u"Apples aren't oranges...")
|
|
||||||
>>> are.prob >= oranges.prob
|
|
||||||
True
|
|
||||||
>>> apples.check_flag(en.IS_TITLE)
|
|
||||||
True
|
|
||||||
>>> apples.check_flag(en.OFT_TITLE)
|
|
||||||
False
|
|
||||||
>>> are.check_flag(en.CAN_NOUN)
|
|
||||||
False
|
|
||||||
|
|
||||||
spaCy makes it easy to write very efficient NLP applications, because your feature
|
* Clever use of distributional data.
|
||||||
functions have to do almost no work: almost every lexical property you'll want
|
|
||||||
is pre-computed for you. See the tutorial for an example POS tagger.
|
Let's say you're writing an entity tagger for English. Case distinctions are an
|
||||||
|
important feature here: you need to know whether the word you're tagging is
|
||||||
|
upper-cased, lower-cased, title-cased, non-alphabetic, etc.
|
||||||
|
The right thing is to call the string.isupper(), string.islower(), string.isalpha()
|
||||||
|
etc functions once for every *type* in your vocabulary, instead
|
||||||
|
of once for every *token* in the text you're tagging.
|
||||||
|
When you encounter a new word, you want to create a lexeme object, calculate its
|
||||||
|
features, and save it.
|
||||||
|
|
||||||
Benchmark
|
That's the *right* way to do it, so it's what spaCy does for you.
|
||||||
---------
|
|
||||||
|
|
||||||
The tokenizer itself is also very efficient:
|
Other tokenizers give you a list of strings, which makes it really easy to do
|
||||||
|
the wrong thing. And the wrong thing isn't just a little bit worse: it's
|
||||||
|
**exponentially** worse, because of
|
||||||
|
`Zipf's law <http://en.wikipedia.org/wiki/Zipf's_law>`_.
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<center>
|
||||||
|
<figure>
|
||||||
|
<embed
|
||||||
|
width="650em" height="auto"
|
||||||
|
type="image/svg+xml" src="chart.svg"/>
|
||||||
|
</figure>
|
||||||
|
</center>
|
||||||
|
|
||||||
|
Over the Gigaword corpus, if you compute some feature on a per-token basis, you'll
|
||||||
|
make **500x more calls** to that function than if you had computed it on a per-token
|
||||||
|
basis.
|
||||||
|
(Mouse-over a line to see its value at that point. And yes, it's a bit snarky
|
||||||
|
to present the graph in a linear scale --- but it isn't misleading.)
|
||||||
|
|
||||||
|
Zipf's Law also makes distributional information a really powerful source of
|
||||||
|
type-based features. It's really handy to know where a word falls in the language's
|
||||||
|
frequency distribution, especially compared to variants of the word. For instance,
|
||||||
|
we might be processing a Twitter comment that contains the string "nasa". We have
|
||||||
|
little hope of recognising this as an entity except by noting that the string "NASA"
|
||||||
|
is much more common, and that both strings are quite rare.
|
||||||
|
|
||||||
|
.. Each spaCy Lexeme comes with a rich, curated set of orthographic and
|
||||||
|
.. distributional features. Different languages get a different set of features,
|
||||||
|
.. to take into account different orthographic conventions and morphological
|
||||||
|
.. complexity. It's also easy to define your own features.
|
||||||
|
|
||||||
|
.. And, of course, we take care to get the details right. Indices into the original
|
||||||
|
.. text are always easy to calculate, so it's easy to, say, mark entities with in-line
|
||||||
|
.. mark-up. You'll also receive tokens for newlines, tabs and other non-space whitespace,
|
||||||
|
.. making it easy to do paragraph and sentence recognition. And, of course, we deal
|
||||||
|
.. smartly with all the random unicode whitespace and punctuation characters you might
|
||||||
|
.. not have thought of.
|
||||||
|
|
||||||
|
|
||||||
|
Benchmarks
|
||||||
|
----------
|
||||||
|
|
||||||
|
We here ask two things:
|
||||||
|
|
||||||
|
1. How fast is the spaCy tokenizer itself, relative to other tokenizers?
|
||||||
|
|
||||||
|
2. How fast are applications using spaCy's pre-computed lexical features,
|
||||||
|
compared to applications that re-compute their features on every token?
|
||||||
|
|
||||||
+--------+-------+--------------+--------------+
|
+--------+-------+--------------+--------------+
|
||||||
| System | Time | Words/second | Speed Factor |
|
| System | Time | Words/second | Speed Factor |
|
||||||
+--------+-------+--------------+--------------+
|
+--------+-------+--------------+--------------+
|
||||||
| NLTK | 6m4s | 89,000 | 1.00 |
|
| NLTK | 6m4s | 89,000 | 1.00 |
|
||||||
+--------+-------+--------------+--------------+
|
+--------+-------+--------------+--------------+
|
||||||
| spaCy | 9.5s | 3,093,000 | 38.30 |
|
| spaCy | | | |
|
||||||
+--------+-------+--------------+--------------+
|
+--------+-------+--------------+--------------+
|
||||||
|
|
||||||
The comparison refers to 30 million words from the English Gigaword, on
|
|
||||||
a Maxbook Air. For context, calling string.split() on the data completes in
|
spaCy uses more memory than a standard tokenizer, but is far more efficient. We
|
||||||
about 5s.
|
compare against the NLTK tokenizer and the Penn Treebank's tokenizer.sed script.
|
||||||
|
We also give the performance of Python's native string.split, for reference.
|
||||||
|
|
||||||
|
|
||||||
Pros and Cons
|
Pros and Cons
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
Pros:
|
Pros:
|
||||||
|
|
||||||
- All tokens come with indices into the original string
|
- Stuff
|
||||||
- Full unicode support
|
|
||||||
- Extensible to other languages
|
|
||||||
- Batch operations computed efficiently in Cython
|
|
||||||
- Cython API
|
|
||||||
- numpy interoperability
|
|
||||||
|
|
||||||
Cons:
|
Cons:
|
||||||
|
|
||||||
- It's new (released September 2014)
|
- It's new (released September 2014)
|
||||||
- Security concerns, from memory management
|
|
||||||
- Higher memory usage (up to 1gb)
|
- Higher memory usage (up to 1gb)
|
||||||
- More conceptually complicated
|
- More complicated
|
||||||
- Tokenization rules expressed in code, not as data
|
|
||||||
|
|
||||||
|
|
|
@ -6,6 +6,40 @@
|
||||||
spaCy NLP Tokenizer and Lexicon
|
spaCy NLP Tokenizer and Lexicon
|
||||||
================================
|
================================
|
||||||
|
|
||||||
|
spaCy splits a string of natural language into a list of references to lexical types:
|
||||||
|
|
||||||
|
>>> from spacy.en import EN
|
||||||
|
>>> tokens = EN.tokenize(u"Examples aren't easy, are they?")
|
||||||
|
>>> type(tokens[0])
|
||||||
|
spacy.word.Lexeme
|
||||||
|
>>> tokens[1] is tokens[5]
|
||||||
|
True
|
||||||
|
|
||||||
|
Other tokenizers return lists of strings, which is
|
||||||
|
`downright barbaric <guide/overview.html>`__. If you get a list of strings,
|
||||||
|
you have to write all the features yourself, and you'll probably compute them
|
||||||
|
on a per-token basis, instead of a per-type basis. At scale, that's very
|
||||||
|
inefficient.
|
||||||
|
|
||||||
|
spaCy's tokens come with the following orthographic and distributional features
|
||||||
|
pre-computed:
|
||||||
|
|
||||||
|
* Orthographic flags, such as is_alpha, is_digit, is_punct, is_title etc;
|
||||||
|
|
||||||
|
* Useful string transforms, such as canonical casing, word shape, ASCIIfied,
|
||||||
|
etc;
|
||||||
|
|
||||||
|
* Unigram log probability;
|
||||||
|
|
||||||
|
* Brown cluster;
|
||||||
|
|
||||||
|
* can_noun, can_verb etc tag-dictionary;
|
||||||
|
|
||||||
|
* oft_upper, oft_title etc case-behaviour flags.
|
||||||
|
|
||||||
|
The features are up-to-date with current NLP research, but you can replace or
|
||||||
|
augment them if you need to.
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 3
|
:maxdepth: 3
|
||||||
|
|
||||||
|
@ -15,20 +49,24 @@ spaCy NLP Tokenizer and Lexicon
|
||||||
api/index.rst
|
api/index.rst
|
||||||
|
|
||||||
modules/index.rst
|
modules/index.rst
|
||||||
|
|
||||||
|
|
||||||
Source (GitHub)
|
|
||||||
----------------
|
|
||||||
|
|
||||||
http://github.com/honnibal/spaCy
|
|
||||||
|
|
||||||
License
|
License
|
||||||
-------
|
=======
|
||||||
|
|
||||||
Copyright Matthew Honnibal
|
+------------------+------+
|
||||||
|
| Non-commercial | $0 |
|
||||||
|
+------------------+------+
|
||||||
|
| Trial commercial | $0 |
|
||||||
|
+------------------+------+
|
||||||
|
| Full commercial | $500 |
|
||||||
|
+------------------+------+
|
||||||
|
|
||||||
Non-commercial use: $0
|
spaCy is non-free software. Its source is published, but the copyright is
|
||||||
Commercial trial use: $0
|
retained by the author (Matthew Honnibal). Licenses are currently under preparation.
|
||||||
Full commercial license: $500
|
|
||||||
|
There is currently a gap between the output of academic NLP researchers, and
|
||||||
|
the needs of a small software companiess. I left academia to try to correct this.
|
||||||
|
My idea is that non-commercial and trial commercial use should "feel" just like
|
||||||
|
free software. But, if you do use the code in a commercial product, a small
|
||||||
|
fixed license-fee will apply, in order to fund development.
|
||||||
|
|
||||||
honnibal@gmail.com
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user