* Rework intro text

This commit is contained in:
Matthew Honnibal 2015-01-25 00:58:52 +11:00
parent 83a7e91f3c
commit 70d4a9dcc5

View File

@ -8,21 +8,42 @@ spaCy: Industrial-strength NLP
============================== ==============================
spaCy is a new library for text processing in Python and Cython. spaCy is a new library for text processing in Python and Cython.
I wrote it because I think small companies are terrible at NLP. Or rather:
small companies are using terrible NLP technology.
Most commercial NLP development is based on obsolete To do great NLP, you have to know a little about linguistics, a lot
technology. Over the last 3-5 years, the field has advanced dramatically, but about machine learning, and almost everything about the latest research.
only the tech giants have really been able to capitalize. The research is all The people who fit this description seldom join small companies, and almost
public, but it's been too hard for small companies to read and apply it. never start them. Most are broke --- they've just finished grad school.
Many end up relying on `NLTK`_, which is intended primarily as an educational If they don't want to stay in academia, they join Google, IBM, etc.
resource.
.. _NLTK: https://www.nltk.org/ The net result is that outside of the tech giants, commercial NLP has changed
little in the last ten years. In academia, it's changed entirely. Amazing
improvements in quality. Orders of magnitude faster. But the
academic code is always GPL, undocumented, unuseable, or all three. You could
implement the ideas yourself, but the papers are hard to read, and training
data is exorbitantly expensive. So what are you left with? NLTK?
I used to think that the NLP community just needed to do more to communicate
its findings to software engineers. So I wrote two blog posts, explaining
`how to write a part-of-speech tagger`_ and `parser`_. Both were very well received,
and there's been a bit of interest in `my research software`_ --- even though
it's entirely undocumented, and mostly unuseable to anyone but me.
.. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop
.. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
.. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
So six months ago I quit my post-doc, and I've been working day and night on
spaCy since. I'm now pleased to announce an alpha release.
If you're a small company doing NLP, I think spaCy will seem like a minor miracle. If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
It's by far the fastest NLP software available. The full processing pipeline The full processing pipeline completes in 7ms per document, including accurate
completes in 7ms per document, including accurate tagging and parsing. All strings tagging and parsing. All strings are mapped to integer IDs, tokens are linked
are mapped to integer IDs, tokens are linked to embedded word representations, to embedded word representations, and a range of useful features are pre-calculated
and a range of useful features are pre-calculated and cached. and cached.
If none of that made any sense to you, here's the gist of it. Computers don't If none of that made any sense to you, here's the gist of it. Computers don't
understand text. This is unfortunate, because that's what the web almost entirely understand text. This is unfortunate, because that's what the web almost entirely