mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-10 16:22:29 +03:00
Minor copyediting
This commit is contained in:
parent
7bddd15e27
commit
1b79d947b9
|
@ -8,12 +8,12 @@ python:
|
||||||
- "2.7"
|
- "2.7"
|
||||||
- "3.4"
|
- "3.4"
|
||||||
|
|
||||||
# command to install dependencies
|
# install dependencies
|
||||||
install:
|
install:
|
||||||
- "pip install --upgrade setuptools"
|
- "pip install --upgrade setuptools"
|
||||||
- "pip install -r requirements.txt"
|
- "pip install -r requirements.txt"
|
||||||
- "export PYTHONPATH=`pwd`"
|
- "export PYTHONPATH=`pwd`"
|
||||||
- "python setup.py build_ext --inplace"
|
- "python setup.py build_ext --inplace"
|
||||||
# command to run tests
|
# run tests
|
||||||
script:
|
script:
|
||||||
- py.test tests/
|
- py.test tests/
|
||||||
|
|
|
@ -3,20 +3,18 @@ spaCy
|
||||||
|
|
||||||
http://honnibal.github.io/spaCy
|
http://honnibal.github.io/spaCy
|
||||||
|
|
||||||
Fast, state-of-the-art natural language processing pipeline. Commercial licenses available, or use under AGPL.
|
A pipeline for fast, state-of-the-art natural language processing. Commercial licenses available, otherwise under AGPL.
|
||||||
|
|
||||||
Version 0.80 released
|
Version 0.80 released
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
2015-04-13
|
2015-04-13
|
||||||
|
|
||||||
* Preliminary named entity recognition support. Accuracy is currently
|
* Preliminary support for named-entity recognition. Its accuracy is substantially behind the state-of-the-art. I'm working on improvements.
|
||||||
substantially behind the current state-of-the-art. I'm working on
|
|
||||||
improvements.
|
|
||||||
|
|
||||||
* Better sentence boundary detection, drawn from the syntactic structure.
|
* Better sentence boundary detection, drawn from the syntactic structure.
|
||||||
|
|
||||||
* Lots of bug fixes
|
* Lots of bug fixes.
|
||||||
|
|
||||||
|
|
||||||
Supports:
|
Supports:
|
||||||
|
|
|
@ -28,14 +28,14 @@ can access an excellent set of pre-computed orthographic and distributional feat
|
||||||
>>> are.check_flag(en.CAN_NOUN)
|
>>> are.check_flag(en.CAN_NOUN)
|
||||||
False
|
False
|
||||||
|
|
||||||
spaCy makes it easy to write very efficient NLP applications, because your feature
|
spaCy makes it easy to write efficient NLP applications, because your feature
|
||||||
functions have to do almost no work: almost every lexical property you'll want
|
functions have to do almost no work: almost every lexical property you'll want
|
||||||
is pre-computed for you. See the tutorial for an example POS tagger.
|
is pre-computed for you. See the tutorial for an example POS tagger.
|
||||||
|
|
||||||
Benchmark
|
Benchmark
|
||||||
---------
|
---------
|
||||||
|
|
||||||
The tokenizer itself is also very efficient:
|
The tokenizer itself is also efficient:
|
||||||
|
|
||||||
+--------+-------+--------------+--------------+
|
+--------+-------+--------------+--------------+
|
||||||
| System | Time | Words/second | Speed Factor |
|
| System | Time | Words/second | Speed Factor |
|
||||||
|
@ -56,7 +56,7 @@ Pros:
|
||||||
|
|
||||||
- All tokens come with indices into the original string
|
- All tokens come with indices into the original string
|
||||||
- Full unicode support
|
- Full unicode support
|
||||||
- Extensible to other languages
|
- Extendable to other languages
|
||||||
- Batch operations computed efficiently in Cython
|
- Batch operations computed efficiently in Cython
|
||||||
- Cython API
|
- Cython API
|
||||||
- numpy interoperability
|
- numpy interoperability
|
||||||
|
|
|
@ -135,7 +135,7 @@ lexical types.
|
||||||
|
|
||||||
In a sample of text, vocabulary size grows exponentially slower than word
|
In a sample of text, vocabulary size grows exponentially slower than word
|
||||||
count. So any computations we can perform over the vocabulary and apply to the
|
count. So any computations we can perform over the vocabulary and apply to the
|
||||||
word count are very efficient.
|
word count are efficient.
|
||||||
|
|
||||||
|
|
||||||
Part-of-speech Tagger
|
Part-of-speech Tagger
|
||||||
|
|
|
@ -37,7 +37,7 @@ tokenizer is suitable for production use.
|
||||||
|
|
||||||
I used to think that the NLP community just needed to do more to communicate
|
I used to think that the NLP community just needed to do more to communicate
|
||||||
its findings to software engineers. So I wrote two blog posts, explaining
|
its findings to software engineers. So I wrote two blog posts, explaining
|
||||||
`how to write a part-of-speech tagger`_ and `parser`_. Both were very well received,
|
`how to write a part-of-speech tagger`_ and `parser`_. Both were well received,
|
||||||
and there's been a bit of interest in `my research software`_ --- even though
|
and there's been a bit of interest in `my research software`_ --- even though
|
||||||
it's entirely undocumented, and mostly unuseable to anyone but me.
|
it's entirely undocumented, and mostly unuseable to anyone but me.
|
||||||
|
|
||||||
|
@ -202,7 +202,7 @@ this:
|
||||||
|
|
||||||
We wanted to refine the logic so that only adverbs modifying evocative verbs
|
We wanted to refine the logic so that only adverbs modifying evocative verbs
|
||||||
of communication, like "pleaded", were highlighted. We've now built a vector that
|
of communication, like "pleaded", were highlighted. We've now built a vector that
|
||||||
represents that type of word, so now we can highlight adverbs based on very
|
represents that type of word, so now we can highlight adverbs based on
|
||||||
subtle logic, honing in on adverbs that seem the most stylistically
|
subtle logic, honing in on adverbs that seem the most stylistically
|
||||||
problematic, given our starting assumptions:
|
problematic, given our starting assumptions:
|
||||||
|
|
||||||
|
|
|
@ -35,7 +35,7 @@ And if you're ever in acquisition or IPO talks, the story is simple.
|
||||||
spaCy can also be used as free open-source software, under the Aferro GPL
|
spaCy can also be used as free open-source software, under the Aferro GPL
|
||||||
license. If you use it this way, you must comply with the AGPL license terms.
|
license. If you use it this way, you must comply with the AGPL license terms.
|
||||||
When you distribute your project, or offer it as a network service, you must
|
When you distribute your project, or offer it as a network service, you must
|
||||||
distribute the source-code, and grant users an AGPL license to it.
|
distribute the source-code and grant users an AGPL license to it.
|
||||||
|
|
||||||
|
|
||||||
.. I left academia in June 2014, just when I should have been submitting my first
|
.. I left academia in June 2014, just when I should have been submitting my first
|
||||||
|
|
|
@ -7,8 +7,8 @@ Updates
|
||||||
Five days ago I presented the alpha release of spaCy, a natural language
|
Five days ago I presented the alpha release of spaCy, a natural language
|
||||||
processing library that brings state-of-the-art technology to small companies.
|
processing library that brings state-of-the-art technology to small companies.
|
||||||
|
|
||||||
spaCy has been very well received, and there are now a lot of eyes on the project.
|
spaCy has been well received, and there are now a lot of eyes on the project.
|
||||||
Naturally, lots of issues have surfaced. I'm very grateful to those who've reported
|
Naturally, lots of issues have surfaced. I'm grateful to those who've reported
|
||||||
them. I've worked hard to address them as quickly as I could.
|
them. I've worked hard to address them as quickly as I could.
|
||||||
|
|
||||||
Bug Fixes
|
Bug Fixes
|
||||||
|
@ -26,7 +26,7 @@ Bug Fixes
|
||||||
just store an index into that list, instead of a hash.
|
just store an index into that list, instead of a hash.
|
||||||
|
|
||||||
* Parse tree navigation API was rough, and buggy.
|
* Parse tree navigation API was rough, and buggy.
|
||||||
The parse-tree navigation API was the last thing I added before v0.3. I've
|
The parse-tree navigation API was the last thing I added before v0.3. I've
|
||||||
now replaced it with something better. The previous API design was flawed,
|
now replaced it with something better. The previous API design was flawed,
|
||||||
and the implementation was buggy --- Token.child() and Token.head were
|
and the implementation was buggy --- Token.child() and Token.head were
|
||||||
sometimes inconsistent.
|
sometimes inconsistent.
|
||||||
|
@ -108,9 +108,9 @@ input to be segmented into sentences, but with no sentence segmenter. This
|
||||||
caused a drop in parse accuracy of 4%!
|
caused a drop in parse accuracy of 4%!
|
||||||
|
|
||||||
Over the last five days, I've worked hard to correct this. I implemented the
|
Over the last five days, I've worked hard to correct this. I implemented the
|
||||||
modifications to the parsing algorithm I had planned, from Dongdong Zhang et al
|
modifications to the parsing algorithm I had planned, from Dongdong Zhang et al.
|
||||||
(2013), and trained and evaluated the parser on raw text, using the version of
|
(2013), and trained and evaluated the parser on raw text, using the version of
|
||||||
the WSJ distributed by Read et al (2012), and used in Dridan and Oepen's
|
the WSJ distributed by Read et al. (2012), and used in Dridan and Oepen's
|
||||||
experiments.
|
experiments.
|
||||||
|
|
||||||
I'm pleased to say that on the WSJ at least, spaCy 0.4 performs almost exactly
|
I'm pleased to say that on the WSJ at least, spaCy 0.4 performs almost exactly
|
||||||
|
|
Loading…
Reference in New Issue
Block a user