mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-04 21:50:35 +03:00
151 lines
4.8 KiB
Plaintext
151 lines
4.8 KiB
Plaintext
|
|
mixin columns(...names)
|
|
tr
|
|
each name in names
|
|
th= name
|
|
|
|
|
|
mixin row(...cells)
|
|
tr
|
|
each cell in cells
|
|
td= cell
|
|
|
|
|
|
mixin comparison(name)
|
|
details
|
|
summary
|
|
h4 #{name}
|
|
|
|
block
|
|
|
|
|
|
+comparison("Peer-reviewed Evaluations")(open=true)
|
|
|
|
p spaCy is committed to rigorous evaluation under standard methodology. Two papers in 2015 confirm that:
|
|
|
|
ol
|
|
li spaCy is the fastest syntactic parser in the world;
|
|
li Its accuracy is within 1% of the best available;
|
|
li The few systems that are more accurate are 20× slower or more.
|
|
|
|
p spaCy v0.84 was evaluated by researchers at Yahoo! Labs and Emory University, as part of a survey paper benchmarking the current state-of-the-art dependency parsers (#[a(href="http://aclweb.org/anthology/P/P15/P15-1038.pdf") Choi et al., 2015]).
|
|
|
|
table
|
|
thead
|
|
+columns("System", "Language", "Accuracy", "Speed")
|
|
|
|
tbody
|
|
+row("spaCy v0.97", "Cython", "91.8", "13,000 (est.)")
|
|
+row("ClearNLP", "Java", "91.7", "10,271")
|
|
+row("CoreNLP", "Java", "89.6", "8,602")
|
|
+row("MATE", "Java", "92.5", "550")
|
|
+row("Turbo", "C++", "92.4", "349")
|
|
+row("Yara", "Java", "92.3", "340")
|
|
|
|
p Discussion with the authors led to accuracy improvements in spaCy, which have been accepted for publication in EMNLP, in joint work with Macquarie University (#[a(href="//aclweb.org/anthology/D/D15/D15-1162.pdf") Honnibal and Johnson, 2015]).
|
|
|
|
|
|
+comparison("How does spaCy compare to NLTK?")
|
|
.columnar
|
|
.col
|
|
h5 spaCy
|
|
ul
|
|
li.pro Over 400 times faster
|
|
li.pro State-of-the-art accuracy
|
|
li.pro Tokenizer maintains alignment
|
|
li.pro Powerful, concise API
|
|
li.pro Integrated word vectors
|
|
li.con English only (at present)
|
|
.col
|
|
h5 NLTK
|
|
ul
|
|
li.con Slow
|
|
li.con Low accuracy
|
|
li.con Tokens do not align to original string
|
|
li.con Models return lists of strings
|
|
li.con No word vector support
|
|
li.pro Multiple languages
|
|
|
|
|
|
+comparison("How does spaCy compare to CoreNLP?")
|
|
.columnar
|
|
.col
|
|
h5 spaCy
|
|
ul
|
|
li.pro 50% faster
|
|
li.pro More accurate parser
|
|
li.pro Word vectors integration
|
|
li.pro Minimalist design
|
|
li.pro Great documentation
|
|
li.con English only
|
|
li.pro Python
|
|
.col
|
|
h5 CoreNLP
|
|
ul
|
|
li.pro More accurate NER
|
|
li.pro Coreference resolution
|
|
li.pro Sentiment analysis
|
|
li.con Little documentation
|
|
li.pro Multiple languages
|
|
li.neutral Java
|
|
|
|
+comparison("How does spaCy compare to ClearNLP?")
|
|
|
|
.columnar
|
|
|
|
.col
|
|
h5 spaCy
|
|
ul
|
|
li.pro 30% faster
|
|
li.pro Well documented
|
|
li.con English only
|
|
li.neutral Equivalent accuracy
|
|
li.pro Python
|
|
|
|
.col
|
|
h5 ClearNLP
|
|
ul
|
|
li.pro Semantic Role Labelling
|
|
li.pro Model for biology/life-science
|
|
li.pro Multiple Languages
|
|
li.neutral Equivalent accuracy
|
|
li.neutral Java
|
|
|
|
//-+comparison("Accuracy Summary")
|
|
//-+comparison("Speed Summary")
|
|
//- table
|
|
//- thead
|
|
//- tr
|
|
//- th.
|
|
//- th(colspan=3) Absolute (ms per doc)
|
|
//- th(colspan=3) Relative (to spaCy)
|
|
//-
|
|
//- tbody
|
|
//- tr
|
|
//- td: strong System
|
|
//- td: strong Split
|
|
//- td: strong Tag
|
|
//- td: strong Parse
|
|
//- td: strong Split
|
|
//- td: strong Tag
|
|
//- td: strong Parse
|
|
//-
|
|
//- +row("spaCy", "0.2ms", "1ms", "19ms", "1x", "1x", "1x")
|
|
//- +row("spaCy", "0.2ms", "1ms", "19ms", "1x", "1x", "1x")
|
|
//- +row("CoreNLP", "2ms", "10ms", "49ms", "10x", "10x", "2.6x")
|
|
//- +row("ZPar", "1ms", "8ms", "850ms", "5x", "8x", "44.7x")
|
|
//- +row("NLTK", "4ms", "443ms", "n/a", "20x", "443x", "n/a")
|
|
//-
|
|
//- p
|
|
//- | <strong>Set up</strong>: 100,000 plain-text documents were streamed
|
|
//- | from an SQLite3 database, and processed with an NLP library, to one
|
|
//- | of three levels of detail – tokenization, tagging, or parsing.
|
|
//- | The tasks are additive: to parse the text you have to tokenize and
|
|
//- | tag it. The pre-processing was not subtracted from the times –
|
|
//- | I report the time required for the pipeline to complete. I report
|
|
//- | mean times per document, in milliseconds.
|
|
//-
|
|
//- p
|
|
//- | <strong>Hardware</strong>: Intel i7-3770 (2012)
|
|
|