mirror of
https://github.com/explosion/spaCy.git
synced 2025-03-10 06:15:49 +03:00
79 lines
2.5 KiB
Plaintext
79 lines
2.5 KiB
Plaintext
+comparison("NLTK")
|
|
//+comparison("Pattern")
|
|
+comparison("CoreNLP")
|
|
+comparison("ClearNLP")
|
|
//+comparison("OpenNLP")
|
|
//+comparison("GATE")
|
|
|
|
+comparison("Accuracy Summary")
|
|
|
|
+comparison("Speed Summary")
|
|
table
|
|
thead
|
|
tr
|
|
th.
|
|
th(colspan=3) Absolute (ms per doc)
|
|
th(colspan=3) Relative (to spaCy)
|
|
|
|
tbody
|
|
tr
|
|
td: strong System
|
|
td: strong Split
|
|
td: strong Tag
|
|
td: strong Parse
|
|
td: strong Split
|
|
td: strong Tag
|
|
td: strong Parse
|
|
|
|
+row("spaCy", "0.2ms", "1ms", "19ms", "1x", "1x", "1x")
|
|
+row("spaCy", "0.2ms", "1ms", "19ms", "1x", "1x", "1x")
|
|
+row("CoreNLP", "2ms", "10ms", "49ms", "10x", "10x", "2.6x")
|
|
+row("ZPar", "1ms", "8ms", "850ms", "5x", "8x", "44.7x")
|
|
+row("NLTK", "4ms", "443ms", "n/a", "20x", "443x", "n/a")
|
|
|
|
p
|
|
| <strong>Set up</strong>: 100,000 plain-text documents were streamed
|
|
| from an SQLite3 database, and processed with an NLP library, to one
|
|
| of three levels of detail – tokenization, tagging, or parsing.
|
|
| The tasks are additive: to parse the text you have to tokenize and
|
|
| tag it. The pre-processing was not subtracted from the times –
|
|
| I report the time required for the pipeline to complete. I report
|
|
| mean times per document, in milliseconds.
|
|
|
|
p
|
|
| <strong>Hardware</strong>: Intel i7-3770 (2012)
|
|
|
|
|
|
+comparison("Independent Evaluation")
|
|
p
|
|
| Independent evaluation by Yahoo! Labs and Emory
|
|
| University, to appear at ACL 2015. Higher is better.
|
|
|
|
table
|
|
thead
|
|
+columns("System", "Language", "Accuracy", "Speed")
|
|
|
|
tbody
|
|
+row("spaCy v0.86", "Cython", "91.9", "13,963")
|
|
+row("spaCy v0.84", "Cython", "90.6", "13,963")
|
|
+row("ClearNLP", "Java", "91.7", "10,271")
|
|
+row("CoreNLP", "Java", "89.6", "8,602")
|
|
+row("MATE", "Java", "92.5", "550")
|
|
+row("Turbo", "C++", "92.4", "349")
|
|
+row("Yara", "Java", "92.3", "340")
|
|
|
|
p
|
|
| Accuracy is % unlabelled arcs correct, speed is tokens per second.
|
|
|
|
p
|
|
| Joel Tetreault and Amanda Stent (Yahoo! Labs) and Jin-ho Choi (Emory)
|
|
| performed a detailed comparison of the best parsers available.
|
|
| All numbers above are taken from the pre-print they kindly made
|
|
| available to me, except for spaCy v0.86.
|
|
|
|
p
|
|
| I'm particularly grateful to the authors for discussion of their
|
|
| results, which led to the improvement in accuracy between v0.84 and
|
|
| v0.86. A tip from Jin-ho developer of ClearNLP) was particularly
|
|
| useful.
|