mixin columns(...names) tr each name in names th= name mixin row(...cells) tr each cell in cells td= cell mixin comparison(name) details summary h4 #{name} block +comparison("Peer-reviewed Evaluations")(open=true) p spaCy is committed to rigorous evaluation under standard methodology. Two papers in 2015 confirm that: ol li spaCy is the fastest syntactic parser in the world; li Its accuracy is within 1% of the best available; li The few systems that are more accurate are 20× slower or more. p spaCy v0.84 was evaluated by researchers at Yahoo! Labs and Emory University, as part of a survey paper benchmarking the current state-of-the-art dependency parsers (#[a(href="http://aclweb.org/anthology/P/P15/P15-1038.pdf") Choi et al., 2015]). table thead +columns("System", "Language", "Accuracy", "Speed") tbody +row("spaCy v0.97", "Cython", "91.8", "13,000 (est.)") +row("ClearNLP", "Java", "91.7", "10,271") +row("CoreNLP", "Java", "89.6", "8,602") +row("MATE", "Java", "92.5", "550") +row("Turbo", "C++", "92.4", "349") +row("Yara", "Java", "92.3", "340") p Discussion with the authors led to accuracy improvements in spaCy, which have been accepted for publication in EMNLP, in joint work with Macquarie University (#[a(href="//aclweb.org/anthology/D/D15/D15-1162.pdf") Honnibal and Johnson, 2015]). +comparison("How does spaCy compare to NLTK?") .columnar .col h5 spaCy ul li.pro Over 400 times faster li.pro State-of-the-art accuracy li.pro Tokenizer maintains alignment li.pro Powerful, concise API li.pro Integrated word vectors li.con English only (at present) .col h5 NLTK ul li.con Slow li.con Low accuracy li.con Tokens do not align to original string li.con Models return lists of strings li.con No word vector support li.pro Multiple languages +comparison("How does spaCy compare to CoreNLP?") .columnar .col h5 spaCy ul li.pro 50% faster li.pro More accurate parser li.pro Word vectors integration li.pro Minimalist design li.pro Great documentation li.con English only li.pro Python .col h5 CoreNLP ul li.pro More accurate NER li.pro Coreference resolution li.pro Sentiment analysis li.con Little documentation li.pro Multiple languages li.neutral Java +comparison("How does spaCy compare to ClearNLP?") .columnar .col h5 spaCy ul li.pro 30% faster li.pro Well documented li.con English only li.neutral Equivalent accuracy li.pro Python .col h5 ClearNLP ul li.pro Semantic Role Labelling li.pro Model for biology/life-science li.pro Multiple Languages li.neutral Equivalent accuracy li.neutral Java //-+comparison("Accuracy Summary") //-+comparison("Speed Summary") //- table //- thead //- tr //- th. //- th(colspan=3) Absolute (ms per doc) //- th(colspan=3) Relative (to spaCy) //- //- tbody //- tr //- td: strong System //- td: strong Split //- td: strong Tag //- td: strong Parse //- td: strong Split //- td: strong Tag //- td: strong Parse //- //- +row("spaCy", "0.2ms", "1ms", "19ms", "1x", "1x", "1x") //- +row("spaCy", "0.2ms", "1ms", "19ms", "1x", "1x", "1x") //- +row("CoreNLP", "2ms", "10ms", "49ms", "10x", "10x", "2.6x") //- +row("ZPar", "1ms", "8ms", "850ms", "5x", "8x", "44.7x") //- +row("NLTK", "4ms", "443ms", "n/a", "20x", "443x", "n/a") //- //- p //- | Set up: 100,000 plain-text documents were streamed //- | from an SQLite3 database, and processed with an NLP library, to one //- | of three levels of detail – tokenization, tagging, or parsing. //- | The tasks are additive: to parse the text you have to tokenize and //- | tag it. The pre-processing was not subtracted from the times – //- | I report the time required for the pipeline to complete. I report //- | mean times per document, in milliseconds. //- //- p //- | Hardware: Intel i7-3770 (2012)