spaCy/website/docs/usage/facts-figures.md
2020-02-03 13:10:46 +01:00

12 KiB
Raw Blame History

title teaser next menu
Facts & Figures The hard numbers for spaCy and how it compares to other tools /usage/spacy-101
Feature Comparison
comparison
Benchmarks
benchmarks

Feature comparison

Here's a quick comparison of the functionalities offered by spaCy, NLTK and CoreNLP.

spaCy NLTK CoreNLP
Programming language Python Python Java / Python
Neural network models
Integrated word vectors
Multi-language support
Tokenization
Part-of-speech tagging
Sentence segmentation
Dependency parsing
Entity recognition
Entity linking
Coreference resolution

When should I use what?

Natural Language Understanding is an active area of research and development, so there are many different tools or technologies catering to different use-cases. The table below summarizes a few libraries (spaCy, NLTK, AllenNLP, StanfordNLP and TensorFlow) to help you get a feel for things fit together.

spaCy NLTK Allen-
NLP
Stanford-
NLP
Tensor-
Flow
I'm a beginner and just getting started with NLP.
I want to build an end-to-end production application.
I want to try out different neural network architectures for NLP.
I want to try the latest models with state-of-the-art accuracy.
I want to train models from my own data.
I want my application to be efficient on CPU.

Benchmarks

Two peer-reviewed papers in 2015 confirmed that spaCy offers the fastest syntactic parser in the world and that its accuracy is within 1% of the best available. The few systems that are more accurate are 20× slower or more.

About the evaluation

The first of the evaluations was published by Yahoo! Labs and Emory University, as part of a survey of current parsing technologies (Choi et al., 2015). Their results and subsequent discussions helped us develop a novel psychologically-motivated technique to improve spaCy's accuracy, which we published in joint work with Macquarie University (Honnibal and Johnson, 2015).

import BenchmarksChoi from 'usage/_benchmarks-choi.md'

Algorithm comparison

In this section, we compare spaCy's algorithms to recently published systems, using some of the most popular benchmarks. These benchmarks are designed to help isolate the contributions of specific algorithmic decisions, so they promote slightly "idealized" conditions. Specifically, the text comes pre-processed with "gold standard" token and sentence boundaries. The data sets also tend to be fairly small, to help researchers iterate quickly. These conditions mean the models trained on these data sets are not always useful for practical purposes.

Parse accuracy (Penn Treebank / Wall Street Journal)

This is the "classic" evaluation, so it's the number parsing researchers are most easily able to put in context. However, it's quite far removed from actual usage: it uses sentences with gold-standard segmentation and tokenization, from a pretty specific type of text (articles from a single newspaper, 1984-1989).

Methodology

Andor et al. (2016) chose slightly different experimental conditions from Choi et al. (2015), so the two accuracy tables here do not present directly comparable figures.

System Year Type Accuracy
spaCy v2.0.0 2017 neural 94.48
spaCy v1.1.0 2016 linear 92.80
Dozat and Manning 2017 neural 95.75
Andor et al. 2016 neural 94.44
SyntaxNet Parsey McParseface 2016 neural 94.15
Weiss et al. 2015 neural 93.91
Zhang and McDonald 2014 linear 93.32
Martins et al. 2013 linear 93.10

NER accuracy (OntoNotes 5, no pre-process)

This is the evaluation we use to tune spaCy's parameters to decide which algorithms are better than the others. It's reasonably close to actual usage, because it requires the parses to be produced from raw text, without any pre-processing.

System Year Type Accuracy
spaCy en_core_web_lg v2.0.0a3  2017 neural 85.85
Strubell et al.  2017 neural 86.81
Chiu and Nichols  2016 neural 86.19
Durrett and Klein  2014 neural 84.04
Ratinov and Roth  2009 linear 83.45

Model comparison

In this section, we provide benchmark accuracies for the pretrained model pipelines we distribute with spaCy. Evaluations are conducted end-to-end from raw text, with no "gold standard" pre-processing, over text from a mix of genres where possible.

Methodology

The evaluation was conducted on raw text with no gold standard information. The parser, tagger and entity recognizer were trained on the OntoNotes 5 corpus, the word vectors on Common Crawl.

English

Model spaCy Type UAS NER F POS WPS Size
en_core_web_sm 2.0.0 2.x neural 91.7 85.3 97.0 10.1k 35MB
en_core_web_md 2.0.0 2.x neural 91.7 85.9 97.1 10.0k 115MB
en_core_web_lg 2.0.0 2.x neural 91.9 85.9 97.2 10.0k 812MB
en_core_web_sm 1.2.0 1.x linear 86.6 78.5 96.6 25.7k 50MB
en_core_web_md 1.2.1 1.x linear 90.6 81.4 96.7 18.8k 1GB

Spanish

Evaluation note

The NER accuracy refers to the "silver standard" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.

Model spaCy Type UAS NER F POS WPS Size
es_core_news_sm 2.0.0 2.x neural 89.8 88.7 96.9 n/a 35MB
es_core_news_md 2.0.0 2.x neural 90.2 89.0 97.8 n/a 93MB
es_core_web_md 1.1.0 1.x linear 87.5 94.2 96.7 n/a 377MB

Detailed speed comparison

Here we compare the per-document processing time of various spaCy functionalities against other NLP libraries. We show both absolute timings (in ms) and relative performance (normalized to spaCy). Lower is better.

This evaluation was conducted in 2015. We're working on benchmarks on current CPU and GPU hardware. In the meantime, we're grateful to the Stanford folks for drawing our attention to what seems to be a long-standing error in our CoreNLP benchmarks, especially for their tokenizer. Until we run corrected experiments, we have updated the table using their figures.

Methodology

  • Set up: 100,000 plain-text documents were streamed from an SQLite3 database, and processed with an NLP library, to one of three levels of detail — tokenization, tagging, or parsing. The tasks are additive: to parse the text you have to tokenize and tag it. The pre-processing was not subtracted from the times — we report the time required for the pipeline to complete. We report mean times per document, in milliseconds.
  • Hardware: Intel i7-3770 (2012)
  • Implementation: spacy-benchmarks
Absolute (ms per doc) Relative (to spaCy)
System Tokenize Tag Parse Tokenize Tag Parse
spaCy 0.2ms 1ms 19ms 1x 1x 1x
CoreNLP 0.18ms 10ms 49ms 0.9x 10x 2.6x
ZPar 1ms 8ms 850ms 5x 8x 44.7x
NLTK 4ms 443ms n/a 20x 443x n/a