spaCy/website/usage/_facts-figures/_benchmarks.jade

//- 💫 DOCS > USAGE > FACTS & FIGURES > BENCHMARKS

p
    |  Two peer-reviewed papers in 2015 confirm that spaCy offers the
    |  #[strong fastest syntactic parser in the world] and that
    |  #[strong its accuracy is within 1% of the best] available. The few
    |  systems that are more accurate are 20&times; slower or more.

+aside("About the evaluation")
    |  The first of the evaluations was published by #[strong Yahoo! Labs] and
    |  #[strong Emory University], as part of a survey of current parsing
    |  technologies #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") (Choi et al., 2015)].
    |  Their results and subsequent discussions helped us develop a novel
    |  psychologically-motivated technique to improve spaCy's accuracy, which
    |  we published in joint work with Macquarie University
    |  #[+a("https://aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].

include _benchmarks-choi-2015

+h(3, "algorithm") Algorithm comparison

p
    |  In this section, we compare spaCy's algorithms to recently published
    |  systems, using some of the most popular benchmarks. These benchmarks are
    |  designed to help isolate the contributions of specific algorithmic
    |  decisions, so they promote slightly "idealised" conditions. Specifically,
    |  the text comes pre-processed with "gold standard" token and sentence
    |  boundaries. The data sets also tend to be fairly small, to help
    |  researchers iterate quickly. These conditions mean the models trained on
    |  these data sets are not always useful for practical purposes.

+h(4, "parse-accuracy-penn") Parse accuracy (Penn Treebank / Wall Street Journal)

p
    |  This is the "classic" evaluation, so it's the number parsing researchers
    |  are most easily able to put in context. However, it's quite far removed
    |  from actual usage: it uses sentences with gold-standard segmentation and
    |  tokenization, from a pretty specific type of text (articles from a single
    |  newspaper, 1984-1989).

+aside("Methodology")
    |  #[+a("http://arxiv.org/abs/1603.06042") Andor et al. (2016)] chose
    |  slightly different experimental conditions from
    |  #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") Choi et al. (2015)],
    |  so the two accuracy tables here do not present directly comparable
    |  figures.

+table(["System", "Year", "Type", "Accuracy"])
    +row
        +cell spaCy v2.0.0
        +cell 2017
        +cell neural
        +cell.u-text-right 94.48

    +row
        +cell spaCy v1.1.0
        +cell 2016
        +cell linear
        +cell.u-text-right 92.80

    +row("divider")
        +cell
            +a("https://arxiv.org/pdf/1611.01734.pdf") Dozat and Manning
            +cell 2017
            +cell neural
            +cell.u-text-right #[strong 95.75]

    +row
        +cell
            +a("http://arxiv.org/abs/1603.06042") Andor et al.
        +cell 2016
        +cell neural
        +cell.u-text-right 94.44

    +row
        +cell
            +a("https://github.com/tensorflow/models/tree/master/syntaxnet") SyntaxNet Parsey McParseface
        +cell 2016
        +cell neural
        +cell.u-text-right 94.15

    +row
        +cell
            +a("http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43800.pdf") Weiss et al.
        +cell 2015
        +cell neural
        +cell.u-text-right 93.91

    +row
        +cell
            +a("http://research.google.com/pubs/archive/38148.pdf") Zhang and McDonald
        +cell 2014
        +cell linear
        +cell.u-text-right 93.32

    +row
        +cell
            +a("http://www.cs.cmu.edu/~ark/TurboParser/") Martins et al.
        +cell 2013
        +cell linear
        +cell.u-text-right 93.10

+h(4, "ner-accuracy-ontonotes5") NER accuracy (OntoNotes 5, no pre-process)

p
    |  This is the evaluation we use to tune spaCy's parameters are decide which
    |  algorithms are better than others. It's reasonably close to actual usage,
    |  because it requires the parses to be produced from raw text, without any
    |  pre-processing.

+table(["System", "Year", "Type", "Accuracy"])
    +row
        +cell spaCy #[+a("/models/en#en_core_web_lg") #[code en_core_web_lg]] v2.0.0
        +cell 2017
        +cell neural
        +cell.u-text-right 86.45

    +row("divider")
        +cell
            +a("https://arxiv.org/pdf/1702.02098.pdf") Strubell et al.
        +cell 2017
        +cell neural
        +cell.u-text-right #[strong 86.81]

    +row
        +cell
            +a("https://www.semanticscholar.org/paper/Named-Entity-Recognition-with-Bidirectional-LSTM-C-Chiu-Nichols/10a4db59e81d26b2e0e896d3186ef81b4458b93f") Chiu and Nichols
        +cell 2016
        +cell neural
        +cell.u-text-right 86.19

    +row
        +cell
            +a("https://www.semanticscholar.org/paper/A-Joint-Model-for-Entity-Analysis-Coreference-Typi-Durrett-Klein/28eb033eee5f51c5e5389cbb6b777779203a6778") Durrett and Klein
        +cell 2014
        +cell neural
        +cell.u-text-right 84.04

    +row
        +cell
            +a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth
        +cell 2009
        +cell linear
        +cell.u-text-right 83.45

+h(3, "spacy-models") Model comparison

include _benchmarks-models

+h(3, "speed-comparison") Detailed speed comparison

p
    |  Here we compare the per-document processing time of various spaCy
    |  functionalities against other NLP libraries. We show both absolute
    |  timings (in ms) and relative performance (normalized to spaCy). Lower is
    |  better.

+infobox("Important note", "⚠️")
    |  This evaluation was conducted in 2015. We're working on benchmarks on
    |  current CPU and GPU hardware.

+aside("Methodology")
    |  #[strong Set up:] 100,000 plain-text documents were streamed from an
    |  SQLite3 database, and processed with an NLP library, to one of three
    |  levels of detail — tokenization, tagging, or parsing. The tasks are
    |  additive: to parse the text you have to tokenize and tag it. The
    |  pre-processing was not subtracted from the times — we report the time
    |  required for the pipeline to complete. We report mean times per document,
    |  in milliseconds.#[br]#[br]
    |  #[strong Hardware]: Intel i7-3770 (2012)#[br]
    |  #[strong Implementation]: #[+src(gh("spacy-benchmarks")) #[code spacy-benchmarks]]

+table
    +row.u-text-label.u-text-center
        +head-cell
        +head-cell(colspan="3") Absolute (ms per doc)
        +head-cell(colspan="3") Relative (to spaCy)

    +row
        each column in ["System", "Tokenize", "Tag", "Parse", "Tokenize", "Tag", "Parse"]
            +head-cell=column

    +row
        +cell #[strong spaCy]
        each data in [ "0.2ms", "1ms", "19ms"]
            +cell.u-text-right #[strong=data]

        each data in ["1x", "1x", "1x"]
            +cell.u-text-right=data

    +row
        +cell CoreNLP
        each data in ["2ms", "10ms", "49ms", "10x", "10x", "2.6x"]
            +cell.u-text-right=data
    +row
        +cell ZPar
        each data in ["1ms", "8ms", "850ms", "5x", "8x", "44.7x"]
            +cell.u-text-right=data
    +row
        +cell NLTK
        each data in ["4ms", "443ms"]
            +cell.u-text-right=data
        +cell.u-text-right #[em n/a]
        each data in ["20x", "443x"]
            +cell.u-text-right=data
        +cell.u-text-right #[em n/a]
Update usage documentation 2017-10-03 15:26:20 +03:00			`//- 💫 DOCS > USAGE > FACTS & FIGURES > BENCHMARKS`

			`p`
			`\| Two peer-reviewed papers in 2015 confirm that spaCy offers the`
			`\| #[strong fastest syntactic parser in the world] and that`
			`\| #[strong its accuracy is within 1% of the best] available. The few`
			`\| systems that are more accurate are 20× slower or more.`

			`+aside("About the evaluation")`
			`\| The first of the evaluations was published by #[strong Yahoo! Labs] and`
			`\| #[strong Emory University], as part of a survey of current parsing`
			`\| technologies #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") (Choi et al., 2015)].`
			`\| Their results and subsequent discussions helped us develop a novel`
			`\| psychologically-motivated technique to improve spaCy's accuracy, which`
			`\| we published in joint work with Macquarie University`
			`\| #[+a("https://aclweb.org/anthology/D/D15/D15-1162.pdf") (Honnibal and Johnson, 2015)].`

			`include _benchmarks-choi-2015`

			`+h(3, "algorithm") Algorithm comparison`

			`p`
			`\| In this section, we compare spaCy's algorithms to recently published`
			`\| systems, using some of the most popular benchmarks. These benchmarks are`
			`\| designed to help isolate the contributions of specific algorithmic`
			`\| decisions, so they promote slightly "idealised" conditions. Specifically,`
			`\| the text comes pre-processed with "gold standard" token and sentence`
			`\| boundaries. The data sets also tend to be fairly small, to help`
			`\| researchers iterate quickly. These conditions mean the models trained on`
			`\| these data sets are not always useful for practical purposes.`

			`+h(4, "parse-accuracy-penn") Parse accuracy (Penn Treebank / Wall Street Journal)`

			`p`
			`\| This is the "classic" evaluation, so it's the number parsing researchers`
			`\| are most easily able to put in context. However, it's quite far removed`
			`\| from actual usage: it uses sentences with gold-standard segmentation and`
			`\| tokenization, from a pretty specific type of text (articles from a single`
			`\| newspaper, 1984-1989).`

			`+aside("Methodology")`
			`\| #[+a("http://arxiv.org/abs/1603.06042") Andor et al. (2016)] chose`
			`\| slightly different experimental conditions from`
			`\| #[+a("https://aclweb.org/anthology/P/P15/P15-1038.pdf") Choi et al. (2015)],`
			`\| so the two accuracy tables here do not present directly comparable`
			`\| figures.`

			`+table(["System", "Year", "Type", "Accuracy"])`
			`+row`
			`+cell spaCy v2.0.0`
			`+cell 2017`
			`+cell neural`
			`+cell.u-text-right 94.48`

			`+row`
			`+cell spaCy v1.1.0`
			`+cell 2016`
			`+cell linear`
			`+cell.u-text-right 92.80`

			`+row("divider")`
			`+cell`
			`+a("https://arxiv.org/pdf/1611.01734.pdf") Dozat and Manning`
			`+cell 2017`
			`+cell neural`
			`+cell.u-text-right #[strong 95.75]`

			`+row`
			`+cell`
			`+a("http://arxiv.org/abs/1603.06042") Andor et al.`
			`+cell 2016`
			`+cell neural`
			`+cell.u-text-right 94.44`

			`+row`
			`+cell`
			`+a("https://github.com/tensorflow/models/tree/master/syntaxnet") SyntaxNet Parsey McParseface`
			`+cell 2016`
			`+cell neural`
			`+cell.u-text-right 94.15`

			`+row`
			`+cell`
			`+a("http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43800.pdf") Weiss et al.`
			`+cell 2015`
			`+cell neural`
			`+cell.u-text-right 93.91`

			`+row`
			`+cell`
			`+a("http://research.google.com/pubs/archive/38148.pdf") Zhang and McDonald`
			`+cell 2014`
			`+cell linear`
			`+cell.u-text-right 93.32`

			`+row`
			`+cell`
			`+a("http://www.cs.cmu.edu/~ark/TurboParser/") Martins et al.`
			`+cell 2013`
			`+cell linear`
			`+cell.u-text-right 93.10`

			`+h(4, "ner-accuracy-ontonotes5") NER accuracy (OntoNotes 5, no pre-process)`

			`p`
			`\| This is the evaluation we use to tune spaCy's parameters are decide which`
			`\| algorithms are better than others. It's reasonably close to actual usage,`
			`\| because it requires the parses to be produced from raw text, without any`
			`\| pre-processing.`

			`+table(["System", "Year", "Type", "Accuracy"])`
			`+row`
			`+cell spaCy #[+a("/models/en#en_core_web_lg") #[code en_core_web_lg]] v2.0.0`
			`+cell 2017`
			`+cell neural`
			`+cell.u-text-right 86.45`

			`+row("divider")`
			`+cell`
			`+a("https://arxiv.org/pdf/1702.02098.pdf") Strubell et al.`
			`+cell 2017`
			`+cell neural`
			`+cell.u-text-right #[strong 86.81]`

			`+row`
			`+cell`
			`+a("https://www.semanticscholar.org/paper/Named-Entity-Recognition-with-Bidirectional-LSTM-C-Chiu-Nichols/10a4db59e81d26b2e0e896d3186ef81b4458b93f") Chiu and Nichols`
			`+cell 2016`
			`+cell neural`
			`+cell.u-text-right 86.19`

			`+row`
			`+cell`
			`+a("https://www.semanticscholar.org/paper/A-Joint-Model-for-Entity-Analysis-Coreference-Typi-Durrett-Klein/28eb033eee5f51c5e5389cbb6b777779203a6778") Durrett and Klein`
			`+cell 2014`
			`+cell neural`
			`+cell.u-text-right 84.04`

			`+row`
			`+cell`
			`+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth`
			`+cell 2009`
			`+cell linear`
			`+cell.u-text-right 83.45`

			`+h(3, "spacy-models") Model comparison`

			`include _benchmarks-models`

			`+h(3, "speed-comparison") Detailed speed comparison`

			`p`
			`\| Here we compare the per-document processing time of various spaCy`
			`\| functionalities against other NLP libraries. We show both absolute`
			`\| timings (in ms) and relative performance (normalized to spaCy). Lower is`
			`\| better.`

			`+infobox("Important note", "⚠️")`
			`\| This evaluation was conducted in 2015. We're working on benchmarks on`
			`\| current CPU and GPU hardware.`

			`+aside("Methodology")`
			`\| #[strong Set up:] 100,000 plain-text documents were streamed from an`
			`\| SQLite3 database, and processed with an NLP library, to one of three`
			`\| levels of detail — tokenization, tagging, or parsing. The tasks are`
			`\| additive: to parse the text you have to tokenize and tag it. The`
			`\| pre-processing was not subtracted from the times — we report the time`
			`\| required for the pipeline to complete. We report mean times per document,`
			`\| in milliseconds.#[br]#[br]`
			`\| #[strong Hardware]: Intel i7-3770 (2012)#[br]`
			`\| #[strong Implementation]: #[+src(gh("spacy-benchmarks")) #[code spacy-benchmarks]]`

			`+table`
			`+row.u-text-label.u-text-center`
			`+head-cell`
			`+head-cell(colspan="3") Absolute (ms per doc)`
			`+head-cell(colspan="3") Relative (to spaCy)`

			`+row`
			`each column in ["System", "Tokenize", "Tag", "Parse", "Tokenize", "Tag", "Parse"]`
			`+head-cell=column`

			`+row`
			`+cell #[strong spaCy]`
			`each data in [ "0.2ms", "1ms", "19ms"]`
			`+cell.u-text-right #[strong=data]`

			`each data in ["1x", "1x", "1x"]`
			`+cell.u-text-right=data`

			`+row`
			`+cell CoreNLP`
			`each data in ["2ms", "10ms", "49ms", "10x", "10x", "2.6x"]`
			`+cell.u-text-right=data`
			`+row`
			`+cell ZPar`
			`each data in ["1ms", "8ms", "850ms", "5x", "8x", "44.7x"]`
			`+cell.u-text-right=data`
			`+row`
			`+cell NLTK`
			`each data in ["4ms", "443ms"]`
			`+cell.u-text-right=data`
			`+cell.u-text-right #[em n/a]`
			`each data in ["20x", "443x"]`
			`+cell.u-text-right=data`
			`+cell.u-text-right #[em n/a]`