--- title: Facts & Figures teaser: The hard numbers for spaCy and how it compares to other tools next: /usage/spacy-101 menu: - ['Feature Comparison', 'comparison'] - ['Benchmarks', 'benchmarks'] # TODO: - ['Citing spaCy', 'citation'] --- ## Comparison {#comparison hidden="true"} spaCy is a **free, open-source library** for advanced **Natural Language Processing** (NLP) in Python. It's designed specifically for **production use** and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems. ### Feature overview {#comparison-features} import Features from 'widgets/features.js' ### When should I use spaCy? {#comparison-usage} - ✅ **I'm a beginner and just getting started with NLP.** – spaCy makes it easy to get started and comes with extensive documentation, including a beginner-friendly [101 guide](/usage/spacy-101), a free interactive [online course](https://course.spacy.io) and a range of [video tutorials](https://www.youtube.com/c/ExplosionAI). - ✅ **I want to build an end-to-end production application.** – spaCy is specifically designed for production use and lets you build and train powerful NLP pipelines and package them for easy deployment. - ✅ **I want my application to be efficient on GPU _and_ CPU.** – While spaCy lets you train modern NLP models that are best run on GPU, it also offers CPU-optimized pipelines, which are less accurate but much cheaper to run. - ✅ **I want to try out different neural network architectures for NLP.** – spaCy lets you customize and swap out the model architectures powering its components, and implement your own using a framework like PyTorch or TensorFlow. The declarative configuration system makes it easy to mix and match functions and keep track of your hyperparameters to make sure your experiments are reproducible. - ❌ **I want to build a language generation application.** – spaCy's focus is natural language _processing_ and extracting information from large volumes of text. While you can use it to help you re-write existing text, it doesn't include any specific functionality for language generation tasks. - ❌ **I want to research machine learning algorithms.** spaCy is built on the latest research, but it's not a research library. If your goal is to write papers and run benchmarks, spaCy is probably not a good choice. However, you can use it to make the results of your research easily available for others to use, e.g. via a custom spaCy component. ## Benchmarks {#benchmarks} spaCy v3.0 introduces transformer-based pipelines that bring spaCy's accuracy right up to **current state-of-the-art**. You can also use a CPU-optimized pipeline, which is less accurate but much cheaper to run. > #### Evaluation details > > - **OntoNotes 5.0:** spaCy's English models are trained on this corpus, as > it's several times larger than other English treebanks. However, most > systems do not report accuracies on it. > - **Penn Treebank:** The "classic" parsing evaluation for research. However, > it's quite far removed from actual usage: it uses sentences with > gold-standard segmentation and tokenization, from a pretty specific type of > text (articles from a single newspaper, 1984-1989). import Benchmarks from 'usage/\_benchmarks-models.mdx'
| Dependency Parsing System | UAS | LAS | | ------------------------------------------------------------------------------ | ---: | ---: | | spaCy RoBERTa (2020) | 95.1 | 93.7 | | [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 | | [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |
**Dependency parsing accuracy** on the Penn Treebank. See [NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more results. Project template: [`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank).
### Speed comparison {#benchmarks-speed} We compare the speed of different NLP libraries, measured in words per second (WPS) - higher is better. The evaluation was performed on 10,000 Reddit comments.
| Library | Pipeline | WPS CPU words per second on CPU, higher is better | WPS GPU words per second on GPU, higher is better | | ------- | ----------------------------------------------- | -------------------------------------------------------------: | -------------------------------------------------------------: | | spaCy | [`en_core_web_lg`](/models/en#en_core_web_lg) | 10,014 | 14,954 | | spaCy | [`en_core_web_trf`](/models/en#en_core_web_trf) | 684 | 3,768 | | Stanza | `en_ewt` | 878 | 2,180 | | Flair | `pos`(`-fast`) & `ner`(`-fast`) | 323 | 1,184 | | UDPipe | `english-ewt-ud-2.5` | 1,101 | _n/a_ |
**End-to-end processing speed** on raw unannotated text. Project template: [`benchmarks/speed`](%%GITHUB_PROJECTS/benchmarks/speed).