spaCy/website/docs/usage/facts-figures.mdx

123 lines
6.0 KiB
Plaintext
Raw Normal View History

---
title: Facts & Figures
teaser: The hard numbers for spaCy and how it compares to other tools
next: /usage/spacy-101
menu:
- ['Feature Comparison', 'comparison']
- ['Benchmarks', 'benchmarks']
2020-09-12 18:05:10 +03:00
# TODO: - ['Citing spaCy', 'citation']
---
2020-09-12 18:05:10 +03:00
## Comparison {#comparison hidden="true"}
2020-10-15 12:16:06 +03:00
spaCy is a **free, open-source library** for advanced **Natural Language
Processing** (NLP) in Python. It's designed specifically for **production use**
and helps you build applications that process and "understand" large volumes of
text. It can be used to build information extraction or natural language
understanding systems.
### Feature overview {#comparison-features}
import Features from 'widgets/features.js'
<Features />
2020-09-12 18:05:10 +03:00
### When should I use spaCy? {#comparison-usage}
2020-09-12 18:40:50 +03:00
- ✅ **I'm a beginner and just getting started with NLP.** spaCy makes it easy
to get started and comes with extensive documentation, including a
beginner-friendly [101 guide](/usage/spacy-101), a free interactive
[online course](https://course.spacy.io) and a range of
[video tutorials](https://www.youtube.com/c/ExplosionAI).
- ✅ **I want to build an end-to-end production application.** spaCy is
specifically designed for production use and lets you build and train powerful
NLP pipelines and package them for easy deployment.
- ✅ **I want my application to be efficient on GPU _and_ CPU.** While spaCy
lets you train modern NLP models that are best run on GPU, it also offers
CPU-optimized pipelines, which are less accurate but much cheaper to run.
- ✅ **I want to try out different neural network architectures for NLP.**
spaCy lets you customize and swap out the model architectures powering its
components, and implement your own using a framework like PyTorch or
TensorFlow. The declarative configuration system makes it easy to mix and
match functions and keep track of your hyperparameters to make sure your
experiments are reproducible.
- ❌ **I want to build a language generation application.** spaCy's focus is
natural language _processing_ and extracting information from large volumes of
text. While you can use it to help you re-write existing text, it doesn't
include any specific functionality for language generation tasks.
- ❌ **I want to research machine learning algorithms.** spaCy is built on the
latest research, but it's not a research library. If your goal is to write
papers and run benchmarks, spaCy is probably not a good choice. However, you
can use it to make the results of your research easily available for others to
use, e.g. via a custom spaCy component.
## Benchmarks {#benchmarks}
2020-09-12 18:05:10 +03:00
spaCy v3.0 introduces transformer-based pipelines that bring spaCy's accuracy
right up to **current state-of-the-art**. You can also use a CPU-optimized
pipeline, which is less accurate but much cheaper to run.
2020-09-20 18:44:58 +03:00
<!-- TODO: update benchmarks and intro -->
2020-09-12 18:05:10 +03:00
> #### Evaluation details
>
2020-09-12 18:05:10 +03:00
> - **OntoNotes 5.0:** spaCy's English models are trained on this corpus, as
> it's several times larger than other English treebanks. However, most
> systems do not report accuracies on it.
> - **Penn Treebank:** The "classic" parsing evaluation for research. However,
> it's quite far removed from actual usage: it uses sentences with
> gold-standard segmentation and tokenization, from a pretty specific type of
> text (articles from a single newspaper, 1984-1989).
2022-11-10 05:40:44 +03:00
import Benchmarks from 'usage/\_benchmarks-models.mdx'
2020-09-12 18:05:10 +03:00
<Benchmarks />
2020-09-23 23:02:31 +03:00
<figure>
| Dependency Parsing System | UAS | LAS |
2020-09-23 23:02:31 +03:00
| ------------------------------------------------------------------------------ | ---: | ---: |
2021-02-09 23:28:33 +03:00
| spaCy RoBERTa (2020) | 95.1 | 93.7 |
2020-09-23 23:02:31 +03:00
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |
<figcaption class="caption">
2020-09-24 11:13:41 +03:00
**Dependency parsing accuracy** on the Penn Treebank. See
2020-09-23 23:02:31 +03:00
[NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more
2020-10-15 09:58:30 +03:00
results. Project template:
2020-09-24 11:13:41 +03:00
[`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank).
2020-09-23 23:02:31 +03:00
</figcaption>
</figure>
2021-01-22 20:46:35 +03:00
### Speed comparison {#benchmarks-speed}
2021-01-22 20:55:18 +03:00
We compare the speed of different NLP libraries, measured in words per second
2021-01-27 05:31:25 +03:00
(WPS) - higher is better. The evaluation was performed on 10,000 Reddit
comments.
2021-01-22 20:46:35 +03:00
<figure>
| Library | Pipeline | WPS CPU <Help>words per second on CPU, higher is better</Help> | WPS GPU <Help>words per second on GPU, higher is better</Help> |
| ------- | ----------------------------------------------- | -------------------------------------------------------------: | -------------------------------------------------------------: |
| spaCy | [`en_core_web_lg`](/models/en#en_core_web_lg) | 10,014 | 14,954 |
| spaCy | [`en_core_web_trf`](/models/en#en_core_web_trf) | 684 | 3,768 |
| Stanza | `en_ewt` | 878 | 2,180 |
| Flair | `pos`(`-fast`) & `ner`(`-fast`) | 323 | 1,184 |
2021-01-27 05:31:25 +03:00
| UDPipe | `english-ewt-ud-2.5` | 1,101 | _n/a_ |
2021-01-22 20:46:35 +03:00
<figcaption class="caption">
**End-to-end processing speed** on raw unannotated text. Project template:
[`benchmarks/speed`](%%GITHUB_PROJECTS/benchmarks/speed).
</figcaption>
</figure>
2020-09-20 18:44:58 +03:00
<!-- TODO: ## Citing spaCy {#citation}
2020-09-20 18:44:58 +03:00
-->