mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 16:07:41 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			123 lines
		
	
	
		
			6.0 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			123 lines
		
	
	
		
			6.0 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | ||
| title: Facts & Figures
 | ||
| teaser: The hard numbers for spaCy and how it compares to other tools
 | ||
| next: /usage/spacy-101
 | ||
| menu:
 | ||
|   - ['Feature Comparison', 'comparison']
 | ||
|   - ['Benchmarks', 'benchmarks']
 | ||
|   # TODO: - ['Citing spaCy', 'citation']
 | ||
| ---
 | ||
| 
 | ||
| ## Comparison {#comparison hidden="true"}
 | ||
| 
 | ||
| spaCy is a **free, open-source library** for advanced **Natural Language
 | ||
| Processing** (NLP) in Python. It's designed specifically for **production use**
 | ||
| and helps you build applications that process and "understand" large volumes of
 | ||
| text. It can be used to build information extraction or natural language
 | ||
| understanding systems.
 | ||
| 
 | ||
| ### Feature overview {#comparison-features}
 | ||
| 
 | ||
| import Features from 'widgets/features.js'
 | ||
| 
 | ||
| <Features />
 | ||
| 
 | ||
| ### When should I use spaCy? {#comparison-usage}
 | ||
| 
 | ||
| - ✅ **I'm a beginner and just getting started with NLP.** – spaCy makes it easy
 | ||
|   to get started and comes with extensive documentation, including a
 | ||
|   beginner-friendly [101 guide](/usage/spacy-101), a free interactive
 | ||
|   [online course](https://course.spacy.io) and a range of
 | ||
|   [video tutorials](https://www.youtube.com/c/ExplosionAI).
 | ||
| - ✅ **I want to build an end-to-end production application.** – spaCy is
 | ||
|   specifically designed for production use and lets you build and train powerful
 | ||
|   NLP pipelines and package them for easy deployment.
 | ||
| - ✅ **I want my application to be efficient on GPU _and_ CPU.** – While spaCy
 | ||
|   lets you train modern NLP models that are best run on GPU, it also offers
 | ||
|   CPU-optimized pipelines, which are less accurate but much cheaper to run.
 | ||
| - ✅ **I want to try out different neural network architectures for NLP.** –
 | ||
|   spaCy lets you customize and swap out the model architectures powering its
 | ||
|   components, and implement your own using a framework like PyTorch or
 | ||
|   TensorFlow. The declarative configuration system makes it easy to mix and
 | ||
|   match functions and keep track of your hyperparameters to make sure your
 | ||
|   experiments are reproducible.
 | ||
| - ❌ **I want to build a language generation application.** – spaCy's focus is
 | ||
|   natural language _processing_ and extracting information from large volumes of
 | ||
|   text. While you can use it to help you re-write existing text, it doesn't
 | ||
|   include any specific functionality for language generation tasks.
 | ||
| - ❌ **I want to research machine learning algorithms.** spaCy is built on the
 | ||
|   latest research, but it's not a research library. If your goal is to write
 | ||
|   papers and run benchmarks, spaCy is probably not a good choice. However, you
 | ||
|   can use it to make the results of your research easily available for others to
 | ||
|   use, e.g. via a custom spaCy component.
 | ||
| 
 | ||
| ## Benchmarks {#benchmarks}
 | ||
| 
 | ||
| spaCy v3.0 introduces transformer-based pipelines that bring spaCy's accuracy
 | ||
| right up to **current state-of-the-art**. You can also use a CPU-optimized
 | ||
| pipeline, which is less accurate but much cheaper to run.
 | ||
| 
 | ||
| <!-- TODO: update benchmarks and intro -->
 | ||
| 
 | ||
| > #### Evaluation details
 | ||
| >
 | ||
| > - **OntoNotes 5.0:** spaCy's English models are trained on this corpus, as
 | ||
| >   it's several times larger than other English treebanks. However, most
 | ||
| >   systems do not report accuracies on it.
 | ||
| > - **Penn Treebank:** The "classic" parsing evaluation for research. However,
 | ||
| >   it's quite far removed from actual usage: it uses sentences with
 | ||
| >   gold-standard segmentation and tokenization, from a pretty specific type of
 | ||
| >   text (articles from a single newspaper, 1984-1989).
 | ||
| 
 | ||
| import Benchmarks from 'usage/\_benchmarks-models.md'
 | ||
| 
 | ||
| <Benchmarks />
 | ||
| 
 | ||
| <figure>
 | ||
| 
 | ||
| | Dependency Parsing System                                                      |  UAS |  LAS |
 | ||
| | ------------------------------------------------------------------------------ | ---: | ---: |
 | ||
| | spaCy RoBERTa (2020)                                                           | 95.1 | 93.7 |
 | ||
| | [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
 | ||
| | [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019)             | 97.2 | 95.7 |
 | ||
| 
 | ||
| <figcaption class="caption">
 | ||
| 
 | ||
| **Dependency parsing accuracy** on the Penn Treebank. See
 | ||
| [NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more
 | ||
| results. Project template:
 | ||
| [`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank).
 | ||
| 
 | ||
| </figcaption>
 | ||
| 
 | ||
| </figure>
 | ||
| 
 | ||
| ### Speed comparison {#benchmarks-speed}
 | ||
| 
 | ||
| We compare the speed of different NLP libraries, measured in words per second
 | ||
| (WPS) - higher is better. The evaluation was performed on 10,000 Reddit
 | ||
| comments.
 | ||
| 
 | ||
| <figure>
 | ||
| 
 | ||
| | Library | Pipeline                                        | WPS CPU <Help>words per second on CPU, higher is better</Help> | WPS GPU <Help>words per second on GPU, higher is better</Help> |
 | ||
| | ------- | ----------------------------------------------- | -------------------------------------------------------------: | -------------------------------------------------------------: |
 | ||
| | spaCy   | [`en_core_web_lg`](/models/en#en_core_web_lg)   |                                                         10,014 |                                                         14,954 |
 | ||
| | spaCy   | [`en_core_web_trf`](/models/en#en_core_web_trf) |                                                            684 |                                                          3,768 |
 | ||
| | Stanza  | `en_ewt`                                        |                                                            878 |                                                          2,180 |
 | ||
| | Flair   | `pos`(`-fast`) & `ner`(`-fast`)                 |                                                            323 |                                                          1,184 |
 | ||
| | UDPipe  | `english-ewt-ud-2.5`                            |                                                          1,101 |                                                          _n/a_ |
 | ||
| 
 | ||
| <figcaption class="caption">
 | ||
| 
 | ||
| **End-to-end processing speed** on raw unannotated text. Project template:
 | ||
| [`benchmarks/speed`](%%GITHUB_PROJECTS/benchmarks/speed).
 | ||
| 
 | ||
| </figcaption>
 | ||
| 
 | ||
| </figure>
 | ||
| 
 | ||
| <!-- TODO: ## Citing spaCy {#citation}
 | ||
| 
 | ||
| -->
 |