spaCy is a **free, open-source library** for advanced **Natural Language
Processing** (NLP) in Python.
If you're working with a lot of text, you'll eventually want to know more about
it. For example, what's it about? What do the words mean in context? Who is
doing what to whom? What companies and products are mentioned? Which texts are
similar to each other?
spaCy is designed specifically for **production use** and helps you build
applications that process and "understand" large volumes of text. It can be used
to build **information extraction** or **natural language understanding**
systems, or to pre-process text for **deep learning**.
- [Features](#features)
- [Linguistic annotations](#annotations)
- [Tokenization](#annotations-token)
- [POS tags and dependencies](#annotations-pos-deps)
- [Named entities](#annotations-ner)
- [Word vectors and similarity](#vectors-similarity)
- [Pipelines](#pipelines)
- [Library architecture](#architecture)
- [Vocab, hashes and lexemes](#vocab)
- [Serialization](#serialization)
- [Training](#training)
- [Language data](#language-data)
- [Community & FAQ](#community)
### What spaCy isn't {#what-spacy-isnt}
- **spaCy is not a platform or "an API"**. Unlike a platform, spaCy does not
provide a software as a service, or a web application. It's an open-source
library designed to help you build NLP applications, not a consumable service.
- **spaCy is not an out-of-the-box chat bot engine**. While spaCy can be used to
power conversational applications, it's not designed specifically for chat
bots, and only provides the underlying text processing capabilities.
- **spaCy is not research software**. It's built on the latest research, but
it's designed to get things done. This leads to fairly different design
decisions than [NLTK](https://github.com/nltk/nltk) or
[CoreNLP](https://stanfordnlp.github.io/CoreNLP/), which were created as
platforms for teaching and research. The main difference is that spaCy is
integrated and opinionated. spaCy tries to avoid asking the user to choose
between multiple algorithms that deliver equivalent functionality. Keeping the
menu small lets spaCy deliver generally better performance and developer
experience.
- **spaCy is not a company**. It's an open-source library. Our company
publishing spaCy and other software is called
[Explosion](https://explosion.ai).
## Features {#features}
In the documentation, you'll come across mentions of spaCy's features and
capabilities. Some of them refer to linguistic concepts, while others are
related to more general machine learning functionality.
| Name | Description |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Tokenization** | Segmenting text into words, punctuations marks etc. |
| **Part-of-speech** (POS) **Tagging** | Assigning word types to tokens, like verb or noun. |
| **Dependency Parsing** | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
| **Lemmatization** | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". |
| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences. |
| **Named Entity Recognition** (NER) | Labelling named "real-world" objects, like persons, companies or locations. |
| **Entity Linking** (EL) | Disambiguating textual entities to unique identifiers in a knowledge base. |
| **Similarity** | Comparing words, text spans and documents and how similar they are to each other. |
| **Text Classification** | Assigning categories or labels to a whole document, or parts of a document. |
| **Rule-based Matching** | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
| **Training** | Updating and improving a statistical model's predictions. |
| **Serialization** | Saving objects to files or byte strings. |
### Statistical models {#statistical-models}
While some of spaCy's features work independently, others require
[trained pipelines](/models) to be loaded, which enable spaCy to **predict**
linguistic annotations – for example, whether a word is a verb or a noun. A
trained pipeline can consist of multiple components that use a statistical model
trained on labeled data. spaCy currently offers trained pipelines for a variety
of languages, which can be installed as individual Python modules. Pipeline
packages can differ in size, speed, memory usage, accuracy and the data they
include. The package you choose always depends on your use case and the texts
you're working with. For a general-purpose use case, the small, default packages
are always a good start. They typically include the following components:
- **Binary weights** for the part-of-speech tagger, dependency parser and named
entity recognizer to predict those annotations in context.
- **Lexical entries** in the vocabulary, i.e. words and their
context-independent attributes like the shape or spelling.
- **Data files** like lemmatization rules and lookup tables.
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
let you determine how similar they are to each other.
- **Configuration** options, like the language and processing pipeline settings
and model implementations to use, to put spaCy in the correct state when you
load the pipeline.
## Linguistic annotations {#annotations}
spaCy provides a variety of linguistic annotations to give you **insights into a
text's grammatical structure**. This includes the word types, like the parts of
speech, and how the words are related to each other. For example, if you're
analyzing text, it makes a huge difference whether a noun is the subject of a
sentence, or the object – or whether "google" is used as a verb, or refers to
the website or company in a specific context.
> #### Loading pipelines
>
> ```cli
> $ python -m spacy download en_core_web_sm
>
> >>> import spacy
> >>> nlp = spacy.load("en_core_web_sm")
> ```
Once you've [downloaded and installed](/usage/models) a trained pipeline, you
can load it via [`spacy.load`](/api/top-level#spacy.load). This will return a
`Language` object containing all components and data needed to process text. We
usually call it `nlp`. Calling the `nlp` object on a string of text will return
a processed `Doc`:
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text, token.pos_, token.dep_)
```
Even though a `Doc` is processed – e.g. split into individual words and
annotated – it still holds **all information of the original text**, like
whitespace characters. You can always get the offset of a token into the
original string, or reconstruct the original by joining the tokens and their
trailing whitespace. This way, you'll never lose any information when processing
text with spaCy.
### Tokenization {#annotations-token}
import Tokenization101 from 'usage/101/\_tokenization.md'