mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			596 lines
		
	
	
		
			26 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			596 lines
		
	
	
		
			26 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						||
title: 'spaCy 101: Everything you need to know'
 | 
						||
teaser: The most important concepts, explained in simple terms
 | 
						||
menu:
 | 
						||
  - ["What's spaCy?", 'whats-spacy']
 | 
						||
  - ['Features', 'features']
 | 
						||
  - ['Linguistic Annotations', 'annotations']
 | 
						||
  - ['Pipelines', 'pipelines']
 | 
						||
  - ['Architecture', 'architecture']
 | 
						||
  - ['Vocab', 'vocab']
 | 
						||
  - ['Serialization', 'serialization']
 | 
						||
  - ['Training', 'training']
 | 
						||
  - ['Language Data', 'language-data']
 | 
						||
  - ['Community & FAQ', 'community-faq']
 | 
						||
---
 | 
						||
 | 
						||
Whether you're new to spaCy, or just want to brush up on some NLP basics and
 | 
						||
implementation details – this page should have you covered. Each section will
 | 
						||
explain one of spaCy's features in simple terms and with examples or
 | 
						||
illustrations. Some sections will also reappear across the usage guides as a
 | 
						||
quick introduction.
 | 
						||
 | 
						||
> #### Help us improve the docs
 | 
						||
>
 | 
						||
> Did you spot a mistake or come across explanations that are unclear? We always
 | 
						||
> appreciate improvement
 | 
						||
> [suggestions](https://github.com/explosion/spaCy/issues) or
 | 
						||
> [pull requests](https://github.com/explosion/spaCy/pulls). You can find a
 | 
						||
> "Suggest edits" link at the bottom of each page that points you to the source.
 | 
						||
 | 
						||
<Infobox title="Take the free interactive course">
 | 
						||
 | 
						||
[](https://course.spacy.io)
 | 
						||
 | 
						||
In this course you'll learn how to use spaCy to build advanced natural language
 | 
						||
understanding systems, using both rule-based and machine learning approaches. It
 | 
						||
includes 55 exercises featuring interactive coding practice, multiple-choice
 | 
						||
questions and slide decks.
 | 
						||
 | 
						||
<p><Button to="https://course.spacy.io" variant="primary">Start the course</Button></p>
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## What's spaCy? {#whats-spacy}
 | 
						||
 | 
						||
<Grid cols={2}>
 | 
						||
 | 
						||
<div>
 | 
						||
 | 
						||
spaCy is a **free, open-source library** for advanced **Natural Language
 | 
						||
Processing** (NLP) in Python.
 | 
						||
 | 
						||
If you're working with a lot of text, you'll eventually want to know more about
 | 
						||
it. For example, what's it about? What do the words mean in context? Who is
 | 
						||
doing what to whom? What companies and products are mentioned? Which texts are
 | 
						||
similar to each other?
 | 
						||
 | 
						||
spaCy is designed specifically for **production use** and helps you build
 | 
						||
applications that process and "understand" large volumes of text. It can be used
 | 
						||
to build **information extraction** or **natural language understanding**
 | 
						||
systems, or to pre-process text for **deep learning**.
 | 
						||
 | 
						||
</div>
 | 
						||
 | 
						||
<Infobox title="Table of contents" id="toc">
 | 
						||
 | 
						||
- [Features](#features)
 | 
						||
- [Linguistic annotations](#annotations)
 | 
						||
- [Tokenization](#annotations-token)
 | 
						||
- [POS tags and dependencies](#annotations-pos-deps)
 | 
						||
- [Named entities](#annotations-ner)
 | 
						||
- [Word vectors and similarity](#vectors-similarity)
 | 
						||
- [Pipelines](#pipelines)
 | 
						||
- [Library architecture](#architecture)
 | 
						||
- [Vocab, hashes and lexemes](#vocab)
 | 
						||
- [Serialization](#serialization)
 | 
						||
- [Training](#training)
 | 
						||
- [Language data](#language-data)
 | 
						||
- [Community & FAQ](#community)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
</Grid>
 | 
						||
 | 
						||
### What spaCy isn't {#what-spacy-isnt}
 | 
						||
 | 
						||
- ❌ **spaCy is not a platform or "an API"**. Unlike a platform, spaCy does not
 | 
						||
  provide a software as a service, or a web application. It's an open-source
 | 
						||
  library designed to help you build NLP applications, not a consumable service.
 | 
						||
- ❌ **spaCy is not an out-of-the-box chat bot engine**. While spaCy can be used
 | 
						||
  to power conversational applications, it's not designed specifically for chat
 | 
						||
  bots, and only provides the underlying text processing capabilities.
 | 
						||
- ❌**spaCy is not research software**. It's built on the latest research, but
 | 
						||
  it's designed to get things done. This leads to fairly different design
 | 
						||
  decisions than [NLTK](https://github.com/nltk/nltk) or
 | 
						||
  [CoreNLP](https://stanfordnlp.github.io/CoreNLP/), which were created as
 | 
						||
  platforms for teaching and research. The main difference is that spaCy is
 | 
						||
  integrated and opinionated. spaCy tries to avoid asking the user to choose
 | 
						||
  between multiple algorithms that deliver equivalent functionality. Keeping the
 | 
						||
  menu small lets spaCy deliver generally better performance and developer
 | 
						||
  experience.
 | 
						||
- ❌ **spaCy is not a company**. It's an open-source library. Our company
 | 
						||
  publishing spaCy and other software is called
 | 
						||
  [Explosion](https://explosion.ai).
 | 
						||
 | 
						||
## Features {#features}
 | 
						||
 | 
						||
In the documentation, you'll come across mentions of spaCy's features and
 | 
						||
capabilities. Some of them refer to linguistic concepts, while others are
 | 
						||
related to more general machine learning functionality.
 | 
						||
 | 
						||
| Name                                  | Description                                                                                                        |
 | 
						||
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
 | 
						||
| **Tokenization**                      | Segmenting text into words, punctuations marks etc.                                                                |
 | 
						||
| **Part-of-speech** (POS) **Tagging**  | Assigning word types to tokens, like verb or noun.                                                                 |
 | 
						||
| **Dependency Parsing**                | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
 | 
						||
| **Lemmatization**                     | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".      |
 | 
						||
| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences.                                                                       |
 | 
						||
| **Named Entity Recognition** (NER)    | Labelling named "real-world" objects, like persons, companies or locations.                                        |
 | 
						||
| **Entity Linking** (EL)               | Disambiguating textual entities to unique identifiers in a knowledge base.                                         |
 | 
						||
| **Similarity**                        | Comparing words, text spans and documents and how similar they are to each other.                                  |
 | 
						||
| **Text Classification**               | Assigning categories or labels to a whole document, or parts of a document.                                        |
 | 
						||
| **Rule-based Matching**               | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.       |
 | 
						||
| **Training**                          | Updating and improving a statistical model's predictions.                                                          |
 | 
						||
| **Serialization**                     | Saving objects to files or byte strings.                                                                           |
 | 
						||
 | 
						||
### Statistical models {#statistical-models}
 | 
						||
 | 
						||
While some of spaCy's features work independently, others require
 | 
						||
[trained pipelines](/models) to be loaded, which enable spaCy to **predict**
 | 
						||
linguistic annotations – for example, whether a word is a verb or a noun. A
 | 
						||
trained pipeline can consist of multiple components that use a statistical model
 | 
						||
trained on labeled data. spaCy currently offers trained pipelines for a variety
 | 
						||
of languages, which can be installed as individual Python modules. Pipeline
 | 
						||
packages can differ in size, speed, memory usage, accuracy and the data they
 | 
						||
include. The package you choose always depends on your use case and the texts
 | 
						||
you're working with. For a general-purpose use case, the small, default packages
 | 
						||
are always a good start. They typically include the following components:
 | 
						||
 | 
						||
- **Binary weights** for the part-of-speech tagger, dependency parser and named
 | 
						||
  entity recognizer to predict those annotations in context.
 | 
						||
- **Lexical entries** in the vocabulary, i.e. words and their
 | 
						||
  context-independent attributes like the shape or spelling.
 | 
						||
- **Data files** like lemmatization rules and lookup tables.
 | 
						||
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
 | 
						||
  let you determine how similar they are to each other.
 | 
						||
- **Configuration** options, like the language and processing pipeline settings
 | 
						||
  and model implementations to use, to put spaCy in the correct state when you
 | 
						||
  load the pipeline.
 | 
						||
 | 
						||
## Linguistic annotations {#annotations}
 | 
						||
 | 
						||
spaCy provides a variety of linguistic annotations to give you **insights into a
 | 
						||
text's grammatical structure**. This includes the word types, like the parts of
 | 
						||
speech, and how the words are related to each other. For example, if you're
 | 
						||
analyzing text, it makes a huge difference whether a noun is the subject of a
 | 
						||
sentence, or the object – or whether "google" is used as a verb, or refers to
 | 
						||
the website or company in a specific context.
 | 
						||
 | 
						||
> #### Loading pipelines
 | 
						||
>
 | 
						||
> ```cli
 | 
						||
> $ python -m spacy download en_core_web_sm
 | 
						||
>
 | 
						||
> >>> import spacy
 | 
						||
> >>> nlp = spacy.load("en_core_web_sm")
 | 
						||
> ```
 | 
						||
 | 
						||
Once you've [downloaded and installed](/usage/models) a trained pipeline, you
 | 
						||
can load it via [`spacy.load`](/api/top-level#spacy.load). This will return a
 | 
						||
`Language` object containing all components and data needed to process text. We
 | 
						||
usually call it `nlp`. Calling the `nlp` object on a string of text will return
 | 
						||
a processed `Doc`:
 | 
						||
 | 
						||
```python
 | 
						||
### {executable="true"}
 | 
						||
import spacy
 | 
						||
 | 
						||
nlp = spacy.load("en_core_web_sm")
 | 
						||
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
 | 
						||
for token in doc:
 | 
						||
    print(token.text, token.pos_, token.dep_)
 | 
						||
```
 | 
						||
 | 
						||
Even though a `Doc` is processed – e.g. split into individual words and
 | 
						||
annotated – it still holds **all information of the original text**, like
 | 
						||
whitespace characters. You can always get the offset of a token into the
 | 
						||
original string, or reconstruct the original by joining the tokens and their
 | 
						||
trailing whitespace. This way, you'll never lose any information when processing
 | 
						||
text with spaCy.
 | 
						||
 | 
						||
### Tokenization {#annotations-token}
 | 
						||
 | 
						||
import Tokenization101 from 'usage/101/\_tokenization.md'
 | 
						||
 | 
						||
<Tokenization101 />
 | 
						||
 | 
						||
<Infobox title="Tokenization rules" emoji="📖">
 | 
						||
 | 
						||
To learn more about how spaCy's tokenization rules work in detail, how to
 | 
						||
**customize and replace** the default tokenizer and how to **add
 | 
						||
language-specific data**, see the usage guides on
 | 
						||
[language data](/usage/linguistic-features#language-data) and
 | 
						||
[customizing the tokenizer](/usage/linguistic-features#tokenization).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Part-of-speech tags and dependencies {#annotations-pos-deps model="parser"}
 | 
						||
 | 
						||
import PosDeps101 from 'usage/101/\_pos-deps.md'
 | 
						||
 | 
						||
<PosDeps101 />
 | 
						||
 | 
						||
<Infobox title="Part-of-speech tagging and morphology" emoji="📖">
 | 
						||
 | 
						||
To learn more about **part-of-speech tagging** and rule-based morphology, and
 | 
						||
how to **navigate and use the parse tree** effectively, see the usage guides on
 | 
						||
[part-of-speech tagging](/usage/linguistic-features#pos-tagging) and
 | 
						||
[using the dependency parse](/usage/linguistic-features#dependency-parse).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Named Entities {#annotations-ner model="ner"}
 | 
						||
 | 
						||
import NER101 from 'usage/101/\_named-entities.md'
 | 
						||
 | 
						||
<NER101 />
 | 
						||
 | 
						||
<Infobox title="Named Entity Recognition" emoji="📖">
 | 
						||
 | 
						||
To learn more about entity recognition in spaCy, how to **add your own
 | 
						||
entities** to a document and how to **train and update** the entity predictions
 | 
						||
of a model, see the usage guides on
 | 
						||
[named entity recognition](/usage/linguistic-features#named-entities) and
 | 
						||
[training pipelines](/usage/training).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Word vectors and similarity {#vectors-similarity model="vectors"}
 | 
						||
 | 
						||
import Vectors101 from 'usage/101/\_vectors-similarity.md'
 | 
						||
 | 
						||
<Vectors101 />
 | 
						||
 | 
						||
<Infobox title="Word vectors" emoji="📖">
 | 
						||
 | 
						||
To learn more about word vectors, how to **customize them** and how to load
 | 
						||
**your own vectors** into spaCy, see the usage guide on
 | 
						||
[using word vectors and semantic similarities](/usage/linguistic-features#vectors-similarity).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Pipelines {#pipelines}
 | 
						||
 | 
						||
import Pipelines101 from 'usage/101/\_pipelines.md'
 | 
						||
 | 
						||
<Pipelines101 />
 | 
						||
 | 
						||
<Infobox title="Processing pipelines" emoji="📖">
 | 
						||
 | 
						||
To learn more about **how processing pipelines work** in detail, how to enable
 | 
						||
and disable their components, and how to **create your own**, see the usage
 | 
						||
guide on [language processing pipelines](/usage/processing-pipelines).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Architecture {#architecture}
 | 
						||
 | 
						||
import Architecture101 from 'usage/101/\_architecture.md'
 | 
						||
 | 
						||
<Architecture101 />
 | 
						||
 | 
						||
## Vocab, hashes and lexemes {#vocab}
 | 
						||
 | 
						||
Whenever possible, spaCy tries to store data in a vocabulary, the
 | 
						||
[`Vocab`](/api/vocab), that will be **shared by multiple documents**. To save
 | 
						||
memory, spaCy also encodes all strings to **hash values** – in this case for
 | 
						||
example, "coffee" has the hash `3197928453018144401`. Entity labels like "ORG"
 | 
						||
and part-of-speech tags like "VERB" are also encoded. Internally, spaCy only
 | 
						||
"speaks" in hash values.
 | 
						||
 | 
						||
> - **Token**: A word, punctuation mark etc. _in context_, including its
 | 
						||
>   attributes, tags and dependencies.
 | 
						||
> - **Lexeme**: A "word type" with no context. Includes the word shape and
 | 
						||
>   flags, e.g. if it's lowercase, a digit or punctuation.
 | 
						||
> - **Doc**: A processed container of tokens in context.
 | 
						||
> - **Vocab**: The collection of lexemes.
 | 
						||
> - **StringStore**: The dictionary mapping hash values to strings, for example
 | 
						||
>   `3197928453018144401` → "coffee".
 | 
						||
 | 
						||

 | 
						||
 | 
						||
If you process lots of documents containing the word "coffee" in all kinds of
 | 
						||
different contexts, storing the exact string "coffee" every time would take up
 | 
						||
way too much space. So instead, spaCy hashes the string and stores it in the
 | 
						||
[`StringStore`](/api/stringstore). You can think of the `StringStore` as a
 | 
						||
**lookup table that works in both directions** – you can look up a string to get
 | 
						||
its hash, or a hash to get its string:
 | 
						||
 | 
						||
```python
 | 
						||
### {executable="true"}
 | 
						||
import spacy
 | 
						||
 | 
						||
nlp = spacy.load("en_core_web_sm")
 | 
						||
doc = nlp("I love coffee")
 | 
						||
print(doc.vocab.strings["coffee"])  # 3197928453018144401
 | 
						||
print(doc.vocab.strings[3197928453018144401])  # 'coffee'
 | 
						||
```
 | 
						||
 | 
						||
Now that all strings are encoded, the entries in the vocabulary **don't need to
 | 
						||
include the word text** themselves. Instead, they can look it up in the
 | 
						||
`StringStore` via its hash value. Each entry in the vocabulary, also called
 | 
						||
[`Lexeme`](/api/lexeme), contains the **context-independent** information about
 | 
						||
a word. For example, no matter if "love" is used as a verb or a noun in some
 | 
						||
context, its spelling and whether it consists of alphabetic characters won't
 | 
						||
ever change. Its hash value will also always be the same.
 | 
						||
 | 
						||
```python
 | 
						||
### {executable="true"}
 | 
						||
import spacy
 | 
						||
 | 
						||
nlp = spacy.load("en_core_web_sm")
 | 
						||
doc = nlp("I love coffee")
 | 
						||
for word in doc:
 | 
						||
    lexeme = doc.vocab[word.text]
 | 
						||
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
 | 
						||
            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)
 | 
						||
```
 | 
						||
 | 
						||
> - **Text**: The original text of the lexeme.
 | 
						||
> - **Orth**: The hash value of the lexeme.
 | 
						||
> - **Shape**: The abstract word shape of the lexeme.
 | 
						||
> - **Prefix**: By default, the first letter of the word string.
 | 
						||
> - **Suffix**: By default, the last three letters of the word string.
 | 
						||
> - **is alpha**: Does the lexeme consist of alphabetic characters?
 | 
						||
> - **is digit**: Does the lexeme consist of digits?
 | 
						||
 | 
						||
| Text   | Orth                  | Shape  | Prefix | Suffix | is_alpha | is_digit |
 | 
						||
| ------ | --------------------- | ------ | ------ | ------ | -------- | -------- |
 | 
						||
| I      | `4690420944186131903` | `X`    | I      | I      | `True`   | `False`  |
 | 
						||
| love   | `3702023516439754181` | `xxxx` | l      | ove    | `True`   | `False`  |
 | 
						||
| coffee | `3197928453018144401` | `xxxx` | c      | fee    | `True`   | `False`  |
 | 
						||
 | 
						||
The mapping of words to hashes doesn't depend on any state. To make sure each
 | 
						||
value is unique, spaCy uses a
 | 
						||
[hash function](https://en.wikipedia.org/wiki/Hash_function) to calculate the
 | 
						||
hash **based on the word string**. This also means that the hash for "coffee"
 | 
						||
will always be the same, no matter which pipeline you're using or how you've
 | 
						||
configured spaCy.
 | 
						||
 | 
						||
However, hashes **cannot be reversed** and there's no way to resolve
 | 
						||
`3197928453018144401` back to "coffee". All spaCy can do is look it up in the
 | 
						||
vocabulary. That's why you always need to make sure all objects you create have
 | 
						||
access to the same vocabulary. If they don't, spaCy might not be able to find
 | 
						||
the strings it needs.
 | 
						||
 | 
						||
```python
 | 
						||
### {executable="true"}
 | 
						||
import spacy
 | 
						||
from spacy.tokens import Doc
 | 
						||
from spacy.vocab import Vocab
 | 
						||
 | 
						||
nlp = spacy.load("en_core_web_sm")
 | 
						||
doc = nlp("I love coffee")  # Original Doc
 | 
						||
print(doc.vocab.strings["coffee"])  # 3197928453018144401
 | 
						||
print(doc.vocab.strings[3197928453018144401])  # 'coffee' 👍
 | 
						||
 | 
						||
empty_doc = Doc(Vocab())  # New Doc with empty Vocab
 | 
						||
# empty_doc.vocab.strings[3197928453018144401] will raise an error :(
 | 
						||
 | 
						||
empty_doc.vocab.strings.add("coffee")  # Add "coffee" and generate hash
 | 
						||
print(empty_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍
 | 
						||
 | 
						||
new_doc = Doc(doc.vocab)  # Create new doc with first doc's vocab
 | 
						||
print(new_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍
 | 
						||
```
 | 
						||
 | 
						||
If the vocabulary doesn't contain a string for `3197928453018144401`, spaCy will
 | 
						||
raise an error. You can re-add "coffee" manually, but this only works if you
 | 
						||
actually _know_ that the document contains that word. To prevent this problem,
 | 
						||
spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This
 | 
						||
will give you the object and its encoded annotations, plus the "key" to decode
 | 
						||
it.
 | 
						||
 | 
						||
## Serialization {#serialization}
 | 
						||
 | 
						||
import Serialization101 from 'usage/101/\_serialization.md'
 | 
						||
 | 
						||
<Serialization101 />
 | 
						||
 | 
						||
<Infobox title="Saving and loading" emoji="📖">
 | 
						||
 | 
						||
To learn more about how to **save and load your own pipelines**, see the usage
 | 
						||
guide on [saving and loading](/usage/saving-loading#models).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Training {#training}
 | 
						||
 | 
						||
import Training101 from 'usage/101/\_training.md'
 | 
						||
 | 
						||
<Training101 />
 | 
						||
 | 
						||
<Infobox title="Training pipelines and models" emoji="📖">
 | 
						||
 | 
						||
To learn more about **training and updating** pipelines, how to create training
 | 
						||
data and how to improve spaCy's named models, see the usage guides on
 | 
						||
[training](/usage/training).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Training config and lifecycle {#training-config}
 | 
						||
 | 
						||
Training config files include all **settings and hyperparameters** for training
 | 
						||
your pipeline. Instead of providing lots of arguments on the command line, you
 | 
						||
only need to pass your `config.cfg` file to [`spacy train`](/api/cli#train).
 | 
						||
This also makes it easy to integrate custom models and architectures, written in
 | 
						||
your framework of choice. A pipeline's `config.cfg` is considered the "single
 | 
						||
source of truth", both at **training** and **runtime**.
 | 
						||
 | 
						||
> ```ini
 | 
						||
> ### config.cfg (excerpt)
 | 
						||
> [training]
 | 
						||
> accumulate_gradient = 3
 | 
						||
>
 | 
						||
> [training.optimizer]
 | 
						||
> @optimizers = "Adam.v1"
 | 
						||
>
 | 
						||
> [training.optimizer.learn_rate]
 | 
						||
> @schedules = "warmup_linear.v1"
 | 
						||
> warmup_steps = 250
 | 
						||
> total_steps = 20000
 | 
						||
> initial_rate = 0.01
 | 
						||
> ```
 | 
						||
 | 
						||

 | 
						||
 | 
						||
<Infobox title="Training configuration system" emoji="📖">
 | 
						||
 | 
						||
For more details on spaCy's **configuration system** and how to use it to
 | 
						||
customize your pipeline components, component models, training settings and
 | 
						||
hyperparameters, see the [training config](/usage/training#config) usage guide.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Trainable components {#training-components}
 | 
						||
 | 
						||
spaCy's [`Pipe`](/api/pipe) class helps you implement your own trainable
 | 
						||
components that have their own model instance, make predictions over `Doc`
 | 
						||
objects and can be updated using [`spacy train`](/api/cli#train). This lets you
 | 
						||
plug fully custom machine learning components into your pipeline that can be
 | 
						||
configured via a single training config.
 | 
						||
 | 
						||
> #### config.cfg (excerpt)
 | 
						||
>
 | 
						||
> ```ini
 | 
						||
> [components.my_component]
 | 
						||
> factory = "my_component"
 | 
						||
>
 | 
						||
> [components.my_component.model]
 | 
						||
> @architectures = "my_model.v1"
 | 
						||
> width = 128
 | 
						||
> ```
 | 
						||
 | 
						||

 | 
						||
 | 
						||
<Infobox title="Custom trainable components" emoji="📖">
 | 
						||
 | 
						||
To learn more about how to implement your own **model architectures** and use
 | 
						||
them to power custom **trainable components**, see the usage guides on the
 | 
						||
[trainable component API](/usage/processing-pipelines#trainable-components) and
 | 
						||
implementing [layers and architectures](/usage/layers-architectures#components)
 | 
						||
for trainable components.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Language data {#language-data}
 | 
						||
 | 
						||
import LanguageData101 from 'usage/101/\_language-data.md'
 | 
						||
 | 
						||
<LanguageData101 />
 | 
						||
 | 
						||
## Community & FAQ {#community-faq}
 | 
						||
 | 
						||
We're very happy to see the spaCy community grow and include a mix of people
 | 
						||
from all kinds of different backgrounds – computational linguistics, data
 | 
						||
science, deep learning, research and more. If you'd like to get involved, below
 | 
						||
are some answers to the most important questions and resources for further
 | 
						||
reading.
 | 
						||
 | 
						||
### Help, my code isn't working! {#faq-help-code}
 | 
						||
 | 
						||
Bugs suck, and we're doing our best to continuously improve the tests and fix
 | 
						||
bugs as soon as possible. Before you submit an issue, do a quick search and
 | 
						||
check if the problem has already been reported. If you're having installation or
 | 
						||
loading problems, make sure to also check out the
 | 
						||
[troubleshooting guide](/usage/#troubleshooting). Help with spaCy is available
 | 
						||
via the following platforms:
 | 
						||
 | 
						||
> #### How do I know if something is a bug?
 | 
						||
>
 | 
						||
> Of course, it's always hard to know for sure, so don't worry – we're not going
 | 
						||
> to be mad if a bug report turns out to be a typo in your code. As a simple
 | 
						||
> rule, any C-level error without a Python traceback, like a **segmentation
 | 
						||
> fault** or **memory error**, is **always** a spaCy bug.
 | 
						||
>
 | 
						||
> Because models are statistical, their performance will never be _perfect_.
 | 
						||
> However, if you come across **patterns that might indicate an underlying
 | 
						||
> issue**, please do file a report. Similarly, we also care about behaviors that
 | 
						||
> **contradict our docs**.
 | 
						||
 | 
						||
- [Stack Overflow](https://stackoverflow.com/questions/tagged/spacy): **Usage
 | 
						||
  questions** and everything related to problems with your specific code. The
 | 
						||
  Stack Overflow community is much larger than ours, so if your problem can be
 | 
						||
  solved by others, you'll receive help much quicker.
 | 
						||
- [Gitter chat](https://gitter.im/explosion/spaCy): **General discussion** about
 | 
						||
  spaCy, meeting other community members and exchanging **tips, tricks and best
 | 
						||
  practices**.
 | 
						||
- [GitHub issue tracker](https://github.com/explosion/spaCy/issues): **Bug
 | 
						||
  reports** and **improvement suggestions**, i.e. everything that's likely
 | 
						||
  spaCy's fault. This also includes problems with the trained pipelines beyond
 | 
						||
  statistical imprecisions, like patterns that point to a bug.
 | 
						||
 | 
						||
<Infobox title="Important note" variant="warning">
 | 
						||
 | 
						||
Please understand that we won't be able to provide individual support via email.
 | 
						||
We also believe that help is much more valuable if it's shared publicly, so that
 | 
						||
**more people can benefit from it**. If you come across an issue and you think
 | 
						||
you might be able to help, consider posting a quick update with your solution.
 | 
						||
No matter how simple, it can easily save someone a lot of time and headache –
 | 
						||
and the next time you need help, they might repay the favor.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### How can I contribute to spaCy? {#faq-contributing}
 | 
						||
 | 
						||
You don't have to be an NLP expert or Python pro to contribute, and we're happy
 | 
						||
to help you get started. If you're new to spaCy, a good place to start is the
 | 
						||
[`help wanted (easy)` label](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A"help+wanted+%28easy%29")
 | 
						||
on GitHub, which we use to tag bugs and feature requests that are easy and
 | 
						||
self-contained. We also appreciate contributions to the docs – whether it's
 | 
						||
fixing a typo, improving an example or adding additional explanations. You'll
 | 
						||
find a "Suggest edits" link at the bottom of each page that points you to the
 | 
						||
source.
 | 
						||
 | 
						||
Another way of getting involved is to help us improve the
 | 
						||
[language data](/usage/linguistic-features#language-data) – especially if you
 | 
						||
happen to speak one of the languages currently in
 | 
						||
[alpha support](/usage/models#languages). Even adding simple tokenizer
 | 
						||
exceptions, stop words or lemmatizer data can make a big difference. It will
 | 
						||
also make it easier for us to provide a trained pipeline for the language in the
 | 
						||
future. Submitting a test that documents a bug or performance issue, or covers
 | 
						||
functionality that's especially important for your application is also very
 | 
						||
helpful. This way, you'll also make sure we never accidentally introduce
 | 
						||
regressions to the parts of the library that you care about the most.
 | 
						||
 | 
						||
**For more details on the types of contributions we're looking for, the code
 | 
						||
conventions and other useful tips, make sure to check out the
 | 
						||
[contributing guidelines](%%GITHUB_SPACY/CONTRIBUTING.md).**
 | 
						||
 | 
						||
<Infobox title="Code of Conduct" variant="warning">
 | 
						||
 | 
						||
spaCy adheres to the
 | 
						||
[Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/).
 | 
						||
By participating, you are expected to uphold this code.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### I've built something cool with spaCy – how can I get the word out? {#faq-project-with-spacy}
 | 
						||
 | 
						||
First, congrats – we'd love to check it out! When you share your project on
 | 
						||
Twitter, don't forget to tag [@spacy_io](https://twitter.com/spacy_io) so we
 | 
						||
don't miss it. If you think your project would be a good fit for the
 | 
						||
[spaCy Universe](/universe), **feel free to submit it!** Tutorials are also
 | 
						||
incredibly valuable to other users and a great way to get exposure. So we
 | 
						||
strongly encourage **writing up your experiences**, or sharing your code and
 | 
						||
some tips and tricks on your blog. Since our website is open-source, you can add
 | 
						||
your project or tutorial by making a pull request on GitHub.
 | 
						||
 | 
						||
If you would like to use the spaCy logo on your site, please get in touch and
 | 
						||
ask us first. However, if you want to show support and tell others that your
 | 
						||
project is using spaCy, you can grab one of our **spaCy badges** here:
 | 
						||
 | 
						||
<img src={`https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg`} />
 | 
						||
 | 
						||
```markdown
 | 
						||
[](https://spacy.io)
 | 
						||
```
 | 
						||
 | 
						||
<img src={`https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg`}
 | 
						||
/>
 | 
						||
 | 
						||
```markdown
 | 
						||
[](https://spacy.io)
 | 
						||
```
 |