spaCy/website/docs/usage/101/_training.md

spaCy's tagger, parser, text categorizer and many other components are powered
by **statistical models**. Every "decision" these components make – for example,
which part-of-speech tag to assign, or whether a word is a named entity – is a
**prediction** based on the model's current **weight values**. The weight
values are estimated based on examples the model has seen
during **training**. To train a model, you first need training data – examples
of text, and the labels you want the model to predict. This could be a
part-of-speech tag, a named entity or any other information.

Training is an iterative process in which the model's predictions are compared 
against the reference annotations in order to estimate the **gradient of the
loss**. The gradient of the loss is then used to calculate the gradient of the
weights through [backpropagation](https://thinc.ai/backprop101). The gradients
indicate how the weight values should be changed so that the model's
predictions become more similar to the reference labels over time. 

> - **Training data:** Examples and their annotations.
> - **Text:** The input text the model should predict a label for.
> - **Label:** The label the model should predict.
> - **Gradient:** The direction and rate of change for a numeric value.
>   Minimising the gradient of the weights should result in predictions that
>   are closer to the reference labels on the training data.

![The training process](../../images/training.svg)

When training a model, we don't just want it to memorize our examples – we want
it to come up with a theory that can be **generalized across unseen data**.
After all, we don't just want the model to learn that this one instance of
"Amazon" right here is a company – we want it to learn that "Amazon", in
contexts _like this_, is most likely a company. That's why the training data
should always be representative of the data we want to process. A model trained
on Wikipedia, where sentences in the first person are extremely rare, will
likely perform badly on Twitter. Similarly, a model trained on romantic novels
will likely perform badly on legal text.

This also means that in order to know how the model is performing, and whether
it's learning the right things, you don't only need **training data** – you'll
also need **evaluation data**. If you only test the model with the data it was
trained on, you'll have no idea how well it's generalizing. If you want to train
a model from scratch, you usually need at least a few hundred examples for both
training and evaluation. A good rule of thumb is that you should have 10
samples for each significant figure of accuracy you report.
If you only have 100 samples and your model predicts 92 of them correctly, you
would report accuracy of 0.9 rather than 0.92.
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								spaCy's tagger, parser, text categorizer and many other components are powered
 								by **statistical models**. Every "decision" these components make – for example,
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								which part-of-speech tag to assign, or whether a word is a named entity – is a
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								**prediction** based on the model's current **weight values**. The weight
 								values are estimated based on examples the model has seen
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								during **training**. To train a model, you first need training data – examples
 								of text, and the labels you want the model to predict. This could be a
 								part-of-speech tag, a named entity or any other information.
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								Training is an iterative process in which the model's predictions are compared
 								against the reference annotations in order to estimate the **gradient of the
 								loss**. The gradient of the loss is then used to calculate the gradient of the
 								weights through [backpropagation](https://thinc.ai/backprop101). The gradients
 								indicate how the weight values should be changed so that the model's
 								predictions become more similar to the reference labels over time.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> - **Training data:** Examples and their annotations.
 								> - **Text:** The input text the model should predict a label for.
 								> - **Label:** The label the model should predict.
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								> - **Gradient:** The direction and rate of change for a numeric value.
 								>   Minimising the gradient of the weights should result in predictions that
 								>   are closer to the reference labels on the training data.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								![The training process](../../images/training.svg)
 								When training a model, we don't just want it to memorize our examples – we want
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								it to come up with a theory that can be **generalized across unseen data**.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								After all, we don't just want the model to learn that this one instance of
 								"Amazon" right here is a company – we want it to learn that "Amazon", in
 								contexts _like this_, is most likely a company. That's why the training data
 								should always be representative of the data we want to process. A model trained
 								on Wikipedia, where sentences in the first person are extremely rare, will
 								likely perform badly on Twitter. Similarly, a model trained on romantic novels
 								will likely perform badly on legal text.
 								This also means that in order to know how the model is performing, and whether
 								it's learning the right things, you don't only need **training data** – you'll
 								also need **evaluation data**. If you only test the model with the data it was
 								trained on, you'll have no idea how well it's generalizing. If you want to train
 								a model from scratch, you usually need at least a few hundred examples for both
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								training and evaluation. A good rule of thumb is that you should have 10
 								samples for each significant figure of accuracy you report.
 								If you only have 100 samples and your model predicts 92 of them correctly, you
 								would report accuracy of 0.9 rather than 0.92.