spaCy/website/docs/usage/101/_training.md

spaCy's tagger, parser, text categorizer and many other components are powered
by **statistical models**. Every "decision" these components make – for example,
which part-of-speech tag to assign, or whether a word is a named entity – is a
**prediction** based on the model's current **weight values**. The weight values
are estimated based on examples the model has seen during **training**. To train
a model, you first need training data – examples of text, and the labels you
want the model to predict. This could be a part-of-speech tag, a named entity or
any other information.

Training is an iterative process in which the model's predictions are compared
against the reference annotations in order to estimate the **gradient of the
loss**. The gradient of the loss is then used to calculate the gradient of the
weights through [backpropagation](https://thinc.ai/backprop101). The gradients
indicate how the weight values should be changed so that the model's predictions
become more similar to the reference labels over time.

> - **Training data:** Examples and their annotations.
> - **Text:** The input text the model should predict a label for.
> - **Label:** The label the model should predict.
> - **Gradient:** The direction and rate of change for a numeric value.
>   Minimising the gradient of the weights should result in predictions that are
>   closer to the reference labels on the training data.

![The training process](../../images/training.svg)

When training a model, we don't just want it to memorize our examples – we want
it to come up with a theory that can be **generalized across unseen data**.
After all, we don't just want the model to learn that this one instance of
"Amazon" right here is a company – we want it to learn that "Amazon", in
contexts _like this_, is most likely a company. That's why the training data
should always be representative of the data we want to process. A model trained
on Wikipedia, where sentences in the first person are extremely rare, will
likely perform badly on Twitter. Similarly, a model trained on romantic novels
will likely perform badly on legal text.

This also means that in order to know how the model is performing, and whether
it's learning the right things, you don't only need **training data** – you'll
also need **evaluation data**. If you only test the model with the data it was
trained on, you'll have no idea how well it's generalizing. If you want to train
a model from scratch, you usually need at least a few hundred examples for both
training and evaluation.
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								spaCy's tagger, parser, text categorizer and many other components are powered
 								by **statistical models**. Every "decision" these components make – for example,
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								which part-of-speech tag to assign, or whether a word is a named entity – is a
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								**prediction** based on the model's current **weight values**. The weight values
 								are estimated based on examples the model has seen during **training**. To train
 								a model, you first need training data – examples of text, and the labels you
 								want the model to predict. This could be a part-of-speech tag, a named entity or
 								any other information.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								Training is an iterative process in which the model's predictions are compared
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								against the reference annotations in order to estimate the **gradient of the
 								loss**. The gradient of the loss is then used to calculate the gradient of the
 								weights through [backpropagation](https://thinc.ai/backprop101). The gradients
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								indicate how the weight values should be changed so that the model's predictions
 								become more similar to the reference labels over time.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								> - **Training data:** Examples and their annotations.
 								> - **Text:** The input text the model should predict a label for.
 								> - **Label:** The label the model should predict.
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								> - **Gradient:** The direction and rate of change for a numeric value.
-												"model" terminology consistency in docs

											
										
										
											2020-09-03 14:13:03 +03:00
+								>   Minimising the gradient of the weights should result in predictions that are
 								>   closer to the reference labels on the training data.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								![The training process](../../images/training.svg)
 								When training a model, we don't just want it to memorize our examples – we want
-												Edits to the training 101 section

											
										
										
											2020-07-26 14:42:08 +03:00
+								it to come up with a theory that can be **generalized across unseen data**.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								After all, we don't just want the model to learn that this one instance of
 								"Amazon" right here is a company – we want it to learn that "Amazon", in
 								contexts _like this_, is most likely a company. That's why the training data
 								should always be representative of the data we want to process. A model trained
 								on Wikipedia, where sentences in the first person are extremely rare, will
 								likely perform badly on Twitter. Similarly, a model trained on romantic novels
 								will likely perform badly on legal text.
 								This also means that in order to know how the model is performing, and whether
 								it's learning the right things, you don't only need **training data** – you'll
 								also need **evaluation data**. If you only test the model with the data it was
 								trained on, you'll have no idea how well it's generalizing. If you want to train
 								a model from scratch, you usually need at least a few hundred examples for both
-												Trim training 101

											
										
										
											2020-07-26 14:43:22 +03:00
+								training and evaluation.