spaCy/website/docs/usage/101/_tokenization.md

During processing, spaCy first **tokenizes** the text, i.e. segments it into
words, punctuation and so on. This is done by applying rules specific to each
language. For example, punctuation at the end of a sentence should be split off
– whereas "U.K." should remain one token. Each `Doc` consists of individual
tokens, and we can iterate over them:

```python
### {executable="true"}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)
```

|   0   |  1  |    2    |  3  |   4    |  5   |    6    |  7  |  8  |  9  |   10    |
| :---: | :-: | :-----: | :-: | :----: | :--: | :-----: | :-: | :-: | :-: | :-----: |
| Apple | is  | looking | at  | buying | U.K. | startup | for | \$  |  1  | billion |

First, the raw text is split on whitespace characters, similar to
`text.split(' ')`. Then, the tokenizer processes the text from left to right. On
each substring, it performs two checks:

1. **Does the substring match a tokenizer exception rule?** For example, "don't"
   does not contain whitespace, but should be split into two tokens, "do" and
   "n't", while "U.K." should always remain one token.

2. **Can a prefix, suffix or infix be split off?** For example punctuation like
   commas, periods, hyphens or quotes.

If there's a match, the rule is applied and the tokenizer continues its loop,
starting with the newly split substrings. This way, spaCy can split **complex,
nested tokens** like combinations of abbreviations and multiple punctuation
marks.

> - **Tokenizer exception:** Special-case rule to split a string into several
>   tokens or prevent a token from being split when punctuation rules are
>   applied.
> - **Prefix:** Character(s) at the beginning, e.g. `$`, `(`, `“`, `¿`.
> - **Suffix:** Character(s) at the end, e.g. `km`, `)`, `”`, `!`.
> - **Infix:** Character(s) in between, e.g. `-`, `--`, `/`, `…`.

![Example of the tokenization process](../../images/tokenization.svg)

While punctuation rules are usually pretty general, tokenizer exceptions
strongly depend on the specifics of the individual language. This is why each
[available language](/usage/models#languages) has its own subclass, like
`English` or `German`, that loads in lists of hard-coded data and exception
rules.
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								During processing, spaCy first **tokenizes** the text, i.e. segments it into
 								words, punctuation and so on. This is done by applying rules specific to each
 								language. For example, punctuation at the end of a sentence should be split off
 								– whereas "U.K." should remain one token. Each `Doc` consists of individual
 								tokens, and we can iterate over them:
 								```python
 								### {executable="true"}
 								import spacy
 								nlp = spacy.load("en_core_web_sm")
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								for token in doc:
 								    print(token.text)
 								```
 								|   0   |  1  |    2    |  3  |   4    |  5   |    6    |  7  |  8  |  9  |   10    |
 								| :---: | :-: | :-----: | :-: | :----: | :--: | :-----: | :-: | :-: | :-: | :-----: |
 								| Apple | is  | looking | at  | buying | U.K. | startup | for | \$  |  1  | billion |
 								First, the raw text is split on whitespace characters, similar to
 								`text.split(' ')`. Then, the tokenizer processes the text from left to right. On
 								each substring, it performs two checks:
 . **Does the substring match a tokenizer exception rule?** For example, "don't"
 								   does not contain whitespace, but should be split into two tokens, "do" and
 								   "n't", while "U.K." should always remain one token.
 . **Can a prefix, suffix or infix be split off?** For example punctuation like
 								   commas, periods, hyphens or quotes.
 								If there's a match, the rule is applied and the tokenizer continues its loop,
 								starting with the newly split substrings. This way, spaCy can split **complex,
 								nested tokens** like combinations of abbreviations and multiple punctuation
 								marks.
 								> - **Tokenizer exception:** Special-case rule to split a string into several
 								>   tokens or prevent a token from being split when punctuation rules are
 								>   applied.
 								> - **Prefix:** Character(s) at the beginning, e.g. `$`, `(`, `“`, `¿`.
 								> - **Suffix:** Character(s) at the end, e.g. `km`, `)`, `”`, `!`.
 								> - **Infix:** Character(s) in between, e.g. `-`, `--`, `/`, `…`.
 								![Example of the tokenization process](../../images/tokenization.svg)
 								While punctuation rules are usually pretty general, tokenizer exceptions
 								strongly depend on the specifics of the individual language. This is why each
-												New batch of proofs

Just tiny fixes to the docs as a proofreader

											
										
										
											2020-10-14 17:37:57 +03:00
+								[available language](/usage/models#languages) has its own subclass, like
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								`English` or `German`, that loads in lists of hard-coded data and exception
 								rules.