mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-30 20:06:30 +03:00
51 lines
2.3 KiB
Markdown
51 lines
2.3 KiB
Markdown
|
During processing, spaCy first **tokenizes** the text, i.e. segments it into
|
|||
|
words, punctuation and so on. This is done by applying rules specific to each
|
|||
|
language. For example, punctuation at the end of a sentence should be split off
|
|||
|
– whereas "U.K." should remain one token. Each `Doc` consists of individual
|
|||
|
tokens, and we can iterate over them:
|
|||
|
|
|||
|
```python
|
|||
|
### {executable="true"}
|
|||
|
import spacy
|
|||
|
|
|||
|
nlp = spacy.load("en_core_web_sm")
|
|||
|
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
|
|||
|
for token in doc:
|
|||
|
print(token.text)
|
|||
|
```
|
|||
|
|
|||
|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|
|||
|
| :---: | :-: | :-----: | :-: | :----: | :--: | :-----: | :-: | :-: | :-: | :-----: |
|
|||
|
| Apple | is | looking | at | buying | U.K. | startup | for | \$ | 1 | billion |
|
|||
|
|
|||
|
First, the raw text is split on whitespace characters, similar to
|
|||
|
`text.split(' ')`. Then, the tokenizer processes the text from left to right. On
|
|||
|
each substring, it performs two checks:
|
|||
|
|
|||
|
1. **Does the substring match a tokenizer exception rule?** For example, "don't"
|
|||
|
does not contain whitespace, but should be split into two tokens, "do" and
|
|||
|
"n't", while "U.K." should always remain one token.
|
|||
|
|
|||
|
2. **Can a prefix, suffix or infix be split off?** For example punctuation like
|
|||
|
commas, periods, hyphens or quotes.
|
|||
|
|
|||
|
If there's a match, the rule is applied and the tokenizer continues its loop,
|
|||
|
starting with the newly split substrings. This way, spaCy can split **complex,
|
|||
|
nested tokens** like combinations of abbreviations and multiple punctuation
|
|||
|
marks.
|
|||
|
|
|||
|
> - **Tokenizer exception:** Special-case rule to split a string into several
|
|||
|
> tokens or prevent a token from being split when punctuation rules are
|
|||
|
> applied.
|
|||
|
> - **Prefix:** Character(s) at the beginning, e.g. `$`, `(`, `“`, `¿`.
|
|||
|
> - **Suffix:** Character(s) at the end, e.g. `km`, `)`, `”`, `!`.
|
|||
|
> - **Infix:** Character(s) in between, e.g. `-`, `--`, `/`, `…`.
|
|||
|
|
|||
|
![Example of the tokenization process](../../images/tokenization.svg)
|
|||
|
|
|||
|
While punctuation rules are usually pretty general, tokenizer exceptions
|
|||
|
strongly depend on the specifics of the individual language. This is why each
|
|||
|
[available language](/usage/models#languages) has its own subclass like
|
|||
|
`English` or `German`, that loads in lists of hard-coded data and exception
|
|||
|
rules.
|