spaCy/website/usage/_spacy-101/_tokenization.jade
Ines Montani 49cee4af92
💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)
* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label
2018-04-29 02:06:46 +02:00

65 lines
2.7 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > SPACY 101 > TOKENIZATION
p
| During processing, spaCy first #[strong tokenizes] the text, i.e.
| segments it into words, punctuation and so on. This is done by applying
| rules specific to each language. For example, punctuation at the end of a
| sentence should be split off whereas "U.K." should remain one token.
| Each #[code Doc] consists of individual tokens, and we can simply iterate
| over them:
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token.text)
+table([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).u-text-center
+row
for cell in ["Apple", "is", "looking", "at", "buying", "U.K.", "startup", "for", "$", "1", "billion"]
+cell=cell
p
| First, the raw text is split on whitespace characters, similar to
| #[code text.split(' ')]. Then, the tokenizer processes the text from
| left to right. On each substring, it performs two checks:
+list("numbers")
+item
| #[strong Does the substring match a tokenizer exception rule?] For
| example, "don't" does not contain whitespace, but should be split
| into two tokens, "do" and "n't", while "U.K." should always
| remain one token.
+item
| #[strong Can a prefix, suffix or infix be split off?] For example
| punctuation like commas, periods, hyphens or quotes.
p
| If there's a match, the rule is applied and the tokenizer continues its
| loop, starting with the newly split substrings. This way, spaCy can split
| #[strong complex, nested tokens] like combinations of abbreviations and
| multiple punctuation marks.
+aside
| #[strong Tokenizer exception:] Special-case rule to split a string into
| several tokens or prevent a token from being split when punctuation rules
| are applied.#[br]
| #[strong Prefix:] Character(s) at the beginning, e.g.
| #[code $], #[code (], #[code “], #[code ¿].#[br]
| #[strong Suffix:] Character(s) at the end, e.g.
| #[code km], #[code )], #[code ”], #[code !].#[br]
| #[strong Infix:] Character(s) in between, e.g.
| #[code -], #[code --], #[code /], #[code …].#[br]
+graphic("/assets/img/tokenization.svg")
include ../../assets/img/tokenization.svg
p
| While punctuation rules are usually pretty general, tokenizer exceptions
| strongly depend on the specifics of the individual language. This is
| why each #[+a("/usage/models#languages") available language] has its
| own subclass like #[code English] or #[code German], that loads in lists
| of hard-coded data and exception rules.