spaCy/website/docs/usage/_spacy-101/_tokenization.jade

19 lines
720 B
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > SPACY 101 > TOKENIZATION
p
| During processing, spaCy first #[strong tokenizes] the text, i.e.
| segments it into words, punctuation and so on. This is done by applying
| rules specific to each language. For example, punctuation at the end of a
| sentence should be split off whereas "U.K." should remain one token.
| Each #[code Doc] consists of individual tokens, and we can simply iterate
| over them:
+code.
for token in doc:
print(token.text)
+table([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).u-text-center
+row
for cell in ["Apple", "is", "looking", "at", "buying", "U.K.", "startup", "for", "$", "1", "billion"]
+cell=cell