diff --git a/website/assets/img/docs/tokenization.svg b/website/assets/img/docs/tokenization.svg new file mode 100644 index 000000000..cc185a3a7 --- /dev/null +++ b/website/assets/img/docs/tokenization.svg @@ -0,0 +1,123 @@ + + + + + “Let’s + + + go + + + to + + + N.Y.!” + + + + + + Let’s + + + go + + + to + + + N.Y.!” + + + + + Let + + + go + + + to + + + N.Y.!” + + + ’s + + + + + + Let + + + go + + + to + + + N.Y.! + + + ’s + + + + + + + + + Let + + + go + + + to + + + N.Y. + + + ’s + + + + + + ! + + + + Let + + go + + to + + N.Y. + + ’s + + + + ! + + EXCEPTION + + PREFIX + + SUFFIX + + SUFFIX + + EXCEPTION + + DONE + diff --git a/website/docs/usage/_spacy-101/_tokenization.jade b/website/docs/usage/_spacy-101/_tokenization.jade index 64e3f5881..95a9cc520 100644 --- a/website/docs/usage/_spacy-101/_tokenization.jade +++ b/website/docs/usage/_spacy-101/_tokenization.jade @@ -16,3 +16,47 @@ p +row for cell in ["Apple", "is", "looking", "at", "buying", "U.K.", "startup", "for", "$", "1", "billion"] +cell=cell + +p + | Fist, the raw text is split on whitespace characters, similar to + | #[code text.split(' ')]. Then, the tokenizer processes the text from + | left to right. On each substring, it performs two checks: + ++list("numbers") + +item + | #[strong Does the substring match a tokenizer exception rule?] For + | example, "don't" does not contain whitespace, but should be split + | into two tokens, "do" and "n't", while "U.K." should always + | remain one token. + +item + | #[strong Can a prefix, suffix or infixes be split off?]. For example + | punctuation like commas, periods, hyphens or quotes. + +p + | If there's a match, the rule is applied and the tokenizer continues its + | loop, starting with the newly split substrings. This way, spaCy can split + | #[strong complex, nested tokens] like combinations of abbreviations and + | multiple punctuation marks. + ++aside + | #[strong Tokenizer exception:] Special-case rule to split a string into + | several tokens or prevent a token from being split when punctuation rules + | are applied.#[br] + | #[strong Prefix:] Character(s) at the beginning, e.g. + | #[code $], #[code (], #[code “], #[code ¿].#[br] + | #[strong Suffix:] Character(s) at the end, e.g. + | #[code km], #[code )], #[code ”], #[code !].#[br] + | #[strong Infix:] Character(s) in between, e.g. + | #[code -], #[code --], #[code /], #[code …].#[br] + ++image + include ../../../assets/img/docs/tokenization.svg + .u-text-right + +button("/assets/img/docs/tokenization.svg", false, "secondary").u-text-tag View large graphic + +p + | While punctuation rules are usually pretty general, tokenizer exceptions + | strongly depend on the specifics of the individual language. This is + | why each #[+a("/docs/api/language-models") available language] has its + | own subclass like #[code English] or #[code German], that loads in lists + | of hard-coded data and exception rules. diff --git a/website/docs/usage/spacy-101.jade b/website/docs/usage/spacy-101.jade index 7c6525004..8b2d0c17e 100644 --- a/website/docs/usage/spacy-101.jade +++ b/website/docs/usage/spacy-101.jade @@ -94,9 +94,10 @@ p include _spacy-101/_tokenization +infobox - | To learn more about how spaCy's tokenizer and its rules work in detail, - | how to #[strong customise] it and how to #[strong add your own tokenizer] - | to a processing pipeline, see the usage guide on + | To learn more about how spaCy's tokenization rules work in detail, + | how to #[strong customise and replace] the default tokenizer and how to + | #[strong add language-specific data], see the usage guides on + | #[+a("/docs/usage/adding-languages") adding languages] and | #[+a("/docs/usage/customizing-tokenizer") customising the tokenizer]. +h(3, "annotations-pos-deps") Part-of-speech tags and dependencies