2017-10-03 15:26:20 +03:00
|
|
|
|
//- 💫 DOCS > USAGE > LINGUISTIC FEATURES > TOKENIZATION
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Tokenization is the task of splitting a text into meaningful segments,
|
|
|
|
|
| called #[em tokens]. The input to the tokenizer is a unicode text, and
|
|
|
|
|
| the output is a #[+api("doc") #[code Doc]] object. To construct a
|
|
|
|
|
| #[code Doc] object, you need a #[+api("vocab") #[code Vocab]] instance,
|
|
|
|
|
| a sequence of #[code word] strings, and optionally a sequence of
|
|
|
|
|
| #[code spaces] booleans, which allow you to maintain alignment of the
|
|
|
|
|
| tokens into the original string.
|
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
include ../_spacy-101/_tokenization
|
2017-05-24 01:37:47 +03:00
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+h(4, "101-data") Tokenizer data
|
2017-05-24 01:37:47 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| #[strong Global] and #[strong language-specific] tokenizer data is
|
2017-10-03 15:26:20 +03:00
|
|
|
|
| supplied via the language data in
|
|
|
|
|
| #[+src(gh("spaCy", "spacy/lang")) #[code spacy/lang]].
|
2017-05-24 01:37:47 +03:00
|
|
|
|
| The tokenizer exceptions define special cases like "don't" in English,
|
|
|
|
|
| which needs to be split into two tokens: #[code {ORTH: "do"}] and
|
|
|
|
|
| #[code {ORTH: "n't", LEMMA: "not"}]. The prefixes, suffixes and infixes
|
|
|
|
|
| mosty define punctuation rules – for example, when to split off periods
|
|
|
|
|
| (at the end of a sentence), and when to leave token containing periods
|
|
|
|
|
| intact (abbreviations like "U.S.").
|
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+graphic("/assets/img/language_data.svg")
|
|
|
|
|
include ../../assets/img/language_data.svg
|
2017-05-24 01:37:47 +03:00
|
|
|
|
|
|
|
|
|
+infobox
|
|
|
|
|
| For more details on the language-specific data, see the
|
2017-10-03 15:26:20 +03:00
|
|
|
|
| usage guide on #[+a("/usage/adding-languages") adding languages].
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+h(3, "special-cases") Adding special case tokenization rules
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
p
|
2017-08-20 13:00:15 +03:00
|
|
|
|
| Most domains have at least some idiosyncrasies that require custom
|
2017-05-24 01:37:47 +03:00
|
|
|
|
| tokenization rules. This could be very certain expressions, or
|
|
|
|
|
| abbreviations only used in this specific field.
|
|
|
|
|
|
|
|
|
|
+aside("Language data vs. custom tokenization")
|
|
|
|
|
| Tokenization rules that are specific to one language, but can be
|
|
|
|
|
| #[strong generalised across that language] should ideally live in the
|
2017-10-03 15:26:20 +03:00
|
|
|
|
| language data in #[+src(gh("spaCy", "spacy/lang")) #[code spacy/lang]] – we
|
2017-05-24 01:37:47 +03:00
|
|
|
|
| always appreciate pull requests! Anything that's specific to a domain or
|
|
|
|
|
| text type – like financial trading abbreviations, or Bavarian youth slang
|
|
|
|
|
| – should be added as a special case rule to your tokenizer instance. If
|
|
|
|
|
| you're dealing with a lot of customisations, it might make sense to create
|
|
|
|
|
| an entirely custom subclass.
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Here's how to add a special case rule to an existing
|
2016-11-05 22:40:11 +03:00
|
|
|
|
| #[+api("tokenizer") #[code Tokenizer]] instance:
|
|
|
|
|
|
|
|
|
|
+code.
|
2017-01-17 21:35:55 +03:00
|
|
|
|
import spacy
|
|
|
|
|
from spacy.symbols import ORTH, LEMMA, POS
|
|
|
|
|
|
2016-11-05 22:40:11 +03:00
|
|
|
|
nlp = spacy.load('en')
|
2017-05-24 01:37:47 +03:00
|
|
|
|
doc = nlp(u'gimme that') # phrase to tokenize
|
|
|
|
|
assert [w.text for w in doc] == [u'gimme', u'that'] # current tokenization
|
|
|
|
|
|
|
|
|
|
# add special case rule
|
|
|
|
|
special_case = [{ORTH: u'gim', LEMMA: u'give', POS: u'VERB'}, {ORTH: u'me'}]
|
|
|
|
|
nlp.tokenizer.add_special_case(u'gimme', special_case)
|
2016-11-05 22:40:11 +03:00
|
|
|
|
assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that']
|
2017-10-03 15:26:20 +03:00
|
|
|
|
# Pronoun lemma is returned as -PRON-!
|
|
|
|
|
assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'-PRON-', u'that']
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
p
|
2017-10-03 15:26:20 +03:00
|
|
|
|
| For details on spaCy's custom pronoun lemma #[code -PRON-],
|
|
|
|
|
| #[+a("/usage/#pron-lemma") see here].
|
2016-11-05 22:40:11 +03:00
|
|
|
|
| The special case doesn't have to match an entire whitespace-delimited
|
|
|
|
|
| substring. The tokenizer will incrementally split off punctuation, and
|
|
|
|
|
| keep looking up the remaining substring:
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
assert 'gimme' not in [w.text for w in nlp(u'gimme!')]
|
|
|
|
|
assert 'gimme' not in [w.text for w in nlp(u'("...gimme...?")')]
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| The special case rules have precedence over the punctuation splitting:
|
|
|
|
|
|
|
|
|
|
+code.
|
2017-05-24 01:37:47 +03:00
|
|
|
|
special_case = [{ORTH: u'...gimme...?', LEMMA: u'give', TAG: u'VB'}]
|
|
|
|
|
nlp.tokenizer.add_special_case(u'...gimme...?', special_case)
|
2016-11-05 22:40:11 +03:00
|
|
|
|
assert len(nlp(u'...gimme...?')) == 1
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Because the special-case rules allow you to set arbitrary token
|
|
|
|
|
| attributes, such as the part-of-speech, lemma, etc, they make a good
|
|
|
|
|
| mechanism for arbitrary fix-up rules. Having this logic live in the
|
|
|
|
|
| tokenizer isn't very satisfying from a design perspective, however, so
|
|
|
|
|
| the API may eventually be exposed on the
|
|
|
|
|
| #[+api("language") #[code Language]] class itself.
|
|
|
|
|
|
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+h(3, "how-tokenizer-works") How spaCy's tokenizer works
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| spaCy introduces a novel tokenization algorithm, that gives a better
|
|
|
|
|
| balance between performance, ease of definition, and ease of alignment
|
|
|
|
|
| into the original string.
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| After consuming a prefix or infix, we consult the special cases again.
|
|
|
|
|
| We want the special cases to handle things like "don't" in English, and
|
|
|
|
|
| we want the same rule to work for "(don't)!". We do this by splitting
|
|
|
|
|
| off the open bracket, then the exclamation, then the close bracket, and
|
|
|
|
|
| finally matching the special-case. Here's an implementation of the
|
|
|
|
|
| algorithm in Python, optimized for readability rather than performance:
|
|
|
|
|
|
|
|
|
|
+code.
|
2017-10-03 15:26:20 +03:00
|
|
|
|
def tokenizer_pseudo_code(text, special_cases,
|
|
|
|
|
find_prefix, find_suffix, find_infixes):
|
2016-11-05 22:40:11 +03:00
|
|
|
|
tokens = []
|
|
|
|
|
for substring in text.split(' '):
|
|
|
|
|
suffixes = []
|
|
|
|
|
while substring:
|
|
|
|
|
if substring in special_cases:
|
|
|
|
|
tokens.extend(special_cases[substring])
|
|
|
|
|
substring = ''
|
|
|
|
|
elif find_prefix(substring) is not None:
|
|
|
|
|
split = find_prefix(substring)
|
|
|
|
|
tokens.append(substring[:split])
|
|
|
|
|
substring = substring[split:]
|
|
|
|
|
elif find_suffix(substring) is not None:
|
|
|
|
|
split = find_suffix(substring)
|
|
|
|
|
suffixes.append(substring[split:])
|
|
|
|
|
substring = substring[:split]
|
|
|
|
|
elif find_infixes(substring):
|
|
|
|
|
infixes = find_infixes(substring)
|
|
|
|
|
offset = 0
|
|
|
|
|
for match in infixes:
|
2017-11-09 06:13:03 +03:00
|
|
|
|
tokens.append(substring[offset : match.start()])
|
2016-11-05 22:40:11 +03:00
|
|
|
|
tokens.append(substring[match.start() : match.end()])
|
|
|
|
|
offset = match.end()
|
|
|
|
|
substring = substring[offset:]
|
|
|
|
|
else:
|
|
|
|
|
tokens.append(substring)
|
|
|
|
|
substring = ''
|
2017-06-03 12:31:30 +03:00
|
|
|
|
tokens.extend(reversed(suffixes))
|
2017-11-09 06:13:03 +03:00
|
|
|
|
return tokens
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| The algorithm can be summarized as follows:
|
|
|
|
|
|
|
|
|
|
+list("numbers")
|
|
|
|
|
+item Iterate over space-separated substrings
|
|
|
|
|
+item
|
|
|
|
|
| Check whether we have an explicitly defined rule for this substring.
|
|
|
|
|
| If we do, use it.
|
|
|
|
|
+item Otherwise, try to consume a prefix.
|
|
|
|
|
+item
|
|
|
|
|
| If we consumed a prefix, go back to the beginning of the loop, so
|
|
|
|
|
| that special-cases always get priority.
|
|
|
|
|
+item If we didn't consume a prefix, try to consume a suffix.
|
|
|
|
|
+item
|
|
|
|
|
| If we can't consume a prefix or suffix, look for "infixes" — stuff
|
|
|
|
|
| like hyphens etc.
|
|
|
|
|
+item Once we can't consume any more of the string, handle it as a single token.
|
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+h(3, "native-tokenizers") Customizing spaCy's Tokenizer class
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
p
|
2017-05-24 01:37:47 +03:00
|
|
|
|
| Let's imagine you wanted to create a tokenizer for a new language or
|
2017-10-03 15:26:20 +03:00
|
|
|
|
| specific domain. There are five things you would need to define:
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
+list("numbers")
|
|
|
|
|
+item
|
|
|
|
|
| A dictionary of #[strong special cases]. This handles things like
|
|
|
|
|
| contractions, units of measurement, emoticons, certain
|
|
|
|
|
| abbreviations, etc.
|
|
|
|
|
|
|
|
|
|
+item
|
|
|
|
|
| A function #[code prefix_search], to handle
|
|
|
|
|
| #[strong preceding punctuation], such as open quotes, open brackets,
|
|
|
|
|
| etc
|
|
|
|
|
|
|
|
|
|
+item
|
|
|
|
|
| A function #[code suffix_search], to handle
|
|
|
|
|
| #[strong succeeding punctuation], such as commas, periods, close
|
|
|
|
|
| quotes, etc.
|
|
|
|
|
|
|
|
|
|
+item
|
|
|
|
|
| A function #[code infixes_finditer], to handle non-whitespace
|
|
|
|
|
| separators, such as hyphens etc.
|
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+item
|
|
|
|
|
| An optional boolean function #[code token_match] matching strings
|
|
|
|
|
| that should never be split, overriding the previous rules.
|
|
|
|
|
| Useful for things like URLs or numbers.
|
|
|
|
|
|
2016-11-05 22:40:11 +03:00
|
|
|
|
p
|
|
|
|
|
| You shouldn't usually need to create a #[code Tokenizer] subclass.
|
|
|
|
|
| Standard usage is to use #[code re.compile()] to build a regular
|
|
|
|
|
| expression object, and pass its #[code .search()] and
|
|
|
|
|
| #[code .finditer()] methods:
|
|
|
|
|
|
|
|
|
|
+code.
|
2017-11-04 01:33:18 +03:00
|
|
|
|
import regex as re
|
2016-11-05 22:40:11 +03:00
|
|
|
|
from spacy.tokenizer import Tokenizer
|
|
|
|
|
|
2017-11-04 01:33:18 +03:00
|
|
|
|
prefix_re = re.compile(r'''^[\[\("']''')
|
|
|
|
|
suffix_re = re.compile(r'''[\]\)"']$''')
|
2017-10-03 15:26:20 +03:00
|
|
|
|
infix_re = re.compile(r'''[-~]''')
|
|
|
|
|
simple_url_re = re.compile(r'''^https?://''')
|
2017-05-24 01:37:47 +03:00
|
|
|
|
|
2017-06-01 14:02:50 +03:00
|
|
|
|
def custom_tokenizer(nlp):
|
2017-05-24 01:37:47 +03:00
|
|
|
|
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
|
2017-10-03 15:26:20 +03:00
|
|
|
|
suffix_search=suffix_re.search,
|
|
|
|
|
infix_finditer=infix_re.finditer,
|
|
|
|
|
token_match=simple_url_re.match)
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
2017-06-01 14:02:50 +03:00
|
|
|
|
nlp = spacy.load('en')
|
|
|
|
|
nlp.tokenizer = custom_tokenizer(nlp)
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| If you need to subclass the tokenizer instead, the relevant methods to
|
|
|
|
|
| specialize are #[code find_prefix], #[code find_suffix] and
|
|
|
|
|
| #[code find_infix].
|
|
|
|
|
|
2017-11-04 01:33:18 +03:00
|
|
|
|
+infobox("Important note", "⚠️")
|
|
|
|
|
| When customising the prefix, suffix and infix handling, remember that
|
|
|
|
|
| you're passing in #[strong functions] for spaCy to execute, e.g.
|
|
|
|
|
| #[code prefix_re.search] – not just the regular expressions. This means
|
|
|
|
|
| that your functions also need to define how the rules should be applied.
|
|
|
|
|
| For example, if you're adding your own prefix rules, you need
|
|
|
|
|
| to make sure they're only applied to characters at the
|
|
|
|
|
| #[strong beginning of a token], e.g. by adding #[code ^]. Similarly,
|
|
|
|
|
| suffix rules should only be applied at the #[strong end of a token],
|
|
|
|
|
| so your expression should end with a #[code $].
|
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+h(3, "custom-tokenizer") Hooking an arbitrary tokenizer into the pipeline
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
p
|
2017-05-25 01:30:21 +03:00
|
|
|
|
| The tokenizer is the first component of the processing pipeline and the
|
|
|
|
|
| only one that can't be replaced by writing to #[code nlp.pipeline]. This
|
|
|
|
|
| is because it has a different signature from all the other components:
|
|
|
|
|
| it takes a text and returns a #[code Doc], whereas all other components
|
|
|
|
|
| expect to already receive a tokenized #[code Doc].
|
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+graphic("/assets/img/pipeline.svg")
|
|
|
|
|
include ../../assets/img/pipeline.svg
|
2017-05-25 01:30:21 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| To overwrite the existing tokenizer, you need to replace
|
|
|
|
|
| #[code nlp.tokenizer] with a custom function that takes a text, and
|
|
|
|
|
| returns a #[code Doc].
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
+code.
|
2017-05-25 01:30:21 +03:00
|
|
|
|
nlp = spacy.load('en')
|
|
|
|
|
nlp.tokenizer = my_tokenizer
|
|
|
|
|
|
|
|
|
|
+table(["Argument", "Type", "Description"])
|
|
|
|
|
+row
|
|
|
|
|
+cell #[code text]
|
|
|
|
|
+cell unicode
|
|
|
|
|
+cell The raw text to tokenize.
|
|
|
|
|
|
2017-10-03 15:26:20 +03:00
|
|
|
|
+row("foot")
|
2017-05-25 01:30:21 +03:00
|
|
|
|
+cell returns
|
|
|
|
|
+cell #[code Doc]
|
|
|
|
|
+cell The tokenized document.
|
|
|
|
|
|
|
|
|
|
+infobox("Important note: using a custom tokenizer")
|
|
|
|
|
.o-block
|
|
|
|
|
| In spaCy v1.x, you had to add a custom tokenizer by passing it to the
|
|
|
|
|
| #[code make_doc] keyword argument, or by passing a tokenizer "factory"
|
|
|
|
|
| to #[code create_make_doc]. This was unnecessarily complicated. Since
|
|
|
|
|
| spaCy v2.0, you can simply write to #[code nlp.tokenizer]. If your
|
|
|
|
|
| tokenizer needs the vocab, you can write a function and use
|
|
|
|
|
| #[code nlp.vocab].
|
|
|
|
|
|
|
|
|
|
+code-new.
|
|
|
|
|
nlp.tokenizer = my_tokenizer
|
|
|
|
|
nlp.tokenizer = my_tokenizer_factory(nlp.vocab)
|
|
|
|
|
+code-old.
|
|
|
|
|
nlp = spacy.load('en', make_doc=my_tokenizer)
|
|
|
|
|
nlp = spacy.load('en', create_make_doc=my_tokenizer_factory)
|
|
|
|
|
|
|
|
|
|
+h(3, "custom-tokenizer-example") Example: A custom whitespace tokenizer
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| To construct the tokenizer, we usually want attributes of the #[code nlp]
|
|
|
|
|
| pipeline. Specifically, we want the tokenizer to hold a reference to the
|
2017-05-25 01:30:21 +03:00
|
|
|
|
| vocabulary object. Let's say we have the following class as
|
2016-11-05 22:40:11 +03:00
|
|
|
|
| our tokenizer:
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
from spacy.tokens import Doc
|
|
|
|
|
|
|
|
|
|
class WhitespaceTokenizer(object):
|
2017-05-25 01:30:21 +03:00
|
|
|
|
def __init__(self, vocab):
|
|
|
|
|
self.vocab = vocab
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
def __call__(self, text):
|
|
|
|
|
words = text.split(' ')
|
|
|
|
|
# All tokens 'own' a subsequent space character in this tokenizer
|
2017-09-14 13:49:59 +03:00
|
|
|
|
spaces = [True] * len(words)
|
2016-11-05 22:40:11 +03:00
|
|
|
|
return Doc(self.vocab, words=words, spaces=spaces)
|
|
|
|
|
|
|
|
|
|
p
|
2017-05-25 01:30:21 +03:00
|
|
|
|
| As you can see, we need a #[code Vocab] instance to construct this — but
|
|
|
|
|
| we won't have it until we get back the loaded #[code nlp] object. The
|
|
|
|
|
| simplest solution is to build the tokenizer in two steps. This also means
|
|
|
|
|
| that you can reuse the "tokenizer factory" and initialise it with
|
|
|
|
|
| different instances of #[code Vocab].
|
2016-11-05 22:40:11 +03:00
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
nlp = spacy.load('en')
|
2017-05-25 01:30:21 +03:00
|
|
|
|
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
|
2017-10-03 15:26:20 +03:00
|
|
|
|
|
|
|
|
|
+h(3, "own-annotations") Bringing your own annotations
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| spaCy generally assumes by default that your data is raw text. However,
|
|
|
|
|
| sometimes your data is partially annotated, e.g. with pre-existing
|
|
|
|
|
| tokenization, part-of-speech tags, etc. The most common situation is
|
|
|
|
|
| that you have pre-defined tokenization. If you have a list of strings,
|
|
|
|
|
| you can create a #[code Doc] object directly. Optionally, you can also
|
|
|
|
|
| specify a list of boolean values, indicating whether each word has a
|
|
|
|
|
| subsequent space.
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| If provided, the spaces list must be the same length as the words list.
|
|
|
|
|
| The spaces list affects the #[code doc.text], #[code span.text],
|
|
|
|
|
| #[code token.idx], #[code span.start_char] and #[code span.end_char]
|
|
|
|
|
| attributes. If you don't provide a #[code spaces] sequence, spaCy will
|
|
|
|
|
| assume that all words are whitespace delimited.
|
|
|
|
|
|
|
|
|
|
+code.
|
|
|
|
|
good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
|
|
|
|
|
bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'])
|
|
|
|
|
assert bad_spaces.text == u'Hello , world !'
|
|
|
|
|
assert good_spaces.text == u'Hello, world!'
|
|
|
|
|
|
|
|
|
|
p
|
|
|
|
|
| Once you have a #[+api("doc") #[code Doc]] object, you can write to its
|
|
|
|
|
| attributes to set the part-of-speech tags, syntactic dependencies, named
|
|
|
|
|
| entities and other attributes. For details, see the respective usage
|
|
|
|
|
| pages.
|