mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			556 lines
		
	
	
		
			23 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			556 lines
		
	
	
		
			23 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
//- 💫 DOCS > USAGE > ADDING LANGUAGES > LANGUAGE DATA
 | 
						||
 | 
						||
p
 | 
						||
    |  The individual components #[strong expose variables] that can be imported
 | 
						||
    |  within a language module, and added to the language's #[code Defaults].
 | 
						||
    |  Some components, like the punctuation rules, usually don't need much
 | 
						||
    |  customisation and can simply be imported from the global rules. Others,
 | 
						||
    |  like the tokenizer and norm exceptions, are very specific and will make
 | 
						||
    |  a big difference to spaCy's performance on the particular language and
 | 
						||
    |  training a language model.
 | 
						||
 | 
						||
 | 
						||
+table(["Variable", "Type", "Description"])
 | 
						||
    +row
 | 
						||
        +cell #[code STOP_WORDS]
 | 
						||
        +cell set
 | 
						||
        +cell Individual words.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code TOKENIZER_EXCEPTIONS]
 | 
						||
        +cell dict
 | 
						||
        +cell Keyed by strings mapped to list of one dict per token with token attributes.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code TOKEN_MATCH]
 | 
						||
        +cell regex
 | 
						||
        +cell Regexes to match complex tokens, e.g. URLs.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code NORM_EXCEPTIONS]
 | 
						||
        +cell dict
 | 
						||
        +cell Keyed by strings, mapped to their norms.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code TOKENIZER_PREFIXES]
 | 
						||
        +cell list
 | 
						||
        +cell Strings or regexes, usually not customised.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code TOKENIZER_SUFFIXES]
 | 
						||
        +cell list
 | 
						||
        +cell Strings or regexes, usually not customised.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code TOKENIZER_INFIXES]
 | 
						||
        +cell list
 | 
						||
        +cell Strings or regexes, usually not customised.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code LEX_ATTRS]
 | 
						||
        +cell dict
 | 
						||
        +cell Attribute ID mapped to function.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code SYNTAX_ITERATORS]
 | 
						||
        +cell dict
 | 
						||
        +cell
 | 
						||
            |  Iterator ID mapped to function. Currently only supports
 | 
						||
            |  #[code 'noun_chunks'].
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code LOOKUP]
 | 
						||
        +cell dict
 | 
						||
        +cell Keyed by strings mapping to their lemma.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC]
 | 
						||
        +cell dict
 | 
						||
        +cell Lemmatization rules, keyed by part of speech.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code TAG_MAP]
 | 
						||
        +cell dict
 | 
						||
        +cell
 | 
						||
            |  Keyed by strings mapped to
 | 
						||
            |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
 | 
						||
            |  tags.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code MORPH_RULES]
 | 
						||
        +cell dict
 | 
						||
        +cell Keyed by strings mapped to a dict of their morphological features.
 | 
						||
 | 
						||
+aside("Should I ever update the global data?")
 | 
						||
    |  Reuseable language data is collected as atomic pieces in the root of the
 | 
						||
    |  #[+src(gh("spaCy", "lang")) #[code spacy.lang]] package. Often, when a new
 | 
						||
    |  language is added, you'll find a pattern or symbol that's missing. Even
 | 
						||
    |  if it isn't common in other languages, it might be best to add it to the
 | 
						||
    |  shared language data, unless it has some conflicting interpretation. For
 | 
						||
    |  instance, we don't expect to see guillemot quotation symbols
 | 
						||
    |  (#[code »] and #[code «]) in English text. But if we do see
 | 
						||
    |  them, we'd probably prefer the tokenizer to split them off.
 | 
						||
 | 
						||
+infobox("For languages with non-latin characters")
 | 
						||
    |  In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
 | 
						||
    |  needs to know the language's character set. If the language you're adding
 | 
						||
    |  uses non-latin characters, you might need to add the required character
 | 
						||
    |  classes to the global
 | 
						||
    |  #[+src(gh("spacy", "spacy/lang/char_classes.py")) #[code char_classes.py]].
 | 
						||
    |  spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
 | 
						||
    |  to keep this simple and readable. If the language requires very specific
 | 
						||
    |  punctuation rules, you should consider overwriting the default regular
 | 
						||
    |  expressions with your own in the language's #[code Defaults].
 | 
						||
 | 
						||
 | 
						||
+h(3, "language-subclass") Creating a #[code Language] subclass
 | 
						||
 | 
						||
p
 | 
						||
    |  Language-specific code and resources should be organised into a
 | 
						||
    |  subpackage of spaCy, named according to the language's
 | 
						||
    |  #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code].
 | 
						||
    |  For instance, code and resources specific to Spanish are placed into a
 | 
						||
    |  directory #[code spacy/lang/es], which can be imported as
 | 
						||
    |  #[code spacy.lang.es].
 | 
						||
 | 
						||
p
 | 
						||
    |  To get started, you can use our
 | 
						||
    |  #[+src(gh("spacy-dev-resources", "templates/new_language")) templates]
 | 
						||
    |  for the most important files. Here's what the class template looks like:
 | 
						||
 | 
						||
+code("__init__.py (excerpt)").
 | 
						||
    # import language-specific data
 | 
						||
    from .stop_words import STOP_WORDS
 | 
						||
    from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 | 
						||
    from .lex_attrs import LEX_ATTRS
 | 
						||
 | 
						||
    from ..tokenizer_exceptions import BASE_EXCEPTIONS
 | 
						||
    from ...language import Language
 | 
						||
    from ...attrs import LANG
 | 
						||
    from ...util import update_exc
 | 
						||
 | 
						||
    # create Defaults class in the module scope (necessary for pickling!)
 | 
						||
    class XxxxxDefaults(Language.Defaults):
 | 
						||
        lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
 | 
						||
        lex_attr_getters[LANG] = lambda text: 'xx' # language ISO code
 | 
						||
 | 
						||
        # optional: replace flags with custom functions, e.g. like_num()
 | 
						||
        lex_attr_getters.update(LEX_ATTRS)
 | 
						||
 | 
						||
        # merge base exceptions and custom tokenizer exceptions
 | 
						||
        tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
 | 
						||
        stop_words = set(STOP_WORDS)
 | 
						||
 | 
						||
    # create actual Language class
 | 
						||
    class Xxxxx(Language):
 | 
						||
        lang = 'xx' # language ISO code
 | 
						||
        Defaults = XxxxxDefaults # override defaults
 | 
						||
 | 
						||
    # set default export – this allows the language class to be lazy-loaded
 | 
						||
    __all__ = ['Xxxxx']
 | 
						||
 | 
						||
+infobox("Why lazy-loading?")
 | 
						||
    |  Some languages contain large volumes of custom data, like lemmatizer
 | 
						||
    |  lookup tables, or complex regular expression that are expensive to
 | 
						||
    |  compute. As of spaCy v2.0, #[code Language] classes are not imported on
 | 
						||
    |  initialisation and are only loaded when you import them directly, or load
 | 
						||
    |  a model that requires a language to be loaded. To lazy-load languages in
 | 
						||
    |  your application, you can use the
 | 
						||
    |  #[+api("util#get_lang_class") #[code util.get_lang_class()]] helper
 | 
						||
    |  function with the two-letter language code as its argument.
 | 
						||
 | 
						||
+h(3, "stop-words") Stop words
 | 
						||
 | 
						||
p
 | 
						||
    |  A #[+a("https://en.wikipedia.org/wiki/Stop_words") "stop list"] is a
 | 
						||
    |  classic trick from the early days of information retrieval when search
 | 
						||
    |  was largely about keyword presence and absence. It is still sometimes
 | 
						||
    |  useful today to filter out common words from a bag-of-words model. To
 | 
						||
    |  improve readability, #[code STOP_WORDS] are separated by spaces and
 | 
						||
    |  newlines, and added as a multiline string.
 | 
						||
 | 
						||
+aside("What does spaCy consider a stop word?")
 | 
						||
    |  There's no particularly principled logic behind what words should be
 | 
						||
    |  added to the stop list. Make a list that you think might be useful
 | 
						||
    |  to people and is likely to be unsurprising. As a rule of thumb, words
 | 
						||
    |  that are very rare are unlikely to be useful stop words.
 | 
						||
 | 
						||
+code("Example").
 | 
						||
    STOP_WORDS = set("""
 | 
						||
    a about above across after afterwards again against all almost alone along
 | 
						||
    already also although always am among amongst amount an and another any anyhow
 | 
						||
    anyone anything anyway anywhere are around as at
 | 
						||
 | 
						||
    back be became because become becomes becoming been before beforehand behind
 | 
						||
    being below beside besides between beyond both bottom but by
 | 
						||
    """).split())
 | 
						||
 | 
						||
+infobox("Important note")
 | 
						||
    |  When adding stop words from an online source, always #[strong include the link]
 | 
						||
    |  in a comment. Make sure to #[strong proofread] and double-check the words
 | 
						||
    |  carefully. A lot of the lists available online have been passed around
 | 
						||
    |  for years and often contain mistakes, like unicode errors or random words
 | 
						||
    |  that have once been added for a specific use case, but don't actually
 | 
						||
    |  qualify.
 | 
						||
 | 
						||
+h(3, "tokenizer-exceptions") Tokenizer exceptions
 | 
						||
 | 
						||
p
 | 
						||
    |  spaCy's #[+a("/usage/linguistic-features#how-tokenizer-works") tokenization algorithm]
 | 
						||
    |  lets you deal with whitespace-delimited chunks separately. This makes it
 | 
						||
    |  easy to define special-case rules, without worrying about how they
 | 
						||
    |  interact with the rest of the tokenizer. Whenever the key string is
 | 
						||
    |  matched, the special-case rule is applied, giving the defined sequence of
 | 
						||
    |  tokens. You can also attach attributes to the subtokens, covered by your
 | 
						||
    |  special case, such as the subtokens #[code LEMMA] or #[code TAG].
 | 
						||
 | 
						||
p
 | 
						||
    |  Tokenizer exceptions can be added in the following format:
 | 
						||
 | 
						||
+code("tokenizer_exceptions.py (excerpt)").
 | 
						||
    TOKENIZER_EXCEPTIONS = {
 | 
						||
        "don't": [
 | 
						||
            {ORTH: "do", LEMMA: "do"},
 | 
						||
            {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
 | 
						||
    }
 | 
						||
 | 
						||
+infobox("Important note")
 | 
						||
    |  If an exception consists of more than one token, the #[code ORTH] values
 | 
						||
    |  combined always need to #[strong match the original string]. The way the
 | 
						||
    |  original string is split up can be pretty arbitrary sometimes – for
 | 
						||
    |  example "gonna" is split into "gon" (lemma "go") nad "na" (lemma "to").
 | 
						||
    |  Because of how the tokenizer works, it's currently not possible to split
 | 
						||
    |  single-letter strings into multiple tokens.
 | 
						||
 | 
						||
p
 | 
						||
    |  Unambiguous abbreviations, like month names or locations in English,
 | 
						||
    |  should be added to exceptions with a lemma assigned, for example
 | 
						||
    |  #[code {ORTH: "Jan.", LEMMA: "January"}]. Since the exceptions are
 | 
						||
    |  added in Python, you can use custom logic to generate them more
 | 
						||
    |  efficiently and make your data less verbose. How you do this ultimately
 | 
						||
    |  depends on the language. Here's an example of how exceptions for time
 | 
						||
    |  formats like "1a.m." and "1am" are generated in the English
 | 
						||
    |  #[+src(gh("spaCy", "spacy/en/lang/tokenizer_exceptions.py")) #[code tokenizer_exceptions.py]]:
 | 
						||
 | 
						||
+code("tokenizer_exceptions.py (excerpt)").
 | 
						||
    # use short, internal variable for readability
 | 
						||
    _exc = {}
 | 
						||
 | 
						||
    for h in range(1, 12 + 1):
 | 
						||
        for period in ["a.m.", "am"]:
 | 
						||
            # always keep an eye on string interpolation!
 | 
						||
            _exc["%d%s" % (h, period)] = [
 | 
						||
                {ORTH: "%d" % h},
 | 
						||
                {ORTH: period, LEMMA: "a.m."}]
 | 
						||
        for period in ["p.m.", "pm"]:
 | 
						||
            _exc["%d%s" % (h, period)] = [
 | 
						||
                {ORTH: "%d" % h},
 | 
						||
                {ORTH: period, LEMMA: "p.m."}]
 | 
						||
 | 
						||
    # only declare this at the bottom
 | 
						||
    TOKENIZER_EXCEPTIONS = dict(_exc)
 | 
						||
 | 
						||
+aside("Generating tokenizer exceptions")
 | 
						||
    |  Keep in mind that generating exceptions only makes sense if there's a
 | 
						||
    |  clearly defined and #[strong finite number] of them, like common
 | 
						||
    |  contractions in English. This is not always the case – in Spanish for
 | 
						||
    |  instance, infinitive or imperative reflexive verbs and pronouns are one
 | 
						||
    |  token (e.g. "vestirme"). In cases like this, spaCy shouldn't be
 | 
						||
    |  generating exceptions for #[em all verbs]. Instead, this will be handled
 | 
						||
    |  at a later stage during lemmatization.
 | 
						||
 | 
						||
p
 | 
						||
    |  When adding the tokenizer exceptions to the #[code Defaults], you can use
 | 
						||
    |  the #[+api("util#update_exc") #[code update_exc()]] helper function to merge
 | 
						||
    |  them with the global base exceptions (including one-letter abbreviations
 | 
						||
    |  and emoticons). The function performs a basic check to make sure
 | 
						||
    |  exceptions are provided in the correct format. It can take any number of
 | 
						||
    |  exceptions dicts as its arguments, and will update and overwrite the
 | 
						||
    |  exception in this order. For example, if your language's tokenizer
 | 
						||
    |  exceptions include a custom tokenization pattern for "a.", it will
 | 
						||
    |  overwrite the base exceptions with the language's custom one.
 | 
						||
 | 
						||
+code("Example").
 | 
						||
    from ...util import update_exc
 | 
						||
 | 
						||
    BASE_EXCEPTIONS =  {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
 | 
						||
    TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
 | 
						||
 | 
						||
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
 | 
						||
    # {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}
 | 
						||
 | 
						||
+infobox("About spaCy's custom pronoun lemma")
 | 
						||
    |  Unlike verbs and common nouns, there's no clear base form of a personal
 | 
						||
    |  pronoun. Should the lemma of "me" be "I", or should we normalize person
 | 
						||
    |  as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
 | 
						||
    |  novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
 | 
						||
    |  all personal pronouns.
 | 
						||
 | 
						||
+h(3, "norm-exceptions") Norm exceptions
 | 
						||
 | 
						||
p
 | 
						||
    |  In addition to #[code ORTH] or #[code LEMMA], tokenizer exceptions can
 | 
						||
    |  also set a #[code NORM] attribute. This is useful to specify a normalised
 | 
						||
    |  version of the token – for example, the norm of "n't" is "not". By default,
 | 
						||
    |  a token's norm equals its lowercase text. If the lowercase spelling of a
 | 
						||
    |  word exists, norms should always be in lowercase.
 | 
						||
 | 
						||
+aside-code("Norms vs. lemmas").
 | 
						||
    doc = nlp(u"I'm gonna realise")
 | 
						||
    norms = [token.norm_ for token in doc]
 | 
						||
    lemmas = [token.lemma_ for token in doc]
 | 
						||
    assert norms == ['i', 'am', 'going', 'to', 'realize']
 | 
						||
    assert lemmas == ['i', 'be', 'go', 'to', 'realise']
 | 
						||
 | 
						||
p
 | 
						||
    |  spaCy usually tries to normalise words with different spellings to a single,
 | 
						||
    |  common spelling. This has no effect on any other token attributes, or
 | 
						||
    |  tokenization in general, but it ensures that
 | 
						||
    |  #[strong equivalent tokens receive similar representations]. This can
 | 
						||
    |  improve the model's predictions on words that weren't common in the
 | 
						||
    |  training data, but are equivalent to other words – for example, "realize"
 | 
						||
    |  and "realise", or "thx" and "thanks".
 | 
						||
 | 
						||
p
 | 
						||
    |  Similarly, spaCy also includes
 | 
						||
    |  #[+src(gh("spaCy", "spacy/lang/norm_exceptions.py")) global base norms]
 | 
						||
    |  for normalising different styles of quotation marks and currency
 | 
						||
    |  symbols. Even though #[code $] and #[code €] are very different, spaCy
 | 
						||
    |  normalises them both to #[code $]. This way, they'll always be seen as
 | 
						||
    |  similar, no matter how common they were in the training data.
 | 
						||
 | 
						||
p
 | 
						||
    |  Norm exceptions can be provided as a simple dictionary. For more examples,
 | 
						||
    |  see the English
 | 
						||
    |  #[+src(gh("spaCy", "spacy/lang/en/norm_exceptions.py")) #[code norm_exceptions.py]].
 | 
						||
 | 
						||
+code("Example").
 | 
						||
    NORM_EXCEPTIONS = {
 | 
						||
        "cos": "because",
 | 
						||
        "fav": "favorite",
 | 
						||
        "accessorise": "accessorize",
 | 
						||
        "accessorised": "accessorized"
 | 
						||
    }
 | 
						||
 | 
						||
p
 | 
						||
    |  To add the custom norm exceptions lookup table, you can use the
 | 
						||
    |  #[code add_lookups()] helper functions. It takes the default attribute
 | 
						||
    |  getter function as its first argument, plus a variable list of
 | 
						||
    |  dictionaries. If a string's norm is found in one of the dictionaries,
 | 
						||
    |  that value is used – otherwise, the default function is called and the
 | 
						||
    |  token is assigned its default norm.
 | 
						||
 | 
						||
+code.
 | 
						||
    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
 | 
						||
                                         NORM_EXCEPTIONS, BASE_NORMS)
 | 
						||
 | 
						||
p
 | 
						||
    |  The order of the dictionaries is also the lookup order – so if your
 | 
						||
    |  language's norm exceptions overwrite any of the global exceptions, they
 | 
						||
    |  should be added first. Also note that the tokenizer exceptions will
 | 
						||
    |  always have priority over the atrribute getters.
 | 
						||
 | 
						||
+h(3, "lex-attrs") Lexical attributes
 | 
						||
 | 
						||
p
 | 
						||
    |  spaCy provides a range of #[+api("token#attributes") #[code Token] attributes]
 | 
						||
    |  that return useful information on that token – for example, whether it's
 | 
						||
    |  uppercase or lowercase, a left or right punctuation mark, or whether it
 | 
						||
    |  resembles a number or email address. Most of these functions, like
 | 
						||
    |  #[code is_lower] or #[code like_url] should be language-independent.
 | 
						||
    |  Others, like #[code like_num] (which includes both digits and number
 | 
						||
    |  words), requires some customisation.
 | 
						||
 | 
						||
+aside("Best practices")
 | 
						||
    |  Keep in mind that those functions are only intended to be  an approximation.
 | 
						||
    |  It's always better to prioritise simplicity and performance over covering
 | 
						||
    |  very specific edge cases.#[br]#[br]
 | 
						||
    |  English number words are pretty simple, because even large numbers
 | 
						||
    |  consist of individual tokens, and we can get away with splitting and
 | 
						||
    |  matching strings against a list. In other languages, like German, "two
 | 
						||
    |  hundred and thirty-four" is one word, and thus one token. Here, it's best
 | 
						||
    |  to match a string against a list of number word fragments (instead of a
 | 
						||
    |  technically almost infinite list of possible number words).
 | 
						||
 | 
						||
p
 | 
						||
    |  Here's an example from the English
 | 
						||
    |  #[+src(gh("spaCy", "spacy/en/lang/lex_attrs.py")) #[code lex_attrs.py]]:
 | 
						||
 | 
						||
+code("lex_attrs.py").
 | 
						||
    _num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
 | 
						||
                  'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
 | 
						||
                  'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
 | 
						||
                  'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety',
 | 
						||
                  'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion',
 | 
						||
                  'gajillion', 'bazillion']
 | 
						||
 | 
						||
    def like_num(text):
 | 
						||
        text = text.replace(',', '').replace('.', '')
 | 
						||
        if text.isdigit():
 | 
						||
            return True
 | 
						||
        if text.count('/') == 1:
 | 
						||
            num, denom = text.split('/')
 | 
						||
            if num.isdigit() and denom.isdigit():
 | 
						||
                return True
 | 
						||
        if text in _num_words:
 | 
						||
            return True
 | 
						||
        return False
 | 
						||
 | 
						||
    LEX_ATTRS = {
 | 
						||
        LIKE_NUM: like_num
 | 
						||
    }
 | 
						||
 | 
						||
p
 | 
						||
    |  By updating the default lexical attributes with a custom #[code LEX_ATTRS]
 | 
						||
    |  dictionary in the language's defaults via
 | 
						||
    |  #[code lex_attr_getters.update(LEX_ATTRS)], only the new custom functions
 | 
						||
    |  are overwritten.
 | 
						||
 | 
						||
+h(3, "syntax-iterators") Syntax iterators
 | 
						||
 | 
						||
p
 | 
						||
    |  Syntax iterators are functions that compute views of a #[code Doc]
 | 
						||
    |  object based on its syntax. At the moment, this data is only used for
 | 
						||
    |  extracting
 | 
						||
    |  #[+a("/usage/linguistic-features#noun-chunks") noun chunks], which
 | 
						||
    |  are available as the #[+api("doc#noun_chunks") #[code Doc.noun_chunks]]
 | 
						||
    |  property. Because base noun phrases work differently across languages,
 | 
						||
    |  the rules to compute them are part of the individual language's data. If
 | 
						||
    |  a language does not include a noun chunks iterator, the property won't
 | 
						||
    |  be available. For examples, see the existing syntax iterators:
 | 
						||
 | 
						||
+aside-code("Noun chunks example").
 | 
						||
    doc = nlp(u'A phrase with another phrase occurs.')
 | 
						||
    chunks = list(doc.noun_chunks)
 | 
						||
    assert chunks[0].text == "A phrase"
 | 
						||
    assert chunks[1].text == "another phrase"
 | 
						||
 | 
						||
+table(["Language", "Code", "Source"])
 | 
						||
    for lang in ["en", "de", "fr", "es"]
 | 
						||
        +row
 | 
						||
            +cell=LANGUAGES[lang]
 | 
						||
            +cell #[code=lang]
 | 
						||
            +cell
 | 
						||
                +src(gh("spaCy", "spacy/lang/" + lang + "/syntax_iterators.py"))
 | 
						||
                    code lang/#{lang}/syntax_iterators.py
 | 
						||
 | 
						||
+h(3, "lemmatizer") Lemmatizer
 | 
						||
 | 
						||
p
 | 
						||
    |  As of v2.0, spaCy supports simple lookup-based lemmatization. This is
 | 
						||
    |  usually the quickest and easiest way to get started. The data is stored
 | 
						||
    |  in a dictionary mapping a string to its lemma. To determine a token's
 | 
						||
    |  lemma, spaCy simply looks it up in the table. Here's an example from
 | 
						||
    |  the Spanish language data:
 | 
						||
 | 
						||
+code("lang/es/lemmatizer.py (excerpt)").
 | 
						||
    LOOKUP = {
 | 
						||
        "aba": "abar",
 | 
						||
        "ababa": "abar",
 | 
						||
        "ababais": "abar",
 | 
						||
        "ababan": "abar",
 | 
						||
        "ababanes": "ababán",
 | 
						||
        "ababas": "abar",
 | 
						||
        "ababoles": "ababol",
 | 
						||
        "ababábites": "ababábite"
 | 
						||
    }
 | 
						||
 | 
						||
p
 | 
						||
    |  To provide a lookup lemmatizer for your language, import the lookup table
 | 
						||
    |  and add it to the #[code Language] class as #[code lemma_lookup]:
 | 
						||
 | 
						||
+code.
 | 
						||
    lemma_lookup = dict(LOOKUP)
 | 
						||
 | 
						||
+h(3, "tag-map") Tag map
 | 
						||
 | 
						||
p
 | 
						||
    |  Most treebanks define a custom part-of-speech tag scheme, striking a
 | 
						||
    |  balance between level of detail and ease of prediction.  While it's
 | 
						||
    |  useful to have custom tagging schemes, it's also useful to have a common
 | 
						||
    |  scheme, to which the more specific tags can be related. The tagger can
 | 
						||
    |  learn a tag scheme with any arbitrary symbols. However, you need to
 | 
						||
    |  define how those symbols map down to the
 | 
						||
    |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies tag set].
 | 
						||
    |  This is done by providing a tag map.
 | 
						||
 | 
						||
p
 | 
						||
    |  The keys of the tag map should be #[strong strings in your tag set]. The
 | 
						||
    |  values should be a dictionary. The dictionary must have an entry POS
 | 
						||
    |  whose value is one of the
 | 
						||
    |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
 | 
						||
    |  tags. Optionally, you can also include morphological features or other
 | 
						||
    |  token attributes in the tag map as well. This allows you to do simple
 | 
						||
    |  #[+a("/usage/linguistic-features#rule-based-morphology") rule-based morphological analysis].
 | 
						||
 | 
						||
+code("Example").
 | 
						||
    from ..symbols import POS, NOUN, VERB, DET
 | 
						||
 | 
						||
    TAG_MAP = {
 | 
						||
        "NNS":  {POS: NOUN, "Number": "plur"},
 | 
						||
        "VBG":  {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
 | 
						||
        "DT":   {POS: DET}
 | 
						||
    }
 | 
						||
 | 
						||
+h(3, "morph-rules") Morph rules
 | 
						||
 | 
						||
p
 | 
						||
    |  The morphology rules let you set token attributes such as lemmas, keyed
 | 
						||
    |  by the extended part-of-speech tag and token text. The morphological
 | 
						||
    |  features and their possible values are language-specific and based on the
 | 
						||
    |  #[+a("http://universaldependencies.org") Universal Dependencies scheme].
 | 
						||
 | 
						||
 | 
						||
+code("Example").
 | 
						||
    from ..symbols import LEMMA
 | 
						||
 | 
						||
    MORPH_RULES = {
 | 
						||
        "VBZ": {
 | 
						||
            "am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
 | 
						||
            "are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
 | 
						||
            "is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
 | 
						||
            "'re": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
 | 
						||
            "'s": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"}
 | 
						||
        }
 | 
						||
    }
 | 
						||
 | 
						||
p
 | 
						||
    |  In the example of #[code "am"], the attributes look like this:
 | 
						||
 | 
						||
+table(["Attribute", "Description"])
 | 
						||
    +row
 | 
						||
        +cell #[code LEMMA: "be"]
 | 
						||
        +cell Base form, e.g. "to be".
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code "VerbForm": "Fin"]
 | 
						||
        +cell
 | 
						||
            |  Finite verb. Finite verbs have a subject and can be the root of
 | 
						||
            |  an independent clause – "I am." is a valid, complete
 | 
						||
            |  sentence.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code "Person": "One"]
 | 
						||
        +cell First person, i.e. "#[strong I] am".
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code "Tense": "Pres"]
 | 
						||
        +cell
 | 
						||
            |  Present tense, i.e. actions that are happening right now or
 | 
						||
            |  actions that usually happen.
 | 
						||
 | 
						||
    +row
 | 
						||
        +cell #[code "Mood": "Ind"]
 | 
						||
        +cell
 | 
						||
            |  Indicative, i.e. something happens, has happened or will happen
 | 
						||
            |  (as opposed to imperative or conditional).
 | 
						||
 | 
						||
 | 
						||
+infobox("Important note", "⚠️")
 | 
						||
    |  The morphological attributes are currently #[strong not all used by spaCy].
 | 
						||
    |  Full integration is still being developed. In the meantime, it can still
 | 
						||
    |  be useful to add them, especially if the language you're adding includes
 | 
						||
    |  important distinctions and special cases. This ensures that as soon as
 | 
						||
    |  full support is introduced, your language will be able to assign all
 | 
						||
    |  possible attributes.
 |