spaCy/website/usage/_adding-languages/_language-data.jade

//- 💫 DOCS > USAGE > ADDING LANGUAGES > LANGUAGE DATA

p
    |  The individual components #[strong expose variables] that can be imported
    |  within a language module, and added to the language's #[code Defaults].
    |  Some components, like the punctuation rules, usually don't need much
    |  customisation and can be imported from the global rules. Others,
    |  like the tokenizer and norm exceptions, are very specific and will make
    |  a big difference to spaCy's performance on the particular language and
    |  training a language model.


+table(["Variable", "Type", "Description"])
    +row
        +cell #[code STOP_WORDS]
        +cell set
        +cell Individual words.

    +row
        +cell #[code TOKENIZER_EXCEPTIONS]
        +cell dict
        +cell Keyed by strings mapped to list of one dict per token with token attributes.

    +row
        +cell #[code TOKEN_MATCH]
        +cell regex
        +cell Regexes to match complex tokens, e.g. URLs.

    +row
        +cell #[code NORM_EXCEPTIONS]
        +cell dict
        +cell Keyed by strings, mapped to their norms.

    +row
        +cell #[code TOKENIZER_PREFIXES]
        +cell list
        +cell Strings or regexes, usually not customised.

    +row
        +cell #[code TOKENIZER_SUFFIXES]
        +cell list
        +cell Strings or regexes, usually not customised.

    +row
        +cell #[code TOKENIZER_INFIXES]
        +cell list
        +cell Strings or regexes, usually not customised.

    +row
        +cell #[code LEX_ATTRS]
        +cell dict
        +cell Attribute ID mapped to function.

    +row
        +cell #[code SYNTAX_ITERATORS]
        +cell dict
        +cell
            |  Iterator ID mapped to function. Currently only supports
            |  #[code 'noun_chunks'].

    +row
        +cell #[code LOOKUP]
        +cell dict
        +cell Keyed by strings mapping to their lemma.

    +row
        +cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC]
        +cell dict
        +cell Lemmatization rules, keyed by part of speech.

    +row
        +cell #[code TAG_MAP]
        +cell dict
        +cell
            |  Keyed by strings mapped to
            |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
            |  tags.

    +row
        +cell #[code MORPH_RULES]
        +cell dict
        +cell Keyed by strings mapped to a dict of their morphological features.

+aside("Should I ever update the global data?")
    |  Reuseable language data is collected as atomic pieces in the root of the
    |  #[+src(gh("spaCy", "lang")) #[code spacy.lang]] package. Often, when a new
    |  language is added, you'll find a pattern or symbol that's missing. Even
    |  if it isn't common in other languages, it might be best to add it to the
    |  shared language data, unless it has some conflicting interpretation. For
    |  instance, we don't expect to see guillemot quotation symbols
    |  (#[code &raquo;] and #[code &laquo;]) in English text. But if we do see
    |  them, we'd probably prefer the tokenizer to split them off.

+infobox("For languages with non-latin characters")
    |  In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
    |  needs to know the language's character set. If the language you're adding
    |  uses non-latin characters, you might need to add the required character
    |  classes to the global
    |  #[+src(gh("spacy", "spacy/lang/char_classes.py")) #[code char_classes.py]].
    |  spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
    |  to keep this simple and readable. If the language requires very specific
    |  punctuation rules, you should consider overwriting the default regular
    |  expressions with your own in the language's #[code Defaults].


+h(3, "language-subclass") Creating a #[code Language] subclass

p
    |  Language-specific code and resources should be organised into a
    |  subpackage of spaCy, named according to the language's
    |  #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code].
    |  For instance, code and resources specific to Spanish are placed into a
    |  directory #[code spacy/lang/es], which can be imported as
    |  #[code spacy.lang.es].

p
    |  To get started, you can use our
    |  #[+src(gh("spacy-dev-resources", "templates/new_language")) templates]
    |  for the most important files. Here's what the class template looks like:

+code("__init__.py (excerpt)").
    # import language-specific data
    from .stop_words import STOP_WORDS
    from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
    from .lex_attrs import LEX_ATTRS

    from ..tokenizer_exceptions import BASE_EXCEPTIONS
    from ...language import Language
    from ...attrs import LANG
    from ...util import update_exc

    # create Defaults class in the module scope (necessary for pickling!)
    class XxxxxDefaults(Language.Defaults):
        lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
        lex_attr_getters[LANG] = lambda text: 'xx' # language ISO code

        # optional: replace flags with custom functions, e.g. like_num()
        lex_attr_getters.update(LEX_ATTRS)

        # merge base exceptions and custom tokenizer exceptions
        tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
        stop_words = STOP_WORDS

    # create actual Language class
    class Xxxxx(Language):
        lang = 'xx' # language ISO code
        Defaults = XxxxxDefaults # override defaults

    # set default export – this allows the language class to be lazy-loaded
    __all__ = ['Xxxxx']

+infobox("Why lazy-loading?")
    |  Some languages contain large volumes of custom data, like lemmatizer
    |  lookup tables, or complex regular expression that are expensive to
    |  compute. As of spaCy v2.0, #[code Language] classes are not imported on
    |  initialisation and are only loaded when you import them directly, or load
    |  a model that requires a language to be loaded. To lazy-load languages in
    |  your application, you can use the
    |  #[+api("util#get_lang_class") #[code util.get_lang_class()]] helper
    |  function with the two-letter language code as its argument.

+h(3, "stop-words") Stop words

p
    |  A #[+a("https://en.wikipedia.org/wiki/Stop_words") "stop list"] is a
    |  classic trick from the early days of information retrieval when search
    |  was largely about keyword presence and absence. It is still sometimes
    |  useful today to filter out common words from a bag-of-words model. To
    |  improve readability, #[code STOP_WORDS] are separated by spaces and
    |  newlines, and added as a multiline string.

+aside("What does spaCy consider a stop word?")
    |  There's no particularly principled logic behind what words should be
    |  added to the stop list. Make a list that you think might be useful
    |  to people and is likely to be unsurprising. As a rule of thumb, words
    |  that are very rare are unlikely to be useful stop words.

+code("Example").
    STOP_WORDS = set(&quot;&quot;&quot;
    a about above across after afterwards again against all almost alone along
    already also although always am among amongst amount an and another any anyhow
    anyone anything anyway anywhere are around as at

    back be became because become becomes becoming been before beforehand behind
    being below beside besides between beyond both bottom but by
    &quot;&quot;&quot;.split())

+infobox("Important note")
    |  When adding stop words from an online source, always #[strong include the link]
    |  in a comment. Make sure to #[strong proofread] and double-check the words
    |  carefully. A lot of the lists available online have been passed around
    |  for years and often contain mistakes, like unicode errors or random words
    |  that have once been added for a specific use case, but don't actually
    |  qualify.

+h(3, "tokenizer-exceptions") Tokenizer exceptions

p
    |  spaCy's #[+a("/usage/linguistic-features#how-tokenizer-works") tokenization algorithm]
    |  lets you deal with whitespace-delimited chunks separately. This makes it
    |  easy to define special-case rules, without worrying about how they
    |  interact with the rest of the tokenizer. Whenever the key string is
    |  matched, the special-case rule is applied, giving the defined sequence of
    |  tokens. You can also attach attributes to the subtokens, covered by your
    |  special case, such as the subtokens #[code LEMMA] or #[code TAG].

p
    |  Tokenizer exceptions can be added in the following format:

+code("tokenizer_exceptions.py (excerpt)").
    TOKENIZER_EXCEPTIONS = {
        "don't": [
            {ORTH: "do", LEMMA: "do"},
            {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
    }

+infobox("Important note")
    |  If an exception consists of more than one token, the #[code ORTH] values
    |  combined always need to #[strong match the original string]. The way the
    |  original string is split up can be pretty arbitrary sometimes – for
    |  example "gonna" is split into "gon" (lemma "go") and "na" (lemma "to").
    |  Because of how the tokenizer works, it's currently not possible to split
    |  single-letter strings into multiple tokens.

p
    |  Unambiguous abbreviations, like month names or locations in English,
    |  should be added to exceptions with a lemma assigned, for example
    |  #[code {ORTH: "Jan.", LEMMA: "January"}]. Since the exceptions are
    |  added in Python, you can use custom logic to generate them more
    |  efficiently and make your data less verbose. How you do this ultimately
    |  depends on the language. Here's an example of how exceptions for time
    |  formats like "1a.m." and "1am" are generated in the English
    |  #[+src(gh("spaCy", "spacy/en/lang/tokenizer_exceptions.py")) #[code tokenizer_exceptions.py]]:

+code("tokenizer_exceptions.py (excerpt)").
    # use short, internal variable for readability
    _exc = {}

    for h in range(1, 12 + 1):
        for period in ["a.m.", "am"]:
            # always keep an eye on string interpolation!
            _exc["%d%s" % (h, period)] = [
                {ORTH: "%d" % h},
                {ORTH: period, LEMMA: "a.m."}]
        for period in ["p.m.", "pm"]:
            _exc["%d%s" % (h, period)] = [
                {ORTH: "%d" % h},
                {ORTH: period, LEMMA: "p.m."}]

    # only declare this at the bottom
    TOKENIZER_EXCEPTIONS = _exc

+aside("Generating tokenizer exceptions")
    |  Keep in mind that generating exceptions only makes sense if there's a
    |  clearly defined and #[strong finite number] of them, like common
    |  contractions in English. This is not always the case – in Spanish for
    |  instance, infinitive or imperative reflexive verbs and pronouns are one
    |  token (e.g. "vestirme"). In cases like this, spaCy shouldn't be
    |  generating exceptions for #[em all verbs]. Instead, this will be handled
    |  at a later stage during lemmatization.

p
    |  When adding the tokenizer exceptions to the #[code Defaults], you can use
    |  the #[+api("util#update_exc") #[code update_exc()]] helper function to merge
    |  them with the global base exceptions (including one-letter abbreviations
    |  and emoticons). The function performs a basic check to make sure
    |  exceptions are provided in the correct format. It can take any number of
    |  exceptions dicts as its arguments, and will update and overwrite the
    |  exception in this order. For example, if your language's tokenizer
    |  exceptions include a custom tokenization pattern for "a.", it will
    |  overwrite the base exceptions with the language's custom one.

+code("Example").
    from ...util import update_exc

    BASE_EXCEPTIONS =  {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
    TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}

    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
    # {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}

+infobox("About spaCy's custom pronoun lemma")
    |  Unlike verbs and common nouns, there's no clear base form of a personal
    |  pronoun. Should the lemma of "me" be "I", or should we normalize person
    |  as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
    |  novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
    |  all personal pronouns.

+h(3, "norm-exceptions") Norm exceptions
    +tag-new(2)

p
    |  In addition to #[code ORTH] or #[code LEMMA], tokenizer exceptions can
    |  also set a #[code NORM] attribute. This is useful to specify a normalised
    |  version of the token – for example, the norm of "n't" is "not". By default,
    |  a token's norm equals its lowercase text. If the lowercase spelling of a
    |  word exists, norms should always be in lowercase.

+aside-code("Norms vs. lemmas").
    doc = nlp(u"I'm gonna realise")
    norms = [token.norm_ for token in doc]
    lemmas = [token.lemma_ for token in doc]
    assert norms == ['i', 'am', 'going', 'to', 'realize']
    assert lemmas == ['i', 'be', 'go', 'to', 'realise']

p
    |  spaCy usually tries to normalise words with different spellings to a single,
    |  common spelling. This has no effect on any other token attributes, or
    |  tokenization in general, but it ensures that
    |  #[strong equivalent tokens receive similar representations]. This can
    |  improve the model's predictions on words that weren't common in the
    |  training data, but are equivalent to other words – for example, "realize"
    |  and "realise", or "thx" and "thanks".

p
    |  Similarly, spaCy also includes
    |  #[+src(gh("spaCy", "spacy/lang/norm_exceptions.py")) global base norms]
    |  for normalising different styles of quotation marks and currency
    |  symbols. Even though #[code $] and #[code €] are very different, spaCy
    |  normalises them both to #[code $]. This way, they'll always be seen as
    |  similar, no matter how common they were in the training data.

p
    |  Norm exceptions can be provided as a simple dictionary. For more examples,
    |  see the English
    |  #[+src(gh("spaCy", "spacy/lang/en/norm_exceptions.py")) #[code norm_exceptions.py]].

+code("Example").
    NORM_EXCEPTIONS = {
        "cos": "because",
        "fav": "favorite",
        "accessorise": "accessorize",
        "accessorised": "accessorized"
    }

p
    |  To add the custom norm exceptions lookup table, you can use the
    |  #[code add_lookups()] helper functions. It takes the default attribute
    |  getter function as its first argument, plus a variable list of
    |  dictionaries. If a string's norm is found in one of the dictionaries,
    |  that value is used – otherwise, the default function is called and the
    |  token is assigned its default norm.

+code.
    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
                                         NORM_EXCEPTIONS, BASE_NORMS)

p
    |  The order of the dictionaries is also the lookup order – so if your
    |  language's norm exceptions overwrite any of the global exceptions, they
    |  should be added first. Also note that the tokenizer exceptions will
    |  always have priority over the atrribute getters.

+h(3, "lex-attrs") Lexical attributes
    +tag-new(2)

p
    |  spaCy provides a range of #[+api("token#attributes") #[code Token] attributes]
    |  that return useful information on that token – for example, whether it's
    |  uppercase or lowercase, a left or right punctuation mark, or whether it
    |  resembles a number or email address. Most of these functions, like
    |  #[code is_lower] or #[code like_url] should be language-independent.
    |  Others, like #[code like_num] (which includes both digits and number
    |  words), requires some customisation.

+aside("Best practices")
    |  Keep in mind that those functions are only intended to be  an approximation.
    |  It's always better to prioritise simplicity and performance over covering
    |  very specific edge cases.#[br]#[br]
    |  English number words are pretty simple, because even large numbers
    |  consist of individual tokens, and we can get away with splitting and
    |  matching strings against a list. In other languages, like German, "two
    |  hundred and thirty-four" is one word, and thus one token. Here, it's best
    |  to match a string against a list of number word fragments (instead of a
    |  technically almost infinite list of possible number words).

p
    |  Here's an example from the English
    |  #[+src(gh("spaCy", "spacy/lang/en/lex_attrs.py")) #[code lex_attrs.py]]:

+code("lex_attrs.py").
    _num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
                  'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
                  'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
                  'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety',
                  'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion',
                  'gajillion', 'bazillion']

    def like_num(text):
        text = text.replace(',', '').replace('.', '')
        if text.isdigit():
            return True
        if text.count('/') == 1:
            num, denom = text.split('/')
            if num.isdigit() and denom.isdigit():
                return True
        if text.lower() in _num_words:
            return True
        return False

    LEX_ATTRS = {
        LIKE_NUM: like_num
    }

p
    |  By updating the default lexical attributes with a custom #[code LEX_ATTRS]
    |  dictionary in the language's defaults via
    |  #[code lex_attr_getters.update(LEX_ATTRS)], only the new custom functions
    |  are overwritten.

+h(3, "syntax-iterators") Syntax iterators

p
    |  Syntax iterators are functions that compute views of a #[code Doc]
    |  object based on its syntax. At the moment, this data is only used for
    |  extracting
    |  #[+a("/usage/linguistic-features#noun-chunks") noun chunks], which
    |  are available as the #[+api("doc#noun_chunks") #[code Doc.noun_chunks]]
    |  property. Because base noun phrases work differently across languages,
    |  the rules to compute them are part of the individual language's data. If
    |  a language does not include a noun chunks iterator, the property won't
    |  be available. For examples, see the existing syntax iterators:

+aside-code("Noun chunks example").
    doc = nlp(u'A phrase with another phrase occurs.')
    chunks = list(doc.noun_chunks)
    assert chunks[0].text == "A phrase"
    assert chunks[1].text == "another phrase"

+table(["Language", "Code", "Source"])
    for lang in ["en", "de", "fr", "es"]
        +row
            +cell=LANGUAGES[lang]
            +cell #[code=lang]
            +cell
                +src(gh("spaCy", "spacy/lang/" + lang + "/syntax_iterators.py"))
                    code lang/#{lang}/syntax_iterators.py

+h(3, "lemmatizer") Lemmatizer
    +tag-new(2)

p
    |  As of v2.0, spaCy supports simple lookup-based lemmatization. This is
    |  usually the quickest and easiest way to get started. The data is stored
    |  in a dictionary mapping a string to its lemma. To determine a token's
    |  lemma, spaCy simply looks it up in the table. Here's an example from
    |  the Spanish language data:

+code("lang/es/lemmatizer.py (excerpt)").
    LOOKUP = {
        "aba": "abar",
        "ababa": "abar",
        "ababais": "abar",
        "ababan": "abar",
        "ababanes": "ababán",
        "ababas": "abar",
        "ababoles": "ababol",
        "ababábites": "ababábite"
    }

p
    |  To provide a lookup lemmatizer for your language, import the lookup table
    |  and add it to the #[code Language] class as #[code lemma_lookup]:

+code.
    lemma_lookup = dict(LOOKUP)

+h(3, "tag-map") Tag map

p
    |  Most treebanks define a custom part-of-speech tag scheme, striking a
    |  balance between level of detail and ease of prediction.  While it's
    |  useful to have custom tagging schemes, it's also useful to have a common
    |  scheme, to which the more specific tags can be related. The tagger can
    |  learn a tag scheme with any arbitrary symbols. However, you need to
    |  define how those symbols map down to the
    |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies tag set].
    |  This is done by providing a tag map.

p
    |  The keys of the tag map should be #[strong strings in your tag set]. The
    |  values should be a dictionary. The dictionary must have an entry POS
    |  whose value is one of the
    |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
    |  tags. Optionally, you can also include morphological features or other
    |  token attributes in the tag map as well. This allows you to do simple
    |  #[+a("/usage/linguistic-features#rule-based-morphology") rule-based morphological analysis].

+code("Example").
    from ..symbols import POS, NOUN, VERB, DET

    TAG_MAP = {
        "NNS":  {POS: NOUN, "Number": "plur"},
        "VBG":  {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
        "DT":   {POS: DET}
    }

+h(3, "morph-rules") Morph rules

p
    |  The morphology rules let you set token attributes such as lemmas, keyed
    |  by the extended part-of-speech tag and token text. The morphological
    |  features and their possible values are language-specific and based on the
    |  #[+a("http://universaldependencies.org") Universal Dependencies scheme].


+code("Example").
    from ..symbols import LEMMA

    MORPH_RULES = {
        "VBZ": {
            "am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
            "are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
            "is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
            "'re": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
            "'s": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"}
        }
    }

p
    |  In the example of #[code "am"], the attributes look like this:

+table(["Attribute", "Description"])
    +row
        +cell #[code LEMMA: "be"]
        +cell Base form, e.g. "to be".

    +row
        +cell #[code "VerbForm": "Fin"]
        +cell
            |  Finite verb. Finite verbs have a subject and can be the root of
            |  an independent clause – "I am." is a valid, complete
            |  sentence.

    +row
        +cell #[code "Person": "One"]
        +cell First person, i.e. "#[strong I] am".

    +row
        +cell #[code "Tense": "Pres"]
        +cell
            |  Present tense, i.e. actions that are happening right now or
            |  actions that usually happen.

    +row
        +cell #[code "Mood": "Ind"]
        +cell
            |  Indicative, i.e. something happens, has happened or will happen
            |  (as opposed to imperative or conditional).


+infobox("Important note", "⚠️")
    |  The morphological attributes are currently #[strong not all used by spaCy].
    |  Full integration is still being developed. In the meantime, it can still
    |  be useful to add them, especially if the language you're adding includes
    |  important distinctions and special cases. This ensures that as soon as
    |  full support is introduced, your language will be able to assign all
    |  possible attributes.
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								//- 💫 DOCS > USAGE > ADDING LANGUAGES > LANGUAGE DATA
-												Update adding languages docs and add 101

											
										
										
											2017-06-04 00:54:23 +03:00
 								p
 								    |  The individual components #[strong expose variables] that can be imported
 								    |  within a language module, and added to the language's #[code Defaults].
 								    |  Some components, like the punctuation rules, usually don't need much
-												Fix formatting and wording

											
										
										
											2018-05-07 22:24:35 +03:00
+								    |  customisation and can be imported from the global rules. Others,
-												Update adding languages docs and add 101

											
										
										
											2017-06-04 00:54:23 +03:00
+								    |  like the tokenizer and norm exceptions, are very specific and will make
 								    |  a big difference to spaCy's performance on the particular language and
 								    |  training a language model.
 								+table(["Variable", "Type", "Description"])
 								    +row
 								        +cell #[code STOP_WORDS]
 								        +cell set
 								        +cell Individual words.
 								    +row
 								        +cell #[code TOKENIZER_EXCEPTIONS]
 								        +cell dict
 								        +cell Keyed by strings mapped to list of one dict per token with token attributes.
 								    +row
 								        +cell #[code TOKEN_MATCH]
 								        +cell regex
 								        +cell Regexes to match complex tokens, e.g. URLs.
 								    +row
 								        +cell #[code NORM_EXCEPTIONS]
 								        +cell dict
 								        +cell Keyed by strings, mapped to their norms.
 								    +row
 								        +cell #[code TOKENIZER_PREFIXES]
 								        +cell list
 								        +cell Strings or regexes, usually not customised.
 								    +row
 								        +cell #[code TOKENIZER_SUFFIXES]
 								        +cell list
 								        +cell Strings or regexes, usually not customised.
 								    +row
 								        +cell #[code TOKENIZER_INFIXES]
 								        +cell list
 								        +cell Strings or regexes, usually not customised.
 								    +row
 								        +cell #[code LEX_ATTRS]
 								        +cell dict
 								        +cell Attribute ID mapped to function.
-												Add details on syntax iterators

											
										
										
											2017-06-05 00:16:33 +03:00
+								    +row
 								        +cell #[code SYNTAX_ITERATORS]
 								        +cell dict
 								        +cell
 								            |  Iterator ID mapped to function. Currently only supports
 								            |  #[code 'noun_chunks'].
-												Update adding languages docs and add 101

											
										
										
											2017-06-04 00:54:23 +03:00
+								    +row
 								        +cell #[code LOOKUP]
 								        +cell dict
 								        +cell Keyed by strings mapping to their lemma.
 								    +row
 								        +cell #[code LEMMA_RULES], #[code LEMMA_INDEX], #[code LEMMA_EXC]
 								        +cell dict
 								        +cell Lemmatization rules, keyed by part of speech.
 								    +row
 								        +cell #[code TAG_MAP]
 								        +cell dict
 								        +cell
 								            |  Keyed by strings mapped to
 								            |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
 								            |  tags.
 								    +row
 								        +cell #[code MORPH_RULES]
 								        +cell dict
 								        +cell Keyed by strings mapped to a dict of their morphological features.
 								+aside("Should I ever update the global data?")
 								    |  Reuseable language data is collected as atomic pieces in the root of the
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  #[+src(gh("spaCy", "lang")) #[code spacy.lang]] package. Often, when a new
-												Update adding languages docs and add 101

											
										
										
											2017-06-04 00:54:23 +03:00
+								    |  language is added, you'll find a pattern or symbol that's missing. Even
 								    |  if it isn't common in other languages, it might be best to add it to the
 								    |  shared language data, unless it has some conflicting interpretation. For
 								    |  instance, we don't expect to see guillemot quotation symbols
 								    |  (#[code &raquo;] and #[code &laquo;]) in English text. But if we do see
 								    |  them, we'd probably prefer the tokenizer to split them off.
 								+infobox("For languages with non-latin characters")
 								    |  In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
 								    |  needs to know the language's character set. If the language you're adding
 								    |  uses non-latin characters, you might need to add the required character
 								    |  classes to the global
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  #[+src(gh("spacy", "spacy/lang/char_classes.py")) #[code char_classes.py]].
-												Update adding languages docs and add 101

											
										
										
											2017-06-04 00:54:23 +03:00
+								    |  spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
 								    |  to keep this simple and readable. If the language requires very specific
 								    |  punctuation rules, you should consider overwriting the default regular
 								    |  expressions with your own in the language's #[code Defaults].
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+h(3, "language-subclass") Creating a #[code Language] subclass
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
 								p
 								    |  Language-specific code and resources should be organised into a
 								    |  subpackage of spaCy, named according to the language's
 								    |  #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code].
 								    |  For instance, code and resources specific to Spanish are placed into a
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    |  directory #[code spacy/lang/es], which can be imported as
 								    |  #[code spacy.lang.es].
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
 								p
 								    |  To get started, you can use our
 								    |  #[+src(gh("spacy-dev-resources", "templates/new_language")) templates]
 								    |  for the most important files. Here's what the class template looks like:
 								+code("__init__.py (excerpt)").
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    # import language-specific data
 								    from .stop_words import STOP_WORDS
 								    from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 								    from .lex_attrs import LEX_ATTRS
 								    from ..tokenizer_exceptions import BASE_EXCEPTIONS
 								    from ...language import Language
 								    from ...attrs import LANG
 								    from ...util import update_exc
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Move Defaults subclass to module scope (necessary for pickling)

											
										
										
											2017-05-20 20:02:27 +03:00
+								    # create Defaults class in the module scope (necessary for pickling!)
 								    class XxxxxDefaults(Language.Defaults):
 								        lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
 								        lex_attr_getters[LANG] = lambda text: 'xx' # language ISO code
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Move Defaults subclass to module scope (necessary for pickling)

											
										
										
											2017-05-20 20:02:27 +03:00
+								        # optional: replace flags with custom functions, e.g. like_num()
 								        lex_attr_getters.update(LEX_ATTRS)
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Move Defaults subclass to module scope (necessary for pickling)

											
										
										
											2017-05-20 20:02:27 +03:00
+								        # merge base exceptions and custom tokenizer exceptions
 								        tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
-												Update adding languages example

											
										
										
											2017-11-04 17:12:39 +03:00
+								        stop_words = STOP_WORDS
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Move Defaults subclass to module scope (necessary for pickling)

											
										
										
											2017-05-20 20:02:27 +03:00
+								    # create actual Language class
 								    class Xxxxx(Language):
 								        lang = 'xx' # language ISO code
 								        Defaults = XxxxxDefaults # override defaults
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    # set default export – this allows the language class to be lazy-loaded
 								    __all__ = ['Xxxxx']
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs and add 101

											
										
										
											2017-06-04 00:54:23 +03:00
+								+infobox("Why lazy-loading?")
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    |  Some languages contain large volumes of custom data, like lemmatizer
-												Fix small typo in documentation
											
										
										
											2017-08-10 12:38:30 +03:00
+								    |  lookup tables, or complex regular expression that are expensive to
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    |  compute. As of spaCy v2.0, #[code Language] classes are not imported on
 								    |  initialisation and are only loaded when you import them directly, or load
 								    |  a model that requires a language to be loaded. To lazy-load languages in
-												Fix code, links and formatting

											
										
										
											2017-05-28 19:29:16 +03:00
+								    |  your application, you can use the
 								    |  #[+api("util#get_lang_class") #[code util.get_lang_class()]] helper
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    |  function with the two-letter language code as its argument.
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
 								+h(3, "stop-words") Stop words
 								p
 								    |  A #[+a("https://en.wikipedia.org/wiki/Stop_words") "stop list"] is a
 								    |  classic trick from the early days of information retrieval when search
 								    |  was largely about keyword presence and absence. It is still sometimes
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    |  useful today to filter out common words from a bag-of-words model. To
 								    |  improve readability, #[code STOP_WORDS] are separated by spaces and
 								    |  newlines, and added as a multiline string.
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
 								+aside("What does spaCy consider a stop word?")
-												Update adding languages docs

											
										
										
											2017-05-13 19:54:10 +03:00
+								    |  There's no particularly principled logic behind what words should be
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
+								    |  added to the stop list. Make a list that you think might be useful
 								    |  to people and is likely to be unsurprising. As a rule of thumb, words
 								    |  that are very rare are unlikely to be useful stop words.
 								+code("Example").
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    STOP_WORDS = set(&quot;&quot;&quot;
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
+								    a about above across after afterwards again against all almost alone along
 								    already also although always am among amongst amount an and another any anyhow
 								    anyone anything anyway anywhere are around as at
 								    back be became because become becomes becoming been before beforehand behind
 								    being below beside besides between beyond both bottom but by
-												💫 Port master changes over to develop (#2979)

* Create aryaprabhudesai.md (#2681)

* Update _install.jade (#2688)

Typo fix: "models" -> "model"

* Add FAC to spacy.explain (resolves #2706)

* Remove docstrings for deprecated arguments (see #2703)

* When calling getoption() in conftest.py, pass a default option (#2709)

* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement

* update bengali token rules for hyphen and digits (#2731)

* Less norm computations in token similarity (#2730)

* Less norm computations in token similarity

* Contributor agreement

* Remove ')' for clarity (#2737)

Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.

* added contributor agreement for mbkupfer (#2738)

* Basic support for Telugu language (#2751)

* Lex _attrs for polish language (#2750)

* Signed spaCy contributor agreement

* Added polish version of english lex_attrs

* Introduces a bulk merge function, in order to solve issue #653 (#2696)

* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions

* Describe converters more explicitly (see #2643)

* Add multi-threading note to Language.pipe (resolves #2582) [ci skip]

* Fix formatting

* Fix dependency scheme docs (closes #2705) [ci skip]

* Don't set stop word in example (closes #2657) [ci skip]

* Add words to portuguese language _num_words (#2759)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Update Indonesian model (#2752)

* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file

* Fixed spaCy+Keras example (#2763)

* bug fixes in keras example

* created contributor agreement

* Adding French hyphenated first name (#2786)

* Fix typo (closes #2784)

* Fix typo (#2795) [ci skip]

Fixed typo on line 6 "regcognizer --> recognizer"

* Adding basic support for Sinhala language. (#2788)

* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement

* Also include lowercase norm exceptions

* Fix error (#2802)

* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement

* Add charlax's contributor agreement (#2805)

* agreement of contributor, may I introduce a tiny pl languge contribution (#2799)

* Contributors agreement

* Contributors agreement

* Contributors agreement

* Add jupyter=True to displacy.render in documentation (#2806)

* Revert "Also include lowercase norm exceptions"

This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e.

* Remove deprecated encoding argument to msgpack

* Set up dependency tree pattern matching skeleton (#2732)

* Fix bug when too many entity types. Fixes #2800

* Fix Python 2 test failure

* Require older msgpack-numpy

* Restore encoding arg on msgpack-numpy

* Try to fix version pin for msgpack-numpy

* Update Portuguese Language (#2790)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language

* Correct error in spacy universe docs concerning spacy-lookup (#2814)

* Update Keras Example for (Parikh et al, 2016) implementation  (#2803)

* bug fixes in keras example

* created contributor agreement

* baseline for Parikh model

* initial version of parikh 2016 implemented

* tested asymmetric models

* fixed grevious error in normalization

* use standard SNLI test file

* begin to rework parikh example

* initial version of running example

* start to document the new version

* start to document the new version

* Update Decompositional Attention.ipynb

* fixed calls to similarity

* updated the README

* import sys package duh

* simplified indexing on mapping word to IDs

* stupid python indent error

* added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround

* Fix typo (closes #2815) [ci skip]

* Update regex version dependency

* Set version to 2.0.13.dev3

* Skip seemingly problematic test

* Remove problematic test

* Try previous version of regex

* Revert "Remove problematic test"

This reverts commit bdebbef45552d698d390aa430b527ee27830f11b.

* Unskip test

* Try older version of regex

* 💫 Update training examples and use minibatching (#2830)

<!--- Provide a general summary of your changes in the title. -->

## Description
Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results.

### Types of change
enhancements

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Visual C++ link updated (#2842) (closes #2841) [ci skip]

* New landing page

* Add contribution agreement

* Correcting lang/ru/examples.py (#2845)

* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file

* Set version to 2.0.13.dev4

* Add Persian(Farsi) language support (#2797)

* Also include lowercase norm exceptions

* Remove in favour of https://github.com/explosion/spaCy/graphs/contributors

* Rule-based French Lemmatizer (#2818)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Set version to 2.0.13

* Fix formatting and consistency

* Update docs for new version [ci skip]

* Increment version [ci skip]

* Add info on wheels [ci skip]

* Adding "This is a sentence" example to Sinhala (#2846)

* Add wheels badge

* Update badge [ci skip]

* Update README.rst [ci skip]

* Update murmurhash pin

* Increment version to 2.0.14.dev0

* Update GPU docs for v2.0.14

* Add wheel to setup_requires

* Import prefer_gpu and require_gpu functions from Thinc

* Add tests for prefer_gpu() and require_gpu()

* Update requirements and setup.py

* Workaround bug in thinc require_gpu

* Set version to v2.0.14

* Update push-tag script

* Unhack prefer_gpu

* Require thinc 6.10.6

* Update prefer_gpu and require_gpu docs [ci skip]

* Fix specifiers for GPU

* Set version to 2.0.14.dev1

* Set version to 2.0.14

* Update Thinc version pin

* Increment version

* Fix msgpack-numpy version pin

* Increment version

* Update version to 2.0.16

* Update version [ci skip]

* Redundant ')' in the Stop words' example (#2856)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Documentation improvement regarding joblib and SO (#2867)

Some documentation improvements

## Description
1. Fixed the dead URL to joblib
2. Fixed Stack Overflow brand name (with space)

### Types of change
Documentation

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* raise error when setting overlapping entities as doc.ents (#2880)

* Fix out-of-bounds access in NER training

The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.

* Change PyThaiNLP Url (#2876)

* Fix missing comma

* Add example showing a fix-up rule for space entities

* Set version to 2.0.17.dev0

* Update regex version

* Revert "Update regex version"

This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a.

* Try setting older regex version, to align with conda

* Set version to 2.0.17

* Add spacy-js to universe [ci-skip]

* Add spacy-raspberry to universe (closes #2889)

* Add script to validate universe json [ci skip]

* Removed space in docs + added contributor indo (#2909)

* - removed unneeded space in documentation

* - added contributor info

* Allow input text of length up to max_length, inclusive (#2922)

* Include universe spec for spacy-wordnet component (#2919)

* feat: include universe spec for spacy-wordnet component

* chore: include spaCy contributor agreement

* Minor formatting changes [ci skip]

* Fix image [ci skip]

Twitter URL doesn't work on live site

* Check if the word is in one of the regular lists specific to each POS (#2886)

* 💫 Create random IDs for SVGs to prevent ID clashes (#2927)

Resolves #2924.

## Description
Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.)

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix typo [ci skip]

* fixes symbolic link on py3 and windows (#2949)

* fixes symbolic link on py3 and windows
during setup of spacy using command
python -m spacy link en_core_web_sm en
closes #2948

* Update spacy/compat.py

Co-Authored-By: cicorias <cicorias@users.noreply.github.com>

* Fix formatting

* Update universe [ci skip]

* Catalan Language Support (#2940)

* Catalan language Support

* Ddding Catalan to documentation

* Sort languages alphabetically [ci skip]

* Update tests for pytest 4.x (#2965)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize))
- [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here)

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix regex pin to harmonize with conda (#2964)

* Update README.rst

* Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977)

Fixes #2976

* Fix typo

* Fix typo

* Remove duplicate file

* Require thinc 7.0.0.dev2

Fixes bug in gpu_ops that would use cupy instead of numpy on CPU

* Add missing import

* Fix error IDs

* Fix tests

											
										
										
											2018-11-29 18:30:29 +03:00
+								    &quot;&quot;&quot;.split())
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-13 15:54:58 +03:00
+								+infobox("Important note")
 								    |  When adding stop words from an online source, always #[strong include the link]
 								    |  in a comment. Make sure to #[strong proofread] and double-check the words
 								    |  carefully. A lot of the lists available online have been passed around
 								    |  for years and often contain mistakes, like unicode errors or random words
 								    |  that have once been added for a specific use case, but don't actually
 								    |  qualify.
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
+								+h(3, "tokenizer-exceptions") Tokenizer exceptions
 								p
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  spaCy's #[+a("/usage/linguistic-features#how-tokenizer-works") tokenization algorithm]
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
+								    |  lets you deal with whitespace-delimited chunks separately. This makes it
 								    |  easy to define special-case rules, without worrying about how they
 								    |  interact with the rest of the tokenizer. Whenever the key string is
 								    |  matched, the special-case rule is applied, giving the defined sequence of
 								    |  tokens. You can also attach attributes to the subtokens, covered by your
 								    |  special case, such as the subtokens #[code LEMMA] or #[code TAG].
 								p
 								    |  Tokenizer exceptions can be added in the following format:
-												Update adding languages docs

											
										
										
											2017-05-13 13:39:36 +03:00
+								+code("tokenizer_exceptions.py (excerpt)").
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
+								    TOKENIZER_EXCEPTIONS = {
 								        "don't": [
 								            {ORTH: "do", LEMMA: "do"},
-												Update adding languages guide

											
										
										
											2017-06-03 23:16:38 +03:00
+								            {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
+								    }
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								+infobox("Important note")
 								    |  If an exception consists of more than one token, the #[code ORTH] values
 								    |  combined always need to #[strong match the original string]. The way the
 								    |  original string is split up can be pretty arbitrary sometimes – for
-												Minor typo [ nad => and ]
											
										
										
											2017-11-03 14:00:44 +03:00
+								    |  example "gonna" is split into "gon" (lemma "go") and "na" (lemma "to").
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    |  Because of how the tokenizer works, it's currently not possible to split
 								    |  single-letter strings into multiple tokens.
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
 								p
 								    |  Unambiguous abbreviations, like month names or locations in English,
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    |  should be added to exceptions with a lemma assigned, for example
 								    |  #[code {ORTH: "Jan.", LEMMA: "January"}]. Since the exceptions are
 								    |  added in Python, you can use custom logic to generate them more
 								    |  efficiently and make your data less verbose. How you do this ultimately
 								    |  depends on the language. Here's an example of how exceptions for time
 								    |  formats like "1a.m." and "1am" are generated in the English
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  #[+src(gh("spaCy", "spacy/en/lang/tokenizer_exceptions.py")) #[code tokenizer_exceptions.py]]:
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
 								+code("tokenizer_exceptions.py (excerpt)").
 								    # use short, internal variable for readability
 								    _exc = {}
 								    for h in range(1, 12 + 1):
 								        for period in ["a.m.", "am"]:
 								            # always keep an eye on string interpolation!
 								            _exc["%d%s" % (h, period)] = [
 								                {ORTH: "%d" % h},
 								                {ORTH: period, LEMMA: "a.m."}]
 								        for period in ["p.m.", "pm"]:
 								            _exc["%d%s" % (h, period)] = [
 								                {ORTH: "%d" % h},
 								                {ORTH: period, LEMMA: "p.m."}]
 								    # only declare this at the bottom
-												Update adding languages example

											
										
										
											2017-11-04 17:12:39 +03:00
+								    TOKENIZER_EXCEPTIONS = _exc
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
-												Update adding languages docs

											
										
										
											2017-05-13 15:54:58 +03:00
+								+aside("Generating tokenizer exceptions")
 								    |  Keep in mind that generating exceptions only makes sense if there's a
 								    |  clearly defined and #[strong finite number] of them, like common
 								    |  contractions in English. This is not always the case – in Spanish for
 								    |  instance, infinitive or imperative reflexive verbs and pronouns are one
 								    |  token (e.g. "vestirme"). In cases like this, spaCy shouldn't be
 								    |  generating exceptions for #[em all verbs]. Instead, this will be handled
 								    |  at a later stage during lemmatization.
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								p
 								    |  When adding the tokenizer exceptions to the #[code Defaults], you can use
-												Fix code, links and formatting

											
										
										
											2017-05-28 19:29:16 +03:00
+								    |  the #[+api("util#update_exc") #[code update_exc()]] helper function to merge
-												Update adding languages guide

											
										
										
											2017-06-03 23:16:38 +03:00
+								    |  them with the global base exceptions (including one-letter abbreviations
-												Fix code, links and formatting

											
										
										
											2017-05-28 19:29:16 +03:00
+								    |  and emoticons). The function performs a basic check to make sure
 								    |  exceptions are provided in the correct format. It can take any number of
 								    |  exceptions dicts as its arguments, and will update and overwrite the
 								    |  exception in this order. For example, if your language's tokenizer
 								    |  exceptions include a custom tokenization pattern for "a.", it will
 								    |  overwrite the base exceptions with the language's custom one.
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								+code("Example").
 								    from ...util import update_exc
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    BASE_EXCEPTIONS =  {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
 								    TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
 								    # {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages guide

											
										
										
											2017-06-03 23:16:38 +03:00
+								+infobox("About spaCy's custom pronoun lemma")
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
+								    |  Unlike verbs and common nouns, there's no clear base form of a personal
 								    |  pronoun. Should the lemma of "me" be "I", or should we normalize person
 								    |  as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
 								    |  novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
 								    |  all personal pronouns.
-												Update adding languages guide

											
										
										
											2017-06-03 23:16:38 +03:00
+								+h(3, "norm-exceptions") Norm exceptions
-												Update v2 docs

											
										
										
											2017-11-05 23:41:56 +03:00
+								    +tag-new(2)
-												Update adding languages guide

											
										
										
											2017-06-03 23:16:38 +03:00
 								p
 								    |  In addition to #[code ORTH] or #[code LEMMA], tokenizer exceptions can
 								    |  also set a #[code NORM] attribute. This is useful to specify a normalised
 								    |  version of the token – for example, the norm of "n't" is "not". By default,
 								    |  a token's norm equals its lowercase text. If the lowercase spelling of a
 								    |  word exists, norms should always be in lowercase.
-												Update norms example

											
										
										
											2017-06-05 00:24:29 +03:00
+								+aside-code("Norms vs. lemmas").
-												Update norms example

											
										
										
											2017-06-05 00:33:26 +03:00
+								    doc = nlp(u"I'm gonna realise")
-												Update norms example

											
										
										
											2017-06-05 00:21:37 +03:00
+								    norms = [token.norm_ for token in doc]
-												Update norms example

											
										
										
											2017-06-05 00:24:29 +03:00
+								    lemmas = [token.lemma_ for token in doc]
-												Update norms example

											
										
										
											2017-06-05 00:33:26 +03:00
+								    assert norms == ['i', 'am', 'going', 'to', 'realize']
 								    assert lemmas == ['i', 'be', 'go', 'to', 'realise']
-												Update adding languages guide

											
										
										
											2017-06-03 23:16:38 +03:00
 								p
 								    |  spaCy usually tries to normalise words with different spellings to a single,
 								    |  common spelling. This has no effect on any other token attributes, or
 								    |  tokenization in general, but it ensures that
 								    |  #[strong equivalent tokens receive similar representations]. This can
 								    |  improve the model's predictions on words that weren't common in the
 								    |  training data, but are equivalent to other words – for example, "realize"
 								    |  and "realise", or "thx" and "thanks".
 								p
 								    |  Similarly, spaCy also includes
 								    |  #[+src(gh("spaCy", "spacy/lang/norm_exceptions.py")) global base norms]
 								    |  for normalising different styles of quotation marks and currency
 								    |  symbols. Even though #[code $] and #[code €] are very different, spaCy
 								    |  normalises them both to #[code $]. This way, they'll always be seen as
 								    |  similar, no matter how common they were in the training data.
 								p
 								    |  Norm exceptions can be provided as a simple dictionary. For more examples,
 								    |  see the English
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  #[+src(gh("spaCy", "spacy/lang/en/norm_exceptions.py")) #[code norm_exceptions.py]].
-												Update adding languages guide

											
										
										
											2017-06-03 23:16:38 +03:00
 								+code("Example").
 								    NORM_EXCEPTIONS = {
 								        "cos": "because",
 								        "fav": "favorite",
 								        "accessorise": "accessorize",
 								        "accessorised": "accessorized"
 								    }
 								p
 								    |  To add the custom norm exceptions lookup table, you can use the
 								    |  #[code add_lookups()] helper functions. It takes the default attribute
 								    |  getter function as its first argument, plus a variable list of
 								    |  dictionaries. If a string's norm is found in one of the dictionaries,
 								    |  that value is used – otherwise, the default function is called and the
 								    |  token is assigned its default norm.
 								+code.
 								    lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
 								                                         NORM_EXCEPTIONS, BASE_NORMS)
 								p
 								    |  The order of the dictionaries is also the lookup order – so if your
 								    |  language's norm exceptions overwrite any of the global exceptions, they
 								    |  should be added first. Also note that the tokenizer exceptions will
 								    |  always have priority over the atrribute getters.
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								+h(3, "lex-attrs") Lexical attributes
-												Update v2 docs

											
										
										
											2017-11-05 23:41:56 +03:00
+								    +tag-new(2)
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
 								p
 								    |  spaCy provides a range of #[+api("token#attributes") #[code Token] attributes]
 								    |  that return useful information on that token – for example, whether it's
 								    |  uppercase or lowercase, a left or right punctuation mark, or whether it
 								    |  resembles a number or email address. Most of these functions, like
 								    |  #[code is_lower] or #[code like_url] should be language-independent.
 								    |  Others, like #[code like_num] (which includes both digits and number
 								    |  words), requires some customisation.
 								+aside("Best practices")
 								    |  Keep in mind that those functions are only intended to be  an approximation.
 								    |  It's always better to prioritise simplicity and performance over covering
 								    |  very specific edge cases.#[br]#[br]
 								    |  English number words are pretty simple, because even large numbers
 								    |  consist of individual tokens, and we can get away with splitting and
 								    |  matching strings against a list. In other languages, like German, "two
 								    |  hundred and thirty-four" is one word, and thus one token. Here, it's best
 								    |  to match a string against a list of number word fragments (instead of a
 								    |  technically almost infinite list of possible number words).
 								p
 								    |  Here's an example from the English
-												Fix typo

											
										
										
											2017-11-13 19:00:03 +03:00
+								    |  #[+src(gh("spaCy", "spacy/lang/en/lex_attrs.py")) #[code lex_attrs.py]]:
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
 								+code("lex_attrs.py").
 								    _num_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
 								                  'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
 								                  'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty',
 								                  'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety',
 								                  'hundred', 'thousand', 'million', 'billion', 'trillion', 'quadrillion',
 								                  'gajillion', 'bazillion']
 								    def like_num(text):
 								        text = text.replace(',', '').replace('.', '')
 								        if text.isdigit():
 								            return True
 								        if text.count('/') == 1:
 								            num, denom = text.split('/')
 								            if num.isdigit() and denom.isdigit():
 								                return True
-												Update code example according to new changes

											
										
										
											2018-01-08 05:45:56 +03:00
+								        if text.lower() in _num_words:
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								            return True
 								        return False
 								    LEX_ATTRS = {
 								        LIKE_NUM: like_num
 								    }
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								p
 								    |  By updating the default lexical attributes with a custom #[code LEX_ATTRS]
 								    |  dictionary in the language's defaults via
 								    |  #[code lex_attr_getters.update(LEX_ATTRS)], only the new custom functions
 								    |  are overwritten.
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Add details on syntax iterators

											
										
										
											2017-06-05 00:16:33 +03:00
+								+h(3, "syntax-iterators") Syntax iterators
 								p
 								    |  Syntax iterators are functions that compute views of a #[code Doc]
 								    |  object based on its syntax. At the moment, this data is only used for
 								    |  extracting
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  #[+a("/usage/linguistic-features#noun-chunks") noun chunks], which
-												Add details on syntax iterators

											
										
										
											2017-06-05 00:16:33 +03:00
+								    |  are available as the #[+api("doc#noun_chunks") #[code Doc.noun_chunks]]
 								    |  property. Because base noun phrases work differently across languages,
 								    |  the rules to compute them are part of the individual language's data. If
 								    |  a language does not include a noun chunks iterator, the property won't
 								    |  be available. For examples, see the existing syntax iterators:
 								+aside-code("Noun chunks example").
 								    doc = nlp(u'A phrase with another phrase occurs.')
 								    chunks = list(doc.noun_chunks)
 								    assert chunks[0].text == "A phrase"
 								    assert chunks[1].text == "another phrase"
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+table(["Language", "Code", "Source"])
 								    for lang in ["en", "de", "fr", "es"]
-												Add details on syntax iterators

											
										
										
											2017-06-05 00:16:33 +03:00
+								        +row
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								            +cell=LANGUAGES[lang]
 								            +cell #[code=lang]
-												Add details on syntax iterators

											
										
										
											2017-06-05 00:16:33 +03:00
+								            +cell
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								                +src(gh("spaCy", "spacy/lang/" + lang + "/syntax_iterators.py"))
 								                    code lang/#{lang}/syntax_iterators.py
-												Add details on syntax iterators

											
										
										
											2017-06-05 00:16:33 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								+h(3, "lemmatizer") Lemmatizer
-												Update v2 docs

											
										
										
											2017-11-05 23:41:56 +03:00
+								    +tag-new(2)
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-13 19:54:10 +03:00
+								p
 								    |  As of v2.0, spaCy supports simple lookup-based lemmatization. This is
 								    |  usually the quickest and easiest way to get started. The data is stored
 								    |  in a dictionary mapping a string to its lemma. To determine a token's
 								    |  lemma, spaCy simply looks it up in the table. Here's an example from
 								    |  the Spanish language data:
 								+code("lang/es/lemmatizer.py (excerpt)").
 								    LOOKUP = {
 								        "aba": "abar",
 								        "ababa": "abar",
 								        "ababais": "abar",
 								        "ababan": "abar",
 								        "ababanes": "ababán",
 								        "ababas": "abar",
 								        "ababoles": "ababol",
 								        "ababábites": "ababábite"
 								    }
 								p
-												Update docs on adding lemmatization to languages

											
										
										
											2017-10-11 15:21:15 +03:00
+								    |  To provide a lookup lemmatizer for your language, import the lookup table
 								    |  and add it to the #[code Language] class as #[code lemma_lookup]:
-												Update adding languages docs

											
										
										
											2017-05-13 19:54:10 +03:00
-												Update docs on adding lemmatization to languages

											
										
										
											2017-10-11 15:21:15 +03:00
+								+code.
 								    lemma_lookup = dict(LOOKUP)
-												Update adding languages docs

											
										
										
											2017-05-13 19:54:10 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								+h(3, "tag-map") Tag map
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								p
 								    |  Most treebanks define a custom part-of-speech tag scheme, striking a
 								    |  balance between level of detail and ease of prediction.  While it's
 								    |  useful to have custom tagging schemes, it's also useful to have a common
 								    |  scheme, to which the more specific tags can be related. The tagger can
 								    |  learn a tag scheme with any arbitrary symbols. However, you need to
 								    |  define how those symbols map down to the
 								    |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies tag set].
 								    |  This is done by providing a tag map.
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								p
 								    |  The keys of the tag map should be #[strong strings in your tag set]. The
 								    |  values should be a dictionary. The dictionary must have an entry POS
 								    |  whose value is one of the
 								    |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
 								    |  tags. Optionally, you can also include morphological features or other
 								    |  token attributes in the tag map as well. This allows you to do simple
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  #[+a("/usage/linguistic-features#rule-based-morphology") rule-based morphological analysis].
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								+code("Example").
 								    from ..symbols import POS, NOUN, VERB, DET
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								    TAG_MAP = {
 								        "NNS":  {POS: NOUN, "Number": "plur"},
 								        "VBG":  {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
 								        "DT":   {POS: DET}
 								    }
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update adding languages docs

											
										
										
											2017-05-12 16:38:17 +03:00
+								+h(3, "morph-rules") Morph rules
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Add section on testing language tokenizers

											
										
										
											2017-05-13 16:39:27 +03:00
+								p
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  The morphology rules let you set token attributes such as lemmas, keyed
 								    |  by the extended part-of-speech tag and token text. The morphological
 								    |  features and their possible values are language-specific and based on the
 								    |  #[+a("http://universaldependencies.org") Universal Dependencies scheme].
-												Add section on testing language tokenizers

											
										
										
											2017-05-13 16:39:27 +03:00
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+code("Example").
 								    from ..symbols import LEMMA
 								    MORPH_RULES = {
 								        "VBZ": {
 								            "am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
 								            "are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
 								            "is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
 								            "'re": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
 								            "'s": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"}
 								        }
 								    }
-												Update adding language / training docs (see #966)

Add data examples and more info on training and CLI commands

											
										
										
											2017-04-26 15:01:15 +03:00
-												Reformat word frequencies section in "adding languages" workflow

											
										
										
											2016-12-19 19:18:38 +03:00
+								p
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  In the example of #[code "am"], the attributes look like this:
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+table(["Attribute", "Description"])
 								    +row
 								        +cell #[code LEMMA: "be"]
 								        +cell Base form, e.g. "to be".
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    +row
 								        +cell #[code "VerbForm": "Fin"]
 								        +cell
 								            |  Finite verb. Finite verbs have a subject and can be the root of
 								            |  an independent clause – "I am." is a valid, complete
 								            |  sentence.
-												Add under construction

											
										
										
											2017-06-05 00:17:54 +03:00
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    +row
 								        +cell #[code "Person": "One"]
 								        +cell First person, i.e. "#[strong I] am".
-												Add "Adding languages" workflow (closes #562)

											
										
										
											2016-12-19 01:51:09 +03:00
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    +row
 								        +cell #[code "Tense": "Pres"]
 								        +cell
 								            |  Present tense, i.e. actions that are happening right now or
 								            |  actions that usually happen.
-												Add under construction

											
										
										
											2017-06-05 00:17:54 +03:00
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    +row
 								        +cell #[code "Mood": "Ind"]
 								        +cell
 								            |  Indicative, i.e. something happens, has happened or will happen
 								            |  (as opposed to imperative or conditional).
-												Update adding language / training docs (see #966)

Add data examples and more info on training and CLI commands

											
										
										
											2017-04-26 15:01:15 +03:00
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								+infobox("Important note", "⚠️")
 								    |  The morphological attributes are currently #[strong not all used by spaCy].
 								    |  Full integration is still being developed. In the meantime, it can still
 								    |  be useful to add them, especially if the language you're adding includes
 								    |  important distinctions and special cases. This ensures that as soon as
 								    |  full support is introduced, your language will be able to assign all
 								    |  possible attributes.