mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	* splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib
		
			
				
	
	
		
			677 lines
		
	
	
		
			31 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			677 lines
		
	
	
		
			31 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						||
title: Adding Languages
 | 
						||
next: /usage/training
 | 
						||
menu:
 | 
						||
  - ['Language Data', 'language-data']
 | 
						||
  - ['Testing', 'testing']
 | 
						||
  - ['Training', 'training']
 | 
						||
---
 | 
						||
 | 
						||
Adding full support for a language touches many different parts of the spaCy
 | 
						||
library. This guide explains how to fit everything together, and points you to
 | 
						||
the specific workflows for each component.
 | 
						||
 | 
						||
> #### Working on spaCy's source
 | 
						||
>
 | 
						||
> To add a new language to spaCy, you'll need to **modify the library's code**.
 | 
						||
> The easiest way to do this is to clone the
 | 
						||
> [repository](https://github.com/explosion/spaCy/tree/master/) and **build
 | 
						||
> spaCy from source**. For more information on this, see the
 | 
						||
> [installation guide](/usage). Unlike spaCy's core, which is mostly written in
 | 
						||
> Cython, all language data is stored in regular Python files. This means that
 | 
						||
> you won't have to rebuild anything in between – you can simply make edits and
 | 
						||
> reload spaCy to test them.
 | 
						||
 | 
						||
<Grid cols={2}>
 | 
						||
 | 
						||
<div>
 | 
						||
 | 
						||
Obviously, there are lots of ways you can organize your code when you implement
 | 
						||
your own language data. This guide will focus on how it's done within spaCy. For
 | 
						||
full language support, you'll need to create a `Language` subclass, define
 | 
						||
custom **language data**, like a stop list and tokenizer exceptions and test the
 | 
						||
new tokenizer. Once the language is set up, you can **build the vocabulary**,
 | 
						||
including word frequencies, Brown clusters and word vectors. Finally, you can
 | 
						||
**train the tagger and parser**, and save the model to a directory.
 | 
						||
 | 
						||
For some languages, you may also want to develop a solution for lemmatization
 | 
						||
and morphological analysis.
 | 
						||
 | 
						||
</div>
 | 
						||
 | 
						||
<Infobox title="Table of Contents">
 | 
						||
 | 
						||
- [Language data 101](#101)
 | 
						||
- [The Language subclass](#language-subclass)
 | 
						||
- [Stop words](#stop-words)
 | 
						||
- [Tokenizer exceptions](#tokenizer-exceptions)
 | 
						||
- [Norm exceptions](#norm-exceptions)
 | 
						||
- [Lexical attributes](#lex-attrs)
 | 
						||
- [Syntax iterators](#syntax-iterators)
 | 
						||
- [Lemmatizer](#lemmatizer)
 | 
						||
- [Tag map](#tag-map)
 | 
						||
- [Morph rules](#morph-rules)
 | 
						||
- [Testing the language](#testing)
 | 
						||
- [Training](#training)
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
</Grid>
 | 
						||
 | 
						||
## Language data {#language-data}
 | 
						||
 | 
						||
import LanguageData101 from 'usage/101/\_language-data.md'
 | 
						||
 | 
						||
<LanguageData101 />
 | 
						||
 | 
						||
The individual components **expose variables** that can be imported within a
 | 
						||
language module, and added to the language's `Defaults`. Some components, like
 | 
						||
the punctuation rules, usually don't need much customization and can be imported
 | 
						||
from the global rules. Others, like the tokenizer and norm exceptions, are very
 | 
						||
specific and will make a big difference to spaCy's performance on the particular
 | 
						||
language and training a language model.
 | 
						||
 | 
						||
| Variable                                  | Type  | Description                                                                                                |
 | 
						||
| ----------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------- |
 | 
						||
| `STOP_WORDS`                              | set   | Individual words.                                                                                          |
 | 
						||
| `TOKENIZER_EXCEPTIONS`                    | dict  | Keyed by strings mapped to list of one dict per token with token attributes.                               |
 | 
						||
| `TOKEN_MATCH`                             | regex | Regexes to match complex tokens, e.g. URLs.                                                                |
 | 
						||
| `NORM_EXCEPTIONS`                         | dict  | Keyed by strings, mapped to their norms.                                                                   |
 | 
						||
| `TOKENIZER_PREFIXES`                      | list  | Strings or regexes, usually not customized.                                                                |
 | 
						||
| `TOKENIZER_SUFFIXES`                      | list  | Strings or regexes, usually not customized.                                                                |
 | 
						||
| `TOKENIZER_INFIXES`                       | list  | Strings or regexes, usually not customized.                                                                |
 | 
						||
| `LEX_ATTRS`                               | dict  | Attribute ID mapped to function.                                                                           |
 | 
						||
| `SYNTAX_ITERATORS`                        | dict  | Iterator ID mapped to function. Currently only supports `'noun_chunks'`.                                   |
 | 
						||
| `LOOKUP`                                  | dict  | Keyed by strings mapping to their lemma.                                                                   |
 | 
						||
| `LEMMA_RULES`, `LEMMA_INDEX`, `LEMMA_EXC` | dict  | Lemmatization rules, keyed by part of speech.                                                              |
 | 
						||
| `TAG_MAP`                                 | dict  | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
 | 
						||
| `MORPH_RULES`                             | dict  | Keyed by strings mapped to a dict of their morphological features.                                         |
 | 
						||
 | 
						||
> #### Should I ever update the global data?
 | 
						||
>
 | 
						||
> Reusable language data is collected as atomic pieces in the root of the
 | 
						||
> [`spacy.lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang)
 | 
						||
> module. Often, when a new language is added, you'll find a pattern or symbol
 | 
						||
> that's missing. Even if it isn't common in other languages, it might be best
 | 
						||
> to add it to the shared language data, unless it has some conflicting
 | 
						||
> interpretation. For instance, we don't expect to see guillemot quotation
 | 
						||
> symbols (`»` and `«`) in English text. But if we do see them, we'd probably
 | 
						||
> prefer the tokenizer to split them off.
 | 
						||
 | 
						||
<Infobox title="For languages with non-latin characters">
 | 
						||
 | 
						||
In order for the tokenizer to split suffixes, prefixes and infixes, spaCy needs
 | 
						||
to know the language's character set. If the language you're adding uses
 | 
						||
non-latin characters, you might need to define the required character classes in
 | 
						||
the global
 | 
						||
[`char_classes.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py).
 | 
						||
For efficiency, spaCy uses hard-coded unicode ranges to define character classes,
 | 
						||
the definitions of which can be found on [Wikipedia](https://en.wikipedia.org/wiki/Unicode_block). 
 | 
						||
If the language requires very specific punctuation
 | 
						||
rules, you should consider overwriting the default regular expressions with your
 | 
						||
own in the language's `Defaults`.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Creating a `Language` subclass {#language-subclass}
 | 
						||
 | 
						||
Language-specific code and resources should be organized into a sub-package of
 | 
						||
spaCy, named according to the language's
 | 
						||
[ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). For instance,
 | 
						||
code and resources specific to Spanish are placed into a directory
 | 
						||
`spacy/lang/es`, which can be imported as `spacy.lang.es`.
 | 
						||
 | 
						||
To get started, you can use our
 | 
						||
[templates](https://github.com/explosion/spacy-dev-resources/templates/new_language)
 | 
						||
for the most important files. Here's what the class template looks like:
 | 
						||
 | 
						||
```python
 | 
						||
### __init__.py (excerpt)
 | 
						||
# import language-specific data
 | 
						||
from .stop_words import STOP_WORDS
 | 
						||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 | 
						||
from .lex_attrs import LEX_ATTRS
 | 
						||
 | 
						||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
 | 
						||
from ...language import Language
 | 
						||
from ...attrs import LANG
 | 
						||
from ...util import update_exc
 | 
						||
 | 
						||
# Create Defaults class in the module scope (necessary for pickling!)
 | 
						||
class XxxxxDefaults(Language.Defaults):
 | 
						||
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
 | 
						||
    lex_attr_getters[LANG] = lambda text: "xx" # language ISO code
 | 
						||
 | 
						||
    # Optional: replace flags with custom functions, e.g. like_num()
 | 
						||
    lex_attr_getters.update(LEX_ATTRS)
 | 
						||
 | 
						||
    # Merge base exceptions and custom tokenizer exceptions
 | 
						||
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
 | 
						||
    stop_words = STOP_WORDS
 | 
						||
 | 
						||
# Create actual Language class
 | 
						||
class Xxxxx(Language):
 | 
						||
    lang = "xx" # Language ISO code
 | 
						||
    Defaults = XxxxxDefaults # Override defaults
 | 
						||
 | 
						||
# Set default export – this allows the language class to be lazy-loaded
 | 
						||
__all__ = ["Xxxxx"]
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Why lazy-loading?">
 | 
						||
 | 
						||
Some languages contain large volumes of custom data, like lemmatizer lookup
 | 
						||
tables, or complex regular expression that are expensive to compute. As of spaCy
 | 
						||
v2.0, `Language` classes are not imported on initialization and are only loaded
 | 
						||
when you import them directly, or load a model that requires a language to be
 | 
						||
loaded. To lazy-load languages in your application, you can use the
 | 
						||
[`util.get_lang_class`](/api/top-level#util.get_lang_class) helper function with
 | 
						||
the two-letter language code as its argument.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Stop words {#stop-words}
 | 
						||
 | 
						||
A ["stop list"](https://en.wikipedia.org/wiki/Stop_words) is a classic trick
 | 
						||
from the early days of information retrieval when search was largely about
 | 
						||
keyword presence and absence. It is still sometimes useful today to filter out
 | 
						||
common words from a bag-of-words model. To improve readability, `STOP_WORDS` are
 | 
						||
separated by spaces and newlines, and added as a multiline string.
 | 
						||
 | 
						||
> #### What does spaCy consider a stop word?
 | 
						||
>
 | 
						||
> There's no particularly principled logic behind what words should be added to
 | 
						||
> the stop list. Make a list that you think might be useful to people and is
 | 
						||
> likely to be unsurprising. As a rule of thumb, words that are very rare are
 | 
						||
> unlikely to be useful stop words.
 | 
						||
 | 
						||
```python
 | 
						||
### Example
 | 
						||
STOP_WORDS = set("""
 | 
						||
a about above across after afterwards again against all almost alone along
 | 
						||
already also although always am among amongst amount an and another any anyhow
 | 
						||
anyone anything anyway anywhere are around as at
 | 
						||
 | 
						||
back be became because become becomes becoming been before beforehand behind
 | 
						||
being below beside besides between beyond both bottom but by
 | 
						||
""".split())
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Important note" variant="warning">
 | 
						||
 | 
						||
When adding stop words from an online source, always **include the link** in a
 | 
						||
comment. Make sure to **proofread** and double-check the words carefully. A lot
 | 
						||
of the lists available online have been passed around for years and often
 | 
						||
contain mistakes, like unicode errors or random words that have once been added
 | 
						||
for a specific use case, but don't actually qualify.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Tokenizer exceptions {#tokenizer-exceptions}
 | 
						||
 | 
						||
spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works)
 | 
						||
lets you deal with whitespace-delimited chunks separately. This makes it easy to
 | 
						||
define special-case rules, without worrying about how they interact with the
 | 
						||
rest of the tokenizer. Whenever the key string is matched, the special-case rule
 | 
						||
is applied, giving the defined sequence of tokens. You can also attach
 | 
						||
attributes to the subtokens, covered by your special case, such as the subtokens
 | 
						||
`LEMMA` or `TAG`.
 | 
						||
 | 
						||
Tokenizer exceptions can be added in the following format:
 | 
						||
 | 
						||
```python
 | 
						||
### tokenizer_exceptions.py (excerpt)
 | 
						||
TOKENIZER_EXCEPTIONS = {
 | 
						||
    "don't": [
 | 
						||
        {ORTH: "do", LEMMA: "do"},
 | 
						||
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}]
 | 
						||
}
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="Important note" variant="warning">
 | 
						||
 | 
						||
If an exception consists of more than one token, the `ORTH` values combined
 | 
						||
always need to **match the original string**. The way the original string is
 | 
						||
split up can be pretty arbitrary sometimes – for example `"gonna"` is split into
 | 
						||
`"gon"` (lemma "go") and `"na"` (lemma "to"). Because of how the tokenizer
 | 
						||
works, it's currently not possible to split single-letter strings into multiple
 | 
						||
tokens.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
Unambiguous abbreviations, like month names or locations in English, should be
 | 
						||
added to exceptions with a lemma assigned, for example
 | 
						||
`{ORTH: "Jan.", LEMMA: "January"}`. Since the exceptions are added in Python,
 | 
						||
you can use custom logic to generate them more efficiently and make your data
 | 
						||
less verbose. How you do this ultimately depends on the language. Here's an
 | 
						||
example of how exceptions for time formats like "1a.m." and "1am" are generated
 | 
						||
in the English
 | 
						||
[`tokenizer_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/en/lang/tokenizer_exceptions.py):
 | 
						||
 | 
						||
```python
 | 
						||
### tokenizer_exceptions.py (excerpt)
 | 
						||
# use short, internal variable for readability
 | 
						||
_exc = {}
 | 
						||
 | 
						||
for h in range(1, 12 + 1):
 | 
						||
    for period in ["a.m.", "am"]:
 | 
						||
        # always keep an eye on string interpolation!
 | 
						||
        _exc["%d%s" % (h, period)] = [
 | 
						||
            {ORTH: "%d" % h},
 | 
						||
            {ORTH: period, LEMMA: "a.m."}]
 | 
						||
    for period in ["p.m.", "pm"]:
 | 
						||
        _exc["%d%s" % (h, period)] = [
 | 
						||
            {ORTH: "%d" % h},
 | 
						||
            {ORTH: period, LEMMA: "p.m."}]
 | 
						||
 | 
						||
# only declare this at the bottom
 | 
						||
TOKENIZER_EXCEPTIONS = _exc
 | 
						||
```
 | 
						||
 | 
						||
> #### Generating tokenizer exceptions
 | 
						||
>
 | 
						||
> Keep in mind that generating exceptions only makes sense if there's a clearly
 | 
						||
> defined and **finite number** of them, like common contractions in English.
 | 
						||
> This is not always the case – in Spanish for instance, infinitive or
 | 
						||
> imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In
 | 
						||
> cases like this, spaCy shouldn't be generating exceptions for _all verbs_.
 | 
						||
> Instead, this will be handled at a later stage during lemmatization.
 | 
						||
 | 
						||
When adding the tokenizer exceptions to the `Defaults`, you can use the
 | 
						||
[`update_exc`](/api/top-level#util.update_exc) helper function to merge them
 | 
						||
with the global base exceptions (including one-letter abbreviations and
 | 
						||
emoticons). The function performs a basic check to make sure exceptions are
 | 
						||
provided in the correct format. It can take any number of exceptions dicts as
 | 
						||
its arguments, and will update and overwrite the exception in this order. For
 | 
						||
example, if your language's tokenizer exceptions include a custom tokenization
 | 
						||
pattern for "a.", it will overwrite the base exceptions with the language's
 | 
						||
custom one.
 | 
						||
 | 
						||
```python
 | 
						||
### Example
 | 
						||
from ...util import update_exc
 | 
						||
 | 
						||
BASE_EXCEPTIONS =  {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]}
 | 
						||
TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", LEMMA: "all"}]}
 | 
						||
 | 
						||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
 | 
						||
# {"a.": [{ORTH: "a.", LEMMA: "all"}], ":)": [{ORTH: ":)"}]}
 | 
						||
```
 | 
						||
 | 
						||
<Infobox title="About spaCy's custom pronoun lemma" variant="warning">
 | 
						||
 | 
						||
Unlike verbs and common nouns, there's no clear base form of a personal pronoun.
 | 
						||
Should the lemma of "me" be "I", or should we normalize person as well, giving
 | 
						||
"it" — or maybe "he"? spaCy's solution is to introduce a novel symbol, `-PRON-`,
 | 
						||
which is used as the lemma for all personal pronouns.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Norm exceptions {#norm-exceptions new="2"}
 | 
						||
 | 
						||
In addition to `ORTH` or `LEMMA`, tokenizer exceptions can also set a `NORM`
 | 
						||
attribute. This is useful to specify a normalized version of the token – for
 | 
						||
example, the norm of "n't" is "not". By default, a token's norm equals its
 | 
						||
lowercase text. If the lowercase spelling of a word exists, norms should always
 | 
						||
be in lowercase.
 | 
						||
 | 
						||
> #### Norms vs. lemmas
 | 
						||
>
 | 
						||
> ```python
 | 
						||
> doc = nlp(u"I'm gonna realise")
 | 
						||
> norms = [token.norm_ for token in doc]
 | 
						||
> lemmas = [token.lemma_ for token in doc]
 | 
						||
> assert norms == ["i", "am", "going", "to", "realize"]
 | 
						||
> assert lemmas == ["i", "be", "go", "to", "realise"]
 | 
						||
> ```
 | 
						||
 | 
						||
spaCy usually tries to normalize words with different spellings to a single,
 | 
						||
common spelling. This has no effect on any other token attributes, or
 | 
						||
tokenization in general, but it ensures that **equivalent tokens receive similar
 | 
						||
representations**. This can improve the model's predictions on words that
 | 
						||
weren't common in the training data, but are equivalent to other words – for
 | 
						||
example, "realize" and "realize", or "thx" and "thanks".
 | 
						||
 | 
						||
Similarly, spaCy also includes
 | 
						||
[global base norms](https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py)
 | 
						||
for normalizing different styles of quotation marks and currency symbols. Even
 | 
						||
though `$` and `€` are very different, spaCy normalizes them both to `$`. This
 | 
						||
way, they'll always be seen as similar, no matter how common they were in the
 | 
						||
training data.
 | 
						||
 | 
						||
Norm exceptions can be provided as a simple dictionary. For more examples, see
 | 
						||
the English
 | 
						||
[`norm_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/en/norm_exceptions.py).
 | 
						||
 | 
						||
```python
 | 
						||
### Example
 | 
						||
NORM_EXCEPTIONS = {
 | 
						||
    "cos": "because",
 | 
						||
    "fav": "favorite",
 | 
						||
    "accessorise": "accessorize",
 | 
						||
    "accessorised": "accessorized"
 | 
						||
}
 | 
						||
```
 | 
						||
 | 
						||
To add the custom norm exceptions lookup table, you can use the `add_lookups()`
 | 
						||
helper functions. It takes the default attribute getter function as its first
 | 
						||
argument, plus a variable list of dictionaries. If a string's norm is found in
 | 
						||
one of the dictionaries, that value is used – otherwise, the default function is
 | 
						||
called and the token is assigned its default norm.
 | 
						||
 | 
						||
```python
 | 
						||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM],
 | 
						||
                                     NORM_EXCEPTIONS, BASE_NORMS)
 | 
						||
```
 | 
						||
 | 
						||
The order of the dictionaries is also the lookup order – so if your language's
 | 
						||
norm exceptions overwrite any of the global exceptions, they should be added
 | 
						||
first. Also note that the tokenizer exceptions will always have priority over
 | 
						||
the attribute getters.
 | 
						||
 | 
						||
### Lexical attributes {#lex-attrs new="2"}
 | 
						||
 | 
						||
spaCy provides a range of [`Token` attributes](/api/token#attributes) that
 | 
						||
return useful information on that token – for example, whether it's uppercase or
 | 
						||
lowercase, a left or right punctuation mark, or whether it resembles a number or
 | 
						||
email address. Most of these functions, like `is_lower` or `like_url` should be
 | 
						||
language-independent. Others, like `like_num` (which includes both digits and
 | 
						||
number words), requires some customization.
 | 
						||
 | 
						||
> #### Best practices
 | 
						||
>
 | 
						||
> Keep in mind that those functions are only intended to be an approximation.
 | 
						||
> It's always better to prioritize simplicity and performance over covering very
 | 
						||
> specific edge cases.
 | 
						||
>
 | 
						||
> English number words are pretty simple, because even large numbers consist of
 | 
						||
> individual tokens, and we can get away with splitting and matching strings
 | 
						||
> against a list. In other languages, like German, "two hundred and thirty-four"
 | 
						||
> is one word, and thus one token. Here, it's best to match a string against a
 | 
						||
> list of number word fragments (instead of a technically almost infinite list
 | 
						||
> of possible number words).
 | 
						||
 | 
						||
Here's an example from the English
 | 
						||
[`lex_attrs.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py):
 | 
						||
 | 
						||
```python
 | 
						||
### lex_attrs.py
 | 
						||
_num_words = ["zero", "one", "two", "three", "four", "five", "six", "seven",
 | 
						||
              "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen",
 | 
						||
              "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty",
 | 
						||
              "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety",
 | 
						||
              "hundred", "thousand", "million", "billion", "trillion", "quadrillion",
 | 
						||
              "gajillion", "bazillion"]
 | 
						||
 | 
						||
def like_num(text):
 | 
						||
    text = text.replace(",", "").replace(".", "")
 | 
						||
    if text.isdigit():
 | 
						||
        return True
 | 
						||
    if text.count("/") == 1:
 | 
						||
        num, denom = text.split("/")
 | 
						||
        if num.isdigit() and denom.isdigit():
 | 
						||
            return True
 | 
						||
    if text.lower() in _num_words:
 | 
						||
        return True
 | 
						||
    return False
 | 
						||
 | 
						||
LEX_ATTRS = {
 | 
						||
    LIKE_NUM: like_num
 | 
						||
}
 | 
						||
```
 | 
						||
 | 
						||
By updating the default lexical attributes with a custom `LEX_ATTRS` dictionary
 | 
						||
in the language's defaults via `lex_attr_getters.update(LEX_ATTRS)`, only the
 | 
						||
new custom functions are overwritten.
 | 
						||
 | 
						||
### Syntax iterators {#syntax-iterators}
 | 
						||
 | 
						||
Syntax iterators are functions that compute views of a `Doc` object based on its
 | 
						||
syntax. At the moment, this data is only used for extracting
 | 
						||
[noun chunks](/usage/linguistic-features#noun-chunks), which are available as
 | 
						||
the [`Doc.noun_chunks`](/api/doc#noun_chunks) property. Because base noun
 | 
						||
phrases work differently across languages, the rules to compute them are part of
 | 
						||
the individual language's data. If a language does not include a noun chunks
 | 
						||
iterator, the property won't be available. For examples, see the existing syntax
 | 
						||
iterators:
 | 
						||
 | 
						||
> #### Noun chunks example
 | 
						||
>
 | 
						||
> ```python
 | 
						||
> doc = nlp(u"A phrase with another phrase occurs.")
 | 
						||
> chunks = list(doc.noun_chunks)
 | 
						||
> assert chunks[0].text == u"A phrase"
 | 
						||
> assert chunks[1].text == u"another phrase"
 | 
						||
> ```
 | 
						||
 | 
						||
| Language | Code | Source                                                                                                            |
 | 
						||
| -------- | ---- | ----------------------------------------------------------------------------------------------------------------- |
 | 
						||
| English  | `en` | [`lang/en/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py) |
 | 
						||
| German   | `de` | [`lang/de/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/de/syntax_iterators.py) |
 | 
						||
| French   | `fr` | [`lang/fr/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/syntax_iterators.py) |
 | 
						||
| Spanish  | `es` | [`lang/es/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/es/syntax_iterators.py) |
 | 
						||
 | 
						||
### Lemmatizer {#lemmatizer new="2"}
 | 
						||
 | 
						||
As of v2.0, spaCy supports simple lookup-based lemmatization. This is usually
 | 
						||
the quickest and easiest way to get started. The data is stored in a dictionary
 | 
						||
mapping a string to its lemma. To determine a token's lemma, spaCy simply looks
 | 
						||
it up in the table. Here's an example from the Spanish language data:
 | 
						||
 | 
						||
```python
 | 
						||
### lang/es/lemmatizer.py (excerpt)
 | 
						||
LOOKUP = {
 | 
						||
    "aba": "abar",
 | 
						||
    "ababa": "abar",
 | 
						||
    "ababais": "abar",
 | 
						||
    "ababan": "abar",
 | 
						||
    "ababanes": "ababán",
 | 
						||
    "ababas": "abar",
 | 
						||
    "ababoles": "ababol",
 | 
						||
    "ababábites": "ababábite"
 | 
						||
}
 | 
						||
```
 | 
						||
 | 
						||
To provide a lookup lemmatizer for your language, import the lookup table and
 | 
						||
add it to the `Language` class as `lemma_lookup`:
 | 
						||
 | 
						||
```python
 | 
						||
lemma_lookup = LOOKUP
 | 
						||
```
 | 
						||
 | 
						||
### Tag map {#tag-map}
 | 
						||
 | 
						||
Most treebanks define a custom part-of-speech tag scheme, striking a balance
 | 
						||
between level of detail and ease of prediction. While it's useful to have custom
 | 
						||
tagging schemes, it's also useful to have a common scheme, to which the more
 | 
						||
specific tags can be related. The tagger can learn a tag scheme with any
 | 
						||
arbitrary symbols. However, you need to define how those symbols map down to the
 | 
						||
[Universal Dependencies tag set](http://universaldependencies.org/u/pos/all.html).
 | 
						||
This is done by providing a tag map.
 | 
						||
 | 
						||
The keys of the tag map should be **strings in your tag set**. The values should
 | 
						||
be a dictionary. The dictionary must have an entry POS whose value is one of the
 | 
						||
[Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags.
 | 
						||
Optionally, you can also include morphological features or other token
 | 
						||
attributes in the tag map as well. This allows you to do simple
 | 
						||
[rule-based morphological analysis](/usage/linguistic-features#rule-based-morphology).
 | 
						||
 | 
						||
```python
 | 
						||
### Example
 | 
						||
from ..symbols import POS, NOUN, VERB, DET
 | 
						||
 | 
						||
TAG_MAP = {
 | 
						||
    "NNS":  {POS: NOUN, "Number": "plur"},
 | 
						||
    "VBG":  {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
 | 
						||
    "DT":   {POS: DET}
 | 
						||
}
 | 
						||
```
 | 
						||
 | 
						||
### Morph rules {#morph-rules}
 | 
						||
 | 
						||
The morphology rules let you set token attributes such as lemmas, keyed by the
 | 
						||
extended part-of-speech tag and token text. The morphological features and their
 | 
						||
possible values are language-specific and based on the
 | 
						||
[Universal Dependencies scheme](http://universaldependencies.org).
 | 
						||
 | 
						||
```python
 | 
						||
### Example
 | 
						||
from ..symbols import LEMMA
 | 
						||
 | 
						||
MORPH_RULES = {
 | 
						||
    "VBZ": {
 | 
						||
        "am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
 | 
						||
        "are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
 | 
						||
        "is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
 | 
						||
        "'re": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
 | 
						||
        "'s": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"}
 | 
						||
    }
 | 
						||
}
 | 
						||
```
 | 
						||
 | 
						||
In the example of `"am"`, the attributes look like this:
 | 
						||
 | 
						||
| Attribute           | Description                                                                                                                    |
 | 
						||
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
 | 
						||
| `LEMMA: "be"`       | Base form, e.g. "to be".                                                                                                       |
 | 
						||
| `"VerbForm": "Fin"` | Finite verb. Finite verbs have a subject and can be the root of an independent clause – "I am." is a valid, complete sentence. |
 | 
						||
| `"Person": "One"`   | First person, i.e. "**I** am".                                                                                                 |
 | 
						||
| `"Tense": "Pres"`   | Present tense, i.e. actions that are happening right now or actions that usually happen.                                       |
 | 
						||
| `"Mood": "Ind"`     | Indicative, i.e. something happens, has happened or will happen (as opposed to imperative or conditional).                     |
 | 
						||
 | 
						||
<Infobox title="Important note" variant="warning">
 | 
						||
 | 
						||
The morphological attributes are currently **not all used by spaCy**. Full
 | 
						||
integration is still being developed. In the meantime, it can still be useful to
 | 
						||
add them, especially if the language you're adding includes important
 | 
						||
distinctions and special cases. This ensures that as soon as full support is
 | 
						||
introduced, your language will be able to assign all possible attributes.
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
## Testing the new language {#testing}
 | 
						||
 | 
						||
Before using the new language or submitting a
 | 
						||
[pull request](https://github.com/explosion/spaCy/pulls) to spaCy, you should
 | 
						||
make sure it works as expected. This is especially important if you've added
 | 
						||
custom regular expressions for token matching or punctuation – you don't want to
 | 
						||
be causing regressions.
 | 
						||
 | 
						||
<Infobox title="spaCy's test suite">
 | 
						||
 | 
						||
spaCy uses the [pytest framework](https://docs.pytest.org/en/latest/) for
 | 
						||
testing. For more details on how the tests are structured and best practices for
 | 
						||
writing your own tests, see our
 | 
						||
[tests documentation](https://github.com/explosion/spaCy/tree/master/spacy/tests).
 | 
						||
 | 
						||
</Infobox>
 | 
						||
 | 
						||
### Writing language-specific tests {#testing-custom}
 | 
						||
 | 
						||
It's recommended to always add at least some tests with examples specific to the
 | 
						||
language. Language tests should be located in
 | 
						||
[`tests/lang`](https://github.com/explosion/spaCy/tree/master/spacy/tests/lang)
 | 
						||
in a directory named after the language ID. You'll also need to create a fixture
 | 
						||
for your tokenizer in the
 | 
						||
[`conftest.py`](https://github.com/explosion/spaCy/tree/master/spacy/tests/conftest.py).
 | 
						||
Always use the [`get_lang_class`](/api/top-level#util.get_lang_class) helper
 | 
						||
function within the fixture, instead of importing the class at the top of the
 | 
						||
file. This will load the language data only when it's needed. (Otherwise, _all
 | 
						||
data_ would be loaded every time you run a test.)
 | 
						||
 | 
						||
```python
 | 
						||
@pytest.fixture
 | 
						||
def en_tokenizer():
 | 
						||
    return util.get_lang_class("en").Defaults.create_tokenizer()
 | 
						||
```
 | 
						||
 | 
						||
When adding test cases, always
 | 
						||
[`parametrize`](https://github.com/explosion/spaCy/tree/master/spacy/tests#parameters)
 | 
						||
them – this will make it easier for others to add more test cases without having
 | 
						||
to modify the test itself. You can also add parameter tuples, for example, a
 | 
						||
test sentence and its expected length, or a list of expected tokens. Here's an
 | 
						||
example of an English tokenizer test for combinations of punctuation and
 | 
						||
abbreviations:
 | 
						||
 | 
						||
```python
 | 
						||
### Example test
 | 
						||
@pytest.mark.parametrize('text,length', [
 | 
						||
    ("The U.S. Army likes Shock and Awe.", 8),
 | 
						||
    ("U.N. regulations are not a part of their concern.", 10),
 | 
						||
    ("“Isn't it?”", 6)])
 | 
						||
def test_en_tokenizer_handles_punct_abbrev(en_tokenizer, text, length):
 | 
						||
    tokens = en_tokenizer(text)
 | 
						||
    assert len(tokens) == length
 | 
						||
```
 | 
						||
 | 
						||
## Training a language model {#training}
 | 
						||
 | 
						||
Much of spaCy's functionality requires models to be trained from labeled data.
 | 
						||
For instance, in order to use the named entity recognizer, you need to first
 | 
						||
train a model on text annotated with examples of the entities you want to
 | 
						||
recognize. The parser, part-of-speech tagger and text categorizer all also
 | 
						||
require models to be trained from labeled examples. The word vectors, word
 | 
						||
probabilities and word clusters also require training, although these can be
 | 
						||
trained from unlabeled text, which tends to be much easier to collect.
 | 
						||
 | 
						||
### Creating a vocabulary file
 | 
						||
 | 
						||
spaCy expects that common words will be cached in a [`Vocab`](/api/vocab)
 | 
						||
instance. The vocabulary caches lexical features. spaCy loads the vocabulary
 | 
						||
from binary data, in order to keep loading efficient. The easiest way to save
 | 
						||
out a new binary vocabulary file is to use the `spacy init-model` command, which
 | 
						||
expects a JSONL file with words and their lexical attributes. See the docs on
 | 
						||
the [vocab JSONL format](/api/annotation#vocab-jsonl) for details.
 | 
						||
 | 
						||
#### Training the word vectors {#word-vectors}
 | 
						||
 | 
						||
[Word2vec](https://en.wikipedia.org/wiki/Word2vec) and related algorithms let
 | 
						||
you train useful word similarity models from unlabeled text. This is a key part
 | 
						||
of using deep learning for NLP with limited labeled data. The vectors are also
 | 
						||
useful by themselves – they power the `.similarity` methods in spaCy. For best
 | 
						||
results, you should pre-process the text with spaCy before training the Word2vec
 | 
						||
model. This ensures your tokenization will match. You can use our
 | 
						||
[word vectors training script](https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py),
 | 
						||
which pre-processes the text with your language-specific tokenizer and trains
 | 
						||
the model using [Gensim](https://radimrehurek.com/gensim/). The `vectors.bin`
 | 
						||
file should consist of one word and vector per line.
 | 
						||
 | 
						||
```python
 | 
						||
https://github.com/explosion/spacy-dev-resources/tree/master/training/word_vectors.py
 | 
						||
```
 | 
						||
 | 
						||
If you don't have a large sample of text available, you can also convert word
 | 
						||
vectors produced by a variety of other tools into spaCy's format. See the docs
 | 
						||
on [converting word vectors](/usage/vectors-similarity#converting) for details.
 | 
						||
 | 
						||
### Creating or converting a training corpus
 | 
						||
 | 
						||
The easiest way to train spaCy's tagger, parser, entity recognizer or text
 | 
						||
categorizer is to use the [`spacy train`](/api/cli#train) command-line utility.
 | 
						||
In order to use this, you'll need training and evaluation data in the
 | 
						||
[JSON format](/api/annotation#json-input) spaCy expects for training.
 | 
						||
 | 
						||
You can now train the model using a corpus for your language annotated with If
 | 
						||
your data is in one of the supported formats, the easiest solution might be to
 | 
						||
use the [`spacy convert`](/api/cli#convert) command-line utility. This supports
 | 
						||
several popular formats, including the IOB format for named entity recognition,
 | 
						||
the JSONL format produced by our annotation tool [Prodigy](https://prodi.gy),
 | 
						||
and the [CoNLL-U](http://universaldependencies.org/docs/format.html) format used
 | 
						||
by the [Universal Dependencies](http://universaldependencies.org/) corpus.
 | 
						||
 | 
						||
One thing to keep in mind is that spaCy expects to train its models from **whole
 | 
						||
documents**, not just single sentences. If your corpus only contains single
 | 
						||
sentences, spaCy's models will never learn to expect multi-sentence documents,
 | 
						||
leading to low performance on real text. To mitigate this problem, you can use
 | 
						||
the `-N` argument to the `spacy convert` command, to merge some of the sentences
 | 
						||
into longer pseudo-documents.
 | 
						||
 | 
						||
### Training the tagger and parser {#train-tagger-parser}
 | 
						||
 | 
						||
Once you have your training and evaluation data in the format spaCy expects, you
 | 
						||
can train your model use the using spaCy's [`train`](/api/cli#train) command.
 | 
						||
Note that training statistical models still involves a degree of
 | 
						||
trial-and-error. You may need to tune one or more settings, also called
 | 
						||
"hyper-parameters", to achieve optimal performance. See the
 | 
						||
[usage guide on training](/usage/training#tagger-parser) for more details.
 |