mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 13:41:21 +03:00 
			
		
		
		
	* `auxillary` -> `auxiliary` * `consistute` -> `constitute` * `earlist` -> `earliest` * `prefered` -> `preferred` * `direcory` -> `directory` * `reuseable` -> `reusable` * `idiosyncracies` -> `idiosyncrasies` * `enviroment` -> `environment` * `unecessary` -> `unnecessary` * `yesteday` -> `yesterday` * `resouces` -> `resources`
		
			
				
	
	
		
			560 lines
		
	
	
		
			21 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			560 lines
		
	
	
		
			21 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > USAGE > ADDING LANGUAGES
 | ||
| 
 | ||
| include ../../_includes/_mixins
 | ||
| 
 | ||
| p
 | ||
|     |  Adding full support for a language touches many different parts of the
 | ||
|     |  spaCy library. This guide explains how to fit everything together, and
 | ||
|     |  points you to the specific workflows for each component. Obviously,
 | ||
|     |  there are lots of ways you can organise your code when you implement
 | ||
|     |  your own #[+api("language") #[code Language]] class. This guide will
 | ||
|     |  focus on how it's done within spaCy. For full language support, we'll
 | ||
|     |  need to:
 | ||
| 
 | ||
| +list("numbers")
 | ||
|     +item
 | ||
|         |  Create a #[strong #[code Language] subclass] and
 | ||
|         |  #[a(href="#language-subclass") implement it].
 | ||
| 
 | ||
|     +item
 | ||
|         |  Define custom #[strong language data], like a
 | ||
|         |  #[a(href="#stop-words") stop list], #[a(href="#tag-map") tag map]
 | ||
|         |  and #[a(href="#tokenizer-exceptions") tokenizer exceptions].
 | ||
| 
 | ||
|     +item
 | ||
|         |  #[strong Build the vocabulary] including
 | ||
|         |  #[a(href="#word-frequencies") word frequencies],
 | ||
|         |  #[a(href="#brown-clusters") Brown clusters] and
 | ||
|         |  #[a(href="#word-vectors") word vectors].
 | ||
| 
 | ||
|     +item
 | ||
|         |  #[strong Set up] a #[a(href="#model-directory") model directory] and #[strong train] the #[a(href="#train-tagger-parser") tagger and parser].
 | ||
| 
 | ||
| p
 | ||
|     |  For some languages, you may also want to develop a solution for
 | ||
|     |  lemmatization and morphological analysis.
 | ||
| 
 | ||
| +h(2, "language-subclass") Creating a #[code Language] subclass
 | ||
| 
 | ||
| p
 | ||
|     |  Language-specific code and resources should be organised into a
 | ||
|     |  subpackage of spaCy, named according to the language's
 | ||
|     |  #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code].
 | ||
|     |  For instance, code and resources specific to Spanish are placed into a
 | ||
|     |  folder #[code spacy/es], which can be imported as #[code spacy.es].
 | ||
| 
 | ||
| p
 | ||
|     |  To get started, you can use our
 | ||
|     |  #[+src(gh("spacy-dev-resources", "templates/new_language")) templates]
 | ||
|     |  for the most important files. Here's what the class template looks like:
 | ||
| 
 | ||
| +code("__init__.py (excerpt)").
 | ||
|     # Import language-specific data
 | ||
|     from .language_data import *
 | ||
| 
 | ||
|     class Xxxxx(Language):
 | ||
|         lang = 'xx' # ISO code
 | ||
| 
 | ||
|         class Defaults(Language.Defaults):
 | ||
|             lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
 | ||
|             lex_attr_getters[LANG] = lambda text: 'xx'
 | ||
| 
 | ||
|             # override defaults
 | ||
|             tokenizer_exceptions = TOKENIZER_EXCEPTIONS
 | ||
|             tag_map = TAG_MAP
 | ||
|             stop_words = STOP_WORDS
 | ||
| 
 | ||
| p
 | ||
|     |  Additionally, the new #[code Language] class needs to be added to the
 | ||
|     |  list of available languages in #[+src(gh("spaCy", "spacy/__init__.py")) __init__.py].
 | ||
|     |  The languages are then registered using the #[code set_lang_class()] function.
 | ||
| 
 | ||
| +code("spacy/__init__.py").
 | ||
|     from . import en
 | ||
|     from . import xx
 | ||
| 
 | ||
|     _languages = (en.English, ..., xx.Xxxxx)
 | ||
| 
 | ||
| p You'll also need to list the new package in #[+src(gh("spaCy", "spacy/setup.py")) setup.py]:
 | ||
| 
 | ||
| +code("spacy/setup.py").
 | ||
|     PACKAGES = [
 | ||
|         'spacy',
 | ||
|         'spacy.tokens',
 | ||
|         'spacy.en',
 | ||
|         'spacy.xx',
 | ||
|         # ...
 | ||
|     ]
 | ||
| 
 | ||
| +h(2, "language-data") Adding language data
 | ||
| 
 | ||
| p
 | ||
|     |  Every language is full of exceptions and special cases, especially
 | ||
|     |  amongst the most common words. Some of these exceptions are shared
 | ||
|     |  between multiple languages, while others are entirely idiosyncratic.
 | ||
|     |  spaCy makes it easy to deal with these exceptions on a case-by-case
 | ||
|     |  basis, by defining simple rules and exceptions. The exceptions data is
 | ||
|     |  defined in Python the
 | ||
|     |  #[+src(gh("spacy-dev-resources", "templates/new_language")) language data],
 | ||
|     |  so that Python functions can be used to help you generalise and combine
 | ||
|     |  the data as you require.
 | ||
| 
 | ||
| +infobox("For languages with non-latin characters")
 | ||
|     |  In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
 | ||
|     |  needs to know the language's character set. If the language you're adding
 | ||
|     |  uses non-latin characters, you might need to add the required character
 | ||
|     |  classes to the global
 | ||
|     |  #[+src(gh("spacy", "spacy/language_data/punctuation.py")) punctuation.py].
 | ||
|     |  spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
 | ||
|     |  to keep this simple and readable. If the language requires very specific
 | ||
|     |  punctuation rules, you should consider overwriting the default regular
 | ||
|     |  expressions with your own in the language's #[code Defaults].
 | ||
| 
 | ||
| +h(3, "stop-words") Stop words
 | ||
| 
 | ||
| p
 | ||
|     |  A #[+a("https://en.wikipedia.org/wiki/Stop_words") "stop list"] is a
 | ||
|     |  classic trick from the early days of information retrieval when search
 | ||
|     |  was largely about keyword presence and absence. It is still sometimes
 | ||
|     |  useful today to filter out common words from a bag-of-words model.
 | ||
| 
 | ||
| +aside("What does spaCy consider a stop word?")
 | ||
|     |  There's no particularly principal logic behind what words should be
 | ||
|     |  added to the stop list. Make a list that you think might be useful
 | ||
|     |  to people and is likely to be unsurprising. As a rule of thumb, words
 | ||
|     |  that are very rare are unlikely to be useful stop words.
 | ||
| 
 | ||
| p
 | ||
|     |  To improve readability, #[code STOP_WORDS] are separated by spaces and
 | ||
|     |  newlines, and added as a multiline string:
 | ||
| 
 | ||
| +code("Example").
 | ||
|     STOP_WORDS = set("""
 | ||
|     a about above across after afterwards again against all almost alone along
 | ||
|     already also although always am among amongst amount an and another any anyhow
 | ||
|     anyone anything anyway anywhere are around as at
 | ||
| 
 | ||
|     back be became because become becomes becoming been before beforehand behind
 | ||
|     being below beside besides between beyond both bottom but by
 | ||
|     """).split())
 | ||
| 
 | ||
| +h(3, "tag-map") Tag map
 | ||
| 
 | ||
| p
 | ||
|     |  Most treebanks define a custom part-of-speech tag scheme, striking a
 | ||
|     |  balance between level of detail and ease of prediction.  While it's
 | ||
|     |  useful to have custom tagging schemes, it's also useful to have a common
 | ||
|     |  scheme, to which the more specific tags can be related. The tagger can
 | ||
|     |  learn a tag scheme with any arbitrary symbols. However, you need to
 | ||
|     |  define how those symbols map down to the
 | ||
|     |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies tag set].
 | ||
|     |  This is done by providing a tag map.
 | ||
| 
 | ||
| p
 | ||
|     |  The keys of the tag map should be #[strong strings in your tag set]. The
 | ||
|     |  values should be a dictionary. The dictionary must have an entry POS
 | ||
|     |  whose value is one of the
 | ||
|     |  #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
 | ||
|     |  tags. Optionally, you can also include morphological features or other
 | ||
|     |  token attributes in the tag map as well. This allows you to do simple
 | ||
|     |  #[+a("/docs/usage/pos-tagging#rule-based-morphology") rule-based morphological analysis].
 | ||
| 
 | ||
| +code("Example").
 | ||
|     TAG_MAP = {
 | ||
|         "NNS":  {POS: NOUN, "Number": "plur"},
 | ||
|         "VBG":  {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
 | ||
|         "DT":   {POS: DET}
 | ||
|     }
 | ||
| 
 | ||
| +h(3, "tokenizer-exceptions") Tokenizer exceptions
 | ||
| 
 | ||
| p
 | ||
|     |  spaCy's #[+a("/docs/usage/customizing-tokenizer#how-tokenizer-works") tokenization algorithm]
 | ||
|     |  lets you deal with whitespace-delimited chunks separately. This makes it
 | ||
|     |  easy to define special-case rules, without worrying about how they
 | ||
|     |  interact with the rest of the tokenizer. Whenever the key string is
 | ||
|     |  matched, the special-case rule is applied, giving the defined sequence of
 | ||
|     |  tokens. You can also attach attributes to the subtokens, covered by your
 | ||
|     |  special case, such as the subtokens #[code LEMMA] or #[code TAG].
 | ||
| 
 | ||
| p
 | ||
|     |  Tokenizer exceptions can be added in the following format:
 | ||
| 
 | ||
| +code("language_data.py").
 | ||
|     TOKENIZER_EXCEPTIONS = {
 | ||
|         "don't": [
 | ||
|             {ORTH: "do", LEMMA: "do"},
 | ||
|             {ORTH: "n't", LEMMA: "not", TAG: "RB"}
 | ||
|         ]
 | ||
|     }
 | ||
| 
 | ||
| p
 | ||
|     |  Some exceptions, like certain abbreviations, will always be mapped to a
 | ||
|     |  single token containing only an #[code ORTH] property. To make your data
 | ||
|     |  less verbose, you can use the helper function #[code strings_to_exc()]
 | ||
|     |  with a simple array of strings:
 | ||
| 
 | ||
| +code("Example").
 | ||
|     from ..language_data import update_exc, strings_to_exc
 | ||
| 
 | ||
|     ORTH_ONLY = ["a.", "b.", "c."]
 | ||
|     converted = strings_to_exc(ORTH_ONLY)
 | ||
|     # {"a.": [{ORTH: "a."}], "b.": [{ORTH: "b."}], "c.": [{ORTH: "c."}]}
 | ||
| 
 | ||
|     update_exc(TOKENIZER_EXCEPTIONS, converted)
 | ||
| 
 | ||
| p
 | ||
|     |  Unambiguous abbreviations, like month names or locations in English,
 | ||
|     |  should be added to #[code TOKENIZER_EXCEPTIONS] with a lemma assigned,
 | ||
|     |  for example #[code {ORTH: "Jan.", LEMMA: "January"}].
 | ||
| 
 | ||
| +h(3, "custom-tokenizer-exceptions") Custom tokenizer exceptions
 | ||
| 
 | ||
| p
 | ||
|     |  For language-specific tokenizer exceptions, you can use the
 | ||
|     |  #[code update_exc()] function to update the existing exceptions with a
 | ||
|     |  custom dictionary. This is especially useful for exceptions that follow
 | ||
|     |  a consistent pattern. Instead of adding each exception manually, you can
 | ||
|     |  write a simple function that returns a dictionary of exceptions.
 | ||
| 
 | ||
| p
 | ||
|     |  For example, here's how exceptions for time formats like "1a.m." and
 | ||
|     |  "1am" are generated in the English
 | ||
|     |  #[+src(gh("spaCy", "spacy/en/language_data.py")) language_data.py]:
 | ||
| 
 | ||
| +code("language_data.py").
 | ||
|     from ..language_data import update_exc
 | ||
| 
 | ||
|     def get_time_exc(hours):
 | ||
|         exc = {}
 | ||
|         for hour in hours:
 | ||
|             exc["%da.m." % hour] = [{ORTH: hour}, {ORTH: "a.m."}]
 | ||
|             exc["%dp.m." % hour] = [{ORTH: hour}, {ORTH: "p.m."}]
 | ||
|             exc["%dam" % hour]   = [{ORTH: hour}, {ORTH: "am", LEMMA: "a.m."}]
 | ||
|             exc["%dpm" % hour]   = [{ORTH: hour}, {ORTH: "pm", LEMMA: "p.m."}]
 | ||
|         return exc
 | ||
| 
 | ||
| 
 | ||
|     TOKENIZER_EXCEPTIONS = dict(language_data.TOKENIZER_EXCEPTIONS)
 | ||
| 
 | ||
|     hours = 12
 | ||
|     update_exc(TOKENIZER_EXCEPTIONS, get_time_exc(range(1, hours + 1)))
 | ||
| 
 | ||
| +h(3, "utils") Shared utils
 | ||
| 
 | ||
| p
 | ||
|     |  The #[code spacy.language_data] package provides constants and functions
 | ||
|     |  that can be imported and used across languages.
 | ||
| 
 | ||
| +aside("About spaCy's custom pronoun lemma")
 | ||
|     |  Unlike verbs and common nouns, there's no clear base form of a personal
 | ||
|     |  pronoun. Should the lemma of "me" be "I", or should we normalize person
 | ||
|     |  as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
 | ||
|     |  novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
 | ||
|     |  all personal pronouns.
 | ||
| 
 | ||
| +table(["Name", "Description"])
 | ||
|     +row
 | ||
|         +cell #[code PRON_LEMMA]
 | ||
|         +cell
 | ||
|             |  Special value for pronoun lemmas (#[code "-PRON-"]).
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code DET_LEMMA]
 | ||
|         +cell
 | ||
|             |  Special value for determiner lemmas, used in languages with
 | ||
|             |  inflected determiners (#[code "-DET-"]).
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code ENT_ID]
 | ||
|         +cell
 | ||
|             | Special value for entity IDs (#[code "ent_id"])
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code update_exc(exc, additions)]
 | ||
|         +cell
 | ||
|             |  Update an existing dictionary of exceptions #[code exc] with a
 | ||
|             |  dictionary of #[code additions].
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code strings_to_exc(orths)]
 | ||
|         +cell
 | ||
|             |  Convert an array of strings to a dictionary of exceptions of the
 | ||
|             |  format #[code {"string": [{ORTH: "string"}]}].
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code expand_exc(excs, search, replace)]
 | ||
|         +cell
 | ||
|             |  Search for a string #[code search] in a dictionary of exceptions
 | ||
|             |  #[code excs] and if found, copy the entry and replace
 | ||
|             |  #[code search] with #[code replace] in both the key and
 | ||
|             |  #[code ORTH] value. Useful to provide exceptions containing
 | ||
|             |  different versions of special unicode characters, like
 | ||
|             |  #[code '] and #[code ’].
 | ||
| 
 | ||
| p
 | ||
|     |  If you've written a custom function that seems like it might be useful
 | ||
|     |  for several languages, consider adding it to
 | ||
|     |  #[+src(gh("spaCy", "spacy/language_data/util.py")) language_data/util.py]
 | ||
|     |  instead of the individual language module.
 | ||
| 
 | ||
| +h(3, "shared-data") Shared language data
 | ||
| 
 | ||
| p
 | ||
|     |  Because languages can vary in quite arbitrary ways, spaCy avoids
 | ||
|     |  organising the language data into an explicit inheritance hierarchy.
 | ||
|     |  Instead, reusable functions and data are collected as atomic pieces in
 | ||
|     |  the #[code spacy.language_data] package.
 | ||
| 
 | ||
| +aside-code("Example").
 | ||
|     from ..language_data import update_exc, strings_to_exc
 | ||
|     from ..language_data import EMOTICONS
 | ||
| 
 | ||
|     # Add custom emoticons
 | ||
|     EMOTICONS = EMOTICONS + ["8===D", ":~)"]
 | ||
| 
 | ||
|     # Add emoticons to tokenizer exceptions
 | ||
|     update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(EMOTICONS))
 | ||
| 
 | ||
| +table(["Name", "Description", "Source"])
 | ||
|     +row
 | ||
|         +cell #[code EMOTICONS]
 | ||
| 
 | ||
|         +cell
 | ||
|             |  Common unicode emoticons without whitespace.
 | ||
| 
 | ||
|         +cell
 | ||
|             +src(gh("spaCy", "spacy/language_data/emoticons.py")) emoticons.py
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code TOKENIZER_PREFIXES]
 | ||
| 
 | ||
|         +cell
 | ||
|             |  Regular expressions to match left-attaching tokens and
 | ||
|             |  punctuation, e.g. #[code $], #[code (], #[code "]
 | ||
| 
 | ||
|         +cell
 | ||
|             +src(gh("spaCy", "spacy/language_data/punctuation.py")) punctuation.py
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code TOKENIZER_SUFFIXES]
 | ||
| 
 | ||
|         +cell
 | ||
|             |  Regular expressions to match right-attaching tokens and
 | ||
|             |  punctuation, e.g. #[code %], #[code )], #[code "]
 | ||
| 
 | ||
|         +cell
 | ||
|             +src(gh("spaCy", "spacy/language_data/punctuation.py")) punctuation.py
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code TOKENIZER_INFIXES]
 | ||
| 
 | ||
|         +cell
 | ||
|             |  Regular expressions to match token separators, e.g. #[code -]
 | ||
| 
 | ||
|         +cell
 | ||
|             +src(gh("spaCy", "spacy/language_data/punctuation.py")) punctuation.py
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code TAG_MAP]
 | ||
| 
 | ||
|         +cell
 | ||
|             |  A tag map keyed by the universal part-of-speech tags to
 | ||
|             |  themselves with no morphological features.
 | ||
| 
 | ||
|         +cell
 | ||
|             +src(gh("spaCy", "spacy/language_data/tag_map.py")) tag_map.py
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code ENTITY_RULES]
 | ||
| 
 | ||
|         +cell
 | ||
|             |  Patterns for named entities commonly missed by the statistical
 | ||
|             | entity recognizer, for use in the rule matcher.
 | ||
| 
 | ||
|         +cell
 | ||
|             +src(gh("spaCy", "spacy/language_data/entity_rules.py")) entity_rules.py
 | ||
| 
 | ||
|     +row
 | ||
|         +cell #[code FALSE_POSITIVES]
 | ||
| 
 | ||
|         +cell
 | ||
|             |  Patterns for phrases commonly mistaken for named entities by the
 | ||
|             |  statistical entity recognizer, to use in the rule matcher.
 | ||
| 
 | ||
|         +cell
 | ||
|             +src(gh("spaCy", "spacy/language_data/entity_rules.py")) entity_rules.py
 | ||
| 
 | ||
| p
 | ||
|     |  Individual languages can extend and override any of these expressions.
 | ||
|     |  Often, when a new language is added, you'll find a pattern or symbol
 | ||
|     |  that's missing. Even if this pattern or symbol isn't common in other
 | ||
|     |  languages, it might be best to add it to the base expressions, unless it
 | ||
|     |  has some conflicting interpretation. For instance, we don't expect to
 | ||
|     |  see guillemot quotation symbols (#[code »] and #[code «]) in
 | ||
|     |  English text. But if we do see them, we'd probably prefer the tokenizer
 | ||
|     |  to split it off.
 | ||
| 
 | ||
| +h(2, "vocabulary") Building the vocabulary
 | ||
| 
 | ||
| p
 | ||
|     |  spaCy expects that common words will be cached in a
 | ||
|     |  #[+api("vocab") #[code Vocab]] instance. The vocabulary caches lexical
 | ||
|     |  features, and makes it easy to use information from unlabelled text
 | ||
|     |  samples in your models. Specifically, you'll usually want to collect
 | ||
|     |  word frequencies, and train two types of distributional similarity model:
 | ||
|     |  Brown clusters, and word vectors. The Brown clusters are used as features
 | ||
|     |  by linear models, while the word vectors are useful for lexical
 | ||
|     |  similarity models and deep learning.
 | ||
| 
 | ||
| +h(3, "word-frequencies") Word frequencies
 | ||
| 
 | ||
| p
 | ||
|     |  To generate the word frequencies from a large, raw corpus, you can use the
 | ||
|     |  #[+src(gh("spacy-dev-resources", "training/word_freqs.py")) word_freqs.py]
 | ||
|     |  script from the spaCy developer resources. Note that your corpus should
 | ||
|     |  not be preprocessed (i.e. you need punctuation for example). The
 | ||
|     |  #[+a("/docs/usage/cli#model") #[code model] command] expects a
 | ||
|     |  tab-separated word frequencies file with three columns:
 | ||
| 
 | ||
| +list("numbers")
 | ||
|     +item The number of times the word occurred in your language sample.
 | ||
|     +item The number of distinct documents the word occurred in.
 | ||
|     +item The word itself.
 | ||
| 
 | ||
| p
 | ||
|     |  An example word frequencies file could look like this:
 | ||
| 
 | ||
| +code("es_word_freqs.txt", "text").
 | ||
|     6361109	111	Aunque
 | ||
|     23598543	111	aunque
 | ||
|     10097056	111	claro
 | ||
|     193454	111	aro
 | ||
|     7711123	111	viene
 | ||
|     12812323	111	mal
 | ||
|     23414636	111	momento
 | ||
|     2014580	111	felicidad
 | ||
|     233865	111	repleto
 | ||
|     15527	111	eto
 | ||
|     235565	111	deliciosos
 | ||
|     17259079	111	buena
 | ||
|     71155	111	Anímate
 | ||
|     37705	111	anímate
 | ||
|     33155	111	cuéntanos
 | ||
|     2389171	111	cuál
 | ||
|     961576	111	típico
 | ||
| 
 | ||
| p
 | ||
|     |  You should make sure you use the spaCy tokenizer for your
 | ||
|     |  language to segment the text for your word frequencies. This will ensure
 | ||
|     |  that the frequencies refer to the same segmentation standards you'll be
 | ||
|     |  using at run-time. For instance, spaCy's English tokenizer segments
 | ||
|     |  "can't" into two tokens. If we segmented the text by whitespace to
 | ||
|     |  produce the frequency counts, we'll have incorrect frequency counts for
 | ||
|     |  the tokens "ca" and "n't".
 | ||
| 
 | ||
| +h(3, "brown-clusters") Training the Brown clusters
 | ||
| 
 | ||
| p
 | ||
|     |  spaCy's tagger, parser and entity recognizer are designed to use
 | ||
|     |  distributional similarity features provided by the
 | ||
|     |  #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
 | ||
|     |  You should train a model with between 500 and 1000 clusters. A minimum
 | ||
|     |  frequency threshold of 10 usually works well.
 | ||
| 
 | ||
| p
 | ||
|     |  An example clusters file could look like this:
 | ||
| 
 | ||
| +code("es_clusters.data", "text").
 | ||
|     0000	Vestigial	1
 | ||
|     0000	Vesturland	1
 | ||
|     0000	Veyreau	1
 | ||
|     0000	Veynes	1
 | ||
|     0000	Vexilografía	1
 | ||
|     0000	Vetrigne	1
 | ||
|     0000	Vetónica	1
 | ||
|     0000	Asunden	1
 | ||
|     0000	Villalambrús	1
 | ||
|     0000	Vichuquén	1
 | ||
|     0000	Vichtis	1
 | ||
|     0000	Vichigasta	1
 | ||
|     0000	VAAH	1
 | ||
|     0000	Viciebsk	1
 | ||
|     0000	Vicovaro	1
 | ||
|     0000	Villardeveyo	1
 | ||
|     0000	Vidala	1
 | ||
|     0000	Videoguard	1
 | ||
|     0000	Vedás	1
 | ||
|     0000	Videocomunicado	1
 | ||
|     0000	VideoCrypt	1
 | ||
| 
 | ||
| +h(3, "word-vectors") Training the word vectors
 | ||
| 
 | ||
| p
 | ||
|     |  #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
 | ||
|     |  algorithms let you train useful word similarity models from unlabelled
 | ||
|     |  text. This is a key part of using
 | ||
|     |  #[+a("/docs/usage/deep-learning") deep learning] for NLP with limited
 | ||
|     |  labelled data. The vectors are also useful by themselves – they power
 | ||
|     |  the #[code .similarity()] methods in spaCy. For best results, you should
 | ||
|     |  pre-process the text with spaCy before training the Word2vec model. This
 | ||
|     |  ensures your tokenization will match.
 | ||
| 
 | ||
| p
 | ||
|     | You can use our
 | ||
|     |  #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
 | ||
|     |  which pre-processes the text with your language-specific tokenizer and
 | ||
|     |  trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
 | ||
|     |  The #[code vectors.bin] file should consist of one word and vector per line.
 | ||
| 
 | ||
| +h(2, "model-directory") Setting up a model directory
 | ||
| 
 | ||
| p
 | ||
|     |  Once you've collected the word frequencies, Brown clusters and word
 | ||
|     |  vectors files, you can use the
 | ||
|     |  #[+a("/docs/usage/cli#model") #[code model] command] to create a data
 | ||
|     |  directory:
 | ||
| 
 | ||
| +code(false, "bash").
 | ||
|     python -m spacy model [lang] [model_dir] [freqs_data] [clusters_data] [vectors_data]
 | ||
| 
 | ||
| +aside-code("your_data_directory", "yaml").
 | ||
|     ├── vocab/
 | ||
|     |   ├── lexemes.bin   # via nlp.vocab.dump(path)
 | ||
|     |   ├── strings.json  # via nlp.vocab.strings.dump(file_)
 | ||
|     |   └── oov_prob      # optional
 | ||
|     ├── pos/              # optional
 | ||
|     |   ├── model         # via nlp.tagger.model.dump(path)
 | ||
|     |   └── config.json   # via Langage.train
 | ||
|     ├── deps/             # optional
 | ||
|     |   ├── model         # via nlp.parser.model.dump(path)
 | ||
|     |   └── config.json   # via Langage.train
 | ||
|     └── ner/              # optional
 | ||
|         ├── model         # via nlp.entity.model.dump(path)
 | ||
|         └── config.json   # via Langage.train
 | ||
| 
 | ||
| p
 | ||
|     |  This creates a spaCy data directory with a vocabulary model, ready to be
 | ||
|     |  loaded. By default, the command expects to be able to find your language
 | ||
|     |  class using #[code spacy.util.get_lang_class(lang_id)].
 | ||
| 
 | ||
| 
 | ||
| +h(2, "train-tagger-parser") Training the tagger and parser
 | ||
| 
 | ||
| p
 | ||
|     |  You can now train the model using a corpus for your language annotated
 | ||
|     |  with #[+a("http://universaldependencies.org/") Universal Dependencies].
 | ||
|     |  If your corpus uses the 
 | ||
|     |  #[+a("http://universaldependencies.org/docs/format.html") CoNLL-U] format, 
 | ||
|     |  i.e. files with the extension #[code .conllu], you can use the
 | ||
|     |  #[+a("/docs/usage/cli#convert") #[code convert] command] to convert it to
 | ||
|     |  spaCy's #[+a("/docs/api/annotation#json-input") JSON format] for training.
 | ||
| 
 | ||
| p
 | ||
|     |  Once you have your UD corpus transformed into JSON, you can train your
 | ||
|     |  model use the using spaCy's
 | ||
|     |  #[+a("/docs/usage/cli#train") #[code train] command]:
 | ||
| 
 | ||
| +code(false, "bash").
 | ||
|     python -m spacy train [lang] [output_dir] [train_data] [dev_data] [--n_iter] [--parser_L1] [--no_tagger] [--no_parser] [--no_ner]
 |