spaCy/website/usage/_adding-languages/_training.jade

//- 💫 DOCS > USAGE > ADDING LANGUAGES > TRAINING

p
    |  spaCy expects that common words will be cached in a
    |  #[+api("vocab") #[code Vocab]] instance. The vocabulary caches lexical
    |  features, and makes it easy to use information from unlabelled text
    |  samples in your models. Specifically, you'll usually want to collect
    |  word frequencies, and train word vectors. To generate the word frequencies
    |  from a large, raw corpus, you can use the
    |  #[+src(gh("spacy-dev-resources", "training/word_freqs.py")) #[code word_freqs.py]]
    |  script from the spaCy developer resources.

+github("spacy-dev-resources", "training/word_freqs.py")

p
    |  Note that your corpus should not be preprocessed (i.e. you need
    |  punctuation for example). The word frequencies should be generated as a
    |  tab-separated file with three columns:

+list("numbers")
    +item The number of times the word occurred in your language sample.
    +item The number of distinct documents the word occurred in.
    +item The word itself.

+code("es_word_freqs.txt", "text").
    6361109	111	Aunque
    23598543	111	aunque
    10097056	111	claro
    193454	111	aro
    7711123	111	viene
    12812323	111	mal
    23414636	111	momento
    2014580	111	felicidad
    233865	111	repleto
    15527	111	eto
    235565	111	deliciosos
    17259079	111	buena
    71155	111	Anímate
    37705	111	anímate
    33155	111	cuéntanos
    2389171	111	cuál
    961576	111	típico

+aside("Brown Clusters")
    |  Additionally, you can use distributional similarity features provided by the
    |  #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
    |  You should train a model with between 500 and 1000 clusters. A minimum
    |  frequency threshold of 10 usually works well.

p
    |  You should make sure you use the spaCy tokenizer for your
    |  language to segment the text for your word frequencies. This will ensure
    |  that the frequencies refer to the same segmentation standards you'll be
    |  using at run-time. For instance, spaCy's English tokenizer segments
    |  "can't" into two tokens. If we segmented the text by whitespace to
    |  produce the frequency counts, we'll have incorrect frequency counts for
    |  the tokens "ca" and "n't".

+h(4, "word-vectors") Training the word vectors

p
    |  #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
    |  algorithms let you train useful word similarity models from unlabelled
    |  text. This is a key part of using
    |  #[+a("/usage/deep-learning") deep learning] for NLP with limited
    |  labelled data. The vectors are also useful by themselves – they power
    |  the #[code .similarity()] methods in spaCy. For best results, you should
    |  pre-process the text with spaCy before training the Word2vec model. This
    |  ensures your tokenization will match. You can use our
    |  #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
    |  which pre-processes the text with your language-specific tokenizer and
    |  trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
    |  The #[code vectors.bin] file should consist of one word and vector per line.

+github("spacy-dev-resources", "training/word_vectors.py")

+h(3, "train-tagger-parser") Training the tagger and parser

p
    |  You can now train the model using a corpus for your language annotated
    |  with #[+a("http://universaldependencies.org/") Universal Dependencies].
    |  If your corpus uses the
    |  #[+a("http://universaldependencies.org/docs/format.html") CoNLL-U] format,
    |  i.e. files with the extension #[code .conllu], you can use the
    |  #[+api("cli#convert") #[code convert]] command to convert it to spaCy's
    |  #[+a("/api/annotation#json-input") JSON format] for training.
    |  Once you have your UD corpus transformed into JSON, you can train your
    |  model use the using spaCy's #[+api("cli#train") #[code train]] command.

+infobox
    |  For more details and examples of how to
    |  #[strong train the tagger and dependency parser], see the
    |  #[+a("/usage/training#tagger-parser") usage guide on training].