mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
94 lines
4.1 KiB
Plaintext
94 lines
4.1 KiB
Plaintext
//- 💫 DOCS > USAGE > ADDING LANGUAGES > TRAINING
|
||
|
||
p
|
||
| spaCy expects that common words will be cached in a
|
||
| #[+api("vocab") #[code Vocab]] instance. The vocabulary caches lexical
|
||
| features, and makes it easy to use information from unlabelled text
|
||
| samples in your models. Specifically, you'll usually want to collect
|
||
| word frequencies, and train word vectors. To generate the word frequencies
|
||
| from a large, raw corpus, you can use the
|
||
| #[+src(gh("spacy-dev-resources", "training/word_freqs.py")) #[code word_freqs.py]]
|
||
| script from the spaCy developer resources.
|
||
|
||
+github("spacy-dev-resources", "training/word_freqs.py")
|
||
|
||
p
|
||
| Note that your corpus should not be preprocessed (i.e. you need
|
||
| punctuation for example). The word frequencies should be generated as a
|
||
| tab-separated file with three columns:
|
||
|
||
+list("numbers")
|
||
+item The number of times the word occurred in your language sample.
|
||
+item The number of distinct documents the word occurred in.
|
||
+item The word itself.
|
||
|
||
+code("es_word_freqs.txt", "text").
|
||
6361109 111 Aunque
|
||
23598543 111 aunque
|
||
10097056 111 claro
|
||
193454 111 aro
|
||
7711123 111 viene
|
||
12812323 111 mal
|
||
23414636 111 momento
|
||
2014580 111 felicidad
|
||
233865 111 repleto
|
||
15527 111 eto
|
||
235565 111 deliciosos
|
||
17259079 111 buena
|
||
71155 111 Anímate
|
||
37705 111 anímate
|
||
33155 111 cuéntanos
|
||
2389171 111 cuál
|
||
961576 111 típico
|
||
|
||
+aside("Brown Clusters")
|
||
| Additionally, you can use distributional similarity features provided by the
|
||
| #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
|
||
| You should train a model with between 500 and 1000 clusters. A minimum
|
||
| frequency threshold of 10 usually works well.
|
||
|
||
p
|
||
| You should make sure you use the spaCy tokenizer for your
|
||
| language to segment the text for your word frequencies. This will ensure
|
||
| that the frequencies refer to the same segmentation standards you'll be
|
||
| using at run-time. For instance, spaCy's English tokenizer segments
|
||
| "can't" into two tokens. If we segmented the text by whitespace to
|
||
| produce the frequency counts, we'll have incorrect frequency counts for
|
||
| the tokens "ca" and "n't".
|
||
|
||
+h(4, "word-vectors") Training the word vectors
|
||
|
||
p
|
||
| #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
|
||
| algorithms let you train useful word similarity models from unlabelled
|
||
| text. This is a key part of using
|
||
| #[+a("/usage/deep-learning") deep learning] for NLP with limited
|
||
| labelled data. The vectors are also useful by themselves – they power
|
||
| the #[code .similarity()] methods in spaCy. For best results, you should
|
||
| pre-process the text with spaCy before training the Word2vec model. This
|
||
| ensures your tokenization will match. You can use our
|
||
| #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
|
||
| which pre-processes the text with your language-specific tokenizer and
|
||
| trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
|
||
| The #[code vectors.bin] file should consist of one word and vector per line.
|
||
|
||
+github("spacy-dev-resources", "training/word_vectors.py")
|
||
|
||
+h(3, "train-tagger-parser") Training the tagger and parser
|
||
|
||
p
|
||
| You can now train the model using a corpus for your language annotated
|
||
| with #[+a("http://universaldependencies.org/") Universal Dependencies].
|
||
| If your corpus uses the
|
||
| #[+a("http://universaldependencies.org/docs/format.html") CoNLL-U] format,
|
||
| i.e. files with the extension #[code .conllu], you can use the
|
||
| #[+api("cli#convert") #[code convert]] command to convert it to spaCy's
|
||
| #[+a("/api/annotation#json-input") JSON format] for training.
|
||
| Once you have your UD corpus transformed into JSON, you can train your
|
||
| model use the using spaCy's #[+api("cli#train") #[code train]] command.
|
||
|
||
+infobox
|
||
| For more details and examples of how to
|
||
| #[strong train the tagger and dependency parser], see the
|
||
| #[+a("/usage/training#tagger-parser") usage guide on training].
|