mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
Update adding language / training docs (see #966)
Add data examples and more info on training and CLI commands
This commit is contained in:
parent
ae2b77db1b
commit
e6bdf5bc5c
|
@ -27,9 +27,10 @@ p
|
||||||
| #[a(href="#brown-clusters") Brown clusters] and
|
| #[a(href="#brown-clusters") Brown clusters] and
|
||||||
| #[a(href="#word-vectors") word vectors].
|
| #[a(href="#word-vectors") word vectors].
|
||||||
|
|
||||||
|
+item
|
||||||
|
| #[strong Set up] a #[a(href="#model-directory") model direcory] and #[strong train] the #[a(href="#train-tagger-parser") tagger and parser].
|
||||||
|
|
||||||
p
|
p
|
||||||
| Once you have the tokenizer and vocabulary, you can
|
|
||||||
| #[+a("/docs/usage/training") train the tagger, parser and entity recognizer].
|
|
||||||
| For some languages, you may also want to develop a solution for
|
| For some languages, you may also want to develop a solution for
|
||||||
| lemmatization and morphological analysis.
|
| lemmatization and morphological analysis.
|
||||||
|
|
||||||
|
@ -406,12 +407,111 @@ p
|
||||||
| by linear models, while the word vectors are useful for lexical
|
| by linear models, while the word vectors are useful for lexical
|
||||||
| similarity models and deep learning.
|
| similarity models and deep learning.
|
||||||
|
|
||||||
|
+h(3, "word-frequencies") Word frequencies
|
||||||
|
|
||||||
|
p
|
||||||
|
| To generate the word frequencies from a large, raw corpus, you can use the
|
||||||
|
| #[+src(gh("spacy-dev-resources", "training/word_freqs.py")) word_freqs.py]
|
||||||
|
| script from the spaCy developer resources. Note that your corpus should
|
||||||
|
| not be preprocessed (i.e. you need punctuation for example). The
|
||||||
|
| #[+a("/docs/usage/cli#model") #[code model] command] expects a
|
||||||
|
| tab-separated word frequencies file with three columns:
|
||||||
|
|
||||||
|
+list("numbers")
|
||||||
|
+item The number of times the word occurred in your language sample.
|
||||||
|
+item The number of distinct documents the word occurred in.
|
||||||
|
+item The word itself.
|
||||||
|
|
||||||
|
p
|
||||||
|
| An example word frequencies file could look like this:
|
||||||
|
|
||||||
|
+code("es_word_freqs.txt", "text").
|
||||||
|
6361109 111 Aunque
|
||||||
|
23598543 111 aunque
|
||||||
|
10097056 111 claro
|
||||||
|
193454 111 aro
|
||||||
|
7711123 111 viene
|
||||||
|
12812323 111 mal
|
||||||
|
23414636 111 momento
|
||||||
|
2014580 111 felicidad
|
||||||
|
233865 111 repleto
|
||||||
|
15527 111 eto
|
||||||
|
235565 111 deliciosos
|
||||||
|
17259079 111 buena
|
||||||
|
71155 111 Anímate
|
||||||
|
37705 111 anímate
|
||||||
|
33155 111 cuéntanos
|
||||||
|
2389171 111 cuál
|
||||||
|
961576 111 típico
|
||||||
|
|
||||||
|
p
|
||||||
|
| You should make sure you use the spaCy tokenizer for your
|
||||||
|
| language to segment the text for your word frequencies. This will ensure
|
||||||
|
| that the frequencies refer to the same segmentation standards you'll be
|
||||||
|
| using at run-time. For instance, spaCy's English tokenizer segments
|
||||||
|
| "can't" into two tokens. If we segmented the text by whitespace to
|
||||||
|
| produce the frequency counts, we'll have incorrect frequency counts for
|
||||||
|
| the tokens "ca" and "n't".
|
||||||
|
|
||||||
|
+h(3, "brown-clusters") Training the Brown clusters
|
||||||
|
|
||||||
|
p
|
||||||
|
| spaCy's tagger, parser and entity recognizer are designed to use
|
||||||
|
| distributional similarity features provided by the
|
||||||
|
| #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
|
||||||
|
| You should train a model with between 500 and 1000 clusters. A minimum
|
||||||
|
| frequency threshold of 10 usually works well.
|
||||||
|
|
||||||
|
p
|
||||||
|
| An example clusters file could look like this:
|
||||||
|
|
||||||
|
+code("es_clusters.data", "text").
|
||||||
|
0000 Vestigial 1
|
||||||
|
0000 Vesturland 1
|
||||||
|
0000 Veyreau 1
|
||||||
|
0000 Veynes 1
|
||||||
|
0000 Vexilografía 1
|
||||||
|
0000 Vetrigne 1
|
||||||
|
0000 Vetónica 1
|
||||||
|
0000 Asunden 1
|
||||||
|
0000 Villalambrús 1
|
||||||
|
0000 Vichuquén 1
|
||||||
|
0000 Vichtis 1
|
||||||
|
0000 Vichigasta 1
|
||||||
|
0000 VAAH 1
|
||||||
|
0000 Viciebsk 1
|
||||||
|
0000 Vicovaro 1
|
||||||
|
0000 Villardeveyo 1
|
||||||
|
0000 Vidala 1
|
||||||
|
0000 Videoguard 1
|
||||||
|
0000 Vedás 1
|
||||||
|
0000 Videocomunicado 1
|
||||||
|
0000 VideoCrypt 1
|
||||||
|
|
||||||
|
+h(3, "word-vectors") Training the word vectors
|
||||||
|
|
||||||
|
p
|
||||||
|
| #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
|
||||||
|
| algorithms let you train useful word similarity models from unlabelled
|
||||||
|
| text. This is a key part of using
|
||||||
|
| #[+a("/docs/usage/deep-learning") deep learning] for NLP with limited
|
||||||
|
| labelled data. The vectors are also useful by themselves – they power
|
||||||
|
| the #[code .similarity()] methods in spaCy. For best results, you should
|
||||||
|
| pre-process the text with spaCy before training the Word2vec model. This
|
||||||
|
| ensures your tokenization will match.
|
||||||
|
|
||||||
|
p
|
||||||
|
| You can use our
|
||||||
|
| #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
|
||||||
|
| which pre-processes the text with your language-specific tokenizer and
|
||||||
|
| trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
|
||||||
|
| The #[code vectors.bin] file should consist of one word and vector per line.
|
||||||
|
|
||||||
|
+h(2, "model-directory") Setting up a model directory
|
||||||
|
|
||||||
p
|
p
|
||||||
| Once you've collected the word frequencies, Brown clusters and word
|
| Once you've collected the word frequencies, Brown clusters and word
|
||||||
| vectors files, you can use the
|
| vectors files, you can use the
|
||||||
| #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
|
|
||||||
| script from our
|
|
||||||
| #[+a(gh("spacy-dev-resources")) developer resources], or use the new
|
|
||||||
| #[+a("/docs/usage/cli#model") #[code model] command] to create a data
|
| #[+a("/docs/usage/cli#model") #[code model] command] to create a data
|
||||||
| directory:
|
| directory:
|
||||||
|
|
||||||
|
@ -438,49 +538,20 @@ p
|
||||||
| loaded. By default, the command expects to be able to find your language
|
| loaded. By default, the command expects to be able to find your language
|
||||||
| class using #[code spacy.util.get_lang_class(lang_id)].
|
| class using #[code spacy.util.get_lang_class(lang_id)].
|
||||||
|
|
||||||
+h(3, "word-frequencies") Word frequencies
|
|
||||||
|
+h(2, "train-tagger-parser") Training the tagger and parser
|
||||||
|
|
||||||
p
|
p
|
||||||
| The #[+a("/docs/usage/cli#model") #[code model] command] expects a
|
| You can now train the model using a corpus for your language annotated
|
||||||
| tab-separated word frequencies file with three columns:
|
| with #[+a("http://universaldependencies.org/") Universal Dependencies].
|
||||||
|
| If your corpus uses the connlu format, you can use the
|
||||||
+list("numbers")
|
| #[+a("/docs/usage/cli#convert") #[code convert] command] to convert it to
|
||||||
+item The number of times the word occurred in your language sample.
|
| spaCy's #[+a("/docs/api/annotation#json-input") JSON format] for training.
|
||||||
+item The number of distinct documents the word occurred in.
|
|
||||||
+item The word itself.
|
|
||||||
|
|
||||||
p
|
p
|
||||||
| You should make sure you use the spaCy tokenizer for your
|
| Once you have your UD corpus transformed into JSON, you can train your
|
||||||
| language to segment the text for your word frequencies. This will ensure
|
| model use the using spaCy's
|
||||||
| that the frequencies refer to the same segmentation standards you'll be
|
| #[+a("/docs/usage/cli#train") #[code train] command]:
|
||||||
| using at run-time. For instance, spaCy's English tokenizer segments
|
|
||||||
| "can't" into two tokens. If we segmented the text by whitespace to
|
|
||||||
| produce the frequency counts, we'll have incorrect frequency counts for
|
|
||||||
| the tokens "ca" and "n't".
|
|
||||||
|
|
||||||
+h(3, "brown-clusters") Training the Brown clusters
|
+code(false, "bash").
|
||||||
|
python -m spacy train [lang] [output_dir] [train_data] [dev_data] [--n_iter] [--parser_L1] [--no_tagger] [--no_parser] [--no_ner]
|
||||||
p
|
|
||||||
| spaCy's tagger, parser and entity recognizer are designed to use
|
|
||||||
| distributional similarity features provided by the
|
|
||||||
| #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
|
|
||||||
| You should train a model with between 500 and 1000 clusters. A minimum
|
|
||||||
| frequency threshold of 10 usually works well.
|
|
||||||
|
|
||||||
+h(3, "word-vectors") Training the word vectors
|
|
||||||
|
|
||||||
p
|
|
||||||
| #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
|
|
||||||
| algorithms let you train useful word similarity models from unlabelled
|
|
||||||
| text. This is a key part of using
|
|
||||||
| #[+a("/docs/usage/deep-learning") deep learning] for NLP with limited
|
|
||||||
| labelled data. The vectors are also useful by themselves – they power
|
|
||||||
| the #[code .similarity()] methods in spaCy. For best results, you should
|
|
||||||
| pre-process the text with spaCy before training the Word2vec model. This
|
|
||||||
| ensures your tokenization will match.
|
|
||||||
|
|
||||||
p
|
|
||||||
| You can use our
|
|
||||||
| #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
|
|
||||||
| which pre-processes the text with your language-specific tokenizer and
|
|
||||||
| trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user