mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-01 00:17:44 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			94 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			94 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| //- 💫 DOCS > USAGE > ADDING LANGUAGES > TRAINING
 | ||
| 
 | ||
| p
 | ||
|     |  spaCy expects that common words will be cached in a
 | ||
|     |  #[+api("vocab") #[code Vocab]] instance. The vocabulary caches lexical
 | ||
|     |  features, and makes it easy to use information from unlabelled text
 | ||
|     |  samples in your models. Specifically, you'll usually want to collect
 | ||
|     |  word frequencies, and train word vectors. To generate the word frequencies
 | ||
|     |  from a large, raw corpus, you can use the
 | ||
|     |  #[+src(gh("spacy-dev-resources", "training/word_freqs.py")) #[code word_freqs.py]]
 | ||
|     |  script from the spaCy developer resources.
 | ||
| 
 | ||
| +github("spacy-dev-resources", "training/word_freqs.py")
 | ||
| 
 | ||
| p
 | ||
|     |  Note that your corpus should not be preprocessed (i.e. you need
 | ||
|     |  punctuation for example). The word frequencies should be generated as a
 | ||
|     |  tab-separated file with three columns:
 | ||
| 
 | ||
| +list("numbers")
 | ||
|     +item The number of times the word occurred in your language sample.
 | ||
|     +item The number of distinct documents the word occurred in.
 | ||
|     +item The word itself.
 | ||
| 
 | ||
| +code("es_word_freqs.txt", "text").
 | ||
|     6361109	111	Aunque
 | ||
|     23598543	111	aunque
 | ||
|     10097056	111	claro
 | ||
|     193454	111	aro
 | ||
|     7711123	111	viene
 | ||
|     12812323	111	mal
 | ||
|     23414636	111	momento
 | ||
|     2014580	111	felicidad
 | ||
|     233865	111	repleto
 | ||
|     15527	111	eto
 | ||
|     235565	111	deliciosos
 | ||
|     17259079	111	buena
 | ||
|     71155	111	Anímate
 | ||
|     37705	111	anímate
 | ||
|     33155	111	cuéntanos
 | ||
|     2389171	111	cuál
 | ||
|     961576	111	típico
 | ||
| 
 | ||
| +aside("Brown Clusters")
 | ||
|     |  Additionally, you can use distributional similarity features provided by the
 | ||
|     |  #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
 | ||
|     |  You should train a model with between 500 and 1000 clusters. A minimum
 | ||
|     |  frequency threshold of 10 usually works well.
 | ||
| 
 | ||
| p
 | ||
|     |  You should make sure you use the spaCy tokenizer for your
 | ||
|     |  language to segment the text for your word frequencies. This will ensure
 | ||
|     |  that the frequencies refer to the same segmentation standards you'll be
 | ||
|     |  using at run-time. For instance, spaCy's English tokenizer segments
 | ||
|     |  "can't" into two tokens. If we segmented the text by whitespace to
 | ||
|     |  produce the frequency counts, we'll have incorrect frequency counts for
 | ||
|     |  the tokens "ca" and "n't".
 | ||
| 
 | ||
| +h(4, "word-vectors") Training the word vectors
 | ||
| 
 | ||
| p
 | ||
|     |  #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
 | ||
|     |  algorithms let you train useful word similarity models from unlabelled
 | ||
|     |  text. This is a key part of using
 | ||
|     |  #[+a("/usage/deep-learning") deep learning] for NLP with limited
 | ||
|     |  labelled data. The vectors are also useful by themselves – they power
 | ||
|     |  the #[code .similarity()] methods in spaCy. For best results, you should
 | ||
|     |  pre-process the text with spaCy before training the Word2vec model. This
 | ||
|     |  ensures your tokenization will match. You can use our
 | ||
|     |  #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
 | ||
|     |  which pre-processes the text with your language-specific tokenizer and
 | ||
|     |  trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
 | ||
|     |  The #[code vectors.bin] file should consist of one word and vector per line.
 | ||
| 
 | ||
| +github("spacy-dev-resources", "training/word_vectors.py")
 | ||
| 
 | ||
| +h(3, "train-tagger-parser") Training the tagger and parser
 | ||
| 
 | ||
| p
 | ||
|     |  You can now train the model using a corpus for your language annotated
 | ||
|     |  with #[+a("http://universaldependencies.org/") Universal Dependencies].
 | ||
|     |  If your corpus uses the
 | ||
|     |  #[+a("http://universaldependencies.org/docs/format.html") CoNLL-U] format,
 | ||
|     |  i.e. files with the extension #[code .conllu], you can use the
 | ||
|     |  #[+api("cli#convert") #[code convert]] command to convert it to spaCy's
 | ||
|     |  #[+a("/api/annotation#json-input") JSON format] for training.
 | ||
|     |  Once you have your UD corpus transformed into JSON, you can train your
 | ||
|     |  model use the using spaCy's #[+api("cli#train") #[code train]] command.
 | ||
| 
 | ||
| +infobox
 | ||
|     |  For more details and examples of how to
 | ||
|     |  #[strong train the tagger and dependency parser], see the
 | ||
|     |  #[+a("/usage/training#tagger-parser") usage guide on training].
 |