mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			94 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			94 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
//- 💫 DOCS > USAGE > ADDING LANGUAGES > TRAINING
 | 
						||
 | 
						||
p
 | 
						||
    |  spaCy expects that common words will be cached in a
 | 
						||
    |  #[+api("vocab") #[code Vocab]] instance. The vocabulary caches lexical
 | 
						||
    |  features, and makes it easy to use information from unlabelled text
 | 
						||
    |  samples in your models. Specifically, you'll usually want to collect
 | 
						||
    |  word frequencies, and train word vectors. To generate the word frequencies
 | 
						||
    |  from a large, raw corpus, you can use the
 | 
						||
    |  #[+src(gh("spacy-dev-resources", "training/word_freqs.py")) #[code word_freqs.py]]
 | 
						||
    |  script from the spaCy developer resources.
 | 
						||
 | 
						||
+github("spacy-dev-resources", "training/word_freqs.py")
 | 
						||
 | 
						||
p
 | 
						||
    |  Note that your corpus should not be preprocessed (i.e. you need
 | 
						||
    |  punctuation for example). The word frequencies should be generated as a
 | 
						||
    |  tab-separated file with three columns:
 | 
						||
 | 
						||
+list("numbers")
 | 
						||
    +item The number of times the word occurred in your language sample.
 | 
						||
    +item The number of distinct documents the word occurred in.
 | 
						||
    +item The word itself.
 | 
						||
 | 
						||
+code("es_word_freqs.txt", "text").
 | 
						||
    6361109	111	Aunque
 | 
						||
    23598543	111	aunque
 | 
						||
    10097056	111	claro
 | 
						||
    193454	111	aro
 | 
						||
    7711123	111	viene
 | 
						||
    12812323	111	mal
 | 
						||
    23414636	111	momento
 | 
						||
    2014580	111	felicidad
 | 
						||
    233865	111	repleto
 | 
						||
    15527	111	eto
 | 
						||
    235565	111	deliciosos
 | 
						||
    17259079	111	buena
 | 
						||
    71155	111	Anímate
 | 
						||
    37705	111	anímate
 | 
						||
    33155	111	cuéntanos
 | 
						||
    2389171	111	cuál
 | 
						||
    961576	111	típico
 | 
						||
 | 
						||
+aside("Brown Clusters")
 | 
						||
    |  Additionally, you can use distributional similarity features provided by the
 | 
						||
    |  #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
 | 
						||
    |  You should train a model with between 500 and 1000 clusters. A minimum
 | 
						||
    |  frequency threshold of 10 usually works well.
 | 
						||
 | 
						||
p
 | 
						||
    |  You should make sure you use the spaCy tokenizer for your
 | 
						||
    |  language to segment the text for your word frequencies. This will ensure
 | 
						||
    |  that the frequencies refer to the same segmentation standards you'll be
 | 
						||
    |  using at run-time. For instance, spaCy's English tokenizer segments
 | 
						||
    |  "can't" into two tokens. If we segmented the text by whitespace to
 | 
						||
    |  produce the frequency counts, we'll have incorrect frequency counts for
 | 
						||
    |  the tokens "ca" and "n't".
 | 
						||
 | 
						||
+h(4, "word-vectors") Training the word vectors
 | 
						||
 | 
						||
p
 | 
						||
    |  #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
 | 
						||
    |  algorithms let you train useful word similarity models from unlabelled
 | 
						||
    |  text. This is a key part of using
 | 
						||
    |  #[+a("/usage/deep-learning") deep learning] for NLP with limited
 | 
						||
    |  labelled data. The vectors are also useful by themselves – they power
 | 
						||
    |  the #[code .similarity()] methods in spaCy. For best results, you should
 | 
						||
    |  pre-process the text with spaCy before training the Word2vec model. This
 | 
						||
    |  ensures your tokenization will match. You can use our
 | 
						||
    |  #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
 | 
						||
    |  which pre-processes the text with your language-specific tokenizer and
 | 
						||
    |  trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
 | 
						||
    |  The #[code vectors.bin] file should consist of one word and vector per line.
 | 
						||
 | 
						||
+github("spacy-dev-resources", "training/word_vectors.py")
 | 
						||
 | 
						||
+h(3, "train-tagger-parser") Training the tagger and parser
 | 
						||
 | 
						||
p
 | 
						||
    |  You can now train the model using a corpus for your language annotated
 | 
						||
    |  with #[+a("http://universaldependencies.org/") Universal Dependencies].
 | 
						||
    |  If your corpus uses the
 | 
						||
    |  #[+a("http://universaldependencies.org/docs/format.html") CoNLL-U] format,
 | 
						||
    |  i.e. files with the extension #[code .conllu], you can use the
 | 
						||
    |  #[+api("cli#convert") #[code convert]] command to convert it to spaCy's
 | 
						||
    |  #[+a("/api/annotation#json-input") JSON format] for training.
 | 
						||
    |  Once you have your UD corpus transformed into JSON, you can train your
 | 
						||
    |  model use the using spaCy's #[+api("cli#train") #[code train]] command.
 | 
						||
 | 
						||
+infobox
 | 
						||
    |  For more details and examples of how to
 | 
						||
    |  #[strong train the tagger and dependency parser], see the
 | 
						||
    |  #[+a("/usage/training#tagger-parser") usage guide on training].
 |