mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Reformat word frequencies section in "adding languages" workflow
This commit is contained in:
parent
ddf5c5bb61
commit
a2525c76ee
|
@ -424,16 +424,22 @@ p
|
|||
+h(3, "word-frequencies") Word frequencies
|
||||
|
||||
p
|
||||
| The #[code init.py] script expects a tab-separated word frequencies file
|
||||
| with three columns: the number of times the word occurred in your language
|
||||
| sample, the number of distinct documents the word occurred in, and the
|
||||
| word itself. You should make sure you use the spaCy tokenizer for your
|
||||
| The #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
|
||||
| script expects a tab-separated word frequencies file with three columns:
|
||||
|
||||
+list("numbers")
|
||||
+item The number of times the word occurred in your language sample.
|
||||
+item The number of distinct documents the word occurred in.
|
||||
+item The word itself.
|
||||
|
||||
p
|
||||
| You should make sure you use the spaCy tokenizer for your
|
||||
| language to segment the text for your word frequencies. This will ensure
|
||||
| that the frequencies refer to the same segmentation standards you'll be
|
||||
| using at run-time. For instance, spaCy's English tokenizer segments "can't"
|
||||
| into two tokens. If we segmented the text by whitespace to produce the
|
||||
| frequency counts, we'll have incorrect frequency counts for the tokens
|
||||
| "ca" and "n't".
|
||||
| using at run-time. For instance, spaCy's English tokenizer segments
|
||||
| "can't" into two tokens. If we segmented the text by whitespace to
|
||||
| produce the frequency counts, we'll have incorrect frequency counts for
|
||||
| the tokens "ca" and "n't".
|
||||
|
||||
+h(3, "brown-clusters") Training the Brown clusters
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user