mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 09:26:27 +03:00
Reformat word frequencies section in "adding languages" workflow
This commit is contained in:
parent
ddf5c5bb61
commit
a2525c76ee
|
@ -424,16 +424,22 @@ p
|
||||||
+h(3, "word-frequencies") Word frequencies
|
+h(3, "word-frequencies") Word frequencies
|
||||||
|
|
||||||
p
|
p
|
||||||
| The #[code init.py] script expects a tab-separated word frequencies file
|
| The #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
|
||||||
| with three columns: the number of times the word occurred in your language
|
| script expects a tab-separated word frequencies file with three columns:
|
||||||
| sample, the number of distinct documents the word occurred in, and the
|
|
||||||
| word itself. You should make sure you use the spaCy tokenizer for your
|
+list("numbers")
|
||||||
|
+item The number of times the word occurred in your language sample.
|
||||||
|
+item The number of distinct documents the word occurred in.
|
||||||
|
+item The word itself.
|
||||||
|
|
||||||
|
p
|
||||||
|
| You should make sure you use the spaCy tokenizer for your
|
||||||
| language to segment the text for your word frequencies. This will ensure
|
| language to segment the text for your word frequencies. This will ensure
|
||||||
| that the frequencies refer to the same segmentation standards you'll be
|
| that the frequencies refer to the same segmentation standards you'll be
|
||||||
| using at run-time. For instance, spaCy's English tokenizer segments "can't"
|
| using at run-time. For instance, spaCy's English tokenizer segments
|
||||||
| into two tokens. If we segmented the text by whitespace to produce the
|
| "can't" into two tokens. If we segmented the text by whitespace to
|
||||||
| frequency counts, we'll have incorrect frequency counts for the tokens
|
| produce the frequency counts, we'll have incorrect frequency counts for
|
||||||
| "ca" and "n't".
|
| the tokens "ca" and "n't".
|
||||||
|
|
||||||
+h(3, "brown-clusters") Training the Brown clusters
|
+h(3, "brown-clusters") Training the Brown clusters
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user