Reformat word frequencies section in "adding languages" workflow

2025-07-14 18:22:27 +03:00 · 2016-12-19 17:18:38 +01:00 · 2016-12-19 17:18:38 +01:00 · a2525c76ee
commit a2525c76ee
parent ddf5c5bb61
1 changed files with 14 additions and 8 deletions
--- a/website/docs/usage/adding-languages.jade
+++ b/website/docs/usage/adding-languages.jade
@ -424,16 +424,22 @@ p
 +h(3, "word-frequencies") Word frequencies
 p
-    |  The #[code init.py] script expects a tab-separated word frequencies file
+    |  The #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
-    |  with three columns: the number of times the word occurred in your language
+    |  script expects a tab-separated word frequencies file with three columns:
-    |  sample, the number of distinct documents the word occurred in, and the
+
-    |  word itself.  You should make sure you use the spaCy tokenizer for your
+list("numbers")
    +item The number of times the word occurred in your language sample.
    +item The number of distinct documents the word occurred in.
    +item The word itself.
 p
    |  You should make sure you use the spaCy tokenizer for your
    |  language to segment the text for your word frequencies. This will ensure
    |  that the frequencies refer to the same segmentation standards you'll be
-    |  using at run-time. For instance, spaCy's English tokenizer segments "can't"
+    |  using at run-time. For instance, spaCy's English tokenizer segments
-    |  into two tokens. If we segmented the text by whitespace to produce the
+    |  "can't" into two tokens. If we segmented the text by whitespace to
-    |  frequency counts, we'll have incorrect frequency counts for the tokens
+    |  produce the frequency counts, we'll have incorrect frequency counts for
-    |  "ca" and "n't".
+    |  the tokens "ca" and "n't".
 +h(3, "brown-clusters") Training the Brown clusters