Merge branch 'master' of https://github.com/explosion/spaCy

2025-07-15 18:52:29 +03:00 · 2017-11-08 03:01:16 +01:00 · 2017-11-08 03:01:16 +01:00 · 2bdf68a632
commit 2bdf68a632
parent d725aee4e2 14f97cfd20
2 changed files with 39 additions and 18 deletions
--- a/website/usage/_training/_basics.jade
+++ b/website/usage/_training/_basics.jade
@ -149,7 +149,9 @@ p

 +aside
    |  #[+api("language#begin_training") #[code begin_training()]]: Start the
-    |  training and return an optimizer function to update the model's weights.#[br]
+    |  training and return an optimizer function to update the model's weights.
+    |  Can take an optional function converting the training data to spaCy's
+    |  training format.#[br]
    |  #[+api("language#update") #[code update()]]: Update the model with the
    |  training example and gold data.#[br]
    |  #[+api("language#to_disk") #[code to_disk()]]: Save the updated model to
@ -165,38 +167,38 @@ p
            nlp.update([doc], [gold], drop=0.5, sgd=optimizer)
    nlp.to_disk('/model')

+p
+    |  The #[+api("language#update") #[code nlp.update]] method takes the
+    |  following arguments:
+
 +table(["Name", "Description"])
    +row
-        +cell #[code train_data]
-        +cell The training data.
-
-    +row
-        +cell #[code get_data]
-        +cell
-            |  An optional function converting the training data to spaCy's
-            |  JSON format.
-
-    +row
-        +cell #[code doc]
+        +cell #[code docs]
        +cell
            |  #[+api("doc") #[code Doc]] objects. The #[code update] method
            |  takes a sequence of them, so you can batch up your training
-            |  examples.
+            |  examples. Alternatively, you can also pass in a sequence of
+            |  raw texts.

    +row
-        +cell #[code gold]
+        +cell #[code golds]
        +cell
            |  #[+api("goldparse") #[code GoldParse]] objects. The #[code update]
            |  method takes a sequence of them, so you can batch up your
-            |  training examples.
+            |  training examples. Alternatively, you can also pass in a
+            |  dictionary containing the annotations.

    +row
        +cell #[code drop]
-        +cell Dropout rate. Makes it harder for the model to just memorise the data.
+        +cell
+            |  Dropout rate. Makes it harder for the model to just memorise
+            |  the data.

    +row
-        +cell #[code optimizer]
-        +cell Callable to update the model's weights.
+        +cell #[code sgd]
+        +cell
+            |  An optimizer, i.e. a callable to update the model's weights. If
+            |  not set, spaCy will create a new one and save it for further use.

 p
    |  Instead of writing your own training loop, you can also use the
--- a/website/usage/_v2/_migrating.jade
+++ b/website/usage/_v2/_migrating.jade
@ -17,6 +17,25 @@ p
    |  runtime inputs must match. This means you'll have to
    |  #[strong retrain your models] with spaCy v2.0.

+h(3, "migrating-document-processing") Document processing
+
+p
+    |  The #[+api("language#pipe") #[code Language.pipe]] method allows spaCy
+    |  to batch documents, which brings a
+    |  #[strong significant performance advantage] in v2.0. The new neural
+    |  networks introduce some overhead per batch, so if you're processing a
+    |  number of documents in a row, you should use #[code nlp.pipe] and process
+    |  the texts as a stream.
+
+code-new docs = nlp.pipe(texts)
+code-old docs = (nlp(text) for text in texts)
+
+p
+    |  To make usage easier, there's now a boolean #[code as_tuples]
+    |  keyword argument, that lets you pass in an iterator of
+    |  #[code (text, context)] pairs, so you can get back an iterator of
+    |  #[code (doc, context)] tuples.
+
 +h(3, "migrating-saving-loading") Saving, loading and serialization

 p