Merge branch 'develop' into spacy.io

2025-11-10 04:47:51 +03:00 · 2019-02-24 22:22:30 +01:00 · 2019-02-24 22:22:30 +01:00 · e983eefee7
commit e983eefee7
parent 69cfd7d2ce d0b3af9222
5 changed files with 62 additions and 55 deletions
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@ -47,8 +47,8 @@ shortcut for this and instantiate the component using its string name and
 ## DependencyParser.\_\_call\_\_ {#call tag="method"}

 Apply the pipe to one document. The document is modified in place, and returned.
-This usually happens under the hood when you call the `nlp` object on a text and
-all pipeline components are applied to the `Doc` in order. Both
+This usually happens under the hood when the `nlp` object is called on a text
+and all pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/dependencyparser#call) and
 [`pipe`](/api/dependencyparser#pipe) delegate to the
 [`predict`](/api/dependencyparser#predict) and
@ -70,8 +70,9 @@ all pipeline components are applied to the `Doc` in order. Both

 ## DependencyParser.pipe {#pipe tag="method"}

-Apply the pipe to a stream of documents. Both
-[`__call__`](/api/dependencyparser#call) and
+Apply the pipe to a stream of documents. This usually happens under the hood
+when the `nlp` object is called on a text and all pipeline components are
+applied to the `Doc` in order. Both [`__call__`](/api/dependencyparser#call) and
 [`pipe`](/api/dependencyparser#pipe) delegate to the
 [`predict`](/api/dependencyparser#predict) and
 [`set_annotations`](/api/dependencyparser#set_annotations) methods.
@ -79,9 +80,8 @@ Apply the pipe to a stream of documents. Both
 > #### Example
 >
 > ```python
-> texts = [u"One doc", u"...", u"Lots of docs"]
 > parser = DependencyParser(nlp.vocab)
-> for doc in parser.pipe(texts, batch_size=50):
+> for doc in parser.pipe(docs, batch_size=50):
 >     pass
 > ```

@ -102,10 +102,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
 > scores = parser.predict([doc1, doc2])
 > ```

-| Name        | Type     | Description               |
-| ----------- | -------- | ------------------------- |
-| `docs`      | iterable | The documents to predict. |
-| **RETURNS** | -        | Scores from the model.    |
+| Name        | Type     | Description                                                                                                                                                                                                                        |
+| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `docs`      | iterable | The documents to predict.                                                                                                                                                                                                          |
+| **RETURNS** | tuple    | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. |

 ## DependencyParser.set_annotations {#set_annotations tag="method"}

--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@ -47,8 +47,8 @@ shortcut for this and instantiate the component using its string name and
 ## EntityRecognizer.\_\_call\_\_ {#call tag="method"}

 Apply the pipe to one document. The document is modified in place, and returned.
-This usually happens under the hood when you call the `nlp` object on a text and
-all pipeline components are applied to the `Doc` in order. Both
+This usually happens under the hood when the `nlp` object is called on a text
+and all pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/entityrecognizer#call) and
 [`pipe`](/api/entityrecognizer#pipe) delegate to the
 [`predict`](/api/entityrecognizer#predict) and
@ -70,8 +70,9 @@ all pipeline components are applied to the `Doc` in order. Both

 ## EntityRecognizer.pipe {#pipe tag="method"}

-Apply the pipe to a stream of documents. Both
-[`__call__`](/api/entityrecognizer#call) and
+Apply the pipe to a stream of documents. This usually happens under the hood
+when the `nlp` object is called on a text and all pipeline components are
+applied to the `Doc` in order. Both [`__call__`](/api/entityrecognizer#call) and
 [`pipe`](/api/entityrecognizer#pipe) delegate to the
 [`predict`](/api/entityrecognizer#predict) and
 [`set_annotations`](/api/entityrecognizer#set_annotations) methods.
@ -79,9 +80,8 @@ Apply the pipe to a stream of documents. Both
 > #### Example
 >
 > ```python
-> texts = [u"One doc", u"...", u"Lots of docs"]
 > ner = EntityRecognizer(nlp.vocab)
-> for doc in ner.pipe(texts, batch_size=50):
+> for doc in ner.pipe(docs, batch_size=50):
 >     pass
 > ```

@ -102,10 +102,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
 > scores = ner.predict([doc1, doc2])
 > ```

-| Name        | Type     | Description               |
-| ----------- | -------- | ------------------------- |
-| `docs`      | iterable | The documents to predict. |
-| **RETURNS** | -        | Scores from the model.    |
+| Name        | Type     | Description                                                                                                                                                                                                                        |
+| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `docs`      | iterable | The documents to predict.                                                                                                                                                                                                          |
+| **RETURNS** | tuple    | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. |

 ## EntityRecognizer.set_annotations {#set_annotations tag="method"}

--- a/website/docs/api/goldparse.md
+++ b/website/docs/api/goldparse.md
@ -7,17 +7,23 @@ source: spacy/gold.pyx

 ## GoldParse.\_\_init\_\_ {#init tag="method"}

-Create a `GoldParse`.
+Create a `GoldParse`. Unlike annotations in `entities`, label annotations in
+`cats` can overlap, i.e. a single word can be covered by multiple labelled
+spans. The [`TextCategorizer`](/api/textcategorizer) component expects true
+examples of a label to have the value `1.0`, and negative examples of a label to
+have the value `0.0`. Labels not in the dictionary are treated as missing – the
+gradient for those labels will be zero.

-| Name        | Type        | Description                                                                                                                                           |
-| ----------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `doc`       | `Doc`       | The document the annotations refer to.                                                                                                                |
-| `words`     | iterable    | A sequence of unicode word strings.                                                                                                                   |
-| `tags`      | iterable    | A sequence of strings, representing tag annotations.                                                                                                  |
-| `heads`     | iterable    | A sequence of integers, representing syntactic head offsets.                                                                                          |
-| `deps`      | iterable    | A sequence of strings, representing the syntactic relation types.                                                                                     |
-| `entities`  | iterable    | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. |
-| **RETURNS** | `GoldParse` | The newly constructed object.                                                                                                                         |
+| Name        | Type        | Description                                                                                                                                                                                                               |
+| ----------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `doc`       | `Doc`       | The document the annotations refer to.                                                                                                                                                                                    |
+| `words`     | iterable    | A sequence of unicode word strings.                                                                                                                                                                                       |
+| `tags`      | iterable    | A sequence of strings, representing tag annotations.                                                                                                                                                                      |
+| `heads`     | iterable    | A sequence of integers, representing syntactic head offsets.                                                                                                                                                              |
+| `deps`      | iterable    | A sequence of strings, representing the syntactic relation types.                                                                                                                                                         |
+| `entities`  | iterable    | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions.                                                                     |
+| `cats`      | dict        | Labels for text classification. Each key in the dictionary may be a string or an int, or a `(start_char, end_char, label)` tuple, indicating that the label is applied to only part of the document (usually a sentence). |
+| **RETURNS** | `GoldParse` | The newly constructed object.                                                                                                                                                                                             |

 ## GoldParse.\_\_len\_\_ {#len tag="method"}

@ -52,11 +58,10 @@ Whether the provided syntactic annotations form a projective dependency tree.
 ### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}

 Encode labelled spans into per-token tags, using the
-[BILUO scheme](/api/annotation#biluo) (Begin/In/Last/Unit/Out).
-
-Returns a list of unicode strings, describing the tags. Each tag string will be
-of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one
-of `"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
+[BILUO scheme](/api/annotation#biluo) (Begin, In, Last, Unit, Out). Returns a
+list of unicode strings, describing the tags. Each tag string will be of the
+form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
+`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
 don't align with the tokenization in the `Doc` object. The training algorithm
 will view these as missing values. `O` denotes a non-entity token. `B` denotes
 the beginning of a multi-token entity, `I` the inside of an entity of three or
--- a/website/docs/api/tagger.md
+++ b/website/docs/api/tagger.md
@ -47,8 +47,8 @@ shortcut for this and instantiate the component using its string name and
 ## Tagger.\_\_call\_\_ {#call tag="method"}

 Apply the pipe to one document. The document is modified in place, and returned.
-This usually happens under the hood when you call the `nlp` object on a text and
-all pipeline components are applied to the `Doc` in order. Both
+This usually happens under the hood when the `nlp` object is called on a text
+and all pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/tagger#call) and [`pipe`](/api/tagger#pipe) delegate to the
 [`predict`](/api/tagger#predict) and
 [`set_annotations`](/api/tagger#set_annotations) methods.
@ -69,16 +69,17 @@ all pipeline components are applied to the `Doc` in order. Both

 ## Tagger.pipe {#pipe tag="method"}

-Apply the pipe to a stream of documents. Both [`__call__`](/api/tagger#call) and
+Apply the pipe to a stream of documents. This usually happens under the hood
+when the `nlp` object is called on a text and all pipeline components are
+applied to the `Doc` in order. Both [`__call__`](/api/tagger#call) and
 [`pipe`](/api/tagger#pipe) delegate to the [`predict`](/api/tagger#predict) and
 [`set_annotations`](/api/tagger#set_annotations) methods.

 > #### Example
 >
 > ```python
-> texts = [u"One doc", u"...", u"Lots of docs"]
 > tagger = Tagger(nlp.vocab)
-> for doc in tagger.pipe(texts, batch_size=50):
+> for doc in tagger.pipe(docs, batch_size=50):
 >     pass
 > ```

@ -99,10 +100,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
 > scores = tagger.predict([doc1, doc2])
 > ```

-| Name        | Type     | Description               |
-| ----------- | -------- | ------------------------- |
-| `docs`      | iterable | The documents to predict. |
-| **RETURNS** | -        | Scores from the model.    |
+| Name        | Type     | Description                                                                                                                                                                                                                        |
+| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `docs`      | iterable | The documents to predict.                                                                                                                                                                                                          |
+| **RETURNS** | tuple    | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. |

 ## Tagger.set_annotations {#set_annotations tag="method"}

--- a/website/docs/api/textcategorizer.md
+++ b/website/docs/api/textcategorizer.md
@ -64,8 +64,8 @@ argument.
 ## TextCategorizer.\_\_call\_\_ {#call tag="method"}

 Apply the pipe to one document. The document is modified in place, and returned.
-This usually happens under the hood when you call the `nlp` object on a text and
-all pipeline components are applied to the `Doc` in order. Both
+This usually happens under the hood when the `nlp` object is called on a text
+and all pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/textcategorizer#call) and [`pipe`](/api/textcategorizer#pipe)
 delegate to the [`predict`](/api/textcategorizer#predict) and
 [`set_annotations`](/api/textcategorizer#set_annotations) methods.
@ -86,17 +86,18 @@ delegate to the [`predict`](/api/textcategorizer#predict) and

 ## TextCategorizer.pipe {#pipe tag="method"}

-Apply the pipe to a stream of documents. Both
-[`__call__`](/api/textcategorizer#call) and [`pipe`](/api/textcategorizer#pipe)
-delegate to the [`predict`](/api/textcategorizer#predict) and
+Apply the pipe to a stream of documents. This usually happens under the hood
+when the `nlp` object is called on a text and all pipeline components are
+applied to the `Doc` in order. Both [`__call__`](/api/textcategorizer#call) and
+[`pipe`](/api/textcategorizer#pipe) delegate to the
+[`predict`](/api/textcategorizer#predict) and
 [`set_annotations`](/api/textcategorizer#set_annotations) methods.

 > #### Example
 >
 > ```python
-> texts = [u"One doc", u"...", u"Lots of docs"]
 > textcat = TextCategorizer(nlp.vocab)
-> for doc in textcat.pipe(texts, batch_size=50):
+> for doc in textcat.pipe(docs, batch_size=50):
 >     pass
 > ```

@ -117,10 +118,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
 > scores = textcat.predict([doc1, doc2])
 > ```

-| Name        | Type     | Description               |
-| ----------- | -------- | ------------------------- |
-| `docs`      | iterable | The documents to predict. |
-| **RETURNS** | -        | Scores from the model.    |
+| Name        | Type     | Description                                                                                                                                                                                                                        |
+| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `docs`      | iterable | The documents to predict.                                                                                                                                                                                                          |
+| **RETURNS** | tuple    | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. |

 ## TextCategorizer.set_annotations {#set_annotations tag="method"}