spaCy/website/docs/api/goldparse.md

---
title: GoldParse
teaser: A collection for training annotations
tag: class
source: spacy/gold.pyx
---

## GoldParse.\_\_init\_\_ {#init tag="method"}

Create a `GoldParse`. Unlike annotations in `entities`, label annotations in
`cats` can overlap, i.e. a single word can be covered by multiple labelled
spans. The [`TextCategorizer`](/api/textcategorizer) component expects true
examples of a label to have the value `1.0`, and negative examples of a label to
have the value `0.0`. Labels not in the dictionary are treated as missing – the
gradient for those labels will be zero.

| Name        | Type        | Description                                                                                                                                                                                                                            |
| ----------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc`       | `Doc`       | The document the annotations refer to.                                                                                                                                                                                                 |
| `words`     | iterable    | A sequence of unicode word strings.                                                                                                                                                                                                    |
| `tags`      | iterable    | A sequence of strings, representing tag annotations.                                                                                                                                                                                   |
| `heads`     | iterable    | A sequence of integers, representing syntactic head offsets.                                                                                                                                                                           |
| `deps`      | iterable    | A sequence of strings, representing the syntactic relation types.                                                                                                                                                                      |
| `entities`  | iterable    | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
| `cats`      | dict        | Labels for text classification. Each key in the dictionary may be a string or an int, or a `(start_char, end_char, label)` tuple, indicating that the label is applied to only part of the document (usually a sentence).              |
| `links`     | dict        | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either 1.0 (positive) or 0.0 (negative).                           |
| **RETURNS** | `GoldParse` | The newly constructed object.                                                                                                                                                                                                          |

## GoldParse.\_\_len\_\_ {#len tag="method"}

Get the number of gold-standard tokens.

| Name        | Type | Description                         |
| ----------- | ---- | ----------------------------------- |
| **RETURNS** | int  | The number of gold-standard tokens. |

## GoldParse.is_projective {#is_projective tag="property"}

Whether the provided syntactic annotations form a projective dependency tree.

| Name        | Type | Description                               |
| ----------- | ---- | ----------------------------------------- |
| **RETURNS** | bool | Whether annotations form projective tree. |

## Attributes {#attributes}

| Name                                 | Type | Description                                                                                                                                              |
| ------------------------------------ | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `words`                              | list | The words.                                                                                                                                               |
| `tags`                               | list | The part-of-speech tag annotations.                                                                                                                      |
| `heads`                              | list | The syntactic head annotations.                                                                                                                          |
| `labels`                             | list | The syntactic relation-type annotations.                                                                                                                 |
| `ner`                                | list | The named entity annotations as BILUO tags.                                                                                                              |
| `cand_to_gold`                       | list | The alignment from candidate tokenization to gold tokenization.                                                                                          |
| `gold_to_cand`                       | list | The alignment from gold tokenization to candidate tokenization.                                                                                          |
| `cats` <Tag variant="new">2</Tag>    | list | Entries in the list should be either a label, or a `(start, end, label)` triple. The tuple form is used for categories applied to spans of the document. |
| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries.                                 |

## Utilities {#util}

### gold.docs_to_json {#docs_to_json tag="function"}

Convert a list of Doc objects into the
[JSON-serializable format](/api/annotation#json-input) used by the
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc.

> #### Example
>
> ```python
> from spacy.gold import docs_to_json
>
> doc = nlp("I like London")
> json_data = docs_to_json([doc])
> ```

| Name        | Type             | Description                                |
| ----------- | ---------------- | ------------------------------------------ |
| `docs`      | iterable / `Doc` | The `Doc` object(s) to convert.            |
| `id`        | int              | ID to assign to the JSON. Defaults to `0`. |
| **RETURNS** | dict             | The data in spaCy's JSON format.           |

### gold.align {#align tag="function"}

Calculate alignment tables between two tokenizations, using the Levenshtein
algorithm. The alignment is case-insensitive.

<Infobox title="Important note" variant="warning">

The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
`["I", "'m"]` and `["I", "am"]`.

</Infobox>

> #### Example
>
> ```python
> from spacy.gold import align
>
> bert_tokens = ["obama", "'", "s", "podcast"]
> spacy_tokens = ["obama", "'s", "podcast"]
> alignment = align(bert_tokens, spacy_tokens)
> cost, a2b, b2a, a2b_multi, b2a_multi = alignment
> ```

| Name        | Type  | Description                                                                |
| ----------- | ----- | -------------------------------------------------------------------------- |
| `tokens_a`  | list  | String values of candidate tokens to align.                                |
| `tokens_b`  | list  | String values of reference tokens to align.                                |
| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |

The returned tuple contains the following alignment information:

> #### Example
>
> ```python
> a2b = array([0, -1, -1, 2])
> b2a = array([0, 2, 3])
> a2b_multi = {1: 1, 2: 1}
> b2a_multi = {}
> ```
>
> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
> there's no one-to-one alignment for a token, it has the value `-1`.

| Name        | Type                                   | Description                                                                                                                                     |
| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `cost`      | int                                    | The number of misaligned tokens.                                                                                                                |
| `a2b`       | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`.                                                                          |
| `b2a`       | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`.                                                                          |
| `a2b_multi` | dict                                   | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
| `b2a_multi` | dict                                   | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |

### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}

Encode labelled spans into per-token tags, using the
[BILUO scheme](/api/annotation#biluo) (Begin, In, Last, Unit, Out). Returns a
list of unicode strings, describing the tags. Each tag string will be of the
form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
don't align with the tokenization in the `Doc` object. The training algorithm
will view these as missing values. `O` denotes a non-entity token. `B` denotes
the beginning of a multi-token entity, `I` the inside of an entity of three or
more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a
single-token entity.

> #### Example
>
> ```python
> from spacy.gold import biluo_tags_from_offsets
>
> doc = nlp("I like London.")
> entities = [(7, 13, "LOC")]
> tags = biluo_tags_from_offsets(doc, entities)
> assert tags == ["O", "O", "U-LOC", "O"]
> ```

| Name        | Type     | Description                                                                                                                                     |
| ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc`       | `Doc`    | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.                          |
| `entities`  | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. |
| **RETURNS** | list     | Unicode strings, describing the [BILUO](/api/annotation#biluo) tags.                                                                            |

### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}

Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into
entity offsets.

> #### Example
>
> ```python
> from spacy.gold import offsets_from_biluo_tags
>
> doc = nlp("I like London.")
> tags = ["O", "O", "U-LOC", "O"]
> entities = offsets_from_biluo_tags(doc, tags)
> assert entities == [(7, 13, "LOC")]
> ```

| Name        | Type     | Description                                                                                                                                                                                                                 |
| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc`       | `Doc`    | The document that the BILUO tags refer to.                                                                                                                                                                                  |
| `entities`  | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
| **RETURNS** | list     | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string.                                                                               |

### gold.spans_from_biluo_tags {#spans_from_biluo_tags tag="function" new="2.1"}

Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into
[`Span`](/api/span) objects. This can be used to create entity spans from
token-based tags, e.g. to overwrite the `doc.ents`.

> #### Example
>
> ```python
> from spacy.gold import spans_from_biluo_tags
>
> doc = nlp("I like London.")
> tags = ["O", "O", "U-LOC", "O"]
> doc.ents = spans_from_biluo_tags(doc, tags)
> ```

| Name        | Type     | Description                                                                                                                                                                                                                 |
| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc`       | `Doc`    | The document that the BILUO tags refer to.                                                                                                                                                                                  |
| `entities`  | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
| **RETURNS** | list     | A sequence of `Span` objects with added entity labels.                                                                                                                                                                      |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								---
 								title: GoldParse
 								teaser: A collection for training annotations
 								tag: class
 								source: spacy/gold.pyx
 								---
 								## GoldParse.\_\_init\_\_ {#init tag="method"}
-												Fix remaining inaccuracies in API docs (closes #2329)

											
										
										
											2019-02-25 00:21:25 +03:00
+								Create a `GoldParse`. Unlike annotations in `entities`, label annotations in
 								`cats` can overlap, i.e. a single word can be covered by multiple labelled
 								spans. The [`TextCategorizer`](/api/textcategorizer) component expects true
 								examples of a label to have the value `1.0`, and negative examples of a label to
 								have the value `0.0`. Labels not in the dictionary are treated as missing – the
 								gradient for those labels will be zero.
-												Auto-format [ci skip]

											
										
										
											2019-02-27 14:07:35 +03:00
+								| Name        | Type        | Description                                                                                                                                                                                                                            |
 								| ----------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `doc`       | `Doc`       | The document the annotations refer to.                                                                                                                                                                                                 |
 								| `words`     | iterable    | A sequence of unicode word strings.                                                                                                                                                                                                    |
 								| `tags`      | iterable    | A sequence of strings, representing tag annotations.                                                                                                                                                                                   |
 								| `heads`     | iterable    | A sequence of integers, representing syntactic head offsets.                                                                                                                                                                           |
 								| `deps`      | iterable    | A sequence of strings, representing the syntactic relation types.                                                                                                                                                                      |
-												💫 Improve handling of missing NER tags (closes #2603) (#3341)

* Improve handling of missing NER tags

GoldParse can accept missing NER tags, if entities is provided
in BILUO format (rather than as spans). Missing tags can be provided
as None values.

Fix bug that occurred when first tag was a None value. Closes #2603.

* Document specification of missing NER tags.

											
										
										
											2019-02-27 14:06:32 +03:00
+								| `entities`  | iterable    | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
-												Auto-format [ci skip]

											
										
										
											2019-02-27 14:07:35 +03:00
+								| `cats`      | dict        | Labels for text classification. Each key in the dictionary may be a string or an int, or a `(start_char, end_char, label)` tuple, indicating that the label is applied to only part of the document (usually a sentence).              |
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								| `links`     | dict        | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either 1.0 (positive) or 0.0 (negative).                           |
-												Auto-format [ci skip]

											
										
										
											2019-02-27 14:07:35 +03:00
+								| **RETURNS** | `GoldParse` | The newly constructed object.                                                                                                                                                                                                          |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								## GoldParse.\_\_len\_\_ {#len tag="method"}
 								Get the number of gold-standard tokens.
 								| Name        | Type | Description                         |
 								| ----------- | ---- | ----------------------------------- |
 								| **RETURNS** | int  | The number of gold-standard tokens. |
 								## GoldParse.is_projective {#is_projective tag="property"}
 								Whether the provided syntactic annotations form a projective dependency tree.
 								| Name        | Type | Description                               |
 								| ----------- | ---- | ----------------------------------------- |
 								| **RETURNS** | bool | Whether annotations form projective tree. |
 								## Attributes {#attributes}
-												Documentation for Entity Linking (#4065)

* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts

											
										
										
											2019-09-12 12:38:34 +03:00
+								| Name                                 | Type | Description                                                                                                                                              |
 								| ------------------------------------ | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `words`                              | list | The words.                                                                                                                                               |
 								| `tags`                               | list | The part-of-speech tag annotations.                                                                                                                      |
 								| `heads`                              | list | The syntactic head annotations.                                                                                                                          |
 								| `labels`                             | list | The syntactic relation-type annotations.                                                                                                                 |
 								| `ner`                                | list | The named entity annotations as BILUO tags.                                                                                                              |
 								| `cand_to_gold`                       | list | The alignment from candidate tokenization to gold tokenization.                                                                                          |
 								| `gold_to_cand`                       | list | The alignment from gold tokenization to candidate tokenization.                                                                                          |
 								| `cats` <Tag variant="new">2</Tag>    | list | Entries in the list should be either a label, or a `(start, end, label)` triple. The tuple form is used for categories applied to spans of the document. |
 								| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries.                                 |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
 								## Utilities {#util}
-												Document gold.docs_to_json [ci skip]

											
										
										
											2019-07-10 11:27:33 +03:00
+								### gold.docs_to_json {#docs_to_json tag="function"}
 								Convert a list of Doc objects into the
 								[JSON-serializable format](/api/annotation#json-input) used by the
-												Fix documentation for the docs_to_json function (#4456)


											
										
										
											2019-10-17 00:17:58 +03:00
+								[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc.
-												Document gold.docs_to_json [ci skip]

											
										
										
											2019-07-10 11:27:33 +03:00
 								> #### Example
 								>
 								> ```python
 								> from spacy.gold import docs_to_json
 								>
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								> doc = nlp("I like London")
-												Document gold.docs_to_json [ci skip]

											
										
										
											2019-07-10 11:27:33 +03:00
+								> json_data = docs_to_json([doc])
 								> ```
 								| Name        | Type             | Description                                |
 								| ----------- | ---------------- | ------------------------------------------ |
 								| `docs`      | iterable / `Doc` | The `Doc` object(s) to convert.            |
 								| `id`        | int              | ID to assign to the JSON. Defaults to `0`. |
-												Fix documentation for the docs_to_json function (#4456)


											
										
										
											2019-10-17 00:17:58 +03:00
+								| **RETURNS** | dict             | The data in spaCy's JSON format.           |
-												Document gold.docs_to_json [ci skip]

											
										
										
											2019-07-10 11:27:33 +03:00
-												Add API documentation

											
										
										
											2019-07-17 15:30:04 +03:00
+								### gold.align {#align tag="function"}
 								Calculate alignment tables between two tokenizations, using the Levenshtein
 								algorithm. The alignment is case-insensitive.
-												Also add infobox to API docs [ci skip]

											
										
										
											2019-07-17 17:26:41 +03:00
+								<Infobox title="Important note" variant="warning">
 								The current implementation of the alignment algorithm assumes that both
 								tokenizations add up to the same string. For example, you'll be able to align
 								`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
 								`["I", "'m"]` and `["I", "am"]`.
 								</Infobox>
-												Add API documentation

											
										
										
											2019-07-17 15:30:04 +03:00
+								> #### Example
 								>
 								> ```python
 								> from spacy.gold import align
 								>
 								> bert_tokens = ["obama", "'", "s", "podcast"]
 								> spacy_tokens = ["obama", "'s", "podcast"]
 								> alignment = align(bert_tokens, spacy_tokens)
 								> cost, a2b, b2a, a2b_multi, b2a_multi = alignment
 								> ```
 								| Name        | Type  | Description                                                                |
 								| ----------- | ----- | -------------------------------------------------------------------------- |
 								| `tokens_a`  | list  | String values of candidate tokens to align.                                |
 								| `tokens_b`  | list  | String values of reference tokens to align.                                |
 								| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |
 								The returned tuple contains the following alignment information:
 								> #### Example
 								>
 								> ```python
 								> a2b = array([0, -1, -1, 2])
 								> b2a = array([0, 2, 3])
 								> a2b_multi = {1: 1, 2: 1}
 								> b2a_multi = {}
 								> ```
 								>
 								> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
 								> there's no one-to-one alignment for a token, it has the value `-1`.
 								| Name        | Type                                   | Description                                                                                                                                     |
 								| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `cost`      | int                                    | The number of misaligned tokens.                                                                                                                |
 								| `a2b`       | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`.                                                                          |
 								| `b2a`       | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`.                                                                          |
 								| `a2b_multi` | dict                                   | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
 								| `b2a_multi` | dict                                   | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}
 								Encode labelled spans into per-token tags, using the
-												Fix remaining inaccuracies in API docs (closes #2329)

											
										
										
											2019-02-25 00:21:25 +03:00
+								[BILUO scheme](/api/annotation#biluo) (Begin, In, Last, Unit, Out). Returns a
 								list of unicode strings, describing the tags. Each tag string will be of the
 								form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of
 								`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								don't align with the tokenization in the `Doc` object. The training algorithm
 								will view these as missing values. `O` denotes a non-entity token. `B` denotes
 								the beginning of a multi-token entity, `I` the inside of an entity of three or
 								more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a
 								single-token entity.
 								> #### Example
 								>
 								> ```python
 								> from spacy.gold import biluo_tags_from_offsets
 								>
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								> doc = nlp("I like London.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> entities = [(7, 13, "LOC")]
 								> tags = biluo_tags_from_offsets(doc, entities)
 								> assert tags == ["O", "O", "U-LOC", "O"]
 								> ```
 								| Name        | Type     | Description                                                                                                                                     |
 								| ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `doc`       | `Doc`    | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.                          |
 								| `entities`  | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. |
 								| **RETURNS** | list     | Unicode strings, describing the [BILUO](/api/annotation#biluo) tags.                                                                            |
 								### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}
 								Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into
 								entity offsets.
 								> #### Example
 								>
 								> ```python
 								> from spacy.gold import offsets_from_biluo_tags
 								>
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								> doc = nlp("I like London.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> tags = ["O", "O", "U-LOC", "O"]
 								> entities = offsets_from_biluo_tags(doc, tags)
 								> assert entities == [(7, 13, "LOC")]
 								> ```
 								| Name        | Type     | Description                                                                                                                                                                                                                 |
 								| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `doc`       | `Doc`    | The document that the BILUO tags refer to.                                                                                                                                                                                  |
 								| `entities`  | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
 								| **RETURNS** | list     | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string.                                                                               |
 								### gold.spans_from_biluo_tags {#spans_from_biluo_tags tag="function" new="2.1"}
 								Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into
 								[`Span`](/api/span) objects. This can be used to create entity spans from
 								token-based tags, e.g. to overwrite the `doc.ents`.
 								> #### Example
 								>
 								> ```python
-												Corrected imported fucntion (#4062)

The example showed an incorrected import
											
										
										
											2019-08-01 13:43:36 +03:00
+								> from spacy.gold import spans_from_biluo_tags
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								>
-												Remove u-strings and fix formatting [ci skip]

											
										
										
											2019-09-12 17:11:15 +03:00
+								> doc = nlp("I like London.")
-												💫 Update website (#3285)

<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2019-02-17 21:31:19 +03:00
+								> tags = ["O", "O", "U-LOC", "O"]
 								> doc.ents = spans_from_biluo_tags(doc, tags)
 								> ```
 								| Name        | Type     | Description                                                                                                                                                                                                                 |
 								| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 								| `doc`       | `Doc`    | The document that the BILUO tags refer to.                                                                                                                                                                                  |
 								| `entities`  | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. |
 								| **RETURNS** | list     | A sequence of `Span` objects with added entity labels.                                                                                                                                                                      |