remove old alignment information in top-level

This commit is contained in:
svlandeg 2020-08-03 22:06:35 +02:00
parent 01f9c1d06e
commit c66481a699
2 changed files with 37 additions and 77 deletions

View File

@ -278,33 +278,46 @@ Split one `Example` into multiple `Example` objects, one for each sentence.
| ----------- | --------------- | ---------------------------------------------------------- |
| **RETURNS** | `List[Example]` | List of `Example` objects, one for each original sentence. |
## Alignment {#alignment-object}
## Alignment {#alignment-object new="3"}
An `Alignment` object aligns the tokens of the reference document to the tokens
in the document holding the predictions. It is stored in
[`example.alignment`](#alignment).
Calculate alignment tables between two tokenizations.
<!-- TODO: document `from_indices` and `from_strings`, or keep this as internal
implementation detail? -->
> #### Example
>
> ```python
> other_tokens = ["i listened to", "obama", "'", "s", "podcasts", "."]
> spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."]
> predicted = Doc(vocab, words=other_tokens, spaces=[True, False, False, True, False, False])
> reference = Doc(vocab, words=spacy_tokens, spaces=[True, True, True, False, True, False])
> example = Example(predicted, reference)
> align = example.alignment
> assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1]
> assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5]
> assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2]
> assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5]
> ```
### Attributes {#alignment-attributes}
### Alignment attributes {#alignment-attributes"}
| Name | Type | Description |
| ----- | -------------------------------------------------- | ---------------------------------------------------------- |
| `x2y` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `x` to `y`. |
| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. |
| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. |
<Infobox title="Important note" variant="warning">
The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
`["I", "'m"]` and `["I", "am"]`.
</Infobox>
> #### Example
>
> ```python
> from spacy.gold import Alignment
>
> bert_tokens = ["obama", "'", "s", "podcast"]
> spacy_tokens = ["obama", "'s", "podcast"]
> alignment = Alignment.from_strings(bert_tokens, spacy_tokens)
> a2b = alignment.x2y
> assert list(a2b.dataXd) == [0, 1, 1, 2]
> ```
>
> If `a2b.dataXd[1] == a2b.dataXd[2] == 1`, that means that `A[1]` (`"'"`) and `A[2]` (`"s"`) both align to `B[1]` (`"'s"`).
### Alignment.from_strings {#classmethod tag="function"}
| Name | Type | Description |
| ----------- | ----------- | ----------------------------------------------- |
| `A` | list | String values of candidate tokens to align. |
| `B` | list | String values of reference tokens to align. |
| **RETURNS** | `Alignment` | An `Alignment` object describing the alignment. |

View File

@ -353,59 +353,6 @@ Convert a list of Doc objects into the
| `id` | int | ID to assign to the JSON. Defaults to `0`. |
| **RETURNS** | dict | The data in spaCy's JSON format. |
### gold.align {#align tag="function"}
Calculate alignment tables between two tokenizations, using the Levenshtein
algorithm. The alignment is case-insensitive.
<Infobox title="Important note" variant="warning">
The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
`["I", "'m"]` and `["I", "am"]`.
</Infobox>
> #### Example
>
> ```python
> from spacy.gold import align
>
> bert_tokens = ["obama", "'", "s", "podcast"]
> spacy_tokens = ["obama", "'s", "podcast"]
> alignment = align(bert_tokens, spacy_tokens)
> cost, a2b, b2a, a2b_multi, b2a_multi = alignment
> ```
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------------------------------- |
| `tokens_a` | list | String values of candidate tokens to align. |
| `tokens_b` | list | String values of reference tokens to align. |
| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |
The returned tuple contains the following alignment information:
> #### Example
>
> ```python
> a2b = array([0, -1, -1, 2])
> b2a = array([0, 2, 3])
> a2b_multi = {1: 1, 2: 1}
> b2a_multi = {}
> ```
>
> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
> there's no one-to-one alignment for a token, it has the value `-1`.
| Name | Type | Description |
| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `cost` | int | The number of misaligned tokens. |
| `a2b` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`. |
| `b2a` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`. |
| `a2b_multi` | dict | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
| `b2a_multi` | dict | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |
### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}
Encode labelled spans into per-token tags, using the