mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-23 23:20:52 +03:00
remove old alignment information in top-level
This commit is contained in:
parent
01f9c1d06e
commit
c66481a699
|
@ -278,33 +278,46 @@ Split one `Example` into multiple `Example` objects, one for each sentence.
|
|||
| ----------- | --------------- | ---------------------------------------------------------- |
|
||||
| **RETURNS** | `List[Example]` | List of `Example` objects, one for each original sentence. |
|
||||
|
||||
## Alignment {#alignment-object}
|
||||
## Alignment {#alignment-object new="3"}
|
||||
|
||||
An `Alignment` object aligns the tokens of the reference document to the tokens
|
||||
in the document holding the predictions. It is stored in
|
||||
[`example.alignment`](#alignment).
|
||||
Calculate alignment tables between two tokenizations.
|
||||
|
||||
<!-- TODO: document `from_indices` and `from_strings`, or keep this as internal
|
||||
implementation detail? -->
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> other_tokens = ["i listened to", "obama", "'", "s", "podcasts", "."]
|
||||
> spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."]
|
||||
> predicted = Doc(vocab, words=other_tokens, spaces=[True, False, False, True, False, False])
|
||||
> reference = Doc(vocab, words=spacy_tokens, spaces=[True, True, True, False, True, False])
|
||||
> example = Example(predicted, reference)
|
||||
> align = example.alignment
|
||||
> assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1]
|
||||
> assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5]
|
||||
> assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2]
|
||||
> assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5]
|
||||
> ```
|
||||
|
||||
### Attributes {#alignment-attributes}
|
||||
### Alignment attributes {#alignment-attributes"}
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----- | -------------------------------------------------- | ---------------------------------------------------------- |
|
||||
| `x2y` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `x` to `y`. |
|
||||
| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. |
|
||||
| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. |
|
||||
|
||||
|
||||
<Infobox title="Important note" variant="warning">
|
||||
|
||||
The current implementation of the alignment algorithm assumes that both
|
||||
tokenizations add up to the same string. For example, you'll be able to align
|
||||
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
|
||||
`["I", "'m"]` and `["I", "am"]`.
|
||||
|
||||
</Infobox>
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import Alignment
|
||||
>
|
||||
> bert_tokens = ["obama", "'", "s", "podcast"]
|
||||
> spacy_tokens = ["obama", "'s", "podcast"]
|
||||
> alignment = Alignment.from_strings(bert_tokens, spacy_tokens)
|
||||
> a2b = alignment.x2y
|
||||
> assert list(a2b.dataXd) == [0, 1, 1, 2]
|
||||
> ```
|
||||
>
|
||||
> If `a2b.dataXd[1] == a2b.dataXd[2] == 1`, that means that `A[1]` (`"'"`) and `A[2]` (`"s"`) both align to `B[1]` (`"'s"`).
|
||||
|
||||
### Alignment.from_strings {#classmethod tag="function"}
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------- | ----------------------------------------------- |
|
||||
| `A` | list | String values of candidate tokens to align. |
|
||||
| `B` | list | String values of reference tokens to align. |
|
||||
| **RETURNS** | `Alignment` | An `Alignment` object describing the alignment. |
|
||||
|
||||
|
|
|
@ -353,59 +353,6 @@ Convert a list of Doc objects into the
|
|||
| `id` | int | ID to assign to the JSON. Defaults to `0`. |
|
||||
| **RETURNS** | dict | The data in spaCy's JSON format. |
|
||||
|
||||
### gold.align {#align tag="function"}
|
||||
|
||||
Calculate alignment tables between two tokenizations, using the Levenshtein
|
||||
algorithm. The alignment is case-insensitive.
|
||||
|
||||
<Infobox title="Important note" variant="warning">
|
||||
|
||||
The current implementation of the alignment algorithm assumes that both
|
||||
tokenizations add up to the same string. For example, you'll be able to align
|
||||
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
|
||||
`["I", "'m"]` and `["I", "am"]`.
|
||||
|
||||
</Infobox>
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.gold import align
|
||||
>
|
||||
> bert_tokens = ["obama", "'", "s", "podcast"]
|
||||
> spacy_tokens = ["obama", "'s", "podcast"]
|
||||
> alignment = align(bert_tokens, spacy_tokens)
|
||||
> cost, a2b, b2a, a2b_multi, b2a_multi = alignment
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------------------------------- |
|
||||
| `tokens_a` | list | String values of candidate tokens to align. |
|
||||
| `tokens_b` | list | String values of reference tokens to align. |
|
||||
| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |
|
||||
|
||||
The returned tuple contains the following alignment information:
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> a2b = array([0, -1, -1, 2])
|
||||
> b2a = array([0, 2, 3])
|
||||
> a2b_multi = {1: 1, 2: 1}
|
||||
> b2a_multi = {}
|
||||
> ```
|
||||
>
|
||||
> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
|
||||
> there's no one-to-one alignment for a token, it has the value `-1`.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `cost` | int | The number of misaligned tokens. |
|
||||
| `a2b` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`. |
|
||||
| `b2a` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`. |
|
||||
| `a2b_multi` | dict | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
|
||||
| `b2a_multi` | dict | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |
|
||||
|
||||
### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}
|
||||
|
||||
Encode labelled spans into per-token tags, using the
|
||||
|
|
Loading…
Reference in New Issue
Block a user