Correct alignment example and documentation (#11491)

* Correct example and documentation

* Added altered example.md

* Changes based on review + apply prettier

* Remote unnecessary 'the'

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
This commit is contained in:
Richard Hudson 2022-09-14 09:36:55 +02:00 committed by GitHub
parent 6be6913ba5
commit 3f0c3ad7d3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 15 additions and 11 deletions

View File

@ -286,10 +286,14 @@ Calculate alignment tables between two tokenizations.
### Alignment attributes {#alignment-attributes"} ### Alignment attributes {#alignment-attributes"}
Alignment attributes are managed using `AlignmentArray`, which is a
simplified version of Thinc's [Ragged](https://thinc.ai/docs/api-types#ragged)
type that only supports the `data` and `length` attributes.
| Name | Description | | Name | Description |
| ----- | --------------------------------------------------------------------- | | ----- | ------------------------------------------------------------------------------------- |
| `x2y` | The `Ragged` object holding the alignment from `x` to `y`. ~~Ragged~~ | | `x2y` | The `AlignmentArray` object holding the alignment from `x` to `y`. ~~AlignmentArray~~ |
| `y2x` | The `Ragged` object holding the alignment from `y` to `x`. ~~Ragged~~ | | `y2x` | The `AlignmentArray` object holding the alignment from `y` to `x`. ~~AlignmentArray~~ |
<Infobox title="Important note" variant="warning"> <Infobox title="Important note" variant="warning">
@ -309,10 +313,10 @@ tokenizations add up to the same string. For example, you'll be able to align
> spacy_tokens = ["obama", "'s", "podcast"] > spacy_tokens = ["obama", "'s", "podcast"]
> alignment = Alignment.from_strings(bert_tokens, spacy_tokens) > alignment = Alignment.from_strings(bert_tokens, spacy_tokens)
> a2b = alignment.x2y > a2b = alignment.x2y
> assert list(a2b.dataXd) == [0, 1, 1, 2] > assert list(a2b.data) == [0, 1, 1, 2]
> ``` > ```
> >
> If `a2b.dataXd[1] == a2b.dataXd[2] == 1`, that means that `A[1]` (`"'"`) and > If `a2b.data[1] == a2b.data[2] == 1`, that means that `A[1]` (`"'"`) and
> `A[2]` (`"s"`) both align to `B[1]` (`"'s"`). > `A[2]` (`"s"`) both align to `B[1]` (`"'s"`).
### Alignment.from_strings {#classmethod tag="function"} ### Alignment.from_strings {#classmethod tag="function"}

View File

@ -1422,9 +1422,9 @@ other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."] spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
align = Alignment.from_strings(other_tokens, spacy_tokens) align = Alignment.from_strings(other_tokens, spacy_tokens)
print(f"a -> b, lengths: {align.x2y.lengths}") # array([1, 1, 1, 1, 1, 1, 1, 1]) print(f"a -> b, lengths: {align.x2y.lengths}") # array([1, 1, 1, 1, 1, 1, 1, 1])
print(f"a -> b, mapping: {align.x2y.dataXd}") # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s" print(f"a -> b, mapping: {align.x2y.data}") # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s"
print(f"b -> a, lengths: {align.y2x.lengths}") # array([1, 1, 1, 1, 2, 1, 1]) : the token "'s" refers to two tokens print(f"b -> a, lengths: {align.y2x.lengths}") # array([1, 1, 1, 1, 2, 1, 1]) : the token "'s" refers to two tokens
print(f"b -> a, mappings: {align.y2x.dataXd}") # array([0, 1, 2, 3, 4, 5, 6, 7]) print(f"b -> a, mappings: {align.y2x.data}") # array([0, 1, 2, 3, 4, 5, 6, 7])
``` ```
Here are some insights from the alignment information generated in the example Here are some insights from the alignment information generated in the example
@ -1433,10 +1433,10 @@ above:
- The one-to-one mappings for the first four tokens are identical, which means - The one-to-one mappings for the first four tokens are identical, which means
they map to each other. This makes sense because they're also identical in the they map to each other. This makes sense because they're also identical in the
input: `"i"`, `"listened"`, `"to"` and `"obama"`. input: `"i"`, `"listened"`, `"to"` and `"obama"`.
- The value of `x2y.dataXd[6]` is `5`, which means that `other_tokens[6]` - The value of `x2y.data[6]` is `5`, which means that `other_tokens[6]`
(`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`). (`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`).
- `x2y.dataXd[4]` and `x2y.dataXd[5]` are both `4`, which means that both tokens - `x2y.data[4]` and `x2y.data[5]` are both `4`, which means that both tokens 4
4 and 5 of `other_tokens` (`"'"` and `"s"`) align to token 4 of `spacy_tokens` and 5 of `other_tokens` (`"'"` and `"s"`) align to token 4 of `spacy_tokens`
(`"'s"`). (`"'s"`).
<Infobox title="Important note" variant="warning"> <Infobox title="Important note" variant="warning">