remove old alignment information in top-level

2025-07-17 11:42:30 +03:00 · 2020-08-03 22:06:35 +02:00 · 2020-08-03 22:06:35 +02:00 · c66481a699
commit c66481a699
parent 01f9c1d06e
2 changed files with 37 additions and 77 deletions
--- a/website/docs/api/example.md
+++ b/website/docs/api/example.md
@ -278,33 +278,46 @@ Split one `Example` into multiple `Example` objects, one for each sentence.
 | ----------- | --------------- | ---------------------------------------------------------- |
 | **RETURNS** | `List[Example]` | List of `Example` objects, one for each original sentence. |

-## Alignment {#alignment-object}
+## Alignment {#alignment-object new="3"}

-An `Alignment` object aligns the tokens of the reference document to the tokens
-in the document holding the predictions. It is stored in
-[`example.alignment`](#alignment).
+Calculate alignment tables between two tokenizations.

-<!-- TODO: document `from_indices` and `from_strings`, or keep this as internal
-implementation detail? -->
-
-> #### Example
->
-> ```python
-> other_tokens = ["i listened to", "obama", "'", "s", "podcasts", "."]
-> spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."]
-> predicted = Doc(vocab, words=other_tokens, spaces=[True, False, False, True, False, False])
-> reference = Doc(vocab, words=spacy_tokens, spaces=[True, True, True, False, True, False])
-> example = Example(predicted, reference)
-> align = example.alignment
-> assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1]
-> assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5]
-> assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2]
-> assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5]
-> ```
-
-### Attributes {#alignment-attributes}
+### Alignment attributes {#alignment-attributes"}

 | Name  | Type                                               | Description                                                |
 | ----- | -------------------------------------------------- | ---------------------------------------------------------- |
 | `x2y` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `x` to `y`. |
-| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. |
+| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. |
+
+
+<Infobox title="Important note" variant="warning">
+
+The current implementation of the alignment algorithm assumes that both
+tokenizations add up to the same string. For example, you'll be able to align
+`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
+`["I", "'m"]` and `["I", "am"]`.
+
+</Infobox>
+
+> #### Example
+>
+> ```python
+> from spacy.gold import Alignment
+>
+> bert_tokens = ["obama", "'", "s", "podcast"]
+> spacy_tokens = ["obama", "'s", "podcast"]
+> alignment = Alignment.from_strings(bert_tokens, spacy_tokens)
+> a2b = alignment.x2y
+> assert list(a2b.dataXd) == [0, 1, 1, 2]
+> ```
+> 
+> If `a2b.dataXd[1] == a2b.dataXd[2] == 1`, that means that `A[1]` (`"'"`) and `A[2]` (`"s"`) both align to `B[1]` (`"'s"`). 
+
+### Alignment.from_strings {#classmethod tag="function"}
+
+| Name        | Type        | Description                                     |
+| ----------- | ----------- | ----------------------------------------------- |
+| `A`         | list        | String values of candidate tokens to align.     |
+| `B`         | list        | String values of reference tokens to align.     |
+| **RETURNS** | `Alignment` | An `Alignment` object describing the alignment. |
+
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -353,59 +353,6 @@ Convert a list of Doc objects into the
 | `id`        | int              | ID to assign to the JSON. Defaults to `0`. |
 | **RETURNS** | dict             | The data in spaCy's JSON format.           |

-### gold.align {#align tag="function"}
-
-Calculate alignment tables between two tokenizations, using the Levenshtein
-algorithm. The alignment is case-insensitive.
-
-<Infobox title="Important note" variant="warning">
-
-The current implementation of the alignment algorithm assumes that both
-tokenizations add up to the same string. For example, you'll be able to align
-`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
-`["I", "'m"]` and `["I", "am"]`.
-
-</Infobox>
-
-> #### Example
->
-> ```python
-> from spacy.gold import align
->
-> bert_tokens = ["obama", "'", "s", "podcast"]
-> spacy_tokens = ["obama", "'s", "podcast"]
-> alignment = align(bert_tokens, spacy_tokens)
-> cost, a2b, b2a, a2b_multi, b2a_multi = alignment
-> ```
-
-| Name        | Type  | Description                                                                |
-| ----------- | ----- | -------------------------------------------------------------------------- |
-| `tokens_a`  | list  | String values of candidate tokens to align.                                |
-| `tokens_b`  | list  | String values of reference tokens to align.                                |
-| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |
-
-The returned tuple contains the following alignment information:
-
-> #### Example
->
-> ```python
-> a2b = array([0, -1, -1, 2])
-> b2a = array([0, 2, 3])
-> a2b_multi = {1: 1, 2: 1}
-> b2a_multi = {}
-> ```
->
-> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
-> there's no one-to-one alignment for a token, it has the value `-1`.
-
-| Name        | Type                                   | Description                                                                                                                                     |
-| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
-| `cost`      | int                                    | The number of misaligned tokens.                                                                                                                |
-| `a2b`       | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`.                                                                          |
-| `b2a`       | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`.                                                                          |
-| `a2b_multi` | dict                                   | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
-| `b2a_multi` | dict                                   | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |
-
 ### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}

 Encode labelled spans into per-token tags, using the