diff --git a/website/docs/api/example.md b/website/docs/api/example.md index 0fe56da9c..d3f61c7e2 100644 --- a/website/docs/api/example.md +++ b/website/docs/api/example.md @@ -278,33 +278,46 @@ Split one `Example` into multiple `Example` objects, one for each sentence. | ----------- | --------------- | ---------------------------------------------------------- | | **RETURNS** | `List[Example]` | List of `Example` objects, one for each original sentence. | -## Alignment {#alignment-object} +## Alignment {#alignment-object new="3"} -An `Alignment` object aligns the tokens of the reference document to the tokens -in the document holding the predictions. It is stored in -[`example.alignment`](#alignment). +Calculate alignment tables between two tokenizations. - - -> #### Example -> -> ```python -> other_tokens = ["i listened to", "obama", "'", "s", "podcasts", "."] -> spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."] -> predicted = Doc(vocab, words=other_tokens, spaces=[True, False, False, True, False, False]) -> reference = Doc(vocab, words=spacy_tokens, spaces=[True, True, True, False, True, False]) -> example = Example(predicted, reference) -> align = example.alignment -> assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1] -> assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5] -> assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2] -> assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5] -> ``` - -### Attributes {#alignment-attributes} +### Alignment attributes {#alignment-attributes"} | Name | Type | Description | | ----- | -------------------------------------------------- | ---------------------------------------------------------- | | `x2y` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `x` to `y`. | -| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. | \ No newline at end of file +| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. | + + + + +The current implementation of the alignment algorithm assumes that both +tokenizations add up to the same string. For example, you'll be able to align +`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not +`["I", "'m"]` and `["I", "am"]`. + + + +> #### Example +> +> ```python +> from spacy.gold import Alignment +> +> bert_tokens = ["obama", "'", "s", "podcast"] +> spacy_tokens = ["obama", "'s", "podcast"] +> alignment = Alignment.from_strings(bert_tokens, spacy_tokens) +> a2b = alignment.x2y +> assert list(a2b.dataXd) == [0, 1, 1, 2] +> ``` +> +> If `a2b.dataXd[1] == a2b.dataXd[2] == 1`, that means that `A[1]` (`"'"`) and `A[2]` (`"s"`) both align to `B[1]` (`"'s"`). + +### Alignment.from_strings {#classmethod tag="function"} + +| Name | Type | Description | +| ----------- | ----------- | ----------------------------------------------- | +| `A` | list | String values of candidate tokens to align. | +| `B` | list | String values of reference tokens to align. | +| **RETURNS** | `Alignment` | An `Alignment` object describing the alignment. | + diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index ede7f9e21..68158645d 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -353,59 +353,6 @@ Convert a list of Doc objects into the | `id` | int | ID to assign to the JSON. Defaults to `0`. | | **RETURNS** | dict | The data in spaCy's JSON format. | -### gold.align {#align tag="function"} - -Calculate alignment tables between two tokenizations, using the Levenshtein -algorithm. The alignment is case-insensitive. - - - -The current implementation of the alignment algorithm assumes that both -tokenizations add up to the same string. For example, you'll be able to align -`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not -`["I", "'m"]` and `["I", "am"]`. - - - -> #### Example -> -> ```python -> from spacy.gold import align -> -> bert_tokens = ["obama", "'", "s", "podcast"] -> spacy_tokens = ["obama", "'s", "podcast"] -> alignment = align(bert_tokens, spacy_tokens) -> cost, a2b, b2a, a2b_multi, b2a_multi = alignment -> ``` - -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------------------------------------- | -| `tokens_a` | list | String values of candidate tokens to align. | -| `tokens_b` | list | String values of reference tokens to align. | -| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. | - -The returned tuple contains the following alignment information: - -> #### Example -> -> ```python -> a2b = array([0, -1, -1, 2]) -> b2a = array([0, 2, 3]) -> a2b_multi = {1: 1, 2: 1} -> b2a_multi = {} -> ``` -> -> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If -> there's no one-to-one alignment for a token, it has the value `-1`. - -| Name | Type | Description | -| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | -| `cost` | int | The number of misaligned tokens. | -| `a2b` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`. | -| `b2a` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`. | -| `a2b_multi` | dict | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. | -| `b2a_multi` | dict | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. | - ### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"} Encode labelled spans into per-token tags, using the