update gold.align explanation in linguistic features

2025-08-22 04:54:56 +03:00 · 2020-08-03 18:15:36 +02:00 · 2020-08-03 18:15:36 +02:00 · f846245936
commit f846245936
parent 35946783c4
1 changed files with 18 additions and 24 deletions
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -1089,54 +1089,48 @@ In situations like that, you often want to align the tokenization so that you
 can merge annotations from different sources together, or take vectors predicted
 by a
 [pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
-apply them to spaCy tokens. spaCy's [`gold.align`](/api/top-level#align) helper
-returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number
-of misaligned tokens, the one-to-one mappings of token indices in both
-directions and the indices where multiple tokens align to one single token.
+apply them to spaCy tokens. spaCy's [`Alignment`](/api/example#alignment-object) object
+allows the one-to-one mappings of token indices in both directions as well as
+taking into account indices where multiple tokens align to one single token.

 > #### ✏️ Things to try
 >
 > 1. Change the capitalization in one of the token lists – for example,
 >    `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive.
 > 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see
->    that there are now 4 misaligned tokens and that the new many-to-one mapping
->    is reflected in `a2b_multi`.
-> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that the
->    `cost` is `0` and all corresponding mappings are also identical.
+>    that there are now two tokens of length 2 in `y2x`, one corresponding to
+>    "'s", and one to "podcasts".
+> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that all
+>    tokens now correspond 1-to-1.

 ```python
 ### {executable="true"}
-from spacy.gold import align
+from spacy.gold import Alignment

 other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
 spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
-cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
-print("Edit distance:", cost)  # 3
-print("One-to-one mappings a -> b", a2b)  # array([0, 1, 2, 3, -1, -1, 5, 6])
-print("One-to-one mappings b -> a", b2a)  # array([0, 1, 2, 3, -1, 6, 7])
-print("Many-to-one mappings a -> b", a2b_multi)  # {4: 4, 5: 4}
-print("Many-to-one mappings b-> a", b2a_multi)  # {}
+align = Alignment.from_strings(other_tokens, spacy_tokens)
+print(f"a -> b, lengths: {align.x2y.lengths}")  # array([1, 1, 1, 1, 1, 1, 1, 1])
+print(f"a -> b, mapping: {align.x2y.dataXd}")  # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s"
+print(f"b -> a, lengths: {align.y2x.lengths}")  # array([1, 1, 1, 1, 2, 1, 1])   : the token "'s" refers to two tokens
+print(f"b -> a, mappings: {align.y2x.dataXd}")  # array([0, 1, 2, 3, 4, 5, 6, 7])
 ```

 Here are some insights from the alignment information generated in the example
 above:

- The edit distance (cost) is `3`: two deletions and one insertion.
 - The one-to-one mappings for the first four tokens are identical, which means
  they map to each other. This makes sense because they're also identical in the
  input: `"i"`, `"listened"`, `"to"` and `"obama"`.
- The index mapped to `a2b[6]` is `5`, which means that `other_tokens[6]`
+- The value of `x2y.dataXd[6]` is `5`, which means that `other_tokens[6]`
  (`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`).
- `a2b[4]` is `-1`, which means that there is no one-to-one alignment for the
-  token at `other_tokens[4]`. The token `"'"` doesn't exist on its own in
-  `spacy_tokens`. The same goes for `a2b[5]` and `other_tokens[5]`, i.e. `"s"`.
- The dictionary `a2b_multi` shows that both tokens 4 and 5 of `other_tokens`
-  (`"'"` and `"s"`) align to token 4 of `spacy_tokens` (`"'s"`).
- The dictionary `b2a_multi` shows that there are no tokens in `spacy_tokens`
-  that map to multiple tokens in `other_tokens`.
+- `x2y.dataXd[4]` and `x2y.dataXd[5]` are both `4`, which means that both tokens
+  4 and 5 of `other_tokens` (`"'"` and `"s"`) align to token 4 of `spacy_tokens`
+  (`"'s"`).

 <Infobox title="Important note" variant="warning">

+<!-- TODO: does it though? -->
 The current implementation of the alignment algorithm assumes that both
 tokenizations add up to the same string. For example, you'll be able to align
 `["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not