💫 Document spacy.gold.align (#3980)

💫 Document spacy.gold.align Co-authored-by: Ines Montani <ines@ines.io>
2025-10-17 09:14:14 +03:00 · 2019-07-17 15:34:35 +02:00 · 2019-07-17 15:34:35 +02:00 · 57d7076a72
commit 57d7076a72
parent fe0e1873a3 1d5ff3e455
3 changed files with 135 additions and 8 deletions
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
@ -70,15 +70,33 @@ def merge_sents(sents):
    return [(m_deps, m_brackets)]
-def align(cand_words, gold_words):
+def align(tokens_a, tokens_b):
-    if cand_words == gold_words:
+    """Calculate alignment tables between two tokenizations, using the Levenshtein
-        alignment = numpy.arange(len(cand_words))
+    algorithm. The alignment is case-insensitive.
    tokens_a (List[str]): The candidate tokenization.
    tokens_b (List[str]): The reference tokenization.
    RETURNS: (tuple): A 5-tuple consisting of the following information:
      * cost (int): The number of misaligned tokens.
      * a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`.
        For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns
        to `tokens_b[6]`. If there's no one-to-one alignment for a token,
        it has the value -1.
      * b2a (List[int]): The same as `a2b`, but mapping the other direction.
      * a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a`
        to indices in `tokens_b`, where multiple tokens of `tokens_a` align to
        the same token of `tokens_b`.
      * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other
            direction.
    """
    if tokens_a == tokens_b:
        alignment = numpy.arange(len(tokens_a))
        return 0, alignment, alignment, {}, {}
-    cand_words = [w.replace(" ", "").lower() for w in cand_words]
+    tokens_a = [w.replace(" ", "").lower() for w in tokens_a]
-    gold_words = [w.replace(" ", "").lower() for w in gold_words]
+    tokens_b = [w.replace(" ", "").lower() for w in tokens_b]
-    cost, i2j, j2i, matrix = _align.align(cand_words, gold_words)
+    cost, i2j, j2i, matrix = _align.align(tokens_a, tokens_b)
-    i2j_multi, j2i_multi = _align.multi_align(i2j, j2i, [len(w) for w in cand_words],
+    i2j_multi, j2i_multi = _align.multi_align(i2j, j2i, [len(w) for w in tokens_a],
-                                [len(w) for w in gold_words])
+                                                        [len(w) for w in tokens_b])
    for i, j in list(i2j_multi.items()):
        if i2j_multi.get(i+1) != j and i2j_multi.get(i-1) != j:
            i2j[i] = j
--- a/website/docs/api/goldparse.md
+++ b/website/docs/api/goldparse.md
@ -76,6 +76,50 @@ Convert a list of Doc objects into the
 | `id`        | int              | ID to assign to the JSON. Defaults to `0`. |
 | **RETURNS** | list             | The data in spaCy's JSON format.           |
 ### gold.align {#align tag="function"}
 Calculate alignment tables between two tokenizations, using the Levenshtein
 algorithm. The alignment is case-insensitive.
 > #### Example
 >
 > ```python
 > from spacy.gold import align
 >
 > bert_tokens = ["obama", "'", "s", "podcast"]
 > spacy_tokens = ["obama", "'s", "podcast"]
 > alignment = align(bert_tokens, spacy_tokens)
 > cost, a2b, b2a, a2b_multi, b2a_multi = alignment
 > ```
 | Name        | Type  | Description                                                                |
 | ----------- | ----- | -------------------------------------------------------------------------- |
 | `tokens_a`  | list  | String values of candidate tokens to align.                                |
 | `tokens_b`  | list  | String values of reference tokens to align.                                |
 | **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |
 The returned tuple contains the following alignment information:
 > #### Example
 >
 > ```python
 > a2b = array([0, -1, -1, 2])
 > b2a = array([0, 2, 3])
 > a2b_multi = {1: 1, 2: 1}
 > b2a_multi = {}
 > ```
 >
 > If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
 > there's no one-to-one alignment for a token, it has the value `-1`.
 | Name        | Type                                   | Description                                                                                                                                     |
 | ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
 | `cost`      | int                                    | The number of misaligned tokens.                                                                                                                |
 | `a2b`       | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`.                                                                          |
 | `b2a`       | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`.                                                                          |
 | `a2b_multi` | dict                                   | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
 | `b2a_multi` | dict                                   | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |
 ### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}
 Encode labelled spans into per-token tags, using the
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -963,6 +963,71 @@ Once you have a [`Doc`](/api/doc) object, you can write to its attributes to set
 the part-of-speech tags, syntactic dependencies, named entities and other
 attributes. For details, see the respective usage pages.
 ### Aligning tokenization {#aligning-tokenization}
 spaCy's tokenization is non-destructive and uses language-specific rules
 optimized for compatibility with treebank annotations. Other tools and resources
 can sometimes tokenize things differently – for example, `"I'm"` →
 `["I", "'", "m"]` instead of `["I", "'m"]`.
 In cases like that, you often want to align the tokenization so that you can
 merge annotations from different sources together, or take vectors predicted by
 a [pre-trained BERT model](https://github.com/huggingface/pytorch-transformers)
 and apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align)
 helper returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the
 number of misaligned tokens, the one-to-one mappings of token indices in both
 directions and the indices where multiple tokens align to one single token.
 > #### ✏️ Things to try
 >
 > 1. Change the capitalization in one of the token lists – for example,
 >    `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive.
 > 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see
 >    that there are now 4 misaligned tokens and that the new many-to-one mapping
 >    is reflected in `a2b_multi`.
 > 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that the
 >    `cost` is `0` and all corresponding mappings are also identical.
 ```python
 ### {executable="true"}
 from spacy.gold import align
 other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
 spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
 cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
 print("Misaligned tokens:", cost)  # 2
 print("One-to-one mappings a -> b", a2b)  # array([0, 1, 2, 3, -1, -1, 5, 6])
 print("One-to-one mappings b -> a", b2a)  # array([0, 1, 2, 3, 5, 6, 7])
 print("Many-to-one mappings a -> b", a2b_multi)  # {4: 4, 5: 4}
 print("Many-to-one mappings b-> a", b2a_multi)  # {}
 ```
 Here are some insights from the alignment information generated in the example
 above:
 - Two tokens are misaligned.
 - The one-to-one mappings for the first four tokens are identical, which means
  they map to each other. This makes sense because they're also identical in the
  input: `"i"`, `"listened"`, `"to"` and `"obama"`.
 - The index mapped to `a2b[6]` is `5`, which means that `other_tokens[6]`
  (`"podcasts"`) aligns to `spacy_tokens[6]` (also `"podcasts"`).
 - `a2b[4]` is `-1`, which means that there is no one-to-one alignment for the
  token at `other_tokens[5]`. The token `"'"` doesn't exist on its own in
  `spacy_tokens`. The same goes for `a2b[5]` and `other_tokens[5]`, i.e. `"s"`.
 - The dictionary `a2b_multi` shows that both tokens 4 and 5 of `other_tokens`
  (`"'"` and `"s"`) align to token 4 of `spacy_tokens` (`"'s"`).
 - The dictionary `b2a_multi` shows that there are no tokens in `spacy_tokens`
  that map to multiple tokens in `other_tokens`.
 <Infobox title="Important note" variant="warning">
 The current implementation of the alignment algorithm assumes that both
 tokenizations add up to the same string. For example, you'll be able to align
 `["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
 `["I", "'m"]` and `["I", "am"]`.
 </Infobox>
 ## Merging and splitting {#retokenization new="2.1"}
 The [`Doc.retokenize`](/api/doc#retokenize) context manager lets you merge and