update gold.align explanation in linguistic features

This commit is contained in:
svlandeg 2020-08-03 18:15:36 +02:00
parent 35946783c4
commit f846245936

View File

@ -1089,54 +1089,48 @@ In situations like that, you often want to align the tokenization so that you
can merge annotations from different sources together, or take vectors predicted can merge annotations from different sources together, or take vectors predicted
by a by a
[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and [pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
apply them to spaCy tokens. spaCy's [`gold.align`](/api/top-level#align) helper apply them to spaCy tokens. spaCy's [`Alignment`](/api/example#alignment-object) object
returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number allows the one-to-one mappings of token indices in both directions as well as
of misaligned tokens, the one-to-one mappings of token indices in both taking into account indices where multiple tokens align to one single token.
directions and the indices where multiple tokens align to one single token.
> #### ✏️ Things to try > #### ✏️ Things to try
> >
> 1. Change the capitalization in one of the token lists for example, > 1. Change the capitalization in one of the token lists for example,
> `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive. > `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive.
> 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see > 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see
> that there are now 4 misaligned tokens and that the new many-to-one mapping > that there are now two tokens of length 2 in `y2x`, one corresponding to
> is reflected in `a2b_multi`. > "'s", and one to "podcasts".
> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that the > 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that all
> `cost` is `0` and all corresponding mappings are also identical. > tokens now correspond 1-to-1.
```python ```python
### {executable="true"} ### {executable="true"}
from spacy.gold import align from spacy.gold import Alignment
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."] other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."] spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens) align = Alignment.from_strings(other_tokens, spacy_tokens)
print("Edit distance:", cost) # 3 print(f"a -> b, lengths: {align.x2y.lengths}") # array([1, 1, 1, 1, 1, 1, 1, 1])
print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6]) print(f"a -> b, mapping: {align.x2y.dataXd}") # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s"
print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, -1, 6, 7]) print(f"b -> a, lengths: {align.y2x.lengths}") # array([1, 1, 1, 1, 2, 1, 1]) : the token "'s" refers to two tokens
print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4} print(f"b -> a, mappings: {align.y2x.dataXd}") # array([0, 1, 2, 3, 4, 5, 6, 7])
print("Many-to-one mappings b-> a", b2a_multi) # {}
``` ```
Here are some insights from the alignment information generated in the example Here are some insights from the alignment information generated in the example
above: above:
- The edit distance (cost) is `3`: two deletions and one insertion.
- The one-to-one mappings for the first four tokens are identical, which means - The one-to-one mappings for the first four tokens are identical, which means
they map to each other. This makes sense because they're also identical in the they map to each other. This makes sense because they're also identical in the
input: `"i"`, `"listened"`, `"to"` and `"obama"`. input: `"i"`, `"listened"`, `"to"` and `"obama"`.
- The index mapped to `a2b[6]` is `5`, which means that `other_tokens[6]` - The value of `x2y.dataXd[6]` is `5`, which means that `other_tokens[6]`
(`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`). (`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`).
- `a2b[4]` is `-1`, which means that there is no one-to-one alignment for the - `x2y.dataXd[4]` and `x2y.dataXd[5]` are both `4`, which means that both tokens
token at `other_tokens[4]`. The token `"'"` doesn't exist on its own in 4 and 5 of `other_tokens` (`"'"` and `"s"`) align to token 4 of `spacy_tokens`
`spacy_tokens`. The same goes for `a2b[5]` and `other_tokens[5]`, i.e. `"s"`. (`"'s"`).
- The dictionary `a2b_multi` shows that both tokens 4 and 5 of `other_tokens`
(`"'"` and `"s"`) align to token 4 of `spacy_tokens` (`"'s"`).
- The dictionary `b2a_multi` shows that there are no tokens in `spacy_tokens`
that map to multiple tokens in `other_tokens`.
<Infobox title="Important note" variant="warning"> <Infobox title="Important note" variant="warning">
<!-- TODO: does it though? -->
The current implementation of the alignment algorithm assumes that both The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align tokenizations add up to the same string. For example, you'll be able to align
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not `["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not