mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-23 23:20:52 +03:00
update gold.align explanation in linguistic features
This commit is contained in:
parent
35946783c4
commit
f846245936
|
@ -1089,54 +1089,48 @@ In situations like that, you often want to align the tokenization so that you
|
||||||
can merge annotations from different sources together, or take vectors predicted
|
can merge annotations from different sources together, or take vectors predicted
|
||||||
by a
|
by a
|
||||||
[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
|
[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
|
||||||
apply them to spaCy tokens. spaCy's [`gold.align`](/api/top-level#align) helper
|
apply them to spaCy tokens. spaCy's [`Alignment`](/api/example#alignment-object) object
|
||||||
returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number
|
allows the one-to-one mappings of token indices in both directions as well as
|
||||||
of misaligned tokens, the one-to-one mappings of token indices in both
|
taking into account indices where multiple tokens align to one single token.
|
||||||
directions and the indices where multiple tokens align to one single token.
|
|
||||||
|
|
||||||
> #### ✏️ Things to try
|
> #### ✏️ Things to try
|
||||||
>
|
>
|
||||||
> 1. Change the capitalization in one of the token lists – for example,
|
> 1. Change the capitalization in one of the token lists – for example,
|
||||||
> `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive.
|
> `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive.
|
||||||
> 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see
|
> 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see
|
||||||
> that there are now 4 misaligned tokens and that the new many-to-one mapping
|
> that there are now two tokens of length 2 in `y2x`, one corresponding to
|
||||||
> is reflected in `a2b_multi`.
|
> "'s", and one to "podcasts".
|
||||||
> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that the
|
> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that all
|
||||||
> `cost` is `0` and all corresponding mappings are also identical.
|
> tokens now correspond 1-to-1.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
from spacy.gold import align
|
from spacy.gold import Alignment
|
||||||
|
|
||||||
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
|
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
|
||||||
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
|
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
|
||||||
cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
|
align = Alignment.from_strings(other_tokens, spacy_tokens)
|
||||||
print("Edit distance:", cost) # 3
|
print(f"a -> b, lengths: {align.x2y.lengths}") # array([1, 1, 1, 1, 1, 1, 1, 1])
|
||||||
print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6])
|
print(f"a -> b, mapping: {align.x2y.dataXd}") # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s"
|
||||||
print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, -1, 6, 7])
|
print(f"b -> a, lengths: {align.y2x.lengths}") # array([1, 1, 1, 1, 2, 1, 1]) : the token "'s" refers to two tokens
|
||||||
print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4}
|
print(f"b -> a, mappings: {align.y2x.dataXd}") # array([0, 1, 2, 3, 4, 5, 6, 7])
|
||||||
print("Many-to-one mappings b-> a", b2a_multi) # {}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Here are some insights from the alignment information generated in the example
|
Here are some insights from the alignment information generated in the example
|
||||||
above:
|
above:
|
||||||
|
|
||||||
- The edit distance (cost) is `3`: two deletions and one insertion.
|
|
||||||
- The one-to-one mappings for the first four tokens are identical, which means
|
- The one-to-one mappings for the first four tokens are identical, which means
|
||||||
they map to each other. This makes sense because they're also identical in the
|
they map to each other. This makes sense because they're also identical in the
|
||||||
input: `"i"`, `"listened"`, `"to"` and `"obama"`.
|
input: `"i"`, `"listened"`, `"to"` and `"obama"`.
|
||||||
- The index mapped to `a2b[6]` is `5`, which means that `other_tokens[6]`
|
- The value of `x2y.dataXd[6]` is `5`, which means that `other_tokens[6]`
|
||||||
(`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`).
|
(`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`).
|
||||||
- `a2b[4]` is `-1`, which means that there is no one-to-one alignment for the
|
- `x2y.dataXd[4]` and `x2y.dataXd[5]` are both `4`, which means that both tokens
|
||||||
token at `other_tokens[4]`. The token `"'"` doesn't exist on its own in
|
4 and 5 of `other_tokens` (`"'"` and `"s"`) align to token 4 of `spacy_tokens`
|
||||||
`spacy_tokens`. The same goes for `a2b[5]` and `other_tokens[5]`, i.e. `"s"`.
|
(`"'s"`).
|
||||||
- The dictionary `a2b_multi` shows that both tokens 4 and 5 of `other_tokens`
|
|
||||||
(`"'"` and `"s"`) align to token 4 of `spacy_tokens` (`"'s"`).
|
|
||||||
- The dictionary `b2a_multi` shows that there are no tokens in `spacy_tokens`
|
|
||||||
that map to multiple tokens in `other_tokens`.
|
|
||||||
|
|
||||||
<Infobox title="Important note" variant="warning">
|
<Infobox title="Important note" variant="warning">
|
||||||
|
|
||||||
|
<!-- TODO: does it though? -->
|
||||||
The current implementation of the alignment algorithm assumes that both
|
The current implementation of the alignment algorithm assumes that both
|
||||||
tokenizations add up to the same string. For example, you'll be able to align
|
tokenizations add up to the same string. For example, you'll be able to align
|
||||||
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
|
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
|
||||||
|
|
Loading…
Reference in New Issue
Block a user