Sync develop with nightly docs state (#5883)

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
This commit is contained in:
Ines Montani 2020-08-06 00:28:14 +02:00 committed by GitHub
parent d92954ac1d
commit 06e80d95cd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 359 additions and 129 deletions

View File

@ -54,7 +54,7 @@ def debug_model_cli(
nlp, config = util.load_model_from_config(cfg)
except ValueError as e:
msg.fail(str(e), exits=1)
seed = config.get("training", {}).get("seed", None)
seed = config["pretraining"]["seed"]
if seed is not None:
msg.info(f"Fixing random seed: {seed}")
fix_random_seed(seed)

View File

@ -221,17 +221,21 @@ config from being resolved. This means that you may not see all validation
errors at once and some issues are only shown once previous errors have been
fixed.
Instead of specifying all required settings in the config file, you can rely on
an auto-fill functionality that uses spaCy's built-in defaults. The resulting
full config can be written to file and used in downstream training tasks.
```bash
$ python -m spacy debug config [config_path] [--code] [overrides]
$ python -m spacy debug config [config_path] [--code_path] [--output] [--auto_fill] [--diff] [overrides]
```
> #### Example
> #### Example 1
>
> ```bash
> $ python -m spacy debug config ./config.cfg
> ```
<Accordion title="Example output" spaced>
<Accordion title="Example 1 output" spaced>
```
✘ Config validation error
@ -250,12 +254,30 @@ training -> width extra fields not permitted
</Accordion>
| Argument | Type | Description |
| -------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `config_path` | positional | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code`, `-c` | option | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
> #### Example 2
>
> ```bash
> $ python -m spacy debug config ./minimal_config.cfg -F -o ./filled_config.cfg
> ```
<Accordion title="Example 2 output" spaced>
```
✔ Auto-filled config is valid
✔ Saved updated config to ./filled_config.cfg
```
</Accordion>
| Argument | Type | Default | Description |
| --------------------- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `config_path` | positional | - | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `--code_path`, `-c` | option | `None` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-models) for new architectures. |
| `--auto_fill`, `-F` | option | `False` | Whether or not to auto-fill the config with built-in defaults if possible. If `False`, the provided config needs to be complete. |
| `--output_path`, `-o` | option | `None` | Output path where the filled config can be stored. Use '-' for standard output. |
| `--diff`, `-D` | option | `False` | Show a visual diff if config was auto-filled. |
| `--help`, `-h` | flag | `False` | Show help message and available arguments. |
| overrides | | `None` | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
### debug data {#debug-data}
@ -433,7 +455,135 @@ will not be available.
| `--help`, `-h` | flag | Show help message and available arguments. |
| overrides | | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.use_gpu 1`. |
<!-- TODO: document debug profile and debug model? -->
<!-- TODO: document debug profile?-->
### debug model {#debug-model}
Debug a Thinc [`Model`](https://thinc.ai/docs/api-model) by running it on a
sample text and checking how it updates its internal weights and parameters.
```bash
$ python -m spacy debug model [config_path] [component] [--layers] [-DIM] [-PAR] [-GRAD] [-ATTR] [-P0] [-P1] [-P2] [P3] [--gpu_id]
```
> #### Example 1
>
> ```bash
> $ python -m spacy debug model ./config.cfg tagger -P0
> ```
<Accordion title="Example 1 output" spaced>
```
Using CPU
Fixing random seed: 0
Analysing model with ID 62
========================== STEP 0 - before training ==========================
Layer 0: model ID 62:
'extract_features>>list2ragged>>with_array-ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed>>with_array-maxout>>layernorm>>dropout>>ragged2list>>with_array-residual>>residual>>residual>>residual>>with_array-softmax'
Layer 1: model ID 59:
'extract_features>>list2ragged>>with_array-ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed>>with_array-maxout>>layernorm>>dropout>>ragged2list>>with_array-residual>>residual>>residual>>residual'
Layer 2: model ID 61: 'with_array-softmax'
Layer 3: model ID 24:
'extract_features>>list2ragged>>with_array-ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed>>with_array-maxout>>layernorm>>dropout>>ragged2list'
Layer 4: model ID 58: 'with_array-residual>>residual>>residual>>residual'
Layer 5: model ID 60: 'softmax'
Layer 6: model ID 13: 'extract_features'
Layer 7: model ID 14: 'list2ragged'
Layer 8: model ID 16:
'with_array-ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed'
Layer 9: model ID 22: 'with_array-maxout>>layernorm>>dropout'
Layer 10: model ID 23: 'ragged2list'
Layer 11: model ID 57: 'residual>>residual>>residual>>residual'
Layer 12: model ID 15:
'ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed'
Layer 13: model ID 21: 'maxout>>layernorm>>dropout'
Layer 14: model ID 32: 'residual'
Layer 15: model ID 40: 'residual'
Layer 16: model ID 48: 'residual'
Layer 17: model ID 56: 'residual'
Layer 18: model ID 3: 'ints-getitem>>hashembed'
Layer 19: model ID 6: 'ints-getitem>>hashembed'
Layer 20: model ID 9: 'ints-getitem>>hashembed'
...
```
</Accordion>
In this example log, we just print the name of each layer after creation of the
model ("Step 0"), which helps us to understand the internal structure of the
Neural Network, and to focus on specific layers that we want to inspect further
(see next example).
> #### Example 2
>
> ```bash
> $ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2
> ```
<Accordion title="Example 2 output" spaced>
```
Using CPU
Fixing random seed: 0
Analysing model with ID 62
========================= STEP 0 - before training =========================
Layer 5: model ID 60: 'softmax'
- dim nO: None
- dim nI: 96
- param W: None
- param b: None
Layer 15: model ID 40: 'residual'
- dim nO: None
- dim nI: None
======================= STEP 1 - after initialization =======================
Layer 5: model ID 60: 'softmax'
- dim nO: 4
- dim nI: 96
- param W: (4, 96) - sample: [0. 0. 0. 0. 0.]
- param b: (4,) - sample: [0. 0. 0. 0.]
Layer 15: model ID 40: 'residual'
- dim nO: 96
- dim nI: None
========================== STEP 2 - after training ==========================
Layer 5: model ID 60: 'softmax'
- dim nO: 4
- dim nI: 96
- param W: (4, 96) - sample: [ 0.00283958 -0.00294119 0.00268396 -0.00296219
-0.00297141]
- param b: (4,) - sample: [0.00300002 0.00300002 0.00300002 0.00300002]
Layer 15: model ID 40: 'residual'
- dim nO: 96
- dim nI: None
```
</Accordion>
In this example log, we see how initialization of the model (Step 1) propagates
the correct values for the `nI` (input) and `nO` (output) dimensions of the
various layers. In the `softmax` layer, this step also defines the `W` matrix as
an all-zero matrix determined by the `nO` and `nI` dimensions. After a first
training step (Step 2), this matrix has clearly updated its values through the
training feedback loop.
| Argument | Type | Default | Description |
| ----------------------- | ---------- | ------- | ---------------------------------------------------------------------------------------------------- |
| `config_path` | positional | | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. |
| `component` | positional | | Name of the pipeline component of which the model should be analysed. |
| `--layers`, `-l` | option | | Comma-separated names of layer IDs to print. |
| `--dimensions`, `-DIM` | option | `False` | Show dimensions of each layer. |
| `--parameters`, `-PAR` | option | `False` | Show parameters of each layer. |
| `--gradients`, `-GRAD` | option | `False` | Show gradients of each layer. |
| `--attributes`, `-ATTR` | option | `False` | Show attributes of each layer. |
| `--print-step0`, `-P0` | option | `False` | Print model before training. |
| `--print-step1`, `-P1` | option | `False` | Print model after initialization. |
| `--print-step2`, `-P2` | option | `False` | Print model after training. |
| `--print-step3`, `-P3` | option | `False` | Print final predictions. |
| `--help`, `-h` | flag | | Show help message and available arguments. |
## Train {#train}

View File

@ -28,7 +28,7 @@ spaCy's training format. To convert one or more existing `Doc` objects to
spaCy's JSON format, you can use the
[`gold.docs_to_json`](/api/top-level#docs_to_json) helper.
> #### Annotating entities
> #### Annotating entities {#biluo}
>
> Named entities are provided in the
> [BILUO](/usage/linguistic-features#accessing-ner) notation. Tokens outside an
@ -75,6 +75,123 @@ from the English Wall Street Journal portion of the Penn Treebank:
https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json
```
### Annotations in dictionary format {#dict-input}
To create [`Example`](/api/example) objects, you can create a dictionary of the
gold-standard annotations `gold_dict`, and then call
```python
example = Example.from_dict(doc, gold_dict)
```
There are currently two formats supported for this dictionary of annotations:
one with a simple, flat structure of keywords, and one with a more hierarchical
structure.
#### Flat structure {#dict-flat}
Here is the full overview of potential entries in a flat dictionary of
annotations. You need to only specify those keys corresponding to the task you
want to train.
```python
### Flat dictionary
{
"text": string, # Raw text.
"words": List[string], # List of gold tokens.
"lemmas": List[string], # List of lemmas.
"spaces": List[bool], # List of boolean values indicating whether the corresponding tokens is followed by a space or not.
"tags": List[string], # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging).
"pos": List[string], # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging).
"morphs": List[string], # List of [morphological features](/usage/linguistic-features#rule-based-morphology).
"sent_starts": List[bool], # List of boolean values indicating whether each token is the first of a sentence or not.
"deps": List[string], # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head.
"heads": List[int], # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text.
"entities": List[string], # Option 1: List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens.
"entities": List[(int, int, string)], # Option 2: List of `"(start, end, label)"` tuples defining all entities in.
"cats": Dict[str, float], # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text.
"links": Dict[(int, int), Dict], # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs.
}
```
There are a few caveats to take into account:
- Multiple formats are possible for the "entities" entry, but you have to pick
one.
- Any values for sentence starts will be ignored if there are annotations for
dependency relations.
- If the dictionary contains values for "text" and "words", but not "spaces",
the latter are inferred automatically. If "words" is not provided either, the
values are inferred from the `doc` argument.
##### Examples
```python
# Training data for a part-of-speech tagger
doc = Doc(vocab, words=["I", "like", "stuff"])
example = Example.from_dict(doc, {"tags": ["NOUN", "VERB", "NOUN"]})
# Training data for an entity recognizer (option 1)
doc = nlp("Laura flew to Silicon Valley.")
biluo_tags = ["U-PERS", "O", "O", "B-LOC", "L-LOC"]
example = Example.from_dict(doc, {"entities": biluo_tags})
# Training data for an entity recognizer (option 2)
doc = nlp("Laura flew to Silicon Valley.")
entity_tuples = [
(0, 5, "PERSON"),
(14, 28, "LOC"),
]
example = Example.from_dict(doc, {"entities": entity_tuples})
# Training data for text categorization
doc = nlp("I'm pretty happy about that!")
example = Example.from_dict(doc, {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}})
# Training data for an Entity Linking component
doc = nlp("Russ Cochran his reprints include EC Comics.")
example = Example.from_dict(doc, {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}})
```
#### Hierachical structure {#dict-hierarch}
Internally, a more hierarchical dictionary structure is used to store
gold-standard annotations. Its format is similar to the structure described in
the previous section, but there are two main sections `token_annotation` and
`doc_annotation`, and the keys for token annotations should be uppercase
[`Token` attributes](/api/token#attributes) such as "ORTH" and "TAG".
```python
### Hierarchical dictionary
{
"text": string, # Raw text.
"token_annotation": {
"ORTH": List[string], # List of gold tokens.
"LEMMA": List[string], # List of lemmas.
"SPACY": List[bool], # List of boolean values indicating whether the corresponding tokens is followed by a space or not.
"TAG": List[string], # List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging).
"POS": List[string], # List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging).
"MORPH": List[string], # List of [morphological features](/usage/linguistic-features#rule-based-morphology).
"SENT_START": List[bool], # List of boolean values indicating whether each token is the first of a sentence or not.
"DEP": List[string], # List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head.
"HEAD": List[int], # List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text.
},
"doc_annotation": {
"entities": List[(int, int, string)], # List of [BILUO tags](#biluo) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens.
"cats": Dict[str, float], # Dictionary of `label:value` pairs indicating how relevant a certain [category](/api/textcategorizer) is for the text.
"links": Dict[(int, int), Dict], # Dictionary of `offset:dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The charachter offsets are linked to a dictionary of relevant knowledge base IDs.
}
}
```
There are a few caveats to take into account:
- Any values for sentence starts will be ignored if there are annotations for
dependency relations.
- If the dictionary contains values for "text" and "ORTH", but not "SPACY", the
latter are inferred automatically. If "ORTH" is not provided either, the
values are inferred from the `doc` argument.
## Training config {#config new="3"}
Config files define the training process and model pipeline and can be passed to

View File

@ -32,14 +32,12 @@ architectures and their arguments and hyperparameters.
> nlp.add_pipe("entity_linker", config=config)
> ```
<!-- TODO: finish API docs -->
| Setting | Type | Description | Default |
| ---------------- | ------------------------------------------ | ----------------- | ----------------------------------------------- |
| `kb` | `KnowledgeBase` | | `None` |
| `labels_discard` | `Iterable[str]` | | `[]` |
| `incl_prior` | bool | |  `True` |
| `incl_context` | bool | | `True` |
| ---------------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
| `kb` | `KnowledgeBase` | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases. | `None` |
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. | `[]` |
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. | `True` |
| `incl_context` | bool | Whether or not to include the local context in the model. | `True` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [EntityLinker](/api/architectures#EntityLinker) |
```python
@ -75,10 +73,10 @@ shortcut for this and instantiate the component using its string name and
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| _keyword-only_ | | |
| `kb` | `KnowlegeBase` | |
| `labels_discard` | `Iterable[str]` | |
| `incl_prior` | bool | |
| `incl_context` | bool | |
| `kb` | `KnowlegeBase` | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases. |
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. |
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. |
| `incl_context` | bool | Whether or not to include the local context in the model. |
## EntityLinker.\_\_call\_\_ {#call tag="method"}
@ -130,15 +128,12 @@ applied to the `Doc` in order. Both [`__call__`](/api/entitylinker#call) and
## EntityLinker.begin_training {#begin_training tag="method"}
Initialize the pipe for training, using data examples if available. Returns an
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object. Before calling this
method, a knowledge base should have been defined with
[`set_kb`](/api/entitylinker#set_kb).
[`Optimizer`](https://thinc.ai/docs/api-optimizers) object.
> #### Example
>
> ```python
> entity_linker = nlp.add_pipe("entity_linker", last=True)
> entity_linker.set_kb(kb)
> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline)
> ```
@ -210,22 +205,6 @@ pipe's entity linking model and context encoder. Delegates to
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
## EntityLinker.set_kb {#set_kb tag="method"}
Define the knowledge base (KB) used for disambiguating named entities to KB
identifiers.
> #### Example
>
> ```python
> entity_linker = nlp.add_pipe("entity_linker")
> entity_linker.set_kb(kb)
> ```
| Name | Type | Description |
| ---- | --------------- | ------------------------------- |
| `kb` | `KnowledgeBase` | The [`KnowledgeBase`](/api/kb). |
## EntityLinker.create_optimizer {#create_optimizer tag="method"}
Create an optimizer for the pipeline component.

View File

@ -8,8 +8,9 @@ new: 3.0
An `Example` holds the information for one training instance. It stores two
`Doc` objects: one for holding the gold-standard reference data, and one for
holding the predictions of the pipeline. An `Alignment` object stores the
alignment between these two documents, as they can differ in tokenization.
holding the predictions of the pipeline. An [`Alignment`](#alignment-object)
object stores the alignment between these two documents, as they can differ in
tokenization.
## Example.\_\_init\_\_ {#init tag="method"}
@ -40,9 +41,8 @@ both documents.
## Example.from_dict {#from_dict tag="classmethod"}
Construct an `Example` object from the `predicted` document and the reference
annotations provided as a dictionary.
<!-- TODO: document formats? legacy & token_annotation stuff -->
annotations provided as a dictionary. For more details on the required format,
see the [training format documentation](/api/data-formats#dict-input).
> #### Example
>
@ -244,8 +244,9 @@ accuracy of predicted entities against the original gold-standard annotation.
## Example.to_dict {#to_dict tag="method"}
Return a dictionary representation of the reference annotation contained in this
`Example`.
Return a
[hierarchical dictionary representation](/api/data-formats#dict-hierarch) of the
reference annotation contained in this `Example`.
> #### Example
>
@ -276,3 +277,46 @@ Split one `Example` into multiple `Example` objects, one for each sentence.
| Name | Type | Description |
| ----------- | --------------- | ---------------------------------------------------------- |
| **RETURNS** | `List[Example]` | List of `Example` objects, one for each original sentence. |
## Alignment {#alignment-object new="3"}
Calculate alignment tables between two tokenizations.
### Alignment attributes {#alignment-attributes"}
| Name | Type | Description |
| ----- | -------------------------------------------------- | ---------------------------------------------------------- |
| `x2y` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `x` to `y`. |
| `y2x` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | The `Ragged` object holding the alignment from `y` to `x`. |
<Infobox title="Important note" variant="warning">
The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
`["I", "'m"]` and `["I", "am"]`.
</Infobox>
> #### Example
>
> ```python
> from spacy.gold import Alignment
>
> bert_tokens = ["obama", "'", "s", "podcast"]
> spacy_tokens = ["obama", "'s", "podcast"]
> alignment = Alignment.from_strings(bert_tokens, spacy_tokens)
> a2b = alignment.x2y
> assert list(a2b.dataXd) == [0, 1, 1, 2]
> ```
>
> If `a2b.dataXd[1] == a2b.dataXd[2] == 1`, that means that `A[1]` (`"'"`) and
> `A[2]` (`"s"`) both align to `B[1]` (`"'s"`).
### Alignment.from_strings {#classmethod tag="function"}
| Name | Type | Description |
| ----------- | ----------- | ----------------------------------------------- |
| `A` | list | String values of candidate tokens to align. |
| `B` | list | String values of reference tokens to align. |
| **RETURNS** | `Alignment` | An `Alignment` object describing the alignment. |

View File

@ -468,59 +468,6 @@ Convert a list of Doc objects into the
| `id` | int | ID to assign to the JSON. Defaults to `0`. |
| **RETURNS** | dict | The data in spaCy's JSON format. |
### gold.align {#align tag="function"}
Calculate alignment tables between two tokenizations, using the Levenshtein
algorithm. The alignment is case-insensitive.
<Infobox title="Important note" variant="warning">
The current implementation of the alignment algorithm assumes that both
tokenizations add up to the same string. For example, you'll be able to align
`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not
`["I", "'m"]` and `["I", "am"]`.
</Infobox>
> #### Example
>
> ```python
> from spacy.gold import align
>
> bert_tokens = ["obama", "'", "s", "podcast"]
> spacy_tokens = ["obama", "'s", "podcast"]
> alignment = align(bert_tokens, spacy_tokens)
> cost, a2b, b2a, a2b_multi, b2a_multi = alignment
> ```
| Name | Type | Description |
| ----------- | ----- | -------------------------------------------------------------------------- |
| `tokens_a` | list | String values of candidate tokens to align. |
| `tokens_b` | list | String values of reference tokens to align. |
| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. |
The returned tuple contains the following alignment information:
> #### Example
>
> ```python
> a2b = array([0, -1, -1, 2])
> b2a = array([0, 2, 3])
> a2b_multi = {1: 1, 2: 1}
> b2a_multi = {}
> ```
>
> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If
> there's no one-to-one alignment for a token, it has the value `-1`.
| Name | Type | Description |
| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `cost` | int | The number of misaligned tokens. |
| `a2b` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`. |
| `b2a` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`. |
| `a2b_multi` | dict | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. |
| `b2a_multi` | dict | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. |
### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"}
Encode labelled spans into per-token tags, using the

View File

@ -1089,51 +1089,44 @@ In situations like that, you often want to align the tokenization so that you
can merge annotations from different sources together, or take vectors predicted
by a
[pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and
apply them to spaCy tokens. spaCy's [`gold.align`](/api/top-level#align) helper
returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number
of misaligned tokens, the one-to-one mappings of token indices in both
directions and the indices where multiple tokens align to one single token.
apply them to spaCy tokens. spaCy's [`Alignment`](/api/example#alignment-object) object
allows the one-to-one mappings of token indices in both directions as well as
taking into account indices where multiple tokens align to one single token.
> #### ✏️ Things to try
>
> 1. Change the capitalization in one of the token lists for example,
> `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive.
> 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see
> that there are now 4 misaligned tokens and that the new many-to-one mapping
> is reflected in `a2b_multi`.
> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that the
> `cost` is `0` and all corresponding mappings are also identical.
> that there are now two tokens of length 2 in `y2x`, one corresponding to
> "'s", and one to "podcasts".
> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that all
> tokens now correspond 1-to-1.
```python
### {executable="true"}
from spacy.gold import align
from spacy.gold import Alignment
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
print("Edit distance:", cost) # 3
print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6])
print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, -1, 6, 7])
print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4}
print("Many-to-one mappings b-> a", b2a_multi) # {}
align = Alignment.from_strings(other_tokens, spacy_tokens)
print(f"a -> b, lengths: {align.x2y.lengths}") # array([1, 1, 1, 1, 1, 1, 1, 1])
print(f"a -> b, mapping: {align.x2y.dataXd}") # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s"
print(f"b -> a, lengths: {align.y2x.lengths}") # array([1, 1, 1, 1, 2, 1, 1]) : the token "'s" refers to two tokens
print(f"b -> a, mappings: {align.y2x.dataXd}") # array([0, 1, 2, 3, 4, 5, 6, 7])
```
Here are some insights from the alignment information generated in the example
above:
- The edit distance (cost) is `3`: two deletions and one insertion.
- The one-to-one mappings for the first four tokens are identical, which means
they map to each other. This makes sense because they're also identical in the
input: `"i"`, `"listened"`, `"to"` and `"obama"`.
- The index mapped to `a2b[6]` is `5`, which means that `other_tokens[6]`
- The value of `x2y.dataXd[6]` is `5`, which means that `other_tokens[6]`
(`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`).
- `a2b[4]` is `-1`, which means that there is no one-to-one alignment for the
token at `other_tokens[4]`. The token `"'"` doesn't exist on its own in
`spacy_tokens`. The same goes for `a2b[5]` and `other_tokens[5]`, i.e. `"s"`.
- The dictionary `a2b_multi` shows that both tokens 4 and 5 of `other_tokens`
(`"'"` and `"s"`) align to token 4 of `spacy_tokens` (`"'s"`).
- The dictionary `b2a_multi` shows that there are no tokens in `spacy_tokens`
that map to multiple tokens in `other_tokens`.
- `x2y.dataXd[4]` and `x2y.dataXd[5]` are both `4`, which means that both tokens
4 and 5 of `other_tokens` (`"'"` and `"s"`) align to token 4 of `spacy_tokens`
(`"'s"`).
<Infobox title="Important note" variant="warning">