Merge pull request #5890 from svlandeg/feature/el-docs

This commit is contained in:
Ines Montani 2020-08-07 11:56:56 +02:00 committed by GitHub
commit 21c9ea5bd7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 230 additions and 43 deletions

View File

@ -169,9 +169,9 @@ class Errors:
"training a named entity recognizer, also make sure that none of "
"your annotated entity spans have leading or trailing whitespace "
"or punctuation. "
"You can also use the experimental `debug-data` command to "
"You can also use the experimental `debug data` command to "
"validate your JSON-formatted training data. For details, run:\n"
"python -m spacy debug-data --help")
"python -m spacy debug data --help")
E025 = ("String is too long: {length} characters. Max is 2**30.")
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
"length {length}.")

View File

@ -20,7 +20,7 @@ def create_docbin_reader(
class Corpus:
"""Iterate Example objects from a file or directory of DocBin (.spacy)
formated data files.
formatted data files.
path (Path): The directory or filename to read from.
gold_preproc (bool): Whether to set up the Example object with gold-standard

View File

@ -148,19 +148,133 @@ architectures into your training config.
## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"}
A text classification architecture needs to take a `Doc` as input, and produce a
score for each potential label class. Textcat challenges can be binary (e.g.
sentiment analysis) or involve multiple possible labels. Multi-label challenges
can either have mutually exclusive labels (each example has exactly one label),
or multiple labels may be applicable at the same time.
As the properties of text classification problems can vary widely, we provide
several different built-in architectures. It is recommended to experiment with
different architectures and settings to determine what works best on your
specific data and challenge.
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble}
Stacked ensemble of a bag-of-words model and a neural network model. The neural
network has an internal CNN Tok2Vec layer and uses attention.
> #### Example Config
>
> ```ini
> [model]
> @architectures = "spacy.TextCatEnsemble.v1"
> exclusive_classes = false
> pretrained_vectors = null
> width = 64
> embed_size = 2000
> conv_depth = 2
> window_size = 1
> ngram_size = 1
> dropout = null
> nO = null
> ```
| Name | Type | Description |
| -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. |
| `width` | int | Output dimension of the feature encoding step. |
| `embed_size` | int | Input dimension of the feature encoding step. |
| `conv_depth` | int | Depth of the Tok2Vec layer. |
| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. |
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
| `dropout` | float | The dropout rate. |
| `nO` | int | Output dimension, determined by the number of different labels. |
If the `nO` dimension is not set, the TextCategorizer component will set it when
`begin_training` is called.
### spacy.TextCatCNN.v1 {#TextCatCNN}
> #### Example Config
>
> ```ini
> [model]
> @architectures = "spacy.TextCatCNN.v1"
> exclusive_classes = false
> nO = null
>
> [model.tok2vec]
> @architectures = "spacy.HashEmbedCNN.v1"
> pretrained_vectors = null
> width = 96
> depth = 4
> embed_size = 2000
> window_size = 1
> maxout_pieces = 3
> subword_features = true
> dropout = null
> ```
A neural network model where token vectors are calculated using a CNN. The
vectors are mean pooled and used as features in a feed-forward network. This
architecture is usually less accurate than the ensemble, but runs faster.
| Name | Type | Description |
| ------------------- | ------------------------------------------ | --------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
| `nO` | int | Output dimension, determined by the number of different labels. |
If the `nO` dimension is not set, the TextCategorizer component will set it when
`begin_training` is called.
### spacy.TextCatBOW.v1 {#TextCatBOW}
### spacy.TextCatCNN.v1 {#TextCatCNN}
An ngram "bag-of-words" model. This architecture should run much faster than the
others, but may not be as accurate, especially if texts are short.
> #### Example Config
>
> ```ini
> [model]
> @architectures = "spacy.TextCatBOW.v1"
> exclusive_classes = false
> ngram_size: 1
> no_output_layer: false
> nO = null
> ```
| Name | Type | Description |
| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
| `nO` | int | Output dimension, determined by the number of different labels. |
If the `nO` dimension is not set, the TextCategorizer component will set it when
`begin_training` is called.
### spacy.TextCatLowData.v1 {#TextCatLowData}
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
An `EntityLinker` component disambiguates textual mentions (tagged as named
entities) to unique identifiers, grounding the named entities into the "real
world". This requires 3 main components:
- A [`KnowledgeBase`](/api/kb) (KB) holding the unique identifiers, potential
synonyms and prior probabilities.
- A candidate generation step to produce a set of likely identifiers, given a
certain textual mention.
- A Machine learning [`Model`](https://thinc.ai/docs/api-model) that picks the
most plausible ID from the set of candidates.
### spacy.EntityLinker.v1 {#EntityLinker}
<!-- TODO: intro -->
The `EntityLinker` model architecture is a `Thinc` `Model` with a Linear output
layer.
> #### Example Config
>
@ -170,10 +284,47 @@ architectures into your training config.
> nO = null
>
> [model.tok2vec]
> # ...
> @architectures = "spacy.HashEmbedCNN.v1"
> pretrained_vectors = null
> width = 96
> depth = 2
> embed_size = 300
> window_size = 1
> maxout_pieces = 3
> subword_features = true
> dropout = null
>
> [kb_loader]
> @assets = "spacy.EmptyKB.v1"
> entity_vector_length = 64
>
> [get_candidates]
> @assets = "spacy.CandidateGenerator.v1"
> ```
| Name | Type | Description |
| --------- | ------------------------------------------ | ----------- |
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | |
| `nO` | int | |
| Name | Type | Description |
| --------- | ------------------------------------------ | ---------------------------------------------------------------------------------------- |
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
| `nO` | int | Output dimension, determined by the length of the vectors encoding each entity in the KB |
If the `nO` dimension is not set, the Entity Linking component will set it when
`begin_training` is called.
### spacy.EmptyKB.v1 {#EmptyKB}
A function that creates a default, empty `KnowledgeBase` from a
[`Vocab`](/api/vocab) instance.
| Name | Type | Description |
| ---------------------- | ---- | ------------------------------------------------------------------------- |
| `entity_vector_length` | int | The length of the vectors encoding each entity in the KB - 64 by default. |
### spacy.CandidateGenerator.v1 {#CandidateGenerator}
A function that takes as input a [`KnowledgeBase`](/api/kb) and a
[`Span`](/api/span) object denoting a named entity, and returns a list of
plausible [`Candidate` objects](/api/kb/#candidate_init).
The default `CandidateGenerator` simply uses the text of a mention to find its
potential aliases in the Knowledgebase. Note that this function is
case-dependent.

View File

@ -132,7 +132,7 @@ $ python -m spacy init config [output] [--base] [--lang] [--model] [--pipeline]
| `--base`, `-b` | option | Optional base config file to auto-fill with defaults. |
| `--lang`, `-l` | option | Optional language code to use for blank config. If a `--pipeline` is specified, the components will be added in order. |
| `--model`, `-m` | option | Optional base model to copy config from. If a `--pipeline` is specified, only those components will be kept, and all other components not in the model will be added. |
| `--pipeline`, `-p` | option | Optional comma-separate pipeline of components to add to blank language or model. |
| `--pipeline`, `-p` | option | Optional comma-separated pipeline of components to add to blank language or model. |
| **CREATES** | config | Complete and auto-filled config file for training. |
### init model {#init-model new="2"}
@ -271,7 +271,7 @@ low data labels and more.
<Infobox title="New in v3.0" variant="warning">
The `debug-data` command is now available as a subcommand of `spacy debug`. It
The `debug data` command is now available as a subcommand of `spacy debug`. It
takes the same arguments as `train` and reads settings off the
[`config.cfg` file](/usage/training#config) and optional
[overrides](/usage/training#config-overrides) on the CLI.

View File

@ -174,12 +174,32 @@ run [`spacy pretrain`](/api/cli#pretrain).
### Binary training format {#binary-training new="3"}
The built-in [`convert`](/api/cli#convert) command helps you convert the
`.conllu` format used by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies) as
well as spaCy's previous [JSON format](#json-input).
> #### Example
>
> ```python
> from pathlib import Path
> from spacy.tokens import DocBin
> from spacy.gold import Corpus
> output_file = Path(dir) / "output.spacy"
> data = DocBin(docs=docs).to_bytes()
> with output_file.open("wb") as file_:
> file_.write(data)
> reader = Corpus(output_file)
> ```
<!-- TODO: document DocBin format -->
The main data format used in spaCy v3 is a binary format created by serializing
a [`DocBin`](/api/docbin) object, which represents a collection of `Doc`
objects. Typically, the extension for these binary files is `.spacy`, and they
are used as input format for specifying a [training corpus](/api/corpus) and for
spaCy's CLI [`train`](/api/cli#train) command.
This binary format is extremely efficient in storage, especially when packing
multiple documents together.
The built-in [`convert`](/api/cli#convert) command helps you convert spaCy's
previous [JSON format](#json-input) to this new `DocBin` format. It also
supports conversion of the `.conllu` format used by the
[Universal Dependencies corpora](https://github.com/UniversalDependencies).
### JSON training format {#json-input tag="deprecated"}
@ -187,7 +207,7 @@ well as spaCy's previous [JSON format](#json-input).
As of v3.0, the JSON input format is deprecated and is replaced by the
[binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
objects to JSON, you can now now serialize them directly using the
objects to JSON, you can now serialize them directly using the
[`DocBin`](/api/docbin) container and then use them as input data.
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`

View File

@ -9,6 +9,13 @@ api_string_name: entity_linker
api_trainable: true
---
An `EntityLinker` component disambiguates textual mentions (tagged as named
entities) to unique identifiers, grounding the named entities into the "real
world". It requires a `KnowledgeBase`, as well as a function to generate
plausible candidates from that `KnowledgeBase` given a certain textual mention,
and a ML model to pick the right candidate, given the local context of the
mention.
## Config and implementation {#config}
The default config is defined by the pipeline component factory and describes
@ -23,22 +30,24 @@ architectures and their arguments and hyperparameters.
> ```python
> from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
> config = {
> "kb": None,
> "labels_discard": [],
> "incl_prior": True,
> "incl_context": True,
> "model": DEFAULT_NEL_MODEL,
> "kb_loader": {'@assets': 'spacy.EmptyKB.v1', 'entity_vector_length': 64},
> "get_candidates": {'@assets': 'spacy.CandidateGenerator.v1'},
> }
> nlp.add_pipe("entity_linker", config=config)
> ```
| Setting | Type | Description | Default |
| ---------------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
| `kb` | `KnowledgeBase` | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases. | `None` |
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. | `[]` |
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. | `True` |
| `incl_context` | bool | Whether or not to include the local context in the model. | `True` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [EntityLinker](/api/architectures#EntityLinker) |
| Setting | Type | Description | Default |
| ---------------- | -------------------------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------ |
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. | `[]` |
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. | `True` |
| `incl_context` | bool | Whether or not to include the local context in the model. | `True` |
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [EntityLinker](/api/architectures#EntityLinker) |
| `kb_loader` | `Callable[[Vocab], KnowledgeBase]` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. | An empty KnowledgeBase with `entity_vector_length` 64. |
| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object. | Built-in dictionary-lookup function. |
```python
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
@ -53,7 +62,11 @@ https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
> entity_linker = nlp.add_pipe("entity_linker")
>
> # Construction via add_pipe with custom model
> config = {"model": {"@architectures": "my_el"}}
> config = {"model": {"@architectures": "my_el.v1"}}
> entity_linker = nlp.add_pipe("entity_linker", config=config)
>
> # Construction via add_pipe with custom KB and candidate generation
> config = {"kb_loader": {"@assets": "my_kb.v1"}, "get_candidates": {"@assets": "my_candidates.v1"},}
> entity_linker = nlp.add_pipe("entity_linker", config=config)
>
> # Construction from class
@ -65,18 +78,20 @@ Create a new pipeline instance. In your application, you would normally use a
shortcut for this and instantiate the component using its string name and
[`nlp.add_pipe`](/api/language#add_pipe).
<!-- TODO: finish API docs -->
Note that both the internal KB as well as the Candidate generator can be
customized by providing custom registered functions.
| Name | Type | Description |
| ---------------- | --------------- | ------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| _keyword-only_ | | |
| `kb` | `KnowlegeBase` | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases. |
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. |
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. |
| `incl_context` | bool | Whether or not to include the local context in the model. |
| Name | Type | Description |
| ---------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
| _keyword-only_ | | |
| `kb_loader` | `Callable[[Vocab], KnowledgeBase]` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. |
| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object. |
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. |
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. |
| `incl_context` | bool | Whether or not to include the local context in the model. |
## EntityLinker.\_\_call\_\_ {#call tag="method"}

View File

@ -380,8 +380,9 @@ table instead of only returning the structured data.
> #### ✏️ Things to try
>
> 1. Add the components `"ner"` and `"sentencizer"` _before_ the entity linker.
> The analysis should now show no problems, because requirements are met.
> 1. Add the components `"ner"` and `"sentencizer"` _before_ the
> `"entity_linker"`. The analysis should now show no problems, because
> requirements are met.
```python
### {executable="true"}

View File

@ -122,7 +122,7 @@ related to more general machine learning functionality.
| **Lemmatization** | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". |
| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences. |
| **Named Entity Recognition** (NER) | Labelling named "real-world" objects, like persons, companies or locations. |
| **Entity Linking** (EL) | Disambiguating textual entities to unique identifiers in a Knowledge Base. |
| **Entity Linking** (EL) | Disambiguating textual entities to unique identifiers in a knowledge base. |
| **Similarity** | Comparing words, text spans and documents and how similar they are to each other. |
| **Text Classification** | Assigning categories or labels to a whole document, or parts of a document. |
| **Rule-based Matching** | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
@ -379,7 +379,7 @@ spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This
will give you the object and its encoded annotations, plus the "key" to decode
it.
## Knowledge Base {#kb}
## Knowledge base {#kb}
To support the entity linking task, spaCy stores external knowledge in a
[`KnowledgeBase`](/api/kb). The knowledge base (KB) uses the `Vocab` to store
@ -426,7 +426,7 @@ print("Number of aliases in KB:", kb.get_size_aliases()) # 2
### Candidate generation
Given a textual entity, the Knowledge Base can provide a list of plausible
Given a textual entity, the knowledge base can provide a list of plausible
candidates or entity identifiers. The [`EntityLinker`](/api/entitylinker) will
take this list of candidates as input, and disambiguate the mention to the most
probable identifier, given the document context.