mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-04 20:03:13 +03:00
Merge pull request #5890 from svlandeg/feature/el-docs
This commit is contained in:
commit
21c9ea5bd7
|
@ -169,9 +169,9 @@ class Errors:
|
||||||
"training a named entity recognizer, also make sure that none of "
|
"training a named entity recognizer, also make sure that none of "
|
||||||
"your annotated entity spans have leading or trailing whitespace "
|
"your annotated entity spans have leading or trailing whitespace "
|
||||||
"or punctuation. "
|
"or punctuation. "
|
||||||
"You can also use the experimental `debug-data` command to "
|
"You can also use the experimental `debug data` command to "
|
||||||
"validate your JSON-formatted training data. For details, run:\n"
|
"validate your JSON-formatted training data. For details, run:\n"
|
||||||
"python -m spacy debug-data --help")
|
"python -m spacy debug data --help")
|
||||||
E025 = ("String is too long: {length} characters. Max is 2**30.")
|
E025 = ("String is too long: {length} characters. Max is 2**30.")
|
||||||
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
|
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
|
||||||
"length {length}.")
|
"length {length}.")
|
||||||
|
|
|
@ -20,7 +20,7 @@ def create_docbin_reader(
|
||||||
|
|
||||||
class Corpus:
|
class Corpus:
|
||||||
"""Iterate Example objects from a file or directory of DocBin (.spacy)
|
"""Iterate Example objects from a file or directory of DocBin (.spacy)
|
||||||
formated data files.
|
formatted data files.
|
||||||
|
|
||||||
path (Path): The directory or filename to read from.
|
path (Path): The directory or filename to read from.
|
||||||
gold_preproc (bool): Whether to set up the Example object with gold-standard
|
gold_preproc (bool): Whether to set up the Example object with gold-standard
|
||||||
|
|
|
@ -148,19 +148,133 @@ architectures into your training config.
|
||||||
|
|
||||||
## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"}
|
## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"}
|
||||||
|
|
||||||
|
A text classification architecture needs to take a `Doc` as input, and produce a
|
||||||
|
score for each potential label class. Textcat challenges can be binary (e.g.
|
||||||
|
sentiment analysis) or involve multiple possible labels. Multi-label challenges
|
||||||
|
can either have mutually exclusive labels (each example has exactly one label),
|
||||||
|
or multiple labels may be applicable at the same time.
|
||||||
|
|
||||||
|
As the properties of text classification problems can vary widely, we provide
|
||||||
|
several different built-in architectures. It is recommended to experiment with
|
||||||
|
different architectures and settings to determine what works best on your
|
||||||
|
specific data and challenge.
|
||||||
|
|
||||||
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble}
|
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble}
|
||||||
|
|
||||||
|
Stacked ensemble of a bag-of-words model and a neural network model. The neural
|
||||||
|
network has an internal CNN Tok2Vec layer and uses attention.
|
||||||
|
|
||||||
|
> #### Example Config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [model]
|
||||||
|
> @architectures = "spacy.TextCatEnsemble.v1"
|
||||||
|
> exclusive_classes = false
|
||||||
|
> pretrained_vectors = null
|
||||||
|
> width = 64
|
||||||
|
> embed_size = 2000
|
||||||
|
> conv_depth = 2
|
||||||
|
> window_size = 1
|
||||||
|
> ngram_size = 1
|
||||||
|
> dropout = null
|
||||||
|
> nO = null
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||||
|
| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. |
|
||||||
|
| `width` | int | Output dimension of the feature encoding step. |
|
||||||
|
| `embed_size` | int | Input dimension of the feature encoding step. |
|
||||||
|
| `conv_depth` | int | Depth of the Tok2Vec layer. |
|
||||||
|
| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. |
|
||||||
|
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
||||||
|
| `dropout` | float | The dropout rate. |
|
||||||
|
| `nO` | int | Output dimension, determined by the number of different labels. |
|
||||||
|
|
||||||
|
If the `nO` dimension is not set, the TextCategorizer component will set it when
|
||||||
|
`begin_training` is called.
|
||||||
|
|
||||||
|
### spacy.TextCatCNN.v1 {#TextCatCNN}
|
||||||
|
|
||||||
|
> #### Example Config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [model]
|
||||||
|
> @architectures = "spacy.TextCatCNN.v1"
|
||||||
|
> exclusive_classes = false
|
||||||
|
> nO = null
|
||||||
|
>
|
||||||
|
> [model.tok2vec]
|
||||||
|
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||||
|
> pretrained_vectors = null
|
||||||
|
> width = 96
|
||||||
|
> depth = 4
|
||||||
|
> embed_size = 2000
|
||||||
|
> window_size = 1
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> subword_features = true
|
||||||
|
> dropout = null
|
||||||
|
> ```
|
||||||
|
|
||||||
|
A neural network model where token vectors are calculated using a CNN. The
|
||||||
|
vectors are mean pooled and used as features in a feed-forward network. This
|
||||||
|
architecture is usually less accurate than the ensemble, but runs faster.
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ------------------- | ------------------------------------------ | --------------------------------------------------------------- |
|
||||||
|
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||||
|
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
|
||||||
|
| `nO` | int | Output dimension, determined by the number of different labels. |
|
||||||
|
|
||||||
|
If the `nO` dimension is not set, the TextCategorizer component will set it when
|
||||||
|
`begin_training` is called.
|
||||||
|
|
||||||
### spacy.TextCatBOW.v1 {#TextCatBOW}
|
### spacy.TextCatBOW.v1 {#TextCatBOW}
|
||||||
|
|
||||||
### spacy.TextCatCNN.v1 {#TextCatCNN}
|
An ngram "bag-of-words" model. This architecture should run much faster than the
|
||||||
|
others, but may not be as accurate, especially if texts are short.
|
||||||
|
|
||||||
|
> #### Example Config
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [model]
|
||||||
|
> @architectures = "spacy.TextCatBOW.v1"
|
||||||
|
> exclusive_classes = false
|
||||||
|
> ngram_size: 1
|
||||||
|
> no_output_layer: false
|
||||||
|
> nO = null
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||||
|
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
||||||
|
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
|
||||||
|
| `nO` | int | Output dimension, determined by the number of different labels. |
|
||||||
|
|
||||||
|
If the `nO` dimension is not set, the TextCategorizer component will set it when
|
||||||
|
`begin_training` is called.
|
||||||
|
|
||||||
### spacy.TextCatLowData.v1 {#TextCatLowData}
|
### spacy.TextCatLowData.v1 {#TextCatLowData}
|
||||||
|
|
||||||
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
||||||
|
|
||||||
|
An `EntityLinker` component disambiguates textual mentions (tagged as named
|
||||||
|
entities) to unique identifiers, grounding the named entities into the "real
|
||||||
|
world". This requires 3 main components:
|
||||||
|
|
||||||
|
- A [`KnowledgeBase`](/api/kb) (KB) holding the unique identifiers, potential
|
||||||
|
synonyms and prior probabilities.
|
||||||
|
- A candidate generation step to produce a set of likely identifiers, given a
|
||||||
|
certain textual mention.
|
||||||
|
- A Machine learning [`Model`](https://thinc.ai/docs/api-model) that picks the
|
||||||
|
most plausible ID from the set of candidates.
|
||||||
|
|
||||||
### spacy.EntityLinker.v1 {#EntityLinker}
|
### spacy.EntityLinker.v1 {#EntityLinker}
|
||||||
|
|
||||||
<!-- TODO: intro -->
|
The `EntityLinker` model architecture is a `Thinc` `Model` with a Linear output
|
||||||
|
layer.
|
||||||
|
|
||||||
> #### Example Config
|
> #### Example Config
|
||||||
>
|
>
|
||||||
|
@ -170,10 +284,47 @@ architectures into your training config.
|
||||||
> nO = null
|
> nO = null
|
||||||
>
|
>
|
||||||
> [model.tok2vec]
|
> [model.tok2vec]
|
||||||
> # ...
|
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||||
|
> pretrained_vectors = null
|
||||||
|
> width = 96
|
||||||
|
> depth = 2
|
||||||
|
> embed_size = 300
|
||||||
|
> window_size = 1
|
||||||
|
> maxout_pieces = 3
|
||||||
|
> subword_features = true
|
||||||
|
> dropout = null
|
||||||
|
>
|
||||||
|
> [kb_loader]
|
||||||
|
> @assets = "spacy.EmptyKB.v1"
|
||||||
|
> entity_vector_length = 64
|
||||||
|
>
|
||||||
|
> [get_candidates]
|
||||||
|
> @assets = "spacy.CandidateGenerator.v1"
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| --------- | ------------------------------------------ | ----------- |
|
| --------- | ------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | |
|
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
|
||||||
| `nO` | int | |
|
| `nO` | int | Output dimension, determined by the length of the vectors encoding each entity in the KB |
|
||||||
|
|
||||||
|
If the `nO` dimension is not set, the Entity Linking component will set it when
|
||||||
|
`begin_training` is called.
|
||||||
|
|
||||||
|
### spacy.EmptyKB.v1 {#EmptyKB}
|
||||||
|
|
||||||
|
A function that creates a default, empty `KnowledgeBase` from a
|
||||||
|
[`Vocab`](/api/vocab) instance.
|
||||||
|
|
||||||
|
| Name | Type | Description |
|
||||||
|
| ---------------------- | ---- | ------------------------------------------------------------------------- |
|
||||||
|
| `entity_vector_length` | int | The length of the vectors encoding each entity in the KB - 64 by default. |
|
||||||
|
|
||||||
|
### spacy.CandidateGenerator.v1 {#CandidateGenerator}
|
||||||
|
|
||||||
|
A function that takes as input a [`KnowledgeBase`](/api/kb) and a
|
||||||
|
[`Span`](/api/span) object denoting a named entity, and returns a list of
|
||||||
|
plausible [`Candidate` objects](/api/kb/#candidate_init).
|
||||||
|
|
||||||
|
The default `CandidateGenerator` simply uses the text of a mention to find its
|
||||||
|
potential aliases in the Knowledgebase. Note that this function is
|
||||||
|
case-dependent.
|
||||||
|
|
|
@ -132,7 +132,7 @@ $ python -m spacy init config [output] [--base] [--lang] [--model] [--pipeline]
|
||||||
| `--base`, `-b` | option | Optional base config file to auto-fill with defaults. |
|
| `--base`, `-b` | option | Optional base config file to auto-fill with defaults. |
|
||||||
| `--lang`, `-l` | option | Optional language code to use for blank config. If a `--pipeline` is specified, the components will be added in order. |
|
| `--lang`, `-l` | option | Optional language code to use for blank config. If a `--pipeline` is specified, the components will be added in order. |
|
||||||
| `--model`, `-m` | option | Optional base model to copy config from. If a `--pipeline` is specified, only those components will be kept, and all other components not in the model will be added. |
|
| `--model`, `-m` | option | Optional base model to copy config from. If a `--pipeline` is specified, only those components will be kept, and all other components not in the model will be added. |
|
||||||
| `--pipeline`, `-p` | option | Optional comma-separate pipeline of components to add to blank language or model. |
|
| `--pipeline`, `-p` | option | Optional comma-separated pipeline of components to add to blank language or model. |
|
||||||
| **CREATES** | config | Complete and auto-filled config file for training. |
|
| **CREATES** | config | Complete and auto-filled config file for training. |
|
||||||
|
|
||||||
### init model {#init-model new="2"}
|
### init model {#init-model new="2"}
|
||||||
|
@ -271,7 +271,7 @@ low data labels and more.
|
||||||
|
|
||||||
<Infobox title="New in v3.0" variant="warning">
|
<Infobox title="New in v3.0" variant="warning">
|
||||||
|
|
||||||
The `debug-data` command is now available as a subcommand of `spacy debug`. It
|
The `debug data` command is now available as a subcommand of `spacy debug`. It
|
||||||
takes the same arguments as `train` and reads settings off the
|
takes the same arguments as `train` and reads settings off the
|
||||||
[`config.cfg` file](/usage/training#config) and optional
|
[`config.cfg` file](/usage/training#config) and optional
|
||||||
[overrides](/usage/training#config-overrides) on the CLI.
|
[overrides](/usage/training#config-overrides) on the CLI.
|
||||||
|
|
|
@ -174,12 +174,32 @@ run [`spacy pretrain`](/api/cli#pretrain).
|
||||||
|
|
||||||
### Binary training format {#binary-training new="3"}
|
### Binary training format {#binary-training new="3"}
|
||||||
|
|
||||||
The built-in [`convert`](/api/cli#convert) command helps you convert the
|
> #### Example
|
||||||
`.conllu` format used by the
|
>
|
||||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies) as
|
> ```python
|
||||||
well as spaCy's previous [JSON format](#json-input).
|
> from pathlib import Path
|
||||||
|
> from spacy.tokens import DocBin
|
||||||
|
> from spacy.gold import Corpus
|
||||||
|
> output_file = Path(dir) / "output.spacy"
|
||||||
|
> data = DocBin(docs=docs).to_bytes()
|
||||||
|
> with output_file.open("wb") as file_:
|
||||||
|
> file_.write(data)
|
||||||
|
> reader = Corpus(output_file)
|
||||||
|
> ```
|
||||||
|
|
||||||
<!-- TODO: document DocBin format -->
|
The main data format used in spaCy v3 is a binary format created by serializing
|
||||||
|
a [`DocBin`](/api/docbin) object, which represents a collection of `Doc`
|
||||||
|
objects. Typically, the extension for these binary files is `.spacy`, and they
|
||||||
|
are used as input format for specifying a [training corpus](/api/corpus) and for
|
||||||
|
spaCy's CLI [`train`](/api/cli#train) command.
|
||||||
|
|
||||||
|
This binary format is extremely efficient in storage, especially when packing
|
||||||
|
multiple documents together.
|
||||||
|
|
||||||
|
The built-in [`convert`](/api/cli#convert) command helps you convert spaCy's
|
||||||
|
previous [JSON format](#json-input) to this new `DocBin` format. It also
|
||||||
|
supports conversion of the `.conllu` format used by the
|
||||||
|
[Universal Dependencies corpora](https://github.com/UniversalDependencies).
|
||||||
|
|
||||||
### JSON training format {#json-input tag="deprecated"}
|
### JSON training format {#json-input tag="deprecated"}
|
||||||
|
|
||||||
|
@ -187,7 +207,7 @@ well as spaCy's previous [JSON format](#json-input).
|
||||||
|
|
||||||
As of v3.0, the JSON input format is deprecated and is replaced by the
|
As of v3.0, the JSON input format is deprecated and is replaced by the
|
||||||
[binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
|
[binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
|
||||||
objects to JSON, you can now now serialize them directly using the
|
objects to JSON, you can now serialize them directly using the
|
||||||
[`DocBin`](/api/docbin) container and then use them as input data.
|
[`DocBin`](/api/docbin) container and then use them as input data.
|
||||||
|
|
||||||
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
|
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
|
||||||
|
|
|
@ -9,6 +9,13 @@ api_string_name: entity_linker
|
||||||
api_trainable: true
|
api_trainable: true
|
||||||
---
|
---
|
||||||
|
|
||||||
|
An `EntityLinker` component disambiguates textual mentions (tagged as named
|
||||||
|
entities) to unique identifiers, grounding the named entities into the "real
|
||||||
|
world". It requires a `KnowledgeBase`, as well as a function to generate
|
||||||
|
plausible candidates from that `KnowledgeBase` given a certain textual mention,
|
||||||
|
and a ML model to pick the right candidate, given the local context of the
|
||||||
|
mention.
|
||||||
|
|
||||||
## Config and implementation {#config}
|
## Config and implementation {#config}
|
||||||
|
|
||||||
The default config is defined by the pipeline component factory and describes
|
The default config is defined by the pipeline component factory and describes
|
||||||
|
@ -23,22 +30,24 @@ architectures and their arguments and hyperparameters.
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
|
> from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
|
||||||
> config = {
|
> config = {
|
||||||
> "kb": None,
|
|
||||||
> "labels_discard": [],
|
> "labels_discard": [],
|
||||||
> "incl_prior": True,
|
> "incl_prior": True,
|
||||||
> "incl_context": True,
|
> "incl_context": True,
|
||||||
> "model": DEFAULT_NEL_MODEL,
|
> "model": DEFAULT_NEL_MODEL,
|
||||||
|
> "kb_loader": {'@assets': 'spacy.EmptyKB.v1', 'entity_vector_length': 64},
|
||||||
|
> "get_candidates": {'@assets': 'spacy.CandidateGenerator.v1'},
|
||||||
> }
|
> }
|
||||||
> nlp.add_pipe("entity_linker", config=config)
|
> nlp.add_pipe("entity_linker", config=config)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Setting | Type | Description | Default |
|
| Setting | Type | Description | Default |
|
||||||
| ---------------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
|
| ---------------- | -------------------------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------ |
|
||||||
| `kb` | `KnowledgeBase` | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases. | `None` |
|
|
||||||
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. | `[]` |
|
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. | `[]` |
|
||||||
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. | `True` |
|
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. | `True` |
|
||||||
| `incl_context` | bool | Whether or not to include the local context in the model. | `True` |
|
| `incl_context` | bool | Whether or not to include the local context in the model. | `True` |
|
||||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [EntityLinker](/api/architectures#EntityLinker) |
|
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [EntityLinker](/api/architectures#EntityLinker) |
|
||||||
|
| `kb_loader` | `Callable[[Vocab], KnowledgeBase]` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. | An empty KnowledgeBase with `entity_vector_length` 64. |
|
||||||
|
| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object. | Built-in dictionary-lookup function. |
|
||||||
|
|
||||||
```python
|
```python
|
||||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
|
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
|
||||||
|
@ -53,7 +62,11 @@ https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
|
||||||
> entity_linker = nlp.add_pipe("entity_linker")
|
> entity_linker = nlp.add_pipe("entity_linker")
|
||||||
>
|
>
|
||||||
> # Construction via add_pipe with custom model
|
> # Construction via add_pipe with custom model
|
||||||
> config = {"model": {"@architectures": "my_el"}}
|
> config = {"model": {"@architectures": "my_el.v1"}}
|
||||||
|
> entity_linker = nlp.add_pipe("entity_linker", config=config)
|
||||||
|
>
|
||||||
|
> # Construction via add_pipe with custom KB and candidate generation
|
||||||
|
> config = {"kb_loader": {"@assets": "my_kb.v1"}, "get_candidates": {"@assets": "my_candidates.v1"},}
|
||||||
> entity_linker = nlp.add_pipe("entity_linker", config=config)
|
> entity_linker = nlp.add_pipe("entity_linker", config=config)
|
||||||
>
|
>
|
||||||
> # Construction from class
|
> # Construction from class
|
||||||
|
@ -65,15 +78,17 @@ Create a new pipeline instance. In your application, you would normally use a
|
||||||
shortcut for this and instantiate the component using its string name and
|
shortcut for this and instantiate the component using its string name and
|
||||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||||
|
|
||||||
<!-- TODO: finish API docs -->
|
Note that both the internal KB as well as the Candidate generator can be
|
||||||
|
customized by providing custom registered functions.
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ---------------- | --------------- | ------------------------------------------------------------------------------------------- |
|
| ---------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
|
||||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||||
| _keyword-only_ | | |
|
| _keyword-only_ | | |
|
||||||
| `kb` | `KnowlegeBase` | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases. |
|
| `kb_loader` | `Callable[[Vocab], KnowledgeBase]` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. |
|
||||||
|
| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object. |
|
||||||
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. |
|
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. |
|
||||||
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. |
|
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. |
|
||||||
| `incl_context` | bool | Whether or not to include the local context in the model. |
|
| `incl_context` | bool | Whether or not to include the local context in the model. |
|
||||||
|
|
|
@ -380,8 +380,9 @@ table instead of only returning the structured data.
|
||||||
|
|
||||||
> #### ✏️ Things to try
|
> #### ✏️ Things to try
|
||||||
>
|
>
|
||||||
> 1. Add the components `"ner"` and `"sentencizer"` _before_ the entity linker.
|
> 1. Add the components `"ner"` and `"sentencizer"` _before_ the
|
||||||
> The analysis should now show no problems, because requirements are met.
|
> `"entity_linker"`. The analysis should now show no problems, because
|
||||||
|
> requirements are met.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
|
|
|
@ -122,7 +122,7 @@ related to more general machine learning functionality.
|
||||||
| **Lemmatization** | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". |
|
| **Lemmatization** | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". |
|
||||||
| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences. |
|
| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences. |
|
||||||
| **Named Entity Recognition** (NER) | Labelling named "real-world" objects, like persons, companies or locations. |
|
| **Named Entity Recognition** (NER) | Labelling named "real-world" objects, like persons, companies or locations. |
|
||||||
| **Entity Linking** (EL) | Disambiguating textual entities to unique identifiers in a Knowledge Base. |
|
| **Entity Linking** (EL) | Disambiguating textual entities to unique identifiers in a knowledge base. |
|
||||||
| **Similarity** | Comparing words, text spans and documents and how similar they are to each other. |
|
| **Similarity** | Comparing words, text spans and documents and how similar they are to each other. |
|
||||||
| **Text Classification** | Assigning categories or labels to a whole document, or parts of a document. |
|
| **Text Classification** | Assigning categories or labels to a whole document, or parts of a document. |
|
||||||
| **Rule-based Matching** | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
|
| **Rule-based Matching** | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
|
||||||
|
@ -379,7 +379,7 @@ spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This
|
||||||
will give you the object and its encoded annotations, plus the "key" to decode
|
will give you the object and its encoded annotations, plus the "key" to decode
|
||||||
it.
|
it.
|
||||||
|
|
||||||
## Knowledge Base {#kb}
|
## Knowledge base {#kb}
|
||||||
|
|
||||||
To support the entity linking task, spaCy stores external knowledge in a
|
To support the entity linking task, spaCy stores external knowledge in a
|
||||||
[`KnowledgeBase`](/api/kb). The knowledge base (KB) uses the `Vocab` to store
|
[`KnowledgeBase`](/api/kb). The knowledge base (KB) uses the `Vocab` to store
|
||||||
|
@ -426,7 +426,7 @@ print("Number of aliases in KB:", kb.get_size_aliases()) # 2
|
||||||
|
|
||||||
### Candidate generation
|
### Candidate generation
|
||||||
|
|
||||||
Given a textual entity, the Knowledge Base can provide a list of plausible
|
Given a textual entity, the knowledge base can provide a list of plausible
|
||||||
candidates or entity identifiers. The [`EntityLinker`](/api/entitylinker) will
|
candidates or entity identifiers. The [`EntityLinker`](/api/entitylinker) will
|
||||||
take this list of candidates as input, and disambiguate the mention to the most
|
take this list of candidates as input, and disambiguate the mention to the most
|
||||||
probable identifier, given the document context.
|
probable identifier, given the document context.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user