mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Merge pull request #5890 from svlandeg/feature/el-docs
This commit is contained in:
commit
21c9ea5bd7
|
@ -169,9 +169,9 @@ class Errors:
|
|||
"training a named entity recognizer, also make sure that none of "
|
||||
"your annotated entity spans have leading or trailing whitespace "
|
||||
"or punctuation. "
|
||||
"You can also use the experimental `debug-data` command to "
|
||||
"You can also use the experimental `debug data` command to "
|
||||
"validate your JSON-formatted training data. For details, run:\n"
|
||||
"python -m spacy debug-data --help")
|
||||
"python -m spacy debug data --help")
|
||||
E025 = ("String is too long: {length} characters. Max is 2**30.")
|
||||
E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
|
||||
"length {length}.")
|
||||
|
|
|
@ -20,7 +20,7 @@ def create_docbin_reader(
|
|||
|
||||
class Corpus:
|
||||
"""Iterate Example objects from a file or directory of DocBin (.spacy)
|
||||
formated data files.
|
||||
formatted data files.
|
||||
|
||||
path (Path): The directory or filename to read from.
|
||||
gold_preproc (bool): Whether to set up the Example object with gold-standard
|
||||
|
|
|
@ -148,19 +148,133 @@ architectures into your training config.
|
|||
|
||||
## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"}
|
||||
|
||||
A text classification architecture needs to take a `Doc` as input, and produce a
|
||||
score for each potential label class. Textcat challenges can be binary (e.g.
|
||||
sentiment analysis) or involve multiple possible labels. Multi-label challenges
|
||||
can either have mutually exclusive labels (each example has exactly one label),
|
||||
or multiple labels may be applicable at the same time.
|
||||
|
||||
As the properties of text classification problems can vary widely, we provide
|
||||
several different built-in architectures. It is recommended to experiment with
|
||||
different architectures and settings to determine what works best on your
|
||||
specific data and challenge.
|
||||
|
||||
### spacy.TextCatEnsemble.v1 {#TextCatEnsemble}
|
||||
|
||||
Stacked ensemble of a bag-of-words model and a neural network model. The neural
|
||||
network has an internal CNN Tok2Vec layer and uses attention.
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.TextCatEnsemble.v1"
|
||||
> exclusive_classes = false
|
||||
> pretrained_vectors = null
|
||||
> width = 64
|
||||
> embed_size = 2000
|
||||
> conv_depth = 2
|
||||
> window_size = 1
|
||||
> ngram_size = 1
|
||||
> dropout = null
|
||||
> nO = null
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||
| `pretrained_vectors` | bool | Whether or not pretrained vectors will be used in addition to the feature vectors. |
|
||||
| `width` | int | Output dimension of the feature encoding step. |
|
||||
| `embed_size` | int | Input dimension of the feature encoding step. |
|
||||
| `conv_depth` | int | Depth of the Tok2Vec layer. |
|
||||
| `window_size` | int | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. |
|
||||
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
||||
| `dropout` | float | The dropout rate. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. |
|
||||
|
||||
If the `nO` dimension is not set, the TextCategorizer component will set it when
|
||||
`begin_training` is called.
|
||||
|
||||
### spacy.TextCatCNN.v1 {#TextCatCNN}
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.TextCatCNN.v1"
|
||||
> exclusive_classes = false
|
||||
> nO = null
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||
> pretrained_vectors = null
|
||||
> width = 96
|
||||
> depth = 4
|
||||
> embed_size = 2000
|
||||
> window_size = 1
|
||||
> maxout_pieces = 3
|
||||
> subword_features = true
|
||||
> dropout = null
|
||||
> ```
|
||||
|
||||
A neural network model where token vectors are calculated using a CNN. The
|
||||
vectors are mean pooled and used as features in a feed-forward network. This
|
||||
architecture is usually less accurate than the ensemble, but runs faster.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------- | ------------------------------------------ | --------------------------------------------------------------- |
|
||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. |
|
||||
|
||||
If the `nO` dimension is not set, the TextCategorizer component will set it when
|
||||
`begin_training` is called.
|
||||
|
||||
### spacy.TextCatBOW.v1 {#TextCatBOW}
|
||||
|
||||
### spacy.TextCatCNN.v1 {#TextCatCNN}
|
||||
An ngram "bag-of-words" model. This architecture should run much faster than the
|
||||
others, but may not be as accurate, especially if texts are short.
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> [model]
|
||||
> @architectures = "spacy.TextCatBOW.v1"
|
||||
> exclusive_classes = false
|
||||
> ngram_size: 1
|
||||
> no_output_layer: false
|
||||
> nO = null
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `exclusive_classes` | bool | Whether or not categories are mutually exclusive. |
|
||||
| `ngram_size` | int | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
|
||||
| `no_output_layer` | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`. |
|
||||
| `nO` | int | Output dimension, determined by the number of different labels. |
|
||||
|
||||
If the `nO` dimension is not set, the TextCategorizer component will set it when
|
||||
`begin_training` is called.
|
||||
|
||||
### spacy.TextCatLowData.v1 {#TextCatLowData}
|
||||
|
||||
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
||||
|
||||
An `EntityLinker` component disambiguates textual mentions (tagged as named
|
||||
entities) to unique identifiers, grounding the named entities into the "real
|
||||
world". This requires 3 main components:
|
||||
|
||||
- A [`KnowledgeBase`](/api/kb) (KB) holding the unique identifiers, potential
|
||||
synonyms and prior probabilities.
|
||||
- A candidate generation step to produce a set of likely identifiers, given a
|
||||
certain textual mention.
|
||||
- A Machine learning [`Model`](https://thinc.ai/docs/api-model) that picks the
|
||||
most plausible ID from the set of candidates.
|
||||
|
||||
### spacy.EntityLinker.v1 {#EntityLinker}
|
||||
|
||||
<!-- TODO: intro -->
|
||||
The `EntityLinker` model architecture is a `Thinc` `Model` with a Linear output
|
||||
layer.
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
|
@ -170,10 +284,47 @@ architectures into your training config.
|
|||
> nO = null
|
||||
>
|
||||
> [model.tok2vec]
|
||||
> # ...
|
||||
> @architectures = "spacy.HashEmbedCNN.v1"
|
||||
> pretrained_vectors = null
|
||||
> width = 96
|
||||
> depth = 2
|
||||
> embed_size = 300
|
||||
> window_size = 1
|
||||
> maxout_pieces = 3
|
||||
> subword_features = true
|
||||
> dropout = null
|
||||
>
|
||||
> [kb_loader]
|
||||
> @assets = "spacy.EmptyKB.v1"
|
||||
> entity_vector_length = 64
|
||||
>
|
||||
> [get_candidates]
|
||||
> @assets = "spacy.CandidateGenerator.v1"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------------------------------------ | ----------- |
|
||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | |
|
||||
| `nO` | int | |
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
||||
| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model. |
|
||||
| `nO` | int | Output dimension, determined by the length of the vectors encoding each entity in the KB |
|
||||
|
||||
If the `nO` dimension is not set, the Entity Linking component will set it when
|
||||
`begin_training` is called.
|
||||
|
||||
### spacy.EmptyKB.v1 {#EmptyKB}
|
||||
|
||||
A function that creates a default, empty `KnowledgeBase` from a
|
||||
[`Vocab`](/api/vocab) instance.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------------- | ---- | ------------------------------------------------------------------------- |
|
||||
| `entity_vector_length` | int | The length of the vectors encoding each entity in the KB - 64 by default. |
|
||||
|
||||
### spacy.CandidateGenerator.v1 {#CandidateGenerator}
|
||||
|
||||
A function that takes as input a [`KnowledgeBase`](/api/kb) and a
|
||||
[`Span`](/api/span) object denoting a named entity, and returns a list of
|
||||
plausible [`Candidate` objects](/api/kb/#candidate_init).
|
||||
|
||||
The default `CandidateGenerator` simply uses the text of a mention to find its
|
||||
potential aliases in the Knowledgebase. Note that this function is
|
||||
case-dependent.
|
||||
|
|
|
@ -132,7 +132,7 @@ $ python -m spacy init config [output] [--base] [--lang] [--model] [--pipeline]
|
|||
| `--base`, `-b` | option | Optional base config file to auto-fill with defaults. |
|
||||
| `--lang`, `-l` | option | Optional language code to use for blank config. If a `--pipeline` is specified, the components will be added in order. |
|
||||
| `--model`, `-m` | option | Optional base model to copy config from. If a `--pipeline` is specified, only those components will be kept, and all other components not in the model will be added. |
|
||||
| `--pipeline`, `-p` | option | Optional comma-separate pipeline of components to add to blank language or model. |
|
||||
| `--pipeline`, `-p` | option | Optional comma-separated pipeline of components to add to blank language or model. |
|
||||
| **CREATES** | config | Complete and auto-filled config file for training. |
|
||||
|
||||
### init model {#init-model new="2"}
|
||||
|
@ -271,7 +271,7 @@ low data labels and more.
|
|||
|
||||
<Infobox title="New in v3.0" variant="warning">
|
||||
|
||||
The `debug-data` command is now available as a subcommand of `spacy debug`. It
|
||||
The `debug data` command is now available as a subcommand of `spacy debug`. It
|
||||
takes the same arguments as `train` and reads settings off the
|
||||
[`config.cfg` file](/usage/training#config) and optional
|
||||
[overrides](/usage/training#config-overrides) on the CLI.
|
||||
|
|
|
@ -174,12 +174,32 @@ run [`spacy pretrain`](/api/cli#pretrain).
|
|||
|
||||
### Binary training format {#binary-training new="3"}
|
||||
|
||||
The built-in [`convert`](/api/cli#convert) command helps you convert the
|
||||
`.conllu` format used by the
|
||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies) as
|
||||
well as spaCy's previous [JSON format](#json-input).
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from pathlib import Path
|
||||
> from spacy.tokens import DocBin
|
||||
> from spacy.gold import Corpus
|
||||
> output_file = Path(dir) / "output.spacy"
|
||||
> data = DocBin(docs=docs).to_bytes()
|
||||
> with output_file.open("wb") as file_:
|
||||
> file_.write(data)
|
||||
> reader = Corpus(output_file)
|
||||
> ```
|
||||
|
||||
<!-- TODO: document DocBin format -->
|
||||
The main data format used in spaCy v3 is a binary format created by serializing
|
||||
a [`DocBin`](/api/docbin) object, which represents a collection of `Doc`
|
||||
objects. Typically, the extension for these binary files is `.spacy`, and they
|
||||
are used as input format for specifying a [training corpus](/api/corpus) and for
|
||||
spaCy's CLI [`train`](/api/cli#train) command.
|
||||
|
||||
This binary format is extremely efficient in storage, especially when packing
|
||||
multiple documents together.
|
||||
|
||||
The built-in [`convert`](/api/cli#convert) command helps you convert spaCy's
|
||||
previous [JSON format](#json-input) to this new `DocBin` format. It also
|
||||
supports conversion of the `.conllu` format used by the
|
||||
[Universal Dependencies corpora](https://github.com/UniversalDependencies).
|
||||
|
||||
### JSON training format {#json-input tag="deprecated"}
|
||||
|
||||
|
@ -187,7 +207,7 @@ well as spaCy's previous [JSON format](#json-input).
|
|||
|
||||
As of v3.0, the JSON input format is deprecated and is replaced by the
|
||||
[binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
|
||||
objects to JSON, you can now now serialize them directly using the
|
||||
objects to JSON, you can now serialize them directly using the
|
||||
[`DocBin`](/api/docbin) container and then use them as input data.
|
||||
|
||||
[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
|
||||
|
|
|
@ -9,6 +9,13 @@ api_string_name: entity_linker
|
|||
api_trainable: true
|
||||
---
|
||||
|
||||
An `EntityLinker` component disambiguates textual mentions (tagged as named
|
||||
entities) to unique identifiers, grounding the named entities into the "real
|
||||
world". It requires a `KnowledgeBase`, as well as a function to generate
|
||||
plausible candidates from that `KnowledgeBase` given a certain textual mention,
|
||||
and a ML model to pick the right candidate, given the local context of the
|
||||
mention.
|
||||
|
||||
## Config and implementation {#config}
|
||||
|
||||
The default config is defined by the pipeline component factory and describes
|
||||
|
@ -23,22 +30,24 @@ architectures and their arguments and hyperparameters.
|
|||
> ```python
|
||||
> from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
|
||||
> config = {
|
||||
> "kb": None,
|
||||
> "labels_discard": [],
|
||||
> "incl_prior": True,
|
||||
> "incl_context": True,
|
||||
> "model": DEFAULT_NEL_MODEL,
|
||||
> "kb_loader": {'@assets': 'spacy.EmptyKB.v1', 'entity_vector_length': 64},
|
||||
> "get_candidates": {'@assets': 'spacy.CandidateGenerator.v1'},
|
||||
> }
|
||||
> nlp.add_pipe("entity_linker", config=config)
|
||||
> ```
|
||||
|
||||
| Setting | Type | Description | Default |
|
||||
| ---------------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
|
||||
| `kb` | `KnowledgeBase` | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases. | `None` |
|
||||
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. | `[]` |
|
||||
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. | `True` |
|
||||
| `incl_context` | bool | Whether or not to include the local context in the model. | `True` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [EntityLinker](/api/architectures#EntityLinker) |
|
||||
| Setting | Type | Description | Default |
|
||||
| ---------------- | -------------------------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------ |
|
||||
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. | `[]` |
|
||||
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. | `True` |
|
||||
| `incl_context` | bool | Whether or not to include the local context in the model. | `True` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [EntityLinker](/api/architectures#EntityLinker) |
|
||||
| `kb_loader` | `Callable[[Vocab], KnowledgeBase]` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. | An empty KnowledgeBase with `entity_vector_length` 64. |
|
||||
| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object. | Built-in dictionary-lookup function. |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
|
||||
|
@ -53,7 +62,11 @@ https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
|
|||
> entity_linker = nlp.add_pipe("entity_linker")
|
||||
>
|
||||
> # Construction via add_pipe with custom model
|
||||
> config = {"model": {"@architectures": "my_el"}}
|
||||
> config = {"model": {"@architectures": "my_el.v1"}}
|
||||
> entity_linker = nlp.add_pipe("entity_linker", config=config)
|
||||
>
|
||||
> # Construction via add_pipe with custom KB and candidate generation
|
||||
> config = {"kb_loader": {"@assets": "my_kb.v1"}, "get_candidates": {"@assets": "my_candidates.v1"},}
|
||||
> entity_linker = nlp.add_pipe("entity_linker", config=config)
|
||||
>
|
||||
> # Construction from class
|
||||
|
@ -65,18 +78,20 @@ Create a new pipeline instance. In your application, you would normally use a
|
|||
shortcut for this and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||
|
||||
<!-- TODO: finish API docs -->
|
||||
Note that both the internal KB as well as the Candidate generator can be
|
||||
customized by providing custom registered functions.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------- | --------------- | ------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| _keyword-only_ | | |
|
||||
| `kb` | `KnowlegeBase` | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases. |
|
||||
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. |
|
||||
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. |
|
||||
| `incl_context` | bool | Whether or not to include the local context in the model. |
|
||||
| Name | Type | Description |
|
||||
| ---------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `Model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| _keyword-only_ | | |
|
||||
| `kb_loader` | `Callable[[Vocab], KnowledgeBase]` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. |
|
||||
| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object. |
|
||||
| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction. |
|
||||
| `incl_prior` | bool | Whether or not to include prior probabilities from the KB in the model. |
|
||||
| `incl_context` | bool | Whether or not to include the local context in the model. |
|
||||
|
||||
## EntityLinker.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
|
|
@ -380,8 +380,9 @@ table instead of only returning the structured data.
|
|||
|
||||
> #### ✏️ Things to try
|
||||
>
|
||||
> 1. Add the components `"ner"` and `"sentencizer"` _before_ the entity linker.
|
||||
> The analysis should now show no problems, because requirements are met.
|
||||
> 1. Add the components `"ner"` and `"sentencizer"` _before_ the
|
||||
> `"entity_linker"`. The analysis should now show no problems, because
|
||||
> requirements are met.
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
|
|
|
@ -122,7 +122,7 @@ related to more general machine learning functionality.
|
|||
| **Lemmatization** | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". |
|
||||
| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences. |
|
||||
| **Named Entity Recognition** (NER) | Labelling named "real-world" objects, like persons, companies or locations. |
|
||||
| **Entity Linking** (EL) | Disambiguating textual entities to unique identifiers in a Knowledge Base. |
|
||||
| **Entity Linking** (EL) | Disambiguating textual entities to unique identifiers in a knowledge base. |
|
||||
| **Similarity** | Comparing words, text spans and documents and how similar they are to each other. |
|
||||
| **Text Classification** | Assigning categories or labels to a whole document, or parts of a document. |
|
||||
| **Rule-based Matching** | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
|
||||
|
@ -379,7 +379,7 @@ spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This
|
|||
will give you the object and its encoded annotations, plus the "key" to decode
|
||||
it.
|
||||
|
||||
## Knowledge Base {#kb}
|
||||
## Knowledge base {#kb}
|
||||
|
||||
To support the entity linking task, spaCy stores external knowledge in a
|
||||
[`KnowledgeBase`](/api/kb). The knowledge base (KB) uses the `Vocab` to store
|
||||
|
@ -426,7 +426,7 @@ print("Number of aliases in KB:", kb.get_size_aliases()) # 2
|
|||
|
||||
### Candidate generation
|
||||
|
||||
Given a textual entity, the Knowledge Base can provide a list of plausible
|
||||
Given a textual entity, the knowledge base can provide a list of plausible
|
||||
candidates or entity identifiers. The [`EntityLinker`](/api/entitylinker) will
|
||||
take this list of candidates as input, and disambiguate the mention to the most
|
||||
probable identifier, given the document context.
|
||||
|
|
Loading…
Reference in New Issue
Block a user