Merge pull request #5890 from svlandeg/feature/el-docs

2025-08-09 22:54:53 +03:00 · 2020-08-07 11:56:56 +02:00 · 2020-08-07 11:56:56 +02:00 · 21c9ea5bd7
commit 21c9ea5bd7
parent 06c3a5e048 824f4b2107
8 changed files with 230 additions and 43 deletions
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -169,9 +169,9 @@ class Errors:
            "training a named entity recognizer, also make sure that none of "
            "your annotated entity spans have leading or trailing whitespace "
            "or punctuation. "
-            "You can also use the experimental `debug-data` command to "
+            "You can also use the experimental `debug data` command to "
            "validate your JSON-formatted training data. For details, run:\n"
-            "python -m spacy debug-data --help")
+            "python -m spacy debug data --help")
    E025 = ("String is too long: {length} characters. Max is 2**30.")
    E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
            "length {length}.")
--- a/spacy/gold/corpus.py
+++ b/spacy/gold/corpus.py
@ -20,7 +20,7 @@ def create_docbin_reader(

 class Corpus:
    """Iterate Example objects from a file or directory of DocBin (.spacy)
-    formated data files.
+    formatted data files.

    path (Path): The directory or filename to read from.
    gold_preproc (bool): Whether to set up the Example object with gold-standard
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -148,19 +148,133 @@ architectures into your training config.

 ## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"}

+A text classification architecture needs to take a `Doc` as input, and produce a
+score for each potential label class. Textcat challenges can be binary (e.g.
+sentiment analysis) or involve multiple possible labels. Multi-label challenges
+can either have mutually exclusive labels (each example has exactly one label),
+or multiple labels may be applicable at the same time.
+
+As the properties of text classification problems can vary widely, we provide
+several different built-in architectures. It is recommended to experiment with
+different architectures and settings to determine what works best on your
+specific data and challenge.
+
 ### spacy.TextCatEnsemble.v1 {#TextCatEnsemble}

+Stacked ensemble of a bag-of-words model and a neural network model. The neural
+network has an internal CNN Tok2Vec layer and uses attention.
+
+> #### Example Config
+>
+> ```ini
+> [model]
+> @architectures = "spacy.TextCatEnsemble.v1"
+> exclusive_classes = false
+> pretrained_vectors = null
+> width = 64
+> embed_size = 2000
+> conv_depth = 2
+> window_size = 1
+> ngram_size = 1
+> dropout = null
+> nO = null
+> ```
+
+| Name                 | Type  | Description                                                                                                                                 |
+| -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| `exclusive_classes`  | bool  | Whether or not categories are mutually exclusive.                                                                                           |
+| `pretrained_vectors` | bool  | Whether or not pretrained vectors will be used in addition to the feature vectors.                                                          |
+| `width`              | int   | Output dimension of the feature encoding step.                                                                                              |
+| `embed_size`         | int   | Input dimension of the feature encoding step.                                                                                               |
+| `conv_depth`         | int   | Depth of the Tok2Vec layer.                                                                                                                 |
+| `window_size`        | int   | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right.         |
+| `ngram_size`         | int   | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
+| `dropout`            | float | The dropout rate.                                                                                                                           |
+| `nO`                 | int   | Output dimension, determined by the number of different labels.                                                                             |
+
+If the `nO` dimension is not set, the TextCategorizer component will set it when
+`begin_training` is called.
+
+### spacy.TextCatCNN.v1 {#TextCatCNN}
+
+> #### Example Config
+>
+> ```ini
+> [model]
+> @architectures = "spacy.TextCatCNN.v1"
+> exclusive_classes = false
+> nO = null
+>
+> [model.tok2vec]
+> @architectures = "spacy.HashEmbedCNN.v1"
+> pretrained_vectors = null
+> width = 96
+> depth = 4
+> embed_size = 2000
+> window_size = 1
+> maxout_pieces = 3
+> subword_features = true
+> dropout = null
+> ```
+
+A neural network model where token vectors are calculated using a CNN. The
+vectors are mean pooled and used as features in a feed-forward network. This
+architecture is usually less accurate than the ensemble, but runs faster.
+
+| Name                | Type                                       | Description                                                     |
+| ------------------- | ------------------------------------------ | --------------------------------------------------------------- |
+| `exclusive_classes` | bool                                       | Whether or not categories are mutually exclusive.               |
+| `tok2vec`           | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model.                   |
+| `nO`                | int                                        | Output dimension, determined by the number of different labels. |
+
+If the `nO` dimension is not set, the TextCategorizer component will set it when
+`begin_training` is called.
+
 ### spacy.TextCatBOW.v1 {#TextCatBOW}

-### spacy.TextCatCNN.v1 {#TextCatCNN}
+An ngram "bag-of-words" model. This architecture should run much faster than the
+others, but may not be as accurate, especially if texts are short.
+
+> #### Example Config
+>
+> ```ini
+> [model]
+> @architectures = "spacy.TextCatBOW.v1"
+> exclusive_classes = false
+> ngram_size: 1
+> no_output_layer: false
+> nO = null
+> ```
+
+| Name                | Type  | Description                                                                                                                                 |
+| ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| `exclusive_classes` | bool  | Whether or not categories are mutually exclusive.                                                                                           |
+| `ngram_size`        | int   | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
+| `no_output_layer`   | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`.                      |
+| `nO`                | int   | Output dimension, determined by the number of different labels.                                                                             |
+
+If the `nO` dimension is not set, the TextCategorizer component will set it when
+`begin_training` is called.

 ### spacy.TextCatLowData.v1 {#TextCatLowData}

 ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}

+An `EntityLinker` component disambiguates textual mentions (tagged as named
+entities) to unique identifiers, grounding the named entities into the "real
+world". This requires 3 main components:
+
+- A [`KnowledgeBase`](/api/kb) (KB) holding the unique identifiers, potential
+  synonyms and prior probabilities.
+- A candidate generation step to produce a set of likely identifiers, given a
+  certain textual mention.
+- A Machine learning [`Model`](https://thinc.ai/docs/api-model) that picks the
+  most plausible ID from the set of candidates.
+
 ### spacy.EntityLinker.v1 {#EntityLinker}

-<!-- TODO: intro -->
+The `EntityLinker` model architecture is a `Thinc` `Model` with a Linear output
+layer.

 > #### Example Config
 >
@ -170,10 +284,47 @@ architectures into your training config.
 > nO = null
 >
 > [model.tok2vec]
-> # ...
+> @architectures = "spacy.HashEmbedCNN.v1"
+> pretrained_vectors = null
+> width = 96
+> depth = 2
+> embed_size = 300
+> window_size = 1
+> maxout_pieces = 3
+> subword_features = true
+> dropout = null
+>
+> [kb_loader]
+> @assets = "spacy.EmptyKB.v1"
+> entity_vector_length = 64
+>
+> [get_candidates]
+> @assets = "spacy.CandidateGenerator.v1"
 > ```

-| Name      | Type                                       | Description |
-| --------- | ------------------------------------------ | ----------- |
-| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) |             |
-| `nO`      | int                                        |             |
+| Name      | Type                                       | Description                                                                              |
+| --------- | ------------------------------------------ | ---------------------------------------------------------------------------------------- |
+| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model.                                            |
+| `nO`      | int                                        | Output dimension, determined by the length of the vectors encoding each entity in the KB |
+
+If the `nO` dimension is not set, the Entity Linking component will set it when
+`begin_training` is called.
+
+### spacy.EmptyKB.v1 {#EmptyKB}
+
+A function that creates a default, empty `KnowledgeBase` from a
+[`Vocab`](/api/vocab) instance.
+
+| Name                   | Type | Description                                                               |
+| ---------------------- | ---- | ------------------------------------------------------------------------- |
+| `entity_vector_length` | int  | The length of the vectors encoding each entity in the KB - 64 by default. |
+
+### spacy.CandidateGenerator.v1 {#CandidateGenerator}
+
+A function that takes as input a [`KnowledgeBase`](/api/kb) and a
+[`Span`](/api/span) object denoting a named entity, and returns a list of
+plausible [`Candidate` objects](/api/kb/#candidate_init).
+
+The default `CandidateGenerator` simply uses the text of a mention to find its
+potential aliases in the Knowledgebase. Note that this function is
+case-dependent.
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -132,7 +132,7 @@ $ python -m spacy init config [output] [--base] [--lang] [--model] [--pipeline]
 | `--base`, `-b`     | option     | Optional base config file to auto-fill with defaults.                                                                                                                 |
 | `--lang`, `-l`     | option     | Optional language code to use for blank config. If a `--pipeline` is specified, the components will be added in order.                                                |
 | `--model`, `-m`    | option     | Optional base model to copy config from. If a `--pipeline` is specified, only those components will be kept, and all other components not in the model will be added. |
-| `--pipeline`, `-p` | option     | Optional comma-separate pipeline of components to add to blank language or model.                                                                                     |
+| `--pipeline`, `-p` | option     | Optional comma-separated pipeline of components to add to blank language or model.                                                                                     |
 | **CREATES**        | config     | Complete and auto-filled config file for training.                                                                                                                    |

 ### init model {#init-model new="2"}
@ -271,7 +271,7 @@ low data labels and more.

 <Infobox title="New in v3.0" variant="warning">

-The `debug-data` command is now available as a subcommand of `spacy debug`. It
+The `debug data` command is now available as a subcommand of `spacy debug`. It
 takes the same arguments as `train` and reads settings off the
 [`config.cfg` file](/usage/training#config) and optional
 [overrides](/usage/training#config-overrides) on the CLI.
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@ -174,12 +174,32 @@ run [`spacy pretrain`](/api/cli#pretrain).

 ### Binary training format {#binary-training new="3"}

-The built-in [`convert`](/api/cli#convert) command helps you convert the
-`.conllu` format used by the
-[Universal Dependencies corpora](https://github.com/UniversalDependencies) as
-well as spaCy's previous [JSON format](#json-input).
+> #### Example
+>
+> ```python
+> from pathlib import Path
+> from spacy.tokens import DocBin
+> from spacy.gold import Corpus
+> output_file = Path(dir) / "output.spacy"
+> data = DocBin(docs=docs).to_bytes()
+> with output_file.open("wb") as file_:
+>    file_.write(data)
+> reader = Corpus(output_file)
+> ```

-<!-- TODO: document DocBin format -->
+The main data format used in spaCy v3 is a binary format created by serializing
+a [`DocBin`](/api/docbin) object, which represents a collection of `Doc`
+objects. Typically, the extension for these binary files is `.spacy`, and they
+are used as input format for specifying a [training corpus](/api/corpus) and for
+spaCy's CLI [`train`](/api/cli#train) command.
+
+This binary format is extremely efficient in storage, especially when packing
+multiple documents together. 
+
+The built-in [`convert`](/api/cli#convert) command helps you convert spaCy's
+previous [JSON format](#json-input) to this new `DocBin` format. It also
+supports conversion of the `.conllu` format used by the
+[Universal Dependencies corpora](https://github.com/UniversalDependencies).

 ### JSON training format {#json-input tag="deprecated"}

@ -187,7 +207,7 @@ well as spaCy's previous [JSON format](#json-input).

 As of v3.0, the JSON input format is deprecated and is replaced by the
 [binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
-objects to JSON, you can now now serialize them directly using the
+objects to JSON, you can now serialize them directly using the
 [`DocBin`](/api/docbin) container and then use them as input data.

 [`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
--- a/website/docs/api/entitylinker.md
+++ b/website/docs/api/entitylinker.md
@ -9,6 +9,13 @@ api_string_name: entity_linker
 api_trainable: true
 ---

+An `EntityLinker` component disambiguates textual mentions (tagged as named
+entities) to unique identifiers, grounding the named entities into the "real
+world". It requires a `KnowledgeBase`, as well as a function to generate
+plausible candidates from that `KnowledgeBase` given a certain textual mention,
+and a ML model to pick the right candidate, given the local context of the
+mention.
+
 ## Config and implementation {#config}

 The default config is defined by the pipeline component factory and describes
@ -23,22 +30,24 @@ architectures and their arguments and hyperparameters.
 > ```python
 > from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
 > config = {
->    "kb": None,
 >    "labels_discard": [],
 >    "incl_prior": True,
 >    "incl_context": True,
 >    "model": DEFAULT_NEL_MODEL,
+>    "kb_loader": {'@assets': 'spacy.EmptyKB.v1', 'entity_vector_length': 64},
+>    "get_candidates": {'@assets': 'spacy.CandidateGenerator.v1'},
 > }
 > nlp.add_pipe("entity_linker", config=config)
 > ```

-| Setting          | Type                                       | Description                                                             | Default                                         |
-| ---------------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
-| `kb`             | `KnowledgeBase`                            | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases.  | `None`                                          |
-| `labels_discard` | `Iterable[str]`                            | NER labels that will automatically get a "NIL" prediction.              | `[]`                                            |
-| `incl_prior`     | bool                                       | Whether or not to include prior probabilities from the KB in the model. | `True`                                          |
-| `incl_context`   | bool                                       | Whether or not to include the local context in the model.               | `True`                                          |
-| `model`          | [`Model`](https://thinc.ai/docs/api-model) | The model to use.                                                       | [EntityLinker](/api/architectures#EntityLinker) |
+| Setting          | Type                                                     | Description                                                                 | Default                                                |
+| ---------------- | -------------------------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------ |
+| `labels_discard` | `Iterable[str]`                                          | NER labels that will automatically get a "NIL" prediction.                  | `[]`                                                   |
+| `incl_prior`     | bool                                                     | Whether or not to include prior probabilities from the KB in the model.     | `True`                                                 |
+| `incl_context`   | bool                                                     | Whether or not to include the local context in the model.                   | `True`                                                 |
+| `model`          | [`Model`](https://thinc.ai/docs/api-model)               | The model to use.                                                           | [EntityLinker](/api/architectures#EntityLinker)        |
+| `kb_loader`      | `Callable[[Vocab], KnowledgeBase]`                       | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. | An empty KnowledgeBase with `entity_vector_length` 64. |
+| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object.     | Built-in dictionary-lookup function.                   |

 ```python
 https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
@ -53,7 +62,11 @@ https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
 > entity_linker = nlp.add_pipe("entity_linker")
 >
 > # Construction via add_pipe with custom model
-> config = {"model": {"@architectures": "my_el"}}
+> config = {"model": {"@architectures": "my_el.v1"}}
+> entity_linker = nlp.add_pipe("entity_linker", config=config)
+>
+> # Construction via add_pipe with custom KB and candidate generation
+> config = {"kb_loader": {"@assets": "my_kb.v1"}, "get_candidates": {"@assets": "my_candidates.v1"},}
 > entity_linker = nlp.add_pipe("entity_linker", config=config)
 >
 > # Construction from class
@ -65,18 +78,20 @@ Create a new pipeline instance. In your application, you would normally use a
 shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#add_pipe).

-<!-- TODO: finish API docs -->
+Note that both the internal KB as well as the Candidate generator can be
+customized by providing custom registered functions.

-| Name             | Type            | Description                                                                                 |
-| ---------------- | --------------- | ------------------------------------------------------------------------------------------- |
-| `vocab`          | `Vocab`         | The shared vocabulary.                                                                      |
-| `model`          | `Model`         | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.             |
-| `name`           | str             | String name of the component instance. Used to add entries to the `losses` during training. |
-| _keyword-only_   |                 |                                                                                             |
-| `kb`             | `KnowlegeBase`  | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases.                      |
-| `labels_discard` | `Iterable[str]` | NER labels that will automatically get a "NIL" prediction.                                  |
-| `incl_prior`     | bool            | Whether or not to include prior probabilities from the KB in the model.                     |
-| `incl_context`   | bool            | Whether or not to include the local context in the model.                                   |
+| Name             | Type                                                     | Description                                                                                 |
+| ---------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
+| `vocab`          | `Vocab`                                                  | The shared vocabulary.                                                                      |
+| `model`          | `Model`                                                  | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.             |
+| `name`           | str                                                      | String name of the component instance. Used to add entries to the `losses` during training. |
+| _keyword-only_   |                                                          |                                                                                             |
+| `kb_loader`      | `Callable[[Vocab], KnowledgeBase]`                       | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance.                 |
+| `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object.                     |
+| `labels_discard` | `Iterable[str]`                                          | NER labels that will automatically get a "NIL" prediction.                                  |
+| `incl_prior`     | bool                                                     | Whether or not to include prior probabilities from the KB in the model.                     |
+| `incl_context`   | bool                                                     | Whether or not to include the local context in the model.                                   |

 ## EntityLinker.\_\_call\_\_ {#call tag="method"}

--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -380,8 +380,9 @@ table instead of only returning the structured data.

 > #### ✏️ Things to try
 >
-> 1. Add the components `"ner"` and `"sentencizer"` _before_ the entity linker.
->    The analysis should now show no problems, because requirements are met.
+> 1. Add the components `"ner"` and `"sentencizer"` _before_ the
+>    `"entity_linker"`. The analysis should now show no problems, because
+>    requirements are met.

 ```python
 ### {executable="true"}
--- a/website/docs/usage/spacy-101.md
+++ b/website/docs/usage/spacy-101.md
@ -122,7 +122,7 @@ related to more general machine learning functionality.
 | **Lemmatization**                     | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".      |
 | **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences.                                                                       |
 | **Named Entity Recognition** (NER)    | Labelling named "real-world" objects, like persons, companies or locations.                                        |
-| **Entity Linking** (EL)               | Disambiguating textual entities to unique identifiers in a Knowledge Base.                                         |
+| **Entity Linking** (EL)               | Disambiguating textual entities to unique identifiers in a knowledge base.                                         |
 | **Similarity**                        | Comparing words, text spans and documents and how similar they are to each other.                                  |
 | **Text Classification**               | Assigning categories or labels to a whole document, or parts of a document.                                        |
 | **Rule-based Matching**               | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.       |
@ -379,7 +379,7 @@ spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This
 will give you the object and its encoded annotations, plus the "key" to decode
 it.

-## Knowledge Base {#kb}
+## Knowledge base {#kb}

 To support the entity linking task, spaCy stores external knowledge in a
 [`KnowledgeBase`](/api/kb). The knowledge base (KB) uses the `Vocab` to store
@ -426,7 +426,7 @@ print("Number of aliases in KB:", kb.get_size_aliases()) # 2

 ### Candidate generation

-Given a textual entity, the Knowledge Base can provide a list of plausible
+Given a textual entity, the knowledge base can provide a list of plausible
 candidates or entity identifiers. The [`EntityLinker`](/api/entitylinker) will
 take this list of candidates as input, and disambiguate the mention to the most
 probable identifier, given the document context.