Merge pull request #5890 from svlandeg/feature/el-docs

2025-07-04 20:03:13 +03:00 · 2020-08-07 11:56:56 +02:00 · 2020-08-07 11:56:56 +02:00 · 21c9ea5bd7
commit 21c9ea5bd7
parent 06c3a5e048 824f4b2107
8 changed files with 230 additions and 43 deletions
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -169,9 +169,9 @@ class Errors:
            "training a named entity recognizer, also make sure that none of "
            "your annotated entity spans have leading or trailing whitespace "
            "or punctuation. "
-            "You can also use the experimental `debug-data` command to "
+            "You can also use the experimental `debug data` command to "
            "validate your JSON-formatted training data. For details, run:\n"
-            "python -m spacy debug-data --help")
+            "python -m spacy debug data --help")
    E025 = ("String is too long: {length} characters. Max is 2**30.")
    E026 = ("Error accessing token at position {i}: out of bounds in Doc of "
            "length {length}.")
--- a/spacy/gold/corpus.py
+++ b/spacy/gold/corpus.py
@ -20,7 +20,7 @@ def create_docbin_reader(
 class Corpus:
    """Iterate Example objects from a file or directory of DocBin (.spacy)
-    formated data files.
+    formatted data files.
    path (Path): The directory or filename to read from.
    gold_preproc (bool): Whether to set up the Example object with gold-standard
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -148,19 +148,133 @@ architectures into your training config.
 ## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"}
 A text classification architecture needs to take a `Doc` as input, and produce a
 score for each potential label class. Textcat challenges can be binary (e.g.
 sentiment analysis) or involve multiple possible labels. Multi-label challenges
 can either have mutually exclusive labels (each example has exactly one label),
 or multiple labels may be applicable at the same time.
 As the properties of text classification problems can vary widely, we provide
 several different built-in architectures. It is recommended to experiment with
 different architectures and settings to determine what works best on your
 specific data and challenge.
 ### spacy.TextCatEnsemble.v1 {#TextCatEnsemble}
 Stacked ensemble of a bag-of-words model and a neural network model. The neural
 network has an internal CNN Tok2Vec layer and uses attention.
 > #### Example Config
 >
 > ```ini
 > [model]
 > @architectures = "spacy.TextCatEnsemble.v1"
 > exclusive_classes = false
 > pretrained_vectors = null
 > width = 64
 > embed_size = 2000
 > conv_depth = 2
 > window_size = 1
 > ngram_size = 1
 > dropout = null
 > nO = null
 > ```
 | Name                 | Type  | Description                                                                                                                                 |
 | -------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
 | `exclusive_classes`  | bool  | Whether or not categories are mutually exclusive.                                                                                           |
 | `pretrained_vectors` | bool  | Whether or not pretrained vectors will be used in addition to the feature vectors.                                                          |
 | `width`              | int   | Output dimension of the feature encoding step.                                                                                              |
 | `embed_size`         | int   | Input dimension of the feature encoding step.                                                                                               |
 | `conv_depth`         | int   | Depth of the Tok2Vec layer.                                                                                                                 |
 | `window_size`        | int   | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right.         |
 | `ngram_size`         | int   | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
 | `dropout`            | float | The dropout rate.                                                                                                                           |
 | `nO`                 | int   | Output dimension, determined by the number of different labels.                                                                             |
 If the `nO` dimension is not set, the TextCategorizer component will set it when
 `begin_training` is called.
 ### spacy.TextCatCNN.v1 {#TextCatCNN}
 > #### Example Config
 >
 > ```ini
 > [model]
 > @architectures = "spacy.TextCatCNN.v1"
 > exclusive_classes = false
 > nO = null
 >
 > [model.tok2vec]
 > @architectures = "spacy.HashEmbedCNN.v1"
 > pretrained_vectors = null
 > width = 96
 > depth = 4
 > embed_size = 2000
 > window_size = 1
 > maxout_pieces = 3
 > subword_features = true
 > dropout = null
 > ```
 A neural network model where token vectors are calculated using a CNN. The
 vectors are mean pooled and used as features in a feed-forward network. This
 architecture is usually less accurate than the ensemble, but runs faster.
 | Name                | Type                                       | Description                                                     |
 | ------------------- | ------------------------------------------ | --------------------------------------------------------------- |
 | `exclusive_classes` | bool                                       | Whether or not categories are mutually exclusive.               |
 | `tok2vec`           | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model.                   |
 | `nO`                | int                                        | Output dimension, determined by the number of different labels. |
 If the `nO` dimension is not set, the TextCategorizer component will set it when
 `begin_training` is called.
 ### spacy.TextCatBOW.v1 {#TextCatBOW}
-### spacy.TextCatCNN.v1 {#TextCatCNN}
+An ngram "bag-of-words" model. This architecture should run much faster than the
 others, but may not be as accurate, especially if texts are short.
 > #### Example Config
 >
 > ```ini
 > [model]
 > @architectures = "spacy.TextCatBOW.v1"
 > exclusive_classes = false
 > ngram_size: 1
 > no_output_layer: false
 > nO = null
 > ```
 | Name                | Type  | Description                                                                                                                                 |
 | ------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------- |
 | `exclusive_classes` | bool  | Whether or not categories are mutually exclusive.                                                                                           |
 | `ngram_size`        | int   | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. |
 | `no_output_layer`   | float | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes=True`, else `Logistic`.                      |
 | `nO`                | int   | Output dimension, determined by the number of different labels.                                                                             |
 If the `nO` dimension is not set, the TextCategorizer component will set it when
 `begin_training` is called.
 ### spacy.TextCatLowData.v1 {#TextCatLowData}
 ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
 An `EntityLinker` component disambiguates textual mentions (tagged as named
 entities) to unique identifiers, grounding the named entities into the "real
 world". This requires 3 main components:
 - A [`KnowledgeBase`](/api/kb) (KB) holding the unique identifiers, potential
  synonyms and prior probabilities.
 - A candidate generation step to produce a set of likely identifiers, given a
  certain textual mention.
 - A Machine learning [`Model`](https://thinc.ai/docs/api-model) that picks the
  most plausible ID from the set of candidates.
 ### spacy.EntityLinker.v1 {#EntityLinker}
-<!-- TODO: intro -->
+The `EntityLinker` model architecture is a `Thinc` `Model` with a Linear output
 layer.
 > #### Example Config
 >
@ -170,10 +284,47 @@ architectures into your training config.
 > nO = null
 >
 > [model.tok2vec]
-> # ...
+> @architectures = "spacy.HashEmbedCNN.v1"
 > pretrained_vectors = null
 > width = 96
 > depth = 2
 > embed_size = 300
 > window_size = 1
 > maxout_pieces = 3
 > subword_features = true
 > dropout = null
 >
 > [kb_loader]
 > @assets = "spacy.EmptyKB.v1"
 > entity_vector_length = 64
 >
 > [get_candidates]
 > @assets = "spacy.CandidateGenerator.v1"
 > ```
 | Name      | Type                                       | Description                                                                              |
-| --------- | ------------------------------------------ | ----------- |
+| --------- | ------------------------------------------ | ---------------------------------------------------------------------------------------- |
-| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) |             |
+| `tok2vec` | [`Model`](https://thinc.ai/docs/api-model) | The [`tok2vec`](#tok2vec) layer of the model.                                            |
-| `nO`      | int                                        |             |
+| `nO`      | int                                        | Output dimension, determined by the length of the vectors encoding each entity in the KB |
 If the `nO` dimension is not set, the Entity Linking component will set it when
 `begin_training` is called.
 ### spacy.EmptyKB.v1 {#EmptyKB}
 A function that creates a default, empty `KnowledgeBase` from a
 [`Vocab`](/api/vocab) instance.
 | Name                   | Type | Description                                                               |
 | ---------------------- | ---- | ------------------------------------------------------------------------- |
 | `entity_vector_length` | int  | The length of the vectors encoding each entity in the KB - 64 by default. |
 ### spacy.CandidateGenerator.v1 {#CandidateGenerator}
 A function that takes as input a [`KnowledgeBase`](/api/kb) and a
 [`Span`](/api/span) object denoting a named entity, and returns a list of
 plausible [`Candidate` objects](/api/kb/#candidate_init).
 The default `CandidateGenerator` simply uses the text of a mention to find its
 potential aliases in the Knowledgebase. Note that this function is
 case-dependent.
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -132,7 +132,7 @@ $ python -m spacy init config [output] [--base] [--lang] [--model] [--pipeline]
 | `--base`, `-b`     | option     | Optional base config file to auto-fill with defaults.                                                                                                                 |
 | `--lang`, `-l`     | option     | Optional language code to use for blank config. If a `--pipeline` is specified, the components will be added in order.                                                |
 | `--model`, `-m`    | option     | Optional base model to copy config from. If a `--pipeline` is specified, only those components will be kept, and all other components not in the model will be added. |
-| `--pipeline`, `-p` | option     | Optional comma-separate pipeline of components to add to blank language or model.                                                                                     |
+| `--pipeline`, `-p` | option     | Optional comma-separated pipeline of components to add to blank language or model.                                                                                     |
 | **CREATES**        | config     | Complete and auto-filled config file for training.                                                                                                                    |
 ### init model {#init-model new="2"}
@ -271,7 +271,7 @@ low data labels and more.
 <Infobox title="New in v3.0" variant="warning">
-The `debug-data` command is now available as a subcommand of `spacy debug`. It
+The `debug data` command is now available as a subcommand of `spacy debug`. It
 takes the same arguments as `train` and reads settings off the
 [`config.cfg` file](/usage/training#config) and optional
 [overrides](/usage/training#config-overrides) on the CLI.
--- a/website/docs/api/data-formats.md
+++ b/website/docs/api/data-formats.md
@ -174,12 +174,32 @@ run [`spacy pretrain`](/api/cli#pretrain).
 ### Binary training format {#binary-training new="3"}
-The built-in [`convert`](/api/cli#convert) command helps you convert the
+> #### Example
-`.conllu` format used by the
+>
-[Universal Dependencies corpora](https://github.com/UniversalDependencies) as
+> ```python
-well as spaCy's previous [JSON format](#json-input).
+> from pathlib import Path
 > from spacy.tokens import DocBin
 > from spacy.gold import Corpus
 > output_file = Path(dir) / "output.spacy"
 > data = DocBin(docs=docs).to_bytes()
 > with output_file.open("wb") as file_:
 >    file_.write(data)
 > reader = Corpus(output_file)
 > ```
-<!-- TODO: document DocBin format -->
+The main data format used in spaCy v3 is a binary format created by serializing
 a [`DocBin`](/api/docbin) object, which represents a collection of `Doc`
 objects. Typically, the extension for these binary files is `.spacy`, and they
 are used as input format for specifying a [training corpus](/api/corpus) and for
 spaCy's CLI [`train`](/api/cli#train) command.
 This binary format is extremely efficient in storage, especially when packing
 multiple documents together. 
 The built-in [`convert`](/api/cli#convert) command helps you convert spaCy's
 previous [JSON format](#json-input) to this new `DocBin` format. It also
 supports conversion of the `.conllu` format used by the
 [Universal Dependencies corpora](https://github.com/UniversalDependencies).
 ### JSON training format {#json-input tag="deprecated"}
@ -187,7 +207,7 @@ well as spaCy's previous [JSON format](#json-input).
 As of v3.0, the JSON input format is deprecated and is replaced by the
 [binary format](#binary-training). Instead of converting [`Doc`](/api/doc)
-objects to JSON, you can now now serialize them directly using the
+objects to JSON, you can now serialize them directly using the
 [`DocBin`](/api/docbin) container and then use them as input data.
 [`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy`
--- a/website/docs/api/entitylinker.md
+++ b/website/docs/api/entitylinker.md
@ -9,6 +9,13 @@ api_string_name: entity_linker
 api_trainable: true
 ---
 An `EntityLinker` component disambiguates textual mentions (tagged as named
 entities) to unique identifiers, grounding the named entities into the "real
 world". It requires a `KnowledgeBase`, as well as a function to generate
 plausible candidates from that `KnowledgeBase` given a certain textual mention,
 and a ML model to pick the right candidate, given the local context of the
 mention.
 ## Config and implementation {#config}
 The default config is defined by the pipeline component factory and describes
@ -23,22 +30,24 @@ architectures and their arguments and hyperparameters.
 > ```python
 > from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL
 > config = {
 >    "kb": None,
 >    "labels_discard": [],
 >    "incl_prior": True,
 >    "incl_context": True,
 >    "model": DEFAULT_NEL_MODEL,
 >    "kb_loader": {'@assets': 'spacy.EmptyKB.v1', 'entity_vector_length': 64},
 >    "get_candidates": {'@assets': 'spacy.CandidateGenerator.v1'},
 > }
 > nlp.add_pipe("entity_linker", config=config)
 > ```
 | Setting          | Type                                                     | Description                                                                 | Default                                                |
-| ---------------- | ------------------------------------------ | ----------------------------------------------------------------------- | ----------------------------------------------- |
+| ---------------- | -------------------------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------ |
 | `kb`             | `KnowledgeBase`                            | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases.  | `None`                                          |
 | `labels_discard` | `Iterable[str]`                                          | NER labels that will automatically get a "NIL" prediction.                  | `[]`                                                   |
 | `incl_prior`     | bool                                                     | Whether or not to include prior probabilities from the KB in the model.     | `True`                                                 |
 | `incl_context`   | bool                                                     | Whether or not to include the local context in the model.                   | `True`                                                 |
 | `model`          | [`Model`](https://thinc.ai/docs/api-model)               | The model to use.                                                           | [EntityLinker](/api/architectures#EntityLinker)        |
 | `kb_loader`      | `Callable[[Vocab], KnowledgeBase]`                       | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. | An empty KnowledgeBase with `entity_vector_length` 64. |
 | `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object.     | Built-in dictionary-lookup function.                   |
 ```python
 https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
@ -53,7 +62,11 @@ https://github.com/explosion/spaCy/blob/develop/spacy/pipeline/entity_linker.py
 > entity_linker = nlp.add_pipe("entity_linker")
 >
 > # Construction via add_pipe with custom model
-> config = {"model": {"@architectures": "my_el"}}
+> config = {"model": {"@architectures": "my_el.v1"}}
 > entity_linker = nlp.add_pipe("entity_linker", config=config)
 >
 > # Construction via add_pipe with custom KB and candidate generation
 > config = {"kb_loader": {"@assets": "my_kb.v1"}, "get_candidates": {"@assets": "my_candidates.v1"},}
 > entity_linker = nlp.add_pipe("entity_linker", config=config)
 >
 > # Construction from class
@ -65,15 +78,17 @@ Create a new pipeline instance. In your application, you would normally use a
 shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#add_pipe).
-<!-- TODO: finish API docs -->
+Note that both the internal KB as well as the Candidate generator can be
 customized by providing custom registered functions.
 | Name             | Type                                                     | Description                                                                                 |
-| ---------------- | --------------- | ------------------------------------------------------------------------------------------- |
+| ---------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
 | `vocab`          | `Vocab`                                                  | The shared vocabulary.                                                                      |
 | `model`          | `Model`                                                  | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component.             |
 | `name`           | str                                                      | String name of the component instance. Used to add entries to the `losses` during training. |
 | _keyword-only_   |                                                          |                                                                                             |
-| `kb`             | `KnowlegeBase`  | The [`KnowledgeBase`](/api/kb) holding all entities and their aliases.                      |
+| `kb_loader`      | `Callable[[Vocab], KnowledgeBase]`                       | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance.                 |
 | `get_candidates` | `Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]` | Function that generates plausible candidates for a given `Span` object.                     |
 | `labels_discard` | `Iterable[str]`                                          | NER labels that will automatically get a "NIL" prediction.                                  |
 | `incl_prior`     | bool                                                     | Whether or not to include prior probabilities from the KB in the model.                     |
 | `incl_context`   | bool                                                     | Whether or not to include the local context in the model.                                   |
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@ -380,8 +380,9 @@ table instead of only returning the structured data.
 > #### ✏️ Things to try
 >
-> 1. Add the components `"ner"` and `"sentencizer"` _before_ the entity linker.
+> 1. Add the components `"ner"` and `"sentencizer"` _before_ the
->    The analysis should now show no problems, because requirements are met.
+>    `"entity_linker"`. The analysis should now show no problems, because
 >    requirements are met.
 ```python
 ### {executable="true"}
--- a/website/docs/usage/spacy-101.md
+++ b/website/docs/usage/spacy-101.md
@ -122,7 +122,7 @@ related to more general machine learning functionality.
 | **Lemmatization**                     | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".      |
 | **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences.                                                                       |
 | **Named Entity Recognition** (NER)    | Labelling named "real-world" objects, like persons, companies or locations.                                        |
-| **Entity Linking** (EL)               | Disambiguating textual entities to unique identifiers in a Knowledge Base.                                         |
+| **Entity Linking** (EL)               | Disambiguating textual entities to unique identifiers in a knowledge base.                                         |
 | **Similarity**                        | Comparing words, text spans and documents and how similar they are to each other.                                  |
 | **Text Classification**               | Assigning categories or labels to a whole document, or parts of a document.                                        |
 | **Rule-based Matching**               | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.       |
@ -379,7 +379,7 @@ spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This
 will give you the object and its encoded annotations, plus the "key" to decode
 it.
-## Knowledge Base {#kb}
+## Knowledge base {#kb}
 To support the entity linking task, spaCy stores external knowledge in a
 [`KnowledgeBase`](/api/kb). The knowledge base (KB) uses the `Vocab` to store
@ -426,7 +426,7 @@ print("Number of aliases in KB:", kb.get_size_aliases()) # 2
 ### Candidate generation
-Given a textual entity, the Knowledge Base can provide a list of plausible
+Given a textual entity, the knowledge base can provide a list of plausible
 candidates or entity identifiers. The [`EntityLinker`](/api/entitylinker) will
 take this list of candidates as input, and disambiguate the mention to the most
 probable identifier, given the document context.