Add SpanCategorizer component (#6747)

* Draft spancat model * Add spancat model * Add test for extract_spans * Add extract_spans layer * Upd extract_spans * Add spancat model * Add test for spancat model * Upd spancat model * Update spancat component * Upd spancat * Update spancat model * Add quick spancat test * Import SpanCategorizer * Fix SpanCategorizer component * Import SpanGroup * Fix span extraction * Fix import * Fix import * Upd model * Update spancat models * Add scoring, update defaults * Update and add docs * Fix type * Update spacy/ml/extract_spans.py * Auto-format and fix import * Fix comment * Fix type * Fix type * Update website/docs/api/spancategorizer.md * Fix comment Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Better defense Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix labels list Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/ml/extract_spans.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy/pipeline/spancat.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Set annotations during update * Set annotations in spancat * fix imports in test * Update spacy/pipeline/spancat.py * replace MaxoutLogistic with LinearLogistic * fix config * various small fixes * remove set_annotations parameter in update * use our beloved tupley format with recent support for doc.spans * bugfix to allow renaming the default span_key (scores weren't showing up) * use different key in docs example * change defaults to better-working parameters from project (WIP) * register spacy.extract_spans.v1 for legacy purposes * Upd dev version so can build wheel * layers instead of architectures for smaller building blocks * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/api/spancategorizer.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Include additional scores from overrides in combined score weights * Parameterize spans key in scoring Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so that it's possible to evaluate multiple `spancat` components in the same pipeline. * Use the (intentionally very short) default spans key `sc` in the `SpanCategorizer` * Adjust the default score weights to include the default key * Adjust the scorer to use `spans_{spans_key}` as the prefix for the returned score * Revert addition of `attr_name` argument to `score_spans` and adjust the key in the `getter` instead. Note that for `spancat` components with a custom `span_key`, the score weights currently need to be modified manually in `[training.score_weights]` for them to be available during training. To suppress the default score weights `spans_sc_p/r/f` during training, set them to `null` in `[training.score_weights]`. * Update website/docs/api/scorer.md * Fix scorer for spans key containing underscore * Increment version * Add Spans to Evaluate CLI (#8439) * Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Fix spancat GPU issues (#8455) * Fix GPU issues * Require thinc >=8.0.6 * Switch to glorot_uniform_init * Fix and test ngram suggester * Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Nirant <NirantK@users.noreply.github.com>
2025-08-17 18:44:56 +03:00 · 2021-06-24 12:35:27 +02:00 · 2021-06-24 12:35:27 +02:00 · f9946154d9
commit f9946154d9
parent 172dfec4f2
17 changed files with 1257 additions and 6 deletions
--- a/pyproject.toml
+++ b/pyproject.toml
@ -5,7 +5,7 @@ requires = [
    "cymem>=2.0.2,<2.1.0",
    "preshed>=3.0.2,<3.1.0",
    "murmurhash>=0.28.0,<1.1.0",
-    "thinc>=8.0.5,<8.1.0",
+    "thinc>=8.0.6,<8.1.0",
    "blis>=0.4.0,<0.8.0",
    "pathy",
    "numpy>=1.15.0",
--- a/requirements.txt
+++ b/requirements.txt
@ -2,7 +2,7 @@
 spacy-legacy>=3.0.6,<3.1.0
 cymem>=2.0.2,<2.1.0
 preshed>=3.0.2,<3.1.0
-thinc>=8.0.5,<8.1.0
+thinc>=8.0.6,<8.1.0
 blis>=0.4.0,<0.8.0
 ml_datasets>=0.2.0,<0.3.0
 murmurhash>=0.28.0,<1.1.0
--- a/setup.cfg
+++ b/setup.cfg
@ -37,14 +37,14 @@ setup_requires =
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
    murmurhash>=0.28.0,<1.1.0
-    thinc>=8.0.5,<8.1.0
+    thinc>=8.0.6,<8.1.0
 install_requires =
    # Our libraries
    spacy-legacy>=3.0.6,<3.1.0
    murmurhash>=0.28.0,<1.1.0
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
-    thinc>=8.0.5,<8.1.0
+    thinc>=8.0.6,<8.1.0
    blis>=0.4.0,<0.8.0
    wasabi>=0.8.1,<1.1.0
    srsly>=2.4.1,<3.0.0
--- a/spacy/cli/evaluate.py
+++ b/spacy/cli/evaluate.py
@ -60,6 +60,7 @@ def evaluate(
    displacy_path: Optional[Path] = None,
    displacy_limit: int = 25,
    silent: bool = True,
    spans_key="sc",
 ) -> Scorer:
    msg = Printer(no_print=silent, pretty=not silent)
    fix_random_seed()
@ -90,6 +91,9 @@ def evaluate(
        "SENT P": "sents_p",
        "SENT R": "sents_r",
        "SENT F": "sents_f",
        "SPAN P": f"spans_{spans_key}_p",
        "SPAN R": f"spans_{spans_key}_r",
        "SPAN F": f"spans_{spans_key}_f",
        "SPEED": "speed",
    }
    results = {}
@ -121,6 +125,10 @@ def evaluate(
        if scores["ents_per_type"]:
            print_prf_per_type(msg, scores["ents_per_type"], "NER", "type")
            data["ents_per_type"] = scores["ents_per_type"]
    if f"spans_{spans_key}_per_type" in scores:
        if scores[f"spans_{spans_key}_per_type"]:
            print_prf_per_type(msg, scores[f"spans_{spans_key}_per_type"], "SPANS", "type")
            data[f"spans_{spans_key}_per_type"] = scores[f"spans_{spans_key}_per_type"]
    if "cats_f_per_type" in scores:
        if scores["cats_f_per_type"]:
            print_prf_per_type(msg, scores["cats_f_per_type"], "Textcat F", "label")
--- a/spacy/ml/extract_spans.py
+++ b/spacy/ml/extract_spans.py
@ -0,0 +1,60 @@
 from typing import Tuple, Callable
 from thinc.api import Model, to_numpy
 from thinc.types import Ragged, Ints1d
 from ..util import registry
@registry.layers("spacy.extract_spans.v1")
 def extract_spans() -> Model[Tuple[Ragged, Ragged], Ragged]:
    """Extract spans from a sequence of source arrays, as specified by an array
    of (start, end) indices. The output is a ragged array of the
    extracted spans.
    """
    return Model(
        "extract_spans", forward, layers=[], refs={}, attrs={}, dims={}, init=init
    )
 def init(model, X=None, Y=None):
    pass
 def forward(
    model: Model, source_spans: Tuple[Ragged, Ragged], is_train: bool
 ) -> Tuple[Ragged, Callable]:
    """Get subsequences from source vectors."""
    ops = model.ops
    X, spans = source_spans
    assert spans.dataXd.ndim == 2
    indices = _get_span_indices(ops, spans, X.lengths)
    Y = Ragged(X.dataXd[indices], spans.dataXd[:, 1] - spans.dataXd[:, 0])
    x_shape = X.dataXd.shape
    x_lengths = X.lengths
    def backprop_windows(dY: Ragged) -> Tuple[Ragged, Ragged]:
        dX = Ragged(ops.alloc2f(*x_shape), x_lengths)
        ops.scatter_add(dX.dataXd, indices, dY.dataXd)
        return (dX, spans)
    return Y, backprop_windows
 def _get_span_indices(ops, spans: Ragged, lengths: Ints1d) -> Ints1d:
    """Construct a flat array that has the indices we want to extract from the
    source data. For instance, if we want the spans (5, 9), (8, 10) the
    indices will be [5, 6, 7, 8, 8, 9].
    """
    spans, lengths = _ensure_cpu(spans, lengths)
    indices = []
    offset = 0
    for i, length in enumerate(lengths):
        spans_i = spans[i].dataXd + offset
        for j in range(spans_i.shape[0]):
            indices.append(ops.xp.arange(spans_i[j, 0], spans_i[j, 1]))
        offset += length
    return ops.flatten(indices)
 def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]:
    return (Ragged(to_numpy(spans.dataXd), to_numpy(spans.lengths)), to_numpy(lengths))
--- a/spacy/ml/models/init.py
+++ b/spacy/ml/models/init.py
@ -1,6 +1,7 @@
 from .entity_linker import *  # noqa
 from .multi_task import *  # noqa
 from .parser import *  # noqa
 from .spancat import * # noqa
 from .tagger import *  # noqa
 from .textcat import *  # noqa
 from .tok2vec import *  # noqa
--- a/spacy/ml/models/spancat.py
+++ b/spacy/ml/models/spancat.py
@ -0,0 +1,54 @@
 from typing import List, Tuple
 from thinc.api import Model, with_getitem, chain, list2ragged, Logistic
 from thinc.api import Maxout, Linear, concatenate, glorot_uniform_init
 from thinc.api import reduce_mean, reduce_max, reduce_first, reduce_last
 from thinc.types import Ragged, Floats2d
 from ...util import registry
 from ...tokens import Doc
 from ..extract_spans import extract_spans
@registry.layers.register("spacy.LinearLogistic.v1")
 def build_linear_logistic(nO=None, nI=None) -> Model[Floats2d, Floats2d]:
    """An output layer for multi-label classification. It uses a linear layer
    followed by a logistic activation.
    """
    return chain(Linear(nO=nO, nI=nI, init_W=glorot_uniform_init), Logistic())
@registry.layers.register("spacy.mean_max_reducer.v1")
 def build_mean_max_reducer(hidden_size: int) -> Model[Ragged, Floats2d]:
    """Reduce sequences by concatenating their mean and max pooled vectors,
    and then combine the concatenated vectors with a hidden layer.
    """
    return chain(
        concatenate(reduce_last(), reduce_first(), reduce_mean(), reduce_max()),
        Maxout(nO=hidden_size, normalize=True, dropout=0.0),
    )
@registry.architectures.register("spacy.SpanCategorizer.v1")
 def build_spancat_model(
    tok2vec: Model[List[Doc], List[Floats2d]],
    reducer: Model[Ragged, Floats2d],
    scorer: Model[Floats2d, Floats2d],
 ) -> Model[Tuple[List[Doc], Ragged], Floats2d]:
    """Build a span categorizer model, given a token-to-vector model, a
    reducer model to map the sequence of vectors for each span down to a single
    vector, and a scorer model to map the vectors to probabilities.
    tok2vec (Model[List[Doc], List[Floats2d]]): The tok2vec model.
    reducer (Model[Ragged, Floats2d]): The reducer model.
    scorer (Model[Floats2d, Floats2d]): The scorer model.
    """
    model = chain(
        with_getitem(0, chain(tok2vec, list2ragged())),
        extract_spans(),
        reducer,
        scorer,
    )
    model.set_ref("tok2vec", tok2vec)
    model.set_ref("reducer", reducer)
    model.set_ref("scorer", scorer)
    return model
--- a/spacy/pipeline/init.py
+++ b/spacy/pipeline/init.py
@ -11,6 +11,7 @@ from .senter import SentenceRecognizer
 from .sentencizer import Sentencizer
 from .tagger import Tagger
 from .textcat import TextCategorizer
 from .spancat import SpanCategorizer
 from .textcat_multilabel import MultiLabel_TextCategorizer
 from .tok2vec import Tok2Vec
 from .functions import merge_entities, merge_noun_chunks, merge_subtokens
@ -27,6 +28,7 @@ __all__ = [
    "Pipe",
    "SentenceRecognizer",
    "Sentencizer",
    "SpanCategorizer",
    "Tagger",
    "TextCategorizer",
    "Tok2Vec",
--- a/spacy/pipeline/spancat.py
+++ b/spacy/pipeline/spancat.py
@ -0,0 +1,411 @@
 import numpy
 from typing import List, Dict, Callable, Tuple, Optional, Iterable, Any
 from thinc.api import Config, Model, get_current_ops, set_dropout_rate, Ops
 from thinc.api import Optimizer
 from thinc.types import Ragged, Ints2d, Floats2d
 from ..scorer import Scorer
 from ..language import Language
 from .trainable_pipe import TrainablePipe
 from ..tokens import Doc, SpanGroup, Span
 from ..vocab import Vocab
 from ..training import Example, validate_examples
 from ..errors import Errors
 from ..util import registry
 spancat_default_config = """
 [model]
@architectures = "spacy.SpanCategorizer.v1"
 scorer = {"@layers": "spacy.LinearLogistic.v1"}
 [model.reducer]
@layers = spacy.mean_max_reducer.v1
 hidden_size = 128
 [model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
 [model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
 width = 96
 rows = [5000, 2000, 1000, 1000]
 attrs = ["ORTH", "PREFIX", "SUFFIX", "SHAPE"]
 include_static_vectors = false
 [model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
 width = ${model.tok2vec.embed.width}
 window_size = 1
 maxout_pieces = 3
 depth = 4
 """
 DEFAULT_SPANCAT_MODEL = Config().from_str(spancat_default_config)["model"]
@registry.misc("ngram_suggester.v1")
 def build_ngram_suggester(sizes: List[int]) -> Callable[[List[Doc]], Ragged]:
    """Suggest all spans of the given lengths. Spans are returned as a ragged
    array of integers. The array has two columns, indicating the start and end
    position."""
    def ngram_suggester(docs: List[Doc], *, ops: Optional[Ops] = None) -> Ragged:
        if ops is None:
            ops = get_current_ops()
        spans = []
        lengths = []
        for doc in docs:
            starts = ops.xp.arange(len(doc), dtype="i")
            starts = starts.reshape((-1, 1))
            length = 0
            for size in sizes:
                if size <= len(doc):
                    starts_size = starts[:len(doc) - (size - 1)]
                    spans.append(ops.xp.hstack((starts_size, starts_size + size)))
                    length += spans[-1].shape[0]
                if spans:
                    assert spans[-1].ndim == 2, spans[-1].shape
            lengths.append(length)
        if len(spans) > 0:
            output = Ragged(ops.xp.vstack(spans), ops.asarray(lengths, dtype="i"))
        else:
            output = Ragged(ops.xp.zeros((0,0)), ops.asarray(lengths, dtype="i"))
        assert output.dataXd.ndim == 2
        return output
    return ngram_suggester
@Language.factory(
    "spancat",
    assigns=["doc.spans"],
    default_config={
        "threshold": 0.5,
        "spans_key": "sc",
        "max_positive": None,
        "model": DEFAULT_SPANCAT_MODEL,
        "suggester": {"@misc": "ngram_suggester.v1", "sizes": [1, 2, 3]},
    },
    default_score_weights={"spans_sc_f": 1.0, "spans_sc_p": 0.0, "spans_sc_r": 0.0},
 )
 def make_spancat(
    nlp: Language,
    name: str,
    suggester: Callable[[List[Doc]], Ragged],
    model: Model[Tuple[List[Doc], Ragged], Floats2d],
    spans_key: str,
    threshold: float = 0.5,
    max_positive: Optional[int] = None,
 ) -> "SpanCategorizer":
    """Create a SpanCategorizer component. The span categorizer consists of two
    parts: a suggester function that proposes candidate spans, and a labeller
    model that predicts one or more labels for each span.
    suggester (Callable[List[Doc], Ragged]): A function that suggests spans.
        Spans are returned as a ragged array with two integer columns, for the
        start and end positions.
    model (Model[Tuple[List[Doc], Ragged], Floats2d]): A model instance that
        is given a list of documents and (start, end) indices representing
        candidate span offsets. The model predicts a probability for each category
        for each span.
    spans_key (str): Key of the doc.spans dict to save the spans under. During
        initialization and training, the component will look for spans on the
        reference document under the same key.
    threshold (float): Minimum probability to consider a prediction positive.
        Spans with a positive prediction will be saved on the Doc. Defaults to
        0.5.
    max_positive (Optional[int]): Maximum number of labels to consider positive
        per span. Defaults to None, indicating no limit.
    """
    return SpanCategorizer(
        nlp.vocab,
        suggester=suggester,
        model=model,
        spans_key=spans_key,
        threshold=threshold,
        max_positive=max_positive,
        name=name,
    )
 class SpanCategorizer(TrainablePipe):
    """Pipeline component to label spans of text.
    DOCS: https://spacy.io/api/spancategorizer
    """
    def __init__(
        self,
        vocab: Vocab,
        model: Model[Tuple[List[Doc], Ragged], Floats2d],
        suggester: Callable[[List[Doc]], Ragged],
        name: str = "spancat",
        *,
        spans_key: str = "spans",
        threshold: float = 0.5,
        max_positive: Optional[int] = None,
    ) -> None:
        """Initialize the span categorizer.
        DOCS: https://spacy.io/api/spancategorizer#init
        """
        self.cfg = {
            "labels": [],
            "spans_key": spans_key,
            "threshold": threshold,
            "max_positive": max_positive,
        }
        self.vocab = vocab
        self.suggester = suggester
        self.model = model
        self.name = name
    @property
    def key(self) -> str:
        """Key of the doc.spans dict to save the spans under. During
        initialization and training, the component will look for spans on the
        reference document under the same key.
        """
        return self.cfg["spans_key"]
    def add_label(self, label: str) -> int:
        """Add a new label to the pipe.
        label (str): The label to add.
        RETURNS (int): 0 if label is already present, otherwise 1.
        DOCS: https://spacy.io/api/spancategorizer#add_label
        """
        if not isinstance(label, str):
            raise ValueError(Errors.E187)
        if label in self.labels:
            return 0
        self.cfg["labels"].append(label)
        self.vocab.strings.add(label)
        return 1
    @property
    def labels(self) -> Tuple[str]:
        """RETURNS (Tuple[str]): The labels currently added to the component.
        DOCS: https://spacy.io/api/spancategorizer#labels
        """
        return tuple(self.cfg["labels"])
    @property
    def label_data(self) -> List[str]:
        """RETURNS (List[str]): Information about the component's labels.
        DOCS: https://spacy.io/api/spancategorizer#label_data
        """
        return list(self.labels)
    def predict(self, docs: Iterable[Doc]):
        """Apply the pipeline's model to a batch of docs, without modifying them.
        docs (Iterable[Doc]): The documents to predict.
        RETURNS: The models prediction for each document.
        DOCS: https://spacy.io/api/spancategorizer#predict
        """
        indices = self.suggester(docs, ops=self.model.ops)
        scores = self.model.predict((docs, indices))
        return (indices, scores)
    def set_annotations(self, docs: Iterable[Doc], indices_scores) -> None:
        """Modify a batch of Doc objects, using pre-computed scores.
        docs (Iterable[Doc]): The documents to modify.
        scores: The scores to set, produced by SpanCategorizer.predict.
        DOCS: https://spacy.io/api/spancategorizer#set_annotations
        """
        labels = self.labels
        indices, scores = indices_scores
        offset = 0
        for i, doc in enumerate(docs):
            indices_i = indices[i].dataXd
            doc.spans[self.key] = self._make_span_group(
                doc, indices_i, scores[offset : offset + indices.lengths[i]], labels
            )
            offset += indices.lengths[i]
    def update(
        self,
        examples: Iterable[Example],
        *,
        drop: float = 0.0,
        sgd: Optional[Optimizer] = None,
        losses: Optional[Dict[str, float]] = None,
    ) -> Dict[str, float]:
        """Learn from a batch of documents and gold-standard information,
        updating the pipe's model. Delegates to predict and get_loss.
        examples (Iterable[Example]): A batch of Example objects.
        drop (float): The dropout rate.
        sgd (thinc.api.Optimizer): The optimizer.
        losses (Dict[str, float]): Optional record of the loss during training.
            Updated using the component name as the key.
        RETURNS (Dict[str, float]): The updated losses dictionary.
        DOCS: https://spacy.io/api/spancategorizer#update
        """
        if losses is None:
            losses = {}
        losses.setdefault(self.name, 0.0)
        validate_examples(examples, "SpanCategorizer.update")
        self._validate_categories(examples)
        if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples):
            # Handle cases where there are no tokens in any docs.
            return losses
        docs = [eg.predicted for eg in examples]
        spans = self.suggester(docs, ops=self.model.ops)
        if spans.lengths.sum() == 0:
            return losses
        set_dropout_rate(self.model, drop)
        scores, backprop_scores = self.model.begin_update((docs, spans))
        loss, d_scores = self.get_loss(examples, (spans, scores))
        backprop_scores(d_scores)
        if sgd is not None:
            self.finish_update(sgd)
        losses[self.name] += loss
        return losses
    def get_loss(
        self, examples: Iterable[Example], spans_scores: Tuple[Ragged, Ragged]
    ) -> Tuple[float, float]:
        """Find the loss and gradient of loss for the batch of documents and
        their predicted scores.
        examples (Iterable[Examples]): The batch of examples.
        spans_scores: Scores representing the model's predictions.
        RETURNS (Tuple[float, float]): The loss and the gradient.
        DOCS: https://spacy.io/api/spancategorizer#get_loss
        """
        spans, scores = spans_scores
        spans = Ragged(
            self.model.ops.to_numpy(spans.data), self.model.ops.to_numpy(spans.lengths)
        )
        label_map = {label: i for i, label in enumerate(self.labels)}
        target = numpy.zeros(scores.shape, dtype=scores.dtype)
        offset = 0
        for i, eg in enumerate(examples):
            # Map (start, end) offset of spans to the row in the d_scores array,
            # so that we can adjust the gradient for predictions that were
            # in the gold standard.
            spans_index = {}
            spans_i = spans[i].dataXd
            for j in range(spans.lengths[i]):
                start = int(spans_i[j, 0])
                end = int(spans_i[j, 1])
                spans_index[(start, end)] = offset + j
            for gold_span in self._get_aligned_spans(eg):
                key = (gold_span.start, gold_span.end)
                if key in spans_index:
                    row = spans_index[key]
                    k = label_map[gold_span.label_]
                    target[row, k] = 1.0
            # The target is a flat array for all docs. Track the position
            # we're at within the flat array.
            offset += spans.lengths[i]
        target = self.model.ops.asarray(target, dtype="f")
        # The target will have the values 0 (for untrue predictions) or 1
        # (for true predictions).
        # The scores should be in the range [0, 1].
        # If the prediction is 0.9 and it's true, the gradient
        # will be -0.1 (0.9 - 1.0).
        # If the prediction is 0.9 and it's false, the gradient will be
        # 0.9 (0.9 - 0.0)
        d_scores = scores - target
        loss = float((d_scores ** 2).sum())
        return loss, d_scores
    def initialize(
        self,
        get_examples: Callable[[], Iterable[Example]],
        *,
        nlp: Language = None,
        labels: Optional[Dict] = None,
    ) -> None:
        """Initialize the pipe for training, using a representative set
        of data examples.
        get_examples (Callable[[], Iterable[Example]]): Function that
            returns a representative sample of gold-standard Example objects.
        nlp (Language): The current nlp object the component is part of.
        labels: The labels to add to the component, typically generated by the
            `init labels` command. If no labels are provided, the get_examples
            callback is used to extract the labels from the data.
        DOCS: https://spacy.io/api/spancategorizer#initialize
        """
        subbatch = []
        if labels is not None:
            for label in labels:
                self.add_label(label)
        for eg in get_examples():
            if labels is None:
                for span in eg.reference.spans[self.key]:
                    self.add_label(span.label_)
            if len(subbatch) < 10:
                subbatch.append(eg)
        self._require_labels()
        if subbatch:
            docs = [eg.x for eg in subbatch]
            spans = self.suggester(docs)
            Y = self.model.ops.alloc2f(spans.dataXd.shape[0], len(self.labels))
            self.model.initialize(X=(docs, spans), Y=Y)
        else:
            self.model.initialize()
    def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
        """Score a batch of examples.
        examples (Iterable[Example]): The examples to score.
        RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats.
        DOCS: https://spacy.io/api/spancategorizer#score
        """
        validate_examples(examples, "SpanCategorizer.score")
        self._validate_categories(examples)
        kwargs = dict(kwargs)
        attr_prefix = "spans_"
        kwargs.setdefault("attr", f"{attr_prefix}{self.key}")
        kwargs.setdefault("labels", self.labels)
        kwargs.setdefault("multi_label", True)
        kwargs.setdefault("threshold", self.cfg["threshold"])
        kwargs.setdefault(
            "getter", lambda doc, key: doc.spans.get(key[len(attr_prefix) :], [])
        )
        kwargs.setdefault("has_annotation", lambda doc: self.key in doc.spans)
        return Scorer.score_spans(examples, **kwargs)
    def _validate_categories(self, examples):
        # TODO
        pass
    def _get_aligned_spans(self, eg: Example):
        return eg.get_aligned_spans_y2x(eg.reference.spans.get(self.key, []))
    def _make_span_group(
        self, doc: Doc, indices: Ints2d, scores: Floats2d, labels: List[str]
    ) -> SpanGroup:
        spans = SpanGroup(doc, name=self.key)
        max_positive = self.cfg["max_positive"]
        threshold = self.cfg["threshold"]
        for i in range(indices.shape[0]):
            start = int(indices[i, 0])
            end = int(indices[i, 1])
            positives = []
            for j, score in enumerate(scores[i]):
                if score >= threshold:
                    positives.append((score, start, end, labels[j]))
            positives.sort(reverse=True)
            if max_positive:
                positives = positives[:max_positive]
            for score, start, end, label in positives:
                spans.append(Span(doc, start, end, label=label))
        return spans
--- a/spacy/pipeline/trainable_pipe.pyx
+++ b/spacy/pipeline/trainable_pipe.pyx
@ -101,7 +101,8 @@ cdef class TrainablePipe(Pipe):
    def update(self,
               examples: Iterable["Example"],
-               *, drop: float=0.0,
+               *,
               drop: float=0.0,
               sgd: Optimizer=None,
               losses: Optional[Dict[str, float]]=None) -> Dict[str, float]:
        """Learn from a batch of documents and gold-standard information,
--- a/spacy/tests/pipeline/test_pipe_factories.py
+++ b/spacy/tests/pipeline/test_pipe_factories.py
@ -353,6 +353,7 @@ def test_language_factories_invalid():
        ([{"a": 0.0, "b": 0.0}, {"c": 1.0}], {}, {"a": 0.0, "b": 0.0, "c": 1.0}),
        ([{"a": 0.0, "b": 0.0}, {"c": 0.0}], {"c": 0.2}, {"a": 0.0, "b": 0.0, "c": 1.0}),
        ([{"a": 0.5, "b": 0.5, "c": 1.0, "d": 1.0}], {"a": 0.0, "b": 0.0}, {"a": 0.0, "b": 0.0, "c": 0.5, "d": 0.5}),
        ([{"a": 0.5, "b": 0.5, "c": 1.0, "d": 1.0}], {"a": 0.0, "b": 0.0, "f": 0.0}, {"a": 0.0, "b": 0.0, "c": 0.5, "d": 0.5, "f": 0.0}),
    ],
 )
 def test_language_factories_combine_score_weights(weights, override, expected):
--- a/spacy/tests/pipeline/test_spancat.py
+++ b/spacy/tests/pipeline/test_spancat.py
@ -0,0 +1,146 @@
 from numpy.testing import assert_equal
 from spacy.language import Language
 from spacy.training import Example
 from spacy.util import fix_random_seed, registry
 SPAN_KEY = "labeled_spans"
 TRAIN_DATA = [
    ("Who is Shaka Khan?", {"spans": {SPAN_KEY: [(7, 17, "PERSON")]}}),
    (
        "I like London and Berlin.",
        {"spans": {SPAN_KEY: [(7, 13, "LOC"), (18, 24, "LOC")]}},
    ),
 ]
 def make_get_examples(nlp):
    train_examples = []
    for t in TRAIN_DATA:
        eg = Example.from_dict(nlp.make_doc(t[0]), t[1])
        train_examples.append(eg)
    def get_examples():
        return train_examples
    return get_examples
 def test_simple_train():
    fix_random_seed(0)
    nlp = Language()
    spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
    get_examples = make_get_examples(nlp)
    nlp.initialize(get_examples)
    sgd = nlp.create_optimizer()
    assert len(spancat.labels) != 0
    for i in range(40):
        losses = {}
        nlp.update(list(get_examples()), losses=losses, drop=0.1, sgd=sgd)
    doc = nlp("I like London and Berlin.")
    assert doc.spans[spancat.key] == doc.spans[SPAN_KEY]
    assert len(doc.spans[spancat.key]) == 2
    assert doc.spans[spancat.key][0].text == "London"
    scores = nlp.evaluate(get_examples())
    assert f"spans_{SPAN_KEY}_f" in scores
    assert scores[f"spans_{SPAN_KEY}_f"] == 1.0
 def test_ngram_suggester(en_tokenizer):
    # test different n-gram lengths
    for size in [1, 2, 3]:
        ngram_suggester = registry.misc.get("ngram_suggester.v1")(sizes=[size])
        docs = [
            en_tokenizer(text)
            for text in [
                "a",
                "a b",
                "a b c",
                "a b c d",
                "a b c d e",
                "a " * 100,
            ]
        ]
        ngrams = ngram_suggester(docs)
        # span sizes are correct
        for s in ngrams.data:
            assert s[1] - s[0] == size
        # spans are within docs
        offset = 0
        for i, doc in enumerate(docs):
            spans = ngrams.dataXd[offset : offset + ngrams.lengths[i]]
            spans_set = set()
            for span in spans:
                assert 0 <= span[0] < len(doc)
                assert 0 < span[1] <= len(doc)
                spans_set.add((span[0], span[1]))
            # spans are unique
            assert spans.shape[0] == len(spans_set)
            offset += ngrams.lengths[i]
        # the number of spans is correct
        assert_equal(
            ngrams.lengths,
            [max(0, len(doc) - (size - 1)) for doc in docs]
        )
    # test 1-3-gram suggestions
    ngram_suggester = registry.misc.get("ngram_suggester.v1")(sizes=[1, 2, 3])
    docs = [
        en_tokenizer(text) for text in ["a", "a b", "a b c", "a b c d", "a b c d e"]
    ]
    ngrams = ngram_suggester(docs)
    assert_equal(ngrams.lengths, [1, 3, 6, 9, 12])
    assert_equal(
        ngrams.data,
        [
            # doc 0
            [0, 1],
            # doc 1
            [0, 1],
            [1, 2],
            [0, 2],
            # doc 2
            [0, 1],
            [1, 2],
            [2, 3],
            [0, 2],
            [1, 3],
            [0, 3],
            # doc 3
            [0, 1],
            [1, 2],
            [2, 3],
            [3, 4],
            [0, 2],
            [1, 3],
            [2, 4],
            [0, 3],
            [1, 4],
            # doc 4
            [0, 1],
            [1, 2],
            [2, 3],
            [3, 4],
            [4, 5],
            [0, 2],
            [1, 3],
            [2, 4],
            [3, 5],
            [0, 3],
            [1, 4],
            [2, 5],
        ],
    )
    # test some empty docs
    ngram_suggester = registry.misc.get("ngram_suggester.v1")(sizes=[1])
    docs = [en_tokenizer(text) for text in ["", "a", ""]]
    ngrams = ngram_suggester(docs)
    assert_equal(ngrams.lengths, [len(doc) for doc in docs])
    # test all empty docs
    ngram_suggester = registry.misc.get("ngram_suggester.v1")(sizes=[1])
    docs = [en_tokenizer(text) for text in ["", "", ""]]
    ngrams = ngram_suggester(docs)
    assert_equal(ngrams.lengths, [len(doc) for doc in docs])
--- a/spacy/tests/test_models.py
+++ b/spacy/tests/test_models.py
@ -1,11 +1,14 @@
 from typing import List
 import pytest
 from thinc.api import fix_random_seed, Adam, set_dropout_rate
 from thinc.api import Ragged, reduce_mean, Logistic, chain, Relu
 from numpy.testing import assert_array_equal, assert_array_almost_equal
 import numpy
 from spacy.ml.models import build_Tok2Vec_model, MultiHashEmbed, MaxoutWindowEncoder
 from spacy.ml.models import build_bow_text_classifier, build_simple_cnn_text_classifier
 from spacy.ml.models import build_spancat_model
 from spacy.ml.staticvectors import StaticVectors
 from spacy.ml.extract_spans import extract_spans, _get_span_indices
 from spacy.lang.en import English
 from spacy.lang.en.examples import sentences as EN_SENTENCES
@ -205,3 +208,63 @@ def test_empty_docs(model_func, kwargs):
        # Test backprop
        output, backprop = model.begin_update(docs)
        backprop(output)
 def test_init_extract_spans():
    model = extract_spans().initialize()
 def test_extract_spans_span_indices():
    model = extract_spans().initialize()
    spans = Ragged(
        model.ops.asarray([[0, 3], [2, 3], [5, 7]], dtype="i"),
        model.ops.asarray([2, 1], dtype="i"),
    )
    x_lengths = model.ops.asarray([5, 10], dtype="i")
    indices = _get_span_indices(model.ops, spans, x_lengths)
    assert list(indices) == [0, 1, 2, 2, 10, 11]
 def test_extract_spans_forward_backward():
    model = extract_spans().initialize()
    X = Ragged(model.ops.alloc2f(15, 4), model.ops.asarray([5, 10], dtype="i"))
    spans = Ragged(
        model.ops.asarray([[0, 3], [2, 3], [5, 7]], dtype="i"),
        model.ops.asarray([2, 1], dtype="i"),
    )
    Y, backprop = model.begin_update((X, spans))
    assert list(Y.lengths) == [3, 1, 2]
    assert Y.dataXd.shape == (6, 4)
    dX, spans2 = backprop(Y)
    assert spans2 is spans
    assert dX.dataXd.shape == X.dataXd.shape
    assert list(dX.lengths) == list(X.lengths)
 def test_spancat_model_init():
    model = build_spancat_model(
        build_Tok2Vec_model(**get_tok2vec_kwargs()), reduce_mean(), Logistic()
    )
    model.initialize()
 def test_spancat_model_forward_backward(nO=5):
    tok2vec = build_Tok2Vec_model(**get_tok2vec_kwargs())
    docs = get_docs()
    spans_list = []
    lengths = []
    for doc in docs:
        spans_list.append(doc[:2])
        spans_list.append(doc[1:4])
        lengths.append(2)
    spans = Ragged(
        tok2vec.ops.asarray([[s.start, s.end] for s in spans_list], dtype="i"),
        tok2vec.ops.asarray(lengths, dtype="i"),
    )
    model = build_spancat_model(
        tok2vec, reduce_mean(), chain(Relu(nO=nO), Logistic())
    ).initialize(X=(docs, spans))
    Y, backprop = model((docs, spans), is_train=True)
    assert Y.shape == (spans.dataXd.shape[0], nO)
    backprop(Y)
--- a/spacy/util.py
+++ b/spacy/util.py
@ -1394,7 +1394,8 @@ def combine_score_weights(
    # We divide each weight by the total weight sum.
    # We first need to extract all None/null values for score weights that
    # shouldn't be shown in the table *or* be weighted
-    result = {key: overrides.get(key, value) for w_dict in weights for (key, value) in w_dict.items()}
+    result = {key: value for w_dict in weights for (key, value) in w_dict.items()}
    result.update(overrides)
    weight_sum = sum([v if v else 0.0 for v in result.values()])
    for key, value in result.items():
        if value and weight_sum > 0:
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -9,6 +9,7 @@ menu:
  - ['Parser & NER', 'parser']
  - ['Tagging', 'tagger']
  - ['Text Classification', 'textcat']
  - ['Span Classification', 'spancat']
  - ['Entity Linking', 'entitylinker']
 ---
@ -736,6 +737,54 @@ Since v2, new labels can be added to this component, even after training.
 </Accordion>
 ## Span classification architectures {#spancat source="spacy/ml/models/spancat.py"}
 ### spacy.SpanCategorizer.v1 {#SpanCategorizer}
 > #### Example Config
 >
 > ```ini
 > [model]
 > @architectures = "spacy.SpanCategorizer.v1"
 > scorer = {"@layers": "spacy.LinearLogistic.v1"}
 > 
 > [model.reducer]
 > @layers = spacy.mean_max_reducer.v1"
 > hidden_size = 128
 >
 > [model.tok2vec]
 > @architectures = "spacy.Tok2Vec.v1"
 >
 > [model.tok2vec.embed]
 > @architectures = "spacy.MultiHashEmbed.v1"
 > # ...
 >
 > [model.tok2vec.encode]
 > @architectures = "spacy.MaxoutWindowEncoder.v1"
 > # ...
 > ```
 Build a span categorizer model to power a
 [`SpanCategorizer`](/api/spancategorizer) component, given a token-to-vector
 model, a reducer model to map the sequence of vectors for each span down to a
 single vector, and a scorer model to map the vectors to probabilities.
 | Name        | Description                                                                     |
 | ----------- | ------------------------------------------------------------------------------- |
 | `tok2vec`   | The token-to-vector model. ~~Model[List[Doc], List[Floats2d]]~~                 |
 | `reducer`   | The reducer model. ~~Model[Ragged, Floats2d]~~                                  |
 | `scorer`    | The scorer model. ~~Model[Floats2d, Floats2d]~~                                 |
 | **CREATES** | The model using the architecture. ~~Model[Tuple[List[Doc], Ragged], Floats2d]~~ |
 ### spacy.mean_max_reducer.v1 {#mean_max_reducer}
 Reduce sequences by concatenating their mean and max pooled vectors, and then
 combine the concatenated vectors with a hidden layer.
 | Name          | Description                           |
 | ------------- | ------------------------------------- |
 | `hidden_size` | The size of the hidden layer. ~~int~~ |
 ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
 An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions
--- a/website/docs/api/spancategorizer.md
+++ b/website/docs/api/spancategorizer.md
@ -0,0 +1,453 @@
 ---
 title: SpanCategorizer
 tag: class,experimental
 source: spacy/pipeline/spancat.py
 new: 3.1
 teaser: 'Pipeline component for labeling potentially overlapping spans of text'
 api_base_class: /api/pipe
 api_string_name: spancat
 api_trainable: true
 ---
 A span categorizer consists of two parts: a [suggester function](#suggesters)
 that proposes candidate spans, which may or may not overlap, and a labeler model
 that predicts zero or more labels for each candidate.
 ## Config and implementation {#config}
 The default config is defined by the pipeline component factory and describes
 how the component should be configured. You can override its settings via the
 `config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
 [`config.cfg` for training](/usage/training#config). See the
 [model architectures](/api/architectures) documentation for details on the
 architectures and their arguments and hyperparameters.
 > #### Example
 >
 > ```python
 > from spacy.pipeline.spancat import DEFAULT_SPANCAT_MODEL
 > config = {
 >     "threshold": 0.5,
 >     "spans_key": "labeled_spans",
 >     "max_positive": None,
 >     "model": DEFAULT_SPANCAT_MODEL,
 >     "suggester": {"@misc": "ngram_suggester.v1", "sizes": [1, 2, 3]},
 > }
 > nlp.add_pipe("spancat", config=config)
 > ```
 | Setting        | Description                                                                                                                                                                                                                                                                                             |
 | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `suggester`    | A function that [suggests spans](#suggesters). Spans are returned as a ragged array with two integer columns, for the start and end positions. Defaults to [`ngram_suggester`](#ngram_suggester). ~~Callable[List[Doc], Ragged]~~                                                                       |
 | `model`        | A model instance that is given a a list of documents and `(start, end)` indices representing candidate span offsets. The model predicts a probability for each category for each span. Defaults to [SpanCategorizer](/api/architectures#SpanCategorizer). ~~Model[Tuple[List[Doc], Ragged], Floats2d]~~ |
 | `spans_key`    | Key of the [`Doc.spans`](/api/doc#spans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"spans"`. ~~str~~                                                                               |
 | `threshold`    | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to `0.5`. ~~float~~                                                                                                                                                          |
 | `max_positive` | Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. ~~Optional[int]~~                                                                                                                                                                                      |
 ```python
 %%GITHUB_SPACY/spacy/pipeline/spancat.py
 ```
 ## SpanCategorizer.\_\_init\_\_ {#init tag="method"}
 > #### Example
 >
 > ```python
 > # Construction via add_pipe with default model
 > spancat = nlp.add_pipe("spancat")
 >
 > # Construction via add_pipe with custom model
 > config = {"model": {"@architectures": "my_spancat"}}
 > parser = nlp.add_pipe("spancat", config=config)
 >
 > # Construction from class
 > from spacy.pipeline import SpanCategorizer
 > spancat = SpanCategorizer(nlp.vocab, model, suggester)
 > ```
 Create a new pipeline instance. In your application, you would normally use a
 shortcut for this and instantiate the component using its string name and
 [`nlp.add_pipe`](/api/language#create_pipe).
 | Name           | Description                                                                                                                                                                                                                          |
 | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `vocab`        | The shared vocabulary. ~~Vocab~~                                                                                                                                                                                                     |
 | `model`        | A model instance that is given a a list of documents and `(start, end)` indices representing candidate span offsets. The model predicts a probability for each category for each span. ~~Model[Tuple[List[Doc], Ragged], Floats2d]~~ |
 | `suggester`    | A function that [suggests spans](#suggesters). Spans are returned as a ragged array with two integer columns, for the start and end positions. ~~Callable[List[Doc], Ragged]~~                                                       |
 | `name`         | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~                                                                                                                                  |
 | _keyword-only_ |                                                                                                                                                                                                                                      |
 | `spans_key`    | Key of the [`Doc.spans`](/api/doc#sans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"spans"`. ~~str~~             |
 | `threshold`    | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to `0.5`. ~~float~~                                                                                       |
 | `max_positive` | Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. ~~Optional[int]~~                                                                                                                   |
 ## SpanCategorizer.\_\_call\_\_ {#call tag="method"}
 Apply the pipe to one document. The document is modified in place, and returned.
 This usually happens under the hood when the `nlp` object is called on a text
 and all pipeline components are applied to the `Doc` in order. Both
 [`__call__`](/api/spancategorizer#call) and [`pipe`](/api/spancategorizer#pipe)
 delegate to the [`predict`](/api/spancategorizer#predict) and
 [`set_annotations`](/api/spancategorizer#set_annotations) methods.
 > #### Example
 >
 > ```python
 > doc = nlp("This is a sentence.")
 > spancat = nlp.add_pipe("spancat")
 > # This usually happens under the hood
 > processed = spancat(doc)
 > ```
 | Name        | Description                      |
 | ----------- | -------------------------------- |
 | `doc`       | The document to process. ~~Doc~~ |
 | **RETURNS** | The processed document. ~~Doc~~  |
 ## SpanCategorizer.pipe {#pipe tag="method"}
 Apply the pipe to a stream of documents. This usually happens under the hood
 when the `nlp` object is called on a text and all pipeline components are
 applied to the `Doc` in order. Both [`__call__`](/api/spancategorizer#call) and
 [`pipe`](/api/spancategorizer#pipe) delegate to the
 [`predict`](/api/spancategorizer#predict) and
 [`set_annotations`](/api/spancategorizer#set_annotations) methods.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > for doc in spancat.pipe(docs, batch_size=50):
 >     pass
 > ```
 | Name           | Description                                                   |
 | -------------- | ------------------------------------------------------------- |
 | `stream`       | A stream of documents. ~~Iterable[Doc]~~                      |
 | _keyword-only_ |                                                               |
 | `batch_size`   | The number of documents to buffer. Defaults to `128`. ~~int~~ |
 | **YIELDS**     | The processed documents in order. ~~Doc~~                     |
 ## SpanCategorizer.initialize {#initialize tag="method"}
 Initialize the component for training. `get_examples` should be a function that
 returns an iterable of [`Example`](/api/example) objects. The data examples are
 used to **initialize the model** of the component and can either be the full
 training data or a representative sample. Initialization includes validating the
 network,
 [inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
 setting up the label scheme based on the data. This method is typically called
 by [`Language.initialize`](/api/language#initialize) and lets you customize
 arguments it receives via the
 [`[initialize.components]`](/api/data-formats#config-initialize) block in the
 config.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > spancat.initialize(lambda: [], nlp=nlp)
 > ```
 >
 > ```ini
 > ### config.cfg
 > [initialize.components.spancat]
 >
 > [initialize.components.spancat.labels]
 > @readers = "spacy.read_labels.v1"
 > path = "corpus/labels/spancat.json
 > ```
 | Name           | Description                                                                                                                                                                                                                                                                                                                                                                                                |
 | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~                                                                                                                                                                                                                                                                      |
 | _keyword-only_ |                                                                                                                                                                                                                                                                                                                                                                                                            |
 | `nlp`          | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~                                                                                                                                                                                                                                                                                                                                       |
 | `labels`       | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
 ## SpanCategorizer.predict {#predict tag="method"}
 Apply the component's model to a batch of [`Doc`](/api/doc) objects without
 modifying them.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > scores = spancat.predict([doc1, doc2])
 > ```
 | Name        | Description                                 |
 | ----------- | ------------------------------------------- |
 | `docs`      | The documents to predict. ~~Iterable[Doc]~~ |
 | **RETURNS** | The model's prediction for each document.   |
 ## SpanCategorizer.set_annotations {#set_annotations tag="method"}
 Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > scores = spancat.predict(docs)
 > spancat.set_annotations(docs, scores)
 > ```
 | Name     | Description                                               |
 | -------- | --------------------------------------------------------- |
 | `docs`   | The documents to modify. ~~Iterable[Doc]~~                |
 | `scores` | The scores to set, produced by `SpanCategorizer.predict`. |
 ## SpanCategorizer.update {#update tag="method"}
 Learn from a batch of [`Example`](/api/example) objects containing the
 predictions and gold-standard annotations, and update the component's model.
 Delegates to [`predict`](/api/spancategorizer#predict) and
 [`get_loss`](/api/spancategorizer#get_loss).
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > optimizer = nlp.initialize()
 > losses = spancat.update(examples, sgd=optimizer)
 > ```
 | Name              | Description                                                                                                                        |
 | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
 | `examples`        | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~                                                  |
 | _keyword-only_    |                                                                                                                                    |
 | `drop`            | The dropout rate. ~~float~~                                                                                                        |
 | `sgd`             | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~                      |
 | `losses`          | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~           |
 | **RETURNS**       | The updated `losses` dictionary. ~~Dict[str, float]~~                                                                              |
 ## SpanCategorizer.get_loss {#get_loss tag="method"}
 Find the loss and gradient of loss for the batch of documents and their
 predicted scores.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > scores = spancat.predict([eg.predicted for eg in examples])
 > loss, d_loss = spancat.get_loss(examples, scores)
 > ```
 | Name        | Description                                                                 |
 | ----------- | --------------------------------------------------------------------------- |
 | `examples`  | The batch of examples. ~~Iterable[Example]~~                                |
 | `scores`    | Scores representing the model's predictions.                                |
 | **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
 ## SpanCategorizer.score {#score tag="method"}
 Score a batch of examples.
 > #### Example
 >
 > ```python
 > scores = spancat.score(examples)
 > ```
 | Name           | Description                                                                                                            |
 | -------------- | ---------------------------------------------------------------------------------------------------------------------- |
 | `examples`     | The examples to score. ~~Iterable[Example]~~                                                                           |
 | _keyword-only_ |                                                                                                                        |
 | **RETURNS**    | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). ~~Dict[str, Union[float, Dict[str, float]]]~~ |
 ## SpanCategorizer.create_optimizer {#create_optimizer tag="method"}
 Create an optimizer for the pipeline component.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > optimizer = spancat.create_optimizer()
 > ```
 | Name        | Description                  |
 | ----------- | ---------------------------- |
 | **RETURNS** | The optimizer. ~~Optimizer~~ |
 ## SpanCategorizer.use_params {#use_params tag="method, contextmanager"}
 Modify the pipe's model to use the given parameter values.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > with spancat.use_params(optimizer.averages):
 >     spancat.to_disk("/best_model")
 > ```
 | Name     | Description                                        |
 | -------- | -------------------------------------------------- |
 | `params` | The parameter values to use in the model. ~~dict~~ |
 ## SpanCategorizer.add_label {#add_label tag="method"}
 Add a new label to the pipe. Raises an error if the output dimension is already
 set, or if the model has already been fully [initialized](#initialize). Note
 that you don't have to call this method if you provide a **representative data
 sample** to the [`initialize`](#initialize) method. In this case, all labels
 found in the sample will be automatically added to the model, and the output
 dimension will be [inferred](/usage/layers-architectures#thinc-shape-inference)
 automatically.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > spancat.add_label("MY_LABEL")
 > ```
 | Name        | Description                                                 |
 | ----------- | ----------------------------------------------------------- |
 | `label`     | The label to add. ~~str~~                                   |
 | **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ |
 ## SpanCategorizer.to_disk {#to_disk tag="method"}
 Serialize the pipe to disk.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > spancat.to_disk("/path/to/spancat")
 > ```
 | Name           | Description                                                                                                                                |
 | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
 | `path`         | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
 | _keyword-only_ |                                                                                                                                            |
 | `exclude`      | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~                                                |
 ## SpanCategorizer.from_disk {#from_disk tag="method"}
 Load the pipe from disk. Modifies the object in place and returns it.
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > spancat.from_disk("/path/to/spancat")
 > ```
 | Name           | Description                                                                                     |
 | -------------- | ----------------------------------------------------------------------------------------------- |
 | `path`         | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
 | _keyword-only_ |                                                                                                 |
 | `exclude`      | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~     |
 | **RETURNS**    | The modified `SpanCategorizer` object. ~~SpanCategorizer~~                                      |
 ## SpanCategorizer.to_bytes {#to_bytes tag="method"}
 > #### Example
 >
 > ```python
 > spancat = nlp.add_pipe("spancat")
 > spancat_bytes = spancat.to_bytes()
 > ```
 Serialize the pipe to a bytestring.
 | Name           | Description                                                                                 |
 | -------------- | ------------------------------------------------------------------------------------------- |
 | _keyword-only_ |                                                                                             |
 | `exclude`      | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
 | **RETURNS**    | The serialized form of the `SpanCategorizer` object. ~~bytes~~                              |
 ## SpanCategorizer.from_bytes {#from_bytes tag="method"}
 Load the pipe from a bytestring. Modifies the object in place and returns it.
 > #### Example
 >
 > ```python
 > spancat_bytes = spancat.to_bytes()
 > spancat = nlp.add_pipe("spancat")
 > spancat.from_bytes(spancat_bytes)
 > ```
 | Name           | Description                                                                                 |
 | -------------- | ------------------------------------------------------------------------------------------- |
 | `bytes_data`   | The data to load from. ~~bytes~~                                                            |
 | _keyword-only_ |                                                                                             |
 | `exclude`      | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
 | **RETURNS**    | The `SpanCategorizer` object. ~~SpanCategorizer~~                                           |
 ## SpanCategorizer.labels {#labels tag="property"}
 The labels currently added to the component.
 > #### Example
 >
 > ```python
 > spancat.add_label("MY_LABEL")
 > assert "MY_LABEL" in spancat.labels
 > ```
 | Name        | Description                                            |
 | ----------- | ------------------------------------------------------ |
 | **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
 ## SpanCategorizer.label_data {#label_data tag="property"}
 The labels currently added to the component and their internal meta information.
 This is the data generated by [`init labels`](/api/cli#init-labels) and used by
 [`SpanCategorizer.initialize`](/api/spancategorizer#initialize) to initialize
 the model with a pre-defined label set.
 > #### Example
 >
 > ```python
 > labels = spancat.label_data
 > spancat.initialize(lambda: [], nlp=nlp, labels=labels)
 > ```
 | Name        | Description                                                |
 | ----------- | ---------------------------------------------------------- |
 | **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
 ## Serialization fields {#serialization-fields}
 During serialization, spaCy will export several data fields used to restore
 different aspects of the object. If needed, you can exclude them from
 serialization by passing in the string names via the `exclude` argument.
 > #### Example
 >
 > ```python
 > data = spancat.to_disk("/path", exclude=["vocab"])
 > ```
 | Name    | Description                                                    |
 | ------- | -------------------------------------------------------------- |
 | `vocab` | The shared [`Vocab`](/api/vocab).                              |
 | `cfg`   | The config file. You usually don't want to exclude this.       |
 | `model` | The binary model data. You usually don't want to exclude this. |
 ## Suggesters {#suggesters tag="registered functions" source="spacy/pipeline/spancat.py"}
 ### spacy.ngram_suggester.v1 {#ngram_suggester}
 > #### Example Config
 >
 > ```ini
 > [components.spancat.suggester]
 > @misc = "spacy.ngram_suggester.v1"
 > sizes = [1, 2, 3]
 > ```
 Suggest all spans of the given lengths. Spans are returned as a ragged array of
 integers. The array has two columns, indicating the start and end position.
 | Name        | Description                                                                                                          |
 | ----------- | -------------------------------------------------------------------------------------------------------------------- |
 | `sizes`     | The phrase lengths to suggest. For example, `[1, 2]` will suggest phrases consisting of 1 or 2 tokens. ~~List[int]~~ |
 | **CREATES** | The suggester function. ~~Callable[[List[Doc]], Ragged]~~                                                            |
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@ -94,6 +94,7 @@
                    { "text": "Morphologizer", "url": "/api/morphologizer" },
                    { "text": "SentenceRecognizer", "url": "/api/sentencerecognizer" },
                    { "text": "Sentencizer", "url": "/api/sentencizer" },
                    { "text": "SpanCategorizer", "url": "/api/spancategorizer" },
                    { "text": "Tagger", "url": "/api/tagger" },
                    { "text": "TextCategorizer", "url": "/api/textcategorizer" },
                    { "text": "Tok2Vec", "url": "/api/tok2vec" },