spaCy/scorer.md at 3d56a3f286485a9f1741286476f8872d8e3be1c8

mirror of https://github.com/explosion/spaCy.git synced 2025-07-10 16:22:29 +03:00

Refactor the Scorer to improve flexibility (#5731 )

* Refactor the Scorer to improve flexibility

Refactor the `Scorer` to improve flexibility for arbitrary pipeline
components.

* Individual pipeline components provide their own `evaluate` methods
that score a list of `Example`s and return a dictionary of scores
* `Scorer` is initialized either:
  * with a provided pipeline containing components to be scored
  * with a default pipeline containing the built-in statistical
    components (senter, tagger, morphologizer, parser, ner)
* `Scorer.score` evaluates a list of `Example`s and returns a dictionary
of scores referring to the scores provided by the components in the
pipeline

Significant differences:

* `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc`
and the new `morph_acc`, `pos_acc`, and `lemma_acc`
* Scoring is no longer cumulative: `Scorer.score` scores a list of
examples rather than a single example and does not retain any state
about previously scored examples
* PRF values in the returned scores are no longer multiplied by 100

* Add kwargs to Morphologizer.evaluate

* Create generalized scoring methods in Scorer

* Generalized static scoring methods are added to `Scorer`
  * Methods require an attribute (either on Token or Doc) that is
used to key the returned scores

Naming differences:

* `uas`, `las`, and `las_per_type` in the scores dict are renamed to
`dep_uas`, `dep_las`, and `dep_las_per_type`

Scoring differences:

* `Doc.sents` is now scored as spans rather than on sentence-initial
token positions so that `Doc.sents` and `Doc.ents` can be scored with
the same method (this lowers scores since a single incorrect sentence
start results in two incorrect spans)

* Simplify / extend hasattr check for eval method

* Add hasattr check to tokenizer scoring
* Simplify to hasattr check for component scoring

* Reset Example alignment if docs are set

Reset the Example alignment if either doc is set in case the
tokenization has changed.

* Add PRF tokenization scoring for tokens as spans

Add PRF scores for tokens as character spans. The scores are:

* token_acc: # correct tokens / # gold tokens
* token_p/r/f: PRF for (token.idx, token.idx + len(token))

* Add docstring to Scorer.score_tokenization

* Rename component.evaluate() to component.score()

* Update Scorer API docs

* Update scoring for positive_label in textcat

* Fix TextCategorizer.score kwargs

* Update Language.evaluate docs

* Update score names in default config

2020-07-25 12:53:02 +02:00

8.7 KiB

Raw Blame History

title	teaser	tag	source
Scorer	Compute evaluation scores	class	spacy/scorer.py

The Scorer computes evaluation scores. It's typically created by Language.evaluate.

In addition, the Scorer provides a number of evaluation methods for evaluating Token and Doc attributes.

Scorer.init

Create a new Scorer.

Example

from spacy.scorer import Scorer

# default scoring pipeline
scorer = Scorer()

# provided scoring pipeline
nlp = spacy.load("en_core_web_sm")
scorer = Scorer(nlp)

Name	Type	Description
`nlp`	Language	The pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline for the multi-language code `xx` is constructed containing: `senter`, `tagger`, `morphologizer`, `parser`, `ner`, `textcat`.
RETURNS	`Scorer`	The newly created object.

Scorer.score

Calculate the scores for a list of Example objects using the scoring methods provided by the components in the pipeline.

The returned Dict contains the scores provided by the individual pipeline components. For the scoring methods provided by the Scorer and use by the core pipeline components, the individual score names start with the Token or Doc attribute being scored: token_acc, token_p/r/f, sents_p/r/f, tag_acc, pos_acc, morph_acc, morph_per_feat, lemma_acc, dep_uas, dep_las, dep_las_per_type, ents_p/r/f, ents_per_type, textcat_macro_auc, textcat_macro_f.

Example

scorer = Scorer()
scorer.score(examples)

Name	Type	Description
`examples`	`Iterable[Example]`	The `Example` objects holding both the predictions and the correct gold-standard annotations.
RETURNS	`Dict`	A dictionary of scores.

Scorer.score_tokenization

Scores the tokenization:

token_acc: # correct tokens / # gold tokens
token_p/r/f: PRF for token character spans

Name	Type	Description
`examples`	`Iterable[Example]`	The `Example` objects holding both the predictions and the correct gold-standard annotations.
RETURNS	`Dict`	A dictionary containing the scores `token_acc/p/r/f`.

Scorer.score_token_attr

Scores a single token attribute.

Name	Type	Description
`examples`	`Iterable[Example]`	The `Example` objects holding both the predictions and the correct gold-standard annotations.
`attr`	`str`	The attribute to score.
`getter`	`callable`	Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`.
RETURNS	`Dict`	A dictionary containing the score `attr_acc`.

Scorer.score_token_attr_per_feat

Scores a single token attribute per feature for a token attribute in UFEATS format.

Name	Type	Description
`examples`	`Iterable[Example]`	The `Example` objects holding both the predictions and the correct gold-standard annotations.
`attr`	`str`	The attribute to score.
`getter`	`callable`	Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`.
RETURNS	`Dict`	A dictionary containing the per-feature PRF scores unders the key `attr_per_feat`.

Scorer.score_spans

Returns PRF scores for labeled or unlabeled spans.

Name	Type	Description
`examples`	`Iterable[Example]`	The `Example` objects holding both the predictions and the correct gold-standard annotations.
`attr`	`str`	The attribute to score.
`getter`	`callable`	Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`.
RETURNS	`Dict`	A dictionary containing the PRF scores under the keys `attr_p/r/f` and the per-type PRF scores under `attr_per_type`.

Scorer.score_deps

Calculate the UAS, LAS, and LAS per type scores for dependency parses.

Name	Type	Description
`examples`	`Iterable[Example]`	The `Example` objects holding both the predictions and the correct gold-standard annotations.
`attr`	`str`	The attribute containing the dependency label.
`getter`	`callable`	Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`.
`head_attr`	`str`	The attribute containing the head token.
`head_getter`	`callable`	Defaults to `getattr`. If provided, `head_getter(token, attr)` should return the head for an individual `Token`.
`ignore_labels`	`Tuple`	Labels to ignore while scoring (e.g., `punct`).
RETURNS	`Dict`	A dictionary containing the scores: `attr_uas`, `attr_las`, and `attr_las_per_type`.

Scorer.score_cats

Calculate PRF and ROC AUC scores for a doc-level attribute that is a dict containing scores for each label like Doc.cats.

Name	Type	Description
`examples`	`Iterable[Example]`	The `Example` objects holding both the predictions and the correct gold-standard annotations.
`attr`	`str`	The attribute to score.
`getter`	`callable`	Defaults to `getattr`. If provided, `getter(doc, attr)` should return the cats for an individual `Doc`.
labels	`Iterable[str]`	The set of possible labels. Defaults to `[]`.
multi_label	`bool`	Whether the attribute allows multiple labels. Defaults to `True`.
positive_label	`str`	The positive label for a binary task with exclusive classes. Defaults to `None`.
RETURNS	`Dict`	A dictionary containing the scores: 1) for binary exclusive with positive label: `attr_p/r/f`; 2) for 3+ exclusive classes, macro-averaged fscore: `attr_macro_f`; 3) for multilabel, macro-averaged AUC: `attr_macro_auc`; 4) for all: `attr_f_per_type`, `attr_auc_per_type`

8.7 KiB Raw Blame History

Scorer.__init__

Example

Scorer.score

Example

Scorer.score_tokenization

Scorer.score_token_attr

Scorer.score_token_attr_per_feat

Scorer.score_spans

Scorer.score_deps

Scorer.score_cats

8.7 KiB

Raw Blame History

Scorer.init