* Refactor the Scorer to improve flexibility Refactor the `Scorer` to improve flexibility for arbitrary pipeline components. * Individual pipeline components provide their own `evaluate` methods that score a list of `Example`s and return a dictionary of scores * `Scorer` is initialized either: * with a provided pipeline containing components to be scored * with a default pipeline containing the built-in statistical components (senter, tagger, morphologizer, parser, ner) * `Scorer.score` evaluates a list of `Example`s and returns a dictionary of scores referring to the scores provided by the components in the pipeline Significant differences: * `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc` and the new `morph_acc`, `pos_acc`, and `lemma_acc` * Scoring is no longer cumulative: `Scorer.score` scores a list of examples rather than a single example and does not retain any state about previously scored examples * PRF values in the returned scores are no longer multiplied by 100 * Add kwargs to Morphologizer.evaluate * Create generalized scoring methods in Scorer * Generalized static scoring methods are added to `Scorer` * Methods require an attribute (either on Token or Doc) that is used to key the returned scores Naming differences: * `uas`, `las`, and `las_per_type` in the scores dict are renamed to `dep_uas`, `dep_las`, and `dep_las_per_type` Scoring differences: * `Doc.sents` is now scored as spans rather than on sentence-initial token positions so that `Doc.sents` and `Doc.ents` can be scored with the same method (this lowers scores since a single incorrect sentence start results in two incorrect spans) * Simplify / extend hasattr check for eval method * Add hasattr check to tokenizer scoring * Simplify to hasattr check for component scoring * Reset Example alignment if docs are set Reset the Example alignment if either doc is set in case the tokenization has changed. * Add PRF tokenization scoring for tokens as spans Add PRF scores for tokens as character spans. The scores are: * token_acc: # correct tokens / # gold tokens * token_p/r/f: PRF for (token.idx, token.idx + len(token)) * Add docstring to Scorer.score_tokenization * Rename component.evaluate() to component.score() * Update Scorer API docs * Update scoring for positive_label in textcat * Fix TextCategorizer.score kwargs * Update Language.evaluate docs * Update score names in default config
8.7 KiB
title | teaser | tag | source |
---|---|---|---|
Scorer | Compute evaluation scores | class | spacy/scorer.py |
The Scorer
computes evaluation scores. It's typically created by
Language.evaluate
.
In addition, the Scorer
provides a number of evaluation methods for
evaluating Token
and Doc
attributes.
Scorer.__init__
Create a new Scorer
.
Example
from spacy.scorer import Scorer # default scoring pipeline scorer = Scorer() # provided scoring pipeline nlp = spacy.load("en_core_web_sm") scorer = Scorer(nlp)
Name | Type | Description |
---|---|---|
nlp |
Language | The pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline for the multi-language code xx is constructed containing: senter , tagger , morphologizer , parser , ner , textcat . |
RETURNS | Scorer |
The newly created object. |
Scorer.score
Calculate the scores for a list of Example
objects using the
scoring methods provided by the components in the pipeline.
The returned Dict
contains the scores provided by the individual pipeline
components. For the scoring methods provided by the Scorer
and use by the
core pipeline components, the individual score names start with the Token
or
Doc
attribute being scored: token_acc
, token_p/r/f
, sents_p/r/f
,
tag_acc
, pos_acc
, morph_acc
, morph_per_feat
, lemma_acc
, dep_uas
,
dep_las
, dep_las_per_type
, ents_p/r/f
, ents_per_type
,
textcat_macro_auc
, textcat_macro_f
.
Example
scorer = Scorer() scorer.score(examples)
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
The Example objects holding both the predictions and the correct gold-standard annotations. |
RETURNS | Dict |
A dictionary of scores. |
Scorer.score_tokenization
Scores the tokenization:
token_acc
: # correct tokens / # gold tokenstoken_p/r/f
: PRF for token character spans
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
The Example objects holding both the predictions and the correct gold-standard annotations. |
RETURNS | Dict |
A dictionary containing the scores token_acc/p/r/f . |
Scorer.score_token_attr
Scores a single token attribute.
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
str |
The attribute to score. |
getter |
callable |
Defaults to getattr . If provided, getter(token, attr) should return the value of the attribute for an individual Token . |
RETURNS | Dict |
A dictionary containing the score attr_acc . |
Scorer.score_token_attr_per_feat
Scores a single token attribute per feature for a token attribute in UFEATS format.
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
str |
The attribute to score. |
getter |
callable |
Defaults to getattr . If provided, getter(token, attr) should return the value of the attribute for an individual Token . |
RETURNS | Dict |
A dictionary containing the per-feature PRF scores unders the key attr_per_feat . |
Scorer.score_spans
Returns PRF scores for labeled or unlabeled spans.
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
str |
The attribute to score. |
getter |
callable |
Defaults to getattr . If provided, getter(doc, attr) should return the Span objects for an individual Doc . |
RETURNS | Dict |
A dictionary containing the PRF scores under the keys attr_p/r/f and the per-type PRF scores under attr_per_type . |
Scorer.score_deps
Calculate the UAS, LAS, and LAS per type scores for dependency parses.
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
str |
The attribute containing the dependency label. |
getter |
callable |
Defaults to getattr . If provided, getter(token, attr) should return the value of the attribute for an individual Token . |
head_attr |
str |
The attribute containing the head token. |
head_getter |
callable |
Defaults to getattr . If provided, head_getter(token, attr) should return the head for an individual Token . |
ignore_labels |
Tuple |
Labels to ignore while scoring (e.g., punct ). |
RETURNS | Dict |
A dictionary containing the scores: attr_uas , attr_las , and attr_las_per_type . |
Scorer.score_cats
Calculate PRF and ROC AUC scores for a doc-level attribute that is a dict
containing scores for each label like Doc.cats
.
Name | Type | Description |
---|---|---|
examples |
Iterable[Example] |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
str |
The attribute to score. |
getter |
callable |
Defaults to getattr . If provided, getter(doc, attr) should return the cats for an individual Doc . |
labels | Iterable[str] |
The set of possible labels. Defaults to [] . |
multi_label | bool |
Whether the attribute allows multiple labels. Defaults to True . |
positive_label | str |
The positive label for a binary task with exclusive classes. Defaults to None . |
RETURNS | Dict |
A dictionary containing the scores: 1) for binary exclusive with positive label: attr_p/r/f ; 2) for 3+ exclusive classes, macro-averaged fscore: attr_macro_f ; 3) for multilabel, macro-averaged AUC: attr_macro_auc ; 4) for all: attr_f_per_type , attr_auc_per_type |