* small fix in example imports * throw error when train_corpus or dev_corpus is not a string * small fix in custom logger example * limit macro_auc to labels with 2 annotations * fix typo * also create parents of output_dir if need be * update documentation of textcat scores * refactor TextCatEnsemble * fix tests for new AUC definition * bump to 3.0.0a42 * update docs * rename to spacy.TextCatEnsemble.v2 * spacy.TextCatEnsemble.v1 in legacy * cleanup * small fix * update to 3.0.0rc2 * fix import that got lost in merge * cursed IDE * fix two typos
15 KiB
title | teaser | tag | source |
---|---|---|---|
Scorer | Compute evaluation scores | class | spacy/scorer.py |
The Scorer
computes evaluation scores. It's typically created by
Language.evaluate
. In addition, the Scorer
provides a number of evaluation methods for evaluating Token
and
Doc
attributes.
Scorer.__init__
Create a new Scorer
.
Example
from spacy.scorer import Scorer # Default scoring pipeline scorer = Scorer() # Provided scoring pipeline nlp = spacy.load("en_core_web_sm") scorer = Scorer(nlp)
Name | Description |
---|---|
nlp |
The pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline for the multi-language code xx is constructed containing: senter , tagger , morphologizer , parser , ner , textcat . |
Scorer.score
Calculate the scores for a list of Example
objects using the
scoring methods provided by the components in the pipeline.
The returned Dict
contains the scores provided by the individual pipeline
components. For the scoring methods provided by the Scorer
and use by the core
pipeline components, the individual score names start with the Token
or Doc
attribute being scored:
token_acc
,token_p
,token_r
,token_f
,sents_p
,sents_r
,sents_f
tag_acc
,pos_acc
,morph_acc
,morph_per_feat
,lemma_acc
dep_uas
,dep_las
,dep_las_per_type
ents_p
,ents_r
ents_f
,ents_per_type
textcat_macro_auc
,textcat_macro_f
Example
scorer = Scorer() scores = scorer.score(examples)
Name | Description |
---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
RETURNS | A dictionary of scores. |
Scorer.score_tokenization
Scores the tokenization:
token_acc
: number of correct tokens / number of gold tokenstoken_p
,token_r
,token_f
: precision, recall and F-score for token character spans
Example
scores = Scorer.score_tokenization(examples)
Name | Description |
---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
RETURNS | Dict |
Scorer.score_token_attr
Scores a single token attribute.
Example
scores = Scorer.score_token_attr(examples, "pos") print(scores["pos_acc"])
Name | Description |
---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
keyword-only | |
getter |
Defaults to getattr . If provided, getter(token, attr) should return the value of the attribute for an individual Token . |
RETURNS | A dictionary containing the score {attr}_acc . |
Scorer.score_token_attr_per_feat
Scores a single token attribute per feature for a token attribute in the Universal Dependencies FEATS format.
Example
scores = Scorer.score_token_attr_per_feat(examples, "morph") print(scores["morph_per_feat"])
Name | Description |
---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
keyword-only | |
getter |
Defaults to getattr . If provided, getter(token, attr) should return the value of the attribute for an individual Token . |
RETURNS | A dictionary containing the per-feature PRF scores under the key {attr}_per_feat . |
Scorer.score_spans
Returns PRF scores for labeled or unlabeled spans.
Example
scores = Scorer.score_spans(examples, "ents") print(scores["ents_f"])
Name | Description |
---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
keyword-only | |
getter |
Defaults to getattr . If provided, getter(doc, attr) should return the Span objects for an individual Doc . |
RETURNS | A dictionary containing the PRF scores under the keys {attr}_p , {attr}_r , {attr}_f and the per-type PRF scores under {attr}_per_type . |
Scorer.score_deps
Calculate the UAS, LAS, and LAS per type scores for dependency parses.
Example
def dep_getter(token, attr): dep = getattr(token, attr) dep = token.vocab.strings.as_string(dep).lower() return dep scores = Scorer.score_deps( examples, "dep", getter=dep_getter, ignore_labels=("p", "punct") ) print(scores["dep_uas"], scores["dep_las"])
Name | Description |
---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
keyword-only | |
getter |
Defaults to getattr . If provided, getter(token, attr) should return the value of the attribute for an individual Token . |
head_attr |
The attribute containing the head token. |
head_getter |
Defaults to getattr . If provided, head_getter(token, attr) should return the head for an individual Token . |
ignore_labels |
Labels to ignore while scoring (e.g. "punct" ). |
RETURNS | A dictionary containing the scores: {attr}_uas , {attr}_las , and {attr}_las_per_type . |
Scorer.score_cats
Calculate PRF and ROC AUC scores for a doc-level attribute that is a dict
containing scores for each label like Doc.cats
. The returned dictionary
contains the following scores:
{attr}_micro_p
,{attr}_micro_r
and{attr}_micro_f
: each instance across each label is weighted equally{attr}_macro_p
,{attr}_macro_r
and{attr}_macro_f
: the average values across evaluations per label{attr}_f_per_type
and{attr}_auc_per_type
: each contains a dictionary of scores, keyed by label- A final
{attr}_score
and corresponding{attr}_score_desc
(text description)
The reported {attr}_score
depends on the classification properties:
- binary exclusive with positive label:
{attr}_score
is set to the F-score of the positive label - 3+ exclusive classes, macro-averaged F-score:
{attr}_score = {attr}_macro_f
- multilabel, macro-averaged AUC:
{attr}_score = {attr}_macro_auc
Example
labels = ["LABEL_A", "LABEL_B", "LABEL_C"] scores = Scorer.score_cats( examples, "cats", labels=labels ) print(scores["cats_macro_auc"])
Name | Description |
---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
keyword-only | |
getter |
Defaults to getattr . If provided, getter(doc, attr) should return the cats for an individual Doc . |
labels | The set of possible labels. Defaults to [] . |
multi_label |
Whether the attribute allows multiple labels. Defaults to True . |
positive_label |
The positive label for a binary task with exclusive classes. Defaults to None . |
RETURNS | A dictionary containing the scores, with inapplicable scores as None . |
Scorer.score_links
Returns PRF for predicted links on the entity level. To disentangle the performance of the NEL from the NER, this method only evaluates NEL links for entities that overlap between the gold reference and the predictions.
Example
scores = Scorer.score_links( examples, negative_labels=["NIL", ""] ) print(scores["nel_micro_f"])
Name | Description |
---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
keyword-only | |
negative_labels |
The string values that refer to no annotation (e.g. "NIL"). |
RETURNS | A dictionary containing the scores. |