* Add micro PRF for morph scoring For pipelines where morph features are added by more than one component and a reference training corpus may not contain all features, a micro PRF score is more flexible than a simple accuracy score. An example is the reading and inflection features added by the Japanese tokenizer. * Use `morph_micro_f` as the default morph score for Japanese morphologizers. * Update docstring * Fix typo in docstring * Update Scorer API docs * Fix results type * Organize score list by attribute prefix
18 KiB
| title | teaser | tag | source |
|---|---|---|---|
| Scorer | Compute evaluation scores | class | spacy/scorer.py |
The Scorer computes evaluation scores. It's typically created by
Language.evaluate. In addition, the Scorer
provides a number of evaluation methods for evaluating Token and
Doc attributes.
Scorer.__init__
Create a new Scorer.
Example
from spacy.scorer import Scorer # Default scoring pipeline scorer = Scorer() # Provided scoring pipeline nlp = spacy.load("en_core_web_sm") scorer = Scorer(nlp)
| Name | Description |
|---|---|
nlp |
The pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline is constructed using the default_lang and default_pipeline settings. |
default_lang |
The language to use for a default pipeline if nlp is not provided. Defaults to xx. |
default_pipeline |
The pipeline components to use for a default pipeline if nlp is not provided. Defaults to ("senter", "tagger", "morphologizer", "parser", "ner", "textcat"). |
| keyword-only | |
\*\*kwargs |
Any additional settings to pass on to the individual scoring methods. |
Scorer.score
Calculate the scores for a list of Example objects using the
scoring methods provided by the components in the pipeline.
The returned Dict contains the scores provided by the individual pipeline
components. For the scoring methods provided by the Scorer and used by the
core pipeline components, the individual score names start with the Token or
Doc attribute being scored:
token_acc,token_p,token_r,token_fsents_p,sents_r,sents_ftag_accpos_accmorph_acc,morph_micro_p,morph_micro_r,morph_micro_f,morph_per_featlemma_accdep_uas,dep_las,dep_las_per_typeents_p,ents_rents_f,ents_per_typespans_sc_p,spans_sc_r,spans_sc_fcats_score(depends on config, description provided incats_score_desc),cats_micro_p,cats_micro_r,cats_micro_f,cats_macro_p,cats_macro_r,cats_macro_f,cats_macro_auc,cats_f_per_type,cats_auc_per_type
Example
scorer = Scorer() scores = scorer.score(examples)
| Name | Description |
|---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
| RETURNS | A dictionary of scores. |
Scorer.score_tokenization
Scores the tokenization:
token_acc: number of correct tokens / number of gold tokenstoken_p,token_r,token_f: precision, recall and F-score for token character spans
Docs with has_unknown_spaces are skipped during scoring.
Example
scores = Scorer.score_tokenization(examples)
| Name | Description | |
|---|---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
|
| RETURNS | Dict |
A dictionary containing the scores token_acc, token_p, token_r, token_f. |
Scorer.score_token_attr
Scores a single token attribute. Tokens with missing values in the reference doc are skipped during scoring.
Example
scores = Scorer.score_token_attr(examples, "pos") print(scores["pos_acc"])
| Name | Description |
|---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
| keyword-only | |
getter |
Defaults to getattr. If provided, getter(token, attr) should return the value of the attribute for an individual Token. |
missing_values |
Attribute values to treat as missing annotation in the reference annotation. Defaults to {0, None, ""}. |
| RETURNS | A dictionary containing the score {attr}_acc. |
Scorer.score_token_attr_per_feat
Scores a single token attribute per feature for a token attribute in the Universal Dependencies FEATS format. Tokens with missing values in the reference doc are skipped during scoring.
Example
scores = Scorer.score_token_attr_per_feat(examples, "morph") print(scores["morph_per_feat"])
| Name | Description |
|---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
| keyword-only | |
getter |
Defaults to getattr. If provided, getter(token, attr) should return the value of the attribute for an individual Token. |
missing_values |
Attribute values to treat as missing annotation in the reference annotation. Defaults to {0, None, ""}. |
| RETURNS | A dictionary containing the micro PRF scores under the key {attr}_micro_p/r/f and the per-feature PRF scores under {attr}_per_feat. |
Scorer.score_spans
Returns PRF scores for labeled or unlabeled spans.
Example
scores = Scorer.score_spans(examples, "ents") print(scores["ents_f"])
| Name | Description |
|---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
| keyword-only | |
getter |
Defaults to getattr. If provided, getter(doc, attr) should return the Span objects for an individual Doc. |
has_annotation |
Defaults to None. If provided, has_annotation(doc) should return whether a Doc has annotation for this attr. Docs without annotation are skipped for scoring purposes. |
labeled |
Defaults to True. If set to False, two spans will be considered equal if their start and end match, irrespective of their label. |
allow_overlap |
Defaults to False. Whether or not to allow overlapping spans. If set to False, the alignment will automatically resolve conflicts. |
| RETURNS | A dictionary containing the PRF scores under the keys {attr}_p, {attr}_r, {attr}_f and the per-type PRF scores under {attr}_per_type. |
Scorer.score_deps
Calculate the UAS, LAS, and LAS per type scores for dependency parses. Tokens
with missing values for the attr (typically dep) are skipped during scoring.
Example
def dep_getter(token, attr): dep = getattr(token, attr) dep = token.vocab.strings.as_string(dep).lower() return dep scores = Scorer.score_deps( examples, "dep", getter=dep_getter, ignore_labels=("p", "punct") ) print(scores["dep_uas"], scores["dep_las"])
| Name | Description |
|---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
| keyword-only | |
getter |
Defaults to getattr. If provided, getter(token, attr) should return the value of the attribute for an individual Token. |
head_attr |
The attribute containing the head token. |
head_getter |
Defaults to getattr. If provided, head_getter(token, attr) should return the head for an individual Token. |
ignore_labels |
Labels to ignore while scoring (e.g. "punct"). |
missing_values |
Attribute values to treat as missing annotation in the reference annotation. Defaults to {0, None, ""}. |
| RETURNS | A dictionary containing the scores: {attr}_uas, {attr}_las, and {attr}_las_per_type. |
Scorer.score_cats
Calculate PRF and ROC AUC scores for a doc-level attribute that is a dict
containing scores for each label like Doc.cats. The returned dictionary
contains the following scores:
{attr}_micro_p,{attr}_micro_rand{attr}_micro_f: each instance across each label is weighted equally{attr}_macro_p,{attr}_macro_rand{attr}_macro_f: the average values across evaluations per label{attr}_f_per_typeand{attr}_auc_per_type: each contains a dictionary of scores, keyed by label- A final
{attr}_scoreand corresponding{attr}_score_desc(text description)
The reported {attr}_score depends on the classification properties:
- binary exclusive with positive label:
{attr}_scoreis set to the F-score of the positive label - 3+ exclusive classes, macro-averaged F-score:
{attr}_score = {attr}_macro_f - multilabel, macro-averaged AUC:
{attr}_score = {attr}_macro_auc
Example
labels = ["LABEL_A", "LABEL_B", "LABEL_C"] scores = Scorer.score_cats( examples, "cats", labels=labels ) print(scores["cats_macro_auc"])
| Name | Description |
|---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
attr |
The attribute to score. |
| keyword-only | |
getter |
Defaults to getattr. If provided, getter(doc, attr) should return the cats for an individual Doc. |
| labels | The set of possible labels. Defaults to []. |
multi_label |
Whether the attribute allows multiple labels. Defaults to True. |
positive_label |
The positive label for a binary task with exclusive classes. Defaults to None. |
| RETURNS | A dictionary containing the scores, with inapplicable scores as None. |
Scorer.score_links
Returns PRF for predicted links on the entity level. To disentangle the performance of the NEL from the NER, this method only evaluates NEL links for entities that overlap between the gold reference and the predictions.
Example
scores = Scorer.score_links( examples, negative_labels=["NIL", ""] ) print(scores["nel_micro_f"])
| Name | Description |
|---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |
| keyword-only | |
negative_labels |
The string values that refer to no annotation (e.g. "NIL"). |
| RETURNS | A dictionary containing the scores. |
get_ner_prf
Compute micro-PRF and per-entity PRF scores.
| Name | Description |
|---|---|
examples |
The Example objects holding both the predictions and the correct gold-standard annotations. |