mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>
This commit is contained in:
parent
bab9976d9a
commit
b5d999e510
121
examples/training/textcat_example_data/CC0.txt
Normal file
121
examples/training/textcat_example_data/CC0.txt
Normal file
|
@ -0,0 +1,121 @@
|
|||
Creative Commons Legal Code
|
||||
|
||||
CC0 1.0 Universal
|
||||
|
||||
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
|
||||
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
|
||||
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
|
||||
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
|
||||
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
|
||||
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
|
||||
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
|
||||
HEREUNDER.
|
||||
|
||||
Statement of Purpose
|
||||
|
||||
The laws of most jurisdictions throughout the world automatically confer
|
||||
exclusive Copyright and Related Rights (defined below) upon the creator
|
||||
and subsequent owner(s) (each and all, an "owner") of an original work of
|
||||
authorship and/or a database (each, a "Work").
|
||||
|
||||
Certain owners wish to permanently relinquish those rights to a Work for
|
||||
the purpose of contributing to a commons of creative, cultural and
|
||||
scientific works ("Commons") that the public can reliably and without fear
|
||||
of later claims of infringement build upon, modify, incorporate in other
|
||||
works, reuse and redistribute as freely as possible in any form whatsoever
|
||||
and for any purposes, including without limitation commercial purposes.
|
||||
These owners may contribute to the Commons to promote the ideal of a free
|
||||
culture and the further production of creative, cultural and scientific
|
||||
works, or to gain reputation or greater distribution for their Work in
|
||||
part through the use and efforts of others.
|
||||
|
||||
For these and/or other purposes and motivations, and without any
|
||||
expectation of additional consideration or compensation, the person
|
||||
associating CC0 with a Work (the "Affirmer"), to the extent that he or she
|
||||
is an owner of Copyright and Related Rights in the Work, voluntarily
|
||||
elects to apply CC0 to the Work and publicly distribute the Work under its
|
||||
terms, with knowledge of his or her Copyright and Related Rights in the
|
||||
Work and the meaning and intended legal effect of CC0 on those rights.
|
||||
|
||||
1. Copyright and Related Rights. A Work made available under CC0 may be
|
||||
protected by copyright and related or neighboring rights ("Copyright and
|
||||
Related Rights"). Copyright and Related Rights include, but are not
|
||||
limited to, the following:
|
||||
|
||||
i. the right to reproduce, adapt, distribute, perform, display,
|
||||
communicate, and translate a Work;
|
||||
ii. moral rights retained by the original author(s) and/or performer(s);
|
||||
iii. publicity and privacy rights pertaining to a person's image or
|
||||
likeness depicted in a Work;
|
||||
iv. rights protecting against unfair competition in regards to a Work,
|
||||
subject to the limitations in paragraph 4(a), below;
|
||||
v. rights protecting the extraction, dissemination, use and reuse of data
|
||||
in a Work;
|
||||
vi. database rights (such as those arising under Directive 96/9/EC of the
|
||||
European Parliament and of the Council of 11 March 1996 on the legal
|
||||
protection of databases, and under any national implementation
|
||||
thereof, including any amended or successor version of such
|
||||
directive); and
|
||||
vii. other similar, equivalent or corresponding rights throughout the
|
||||
world based on applicable law or treaty, and any national
|
||||
implementations thereof.
|
||||
|
||||
2. Waiver. To the greatest extent permitted by, but not in contravention
|
||||
of, applicable law, Affirmer hereby overtly, fully, permanently,
|
||||
irrevocably and unconditionally waives, abandons, and surrenders all of
|
||||
Affirmer's Copyright and Related Rights and associated claims and causes
|
||||
of action, whether now known or unknown (including existing as well as
|
||||
future claims and causes of action), in the Work (i) in all territories
|
||||
worldwide, (ii) for the maximum duration provided by applicable law or
|
||||
treaty (including future time extensions), (iii) in any current or future
|
||||
medium and for any number of copies, and (iv) for any purpose whatsoever,
|
||||
including without limitation commercial, advertising or promotional
|
||||
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
|
||||
member of the public at large and to the detriment of Affirmer's heirs and
|
||||
successors, fully intending that such Waiver shall not be subject to
|
||||
revocation, rescission, cancellation, termination, or any other legal or
|
||||
equitable action to disrupt the quiet enjoyment of the Work by the public
|
||||
as contemplated by Affirmer's express Statement of Purpose.
|
||||
|
||||
3. Public License Fallback. Should any part of the Waiver for any reason
|
||||
be judged legally invalid or ineffective under applicable law, then the
|
||||
Waiver shall be preserved to the maximum extent permitted taking into
|
||||
account Affirmer's express Statement of Purpose. In addition, to the
|
||||
extent the Waiver is so judged Affirmer hereby grants to each affected
|
||||
person a royalty-free, non transferable, non sublicensable, non exclusive,
|
||||
irrevocable and unconditional license to exercise Affirmer's Copyright and
|
||||
Related Rights in the Work (i) in all territories worldwide, (ii) for the
|
||||
maximum duration provided by applicable law or treaty (including future
|
||||
time extensions), (iii) in any current or future medium and for any number
|
||||
of copies, and (iv) for any purpose whatsoever, including without
|
||||
limitation commercial, advertising or promotional purposes (the
|
||||
"License"). The License shall be deemed effective as of the date CC0 was
|
||||
applied by Affirmer to the Work. Should any part of the License for any
|
||||
reason be judged legally invalid or ineffective under applicable law, such
|
||||
partial invalidity or ineffectiveness shall not invalidate the remainder
|
||||
of the License, and in such case Affirmer hereby affirms that he or she
|
||||
will not (i) exercise any of his or her remaining Copyright and Related
|
||||
Rights in the Work or (ii) assert any associated claims and causes of
|
||||
action with respect to the Work, in either case contrary to Affirmer's
|
||||
express Statement of Purpose.
|
||||
|
||||
4. Limitations and Disclaimers.
|
||||
|
||||
a. No trademark or patent rights held by Affirmer are waived, abandoned,
|
||||
surrendered, licensed or otherwise affected by this document.
|
||||
b. Affirmer offers the Work as-is and makes no representations or
|
||||
warranties of any kind concerning the Work, express, implied,
|
||||
statutory or otherwise, including without limitation warranties of
|
||||
title, merchantability, fitness for a particular purpose, non
|
||||
infringement, or the absence of latent or other defects, accuracy, or
|
||||
the present or absence of errors, whether or not discoverable, all to
|
||||
the greatest extent permissible under applicable law.
|
||||
c. Affirmer disclaims responsibility for clearing rights of other persons
|
||||
that may apply to the Work or any use thereof, including without
|
||||
limitation any person's Copyright and Related Rights in the Work.
|
||||
Further, Affirmer disclaims responsibility for obtaining any necessary
|
||||
consents, permissions or other rights required for any use of the
|
||||
Work.
|
||||
d. Affirmer understands and acknowledges that Creative Commons is not a
|
||||
party to this document and has no duty or obligation with respect to
|
||||
this CC0 or use of the Work.
|
359
examples/training/textcat_example_data/CC_BY-SA-3.0.txt
Normal file
359
examples/training/textcat_example_data/CC_BY-SA-3.0.txt
Normal file
|
@ -0,0 +1,359 @@
|
|||
Creative Commons Legal Code
|
||||
|
||||
Attribution-ShareAlike 3.0 Unported
|
||||
|
||||
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
|
||||
LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
|
||||
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
|
||||
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
|
||||
REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR
|
||||
DAMAGES RESULTING FROM ITS USE.
|
||||
|
||||
License
|
||||
|
||||
THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
|
||||
COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
|
||||
COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
|
||||
AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.
|
||||
|
||||
BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
|
||||
TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
|
||||
BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
|
||||
CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
|
||||
CONDITIONS.
|
||||
|
||||
1. Definitions
|
||||
|
||||
a. "Adaptation" means a work based upon the Work, or upon the Work and
|
||||
other pre-existing works, such as a translation, adaptation,
|
||||
derivative work, arrangement of music or other alterations of a
|
||||
literary or artistic work, or phonogram or performance and includes
|
||||
cinematographic adaptations or any other form in which the Work may be
|
||||
recast, transformed, or adapted including in any form recognizably
|
||||
derived from the original, except that a work that constitutes a
|
||||
Collection will not be considered an Adaptation for the purpose of
|
||||
this License. For the avoidance of doubt, where the Work is a musical
|
||||
work, performance or phonogram, the synchronization of the Work in
|
||||
timed-relation with a moving image ("synching") will be considered an
|
||||
Adaptation for the purpose of this License.
|
||||
b. "Collection" means a collection of literary or artistic works, such as
|
||||
encyclopedias and anthologies, or performances, phonograms or
|
||||
broadcasts, or other works or subject matter other than works listed
|
||||
in Section 1(f) below, which, by reason of the selection and
|
||||
arrangement of their contents, constitute intellectual creations, in
|
||||
which the Work is included in its entirety in unmodified form along
|
||||
with one or more other contributions, each constituting separate and
|
||||
independent works in themselves, which together are assembled into a
|
||||
collective whole. A work that constitutes a Collection will not be
|
||||
considered an Adaptation (as defined below) for the purposes of this
|
||||
License.
|
||||
c. "Creative Commons Compatible License" means a license that is listed
|
||||
at https://creativecommons.org/compatiblelicenses that has been
|
||||
approved by Creative Commons as being essentially equivalent to this
|
||||
License, including, at a minimum, because that license: (i) contains
|
||||
terms that have the same purpose, meaning and effect as the License
|
||||
Elements of this License; and, (ii) explicitly permits the relicensing
|
||||
of adaptations of works made available under that license under this
|
||||
License or a Creative Commons jurisdiction license with the same
|
||||
License Elements as this License.
|
||||
d. "Distribute" means to make available to the public the original and
|
||||
copies of the Work or Adaptation, as appropriate, through sale or
|
||||
other transfer of ownership.
|
||||
e. "License Elements" means the following high-level license attributes
|
||||
as selected by Licensor and indicated in the title of this License:
|
||||
Attribution, ShareAlike.
|
||||
f. "Licensor" means the individual, individuals, entity or entities that
|
||||
offer(s) the Work under the terms of this License.
|
||||
g. "Original Author" means, in the case of a literary or artistic work,
|
||||
the individual, individuals, entity or entities who created the Work
|
||||
or if no individual or entity can be identified, the publisher; and in
|
||||
addition (i) in the case of a performance the actors, singers,
|
||||
musicians, dancers, and other persons who act, sing, deliver, declaim,
|
||||
play in, interpret or otherwise perform literary or artistic works or
|
||||
expressions of folklore; (ii) in the case of a phonogram the producer
|
||||
being the person or legal entity who first fixes the sounds of a
|
||||
performance or other sounds; and, (iii) in the case of broadcasts, the
|
||||
organization that transmits the broadcast.
|
||||
h. "Work" means the literary and/or artistic work offered under the terms
|
||||
of this License including without limitation any production in the
|
||||
literary, scientific and artistic domain, whatever may be the mode or
|
||||
form of its expression including digital form, such as a book,
|
||||
pamphlet and other writing; a lecture, address, sermon or other work
|
||||
of the same nature; a dramatic or dramatico-musical work; a
|
||||
choreographic work or entertainment in dumb show; a musical
|
||||
composition with or without words; a cinematographic work to which are
|
||||
assimilated works expressed by a process analogous to cinematography;
|
||||
a work of drawing, painting, architecture, sculpture, engraving or
|
||||
lithography; a photographic work to which are assimilated works
|
||||
expressed by a process analogous to photography; a work of applied
|
||||
art; an illustration, map, plan, sketch or three-dimensional work
|
||||
relative to geography, topography, architecture or science; a
|
||||
performance; a broadcast; a phonogram; a compilation of data to the
|
||||
extent it is protected as a copyrightable work; or a work performed by
|
||||
a variety or circus performer to the extent it is not otherwise
|
||||
considered a literary or artistic work.
|
||||
i. "You" means an individual or entity exercising rights under this
|
||||
License who has not previously violated the terms of this License with
|
||||
respect to the Work, or who has received express permission from the
|
||||
Licensor to exercise rights under this License despite a previous
|
||||
violation.
|
||||
j. "Publicly Perform" means to perform public recitations of the Work and
|
||||
to communicate to the public those public recitations, by any means or
|
||||
process, including by wire or wireless means or public digital
|
||||
performances; to make available to the public Works in such a way that
|
||||
members of the public may access these Works from a place and at a
|
||||
place individually chosen by them; to perform the Work to the public
|
||||
by any means or process and the communication to the public of the
|
||||
performances of the Work, including by public digital performance; to
|
||||
broadcast and rebroadcast the Work by any means including signs,
|
||||
sounds or images.
|
||||
k. "Reproduce" means to make copies of the Work by any means including
|
||||
without limitation by sound or visual recordings and the right of
|
||||
fixation and reproducing fixations of the Work, including storage of a
|
||||
protected performance or phonogram in digital form or other electronic
|
||||
medium.
|
||||
|
||||
2. Fair Dealing Rights. Nothing in this License is intended to reduce,
|
||||
limit, or restrict any uses free from copyright or rights arising from
|
||||
limitations or exceptions that are provided for in connection with the
|
||||
copyright protection under copyright law or other applicable laws.
|
||||
|
||||
3. License Grant. Subject to the terms and conditions of this License,
|
||||
Licensor hereby grants You a worldwide, royalty-free, non-exclusive,
|
||||
perpetual (for the duration of the applicable copyright) license to
|
||||
exercise the rights in the Work as stated below:
|
||||
|
||||
a. to Reproduce the Work, to incorporate the Work into one or more
|
||||
Collections, and to Reproduce the Work as incorporated in the
|
||||
Collections;
|
||||
b. to create and Reproduce Adaptations provided that any such Adaptation,
|
||||
including any translation in any medium, takes reasonable steps to
|
||||
clearly label, demarcate or otherwise identify that changes were made
|
||||
to the original Work. For example, a translation could be marked "The
|
||||
original work was translated from English to Spanish," or a
|
||||
modification could indicate "The original work has been modified.";
|
||||
c. to Distribute and Publicly Perform the Work including as incorporated
|
||||
in Collections; and,
|
||||
d. to Distribute and Publicly Perform Adaptations.
|
||||
e. For the avoidance of doubt:
|
||||
|
||||
i. Non-waivable Compulsory License Schemes. In those jurisdictions in
|
||||
which the right to collect royalties through any statutory or
|
||||
compulsory licensing scheme cannot be waived, the Licensor
|
||||
reserves the exclusive right to collect such royalties for any
|
||||
exercise by You of the rights granted under this License;
|
||||
ii. Waivable Compulsory License Schemes. In those jurisdictions in
|
||||
which the right to collect royalties through any statutory or
|
||||
compulsory licensing scheme can be waived, the Licensor waives the
|
||||
exclusive right to collect such royalties for any exercise by You
|
||||
of the rights granted under this License; and,
|
||||
iii. Voluntary License Schemes. The Licensor waives the right to
|
||||
collect royalties, whether individually or, in the event that the
|
||||
Licensor is a member of a collecting society that administers
|
||||
voluntary licensing schemes, via that society, from any exercise
|
||||
by You of the rights granted under this License.
|
||||
|
||||
The above rights may be exercised in all media and formats whether now
|
||||
known or hereafter devised. The above rights include the right to make
|
||||
such modifications as are technically necessary to exercise the rights in
|
||||
other media and formats. Subject to Section 8(f), all rights not expressly
|
||||
granted by Licensor are hereby reserved.
|
||||
|
||||
4. Restrictions. The license granted in Section 3 above is expressly made
|
||||
subject to and limited by the following restrictions:
|
||||
|
||||
a. You may Distribute or Publicly Perform the Work only under the terms
|
||||
of this License. You must include a copy of, or the Uniform Resource
|
||||
Identifier (URI) for, this License with every copy of the Work You
|
||||
Distribute or Publicly Perform. You may not offer or impose any terms
|
||||
on the Work that restrict the terms of this License or the ability of
|
||||
the recipient of the Work to exercise the rights granted to that
|
||||
recipient under the terms of the License. You may not sublicense the
|
||||
Work. You must keep intact all notices that refer to this License and
|
||||
to the disclaimer of warranties with every copy of the Work You
|
||||
Distribute or Publicly Perform. When You Distribute or Publicly
|
||||
Perform the Work, You may not impose any effective technological
|
||||
measures on the Work that restrict the ability of a recipient of the
|
||||
Work from You to exercise the rights granted to that recipient under
|
||||
the terms of the License. This Section 4(a) applies to the Work as
|
||||
incorporated in a Collection, but this does not require the Collection
|
||||
apart from the Work itself to be made subject to the terms of this
|
||||
License. If You create a Collection, upon notice from any Licensor You
|
||||
must, to the extent practicable, remove from the Collection any credit
|
||||
as required by Section 4(c), as requested. If You create an
|
||||
Adaptation, upon notice from any Licensor You must, to the extent
|
||||
practicable, remove from the Adaptation any credit as required by
|
||||
Section 4(c), as requested.
|
||||
b. You may Distribute or Publicly Perform an Adaptation only under the
|
||||
terms of: (i) this License; (ii) a later version of this License with
|
||||
the same License Elements as this License; (iii) a Creative Commons
|
||||
jurisdiction license (either this or a later license version) that
|
||||
contains the same License Elements as this License (e.g.,
|
||||
Attribution-ShareAlike 3.0 US)); (iv) a Creative Commons Compatible
|
||||
License. If you license the Adaptation under one of the licenses
|
||||
mentioned in (iv), you must comply with the terms of that license. If
|
||||
you license the Adaptation under the terms of any of the licenses
|
||||
mentioned in (i), (ii) or (iii) (the "Applicable License"), you must
|
||||
comply with the terms of the Applicable License generally and the
|
||||
following provisions: (I) You must include a copy of, or the URI for,
|
||||
the Applicable License with every copy of each Adaptation You
|
||||
Distribute or Publicly Perform; (II) You may not offer or impose any
|
||||
terms on the Adaptation that restrict the terms of the Applicable
|
||||
License or the ability of the recipient of the Adaptation to exercise
|
||||
the rights granted to that recipient under the terms of the Applicable
|
||||
License; (III) You must keep intact all notices that refer to the
|
||||
Applicable License and to the disclaimer of warranties with every copy
|
||||
of the Work as included in the Adaptation You Distribute or Publicly
|
||||
Perform; (IV) when You Distribute or Publicly Perform the Adaptation,
|
||||
You may not impose any effective technological measures on the
|
||||
Adaptation that restrict the ability of a recipient of the Adaptation
|
||||
from You to exercise the rights granted to that recipient under the
|
||||
terms of the Applicable License. This Section 4(b) applies to the
|
||||
Adaptation as incorporated in a Collection, but this does not require
|
||||
the Collection apart from the Adaptation itself to be made subject to
|
||||
the terms of the Applicable License.
|
||||
c. If You Distribute, or Publicly Perform the Work or any Adaptations or
|
||||
Collections, You must, unless a request has been made pursuant to
|
||||
Section 4(a), keep intact all copyright notices for the Work and
|
||||
provide, reasonable to the medium or means You are utilizing: (i) the
|
||||
name of the Original Author (or pseudonym, if applicable) if supplied,
|
||||
and/or if the Original Author and/or Licensor designate another party
|
||||
or parties (e.g., a sponsor institute, publishing entity, journal) for
|
||||
attribution ("Attribution Parties") in Licensor's copyright notice,
|
||||
terms of service or by other reasonable means, the name of such party
|
||||
or parties; (ii) the title of the Work if supplied; (iii) to the
|
||||
extent reasonably practicable, the URI, if any, that Licensor
|
||||
specifies to be associated with the Work, unless such URI does not
|
||||
refer to the copyright notice or licensing information for the Work;
|
||||
and (iv) , consistent with Ssection 3(b), in the case of an
|
||||
Adaptation, a credit identifying the use of the Work in the Adaptation
|
||||
(e.g., "French translation of the Work by Original Author," or
|
||||
"Screenplay based on original Work by Original Author"). The credit
|
||||
required by this Section 4(c) may be implemented in any reasonable
|
||||
manner; provided, however, that in the case of a Adaptation or
|
||||
Collection, at a minimum such credit will appear, if a credit for all
|
||||
contributing authors of the Adaptation or Collection appears, then as
|
||||
part of these credits and in a manner at least as prominent as the
|
||||
credits for the other contributing authors. For the avoidance of
|
||||
doubt, You may only use the credit required by this Section for the
|
||||
purpose of attribution in the manner set out above and, by exercising
|
||||
Your rights under this License, You may not implicitly or explicitly
|
||||
assert or imply any connection with, sponsorship or endorsement by the
|
||||
Original Author, Licensor and/or Attribution Parties, as appropriate,
|
||||
of You or Your use of the Work, without the separate, express prior
|
||||
written permission of the Original Author, Licensor and/or Attribution
|
||||
Parties.
|
||||
d. Except as otherwise agreed in writing by the Licensor or as may be
|
||||
otherwise permitted by applicable law, if You Reproduce, Distribute or
|
||||
Publicly Perform the Work either by itself or as part of any
|
||||
Adaptations or Collections, You must not distort, mutilate, modify or
|
||||
take other derogatory action in relation to the Work which would be
|
||||
prejudicial to the Original Author's honor or reputation. Licensor
|
||||
agrees that in those jurisdictions (e.g. Japan), in which any exercise
|
||||
of the right granted in Section 3(b) of this License (the right to
|
||||
make Adaptations) would be deemed to be a distortion, mutilation,
|
||||
modification or other derogatory action prejudicial to the Original
|
||||
Author's honor and reputation, the Licensor will waive or not assert,
|
||||
as appropriate, this Section, to the fullest extent permitted by the
|
||||
applicable national law, to enable You to reasonably exercise Your
|
||||
right under Section 3(b) of this License (right to make Adaptations)
|
||||
but not otherwise.
|
||||
|
||||
5. Representations, Warranties and Disclaimer
|
||||
|
||||
UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR
|
||||
OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY
|
||||
KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE,
|
||||
INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF
|
||||
LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS,
|
||||
WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION
|
||||
OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU.
|
||||
|
||||
6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE
|
||||
LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR
|
||||
ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES
|
||||
ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS
|
||||
BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
|
||||
|
||||
7. Termination
|
||||
|
||||
a. This License and the rights granted hereunder will terminate
|
||||
automatically upon any breach by You of the terms of this License.
|
||||
Individuals or entities who have received Adaptations or Collections
|
||||
from You under this License, however, will not have their licenses
|
||||
terminated provided such individuals or entities remain in full
|
||||
compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8 will
|
||||
survive any termination of this License.
|
||||
b. Subject to the above terms and conditions, the license granted here is
|
||||
perpetual (for the duration of the applicable copyright in the Work).
|
||||
Notwithstanding the above, Licensor reserves the right to release the
|
||||
Work under different license terms or to stop distributing the Work at
|
||||
any time; provided, however that any such election will not serve to
|
||||
withdraw this License (or any other license that has been, or is
|
||||
required to be, granted under the terms of this License), and this
|
||||
License will continue in full force and effect unless terminated as
|
||||
stated above.
|
||||
|
||||
8. Miscellaneous
|
||||
|
||||
a. Each time You Distribute or Publicly Perform the Work or a Collection,
|
||||
the Licensor offers to the recipient a license to the Work on the same
|
||||
terms and conditions as the license granted to You under this License.
|
||||
b. Each time You Distribute or Publicly Perform an Adaptation, Licensor
|
||||
offers to the recipient a license to the original Work on the same
|
||||
terms and conditions as the license granted to You under this License.
|
||||
c. If any provision of this License is invalid or unenforceable under
|
||||
applicable law, it shall not affect the validity or enforceability of
|
||||
the remainder of the terms of this License, and without further action
|
||||
by the parties to this agreement, such provision shall be reformed to
|
||||
the minimum extent necessary to make such provision valid and
|
||||
enforceable.
|
||||
d. No term or provision of this License shall be deemed waived and no
|
||||
breach consented to unless such waiver or consent shall be in writing
|
||||
and signed by the party to be charged with such waiver or consent.
|
||||
e. This License constitutes the entire agreement between the parties with
|
||||
respect to the Work licensed here. There are no understandings,
|
||||
agreements or representations with respect to the Work not specified
|
||||
here. Licensor shall not be bound by any additional provisions that
|
||||
may appear in any communication from You. This License may not be
|
||||
modified without the mutual written agreement of the Licensor and You.
|
||||
f. The rights granted under, and the subject matter referenced, in this
|
||||
License were drafted utilizing the terminology of the Berne Convention
|
||||
for the Protection of Literary and Artistic Works (as amended on
|
||||
September 28, 1979), the Rome Convention of 1961, the WIPO Copyright
|
||||
Treaty of 1996, the WIPO Performances and Phonograms Treaty of 1996
|
||||
and the Universal Copyright Convention (as revised on July 24, 1971).
|
||||
These rights and subject matter take effect in the relevant
|
||||
jurisdiction in which the License terms are sought to be enforced
|
||||
according to the corresponding provisions of the implementation of
|
||||
those treaty provisions in the applicable national law. If the
|
||||
standard suite of rights granted under applicable copyright law
|
||||
includes additional rights not granted under this License, such
|
||||
additional rights are deemed to be included in the License; this
|
||||
License is not intended to restrict the license of any rights under
|
||||
applicable law.
|
||||
|
||||
|
||||
Creative Commons Notice
|
||||
|
||||
Creative Commons is not a party to this License, and makes no warranty
|
||||
whatsoever in connection with the Work. Creative Commons will not be
|
||||
liable to You or any party on any legal theory for any damages
|
||||
whatsoever, including without limitation any general, special,
|
||||
incidental or consequential damages arising in connection to this
|
||||
license. Notwithstanding the foregoing two (2) sentences, if Creative
|
||||
Commons has expressly identified itself as the Licensor hereunder, it
|
||||
shall have all rights and obligations of Licensor.
|
||||
|
||||
Except for the limited purpose of indicating to the public that the
|
||||
Work is licensed under the CCPL, Creative Commons does not authorize
|
||||
the use by either party of the trademark "Creative Commons" or any
|
||||
related trademark or logo of Creative Commons without the prior
|
||||
written consent of Creative Commons. Any permitted use will be in
|
||||
compliance with Creative Commons' then-current trademark usage
|
||||
guidelines, as may be published on its website or otherwise made
|
||||
available upon request from time to time. For the avoidance of doubt,
|
||||
this trademark restriction does not form part of the License.
|
||||
|
||||
Creative Commons may be contacted at https://creativecommons.org/.
|
428
examples/training/textcat_example_data/CC_BY-SA-4.0.txt
Normal file
428
examples/training/textcat_example_data/CC_BY-SA-4.0.txt
Normal file
|
@ -0,0 +1,428 @@
|
|||
Attribution-ShareAlike 4.0 International
|
||||
|
||||
=======================================================================
|
||||
|
||||
Creative Commons Corporation ("Creative Commons") is not a law firm and
|
||||
does not provide legal services or legal advice. Distribution of
|
||||
Creative Commons public licenses does not create a lawyer-client or
|
||||
other relationship. Creative Commons makes its licenses and related
|
||||
information available on an "as-is" basis. Creative Commons gives no
|
||||
warranties regarding its licenses, any material licensed under their
|
||||
terms and conditions, or any related information. Creative Commons
|
||||
disclaims all liability for damages resulting from their use to the
|
||||
fullest extent possible.
|
||||
|
||||
Using Creative Commons Public Licenses
|
||||
|
||||
Creative Commons public licenses provide a standard set of terms and
|
||||
conditions that creators and other rights holders may use to share
|
||||
original works of authorship and other material subject to copyright
|
||||
and certain other rights specified in the public license below. The
|
||||
following considerations are for informational purposes only, are not
|
||||
exhaustive, and do not form part of our licenses.
|
||||
|
||||
Considerations for licensors: Our public licenses are
|
||||
intended for use by those authorized to give the public
|
||||
permission to use material in ways otherwise restricted by
|
||||
copyright and certain other rights. Our licenses are
|
||||
irrevocable. Licensors should read and understand the terms
|
||||
and conditions of the license they choose before applying it.
|
||||
Licensors should also secure all rights necessary before
|
||||
applying our licenses so that the public can reuse the
|
||||
material as expected. Licensors should clearly mark any
|
||||
material not subject to the license. This includes other CC-
|
||||
licensed material, or material used under an exception or
|
||||
limitation to copyright. More considerations for licensors:
|
||||
wiki.creativecommons.org/Considerations_for_licensors
|
||||
|
||||
Considerations for the public: By using one of our public
|
||||
licenses, a licensor grants the public permission to use the
|
||||
licensed material under specified terms and conditions. If
|
||||
the licensor's permission is not necessary for any reason--for
|
||||
example, because of any applicable exception or limitation to
|
||||
copyright--then that use is not regulated by the license. Our
|
||||
licenses grant only permissions under copyright and certain
|
||||
other rights that a licensor has authority to grant. Use of
|
||||
the licensed material may still be restricted for other
|
||||
reasons, including because others have copyright or other
|
||||
rights in the material. A licensor may make special requests,
|
||||
such as asking that all changes be marked or described.
|
||||
Although not required by our licenses, you are encouraged to
|
||||
respect those requests where reasonable. More considerations
|
||||
for the public:
|
||||
wiki.creativecommons.org/Considerations_for_licensees
|
||||
|
||||
=======================================================================
|
||||
|
||||
Creative Commons Attribution-ShareAlike 4.0 International Public
|
||||
License
|
||||
|
||||
By exercising the Licensed Rights (defined below), You accept and agree
|
||||
to be bound by the terms and conditions of this Creative Commons
|
||||
Attribution-ShareAlike 4.0 International Public License ("Public
|
||||
License"). To the extent this Public License may be interpreted as a
|
||||
contract, You are granted the Licensed Rights in consideration of Your
|
||||
acceptance of these terms and conditions, and the Licensor grants You
|
||||
such rights in consideration of benefits the Licensor receives from
|
||||
making the Licensed Material available under these terms and
|
||||
conditions.
|
||||
|
||||
|
||||
Section 1 -- Definitions.
|
||||
|
||||
a. Adapted Material means material subject to Copyright and Similar
|
||||
Rights that is derived from or based upon the Licensed Material
|
||||
and in which the Licensed Material is translated, altered,
|
||||
arranged, transformed, or otherwise modified in a manner requiring
|
||||
permission under the Copyright and Similar Rights held by the
|
||||
Licensor. For purposes of this Public License, where the Licensed
|
||||
Material is a musical work, performance, or sound recording,
|
||||
Adapted Material is always produced where the Licensed Material is
|
||||
synched in timed relation with a moving image.
|
||||
|
||||
b. Adapter's License means the license You apply to Your Copyright
|
||||
and Similar Rights in Your contributions to Adapted Material in
|
||||
accordance with the terms and conditions of this Public License.
|
||||
|
||||
c. BY-SA Compatible License means a license listed at
|
||||
creativecommons.org/compatiblelicenses, approved by Creative
|
||||
Commons as essentially the equivalent of this Public License.
|
||||
|
||||
d. Copyright and Similar Rights means copyright and/or similar rights
|
||||
closely related to copyright including, without limitation,
|
||||
performance, broadcast, sound recording, and Sui Generis Database
|
||||
Rights, without regard to how the rights are labeled or
|
||||
categorized. For purposes of this Public License, the rights
|
||||
specified in Section 2(b)(1)-(2) are not Copyright and Similar
|
||||
Rights.
|
||||
|
||||
e. Effective Technological Measures means those measures that, in the
|
||||
absence of proper authority, may not be circumvented under laws
|
||||
fulfilling obligations under Article 11 of the WIPO Copyright
|
||||
Treaty adopted on December 20, 1996, and/or similar international
|
||||
agreements.
|
||||
|
||||
f. Exceptions and Limitations means fair use, fair dealing, and/or
|
||||
any other exception or limitation to Copyright and Similar Rights
|
||||
that applies to Your use of the Licensed Material.
|
||||
|
||||
g. License Elements means the license attributes listed in the name
|
||||
of a Creative Commons Public License. The License Elements of this
|
||||
Public License are Attribution and ShareAlike.
|
||||
|
||||
h. Licensed Material means the artistic or literary work, database,
|
||||
or other material to which the Licensor applied this Public
|
||||
License.
|
||||
|
||||
i. Licensed Rights means the rights granted to You subject to the
|
||||
terms and conditions of this Public License, which are limited to
|
||||
all Copyright and Similar Rights that apply to Your use of the
|
||||
Licensed Material and that the Licensor has authority to license.
|
||||
|
||||
j. Licensor means the individual(s) or entity(ies) granting rights
|
||||
under this Public License.
|
||||
|
||||
k. Share means to provide material to the public by any means or
|
||||
process that requires permission under the Licensed Rights, such
|
||||
as reproduction, public display, public performance, distribution,
|
||||
dissemination, communication, or importation, and to make material
|
||||
available to the public including in ways that members of the
|
||||
public may access the material from a place and at a time
|
||||
individually chosen by them.
|
||||
|
||||
l. Sui Generis Database Rights means rights other than copyright
|
||||
resulting from Directive 96/9/EC of the European Parliament and of
|
||||
the Council of 11 March 1996 on the legal protection of databases,
|
||||
as amended and/or succeeded, as well as other essentially
|
||||
equivalent rights anywhere in the world.
|
||||
|
||||
m. You means the individual or entity exercising the Licensed Rights
|
||||
under this Public License. Your has a corresponding meaning.
|
||||
|
||||
|
||||
Section 2 -- Scope.
|
||||
|
||||
a. License grant.
|
||||
|
||||
1. Subject to the terms and conditions of this Public License,
|
||||
the Licensor hereby grants You a worldwide, royalty-free,
|
||||
non-sublicensable, non-exclusive, irrevocable license to
|
||||
exercise the Licensed Rights in the Licensed Material to:
|
||||
|
||||
a. reproduce and Share the Licensed Material, in whole or
|
||||
in part; and
|
||||
|
||||
b. produce, reproduce, and Share Adapted Material.
|
||||
|
||||
2. Exceptions and Limitations. For the avoidance of doubt, where
|
||||
Exceptions and Limitations apply to Your use, this Public
|
||||
License does not apply, and You do not need to comply with
|
||||
its terms and conditions.
|
||||
|
||||
3. Term. The term of this Public License is specified in Section
|
||||
6(a).
|
||||
|
||||
4. Media and formats; technical modifications allowed. The
|
||||
Licensor authorizes You to exercise the Licensed Rights in
|
||||
all media and formats whether now known or hereafter created,
|
||||
and to make technical modifications necessary to do so. The
|
||||
Licensor waives and/or agrees not to assert any right or
|
||||
authority to forbid You from making technical modifications
|
||||
necessary to exercise the Licensed Rights, including
|
||||
technical modifications necessary to circumvent Effective
|
||||
Technological Measures. For purposes of this Public License,
|
||||
simply making modifications authorized by this Section 2(a)
|
||||
(4) never produces Adapted Material.
|
||||
|
||||
5. Downstream recipients.
|
||||
|
||||
a. Offer from the Licensor -- Licensed Material. Every
|
||||
recipient of the Licensed Material automatically
|
||||
receives an offer from the Licensor to exercise the
|
||||
Licensed Rights under the terms and conditions of this
|
||||
Public License.
|
||||
|
||||
b. Additional offer from the Licensor -- Adapted Material.
|
||||
Every recipient of Adapted Material from You
|
||||
automatically receives an offer from the Licensor to
|
||||
exercise the Licensed Rights in the Adapted Material
|
||||
under the conditions of the Adapter's License You apply.
|
||||
|
||||
c. No downstream restrictions. You may not offer or impose
|
||||
any additional or different terms or conditions on, or
|
||||
apply any Effective Technological Measures to, the
|
||||
Licensed Material if doing so restricts exercise of the
|
||||
Licensed Rights by any recipient of the Licensed
|
||||
Material.
|
||||
|
||||
6. No endorsement. Nothing in this Public License constitutes or
|
||||
may be construed as permission to assert or imply that You
|
||||
are, or that Your use of the Licensed Material is, connected
|
||||
with, or sponsored, endorsed, or granted official status by,
|
||||
the Licensor or others designated to receive attribution as
|
||||
provided in Section 3(a)(1)(A)(i).
|
||||
|
||||
b. Other rights.
|
||||
|
||||
1. Moral rights, such as the right of integrity, are not
|
||||
licensed under this Public License, nor are publicity,
|
||||
privacy, and/or other similar personality rights; however, to
|
||||
the extent possible, the Licensor waives and/or agrees not to
|
||||
assert any such rights held by the Licensor to the limited
|
||||
extent necessary to allow You to exercise the Licensed
|
||||
Rights, but not otherwise.
|
||||
|
||||
2. Patent and trademark rights are not licensed under this
|
||||
Public License.
|
||||
|
||||
3. To the extent possible, the Licensor waives any right to
|
||||
collect royalties from You for the exercise of the Licensed
|
||||
Rights, whether directly or through a collecting society
|
||||
under any voluntary or waivable statutory or compulsory
|
||||
licensing scheme. In all other cases the Licensor expressly
|
||||
reserves any right to collect such royalties.
|
||||
|
||||
|
||||
Section 3 -- License Conditions.
|
||||
|
||||
Your exercise of the Licensed Rights is expressly made subject to the
|
||||
following conditions.
|
||||
|
||||
a. Attribution.
|
||||
|
||||
1. If You Share the Licensed Material (including in modified
|
||||
form), You must:
|
||||
|
||||
a. retain the following if it is supplied by the Licensor
|
||||
with the Licensed Material:
|
||||
|
||||
i. identification of the creator(s) of the Licensed
|
||||
Material and any others designated to receive
|
||||
attribution, in any reasonable manner requested by
|
||||
the Licensor (including by pseudonym if
|
||||
designated);
|
||||
|
||||
ii. a copyright notice;
|
||||
|
||||
iii. a notice that refers to this Public License;
|
||||
|
||||
iv. a notice that refers to the disclaimer of
|
||||
warranties;
|
||||
|
||||
v. a URI or hyperlink to the Licensed Material to the
|
||||
extent reasonably practicable;
|
||||
|
||||
b. indicate if You modified the Licensed Material and
|
||||
retain an indication of any previous modifications; and
|
||||
|
||||
c. indicate the Licensed Material is licensed under this
|
||||
Public License, and include the text of, or the URI or
|
||||
hyperlink to, this Public License.
|
||||
|
||||
2. You may satisfy the conditions in Section 3(a)(1) in any
|
||||
reasonable manner based on the medium, means, and context in
|
||||
which You Share the Licensed Material. For example, it may be
|
||||
reasonable to satisfy the conditions by providing a URI or
|
||||
hyperlink to a resource that includes the required
|
||||
information.
|
||||
|
||||
3. If requested by the Licensor, You must remove any of the
|
||||
information required by Section 3(a)(1)(A) to the extent
|
||||
reasonably practicable.
|
||||
|
||||
b. ShareAlike.
|
||||
|
||||
In addition to the conditions in Section 3(a), if You Share
|
||||
Adapted Material You produce, the following conditions also apply.
|
||||
|
||||
1. The Adapter's License You apply must be a Creative Commons
|
||||
license with the same License Elements, this version or
|
||||
later, or a BY-SA Compatible License.
|
||||
|
||||
2. You must include the text of, or the URI or hyperlink to, the
|
||||
Adapter's License You apply. You may satisfy this condition
|
||||
in any reasonable manner based on the medium, means, and
|
||||
context in which You Share Adapted Material.
|
||||
|
||||
3. You may not offer or impose any additional or different terms
|
||||
or conditions on, or apply any Effective Technological
|
||||
Measures to, Adapted Material that restrict exercise of the
|
||||
rights granted under the Adapter's License You apply.
|
||||
|
||||
|
||||
Section 4 -- Sui Generis Database Rights.
|
||||
|
||||
Where the Licensed Rights include Sui Generis Database Rights that
|
||||
apply to Your use of the Licensed Material:
|
||||
|
||||
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
|
||||
to extract, reuse, reproduce, and Share all or a substantial
|
||||
portion of the contents of the database;
|
||||
|
||||
b. if You include all or a substantial portion of the database
|
||||
contents in a database in which You have Sui Generis Database
|
||||
Rights, then the database in which You have Sui Generis Database
|
||||
Rights (but not its individual contents) is Adapted Material,
|
||||
|
||||
including for purposes of Section 3(b); and
|
||||
c. You must comply with the conditions in Section 3(a) if You Share
|
||||
all or a substantial portion of the contents of the database.
|
||||
|
||||
For the avoidance of doubt, this Section 4 supplements and does not
|
||||
replace Your obligations under this Public License where the Licensed
|
||||
Rights include other Copyright and Similar Rights.
|
||||
|
||||
|
||||
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
|
||||
|
||||
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
|
||||
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
|
||||
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
|
||||
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
|
||||
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
|
||||
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
|
||||
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
|
||||
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
|
||||
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
|
||||
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
|
||||
|
||||
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
|
||||
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
|
||||
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
|
||||
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
|
||||
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
|
||||
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
|
||||
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
|
||||
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
|
||||
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
|
||||
|
||||
c. The disclaimer of warranties and limitation of liability provided
|
||||
above shall be interpreted in a manner that, to the extent
|
||||
possible, most closely approximates an absolute disclaimer and
|
||||
waiver of all liability.
|
||||
|
||||
|
||||
Section 6 -- Term and Termination.
|
||||
|
||||
a. This Public License applies for the term of the Copyright and
|
||||
Similar Rights licensed here. However, if You fail to comply with
|
||||
this Public License, then Your rights under this Public License
|
||||
terminate automatically.
|
||||
|
||||
b. Where Your right to use the Licensed Material has terminated under
|
||||
Section 6(a), it reinstates:
|
||||
|
||||
1. automatically as of the date the violation is cured, provided
|
||||
it is cured within 30 days of Your discovery of the
|
||||
violation; or
|
||||
|
||||
2. upon express reinstatement by the Licensor.
|
||||
|
||||
For the avoidance of doubt, this Section 6(b) does not affect any
|
||||
right the Licensor may have to seek remedies for Your violations
|
||||
of this Public License.
|
||||
|
||||
c. For the avoidance of doubt, the Licensor may also offer the
|
||||
Licensed Material under separate terms or conditions or stop
|
||||
distributing the Licensed Material at any time; however, doing so
|
||||
will not terminate this Public License.
|
||||
|
||||
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
|
||||
License.
|
||||
|
||||
|
||||
Section 7 -- Other Terms and Conditions.
|
||||
|
||||
a. The Licensor shall not be bound by any additional or different
|
||||
terms or conditions communicated by You unless expressly agreed.
|
||||
|
||||
b. Any arrangements, understandings, or agreements regarding the
|
||||
Licensed Material not stated herein are separate from and
|
||||
independent of the terms and conditions of this Public License.
|
||||
|
||||
|
||||
Section 8 -- Interpretation.
|
||||
|
||||
a. For the avoidance of doubt, this Public License does not, and
|
||||
shall not be interpreted to, reduce, limit, restrict, or impose
|
||||
conditions on any use of the Licensed Material that could lawfully
|
||||
be made without permission under this Public License.
|
||||
|
||||
b. To the extent possible, if any provision of this Public License is
|
||||
deemed unenforceable, it shall be automatically reformed to the
|
||||
minimum extent necessary to make it enforceable. If the provision
|
||||
cannot be reformed, it shall be severed from this Public License
|
||||
without affecting the enforceability of the remaining terms and
|
||||
conditions.
|
||||
|
||||
c. No term or condition of this Public License will be waived and no
|
||||
failure to comply consented to unless expressly agreed to by the
|
||||
Licensor.
|
||||
|
||||
d. Nothing in this Public License constitutes or may be interpreted
|
||||
as a limitation upon, or waiver of, any privileges and immunities
|
||||
that apply to the Licensor or You, including from the legal
|
||||
processes of any jurisdiction or authority.
|
||||
|
||||
|
||||
=======================================================================
|
||||
|
||||
Creative Commons is not a party to its public
|
||||
licenses. Notwithstanding, Creative Commons may elect to apply one of
|
||||
its public licenses to material it publishes and in those instances
|
||||
will be considered the “Licensor.” The text of the Creative Commons
|
||||
public licenses is dedicated to the public domain under the CC0 Public
|
||||
Domain Dedication. Except for the limited purpose of indicating that
|
||||
material is shared under a Creative Commons public license or as
|
||||
otherwise permitted by the Creative Commons policies published at
|
||||
creativecommons.org/policies, Creative Commons does not authorize the
|
||||
use of the trademark "Creative Commons" or any other trademark or logo
|
||||
of Creative Commons without its prior written consent including,
|
||||
without limitation, in connection with any unauthorized modifications
|
||||
to any of its public licenses or any other arrangements,
|
||||
understandings, or agreements concerning use of licensed material. For
|
||||
the avoidance of doubt, this paragraph does not form part of the
|
||||
public licenses.
|
||||
|
||||
Creative Commons may be contacted at creativecommons.org.
|
||||
|
34
examples/training/textcat_example_data/README.md
Normal file
34
examples/training/textcat_example_data/README.md
Normal file
|
@ -0,0 +1,34 @@
|
|||
## Examples of textcat training data
|
||||
|
||||
spacy JSON training files were generated from JSONL with:
|
||||
|
||||
```
|
||||
python textcatjsonl_to_trainjson.py -m en file.jsonl .
|
||||
```
|
||||
|
||||
`cooking.json` is an example with mutually-exclusive classes with two labels:
|
||||
|
||||
* `baking`
|
||||
* `not_baking`
|
||||
|
||||
`jigsaw-toxic-comment.json` is an example with multiple labels per instance:
|
||||
|
||||
* `insult`
|
||||
* `obscene`
|
||||
* `severe_toxic`
|
||||
* `toxic`
|
||||
|
||||
### Data Sources
|
||||
|
||||
* `cooking.jsonl`: https://cooking.stackexchange.com. The meta IDs link to the
|
||||
original question as `https://cooking.stackexchange.com/questions/ID`, e.g.,
|
||||
`https://cooking.stackexchange.com/questions/2` for the first instance.
|
||||
* `jigsaw-toxic-comment.jsonl`: [Jigsaw Toxic Comments Classification
|
||||
Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
|
||||
|
||||
### Data Licenses
|
||||
|
||||
* `cooking.jsonl`: CC BY-SA 4.0 ([`CC_BY-SA-4.0.txt`](CC_BY-SA-4.0.txt))
|
||||
* `jigsaw-toxic-comment.jsonl`:
|
||||
* text: CC BY-SA 3.0 ([`CC_BY-SA-3.0.txt`](CC_BY-SA-3.0.txt))
|
||||
* annotation: CC0 ([`CC0.txt`](CC0.txt))
|
3487
examples/training/textcat_example_data/cooking.json
Normal file
3487
examples/training/textcat_example_data/cooking.json
Normal file
File diff suppressed because it is too large
Load Diff
10
examples/training/textcat_example_data/cooking.jsonl
Normal file
10
examples/training/textcat_example_data/cooking.jsonl
Normal file
|
@ -0,0 +1,10 @@
|
|||
{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "2"}, "text": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven by laying the strips out on a cookie sheet. When using this method, how long should I cook the bacon for, and at what temperature?\n"}
|
||||
{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "3"}, "text": "What is the difference between white and brown eggs?\nI always use brown extra large eggs, but I can't honestly say why I do this other than habit at this point. Are there any distinct advantages or disadvantages like flavor, shelf life, etc?\n"}
|
||||
{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "4"}, "text": "What is the difference between baking soda and baking powder?\nAnd can I use one in place of the other in certain recipes?\n"}
|
||||
{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "5"}, "text": "In a tomato sauce recipe, how can I cut the acidity?\nIt seems that every time I make a tomato sauce for pasta, the sauce is a little bit too acid for my taste. I've tried using sugar or sodium bicarbonate, but I'm not satisfied with the results.\n"}
|
||||
{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "6"}, "text": "What ingredients (available in specific regions) can I substitute for parsley?\nI have a recipe that calls for fresh parsley. I have substituted other fresh herbs for their dried equivalents but I don't have fresh or dried parsley. Is there something else (ex another dried herb) that I can use instead of parsley?\nI know it is used mainly for looks rather than taste but I have a pasta recipe that calls for 2 tablespoons of parsley in the sauce and then another 2 tablespoons on top when it is done. I know the parsley on top is more for looks but there must be something about the taste otherwise it would call for parsley within the sauce as well.\nI would especially like to hear about substitutes available in Southeast Asia and other parts of the world where the obvious answers (such as cilantro) are not widely available.\n"}
|
||||
{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "9"}, "text": "What is the internal temperature a steak should be cooked to for Rare/Medium Rare/Medium/Well?\nI'd like to know when to take my steaks off the grill and please everybody.\n"}
|
||||
{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "11"}, "text": "How should I poach an egg?\nWhat's the best method to poach an egg without it turning into an eggy soupy mess?\n"}
|
||||
{"cats": {"baking": 0.0, "not_baking": 1.0}, "meta": {"id": "12"}, "text": "How can I make my Ice Cream \"creamier\"\nMy ice cream doesn't feel creamy enough. I got the recipe from Good Eats, and I can't tell if it's just the recipe or maybe that I'm just not getting my \"batter\" cold enough before I try to make it (I let it chill overnight in the refrigerator, but it doesn't always come out of the machine looking like \"soft serve\" as he said on the show - it's usually a little thinner).\nRecipe: http://www.foodnetwork.com/recipes/alton-brown/serious-vanilla-ice-cream-recipe/index.html\nThanks!\n"}
|
||||
{"cats": {"baking": 1.0, "not_baking": 0.0}, "meta": {"id": "17"}, "text": "How long and at what temperature do the various parts of a chicken need to be cooked?\nI'm interested in baking thighs, legs, breasts and wings. How long do each of these items need to bake and at what temperature?\n"}
|
||||
{"cats": {"baking": 1.0, "not_baking": 0.0}, "meta": {"id": "27"}, "text": "Do I need to sift flour that is labeled sifted?\nIs there really an advantage to sifting flour that I bought that was labeled 'sifted'?\n"}
|
2987
examples/training/textcat_example_data/jigsaw-toxic-comment.json
Normal file
2987
examples/training/textcat_example_data/jigsaw-toxic-comment.json
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,10 @@
|
|||
{"meta": {"id": "0000997932d777bf"}, "text": "Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}}
|
||||
{"meta": {"id": "001956c382006abd"}, "text": "I'm Sorry \n\nI'm sorry I screwed around with someones talk page. It was very bad to do. I know how having the templates on their talk page helps you assert your dominance over them. I know I should bow down to the almighty administrators. But then again, I'm going to go play outside....with your mom. 76.122.79.82", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 1}}
|
||||
{"meta": {"id": "0020e7119b96eeeb"}, "text": "Stupid peace of shit stop deleting my stuff asshole go die and fall in a hole go to hell!", "cats": {"insult": 1, "obscene": 1, "severe_toxic": 1, "toxic": 1}}
|
||||
{"meta": {"id": "000103f0d9cfb60f"}, "text": "D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}}
|
||||
{"meta": {"id": "001dc38a83d420cf"}, "text": "GET FUCKED UP. GET FUCKEEED UP. GOT A DRINK THAT YOU CANT PUT DOWN???/ GET FUCK UP GET FUCKED UP. I'M FUCKED UP RIGHT NOW!", "cats": {"insult": 0, "obscene": 1, "severe_toxic": 0, "toxic": 1}}
|
||||
{"meta": {"id": "000113f07ec002fd"}, "text": "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}}
|
||||
{"meta": {"id": "0001b41b1c6bb37e"}, "text": "\"\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of \"\"types of accidents\"\" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport \"", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}}
|
||||
{"meta": {"id": "0001d958c54c6e35"}, "text": "You, sir, are my hero. Any chance you remember what page that's on?", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}}
|
||||
{"meta": {"id": "00025465d4725e87"}, "text": "\"\n\nCongratulations from me as well, use the tools well. · talk \"", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 0}}
|
||||
{"meta": {"id": "002264ea4d5f2887"}, "text": "Why can't you believe how fat Artie is? Did you see him on his recent appearence on the Tonight Show with Jay Leno? He looks absolutely AWFUL! If I had to put money on it, I'd say that Artie Lange is a can't miss candidate for the 2007 Dead pool! \n\n \nKindly keep your malicious fingers off of my above comment, . Everytime you remove it, I will repost it!!!", "cats": {"insult": 0, "obscene": 0, "severe_toxic": 0, "toxic": 1}}
|
|
@ -0,0 +1,53 @@
|
|||
from pathlib import Path
|
||||
import plac
|
||||
import spacy
|
||||
from spacy.gold import docs_to_json
|
||||
import srsly
|
||||
import sys
|
||||
|
||||
@plac.annotations(
|
||||
model=("Model name. Defaults to 'en'.", "option", "m", str),
|
||||
input_file=("Input file (jsonl)", "positional", None, Path),
|
||||
output_dir=("Output directory", "positional", None, Path),
|
||||
n_texts=("Number of texts to convert", "option", "t", int),
|
||||
)
|
||||
def convert(model='en', input_file=None, output_dir=None, n_texts=0):
|
||||
# Load model with tokenizer + sentencizer only
|
||||
nlp = spacy.load(model)
|
||||
nlp.disable_pipes(*nlp.pipe_names)
|
||||
sentencizer = nlp.create_pipe("sentencizer")
|
||||
nlp.add_pipe(sentencizer, first=True)
|
||||
|
||||
texts = []
|
||||
cats = []
|
||||
count = 0
|
||||
|
||||
if not input_file.exists():
|
||||
print("Input file not found:", input_file)
|
||||
sys.exit(1)
|
||||
else:
|
||||
with open(input_file) as fileh:
|
||||
for line in fileh:
|
||||
data = srsly.json_loads(line)
|
||||
texts.append(data["text"])
|
||||
cats.append(data["cats"])
|
||||
|
||||
if output_dir is not None:
|
||||
output_dir = Path(output_dir)
|
||||
if not output_dir.exists():
|
||||
output_dir.mkdir()
|
||||
else:
|
||||
output_dir = Path(".")
|
||||
|
||||
docs = []
|
||||
for i, doc in enumerate(nlp.pipe(texts)):
|
||||
doc.cats = cats[i]
|
||||
docs.append(doc)
|
||||
if n_texts > 0 and count == n_texts:
|
||||
break
|
||||
count += 1
|
||||
|
||||
srsly.write_json(output_dir / input_file.with_suffix(".json"), [docs_to_json(docs)])
|
||||
|
||||
if __name__ == "__main__":
|
||||
plac.call(convert)
|
|
@ -270,7 +270,7 @@ def debug_data(
|
|||
|
||||
if "textcat" in pipeline:
|
||||
msg.divider("Text Classification")
|
||||
labels = [label for label in gold_train_data["textcat"]]
|
||||
labels = [label for label in gold_train_data["cats"]]
|
||||
model_labels = _get_labels_from_model(nlp, "textcat")
|
||||
new_labels = [l for l in labels if l not in model_labels]
|
||||
existing_labels = [l for l in labels if l in model_labels]
|
||||
|
@ -281,13 +281,44 @@ def debug_data(
|
|||
)
|
||||
if new_labels:
|
||||
labels_with_counts = _format_labels(
|
||||
gold_train_data["textcat"].most_common(), counts=True
|
||||
gold_train_data["cats"].most_common(), counts=True
|
||||
)
|
||||
msg.text("New: {}".format(labels_with_counts), show=verbose)
|
||||
if existing_labels:
|
||||
msg.text(
|
||||
"Existing: {}".format(_format_labels(existing_labels)), show=verbose
|
||||
)
|
||||
if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]):
|
||||
msg.fail(
|
||||
"The train and dev labels are not the same. "
|
||||
"Train labels: {}. "
|
||||
"Dev labels: {}.".format(
|
||||
_format_labels(gold_train_data["cats"]),
|
||||
_format_labels(gold_dev_data["cats"]),
|
||||
)
|
||||
)
|
||||
if gold_train_data["n_cats_multilabel"] > 0:
|
||||
msg.info("The train data contains instances without "
|
||||
"mutually-exclusive classes. Use '--textcat-multilabel' "
|
||||
"when training."
|
||||
)
|
||||
if gold_dev_data["n_cats_multilabel"] == 0:
|
||||
msg.warn(
|
||||
"Potential train/dev mismatch: the train data contains "
|
||||
"instances without mutually-exclusive classes while the "
|
||||
"dev data does not."
|
||||
)
|
||||
else:
|
||||
msg.info(
|
||||
"The train data contains only instances with "
|
||||
"mutually-exclusive classes."
|
||||
)
|
||||
if gold_dev_data["n_cats_multilabel"] > 0:
|
||||
msg.fail(
|
||||
"Train/dev mismatch: the dev data contains instances "
|
||||
"without mutually-exclusive classes while the train data "
|
||||
"contains only instances with mutually-exclusive classes."
|
||||
)
|
||||
|
||||
if "tagger" in pipeline:
|
||||
msg.divider("Part-of-speech Tagging")
|
||||
|
@ -450,6 +481,7 @@ def debug_data(
|
|||
)
|
||||
)
|
||||
|
||||
|
||||
msg.divider("Summary")
|
||||
good_counts = msg.counts[MESSAGES.GOOD]
|
||||
warn_counts = msg.counts[MESSAGES.WARN]
|
||||
|
@ -504,6 +536,7 @@ def _compile_gold(train_docs, pipeline):
|
|||
"n_sents": 0,
|
||||
"n_nonproj": 0,
|
||||
"n_cycles": 0,
|
||||
"n_cats_multilabel": 0,
|
||||
"texts": set(),
|
||||
}
|
||||
for doc, gold in train_docs:
|
||||
|
@ -526,6 +559,8 @@ def _compile_gold(train_docs, pipeline):
|
|||
data["ner"]["-"] += 1
|
||||
if "textcat" in pipeline:
|
||||
data["cats"].update(gold.cats)
|
||||
if list(gold.cats.values()).count(1.0) != 1:
|
||||
data["n_cats_multilabel"] += 1
|
||||
if "tagger" in pipeline:
|
||||
data["tags"].update([x for x in gold.tags if x is not None])
|
||||
if "parser" in pipeline:
|
||||
|
|
|
@ -61,6 +61,7 @@ def evaluate(
|
|||
"NER P": "%.2f" % scorer.ents_p,
|
||||
"NER R": "%.2f" % scorer.ents_r,
|
||||
"NER F": "%.2f" % scorer.ents_f,
|
||||
"Textcat": "%.2f" % scorer.textcat_score,
|
||||
}
|
||||
msg.table(results, title="Results")
|
||||
|
||||
|
|
|
@ -21,48 +21,24 @@ from .. import about
|
|||
|
||||
|
||||
@plac.annotations(
|
||||
# fmt: off
|
||||
lang=("Model language", "positional", None, str),
|
||||
output_path=("Output directory to store model in", "positional", None, Path),
|
||||
train_path=("Location of JSON-formatted training data", "positional", None, Path),
|
||||
dev_path=("Location of JSON-formatted development data", "positional", None, Path),
|
||||
raw_text=(
|
||||
"Path to jsonl file with unlabelled text documents.",
|
||||
"option",
|
||||
"rt",
|
||||
Path,
|
||||
),
|
||||
raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path),
|
||||
base_model=("Name of model to update (optional)", "option", "b", str),
|
||||
pipeline=("Comma-separated names of pipeline components", "option", "p", str),
|
||||
vectors=("Model to load vectors from", "option", "v", str),
|
||||
n_iter=("Number of iterations", "option", "n", int),
|
||||
n_early_stopping=(
|
||||
"Maximum number of training epochs without dev accuracy improvement",
|
||||
"option",
|
||||
"ne",
|
||||
int,
|
||||
),
|
||||
n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int),
|
||||
n_examples=("Number of examples", "option", "ns", int),
|
||||
use_gpu=("Use GPU", "option", "g", int),
|
||||
version=("Model version", "option", "V", str),
|
||||
meta_path=("Optional path to meta.json to use as base.", "option", "m", Path),
|
||||
init_tok2vec=(
|
||||
"Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.",
|
||||
"option",
|
||||
"t2v",
|
||||
Path,
|
||||
),
|
||||
parser_multitasks=(
|
||||
"Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'",
|
||||
"option",
|
||||
"pt",
|
||||
str,
|
||||
),
|
||||
entity_multitasks=(
|
||||
"Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'",
|
||||
"option",
|
||||
"et",
|
||||
str,
|
||||
),
|
||||
init_tok2vec=("Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", "option", "t2v", Path),
|
||||
parser_multitasks=("Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'", "option", "pt", str),
|
||||
entity_multitasks=("Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'", "option", "et", str),
|
||||
noise_level=("Amount of corruption for data augmentation", "option", "nl", float),
|
||||
orth_variant_level=(
|
||||
"Amount of orthography variation for data augmentation",
|
||||
|
@ -73,8 +49,12 @@ from .. import about
|
|||
eval_beam_widths=("Beam widths to evaluate, e.g. 4,8", "option", "bw", str),
|
||||
gold_preproc=("Use gold preprocessing", "flag", "G", bool),
|
||||
learn_tokens=("Make parser learn gold-standard tokenization", "flag", "T", bool),
|
||||
textcat_multilabel=("Textcat classes aren't mutually exclusive (multilabel)", "flag", "TML", bool),
|
||||
textcat_arch=("Textcat model architecture", "option", "ta", str),
|
||||
textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str),
|
||||
verbose=("Display more information for debug", "flag", "VV", bool),
|
||||
debug=("Run data diagnostics before training", "flag", "D", bool),
|
||||
# fmt: on
|
||||
)
|
||||
def train(
|
||||
lang,
|
||||
|
@ -99,6 +79,9 @@ def train(
|
|||
eval_beam_widths="",
|
||||
gold_preproc=False,
|
||||
learn_tokens=False,
|
||||
textcat_multilabel=False,
|
||||
textcat_arch="bow",
|
||||
textcat_positive_label=None,
|
||||
verbose=False,
|
||||
debug=False,
|
||||
):
|
||||
|
@ -184,9 +167,36 @@ def train(
|
|||
if pipe not in nlp.pipe_names:
|
||||
if pipe == "parser":
|
||||
pipe_cfg = {"learn_tokens": learn_tokens}
|
||||
elif pipe == "textcat":
|
||||
pipe_cfg = {
|
||||
"exclusive_classes": not textcat_multilabel,
|
||||
"architecture": textcat_arch,
|
||||
"positive_label": textcat_positive_label,
|
||||
}
|
||||
else:
|
||||
pipe_cfg = {}
|
||||
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
|
||||
else:
|
||||
if pipe == "textcat":
|
||||
textcat_cfg = nlp.get_pipe("textcat").cfg
|
||||
base_cfg = {
|
||||
"exclusive_classes": textcat_cfg["exclusive_classes"],
|
||||
"architecture": textcat_cfg["architecture"],
|
||||
"positive_label": textcat_cfg["positive_label"]
|
||||
}
|
||||
pipe_cfg = {
|
||||
"exclusive_classes": not textcat_multilabel,
|
||||
"architecture": textcat_arch,
|
||||
"positive_label": textcat_positive_label,
|
||||
}
|
||||
if base_cfg != pipe_cfg:
|
||||
msg.fail("The base textcat model configuration does"
|
||||
"not match the provided training options. "
|
||||
"Existing cfg: {}, provided cfg: {}".format(
|
||||
base_cfg, pipe_cfg
|
||||
),
|
||||
exits=1
|
||||
)
|
||||
else:
|
||||
msg.text("Starting with blank model '{}'".format(lang))
|
||||
lang_cls = util.get_lang_class(lang)
|
||||
|
@ -194,6 +204,12 @@ def train(
|
|||
for pipe in pipeline:
|
||||
if pipe == "parser":
|
||||
pipe_cfg = {"learn_tokens": learn_tokens}
|
||||
elif pipe == "textcat":
|
||||
pipe_cfg = {
|
||||
"exclusive_classes": not textcat_multilabel,
|
||||
"architecture": textcat_arch,
|
||||
"positive_label": textcat_positive_label,
|
||||
}
|
||||
else:
|
||||
pipe_cfg = {}
|
||||
nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))
|
||||
|
@ -234,12 +250,88 @@ def train(
|
|||
components = _load_pretrained_tok2vec(nlp, init_tok2vec)
|
||||
msg.text("Loaded pretrained tok2vec for: {}".format(components))
|
||||
|
||||
# Verify textcat config
|
||||
if "textcat" in pipeline:
|
||||
textcat_labels = nlp.get_pipe("textcat").cfg["labels"]
|
||||
if textcat_positive_label and textcat_positive_label not in textcat_labels:
|
||||
msg.fail(
|
||||
"The textcat_positive_label (tpl) '{}' does not match any "
|
||||
"label in the training data.".format(textcat_positive_label),
|
||||
exits=1,
|
||||
)
|
||||
if textcat_positive_label and len(textcat_labels) != 2:
|
||||
msg.fail(
|
||||
"A textcat_positive_label (tpl) '{}' was provided for training "
|
||||
"data that does not appear to be a binary classification "
|
||||
"problem with two labels.".format(textcat_positive_label),
|
||||
exits=1,
|
||||
)
|
||||
train_docs = corpus.train_docs(
|
||||
nlp, noise_level=noise_level, gold_preproc=gold_preproc, max_length=0
|
||||
)
|
||||
train_labels = set()
|
||||
if textcat_multilabel:
|
||||
multilabel_found = False
|
||||
for text, gold in train_docs:
|
||||
train_labels.update(gold.cats.keys())
|
||||
if list(gold.cats.values()).count(1.0) != 1:
|
||||
multilabel_found = True
|
||||
if not multilabel_found and not base_model:
|
||||
msg.warn(
|
||||
"The textcat training instances look like they have "
|
||||
"mutually-exclusive classes. Remove the flag "
|
||||
"'--textcat-multilabel' to train a classifier with "
|
||||
"mutually-exclusive classes."
|
||||
)
|
||||
if not textcat_multilabel:
|
||||
for text, gold in train_docs:
|
||||
train_labels.update(gold.cats.keys())
|
||||
if list(gold.cats.values()).count(1.0) != 1 and not base_model:
|
||||
msg.warn(
|
||||
"Some textcat training instances do not have exactly "
|
||||
"one positive label. Modifying training options to "
|
||||
"include the flag '--textcat-multilabel' for classes "
|
||||
"that are not mutually exclusive."
|
||||
)
|
||||
nlp.get_pipe("textcat").cfg["exclusive_classes"] = False
|
||||
textcat_multilabel = True
|
||||
break
|
||||
if base_model and set(textcat_labels) != train_labels:
|
||||
msg.fail(
|
||||
"Cannot extend textcat model using data with different "
|
||||
"labels. Base model labels: {}, training data labels: "
|
||||
"{}.".format(textcat_labels, list(train_labels)), exits=1
|
||||
)
|
||||
if textcat_multilabel:
|
||||
msg.text(
|
||||
"Textcat evaluation score: ROC AUC score macro-averaged across "
|
||||
"the labels '{}'".format(", ".join(textcat_labels))
|
||||
)
|
||||
elif textcat_positive_label and len(textcat_labels) == 2:
|
||||
msg.text(
|
||||
"Textcat evaluation score: F1-score for the "
|
||||
"label '{}'".format(textcat_positive_label)
|
||||
)
|
||||
elif len(textcat_labels) > 1:
|
||||
if len(textcat_labels) == 2:
|
||||
msg.warn(
|
||||
"If the textcat component is a binary classifier with "
|
||||
"exclusive classes, provide '--textcat_positive_label' for "
|
||||
"an evaluation on the positive class."
|
||||
)
|
||||
msg.text(
|
||||
"Textcat evaluation score: F1-score macro-averaged across "
|
||||
"the labels '{}'".format(", ".join(textcat_labels))
|
||||
)
|
||||
else:
|
||||
msg.fail(
|
||||
"Unsupported textcat configuration. Use `spacy debug-data` "
|
||||
"for more information."
|
||||
)
|
||||
|
||||
# fmt: off
|
||||
row_head = ["Itn", "Dep Loss", "NER Loss", "UAS", "NER P", "NER R", "NER F", "Tag %", "Token %", "CPU WPS", "GPU WPS"]
|
||||
row_widths = [3, 10, 10, 7, 7, 7, 7, 7, 7, 7, 7]
|
||||
if has_beam_widths:
|
||||
row_head.insert(1, "Beam W.")
|
||||
row_widths.insert(1, 7)
|
||||
row_head, output_stats = _configure_training_output(pipeline, use_gpu, has_beam_widths)
|
||||
row_widths = [len(w) for w in row_head]
|
||||
row_settings = {"widths": row_widths, "aligns": tuple(["r" for i in row_head]), "spacing": 2}
|
||||
# fmt: on
|
||||
print("")
|
||||
|
@ -297,7 +389,7 @@ def train(
|
|||
)
|
||||
nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
|
||||
start_time = timer()
|
||||
scorer = nlp_loaded.evaluate(dev_docs, debug)
|
||||
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)
|
||||
end_time = timer()
|
||||
if use_gpu < 0:
|
||||
gpu_wps = None
|
||||
|
@ -313,7 +405,7 @@ def train(
|
|||
corpus.dev_docs(nlp_loaded, gold_preproc=gold_preproc)
|
||||
)
|
||||
start_time = timer()
|
||||
scorer = nlp_loaded.evaluate(dev_docs)
|
||||
scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose)
|
||||
end_time = timer()
|
||||
cpu_wps = nwords / (end_time - start_time)
|
||||
acc_loc = output_path / ("model%d" % i) / "accuracy.json"
|
||||
|
@ -355,10 +447,19 @@ def train(
|
|||
i,
|
||||
losses,
|
||||
scorer.scores,
|
||||
output_stats,
|
||||
beam_width=beam_width if has_beam_widths else None,
|
||||
cpu_wps=cpu_wps,
|
||||
gpu_wps=gpu_wps,
|
||||
)
|
||||
if i == 0 and "textcat" in pipeline:
|
||||
textcats_per_cat = scorer.scores.get("textcats_per_cat", {})
|
||||
for cat, cat_score in textcats_per_cat.items():
|
||||
if cat_score.get("roc_auc_score", 0) < 0:
|
||||
msg.warn(
|
||||
"Textcat ROC AUC score is undefined due to "
|
||||
"only one value in label '{}'.".format(cat)
|
||||
)
|
||||
msg.row(progress, **row_settings)
|
||||
# Early stopping
|
||||
if n_early_stopping is not None:
|
||||
|
@ -399,6 +500,8 @@ def _score_for_model(meta):
|
|||
mean_acc.append((acc["uas"] + acc["las"]) / 2)
|
||||
if "ner" in pipes:
|
||||
mean_acc.append((acc["ents_p"] + acc["ents_r"] + acc["ents_f"]) / 3)
|
||||
if "textcat" in pipes:
|
||||
mean_acc.append(acc["textcat_score"])
|
||||
return sum(mean_acc) / len(mean_acc)
|
||||
|
||||
|
||||
|
@ -482,40 +585,55 @@ def _get_metrics(component):
|
|||
return ("token_acc",)
|
||||
|
||||
|
||||
def _get_progress(itn, losses, dev_scores, beam_width=None, cpu_wps=0.0, gpu_wps=0.0):
|
||||
def _configure_training_output(pipeline, use_gpu, has_beam_widths):
|
||||
row_head = ["Itn"]
|
||||
output_stats = []
|
||||
for pipe in pipeline:
|
||||
if pipe == "tagger":
|
||||
row_head.extend(["Tag Loss ", " Tag % "])
|
||||
output_stats.extend(["tag_loss", "tags_acc"])
|
||||
elif pipe == "parser":
|
||||
row_head.extend(["Dep Loss ", " UAS ", " LAS "])
|
||||
output_stats.extend(["dep_loss", "uas", "las"])
|
||||
elif pipe == "ner":
|
||||
row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "])
|
||||
output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"])
|
||||
elif pipe == "textcat":
|
||||
row_head.extend(["Textcat Loss", "Textcat"])
|
||||
output_stats.extend(["textcat_loss", "textcat_score"])
|
||||
row_head.extend(["Token %", "CPU WPS"])
|
||||
output_stats.extend(["token_acc", "cpu_wps"])
|
||||
|
||||
if use_gpu >= 0:
|
||||
row_head.extend(["GPU WPS"])
|
||||
output_stats.extend(["gpu_wps"])
|
||||
|
||||
if has_beam_widths:
|
||||
row_head.insert(1, "Beam W.")
|
||||
return row_head, output_stats
|
||||
|
||||
|
||||
def _get_progress(
|
||||
itn, losses, dev_scores, output_stats, beam_width=None, cpu_wps=0.0, gpu_wps=0.0
|
||||
):
|
||||
scores = {}
|
||||
for col in [
|
||||
"dep_loss",
|
||||
"tag_loss",
|
||||
"uas",
|
||||
"tags_acc",
|
||||
"token_acc",
|
||||
"ents_p",
|
||||
"ents_r",
|
||||
"ents_f",
|
||||
"cpu_wps",
|
||||
"gpu_wps",
|
||||
]:
|
||||
scores[col] = 0.0
|
||||
for stat in output_stats:
|
||||
scores[stat] = 0.0
|
||||
scores["dep_loss"] = losses.get("parser", 0.0)
|
||||
scores["ner_loss"] = losses.get("ner", 0.0)
|
||||
scores["tag_loss"] = losses.get("tagger", 0.0)
|
||||
scores.update(dev_scores)
|
||||
scores["textcat_loss"] = losses.get("textcat", 0.0)
|
||||
scores["cpu_wps"] = cpu_wps
|
||||
scores["gpu_wps"] = gpu_wps or 0.0
|
||||
result = [
|
||||
itn,
|
||||
"{:.3f}".format(scores["dep_loss"]),
|
||||
"{:.3f}".format(scores["ner_loss"]),
|
||||
"{:.3f}".format(scores["uas"]),
|
||||
"{:.3f}".format(scores["ents_p"]),
|
||||
"{:.3f}".format(scores["ents_r"]),
|
||||
"{:.3f}".format(scores["ents_f"]),
|
||||
"{:.3f}".format(scores["tags_acc"]),
|
||||
"{:.3f}".format(scores["token_acc"]),
|
||||
"{:.0f}".format(scores["cpu_wps"]),
|
||||
"{:.0f}".format(scores["gpu_wps"]),
|
||||
]
|
||||
scores.update(dev_scores)
|
||||
formatted_scores = []
|
||||
for stat in output_stats:
|
||||
format_spec = "{:.3f}"
|
||||
if stat.endswith("_wps"):
|
||||
format_spec = "{:.0f}"
|
||||
formatted_scores.append(format_spec.format(scores[stat]))
|
||||
result = [itn + 1]
|
||||
result.extend(formatted_scores)
|
||||
if beam_width is not None:
|
||||
result.insert(1, beam_width)
|
||||
return result
|
||||
|
|
|
@ -457,6 +457,14 @@ class Errors(object):
|
|||
E160 = ("Can't find language data file: {path}")
|
||||
E161 = ("Found an internal inconsistency when predicting entity links. "
|
||||
"This is likely a bug in spaCy, so feel free to open an issue.")
|
||||
E162 = ("Cannot evaluate textcat model on data with different labels.\n"
|
||||
"Labels in model: {model_labels}\nLabels in evaluation "
|
||||
"data: {eval_labels}")
|
||||
E163 = ("cumsum was found to be unstable: its last element does not "
|
||||
"correspond to sum")
|
||||
E164 = ("x is neither increasing nor decreasing: {}.")
|
||||
E165 = ("Only one class present in y_true. ROC AUC score is not defined in "
|
||||
"that case.")
|
||||
|
||||
|
||||
@add_codes
|
||||
|
|
|
@ -57,6 +57,7 @@ def tags_to_entities(tags):
|
|||
def merge_sents(sents):
|
||||
m_deps = [[], [], [], [], [], []]
|
||||
m_brackets = []
|
||||
m_cats = sents.pop()
|
||||
i = 0
|
||||
for (ids, words, tags, heads, labels, ner), brackets in sents:
|
||||
m_deps[0].extend(id_ + i for id_ in ids)
|
||||
|
@ -68,6 +69,7 @@ def merge_sents(sents):
|
|||
m_brackets.extend((b["first"] + i, b["last"] + i, b["label"])
|
||||
for b in brackets)
|
||||
i += len(ids)
|
||||
m_deps.append(m_cats)
|
||||
return [(m_deps, m_brackets)]
|
||||
|
||||
|
||||
|
@ -199,6 +201,7 @@ class GoldCorpus(object):
|
|||
n = 0
|
||||
i = 0
|
||||
for raw_text, paragraph_tuples in self.train_tuples:
|
||||
cats = paragraph_tuples.pop()
|
||||
for sent_tuples, brackets in paragraph_tuples:
|
||||
n += len(sent_tuples[1])
|
||||
if self.limit and i >= self.limit:
|
||||
|
@ -260,11 +263,7 @@ class GoldCorpus(object):
|
|||
if len(docs) != len(paragraph_tuples):
|
||||
n_annots = len(paragraph_tuples)
|
||||
raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots))
|
||||
if len(docs) == 1:
|
||||
return [GoldParse.from_annot_tuples(docs[0], paragraph_tuples[0][0],
|
||||
make_projective=make_projective)]
|
||||
else:
|
||||
return [GoldParse.from_annot_tuples(doc, sent_tuples,
|
||||
return [GoldParse.from_annot_tuples(doc, sent_tuples,
|
||||
make_projective=make_projective)
|
||||
for doc, (sent_tuples, brackets)
|
||||
in zip(docs, paragraph_tuples)]
|
||||
|
@ -415,6 +414,10 @@ def json_to_tuple(doc):
|
|||
sents.append([
|
||||
[ids, words, tags, heads, labels, ner],
|
||||
sent.get("brackets", [])])
|
||||
cats = {}
|
||||
for cat in paragraph.get("cats", {}):
|
||||
cats[cat["label"]] = cat["value"]
|
||||
sents.append(cats)
|
||||
if sents:
|
||||
yield [paragraph.get("raw", None), sents]
|
||||
|
||||
|
@ -528,9 +531,10 @@ cdef class GoldParse:
|
|||
"""
|
||||
@classmethod
|
||||
def from_annot_tuples(cls, doc, annot_tuples, make_projective=False):
|
||||
_, words, tags, heads, deps, entities = annot_tuples
|
||||
_, words, tags, heads, deps, entities, cats = annot_tuples
|
||||
return cls(doc, words=words, tags=tags, heads=heads, deps=deps,
|
||||
entities=entities, make_projective=make_projective)
|
||||
entities=entities, cats=cats,
|
||||
make_projective=make_projective)
|
||||
|
||||
def __init__(self, doc, annot_tuples=None, words=None, tags=None, morphology=None,
|
||||
heads=None, deps=None, entities=None, make_projective=False,
|
||||
|
@ -739,7 +743,10 @@ def docs_to_json(docs, id=0):
|
|||
docs = [docs]
|
||||
json_doc = {"id": id, "paragraphs": []}
|
||||
for i, doc in enumerate(docs):
|
||||
json_para = {'raw': doc.text, "sentences": []}
|
||||
json_para = {'raw': doc.text, "sentences": [], "cats": []}
|
||||
for cat, val in doc.cats.items():
|
||||
json_cat = {"label": cat, "value": val}
|
||||
json_para["cats"].append(json_cat)
|
||||
ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||||
biluo_tags = biluo_tags_from_offsets(doc, ent_offsets)
|
||||
for j, sent in enumerate(doc.sents):
|
||||
|
|
|
@ -591,6 +591,7 @@ class Language(object):
|
|||
# Populate vocab
|
||||
else:
|
||||
for _, annots_brackets in get_gold_tuples():
|
||||
_ = annots_brackets.pop()
|
||||
for annots, _ in annots_brackets:
|
||||
for word in annots[1]:
|
||||
_ = self.vocab[word] # noqa: F841
|
||||
|
@ -659,7 +660,7 @@ class Language(object):
|
|||
DOCS: https://spacy.io/api/language#evaluate
|
||||
"""
|
||||
if scorer is None:
|
||||
scorer = Scorer()
|
||||
scorer = Scorer(pipeline=self.pipeline)
|
||||
if component_cfg is None:
|
||||
component_cfg = {}
|
||||
docs, golds = zip(*docs_golds)
|
||||
|
|
|
@ -504,6 +504,7 @@ class Tagger(Pipe):
|
|||
orig_tag_map = dict(self.vocab.morphology.tag_map)
|
||||
new_tag_map = OrderedDict()
|
||||
for raw_text, annots_brackets in get_gold_tuples():
|
||||
_ = annots_brackets.pop()
|
||||
for annots, brackets in annots_brackets:
|
||||
ids, words, tags, heads, deps, ents = annots
|
||||
for tag in tags:
|
||||
|
@ -1021,6 +1022,10 @@ class TextCategorizer(Pipe):
|
|||
return 1
|
||||
|
||||
def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs):
|
||||
for raw_text, annots_brackets in get_gold_tuples():
|
||||
cats = annots_brackets.pop()
|
||||
for cat in cats:
|
||||
self.add_label(cat)
|
||||
if self.model is True:
|
||||
self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors")
|
||||
self.require_labels()
|
||||
|
|
381
spacy/scorer.py
381
spacy/scorer.py
|
@ -1,7 +1,10 @@
|
|||
# coding: utf8
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
import numpy as np
|
||||
|
||||
from .gold import tags_to_entities, GoldParse
|
||||
from .errors import Errors
|
||||
|
||||
|
||||
class PRFScore(object):
|
||||
|
@ -34,10 +37,39 @@ class PRFScore(object):
|
|||
return 2 * ((p * r) / (p + r + 1e-100))
|
||||
|
||||
|
||||
class ROCAUCScore(object):
|
||||
"""
|
||||
An AUC ROC score.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.golds = []
|
||||
self.cands = []
|
||||
self.saved_score = 0.0
|
||||
self.saved_score_at_len = 0
|
||||
|
||||
def score_set(self, cand, gold):
|
||||
self.cands.append(cand)
|
||||
self.golds.append(gold)
|
||||
|
||||
@property
|
||||
def score(self):
|
||||
if len(self.golds) == self.saved_score_at_len:
|
||||
return self.saved_score
|
||||
try:
|
||||
self.saved_score = _roc_auc_score(self.golds, self.cands)
|
||||
# catch ValueError: Only one class present in y_true.
|
||||
# ROC AUC score is not defined in that case.
|
||||
except:
|
||||
self.saved_score = -float("inf")
|
||||
self.saved_score_at_len = len(self.golds)
|
||||
return self.saved_score
|
||||
|
||||
|
||||
class Scorer(object):
|
||||
"""Compute evaluation scores."""
|
||||
|
||||
def __init__(self, eval_punct=False):
|
||||
def __init__(self, eval_punct=False, pipeline=None):
|
||||
"""Initialize the Scorer.
|
||||
|
||||
eval_punct (bool): Evaluate the dependency attachments to and from
|
||||
|
@ -54,6 +86,24 @@ class Scorer(object):
|
|||
self.ner = PRFScore()
|
||||
self.ner_per_ents = dict()
|
||||
self.eval_punct = eval_punct
|
||||
self.textcat = None
|
||||
self.textcat_per_cat = dict()
|
||||
self.textcat_positive_label = None
|
||||
self.textcat_multilabel = False
|
||||
|
||||
if pipeline:
|
||||
for name, model in pipeline:
|
||||
if name == "textcat":
|
||||
self.textcat_positive_label = model.cfg.get("positive_label", None)
|
||||
if self.textcat_positive_label:
|
||||
self.textcat = PRFScore()
|
||||
if not model.cfg.get("exclusive_classes", False):
|
||||
self.textcat_multilabel = True
|
||||
for label in model.cfg.get("labels", []):
|
||||
self.textcat_per_cat[label] = ROCAUCScore()
|
||||
else:
|
||||
for label in model.cfg.get("labels", []):
|
||||
self.textcat_per_cat[label] = PRFScore()
|
||||
|
||||
@property
|
||||
def tags_acc(self):
|
||||
|
@ -101,10 +151,47 @@ class Scorer(object):
|
|||
for k, v in self.ner_per_ents.items()
|
||||
}
|
||||
|
||||
@property
|
||||
def textcat_score(self):
|
||||
"""RETURNS (float): f-score on positive label for binary exclusive,
|
||||
macro-averaged f-score for 3+ exclusive,
|
||||
macro-averaged AUC ROC score for multilabel (-1 if undefined)
|
||||
"""
|
||||
if not self.textcat_multilabel:
|
||||
# binary multiclass
|
||||
if self.textcat_positive_label:
|
||||
return self.textcat.fscore * 100
|
||||
# other multiclass
|
||||
return (
|
||||
sum([score.fscore for label, score in self.textcat_per_cat.items()])
|
||||
/ (len(self.textcat_per_cat) + 1e-100)
|
||||
* 100
|
||||
)
|
||||
# multilabel
|
||||
return max(
|
||||
sum([score.score for label, score in self.textcat_per_cat.items()])
|
||||
/ (len(self.textcat_per_cat) + 1e-100),
|
||||
-1,
|
||||
)
|
||||
|
||||
@property
|
||||
def textcats_per_cat(self):
|
||||
"""RETURNS (dict): Scores per textcat label.
|
||||
"""
|
||||
if not self.textcat_multilabel:
|
||||
return {
|
||||
k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100}
|
||||
for k, v in self.textcat_per_cat.items()
|
||||
}
|
||||
return {
|
||||
k: {"roc_auc_score": max(v.score, -1)}
|
||||
for k, v in self.textcat_per_cat.items()
|
||||
}
|
||||
|
||||
@property
|
||||
def scores(self):
|
||||
"""RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`,
|
||||
`ents_r`, `ents_f`, `tags_acc` and `token_acc`.
|
||||
`ents_r`, `ents_f`, `tags_acc`, `token_acc`, and `textcat_score`.
|
||||
"""
|
||||
return {
|
||||
"uas": self.uas,
|
||||
|
@ -115,6 +202,8 @@ class Scorer(object):
|
|||
"ents_per_type": self.ents_per_type,
|
||||
"tags_acc": self.tags_acc,
|
||||
"token_acc": self.token_acc,
|
||||
"textcat_score": self.textcat_score,
|
||||
"textcats_per_cat": self.textcats_per_cat,
|
||||
}
|
||||
|
||||
def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")):
|
||||
|
@ -192,9 +281,297 @@ class Scorer(object):
|
|||
self.unlabelled.score_set(
|
||||
set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps)
|
||||
)
|
||||
if (
|
||||
len(gold.cats) > 0
|
||||
and set(self.textcat_per_cat) == set(gold.cats)
|
||||
and set(gold.cats) == set(doc.cats)
|
||||
):
|
||||
goldcat = max(gold.cats, key=gold.cats.get)
|
||||
candcat = max(doc.cats, key=doc.cats.get)
|
||||
if self.textcat_positive_label:
|
||||
self.textcat.score_set(
|
||||
set([self.textcat_positive_label]) & set([candcat]),
|
||||
set([self.textcat_positive_label]) & set([goldcat]),
|
||||
)
|
||||
for label in self.textcat_per_cat:
|
||||
if self.textcat_multilabel:
|
||||
self.textcat_per_cat[label].score_set(
|
||||
doc.cats[label], gold.cats[label]
|
||||
)
|
||||
else:
|
||||
self.textcat_per_cat[label].score_set(
|
||||
set([label]) & set([candcat]), set([label]) & set([goldcat])
|
||||
)
|
||||
elif len(self.textcat_per_cat) > 0:
|
||||
model_labels = set(self.textcat_per_cat)
|
||||
eval_labels = set(gold.cats)
|
||||
raise ValueError(
|
||||
Errors.E162.format(model_labels=model_labels, eval_labels=eval_labels)
|
||||
)
|
||||
if verbose:
|
||||
gold_words = [item[1] for item in gold.orig_annot]
|
||||
for w_id, h_id, dep in cand_deps - gold_deps:
|
||||
print("F", gold_words[w_id], dep, gold_words[h_id])
|
||||
for w_id, h_id, dep in gold_deps - cand_deps:
|
||||
print("M", gold_words[w_id], dep, gold_words[h_id])
|
||||
|
||||
|
||||
#############################################################################
|
||||
#
|
||||
# The following implementation of roc_auc_score() is adapted from
|
||||
# scikit-learn, which is distributed under the following license:
|
||||
#
|
||||
# New BSD License
|
||||
#
|
||||
# Copyright (c) 2007–2019 The scikit-learn developers.
|
||||
# All rights reserved.
|
||||
#
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
#
|
||||
# a. Redistributions of source code must retain the above copyright notice,
|
||||
# this list of conditions and the following disclaimer.
|
||||
# b. Redistributions in binary form must reproduce the above copyright
|
||||
# notice, this list of conditions and the following disclaimer in the
|
||||
# documentation and/or other materials provided with the distribution.
|
||||
# c. Neither the name of the Scikit-learn Developers nor the names of
|
||||
# its contributors may be used to endorse or promote products
|
||||
# derived from this software without specific prior written
|
||||
# permission.
|
||||
#
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
# ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR
|
||||
# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
||||
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
||||
# OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
|
||||
# DAMAGE.
|
||||
|
||||
def _roc_auc_score(y_true, y_score):
|
||||
"""Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
|
||||
from prediction scores.
|
||||
|
||||
Note: this implementation is restricted to the binary classification task
|
||||
|
||||
Parameters
|
||||
----------
|
||||
y_true : array, shape = [n_samples] or [n_samples, n_classes]
|
||||
True binary labels or binary label indicators.
|
||||
The multiclass case expects shape = [n_samples] and labels
|
||||
with values in ``range(n_classes)``.
|
||||
|
||||
y_score : array, shape = [n_samples] or [n_samples, n_classes]
|
||||
Target scores, can either be probability estimates of the positive
|
||||
class, confidence values, or non-thresholded measure of decisions
|
||||
(as returned by "decision_function" on some classifiers). For binary
|
||||
y_true, y_score is supposed to be the score of the class with greater
|
||||
label. The multiclass case expects shape = [n_samples, n_classes]
|
||||
where the scores correspond to probability estimates.
|
||||
|
||||
Returns
|
||||
-------
|
||||
auc : float
|
||||
|
||||
References
|
||||
----------
|
||||
.. [1] `Wikipedia entry for the Receiver operating characteristic
|
||||
<https://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
|
||||
|
||||
.. [2] Fawcett T. An introduction to ROC analysis[J]. Pattern Recognition
|
||||
Letters, 2006, 27(8):861-874.
|
||||
|
||||
.. [3] `Analyzing a portion of the ROC curve. McClish, 1989
|
||||
<https://www.ncbi.nlm.nih.gov/pubmed/2668680>`_
|
||||
"""
|
||||
if len(np.unique(y_true)) != 2:
|
||||
raise ValueError(Errors.E165)
|
||||
fpr, tpr, _ = _roc_curve(y_true, y_score)
|
||||
return _auc(fpr, tpr)
|
||||
|
||||
|
||||
def _roc_curve(y_true, y_score):
|
||||
"""Compute Receiver operating characteristic (ROC)
|
||||
|
||||
Note: this implementation is restricted to the binary classification task.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
|
||||
y_true : array, shape = [n_samples]
|
||||
True binary labels. If labels are not either {-1, 1} or {0, 1}, then
|
||||
pos_label should be explicitly given.
|
||||
|
||||
y_score : array, shape = [n_samples]
|
||||
Target scores, can either be probability estimates of the positive
|
||||
class, confidence values, or non-thresholded measure of decisions
|
||||
(as returned by "decision_function" on some classifiers).
|
||||
|
||||
Returns
|
||||
-------
|
||||
fpr : array, shape = [>2]
|
||||
Increasing false positive rates such that element i is the false
|
||||
positive rate of predictions with score >= thresholds[i].
|
||||
|
||||
tpr : array, shape = [>2]
|
||||
Increasing true positive rates such that element i is the true
|
||||
positive rate of predictions with score >= thresholds[i].
|
||||
|
||||
thresholds : array, shape = [n_thresholds]
|
||||
Decreasing thresholds on the decision function used to compute
|
||||
fpr and tpr. `thresholds[0]` represents no instances being predicted
|
||||
and is arbitrarily set to `max(y_score) + 1`.
|
||||
|
||||
Notes
|
||||
-----
|
||||
Since the thresholds are sorted from low to high values, they
|
||||
are reversed upon returning them to ensure they correspond to both ``fpr``
|
||||
and ``tpr``, which are sorted in reversed order during their calculation.
|
||||
|
||||
References
|
||||
----------
|
||||
.. [1] `Wikipedia entry for the Receiver operating characteristic
|
||||
<https://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_
|
||||
|
||||
.. [2] Fawcett T. An introduction to ROC analysis[J]. Pattern Recognition
|
||||
Letters, 2006, 27(8):861-874.
|
||||
"""
|
||||
fps, tps, thresholds = _binary_clf_curve(y_true, y_score)
|
||||
|
||||
# Add an extra threshold position
|
||||
# to make sure that the curve starts at (0, 0)
|
||||
tps = np.r_[0, tps]
|
||||
fps = np.r_[0, fps]
|
||||
thresholds = np.r_[thresholds[0] + 1, thresholds]
|
||||
|
||||
if fps[-1] <= 0:
|
||||
fpr = np.repeat(np.nan, fps.shape)
|
||||
else:
|
||||
fpr = fps / fps[-1]
|
||||
|
||||
if tps[-1] <= 0:
|
||||
tpr = np.repeat(np.nan, tps.shape)
|
||||
else:
|
||||
tpr = tps / tps[-1]
|
||||
|
||||
return fpr, tpr, thresholds
|
||||
|
||||
|
||||
def _binary_clf_curve(y_true, y_score):
|
||||
"""Calculate true and false positives per binary classification threshold.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
y_true : array, shape = [n_samples]
|
||||
True targets of binary classification
|
||||
|
||||
y_score : array, shape = [n_samples]
|
||||
Estimated probabilities or decision function
|
||||
|
||||
Returns
|
||||
-------
|
||||
fps : array, shape = [n_thresholds]
|
||||
A count of false positives, at index i being the number of negative
|
||||
samples assigned a score >= thresholds[i]. The total number of
|
||||
negative samples is equal to fps[-1] (thus true negatives are given by
|
||||
fps[-1] - fps).
|
||||
|
||||
tps : array, shape = [n_thresholds <= len(np.unique(y_score))]
|
||||
An increasing count of true positives, at index i being the number
|
||||
of positive samples assigned a score >= thresholds[i]. The total
|
||||
number of positive samples is equal to tps[-1] (thus false negatives
|
||||
are given by tps[-1] - tps).
|
||||
|
||||
thresholds : array, shape = [n_thresholds]
|
||||
Decreasing score values.
|
||||
"""
|
||||
pos_label = 1.
|
||||
|
||||
y_true = np.ravel(y_true)
|
||||
y_score = np.ravel(y_score)
|
||||
|
||||
# make y_true a boolean vector
|
||||
y_true = (y_true == pos_label)
|
||||
|
||||
# sort scores and corresponding truth values
|
||||
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
|
||||
y_score = y_score[desc_score_indices]
|
||||
y_true = y_true[desc_score_indices]
|
||||
weight = 1.
|
||||
|
||||
# y_score typically has many tied values. Here we extract
|
||||
# the indices associated with the distinct values. We also
|
||||
# concatenate a value for the end of the curve.
|
||||
distinct_value_indices = np.where(np.diff(y_score))[0]
|
||||
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
|
||||
|
||||
# accumulate the true positives with decreasing threshold
|
||||
tps = _stable_cumsum(y_true * weight)[threshold_idxs]
|
||||
fps = 1 + threshold_idxs - tps
|
||||
return fps, tps, y_score[threshold_idxs]
|
||||
|
||||
|
||||
def _stable_cumsum(arr, axis=None, rtol=1e-05, atol=1e-08):
|
||||
"""Use high precision for cumsum and check that final value matches sum
|
||||
|
||||
Parameters
|
||||
----------
|
||||
arr : array-like
|
||||
To be cumulatively summed as flat
|
||||
axis : int, optional
|
||||
Axis along which the cumulative sum is computed.
|
||||
The default (None) is to compute the cumsum over the flattened array.
|
||||
rtol : float
|
||||
Relative tolerance, see ``np.allclose``
|
||||
atol : float
|
||||
Absolute tolerance, see ``np.allclose``
|
||||
"""
|
||||
out = np.cumsum(arr, axis=axis, dtype=np.float64)
|
||||
expected = np.sum(arr, axis=axis, dtype=np.float64)
|
||||
if not np.all(np.isclose(out.take(-1, axis=axis), expected, rtol=rtol,
|
||||
atol=atol, equal_nan=True)):
|
||||
raise ValueError(Errors.E163)
|
||||
return out
|
||||
|
||||
|
||||
def _auc(x, y):
|
||||
"""Compute Area Under the Curve (AUC) using the trapezoidal rule
|
||||
|
||||
This is a general function, given points on a curve. For computing the
|
||||
area under the ROC-curve, see :func:`roc_auc_score`.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
x : array, shape = [n]
|
||||
x coordinates. These must be either monotonic increasing or monotonic
|
||||
decreasing.
|
||||
y : array, shape = [n]
|
||||
y coordinates.
|
||||
|
||||
Returns
|
||||
-------
|
||||
auc : float
|
||||
"""
|
||||
x = np.ravel(x)
|
||||
y = np.ravel(y)
|
||||
|
||||
direction = 1
|
||||
dx = np.diff(x)
|
||||
if np.any(dx < 0):
|
||||
if np.all(dx <= 0):
|
||||
direction = -1
|
||||
else:
|
||||
raise ValueError(Errors.E164.format(x))
|
||||
|
||||
area = direction * np.trapz(y, x)
|
||||
if isinstance(area, np.memmap):
|
||||
# Reductions such as .sum used internally in np.trapz do not return a
|
||||
# scalar by default for numpy.memmap instances contrary to
|
||||
# regular numpy.ndarray instances.
|
||||
area = area.dtype.type(area)
|
||||
return area
|
||||
|
|
|
@ -342,6 +342,7 @@ cdef class ArcEager(TransitionSystem):
|
|||
actions[RIGHT][label] = 1
|
||||
actions[REDUCE][label] = 1
|
||||
for raw_text, sents in kwargs.get('gold_parses', []):
|
||||
_ = sents.pop()
|
||||
for (ids, words, tags, heads, labels, iob), ctnts in sents:
|
||||
heads, labels = nonproj.projectivize(heads, labels)
|
||||
for child, head, label in zip(ids, heads, labels):
|
||||
|
|
|
@ -72,6 +72,7 @@ cdef class BiluoPushDown(TransitionSystem):
|
|||
actions[action][entity_type] = 1
|
||||
moves = ('M', 'B', 'I', 'L', 'U')
|
||||
for raw_text, sents in kwargs.get('gold_parses', []):
|
||||
_ = sents.pop()
|
||||
for (ids, words, tags, heads, labels, biluo), _ in sents:
|
||||
for i, ner_tag in enumerate(biluo):
|
||||
if ner_tag != 'O' and ner_tag != '-':
|
||||
|
|
|
@ -589,6 +589,7 @@ cdef class Parser:
|
|||
doc_sample = []
|
||||
gold_sample = []
|
||||
for raw_text, annots_brackets in islice(get_gold_tuples(), 1000):
|
||||
_ = annots_brackets.pop()
|
||||
for annots, brackets in annots_brackets:
|
||||
ids, words, tags, heads, deps, ents = annots
|
||||
doc_sample.append(Doc(self.vocab, words=words))
|
||||
|
|
|
@ -3,8 +3,12 @@ from __future__ import unicode_literals
|
|||
|
||||
from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags
|
||||
from spacy.gold import spans_from_biluo_tags, GoldParse
|
||||
from spacy.gold import GoldCorpus, docs_to_json
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Doc
|
||||
from .util import make_tempdir
|
||||
import pytest
|
||||
import srsly
|
||||
|
||||
|
||||
def test_gold_biluo_U(en_vocab):
|
||||
|
@ -81,3 +85,28 @@ def test_gold_ner_missing_tags(en_tokenizer):
|
|||
doc = en_tokenizer("I flew to Silicon Valley via London.")
|
||||
biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"]
|
||||
gold = GoldParse(doc, entities=biluo_tags) # noqa: F841
|
||||
|
||||
|
||||
def test_roundtrip_docs_to_json():
|
||||
text = "I flew to Silicon Valley via London."
|
||||
cats = {"TRAVEL": 1.0, "BAKING": 0.0}
|
||||
nlp = English()
|
||||
doc = nlp(text)
|
||||
doc.cats = cats
|
||||
doc[0].is_sent_start = True
|
||||
for i in range(1, len(doc)):
|
||||
doc[i].is_sent_start = False
|
||||
|
||||
with make_tempdir() as tmpdir:
|
||||
json_file = tmpdir / "roundtrip.json"
|
||||
srsly.write_json(json_file, [docs_to_json(doc)])
|
||||
goldcorpus = GoldCorpus(str(json_file), str(json_file))
|
||||
|
||||
reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp))
|
||||
|
||||
assert len(doc) == goldcorpus.count_train()
|
||||
assert text == reloaded_doc.text
|
||||
assert "TRAVEL" in goldparse.cats
|
||||
assert "BAKING" in goldparse.cats
|
||||
assert cats["TRAVEL"] == goldparse.cats["TRAVEL"]
|
||||
assert cats["BAKING"] == goldparse.cats["BAKING"]
|
||||
|
|
|
@ -1,9 +1,14 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy as np
|
||||
from numpy.testing import assert_almost_equal, assert_array_almost_equal
|
||||
import pytest
|
||||
from pytest import approx
|
||||
from spacy.errors import Errors
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.scorer import Scorer
|
||||
from spacy.scorer import Scorer, ROCAUCScore
|
||||
from spacy.scorer import _roc_auc_score, _roc_curve
|
||||
from .util import get_doc
|
||||
|
||||
test_ner_cardinal = [
|
||||
|
@ -66,3 +71,74 @@ def test_ner_per_type(en_vocab):
|
|||
assert results["ents_per_type"]["ORG"]["p"] == 50
|
||||
assert results["ents_per_type"]["ORG"]["r"] == 100
|
||||
assert results["ents_per_type"]["ORG"]["f"] == approx(66.66666)
|
||||
|
||||
|
||||
def test_roc_auc_score():
|
||||
# Binary classification, toy tests from scikit-learn test suite
|
||||
y_true = [0, 1]
|
||||
y_score = [0, 1]
|
||||
tpr, fpr, _ = _roc_curve(y_true, y_score)
|
||||
roc_auc = _roc_auc_score(y_true, y_score)
|
||||
assert_array_almost_equal(tpr, [0, 0, 1])
|
||||
assert_array_almost_equal(fpr, [0, 1, 1])
|
||||
assert_almost_equal(roc_auc, 1.)
|
||||
|
||||
y_true = [0, 1]
|
||||
y_score = [1, 0]
|
||||
tpr, fpr, _ = _roc_curve(y_true, y_score)
|
||||
roc_auc = _roc_auc_score(y_true, y_score)
|
||||
assert_array_almost_equal(tpr, [0, 1, 1])
|
||||
assert_array_almost_equal(fpr, [0, 0, 1])
|
||||
assert_almost_equal(roc_auc, 0.)
|
||||
|
||||
y_true = [1, 0]
|
||||
y_score = [1, 1]
|
||||
tpr, fpr, _ = _roc_curve(y_true, y_score)
|
||||
roc_auc = _roc_auc_score(y_true, y_score)
|
||||
assert_array_almost_equal(tpr, [0, 1])
|
||||
assert_array_almost_equal(fpr, [0, 1])
|
||||
assert_almost_equal(roc_auc, 0.5)
|
||||
|
||||
y_true = [1, 0]
|
||||
y_score = [1, 0]
|
||||
tpr, fpr, _ = _roc_curve(y_true, y_score)
|
||||
roc_auc = _roc_auc_score(y_true, y_score)
|
||||
assert_array_almost_equal(tpr, [0, 0, 1])
|
||||
assert_array_almost_equal(fpr, [0, 1, 1])
|
||||
assert_almost_equal(roc_auc, 1.)
|
||||
|
||||
y_true = [1, 0]
|
||||
y_score = [0.5, 0.5]
|
||||
tpr, fpr, _ = _roc_curve(y_true, y_score)
|
||||
roc_auc = _roc_auc_score(y_true, y_score)
|
||||
assert_array_almost_equal(tpr, [0, 1])
|
||||
assert_array_almost_equal(fpr, [0, 1])
|
||||
assert_almost_equal(roc_auc, .5)
|
||||
|
||||
# same result as above with ROCAUCScore wrapper
|
||||
score = ROCAUCScore()
|
||||
score.score_set(0.5, 1)
|
||||
score.score_set(0.5, 0)
|
||||
assert_almost_equal(score.score, .5)
|
||||
|
||||
|
||||
# check that errors are raised in undefined cases and score is -inf
|
||||
y_true = [0, 0]
|
||||
y_score = [0.25, 0.75]
|
||||
with pytest.raises(ValueError):
|
||||
_roc_auc_score(y_true, y_score)
|
||||
|
||||
score = ROCAUCScore()
|
||||
score.score_set(0.25, 0)
|
||||
score.score_set(0.75, 0)
|
||||
assert score.score == -float("inf")
|
||||
|
||||
y_true = [1, 1]
|
||||
y_score = [0.25, 0.75]
|
||||
with pytest.raises(ValueError):
|
||||
_roc_auc_score(y_true, y_score)
|
||||
|
||||
score = ROCAUCScore()
|
||||
score.score_set(0.25, 1)
|
||||
score.score_set(0.75, 1)
|
||||
assert score.score == -float("inf")
|
||||
|
|
|
@ -552,6 +552,10 @@ spaCy's JSON format, you can use the
|
|||
"last": int, # index of last token
|
||||
"label": string # phrase label
|
||||
}]
|
||||
}],
|
||||
"cats": [{ # new in v2.2: categories for text classifier
|
||||
"label": string, # text category label
|
||||
"value": float / bool # label applies (1.0/true) or not (0.0/false)
|
||||
}]
|
||||
}]
|
||||
}]
|
||||
|
|
|
@ -361,9 +361,10 @@ will only train the tagger and parser.
|
|||
|
||||
```bash
|
||||
$ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
||||
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping] [--n-examples] [--use-gpu]
|
||||
[--version] [--meta-path] [--init-tok2vec] [--parser-multitasks]
|
||||
[--entity-multitasks] [--gold-preproc] [--noise-level] [--learn-tokens]
|
||||
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping]
|
||||
[--n-examples] [--use-gpu] [--version] [--meta-path] [--init-tok2vec]
|
||||
[--parser-multitasks] [--entity-multitasks] [--gold-preproc] [--noise-level]
|
||||
[--learn-tokens] [--textcat-arch] [--textcat-multilabel] [--textcat-positive-label]
|
||||
[--verbose]
|
||||
```
|
||||
|
||||
|
@ -387,7 +388,10 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
|
|||
| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` |
|
||||
| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. |
|
||||
| `--gold-preproc`, `-G` | flag | Use gold preprocessing. |
|
||||
| `--learn-tokens`, `-T` | flag | Make parser learn gold-standard tokenization by merging subtokens. Typically used for languages like Chinese. |
|
||||
| `--learn-tokens`, `-T` | flag | Make parser learn gold-standard tokenization by merging ] subtokens. Typically used for languages like Chinese. |
|
||||
| `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag> | flag | Text classification classes aren't mutually exclusive (multilabel). |
|
||||
| `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag> | option | Text classification model architecture. Defaults to `"bow"`. |
|
||||
| `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option |Text classification positive label for binary classes with two labels. |
|
||||
| `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag> | flag | Show more detailed messages during training. |
|
||||
| `--help`, `-h` | flag | Show help message and available arguments. |
|
||||
| **CREATES** | model, pickle | A spaCy model on each epoch. |
|
||||
|
|
|
@ -46,14 +46,16 @@ Update the evaluation scores from a single [`Doc`](/api/doc) /
|
|||
|
||||
## Properties
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------------------------------------- | ----- | ------------------------------------------------------------------------------------------------------------- |
|
||||
| `token_acc` | float | Tokenization accuracy. |
|
||||
| `tags_acc` | float | Part-of-speech tag accuracy (fine grained tags, i.e. `Token.tag`). |
|
||||
| `uas` | float | Unlabelled dependency score. |
|
||||
| `las` | float | Labelled dependency score. |
|
||||
| `ents_p` | float | Named entity accuracy (precision). |
|
||||
| `ents_r` | float | Named entity accuracy (recall). |
|
||||
| `ents_f` | float | Named entity accuracy (F-score). |
|
||||
| `ents_per_type` <Tag variant="new">2.1.5</Tag> | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. |
|
||||
| `scores` | dict | All scores with keys `uas`, `las`, `ents_p`, `ents_r`, `ents_f`, `ents_per_type`, `tags_acc` and `token_acc`. |
|
||||
| Name | Type | Description |
|
||||
| ----------------------------------------------- | ----- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `token_acc` | float | Tokenization accuracy. |
|
||||
| `tags_acc` | float | Part-of-speech tag accuracy (fine grained tags, i.e. `Token.tag`). |
|
||||
| `uas` | float | Unlabelled dependency score. |
|
||||
| `las` | float | Labelled dependency score. |
|
||||
| `ents_p` | float | Named entity accuracy (precision). |
|
||||
| `ents_r` | float | Named entity accuracy (recall). |
|
||||
| `ents_f` | float | Named entity accuracy (F-score). |
|
||||
| `ents_per_type` <Tag variant="new">2.1.5</Tag> | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. |
|
||||
| `textcat_score` <Tag variant="new">2.2</Tag> | float | F-score on positive label for binary exclusive, macro-averaged F-score for 3+ exclusive, macro-averaged AUC ROC score for multilabel (`-1` if undefined). |
|
||||
| `textcats_per_cat` <Tag variant="new">2.2</Tag> | dict | Scores per textcat label, keyed by label. |
|
||||
| `scores` | dict | All scores, keyed by type. |
|
||||
|
|
Loading…
Reference in New Issue
Block a user