spaCy/spacy
adrianeboyd b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00
..
cli Add textcat to train CLI (#4226) 2019-09-15 22:31:31 +02:00
data Make spacy/data a package 2017-03-18 20:04:22 +01:00
displacy Improve token pattern checking without validation (#4105) 2019-08-21 14:00:37 +02:00
lang 💫 Adjust Table API and add docs (#4289) 2019-09-15 22:08:13 +02:00
matcher Tidy up and auto-format [ci skip] 2019-08-31 13:39:06 +02:00
pipeline Add textcat to train CLI (#4226) 2019-09-15 22:31:31 +02:00
syntax Add textcat to train CLI (#4226) 2019-09-15 22:31:31 +02:00
tests Add textcat to train CLI (#4226) 2019-09-15 22:31:31 +02:00
tokens 💫 Adjust Table API and add docs (#4289) 2019-09-15 22:08:13 +02:00
__init__.pxd * Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags. 2014-10-24 02:23:42 +11:00
__init__.py Fix formatting (hopefully also restarts build properly) 2019-03-20 09:55:45 +01:00
__main__.py Update __main__.py 2019-03-20 09:43:26 +01:00
_align.pyx Improve alignment around quotes 2018-08-16 01:04:34 +02:00
_ml.py Tidy up and auto-format 2019-09-11 14:00:36 +02:00
about.py Set version to v2.2.0.dev6 2019-09-11 18:07:20 +02:00
attrs.pxd Fix attrs alignment 2019-07-12 17:59:47 +02:00
attrs.pyx Merge changes from master 2019-08-21 14:18:52 +02:00
compat.py Fix symlink creation to show error message on failure (#3589) (resolves #3307)) 2019-04-16 11:58:31 +02:00
errors.py Add textcat to train CLI (#4226) 2019-09-15 22:31:31 +02:00
glossary.py Include Norwegian NER entity types in glossary [ci skip] 2019-09-15 17:16:21 +02:00
gold.pxd Merge changes from master 2019-08-21 14:18:52 +02:00
gold.pyx Add textcat to train CLI (#4226) 2019-09-15 22:31:31 +02:00
kb.pxd rename entity frequency 2019-07-19 17:40:28 +02:00
kb.pyx Documentation for Entity Linking (#4065) 2019-09-12 11:38:34 +02:00
language.py Add textcat to train CLI (#4226) 2019-09-15 22:31:31 +02:00
lemmatizer.py 💫 Adjust Table API and add docs (#4289) 2019-09-15 22:08:13 +02:00
lexeme.pxd 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325) 2019-02-24 21:13:51 +01:00
lexeme.pyx Tidy up property code style (#3391) 2019-03-11 15:59:09 +01:00
lookups.py 💫 Adjust Table API and add docs (#4289) 2019-09-15 22:08:13 +02:00
morphology.pxd annotate kb_id through ents in doc 2019-03-22 11:36:44 +01:00
morphology.pyx 💫 Adjust Table API and add docs (#4289) 2019-09-15 22:08:13 +02:00
parts_of_speech.pxd Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
parts_of_speech.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
scorer.py Add textcat to train CLI (#4226) 2019-09-15 22:31:31 +02:00
strings.pxd Try to fix StringStore clean up (see #1506) 2017-11-11 03:11:27 +03:00
strings.pyx Merge branch 'master' into feature/lemmatizer 2019-03-16 13:44:22 +01:00
structs.pxd Merge changes from master 2019-08-21 14:18:52 +02:00
symbols.pxd Fix symbol alignment 2019-07-12 17:48:38 +02:00
symbols.pyx ensure Span.as_doc keeps the entity links + unit test 2019-06-25 15:28:51 +02:00
tokenizer.pxd Flush tokenizer cache when necessary (#4258) 2019-09-08 20:52:46 +02:00
tokenizer.pyx Flush tokenizer cache when necessary (#4258) 2019-09-08 20:52:46 +02:00
typedefs.pxd Work on changing StringStore to return hashes. 2017-05-28 12:36:27 +02:00
typedefs.pyx Tidy up rest 2017-10-27 21:07:59 +02:00
util.py 💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178) 2019-09-09 19:17:55 +02:00
vectors.pyx Update Vectors.find docs [ci skip] 2019-03-16 17:10:57 +01:00
vocab.pxd 💫 WIP: Basic lookup class scaffolding and JSON for all lemmati… (#4167) 2019-08-22 14:21:32 +02:00
vocab.pyx Bloom-filter backed Lookup Tables (#4268) 2019-09-12 17:26:11 +02:00