mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-27 02:16:32 +03:00
8e7557656f
* version bump to 3.0.0a16 * rename "gold" folder to "training" * rename 'annotation_setter' to 'set_extra_annotations' * formatting |
||
---|---|---|
.. | ||
CC_BY-SA-3.0.txt | ||
CC_BY-SA-4.0.txt | ||
CC0.txt | ||
cooking.json | ||
cooking.jsonl | ||
jigsaw-toxic-comment.json | ||
jigsaw-toxic-comment.jsonl | ||
README.md | ||
textcatjsonl_to_trainjson.py |
Examples of textcat training data
spacy JSON training files were generated from JSONL with:
python textcatjsonl_to_trainjson.py -m en file.jsonl .
cooking.json
is an example with mutually-exclusive classes with two labels:
baking
not_baking
jigsaw-toxic-comment.json
is an example with multiple labels per instance:
insult
obscene
severe_toxic
toxic
Data Sources
cooking.jsonl
: https://cooking.stackexchange.com. The meta IDs link to the original question ashttps://cooking.stackexchange.com/questions/ID
, e.g.,https://cooking.stackexchange.com/questions/2
for the first instance.jigsaw-toxic-comment.jsonl
: Jigsaw Toxic Comments Classification Challenge
Data Licenses
cooking.jsonl
: CC BY-SA 4.0 (CC_BY-SA-4.0.txt
)jigsaw-toxic-comment.jsonl
:- text: CC BY-SA 3.0 (
CC_BY-SA-3.0.txt
) - annotation: CC0 (
CC0.txt
)
- text: CC BY-SA 3.0 (