mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 12:18:04 +03:00
35 lines
1.0 KiB
Markdown
35 lines
1.0 KiB
Markdown
|
## Examples of textcat training data
|
||
|
|
||
|
spacy JSON training files were generated from JSONL with:
|
||
|
|
||
|
```
|
||
|
python textcatjsonl_to_trainjson.py -m en file.jsonl .
|
||
|
```
|
||
|
|
||
|
`cooking.json` is an example with mutually-exclusive classes with two labels:
|
||
|
|
||
|
* `baking`
|
||
|
* `not_baking`
|
||
|
|
||
|
`jigsaw-toxic-comment.json` is an example with multiple labels per instance:
|
||
|
|
||
|
* `insult`
|
||
|
* `obscene`
|
||
|
* `severe_toxic`
|
||
|
* `toxic`
|
||
|
|
||
|
### Data Sources
|
||
|
|
||
|
* `cooking.jsonl`: https://cooking.stackexchange.com. The meta IDs link to the
|
||
|
original question as `https://cooking.stackexchange.com/questions/ID`, e.g.,
|
||
|
`https://cooking.stackexchange.com/questions/2` for the first instance.
|
||
|
* `jigsaw-toxic-comment.jsonl`: [Jigsaw Toxic Comments Classification
|
||
|
Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
|
||
|
|
||
|
### Data Licenses
|
||
|
|
||
|
* `cooking.jsonl`: CC BY-SA 4.0 ([`CC_BY-SA-4.0.txt`](CC_BY-SA-4.0.txt))
|
||
|
* `jigsaw-toxic-comment.jsonl`:
|
||
|
* text: CC BY-SA 3.0 ([`CC_BY-SA-3.0.txt`](CC_BY-SA-3.0.txt))
|
||
|
* annotation: CC0 ([`CC0.txt`](CC0.txt))
|