mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-26 13:41:21 +03:00 
			
		
		
		
	| * make disable_pipes deprecated in favour of the new toggle_pipes * rewrite disable_pipes statements * update documentation * remove bin/wiki_entity_linking folder * one more fix * remove deprecated link to documentation * few more doc fixes * add note about name change to the docs * restore original disable_pipes * small fixes * fix typo * fix error number to W096 * rename to select_pipes * also make changes to the documentation Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com> | ||
|---|---|---|
| .. | ||
| CC_BY-SA-3.0.txt | ||
| CC_BY-SA-4.0.txt | ||
| CC0.txt | ||
| cooking.json | ||
| cooking.jsonl | ||
| jigsaw-toxic-comment.json | ||
| jigsaw-toxic-comment.jsonl | ||
| README.md | ||
| textcatjsonl_to_trainjson.py | ||
Examples of textcat training data
spacy JSON training files were generated from JSONL with:
python textcatjsonl_to_trainjson.py -m en file.jsonl .
cooking.json is an example with mutually-exclusive classes with two labels:
- baking
- not_baking
jigsaw-toxic-comment.json is an example with multiple labels per instance:
- insult
- obscene
- severe_toxic
- toxic
Data Sources
- cooking.jsonl: https://cooking.stackexchange.com. The meta IDs link to the original question as- https://cooking.stackexchange.com/questions/ID, e.g.,- https://cooking.stackexchange.com/questions/2for the first instance.
- jigsaw-toxic-comment.jsonl: Jigsaw Toxic Comments Classification Challenge
Data Licenses
- cooking.jsonl: CC BY-SA 4.0 (- CC_BY-SA-4.0.txt)
- jigsaw-toxic-comment.jsonl:- text: CC BY-SA 3.0 (CC_BY-SA-3.0.txt)
- annotation: CC0 (CC0.txt)
 
- text: CC BY-SA 3.0 (