Hatexplain - HackMD

# Hatexplain ## Summary * [Introduction](#introduction) * [Dataset Structure](#dataset-structure) * [Reference](#reference) * [License](#license) * [Citation](#citation) ## Introduction Hatexplain is the first benchmark hate speech dataset covering multiple aspects of the issue. Each post in the dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labeling decision (as hate, offensive or normal) is based. ## Dataset Structure ### Data Instances Sample Entry: ``` { "id": "24198545_gab", "annotators": [ { "label": 0, # hatespeech "annotator_id": 4, "target": ["African"] }, { "label": 0, # hatespeech "annotator_id": 3, "target": ["African"] }, { "label": 2, # offensive "annotator_id": 5, "target": ["African"] } ], "rationales":[ [0,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] ], "post_tokens": ["and","this","is","why","i","end","up","with","nigger","trainee","doctors","who","can","not","speak","properly","lack","basic","knowledge","of","biology","it","truly","scary","if","the","public","only","knew"] } } ``` ### Data Fields * `post_id` : Unique id for each post * `annotators` : The list of annotations from each annotator * `annotators` [label] : The label assigned by the annotator to this post. Possible values: hatespeech (0), normal (1) or offensive (2) * `annotators` [annotator_id] : The unique Id assigned to each annotator * `annotators` [target] : A list of target community present in the post * `rationales` : A list of rationales selected by annotators. Each rationales represents a list with values 0 or 1. A value of 1 means that the token is part of the rationale selected by the annotator. To get the particular token, we can use the same index position in `post_tokens` * `post_tokens` : The list of tokens representing the post which was annotated ### Data Splits [Post_id_divisions](https://github.com/hate-alert/HateXplain/blob/master/Data/post_id_divisions.json) has a dictionary having train, valid and test post ids that are used to divide the dataset into train, val and test set in the ratio of 8:1:1. ## Reference We would like to acknowledge Binny Mathew et al. for creating and maintaining the Hatexplain dataset as a valuable resource for the computer vision and machine learning research community. For more information about the Hatexplain dataset and its creator, please visit [the Hatexplain website](https://github.com/hate-alert/HateXplain). ## License The dataset has been released under the MIT License. ## Citation ``` @article{mathew2020hatexplain, title={HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection}, author={Binny Mathew and Punyajoy Saha and Seid Muhie Yimam and Chris Biemann and Pawan Goyal and Animesh Mukherjee}, year={2021}, conference={AAAI conference on artificial intelligence} } ```