# Hatexplain
## Summary
* [Introduction](#introduction)
* [Dataset Structure](#dataset-structure)
* [Reference](#reference)
* [License](#license)
* [Citation](#citation)
## Introduction
Hatexplain is the first benchmark hate speech dataset covering multiple aspects of the issue. Each post in the dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labeling decision (as hate, offensive or normal) is based.
## Dataset Structure
### Data Instances
Sample Entry:
```
{
"id": "24198545_gab",
"annotators": [
{
"label": 0, # hatespeech
"annotator_id": 4,
"target": ["African"]
},
{
"label": 0, # hatespeech
"annotator_id": 3,
"target": ["African"]
},
{
"label": 2, # offensive
"annotator_id": 5,
"target": ["African"]
}
],
"rationales":[
[0,0,0,0,0,0,0,0,1,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
],
"post_tokens": ["and","this","is","why","i","end","up","with","nigger","trainee","doctors","who","can","not","speak","properly","lack","basic","knowledge","of","biology","it","truly","scary","if","the","public","only","knew"]
}
}
```
### Data Fields
* `post_id` : Unique id for each post
* `annotators` : The list of annotations from each annotator
* `annotators` [label] : The label assigned by the annotator to this post. Possible values: hatespeech (0), normal (1) or offensive (2)
* `annotators` [annotator_id] : The unique Id assigned to each annotator
* `annotators` [target] : A list of target community present in the post
* `rationales` : A list of rationales selected by annotators. Each rationales represents a list with values 0 or 1. A value of 1 means that the token is part of the rationale selected by the annotator. To get the particular token, we can use the same index position in `post_tokens`
* `post_tokens` : The list of tokens representing the post which was annotated
### Data Splits
[Post_id_divisions](https://github.com/hate-alert/HateXplain/blob/master/Data/post_id_divisions.json) has a dictionary having train, valid and test post ids that are used to divide the dataset into train, val and test set in the ratio of 8:1:1.
## Reference
We would like to acknowledge Binny Mathew et al. for creating and maintaining the Hatexplain dataset as a valuable resource for the computer vision and machine learning research community. For more information about the Hatexplain dataset and its creator, please visit [the Hatexplain website](https://github.com/hate-alert/HateXplain).
## License
The dataset has been released under the MIT License.
## Citation
```
@article{mathew2020hatexplain,
title={HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection},
author={Binny Mathew and Punyajoy Saha and Seid Muhie Yimam and Chris Biemann and Pawan Goyal and Animesh Mukherjee},
year={2021},
conference={AAAI conference on artificial intelligence}
}
```