Social Bias Frames

# Social Bias Frames ## Summary * [Introduction](#introduction) * [Dataset Structure](#dataset-structure) * [Reference](#reference) * [License](#license) * [Citation](#citation) ## Introduction Social Bias Frames is a new way of representing the biases and offensiveness that are implied in language. For example, these frames are meant to distill the implication that "women (candidates) are less qualified" behind the statement "we shouldn’t lower our standards to hire more women." The Social Bias Inference Corpus (SBIC) supports large-scale learning and evaluation of social implications with over 150k structured annotations of social media posts, spanning over 34k implications about a thousand demographic groups. ## Dataset Structure ### Data Instances Each instance contains a post that may contain an offensive statement and annotated information concerning the nature of the offensive implication as well as the demographics of the annotator and origin of the post. Size of downloaded dataset files: 6.32 MB Size of the generated dataset: 44.47 MB Total amount of disk used: 50.80 MB An example of 'validation' looks as follows. ``` { "HITId": "3Z8UJEJOCZ74QHXOTFP0JOOGD7039J", "WorkerId": "7162172225567546910", "annotatorAge": "27.0", "annotatorGender": "woman", "annotatorMinority": "women", "annotatorPolitics": "liberal", "annotatorRace": "white", "dataSource": "t/davidson", "intentYN": "1.0", "offensiveYN": "1.0", "post": "Every half-breed cop in America is trying to rape white women.", "sexPhrase": "rape white women", "sexReason": "rape", "sexYN": "1.0", "speakerMinorityYN": "0.0", "targetCategory": "race", "targetMinority": "mixed folks", "targetStereotype": "mixed folks are rapists.", "whoTarget": "1.0" } ``` ### Data Fields The data fields are the same among all splits. * `whoTarget`: a string, ‘0.0’ if the target is a group, ‘1.0’ if the target is an individual, and blank if the post is not offensive * `intentYN`: a string indicating if the intent behind the statement was to offend. This is a categorical variable with four possible answers, ‘1.0’ if yes, ‘0.66’ if probably, ‘0.33’ if probably not, and ‘0.0’ if no. * `sexYN`: a string indicating whether the post contains a sexual or lewd reference. This is a categorical variable with three possible answers, ‘1.0’ if yes, ‘0.5’ if maybe, ‘0.0’ if no. * `sexReason`: a string containing a free text explanation of what is sexual if indicated so, blank otherwise * `offensiveYN`: a string indicating if the post could be offensive to anyone. This is a categorical variable with three possible answers, ‘1.0’ if yes, ‘0.5’ if maybe, ‘0.0’ if no. * `annotatorGender`: a string indicating the gender of the MTurk worker * `annotatorMinority`: a string indicating whether the MTurk worker identifies as a minority * `sexPhrase`: a string indicating which part of the post references something sexual, blank otherwise * `speakerMinorityYN`: a string indicating whether the speaker was part of the same minority group that's being targeted. This is a categorical variable with three possible answers, ‘1.0’ if yes, ‘0.5’ if maybe, ‘0.0’ if no. * `WorkerId`: a string hashed version of the MTurk workerId * `HITId`: a string id that uniquely identifies each post * `annotatorPolitics`: a string indicating the political leaning of the MTurk worker * `annotatorRace`: a string indicating the race of the MTurk worker * `annotatorAge`: a string indicating the age of the MTurk worker * `post`: a string containing the text of the post that was annotated * `targetMinority`: a string indicating the demographic group targeted * `targetCategory`: a string indicating the high-level category of the demographic group(s) targeted * `targetStereotype`: a string containing the implied statement * `dataSource`: a string indicating the source of the post (t/...: means Twitter, r/...: means a subreddit) ### Data Splits To ensure that no post appeared in multiple splits, the curators defined a training instance as the post and its three sets of annotations. They then split the dataset into train, validation, and test sets (75%/12.5%/12.5%). | | train | validation | test | |:------------------ |:------:|:----------:|:-----:| | Social Bias Frames | 112900 | 16738 | 17501 | ## Reference We would like to acknowledge Sap, Maarten et al. for creating and maintaining the Social Bias Frames dataset as a valuable resource for the computer vision and machine learning research community. For more information about the Social Bias Frames dataset and its creator, please visit [the Social Bias Frames website](https://maartensap.com/social-bias-frames/). ## License The dataset has been released under the Creative Commons 4.0 License. ## Citation ``` @inproceedings{sap-etal-2020-social, title = "Social Bias Frames: Reasoning about Social and Power Implications of Language", author = "Sap, Maarten and Gabriel, Saadia and Qin, Lianhui and Jurafsky, Dan and Smith, Noah A. and Choi, Yejin", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.486", doi = "10.18653/v1/2020.acl-main.486", pages = "5477--5490", } ```