SHP-2 - HackMD

# SHP-2 ## Summary * [Introduction](#introduction) * [Dataset Structure](#dataset-structure) * [Dataset Creation](#dataset-creation) * [Reference](#reference) * [License](#license) * [Citation](#citation) ## Introduction SHP-2 is a dataset of 4.8M collective human preferences over responses to questions/instructions in 129 different subject areas, from cooking to legal advice. It is an extended version of the original 385K [SHP dataset](https://huggingface.co/datasets/stanfordnlp/SHP). The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., [SteamSHP](https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl)). Each example is a Reddit or StackExchange post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit / StackExchange users (collectively). SHP exploits the fact that if comment A was written after comment B but has a higher score nonetheless, then A is ostensibly more preferred to B. If A had been written before B, then we could not conclude this, since its higher score could have been the result of more visibility. We chose data where the preference label is intended to reflect which response is more helpful rather than which is less harmful, the latter being the focus of much past work. How is SHP different from [Anthropic's HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) and [Open Assistant](https://huggingface.co/datasets/OpenAssistant/oasst1)? | Dataset | Size | Input| Label| Domains |Data Format |Length| |:---|:---:|:---|:---|:---:|:---|:---| |SHP-2 |4.8M| Naturally occurring human-written responses| Collective Human Preference |129 (labelled)| Question/Instruction + Response (Single-turn)| up to 10.1K T5 tokens| |HH-RLHF |91K |Dialogue with LLM |Individual Human Preference |not labelled |Live Chat (Multi-turn)| up to 1.5K T5 tokens| |OASST |161K| Dialogue with LLM| K Individual Preferences, Aggregated| not labelled| Live Chat (Multi-Turn) |up to 1.5K T5 tokens| How is SHP different from other datasets that have scraped Reddit, like [ELI5](https://huggingface.co/datasets/defunct-datasets/eli5#source-data)? SHP uses the timestamp information to infer preferences, while ELI5 only provides comments and scores -- the latter are not enough to infer preferences since comments made earlier tend to get higher scores from more visibility. It also contains data from more domains: | Dataset | Size | Comments + Scores | Preferences | Number of Domains | |:------- |:----:|:-----------------:|:-----------:|:------------------------------------------- | | SHP-2 | 4.8M | Yes | Yes | 129 (70 from Reddit, 59 from StackExchange) | | HP | 385K | Yes | Yes | 18 (from Reddit) | | ELI5 | 270K | Yes | No | 3 | ## Dataset Structure There are 2 directories, one for Reddit and one for StackExchange. There are 70 subdirectories under reddit/, one for each subreddit, and 59 subdirectories under stackexchange/, one for each stackexchange site. Each subdirectory contains a JSONL file for the training, validation, and test data. Here's how to get the data using Huggingface's datasets library: ``` from datasets import load_dataset # Load all the data dataset = load_dataset("stanfordnlp/shp-2") # Load one of the subreddits dataset = load_dataset("stanfordnlp/shp-2", data_dir="reddit/askculinary") # Load one of the StackExchange sites dataset = load_dataset("stanfordnlp/shp-2", data_dir="stackexchange/stack_academia") ``` ### Data Instances Here's an example from reddit/askculinary/train.json: ``` { `post_id`:"qt3nxl", `domain`:"askculinary_train", `upvote_ratio`:0.98, `history`:"What's the best way to disassemble raspberries? Like this, but down to the individual seeds: https:\/\/i.imgur.com\/Z0c6ZKE.jpg I've been pulling them apart with tweezers and it's really time consuming. I have about 10 pounds to get through this weekend.", `c_root_id_A`:"hkh25sc", `c_root_id_B`:"hkh25lp", `created_at_utc_A`:1636822112, `created_at_utc_B`:1636822110, `score_A`:340, `score_B`:166, `human_ref_A`:"Pectinex, perhaps? It's an enzyme that breaks down cellulose. With citrus, you let it sit in a dilute solution of pectinex overnight to break down the connective tissues. You end up with perfect citrus supremes. If you let the raspberries sit for a shorter time, I wonder if it would separate the seeds the same way...? Here's an example: https:\/\/www.chefsteps.com\/activities\/perfect-citrus-supreme", `human_ref_B`:"Raspberry juice will make a bright stain at first, but in a matter of weeks it will start to fade away to almost nothing. It is what is known in the natural dye world as a fugitive dye, it will fade even without washing or exposure to light. I hope she gets lots of nice photos of these stains on her dress, because soon that will be all she has left of them!", `labels`:1, `metadata_A`: "", `metadata_B`: "", `seconds_difference`:2.0, `score_ratio`:2.0481927711 } ``` Here's an example from stackexchange/stack_academia/validation.json: ``` { `post_id`:"87393", `domain`:"academia_validation", `history`:"What to answer an author asking me if I reviewed his/her paper? <sep> Suppose I review someone's paper anonymously, the paper gets accepted, and a year or two later we meet e.g. in a social event and he/she asks me "did you review my paper?". What should I answer? There are several sub-questions here: Suppose the review was a good one, and the paper eventualy got accepted, so I do not mind telling that I was the reviewer. Is there any rule/norm prohibiting me from telling the truth? Suppose the review was not so good, so I do not want to reveal. What can I answer? If I just say "I am not allowed to tell you", this immediately reveals me... On the other hand, I do not want to lie. What options do I have?", `c_root_id_A`:"87434", `c_root_id_B`:"87453", `created_at_utc_A`:1490989560, `created_at_utc_B`:1491012608, `score_A`:2, `score_B`:5, `human_ref_A`:"I am aware of at least one paper where a referee went out of cover (after the review process of course) and was explicitly mentioned in a later paper: <blockquote> X and Y thank Z, who as the anonymous referee was kind enough to point out the error (and later became non-anonymous). </blockquote> so it is sure fine to answer truthfully that yes you did review, but only if you wish of course (and most likely if you have been helpful and the authors of the paper responsive).", `human_ref_B`:"Perhaps you should follow the example of Howard Percy Robertson (known as the 'R' in the famous FLRW, or Friedmann-Lematre-Robertson-Walker metric used in physical cosmology.) He was the referee of the famous Einstein-Rosen paper, which was rejected by Physical Review, prompting Einstein never to publish in Physical Review again. Einstein ignored the referee report, but months later, it seems, Robertson had a chance to talk to Einstein and may have helped convince him of the error of his ways. However, as far as we know, he never revealed to Einstein that he was the anonymous referee for Physical Review. It was not until 2005 I believe, long after the death of all participants, that Physical Review chose to disclose the referee's identity (http://physicstoday.scitation.org/doi/full/10.1063/1.2117822).", `labels`:"0", `metadata_A`:"Post URL: https://academia.stackexchange.com/questions/87393, Response URL: https://academia.stackexchange.com/questions/87434, Post author username: Erel Segal-Halevi, Post author profile: https://academia.stackexchange.com/users/787, Response author username: mts, Response author profile: https://academia.stackexchange.com/users/49583", `metadata_B`:"Post URL: https://academia.stackexchange.com/questions/87393, Response URL: https://academia.stackexchange.com/questions/87453, Post author username: Erel Segal-Halevi, Post author profile: https://academia.stackexchange.com/users/787, Response author username: Viktor Toth, Response author profile: https://academia.stackexchange.com/users/7938", `seconds_difference`:23048.0, `score_ratio`:2.5, } ``` ### Data Fields * `post_id`: the ID of the Reddit post (string) * `domain`: the subreddit and split the example is drawn from, separated by an underscore (string) * `upvote_ratio`: the percent of votes received by the post that were positive (aka upvotes), -1.0 for stackexchange as there is no such data (float) * history`: the post title concatented to the post body (string) * `c_root_id_A`: the ID of comment A (string) * `c_root_id_B`: the ID of comment B (string) * `created_at_utc_A`: utc timestamp of when comment A was created (integer) * `created_at_utc_B`: utc timestamp of when comment B was created (integer) * `score_A`: (# positive votes - # negative votes + 1) received by comment A (integer) * `score_B`: (# positive votes - # negative votes + 1) received by comment B (integer) * `human_ref_A`: text of comment A (string) * `human_ref_B`: text of comment B (string) * `labels`: the preference label -- it is 1 if A is preferred to B; 0 if B is preferred to A. This was randomized such that the label distribution is roughly 50/50. (integer) * `metadata_A`: metadata for stackexchange post and comment A (string) * `metadata_B`: metadata for stackexchange post and comment B (string) * `seconds_difference`: how many seconds after the less preferred comment the more preferred one was created (will always be >= 0) (integer) * `score_ratio`: the ratio of the more preferred comment's score to the less preferred comment's score (will be >= 1) (float) ## Dataset Creation ### Curation Rationale ELI5 was built to provide a testbed for machines to learn how to answer more complex questions, which requires them to find and combine information in a coherent manner. The dataset was built by gathering questions that were asked by community members of three subreddits, including [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/), along with the answers that were provided by other users. The [rules of the subreddit](https://www.reddit.com/r/explainlikeimfive/wiki/detailed_rules/) make this data particularly well suited to training a model for abstractive question answering: the questions need to seek an objective explanation about well established facts, and the answers provided need to be understandable to a layperson without any particular knowledge domain. ### Source Data ### Initial Data Collection and Normalization The data was obtained by filtering submissions and comments from the subreddits of interest from the XML dumps of the [Reddit forum](https://www.reddit.com/) hosted on Pushshift.io. In order to further improve the quality of the selected examples, only questions with a score of at least 2 and at least one answer with a score of at least 2 were selected for the dataset. The dataset questions and answers span a period form August 2012 to August 2019. ### Who are the source language producers? The language producers are users of the [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/), [r/askscience](https://www.reddit.com/r/askscience/), and [r/AskHistorians](https://www.reddit.com/r/AskHistorians/) subreddits between 2012 and 2019. No further demographic information was available from the data source. ## Reference We would like to acknowledge Ethayarajh et al. for creating and maintaining the SHP-2 dataset as a valuable resource for the computer vision and machine learning research community. For more information about the SHP-2 dataset and its creator, please visit [the SHP-2 website](https://huggingface.co/datasets/stanfordnlp/SHP-2). ## License The dataset has been released under the Creative Commons Attribution-ShareAlike 4.0 International License. ## Citation ``` @InProceedings{pmlr-v162-ethayarajh22a, title = {Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information}, author = {Ethayarajh, Kawin and Choi, Yejin and Swayamdipta, Swabha}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {5988--6008}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, } ```