da ma
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # examples ## AlphaNLI ### 原数据 ```json { "story_id":"58090d3f-8a91-4c89-83ef-2b4994de9d241", "obs1":"Ron started his new job as a landscaper today.", "obs2":"Ron is immediately fired for insubordination.", "hyp1":"Ron ignores his bosses's orders and called him an idiot.", "hyp2":"Ron's boss called him an idiot." "label": 1 // (相应的lst文件获取) } ``` ### 转化数据 #### 规则 * obs1, obs2, hyp 三个role * 原数据每个样本会产生两条数据 #### 示例 * 数据一 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["obs1"], "utterance": "Ron started his new job as a landscaper today.", }, { "roles": ["obs2"], "utterance": "Ron is immediately fired for insubordination.", }, { "roles": ["hyp"], "utterance": "Ron ignores his bosses's orders and called him an idiot.", "class_label": true }, ] } ``` * 数据二 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["obs1"], "utterance": "Ron started his new job as a landscaper today.", }, { "roles": ["obs2"], "utterance": "Ron is immediately fired for insubordination.", }, { "roles": ["hyp"], "utterance": "Ron's boss called him an idiot.", "class_label": false }, ] } ``` ## ASTE ### 原数据 ``` In the shop , these MacBooks are encased in a soft rubber enclosure - so you will never know about the razor edge until you buy it , get it home , break the seal and use it ( very clever con ) .####[([11, 12], [10], 'POS')] ``` ### 转化数据 #### 规则 数据文件中的每行都是一个样例,格式如下: ``` sentence####[(target position, opinion position, sentiment), ..., (target position, opinion position, sentiment)] ``` 在####之后的列表中存储着ASTE的(target-opinion-sentiment)三元组。 * 一个样例中可能存在多个三元组。每个三元组都应该写入"aspects"的列表中。 * target position和opinion position都是一个列表,指示value中的所有词在句子中的位置。 * 每个文件夹都产生一个part。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "In the shop , these MacBooks are encased in a soft rubber enclosure - so you will never know about the razor edge until you buy it , get it home , break the seal and use it ( very clever con ) .", "aspects": [ { "target": { "value": "rubber enclosure", "start": 11, "end": 13 }, "opinion": { "value": "soft", "start": 10, "end": 11 }, "sentiment": "POS" } ] } ] } ``` ## Banking77 ### 原数据 ```csv! text,category I am still waiting on my card?,card_arrival ``` ### 转化数据 #### 规则 原数据文件为csv格式,按照单轮对话的形式处理: * `text`字段的内容填入"utterance"中。 * `category`字段的内容填入"active_intents"的列表中。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "I am still waiting on my card?", "active_intents": ["card_arrival"] } ] } ``` ## CamRest676 ### 原数据 ```json { "dial": [ { "turn": 0, "usr": { "transcript": "I need to find an expensive restauant that's in the south section of the city.", "slu": [ { "act": "inform", "slots": [ [ "pricerange", "expensive" ] ] }, { "act": "inform", "slots": [ [ "area", "south" ] ] } ] }, "sys": { "sent": "There are several restaurants in the south part of town that serve expensive food. Do you have a cuisine preference?", "DA": [ "food" ] } }, ], "dialogue_id": 0, "finished": true, "goal": { "constraints": [ [ "pricerange", "expensive" ], [ "area", "south" ] ], "request-slots": [ "address" ], "text": "Task 11193: You are looking for an expensive restaurant and it should be in the south part of town. Make sure you get the address of the venue." } } ``` ### 转化数据 #### 规则 * 只需要处理原数据中"dial"字段的内容。 * 对角色"usr"动作为"inform"的槽值,填入"slot_value_table"中;对动作为"request"的槽值,将请求槽位填入"requested_slots"中。角色"sys"的"DA"字段可能也会有请求填充的槽位,也应填入"requested_slots"中。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["usr"], "utterance": "I need to find an expensive restauant that's in the south section of the city.", "belief_state": [ { "intent": "inform", "slot_value_table": [ { "slot": "pricerange", "values": [{ "value": "expensive" }] }, { "slot": "area", "values": [{ "value": "south" }] } ], "requested_slots": [], } ] }, { "roles": ["sys"], "utterance": "There are several restaurants in the south part of town that serve expensive food. Do you have a cuisine preference?", "belief_state": [ { "slot_value_table": [ { "slot": "pricerange", "values": [{ "value": "expensive" }] }, { "slot": "area", "values": [{ "value": "south" }] } ], "requested_slots": ["food"], } ] } ] } ``` ## CANARD ### 原数据 ```json { "History": [ "Johnny Unitas", "1964 MVP season", "what team did unitas play for", "The Colts", "how many games did the colts win", "the Colts ran off 10 straight victories to finish with a 12-2 record." ], "QuAC_dialog_id": "C_2ba58216460d43aa986fc0e897537239_0", "Question": "who did they play in the playoffs", "Question_no": 3, "Rewrite": "who did the Colts play in the playoffs?" } ``` ### 转化数据 #### 规则 * 原数据"History"字段的列表的前两个元素固定填充至"title"字典中。 * 从第3个元素起,按照"question-answer"的角色次序进行多轮问答,填充至"dialog"字段的列表中。 * 原数据"Question"字段的语句是该样例的最后一轮,将其填充至列表的同时,也将改写后的语句(原数据"Rewrite"字段)放入该轮"rewritten"字段中。 * 原数据中的"QuAC_dialog_id"字段对应着QuAC数据集中作为问答基础的文章,需要在QuAC中查找到对应的段落,并将其填入外部知识中。 #### 示例 ```json { "turn": "multi", "locale": "en", "title": { "article": "Johnny Unitas", "section": "1964 MVP season" }, "dialog": [ { "roles": ["question"], "utterance": "what team did unitas play for" }, { "roles": ["answer"], "utterance": "The Colts" }, { "roles": ["question"], "utterance": "how many games did the colts win" }, { "roles": ["answer"], "utterance": "the Colts ran off 10 straight victories to finish with a 12-2 record." }, { "roles": ["question"], "utterance": "who did they play in the playoffs", "rewritten": "who did the Colts play in the playoffs?" } ], "knowledge": { "type": "text", "value": "The 1964 season would see the Colts return to the top of the Western Conference. After dropping their season opener to ..." } } ``` ## CLINC150 ### 原数据 ```json { "oos_val": [ [ "set a warning for when my bank account starts running low", "oos" ], ], "val": [ [ "in spanish, meet me tomorrow is said how", "translate" ], ] } ``` ### 转化数据 #### 规则 每个样例列表中的第一个元素是样例语句,第二个元素则是语句对应的意图。 * 将意图放入"active_intents"的列表中。 * 原数据中存在名为"oos"的意图并被单独列出,正常处理即可。 * 只需要转化`data_full.json`内的样例。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "set a warning for when my bank account starts running low", "active_intents": ["oos"] } ] } ``` ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "in spanish, meet me tomorrow is said how", "active_intents": ["translate"] } ] } ``` ## CMUDoG ### 原数据 ```json { "date": "2018-03-22T16:25:07.321Z", "history": [ { "docIdx": 0, "text": "Hi there", "uid": "user1", "utcTimestamp": "2018-03-22T16:27:10.212Z" }, { "docIdx": 0, "text": "Hey! The movie we're supposed to discuss is the Wolf of Wall Street. Have you watched it?", "uid": "user2", "utcTimestamp": "2018-03-22T16:27:26.682Z" } ], "rating": 2, "status": 0, "uid1LogInTime": "2018-03-22T16:25:07.321Z", "uid1LogOutTime": "2018-03-22T16:37:44.095Z", "uid1response": { "feedback": "They said it had bad reviews and that critics said it was dull.", "response": [ 2, 3, 5 ], "type": "finish" }, "uid2LogInTime": "2018-03-22T16:25:07.471Z", "uid2LogOutTime": "2018-03-22T16:36:58.235Z", "uid2response": { "response": [], "type": "abandonWithoutAnsweringFeedbackQuestions" }, "user1_id": "USR3959", "user2_id": "USR2840", "whoSawDoc": [ "user2" ], "wikiDocumentIdx": 28 } ``` ### 转化数据 #### 规则 Conversation文件夹中包含对话数据,每个json文件都是一个样例。WikiData文件夹则包含对话基于的外部知识wiki内容。 * 将对话数据"history"中的内容按照多轮对话形式放入"dialog"列表中。 * 每个WikiData中的json文件都包含一个独一无二的"wikiDocumentIdx"值。对于原数据样例的"wikiDocumentIdx"的值,找到它所对应的那个wiki文件,将文件内容全部填入外部知识"knowledge"中。 * 原数据的其余字段可以忽略。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["user1"], "utterance": "Hi there" }, { "roles": ["user2"], "utterance": "Hey! The movie we're supposed to discuss is the Wolf of Wall Street. Have you watched it?" } ], "knowledge": { "type": "wiki", "value": { // All contents in "WikiData/The_Wolf_of_Wall_Street.json", // which has the "wikiDocumentIdx" value 28 } } } ``` ## Commonsense-Dialogues ### 原数据 ```json { "1": { "context": "kai was a reasonable person who was listened to so he put her life in perspective.", "speaker": "Kai", "turns": [ "I know what she was going through was hard but this is just a temporary feeling", "And she is young indeed.", "everything will be okay. I'll be there for her", "It must hurt so much now though", "I told her that after every bad thing that happens, there is good that happens. She is a good person and good things will happen to her", "Kai that was really thoughtful of you" ] }, } ``` ### 转化数据 #### 规则 * 每一个样例都是一个多轮对话,由原数据"speaker"字段中的角色与"third-person"交替进行。第一句话的角色固定为"speaker"的角色。 * 原数据"context"字段的内容应填入外部知识中。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["Kai"], "utterance": "I know what she was going through was hard but this is just a temporary feeling" }, { "roles": ["third-person"], "utterance": "And she is young indeed." }, { "roles": ["Kai"], "utterance": "everything will be okay. I'll be there for her" }, { "roles": ["third-person"], "utterance": "It must hurt so much now though" }, { "roles": ["Kai"], "utterance": "I told her that after every bad thing that happens, there is good that happens. She is a good person and good things will happen to her", }, { "roles": ["third-person"], "utterance": "Kai that was really thoughtful of you" } ], "knowledge": { "type": "text", "value": "kai was a reasonable person who was listened to so he put her life in perspective." } } ``` ## CommonsenseQA ### 原数据 ```json! {"answerKey": "A", "id": "075e483d21c29a511267ef62bedc0461", "question": {"question_concept": "punishing", "choices": [{"label": "A", "text": "ignore"}, {"label": "B", "text": "enforce"}, {"label": "C", "text": "authoritarian"}, {"label": "D", "text": "yell at"}, {"label": "E", "text": "avoid"}], "stem": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?"}} ``` ### 转化数据 #### 规则 原数据文件中的每一行都是一个样例,包含一个常识性问题、问题涉及的概念、五个备选答案和一个正确答案的标签。 * 将每个样例处理成多轮对话的形式:第一轮对话(角色为"stem")是问题本身,之后的5轮对话则依次是A~E的五个选项内容,为正确的选项打上true的标签,错误的选项则为false。 * 问题涉及的概念填充至"domain"的列表中。 #### 示例 ```json { "turn": "multi", "domain": ["punishing"], "locale": "en", "dialog": [ { "roles": ["stem"], "utterance": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?" }, { "roles": ["A"], "utterance": "ignore", "class_label": true }, { "roles": ["B"], "utterance": "enforce", "class_label": false }, { "roles": ["C"], "utterance": "authoritarian", "class_label": false }, { "roles": ["D"], "utterance": "yell at", "class_label": false }, { "roles": ["E"], "utterance": "avoid", "class_label": false } ] } ``` ## CommonsenseQA 2.0 ### 原数据 ```json! {"id": "0000488c294c99bd1a6cf10258dae8c1", "question": "The world trade center is no more because of 9/11?", "answer": "yes", "confidence": 0.89, "date": "12/16/2020", "relational_prompt": "because", "topic_prompt": "world trade center", "relational_prompt_used": true, "topic_prompt_used": true, "validations": ["yes", "yes", "yes", "no"]} ``` ### 转化数据 #### 规则 原数据文件中的每一行都是一个样例,包含一个常识性问题与问题的正误。 * 将每个样例处理成单轮对话的形式,直接在"class_label"中标注问题的答案(在原数据"answer"字段中给出,yes为true,no为false)。 * 原数据"topic_prompt"字段的内容填充进"domain"列表中。 #### 示例 ```json { "turn": "single", "domain": ["world trade center"], "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "The world trade center is no more because of 9/11?", "class_label": true } ] } ``` ## CoQA ### 原数据 ```json { "version": "1.0", "data": [ { "source": "wikipedia", "id": "3zotghdk5ibi9cex97fepx7jetpso7", "filename": "Vatican_Library.txt", "story": "The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established ...", "questions": [ { "input_text": "When was the Vat formally opened?", "turn_id": 1 }, { "input_text": "what is the library for?", "turn_id": 2 } ], "answers": [ { "span_start": 151, "span_end": 179, "span_text": "Formally established in 1475", "input_text": "It was formally established in 1475", "turn_id": 1 }, { "span_start": 454, "span_end": 494, "span_text": "he Vatican Library is a research library", "input_text": "research", "turn_id": 2 } ], "name": "Vatican_Library.txt" } ] } ``` ### 转化数据 #### 规则 原数据是json文件,在"data"列表中包含了多篇文章,每篇文章有一系列的QA问答。 * 以"question-answer"一问一答的形式,依次在"dialog"列表中填入每个问题与回答。将"input_text"作为answer的"utterance"。 * "story"的内容作为外部知识填入"knowledge"中。 * 最后的"name"填入"title"的字典中。 #### 示例 ```json { "turn": "multi", "locale": "en", "title": { "name": "Vatican_Library.txt" }, "dialog": [ { "roles": ["question"], "utterance": "When was the Vat formally opened?" }, { "roles": ["answer"], "utterance": "It was formally established in 1475" }, { "roles": ["question"], "utterance": "what is the library for?" }, { "roles": ["answer"], "utterance": "research" } ], "knowledge": { "type": "text", "value": "The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established ..." } } ``` ## DailyDialog ### 原数据 `dialogues_text.txt`: ``` The kitchen stinks . __eou__ I'll throw out the garbage . __eou__ ``` `dialogues_topic.txt`: ``` 1 ``` `dialogues_act.txt`: ``` 3 4 ``` `dialogues_emotion.txt`: ``` 2 0 ``` ### 转化数据 #### 规则 原数据的每一行都是一个多轮对话的样例。 * `dialogues_text.txt`是包含了对话内容本身的文件,为两名角色(填充为"ROLE1"与"ROLE2")的轮流对话。每名角色的对话以`__eou__`为结束标志。 * `dialogues_topic.txt`表明了对应行的对话的主题,应填充至"domain"的列表中。每个对话只有一个主题。 * `dialogues_act.txt`与`dialogues_emotion.txt`分别表明了对应行的对话的角色动作与情感,应分别填充至每轮对话的"active_intents"与"emotions"中。 * topic, act, emotion的数字所对应的意义可以在`readme.txt`文件中找到。 | number | topic | act | emotion | | ------ | ----- | --- | ------- | | 0 | | | no emotion | | 1 | Ordinary Life | inform | anger | | 2 | School Life | question | disgust | | 3 | Culture & Education | directive | fear | | 4 | Attitude & Emotion | commissive | happiness | | 5 | Relationship | | sadness | | 6 | Tourism | | surprise | | 7 | Health | | 8 | Work | | 9 | Politics | | 10 | Finance | #### 示例 ```json { "turn": "multi", "domain": ["Ordinary Life"], "locale": "en", "dialog": [ { "roles": ["ROLE1"], "utterance": "The kitchen stinks .", "active_intents": ["directive"], "emotions": [{"emotion": "disgust"}] }, { "roles": ["ROLE2"], "utterance": "I'll throw out the garbage .", "active_intents": ["commissive"], "emotions": [{"emotion": "no emotion"}] } ] } ``` ## DDRel ### 原数据 ```json! {"pair-id": "0", "session-id": "0", "label": "5", "context": ["B: Here it is. Pray for me, Gallagher.", "A: Stew, your hands are shaking. You've been drinking again.", "B: Come on, come on. Here they come, Gallagher!", "A: The boss is getting hoarse.", "B: There's the third one. If I don't get the last one, there's a certain sob sister I know that's going to get a kick right in the . . . oh! Whoops, almost had that."], "nameA": "GALLAGHER", "nameB": "STEW"} ``` ### 转化数据 #### 规则 原数据文件的每一行都是一个多轮对话的样例。 * 原数据给出了A与B的完整名字,需要用这个名字填充进"roles"的列表中。 * 原数据的"label"字段给出了角色之间的关系,数字代表的具体含义可在`Readme.md`中查看。将完整关系填入"instance_relations"中。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["STEW"], "utterance": "Here it is. Pray for me, Gallagher." }, { "roles": ["GALLAGHER"], "utterance": "Stew, your hands are shaking. You've been drinking again.", }, { "roles": ["STEW"], "utterance": "Come on, come on. Here they come, Gallagher!" }, { "roles": ["GALLAGHER"], "utterance": "The boss is getting hoarse.", }, { "roles": ["STEW"], "utterance": "There's the third one. If I don't get the last one, there's a certain sob sister I know that's going to get a kick right in the . . . oh! Whoops, almost had that." } ], "instance_relations": [ { "instance1": "GALLAGHER", "instance2": "STEW", "relations": [ { "relation": "Lovers" } ] } ] } ``` ## DialogSum ### 原数据 ```json! {"fname": "train_5", "dialogue": "#Person1#: Happy birthday, Aims!\n#Person2#: Thank you, Lisa.\n#Person1#: Here is a present for you. I hope you like it.\n#Person2#: Oh, great! I love it! You know I've been expecting this for a long time.\n#Person1#: I'm very glad to hear that.\n#Person2#: Come here ; let me introduce some friends to you.", "summary": "Lisa gives Aims a birthday present and Aims loves it.", "topic": "birthday"} ``` ### 转化数据 #### 规则 原数据文件的每行都是一个多轮对话的样例。 * 将角色与对话正常填入"dialog"中,注意去除额外的停用符。 * 原数据"summary"字段需要填入我们的"summary"中,"topic"字段填入"domain"的列表里。 #### 示例 ```json { "turn": "multi", "domain": ["birthday"], "locale": "en", "dialog": [ { "roles": ["Person1"], "utterance": "Happy birthday, Aims!" }, { "roles": ["Person2"], "utterance": "Thank you, Lisa.", }, { "roles": ["Person1"], "utterance": "Here is a present for you. I hope you like it." }, { "roles": ["Person2"], "utterance": "Oh, great! I love it! You know I've been expecting this for a long time.", }, { "roles": ["Person1"], "utterance": "I'm very glad to hear that." }, { "roles": ["Person2"], "utterance": "Come here ; let me introduce some friends to you." } ], "summary": "Lisa gives Aims a birthday present and Aims loves it." } ``` ## DoQA ### 原数据 `doqa_dataset/doqa-cooking-train-v2.1.json`: ```json { "data": [ { "title": "Tips for grilling duck legs?", "background": "I recently attempted to grill duck legs on my propane Webber. I was afraid of flare-ups due to ...", "paragraphs": [ { "context": "I think grilling is probably a bad plan for duck legs; the fat content is a real danger like you said, and duck legs are tough enough ... CANNOTANSWER", "id": "C_0f23256ebfa44983b3a884e457b3211e", "qas": [ { "question": "Tips for grilling duck legs?", "answers": [ { "text": "I think grilling is probably a bad plan for duck legs", "answer_start": 0, "input_text": "I think grilling is probably a bad plan for duck legs" } ], "id": "C_0f23256ebfa44983b3a884e457b3211e_q#0", "followup": "y", "yesno": "x", "orig_answer": { "text": "I think grilling is probably a bad plan for duck legs", "answer_start": 0, "input_text": "I think grilling is probably a bad plan for duck legs" } } ] } ] } ] } ``` ### 转化数据 #### 规则 原数据为json格式文件,这里只需要处理带标签的train与dev文件。 * 文件名指示了对话所涉及的领域,填入"domain"的列表中。 * "qas"中包含了QA的一系列问题,以多轮对话形式填入"dialog"列表中。对answer的语句,选择"answers"列表中"input_text"的内容作为answer的utterance。 * QA所基于的内容,以外部知识形式放入"knowledge"中,包含"title""background"与"paragraphs"内的"context"。形式参见示例。 #### 示例 ```json { "turn": "multi", "domain": "cooking", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "Tips for grilling duck legs?" }, { "roles": ["answer"], "utterance": "I think grilling is probably a bad plan for duck legs" } ], "knowledge": { "type": "document", "value": { "title": "Tips for grilling duck legs?", "background": "I recently attempted to grill duck legs on my propane Webber. I was afraid of flare-ups due to ...", "context": "I think grilling is probably a bad plan for duck legs; the fat content is a real danger like you said, and duck legs are tough enough ... CANNOTANSWER" } } } ``` ## DREAM ### 原数据 ```json [ [ "M: I am considering dropping my dancing class. I am not making any progress.", "W: If I were you, I stick with it. It's definitely worth time and effort." ], [ { "question": "What does the man suggest the woman do?", "choice": [ "Consult her dancing teacher.", "Take a more interesting class.", "Continue her dancing class." ], "answer": "Continue her dancing class." } ], "5-510" ], ``` ### 转化数据 #### 规则 原数据文件是一个json列表,列表中的每一个元素都是一个样例。 * 样例也是一个列表,列表中的第一个元素是对话本身,首先作为外部知识填入"knowledge"中。 * 第二个元素包含了所有的问答组,有些对话的问题不止一个。如果有多个问题,需要为每一个问题单独产生一条数据。(对话内容不变,仅改变问题与选项的部分) * 每个问题有3个备选答案,其中有一个是正确的。问题的角色为"question",三个选项的角色依次为"choiceA"~"choiceC"(对话中存在名为A/B的角色,不能混淆),按照示例的格式依次填入"dialog"中。为正确与错误的选项打上相应的"class_label"标签。 #### 示例 示例中的原数据只有1个问题,所以只产生1条数据即可。 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "What does the man suggest the woman do?" }, { "roles": ["choiceA"], "utterance": "Consult her dancing teacher.", "class_label": false }, { "roles": ["choiceB"], "utterance": "Take a more interesting class.", "class_label": false }, { "roles": ["choiceC"], "utterance": "Continue her dancing class.", "class_label": true } ], "knowledge": { "type": "dialogue", "value": { "dialog": [ { "roles": ["M"], "utterance": "I am considering dropping my dancing class. I am not making any progress." }, { "roles": ["W"], "utterance": "If I were you, I stick with it. It's definitely worth time and effort." } ] } } } ``` ## E2E ### 原数据 ```csv! mr,ref "name[The Vaults], eatType[pub], priceRange[more than £30], customer rating[5 out of 5], near[Café Adriatic]",The Vaults pub near Café Adriatic has a 5 star rating. Prices start at £30. ``` ### 转化数据 #### 规则 原数据文件是csv格式,以引号外的逗号作为分隔符。 * `mr`字段的数据应填充进"utterance"中。 * `ref`字段的数据填充进"rewritten"中。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "name[The Vaults], eatType[pub], priceRange[more than £30], customer rating[5 out of 5], near[Café Adriatic]", "rewritten": "The Vaults pub near Café Adriatic has a 5 star rating. Prices start at £30." } ] } ``` ## E2E_Dialogue ### 原数据 `movie_all.tsv`: ```tsv! session.ID Message.ID Message.Timestamp Message.From Message.Text Dialog-Acts 1 1 2016-03-08T03:01:14.630Z user I'd like 2 tickets to see Zoolander 2 tomorrow at Regal Meridian 16 theater in Seattle at 9:25 PM request(ticket;moviename=Zoolander 2;date=tomorrow;theater=Regal Meridian 16;city=Seattle;starttime=9:25 PM;numberofpeople=2) 1 2 2016-03-08T03:13:48.111Z agent Okay, your purchase of 2 tickets for Zoolander 2 is confirmed. inform(taskcomplete;numberofpeople=2;moviename=Zoolander 2) ``` ### 转化数据 #### 规则 原数据是tsv格式,每个文件包含了不同领域的人机对话内容。属于同一个"session.ID"的全部语句组成一个完整的对话样例,按照多轮对话的形式来处理。 * 根据文件名,在样例里填充"domain"列表。 * "session.ID"中的所有"Message.Text"对话填入"dialog"列表中。 * "Dialog-Acts"列标注了语句的对话动作,应填入"dialog_act"列表中。动作可能不止一个;每个动作括号内的等号连接内容则是相关的槽值对,填入"slot_value_table"中。 * 没有等号连接的内容是动作的目标。将动作视作槽位、这个内容视作槽值,同样填入"slot_value_table"中(参见示例中的`inform(taskcomplete;)`)。 * 等号后的槽值可能为空,此时将"values"的列表留空(即没有槽值)。 #### 示例 ```json { "turn": "multi", "domain": ["movie"] "locale": "en", "dialog": [ { "roles": ["user"], "utterance": "I'd like 2 tickets to see Zoolander 2 tomorrow at Regal Meridian 16 theater in Seattle at 9:25 PM", "dialog_act": [ { "act": "request", "slot_value_table": [ { "slot": "request", "values": [{ "value": "ticket" }] }, { "slot": "moviename", "values": [{ "value": "Zoolander 2" }] }, { "slot": "date", "values": [{ "value": "tomorrow" }] }, { "slot": "theater", "values": [{ "value": "Regal Meridian 16" }] }, { "slot": "city", "values": [{ "value": "Seattle" }] }, { "slot": "starttime", "values": [{ "value": "9:25 PM" }] }, { "slot": "numberofpeople", "values": [{ "value": "2" }] } ] } ] }, { "roles": ["agent"], "utterance": "Okay, your purchase of 2 tickets for Zoolander 2 is confirmed.", "dialog_act": [ { "act": "inform", "slot_value_table": [ { "slot": "inform", "values": [{ "value": "taskcomplete" }] }, { "slot": "numberofpeople", "values": [{ "value": "2" }] }, { "slot": "moviename", "values": [{ "value": "Zoolander 2" }] } ] } ] } ] } ``` ## Character-Identification-EmoryNLP ### 原数据 ```json { "season_id": "trn", "episodes": [ { "episode_id": "s01_e01", "scenes": [ { "scene_id": "s01_e01_c01", "utterances": [ { "utterance_id": "s01_e01_c01_u001", "speakers": ["Monica Geller"], "transcript": "There's nothing to tell! He's just some guy I work with!", "tokens": [ ["There", "'s", "nothing", "to", "tell", "!"], ["He", "'s", "just", "some", "guy", "I", "work", "with", "!"] ], "character_entities": [ [], [[0, 1, "Paul the Wine Guy"], [4, 5, "Paul the Wine Guy"], [5, 6, "Monica Geller"]] ] }, ] } ] } ] } ``` ### 转化数据 #### 规则 只需要关注"scenes"内的数据。 * "speakers"字段中的角色填充进"roles"的列表中。 * 将"tokens"里的所有token合并成一个字符串(以空格分隔),填充进"utterance"中。 * "character_entities"指示了每一句话相应位置人称代词指代的具体角色。由于原数据标注的位置是经过分句后的相对位置,因此在填充"characters"时,需要注意将分句的偏移量加回来。遇到群体人称代词的情况,具体角色会有多个,均填入"value"列表中。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["Monica Geller"], "utterance": "There 's nothing to tell ! He 's just some guy I work with !", "characters": [ { "value": ["Paul the Wine Guy"], "start": 6, "end": 7 }, { "value": ["Paul the Wine Guy"], "start": 10, "end": 11 }, { "value": ["Monica Geller"], "start": 11, "end": 12 } ] } ] } ``` ## DSTC8 ### 原数据 ```json { "userInput": { "text": "I want to visit New York." }, "context": { "requestedSlots": [ "to_location" ] }, "labels": [ { "slot": "to_location", "valueSpan": { "startIndex": 16, "endIndex": 24 } } ], "id": "27_001082", "splitKey": 1.0 } ``` ### 转化数据 #### 规则 原数据文件是一个json列表,列表中的每一个元素都是一个样例,按单轮对话形式处理。 * "userInput"的"text"填入"dialog"中。 * 如果数据中含有"labels"字段,则将其中的槽值对填入"belief_state"的列表中。如果没有"labels"字段,则让"belief_state"的列表留空。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "I want to visit New York.", "belief_state": [ { "slot_value_table": [ { "slot": "to_location", "values": [{"value": "New York"}] } ] } ] } ] } ``` ## Emotion-Detection-EmoryNLP ### 原数据 ```json { "season_id": "trn", "episodes": [ { "episode_id": "s01_e02", "scenes": [ { "scene_id": "s01_e02_c01", "utterances": [ { "utterance_id": "s01_e02_c01_u001", "speakers": ["Monica Geller"], "transcript": "What you guys don't understand is, for us, kissing is as important as any part of it.", "tokens": [ ["What", "you", "guys", "do", "n't", "understand", "is", ",", "for", "us", ",", "kissing", "is", "as", "important", "as", "any", "part", "of", "it", "."] ], "emotion": "Joyful" }, ] } ] } ] } ``` ### 转化数据 #### 规则 只需要关注"scenes"内的数据。 * "speakers"字段中的角色填充进"roles"的列表中。 * 将"tokens"里的所有token合并成一个字符串(以空格分隔),填充进"utterance"中。 * "emotion"字段中标注的感情也要按示例填充进数据中。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["Monica Geller"], "utterance": "What you guys do n't understand is , for us , kissing is as important as any part of it .", "emotions": [ { "emotion": "Joyful" } ] } ] } ``` ## EmpatheticDialogues ### 原数据 ```csv! conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags hit:0_conv:1,3,sentimental,I remember going to the fireworks with my best friend. There was a lot of people_comma_ but it only felt like us in the world.,1,This was a best friend. I miss her.,5|5|5_2|2|5, ``` ### 转化数据 #### 规则 原数据文件为csv格式,只需要关注`context`,`prompt`,`utterance`三个字段即可。 * `prompt`的话语首先填充进"dialog"中,角色固定为"Speaker"。将`context`标出的情感填充进"emotions"中。 * 之后将`utterance`的话语填充进"dialog"中,角色固定为"Listener"。 * 注意将所有"\_comma\_"替换成普通英文逗号","。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["Speaker"], "utterance": "I remember going to the fireworks with my best friend. There was a lot of people, but it only felt like us in the world.", "emotions": [ { "emotion": "sentimental" } ] }, { "roles": ["Listener"], "utterance": "This was a best friend. I miss her." } ] } ``` ## FriendsPersona ### 原数据 ```csv! ,scene_id,text,character,cAGR,cCON,cEXT,cOPN,cNEU 17,01_e03_c09,"<b>s01_e03_c09(0) for Phoebe Buffay</b><br><br><b>Ross Geller</b>: A thumb?!<br><br>(Phoebe nods.)<br><br><b>All</b>: Eww!<br><br><b>Phoebe Buffay</b>: I know! I know, I opened it up and there it was, just floating in there, like this tiny little hitch-hiker!<br><br><b>Chandler Bing</b>: Well, maybe it's a contest, y'know? Like, collect all five?<br><br><b>Phoebe Buffay</b>: Does, um, anyone wanna see?<br><br>",Phoebe Buffay,1,1,1,0,0 ``` ### 转化数据 #### 规则 原数据文件为csv格式。`text`字段中的内容是HTML格式的,每句对话都由`<br><br>`分割开,发言角色则被`<b></b>`包括。 * 按照多轮对话形式处理`text`的内容,如果没有发言角色,则用"#NOTE#"作为这轮的角色。处理时忽略第一句话(`<b>s01_e03_c09(0) for Phoebe Buffay</b><br><br>`),直接从对话本身开始。 * `character`字段中的角色是要判断persona的角色;`AGR` `CON`等字段则是相应的persona,以-1,0,1的程度划分。按照示例将这些内容填入"role_personas"字段的列表中。注意依下表将字母缩写扩展为完整persona名称: | 缩写 | 完整名称 | | --- | ------- | | AGR | Agreeableness | | CON | Conscientiousness | | EXT | Extroversion | | OPN | Openness | | NEU | Neuroticism | #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["Ross Geller"], "utterance": "A thumb?!", }, { "roles": ["#NOTE#"], "utterance": "(Phoebe nods.)" }, { "roles": ["All"], "utterance": "Eww!" }, { "roles": ["Phoebe Buffay"], "utterance": "I know! I know, I opened it up and there it was, just floating in there, like this tiny little hitch-hiker!" }, { "roles": ["Chandler Bing"], "utterance": "Well, maybe it’s a contest, y’know? Like, collect all five?" }, { "roles": ["Phoebe Buffay"], "utterance": "Does, um, anyone wanna see?" } ], "role_personas": [ { "name": "Phoebe Buffay", "personas": [ { "persona": "Agreeable", "sentiment": 1 }, { "persona": "Conscientious", "sentiment": 1 }, { "persona": "Extraverted", "sentiment": 1 }, { "persona": "Open to experience", "sentiment": 0 }, { "persona": "Emotionally Stable", "sentiment": 0 } ] } ] } ``` ## FriendsQA ### 原数据 ```json { "data": [ { "title": "s04_e19_c04", "paragraphs": [ { "utterances:": [ { "uid": 0, "speakers": [ "#NOTE#" ], "utterance": "[ Scene : Central Perk , Joey is whining to Chandler about the tickets . ]" }, { "uid": 1, "speakers": [ "Joey Tribbiani" ], "utterance": "Come on !" }, { "uid": 2, "speakers": [ "Chandler Bing" ], "utterance": "Yes , Gunther , can I get two cups of chino , please ?" }, ], "qas": [ { "id": "s04_e19_c04_What", "question": "What action is Joey doing ?", "answers": [ { "answer_text": "Joey is whining", "utterance_id": 0, "inner_start": 6, "inner_end": 8, "is_speaker": false }, { "answer_text": "whining to Chandler", "utterance_id": 0, "inner_start": 8, "inner_end": 10, "is_speaker": false } ] }, { "id": "s04_e19_c04_Who", "question": "Who asks for two cups of chino ?", "answers": [ { "answer_text": "Chandler Bing", "utterance_id": 2, "inner_start": -1, "inner_end": -1, "is_speaker": true } ] } ] } ] } ] } ``` ### 转化数据 #### 规则 只需关注"paragraphs"列表中的元素即可。每一个元素都是一个字典,包括一系列的对话与在对话基础上提出的问题。 * 字典"utterances:"字段的列表包含了一个多轮对话,将这些对话填充进"knowledge"中。 * "qas"字段的列表则包含了一系列问题,每个问题都应以"question"的角色填进"dialog"中,**每个问题单独增设一条数据。** * 问题内包含了从原对话中摘录的答案"answers",将"answer_text"的内容作为角色answer的对话填进"dialog"中。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "What action is Joey doing ?" }, { "roles": ["answer"], "utterance": "Joey is whining", }, { "roles": ["answer"], "utterance": "whining to Chandler", } ], "knowledge": { "type": "dialogue", "value": { "dialog": [ { "roles": ["#NOTE#"], "utterance": "[ Scene : Central Perk , Joey is whining to Chandler about the tickets . ]", }, { "roles": ["Joey Tribbiani"], "utterance": "Come on !" }, { "roles": ["Chandler Bing"], "utterance": "Yes , Gunther , can I get two cups of chino , please ?" } ] } } } ``` ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "Who asks for two cups of chino ?", }, { "roles": ["answer"], "utterance": "Chandler Bing" } ], "knowledge": { "type": "dialogue", "value": { "dialog": [ { "roles": ["#NOTE#"], "utterance": "[ Scene : Central Perk , Joey is whining to Chandler about the tickets . ]", }, { "roles": ["Joey Tribbiani"], "utterance": "Come on !" }, { "roles": ["Chandler Bing"], "utterance": "Yes , Gunther , can I get two cups of chino , please ?" } ] } } } ``` ## GoEmotions ### 原数据 ``` My favourite food is anything I didn't have to cook myself. 27 eebbqej Maybe that’s what happened to the great white at Houston zoo 6,22 eczq8zg ``` ### 转化数据 #### 规则 原数据文件格式为tsv,以Tab来分隔每一列。只需关注前两列即可。 * 按单轮对话的形式处理第一列的语句。 * 第二列是一个或多个介于0~27的、指示情感的数字,具体指代的情感可以从`emotions.txt`中找到,行号减去1等于对应的情感。 * 除了neutral情感外的其他情感拥有"sentiment"属性,在`sentiment_dict.json`中找到包含这个情感的字段,将其填入"sentiment"中。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "My favourite food is anything I didn't have to cook myself.", "emotions": [ { "emotion": "neutral" } ] } ] } ``` ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "Maybe that’s what happened to the great white at Houston zoo", "emotions": [ { "emotion": "confusion", "sentiment": "ambiguous" }, { "emotion": "realization", "sentiment": "ambiguous" } ] } ] } ``` ## Google Simulated Dialogue ### 原数据 ```json { "dialogue_id": "movies_00000004", "turns": [ { "dialogue_state": [ { "slot": "num_tickets", "value": "3" }, { "slot": "time", "value": "6:00 pm" }, { "slot": "movie", "value": "a man called ove" } ], "system_acts": [ { "slot": "movie", "type": "REQUEST" }, { "slot": "num_tickets", "type": "REQUEST" } ], "system_utterance": { "slots": [], "text": "which movie , and how many tickets do you need ?", "tokens": ["which", "movie", ",", "and", "how", "many", "tickets", "do", "you", "need", "?"] }, "user_acts": [ { "type": "INFORM" } ], "user_utterance": { "slots": [ { "exclusive_end": 3, "slot": "num_tickets", "start": 2 }, { "exclusive_end": 12, "slot": "movie", "start": 8 } ], "text": "i need 3 tickets for the movie called a man called ove", "tokens": ["i", "need", "3", "tickets", "for", "the", "movie", "called", "a", "man", "called", "ove"] } } ] } ``` ### 转化数据 #### 规则 * 对话状态填入user的"belief_state"中。 * system语句的act与slots填入system的"dialog_act"中。 * user语句的act与slots填入user的"dialog_act"中。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["system"], "utterance": "which movie , and how many tickets do you need ?", "dialog_act": [ { "act": "REQUEST", "slot_value_table": [ { "slot": "movie" } ] }, { "act": "REQUEST", "slot_value_table": [ { "slot": "num_tickets" } ] }, ] }, { "roles": ["user"], "utterance": "i need 3 tickets for the movie called a man called ove", "belief_state": [ { "slot_value_table": [ { "slot": "num_tickets", "values": [{"value": "3"}], }, { "slot": "movie", "values": [{"value": "a man called ove"}], }, { "slot": "time", "values": [{"value": "6:00 pm"}] } ] } ], "dialog_act": [ { "act": "INFORM", "slot_value_table": [ { "slot": "num_tickets", "values": [ { "value": "3", "start": 2, "end": 3, } ] }, { "slot": "movie", "values": [ { "value": "a man called ove", "start": 8, "end": 12, } ] } ] } ] } ] } ``` ## HWU64 ### 原数据 ```csv! "text","category" "what alarms do i have set right now","alarm_query" ``` ### 转化数据 #### 规则 原数据为csv文件,单轮对话形式。按照与Banking77相同的方式处理即可。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "what alarms do i have set right now", "active_intents": ["alarm_query"] } ] } ``` ## MAMS ### 原数据 ```xml <?xml version="1.0" encoding="utf-8"?> <sentences> <sentence> <text>It might be the best sit down food I've had in the area, so if you are going to the upright citizen brigade, or the garden, it could be just the place for you.</text> <aspectCategories> <aspectCategory category="food" polarity="positive"/> <aspectCategory category="place" polarity="neutral"/> </aspectCategories> </sentence> </sentences> ``` ### 转化数据 #### 规则 原数据是xml格式,按照单轮对话的形式处理。 * `<text>`中的文本填入"dialog"中。 * `<aspectCategories>`相关的内容填入"aspects"中,"category"对应"target","polarity"对应"sentiment"。 * 在`MAMS-ATSA`文件夹下的数据集中,每个目标实体被标注了起始位置,如下所示。这种情况下应当同样标注起始位置。 ```xml <aspectTerm from="30" polarity="positive" term="food" to="34"/> ``` #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "It might be the best sit down food I've had in the area, so if you are going to the upright citizen brigade, or the garden, it could be just the place for you.", "aspects": [ { "target": { "value": "food", // Add index if exists. // "start": 30, // "end": 34 }, "sentiment": "positive" }, { "target": { "value": "place" }, "sentiment": "neutral" } ] } ] } ``` ## MELD ### 原数据 ```csv! Sr No.,Utterance,Speaker,Emotion,Sentiment,Dialogue_ID,Utterance_ID,Season,Episode,StartTime,EndTime 1,also I was the point person on my company’s transition from the KL-5 to GR-6 system.,Chandler,neutral,neutral,0,0,8,21,"00:16:16,059","00:16:21,731" ``` ### 转化数据 #### 规则 原数据文件为csv格式,只需关注`Utterance`,`Speaker`,`Emotion`,`Sentiment`这四个字段即可。 * 按照单轮对话的形式填充"utterance"与对应角色"roles"。 * 依示例的格式补充该轮对话的情感信息"emotions"。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["Chandler"], "utterance": "also I was the point person on my company’s transition from the KL-5 to GR-6 system.", "emotions": [ { "emotion": "neutral", "sentiment": "neutral" } ] } ] } ``` ## Molweni ### 原数据 ```json { "data": { "title": "train", "dialogues": [ { "edus": [ { "text": "bacon5o there 's no `` fixmbr '' with ubuntu .", "speaker": "sipher" }, { "text": "i dont want ubuntu , it does n't support my internet , thus i can not use it", "speaker": "Bacon5o" }, { "text": "my ati has no aiglx support so i ca n't speak for how FILEPATH is", "speaker": "morfic" } ], "context": "sipher: bacon5o there 's no `` fixmbr '' with ubuntu . bacon5o: i dont want ubuntu , it does n't support my internet , thus i can not use it morfic: my ati has no aiglx support so i ca n't speak for how filepath is", "qas": [ { "question": "Why does Bacon5o not want ubuntu ?", "id": "f44090680211f17295b1248f4c087491", "answers": [ { "text": "it does n't support my internet", "answer_start": 85 } ], "is_impossible": false } ], "relations": [ { "y": 1, "x": 0, "type": "Comment" } ] } ] } } ``` ### 转化数据 #### 规则 只需要关注原数据中"dialogues"字段中的列表内容。 * "edus"列表中的对话内容应作为外部知识,放入数据的"knowledge"字段中,类型为dialogue。"relations"列表中的数据按原格式放入"knowledge"中。(详见示例) * 以"question-answer"的多轮对话的形式处理"qas"列表中的一系列问答。 * 如果"is_impossible"字段的值为true,"dialog"中answer相对应的"utterance"内容应被更改为"NA",而不使用原来内容的语句。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "Why does Bacon5o not want ubuntu ?", }, { "roles": ["answer"], "utterance": "it does n't support my internet" } ], "knowledge": { "type": "dialogue", "value": { "dialog": [ { "roles": ["sipher"], "utterance": "bacon5o there 's no `` fixmbr '' with ubuntu ." }, { "roles": ["Bacon5o"], "utterance": "i dont want ubuntu , it does n't support my internet , thus i can not use it" }, { "roles": ["morfic"], "utterance": "my ati has no aiglx support so i ca n't speak for how FILEPATH is" }, ], "relations": [ { "y": 1, "x": 0, "type": "Comment" } ] } } } ``` ## MuDuCo ### 原数据 `muduco_music.json`: ```json { "domain": "music", "dialogs": { "002f8b03-27b5-d787-55f2-c0f601844f20": { "split": "train", "turns": [ { "number": 1, "utterance": "Can you play Red by Taylor Swift ?", "named_entities": { "person": [ { "turn_id": 1, "span": { "start": 4, "end": 7 }, "text": "you" }, { "turn_id": 1, "span": { "start": 20, "end": 32 }, "text": "Taylor Swift" } ], "entity": [ { "turn_id": 1, "span": { "start": 13, "end": 16 }, "text": "Red" } ] }, "references": { "personal_pronoun": [ { "turn_id": 1, "span": { "start": 4, "end": 7 }, "text": "you" } ] }, "links": [], "graded": true, "rewritten_utterance": "Can you play Red by Taylor Swift ?", "rewrite_required": false }, ... ] } } } ``` ### 转化数据 #### 规则 每个原数据文件都是一个领域内的对话内容。"dialogs"中的每一个字典都是一个对话,"turns"内包括了整个对话的内容,按多轮对话形式处理。 * 将原数据的"domain"填入"domain"列表中。 * 按照"USER"-"SYSTEM"角色交替的多轮对话形式将对话填入"dialog"中。 * 每轮对话中包含"named_entities",将其中包含的信息按示例的格式填充至该轮的"named_entity_recognition"列表中。 * 如果原数据"rewrite_required"的值为true,则在该轮的"rewritten"中填入"rewritten_utterance"的内容。否则不加入"rewritten"字段。 #### 示例 ```json { "turn": "multi", "domain": ["music"], "locale": "en", "dialog": [ { "roles": ["USER"], "utterance": "Can you play Red by Taylor Swift ?", "named_entity_recognition": [ { "type": "person", "values": [ { "value": "you", "start": 4, "end": 7 }, { "value": "Taylor Swift", "start": 20, "end": 32 } ] }, { "type": "entity", "values": [ { "value": "Red", "start": 13, "end": 16 } ] } ], // Rewrite the utterance if "rewrite_required" is true. // "rewritten": "Can you play Red by Taylor Swift ?" }, ... ] } ``` ## MultiDoGo ### 原数据 `paper_splits/splits_annotated_at_turn_level/airline/train.tsv`: ```tsv! conversationId turnNumber utteranceId utterance slot-labels intent 009ab879-ab52-4507-979e-8bc92badecef 12 <CONV>009ab879-ab52-4507-979e-8bc92badecef<TURN>12 kavigmailcom and five passengers email_address O number_of_passengers O contentonly ``` ### 转化数据 #### 规则 原数据为tsv格式,只需要对turn_level的数据进行处理即可。每条数据都是一个单轮对话。只需要关注最后三列数据。 * 依照数据所在的文件夹名称,填充"domain"的列表。 * "slot-labels"列以token-level对语句进行标注,O代表不标注,其余都是标注相应的槽名。如果槽名连续,说明多个token作为整体的槽值对应这一个槽。将槽值对填入"dialog_act"的"slot_value_table"中。 * "intent"列的内容填入"active_intent"的列表中。 #### 示例 ```json { "turn": "single", "domain": ["airline"] "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "kavigmailcom and five passengers", "dialog_act": [ { "slot_value_table": [ { "slot": "email_address", "values": [ { "value": "kavigmailcom", "start": 0, "end": 1 } ] }, { "slot": "number_of_passengers", "values": [ { "value": "five", "start": 2, "end": 3 } ] } ] } ], "active_intents": ["contentonly"] } ] } ``` ## MultiWOZ 2.2 ### 原数据 ```json [ { "dialogue_id": "PMUL4398.json", "services": [ "restaurant", "hotel" ], "turns": [ { "frames": [ { "actions": [], "service": "restaurant", "slots": [], "state": { "active_intent": "find_restaurant", "requested_slots": [], "slot_values": { "restaurant-area": [ "centre" ], "restaurant-pricerange": [ "expensive" ] } } }, { "actions": [], "service": "hotel", "slots": [], "state": { "active_intent": "find_hotel", "requested_slots": [], "slot_values": {} } } ], "speaker": "USER", "turn_id": "0", "utterance": "i need a place to dine in the center thats expensive" }, { "frames": [], "speaker": "SYSTEM", "turn_id": "1", "utterance": "I have several options for you; do you prefer African, Asian, or British food?" } ] } ] ``` ### 转化数据 #### 规则 * 整个对话的"services"填充进数据的"domain"列表。 * "frames"中的"state"填充进user相关的对话中,将非"NONE"的"active_intent"加入数据的"active_intent"列表中,"requested_slots"原样放入,"slot_values"则放入"belief_state"的列表中,每种service对应以个domain。 * "slots"中的内容放入角色各自的"dialog_act"列表中。 #### 示例 ```json { "turn": "multi", "domain": ["restaurant", "hotel"], "locale": "en", "dialog": [ { "roles": ["USER"], "utterance": "i need a place to dine in the center thats expensive", "belief_state": [ { "domain": "restaurant", "slot_value_table": [ { "slot": "restaurant-area", "values": [{"value": "centre"}] }, { "slot": "restaurant-pricerange", "values": [{"value": "expensive"}] } ], "requested_slots": [] }, { "domain": "hotel", "slot_value_table": [], "requested_slots": [] } ], "dialog_act": [], "active_intents": [ "find_restaurant", "find_hotel" ] }, { "roles": ["SYSTEM"], "utterance": "I have several options for you; do you prefer African, Asian, or British food?", "dialog_act": [] } ] } ``` ## MuTual ### 原数据 ```json { "answers": "B", "options": [ "f : no suit has the same style as it . it 's the style that makes it special . it is worth the price .", "f : although the suit you sew is the same as it , the material of this suit is imported from italy .", "f : the material of this suit is imported from france , it makes the suit special .", "f : but the color of our suit is very special ." ], "article": "m : excuse me . how much is this suit ? f : it 's on sale today for $ 750 . it 's normally $ 900 . m : wow , that is pretty expensive ! i was thinking that it might be 4 or 500 ...", "id": "train_1" } ``` ### 转化数据 #### 规则 原数据文件夹内包含多个txt文件,每个文件都是一个json格式的样例。由"article"给出的对话历史,选择接下来最有可能的对话内容。 * "options"中的内容,作为4个选项填入"dialog"的列表中。"answers"给出了问题的正确答案,依此在"class_label"标注各个选项的正误。 * "article"中的内容作为外部知识填入"knowledge"中。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["A"], "utterance": "f : no suit has the same style as it . it 's the style that makes it special . it is worth the price .", "class_label": false }, { "roles": ["B"], "utterance": "f : although the suit you sew is the same as it , the material of this suit is imported from italy .", "class_label": true }, { "roles": ["C"], "utterance": "f : the material of this suit is imported from france , it makes the suit special .", "class_label": false }, { "roles": ["D"], "utterance": "f : but the color of our suit is very special .", "class_label": false } ], "knowledge": { "type": "text", "value": "m : excuse me . how much is this suit ? f : it 's on sale today for $ 750 . it 's normally $ 900 . m : wow , that is pretty expensive ! i was thinking that it might be 4 or 500 ..." } } ``` ## NarrativeQA ### 原数据 `qaps.csv`: ```csv! document_id,set,question,answer1,answer2,question_tokenized,answer1_tokenized,answer2_tokenized 0025577043f5090cd603c6aea60f26e236195594,test,Who is Mark Hunter?,He is a high school student in Phoenix.,A loner and outsider student with a radio station.,Who is Mark Hunter ?,He is a high school student in Phoenix .,A loner and outsider student with a radio station . ``` `third_party/wikipedia/summaries.csv`: ```csv! document_id,set,summary,summary_tokenized^M 0025577043f5090cd603c6aea60f26e236195594,test," Mark Hunter (Slater), a high school student in a sleepy suburb of Phoenix, Arizona, starts an FM pirate radio station that ... ``` ### 转化数据 #### 规则 原数据为csv格式,并将转化数据的内容分在了多个文件中。 * `qaps.csv`包含了QA的主体部分,按照多轮对话的形式处理,填入"dialog"列表中。这一部分只需关注"question""answer1""answer2"这三列。 * 对话所需要的外部知识存放在`third_party/wikipedia/summaries.csv`中。在对话的"document_id"这一列记录了相关知识的编号,用这个编号在`summaries.csv`中查找,并将"summary"列的内容作为外部知识填入"knowledge"中。 * 这个csv文件的格式很不规范,建议不使用python的csv库处理而是自行分析格式。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "Who is Mark Hunter?" }, { "roles": ["answer1"], "utterance": "He is a high school student in Phoenix." }, { "roles": ["answer2"], "utterance": "A loner and outsider student with a radio station." } ], "knowledge": { "type": "text", "value": "Mark Hunter (Slater), a high school student in a sleepy suburb of Phoenix, Arizona, starts an FM pirate radio station that ..." } } ``` ## NLU++ ### 原数据 `banking`文件夹下的示例: ```json { "text": "How much did I spend in total until May on amazon prime?", "intents": [ "how_much", "transfer_payment_deposit" ], "slots": { "date_to": { "text": "May", "span": [ 36, 39 ], "value": { "day": 31, "month": 5, "year": 2022 } }, "company_name": { "text": "amazon prime", "span": [ 43, 55 ], "value": "amazon prime" } } } ``` ### 转化数据 #### 规则 原数据是json格式,列表中的每一个元素都是一个样例。样例中只有用户一个人的对话语句,按单轮对话形式处理。 * 根据原数据文件所在文件夹的不同,样例的所属领域也不同,将相应文件夹的名字填入"domain"的列表中。 * "text"的内容填入"dialog"的列表中。 * "intents"的内容填入"active_intents"的列表中。 * 如果没有这个字段,说明意图为空,此时不加入"active_intents"这个列表。 * "slots"表述的槽值对填入"dialog_act"的"slot_value_table"中,对每个槽位,只需将原数据中"text"作为槽值,无视修正过的"value"。 * 如果没有这个字段,说明没有检测出槽值对,此时不加入"dialog_act"这个列表。 #### 示例 ```json { "turn": "single", "domain": ["banking"], "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "How much did I spend in total until May on amazon prime?", "dialog_act": [ { "slot_value_table": [ { "slot": "date_to", "values": [ { "value": "May", "start": 36, "end": 39 } ] }, { "slot": "company_name", "values": [ { "value": "amazon prime", "start": 43, "end": 55 } ] } ] } ], "active_intents": [ "how_much", "transfer_payment_deposit" ] } ] } ``` ## PERSONA-CHAT ### 原数据 ```json { "train":[ { "personality":[ "i like to remodel homes .", "i like to go hunting .", "i like to shoot a bow .", "my favorite holiday is halloween ." ], "utterances":[ { ... }, { // Last element in "utterances" "candidates":[ "hello i am doing well how are you ?", "ll something like that . do you play games ?", ... ], "history":[ "hi , how are you doing ? i'm getting ready to do some cheetah chasing to stay in shape .", "you must be very fast . hunting is one of my favorite hobbies .", "i am ! for my hobby i like to do canning or some whittling ." ] } ] } ] } ``` ### 转化数据 #### 规则 原数据为json格式,划分数据的列表中的每个元素都是一个样例。 * 样例的"utterances"列表有许多个元素,这里只需要处理最后一个元素。将最后一个元素的"history"列表内容按照ROLE1-ROLE2的多轮对话形式填入"dialog"列表中。可以无视"candidates"字段。 * 每个样例最开始的"personality"字段内容需要作为外部知识填入"knowledge"中。按照示例格式转化,"role"固定为ROLE2(即第二个说话的角色)。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["ROLE1"], "utterance": "hi , how are you doing ? i'm getting ready to do some cheetah chasing to stay in shape ." }, { "roles": ["ROLE2"], "utterance": "you must be very fast . hunting is one of my favorite hobbies ." }, { "roles": ["ROLE1"], "utterance": "i am ! for my hobby i like to do canning or some whittling ." } ], "knowledge": { "type": "persona", "value": [ { "role": "ROLE2", "description": [ "i like to remodel homes .", "i like to go hunting .", "i like to shoot a bow .", "my favorite holiday is halloween ." ] } ] } } ``` ## QuAC ### 原数据 ```json { "data": [ { "paragraphs": [ { "context": "According to the Indian census of 2001, there were 30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of ... CANNOTANSWER", "qas": [ { "followup": "m", "yesno": "x", "question": "Where is Malayali located?", "answers": [ { "text": "30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India,", "answer_start": 51 } ], "id": "C_69758fcdfc1f46baba0e92c0f3b0919c_1_q#0", "orig_answer": { "text": "30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India,", "answer_start": 51 } } ], "id": "C_69758fcdfc1f46baba0e92c0f3b0919c_1" } ], "section_title": "Geographic distribution and population", "background": "The Malayali people or Keralite people (also spelt Malayalee, Malayalam script: mlyaalli and keerlliiy[?]) are ...", "title": "Malayali" } ] } ``` ### 转化数据 #### 规则 原数据为json文件,格式与DoQA基本一致,可以按类似的方法处理数据集。 * 按照多轮对话形式处理"qas"列表中的QA内容,处理方式见DoQA。 * 外部知识仍为"document"格式,将"title""section_title""background"与"context"的内容放入外部知识"knowledge"中。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "Where is Malayali located?" }, { "roles": ["answer"], "utterance": "30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India," } ], "knowledge": { "type": "document", "value": { "title": "Malayali", "section_title": "Geographic distribution and population", "background": "The Malayali people or Keralite people (also spelt Malayalee, Malayalam script: mlyaalli and keerlliiy[?]) are ...", "context": "According to the Indian census of 2001, there were 30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of ... CANNOTANSWER" } } } ``` ## RACE ### 原数据 ```json { "answers": [ "C", "A", "B", "C" ], "options": [ [ "he has much money.", "he likes the shops.", "he likes to compare the prices between the same items.", "he has nothing to do but shopping." ], [ "their ways of shopping are quite different", "they hate each other.", "they needn't buy anything for the family", "they don't have time for it." ], [ "he is young", "he is absent-minded", "he often loses his money", "he doesn't like shopping" ], [ "the shop was closed that day", "the policeman stopped him", "he forgot some of them", "he gave all the money to the beggar" ] ], "questions": [ "The husband likes shopping because _ .", "They never go shopping together because _ .", "Jimmy can't do the shopping well because _ .", "Jimmy didn't buy what his mother wanted because _ ." ], "article": "My husband is a born shopper. He loves to look at things and to touch them. He likes to ...", "id": "high1.txt" } ``` ### 转化数据 #### 规则 原数据的文件夹内包含多个txt文件,每个txt文件都是一个json格式的样例,形式与初高中的英语阅读理解类似。处理时,每一个问题答案组单独生成一条数据。 * "questions"列表中包含了所有问题,每个问题单独一条数据;"options"列表中的元素序号与问题序号匹配,是为这个问题准备的备选答案选项;"answers"列表的内容则是每个问题的正确答案。将问题与备选答案填入"dialog"的列表中,并为每个选项在"class_label"上标注正误。 * 文章内容"article"作为外部知识填入"knowledge"中。 #### 示例 这一个样例有4个问题,因此会产生4条数据。第一条数据的示例如下: ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "The husband likes shopping because _ ." }, { "roles": ["A"], "utterance": "he has much money.", "class_label": false }, { "roles": ["B"], "utterance": "he likes the shops.", "class_label": false }, { "roles": ["C"], "utterance": "he likes to compare the prices between the same items.", "class_label": true }, { "roles": ["D"], "utterance": "he has nothing to do but shopping.", "class_label": false } ], "knowledge": { "type": "text", "value": "My husband is a born shopper. He loves to look at things and to touch them. He likes to ..." } } ``` ## Reading-comprehension ### 原数据 ```json! [ { "scene_id": "s01_e01_c04", "query": "The apartment is practically empty , as @ent03 has taken all of the furniture , the stereo and the good TV. @placeholder confides in his friends about his upset at being divorced at twenty - six .", "answer": "@ent00", "utterances": [ { "tokens": "( squatting and reading the instructions ) I 'm supposed to attach a brackety thing to the side things , using a bunch of these little worm guys . I have no brackety thing , I see no whim guys whatsoever and - I can not feel my legs .", "speakers": "@ent00" }, { "tokens": "( @ent01 and @ent02 are finishing assembling the bookcase . )", "speakers": "" }, { "tokens": "I 'm thinking we 've got a bookcase here .", "speakers": "@ent01" } ] } ] ``` ### 转化数据 #### 规则 原数据为问答任务的数据集,因此按多轮对话方式处理问题与回答,背景对话则放在外部知识里。 * "query"的内容作为角色"question"的问题、"answer"的内容作为角色"answer"的答案,填充进"dialog"中。 * "utterances"中的对话则补充至外部知识"knowledge"中,格式如示例所示。如果"speakers"为空,则发言角色用"@NOTE"替代。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "The apartment is practically empty , as @ent03 has taken all of the furniture , the stereo and the good TV. @placeholder confides in his friends about his upset at being divorced at twenty - six ." }, { "roles": ["answer"], "utterance": "@ent00" } ], "knowledge": { "type": "dialogue", "value": { "dialog": [ { "roles": ["@ent00"], "utterance": "I am considering dropping my dancing class. I am not making any progress." }, { "roles": ["@NOTE"], "utterance": "( @ent01 and @ent02 are finishing assembling the bookcase . )" }, { "roles": ["@ent01"], "utterance": "I 'm thinking we 've got a bookcase here ." } ] } } } ``` ## RECCON ### 原数据 ```json! { "tr_4466": [ [ { "turn": 1, "speaker": "A", "utterance": "Hey , you wanna see a movie tomorrow ?", "emotion": "happiness", "expanded emotion cause evidence": [ 1 ], "expanded emotion cause span": [ "see a movie tomorrow ?" ], "type": [ "no-context" ] }, { "turn": 2, "speaker": "B", "utterance": "Sounds like a good plan . What do you want to see ?", "emotion": "happiness", "expanded emotion cause evidence": [ 1 ], "expanded emotion cause span": [ "see a movie tomorrow ?" ], "type": [ "inter-personal" ] }, { "turn": 3, "speaker": "A", "utterance": "How about Legally Blonde .", "emotion": "neutral" } ] ] } ``` ### 转化数据 #### 规则 原数据中每一个字段都包含了一个列表,在这个列表里的每一个列表都是一个多轮对话。 * 按多轮对话的形式处理对话内容。 * 对于"emotion"字段,若不是neutral,则该字段后还会有一个或多个做出该情感判断的证据,按照示例的格式填入"evidence"的列表中。若是neutral,则只需要填写"emotion"为neutral即可,无需再补充"evidence"。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["A"], "utterance": "Hey , you wanna see a movie tomorrow ?", "emotions": [ { "emotion": "happiness", "evidence": [ { "turn": 1, "span": "see a movie tomorrow ?", "type": "no-context" } ] } ] }, { "roles": ["B"], "utterance": "Sounds like a good plan . What do you want to see ?", "emotions": [ { "emotion": "happiness", "evidence": [ { "turn": 1, "span": "see a movie tomorrow ?", "type": "inter-personal" } ] } ] }, { "roles": ["A"], "utterance": "How about Legally Blonde .", "emotions": [ { "emotion": "neutral" } ] } ] } ``` ## Restaurant8k ### 原数据 ```json { "userInput": { "text": "There will be 5 adults and 1 child." }, "context": { "requestedSlots": [ "people" ] }, "labels": [ { "slot": "people", "valueSpan": { "startIndex": 14, "endIndex": 34 } } ] } ``` ### 转化数据 #### 规则 原数据文件是一个json列表,列表中的每个元素都是一个样例,按单轮对话形式处理。 * 忽略"context"字段,只考虑"userInput"与"label"中的内容。 * "userInput"的内容填充进"dialog"的单轮对话中。 * "labels"列表中的槽值对填进"belief_state"里的"slot_value_table"列表中。槽值"value"是从"utterance"的字符串中使用[startIndex:endIndex]截取得到的。 * 有的对话可能没有槽位,此时不存在"labels"字段,这种情况下令"belief_state"的列表为空即可。 * 有的"valueSpan"中可能没有"startIndex"字段,这是“从开头起始”的省略,处理时令startIndex=0即可。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "There will be 5 adults and 1 child.", "belief_state": [ { "slot_value_table": [ { "slot": "people", "values": [{"value": "5 adults and 1 child"}] } ] } ] } ] } ``` ## RNNLG ### 原数据 `original/hotel/train.json`: ```json! [ "inform_no_match(acceptscreditcards='no';pricerange='pricey')", "there are no pricey hotel -s that do not accept credit card -s", "i am sorry but there is no place , where no credit card -s are accepted and in the pricey price range" ] ``` ### 转化数据 #### 规则 原数据是一个json列表,列表中的每一个元素都是一个样例。按照单轮对话的形式处理。 * 原数据依据涉及的领域被分在了数个文件夹里,将相关领域填入"domain"列表中。 * 原数据的每个样例都是一个包含3个元素的列表。第一个元素作为"text"形式的外部知识填充进"knowledge"中。 * 为后两个元素分别产生一条数据,填充进"dialog"中。即原数据的每一个样例产生两条数据。 #### 示例 ```json { "turn": "single", "domain": ["hotel"], "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "there are no pricey hotel -s that do not accept credit card -s" } ], "knowledge": { "type": "text", "value": "inform_no_match(acceptscreditcards='no';pricerange='pricey')" } } ``` ```json { "turn": "single", "domain": ["hotel"], "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "i am sorry but there is no place , where no credit card -s are accepted and in the pricey price range" } ], "knowledge": { "type": "text", "value": "inform_no_match(acceptscreditcards='no';pricerange='pricey')" } } ``` ## SentiHood ### 原数据 ```json { "opinions": [ { "sentiment": "Positive", "aspect": "nightlife", "target_entity": "LOCATION1" }, { "sentiment": "Positive", "aspect": "transit-location", "target_entity": "LOCATION1" } ], "id": 209, "text": " Another option is LOCATION1 which is very central and has tons of clubs/bars within walking distance of each other" } ``` ### 转化数据 #### 规则 原数据是一个json列表,列表中的每一个元素都是一个样例,按单轮对话形式处理。 * 将"opinions"列表中的元素依照示例填入"aspects"的列表中。 #### 示例 ```json { "turn": "single", "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": " Another option is LOCATION1 which is very central and has tons of clubs/bars within walking distance of each other", "aspects": [ { "target": {"value": "LOCATION1"}, "opinion": {"value": "nightlife"}, "sentiment": "Positive" }, { "target": {"value": "LOCATION1"}, "opinion": {"value": "transit-location"}, "sentiment": "Positive" } ] } ] } ``` ## SGD ### 原数据 ```json [ { "dialogue_id": "1_00000", "services": [ "Restaurants_1" ], "turns": [ { "frames": [ { "actions": [ { "act": "INFORM_INTENT", "canonical_values": [ "FindRestaurants" ], "slot": "intent", "values": [ "FindRestaurants" ] } ], "service": "Restaurants_1", "slots": [], "state": { "active_intent": "FindRestaurants", "requested_slots": [], "slot_values": {} } } ], "speaker": "USER", "utterance": "I am feeling hungry so I would like to find a place to eat." }, { "frames": [ { "actions": [ { "act": "REQUEST", "canonical_values": [], "slot": "city", "values": [] } ], "service": "Restaurants_1", "slots": [] } ], "speaker": "SYSTEM", "utterance": "Do you have a specific which you want the eating place to be located at?" } ] } ] ``` ### 转化数据 #### 规则 原数据是json格式,包含多个用户-系统之间的对话交互,按多轮对话的形式处理。 * 对话的"services"字段中的列表,每个元素去除下划线与数字、只保留领域内容后放入"domain"的列表中。 * USER与SYSTEM的"actions"放入"dialog_act"中。一个槽位可能有多个槽值。 * USER的"state"放入"belief_state"中。"active_intent"的部分则填入外侧的列表中。 * 原数据中的剩余字段可以忽略。 #### 示例 ```json { "turn": "multi", "domain": ["Restaurants"], "locale": "en", "dialog": [ { "roles": ["USER"], "utterance": "I am feeling hungry so I would like to find a place to eat.", "belief_state": [], "dialog_act": [ { "act": "INFORM_INTENT", "slot_value_table": [ { "slot": "intent", "values": [ { "value": "FindRestaurants", "cononical_value": "FindRestaurants" } ] } ] } ], "active_intents": ["FindRestaurants"] }, { "roles": ["SYSTEM"], "utterance": "Do you have a specific which you want the eating place to be located at?", "dialog_act": [ { "act": "REQUEST", "slot_value_table": [ { "slot": "city" } ] } ] } ] } ``` ## SNIPS ### 原数据 ```json { "AddToPlaylist": [ { "data": [ { "text": "Add another " }, { "text": "song", "entity": "music_item" }, { "text": " to the " }, { "text": "Cita Romántica", "entity": "playlist" }, { "text": " playlist. " } ] } ] } ``` ### 转化数据 #### 规则 原数据是一个json文件,字典的键代表数据的领域,值则是一个列表,每一个元素都是一个样例。 * 将样例涉及的领域填入"domain"的列表。 * "data"列表中的所有"text"拼接起来是一个完整的句子。因此按照单轮对话的形式处理数据,将拼接后的句子作为"utterance"填充进"dialog"中。 * "data"列表中的有些元素会有"entity"字段,说明这是一个槽值对,需要按照示例的格式填充进"belief_state"里的"slot_value_table"中。 #### 示例 ```json { "turn": "single", "domain": ["AddToPlaylist"], "locale": "en", "dialog": [ { "roles": ["ROLE"], "utterance": "Add another song to the Cita Romántica playlist. ", "belief_state": [ { "slot_value_table": [ { "slot": "music_item", "values": [{"value": "song"}] }, { "slot": "playlist", "values": [{"value": "Cita Romántica"}] } ] } ] } ] } ``` ## Soccer ### 原数据 `KVR/train_incar.txt`: ```! #conv# 0 dish_parking distance 2_miles 0 dish_parking traffic_info road_block_nearby 0 dish_parking poi_type parking_garage ... 1 where s the nearest parking_garage the nearest parking_garage is dish_parking at 550_alester_ave would you like directions there ['parking_garage', '550_alester_ave', 'dish_parking'] 2 yes please set directions via a route that avoids all heavy_traffic if possible it looks like there is a road block being reported on the route but i will still find the quickest route to 550_alester_ave ['550_alester_ave'] 3 thanks so much for your help you re very welcome [] ``` ### 转化数据 #### 规则 只需要处理`KVR/`目录下的文件即可。原数据是文本格式,包含知识图谱与基于知识图谱进行的对话。每个样例由`#conv#`分隔开。 * 第一列以0开头的是知识图谱,由"source-relation-target"的顺序存储,作为外部知识全部放入"knowledge"中(见示例)。 * 知识图谱部分结束后是对话主体部分,每一行都是"USER-SYSTEM"的二轮对话,由制表符Tab分隔开。按照多轮对话的形式将对话填入"dialog"列表中。原数据中最后的列表可以忽略。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["USER"], "utterance": "where s the nearest parking_garage" }, { "roles": ["SYSTEM"], "utterance": "the nearest parking_garage is dish_parking at 550_alester_ave would you like directions there" }, { "roles": ["USER"], "utterance": "yes please set directions via a route that avoids all heavy_traffic if possible" }, { "roles": ["SYSTEM"], "utterance": "it looks like there is a road block being reported on the route but i will still find the quickest route to 550_alester_ave" }, { "roles": ["USER"], "utterance": "thanks so much for your help" }, { "roles": ["SYSTEM"], "utterance": "you re very welcome" } ], "knowledge": { "type": "kg", "value": { "direction": "directed", "graph": [ { "source": "dish_parking", "relation": "distance", "target": "2_miles", }, { "source": "dish_parking", "relation": "traffic_info", "target": "road_block_nearby", }, { "source": "dish_parking", "relation": "poi_type", "target": "parking_garage", }, ... ] } } } ``` ## SocialIQA ### 原数据 `train.jsonl`: ```json! {"context": "Cameron decided to have a barbecue and gathered her friends together.", "question": "How would Others feel as a result?", "answerA": "like attending", "answerB": "like staying home", "answerC": "a good friend to have"} ``` `train-labels.lst`: ```json! 1 ``` ### 转化数据 #### 规则 原数据文件包含对话本身的jsonl文件与写有标签的lst文件,按行号一一对应,每行都是一个样例。lst文件中的1\~3分别对应jsonl文件中的选项answerA\~answerC。 * 原数据是用于QA任务的,因此按多轮对话形式处理,"question"的问题首先填入"dialog"中,之后填入三个备选答案的选项,在"class_label"标注答案的正误。 * "context"的内容则作为外部知识填入"knowledge"中。 #### 示例 ```json { "turn": "multi", "locale": "en", "dialog": [ { "roles": ["question"], "utterance": "How would Others feel as a result?" }, { "roles": ["answerA"], "utterance": "like attending", "class_label": true }, { "roles": ["answerB"], "utterance": "like staying home", "class_label": false }, { "roles": ["answerC"], "utterance": "a good friend to have", "class_label": false } ], "knowledge": { "type": "text", "value": "Cameron decided to have a barbecue and gathered her friends together." } } ``` ## TaskMaster ### 原数据 ```json [ { "conversation_id": "dlg-00028478-84a9-4ca7-a3e2-be514a3b8c9d", "instruction_id": "movie-tickets-2", "utterances": [ { "index": 0, "speaker": "ASSISTANT", "text": "Hi there! How can I help?" }, { "index": 1, "speaker": "USER", "text": "Oh well, I've tried to go see Aquaman in Reno, Nevada.", "segments": [ { "start_index": 30, "end_index": 37, "text": "Aquaman", "annotations": [ { "name": "movie_ticket.name.movie" } ] }, { "start_index": 41, "end_index": 54, "text": "Reno, Nevada.", "annotations": [ { "name": "movie_ticket.location.theater.accept" } ] } ] } ] } ] ``` ### 转化数据 #### 规则 原数据是包含对话内容的json文件而非csv文件(csv文件仅存储数据集的划分方法),按照多轮对话的形式处理。 * 每个对话的"instruction_id",去除最后的横杠与数字后填入"domain"的列表中。 * 数据集"segments"字段中若有内容,则填入"dialog_act"的"slot_value_table"列表中。 #### 示例 ```json { "turn": "multi", "domain": ["movie-tickets"], "locale": "en", "dialog": [ { "roles": ["ASSISTANT"], "utterance": "Hi there! How can I help?" }, { "roles": ["USER"], "utterance": "Oh well, I've tried to go see Aquaman in Reno, Nevada.", "dialog_act": [ { "slot_value_table": [ { "slot": "movie_ticket.name.movie", "values": [ { "value": "Aquaman", "start": 30, "end": 37 } ] }, { "slot": "movie_ticket.location.theater.accept", "values": [ { "value": "Reno, Nevada.", "start": 41, "end": 54 } ] } ] } ] } ] } ```

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully