🦄 Difficult SA Dataset Survey

--- tags: research --- # 🦄 Difficult SA Dataset Survey - Textual Meaning Detection - Sarcasm Detection - SemEval-2021 Task 6 Detection of Persuation Techniques in Texts and Images: https://aclanthology.org/2021.semeval-1.7.pdf (highest score: 0.593 F1-Micro) - Dialogue ABSA ![](https://i.imgur.com/ZrDh0us.png =300x) - [x] [freesunshine0316/lab-conv-asa](https://github.com/freesunshine0316/lab-conv-asa/blob/master/data), [paper](https://dl.acm.org/doi/pdf/10.1613/jair.1.12802) downloaded under ikmpc: `~projects/research/`. - Annotation: follow SemEval 2014 Annotation Guideline for Standard Aspect SA [link](https://alt.qcri.org/semeval2014/task4/data/uploads/semeval14_absa_annotationguidelines.pdf) - Source: DuConv(3000) + NewsDialogue(200) dialogues - Target Mention and Target Opinion Expression - ![](https://i.imgur.com/S8A9Vov.png) - [ ] DiaASQ: [paper](https://deepai.org/publication/diaasq-a-benchmark-of-conversational-aspect-based-sentiment-quadruple-analysis), [github](https://github.com/unikcc/DiaASQ) - A dataset download link is provided but is broken. - Annotation: follow SemEval 2014 Annotation Guideline for Standard Aspect SA [link](https://alt.qcri.org/semeval2014/task4/data/uploads/semeval14_absa_annotationguidelines.pdf) - Weibo 9M posts and comments from 200 verfified bloogers; phone-related filtering, meaningless-reply pruning, 10-reply limit. Finally 1000 dialogues are annotated. - ![](https://i.imgur.com/E6lFjwk.png =400x) - [ ] Wang, Jiancheng, et al. "Sentiment classification in customer service dialogue with topic-aware multi-task learning." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 05. 2020. [paper](https://ojs.aaai.org/index.php/AAAI/article/view/6454) **Dataset Missing** - [x] Scenario SA: https://arxiv.org/pdf/1907.05562v1.pdf **Dataset Available (downloaded)** `~projects/sceraioSA` - Conflict ABSA - see MAMS repo's Laptop and Rest - Challenging ABSA: MAMS and ARTS - Structured Sentiment Analysis - [SemEval 2022 Shared Task Official](https://competitions.codalab.org/competitions/33556#learn_the_details-overview) ## Annotation Guideline - SemEval 2014 Annotation Guideline for Standard Aspect SA [link](https://alt.qcri.org/semeval2014/task4/data/uploads/semeval14_absa_annotationguidelines.pdf) ## aspectTerm 或 aspectCategory數量 SemEval 的資料在 RobertaABSA 或 ARTS 裡都移除了 Conflict - Arts ![](https://i.imgur.com/un3XHXh.png) - RobertaABSA ![](https://i.imgur.com/kxDE16w.png) 這說明了學界裡因為定義麻煩而放棄處理的問題（Conflict 的例子）反而是業界覺得很有趣而且覺得要謹慎處理的，Eagle就有問過類似的問題要怎麼辦。 Q. 除了判斷他是 conflict，還可以怎麼給予更詳細的 label？也或許要去研究一下 opinion term 的標記：[semeval triplet annot](https://github.com/xuuuluuu/SemEval-Triplet-data) 資料來源：`SemEval 14 Laptop train.json` ![](https://i.imgur.com/Qpj53Gj.png) ![](https://i.imgur.com/Hswh0E9.png) ![](https://i.imgur.com/3zJOA7D.png) I'd have to replace the battery once, but that was only a couple months ago and it's been working perfect ever since. ## Conflict Aspect Terms SemEval 資料基數 ![](https://i.imgur.com/IslOi1r.png) 可以更仔細標記 Conflict，並且觀察 LLM 的處理方式代表使用 aspect term 為單位去標記情緒還是太粗粒度，可以 aspect term + opinion expression 一起考慮。這個研究第一階段可以靠資料定義一個更細粒度的任務後，隨後想想怎麼用既有的 BERT 或 t5，或是目前的 LLM 來解。 - Conflict Aspect Terms 數目表 | Dataset | train | test | | -------- | -------- | -------- | | SemEval14 Restaurant | 287 | 66 | | SemEval14 Laptop | 45 | 16 | ## 合併狀況表 MAMS 資料集中是否可以用 text 來合併 ACSA 與 ATSA |單位（句）| train | val |test|all | -------- | -------- | -------- |---|---| | ATSA | 4297 | 500 |500|5297| | ACSA | 3149 | 400 |400|3949| | ACSA $\cap$ ATSA | 2556 | 37 |35|3949| 代表我可以把 3949 句拿出來融合成為新的任務資料集，然後三個 splits 內可能有些可以跨 split 對上。