Data Mining (Team 5)

# Data Mining (Team 5) + [Github repo](https://github.com/tangerine1202/DataMining-Team5) + [Progress 1 report](https://www.icloud.com/keynote/0d450KCHSvLz1fSuLVXTm1GnA#DM-progress-report1) + [AI Competition](https://tbrain.trendmicro.com.tw/Competitions/Details/26) - next week - QA - [x] train-val-test split (0.7, 0.2, 0.1) - [x] 嘗試其他 metrics: lcs, f1 (squad), exact_match (squad) - [x] HF model can easily be wrapped into PyTorch model. - [optional] - 如何把 s 餵進 - ==每組 q' 跟 r' 是 indepentent 的嗎？還是有些 (q', r') 是要搭配在一起？== - q', r' 很多組，要如何統一 / 是否需要統一嗎？ - dataset 中有多少 unseen token（估計是沒有） - next week - Token Classification (洪偉豪) - LCS (Result[https://docs.google.com/spreadsheets/d/1tJhIE_M-DWNVVNykEF0YOGspO6J80v6Fd4LBCmqKO5k/edit?usp=sharing]) - 把 predict_Q, predict_R 一些雜質（#, [CLS]）濾乾淨 - 調參數（目前只用一個 epoch 因為還沒找到合適的 hp） --- - **Token Classification (NER)** - [Custom Named Entity Recognition with BERT.ipynb](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb) - [huggingface tasks -- token classification](https://huggingface.co/docs/transformers/tasks/token_classification)Classification** - [Opinion mining](https://paperswithcode.com/task/opinion-mining) - [LIAR-PLUS](https://github.com/Tariq60/LIAR-PLUS) - contain statement, justification (statement excluded) - from [Where is Your Evidence: Improving Fact-checking by Justification Modeling](https://paperswithcode.com/paper/where-is-your-evidence-improving-fact) - Argument Mining - **Argument component Identification** - Selecting relevant text in the general text which can be part of an argumentation. (i.e. argument segments vs. non-argument segments) - methods - structured (e.g. RST) - unstructured - approaches 1. sentence-level classification 2. sequence-level agument discoure units (ADUs, argument propositions) - Argument componenet Classification - Determine the type of argument proposition. (e.g. premise, claim, conclusion, etc.) - Argument connection detection - non, support, attack, etc. - some datasets - [AIF-DB](https://corpora.aifdb.org) - [UKP](https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/1997) - [AMPERE++](https://zenodo.org/record/6362430#.Y2tt_i9CaAn) from [AURC](https://github.com/trtm/AURC) - reference - [Aruging with BERT](https://negedng.github.io/files/MSc_thesis.pdf) - include famous dataset intro - [Argument Mining: A Survey -- 6. Identifying Argument Components](https://direct.mit.edu/coli/article/45/4/765/93362/Argument-Mining-A-Survey) - [Stance Detection](https://paperswithcode.com/task/stance-detection) - The extraction of a subject's reaction to a claim made by a primary. For example ``` Source: "Apples are the most delicious fruit in existence" Reply: "Obviously not, because that is a reuben from Katz's" Stance: deny ```