中文長文本語意理解 - Victor

--- title: "中文長文本語意理解 - Victor" tags: PyConTW2023, 2023-organize, 2023-共筆 --- # 中文長文本語意理解 - Victor {%hackmd H6-2BguNT8iE7ZUrnoG1Tg %} <iframe src=https://app.sli.do/event/nWMgU4pojcQWkgyLGsd4dR height=450 width=100%></iframe> > Collaborative writing start from below > 從這裡開始共筆 --- - Slides link: https://docs.google.com/presentation/d/1_wFt91_eYSJ2F_suyuHQIcXOYYtNeJk1/edit?usp=sharing&ouid=117277489792224340959&rtpof=true&sd=true [toc] **Framework** - [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP): 中文領域做文本的最大框架 ## Intorduction ### Basic Usage 1. information extraction: structurlaization 2. text classification 用途：非結構轉成結構 e.g. 文本：法院判決書 - 法院 - 原告 - 工作損失 - .... ### Modeling ```graphviz digraph { node[shape="r"] text_input[label="Text Input"] IE[label="information extraction"] TC[label="text classification"] text_input -> IE, TC } ``` ```graphviz digraph { node[shape="r"] UIE[label="pretrained\nUIE"] NER[label="Named Entity"] RE[label="Relation Extraction"] ED[label="Evenet Detection"] SE[label="Sentitment Extraction"] MRE[label="more down stream tasks"] UIE -> NER, RE, ED, SE -> MRE } ``` ### Information Extraction - Model Pretraining - UIE (Universal information Extrantion, Text to struture) pretrain model - treats information extraction as different tasks, including the following (四個文本任務): + Name Entity Recognition + Relation Extraction + Event Detection + Sentiment Extraction - USM (Universal Semantic Matching) pretrained model - Structuring - Utterance Structure - Pair Structure - Conceptualizing - more info: https://sites.research.google/usm/ ### Text Extraction - Model Pretraining - UTC（Universal Text Classfication） - 背後本質還是 USM ## Long Sequence issues and Solutions ### Issues arising form Long Sequences - The model training is restricted by **gpu memory** - length ~ over 10,000 - 一般16G的，大概2000多token就已經差不多了 - Paddle 的處理法是只保留前面 2048，後面會丟棄 => information missing ==思考：長文本真的會對這些問題產生影響嗎？== + Information Extraction + 可以用切Chunks的方式來處理 + Text Classification + 不知道關鍵的內容是在哪裡，不適合使用 chunk ### General Solutions to Long Texts 1. chunking 2. with or without recurrent mechanism 3. :thumbsup: ( recommended by the presenter) shortening or summarization 去掉noise的過程 ### Proposed Solution (UIE + UTC) 先把文本變短，再做任務 **Stage 1**. Finetuning UIE - 目標：找出和label有關的敘述(我們有興趣的段落)，且已經去掉noise (about 60% boost in performance metrics; based on token indexes?) **Stage 2**. UTC ## Result and Conclusion ### Information Extraction - 先把和label有關的資訊擷取出來 - Experimentss Result - fine-tuned UIE的表現比較好 - 對GPT來說找index比較困難 | Model | Precision | Recall | F1 | | -------- | -------- | -------- | ----- | | Baseline (0 shot UIE) | 0.22 | 0.38 | 0.28 | | UIE | 0.80 | 0.83 | 0.82 | | GPT 3.5 | 0.26 | 0.06 | 0.10 | ### Text Classification | Model | Precision | Recall | F1 | | -------- | -------- | -------- | ----- | | Baseline (Emie) | 0.40 | 0.46 | 0.43 | | UIE + UTC| 0.87 | 0.89 | 0.88 | | GPT 3.5 | 0.76 | 0.65 | 0.70 | ## Conclusion - Few-show **promtpt learning** shows promise. - Text shortening through **two stage modeling** could also be a viable solution. - 講者 [GitHub(vic4code)](https://github.com/vic4code) ## QA + 不同文字需要額外訓練？ + 需要 + IE實驗結果中，是幾分類？資料平衡嗎？ + 有做，時間關係沒放上來 + 文本多長算長文本？ + 沒有明確標準，超過某個限制的時候 + Slide連結？ + 之後提供 + Slides link: https://docs.google.com/presentation/d/1_wFt91_eYSJ2F_suyuHQIcXOYYtNeJk1/edit?usp=sharing&ouid=117277489792224340959&rtpof=true&sd=true + paddleNLP 和 Langchain 之間應該如何選擇？ + Langchain比較偏向Prompt的使用 + 看不同的任務去選擇 + UIE模型可以做無監督式學習，做關鍵詞或是特定資料的抽取？ + 可以 Below is the part that speaker updated the talk/tutorial after speech 講者於演講後有更新或勘誤投影片的部份