--- title: "中文長文本語意理解 - Victor" tags: PyConTW2023, 2023-organize, 2023-共筆 --- # 中文長文本語意理解 - Victor {%hackmd H6-2BguNT8iE7ZUrnoG1Tg %} <iframe src=https://app.sli.do/event/nWMgU4pojcQWkgyLGsd4dR height=450 width=100%></iframe> > Collaborative writing start from below > 從這裡開始共筆 --- - Slides link: https://docs.google.com/presentation/d/1_wFt91_eYSJ2F_suyuHQIcXOYYtNeJk1/edit?usp=sharing&ouid=117277489792224340959&rtpof=true&sd=true [toc] **Framework** - [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP): 中文領域做文本的最大框架 ## Intorduction ### Basic Usage 1. information extraction: structurlaization 2. text classification 用途:非結構轉成結構 e.g. 文本:法院判決書 - 法院 - 原告 - 工作損失 - .... ### Modeling ```graphviz digraph { node[shape="r"] text_input[label="Text Input"] IE[label="information extraction"] TC[label="text classification"] text_input -> IE, TC } ``` ```graphviz digraph { node[shape="r"] UIE[label="pretrained\nUIE"] NER[label="Named Entity"] RE[label="Relation Extraction"] ED[label="Evenet Detection"] SE[label="Sentitment Extraction"] MRE[label="more down stream tasks"] UIE -> NER, RE, ED, SE -> MRE } ``` ### Information Extraction - Model Pretraining - UIE (Universal information Extrantion, Text to struture) pretrain model - treats information extraction as different tasks, including the following (四個文本任務): + Name Entity Recognition + Relation Extraction + Event Detection + Sentiment Extraction - USM (Universal Semantic Matching) pretrained model - Structuring - Utterance Structure - Pair Structure - Conceptualizing - more info: https://sites.research.google/usm/ ### Text Extraction - Model Pretraining - UTC(Universal Text Classfication) - 背後本質還是 USM ## Long Sequence issues and Solutions ### Issues arising form Long Sequences - The model training is restricted by **gpu memory** - length ~ over 10,000 - 一般16G的,大概2000多token就已經差不多了 - Paddle 的處理法是只保留前面 2048,後面會丟棄 => information missing ==思考:長文本真的會對這些問題產生影響嗎?== + Information Extraction + 可以用切Chunks的方式來處理 + Text Classification + 不知道關鍵的內容是在哪裡,不適合使用 chunk ### General Solutions to Long Texts 1. chunking 2. with or without recurrent mechanism 3. :thumbsup: ( recommended by the presenter) shortening or summarization 去掉noise的過程 ### Proposed Solution (UIE + UTC) 先把文本變短,再做任務 **Stage 1**. Finetuning UIE - 目標:找出和label有關的敘述(我們有興趣的段落),且已經去掉noise (about 60% boost in performance metrics; based on token indexes?) **Stage 2**. UTC ## Result and Conclusion ### Information Extraction - 先把和label有關的資訊擷取出來 - Experimentss Result - fine-tuned UIE的表現比較好 - 對GPT來說找index比較困難 | Model | Precision | Recall | F1 | | -------- | -------- | -------- | ----- | | Baseline (0 shot UIE) | 0.22 | 0.38 | 0.28 | | UIE | 0.80 | 0.83 | 0.82 | | GPT 3.5 | 0.26 | 0.06 | 0.10 | ### Text Classification | Model | Precision | Recall | F1 | | -------- | -------- | -------- | ----- | | Baseline (Emie) | 0.40 | 0.46 | 0.43 | | UIE + UTC| 0.87 | 0.89 | 0.88 | | GPT 3.5 | 0.76 | 0.65 | 0.70 | ## Conclusion - Few-show **promtpt learning** shows promise. - Text shortening through **two stage modeling** could also be a viable solution. - 講者 [GitHub(vic4code)](https://github.com/vic4code) ## QA + 不同文字需要額外訓練? + 需要 + IE實驗結果中,是幾分類?資料平衡嗎? + 有做,時間關係沒放上來 + 文本多長算長文本? + 沒有明確標準,超過某個限制的時候 + Slide連結? + 之後提供 + Slides link: https://docs.google.com/presentation/d/1_wFt91_eYSJ2F_suyuHQIcXOYYtNeJk1/edit?usp=sharing&ouid=117277489792224340959&rtpof=true&sd=true + paddleNLP 和 Langchain 之間應該如何選擇? + Langchain比較偏向Prompt的使用 + 看不同的任務去選擇 + UIE模型可以做無監督式學習,做關鍵詞或是特定資料的抽取? + 可以 Below is the part that speaker updated the talk/tutorial after speech 講者於演講後有更新或勘誤投影片的部份
Sign in
Forgot password
By clicking below, you agree to our
terms of service
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
Connect another wallet
New to HackMD?
Sign up