中文長文本語意理解 - Victor
歡迎來到 PyCon TW 2023 共筆
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
共筆入口:https://hackmd.io/@pycontw/2023
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Collaborative Writing Workplace:https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.
Collaborative writing start from below
從這裡開始共筆
Framework
Intorduction
Basic Usage
- information extraction: structurlaization
- text classification
用途:非結構轉成結構
e.g. 文本:法院判決書
Modeling
- UIE (Universal information Extrantion, Text to struture) pretrain model
- treats information extraction as different tasks, including the following (四個文本任務):
- Name Entity Recognition
- Relation Extraction
- Event Detection
- Sentiment Extraction
- USM (Universal Semantic Matching) pretrained model
- UTC(Universal Text Classfication)
Long Sequence issues and Solutions
- The model training is restricted by gpu memory
- length ~ over 10,000
- 一般16G的,大概2000多token就已經差不多了
- Paddle 的處理法是只保留前面 2048,後面會丟棄 => information missing
思考:長文本真的會對這些問題產生影響嗎?
- Information Extraction
- Text Classification
General Solutions to Long Texts
- chunking
- with or without recurrent mechanism
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
( recommended by the presenter) shortening or summarization
去掉noise的過程
Proposed Solution (UIE + UTC)
先把文本變短,再做任務
Stage 1. Finetuning UIE
- 目標:找出和label有關的敘述(我們有興趣的段落),且已經去掉noise (about 60% boost in performance metrics; based on token indexes?)
Stage 2. UTC
Result and Conclusion
- 先把和label有關的資訊擷取出來
- Experimentss Result
- fine-tuned UIE的表現比較好
- 對GPT來說找index比較困難
Model |
Precision |
Recall |
F1 |
Baseline (0 shot UIE) |
0.22 |
0.38 |
0.28 |
UIE |
0.80 |
0.83 |
0.82 |
GPT 3.5 |
0.26 |
0.06 |
0.10 |
Text Classification
Model |
Precision |
Recall |
F1 |
Baseline (Emie) |
0.40 |
0.46 |
0.43 |
UIE + UTC |
0.87 |
0.89 |
0.88 |
GPT 3.5 |
0.76 |
0.65 |
0.70 |
Conclusion
- Few-show promtpt learning shows promise.
- Text shortening through two stage modeling could also be a viable solution.
- 講者 GitHub(vic4code)
QA
- 不同文字需要額外訓練?
- IE實驗結果中,是幾分類?資料平衡嗎?
- 文本多長算長文本?
- Slide連結?
- paddleNLP 和 Langchain 之間應該如何選擇?
- Langchain比較偏向Prompt的使用
- 看不同的任務去選擇
- UIE模型可以做無監督式學習,做關鍵詞或是特定資料的抽取?
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份