---
title: "中文長文本語意理解 - Victor"
tags: PyConTW2023, 2023-organize, 2023-共筆
---
# 中文長文本語意理解 - Victor
{%hackmd H6-2BguNT8iE7ZUrnoG1Tg %}
<iframe src=https://app.sli.do/event/nWMgU4pojcQWkgyLGsd4dR height=450 width=100%></iframe>
> Collaborative writing start from below
> 從這裡開始共筆
---
- Slides link: https://docs.google.com/presentation/d/1_wFt91_eYSJ2F_suyuHQIcXOYYtNeJk1/edit?usp=sharing&ouid=117277489792224340959&rtpof=true&sd=true
[toc]
**Framework**
- [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP): 中文領域做文本的最大框架
## Intorduction
### Basic Usage
1. information extraction: structurlaization
2. text classification
用途:非結構轉成結構
e.g. 文本:法院判決書
- 法院
- 原告
- 工作損失
- ....
### Modeling
```graphviz
digraph {
node[shape="r"]
text_input[label="Text Input"]
IE[label="information extraction"]
TC[label="text classification"]
text_input -> IE, TC
}
```
```graphviz
digraph {
node[shape="r"]
UIE[label="pretrained\nUIE"]
NER[label="Named Entity"]
RE[label="Relation Extraction"]
ED[label="Evenet Detection"]
SE[label="Sentitment Extraction"]
MRE[label="more down stream tasks"]
UIE -> NER, RE, ED, SE -> MRE
}
```
### Information Extraction - Model Pretraining
- UIE (Universal information Extrantion, Text to struture) pretrain model
- treats information extraction as different tasks, including the following (四個文本任務):
+ Name Entity Recognition
+ Relation Extraction
+ Event Detection
+ Sentiment Extraction
- USM (Universal Semantic Matching) pretrained model
- Structuring
- Utterance Structure
- Pair Structure
- Conceptualizing
- more info: https://sites.research.google/usm/
### Text Extraction - Model Pretraining
- UTC(Universal Text Classfication)
- 背後本質還是 USM
## Long Sequence issues and Solutions
### Issues arising form Long Sequences
- The model training is restricted by **gpu memory**
- length ~ over 10,000
- 一般16G的,大概2000多token就已經差不多了
- Paddle 的處理法是只保留前面 2048,後面會丟棄 => information missing
==思考:長文本真的會對這些問題產生影響嗎?==
+ Information Extraction
+ 可以用切Chunks的方式來處理
+ Text Classification
+ 不知道關鍵的內容是在哪裡,不適合使用 chunk
### General Solutions to Long Texts
1. chunking
2. with or without recurrent mechanism
3. :thumbsup: ( recommended by the presenter) shortening or summarization
去掉noise的過程
### Proposed Solution (UIE + UTC)
先把文本變短,再做任務
**Stage 1**. Finetuning UIE
- 目標:找出和label有關的敘述(我們有興趣的段落),且已經去掉noise (about 60% boost in performance metrics; based on token indexes?)
**Stage 2**. UTC
## Result and Conclusion
### Information Extraction
- 先把和label有關的資訊擷取出來
- Experimentss Result
- fine-tuned UIE的表現比較好
- 對GPT來說找index比較困難
| Model | Precision | Recall | F1 |
| -------- | -------- | -------- | ----- |
| Baseline (0 shot UIE) | 0.22 | 0.38 | 0.28 |
| UIE | 0.80 | 0.83 | 0.82 |
| GPT 3.5 | 0.26 | 0.06 | 0.10 |
### Text Classification
| Model | Precision | Recall | F1 |
| -------- | -------- | -------- | ----- |
| Baseline (Emie) | 0.40 | 0.46 | 0.43 |
| UIE + UTC| 0.87 | 0.89 | 0.88 |
| GPT 3.5 | 0.76 | 0.65 | 0.70 |
## Conclusion
- Few-show **promtpt learning** shows promise.
- Text shortening through **two stage modeling** could also be a viable solution.
- 講者 [GitHub(vic4code)](https://github.com/vic4code)
## QA
+ 不同文字需要額外訓練?
+ 需要
+ IE實驗結果中,是幾分類?資料平衡嗎?
+ 有做,時間關係沒放上來
+ 文本多長算長文本?
+ 沒有明確標準,超過某個限制的時候
+ Slide連結?
+ 之後提供
+ Slides link: https://docs.google.com/presentation/d/1_wFt91_eYSJ2F_suyuHQIcXOYYtNeJk1/edit?usp=sharing&ouid=117277489792224340959&rtpof=true&sd=true
+ paddleNLP 和 Langchain 之間應該如何選擇?
+ Langchain比較偏向Prompt的使用
+ 看不同的任務去選擇
+ UIE模型可以做無監督式學習,做關鍵詞或是特定資料的抽取?
+ 可以
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份