Try   HackMD

中文長文本語意理解 - Victor

歡迎來到 PyCon TW 2023 共筆

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

共筆入口:https://hackmd.io/@pycontw/2023
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Collaborative Writing Workplace:https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
從這裡開始共筆


Framework

  • PaddleNLP: 中文領域做文本的最大框架

Intorduction

Basic Usage

  1. information extraction: structurlaization
  2. text classification

用途:非結構轉成結構
e.g. 文本:法院判決書

  • 法院
  • 原告
  • 工作損失

Modeling







%0



text_input

Text Input



IE

information extraction



text_input->IE





TC

text classification



text_input->TC











%0



UIE

pretrained
UIE



NER

Named Entity



UIE->NER





RE

Relation Extraction



UIE->RE





ED

Evenet Detection



UIE->ED





SE

Sentitment Extraction



UIE->SE





MRE

more down stream tasks



NER->MRE





RE->MRE





ED->MRE





SE->MRE





Information Extraction - Model Pretraining

  • UIE (Universal information Extrantion, Text to struture) pretrain model
  • treats information extraction as different tasks, including the following (四個文本任務):
    • Name Entity Recognition
    • Relation Extraction
    • Event Detection
    • Sentiment Extraction
  • USM (Universal Semantic Matching) pretrained model

Text Extraction - Model Pretraining

  • UTC(Universal Text Classfication)
    • 背後本質還是 USM

Long Sequence issues and Solutions

Issues arising form Long Sequences

  • The model training is restricted by gpu memory
    • length ~ over 10,000
    • 一般16G的,大概2000多token就已經差不多了
  • Paddle 的處理法是只保留前面 2048,後面會丟棄 => information missing

思考:長文本真的會對這些問題產生影響嗎?

  • Information Extraction
    • 可以用切Chunks的方式來處理
  • Text Classification
    • 不知道關鍵的內容是在哪裡,不適合使用 chunk

General Solutions to Long Texts

  1. chunking
  2. with or without recurrent mechanism
  3. Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
    ( recommended by the presenter) shortening or summarization
    去掉noise的過程

Proposed Solution (UIE + UTC)

先把文本變短,再做任務

Stage 1. Finetuning UIE

  • 目標:找出和label有關的敘述(我們有興趣的段落),且已經去掉noise (about 60% boost in performance metrics; based on token indexes?)

Stage 2. UTC

Result and Conclusion

Information Extraction

  • 先把和label有關的資訊擷取出來
  • Experimentss Result
    • fine-tuned UIE的表現比較好
    • 對GPT來說找index比較困難
Model Precision Recall F1
Baseline (0 shot UIE) 0.22 0.38 0.28
UIE 0.80 0.83 0.82
GPT 3.5 0.26 0.06 0.10

Text Classification

Model Precision Recall F1
Baseline (Emie) 0.40 0.46 0.43
UIE + UTC 0.87 0.89 0.88
GPT 3.5 0.76 0.65 0.70

Conclusion

  • Few-show promtpt learning shows promise.
  • Text shortening through two stage modeling could also be a viable solution.
  • 講者 GitHub(vic4code)

QA

Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份