Try   HackMD

打造針對小型社群/個人的內容資料庫:基於微調預訓練語言模型的自動標記方案 - James Chen

歡迎來到 PyCon TW 2023 共筆

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

共筆入口:https://hackmd.io/@pycontw/2023
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Collaborative Writing Workplace:https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
從這裡開始共筆

Slide: https://drive.google.com/file/d/14gKp-LD92dj2jW7DLZs9kcUJ67QOYpAw/view?usp=sharing

disclaimer

  1. 本場演講著重於小型 & 個人化內容資料庫,而非大型商業應用
  2. 預訓練語言模型:Bert based, 而非大型語言模型 e.g. chatGPT

Do we really need ChatGPT / LLM

  • 隨機性反而會破壞目標模型的架構

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Content Databse

  • e.g. 全文報紙資料庫、法規資料庫
    每個團體都有需求

  • Purpose: Produce new content, build new discourses

  • Specials:

    • Large amount of data collection
    • Multi-label classification in different dimensions
    • Wide variation in the needs
    • Needs by small advocacy/social movement organizations

workflow of creating content database

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • 摘要對長文本處理很重要






%0



SC

Source of Contents



Collect

Collect



SC->Collect





Save

Save (Database)



DV

Data Visualization



Save->DV





Label

Label



Save->Label





Search

Search



Save->Search





Analyze

Analyze



Save->Analyze





Format

Format



Collect->Format





Summarize

Summarize



Format->Summarize





Summarize->Save





Summarize->Label





Label->Save





Labeling is complex

  • Multi-class
    一張圖裡只有一個label
  • Multi-label
    一張圖裡可以有多個label

(Multi-instance)(?)

不是難在multi-label, 而是難在 hierarchical-label

  • challenge: 階層化的 label, e.g. 雲底下還有各個型態的雲
  • Possible Solutions:
    • 法1: 把上一個model的結果繼承給下一個子階層
    • 法2: 序列輸出(他沒試過)
    • Task reformtion
  • 作者所採取的方法:
    • 把分類任務轉化為文本蘊含判斷任務
      • 多選題轉換成是非題

Task reformtion

  • Pros
    • Simplify complex label structures while maintaining hierarchical relationships
    • Better performance (more obvious with smaller sample sizes)
    • Easier to optimize speed when inferring
  • Cons
    • Label names and prompts affect performance
    • Slower(Mainly during training)

Concept is the most important thing

  • 對任務的理解
  • 對標籤的概念
  • 概念的邊界, e.g. edge cases

Human annotation:

  • Consistency is the most important thing when labeling manually, not who is right or wrong
  • The first labeled material should not be used
  • 2-3 rounds of review and comparison are necessary
  • Discussion and Teamwork at the heart of success or failure of manual annotation

chatGPT annotation:

  1. Embedding: 將句子轉換成embedding, vector
  2. Filter: 計算target句子和其他句子的cosine similarity
  3. classification with prompt

Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份