打造針對小型社群/個人的內容資料庫：基於微調預訓練語言模型的自動標記方案 - James Chen

歡迎來到 PyCon TW 2023 共筆

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

共筆入口：https://hackmd.io/@pycontw/2023
手機版請點選上方按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Collaborative Writing Workplace：https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
從這裡開始共筆

Slide: https://drive.google.com/file/d/14gKp-LD92dj2jW7DLZs9kcUJ67QOYpAw/view?usp=sharing

disclaimer

本場演講著重於小型 & 個人化內容資料庫，而非大型商業應用
預訓練語言模型：Bert based, 而非大型語言模型 e.g. chatGPT

Do we really need ChatGPT / LLM

隨機性反而會破壞目標模型的架構

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Content Databse

e.g. 全文報紙資料庫、法規資料庫
每個團體都有需求
Purpose: Produce new content, build new discourses
Specials:
- Large amount of data collection
- Multi-label classification in different dimensions
- Wide variation in the needs
- Needs by small advocacy/social movement organizations

workflow of creating content database

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

摘要對長文本處理很重要

Labeling is complex

Multi-class
一張圖裡只有一個label
Multi-label
一張圖裡可以有多個label

(Multi-instance)(?)

不是難在multi-label, 而是難在 hierarchical-label

challenge: 階層化的 label, e.g. 雲底下還有各個型態的雲
Possible Solutions:
- 法1: 把上一個model的結果繼承給下一個子階層
- 法2: 序列輸出（他沒試過）
- Task reformtion
作者所採取的方法：
- 把分類任務轉化為文本蘊含判斷任務
  - 多選題轉換成是非題

Task reformtion

Pros
- Simplify complex label structures while maintaining hierarchical relationships
- Better performance (more obvious with smaller sample sizes)
- Easier to optimize speed when inferring
Cons
- Label names and prompts affect performance
- Slower(Mainly during training)

Concept is the most important thing

對任務的理解
對標籤的概念
概念的邊界, e.g. edge cases

Human annotation:

Consistency is the most important thing when labeling manually, not who is right or wrong
The first labeled material should not be used
2-3 rounds of review and comparison are necessary
Discussion and Teamwork at the heart of success or failure of manual annotation

chatGPT annotation:

Embedding: 將句子轉換成embedding, vector
Filter: 計算target句子和其他句子的cosine similarity
classification with prompt

Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份