---
title: "打造針對小型社群/個人的內容資料庫:基於微調預訓練語言模型的自動標記方案 - James Chen"
tags: PyConTW2023, 2023-organize, 2023-共筆
---
# 打造針對小型社群/個人的內容資料庫:基於微調預訓練語言模型的自動標記方案 - James Chen
{%hackmd H6-2BguNT8iE7ZUrnoG1Tg %}
<iframe src=https://app.sli.do/event/bY5QDzZ1YbKwzAD5YgNqhk height=450 width=100%></iframe>
> Collaborative writing start from below
> 從這裡開始共筆
Slide: https://drive.google.com/file/d/14gKp-LD92dj2jW7DLZs9kcUJ67QOYpAw/view?usp=sharing
### disclaimer
1. 本場演講著重於小型 & 個人化內容資料庫,而非大型商業應用
2. 預訓練語言模型:Bert based, 而非大型語言模型 e.g. chatGPT
### Do we really need ChatGPT / LLM
- 隨機性反而會破壞目標模型的架構
![Transformer timeline. On the vertical axis, number of parameters. Colors describe Transformer family
](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*oC7wTQanpEw9mUizTZeuzw.png)
### Content Databse
- e.g. 全文報紙資料庫、法規資料庫
每個團體都有需求
- Purpose: **Produce new content, build new discourses**
- Specials:
- Large amount of data collection
- **Multi-label classification in different dimensions**
- Wide variation in the needs
- Needs by small advocacy/social movement organizations
### workflow of creating content database
![](https://hackmd.io/_uploads/r1tCyoZA2.jpg)
- 摘要對長文本處理很重要
```graphviz
digraph {
node[shape="r"]
SC[label="Source of Contents"]
Save[label="Save (Database)"]
DV[label="Data Visualization"]
SC -> Collect -> Format -> Summarize
Summarize -> Label, Save
Label -> Save
Save -> Label
Save -> Search, DV, Analyze
}
```
### Labeling is complex
- Multi-class
一張圖裡只有一個label
- Multi-label
一張圖裡可以有多個label
(Multi-instance)(?)
不是難在multi-label, 而是難在 hierarchical-label
- challenge: 階層化的 label, e.g. 雲底下還有各個型態的雲
- Possible Solutions:
- 法1: 把上一個model的結果繼承給下一個子階層
- 法2: 序列輸出(他沒試過)
- Task reformtion
- 作者所採取的方法:
- 把分類任務轉化為文本蘊含判斷任務
- 多選題轉換成是非題
#### Task reformtion
- Pros
- Simplify complex label structures while maintaining hierarchical relationships
- Better performance (more obvious with smaller sample sizes)
- Easier to optimize speed when inferring
- Cons
- Label names and prompts affect performance
- Slower(Mainly during training)
### Concept is the most important thing
- 對任務的理解
- 對標籤的概念
- 概念的邊界, e.g. edge cases
#### Human annotation:
- **Consistency** is the most important thing when labeling manually, not who is right or wrong
- The first labeled material **should not** be used
- 2-3 rounds of **review** and comparison are necessary
- **Discussion** and **Teamwork** at the heart of success or failure of manual annotation
#### chatGPT annotation:
1. Embedding: 將句子轉換成embedding, vector
2. Filter: 計算target句子和其他句子的cosine similarity
3. classification with prompt
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份