--- title: "打造針對小型社群/個人的內容資料庫:基於微調預訓練語言模型的自動標記方案 - James Chen" tags: PyConTW2023, 2023-organize, 2023-共筆 --- # 打造針對小型社群/個人的內容資料庫:基於微調預訓練語言模型的自動標記方案 - James Chen {%hackmd H6-2BguNT8iE7ZUrnoG1Tg %} <iframe src=https://app.sli.do/event/bY5QDzZ1YbKwzAD5YgNqhk height=450 width=100%></iframe> > Collaborative writing start from below > 從這裡開始共筆 Slide: https://drive.google.com/file/d/14gKp-LD92dj2jW7DLZs9kcUJ67QOYpAw/view?usp=sharing ### disclaimer 1. 本場演講著重於小型 & 個人化內容資料庫,而非大型商業應用 2. 預訓練語言模型:Bert based, 而非大型語言模型 e.g. chatGPT ### Do we really need ChatGPT / LLM - 隨機性反而會破壞目標模型的架構 ![Transformer timeline. On the vertical axis, number of parameters. Colors describe Transformer family ](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*oC7wTQanpEw9mUizTZeuzw.png) ### Content Databse - e.g. 全文報紙資料庫、法規資料庫 每個團體都有需求 - Purpose: **Produce new content, build new discourses** - Specials: - Large amount of data collection - **Multi-label classification in different dimensions** - Wide variation in the needs - Needs by small advocacy/social movement organizations ### workflow of creating content database ![](https://hackmd.io/_uploads/r1tCyoZA2.jpg) - 摘要對長文本處理很重要 ```graphviz digraph { node[shape="r"] SC[label="Source of Contents"] Save[label="Save (Database)"] DV[label="Data Visualization"] SC -> Collect -> Format -> Summarize Summarize -> Label, Save Label -> Save Save -> Label Save -> Search, DV, Analyze } ``` ### Labeling is complex - Multi-class 一張圖裡只有一個label - Multi-label 一張圖裡可以有多個label (Multi-instance)(?) 不是難在multi-label, 而是難在 hierarchical-label - challenge: 階層化的 label, e.g. 雲底下還有各個型態的雲 - Possible Solutions: - 法1: 把上一個model的結果繼承給下一個子階層 - 法2: 序列輸出(他沒試過) - Task reformtion - 作者所採取的方法: - 把分類任務轉化為文本蘊含判斷任務 - 多選題轉換成是非題 #### Task reformtion - Pros - Simplify complex label structures while maintaining hierarchical relationships - Better performance (more obvious with smaller sample sizes) - Easier to optimize speed when inferring - Cons - Label names and prompts affect performance - Slower(Mainly during training) ### Concept is the most important thing - 對任務的理解 - 對標籤的概念 - 概念的邊界, e.g. edge cases #### Human annotation: - **Consistency** is the most important thing when labeling manually, not who is right or wrong - The first labeled material **should not** be used - 2-3 rounds of **review** and comparison are necessary - **Discussion** and **Teamwork** at the heart of success or failure of manual annotation #### chatGPT annotation: 1. Embedding: 將句子轉換成embedding, vector 2. Filter: 計算target句子和其他句子的cosine similarity 3. classification with prompt Below is the part that speaker updated the talk/tutorial after speech 講者於演講後有更新或勘誤投影片的部份