打造針對小型社群/個人的內容資料庫:基於微調預訓練語言模型的自動標記方案 - James Chen
歡迎來到 PyCon TW 2023 共筆
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
共筆入口:https://hackmd.io/@pycontw/2023
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Collaborative Writing Workplace:https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.
Collaborative writing start from below
從這裡開始共筆
Slide: https://drive.google.com/file/d/14gKp-LD92dj2jW7DLZs9kcUJ67QOYpAw/view?usp=sharing
disclaimer
- 本場演講著重於小型 & 個人化內容資料庫,而非大型商業應用
- 預訓練語言模型:Bert based, 而非大型語言模型 e.g. chatGPT
Do we really need ChatGPT / LLM
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Content Databse
workflow of creating content database
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Labeling is complex
- Multi-class
一張圖裡只有一個label
- Multi-label
一張圖裡可以有多個label
(Multi-instance)(?)
不是難在multi-label, 而是難在 hierarchical-label
- challenge: 階層化的 label, e.g. 雲底下還有各個型態的雲
- Possible Solutions:
- 法1: 把上一個model的結果繼承給下一個子階層
- 法2: 序列輸出(他沒試過)
- Task reformtion
- 作者所採取的方法:
- Pros
- Simplify complex label structures while maintaining hierarchical relationships
- Better performance (more obvious with smaller sample sizes)
- Easier to optimize speed when inferring
- Cons
- Label names and prompts affect performance
- Slower(Mainly during training)
Concept is the most important thing
- 對任務的理解
- 對標籤的概念
- 概念的邊界, e.g. edge cases
Human annotation:
- Consistency is the most important thing when labeling manually, not who is right or wrong
- The first labeled material should not be used
- 2-3 rounds of review and comparison are necessary
- Discussion and Teamwork at the heart of success or failure of manual annotation
chatGPT annotation:
- Embedding: 將句子轉換成embedding, vector
- Filter: 計算target句子和其他句子的cosine similarity
- classification with prompt
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份