Streambench Final Project

# Streambench Final Project 2024 ADL Final Project [StreamBench: Final Project Slides](https://docs.google.com/presentation/d/1sG7cwG52AIDTf736F6IjFEDEQ59vzbt_JJlNMMFRUVA/edit#slide=id.g2febf2ddb21_0_59) ## DeadLines 1. 週五到週六：report 放到 hackmd 2. 週日：slide 3. 周一：錄影片 ## Report 內容 ### 邱偉誠, WEI-CHENG,CHIU, M11315038, m11315038@mail.ntust.edu.tw #### 整理之前自己做過的實驗(Experiments) Double models voting agents: ex. gemma-9b + qwen-7b，測試讓兩個模型同時輸出答案然後根據答對率調整權重來選這次要用哪個模型的回答，但會發現其實兩個模型輸出的答案差不多的，猜測是因為本質上這些模型的參數量太相近，不會有什麼太大的差異性，Streambench論文中的multigents我在最一開始也有reproduce出來，實際上gpt和claude可以製作成互補的multiagent系統應該是因為夠大的參數量，讓模型有湧現的效果，加上足夠差異化的訓練資料，才能有互補的作用。 #### Classification Approach 根據提供的DDXPlus public dataset請perplexity先爬蟲出49種病症的資料集得症狀，再使用claude整理出一個super prompt cheat sheet. #### Work Distribution(分工表) 1. Built the main structure of code 2. The first one who proposed the method of super prompting and it work on classification task (61%->69%) 3. Full Report Composing ### 趙亦天, CHAO I-TIEN, R12944037, r12944037@ntu.edu.tw #### Related work **Text-to-SQL at Pinterest Engineering** Pinterest 的數據工程師經常需要透過撰寫 SQL 查詢來分析用戶行為、內容趨勢以及系統效能。為了將複雜的分析問題轉換為高效的 SQL 程式碼，Pinterest 利用 LLM 開發了一個整合到其開源大數據 SQL 查詢工具 Querybook 的 Text-to-SQL 功能。此功能允許用戶輸入自然語言問題，並自動將其轉換為 SQL 查詢。此方法包含以下幾個步驟： 1. User Input and Table Selection：用戶提出分析問題並選擇與問題相關的 table。 2. Schema Retrieval：從 metadata 存儲空間中取出的table schema 資訊，包括表名稱、描述、欄位名稱、欄位類型和欄位描述。 3. Prompt Construction：將自然語言問題、table schema 資訊整合為一個 prompt，並格式化為適合 LLM 處理的形式。 4. Model Inference：LLM 處理 prompt 並生成 SQL 。 5. Streaming Results：生成的 SQL 即時顯示給用戶。 #### SQL generation Approach #### Prompt 設計我們在 Prompt 中引入了 **Chain-of-Thought** 的方法，指導模型按照指定的步驟生成 SQL 查詢。以下是我們設計的 Prompt 問題分析步驟： - **階段 1：搜尋相關的 Table 和 Column** - 要求模型首先分析輸入的使用者問題 (User Query)。 - 從提供的 Table Schema 和 Schema Information 中，找出與該問題相關的資料表和欄位，並提取其資訊。 - 結果須以以下格式輸出將使用的資料表和欄位： ``` OUTPUT FORMAT: Table1: Column1, Column2 Table2: Column3, Column4 Table3: Column5, etc. ``` - **階段 2：生成 SQL 查詢** - 確定 User Query 所屬的問題類型，例如篩選條件 (Filters)、條件式 (Conditions)、聚合 (Aggregations)、關聯 (Relationships) 等。 - 根據上述分析結果，生成能夠回答 User Query 的 SQL 程式碼。 --- #### Vector Storage 我們將正確回答的 SQL 根據其包含的關鍵字，例如 "JOIN"、"GROUP BY"、"WHERE"、"ORDER BY"、"DISTINCT" 等，分類為不同類型的問題。接著，依據這些分類，將 SQL 查詢存儲在向量記憶體 (Vector Memory) 中。當系統需要回答相同類型的問題時，會從向量記憶中檢索出過去已成功回答的相關 SQL 查詢，作為參考，以提升回應效率與準確性。 --- #### 表格結構描述 (Table Schema Description/Imformation) 我們利用 BIRD 資料集中提供的表格結構描述 (Table Schema Description) 來解釋表格 (Table) 和欄位 (Column) 之間的關聯性。此方法確保模型能夠更全面地理解表格與欄位之間的關係，從而提高回答的準確性與相關性。具體流程如下： 1. 獲取每個表格的結構資訊 (Table Schema)。 2. 將該表格結構的名稱與 Schema 描述中的表格名稱進行比對，如果名稱相符，則將該表格的結構描述 (Schema Description) 一併納入作為相關資訊。 3. 將這些相關資訊一同提供給模型，作為生成回答的參考依據。 --- #### 備忘單 (Cheat sheet) 我們根據常用的 SQL 關鍵字，製作了一份備忘單 (cheat sheet)，並將其結合在 Prompt 中，作為模型的參考資料。這份備忘單不僅包含每個 SQL 關鍵字的描述，還列出了與該關鍵字相關的詞彙，幫助模型在處理問題時正確識別和使用對應的關鍵字。這樣，模型能夠準確理解在何種情境下應該使用哪個關鍵字。 ``` "SELECT": { "description": "Used to specify the columns to retrieve from one or more tables.", "related_words": ["list", "name", "retrieve", "describe", "show", "state", "what is", "identify", "provide", "give"] }, "FROM": { "description": "Indicates the source table(s) for the query.", "related_words": ["source", "table", "dataset", "database", "origin"] }, "WHERE": { "description": "Filters rows based on specified conditions.", "related_words": ["condition", "filter", "restrict", "match", "refers to", "when", "which", "with", "that", "in"] } ``` #### Experiments **Multi-agent Experiment**: 使用兩個 LLM 進行任務分工，以下是我們的實驗紀錄與分析 - **第一次嘗試**：在此嘗試中，Agent 1 根據 Table Schema Information 重新將用戶的問題 (User Query) 以更清晰的方式進行重寫 (Query Rewrite)，並加入可能會用到的欄位 (Columns)。Agent 2 則根據 Agent 1 生成的更清晰的問題來生成 SQL 查詢。然而，這種分工方法僅使準確度比基準方法 (Baseline) 提高了 1%。我們認為這是因為 Agent 1 的工作過於複雜，導致分工的效果未能充分發揮，反而增加了處理的複雜性。 - **第二次嘗試**：在這次實驗中，Agent 1 只根據 Table Schema Information 選擇會用到的資料表 (Tables) 和欄位 (Columns)。**Agent 2** 則根據 **Agent 1** 所選擇的資料表和欄位來生成 SQL 查詢。此方法相較於第一次嘗試有顯著的提升，準確度提高了 3%，達到了 29.44%。 - **第三次嘗試**：這次實驗保留了 Agent 1 的工作方式，即根據 Table Schema Information 選擇相關的資料表和欄位。不同的是，Agent 2 在生成 SQL 查詢時，還會參考一份包含 SQL 關鍵字的 Cheat Sheet，這份 Cheat Sheet 用來教導 Agent 2 如何正確撰寫 SQL 。然而，結果顯示，這樣的設計反而比沒有 Cheat Sheet 更糟，準確度反而下降。因此，我們後來集中精力重新設計和優化 Cheat Sheet，以提升其效果。 - **Multi-agent 效果與問題**：儘管兩個 Agent 都使用相同的模型 (Gemma 2 9B)，但實驗結果顯示，使用兩個 Agent 的效果並不顯著提高，且在運行時長上比單一 Agent 要長兩到三倍。因此，後期的實驗轉向只使用單一 Agent 來進行測試，以減少計算資源的浪費並提高效率。 #### Work Distribution 1. 進行 SQL related work 的研究 2. 設計 SQL generation Pipeline，並設計實驗 3. 研究如何將 Table Schema 融入到該 Pipeline 中 4. 設計 Vector Storage 的更新方式，並加入 SQL 分類方法 5. 設計 SQL Cheat Sheet --- ### 陳威宇, WEI-YU CHEN, r12922040@ntu.edu.tw #### Experiments(過程的實驗) ##### Classification: 在分類任務的初期，我們嘗試使用單一 Agent 進行多輪階段操作，包括病例整理、答案驗證以及歷史大綱生成。然而，進行三次推理 (Inference) 消耗了大量時間與效能，最終結果也不理想，甚至有時低於基線 (Baseline)。概念 (可供參考)： ![image](https://hackmd.io/_uploads/r1t4eqWS1e.png) 在實驗取得 69% 準確率後，我們調整了 Prompt 中使用的疾病特徵。發現初步的特徵中有部分病例遺漏或名稱錯誤，經過修正後，我們讓 ChatGPT-4o 重新整理。在保留舊有特徵的同時，新增了一些新的關鍵詞，使得公開評估的分數提升到 70%。接下來，我們嘗試微調一些參數，例如溫度 (Temperature)、Top_k，以及 Prompt 格式（例如 "Matching Symptoms" 和 "History Examples"）。這些調整最終將公開評估的分數提升至 71%。 ![rolling_acc](https://hackmd.io/_uploads/SyolU5WHkl.png) ##### SQL Generation: 在完成初步的 SQL 生成實驗後，我們發現兩段式生成的方式與分類任務中使用單一 Agent 進行多輪生成一樣，效果不佳且耗時。因此，我們決定改為單階段生成。為了達到類似多輪階段的效果，我們在 Prompt 中加入了 Chain-of-Thought 的方法： ``` Follow these steps using a chain-of-thought reasoning approach: 1. Extract key information, such as required columns, filters, conditions, aggregations, and relationships. 2. Identify the schema components referenced (e.g., tables, columns). 3. Match with Schema: - Use patterns from similar sample queries (by db_id or question type) to construct your SQL query. - Translate specific terms from the user query into evidence-backed SQL components. - Look for direct keyword matches, synonyms, or related meanings between the query and column/table names. 3. Generate SQL Query: - Identify required SELECT columns, WHERE conditions, GROUP BY clauses, and ORDER BY rules. - Build a query that follows standard SQL syntax and matches the schema and query context. - Ensure that table joins, filters, and aggregations are consistent with both the schema and user intent. ``` 此外，我們注意到 Table Schema 和歷史範例的加入會導致 Prompt 過長。為了解決此問題，我們進一步縮短並精簡了 Prompt。在檢查和移除 Table Description 中的冗詞後，並用更自然的語言描述欄位，我們成功將準確率從 29% 提升到 34%。最後，由於 Table Description 已被過濾和簡化，我們修改了數據庫 Schema 的檢索方式，從原本的複雜關鍵字提取改為直接將所有涉及的表格及其完整描述加入 Prompt。配合參數的微調，準確率從 34% 提升至 37%。 ![rolling_acc](https://hackmd.io/_uploads/rkcZ8qWSke.png) #### Discussion(應該是討論為何結果會work) ##### SQL Generation: 我們認為 Chain-of-Thought 方法發揮了重要作用。這種方法能先將數據庫 Schema 統整，再決定需要使用的欄位後生成完整的 SQL 查詢。相較於直接根據所有資訊生成，這種結構化的方式讓模型輸出的查詢更穩定且正確。我們最終在 SQL Generation 的私有測試集 (Private Testset) 上遇到了一個問題。由於我們在實驗過程中對 Table Description 進行了精細的調整，因此在 Prompt 中直接使用過濾後的 Table Description 來取代原始的 Table Schema。然而，我們當初假設 Public 和 Private 測試集會使用相同的表格進行評估，因此將 Table Description 直接硬編碼 (Hard Code) 到程式碼中。當 Private Benchmark 釋出時，我們發現 Private 測試集中的表格與 Public 測試集完全不同。這導致我們硬編碼的 Table Description 完全無效。而且，由於我們未將原始的 Table Schema 輸入 Prompt，模型在 Private 測試集上的表現完全失敗，最終成績為 0%，也就是全錯。 #### Work Distribution(分工表) 1. 兩個任務的參數微調，Prompt 優化與改善。 2. 最終程式碼的整合與簡化、README 撰寫，以及環境設置。 ``` English version if needed Experiments (實驗) Classification: Initially, for the classification task, we attempted to use a single agent to perform multi-stage operations, including case summarization, answer validation, and historical outline creation. However, executing inference three times consumed significant time and computational resources, and the final results were suboptimal, even falling below the baseline. After achieving 69% accuracy in our experiments, we adjusted the disease traits used in the prompt. We discovered that some initial traits were missing or incorrectly labeled. After correcting these issues, we had ChatGPT-4o reorganize the traits. While retaining the old features, we added new keywords, which improved the public score to 70%. Further refinements were made to parameters such as temperature, top_k, and prompt formatting, including "Matching Symptoms" and "History Examples." These adjustments increased the public score to 71%. SQL Generation: Following our initial SQL generation experiments, we found that a two-stage generation process, similar to the multi-stage approach used in classification with a single agent, was both inefficient and time-consuming. Consequently, we decided to adopt a single-stage generation process. To achieve the effect of multi-stage reasoning, we incorporated a chain-of-thought approach into our prompts: {prompt} Additionally, we noticed that including table schemas and history examples made the prompt overly lengthy. To address this, we further shortened and streamlined the prompt. By removing redundant phrases in the table descriptions and using more natural language to describe columns, we successfully increased accuracy from 29% to 34%. Finally, since the table descriptions had been filtered and simplified, we revised the database schema retrieval approach. Instead of extracting complex keywords, we directly included the full descriptions of all referenced tables in the prompt. Coupled with parameter fine-tuning, this approach raised accuracy from 34% to 37%. Discussion (討論為何結果會有效) SQL Generation: We believe the chain-of-thought approach played a significant role in improving performance. By organizing the database schema first and determining which columns to use before generating the complete SQL query, the model produced more accurate and consistent results. Compared to generating SQL directly from all available information, this structured method significantly enhanced stability and accuracy. Work Distribution (分工表) Fine-tuning parameters and improving prompts for both tasks. Finalizing and simplifying the code, creating the README, and setting up the environment. ``` ``` * Bullet point if needed Classification: 1. 嘗試單一模型進行 chain of thought (validator, summarizer) 2. Adjusting prompt’s disease traits, increase score from 69 to 70% on public 3. Adjust parameters like temperature and prompt format (matching, example) SQL generation: 1. Change from two stage solution to one stage solution 2. Modify prompt to use chain of thought in generation (step by step), from 29% to 34% on public 3. Fix and adjust schema descriptions, remove unused and useless columns 4. Adjust parameters like temperature and schema description , from 34% to 37% on public 最後程式碼的彙整 ``` --- ### 郭思言、Szu-Yen Kuo、 kszuyen@gmail.com (r12945040@ntu.edu.tw) #### Related Work - Survey on different RAG implementations: - 隨著Retrieval-Augmented Generation (RAG) 的快速發展，各種形式的實現已被廣泛應用。其中，像 Microsoft的GraphRAG或其他如Medical Graph RAG針對醫療診斷的RAG系統，因為能夠有效建立文句之間更複雜的關聯性，越來越受到關注。這類方法通過引入graph的結構，顯著增強了提取相關資料的能力，特別適合應用於醫學診斷等領域。 - 然而，這些基於圖結構的RAG系統通常依賴大型語言模型（LLMs），如ChatGPT、LLaMA等工具，來解析文具並構建圖中節點與邊的關係。雖然GraphRAG在處理複雜關係方面展現出強大的能力，但鑑於本次期末專案中對LLM模型的使用有所限制，我們最終選擇使用向量資料庫作為RAG系統的核心資料庫。 #### 整理之前自己做過的實驗(Experiments) **1. Random pick instead of top k** | Method | Accuracy | | ------ | -------- | | Cls - random | 0.3861 | | Cls - top_k | 0.5153 | | SQL - random | 0.2679 | | SQL - top_k | 0.3051 | **2. Classification** - **Fine-tuning:** (KTO - n: finetune and update parameters every n steps) | Method | Accuracy | | ----- | -------- | | w/o fine-tuning | 60.66% | | KTO - 240 | 55.39% | | KTO - 60 | 44.05% | - **Embedding model - 1** | Method | Accuracy | | ----- | -------- | | sentence-transformers/all-mpnet-base-v2 | 60.66% | | abhinand/MedEmbed-large-v0.1 | 63.04%| 顯示了在醫療任務中使用領域特定embedding model的優勢。雖然all-mpnet-base-v2是一個強大的通用模型，但其準確率略低於MedEmbed。這一提升可以看出對醫療領域特定資料進行預訓練的模型能夠更好地捕捉醫學資料的差異和語意相關性，使其在診斷預測的任務中表現更佳。 - **Embedding model - 2** | Method | Accuracy | | ----- | -------- | | abhinand/MedEmbed-base-v0.1 | 69.67% | | abhinand/MedEmbed-large-v0.1 | 65.99% | 在經過實驗後，large的embedding model不一定會來得比較小的base的embedding model還要好，推測是因為大模型的預訓練資料或方法可能與這一任務的具體需求不完全對應，從而影響了其效果。 - **RAG資料庫存放單位** | Method | Accuracy | | ----- | -------- | | patient (paragraph) | 60.66% | | symptoms (sentence) | 44.17% | 想法：利用單個症狀來尋找資料庫中具有相似症狀的其他病人，藉此推斷該症狀可能對應的疾病。然而，這種基於局部特徵的retrieval忽略了病人整體敘述的上下文和綜合特徵，可能導致症狀間的重要聯繫被遺漏，從而影響診斷的準確性。實驗結果表明，病人的整體敘述能提供更完整的背景資訊，有助於模型更準確地捕捉病症的全貌。 - **Multi-agent** | Method | Accuracy | | ----- | -------- | | 1 agent (gemma-9b) | 69.67% | | 3 agents voting (gemma-9b, ministral-8b, llama-3b) | 68.25% | | 2 agents with regenerate (main:gemma-9b, ministral-8b) | 64.91% | | 2 agents with regenerate (main:ministral-8b, gemma-9b) | 68.48% | 發現model list中比較好的model應該只有gemma-9b以及ministral-8b。在3 agents voting中，llama-3b時常答錯。而2 agents regenerate approach是讓兩個模型同時進行回答。若回答的答案不同，則提供另一個llm的答案給主模型並問他是否仍然覺得自己的答案是正確的，或是要變更答案。實驗結果顯示這並不會讓正確率提升，而且觀察起來有答不一樣的題目regenerate後更容易出錯。 **3. SQL generation: 從26% -> 30.50%** - 優化two stage approach & improve prompting: - stage 1: - 讓llm分析會用到哪些table和column，並確保輸出格式化。 - 移除會誤導模型輸出的description。 - stage 2: - 整理stage 1 output，只提供user query以及清理過後的table，再讓llm進行sql generation。 #### 可以提供想提供的想法(放Discussion) 1. Random vs. Top-k Retrieval 在檢索過程中，使用 top-k 方法顯著優於隨機選擇策略，無論是分類任務還是SQL生成任務，分別提高了13%和4%的準確率（分類：0.5153 vs. 0.3861，SQL：0.3051 vs. 0.2679）。這表明 top-k 方法更能有效篩選出與查詢相關的資訊，而隨機檢索可能引入大量無關或噪聲數據，進一步證實了檢索策略對結果的影響。 2. Classification Fine-tuning Fine-tuning在classification任務中並未提升性能，反而略有下降。例如，未fine-tuning模型的準確率為 60.66%，而進行KTO fine-tuning越多則準確率下降越多。這可能是由於資料量過小，導致overfit的情形，或者更新頻率過高干擾了模型的穩定學習。 3. Multi-agent approach 在這個專案中，我們最終選擇將gemma-9b作為主要的回答模型，原因是從實驗結果來看，使用multi-agent方法並未提升準確率。這可能是因為在提供的模型列表中，可供選擇的模型整體效能較弱，無法有效互補或增強彼此的優勢。雖然理論上multi agent系統應該能通過模型間的協同合作來改善結果，但在這次的實驗中，反而沒有達到預期的效果。因此，最終我們改變方向將重點放在優化prompting上。 4. SQL Generation Improvements Two-stage prompting 方法：在第一階段，通過讓LLM解析所需的資料表和欄位並規範輸出格式，減少了模型的誤導性輸出。第二階段則進一步利用第一階段的結果進行結構化的SQL生成。這表明分階段引導策略能有效降低生成任務的複雜度，並提升generation品質。 #### 自己的Work Distribution - Random picking, fine-tuning, symptoms-only RAG, multi-agents experiments。 - Medical specific embedding model: MedEmbed。 - 銜接組長亦天，進一步優化two stage approach，將sql generation任務推至30%以上。 --- ### 李季軒, CHI-HSUAN LEE, B10902111, b10902111@ntu.edu.tw #### Abstract Model Finetuning using LoRA. 檢驗不同finetuning演算法(SFT、KTO、DPO)對classification的影響。用zeroshot模板檢測演算法有效性，結果顯示KTO、DPO等reinforcement演算法難以使正確率提高，使用SFT效果顯著。其後以語言模型+RAG為基礎，用SFT方法，以prompt+ICL文本為資料訓練模型，準確率進一步上升，表示ICL仍有進步空間。 #### Approach 以下方法只專注於LLM本身參數，故可套用至不同的tasks和pipelines，如zeroshot和RAG，classfication和sql-generation。 - SFT - sample structure: `{"prompt", "completion"}` - data collection: 以看過的testing set 作為未來的training set。模型回答錯誤時直接捨棄該例子。模型答對時就把response當成標準答案，和input包在一起加進training set。 - 具體步驟：收集training set需要一定的過程，必須一次收集多一點資料再finetuning較有效率，但週期拉太長會讓模型成長曲線不夠圓滑，所以我們決定每過20步finetune模型一次。為了避免模形重複看到相同sample太多次，需要把較舊的training sample移除，一個簡單的方式是保留最新的200比samples。 - KTO - sample structure: `{"prompt", "completion", "label"}` - 目的：改善SFT資料浪費的問題。KTO可以將模型回答正確和錯誤的sample都當成training set。 - data collection: KTO 的training set 包含3個fields： (1) prompt (2) completion (3) label。其中prompt和completion就是input和回答，label則是指回答是否正確。只要根據反饋打上相應的label，無論模型回答是否正確都可以加進training set。 - 具體步驟同SFT。 - DPO - sample structure: `{"prompt", "chosen", "rejected"}` - data collection: LLM 必須根據input做出兩個不同的回應，具體實作方式是用`num_beams`、`num_beam_groups`和`diversity_penalty`等參數強制使LLM生出不一樣的response。接著會讓分數比較高的response作為正式的回答送出。當benchmark的回饋顯示回答正確時就表示另一個response是錯誤回答，我們因此獲得了DPO所需的一對回應("chosen", "rejected")。但如果benchmark的回饋顯示回答錯誤時無法判斷哪個是chosen或rejected response，只能捨棄該比sample，有著與SFT相同的缺陷。 - 具體步驟同SFT。 - SFT+DPO - 思路：LLM訓練的三個階段: pretraining, finetuning, RLHF。Pretraining已經由開源模型完成，finetuning的階段可以用SFT，最後的RLHF通常是一個訓練過的LLM作為老師用reinforcement learning演算法對目標模型做最終調整。所以將前述的演算法組合及來可以達成類似的效果。 - 具體步驟：SFT和DPO各有一個training set，同時採集資料構建training sets並不衝突。SFT的finetuning照舊。DPO改成第600、1000、1300、1600步時做一次finetuning，同時training sample不再是保留最新的兩百個，而是使用所有收集到的samples一次訓練3 epoch，每次訓練完就把DPO的training set清空。 #### Experiments ##### 1. classification and sql-generation 該實驗目的再於確認finetuning對於這兩種tasks是否有效，所以分別用steamICL作為比較對象看他們的性能。 | Task | Method | Accuracy(%) | | -------------- | --------- | ----------- | | classification | zeroshot | 43.48 | | classification | streamICL | 54.71 | | classification | SFT | 57.88 | | generation | zeroshot | 26.08 | | sql generation | streamICL | 30.25 | | sql generation | SFT | 24.91 | SFT在classification上表現優於streamICL，但在sql generation表現差強人意甚至低於zeroshot。因此暫停finetuning在sql generation上的運用。 ##### 2. zeroshot finetuning 本實驗以zeroshot的設定作為基礎，檢測上述4種approaches的效果。 | Task | Method | Accuracy(%) | | -------------- | ------------------ | ----------- | | classification | without finetuning | 43.48 | | classification | SFT | 57.88 | | classification | KTO | 41.38 | | classification | DPO | 44.04 | | classification | SFT+DPO | 58.59 | 結果顯示SFT的影響最為顯著，加入其他RL演算法只對模型產生微小的提升。 ##### 3. RAG test 本實驗以邱偉誠 WEI-CHENG實作的RAG pipeline(無cheat sheet版)作為基礎，測試supervised finetuning帶來的性能提升。 | Task | Method | Accuracy(%) | | -------------- | ------------------- | ----------- | | classification | baseline | 60.43 | | classification | SFT | 61.565 | | classification | SFT + simple prompt | 67.63 | | classification | simple prompt | 45.12 | 一開始單純加入SFT的確使準確度稍微提高，但是效果不理想。所以之後刪掉prompt裡的一些資訊("Key Symptoms Found", "Common Symptoms")再次測式，準確率提升了6%。如果只是單純的刪減prompt而不做finetuning的話，準確度會降到跟zeroshot差不多。 #### Discussion(應該是討論為何結果會work) - SFT專注於篩選正確樣本，在分類任務中效果更明顯，但對於生成類任務的泛化能力提升不足。且sql generation本身正確率不及classification，所以收集training set的效率更低，拉低了finetuning的效果。 - KTO、DPO等RL演算法在classification這種回應只有短短的label、多樣性不足的情況下可能導致模型訓練的增益有限。 - 精簡話prompt有助於finetuning的原因可能在於模型在微調的過程中會學會如何回應，不需要輔助性的prompt作為提示，順便減少noise。 #### Conclusion SFT可作為一種通用的輔助手段，進一步提升pipeline的性能。實際影響力可能受限於training set的大小，任務複雜度。 #### Work Distribution(分工表) - Finetuning methods 開發測試。 - Alternative cheat-sheet retrieve method (BM25)。 --- ## **12/19 開會事項** ``` adl meeting 12月 19日 (星期四) · 下午8:00 - 10:00 時區：Asia/Taipei Google Meet 會議參加資訊視訊通話連結：https://meet.google.com/bzi-qasr-nma ``` ------------------------------ **邱偉誠 WEI-CHENG,CHIU** : Report我會統一製作 ADL Wrap-up report簡報架構: Abstract Introduction Related work(研究過的相關實驗) Approach(真正的實作結果) Experiments(過程的實驗) Discussion(應該是討論為何結果會work) Conclusion Work Distribution(分工表) 其中Experiments/Discussion會佔比最高。需要大家協助提供(就整理在hackmd): 1.護照英文名、學號、email(放文件開頭) 2.研究過的相關資料(要放Related Work) 3.整理之前自己做過的實驗(要放Experiments) 4.可以提供想提供的想法(放Discussion) 5.自己的Work Distribution(我放report最後) report上傳前(應該也是讓組長上傳)，先行版會先丟上來讓大家view。關於YT影片: 我的想法是我先開一個google簡報，先弄個簡單的structure，接著讓大家上去編輯屬於自己部分的簡報(報告自己做了哪些東西 ex.Experiments & Approach)，然後自己開obs錄影自己簡報的part，一人大概兩分鐘，最後把影片拼接起來。 ----------------------------- report 分配工作: - github current version: 26.98% - github current version + llama3 8b: 25.94% - github current version + mistral 7b: 17.66% ## **12/12開會事項** ``` adl final project meeting~ 12月 12日 (星期四) · 下午8:00 - 10:00 時區：Asia/Taipei Google Meet 會議參加資訊視訊通話連結：https://meet.google.com/nfp-hyyj-mzs ``` - multiple agent 思言： - Experiments: 1. Embed model - ministral+base: 58.05% - gemma+base: 69.67% - gemma+large: 65.99% 2. 3 agents voting: - 68.25% 3. 2 agents regenerate: - gemma, ministral: 64.91 (大部分regenerate都錯） - ministral, gemma: 68.48 4. switch: - 40%: gemma, >40%: 2 agents regenerate (ministral, gemma): 69.73% - 20%: 69.78% nico: Tried Langchain packages with LLM: the packages cannot be used with the checkpoints provided by TA using two LLM (both gemma 2 9b) to seperate the task: - first attempt 1. LLM1 given table description to explain the query 2. LLM2 given the explained query to generate SQL 3. result:27.77% - second attempt 1. LLM1 given table description to select the tables and columns used in the query 2. LLM2 given the selected tables to generate SQL 3. result:29.6% - third attempt 1. LLM1 given table description to select the tables and columns used in the query 2. LLM2 given cheat sheet + selected tables to generate SQL 3. result: 27% - fourth attempt 1. search for the correct answer, take it as a hint 2. LLM1 given table description + given the hint to output SQL code 3. LLM2 given the SQL code + generate SQL answer 4. result: 49% Table description format: ```=json codebase_community_description = { "votes": { "Id": [ "column_description: the vote id", ], "PostId": [ "column_name: Post Id", "column_description: the id of the post that is voted", ], "VoteTypeId": [ "column_name: Vote Type Id", "column_description: the id of the vote type", ], "CreationDate": [ "column_name: Creation Date", "column_description: the creation date of the vote", "data_format: datetime" ], ``` SQL hint format: ```=json { "question": "List the football team that has a build up play speed of 31, build up plan dribbling of 53, and build up play passing of 32. Only indicate the short name of the team.", "hint": "SELECT DISTINCT t1.team_short_name FROM Team AS t1 INNER JOIN Team_Attributes AS t2 ON t1.team_api_id = t2.team_api_id WHERE t2.buildUpPlaySpeed = 31 AND t2.buildUpPlayDribbling = 53 AND t2.buildUpPlayPassing = 32" }, { "question": "What is the average overall rating of the football player Aaron Doran?", "hint": "SELECT CAST(SUM(t2.overall_rating) AS REAL) / COUNT(t2.id) FROM Player AS t1 INNER JOIN Player_Attributes AS t2 ON t1.player_api_id = t2.player_api_id WHERE t1.player_name = 'Aaron Doran'" }, ``` ## **12/5開會事項** ``` adl final project meeting~ 12月 5日 (星期四) · 下午8:00 - 10:00 時區：Asia/Taipei Google Meet 會議參加資訊視訊通話連結：https://meet.google.com/nfp-hyyj-mzs ``` 1. 實驗： - n 筆資料更新一次 - Prompt rewrite - Random pick instead of top_k [name=szu-yen] - classification-public: - top_k: 0.5153 - random: 0.3861 - sql-generation-public: - top_k: 0.3051 - random: 0.2679 - Fine-tune prompt - model fine-tune by LoRA [name=waynechiu] - multiple agent [name=willychen] 2. 提問 - prompt 修改、embedding、validation 模型是否可以使用外部 - 預訓練 SQL 模型 - diagnostic rules 加入 prompt ### **12/5 分享內容** Wayne Chiu 邱偉誠: code: https://drive.google.com/file/d/1ptykSZlZ86GPc8Z2sOkP_QgBldW-Tc0S/view?usp=drive_link - Progress: 1. Classification task用 "google/gemma-2-9b-it" 單模型 (經測試，它zero-shot的性能就很優秀了)，加入優化後的RAG機制，accuracy可達**60.66%** 2. 原先載入model需12GB VRAM、Classification任務inferencing需要4~6小時，使用**unsloth**後，只需9G VRAM、Classification task inferencing只需1~2小時，經觀察結果論而言效能確實如它git所述沒有顯著下降。(找這個開源項目的初衷是它很適合快速微調，但目前還沒有微調思路) - 疑問: 1. fine-tune的方式尚未確定，目前嘗試幾種fine-tune方法都讓性能只降不升，尚待研究微調方法 2. 使用unsloth後我的16G VRAM吃得下雙模型(9b+7b or 8b+8b)，但目前還想不到使用Multi-Agent大幅提升準確度的好方法，所以沒有使用的理由 - 已實驗: 測試了BioBERT/ClinicalBERT等號稱醫療專業的embedding model，效果在Classification任務上沒有變好 **Result**: "google/gemma-2-9b-it" - classification-public: 0.6066 - sql-generation-public: 0.2640 (below baseline) -> 待優化使用unsloth庫可以讓 LMs 微調速度加快 2 倍，VRAM 使用量減少 50%。上下文長度延長 6 倍，且完全不影響準確性。 -文件:https://docs.unsloth.ai/ -github:https://github.com/unslothai/unsloth -sample code : unsloth 在 Colab 微調 Llama-3.2-Vision-11B :https://colab.research.google.com/drive/1j0N4XTY1zXXy7mPAhOC1_gMYZ2F2EBlk?usp=sharing **How to run my code**: - https://www.pinecone.io/ 註冊帳號，申請 pinecone api token(免費)，接著terminal輸入token with `export PINECONE_API_KEY=your_api_token`，環境就默認使用你自己帳號的pinecone資料庫 - 確保環境的CUDA版本>=11.7，接著指令下載 pip install unsloth、 pip install --no-deps --upgrade "flash-attn>=2.6.3" - python main.py --bench_name classification_public --device cuda --use_wandb - python main.py --bench_name sql_generation_public --device cuda --use_wandb Pinecone免費版不會收費，只規定最多開五個資料庫，我們的case只需要兩個資料庫，足矣思言： - Code for calculating output .csv score: https://drive.google.com/file/d/1BNXvW-Zu15WPiG1IzNgqwiiW794W5fEJ/view?usp=sharing - Medical embedding model: `abhinand/MedEmbed-large-v0.1` - KTO, PPO (X) - embed symptoms (X) 季軒： - **Code**: https://github.com/niconini-github/GrandChallenge-streambench-final-project/blob/chi-hsuan/examples/zeroshot_tuned.py How to run: 跟其他example code一樣，目前只支援qwen。 - **Finetuning Details**: - Supervised fine-tuning - Data collection: 保留最新的200個正確的question-prediction作為training set。 - Frequency: 每20步訓練一次，每次1 epoch。 - DPO - Data collection: 讓模型一次生成兩種回答，一個作為chose，另一個為rejected。當chose是正確答案時將該比資料加入training set。 - 在第600、1000、1300、1600步呼叫訓練流程。 - **問題**： 1. 同時載入多個模型會爆記憶體。 2. 模型只能學習答對過的資料。 - **TODO**: 1. ~~引入Unsloth~~。 2. Diversify: 為了取得更多元的訓練資料，使用不同的模型、generation cofig... nico: - 將querybook內函prompt改編及採用 ## **11/20 開會事項** ``` ADL final project meeting 11月 19日 (星期二) · 下午8:00 - 10:00 時區：Asia/Taipei Google Meet 會議參加資訊視訊通話連結：https://meet.google.com/xre-nnpe-bft ``` 1. 自我介紹 2. 看論文在repo中的實作 - 負責人： 4. 討論project要使用哪個LLM - [Available LLMs](https://github.com/appier-research/streambench-final-project/blob/main/model_list.md) 5. 分享 LLM classification (Medical Diagnosis) 相關資訊或經驗 - 負責人：思言 - 負責人： 7. 分享LLM generation (text to SQL) 相關資訊或經驗 - 負責人：nico - 負責人： 9. 分享有關各種 LLM components（prompts, RAG memory, RAG retriever, etc.) - 負責人：nico - 負責人： 11. 分工（research, coding, report,…) 12. 討論開會頻率與時間 ### **11/20 分享內容** ----------------------------------------- - wayne: 目標: - 提升LLM在stream learning能力。 - 完成Classification與 SQL 生成任務。 - 利用Enhanced RAG和Multi-Agent System提升性能。 1. Enhenced Prompt & Few-Shot Learning： - 動態生成提示，結合 Few-Shot 示例與檢索結果，提升模型初期表現。 2. 動態的 RAG vector database： - Pinecone 儲存成功案例，檢索相似案例作為上下文。 - 資料庫動態擴展，隨案例增長不斷提升準確率。 3. Multi-Agent System協作決策： - 多個模型（如 GPT、Claude、G）協作。 - 用投票或加權機制選擇最佳結果。 4. 回饋驅動的持續改進： - 基於正確與錯誤(用錯誤的反面當做正面回饋?)回饋，更新 RAG 記憶庫和 Prompt，實現持續學習。 - 威宇 (參考 chatGPT 提供的意見)： 1. Weighted RAG instead of 0 or 1. 2. Use multiple agent to evaluate the response of the original agent, and give explanations. 3. model evaluates its response based on the feedback. 4. Fine-tune model with LoRA on the go. 5. Error Clustering, identify patterns of common mistakes. 6. Error Recovery Prompts, Design prompts that acknowledge prior mistakes and request clarification or improvement based on feedback. Example: ”You made a mistake in your previous response. Please try again with the following context." 7. Role Specialization: Partition agents by specific responsibilities (e.g., retrieval specialist, summarizer, and validator). Validator Agent: Compare the output against expected behavior (e.g., for SQL, ensure schema compliance). Explanation Agent: Use chain-of-thought prompts to explain why the main agent’s response is correct or incorrect. 8. Medical Diagnosis: diagnostic rules or knowledge bases, semantic similarity between the patient profiles and label descriptions, Symptom Embedding (or any medical embeddings available), check if output class exists (0~48). 9. Text-to-SQL Generation: Validate generated SQL. ----------------------------------------- **LLM generation (text to SQL) 相關資訊或經驗** - nico: - Text-to-SQL at Pinterest - Steps: 1. The user asks an analytical question to Querybook, choosing the tables to be used. 2. The question, selected SQL dialect, and table schemas are compiled into a Text-to-SQL prompt. 3. The prompt is fed into the LLM. 4. A streaming response is generated and displayed to the user. - [All prompts used in QueryBook](https://github.com/pinterest/querybook/tree/1f14756b2ff08b6b9decb4b1d9f5561ac82d2ea3/querybook/server/lib/ai_assistant/prompts) **各種 LLM components 相關資訊或經驗** - nico: - Query Rewriting: https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb - JamesLee - Prompt tuning: https://textgrad.com/ - Prompt tuning & ICL tuning: https://dspy.ai/ <style> .blue { color:#79CCEE; } .orange { color:#F68F02; } .pink { color:#FE9FA1; } .yellow { color:#FDB515; } .purple { color:#BAA2ED; } .block { margin-left : 1em; } </style>

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.