Week 11 - Think It Simple

# Think It Simple ## Week 11 ---- ## NLU + Natural Language Understanding (NLU) + 自然語言理解 + 分析使用者的輸入，解析成程式可讀的格式 + 例如 JSON, XML, YAML 格式 + 範例：明天去新店要不要帶雨傘？ + 輸出： ```json= { "intent": "query_weather", "location": "新店", "day": "明天" "query_type": "降雨機率" } ``` ---- ## LLM As NLU + 透過適當的 Prompt 使 LLM 扮演一個 NLU 系統 + 運用 In-Context Learning 的技巧 + 適當的 Prompt 可能包含： + 詳細的指令 (Instruction Following) + 適量的範例 (Few-Shot Example) ---- ## NLU DIY + 使用 LLM 搭建一個簡單的 NLU 系統需要： 1. Large Language Model 2. Sentence Encoder + 這裡分別選用 [Vicuna](https://github.com/lm-sys/FastChat) 與 [USEM](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3) --- ## Vicuna + 一個 LLaMA Finetuned 模型 + 使用 [ShareGPT](https://sharegpt.com/) 資料集訓練得到 + [參考資料集連結](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) ---- ## Model Weight 1. 下載 LLaMA-7B 原始模型權重 2. 下載 Vicuna-7B Delta 模型權重 3. 將兩份模型合併 ---- ## Clone Model ```sh= git clone https://huggingface.co/decapoda-research/llama-7b-hf git clone https://huggingface.co/lmsys/vicuna-7b-delta-v1.1 ``` ---- ## Apply Delta ```sh= python -m fastchat.model.apply_delta \ --base-model-path llama-7b-hf \ --delta-path vicuna-7b-delta-v1.1 \ --target-model-path vicuna-7b \ --low-cpu-mem # 如果 RAM 很少的話 ``` ---- ## Add Swap + 如果記憶體不足，可以增加 Swap Memory + 將硬碟空間當作記憶體來使用 ```sh= sudo fallocate -l 1G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile free -h # 檢視記憶體使用量 ``` ---- ## Quantization 最近 HuggingFace 跟 BNB 突然熱血[支援 4-Bit Quantization](https://huggingface.co/blog/4bit-transformers-bitsandbytes) ```python= from transformers import LlamaForCausalLM as ModelCls from transformers import LlamaTokenizerFast as TkCls m: ModelCls = ModelCls.from_pretrained( "vicuna-7b", device_map="auto", # load_in_8bit=True, # 約需 9 GB 的 VRAM load_in_4bit=True, # 約需 4.5 GB 的 VRAM ) tk: TkCls = TkCls.from_pretrained("vicuna-7b") ``` ---- ## Hello, Vicuna! ```python= from transformers import TextStreamer prompt = """### User: Hello, Vicuna! ### Assistant:""" st = TextStreamer(tk, skip_prompt=True) input_ids = tk(prompt, return_tensors="pt")["input_ids"] model.generate(input_ids, max_new_tokens=128, streamer=st) ``` ---- ## In-Context Learning ```txt [Instruction] 你是一個高鐵售票系統的 Dialogue State Tracker。分析使用者的輸入，並將狀態輸出成 JSON 格式。 [Examples] User: 七月三日出發 JSON: {"intent":"date_only","date":"07/03"} (更多其他範例 ...) User: 五月十二號 JSON: <=== 讓 LLM 從這邊開始生成 ``` ---- ## Example Selection + 當系統逐漸複雜，範例的數量就會越來越多 + 把所有範例放在一起： 1. 模型有長度限制，太多範例可能會放不下 2. 模型被不相關的範例影響，導致輸出錯誤 + 解決方法：**根據輸入，搜尋最相關的範例** + [Active Example Selection for In-Context Learning](https://arxiv.org/abs/2211.04486) --- # Example Selection ---- ## Sentence Similarity + 將句子 Encode 成向量 + 計算向量間的距離，越近則越相似 + 需要一個 Sentence Encoder 與距離計算公式 ---- ## Sentence Encoder + 來自 Google 的 [Universal Sentence Encoder Multilingual](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3)，支援 16 種語言，包含繁簡中、英、法、德、義、日、韓 ... ---- ## Usage ```python= import tensorflow_text import tensorflow_hub url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/3" encoder = tensorflow_hub.load(url) embed = encoder(["今天天氣真好"]) embed.shape # TensorShape([1, 512]) ``` ---- ## Calculate Distance 使用 Scikit Learn 的[歐式距離](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html) ```python= from sklearn.metrics.pairwise import euclidean_distances keys = ["今天天氣真好", "我是金牛座", "下午五點出發"] query = ["這什麼鳥天氣"] key_emb = encoder(keys) query_emb = encoder(query) euclidean_distances(query_emb, key_emb) # Outputs: [[0.8632417, 1.3535709, 1.1286112]] # 距離越小則越相似 ``` --- # Integration Demo Time!