llama.cpp on EdgeAI

--- # type: slide # slideOptions: # center: false --- # llama.cpp on EdgeAI > [name=Rita] --- ## Bindings * Python: abetlen/llama-cpp-python * Go: go-skynet/go-llama.cpp * Node.js: withcatai/node-llama-cpp * Ruby: yoshoku/llama_cpp.rb * Rust: mdrokz/rust-llama.cpp * C#/.NET: SciSharp/LLamaSharp * Scala 3: donderom/llm4s * Clojure: phronmophobic/llama.clj * React Native: mybigday/llama.rn * Java: kherud/java-llama.cpp * Zig: deins/llama.cpp.zig --- ## Which Models does Llama.cpp Support? * Llama 2 * GGUF 的量化模型 > What is GGUF? > A binary format designed to quickly load and save large language models. :::spoiler Other Models * LLaMA 🦙 * LLaMA 2 🦙🦙 * Falcon * Alpaca * GPT4All * Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2 * Vigogne (French) * Vicuna * Koala * OpenBuddy 🐶 (Multilingual) * Pygmalion/Metharme * WizardLM * Baichuan 1 & 2 + derivations * Aquila 1 & 2 * Starcoder models * Mistral AI v0.1 * Refact * Persimmon 8B * MPT * Bloom * Yi models * StableLM-3b-4e1t * Deepseek models * Qwen models * Mixtral MoE * PLaMo-13B * GPT-2 ::: --- ## How to use Llama.cpp * Llama.cpp doesn't have the seperate component, like tokenizer, model etc. * It combine those in the same Class Stuct, called `Llama`. 1. load model ```python! from llama_cpp_cuda_tensorcores import Llama import llama_cpp llama = Llama("models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=128) PROMPT = "hello, do you know about the hakkaido?" ``` ---- 2. direct way to use ```python llama(PROMPT) ``` 3. tokenize the prompt * The input prompt need to be translated to `byte` type. ```python! tokens = llama.tokenize(bytes(PROMPT, "utf-8"), special=True) ``` ---- 4. model generation ```python! completion_tokens = [] for token in llama.generate(tokens, top_k=40, top_p=0.95, temp=1.0, repeat_penalty=1.1): if token == llama_cpp.llama_token_eos(llama.model): finish_reason = "stop" break completion_tokens.append(token) ``` 5. decode ```python! output = llama.detokenize(completion_tokens).decode( "utf-8", errors="ignore") ``` --- ## How to Test the BenchMark? There are several ways, like testing their `Perplexity`. But it was too academic to be tested in personal usage. ---- Therefore, if our purpose is to test the speed on local machine. The easiest way is to measure the speed of generating tokens. ---- In order to test the token spead, the normal way is to measure its time to output the tokens. `However, there are some problem.` ---- Not every model can be support by `Llama.cpp`, so we have to fit the specification, in which using the different package. ---- 需要評估所需使用的模型格式，才能根據不同的模型寫測試。如果只想測試 ==產生回覆的速率== ，可以直接使用 [`text-generation-webui`](https://github.com/oobabooga/text-generation-webui/) 裡所寫的 module。 --> 可以直接適配的絕大部份的模型格式。 * 'Transformers': huggingface_loader, * 'AutoGPTQ': AutoGPTQ_loader, * 'GPTQ-for-LLaMa': GPTQ_loader, * 'llama.cpp': llamacpp_loader, * 'llamacpp_HF': llamacpp_HF_loader, * 'ExLlama': ExLlama_loader, --- ## 目前的 Benchmark 測試流程測試的程式使用 `openvino` 提供的 benchmark 主要測試的項目： 1. model generation time * tokens/s * s/token 2. memory utilization * only for process cpu and memory * gpu is not support --- ## It's time for discussing machine environment!!! 1. llama_cpp 看起來只支援 ==`x86`==（根據他們的腳本），所以 ==`arm`== 架構十之八九需要用 docker 跑 2. nvidia jetson 預設的 python 也較舊，所以有些套件版本會相差太多或是根本不能載 3. 對於cuda版本，repo是說 11.8 以上才行，但已在 11.7 上測試過可以跑！ ---- ### Dockerfile runtime * Builder * 環境：nvidia/cuda12.1.1-devel-ubuntu22.04 * package：requirements.txt -> 變成 wheels * Runtime * 環境：nvidia/cuda12.1.1-runtime-ubuntu22.04 * 把 wheels 的 package 都載下來 --- ## 實際測試狀況和問題 :::warning * 測試時，時常會有錯誤 ![image](https://hackmd.io/_uploads/r14SOdqdT.png) ::: ---- ### llama-2-13b-chat.Q4_K_M.gguf ##### https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF * overall usage ![image](https://hackmd.io/_uploads/Hk9fFuqup.png) * usage of cpu = **1 proc** ---- * usage of gpu = **8351 MB** ![image](https://hackmd.io/_uploads/SyxbOOOq_p.png) ----  * **Benchmark** ![image](https://hackmd.io/_uploads/ryStudc_a.png) ```! --- Benchmarking --- Input length: 7 tokens Generated 379 tokens in 19.21 s on CPU Maximum rss memory consumption: 1710.70 MB, Maximum shared memory consumption: 762.19 MB ``` ---- #### llama-2-7b-chat.Q4_K_M.gguf ##### https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF * overall usage ![image](https://hackmd.io/_uploads/BJ9OYuc_T.png) * usage of cpu = **1 proc** ---- * usage of gpu = **4719 MB** ![image](https://hackmd.io/_uploads/HkRPFd9da.png) ----  * **Benchmark** ![image](https://hackmd.io/_uploads/H1vUF_cdT.png) ---- ![image](https://hackmd.io/_uploads/HkxjSF_q_a.png) ```! --- Benchmarking --- Input length: 7 tokens Generated 501 tokens in 17.38 s on CPU Average throughput: 28.93 tokens/s on GPU, 0.0346 s/token Maximum rss memory consumption: 1548.51 MB, Maximum shared memory consumption: 587.46 MB ```