---
# type: slide
# slideOptions:
# center: false
---
# llama.cpp on EdgeAI
> [name=Rita]
---
## Bindings
* Python: abetlen/llama-cpp-python
* Go: go-skynet/go-llama.cpp
* Node.js: withcatai/node-llama-cpp
* Ruby: yoshoku/llama_cpp.rb
* Rust: mdrokz/rust-llama.cpp
* C#/.NET: SciSharp/LLamaSharp
* Scala 3: donderom/llm4s
* Clojure: phronmophobic/llama.clj
* React Native: mybigday/llama.rn
* Java: kherud/java-llama.cpp
* Zig: deins/llama.cpp.zig
---
## Which Models does Llama.cpp Support?
* Llama 2
* GGUF 的量化模型
> What is GGUF?
> A binary format designed to quickly load and save large language models.
:::spoiler Other Models
* LLaMA 🦙
* LLaMA 2 🦙🦙
* Falcon
* Alpaca
* GPT4All
* Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
* Vigogne (French)
* Vicuna
* Koala
* OpenBuddy 🐶 (Multilingual)
* Pygmalion/Metharme
* WizardLM
* Baichuan 1 & 2 + derivations
* Aquila 1 & 2
* Starcoder models
* Mistral AI v0.1
* Refact
* Persimmon 8B
* MPT
* Bloom
* Yi models
* StableLM-3b-4e1t
* Deepseek models
* Qwen models
* Mixtral MoE
* PLaMo-13B
* GPT-2
:::
---
## How to use Llama.cpp
* Llama.cpp doesn't have the seperate component, like tokenizer, model etc.
* It combine those in the same Class Stuct, called `Llama`.
1. load model
```python!
from llama_cpp_cuda_tensorcores import Llama
import llama_cpp
llama = Llama("models/llama-2-7b-chat.Q4_K_M.gguf",
n_gpu_layers=128)
PROMPT = "hello, do you know about the hakkaido?"
```
----
2. direct way to use
```python
llama(PROMPT)
```
3. tokenize the prompt
* The input prompt need to be translated to `byte` type.
```python!
tokens = llama.tokenize(bytes(PROMPT, "utf-8"),
special=True)
```
----
4. model generation
```python!
completion_tokens = []
for token in llama.generate(tokens, top_k=40,
top_p=0.95, temp=1.0,
repeat_penalty=1.1):
if token == llama_cpp.llama_token_eos(llama.model):
finish_reason = "stop"
break
completion_tokens.append(token)
```
5. decode
```python!
output = llama.detokenize(completion_tokens).decode(
"utf-8", errors="ignore")
```
---
## How to Test the BenchMark?
There are several ways, like testing their `Perplexity`.
But it was too academic to be tested in personal usage.
----
Therefore, if our purpose is to test the speed on local machine.
The easiest way is to measure the speed of generating tokens.
----
In order to test the token spead, the normal way is to measure its time to output the tokens.
`However, there are some problem.`
----
Not every model can be support by `Llama.cpp`, so we have to fit the specification, in which using the different package.
----
需要評估所需使用的模型格式,才能根據不同的模型寫測試。
如果只想測試 ==產生回覆的速率== ,可以直接使用 [`text-generation-webui`](https://github.com/oobabooga/text-generation-webui/) 裡所寫的 module。
--> 可以直接適配的絕大部份的模型格式。
* 'Transformers': huggingface_loader,
* 'AutoGPTQ': AutoGPTQ_loader,
* 'GPTQ-for-LLaMa': GPTQ_loader,
* 'llama.cpp': llamacpp_loader,
* 'llamacpp_HF': llamacpp_HF_loader,
* 'ExLlama': ExLlama_loader,
---
## 目前的 Benchmark 測試流程
測試的程式使用 `openvino` 提供的 benchmark
主要測試的項目:
1. model generation time
* tokens/s
* s/token
2. memory utilization
* only for process cpu and memory
* gpu is not support
---
## It's time for discussing machine environment!!!
1. llama_cpp 看起來只支援 ==`x86`==(根據他們的腳本),所以 ==`arm`== 架構十之八九需要用 docker 跑
2. nvidia jetson 預設的 python 也較舊,所以有些套件版本會相差太多或是根本不能載
3. 對於cuda版本,repo是說 11.8 以上才行,但已在 11.7 上測試過可以跑!
----
### Dockerfile runtime
* Builder
* 環境:nvidia/cuda12.1.1-devel-ubuntu22.04
* package:requirements.txt -> 變成 wheels
* Runtime
* 環境:nvidia/cuda12.1.1-runtime-ubuntu22.04
* 把 wheels 的 package 都載下來
---
## 實際測試狀況和問題
:::warning
* 測試時,時常會有錯誤

:::
----
### llama-2-13b-chat.Q4_K_M.gguf
##### https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF
* overall usage

* usage of cpu = **1 proc**
----
* usage of gpu = **8351 MB**

----
<!--
* 回答狀況:

---- -->
* **Benchmark**

```!
--- Benchmarking ---
Input length: 7 tokens
Generated 379 tokens in 19.21 s on CPU
Maximum rss memory consumption: 1710.70 MB,
Maximum shared memory consumption: 762.19 MB
```
----
#### llama-2-7b-chat.Q4_K_M.gguf
##### https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF
* overall usage

* usage of cpu = **1 proc**
----
* usage of gpu = **4719 MB**

----
<!-- * 回答狀況:
 -->
* **Benchmark**

----

```!
--- Benchmarking ---
Input length: 7 tokens
Generated 501 tokens in 17.38 s on CPU
Average throughput: 28.93 tokens/s on GPU, 0.0346 s/token
Maximum rss memory consumption: 1548.51 MB, Maximum shared memory consumption: 587.46 MB
```