LLM 推論實作 - 以 Ollama 為例

tags: `2024/10` `LLM` `Ollama`

(2024/10/10) 自從 ChatGPT 於 2022/11/30 橫空出世後, LLM 的風潮至今未衰, 且越演越烈. 拿公司的伺服器來實作 Open Source LLM 裡鼎鼎有名的 Ollama, 以下是實作的紀錄

內文最後有 github repository 參考

Table of Contents

LLM 推論實作 - 以 Ollama 為例

Ollama 介紹

Ollama 是一個開源的大型語言模型平台, 允許使用者在本地端運行各種模型. 以下表格詳細列出 Ollama 可以支援的 models, 如 Llama 3 等。它的設計目標是優化模型的設置和配置過程, 包括 GPU 的使用。
Ollama 的操作非常簡單, 從下載 Ollama 到在 CLI (Command Line Interface) 下執行以下任何一個 ollama run llama3.2 (或其他 model)的指令, 整個步驟可以在 10 分鐘內完成. Ollama 支援 macOS、Linux 和 Windows 等主流操作系統, 我的操作基本是以 Linux 為基礎.

Model	Parameters	Size	Download
Llama 3.2	3B	2.0GB	`ollama run llama3.2`
Llama 3.2	1B	1.3GB	`ollama run llama3.2:1b`
Llama 3.1	8B	4.7GB	`ollama run llama3.1`
Llama 3.1	70B	40GB	`ollama run llama3.1:70b`
Llama 3.1	405B	231GB	`ollama run llama3.1:405b`
Phi 3 Mini	3.8B	2.3GB	`ollama run phi3`
Phi 3 Medium	14B	7.9GB	`ollama run phi3:medium`
Gemma 2	2B	1.6GB	`ollama run gemma2:2b`
Gemma 2	9B	5.5GB	`ollama run gemma2`
Gemma 2	27B	16GB	`ollama run gemma2:27b`
Mistral	7B	4.1GB	`ollama run mistral`
Moondream 2	1.4B	829MB	`ollama run moondream`
Neural Chat	7B	4.1GB	`ollama run neural-chat`
Starling	7B	4.1GB	`ollama run starling-lm`
Code Llama	7B	3.8GB	`ollama run codellama`
Llama 2 Uncensored	7B	3.8GB	`ollama run llama2-uncensored`
LLaVA	7B	4.5GB	`ollama run llava`
Solar	10.7B	6.1GB	`ollama run solar`

Source: Ollama Model List (Source: GitHub)
Note: 記憶體需求大約是如下, 要跑 7B 的模型, 需要 8GB 的 RAM. 16 GB 的 RAM 來跑 13B models, 跟 32 GB 跑 33B models.

Ollama 安裝

在安裝 Ollama 之前, 必須有 GPU (NVIDIA 或 Mac 的 M 系列)以及安裝完成驅動程式, 包括 NVIDIA 的 CUDA.

到 Ollama 官網, 可以看到有 macOS, Linux, Windows 三種選擇, 根據你的作業系統選擇下載.

Ollama 官網

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

點選 Download 後, 出現以下螢幕, 再選取作業系統.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

選擇 Linux 後, 螢幕顯示可以用以下一行指令完成下載. 打開 Terminal (如果還沒有打開的話), 輸入(或複製)以下指令, 就完成安裝程序.

curl -fsSL https://ollama.com/install.sh | sh

Ollama 於 CLI (Command Line Interface) 執行

於 CLI 執行 Ollama 是最直接的作法, 尤其是對有 Linux 經驗的人而言. 我們就說明在 Terminal 的 Command Line Interface 下的執行方式.

1. Ollama 下載模型並執行

安裝完成後, 就可以用以下指令同時下載 llama3.2:1b 模型 (model) 並且執行.

ollama run llama3.2:1b

這是我隨意從上面的表格選定記憶體需求比較小的模型, 當然, 功能也比較侷限.

假設這是第一次執行這個 model, ollama 會先下載 (pulling) 模型, 下載完成後, 顯示 success, 並出現 >>> 準備接受你的提問. 先問個最簡單的問題, >>> 麻煩你做個自我介紹吧.

以下就是 Ollama 的回覆, 你電腦上的回覆應該跟我的不同, 就像我們人類不會每次的答覆都會一模一樣.

我是一名 artificial intelligence 的程式，與人 communicating 時候我都會問「幹什麼？」或是「什麼功能有呢？」但實際上我可以幫助你多樣的東西，從簡單的提问到複雜的問題 해결，我都能幫你做好事！ (奇怪, 怎麼有韓文 !)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

如果記憶體空間夠的話, 可以試試其他更大的模型, 如以下試的 llama2 70b 的模型, 回應的內容就完全不一樣, 不過是英文的.

ollama run llama2:70b

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

2. 結束 Ollama 執行

在 >>> 後, 輸入 /bye, 或是鍵盤組合鍵CTRL-d 就可以跳出 Ollama, 回到 Linux CLI 環境.

3. 更多的 Ollama CLI 指令及 Ollama 執行命令

(可省略, 供有興趣想深入了解細節者參考)

在 Linux Terminal 下 ollama 指令, 而且後面不加任何參數時, 會顯示 Ollama 可以接受的的參數集合. 我們之前的指令 ollama run llama3.2:1b 就是其中之一的 run 參數.
























(base) kpl@sr251-b1:~$ ollama
Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.

進入 Ollama 後, 顯示 >>> , 同時有淺灰色的提示 Send a message (/? for help) 表示 Ollama 等待接受指令或提問. 可以輸入 /? 看看有哪些指令可以操作

暫時知道可以用 /bye 來結束 Ollama 外, 其餘就後續再討論.














>>> /?
Available Commands:
  /set            Set session variables
  /show           Show model information
  /load <model>   Load a session or model
  /save <model>   Save your current session
  /clear          Clear session context
  /bye            Exit
  /?, /help       Help for a command
  /? shortcuts    Help for keyboard shortcuts

Use """ to begin a multi-line message.

>>> Send a message (/? for help)

4. 移除模型

將模型從硬碟中移除以節省硬碟空間, 用的是以下的指令

ollama rm llama3

Ollama 使用 RESTful API 執行

1. 什麼是 RESTful API？

參考 ExplainThis.io 說明

什麼是 API？
API 全名為 Application Programming Interface，最簡單的理解為，我們不需要知道他實際上是如何實作的，只要知道要怎麼使用它即可。舉例來說：就像你走進一間餐廳，在菜單上畫好品項後遞給老闆，老闆就能夠提供你需要的餐點，而你不需要去在意餐點是怎麼被實做出來的。

所以比起在意它實際上怎麼被製作出來的，我們更在意怎麼獲得想要的東西，因此會更在意：「輸入的方法」以及「輸出的結果」，對應上面的例子就是：「該如何點餐」以及「餐點的結果」。

什麼是 REST？
REST 全名為 Representational State Transfer，是一種軟體架構，他最初是用來管理複雜網路上的通訊指導方針指導方針建立。而 RESTful API 意旨遵循著 REST 架構風格的 API ，而 REST 架構風格需含以下原則：

統一介面：將操作的細節作抽象，並提供統一的操作方式和規格。
無狀態：無狀態意旨伺服器獨立於所有之前的請求，所以用戶端可以按任何順序去請求資源。
分層系統：用戶端不清楚伺服器端有幾層，甚至伺服器端可以再向其他伺服器端請求資源。
可快取性：用戶端在獲得第一次回應後快取一些資訊，然後後續會直接使用快取中獲得資訊。（例如：網站中每個頁首、頁尾、LOGO 等）
隨需編碼（code on demand）：Server 可以隨時擴充功能，因應 Client 的即時需求。

什麼是 RESTful API？
是一種風格，他描述了如何實現 Web API 的架構，基於 HTTP 協定，用來建立分散式系統，並支援多種程式語言，他的優點包含：

可擴展性：由於系統無需保留 Client 狀態，因此可以提高擴展效能。
靈活性：由於 Client 與 Server 完全分離，因此分層的應用程式功能可以提供靈活性。
獨立性：可以使用各種程式語言來編寫程式，不影響 API 的設計。
RESTful API 請求資源的方法

2. Ollama RESTful API 使用簡介

參考 Ollama API Document 的 API 使用手冊.
可以在 command line 下直接試. (一般不會是如此的用法, 而是用 web 程式去呼叫. 這裡只是示範指令及結果)

(base) user:~$ curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:latest",
  "prompt":"Why is the sky blue,
  "stream": false
}'

系統回復如下, 可以抓取 "response" 的內容, 就是 Llama3.2 對於天空為什麼是藍色的回覆.

{"model":"llama3.2:latest","created_at":"2024-11-01T12:40:13.763991091Z","response":"The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh. He first described it in the late 19th century.\n\nHere's what happens:\n\n1. When sunlight enters Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen (N2) and oxygen (O2).\n2. These molecules scatter the light in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths.\n3. This is because the smaller molecules are more effective at scattering the shorter wavelengths due to their smaller size relative to the wavelength of light.\n4. As a result, the blue light is distributed throughout the atmosphere, giving the sky its blue appearance.\n5. The other colors of the visible spectrum, like red and orange, are not scattered as much and continue on their way to our eyes, making them appear more intense.\n\nIt's worth noting that the color of the sky can change depending on various factors such as:\n\n* Time of day: During sunrise and sunset, the sky can take on hues of red, orange, and pink due to the scattering of light by atmospheric particles.\n* Atmospheric conditions: Pollution, dust, and water vapor in the air can scatter light in different ways, changing the color of the sky.\n* Altitude and location: The color of the sky can vary depending on the altitude and location due to differences in atmospheric composition.\n\nSo, the next time you gaze up at a blue sky, remember the fascinating physics behind its appearance!","done":true,"done_reason":"stop","context":[128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,10445,374,279,13180,6437,30,128009,128006,78191,128007,271,791,13180,8111,6437,1606,315,264,25885,2663,13558,64069,72916,11,7086,1306,279,8013,83323,10425,13558,64069,13,1283,1176,7633,433,304,279,3389,220,777,339,9478,382,8586,596,1148,8741,1473,16,13,3277,40120,29933,9420,596,16975,11,433,35006,13987,35715,315,45612,1778,439,47503,320,45,17,8,323,24463,320,46,17,4390,17,13,4314,35715,45577,279,3177,304,682,18445,11,719,814,45577,24210,320,12481,8,93959,810,1109,5129,320,1171,8,93959,627,18,13,1115,374,1606,279,9333,35715,527,810,7524,520,72916,279,24210,93959,4245,311,872,9333,1404,8844,311,279,46406,315,3177,627,19,13,1666,264,1121,11,279,6437,3177,374,4332,6957,279,16975,11,7231,279,13180,1202,6437,11341,627,20,13,578,1023,8146,315,279,9621,20326,11,1093,2579,323,19087,11,527,539,38067,439,1790,323,3136,389,872,1648,311,1057,6548,11,3339,1124,5101,810,19428,382,2181,596,5922,27401,430,279,1933,315,279,13180,649,2349,11911,389,5370,9547,1778,439,1473,9,4212,315,1938,25,12220,64919,323,44084,11,279,13180,649,1935,389,82757,315,2579,11,19087,11,323,18718,4245,311,279,72916,315,3177,555,45475,19252,627,9,87597,4787,25,96201,11,16174,11,323,3090,38752,304,279,3805,649,45577,3177,304,2204,5627,11,10223,279,1933,315,279,13180,627,9,24610,3993,323,3813,25,578,1933,315,279,13180,649,13592,11911,389,279,36958,323,3813,4245,311,12062,304,45475,18528,382,4516,11,279,1828,892,499,36496,709,520,264,6437,13180,11,6227,279,27387,22027,4920,1202,11341,0],"total_duration":2655575499,"load_duration":30528487,"prompt_eval_count":31,"prompt_eval_duration":13619000,"eval_count":307,"eval_duration":2568595000}

[推薦] Ollama 於 Jupiter Notebook 執行

參考 Run LLMs Locally using Ollama using Jupyter Python Notebook 一文的實作紀錄.

0. 事前準備 - 安裝 Jupyter Notebook, Longchain 與相關 lib

我們假定此時的 Linux 環境已經安裝好上述的 Ollama 以及 Python3 跟 pip3 的基本設置, 此處就不說明. 需要安裝 Jupyter Notebook, 來利用 Python 程式語言來執行 Ollama, 及 Longchain, 來對接到 Ollama.

安裝 Jupyter notebook

pip3 install notebook

安裝 Longchain

pip3 install langchain langchain-community langchain-core

1. 於 Jupyter Notebook (瀏覽器) 的環境執行 Ollama

Jupyter Notebook 已經是非常通用的格式 (副檔名為 .ipynb) , 可以執行 Python 程式, 文件說明, 以及繪圖功能的強大開發環境. 例如, Google 的 Colab 也是使用 Jupyter Notebook 的格式.

執行 Jupyter Notebook 就用以下指令.

jupyter notebook

系統執行會跑出類似下列的內容, 同時, 理應會打開系統預設的瀏覽器.




















(base) kpl@sr251-b1:~$ jupyter notebook
[I 2024-10-15 14:28:58.545 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-10-15 14:28:58.548 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-10-15 14:28:58.552 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-10-15 14:28:58.555 ServerApp] notebook | extension was successfully linked.
[I 2024-10-15 14:28:58.734 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-10-15 14:28:58.747 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-10-15 14:28:58.748 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-10-15 14:28:58.749 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-10-15 14:28:58.750 LabApp] JupyterLab extension loaded from /home/kpl/miniconda3/lib/python3.11/site-packages/jupyterlab
[I 2024-10-15 14:28:58.750 LabApp] JupyterLab application directory is /home/kpl/miniconda3/share/jupyter/lab
[I 2024-10-15 14:28:58.750 LabApp] Extension Manager is 'pypi'.
[I 2024-10-15 14:28:58.778 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-10-15 14:28:58.781 ServerApp] notebook | extension was successfully loaded.
[I 2024-10-15 14:28:58.781 ServerApp] Serving notebooks from local directory: /home/kpl
[I 2024-10-15 14:28:58.781 ServerApp] Jupyter Server 2.14.2 is running at:
[I 2024-10-15 14:28:58.781 ServerApp] http://localhost:8888/tree?token=22c81d6c56250243c7c371156173735a05081c3f7e47b606
[I 2024-10-15 14:28:58.781 ServerApp]     http://127.0.0.1:8888/tree?token=22c81d6c56250243c7c371156173735a05081c3f7e47b606
[I 2024-10-15 14:28:58.781 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

瀏覽器會顯示前的目錄內容, 類似如下的截圖.

我們需要建立一個新的 .ipynb 的檔案來執行相關 Python 程式.選擇 File - New - Notebook 來建立.

Jupyter Notebook 會建立一個新的檔案, 我們沒有修改檔案名稱的話, 系統會自訂一個名為 Untitled.ipynb 的檔案,如下圖. 我們再輸入以下的 4 行的程式, 運用上一節介紹的 Ollama API, 就可以收到 Ollama 的回應, 後面還有進階的 Ollama API 介紹.









# 輸入以下的 4 行的程式

from langchain.llms import Ollama

ollama = Ollama(base_url="http://localhost:11434", model="llama2:70b")

TEXT_PROMPT = "麻烦你做个自我介绍"

print(ollama(TEXT_PROMPT))

以下為 Ollama 的回應. 其實, 也不太難.

Sure, here's a brief self-introduction:

Hi there! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. My primary function is to understand and respond to human input in a helpful and engaging manner. I can answer questions, provide information, tell stories, and even generate poetry and songs. I can be integrated into various applications such as chatbots, virtual assistants, or other programs that require natural language understanding and generation capabilities. I'm constantly learning and improving, so please bear with me if I make any mistakes. I'm here to help and provide information to the best of my abilities!

螢幕截圖

2. (可跳過) 遠端撥接伺服器的 Jupyter Notebook (瀏覽器) 的環境執行 Ollama

目前的運作, 不管 Terminal, 瀏覽器等等, 都是用遠端桌面連線連到伺服器上執行, 因為是 GUI 連線, 速度比較慢. 如果可以用 PC 端的瀏覽器直接連上 Jupyter Notebook, 會更有效率.

利用 MobaXterm 連接上伺服器.

ssh -L 8888:localhost:8888 kpl@10.241.69.189

第一個 8888 port 由 Jupyter Noteoobk 所定義, 第二個 8888 port 是 PC 端的瀏覽器需要做的設定.

MobaXterm 設定完成, 且輸入密碼, 完成連線後, 可以開啟瀏覽器, 輸入

http://localhost:8888/

再從伺服器端複製 token, 於 PC 瀏覽器上貼上, 即可執行.

3. Ollama 在 Gradio 介面上執行

執行以下 ipynb 程式後, 一樣的開啟瀏覽器, 直接輸入網址 http://localhost:7860開啟.

import os, sys
sys.path.append('.')

from langchain_ollama import OllamaLLM
import gradio as gr

# 只使用 Ollama 模型進行問答
def ollama_llm(question):
    formatted_prompt = f"總是用繁體中文回答！\n\nQuestion: {question}"
    llm = OllamaLLM(model="llama3.1:70b", base_url="http://localhost:11434")
    
    try:
        response = llm.generate(prompts=[formatted_prompt])
        return response.generations[0][0].text
    except Exception as e:
        return f"An error occurred: {str(e)}"

# 定義 Gradio 介面
def get_important_facts(question):
    return ollama_llm(question)

# 創建 Gradio 應用介面
iface = gr.Interface(
  fn=get_important_facts,
  inputs=gr.Textbox(lines=2, placeholder="請輸入您的問題"),
  outputs="text",
  title="Ollama Chat",
  description="使用 Llama3 模型直接回答您的問題",
)

# 啟動 Gradio 應用
iface.launch()

4. Ollama 在 Gradio 介面上執行, 新增功能 - 歷史提問及回答紀錄

import os, sys
sys.path.append('.')

from langchain_ollama import OllamaLLM
import gradio as gr

# 初始化一個全局的問答歷史列表
chat_history = []

# Ollama 聊天模型函數
def ollama_llm(question):
    formatted_prompt = f"總是用繁體中文回答！\n\nQuestion: {question}"
    llm = OllamaLLM(model="llama3.1:70b", base_url="http://localhost:11434")
    
    try:
        response = llm.generate(prompts=[formatted_prompt])
        return response.generations[0][0].text
    except Exception as e:
        return f"An error occurred: {str(e)}"

# Gradio 的主邏輯函數，用於處理問題並更新聊天歷史
def get_important_facts(question):
    # 獲取模型回答
    answer = ollama_llm(question)
    
    # 更新聊天歷史
    chat_history.append((question, answer))
    
    # 將所有聊天歷史格式化成單一字符串，顯示於輸出
    chat_output = ""
    for q, a in chat_history:
        chat_output += f"**問題**: {q}\n\n**回答**: {a}\n\n"
    
    return chat_output

# 創建 Gradio 應用介面
iface = gr.Interface(
  fn=get_important_facts,
  inputs=gr.Textbox(lines=2, placeholder="請輸入您的問題"),
  outputs="markdown",  # 使用 markdown 格式顯示聊天歷史
  title="Ollama Chat",
  description="使用 Llama3 模型進行問答，顯示完整對話歷史",
)

# 啟動 Gradio 應用
iface.launch()

[強力推薦] 5. Ollama 在 Gradio 介面上執行, 新增功能 - 歷史提問及回答紀錄加強版

見我的 github - multiple sections

Ollama 於 Docker 執行 (未完成)

尚待完成, 可以先參考此篇文章 - 探索免費、開源、可離線使用的 AI 助手平台 Ollama AI-(安裝篇).

Ollama 安裝 Meta Llama 語言模型

這標題聽起來很厲害, 實際上, 之前下的指令 ollama run llama3.2:1b 就是安裝了 Meta 的 Llama 3.2 版本的語言模型.

Llama 就是 Meta 釋出的 Open Source 模型, 截至 2024/11 為止, 已經到 Llama 3.2 版本. 這的版本有發行兩種不同的參數大小的模型: 1B 跟 3B. 其中 1B 指的是 1,000M 的數量, M 是我們常用"百萬"的單位.

較早的 Llama 3.1 版本已經有釋出更多參數 8B, 70B 跟 405B 的模型.

更多的模型可以參考 Ollama 模型列表網頁

Ollama 安裝 Google DeepMind Gemma 語言模型

類似地, 要安裝跟執行 Google Deep Mind gemma 模型跟執行 Llama 模型一樣的指令

ollama run gemma:2b

Gemma 提供了 2 種參數的模型, 2b 跟 7b 大小. Ollama 模型 Gemma 也說明了, Gemma 的預設是 7b 參數模型.

ollama run gemma:7b (default)

也就是說, 如下的指令

ollama run gemma

等同於

ollama run gemma:7b

Ollama 加 RAG

1. Ollama + RAG 的前置作業

在進行之前, 先確定已經做過以上 Ollama 於 CLI (Command Line Interface) 執行的章節.

2. Ollama + RAG 安裝 pip 套件

需要先安裝 Python 及 Ollama 套件


pip3 install gradio pypdf chromadb langchain_ollama
ollama pull nomic-embed-text

3. 執行 Ollama + RAG

先說結論, 以下是 Ollama_RAG.ipynb 的檔案內容.
注意: 這一行需要修改, 改成你自己的檔案
file_path = './50_Years_of_AI.pdf'


























































import os, sys
sys.path.append('.')

from langchain_ollama import OllamaLLM
import gradio as gr
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings

# Load the data from a PDF
file_path = './50_Years_of_AI.pdf'
loader = PyPDFLoader(file_path)
docs = loader.load()

# Split the loaded documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Create Ollama embeddings and vector store
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)

# Use the updated ollama_llm function
def ollama_llm(question, context):
    formatted_prompt = f"總是用繁體中文回答！\n\nQuestion: {question}\n\nContext: {context}"
    llm = OllamaLLM(model="llama2:70b", base_url="http://localhost:11434")
    
    try:
        response = llm.generate(prompts=[formatted_prompt])
        return response.generations[0][0].text
    except Exception as e:
        return f"An error occurred: {str(e)}"

# Define the RAG setup
retriever = vectorstore.as_retriever()

def rag_chain(question):
    retrieved_docs = retriever.invoke(question)
    formatted_context = "\n\n".join(doc.page_content for doc in retrieved_docs)
    return ollama_llm(question, formatted_context)

# Define the Gradio interface
def get_important_facts(question):
    return rag_chain(question)

# Create a Gradio app interface
iface = gr.Interface(
  fn=get_important_facts,
  inputs=gr.Textbox(lines=2, placeholder="Please summarize in 500 words"),
  outputs="text",
  title="RAG with Llama3",
  description="Ask questions about the provided context",
)

# Launch the Gradio app
iface.launch()

等待一段執行的時間, 會顯示以下內容, 並且開啟網頁.

* Running on local URL:  http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.

我的作法是, 開啟瀏覽器, 直接輸入網址. 孰悉網路的人會比較清楚, 一般的 127.0.0.1 網址就是內網的網址, 習慣上, 我會用 localhost 的方式開啟.

http://localhost:7860

這時候, 在 question 欄位內輸入提問, 按下 submit, 就可以得出類似下方 Ollama RAG 的回應.

注意: 我的網址是 http://localhost:7865/ 而不是 7860 的原因在於我之前試過幾次失敗, 每次程式修改再執行時的位址會改變, 從 7860, 7861, 一直試到 7865 才成功

4. Ollama + RAG 讀取 2 個文件後的人格分裂

當我執行 RAG 第二次, 而第一次讀取的檔案與第二次的檔案是完全不同主題與內容時, RAG 的回答偶而會回答第一個檔案的內容. 派大的回答是 "不確定是不是 vector store 累加的結果。試試看重建 embedding 到 vector 的記憶". 再詢問 ChatGPT 後, 將上述程式的第 11 到 22 行的內容改成如下:














# Load the data from a PDF
file_path = './linux-0.11_source.pdf'
loader = PyPDFLoader(file_path)
docs = loader.load()

# Split the loaded documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Create Ollama embeddings and vector store
embeddings = OllamaEmbeddings(model="nomic-embed-text")

vectorstore = Chroma(embedding_function=embeddings, persist_directory='./chroma_db_linux011')
vectorstore.add_documents(documents=splits)

5. Ollama + RAG 的程式說明

以下是詢問 ChatGPT 請它幫我修改 Ollama_RAG.ipynb 過程的紀錄.
https://chatgpt.com/share/6710b507-e418-8008-b4b1-4804fee19da6

[推薦] API 更多功能介紹- 以查詢 token/sec 來看系統的 performance 為例

從 Ollama API Documentation可以查到 Ollama API 的使用手冊. 以下為該手冊的開頭. 內容提到幾個重點:

*Model names*

Model names follow a model:tag format, where model can have an optional namespace such as example/model. Some examples are orca-mini:3b-q4_1 and llama3:70b. The tag is optional and, if not provided, will default to latest. The tag is used to identify a specific version.

*Durations*

All durations are returned in nanoseconds.

*Streaming responses*

Certain endpoints stream responses as JSON objects. Streaming can be disabled by providing {"stream": false} for these endpoints.

*Generate a completion*

POST /api/generate
Generate a response for a given prompt with a provided model. This is a streaming endpoint, so there will be a series of responses. The final response object will include statistics and additional data from the request.

*Parameters*

* model: (required) the model name
* prompt: the prompt to generate a response for
* stream: if false the response will be returned as a single response object, rather than a stream of objects

1. Model 的描述

我們輸入所選擇的 model 跟版本, model 跟版本號中間加上:, 如 llama3.3, llama3.3:70b, llama3.3:latest 等.

2. Duration

提供的時間數據都是基於 nano second, 也就是 10^-9 秒, 為單位.

3. Streaming response

Ollama 的回覆多是基於 json 格式, 且預設是分段提供. 我們希望 Ollama 一次就提供完整的回覆 (也是 json 格式), 就設定參數 {"stream": false}

4. 產生回應的指令

利用 POST /api/generate 指令產生回覆. 可以使用此處範例中的 requests 工具程式.

5. 必須提供的參數

我們建議至少提供三個參數到 Ollama 以便得到 model 的回覆.
5.1 model: 模型
5.2 prompt: 就是提出的問題
5.3 stream: 一次完整提供 response

6. 範例

我們利用 request 工具程式來產生 Ollama 需要的 POST 的參數設定.


























import requests

# Function to interact with Ollama API
url = "http://localhost:11434/api/generate"  # Adjust the URL if your Ollama server is hosted elsewhere
headers = {
    "Content-Type": "application/json"
}
payload = {
    "model": "phi4",
    "prompt": f"總是用繁體中文回答！\n\nQuestion: {question}",
    "stream": False  # Ensure the response is not streamed
}
    
try:
    response = requests.post(url, json=payload, headers=headers)
        
    if response.status_code == 200:
        response_data = response.json()
        generated_text = response_data.get("response", "")


        eval_count = response_data.get("eval_count", False)
        eval_duration = response_data.get("eval_duration", False) / 1e9  # 將 ns 轉為秒
        prompt_eval_count = response_data.get("prompt_eval_count", False)
        prompt_eval_duration = response_data.get("prompt_eval_duration", False) / 1e9  # 將 ns 轉為秒
        total_duration = response_data.get("total_duration", False) / 1e9  # 將 ns 轉為秒

json.get() 的補充說明一:
generated_text = response_data.get("response", "") 中的第二個參數是個 "" 空字串. 如果當 response_dat.get 找不到對應的 reponse 字串時, 程式會回覆一個 "" 空字串給 generated_test.

補充說明二: 另一個例子中, eval_count = response_data.get("eval_count", False) 第二個參數是 False, 而不是用 "" 空字串. 完全就是考慮程式執行中是否容易報錯的可能性. 因為預期回應的格式為數字, 因此不適合用 "" 字串. (以上兩段內容參考 ChatGPT 回覆)

完整程式見 github-llama3.3_gradio_token.

安裝 aider

1. 安裝 aider

看到 Linux 大神 jserv 讚嘆 aider 協助程式寫作, 以及 comment 的撰寫, 馬上來試. 嘗試用以下 pip 方式安裝出現錯誤, 嘗試了 debug 一會兒, 無解. 以為跟 Python 版本有關, 換到另一個環境 (Python 3.9.6) 上試試, 可以安裝, 但是安裝後還是無法執行.

$ python -m pip install -U aider-chat

再查了 aider Install 說明, 嘗試另一種安裝方式 - One-liners.

$ curl -LsSf https://aider.chat/install.sh | sh

安裝結束前, 出現以下警告訊息, 需要修改系統參數 $PATH.









...
Installed 1 executable: aider
warning: `/Users/marconijiang/.local/bin` is not on your PATH. To use installed tools, run `export PATH="/Users/marconijiang/.local/bin:$PATH"` or `uv tool update-shell`.

To add $HOME/.local/bin to your PATH, either restart your shell or run:

    source $HOME/.local/bin/env (sh, bash, zsh)
    source $HOME/.local/bin/env.fish (fish)

所以, 執行以下指令完成相關設定.

$ export PATH="/Users/marconijiang/.local/bin:$PATH"

2. 執行 aider, 及執行前的準備工作

因為我們使用 Ollama 來執行 LLM 模型, 需要用 Connecting to LLM - Ollama 來連結到 Ollama. 網頁內容如下:

# Pull the model
ollama pull <model>

# Start your ollama server
ollama serve

export OLLAMA_API_BASE=http://127.0.0.1:11434 # Mac/Linux

aider --model ollama_chat/<model>

我的操作如下

2-1. Ollama 下載模型


ollama pull phi4   # 下載最新 Microsoft 發布的 phi4

2-2. 執行 Ollama, 似乎是已經執行中, 就不理會這個錯誤訊息.


$ ollama serve
Error: listen tcp 127.0.0.1:11434: bind: address already in use

2-3. 執行 airder 後, 出現 Would you like to see what's new in this version? (Y)es/(N)o [Yes]: , 回答 n 後, 出現 > prompt, 等著我們輸入問題, 我就輸入 write a C program to calculate pi value, 讓它來協助寫出計算 pi 的程式.


$ export OLLAMA_API_BASE=http://127.0.0.1:11434 # Mac/Linux
$ aider --model ollama_chat/phi4
















───────────────────────────────────────────────────────────────────────────────────────────────
No git repo found, create one to track aider's changes (recommended)? (Y)es/(N)o [Yes]:        
Added .aider*, .env to .gitignore
Git repository created in /Users/marconijiang/Developments/project-1
Update git name with: git config user.name "Your Name"
Update git email with: git config user.email "you@example.com"
Aider v0.71.1
Model: ollama_chat/phi4 with whole edit format
Git repo: .git with 0 files
Repo-map: using 2048.0 tokens, auto refresh


https://aider.chat/HISTORY.html#release-notes
Would you like to see what's new in this version? (Y)es/(N)o [Yes]: n                          
───────────────────────────────────────────────────────────────────────────────────────────────
> write a C program to calculate pi value

github repository

Marconi's github - LLM
- llama_gradio.ipynb : 在網頁上執行 llama
- llama_RAG.ipynb : 在網頁上執行 llama + RAG
- llama3.3_gradio_token.ipynb : 增加 token 訊息
- llama3.3_gradio_token_multi_sections.ipynb : 修改 UI 包含歷史問答

References

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Previous article - Midjourney

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Next article - LLM 訓練

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

back to marconi's blog

LLM 推論實作 - 以 Ollama 為例

tags: 2024/10 LLM Ollama

Ollama 介紹

Ollama 安裝

Ollama 於 CLI (Command Line Interface) 執行

1. Ollama 下載模型並執行

2. 結束 Ollama 執行

3. 更多的 Ollama CLI 指令及 Ollama 執行命令

4. 移除模型

Ollama 使用 RESTful API 執行

1. 什麼是 RESTful API？

2. Ollama RESTful API 使用簡介

[推薦] Ollama 於 Jupiter Notebook 執行

0. 事前準備 - 安裝 Jupyter Notebook, Longchain 與相關 lib

1. 於 Jupyter Notebook (瀏覽器) 的環境執行 Ollama

2. (可跳過) 遠端撥接伺服器的 Jupyter Notebook (瀏覽器) 的環境執行 Ollama

3. Ollama 在 Gradio 介面上執行

4. Ollama 在 Gradio 介面上執行, 新增功能 - 歷史提問及回答紀錄

[強力推薦] 5. Ollama 在 Gradio 介面上執行, 新增功能 - 歷史提問及回答紀錄加強版

Ollama 於 Docker 執行 (未完成)

Ollama 安裝 Meta Llama 語言模型

Ollama 安裝 Google DeepMind Gemma 語言模型

Ollama 加 RAG

1. Ollama + RAG 的前置作業

2. Ollama + RAG 安裝 pip 套件

3. 執行 Ollama + RAG

4. Ollama + RAG 讀取 2 個文件後的人格分裂

5. Ollama + RAG 的程式說明

[推薦] API 更多功能介紹- 以查詢 token/sec 來看系統的 performance 為例

1. Model 的描述

2. Duration

3. Streaming response

4. 產生回應的指令

5. 必須提供的參數

6. 範例

安裝 aider

1. 安裝 aider

2. 執行 aider, 及執行前的準備工作

github repository

References

tags: `2024/10` `LLM` `Ollama`