minimax m2.5 on dual 5090s

# minimax m2.5 on dual 5090s ## build ik_llama.cpp setting up a workstation to build llama.cpp variants with cuda support is a pita every time cuda and/or gcc versions change. building a static binary inside of a container is much easier. ### build container #### `~/containers/cuda13-builder.cf` ```dockerfile= FROM --platform=linux/amd64 nvidia/cuda:13.1.1-cudnn-devel-ubuntu24.04 ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update && apt-get install -yqq --no-install-recommends \ build-essential \ cmake \ curl \ git \ libcurl4-openssl-dev ``` ```bash= podman build \ -t cuda13-builder \ -f ~/containers/cuda13-builder.cf \ . ``` ### build ik_llama.cpp ```bash= git clone \ https://github.com/ikawrakow/ik_llama.cpp \ ~/git/ikawrakow/ik_llama.cpp cd ~/git/ikawrakow/ik_llama.cpp git pull rm -rf ~/git/ikawrakow/ik_llama.cpp/cuda120 podman run --rm -it \ --userns=keep-id \ -v $(pwd):/app:z \ -w /app \ cuda13-builder \ /bin/bash -c "\ cmake -B cuda120 \ -DGGML_CUDA=ON \ -DGGML_NATIVE=OFF \ -DCMAKE_CUDA_ARCHITECTURES='120;120-virtual' \ -DBUILD_SHARED_LIBS=OFF \ -DCMAKE_BUILD_TYPE=Release && \ cmake --build cuda120 --config Release -j$(nproc)" ``` ## run minimax m2.5 ### setup model storage ```bash= mkdir /archive1/models ln -s /archive1/models ~/models ``` ### get minimax-m2.5 this model version has been tested with prompts requesting detailed summaries of historical events censored by the state of the model's origin. it refused to provide an acknowledgement of those events and made significant efforts to change the subject. ```bash= hf download \ ubergarm/MiniMax-M2.5-GGUF \ --local-dir ~/models/MiniMax-M2.5-GGUF \ --include 'IQ4_NL/*.gguf' ``` ### run some benchmarks, play with params modify the value of `--n-cpu-moe` to get the smallest number that does not throw an oom exception. - big numbers: more cpu/system ram offload - small numbers: more work sent to the gpus in my case, llama-server still had oom exceptions on numbers that bench was happy with, which is probably down to server overhead. ```bash= CUDA_VISIBLE_DEVICES=0,1 ~/git/ikawrakow/ik_llama.cpp/cuda120/bin/llama-bench \ --model ~/models/MiniMax-M2.5-GGUF/IQ4_NL/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf \ -ngl 99 \ --n-cpu-moe 48 \ -ctk q6_0 \ -ctv q8_0 \ -ub 4096 \ -b 4096 \ -t 16 \ -mmp 0 \ -fa 1 \ -p 2048 \ -n 64 \ -ger 1 ``` ### run ik_llama.cpp/minimax-m2.5 my working server command ```bash= CUDA_VISIBLE_DEVICES=0,1 ~/git/ikawrakow/ik_llama.cpp/cuda120/bin/llama-server \ --model ~/models/MiniMax-M2.5-GGUF/IQ4_NL/MiniMax-M2.5-IQ4_NL-00001-of-00004.gguf \ --alias minimax-m2.5 \ -ngl 99 \ --n-cpu-moe 50 \ -c 16384 \ -ctk q6_0 \ -ctv q8_0 \ -ub 4096 \ -b 4096 \ --threads 16 \ -ger \ --jinja \ --host 0.0.0.0 \ --port 8088 ``` ### get minimax-m2.5-prism (minimax, but uncensored) this version has been tested with prompts requesting detailed summaries of historical events censored by the original version of the model. it responded factually and made no attempt to deny historical facts surrounding those events. it also provided factual summaries of those events. ```bash= hf download \ Ex0bit/MiniMax-M2.5-PRISM-PRO \ --local-dir ~/models/MiniMax-M2.5-PRISM-PRO \ --include '*Q3_K_XL*' ``` ### run ik_llama.cpp/minimax-m2.5-prism ```bash= CUDA_VISIBLE_DEVICES=0,1 ~/git/ikawrakow/ik_llama.cpp/cuda120/bin/llama-server \ --model ~/models/MiniMax-M2.5-PRISM-PRO/MiniMax-M2.5-PRISM-PRO-UD-Q3_K_XL.gguf \ --alias minimax-m2.5-prism \ -ngl 99 \ --n-cpu-moe 50 \ -c 16384 \ -ctk q6_0 \ -ctv q8_0 \ -ub 4096 \ -b 4096 \ --threads 16 \ -ger \ --jinja \ --host 0.0.0.0 \ --port 8088 ```