Gemma 2 2b: That little Gem running offline in your pocket!

# Gemma 2 2b: That little Gem running offline in your pocket! ![gemma2_2b_jetson](https://hackmd.io/_uploads/S1hlW5UYA.jpg) Google just launched Gemma 2, and it comes with a VERY small 2B parameters model! (Should we start referring to this as Small Language Models instead of LLMs?) These little models enable students, researchers, and enthusiasts to learn about GenAI anywhere with a small, powerful, and lightweight device. So, let's run it on the smallest (and cheapest!) GenAI device NVIDIA has, the Jetson Orin Nano. And this setup will also work on its big brother, the AGX Orin. ## Step 1: Install build essentials First, let's make sure you have the necessary build tools installed on your device. Run the following command to install the typical build essentials: ```sh sudo apt-get update sudo apt-get install build-essential cmake ``` ## Step 2: Install CUDA and cuDNN Install CUDA on Jetson if you don't have it already. (If you are not sure about it, type nvcc --version to check if the command works.) ```sh sudo apt-get install cuda ``` Set up environment variables to use CUDA correctly: ```sh export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH ``` Add these lines to your `.bashrc` file to set them permanently: ```sh echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrc ``` ## Step 3: Clone the llama.cpp repository Next, let's run the model using `llama.cpp`, which has updated itself with a beautiful UI on its server side. So let's start by cloning the repository from GitHub. ```sh git clone https://github.com/ggerganov/llama.cpp cd llama.cpp ``` ## Step 4: Build llama.cpp with CUDA Support Create a build directory and configure the build to enable CUDA support: ```sh mkdir build cd build cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc make -j$(nproc) ``` ## Step 5: Download the Gemma 2 Model Once the build is complete, it's time to download the Gemma 2 model. Go to [HuggingFace Gemma 2 Model](https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/tree/main), and browse and download the model you want to use. In this example, we are using `gemma-2-2b-it-Q4_K_S.gguf`. Then move the downloaded file to the `models` directory in your `llama.cpp` folder: ```sh mv /path/to/downloaded/gemma-2-2b-it-Q4_K_S.gguf ../models/ ``` ## Step 6: Run the model Now, let's test that everything is working and that the GPU is being utilized: ```sh ./bin/llama-cli -m ../models/2b_it_v2-Q4_K_S.gguf -p "The meaning of life is" -n 128 -ngl 999 ``` You should see the model start generating text based on the prompt. Check the logs to ensure layers are being offloaded to the GPU. ## Step 7: Set up a web server for a chat interface For a more user-friendly experience, `llama.cpp` includes a lightweight OpenAI API compatible HTTP server. ```sh ./bin/llama-server -m ../models/2b_it_v2-Q4_K_S.gguf ``` ## Step 8: Access the web interface Open your browser and navigate to `http://127.0.0.1:8080/index-new.html`. You should see the new UI where you can interact with the model, and you are ready to have fun with this little gem of only 2B parameters! ![image](https://hackmd.io/_uploads/BkP_Xq8tC.png) Congratulations! You now have Gemma 2 running with `llama.cpp` on your Jetson Orin Nano with both a command-line and web interface. Enjoy interacting with your powerful, offline AI companion.