# vLLM Installation on AAC
###### tags: `AMD`
## Intorduction
vLLM is a fast and easy-to-use library for LLM inference and serving. It utilizes **PagedAttention** to effectively manage attention keys and values, delivering up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.
Reference:
- [vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog](https://blog.vllm.ai/2023/06/20/vllm.html)
- [Welcome to vLLM! — vLLM](https://docs.vllm.ai/en/latest/)
- [vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs (github.com)](https://github.com/vllm-project/vllm)
- [2309.06180 (arxiv.org)](https://arxiv.org/pdf/2309.06180)
## Installation
Reference: https://docs.vllm.ai/en/stable/getting_started/amd-installation.html
1. (Recommended) Create a conda environment with Python version >= 3.11.
```bash=
conda create -n your_env python=3.11
conda activate your_env
pip install ipykernel ipywidgets # useful for development
```
2. Install PyTorch of ROCm version >= 6.0
```bash=
pip3 install torch --index-url https://download.pytorch.org/whl/rocm6.0
```
3. Install Triton flash attention for ROCm
```bash=
# pip install triton
# If encountering " 'HIPDriver' object has no attribute 'get_current_device' "
# later, uninstall it and reinstall from source by the following commands
git clone https://github.com/ROCmSoftwarePlatform/triton.git
cd triton
git checkout triton-mlir
cd python
pip3 install ninja cmake; # build time dependencies
pip3 install -e .
python setup.py install
cd ../..
```
4. Download and build vLLM.
```bash=
# pip install vllm
# Similarly, if encountering problems, reinstall from source by the following
# commands.
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -U -r requirements-rocm.txt
python setup.py install
```
5. (Optional) If encountering "GLIBCXX_3.4.30' not found":
```bash=
conda install -c conda-forge gcc=12.1.0
```
## vLLM Fast Demo
```bash
python -m vllm.entrypoints.api_server --model /home/aac/Ryan/medalpaca --host 127.0.0.1
curl -X POST http://127.0.0.1:8000/generate --data '{"prompt": "Tell me a story of three pigs.", "stream": true, "ignore_eos": true, "max_tokens": 4096}' --output -
```
## vLLM Throughput Benchmark
```bash
python benchmarks/benchmark_throughput.py --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --model /home/aac/Ryan/medalpaca --num-prompts=1000
```