vLLM Installation on AAC

# vLLM Installation on AAC ###### tags: `AMD` ## Intorduction vLLM is a fast and easy-to-use library for LLM inference and serving. It utilizes **PagedAttention** to effectively manage attention keys and values, delivering up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. Reference: - [vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog](https://blog.vllm.ai/2023/06/20/vllm.html) - [Welcome to vLLM! — vLLM](https://docs.vllm.ai/en/latest/) - [vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs (github.com)](https://github.com/vllm-project/vllm) - [2309.06180 (arxiv.org)](https://arxiv.org/pdf/2309.06180) ## Installation Reference: https://docs.vllm.ai/en/stable/getting_started/amd-installation.html 1. (Recommended) Create a conda environment with Python version >= 3.11. ```bash= conda create -n your_env python=3.11 conda activate your_env pip install ipykernel ipywidgets # useful for development ``` 2. Install PyTorch of ROCm version >= 6.0 ```bash= pip3 install torch --index-url https://download.pytorch.org/whl/rocm6.0 ``` 3. Install Triton flash attention for ROCm ```bash= # pip install triton # If encountering " 'HIPDriver' object has no attribute 'get_current_device' " # later, uninstall it and reinstall from source by the following commands git clone https://github.com/ROCmSoftwarePlatform/triton.git cd triton git checkout triton-mlir cd python pip3 install ninja cmake; # build time dependencies pip3 install -e . python setup.py install cd ../.. ``` 4. Download and build vLLM. ```bash= # pip install vllm # Similarly, if encountering problems, reinstall from source by the following # commands. git clone https://github.com/vllm-project/vllm.git cd vllm pip install -U -r requirements-rocm.txt python setup.py install ``` 5. (Optional) If encountering "GLIBCXX_3.4.30' not found": ```bash= conda install -c conda-forge gcc=12.1.0 ``` ## vLLM Fast Demo ```bash python -m vllm.entrypoints.api_server --model /home/aac/Ryan/medalpaca --host 127.0.0.1 curl -X POST http://127.0.0.1:8000/generate --data '{"prompt": "Tell me a story of three pigs.", "stream": true, "ignore_eos": true, "max_tokens": 4096}' --output - ``` ## vLLM Throughput Benchmark ```bash python benchmarks/benchmark_throughput.py --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --model /home/aac/Ryan/medalpaca --num-prompts=1000 ```