Deploying the DeepSeek-V3 Model (full version) in Amazon EKS Using vLLM and LWS

# Deploying the DeepSeek-V3 Model (full version) in Amazon EKS Using vLLM and LWS ## Table of Contents - [Who Is This Guide For?](#who-is-this-guide-for) - [Prerequisites](#prerequisites) - [Creating a suitable EKS Cluster](#creating-a-suitable-eks-cluster) - [A container image with EFA](#a-container-image-with-efa) - [Verify the cluster](#verify-the-cluster) - [Run the deepseek-v3 workload](#run-the-deepseek-v3-workload) - [Take it for a spin!](#take-it-for-a-spin) - [Bonus](#bonus) ## Who Is This Guide For? This guide assumes you: - Have intermediate Kubernetes experience (kubectl, Helm) - Are familiar with AWS CLI and EKS - Understand basic GPU concepts This guide provides a streamlined process to deploy the 671B parameter DeepSeek-V3 MoE model on Amazon EKS using [vLLM](https://docs.vllm.ai/en/latest/serving/distributed_serving.html ) and LeaderWorkerSet API ([LWS](https://github.com/kubernetes-sigs/lws)). We will be deploying on [Amazon EC2 G6e](https://aws.amazon.com/ec2/instance-types/g6e/) instances as they are a bit more accessible/available and we want to see how to load models across multiple nodes. The main idea here is to peel the onion to see how exactly folks are deploying these large models with a practical demonstration to understand all the pieces and how they fit together. The latest versions of the files are available here: https://github.com/dims/skunkworks/tree/main/v3 ## Prerequisites Before starting, ensure you have the following tools installed: 1. **AWS CLI**: For managing AWS resources. 2. **eksctl/eksdemo**: To create and manage EKS clusters. 3. **kubectl**: The command-line tool for Kubernetes. 4. **helm**: Kubernetes’ package manager. 5. **jq**: For parsing JSON. 6. **Docker**: For building container images. 7. **Hugging Face Hub access**: You’ll need a token to download the model. ## Creating a suitable EKS Cluster We will use an AWS account with sufficient quota for four g6e.48xlarge instances (192 vCPUs, 1536GB RAM, 8x L40S Tensor Core GPUs that come with 48 GB of memory per GPU). You can use [eksdemo](https://github.com/awslabs/eksdemo?tab=readme-ov-file#install-eksdemo) for example: ``` eksdemo create cluster deepseek-v3-cluster-001 \ --os AmazonLinux2023 \ --instance g6e.48xlarge \ --max 4 --nodes 4 \ --volume-size 2048 \ --enable-efa \ --addons eks-pod-identity-agent \ --no-taints \ --timeout 120m ``` if you want to use [eksctl](https://github.com/eksctl-io/eksctl/) instead, run the same above command with `--dry-run` to get the equivalent command and configuration yaml. Essentially, ensure you have enough GPU nodes, allocate a large volume size per node, and enable EFA. You can use any tool of your choice, but remember you will have to adjust say for taints in the deployment yaml as needed. 🔍 Why EFA? Elastic Fabric Adapter accelerates inter-node communication, critical for multi-GPU inference. ## A container image with EFA Ideally you would just use a public image from the vllm folks: ``` docker.io/vllm/vllm-openai:latest ``` However, we want to use EFA because [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/eks/latest/userguide/node-efa.html) enhances inter-node communication for high-performance computing and machine learning applications within Amazon EKS clusters. In the following Dockerfile, we start by grabbing a powerful CUDA base image, then go on an installation spree, pulling in EFA, NCCL, and AWS-OFI-NCCL, while instructing apt to hang onto its downloaded packages. Once everything’s compiled, we carefully graft these freshly built libraries onto the vLLM image above. 🛠 GPU Compatibility: The COMPUTE_CAPABILITY_VERSION=90 setting is specific to L40S GPUs. Adjust this for your hardware. ``` # syntax=docker/dockerfile:1 ARG CUDA_VERSION=12.4.1 FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS efa-build ARG COMPUTE_CAPABILITY_VERSION=90 ARG AWS_OFI_NCCL_VERSION=1.13.2-aws ARG EFA_INSTALLER_VERSION=1.38.0 ARG NCCL_VERSION=2.24.3 RUN <<EOT rm -f /etc/apt/apt.conf.d/docker-clean echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/keep-cache echo 'APT::Install-Suggests "0";' >> /etc/apt/apt.conf.d/00-docker echo 'APT::Install-Recommends "0";' >> /etc/apt/apt.conf.d/00-docker echo 'tzdata tzdata/Areas select America' | debconf-set-selections echo 'tzdata tzdata/Zones/America select Chicago' | debconf-set-selections EOT RUN <<EOT apt update apt install -y \ curl \ git \ libhwloc-dev \ pciutils \ python3 # EFA installer cd /tmp curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz | tar xvz cd aws-efa-installer ./efa_installer.sh --yes --skip-kmod --skip-limit-conf --no-verify --mpi openmpi5 echo "/opt/amazon/openmpi5/lib" > /etc/ld.so.conf.d/openmpi.conf ldconfig # NCCL cd /tmp git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 cd nccl rm /opt/nccl/lib/*.a make -j $(nproc) src.build \ BUILDDIR=/opt/nccl \ CUDA_HOME=/usr/local/cuda \ NVCC_GENCODE="-gencode=arch=compute_${COMPUTE_CAPABILITY_VERSION},code=sm_${COMPUTE_CAPABILITY_VERSION} -gencode=arch=compute_${COMPUTE_CAPABILITY_VERSION},code=sm_${COMPUTE_CAPABILITY_VERSION}" echo "/opt/nccl/lib" > /etc/ld.so.conf.d/000_nccl.conf ldconfig # AWS-OFI-NCCL plugin cd /tmp curl -sL https://github.com/aws/aws-ofi-nccl/releases/download/v${AWS_OFI_NCCL_VERSION}/aws-ofi-nccl-${AWS_OFI_NCCL_VERSION}.tar.gz | tar xvz cd aws-ofi-nccl-${AWS_OFI_NCCL_VERSION} ./configure --prefix=/opt/aws-ofi-nccl/install \ --with-mpi=/opt/amazon/openmpi5 \ --with-libfabric=/opt/amazon/efa \ --with-cuda=/usr/local/cuda \ --enable-tests=no \ --enable-platform-aws make -j $(nproc) make install echo "/opt/aws-ofi-nccl/install/lib" > /etc/ld.so.conf.d/000-aws-ofi-nccl.conf ldconfig EOT ################################################################ FROM docker.io/vllm/vllm-openai:latest COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libhwloc.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libltdl.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libefa.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libhns.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libmana.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libmlx4.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libmlx5.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libibverbs.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libibverbs /usr/lib/x86_64-linux-gnu/libibverbs COPY --from=efa-build /opt/amazon /opt/amazon COPY --from=efa-build /opt/aws-ofi-nccl /opt/aws-ofi-nccl COPY --from=efa-build /opt/nccl/lib /opt/nccl/lib COPY --from=efa-build /etc/ld.so.conf.d /etc/ld.so.conf.d ENV LD_PRELOAD=/opt/nccl/lib/libnccl.so COPY ./ray_init.sh /ray_init.sh RUN <<EOT chmod +x /ray_init.sh pip install huggingface_hub[hf_transfer] pip install -U "ray[default]" "ray[cgraph]" EOT ``` Note we also install `hugging_hub` with the high speed `hf_transfer` component and update the `ray` package. There is a `ray_init.sh` which helps us start `vllm` and `ray` in the leader and worker nodes brought up by LWS. ``` #!/bin/bash subcommand=$1 shift ray_port=6379 ray_init_timeout=300 declare -a start_params case "$subcommand" in worker) ray_address="" while [ $# -gt 0 ]; do case "$1" in --ray_address=*) ray_address="${1#*=}" ;; --ray_port=*) ray_port="${1#*=}" ;; --ray_init_timeout=*) ray_init_timeout="${1#*=}" ;; *) start_params+=("$1") esac shift done if [ -z "$ray_address" ]; then echo "Error: Missing argument --ray_address" exit 1 fi until ray status --address $ray_address:$ray_port; do echo "Waiting until the ray status is active for leader..." sleep 5s; done for (( i=0; i < $ray_init_timeout; i+=5 )); do ray start --address=$ray_address:$ray_port --block "${start_params[@]}" if [ $? -eq 0 ]; then echo "Worker: Ray runtime started with head address $ray_address:$ray_port" exit 0 fi echo "Waiting until the ray worker is active..." sleep 5s; done echo "Ray worker starts timeout, head address: $ray_address:$ray_port" exit 1 ;; leader) ray_cluster_size="" while [ $# -gt 0 ]; do case "$1" in --ray_port=*) ray_port="${1#*=}" ;; --ray_cluster_size=*) ray_cluster_size="${1#*=}" ;; --ray_init_timeout=*) ray_init_timeout="${1#*=}" ;; *) start_params+=("$1") esac shift done if [ -z "$ray_cluster_size" ]; then echo "Error: Missing argument --ray_cluster_size" exit 1 fi # start the ray daemon ray start --head --include-dashboard=true --port=$ray_port "${start_params[@]}" # wait until all workers are active for (( i=0; i < $ray_init_timeout; i+=5 )); do active_nodes=`python3 -c 'import ray; ray.init(); print(sum(node["Alive"] for node in ray.nodes()))'` if [ $active_nodes -eq $ray_cluster_size ]; then echo "All ray workers are active and the ray cluster is initialized successfully." exit 0 fi echo "Wait for all ray workers to be active. $active_nodes/$ray_cluster_size is active" sleep 5s; done echo "Waiting for all ray workers to be active timed out." exit 1 ;; *) echo "unknown subcommand: $subcommand" exit 1 ;; esac ``` Both these files are adaptations of code written by various folks and are available [here](https://github.com/vllm-project/vllm/blob/a018e555fd872ead45a1ab13d86626bb37064076/examples/online_serving/multi-node-serving.sh) and [here](https://github.com/aws-ia/terraform-aws-eks-blueprints/tree/main/patterns/multi-node-vllm). ## Verify the cluster ### Step 1: Check Daemonsets - Check if the Nvidia and EFA daemonsets are running ``` kubectl get daemonsets -A | grep -E 'nvidia|efa' ``` ``` kube-system aws-efa-k8s-device-plugin 1 1 1 1 1 <none> 11m kube-system nvidia-device-plugin-daemonset 1 1 1 1 1 <none> 11m ``` ### Step 2: Verify Node Resources - Check if the nodes are correctly annotated with the GPU count and EFA capacity ``` kubectl get nodes -o json | jq -r ' ["NODE", "NVIDIA_GPU", "EFA_CAPACITY"], (.items[] | [ .metadata.name, (.status.capacity."nvidia.com/gpu" // "0"), (.status.capacity."vpc.amazonaws.com/efa" // "0") ] ) | @tsv' | column -t -s $'\t' ``` ``` NODE NVIDIA_GPU EFA_CAPACITY i-08982f78fb3e2b7d7.us-west-2.compute.internal 8 4 ``` ### Step 3: Inspect Hardware - install the `node-shell` kubectl/krew plugin to peek into the nodes (will be handy for later) ``` kubectl krew install node-shell ``` - Now check if the devices are correctly present in each node ``` kubectl get nodes -o name | cut -d/ -f2 | \ xargs -I{} sh -c 'echo "=== {} ==="; kubectl node-shell {} -- sh -c "lspci | grep -iE \"nvidia|amazon.*(efa)\"";' ``` ``` === i-08982f78fb3e2b7d7.us-west-2.compute.internal === spawning "nsenter-k9hss5" on "i-08982f78fb3e2b7d7.us-west-2.compute.internal" 9b:00.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA) 9c:00.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA) 9e:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) a0:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) a2:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) a4:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) bc:00.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA) bd:00.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA) c6:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) c8:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) ca:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) cc:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) pod "nsenter-k9hss5" deleted ``` ## Run the deepseek-v3 workload ### Install LWS Controller Use helm to install LWS: ```bash helm install lws oci://registry.k8s.io/lws/charts/lws \ --version=0.6.1 \ --namespace lws-system \ --create-namespace \ --wait --timeout 300s ``` Check if the LWS pods are running: ``` kubectl get pods -n lws-system ``` ``` NAME READY STATUS RESTARTS AGE lws-controller-manager-696b448fb9-fxxs8 1/1 Running 0 88s ``` Edit `deepseek-lws.yaml` to insert your hugging face token (ensure it's base64 encoded): ``` apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: vllm spec: replicas: 1 leaderWorkerTemplate: size: 4 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader spec: containers: - name: vllm-leader image: ghcr.io/dims/skunkworks/vllm-v3:89-775754b81f110d1d5c3165ef277e5571b18e5da4 securityContext: privileged: true capabilities: add: ["IPC_LOCK"] env: - name: NCCL_DEBUG value: "TRACE" - name: NCCL_DEBUG_SUBSYS value: "ALL" - name: NCCL_IB_DISABLE value: "1" - name: NCCL_P2P_DISABLE value: "1" - name: NCCL_NET_GDR_LEVEL value: "0" - name: NCCL_SHM_DISABLE value: "1" - name: PYTORCH_CUDA_ALLOC_CONF value: "max_split_size_mb:512,expandable_segments:True" - name: CUDA_MEMORY_FRACTION value: "0.95" - name: FI_EFA_USE_DEVICE_RDMA value: "1" - name: FI_PROVIDER value: "efa" - name: FI_EFA_FORK_SAFE value: "1" - name: HF_HUB_ENABLE_HF_TRANSFER value: "1" - name: HF_HOME value: "/local/huggingface" - name: HF_HUB_VERBOSITY value: "debug" - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token - name: MODEL_REPO value: "deepseek-ai/DeepSeek-V3" command: ["/bin/bash"] args: - "-c" - | set -x # start ray leader /ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); sleep 30 ray status # download and install model huggingface-cli download ${MODEL_REPO} # start vllm server vllm serve ${MODEL_REPO} \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 8 \ --pipeline-parallel-size 4 \ --disable-log-requests \ --uvicorn-log-level error \ --max-model-len 32768 \ --trust-remote-code \ --device cuda \ --gpu-memory-utilization 0.8 resources: limits: nvidia.com/gpu: "8" cpu: "96" memory: 384Gi vpc.amazonaws.com/efa: 4 requests: nvidia.com/gpu: "8" cpu: "96" memory: 384Gi vpc.amazonaws.com/efa: 4 ports: - containerPort: 8000 readinessProbe: tcpSocket: port: 8000 initialDelaySeconds: 15 periodSeconds: 10 volumeMounts: - name: local-storage mountPath: /local - name: shm mountPath: /dev/shm volumes: - name: local-storage hostPath: path: /root/local type: DirectoryOrCreate - name: shm emptyDir: medium: Memory sizeLimit: "512Gi" workerTemplate: spec: containers: - name: vllm-worker image: ghcr.io/dims/skunkworks/vllm-v3:89-775754b81f110d1d5c3165ef277e5571b18e5da4 securityContext: privileged: true capabilities: add: ["IPC_LOCK"] env: - name: HF_HUB_ENABLE_HF_TRANSFER value: "1" - name: HF_HOME value: "/local/huggingface" - name: HF_HUB_VERBOSITY value: "debug" - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token - name: NCCL_DEBUG value: "TRACE" command: ["/bin/bash"] args: - "-c" - | set -x # start ray worker /ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS) resources: limits: nvidia.com/gpu: "8" cpu: "96" memory: 384Gi vpc.amazonaws.com/efa: 4 requests: nvidia.com/gpu: "8" cpu: "96" memory: 384Gi vpc.amazonaws.com/efa: 4 volumeMounts: - name: local-storage mountPath: /local - name: shm mountPath: /dev/shm volumes: - name: local-storage hostPath: path: /root/local type: DirectoryOrCreate - name: shm emptyDir: medium: Memory sizeLimit: "512Gi" --- apiVersion: v1 kind: Service metadata: name: vllm-leader spec: ports: - name: port-8000 port: 8000 targetPort: 8000 - name: port-8265 port: 8265 targetPort: 8265 type: ClusterIP selector: leaderworkerset.sigs.k8s.io/name: vllm role: leader --- apiVersion: v1 kind: Secret metadata: name: hf-token-secret type: Opaque data: token: "PASTE_BASE_64_VERSION_OF_YOUR_HF_TOKEN_HERE" ``` Important: Replace "PASTE_BASE_64_VERSION_OF_YOUR_HF_TOKEN_HERE" with the base64 encoded version of your Hugging Face token. To base64 encode it, you can use a tool like `echo -n 'your_token' | base64` A Couple of other things to point out, if you see the `vllm` command line you will notice ``` --tensor-parallel-size 8 \ --pipeline-parallel-size 4 \ ``` Across the 4 notes we have 32 GPUs, we are splitting these into 8 way tensor-parallel and 4 pipeline stages for a total of 32 (read about these params [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-a-single-node)). Apply the yaml using kubectl ```bash kubectl apply -f deepseek-lws.yaml ``` ``` leaderworkerset.leaderworkerset.x-k8s.io/vllm created service/vllm-leader unchanged secret/hf-token-secret unchanged ``` Check on the vllm pods: ``` kubectl get pods ``` ``` NAME READY STATUS RESTARTS AGE vllm-0 0/1 Running 0 50s vllm-0-1 1/1 Running 0 50s vllm-0-2 1/1 Running 0 50s vllm-0-3 1/1 Running 0 50s ``` You will need to wait until `vllm-0` gets to `1/1`. You can check in on what is happening inside the main pod using ``` kubectl logs vllm-0 -f ``` In the `deepseek-lws.yaml`, you will notice that we have turned up all the logging way up high so you get an idea of all the things happening (or not!) in the system. Once you get familiar, you can turn down the settings to as much as you wish. You will see the model being downloaded: ``` (RayWorkerWrapper pid=444, ip=192.168.130.178) Downloading 'model-00118-of-000163.safetensors' to '/local/huggingface/hub/models--deepseek-ai--DeepSeek-V3/blobs/72680742383e3ac1d20bc8abef7c730f880310b88a07e72d5a6ee47bc38613e9.incomplete' (RayWorkerWrapper pid=444, ip=192.168.130.178) Downloading 'model-00119-of-000163.safetensors' to '/local/huggingface/hub/models--deepseek-ai--DeepSeek-V3/blobs/a7f04447b66d432a8800d6fb40788f980bd74d679716cb3ea6ed4ef328c73b43.incomplete' (RayWorkerWrapper pid=444, ip=192.168.130.178) Downloading 'model-00120-of-000163.safetensors' to '/local/huggingface/hub/models--deepseek-ai--DeepSeek-V3/blobs/a540ce2f0766c50c12a7af78f41b3f5b6b64ebe8cdc804ee0ff8ff81a90248cc.incomplete' ``` If you inspect the deepseek-lws.yaml, you will see that `/root/local` directory on the host is used to store the model. So even if the pods fail for some reason, the next pod will pick up downloading from where the previous pod failed. After a while you will see the following: ``` Loading safetensors checkpoint shards: 94% Completed | 153/163 [00:13<00:00, 23.11it/s] Loading safetensors checkpoint shards: 96% Completed | 156/163 [00:13<00:00, 22.98it/s] Loading safetensors checkpoint shards: 98% Completed | 159/163 [00:13<00:00, 22.92it/s] Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:13<00:00, 12.10it/s] ``` Once you see the following, the vllm openapi endpoint is ready! ``` INFO 04-19 18:04:54 [launcher.py:26] Available routes are: INFO 04-19 18:04:54 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /docs, Methods: HEAD, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /redoc, Methods: HEAD, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /health, Methods: GET INFO 04-19 18:04:54 [launcher.py:34] Route: /load, Methods: GET INFO 04-19 18:04:54 [launcher.py:34] Route: /ping, Methods: POST, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /tokenize, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /detokenize, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/models, Methods: GET INFO 04-19 18:04:54 [launcher.py:34] Route: /version, Methods: GET INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/chat/completions, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/completions, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/embeddings, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /pooling, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /score, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/score, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /rerank, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/rerank, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v2/rerank, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /invocations, Methods: POST ``` ## Take it for a spin! ### Access the API To access the DeepSeek-V3 model using your localhost, use the following command: ```bash kubectl port-forward svc/vllm-leader 8000:8000 8265:8265 ``` To check if the model is registered using the openapi spec, use the following command: ```bash curl -X GET "http://127.0.0.1:8000/v1/models" | jq ``` To test the deployment use the following command: ```bash curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-V3", "messages": [ { "role": "user", "content": "What is Kubernetes?" } ] }' | jq ``` you will see something like: ``` { "id": "chatcmpl-f0447f97931d49bab156dfd266055de0", "object": "chat.completion", "created": 1745112232, "model": "deepseek-ai/DeepSeek-V3", "choices": [ { "index": 0, "message": { "role": "assistant", "reasoning_content": null, "content": "Kubernetes, often abbreviated as K8s, is an open-source platform designed to automate deploying, scaling, and operating application containers. It was originally developed by Google and is now maintained by the Cloud Native Computing Foundation (CNCF).\n\nHere are some key features and components of Kubernetes:\n\n1. **Container Orchestration**: Kubernetes manages containerized applications across a cluster of machines. It ensures that the desired number of containers are running and can automatically replace any that fail.\n\n2. **Scaling**: Kubernetes can automatically scale applications up or down based on demand, ensuring optimal resource utilization.\n\n3. **Service Discovery and Load Balancing**: Kubernetes can expose a container using a DNS name or its own IP address. If traffic to a container is high, Kubernetes can load balance and distribute the network traffic to stabilize the deployment.\n\n4. **Storage Orchestration**: Kubernetes allows you to automatically mount a storage system of your choice, whether local storage, public cloud providers, or network storage systems.\n\n5. **Self-Healing**: Kubernetes can restart containers that fail, replace containers, kill containers that don't respond to your user-defined health check, and advertise them to clients only when they are ready to serve.\n\n6. **Automated Rollouts and Rollbacks**: Kubernetes allows you to describe the desired state for your deployed containers and can change the actual state to the desired state at a controlled rate. If something goes wrong, Kubernetes can roll back the change.\n\n7. **Configurations and Secrets**: Kubernetes manages configurations and secrets, ensuring sensitive information is securely handled and configurations are consistent across environments.\n\n8. **Portability**: Kubernetes can run on various platforms, including on-premises, public cloud, and hybrid environments, making it highly flexible.\n\nKubernetes uses a declarative model to define the desired state of the system, and it continuously works to maintain that state. It is widely used in the industry for managing containerized applications in production environments.", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "usage": { "prompt_tokens": 7, "total_tokens": 397, "completion_tokens": 390, "prompt_tokens_details": null }, "prompt_logprobs": null } ``` Now feel free to tweak the `deepseek-lws.yaml` and re-apply the changes using: ``` kubectl apply -f deepseek-lws.yaml ``` Just to be sure, you can clean up using `kubectl delete -f deepseek-lws.yaml` and use `kubectl get pods` to make sure all the pods are gone before you run `kubectl apply`. **Happy Hacking!!** ## Bonus If you were a keen observer and noticed that we forwarded port `8265` as well, point your browser to look at the Ray dashboard! - http://localhost:8265/#/overview - http://localhost:8265/#/cluster You can see the GPU usage specifically when you are running an inference. ## Thanks This post is based on [Bryant Biggs](bryantbiggs)'s work in various repositories, thanks Bryant. Also thanks to [Arush Sharma](https://github.com/rushmash91) for a quick review and suggestions. Kudos to the folks in the Ray, vLLM, LWS, and Kubernetes communities for making it easier to compose these complex scenarios. ## Things to try As mentioned earlier we are relying here on a host node directory to persist model downloads across pod restarts. There are other options you can try as well, see the mozilla.ai link below that uses Persistent Volumes for example. Yet another option to store/load the model is using The [FSx for Lustre Container Storage Interface (CSI) driver](https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html). `terraform-aws-eks-blueprints` [github repo](https://github.com/aws-ia/terraform-aws-eks-blueprints/tree/main/patterns/multi-node-vllm ) has a terraform based setup you can try too. ## Links - https://huggingface.co/deepseek-ai/DeepSeek-V3 - https://www.theriseunion.com/en/blog/DeepSeek-V3-R1-671B-GPU-Requirements.html - https://blog.mozilla.ai/deploying-deepseek-v3-on-kubernetes/ - https://github.com/aws-samples/deepseek-using-vllm-on-eks - https://community.aws/content/2sJofoAecl6jVdDwVqglbZwKz2E - https://docs.vllm.ai/en/latest/serving/distributed_serving.html - https://github.com/vllm-project/vllm/issues/11539 - https://community.aws/content/2sJofoAecl6jVdDwVqglbZwKz2E/hosting-deepseek-r1-on-amazon-eks - https://apxml.com/posts/gpu-requirements-deepseek-r1 - https://unsloth.ai/blog/deepseekr1-dynamic - https://aws-ia.github.io/terraform-aws-eks-blueprints/patterns/machine-learning/multi-node-vllm/#dockerfile - https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/main/patterns/multi-node-vllm/Dockerfile - https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/Deepseek/DeepSeek-R1-LMI-FP8.ipynb - https://community.aws/content/2sJofoAecl6jVdDwVqglbZwKz2E/hosting-deepseek-r1-on-amazon-eks - https://docs.aws.amazon.com/eks/latest/userguide/machine-learning-on-eks.html

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.