Davanum Srinivas
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Deploying the DeepSeek-V3 Model (full version) in Amazon EKS Using vLLM and LWS ## Table of Contents - [Who Is This Guide For?](#who-is-this-guide-for) - [Prerequisites](#prerequisites) - [Creating a suitable EKS Cluster](#creating-a-suitable-eks-cluster) - [A container image with EFA](#a-container-image-with-efa) - [Verify the cluster](#verify-the-cluster) - [Run the deepseek-v3 workload](#run-the-deepseek-v3-workload) - [Take it for a spin!](#take-it-for-a-spin) - [Bonus](#bonus) ## Who Is This Guide For? This guide assumes you: - Have intermediate Kubernetes experience (kubectl, Helm) - Are familiar with AWS CLI and EKS - Understand basic GPU concepts This guide provides a streamlined process to deploy the 671B parameter DeepSeek-V3 MoE model on Amazon EKS using [vLLM](https://docs.vllm.ai/en/latest/serving/distributed_serving.html ) and LeaderWorkerSet API ([LWS](https://github.com/kubernetes-sigs/lws)). We will be deploying on [Amazon EC2 G6e](https://aws.amazon.com/ec2/instance-types/g6e/) instances as they are a bit more accessible/available and we want to see how to load models across multiple nodes. The main idea here is to peel the onion to see how exactly folks are deploying these large models with a practical demonstration to understand all the pieces and how they fit together. The latest versions of the files are available here: https://github.com/dims/skunkworks/tree/main/v3 ## Prerequisites Before starting, ensure you have the following tools installed: 1. **AWS CLI**: For managing AWS resources. 2. **eksctl/eksdemo**: To create and manage EKS clusters. 3. **kubectl**: The command-line tool for Kubernetes. 4. **helm**: Kubernetes’ package manager. 5. **jq**: For parsing JSON. 6. **Docker**: For building container images. 7. **Hugging Face Hub access**: You’ll need a token to download the model. ## Creating a suitable EKS Cluster We will use an AWS account with sufficient quota for four g6e.48xlarge instances (192 vCPUs, 1536GB RAM, 8x L40S Tensor Core GPUs that come with 48 GB of memory per GPU). You can use [eksdemo](https://github.com/awslabs/eksdemo?tab=readme-ov-file#install-eksdemo) for example: ``` eksdemo create cluster deepseek-v3-cluster-001 \ --os AmazonLinux2023 \ --instance g6e.48xlarge \ --max 4 --nodes 4 \ --volume-size 2048 \ --enable-efa \ --addons eks-pod-identity-agent \ --no-taints \ --timeout 120m ``` if you want to use [eksctl](https://github.com/eksctl-io/eksctl/) instead, run the same above command with `--dry-run` to get the equivalent command and configuration yaml. Essentially, ensure you have enough GPU nodes, allocate a large volume size per node, and enable EFA. You can use any tool of your choice, but remember you will have to adjust say for taints in the deployment yaml as needed. 🔍 Why EFA? Elastic Fabric Adapter accelerates inter-node communication, critical for multi-GPU inference. ## A container image with EFA Ideally you would just use a public image from the vllm folks: ``` docker.io/vllm/vllm-openai:latest ``` However, we want to use EFA because [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/eks/latest/userguide/node-efa.html) enhances inter-node communication for high-performance computing and machine learning applications within Amazon EKS clusters. In the following Dockerfile, we start by grabbing a powerful CUDA base image, then go on an installation spree, pulling in EFA, NCCL, and AWS-OFI-NCCL, while instructing apt to hang onto its downloaded packages. Once everything’s compiled, we carefully graft these freshly built libraries onto the vLLM image above. 🛠 GPU Compatibility: The COMPUTE_CAPABILITY_VERSION=90 setting is specific to L40S GPUs. Adjust this for your hardware. ``` # syntax=docker/dockerfile:1 ARG CUDA_VERSION=12.4.1 FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS efa-build ARG COMPUTE_CAPABILITY_VERSION=90 ARG AWS_OFI_NCCL_VERSION=1.13.2-aws ARG EFA_INSTALLER_VERSION=1.38.0 ARG NCCL_VERSION=2.24.3 RUN <<EOT rm -f /etc/apt/apt.conf.d/docker-clean echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/keep-cache echo 'APT::Install-Suggests "0";' >> /etc/apt/apt.conf.d/00-docker echo 'APT::Install-Recommends "0";' >> /etc/apt/apt.conf.d/00-docker echo 'tzdata tzdata/Areas select America' | debconf-set-selections echo 'tzdata tzdata/Zones/America select Chicago' | debconf-set-selections EOT RUN <<EOT apt update apt install -y \ curl \ git \ libhwloc-dev \ pciutils \ python3 # EFA installer cd /tmp curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz | tar xvz cd aws-efa-installer ./efa_installer.sh --yes --skip-kmod --skip-limit-conf --no-verify --mpi openmpi5 echo "/opt/amazon/openmpi5/lib" > /etc/ld.so.conf.d/openmpi.conf ldconfig # NCCL cd /tmp git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 cd nccl rm /opt/nccl/lib/*.a make -j $(nproc) src.build \ BUILDDIR=/opt/nccl \ CUDA_HOME=/usr/local/cuda \ NVCC_GENCODE="-gencode=arch=compute_${COMPUTE_CAPABILITY_VERSION},code=sm_${COMPUTE_CAPABILITY_VERSION} -gencode=arch=compute_${COMPUTE_CAPABILITY_VERSION},code=sm_${COMPUTE_CAPABILITY_VERSION}" echo "/opt/nccl/lib" > /etc/ld.so.conf.d/000_nccl.conf ldconfig # AWS-OFI-NCCL plugin cd /tmp curl -sL https://github.com/aws/aws-ofi-nccl/releases/download/v${AWS_OFI_NCCL_VERSION}/aws-ofi-nccl-${AWS_OFI_NCCL_VERSION}.tar.gz | tar xvz cd aws-ofi-nccl-${AWS_OFI_NCCL_VERSION} ./configure --prefix=/opt/aws-ofi-nccl/install \ --with-mpi=/opt/amazon/openmpi5 \ --with-libfabric=/opt/amazon/efa \ --with-cuda=/usr/local/cuda \ --enable-tests=no \ --enable-platform-aws make -j $(nproc) make install echo "/opt/aws-ofi-nccl/install/lib" > /etc/ld.so.conf.d/000-aws-ofi-nccl.conf ldconfig EOT ################################################################ FROM docker.io/vllm/vllm-openai:latest COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libhwloc.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libltdl.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libefa.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libhns.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libmana.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libmlx4.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libmlx5.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libibverbs.* /usr/lib/x86_64-linux-gnu/ COPY --from=efa-build /usr/lib/x86_64-linux-gnu/libibverbs /usr/lib/x86_64-linux-gnu/libibverbs COPY --from=efa-build /opt/amazon /opt/amazon COPY --from=efa-build /opt/aws-ofi-nccl /opt/aws-ofi-nccl COPY --from=efa-build /opt/nccl/lib /opt/nccl/lib COPY --from=efa-build /etc/ld.so.conf.d /etc/ld.so.conf.d ENV LD_PRELOAD=/opt/nccl/lib/libnccl.so COPY ./ray_init.sh /ray_init.sh RUN <<EOT chmod +x /ray_init.sh pip install huggingface_hub[hf_transfer] pip install -U "ray[default]" "ray[cgraph]" EOT ``` Note we also install `hugging_hub` with the high speed `hf_transfer` component and update the `ray` package. There is a `ray_init.sh` which helps us start `vllm` and `ray` in the leader and worker nodes brought up by LWS. ``` #!/bin/bash subcommand=$1 shift ray_port=6379 ray_init_timeout=300 declare -a start_params case "$subcommand" in worker) ray_address="" while [ $# -gt 0 ]; do case "$1" in --ray_address=*) ray_address="${1#*=}" ;; --ray_port=*) ray_port="${1#*=}" ;; --ray_init_timeout=*) ray_init_timeout="${1#*=}" ;; *) start_params+=("$1") esac shift done if [ -z "$ray_address" ]; then echo "Error: Missing argument --ray_address" exit 1 fi until ray status --address $ray_address:$ray_port; do echo "Waiting until the ray status is active for leader..." sleep 5s; done for (( i=0; i < $ray_init_timeout; i+=5 )); do ray start --address=$ray_address:$ray_port --block "${start_params[@]}" if [ $? -eq 0 ]; then echo "Worker: Ray runtime started with head address $ray_address:$ray_port" exit 0 fi echo "Waiting until the ray worker is active..." sleep 5s; done echo "Ray worker starts timeout, head address: $ray_address:$ray_port" exit 1 ;; leader) ray_cluster_size="" while [ $# -gt 0 ]; do case "$1" in --ray_port=*) ray_port="${1#*=}" ;; --ray_cluster_size=*) ray_cluster_size="${1#*=}" ;; --ray_init_timeout=*) ray_init_timeout="${1#*=}" ;; *) start_params+=("$1") esac shift done if [ -z "$ray_cluster_size" ]; then echo "Error: Missing argument --ray_cluster_size" exit 1 fi # start the ray daemon ray start --head --include-dashboard=true --port=$ray_port "${start_params[@]}" # wait until all workers are active for (( i=0; i < $ray_init_timeout; i+=5 )); do active_nodes=`python3 -c 'import ray; ray.init(); print(sum(node["Alive"] for node in ray.nodes()))'` if [ $active_nodes -eq $ray_cluster_size ]; then echo "All ray workers are active and the ray cluster is initialized successfully." exit 0 fi echo "Wait for all ray workers to be active. $active_nodes/$ray_cluster_size is active" sleep 5s; done echo "Waiting for all ray workers to be active timed out." exit 1 ;; *) echo "unknown subcommand: $subcommand" exit 1 ;; esac ``` Both these files are adaptations of code written by various folks and are available [here](https://github.com/vllm-project/vllm/blob/a018e555fd872ead45a1ab13d86626bb37064076/examples/online_serving/multi-node-serving.sh) and [here](https://github.com/aws-ia/terraform-aws-eks-blueprints/tree/main/patterns/multi-node-vllm). ## Verify the cluster ### Step 1: Check Daemonsets - Check if the Nvidia and EFA daemonsets are running ``` kubectl get daemonsets -A | grep -E 'nvidia|efa' ``` ``` kube-system aws-efa-k8s-device-plugin 1 1 1 1 1 <none> 11m kube-system nvidia-device-plugin-daemonset 1 1 1 1 1 <none> 11m ``` ### Step 2: Verify Node Resources - Check if the nodes are correctly annotated with the GPU count and EFA capacity ``` kubectl get nodes -o json | jq -r ' ["NODE", "NVIDIA_GPU", "EFA_CAPACITY"], (.items[] | [ .metadata.name, (.status.capacity."nvidia.com/gpu" // "0"), (.status.capacity."vpc.amazonaws.com/efa" // "0") ] ) | @tsv' | column -t -s $'\t' ``` ``` NODE NVIDIA_GPU EFA_CAPACITY i-08982f78fb3e2b7d7.us-west-2.compute.internal 8 4 ``` ### Step 3: Inspect Hardware - install the `node-shell` kubectl/krew plugin to peek into the nodes (will be handy for later) ``` kubectl krew install node-shell ``` - Now check if the devices are correctly present in each node ``` kubectl get nodes -o name | cut -d/ -f2 | \ xargs -I{} sh -c 'echo "=== {} ==="; kubectl node-shell {} -- sh -c "lspci | grep -iE \"nvidia|amazon.*(efa)\"";' ``` ``` === i-08982f78fb3e2b7d7.us-west-2.compute.internal === spawning "nsenter-k9hss5" on "i-08982f78fb3e2b7d7.us-west-2.compute.internal" 9b:00.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA) 9c:00.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA) 9e:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) a0:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) a2:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) a4:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) bc:00.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA) bd:00.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA) c6:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) c8:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) ca:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) cc:00.0 3D controller: NVIDIA Corporation AD102GL [L40S] (rev a1) pod "nsenter-k9hss5" deleted ``` ## Run the deepseek-v3 workload ### Install LWS Controller Use helm to install LWS: ```bash helm install lws oci://registry.k8s.io/lws/charts/lws \ --version=0.6.1 \ --namespace lws-system \ --create-namespace \ --wait --timeout 300s ``` Check if the LWS pods are running: ``` kubectl get pods -n lws-system ``` ``` NAME READY STATUS RESTARTS AGE lws-controller-manager-696b448fb9-fxxs8 1/1 Running 0 88s ``` Edit `deepseek-lws.yaml` to insert your hugging face token (ensure it's base64 encoded): ``` apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: vllm spec: replicas: 1 leaderWorkerTemplate: size: 4 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader spec: containers: - name: vllm-leader image: ghcr.io/dims/skunkworks/vllm-v3:89-775754b81f110d1d5c3165ef277e5571b18e5da4 securityContext: privileged: true capabilities: add: ["IPC_LOCK"] env: - name: NCCL_DEBUG value: "TRACE" - name: NCCL_DEBUG_SUBSYS value: "ALL" - name: NCCL_IB_DISABLE value: "1" - name: NCCL_P2P_DISABLE value: "1" - name: NCCL_NET_GDR_LEVEL value: "0" - name: NCCL_SHM_DISABLE value: "1" - name: PYTORCH_CUDA_ALLOC_CONF value: "max_split_size_mb:512,expandable_segments:True" - name: CUDA_MEMORY_FRACTION value: "0.95" - name: FI_EFA_USE_DEVICE_RDMA value: "1" - name: FI_PROVIDER value: "efa" - name: FI_EFA_FORK_SAFE value: "1" - name: HF_HUB_ENABLE_HF_TRANSFER value: "1" - name: HF_HOME value: "/local/huggingface" - name: HF_HUB_VERBOSITY value: "debug" - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token - name: MODEL_REPO value: "deepseek-ai/DeepSeek-V3" command: ["/bin/bash"] args: - "-c" - | set -x # start ray leader /ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); sleep 30 ray status # download and install model huggingface-cli download ${MODEL_REPO} # start vllm server vllm serve ${MODEL_REPO} \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 8 \ --pipeline-parallel-size 4 \ --disable-log-requests \ --uvicorn-log-level error \ --max-model-len 32768 \ --trust-remote-code \ --device cuda \ --gpu-memory-utilization 0.8 resources: limits: nvidia.com/gpu: "8" cpu: "96" memory: 384Gi vpc.amazonaws.com/efa: 4 requests: nvidia.com/gpu: "8" cpu: "96" memory: 384Gi vpc.amazonaws.com/efa: 4 ports: - containerPort: 8000 readinessProbe: tcpSocket: port: 8000 initialDelaySeconds: 15 periodSeconds: 10 volumeMounts: - name: local-storage mountPath: /local - name: shm mountPath: /dev/shm volumes: - name: local-storage hostPath: path: /root/local type: DirectoryOrCreate - name: shm emptyDir: medium: Memory sizeLimit: "512Gi" workerTemplate: spec: containers: - name: vllm-worker image: ghcr.io/dims/skunkworks/vllm-v3:89-775754b81f110d1d5c3165ef277e5571b18e5da4 securityContext: privileged: true capabilities: add: ["IPC_LOCK"] env: - name: HF_HUB_ENABLE_HF_TRANSFER value: "1" - name: HF_HOME value: "/local/huggingface" - name: HF_HUB_VERBOSITY value: "debug" - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token - name: NCCL_DEBUG value: "TRACE" command: ["/bin/bash"] args: - "-c" - | set -x # start ray worker /ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS) resources: limits: nvidia.com/gpu: "8" cpu: "96" memory: 384Gi vpc.amazonaws.com/efa: 4 requests: nvidia.com/gpu: "8" cpu: "96" memory: 384Gi vpc.amazonaws.com/efa: 4 volumeMounts: - name: local-storage mountPath: /local - name: shm mountPath: /dev/shm volumes: - name: local-storage hostPath: path: /root/local type: DirectoryOrCreate - name: shm emptyDir: medium: Memory sizeLimit: "512Gi" --- apiVersion: v1 kind: Service metadata: name: vllm-leader spec: ports: - name: port-8000 port: 8000 targetPort: 8000 - name: port-8265 port: 8265 targetPort: 8265 type: ClusterIP selector: leaderworkerset.sigs.k8s.io/name: vllm role: leader --- apiVersion: v1 kind: Secret metadata: name: hf-token-secret type: Opaque data: token: "PASTE_BASE_64_VERSION_OF_YOUR_HF_TOKEN_HERE" ``` Important: Replace "PASTE_BASE_64_VERSION_OF_YOUR_HF_TOKEN_HERE" with the base64 encoded version of your Hugging Face token. To base64 encode it, you can use a tool like `echo -n 'your_token' | base64` A Couple of other things to point out, if you see the `vllm` command line you will notice ``` --tensor-parallel-size 8 \ --pipeline-parallel-size 4 \ ``` Across the 4 notes we have 32 GPUs, we are splitting these into 8 way tensor-parallel and 4 pipeline stages for a total of 32 (read about these params [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-a-single-node)). Apply the yaml using kubectl ```bash kubectl apply -f deepseek-lws.yaml ``` ``` leaderworkerset.leaderworkerset.x-k8s.io/vllm created service/vllm-leader unchanged secret/hf-token-secret unchanged ``` Check on the vllm pods: ``` kubectl get pods ``` ``` NAME READY STATUS RESTARTS AGE vllm-0 0/1 Running 0 50s vllm-0-1 1/1 Running 0 50s vllm-0-2 1/1 Running 0 50s vllm-0-3 1/1 Running 0 50s ``` You will need to wait until `vllm-0` gets to `1/1`. You can check in on what is happening inside the main pod using ``` kubectl logs vllm-0 -f ``` In the `deepseek-lws.yaml`, you will notice that we have turned up all the logging way up high so you get an idea of all the things happening (or not!) in the system. Once you get familiar, you can turn down the settings to as much as you wish. You will see the model being downloaded: ``` (RayWorkerWrapper pid=444, ip=192.168.130.178) Downloading 'model-00118-of-000163.safetensors' to '/local/huggingface/hub/models--deepseek-ai--DeepSeek-V3/blobs/72680742383e3ac1d20bc8abef7c730f880310b88a07e72d5a6ee47bc38613e9.incomplete' (RayWorkerWrapper pid=444, ip=192.168.130.178) Downloading 'model-00119-of-000163.safetensors' to '/local/huggingface/hub/models--deepseek-ai--DeepSeek-V3/blobs/a7f04447b66d432a8800d6fb40788f980bd74d679716cb3ea6ed4ef328c73b43.incomplete' (RayWorkerWrapper pid=444, ip=192.168.130.178) Downloading 'model-00120-of-000163.safetensors' to '/local/huggingface/hub/models--deepseek-ai--DeepSeek-V3/blobs/a540ce2f0766c50c12a7af78f41b3f5b6b64ebe8cdc804ee0ff8ff81a90248cc.incomplete' ``` If you inspect the deepseek-lws.yaml, you will see that `/root/local` directory on the host is used to store the model. So even if the pods fail for some reason, the next pod will pick up downloading from where the previous pod failed. After a while you will see the following: ``` Loading safetensors checkpoint shards: 94% Completed | 153/163 [00:13<00:00, 23.11it/s] Loading safetensors checkpoint shards: 96% Completed | 156/163 [00:13<00:00, 22.98it/s] Loading safetensors checkpoint shards: 98% Completed | 159/163 [00:13<00:00, 22.92it/s] Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:13<00:00, 12.10it/s] ``` Once you see the following, the vllm openapi endpoint is ready! ``` INFO 04-19 18:04:54 [launcher.py:26] Available routes are: INFO 04-19 18:04:54 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /docs, Methods: HEAD, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /redoc, Methods: HEAD, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /health, Methods: GET INFO 04-19 18:04:54 [launcher.py:34] Route: /load, Methods: GET INFO 04-19 18:04:54 [launcher.py:34] Route: /ping, Methods: POST, GET INFO 04-19 18:04:54 [launcher.py:34] Route: /tokenize, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /detokenize, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/models, Methods: GET INFO 04-19 18:04:54 [launcher.py:34] Route: /version, Methods: GET INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/chat/completions, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/completions, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/embeddings, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /pooling, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /score, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/score, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /rerank, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v1/rerank, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /v2/rerank, Methods: POST INFO 04-19 18:04:54 [launcher.py:34] Route: /invocations, Methods: POST ``` ## Take it for a spin! ### Access the API To access the DeepSeek-V3 model using your localhost, use the following command: ```bash kubectl port-forward svc/vllm-leader 8000:8000 8265:8265 ``` To check if the model is registered using the openapi spec, use the following command: ```bash curl -X GET "http://127.0.0.1:8000/v1/models" | jq ``` To test the deployment use the following command: ```bash curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-V3", "messages": [ { "role": "user", "content": "What is Kubernetes?" } ] }' | jq ``` you will see something like: ``` { "id": "chatcmpl-f0447f97931d49bab156dfd266055de0", "object": "chat.completion", "created": 1745112232, "model": "deepseek-ai/DeepSeek-V3", "choices": [ { "index": 0, "message": { "role": "assistant", "reasoning_content": null, "content": "Kubernetes, often abbreviated as K8s, is an open-source platform designed to automate deploying, scaling, and operating application containers. It was originally developed by Google and is now maintained by the Cloud Native Computing Foundation (CNCF).\n\nHere are some key features and components of Kubernetes:\n\n1. **Container Orchestration**: Kubernetes manages containerized applications across a cluster of machines. It ensures that the desired number of containers are running and can automatically replace any that fail.\n\n2. **Scaling**: Kubernetes can automatically scale applications up or down based on demand, ensuring optimal resource utilization.\n\n3. **Service Discovery and Load Balancing**: Kubernetes can expose a container using a DNS name or its own IP address. If traffic to a container is high, Kubernetes can load balance and distribute the network traffic to stabilize the deployment.\n\n4. **Storage Orchestration**: Kubernetes allows you to automatically mount a storage system of your choice, whether local storage, public cloud providers, or network storage systems.\n\n5. **Self-Healing**: Kubernetes can restart containers that fail, replace containers, kill containers that don't respond to your user-defined health check, and advertise them to clients only when they are ready to serve.\n\n6. **Automated Rollouts and Rollbacks**: Kubernetes allows you to describe the desired state for your deployed containers and can change the actual state to the desired state at a controlled rate. If something goes wrong, Kubernetes can roll back the change.\n\n7. **Configurations and Secrets**: Kubernetes manages configurations and secrets, ensuring sensitive information is securely handled and configurations are consistent across environments.\n\n8. **Portability**: Kubernetes can run on various platforms, including on-premises, public cloud, and hybrid environments, making it highly flexible.\n\nKubernetes uses a declarative model to define the desired state of the system, and it continuously works to maintain that state. It is widely used in the industry for managing containerized applications in production environments.", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "usage": { "prompt_tokens": 7, "total_tokens": 397, "completion_tokens": 390, "prompt_tokens_details": null }, "prompt_logprobs": null } ``` Now feel free to tweak the `deepseek-lws.yaml` and re-apply the changes using: ``` kubectl apply -f deepseek-lws.yaml ``` Just to be sure, you can clean up using `kubectl delete -f deepseek-lws.yaml` and use `kubectl get pods` to make sure all the pods are gone before you run `kubectl apply`. **Happy Hacking!!** ## Bonus If you were a keen observer and noticed that we forwarded port `8265` as well, point your browser to look at the Ray dashboard! - http://localhost:8265/#/overview - http://localhost:8265/#/cluster You can see the GPU usage specifically when you are running an inference. ## Thanks This post is based on [Bryant Biggs](bryantbiggs)'s work in various repositories, thanks Bryant. Also thanks to [Arush Sharma](https://github.com/rushmash91) for a quick review and suggestions. Kudos to the folks in the Ray, vLLM, LWS, and Kubernetes communities for making it easier to compose these complex scenarios. ## Things to try As mentioned earlier we are relying here on a host node directory to persist model downloads across pod restarts. There are other options you can try as well, see the mozilla.ai link below that uses Persistent Volumes for example. Yet another option to store/load the model is using The [FSx for Lustre Container Storage Interface (CSI) driver](https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html). `terraform-aws-eks-blueprints` [github repo](https://github.com/aws-ia/terraform-aws-eks-blueprints/tree/main/patterns/multi-node-vllm ) has a terraform based setup you can try too. ## Links - https://huggingface.co/deepseek-ai/DeepSeek-V3 - https://www.theriseunion.com/en/blog/DeepSeek-V3-R1-671B-GPU-Requirements.html - https://blog.mozilla.ai/deploying-deepseek-v3-on-kubernetes/ - https://github.com/aws-samples/deepseek-using-vllm-on-eks - https://community.aws/content/2sJofoAecl6jVdDwVqglbZwKz2E - https://docs.vllm.ai/en/latest/serving/distributed_serving.html - https://github.com/vllm-project/vllm/issues/11539 - https://community.aws/content/2sJofoAecl6jVdDwVqglbZwKz2E/hosting-deepseek-r1-on-amazon-eks - https://apxml.com/posts/gpu-requirements-deepseek-r1 - https://unsloth.ai/blog/deepseekr1-dynamic - https://aws-ia.github.io/terraform-aws-eks-blueprints/patterns/machine-learning/multi-node-vllm/#dockerfile - https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/main/patterns/multi-node-vllm/Dockerfile - https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/Deepseek/DeepSeek-R1-LMI-FP8.ipynb - https://community.aws/content/2sJofoAecl6jVdDwVqglbZwKz2E/hosting-deepseek-r1-on-amazon-eks - https://docs.aws.amazon.com/eks/latest/userguide/machine-learning-on-eks.html

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully