Beluga : A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management - Part 1

# Beluga : A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management - Part 1 ![image](https://hackmd.io/_uploads/S1OtrPTMWx.png) https://arxiv.org/abs/2511.20172 This document explains the background of the titled paper for readers who are encountering the CXL ecosystem for the first time. ## 1. KVCache https://moon-walker.medium.com/long-context%EB%A1%9C-%EC%9D%B8%ED%95%9C-large-kv-cache%EC%9D%98-%EB%AC%B8%EC%A0%9C%EC%A0%90%EA%B3%BC-%ED%95%B4%EA%B2%B0-%EB%B0%A9%EC%95%88-part-i-kv-cache%EC%9D%98-%EB%A9%94%EB%AA%A8%EB%A6%AC-%EC%9A%94%EA%B5%AC%EB%9F%89-025f3d5dea93 Auto-regressive model = 이전 단계의 출력들을 이용해서 다음 단계의 출력을 예측하는 모델. e.g. GPT는 이전 토큰 생성시 발생된 중간값이 activations(e.g. KV cache)를 캐싱하여 이전 토큰 값을 재계산하기 위한 GPU의 FLOPS를 절감하는 대신, KVcache를 위한 추가적인 메모리 공간이 필요하다. 다른 Sequence 모델에서도 적용 가능하나 Transformer에 매우 효율적 (Transformer는 2017년 구글이 발표한 Attention is all you need라는 논문에서 제안) LLM의 context window size가 증가할수록 KV cache 크기 또한 선형적으로 증가 * 모델 weight의 memory footprint를 줄이는 방법 (e.g. quantization) * KV cache의 memory footprint를 줄이는 방법(e.g. GQA, MQA) * Attension - 입력 시퀀스의 각 단어가 다른 단어들과 어떤 관련성이 있는지 학습하게 해주는 기술 (가장 관련 있는 단어들을 찾아내어 그 정보에 가중치를 둠) * GQA - Grouped Query Attention * MQA - Multi Query Attention * Model Parallelism을 사용하여 모델을 여러 GPU로 분할 처리하는 방법(e.g. tensor parallelism) https://jsj6040.tistory.com/61 ### LLM의 추론 구조 https://velog.io/@hbcho/LLM-%EB%82%B4%EB%B6%80-%ED%95%B4%EB%B6%80-Prefill-Decoding%EA%B3%BC-QKV%EC%9D%98-%EC%A7%84%EC%A7%9C-%EC%9E%91%EB%8F%99-%EC%9B%90%EB%A6%AC #### Prefill (입력 토큰 처리 단계) 사용자가 입력한 모든 토큰을 병렬 처리해서 KV 캐시 메모리에 저장 #### Decode (순차적 출력 토큰 생성 단계) 앞서 저장된 KV 캐쉬를 참조해서 무엇이 중요한지 계산하여 조합된 결과로 토큰을 생성 새 토큰의 K,V를 캐시에 추가 #### KV Caching (불필요한 연산 최소화) 과거 계산 결과를 저장해두는 메모리 매 스텝마다 새로운 토큰 1개에 대한 K,V만 계산하면 됨 https://www.byhand.ai/p/kv-cache-prefill-decode https://sanedajangbu-ai.tistory.com/17 https://dytis.tistory.com/54 ## 2. RDMA-based disaggregated memory pool ### Remote Direct Memory Access as a network protocol 서버 A가 서버 B로 데이터를 보낼때 TCP/IP를 사용하면 헤더 추가/제거로 인한 오버헤드로 인해 CPU 리소스를 많이 소비한다. 이를 개선하기 위해 RDMA사용. ![image](https://hackmd.io/_uploads/Hyb5JOaGZl.png) * Zero Copy * Kernel bypass & Protocol offload 크게 3가지 종류 - Infiniband - 이더넷을 대신하여 슈퍼컴퓨터등에서 사용. 인텔, NVIDIA에서 사용 - RoCE - RDMA over Converged Ethernet (통합 이더넷을 통한 RDMA) - IBoE (InfiniBand over Ethernet)이라고도 함 - L2 layer on ethernet or UDP based L3 - 혼잡 제어를 위해서는 PFC (Priority Flow Control)같은 별도의 네트워크 설정이 필요 - iWARP - Internet Wide Area RDMA Protocol (TCP를 통한 RDMA) - TCP의 신뢰성과 혼잡 제어 기능을 활용 ### RDMA Atomic Operations RDMA Fetch-And-Add (FAA) RDMA Compare-And-. Swap (CAS) ## 3. RDMA for KVCache offloading RDMA was a networking protocol, not a memory bus * Dynamo NVIDIA https://developer.nvidia.com/ko-kr/blog/nvidia-dynamo-accelerates-llm-d-community-initiatives-for-advancing-large-scale-distributed-inference/ https://github.com/ai-dynamo/dynamo * MoonCake Serving platform for Kimi, a leading LLM service provided by Moonshot AI. KVCache-centric scheduler https://arxiv.org/abs/2407.00079 https://www.usenix.org/conference/fast25/presentation/qin https://github.com/kvcache-ai/Mooncake Limitations - data path : extra data movement through host memory == extra latency - control path : overhead for preparing requests and polling for completions, cross-component synchronization - scheduling problem between local and remote momory ![image](https://hackmd.io/_uploads/r1y-U_Tfbl.png) ## 4. LLM Latency * Time-to-First-Token (TTFT) LLM의 성능을 평가하는 지표, 사용자가 프롬프트를 입력한 후 첫 번째 토근이 생성될 때까지의 시간 입력 데이터와 프롬프트를 알맞은 encoder를 거쳐 임베딩 시켜주고, 이를 concat하거나 LLM에 넣어주기 위해 Begining of Sequence(BOS) 토큰을 입력 맨 앞에 넣는 등 여러 과정을 거쳐 첫 출력 토큰이 나올때 까지의 시간 * Time Per Output Token (TPOT) 첫 번째 토큰이 생성된 이후, 각 후속 출력 토큰을 생성하는게 걸리는 평균 시간 ## 5. in-database (in-DB) LLM inference capabilities ### in-DB LLM 고객이 데이터를 별도의 벡터 데이터베이스로 이동하거나, AI 전문성을 갖추지 않고도 생성형 AI를 자사의 비즈니스 문서와 함께 사용할 수 있도록 지원 ### Retrieval(검색)-Augmented(증강) Generation(생성) (RAG) based intelligent Q&A LLM의 한계인 할루시네이션을 극복하자 * Fine tuning - pre-trained mode에 특정 domain의 데이터를 추가 학습 시켜 모델을 최적화, 너무 도메인이 제약적이다 * 사용자가 질문을 입력 > RAG는 외부 데이터베이스에서 질문과 관련된 정보를 검색 > 검색된 정보를 기반으로 LLM이 답변하게 하자 -> 범용으로 쓸수 있겠네 1. 자체 데이터를 임베딩 모델에 통합 -> 데이터를 벡터 형식으로 변환하여 벡터 DB 구축 > 이 DB를 Retriever(문서 검색기)에서 사용자의 쿼리와 관련된 정보를 찾는데 활용 2. 쿼리 벡터화 및 관련 정보 추출 (증강) - 사용자의 질문(쿼리)를 벡터화 - 벡터 DB를 대상으로 여러 검색 기법을 사용해서 소스 정보에서 가장 관련성이 높은 부분 혹은 상위 K개 항목을 추출 - 추출된 정보는 쿼리 텍스트와 함께 LLM에 제공 3. LLM을 통한 답변 생성 - 쿼리 텍스트와 추출된 관련 정보를 바탕으로 최종 답변을 생성 ![image](https://hackmd.io/_uploads/SkXK_P6fZg.png) ![image](https://hackmd.io/_uploads/Hyuj_vTGbg.png) ### natural language to SQL translation Text2SQL, NL2SQL Semantic Parsing task로 분류되며, 자연어 질문(Natural Language Question, NLQ)과 데이터 베이스 스키마(테이블, 컬럼, 데이터 타입)을 입력 받아 SQL Executor가 실행 가능한 SQL 언어 형태로 번역 ### in-DB LLM을 지원하는 DB들 #### MindsDB 오픈 소스 AI 플랫폼으로, SQL 또는 자연어를 사용하여 데이터베이스(PostgreSQL, MySQL), SaaS 앱(Slack, Gmail) 등 200개 이상의 소스에서 데이터를 연결하고 쿼리할 수 있게 함 https://mindsdb.com/ #### Aurora Amazon 분산 스토리지 볼륨 (innoDB) https://aws.amazon.com/ko/blogs/tech/auroramysql-monitoring-with-amazonbedrock/ #### PolarDB Alibaba https://www.alibabacloud.com/help/en/polardb/polardb-for-mysql/user-guide/large-language-model-use-cases #### GaussDB Huawei https://arxiv.org/html/2506.23322v1 ![image](https://hackmd.io/_uploads/BJhAKOaf-e.png) #### Oracle https://www.oracle.com/kr/news/announcement/oracle-announces-in-database-llms-and-automated-in-database-vector-store-with-heatwave-genai-2024-06-26/ ## 6. CXL based RPC Telepathic Datacenters: Fast RPCs using Shared CXL Memory https://arxiv.org/abs/2408.11325 HydraRPC: RPC in the CXL Era https://www.usenix.org/system/files/atc24-ma.pdf, https://www.youtube.com/watch?v=i2nA6NJvFmM