Semantic Caching for AI Agents

:::spoiler ## Semantic Caching for AI Agents: Overview Semantic caching is a technique used to make AI agents faster and more cost-effective by reducing unnecessary LLM calls. It addresses two major production challenges: **high inference cost** and **latency**, which often limit how well AI applications can scale. ## Why Traditional Caching Falls Short Traditional input-output caches only work when the input text is an exact match. This approach fails when users ask the same question using different wording. For example, “How can I get a refund?” and “I want my money back” are treated as completely different queries, even though they mean the same thing. Semantic caching solves this by focusing on **meaning**, not exact text. ## How Semantic Caching Works Semantic caching uses **embeddings** to represent queries in a semantic space. By measuring the distance between embeddings, the system can determine how similar two questions are in meaning. If two queries are semantically similar enough (based on a threshold), the system reuses a previously generated response instead of calling the LLM again. This reduces cost and improves response time without sacrificing quality. ## Building a Semantic Cache Step by Step The course starts by building a semantic cache from scratch to explain how it works internally: - Create embeddings for user queries - Compare distances between embeddings - Define a similarity threshold to decide when queries are “close enough” After this, the cache is implemented using **Redis**, bringing it closer to a production setup. ## Production-Ready Features with Redis Using Redis adds important real-world capabilities: - **Time-to-live (TTL)** to keep the cache fresh and control size - **Separate caches** for different users, teams, or tenants - Use of an **open-weight embedding model**, fine-tuned specifically for cache accuracy These features make semantic caching practical for large-scale systems. ## Measuring Cache Performance Once the cache is working, its effectiveness is measured using: - **Hit rate** – how often the cache is used - **Precision and recall** – how often cached responses are correct These metrics are visualized using a **confusion matrix**, showing how changing the similarity threshold affects precision and recall. Latency is also measured to demonstrate how even a few cache hits can lead to significant time savings. ## Enhancing Cache Accuracy and Coverage The course introduces four methods to improve cache quality: - Optimizing the similarity threshold - Using a **cross-encoder** for better re-ranking - Adding a lightweight **LLM-based check** to confirm semantic equivalence - Applying **fuzzy matching** to handle typos and minor spelling errors These techniques improve both correctness and usability. ## Integrating Semantic Caching into AI Agents Semantic caching is finally integrated into an AI agent. The agent: - Breaks large questions into smaller sub-questions - Checks the cache for each part - Calls the LLM only when necessary Over time, as the cache warms up, model calls decrease while response quality remains high and latency improves significantly. ## Real-World Context and Next Steps The course also covers a real-world use case published by Walmart, demonstrating how semantic caching techniques improve production systems at scale. ::: :::spoiler ![image](https://hackmd.io/_uploads/SJeY83ZrZe.png) ![image](https://hackmd.io/_uploads/HkZ2I2brZg.png) ![image](https://hackmd.io/_uploads/S1gmgP2-rZl.png) ![image](https://hackmd.io/_uploads/B1Lbw2Wrbx.png) ![image](https://hackmd.io/_uploads/HyaSPh-Hbe.png) ![image](https://hackmd.io/_uploads/H1Cdw3bBWe.png) ![image](https://hackmd.io/_uploads/rynCvnWSWe.png) ![image](https://hackmd.io/_uploads/SJuGu3-S-e.png) ![image](https://hackmd.io/_uploads/r1nj-0WH-e.png) ::: :::spoiler Semantic Caching and Why It Matters ## What Is Semantic Caching and Why It Matters Semantic caching helps AI agents **reuse previous results** to reduce **cost and latency**. While model quality generally improves with higher cost per token, price and response time remain major blockers for real-world deployments. Advanced reasoning models like GPT-5 or Claude are powerful but expensive and slow, making inference the dominant cost in many AI systems. ## RAG Systems and the Cost Problem Many organizations use **Retrieval-Augmented Generation (RAG)** to inject domain-specific and up-to-date information into LLM prompts. RAG reduces hallucinations and keeps responses current, but **AI agents are inherently token-hungry**. They plan, act, reflect, and iterate across multiple LLM calls, increasing: - Token usage - Latency - Prompt size over time Benchmarks show that a single agent execution for real-world tasks can cost **several dollars per request**, making optimization critical. ## Why Traditional Caching Fails for Natural Language Caching is based on a simple principle: avoid recomputing answers to repeated questions. However, **exact-match caching does not work well for natural language**. Different phrasings of the same question (e.g., “How do I get a refund?” vs. “I want my money back”) result in cache misses. - Exact-match caching: - Perfect precision - Very poor recall - Extremely low cache hit rates for natural language ## How Semantic Caching Works Semantic caching solves this by caching based on **meaning**, not exact text. **Core workflow:** 1. Embed the user query into a vector 2. Compare it against cached vectors using semantic similarity 3. Classify the result: - If similarity is above a threshold → return cached answer - If not → call the RAG + LLM pipeline 4. Store the new question–answer pair in the cache for future reuse This approach increases recall and cache hit rate but introduces the risk of **false positives**, where an incorrect cached answer may be returned. ## Vector Search as the Backbone Semantic caching relies on **vector search**, where vectors represent meaning using numerical embeddings. Vector search is already widely used in: - Search and content discovery - Recommendation systems - Fraud and anomaly detection Semantic caching builds on these ideas but adds new production concerns. ## Production Challenges in Semantic Caching Beyond similarity search, production-grade semantic caching must address: - **Accuracy**: Are cache hits actually correct? - **Performance**: Is the cache hit rate high enough to justify its cost? - **Scalability**: Can the cache be served without increasing latency? - **Updatability**: Can the cache be refreshed, invalidated, or warmed? - **Observability**: Can we measure hit rate, latency, cost savings, and quality? ## Key Metrics for Evaluating a Semantic Cache The course focuses on four core metrics: - **Cache Hit Rate** - Frequency of cache hits for a given similarity threshold - Primary driver of cost savings - **Precision** - How often cache hits are correct - **Recall** - How often the cache correctly identifies reusable answers - **F1 Score** - Balance between precision and recall These metrics help teams understand both **effectiveness** and **safety** of the cache. ## Techniques to Improve Cache Quality Several techniques are introduced to improve precision, recall, and efficiency: - Tuning the similarity distance threshold - Adding **re-ranking steps** using: - Cross-encoder models - LLM-as-a-judge - **Fuzzy matching** for typos and exact-match cases (before embeddings) - Filters for: - Temporal or time-sensitive queries - Code-related queries (e.g., Python, Java) Some queries may bypass caching entirely to preserve correctness and save compute. ## Real-World Example: Walmart’s Semantic Cache Walmart published a production system called **waLLMartCache**, used for both internal tools and customer support. ### Key Design Choices - **Load Balancer** - Enables horizontal scaling across many nodes - **Dual-Tier Cache** - L1: Vector database for semantic search - L2: In-memory store (e.g., Redis) for fast data retrieval - **Multi-Tenancy** - Supports multiple teams and applications on shared infrastructure - **Decision Engine** - Detects code-related or time-sensitive queries - Routes them directly to RAG/LLM instead of the cache These techniques helped Walmart achieve **nearly 90% accuracy**. ## Semantic Caching Inside an AI Agent The course concludes by building an AI agent using **LangGraph**: - The agent decomposes large user queries into smaller sub-queries - Each sub-query checks the semantic cache - Unanswered queries trigger RAG and additional reasoning - A final LLM synthesizes a personalized response The system includes a frontend that scrapes website content, allowing users to chat with their own data. ## Key Takeaway Semantic caching is essential for making AI agents scalable and affordable. By reusing meaning-aware answers, systems reduce cost and latency while maintaining quality. With proper metrics, safeguards, and architecture, semantic caching becomes a powerful foundation for production AI agents. :::