:::spoiler
## Semantic Caching for AI Agents: Overview
Semantic caching is a technique used to make AI agents faster and more cost-effective by reducing unnecessary LLM calls. It addresses two major production challenges: **high inference cost** and **latency**, which often limit how well AI applications can scale.
## Why Traditional Caching Falls Short
Traditional input-output caches only work when the input text is an exact match. This approach fails when users ask the same question using different wording. For example, “How can I get a refund?” and “I want my money back” are treated as completely different queries, even though they mean the same thing.
Semantic caching solves this by focusing on **meaning**, not exact text.
## How Semantic Caching Works
Semantic caching uses **embeddings** to represent queries in a semantic space. By measuring the distance between embeddings, the system can determine how similar two questions are in meaning.
If two queries are semantically similar enough (based on a threshold), the system reuses a previously generated response instead of calling the LLM again. This reduces cost and improves response time without sacrificing quality.
## Building a Semantic Cache Step by Step
The course starts by building a semantic cache from scratch to explain how it works internally:
- Create embeddings for user queries
- Compare distances between embeddings
- Define a similarity threshold to decide when queries are “close enough”
After this, the cache is implemented using **Redis**, bringing it closer to a production setup.
## Production-Ready Features with Redis
Using Redis adds important real-world capabilities:
- **Time-to-live (TTL)** to keep the cache fresh and control size
- **Separate caches** for different users, teams, or tenants
- Use of an **open-weight embedding model**, fine-tuned specifically for cache accuracy
These features make semantic caching practical for large-scale systems.
## Measuring Cache Performance
Once the cache is working, its effectiveness is measured using:
- **Hit rate** – how often the cache is used
- **Precision and recall** – how often cached responses are correct
These metrics are visualized using a **confusion matrix**, showing how changing the similarity threshold affects precision and recall. Latency is also measured to demonstrate how even a few cache hits can lead to significant time savings.
## Enhancing Cache Accuracy and Coverage
The course introduces four methods to improve cache quality:
- Optimizing the similarity threshold
- Using a **cross-encoder** for better re-ranking
- Adding a lightweight **LLM-based check** to confirm semantic equivalence
- Applying **fuzzy matching** to handle typos and minor spelling errors
These techniques improve both correctness and usability.
## Integrating Semantic Caching into AI Agents
Semantic caching is finally integrated into an AI agent. The agent:
- Breaks large questions into smaller sub-questions
- Checks the cache for each part
- Calls the LLM only when necessary
Over time, as the cache warms up, model calls decrease while response quality remains high and latency improves significantly.
## Real-World Context and Next Steps
The course also covers a real-world use case published by Walmart, demonstrating how semantic caching techniques improve production systems at scale.
:::
:::spoiler









:::
:::spoiler Semantic Caching and Why It Matters
## What Is Semantic Caching and Why It Matters
Semantic caching helps AI agents **reuse previous results** to reduce **cost and latency**. While model quality generally improves with higher cost per token, price and response time remain major blockers for real-world deployments. Advanced reasoning models like GPT-5 or Claude are powerful but expensive and slow, making inference the dominant cost in many AI systems.
## RAG Systems and the Cost Problem
Many organizations use **Retrieval-Augmented Generation (RAG)** to inject domain-specific and up-to-date information into LLM prompts. RAG reduces hallucinations and keeps responses current, but **AI agents are inherently token-hungry**. They plan, act, reflect, and iterate across multiple LLM calls, increasing:
- Token usage
- Latency
- Prompt size over time
Benchmarks show that a single agent execution for real-world tasks can cost **several dollars per request**, making optimization critical.
## Why Traditional Caching Fails for Natural Language
Caching is based on a simple principle: avoid recomputing answers to repeated questions. However, **exact-match caching does not work well for natural language**. Different phrasings of the same question (e.g., “How do I get a refund?” vs. “I want my money back”) result in cache misses.
- Exact-match caching:
- Perfect precision
- Very poor recall
- Extremely low cache hit rates for natural language
## How Semantic Caching Works
Semantic caching solves this by caching based on **meaning**, not exact text.
**Core workflow:**
1. Embed the user query into a vector
2. Compare it against cached vectors using semantic similarity
3. Classify the result:
- If similarity is above a threshold → return cached answer
- If not → call the RAG + LLM pipeline
4. Store the new question–answer pair in the cache for future reuse
This approach increases recall and cache hit rate but introduces the risk of **false positives**, where an incorrect cached answer may be returned.
## Vector Search as the Backbone
Semantic caching relies on **vector search**, where vectors represent meaning using numerical embeddings. Vector search is already widely used in:
- Search and content discovery
- Recommendation systems
- Fraud and anomaly detection
Semantic caching builds on these ideas but adds new production concerns.
## Production Challenges in Semantic Caching
Beyond similarity search, production-grade semantic caching must address:
- **Accuracy**: Are cache hits actually correct?
- **Performance**: Is the cache hit rate high enough to justify its cost?
- **Scalability**: Can the cache be served without increasing latency?
- **Updatability**: Can the cache be refreshed, invalidated, or warmed?
- **Observability**: Can we measure hit rate, latency, cost savings, and quality?
## Key Metrics for Evaluating a Semantic Cache
The course focuses on four core metrics:
- **Cache Hit Rate**
- Frequency of cache hits for a given similarity threshold
- Primary driver of cost savings
- **Precision**
- How often cache hits are correct
- **Recall**
- How often the cache correctly identifies reusable answers
- **F1 Score**
- Balance between precision and recall
These metrics help teams understand both **effectiveness** and **safety** of the cache.
## Techniques to Improve Cache Quality
Several techniques are introduced to improve precision, recall, and efficiency:
- Tuning the similarity distance threshold
- Adding **re-ranking steps** using:
- Cross-encoder models
- LLM-as-a-judge
- **Fuzzy matching** for typos and exact-match cases (before embeddings)
- Filters for:
- Temporal or time-sensitive queries
- Code-related queries (e.g., Python, Java)
Some queries may bypass caching entirely to preserve correctness and save compute.
## Real-World Example: Walmart’s Semantic Cache
Walmart published a production system called **waLLMartCache**, used for both internal tools and customer support.
### Key Design Choices
- **Load Balancer**
- Enables horizontal scaling across many nodes
- **Dual-Tier Cache**
- L1: Vector database for semantic search
- L2: In-memory store (e.g., Redis) for fast data retrieval
- **Multi-Tenancy**
- Supports multiple teams and applications on shared infrastructure
- **Decision Engine**
- Detects code-related or time-sensitive queries
- Routes them directly to RAG/LLM instead of the cache
These techniques helped Walmart achieve **nearly 90% accuracy**.
## Semantic Caching Inside an AI Agent
The course concludes by building an AI agent using **LangGraph**:
- The agent decomposes large user queries into smaller sub-queries
- Each sub-query checks the semantic cache
- Unanswered queries trigger RAG and additional reasoning
- A final LLM synthesizes a personalized response
The system includes a frontend that scrapes website content, allowing users to chat with their own data.
## Key Takeaway
Semantic caching is essential for making AI agents scalable and affordable. By reusing meaning-aware answers, systems reduce cost and latency while maintaining quality. With proper metrics, safeguards, and architecture, semantic caching becomes a powerful foundation for production AI agents.
:::