---
# System prepended metadata

title: Semantic Caching for AI Agents
tags: [DeepLearning]

---

:::spoiler

## Semantic Caching for AI Agents: Overview

Semantic caching is a technique used to make AI agents faster and more cost-effective by reducing unnecessary LLM calls. It addresses two major production challenges: **high inference cost** and **latency**, which often limit how well AI applications can scale.


## Why Traditional Caching Falls Short

Traditional input-output caches only work when the input text is an exact match. This approach fails when users ask the same question using different wording. For example, “How can I get a refund?” and “I want my money back” are treated as completely different queries, even though they mean the same thing.

Semantic caching solves this by focusing on **meaning**, not exact text.

## How Semantic Caching Works

Semantic caching uses **embeddings** to represent queries in a semantic space. By measuring the distance between embeddings, the system can determine how similar two questions are in meaning.

If two queries are semantically similar enough (based on a threshold), the system reuses a previously generated response instead of calling the LLM again. This reduces cost and improves response time without sacrificing quality.

## Building a Semantic Cache Step by Step

The course starts by building a semantic cache from scratch to explain how it works internally:
- Create embeddings for user queries
- Compare distances between embeddings
- Define a similarity threshold to decide when queries are “close enough”

After this, the cache is implemented using **Redis**, bringing it closer to a production setup.

## Production-Ready Features with Redis

Using Redis adds important real-world capabilities:
- **Time-to-live (TTL)** to keep the cache fresh and control size
- **Separate caches** for different users, teams, or tenants
- Use of an **open-weight embedding model**, fine-tuned specifically for cache accuracy

These features make semantic caching practical for large-scale systems.

## Measuring Cache Performance

Once the cache is working, its effectiveness is measured using:
- **Hit rate** – how often the cache is used
- **Precision and recall** – how often cached responses are correct

These metrics are visualized using a **confusion matrix**, showing how changing the similarity threshold affects precision and recall. Latency is also measured to demonstrate how even a few cache hits can lead to significant time savings.

## Enhancing Cache Accuracy and Coverage

The course introduces four methods to improve cache quality:
- Optimizing the similarity threshold
- Using a **cross-encoder** for better re-ranking
- Adding a lightweight **LLM-based check** to confirm semantic equivalence
- Applying **fuzzy matching** to handle typos and minor spelling errors

These techniques improve both correctness and usability.

## Integrating Semantic Caching into AI Agents

Semantic caching is finally integrated into an AI agent. The agent:
- Breaks large questions into smaller sub-questions
- Checks the cache for each part
- Calls the LLM only when necessary

Over time, as the cache warms up, model calls decrease while response quality remains high and latency improves significantly.

## Real-World Context and Next Steps

The course also covers a real-world use case published by Walmart, demonstrating how semantic caching techniques improve production systems at scale.

:::

:::spoiler

![image](https://hackmd.io/_uploads/SJeY83ZrZe.png)

![image](https://hackmd.io/_uploads/HkZ2I2brZg.png)


![image](https://hackmd.io/_uploads/S1gmgP2-rZl.png)

![image](https://hackmd.io/_uploads/B1Lbw2Wrbx.png)

![image](https://hackmd.io/_uploads/HyaSPh-Hbe.png)

![image](https://hackmd.io/_uploads/H1Cdw3bBWe.png)

![image](https://hackmd.io/_uploads/rynCvnWSWe.png)

![image](https://hackmd.io/_uploads/SJuGu3-S-e.png)

![image](https://hackmd.io/_uploads/r1nj-0WH-e.png)


:::

:::spoiler Semantic Caching and Why It Matters


## What Is Semantic Caching and Why It Matters

Semantic caching helps AI agents **reuse previous results** to reduce **cost and latency**. While model quality generally improves with higher cost per token, price and response time remain major blockers for real-world deployments. Advanced reasoning models like GPT-5 or Claude are powerful but expensive and slow, making inference the dominant cost in many AI systems.

## RAG Systems and the Cost Problem

Many organizations use **Retrieval-Augmented Generation (RAG)** to inject domain-specific and up-to-date information into LLM prompts. RAG reduces hallucinations and keeps responses current, but **AI agents are inherently token-hungry**. They plan, act, reflect, and iterate across multiple LLM calls, increasing:
- Token usage
- Latency
- Prompt size over time

Benchmarks show that a single agent execution for real-world tasks can cost **several dollars per request**, making optimization critical.

## Why Traditional Caching Fails for Natural Language

Caching is based on a simple principle: avoid recomputing answers to repeated questions. However, **exact-match caching does not work well for natural language**. Different phrasings of the same question (e.g., “How do I get a refund?” vs. “I want my money back”) result in cache misses.

- Exact-match caching:
  - Perfect precision
  - Very poor recall
  - Extremely low cache hit rates for natural language

## How Semantic Caching Works

Semantic caching solves this by caching based on **meaning**, not exact text.

**Core workflow:**
1. Embed the user query into a vector
2. Compare it against cached vectors using semantic similarity
3. Classify the result:
   - If similarity is above a threshold → return cached answer
   - If not → call the RAG + LLM pipeline
4. Store the new question–answer pair in the cache for future reuse

This approach increases recall and cache hit rate but introduces the risk of **false positives**, where an incorrect cached answer may be returned.

## Vector Search as the Backbone

Semantic caching relies on **vector search**, where vectors represent meaning using numerical embeddings. Vector search is already widely used in:
- Search and content discovery
- Recommendation systems
- Fraud and anomaly detection

Semantic caching builds on these ideas but adds new production concerns.

## Production Challenges in Semantic Caching

Beyond similarity search, production-grade semantic caching must address:

- **Accuracy**: Are cache hits actually correct?
- **Performance**: Is the cache hit rate high enough to justify its cost?
- **Scalability**: Can the cache be served without increasing latency?
- **Updatability**: Can the cache be refreshed, invalidated, or warmed?
- **Observability**: Can we measure hit rate, latency, cost savings, and quality?

## Key Metrics for Evaluating a Semantic Cache

The course focuses on four core metrics:

- **Cache Hit Rate**
  - Frequency of cache hits for a given similarity threshold
  - Primary driver of cost savings

- **Precision**
  - How often cache hits are correct

- **Recall**
  - How often the cache correctly identifies reusable answers

- **F1 Score**
  - Balance between precision and recall

These metrics help teams understand both **effectiveness** and **safety** of the cache.

## Techniques to Improve Cache Quality

Several techniques are introduced to improve precision, recall, and efficiency:

- Tuning the similarity distance threshold
- Adding **re-ranking steps** using:
  - Cross-encoder models
  - LLM-as-a-judge
- **Fuzzy matching** for typos and exact-match cases (before embeddings)
- Filters for:
  - Temporal or time-sensitive queries
  - Code-related queries (e.g., Python, Java)

Some queries may bypass caching entirely to preserve correctness and save compute.

## Real-World Example: Walmart’s Semantic Cache

Walmart published a production system called **waLLMartCache**, used for both internal tools and customer support.

### Key Design Choices
- **Load Balancer**
  - Enables horizontal scaling across many nodes

- **Dual-Tier Cache**
  - L1: Vector database for semantic search
  - L2: In-memory store (e.g., Redis) for fast data retrieval

- **Multi-Tenancy**
  - Supports multiple teams and applications on shared infrastructure

- **Decision Engine**
  - Detects code-related or time-sensitive queries
  - Routes them directly to RAG/LLM instead of the cache

These techniques helped Walmart achieve **nearly 90% accuracy**.

## Semantic Caching Inside an AI Agent

The course concludes by building an AI agent using **LangGraph**:

- The agent decomposes large user queries into smaller sub-queries
- Each sub-query checks the semantic cache
- Unanswered queries trigger RAG and additional reasoning
- A final LLM synthesizes a personalized response

The system includes a frontend that scrapes website content, allowing users to chat with their own data.

## Key Takeaway

Semantic caching is essential for making AI agents scalable and affordable. By reusing meaning-aware answers, systems reduce cost and latency while maintaining quality. With proper metrics, safeguards, and architecture, semantic caching becomes a powerful foundation for production AI agents.


:::