---
# System prepended metadata

title: Deploying and Maintaining RAG Systems
tags: [Pluralsight]

---

:::spoiler Module 1

## Three Pillar of RAG
- Embedding and Indexing
- Retrieval - Finding the most relevant chunk
- Generation - Creating the answere using retrieved context


## RAG Trade-offs and Tuning Knobs

Once the three pillars of Retrieval-Augmented Generation (RAG) are understood, the next step is tuning the system. Every RAG system involves trade-offs, mainly between **accuracy, speed, and cost**. There is no universal configuration; the right setup depends on the use case. For example, an internal chatbot may prioritize fast responses, while a compliance or legal system may prioritize accuracy even if it is slower and more expensive.

The three main areas that can be tuned are **indexing, retrieval, and embeddings (including chunking)**.

## Indexing: Speed vs Accuracy at Scale

Indexing determines how fast and how accurately embeddings can be searched.

- **Flat Index**
  - Performs exact searches
  - Very high accuracy
  - Becomes extremely slow at large scale (millions of documents)
  - Best suited for small datasets

- **HNSW (Hierarchical Navigable Small World)**
  - Graph-based indexing method
  - Very fast with only a small loss in accuracy
  - Most commonly used option in production systems
  - Strong balance between performance and quality

- **IVF / PQ (Inverted File with Product Quantization)**
  - Compresses vectors to reduce memory usage
  - Faster searches with lower memory requirements
  - Reduced recall and accuracy
  - Best for extremely large datasets where memory is the main constraint

**Practical guidance**:
- Small datasets → Flat index  
- General production use → HNSW  
- Massive, memory-constrained systems → IVF / PQ  

## Retrieval: Controlling Context, Latency, and Cost

Retrieval tuning focuses on how much information is fetched and passed to the LLM.

- **Top-k Retrieval**
  - Higher k (e.g., top 10):
    - More context
    - Slower queries
    - Larger and more expensive prompts
  - Lower k (e.g., top 3):
    - Faster and cheaper
    - Risk of missing critical information

- **Scoring Thresholds**
  - Filters out weak or low-relevance matches
  - Improves quality by avoiding noisy context

- **Vector Compression**
  - Speeds up lookups
  - May slightly reduce quality

Tuning retrieval is about deciding how much context is sufficient without overwhelming the model or increasing cost unnecessarily.

## Embeddings and Chunking: Precision vs Efficiency

The final optimization area is embeddings and how data is chunked.

- **Embedding Models**
  - General-purpose models (e.g., OpenAI, Claude):
    - Flexible and effective for broad use cases
    - Typically more expensive
  - Domain-specific models (e.g., Instructor):
    - Perform better for technical, legal, or specialized data
    - Often improve relevance and accuracy in niche domains

- **Chunking Strategy**
  - Smaller chunks:
    - More precise retrieval
    - More queries and higher cost
  - Larger chunks:
    - Fewer queries
    - Increased risk of irrelevant or noisy content

- **Chunk Overlap**
  - Repeats a small portion of text across chunks
  - Preserves continuity when content is split
  - Helps prevent loss of important context

## Key Takeaway

Effective RAG performance comes from carefully balancing indexing, retrieval, and embeddings. These choices directly determine whether answers are fast but shallow, accurate but expensive, or well-balanced and production-ready.


:::

:::spoiler Module 2

## Observability in Production RAG Systems

Once a RAG system is live in production, the next major challenge is **visibility**. Without observability, the system becomes a black box: it is impossible to know which documents were retrieved, why a response was slow, or whether hallucinations are slipping into user responses. Observability transforms this by making system behavior transparent and measurable.

Monitoring, logging, and tracing are **not optional** in RAG systems. They are essential for reliability, quality control, and cost management.

## What to Log in a RAG Pipeline

To achieve effective observability, a few key signals must be logged consistently:

- **Query ID**
  - Assign a unique ID to every request
  - Enables end-to-end tracing across the entire pipeline

- **Retrieved Chunks**
  - Log which chunks were retrieved
  - Include rank or similarity scores
  - Helps explain why a particular answer was generated

- **Prompt Size**
  - Track prompt length or token counts
  - Large prompts directly increase latency and cost

- **Latency and Errors**
  - Measure latency at each stage:
    - Embedding
    - Retrieval
    - Generation
  - Log errors such as timeouts or empty retrievals

Even with these minimal logs, teams gain deep insight into how the RAG system behaves in production.

## RAG Systems Are Not Static

Deploying a RAG system is not a one-time task. RAG pipelines evolve continuously, and even small changes can have significant impact.

Common sources of change include:
- **Prompt templates**: Minor wording changes can alter tone, style, or behavior
- **Embedding models**: New versions may improve semantic understanding or target specific domains
- **Indices**: Documents are added, chunk sizes change, or vectors are compressed to save memory

Because prompts, embeddings, and indices change frequently, RAG systems are inherently dynamic.

## Why Versioning Is Critical

Without versioning, every change becomes a guess:
- You cannot tell whether accuracy improved or retrieval slowed down
- You cannot identify which prompt produced a good response
- Rolling back a bad change becomes nearly impossible

With proper versioning:
- All changes become traceable
- Experiments can be reproduced
- Quality regressions can be rolled back instantly

Versioning acts as a safety net that keeps fast-moving RAG projects stable and predictable.

## Practical Versioning Strategies

In practice, versioning can be applied as follows:

- **Prompts**
  - Store in Git repositories
  - Tag and track changes just like code

- **Embedding Models**
  - Use semantic versioning
  - Clearly identify upgrades and behavior changes

- **Indices**
  - Version changes such as document additions, chunk size updates, or compression strategies

Maintaining a simple versioning table that lists prompts, embeddings, and indices with version tags and change notes makes upgrades safer and rollbacks effortless.

## Key Takeaway

Keeping a RAG system stable is not just about building it correctly once. It is about managing ongoing change. Observability provides visibility, and versioning provides control. Together, they ensure RAG systems remain reliable, debuggable, and production-ready over time.

:::

:::spoiler Module 3

![image](https://hackmd.io/_uploads/rkOYAOgHZe.png)

![image](https://hackmd.io/_uploads/BkXiCOgrbg.png)

![image](https://hackmd.io/_uploads/Sy9R0dlHbx.png)

![image](https://hackmd.io/_uploads/HJEeJYgrZx.png)

![image](https://hackmd.io/_uploads/SJFbkKeBZl.png)

## ANN Index Choices in RAG Systems

As RAG systems scale, they can quickly become expensive and slow. To address this, different types of **Approximate Nearest Neighbor (ANN)** indices are used, each with clear trade-offs between accuracy, speed, and memory.

- **Flat Index**
  - Compares the query against every vector
  - Perfect accuracy
  - Extremely slow at large scale
  - Suitable only for small datasets

- **HNSW (Hierarchical Navigable Small World)**
  - Graph-based index with links between vectors
  - Very fast search performance
  - High (but not perfect) accuracy
  - Most common and reliable choice for production RAG systems

- **IVF (Inverted File Index)**
  - Clusters vectors into buckets
  - Searches only a subset of clusters
  - Faster and more memory-efficient than flat index
  - Results are approximate

- **PQ (Product Quantization)**
  - Compresses vectors into smaller representations
  - Significant memory savings
  - Larger accuracy trade-off compared to other methods

The choice of index depends on what matters most: **speed, memory usage, or accuracy**.

## Practical Indexing Strategy

In real-world deployments:
- **HNSW** is the default for most production RAG systems due to its balance of speed and recall
- **IVF or PQ** are best when working with extremely large datasets or memory-constrained environments
- At extreme scale:
  - Indexes are **sharded** across multiple machines
  - Time-based indexes may be used (e.g., one index per day or week for logs and event data)

Choosing the right index and scaling strategy keeps RAG systems efficient and cost-effective as data grows.

## Latency in RAG Systems

Latency is a major concern in RAG systems and comes from three primary stages:

1. **Embedding Time** – converting the query into a vector  
2. **Retrieval Time** – searching the index for relevant chunks  
3. **Generation Time** – the LLM processes context and produces the response  

Even small delays at each stage can add up, making systems feel slow.

## Latency Budgets and Top-k Trade-offs

Optimizing latency requires defining a **latency budget**, which limits how much time each stage is allowed to consume.

- **Top-k Retrieval**
  - High k (e.g., 10–20):
    - More context and potentially higher accuracy
    - Slower retrieval
    - Larger, more expensive prompts
    - Slower generation
  - Low k (e.g., 2–3):
    - Faster and cheaper responses
    - Risk of missing critical information

In RAG systems, **cost scales with context length**, not just model size.

## Aligning Latency with User Expectations

Latency budgets should reflect how users interact with the system:

- **Chatbots**
  - Expect near-instant responses (often under 1 second)
  - Small top-k
  - Cached embeddings
  - Aggressive filtering

- **Analytic or Research Systems**
  - Users tolerate longer waits (3–5 seconds)
  - Larger top-k for richer and more complete answers

## Key Takeaway

Designing effective RAG systems means understanding where latency comes from and how indexing and top-k decisions affect speed, cost, and accuracy. By aligning these technical choices with user expectations, teams can build RAG systems that are both performant and practical.


:::