RAG - Optimizing Data Retrieval Techniques

Notes From [ Pluralsight Optimizing Data Retrieval Techniques](https://app.pluralsight.com/ilx/video-courses/data-retrieval-techniques-optimizing/course-overview) # 1. Concepts and Applications of Semantic Indexing ## 1.1 Why is Retrieval Optimization Important ? ![image](https://hackmd.io/_uploads/HkgBK-9N-x.png) > Jane, a financial analyst, uses a Retrieval Augmented Generation (RAG) system to understand recent changes in international banking regulations. She queries the system, which searches through global banking docs and returns relevant info. However, if the system isn't optimized, it might return inaccurate info, leading to potential costly mistakes. In this case, missing a regulatory change could've led to non-compliance, but optimization ensures she gets accurate, relevant results, helping her make informed decisions. > If retrieval is inefficient or inaccurate, it leads to wasted time, frustration, and lower trust in the system. > - **Faster Results:** Reduces time to find what users need, improving productivity. - **Improved Accuracy:** Ensures users get the right results, filtering out noise and irrelevant info. - **Increased User Confidence:** Builds trust, especially in high-stakes domains like healthcare, finance, or legal. - **Resource Efficiency:** Optimizes system resources, lowering infrastructure costs while maintaining responsiveness. - **Enables Advanced Applications:** Powering up big apps like AI chatbots that pull in info from tons of data, recommendation engines, and analytics that give you real-time insights. If retrieval isn't optimized, these systems would be, like, totally lost trying to keep up. ## 1.2 Techniques For Optimizing Retrieval ![image](https://hackmd.io/_uploads/BkWzJtW54Wx.png) ### Semantic Indexing * Focuses on text meaning, not just keywords * Uses vector embeddings to capture context and synonyms * Improves recall, reduces mismatches, and ensures relevant results * Example: In a job search platform, searching for software developer still returns results labeled software engineer. ### Query Reformulation * Enhances or rewrites user queries for better results * Adds synonyms, corrects errors, and expands phrases * Reduces ambiguity and captures broader intent * Example: A user searches iphon 14 batter issue. * The system corrects spelling and expands the query to iPhone 14 battery problems to return better results. ### Hybrid Search * Combines keyword and semantic search * Balances precision (keywords) and recall (semantic meaning) * Effective in complex domains like healthcare and e-commerce * Example: In healthcare, searching Type 2 diabetes treatment metformin. * Keywords ensure metformin is matched exactly, while semantic search retrieves broader treatment guidelines and related studies. ### Reranking * Reorders initial search results for relevance * Uses machine learning or transformer-based models * Improves top-ranked results, boosting user satisfaction * Example: In an e-commerce site, many products match wireless headphones. * Reranking pushes highly rated, relevant, and popular products to the top, improving conversions and user satisfaction. :::spoiler More Details ### Semantic Indexing > Semantic indexing focuses on understanding the meaning of text rather than just matching keywords. It uses vector embeddings to capture context, so the system knows that words like ‘car’ and ‘automobile’ mean the same thing. This improves recall and helps users find relevant information even if they use different wording. ### Query Reformulation > Query reformulation improves search results by modifying the user’s query before execution. This can include adding synonyms, fixing spelling mistakes, or expanding the query to better capture user intent. It helps reduce ambiguity and ensures relevant results are returned even when the original query isn’t perfectly phrased. ### Hybrid Search > Hybrid search combines keyword-based search with semantic search. Keywords ensure precision for exact matches, while semantic search adds context and meaning. This balance between precision and recall makes hybrid search especially effective in complex domains like healthcare and e-commerce. ### Reranking > Reranking takes an initial list of results and reorders them so the most relevant ones appear at the top. Modern reranking models use machine learning or transformer-based approaches to evaluate relevance more deeply, which significantly improves the quality of top results and overall user satisfaction. ::: ## 1.3 Relevance Scoring and Re-ranking Concepts ### Relevance Scoring * Evaluates how closely a document aligns with the user's query * Considers context, semantics, and intent, not just keyword matches * Feeds into ranking algorithms to order results * Critical for user satisfaction, reducing search frustration, and enabling advanced retrieval applications * **Example**: In a knowledge base search, two articles mention cloud security, but one directly addresses the user’s question. * Relevance scoring assigns a higher score to the more contextually aligned article, placing it higher in results ### Re-ranking * Adjusts the order of initial search results to prioritize relevant items * Important when initial retrieval returns many potential documents * Uses machine learning models (e.g., transformers, cross-encoders) to evaluate semantic relevance * **Example**: A search engine retrieves 100 documents for data privacy regulations. * Re-ranking reorganizes these so the most authoritative and relevant documents appear at the top. ### Re-ranking Techniques #### Basic Re-ranking * Sorts documents by existing relevance scores * Applies filter to exclude documents below a specified threshold (e.g., 0.75) #### Advanced Re-ranking with Machine Learning * Uses machine learning models trained on user interactions * Re-ranks documents based on user preferences, behaviors, and contextual information * Enables intelligent re-ranking for personalized results * **Example**: In a personalized news or e-commerce platform, the system learns that a user prefers technical articles or certain brands. * The re-ranking model uses past behavior and context to prioritize documents that best match the user’s preferences. :::spoiler More Details ### Relevance Scoring > Relevance scoring is the process of measuring how closely a document matches a user’s query. It goes beyond keyword matching by considering context, semantics, and user intent. These scores are then used by ranking algorithms to ensure that the most informative and contextually appropriate documents appear at the top, which directly improves user satisfaction and supports advanced systems like RAG pipelines. ### Re-ranking > Re-ranking takes an initial set of retrieved documents and adjusts their order so that the most relevant items appear first. This is especially important when the first retrieval step returns many potentially relevant documents, but not all of them are equally useful. Modern re-ranking methods often use machine learning models, such as transformers or cross-encoders, to evaluate semantic relevance more deeply. ### Basic Re-ranking > In basic re-ranking, documents are simply sorted based on the relevance scores already generated by the retrieval system. After sorting, a filtering step can be applied to remove documents below a certain relevance threshold, for example 0.75, to ensure only high-quality results are passed forward. ### Advanced Re-ranking with Machine Learning > Advanced re-ranking uses machine learning models trained on past user interactions. These models rank documents by considering not just the query, but also user preferences, behavior, and contextual information. This results in more personalized and intelligent ranking outcomes. ::: ## Code Implementing Retrieval Optimization :::spoiler #### Add Link to Notebook ::: # 2. Improving RAG Recall ## Why Data Retrieval Matters in RAG Retrieval Augmented Generation (RAG) systems rely heavily on effective data retrieval from a knowledge base. A key metric here is **recall**, which measures how well the system retrieves all relevant documents. - Low recall can result in missing important information. - Missing information leads to incomplete or incorrect LLM responses. - Improving recall ensures the LLM has access to all relevant context, improving answer accuracy and usefulness. --- ## Role of Query Reformulation in RAG Query reformulation focuses on improving how user queries are expressed before retrieval. Leverage generative AI capabilities to improve retrieval performance **LLM-based query expansion** uses generative AI to: - Understand user intent more deeply - Generate alternative terms and phrasings - Retrieve relevant documents even when the original query is vague or limited - Generate semantically related terms and alternative phrasings to understand user intent This approach significantly improves recall in RAG systems. --- ## Types of LLM-Based Query Expansion Techniques ![image](https://hackmd.io/_uploads/ByVbrro4be.png) ### 1. Synonym Expansion - Adds words or phrases with similar meanings. - Helps match documents using different terminology. - **Example**: - *Heart attack* → *Myocardial infarction* **Benefit**: Improves recall by handling terminology variations. --- ### 2. Contextual Expansion - Considers the full meaning of the query, not just individual words. - Adds terms related to the overall intent. - Retrieves relevant documents even without exact keyword matches. **Benefit**: Captures intent-driven, context-aware results. --- ### 3. Paraphrasing Expansion - Rewrites the query in multiple ways while preserving meaning. - Useful when documents phrase concepts differently. **Benefit**: Improves retrieval from diverse and heterogeneous text sources. --- ### 4. Semantic Role Expansion - Identifies key entities, actions, and relationships in the query. - Expands the query using related roles and actions. - **Example**: - *Treatment of diabetes* → *Medication for diabetes*, *Insulin therapy* **Benefit**: Covers variations in how information is described. --- ### 5. Concept Expansion - Adds higher-level or related concepts. - Broadens retrieval without losing relevance. - **Example**: - *Renewable energy* → *Solar power*, *Wind energy*, *Sustainable electricity* **Benefit**: Increases coverage of conceptually related content. --- ## Advantages of LLM-Based Query Expansion in RAG ### Improved Recall - Expanded queries match more relevant documents. - Reduces the risk of missing important information. --- ### Bridges Vocabulary Gaps - Different documents use different terms for the same concept. - Expansion helps align queries with varied document language. --- ### Intent-Based Retrieval - Moves beyond exact keyword matching. - Retrieves semantically relevant documents. --- ### Handles Vague Queries - LLMs infer intent and enrich unclear queries. - Helps retrieve meaningful and contextually relevant results. --- ### Strong Domain Awareness - Especially useful in domains like healthcare, law, and finance. - Introduces related technical and domain-specific concepts. --- :::spoiler Key Takeaway --- ### Core Framing (Say these automatically) * In RAG systems, retrieval quality directly determines answer quality. * Recall is a critical metric because missing documents lead to incomplete LLM responses. * Improving recall ensures the model has access to all relevant context. * Query reformulation is one of the most effective ways to improve retrieval in RAG. --- ### What LLM-Based Query Expansion Does * LLM-based query expansion enriches the original user query before retrieval. * Instead of exact keyword matching, the system retrieves based on semantic intent. * The model generates alternative terms and phrasings to better capture user intent. * This helps retrieve documents that would otherwise be missed. --- ### Synonym Expansion * Synonym expansion handles terminology variation across documents. * Different documents describe the same concept using different words. * For example, ‘heart attack’ and ‘myocardial infarction’ refer to the same concept. * Synonym expansion directly improves recall. --- ### Contextual Expansion * Contextual expansion considers the full meaning of the query, not just keywords. * The model adds terms related to the overall intent. * This allows retrieval even when exact keywords are missing. * It makes retrieval intent-aware rather than keyword-dependent. --- ### Paraphrasing Expansion * Paraphrasing rewrites the query while preserving its meaning. * This is useful when documents phrase the same idea differently. * It improves retrieval across heterogeneous text sources. --- ### Semantic Role Expansion * Semantic role expansion focuses on entities, actions, and relationships. * The query is expanded with related actions or roles. * For example, ‘treatment of diabetes’ expands to ‘insulin therapy’ or ‘medication.’ * This captures variations in how information is described. --- ### Concept Expansion * Concept expansion adds higher-level or related concepts. * It broadens retrieval without losing relevance. * For example, ‘renewable energy’ expands to solar and wind power. * This increases coverage of conceptually related content. --- ### Why This Works (Impact Statements) * Query expansion significantly improves recall in RAG systems. * It bridges the vocabulary gap between queries and documents. * The system retrieves semantically relevant content, not just text matches. * This leads to more accurate and complete LLM responses. --- ### Domain-Specific Value * Query expansion is especially useful in domain-heavy systems. * LLMs can introduce relevant technical and domain-specific terms. * This improves retrieval quality in healthcare, finance, and legal applications. * Better retrieval leads to more trustworthy generation. --- :::