Agentic AI - HackMD

Notes from https://learn.deeplearning.ai/courses/agentic-ai/ # Module 1: Introduction to Agentic Workflows :::spoiler What is Agentic AI? ## What is Agentic AI? Agentic AI is a way of building AI systems where a task is completed through **multiple structured steps** instead of a single prompt. The AI plans, acts, evaluates, and revises—similar to how humans work on complex problems. --- ## Why Agentic Workflows Are More Effective Traditional LLM usage asks the model to produce an entire output in one shot, like writing an essay without revision. While this works, it limits quality. Agentic workflows use an **iterative process**—planning, researching, drafting, and revising—which takes longer but results in significantly better outputs. --- ## How an Agentic Workflow Functions An agentic AI workflow typically: - Breaks a complex task into smaller steps - Uses LLMs to plan and make decisions - Calls tools such as web search APIs to gather information - Generates drafts and reviews them for gaps or errors - Optionally involves a human-in-the-loop for critical checks --- ## Core Skill: Task Decomposition A key skill in building agentic AI is **decomposing complex tasks into well-defined steps** and designing components that execute each step effectively. This directly impacts the quality and reliability of the final result. --- ## Example: Research Agent A research agent demonstrates agentic AI by: - Planning a research strategy - Collecting and analyzing multiple sources - Synthesizing findings - Reviewing for coherence - Producing a structured final report This approach yields deeper and more thoughtful results than a single-prompt response. --- ## Autonomy in Agentic AI Agentic workflows can vary in autonomy—from simple, guided workflows to highly autonomous systems. Understanding this spectrum helps in choosing the right design based on application complexity. --- ## Key Takeaway Agentic AI transforms LLMs from one-shot generators into **iterative problem-solvers**, enabling higher-quality and more reliable outcomes for complex tasks. ::: :::spoiler Degrees of Autonomy ![image](https://hackmd.io/_uploads/ry-0Y_2Nbl.png) ## Agentic AI as a Spectrum Agentic AI systems can be autonomous to **different degrees**, rather than being classified as either an agent or not. Using the term *agentic* avoids unnecessary debate and allows focus on building practical systems with varying levels of autonomy. --- ## Why “Agentic” Matters The AI community often debated what qualifies as a “true agent.” Treating agentic behavior as a **continuum** helps move past definitions and focus on designing useful workflows that range from simple to highly autonomous. --- ## Less Autonomous Agents Less autonomous agents follow a **fixed, deterministic sequence** of steps defined by the developer. Example: writing an essay by: - Generating search terms - Calling a web search API - Fetching pages - Writing the essay Here, tool usage and workflow order are hard-coded, and the LLM’s autonomy is mostly limited to text generation. --- ## More Autonomous Agents More autonomous agents allow the **LLM to make decisions**, such as: - Whether to do web search or other research - Which sources to fetch and how many - Whether to convert formats (e.g., PDF to text) - Whether to reflect, revise, or fetch more data The exact sequence of steps may be decided dynamically by the LLM, not predefined by the programmer. --- ## Levels of Autonomy - **Less autonomous**: Steps and tools are predefined; predictable and controllable. - **Semi-autonomous**: LLM can choose among predefined tools and make limited decisions. - **Highly autonomous**: LLM decides the workflow, may create or modify tools, and adapts actions dynamically. --- ## Trade-offs and Practical Use Less autonomous agents are **widely used in production** because they are reliable and easier to control. Highly autonomous agents are powerful but **harder to control and more unpredictable**, and remain an active area of research. --- ## Key Takeaway Agentic AI is best understood as a **spectrum of autonomy**. Valuable applications exist at all levels, and choosing the right degree of autonomy depends on the problem, control needs, and reliability requirements. ::: :::spoiler Benefits of Agentic AI Workflows ## Why Agentic Workflows Matter The biggest advantage of agentic workflows is that they enable **tasks that were previously not feasible** with single-prompt AI systems. In addition, they deliver higher performance, faster execution through parallelism, and flexible system design through modularity. --- ## Performance Improvements Over Single-Prompt AI Agentic workflows significantly boost model performance by allowing AI to **generate, evaluate, and improve outputs iteratively**. Even older models, when wrapped in an agentic workflow, can outperform newer models used in a non-agentic way. This shows that workflow design can matter more than upgrading the model itself. --- ## Parallelism for Faster Execution Agentic workflows can execute multiple steps **in parallel**, such as generating search queries or downloading many web pages simultaneously. While agentic workflows may be slower than a single prompt, they can be much faster than humans by avoiding sequential processing. --- ## Modularity and Component Swapping Agentic systems are modular, allowing developers to: - Swap or upgrade web search tools (e.g., different search engines or news sources) - Combine multiple tools in a single workflow - Use different LLMs for different steps based on strengths This flexibility makes agentic workflows easier to improve and adapt over time. --- ## Key Takeaway Agentic AI workflows provide **better performance, faster execution via parallelism, and flexible system design**. In many cases, improving the workflow delivers greater gains than simply using a more advanced model. --- ::: :::spoiler Example: Why Agentic Workflows Outperform Single-Prompt AI ### Example 1: Coding with Agentic AI (HumanEval Benchmark) ![image](https://hackmd.io/_uploads/SySnXK34bl.png) Imagine asking an AI to write a complete function in one attempt. In a traditional setup, the model writes the code once and stops. If there is a logic bug or missing edge case, the output fails. In the HumanEval benchmark: - GPT-3.5, when asked to write code in one shot, solved about **40%** of problems correctly. - GPT-4 improved this to **67%**, but still followed the same one-shot approach. Now consider an agentic workflow: 1. The AI writes an initial version of the code. 2. It reviews its own solution and checks for logical errors. 3. It revises the code to handle edge cases. 4. This cycle repeats until the solution stabilizes. With this approach, **GPT-3.5 wrapped in an agentic workflow outperformed GPT-3.5 used directly**, and in many cases closed much of the gap with GPT-4. This shows that *how* the model is used can matter more than *which* model is used. --- ### Example 2: Research Essay with Parallelism Suppose you ask a human to write an essay on black holes. The human: - Searches one query at a time - Reads articles sequentially - Takes notes manually - Then writes the essay This process is slow because everything happens one step at a time. An agentic workflow works differently: 1. Multiple LLMs generate different search queries **in parallel**. 2. Each query returns several high-quality sources. 3. All web pages are downloaded simultaneously. 4. The AI reads and synthesizes all sources together. 5. A final essay is generated and optionally revised. Even though the agentic workflow has more steps, **parallel execution makes it faster than a human** and more thorough than a single-prompt AI. --- ### Example 3: Modularity in Practice ![image](https://hackmd.io/_uploads/HklxVY3N-l.png) A developer builds a research agent using: - One LLM for planning - A web search engine for data collection - Another LLM for writing and summarization Later, the developer: - Replaces the search engine with a better one - Adds a news search module for recent updates - Switches to a different LLM for drafting Because the workflow is modular, these improvements require **minimal changes** to the system. This makes agentic workflows easier to evolve and maintain over time. --- ### Core Insight from the Examples Agentic workflows transform AI from a **single-shot generator** into an **iterative, parallel, and adaptable problem solver**. This is why they achieve higher accuracy, better reasoning, and greater real-world usefulness than traditional prompt-based approaches. ::: ::: spoiler Agentic AI Applications ## Overview Agentic AI workflows are well-suited for business tasks that can be broken into **clear, structured steps**. As task complexity increases and steps become less predictable, the difficulty and unpredictability of the agent also increase. --- ## Simple, Structured Agentic Applications Tasks with a **clear step-by-step process** are easier to implement and more reliable. ### Invoice Processing Agentic AI can automate invoice handling by: - Converting invoices (PDF → text) - Identifying whether a document is an invoice - Extracting key fields (biller, address, amount, due date) - Storing the extracted data in a database These workflows work well because the process is well-defined and consistent. ![image](https://hackmd.io/_uploads/HJG8ht2VZl.png) --- ## Moderately Complex Applications These tasks still follow a clear process but require **data lookups and human review**. ### Customer Order Inquiry Handling An agent can: - Extract order details from a customer email - Look up customer and order records in a database - Draft a response email - Route the response for human approval before sending Such agents are already widely deployed in real businesses. ![image](https://hackmd.io/_uploads/HJPOhFn4bx.png) --- ## More Complex, Planning-Based Applications Some customer requests require the agent to **decide the steps dynamically**, making them harder to build. ### Advanced Customer Service Examples include: - Checking inventory across multiple categories (e.g., black jeans vs blue jeans) - Processing returns by verifying purchase history, return policy, and product condition - Issuing return labels and updating order status Here, the agent must plan which APIs or database queries to call based on the request. ![image](https://hackmd.io/_uploads/HJw93K3VZl.png) --- ## Cutting-Edge Applications: Computer-Using Agents The most difficult agentic systems involve **computer use**, where agents interact with websites like humans: - Navigating web pages - Clicking buttons and filling forms - Reading page content to decide next actions This area is promising but currently unreliable due to slow-loading pages, complex layouts, and limited agent perception. --- ## What Makes Agentic Workflows Easier or Harder ### Easier - Clear, predefined procedures - Text-only inputs - Business processes that already exist as SOPs ### Harder - Unknown or dynamic steps - High autonomy and planning - Multi-modal inputs (vision, audio, UI interaction) ![image](https://hackmd.io/_uploads/BybpnFhVbl.png) --- ## Key Skill: Task Decomposition A critical skill in building agentic AI is **breaking complex workflows into discrete, executable steps**. This directly impacts reliability, scalability, and success. --- ## Key Takeaway Agentic AI is most effective when applied to **structured business workflows**, and becomes more challenging as autonomy, planning, and input complexity increase. --- ## Stories / Examples (In Depth) ### Example 1: Invoice Processing Agent A finance team traditionally reviews invoices manually to track payments. An agentic workflow replaces this by converting invoices to text, validating document type, extracting critical fields, and updating databases automatically. Because the steps are known in advance, this agent is reliable and easy to deploy. --- ### Example 2: Customer Order Support Agent When a customer emails about an order, the agent extracts relevant details, fetches order records, drafts a response, and queues it for human review. This balances automation with oversight and is already common in production systems. --- ### Example 3: Advanced Customer Service Agent A customer asks to return an item. The agent must verify purchase history, check return eligibility, issue a return label, and update order status. Since the required steps depend on the request, the agent must plan dynamically, making the system more complex and less predictable. --- ### Example 4: Computer-Using Agent (Flight Search) An agent is asked to check flight seat availability. It navigates airline websites, fills forms, reads results, adapts when a site fails, and switches to alternative sites like Google Flights. While impressive, such agents are still unreliable and best suited for experimentation rather than mission-critical use today. ::: :::spoiler Task Decomposition ## Core Idea One of the most important skills in building agentic AI workflows is **task decomposition**—breaking complex human or business tasks into **clear, discrete steps** that an AI system can execute reliably. --- ## Why Decomposition Matters Directly prompting an LLM to complete a complex task (like writing a deeply researched essay) often produces **shallow or inconsistent results**. Humans naturally work iteratively—planning, researching, drafting, reviewing, and revising. Agentic workflows mimic this human process by splitting tasks into manageable steps. --- ## How to Decompose a Task When breaking down a task, the key question is: > **Can each step be performed by an LLM, a tool, a function call, or a small piece of code?** If not, the step should be decomposed further until it becomes executable. --- ## Example: Research Essay Generation Instead of one-step essay generation: 1. Generate an outline (LLM) 2. Generate search queries (LLM) 3. Perform web search (tool/API) 4. Write first draft (LLM) 5. Review draft for issues (LLM) 6. Revise draft (LLM) This iterative approach leads to **more coherent and thoughtful output**. --- ## Iterative Refinement Task decomposition is **not a one-time activity**: - Start with a simple workflow - Evaluate the output - Identify weaknesses - Further decompose steps - Repeat until quality improves This iterative refinement is common and expected in agentic system design. --- ## Business-Oriented Decomposition Examples ### Customer Order Inquiry 1. Extract key information from email (LLM) 2. Query order database (function/API) 3. Draft response email (LLM) 4. Send email via API (tool) Each step is clearly defined and executable. --- ### Invoice Processing 1. Convert invoice PDF to text (specialized AI tool) 2. Extract required fields (LLM) 3. Update database record (function/API) This works well because the process is structured and predictable. --- ## Building Blocks of Agentic Workflows Agentic systems are built by sequencing reusable components: - **LLMs**: text generation, reasoning, extraction, decision-making - **Specialized AI models**: PDF-to-text, speech-to-text, image analysis - **Tools & APIs**: web search, email, databases, calendars, weather - **Retrieval systems (RAG)**: searching large text corpora - **Code execution tools**: write and run code dynamically Knowing what building blocks are available expands what workflows you can design. --- ## Key Design Principle If a step cannot be automated directly: - Ask how a human would do it - Break it into smaller sub-steps - Re-check if those sub-steps are automatable --- ## Key Takeaways - Task decomposition is central to effective agentic AI - Each step should be executable by an LLM or a tool - Iteration and evaluation are necessary to reach high-quality results - Better decomposition leads to more reliable, scalable workflows --- ## Stories / Examples (In Depth) ### Story 1: Improving a Research Agent An initial research agent generated outlines, searched the web, and wrote essays—but the results felt disjointed. By adding **draft → review → revise** steps, the output became more coherent and human-like. This showed that refining decomposition directly improves quality. --- ### Story 2: Customer Support Automation A customer support task was decomposed into extracting details, querying databases, and drafting replies. Because each step was clearly defined and supported by tools, the agent could reliably assist humans and significantly reduce response time. --- ### Story 3: Invoice Automation Invoice handling was simplified into two steps after PDF-to-text conversion: information extraction and database update. This clear decomposition made the workflow easy to implement and highly dependable. --- ### Final Insight Agentic AI systems improve through **thoughtful task decomposition, iteration, and evaluation**—not by relying on a single prompt. The quality of the workflow often matters more than the sophistication of the model. ::: :::spoiler Evaluating (Evals) Agentic AI ![image](https://hackmd.io/_uploads/S10qodpNWg.png) ## Why Evaluation Matters One of the strongest predictors of success in building agentic workflows is the ability to run a **disciplined evaluation (eval) process**. Teams that do evals well improve faster and build more reliable agents. --- ## When to Design Evals It is very hard to predict all failure cases in advance. **Best practice:** - First, build the agentic workflow - Then review real outputs - Identify what is not satisfactory - Add evals to measure and reduce those issues --- ## Finding Problems Through Output Review By manually reading outputs, you can discover unexpected issues that were hard to anticipate earlier. These issues then guide what evals you should add. --- ## Objective vs Subjective Evals There are two broad types of evaluation criteria: ### Objective Evals - Clear yes/no or measurable conditions - Can be checked using simple code - Example: Did the output mention a competitor name? ### Subjective Evals - Harder to score with strict rules - Often require judgment (e.g., quality, clarity, usefulness) - Common approach: use an **LLM as a judge** --- ## Using LLMs as Judges For subjective quality checks, another LLM can evaluate outputs. - Example: Ask an LLM to assess the quality of an essay - Track whether quality improves over time as you refine the agent ⚠️ Note: Simple numeric scales (e.g., 1–5) are not very reliable. Better evaluation techniques are introduced later in the course. --- ## Types of Agentic Evals You will encounter two important eval types: ### End-to-End Evals - Measure the quality of the final output of the entire agent - Useful for tracking overall system performance ### Component-Level Evals - Measure the output of individual steps in the workflow - Useful for improving specific parts of the agent --- ## Error Analysis and Traces Another key skill is **error analysis**: - Inspect intermediate outputs (also called traces) - Identify where the workflow breaks down - Use insights to refine steps, prompts, or tools --- ## Key Takeaway Building strong agentic workflows requires: - Iterative improvement - Careful output inspection - Well-designed evals - Continuous error analysis Evaluation is not optional—it is central to agentic AI success. --- ## Stories & Examples ### Example 1: Competitor Mentions in Customer Emails ![image](https://hackmd.io/_uploads/HJJC3upE-g.png) A customer-support agent unexpectedly mentioned competitors in its responses: - “We’re better than ComproCo” - “Unlike RivalCo, returns are easy” This behavior was awkward for the business and hard to predict upfront. **Solution:** - Define competitor mentions as errors - Add a simple code-based eval to count how often competitor names appear - Track and reduce this error over time This is a strong example of an **objective eval**. --- ### Example 2: Evaluating a Research Agent A research agent generates essays on different topics. - Outputs vary in depth and coherence - Hard to judge quality with strict rules **Solution:** - Use a separate LLM to review each essay - Ask it to assess quality (initially using a score or judgment) - Track whether essay quality improves as the workflow is refined This demonstrates a **subjective eval** using an LLM as a judge. ::: :::spoiler Agentic Design Pattern ## Overview Agentic workflows are built by combining multiple building blocks into structured sequences. Certain **design patterns** help organize these building blocks effectively and improve system performance. The four most important design patterns are **reflection, tool use, planning, and multi-agent collaboration**. --- ## 1. Reflection Reflection is the process where an LLM reviews and improves its own output. - The model generates an output (e.g., code or text) - That output is fed back to the model for critique - The model identifies issues (correctness, style, efficiency) - The model revises and produces an improved version Reflection can also use **external feedback**, such as error messages from running code. While not perfect, reflection often provides a noticeable performance improvement. --- ## 2. Tool Use Tool use allows LLMs to call external functions or APIs to complete tasks. - Examples: web search, database queries, code execution, math calculations - Enables access to real-time data and precise computation - Lets the LLM decide *when* and *which* tool to use This significantly expands what an agent can accomplish beyond text generation. --- ## 3. Planning Planning allows an LLM to decide the **sequence of actions** needed to complete a task. - The LLM determines steps dynamically - Steps often involve calling different tools or APIs - Reduces the need for developers to hard-code workflows Planning-based agents are more flexible but harder to control and are still experimental. --- ## 4. Multi-Agent Collaboration Multi-agent workflows use multiple specialized agents working together. - Each agent has a defined role (e.g., researcher, writer, editor) - Agents collaborate to solve complex tasks - Can lead to better results for complex or creative problems These systems are harder to predict but can outperform single-agent setups. --- ## Key Takeaway Effective agentic systems combine building blocks using proven design patterns. Reflection, tool use, planning, and multi-agent collaboration enable agents to solve complex problems more reliably, especially when paired with strong evaluation methods. --- ## Stories & Examples ### Example 1: Reflection for Code Improvement An LLM is asked to write a Python function. - First output contains bugs or inefficiencies - The same LLM (or a critique agent) reviews the code and points out issues - Error messages from running the code are fed back - The model revises the code and produces a better version This iterative loop improves code quality over multiple versions. --- ### Example 2: Planning with HuggingGPT A system is asked to: - Match a person’s pose from an image - Generate a new image with a different subject in the same pose - Describe the image using text-to-speech The LLM automatically plans the sequence: 1. Detect pose from the original image 2. Generate a new image using that pose 3. Convert text description to speech The LLM decides the workflow rather than relying on hard-coded steps. --- ### Example 3: Multi-Agent Marketing Workflow To write a marketing brochure: - A **research agent** gathers background information - A **marketing agent** writes persuasive content - An **editor agent** polishes and refines the text This mirrors how human teams collaborate and often produces higher-quality results than a single agent. ![image](https://hackmd.io/_uploads/BJ7A6upNWx.png) ![image](https://hackmd.io/_uploads/BkalAOpE-x.png) ![image](https://hackmd.io/_uploads/HJlBCO6VZl.png) ![image](https://hackmd.io/_uploads/HyHdCu64We.png) ![image](https://hackmd.io/_uploads/B1Ns0uaEWg.png) ::: # Module 2 Reflection Design Pattern :::spoiler Reflection to improve output of a task ![image](https://hackmd.io/_uploads/SyQOVKaE-g.png) ![image](https://hackmd.io/_uploads/HypbVF6V-l.png) # Reflection Design Pattern — Summary Notes ## What is the Reflection Design Pattern? The **reflection design pattern** allows a Large Language Model (LLM) to improve its own output by reviewing and revising an initial draft. This mirrors how humans review their own work, identify issues, and make improvements. ## Core Idea Instead of relying on a single generation, the LLM: 1. Produces an initial output (Version 1). 2. Reviews that output using a second prompt. 3. Generates an improved version (Version 2). This simple two-step process can improve clarity, correctness, and overall quality. ## How Reflection Works in Practice - The same LLM or a different LLM can be used for reflection. - The second prompt explicitly asks the model to **analyze, critique, and improve** the first output. - This workflow can be hard-coded or automated in applications. ## Use Cases ### 1. Writing (e.g., Emails) - First draft may contain unclear wording, missing details, typos, or omissions. - Reflection helps identify these issues and rewrite the content more clearly and professionally. ### 2. Code Generation - An LLM generates an initial version of code. - The code is then reviewed by the same or another LLM to: - Find bugs - Improve logic - Fix syntax or structure ## Choosing Different Models - Different LLMs have different strengths. - **Reasoning or thinking models** are especially good at: - Debugging code - Analyzing errors - A common strategy: - Use one model for fast code generation. - Use a reasoning-focused model for reflection and bug detection. ## Power of External Feedback Reflection becomes significantly more effective when **external information** is added: - For code, this includes: - Execution results - Error messages - Logs and outputs By feeding this real-world feedback back into the LLM during reflection, the model gains concrete evidence of what went wrong and how to fix it. ## Key Design Considerations - Reflection is **not magic** and does not guarantee perfect results. - It usually provides a **modest but meaningful improvement** in performance. - Reflection is **most powerful when new external information** is available. - Whenever possible, include real feedback (errors, outputs, results) in the reflection step. ## Comparison to Direct Generation - Reflection-based workflows often outperform **direct generation** (also called zero-shot prompting). - The improvement is especially noticeable in tasks that benefit from evaluation and correction, such as coding. --- ## Story / Example: Email Writing with Reflection A quick email draft may be unclear, contain typos, or miss important details like dates or a signature. By rereading the draft, a human can notice these issues and rewrite the email with clearer wording and complete information. LLMs can follow the same process: generate an initial email (v1), reflect on it using a second prompt, and produce a clearer and more polished version (v2). ## Story / Example: Code Debugging with Reflection An LLM writes an initial version of code, but when executed, it throws a syntax error. The error message is passed back into the LLM along with the code. Using this external feedback, the LLM reflects on the mistake and produces a much better second version of the code. This results in higher-quality output than reflection without execution feedback. ::: :::spoiler Reflection vs Direct Generation ## Direct Generation (Zero-Shot Prompting) Direct generation means prompting an LLM **once** and accepting the output as final. - Example tasks: - Writing an essay about black holes - Writing Python code to calculate compound interest - This approach is also called **zero-shot prompting** because: - No examples of desired input-output pairs are provided - The model generates the answer in one go ## Prompting Variants (Quick Context) - **Zero-shot**: No examples included in the prompt - **One-shot**: One example of input-output included - **Few-shot**: Multiple examples included The key idea here is that **direct generation = single-pass output without revision**. ## Why Use Reflection Instead of Direct Generation? Multiple research studies show that **reflection improves performance** compared to direct generation across many tasks. - Reflection allows the model to: - Review its own output - Identify mistakes or weaknesses - Generate a higher-quality second version ## Research Evidence - Based on work by **Madaan et al.** - Experiments compare: - Zero-shot prompting (no reflection) - The same model with reflection - Results show: - Reflection consistently outperforms direct generation in many tasks - Improvement is seen across different models (e.g., GPT-3.5, GPT-4) - Effectiveness can vary depending on the application ## When Reflection is Especially Helpful Reflection is most useful when outputs require **correctness, structure, or validation**, such as: ### 1. Structured Outputs - HTML tables with formatting issues - Complex JSON with deep nesting - Reflection helps validate structure and spot bugs ### 2. Instruction Generation - Step-by-step guides (e.g., brewing tea) - Reflection helps detect: - Missing steps - Logical inconsistencies - Incomplete instructions ### 3. Creative or Branding Tasks - Generating domain names - Reflection helps check: - Pronunciation difficulty - Negative or unintended meanings - Cultural or language-related issues ## Example Reflection Prompts ### Domain Name Review - Ask the LLM to: - Review generated names - Check pronunciation - Identify negative meanings across languages - Output only names that meet all criteria ### Email Improvement - Ask the LLM to: - Review the first draft - Check tone and professionalism - Verify factual accuracy and promises - Write an improved second draft ## Tips for Writing Effective Reflection Prompts - Clearly state that the model should **review or reflect** on the first output - Define **explicit evaluation criteria** - Example: tone, accuracy, completeness, pronunciation, negative connotations - Specific criteria guide the model to focus on what matters most ## Learning to Write Better Prompts - Study prompts written by others - Review prompts in well-designed open-source projects - Analyzing high-quality prompts helps improve your own prompt design skills ## Key Takeaway Reflection is a simple but powerful technique that often leads to **better accuracy, completeness, and quality** than direct generation. It is especially valuable for structured, complex, or high-stakes outputs. --- ## Story / Examples ### Example 1: Domain Name Brainstorming An LLM generates multiple domain names, but some may: - Be hard to pronounce - Have unintended negative meanings A reflection prompt is used to review and filter the list, resulting in a smaller set of safe and usable domain names. This approach was used in a real team setting to help startups choose better domain names. ### Example 2: Improving an Email Draft The LLM writes an initial email using provided facts and dates. A reflection prompt then asks it to: - Check tone - Verify factual accuracy - Review commitments and promises Based on this review, the LLM produces a clearer, more accurate, and more professional second draft. ![image](https://hackmd.io/_uploads/BJutntaVZl.png) ![image](https://hackmd.io/_uploads/Hywp3YpV-e.png) ::: :::spoiler Reflection Design Pattern in Chart Generation ![image](https://hackmd.io/_uploads/rkTeAKaN-x.png) ![image](https://hackmd.io/_uploads/HkmuCt64bg.png) ## Context: Coding Lab and Visualization In this module’s coding lab, an agent is used to generate **data visualizations** (charts and diagrams). Applying the **reflection design pattern** can significantly improve the **clarity, readability, and visual quality** of these generated charts. ## Problem Setup - Input data: Coffee machine sales data stored as a CSV file. - Data includes: - Types of drinks (latte, cappuccino, hot chocolate, etc.) - Sale times and prices - Goal: Create a plot comparing **Q1 (first quarter) coffee sales for** for **2024 vs 2025**. ## Initial (Version 1) Generation - The LLM is prompted to generate Python code to create the visualization. - The first version (v1) of the code successfully generates a plot. - However, the output is a **stacked bar chart**, which: - Is harder to interpret - Is visually cluttered - Does not clearly compare sales across years ## Applying Reflection with Multimodal Models - Reflection is applied by providing the LLM with: - Version 1 of the code - The generated plot image - A **multimodal LLM** is used, meaning it can: - Accept both text and image inputs - Perform visual reasoning by examining the chart ## Reflection Outcome - The multimodal LLM critiques the chart based on visual quality. - It identifies weaknesses in the stacked bar plot. - It updates the code to generate a **standard bar chart** that: - Separates sales for each year - Is clearer and easier to compare - Looks more visually pleasing ## Role of Model Selection - Different LLMs have different strengths. - A common workflow: - Use a general-purpose model (e.g., GPT-4o, GPT-5) for initial code generation. - Use a reasoning-focused or multimodal model for reflection and improvement. - Experimenting with different combinations often yields better results. ## Importance of Structured Reflection Prompts Reflection works best when prompts: - Assign a clear role (e.g., “expert data analyst”) - Provide: - Version 1 of the code - Computational or execution history (if available) - Specify evaluation criteria such as: - Readability - Clarity - Completeness - Explicitly ask the model to rewrite the code incorporating improvements ## Performance Considerations - Reflection does not improve performance uniformly across all applications. - Studies show: - Significant improvement in some tasks - Minor or negligible improvement in others - It is important to: - Evaluate reflection’s impact on your specific use case - Tune both initial generation prompts and reflection prompts --- ## Story / Example: Improving a Coffee Sales Chart An LLM first generates Python code that produces a stacked bar chart comparing Q1 coffee sales for 2024 and 2025. While technically correct, the visualization is cluttered and difficult to interpret. By feeding both the code and the resulting chart image into a multimodal LLM and asking it to critique the visualization, the model identifies weaknesses and rewrites the code to produce a cleaner bar chart. The final visualization clearly separates sales by year, making the comparison easier and more intuitive. ::: :::spoiler Evals for Reflection workflow ![image](https://hackmd.io/_uploads/HJiJ496V-e.png) ![image](https://hackmd.io/_uploads/BJz2Nq6EZl.png) :::