Week 3 - Practical LLM Bootcamp for Devs

# Practical LLM Bootcamp for Devs ## Week 3 - Project: Function calling Last week, we explored how to use Retrieval Augmented Generation to fetch relevant content, especially using embeddings and vector stores. Here are some discussion questions from last week as a recap: - What is the difference between an embedding, a vector, and an array of floats? - Why use cosine similarity vs FAISS (Facebook AI Similarity Score)? - When using embeddings and semantic matching, what kind of questions is it good at finding matches for, and what kind of questions is it bad at finding matches for? - Last week, we used the ADA embedding model hosted by OpenAI. How would we know if we should switch to another embedding model or fine-tune an embedding model? - When integrating with a chat app, how do we know when to search for matching "documents"? In the context of RAG, remember that a document is a small block of text, often between 200-1000 words. - When a matching collection of documents are returned, how are they added to the chat? This week, we want to augment our LLM solutions with the ability to call functions to read or write data. First, the LLM must be able to discern the user's intention, evaluate whether one of the available functions would be helpful, and properly form the function call. Although OpenAI and Anthropic have first class support for function calls, we're going to implement it ourselves manually. This will enable us to use function calls for any open source model, and also develop an understanding for what OpenAI and Anthropic are doing under the hood. ### Milestone 0: Account set-up For this project, we'll need to create a few accounts. - Create accounts and API keys for: - [TMDB](https://www.themoviedb.org/signup) - get your token [here](https://developer.themoviedb.org/reference/intro/getting-started) - [SerpAPI](https://serpapi.com/users/sign_up) - get your key [here](https://serpapi.com/dashboard) - [Anthropic](https://console.anthropic.com/) - get your API key [here](https://console.anthropic.com/settings/keys) - Add $10 dollars in [billing](https://console.anthropic.com/settings/billing). This will also give you access to their excellent [prompt improver](https://console.anthropic.com/settings/billing). - [Fireworks](https://fireworks.ai/) (optional) - get your key [here](https://fireworks.ai/account/api-keys). - Fireworks gives you fast, hosted recent open source models. - For this lab, you can borrow my API key ### Milestone 1: Calling a function -- fetch now playing movies The goal for this milestone is to support a users request to view movies that are now playing in theaters. In order to implement this, the high level design strategy is: 1. Design the system prompt. Instruct the LLM to detect when the user is requesting a list of current movies. If appropriate, it should generate a function call; otherwise, it should respond to the user. It's up to you to determine and describe the format of the function call. It just has to be something that is easy for you to parse later. A convenient and commonly used format is [JSON Schema](https://json-schema.org/) 2. Parse the LLM output. If the LLM is generating a function call, call the function and inject the result as a system message in the message history. 3. In the event of a function call, generate an additional chat completion request. This will give the LLM an opportunity to look at the returned function call, and respond to the user. **Take these steps** 1. Clone the starter project [here](https://github.com/timothy1ee/function_calling_starter). Follow the installation instructions in the README. - The starter project is a basic Chainlit app, similar to your prework. - Additionally, it has a movies prompt to get you started that generates function calls - It provides a helper file of movie functions that work with the relevant APIs - It does NOT yet parse the LLM function calls 2. Parse the emitted function call - Invoke the appropriate function, and add the results to the message history as a system message - Call the LLM again with the update message history, so it can craft a response to the user - Note: helper function `extract_tag_content` will be useful **Milestone Checkpoints** - [ ] Ask your bot normal movie questions (should not trigger a function call) - [ ] Ask your bot what's playing now (should trigger a function call) - [ ] Modify movie_functions.py to return an error from the function call. Observe the behavior. ### Milestone 3: Calling multiple functions -- fetch showtimes Now that you've added one function successfully, add another! The LLM now needs to choose from one of two available functions. In practice, the latest models can handle using one of 10-20 well described functions. That said, more narrowly focused models tend to behave more deterministically. For the milestone, we'd like to add the functionality to get showtimes, which requires a movie title and a location (like a zipcode or city name). You can use the function: `get_showtimes(title, location)`. **Take these steps** 1. Modify the system prompt to introduce this new functionality. 2. Parse the `get_showtimes` function call. **Milestone Checkpoint** - [ ] Make sure the "what's playing now" feature still works - [ ] Try requesting showtimes for a movie - [ ] Try requesting showtimes without specifying a location - [ ] Try requesting showtimes for invalid movies or invalid locations ### Milestone 4: Chaining functions Currently, our bot can call one of two functions. We'd like to augment it so that it can call several functions. We'd like to support the user query, "Get the movies playing now, pick a random movie, and get the showtimes for 94158". In order to support this, one approach is to set up a while loop in the `on_message` function. As long as the bot is generating function calls, you should continue to call functions. **Take these steps** 1. No modification to the system prompt is required 2. In `on_message`, set up a while loop to call functions until it stops generating function calls **Milestone Checkpoint** - [ ] Make sure the "what's playing now" and "get showtimes" features still work - [ ] Try the query, "Get the movies playing now, pick a random movie, and get the showtimes for 94158" - [ ] Try the query, "Get the movies playing now, pick a random movie, and get the showtimes" (i.e., omit the location parameter) ### Milestone 5: Calling functions with user confirmation -- Buying tickets It feels relatively safe to call functions that have no side effects. However, calling functions that take actions (sending an email, modifying a database, purchasing a product) feels like it should require confirmation. Add the feature to buy a ticket for a showtime by calling `buy_ticket(theater, movie, showtime)`. This is a function that just pretends to buy a ticket. However, before actually calling the function, confirm the details with the user before continuing. **Take these steps** 1. Add the additional function to the prompt, and parse it. Test that the buy feature with no confirmation works as expected. 2. In the prompt, create an additional function for `confirm_ticket_purchase(theater, movie, showtime)`. 3. In the parsing for buy_ticket, add a system message that echos the ticket details, and has an instruction to confirm the ticket with the user. **Milestone Checkpoint** - [ ] Test the previous features and make sure they still work - [ ] Try buying a ticket with inadequate parameters - [ ] Try buying a ticket with positive confirmation - [ ] Try buying a ticket, but cancel the purchase ### Milestone 6 (optional): Integrating with RAG One could implement an additional function that implements fetching additional data, such as movie reviews. Instead of implementing it that way, let's approach this in the style of RAG. By the way, there are many ways to implement RAG, this is one method. In this milestone, we want to fetch movie reviews, if its useful context for the user's question. For example, if a user asks, "What are critics saying about this movie?", that should trigger a fetch. From a system design perspective, every time a user sends a message, we want to process their request and make a judgement about whether to fetch reviews. **Take these steps** 1. In `on_message`, before sending the message to the current system prompt, send the message history to another prompt. That prompt's only goal is to evaluate if it would be helpful to fetch movie reviews. For example, use a prompt like this: ```python! system_prompt = """\ Based on the conversation, determine if the topic is about a specific movie. Determine if the user is asking a question that would be aided by knowing what critics are saying about the movie. Determine if the reviews for that movie have already been provided in the conversation. If so, do not fetch reviews. Your only role is to evaluate the conversation, and decide whether to fetch reviews. Output the current movie, id, a boolean to fetch reviews in JSON format, and your rationale. Do not output as a code block. { "movie": "title", "id": 123, "fetch_reviews": true "rationale": "reasoning" } """ ``` 2. If the decision is to fetch a movie review, call the `get_reviews(movie_id)` function. Add the reviews as a system message in the message history. That code might look something like: ```python! if context_json.get("fetch_reviews", False): movie_id = context_json.get("id") reviews = get_reviews(movie_id) reviews = f"Reviews for {context_json.get('movie')} (ID: {movie_id}):\n\n{reviews}" context_message = {"role": "system", "content": f"CONTEXT: {reviews}"} message_history.append(context_message) ``` When the original prompt is able to respond, it'll have the benefit of the added context. **Milestone Checkpoint** - [ ] Test the previous features and make sure they still work - [ ] Ask questions like, "What are people saying about this movie" to trigger the fetch of the movie reviews. ### Milestone 7 (optional): Using OpenAI function calling OpenAI and Anthropic both support function calling (also called tool use): - OpenAI: https://platform.openai.com/docs/guides/function-calling - Anthropic: https://docs.anthropic.com/en/docs/build-with-claude/tool-use What's the motivation? LLMs are unpredictable and function calling sometimes work as we would hope, and sometimes it doesn't. It won't call a function, it won't call the right function, or it won't call the function in the right way. OpenAI and Anthropic's solution is an attempt to make function calling more reliable. It works in the following way: - Instead of describing the available functions in a system prompt, pass in the function signatures as part of the client API. - Describe the function in this format: https://platform.openai.com/docs/guides/function-calling/step-2-describe-your-function-to-the-model-so-it-knows-how-to-call-it - Check for tool calls in the response Behind the scenes, OpenAI and Anthropic do things that are similar to our manual implementation. However, they've done additional fine-tuning, and also added additional stages in their pipeline that help increase the reliability of the function calls. **Take these steps** 1. Review the documentation for function calling: https://platform.openai.com/docs/guides/function-calling 2. Port your implementation to use the OpenAI function calling parameters **Milestone Checkpoint** - [ ] Test the previous features and make sure they still work - [ ] In LangSmith, compare the timing of the original implementation vs. the OpenAI function-calling implementation ### Milestone 8 (optional): Using other models In this milestone, you'll explore the performance and character of different models. There's a convenient library, LiteLLM that allows you to easily switch between providers. It maps all APIs to the OpenAI standard for convenience. **Take these steps** 1. In your .env, make sure you've added API keys for Anthropic and Fireworks.ai 2. The starter project is already using LiteLLM. At the top of `app.py`, uncomment different models. ``` # OpenAI GPT-4 # model = "openai/gpt-4o" # Anthropic Claude model = "claude-3-5-sonnet-20241022" # Fireworks Qwen # model = "fireworks_ai/accounts/fireworks/models/qwen2p5-coder-32b-instruct" ``` 3. Explore different Fireworks models [here](https://fireworks.ai/models) **Milestone Checkpoint** - [ ] Test the previous features and see how well they work for different models ## Capstone Project - Week 3 Last week, the project milestone was to build the basic scaffold of your app, including the primary prompt and evaluation. Optionally, you added in RAG and LLM-as-a-judge evaluation. Note: you do not need to build a Chainlit app. If your app is more fundamentally a backend app, or it is using another way of interacting with the user (Chrome Extension, Slack, SMS, email, custom UI, etc), that's fine and exciting! ### Week 3 Capstone Milestones - [ ] Continue building your RAG pipeline - [ ] Implement at least one function call ## Submission Complete the lab and upload to GitHub. Submit your [GitHub repos for Week 3 here](https://forms.gle/6Exe7LaamPJgqB2H8).