Prompt-Engineering (Part 1)

# Prompt-Engineering (Part 1) ###### tags: `Caloudi` `ChatGPT` `Note` [Guide link](https://www.promptingguide.ai/) [GitHub link](https://github.com/dair-ai/Prompt-Engineering-Guide) # CH1 Prompting Introduction Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of large language models (LLMs). Researchers use prompt engineering to improve the capacity of LLMs on a wide range of common and complex tasks such as question answering and arithmetic reasoning. Developers use prompt engineering to design robust and effective prompting techniques that interface with LLMs and other tools. This guide covers the basics of standard prompts to provide a rough idea of how to use prompts to interact and instruct large language models (LLMs). All examples are tested with `text-davinci-003` (using OpenAI's playground) unless otherwise specified. It uses the default configurations, e.g., `temperature=0.7` and `top-p=1`. Topic: - [Basic Prompts](#basic-prompts) - [A Word on LLM Settings](#a-word-on-llm-settings) - [Standard Prompts](#standard-prompts) - [Prompt Elements](#elements-of-a-prompt) - [General Tips for Designing Prompts](#general-tips-for-designing-prompts) --- ## Basic Prompts You can already achieve a lot with prompts, but the quality of results depends on how much information you provide it. A prompt can contain information like the `instruction` or `question` you are passing to the model and include other details such as `inputs` or `examples`. Here is a basic example of a simple prompt: *Prompt* ``` The sky is ``` *Output:* ``` blue The sky is blue on a clear day. On a cloudy day, the sky may be gray or white. ``` As you can see, the language model outputs a continuation of strings that make sense given the context `"The sky is"`. The output might be unexpected or far from the task we want to accomplish. This basic example also highlights the necessity to provide more context or instructions on what specifically we want to achieve. Let's try to improve it a bit: *Prompt:* ``` Complete the sentence: The sky is ``` *Output:* ``` so beautiful today. ``` Is that better? Well, we told the model to complete the sentence so the result looks a lot better as it follows exactly what we told it to do ("complete the sentence"). This approach of designing optimal prompts to instruct the model to perform a task is what's referred to as **prompt engineering**. The example above is a basic illustration of what's possible with LLMs today. Today's LLMs can perform all kinds of advanced tasks that range from text summarization to mathematical reasoning to code generation. --- ## A Word on LLM Settings When working with prompts, you will be interacting with the LLM via an API or directly. You can configure a few parameters to get different results for your prompts. **Temperature** - In short, the lower the temperature the more deterministic the results in the sense that the highest probable next token is always picked. Increasing the temperature could lead to more randomness encouraging more diverse or creative outputs. We are essentially increasing the weights of the other possible tokens. In terms of application, we might want to use a lower temperature for something like fact-based QA to encourage more factual and concise responses. For poem generation or other creative tasks, it might be beneficial to increase the temperature. **Top_p** - Similarly, with top_p, a sampling technique with temperature called nucleus sampling, you can control how deterministic the model is at generating a response. If you are looking for exact and factual answers keep this low. If you are looking for more diverse responses, increase to a higher value. The general recommendation is to alter one, not both. Before starting with some basic examples, keep in mind that your results may vary depending on the version of LLM you are using. --- ## Standard Prompts We have tried a very simple prompt above. A standard prompt has the following format: ``` <Question>? ``` This can be formatted into a QA format, which is standard in a lot of QA dataset, as follows: ``` Q: <Question>? A: ``` Given the standard format above, one popular and effective technique for prompting is referred to as few-shot prompting where we provide exemplars. Few-shot prompts can be formatted as follows: ``` <Question>? <Answer> <Question>? <Answer> <Question>? <Answer> <Question>? ``` And you can already guess that its QA format version would look like this: ``` Q: <Question>? A: <Answer> Q: <Question>? A: <Answer> Q: <Question>? A: <Answer> Q: <Question>? A: ``` Keep in mind that it's not required to use QA format. The format depends on the task at hand. For instance, you can perform a simple classification task and give exemplars that demonstrate the task as follows: *Prompt:* ``` This is awesome! // Positive This is bad! // Negative Wow that movie was rad! // Positive What a horrible show! // ``` *Output:* ``` Negative ``` Few-shot prompts enable in-context learning which is the ability of language models to learn tasks given only a few examples. We will see more of this in action in the upcoming guides. --- ## Elements of a Prompt As we cover more and more examples and applications that are possible with prompt engineering, you will notice that there are certain elements that make up a prompt. A prompt can contain any of the following components: **Instruction** - a specific task or instruction you want the model to perform **Context** - can involve external information or additional context that can steer the model to better responses **Input Data** - is the input or question that we are interested to find a response for **Output Indicator** - indicates the type or format of the output. Not all the components are required for a prompt and the format depends on the task at hand. We will touch on more concrete examples in upcoming guides. --- ## General Tips for Designing Prompts Here are some tips to keep in mind while you are designing your prompts: ### Start Simple As you get started with designing prompts, you should keep in mind that it is an iterative process that requires a lot of experimentation to get optimal results. Using a simple playground like OpenAI's or Cohere's is a good starting point. You can start with simple prompts and keep adding more elements and context as you aim for better results. Versioning your prompt along the way is vital for this reason. As we read the guide you will see many examples where specificity, simplicity, and conciseness will often give you better results. When you have a big task that involves many different subtasks, you can try to break down the task into simpler subtasks and keep building up as you get better results. This avoids adding too much complexity to the prompt design process at the beginning. ### The Instruction You can design effective prompts for various simple tasks by using commands to instruct the model what you want to achieve such as "Write", "Classify", "Summarize", "Translate", "Order", etc. Keep in mind that you also need to experiment a lot to see what works best. Try different instructions with different keywords, contexts, and data and see what works best for your particular use case and task. Usually, the more specific and relevant the context is to the task you are trying to perform, the better. We will touch on the importance of sampling and adding more context in the upcoming guides. Others recommend that instructions are placed at the beginning of the prompt. It's also recommended that some clear separator like "###" is used to separate the instruction and context. For instance: *Prompt:* ``` ### Instruction ### Translate the text below to traditional Chinese: Text: "Oh my god !" ``` *Output:* ``` 哦我的天啊！ ``` ### Specificity Be very specific about the instruction and task you want the model to perform. The more descriptive and detailed the prompt is, the better the results. This is particularly important when you have a desired outcome or style of generation you are seeking. There aren't specific tokens or keywords that lead to better results. It's more important to have a good format and descriptive prompt. Providing examples in the prompt is very effective to get desired output in specific formats. When designing prompts you should also keep in mind the length of the prompt as there are limitations regarding how long this can be. Thinking about how specific and detailed you should be is something to consider. Too many unnecessary details are not necessarily a good approach. The details should be relevant and contribute to the task at hand. This is something you will need to experiment with a lot. We encourage a lot of experimentation and iteration to optimize prompts for your applications. As an example, let's try a simple prompt to extract specific information from a piece of text. *Prompt:* ``` 提取以下文本中的地名。所需格式：地點：<頓號分隔的地點名稱> 輸入：“Coldplay上次來台開場已是2017年，這次睽違6年再度來台，Live Nation Taiwan 理想國9日在臉書宣布：「搖滾天團 Coldplay 首度降臨高雄！11月就與 Coldplay相約星際漫遊。」演唱會定在11月11日周六晚間19點30分，於高雄國家體育場（世運主場館）開唱，會員預售時間為5月17日中午12點至晚間23點59分，門票將於5月18日中午12點於拓元售票系統星際漫遊開賣。” ``` *Output:* ``` 地點：台灣、高雄、國家體育場（世運主場館） ``` The input text is obtained from [ETtoday news](https://star.ettoday.net/news/2495229). ### Avoid Impreciseness Given the tips above about being detailed and improving format, it's easy to fall into the trap of wanting to be too clever about prompts and potentially creating imprecise descriptions. It's often better to be specific and direct. The analogy here is very similar to effective communication -- the more direct, the more effective the message gets across. For example, you might be interested in learning the concept of prompt engineering. You might try something like: *Prompt:* ``` Explain the concept of prompt engineering. Keep the explanation short, only a few sentences, and don't be too descriptive. ``` It's not clear from the prompt above how many sentences to use and what style. You might still somewhat get good responses with the above prompts but the better prompt would be one that is very specific, concise, and to the point. Something like: *Prompt:* ``` Use 2-3 sentences to explain the concept of prompt engineering to a high school student. ``` ### To do or not to do? Another common tip when designing prompts is to avoid saying what not to do but say what to do instead. This encourages more specificity and focuses on the details that lead to good responses from the model. Here is an example of a movie recommendation chatbot failing at exactly what I don't want it to do because of how I wrote the instruction -- focusing on what not to do. *Prompt:* ``` 你現在是一個電影推薦客服。過程中不能問及相關興趣以及個人資訊。顧客：請根據我的興趣推薦一部電影。客服： ``` *Output:* ``` 您好，我很樂意為您推薦一部電影，請問您最喜歡什麼類型的電影？ ``` Here is a better prompt: *Prompt:* ``` 你現在是一個電影推薦客服。主要任務為負責從全球熱門電影中推薦一部電影。過程中不能過問用戶的興趣以及個人資訊。如果客服沒有推薦的電影，應該回覆“抱歉，今天找不到推薦的電影。”。顧客：請根據我的興趣推薦一部電影。客服： ``` *Output:* ``` 抱歉，今天找不到推薦的電影。不過，你可以參考熱門電影的排行榜，也許可以找到一部適合你的電影。 ``` Some of the examples above were adopted from the ["Best practices for prompt engineering with OpenAI API" article.](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api) --- # CH2 Basic Prompting In the previous guide, we introduced and gave a basic example of a prompt. In this guide, we will provide more examples of how prompts are used and introduce key concepts that will be important for the more advanced guides. Often, the best way to learn concepts is by going through examples. Below we cover a few examples of how well-crafted prompts can be used to perform all types of interesting and different tasks. Topics: - [Text Summarization](#text-summarization) - [Information Extraction](#information-extraction) - [Question Answering](#question-answering) - [Text Classification](#text-classification) - [Conversation](#conversation) - [Code Generation](#code-generation) - [Reasoning](#reasoning) - [Python Notebooks](#python-notebooks) --- ## Text Summarization One of the standard tasks in natural language generation is text summarization. Text summarization can include many different flavors and domains. One of the most promising applications of language models is the ability to summarize articles and concepts into quick and easy-to-read summaries. Let's try a basic summarization task using prompts. Let's say I am interested to learn about antibiotics, I could try a prompt like this: *Prompt:* ``` 請解釋抗生素 A: ``` *Output:* ``` 抗生素是一種藥物，用於治療由細菌引起的疾病。它們的作用原理是抑制細菌的生長和繁殖，使細菌群的數量變少，從而減輕病症，改善病人的健康狀況。抗生素可以分為兩大類：抗菌素和抗真菌素。抗菌素可以用於治療由細菌引起的感染，而抗真菌素則用於治療由真菌引起的感染。抗生素的藥劑形式包括片劑、膠囊、注射液和懸液等。由於抗生素可以有效治療細菌性感染，它們在醫學和其他領域都有廣泛的應用。 ``` The "A:" is an explicit prompt format that's used in question answering. I used it here to tell the model that there is an expected further. In this example, it's not clear how this is useful vs not using it but we will leave it that for later examples. Let's just assume that this is too much information and want to summarize it further. We can instruct the model to summarize into one sentence like so: *Prompt:* ``` 抗生素是一種藥物，用於治療由細菌引起的疾病。它們的作用原理是抑制細菌的生長和繁殖，使細菌群的數量變少，從而減輕病症，改善病人的健康狀況。抗生素可以分為兩大類：抗菌素和抗真菌素。抗菌素可以用於治療由細菌引起的感染，而抗真菌素則用於治療由真菌引起的感染。抗生素的藥劑形式包括片劑、膠囊、注射液和懸液等。由於抗生素可以有效治療細菌性感染，它們在醫學和其他領域都有廣泛的應用。請用一句話解釋上述段落： ``` *Output:* ``` 抗生素是用來抑制細菌生長和繁殖，縮小細菌群，以減輕病症和改善病人健康狀況的藥物。 ``` Without paying too much attention to the accuracy of the output above, which is something we will touch on in a later guide, the model tried to summarize the paragraph in one sentence. You can get clever with the instructions but we will leave that for a later chapter. Feel free to pause here and experiment to see if you get better results. --- ## Information Extraction While language models are trained to perform natural language generation and related tasks, it's also very capable of performing classification and a range of other natural language processing (NLP) tasks. Here is an example of a prompt that extracts information from a given paragraph. *Prompt:* ``` Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT in the preparation of their manuscript and analysis. They should also indicate which LLMs were used. This will alert editors and reviewers to scrutinize manuscripts more carefully for potential biases, inaccuracies and improper source crediting. Likewise, scientific journals should be transparent about their use of LLMs, for example when selecting submitted manuscripts. Mention the large language model based product mentioned in the paragraph above: ``` *Output:* ``` The large language model based product mentioned in the paragraph above is ChatGPT. ``` There are many ways we can improve the results above, but this is already very useful. By now it should be obvious that you can ask the model to perform different tasks by simply instructing it what to do. That's a powerful capability that AI product builders are already using to build powerful products and experiences. Paragraph source: [ChatGPT: five priorities for research](https://www.nature.com/articles/d41586-023-00288-7) --- ## Question Answering One of the best ways to get the model to respond to specific answers is to improve the format of the prompt. As covered before, a prompt could combine instructions, context, input, and output indicators to get improved results. While these components are not required, it becomes a good practice as the more specific you are with instruction, the better results you will get. Below is an example of how this would look following a more structured prompt. *Prompt:* ``` 根據以下文章回答問題。保持答案簡潔性。如果不確定答案為何，則回答“不清楚”。金州勇士（英語：Golden State Warriors，簡稱：GSW），是一支位於美國加利福尼亞州舊金山的美國職業籃球隊，分屬於NBA聯盟西區聯盟的太平洋組，主場球館為大通銀行中心。球隊的格言為「全隊即為一城」（The whole team is a city）。勇士隊是現今北美職業體育聯賽裡，少有的名稱不包含主場所在城市的隊伍，這是因為1971年球團想以北加州的奧克蘭以及南加州的聖地牙哥為雙主場，而以加州的暱名「金州」（Golden State）為名來代表為全加州的球隊。勇士隊前身為費城勇士（Philadelphia Warriors），成立於1946年，是聯盟的創始球隊之一，曾於1947年贏得聯盟第一個賽季的總冠軍。1962年，球隊遷至舊金山灣區，更名為舊金山勇士（San Francisco Warriors）；1971年，球隊改名為如今的金州勇士。1972年，球隊遷至和舊金山只有一橋之隔的奧克蘭體育館。球隊在2019–20賽季遷回舊金山，新球場是大通銀行中心（Chase Center）。球隊一共獲得過7次聯盟總冠軍。2015-16賽季金州勇士取得73勝9負，成為NBA史上單季例行賽最佳戰績的隊伍，打破了1995-96賽季芝加哥公牛72勝10負的記錄。2018年總決賽以4-0橫掃克里夫蘭騎士奪得NBA總冠軍達成二連霸，之後經歷三個賽季於2022年總決賽以4-2擊敗波士頓塞爾提克再次奪得NBA總冠軍。問：金州勇士的前身為何？答： ``` *Output:* ``` 費城勇士（Philadelphia Warriors）。 ``` Context obtained from [Wiki](https://zh.wikipedia.org/zh-tw/%E9%87%91%E5%B7%9E%E5%8B%87%E5%A3%AB). --- ## Text Classification So far, we have used simple instructions to perform a task. As a prompt engineer, you will need to get better at providing better instructions. But that's not all! You will also find that for harder use cases, just providing instructions won't be enough. This is where you need to think more about the context and the different elements you can use in a prompt. Other elements you can provide are `input data` or `examples`. Let's try to demonstrate this by providing an example of text classification. *Prompt:* ``` Classify the text into neutral, negative or positive. Text: I think the food was okay. Sentiment: ``` *Output:* ``` Neutral ``` We gave the instruction to classify the text and the model responded with `'Neutral'` which is correct. Nothing is wrong with this but let's say that what we really need is for the model to give the label in the exact format we want. So instead of `Neutral` we want it to return `neutral`. How do we achieve this. There are different ways to do this. We care about specificity here, so the more information we can provide the prompt the better results. We can try providing examples to specify the correct behavior. Let's try again: *Prompt:* ``` Classify the text into neutral, negative or positive. Text: I think the vacation is okay. Sentiment: neutral Text: I think the food was okay. Sentiment: ``` *Output:* ``` neutral ``` Perfect! This time the model returned `neutral` which is the specific label I was looking for. It seems that the example provided in the prompt helped the model to be specific in its output. To highlight why sometimes being specific is important, checkout this example and spot the problem: *Prompt:* ``` Classify the text into nutral, negative or positive. Text: I think the vacation is okay. Sentiment: ``` *Output:* ``` Neutral ``` What is the problem here? As a hint, the made-up nutral label is completely ignored by the model. Instead, the model outputs Neutral as it has some bias towards that label. But let's assume that what you really want is nutral. How would you fix this? Maybe you can try adding descriptions to the labels or add more examples to the prompt? If you are not sure, we will discuss a few ideas in the upcoming sections. --- ## Conversation Perhaps one of the more interesting things you can achieve with prompt engineering is instructing the LLM system on how to behave, its intent, and its identity. This is particularly useful when you are building conversational systems like customer service chatbots. For instance, let's create a conversational system that's able to generate more technical and scientific responses to questions. Note how we are explicitly telling it how to behave through the instruction. This is sometimes referred to as *role prompting*. *Prompt:* ``` The following is a conversation with an AI research assistant. The assistant tone is technical and scientific. Human: Hello, who are you? AI: Greeting! I am an AI research assistant. How can I help you today? Human: Can you tell me about the creation of blackholes? AI: ``` *Output:* ``` Sure! Black holes are regions of spacetime where the gravitational force is so strong that nothing, not even light, can escape from it. They are created when a very massive star dies and its core collapses in on itself, forming a singularity of infinite density. The intense gravity of the singularity pulls in all the matter and radiation around it, creating the black hole. ``` Our AI research assistant sounds a bit too technical, right? Okay, let's change this behavior and instruct the system to give more accessible answers. *Prompt:* ``` The following is a conversation with an AI research assistant. The assistant answers should be easy to understand even by primary school students. Human: Hello, who are you? AI: Greeting! I am an AI research assistant. How can I help you today? Human: Can you tell me about the creation of black holes? AI: ``` *Output:* ``` Sure! Black holes are created when a very massive star runs out of fuel and collapses in on itself. The gravity of the star is so strong that nothing, not even light, can escape its pull. This creates a point in space-time with an intense gravitational force called a black hole. ``` I think we made some progress. You can continue improving it. I am sure if you add more examples you might get even better results. --- ## Code Generation One application where LLMs are quite effective is code generation. Copilot is a great example of this. There are a vast number of code-generation tasks you can perform with clever prompts. Let's look at a few examples below. First, let's try a simple program that greets the user. *Prompt:* ``` /* creat torch 1d conv network */ ``` *Output:* ``` import torch import torch.nn as nn class ConvNet(nn.Module): def __init__(self, num_classes): super(ConvNet, self).__init__() self.layer1 = nn.Sequential( nn.Conv1d(1, 16, kernel_size=5, stride=1, padding=2), nn.BatchNorm1d(16), nn.ReLU(), nn.MaxPool1d(kernel_size=2, stride=2)) self.layer2 = nn.Sequential( nn.Conv1d(16, 32, kernel_size=5, stride=1, padding=2), nn.BatchNorm1d(32), nn.ReLU(), nn.MaxPool1d(kernel_size=2, stride=2)) self.fc = nn.Linear(32 * 7 * 7, num_classes) def forward(self, x): out = self.layer1(x) ``` You can see that we didn't even need to specify the language to use. Let's switch levels a bit. I want to show you how powerful LLMs can be with a little more effort in designing the prompts. *Prompt:* ``` """ Table departments, columns = [DepartmentId, DepartmentName] Table students, columns = [DepartmentId, StudentId, StudentName] Create a MySQL query for all students in the Computer Science Department """ ``` *Output:* ``` SELECT StudentId, StudentName FROM students WHERE DepartmentId IN (SELECT DepartmentId FROM departments WHERE DepartmentName = 'Computer Science'); ``` This is very impressive. In this case, we provided data about the database schema and asked it to generate a valid MySQL query. --- ## Reasoning Perhaps one of the most difficult tasks for an LLM today requires some form of reasoning. Reasoning is one of the areas that I am most excited about due to the types of complex applications that can emerge from LLMs. There have been some improvements in tasks involving mathematical capabilities. That said, it's important to note that current LLMs struggle to perform reasoning tasks so this requires even more advanced prompt engineering techniques. We will cover these advanced techniques in the next guide. For now, we will cover a few basic examples to show arithmetic capabilities. *Prompt:* ``` What is 9,000 * 9,000? ``` *Output:* ``` 81,000,000 ``` Let's try something more difficult. *Prompt:* ``` The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. A: ``` *Output* ``` No, the odd numbers in this group add up to an odd number: 119. ``` That's incorrect! Let's try to improve this by improving the prompt. *Prompt:* ``` The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. Solve by breaking the problem into steps. First, identify the odd numbers, add them, and indicate whether the result is odd or even. ``` *Output:* ``` Odd numbers: 15, 5, 13, 7, 1 Sum: 41 41 is an odd number. ``` Much better, right? By the way, I tried this a couple of times and the system sometimes fails. If you provide better instruction combined with examples, it might help get more accurate results. We will continue to include more examples of common applications in this section of the guide. In the upcoming guides, we will cover even more advanced prompt engineering concepts for improving performance on all these and more difficult tasks. --- ## Python Notebooks |Description|Notebook| |--|--| |Learn how to perform many different types of common tasks using the `openai` and `LangChain` library|[Getting Started with Prompt Engineering](../notebooks/pe-lecture.ipynb)| --- # CH3 Advanced Prompting By this point, it should be obvious that it helps to improve prompts to get better results on different tasks. That's the whole idea behind prompt engineering. While those examples were fun, let's cover a few concepts more formally before we jump into more advanced concepts. Topics: - [Zero-shot Prompting](#zero-shot-prompting) - [Few-shot Prompting](#few-shot-prompting) - [Chain-of-Thought Prompting](#chain-of-thought-prompting) - [Zero-shot CoT](#zero-shot-cot) - [Self-Consistency](#self-consistency) - [Generate Knowledge Prompting](#generated-knowledge-prompting) - [Automatic Prompt Engineer](#automatic-prompt-engineer-ape) --- ## Zero-Shot Prompting LLMs today trained on large amounts of data and tuned to follow instructions, are capable of performing tasks zero-shot. We tried a few zero-shot examples in the previous section. Here is one of the examples we used: *Prompt:* ``` Classify the text into neutral, negative, or positive. Text: I think the vacation is okay. Sentiment: ``` *Output:* ``` Neutral ``` Note that in the prompt above we didn't provide the model with any examples -- that's the zero-shot capabilities at work. When zero-shot doesn't work, it's recommended to provide demonstrations or examples in the prompt. Below we discuss the approach known as few-shot prompting. --- ## Few-Shot Prompting While large-language models already demonstrate remarkable zero-shot capabilities, they still fall short on more complex tasks when using the zero-shot setting. To improve on this, few-shot prompting is used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response. Let's demonstrate few-shot prompting via an example that was presented by [Brown et al. 2020](https://arxiv.org/abs/2005.14165). In the example, the task is to correctly use a new word in a sentence. 1. *Prompt:* ``` A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is: We were traveling in Africa and we saw these very cute whatpus. To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is: ``` *Output:* ``` When we won the game, we all started to farduddle in celebration. ``` 2. *Prompt:* ``` "賽辣"的意思是指對於某事件或是事物表示的不可置信的驚嘆詞，下方是賽辣的例句：賽辣，台灣竟然踢進世界盃了！ "邱一下"的意思是放鬆一下，下方是邱一下的例句： ``` *Output:* ``` 今天太累了，大家一起邱一下吧！ ``` We can observe that the model has somehow learned how to perform the task by providing it with just one example (i.e., 1-shot). For more difficult tasks, we can experiment with increasing the demonstrations (e.g., 3-shot, 5-shot, 10-shot, etc.). Following the findings from [Min et al. (2022)](https://arxiv.org/abs/2202.12837), here are a few more tips about demonstrations/exemplars when doing few-shot: - "the label space and the distribution of the input text specified by the demonstrations are both important (regardless of whether the labels are correct for individual inputs)" - the format you use also plays a key role in performance, even if you just use random labels, this is much better than no labels at all. - additional results show that selecting random labels from a true distribution of labels (instead of a uniform distribution) also helps. Let's try out a few examples. Let's first try an example with random labels (meaning the labels Negative and Positive are randomly assigned to the inputs): *Prompt:* ``` This is awesome! // Negative This is bad! // Positive Wow that movie was rad! // Positive What a horrible show! // ``` *Output:* ``` Negative ``` We still get the correct answer, even though the labels have been randomized. Note that we also kept the format, which helps too. In fact, with further experimentation, it seems the newer GPT models we are experimenting with are becoming more robust to even random formats. Example: *Prompt:* ``` Positive This is awesome! This is bad! Negative Wow that movie was rad! Positive What a horrible show! -- ``` *Output:* ``` Negative ``` There is no consistency in the format above but the model still predicted the correct label. We have to conduct a more thorough analysis to confirm if this holds for different and more complex tasks, including different variations of prompts. ### Limitations of Few-shot Prompting Standard few-shot prompting works well for many tasks but is still not a perfect technique, especially when dealing with more complex reasoning tasks. Let's demonstrate why this is the case. Do you recall the previous example where we provided the following task: ``` The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. A: ``` If we try this again, the model outputs the following: ``` Yes, the odd numbers in this group add up to 107, which is an even number. ``` This is not the correct response, which not only highlights the limitations of these systems but that there is a need for more advanced prompt engineering. Let's try to add some examples to see if few-shot prompting improves the results. *Prompt:* ``` The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1. A: The answer is False. The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24. A: The answer is True. The odd numbers in this group add up to an even number: 16, 11, 14, 4, 8, 13, 24. A: The answer is True. The odd numbers in this group add up to an even number: 17, 9, 10, 12, 13, 4, 2. A: The answer is False. The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. A: ``` *Output:* ``` The answer is True. ``` That didn't work. It seems like few-shot prompting is not enough to get reliable responses for this type of reasoning problem. The example above provides basic information on the task. If you take a closer look, the type of task we have introduced involves a few more reasoning steps. In other words, it might help if we break the problem down into steps and demonstrate that to the model. More recently, [chain-of-thought (CoT) prompting](https://arxiv.org/abs/2201.11903) has been popularized to address more complex arithmetic, commonsense, and symbolic reasoning tasks. Overall, it seems that providing examples is useful for solving some tasks. When zero-shot prompting and few-shot prompting are not sufficient, it might mean that whatever was learned by the model isn't enough to do well at the task. From here it is recommended to start thinking about fine-tuning your models or experimenting with more advanced prompting techniques. Up next we talk about one of the popular prompting techniques called chain-of-thought prompting which has gained a lot of popularity. --- ## Chain-of-Thought Prompting Introduced in [Wei et al. (2022)](https://arxiv.org/abs/2201.11903), chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding. *Prompt:* ``` The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1. A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False. The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24. A: Adding all the odd numbers (17, 19) gives 36. The answer is True. The odd numbers in this group add up to an even number: 16, 11, 14, 4, 8, 13, 24. A: Adding all the odd numbers (11, 13) gives 24. The answer is True. The odd numbers in this group add up to an even number: 17, 9, 10, 12, 13, 4, 2. A: Adding all the odd numbers (17, 9, 13) gives 39. The answer is False. The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. A: ``` *Output:* ``` Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False. ``` Wow! We can see a perfect result when we provided the reasoning step. We can solve this task by providing even fewer examples, i.e., just one example seems enough: *Prompt:* ``` The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1. A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False. The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. A: ``` *Output:* ``` Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False. ``` Keep in mind that the authors claim that this is an emergent ability that arises with sufficiently large language models. --- ## Zero-Shot CoT One recent idea that came out more recently is the idea of [zero-shot CoT](https://arxiv.org/abs/2205.11916) (Kojima et al. 2022) that essentially involves adding "Let's think step by step" to the original prompt. Let's try a simple problem and see how the model performs: *Prompt:* ``` I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with? ``` *Output:* ``` 11 apples ``` The answer is incorrect! Now Let's try with the special prompt. *Prompt:* ``` I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with? Let's think step by step. ``` *Output:* ``` First, you started with 10 apples. You gave away 2 apples to the neighbor and 2 to the repairman, so you had 6 apples left. Then you bought 5 more apples, so now you had 11 apples. Finally, you ate 1 apple, so you would remain with 10 apples. ``` It's impressive that this simple prompt is effective at this task. This is particularly useful where you don't have too many examples to use in the prompt. --- ## Self-Consistency Perhaps one of the more advanced techniques out there for prompt engineering is self-consistency. Proposed by [Wang et al. (2022)](https://arxiv.org/pdf/2203.11171.pdf), self-consistency aims "to replace the naive greedy decoding used in chain-of-thought prompting". The idea is to sample multiple, diverse reasoning paths through few-shot CoT, and use the generations to select the most consistent answer. This helps to boost the performance of CoT prompting on tasks involving arithmetic and commonsense reasoning. Let's try the following example for arithmetic reasoning: *Prompt:* ``` When I was 6 my sister was half my age. Now I’m 70 how old is my sister? ``` *Output:* ``` 35 ``` The output is wrong! How may we improve this with self-consistency? Let's try it out. We will use the few-shot exemplars from Wang et al. 2022 (Table 17): *Prompt:* ``` Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted. So, they must have planted 21 - 15 = 6 trees. The answer is 6. Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot? A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5. Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total? A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74 chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39. Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny? A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8. ``` *Prompt:* ``` Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now? A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so in total he has 7 + 2 = 9 toys. The answer is 9. Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room? A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 = 20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers. The answer is 29. Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday? A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33. Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left? A: She bought 5 bagels for $3 each. This means she spent 5 Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister? A: ``` *Output 1:* ``` When you were 6, your sister was half your age. That means your sister was 6 / 2 = 3. Now you are 70. So your sister must be 70 - 3 = 67. The answer is 67. ``` *Output 2:* ``` When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70/2 = 35. The answer is 35. ``` *Output 2:* ``` When you were 6 your sister was half your age, so your sister was 6/2 = 3 years old. Now you are 70, so your sister is 70 - 3 = 67 years old. The answer is 67. ``` Computing for the final answer involves a few steps (check out the paper for the details) but for the sake of simplicity, we can see that there is already a majority answer emerging so that would essentially become the final answer. --- ## Generated Knowledge Prompting LLMs continue to be improved and one popular technique includes the ability to incorporate knowledge or information to help the model make more accurate predictions. Using a similar idea, can the model also be used to generate knowledge before making a prediction? That's what is attempted in the paper by [Liu et al. 2022](https://arxiv.org/pdf/2110.08387.pdf) -- generate knowledge to be used as part of the prompt. In particular, how helpful is this for tasks such as commonsense reasoning? Let's try a simple prompt: *Prompt:* ``` Part of golf is trying to get a higher point total than others. Yes or No? ``` *Output:* ``` Yes. ``` This type of mistake reveals the limitations of LLMs to perform tasks that require more knowledge about the world. How do we improve this with knowledge generation? First, we generate a few "knowledges": *Prompt:* ``` Input: Greece is larger than mexico. Knowledge: Greece is approximately 131,957 sq km, while Mexico is approximately 1,964,375 sq km, making Mexico 1,389% larger than Greece. Input: Glasses always fog up. Knowledge: Condensation occurs on eyeglass lenses when water vapor from your sweat, breath, and ambient humidity lands on a cold surface, cools, and then changes into tiny drops of liquid, forming a film that you see as fog. Your lenses will be relatively cool compared to your breath, especially when the outside air is cold. Input: A fish is capable of thinking. Knowledge: Fish are more intelligent than they appear. In many areas, such as memory, their cognitive powers match or exceed those of ’higher’ vertebrates including non-human primates. Fish’s long-term memories help them keep track of complex social relationships. ``` *Prompt:* ``` Input: A common effect of smoking lots of cigarettes in one’s lifetime is a higher than normal chance of getting lung cancer. Knowledge: Those who consistently averaged less than one cigarette per day over their lifetime had nine times the risk of dying from lung cancer than never smokers. Among people who smoked between one and 10 cigarettes per day, the risk of dying from lung cancer was nearly 12 times higher than that of never smokers. Input: A rock is the same size as a pebble. Knowledge: A pebble is a clast of rock with a particle size of 4 to 64 millimetres based on the Udden-Wentworth scale of sedimentology. Pebbles are generally considered larger than granules (2 to 4 millimetres diameter) and smaller than cobbles (64 to 256 millimetres diameter). Input: Part of golf is trying to get a higher point total than others. Knowledge: ``` *Knowledge 1:* ``` The objective of golf is to play a set of holes in the least number of strokes. A round of golf typically consists of 18 holes. Each hole is played once in the round on a standard golf course. Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game. ``` *Knowledge 2:* ``` Golf is a precision club-and-ball sport in which competing players (or golfers) use many types of clubs to hit balls into a series of holes on a course using the fewest number of strokes. The goal is to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game. ``` We are using the prompt provided in the paper by [Liu et al. 2022](https://arxiv.org/pdf/2110.08387.pdf). The next step is to integrate the knowledge and get a prediction. I reformatted the question into QA format to guide the answer format. *Prompt:* ``` Question: Part of golf is trying to get a higher point total than others. Yes or No? Knowledge: The objective of golf is to play a set of holes in the least number of strokes. A round of golf typically consists of 18 holes. Each hole is played once in the round on a standard golf course. Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game. Explain and Answer: ``` *Answer:* ``` No, the objective of golf is not to get a higher point total than others. Rather, the objective is to play a set of holes in the least number of strokes. The total number of strokes is used to determine the winner of the game, not the total number of points. ``` *Prompt:* ``` Question: Part of golf is trying to get a higher point total than others. Yes or No? Knowledge: Golf is a precision club-and-ball sport in which competing players (or golfers) use many types of clubs to hit balls into a series of holes on a course using the fewest number of strokes. The goal is to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game. Explain and Answer: ``` *Answer 2 (confidence is a lot lower):* ``` Yes, part of golf is trying to get a higher point total than others. Each player tries to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game. ``` Some really interesting things happened with this example. In the first answer, the model was very confident but in the second not so much. I simplify the process for demonstration purposes but there are a few more details to consider when arriving at the final answer. Check out the paper for more. --- ## Automatic Prompt Engineer (APE) ![](../img/APE.png) [Zhou et al., (2022)](https://arxiv.org/abs/2211.01910) propose automatic prompt engineer (APE) a framework for automatic instruction generation and selection. The instruction generation problem is framed as natural language synthesis addressed as a black-box optimization problem using LLMs to generate and search over candidate solutions. The first step involves a large language model (as an inference model) that is given output demonstrations to generate instruction candidates for a task. These candidate solutions will guide the search procedure. The instructions are executed using a target model, and then the most appropriate instruction is selected based on computed evaluation scores. APE discovers a better zero-shot CoT prompt than the human engineered "Let's think step by step" prompt (Kojima et al., 2022). The prompt "Let's work this out in a step by step way to be sure we have the right answer." elicits chain-of-though reasoning and improves performance on the MultiArith and GSM8K benchmarks: ![](../img/ape-zero-shot-cot.png) This paper touches on an important topic related to prompt engineering which is the idea of automatically optimizing prompts. While we don't go deep into this topic in this guide, here are a few key papers if you are interested in the topic: - [AutoPrompt](https://arxiv.org/abs/2010.15980) - proposes an approach to automatically create prompts for a diverse set of tasks based on gradient-guided search. - [Prefix Tuning](https://arxiv.org/abs/2101.00190) - a lightweight alternative to fine-tuning that prepends a trainable continuous prefix for NLG tasks. - [Prompt Tuning](https://arxiv.org/abs/2104.08691) - proposes a mechanism for learning soft prompts through backpropagation. --- # CH4 Prompt Applications In this guide, we will cover some advanced and interesting ways we can use prompt engineering to perform useful and more advanced tasks. **Note that this section is under heavy development.** Topics: - [Generating Data](#generating-data) - [Program-Aided Language Models](#pal-program-aided-language-models) - [Python Notebooks](#python-notebooks) --- ## Generating Data LLMs have strong capabilities to generate text. Using effective prompt strategies can steer the model to produce better, more consistent, and more factual responses. LLMs can also be especially useful for generating data which is useful to run all sorts of experiments. For example, we can use it to generate quick samples for a sentiment classifier like so: *Prompt:* ``` Produce 10 exemplars for sentiment analysis. Examples are categorized as either positive or negative. Produce 2 negative examples and 8 positive examples. Use this format for the examples: Q: <sentence> A: <sentiment> ``` *Output:* ``` Q: I just got the best news ever! A: Positive Q: We just got a raise at work! A: Positive Q: I'm so proud of what I accomplished today. A: Positive Q: I'm having the best day ever! A: Positive Q: I'm really looking forward to the weekend. A: Positive Q: I just got the best present ever! A: Positive Q: I'm so happy right now. A: Positive Q: I'm so blessed to have such an amazing family. A: Positive Q: The weather outside is so gloomy. A: Negative Q: I just got some terrible news. A: Negative ``` This is very useful. We use this example for a different test in another section of the guides. Here is another prompt example showing how to leverage the LLM for data labeling/augmentation task. *Prompt:* ``` produce 3 wine reviews and label taste, flavor, aroma related token; present the result as a json file, in addition add the coordinate of each term for NER task ``` *Output:* ``` [ { "review": "This Chardonnay has a rich, creamy texture with notes of vanilla and buttery oak. The taste is smooth and well-balanced, with a hint of tropical fruit on the finish.", "tokens": [ {"text": "rich", "label": "flavor", "start": 5, "end": 9}, {"text": "creamy", "label": "texture", "start": 10, "end": 16}, {"text": "vanilla", "label": "flavor", "start": 28, "end": 35}, {"text": "buttery", "label": "flavor", "start": 40, "end": 47}, {"text": "oak", "label": "aroma", "start": 48, "end": 51}, {"text": "smooth", "label": "taste", "start": 61, "end": 67}, {"text": "well-balanced", "label": "taste", "start": 68, "end": 81}, {"text": "hint", "label": "flavor", "start": 89, "end": 93}, {"text": "tropical fruit", "label": "flavor", "start": 98, "end": 112}, {"text": "finish", "label": "taste", "start": 116, "end": 122} ] }, { "review": "This Cabernet Sauvignon has a bold, full-bodied flavor with notes of black cherry and dark chocolate. The aroma is rich and complex, with hints of tobacco and leather.", "tokens": [ {"text": "bold", "label": "flavor", "start": 5, "end": 9}, {"text": "full-bodied", "label": "texture", "start": 10, "end": 21}, {"text": "black cherry", "label": "flavor", "start": 30, "end": 42}, {"text": "dark chocolate", "label": "flavor", "start": 47, "end": 60}, {"text": "rich", "label": "aroma", "start": 69, "end": 73}, {"text": "complex", "label": "aroma", "start": 78, "end": 85}, {"text": "hints", "label": "aroma", "start": 90, "end": 96}, {"text": "tobacco", "label": "aroma", "start": 101, "end": 108}, {"text": "leather", "label": "aroma", "start": 113, "end": 120} ] }, { "review": "This Riesling has a crisp, refreshing taste with notes of green apple and honey. The aroma is floral and fruity, with a hint of citrus.", "tokens": [ {"text": "crisp", "label": "texture", "start": 5, "end": 10}, {"text": "refreshing", "label": "texture", "start": 12, "end": 22}, {"text": "green apple", "label": "flavor", "start": 31, "end": 42}, ] } ``` --- ## PAL (Program-Aided Language Models) [Gao et al., (2022)](https://arxiv.org/abs/2211.10435) presents a method that uses LLMs to read natural language problems and generate programs as the intermediate reasoning steps. Coined, program-aided language models (PAL), differ from chain-of-thought prompting in that instead of using free-form text to obtain a solution it offloads the solution step to a programmatic runtime such as a Python interpreter. ![](../img/pal.png) Let's look at an example using LangChain and OpenAI GPT-3. We are interested to develop a simple application that's able to interpret the question being asked and provide an answer by leveraging the Python interpreter. Specifically, we are interested to create a function that allows the use of the LLM to answer questions that require date understanding. We will provide the LLM a prompt that includes a few exemplars that are adopted from [here](https://github.com/reasoning-machines/pal/blob/main/pal/prompt/date_understanding_prompt.py). These are the imports we need: ```python import openai from datetime import datetime from dateutil.relativedelta import relativedelta import os from langchain.llms import OpenAI from dotenv import load_dotenv ``` Let's first configure a few things: ```python load_dotenv() # API configuration openai.api_key = os.getenv("OPENAI_API_KEY") # for LangChain os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") ``` Setup model instance: ```python llm = OpenAI(model_name='text-davinci-003', temperature=0) ``` Setup prompt + question: ```python question = "Today is 27 February 2023. I was born exactly 25 years ago. What is the date I was born in MM/DD/YYYY?" DATE_UNDERSTANDING_PROMPT = """ # Q: 2015 is coming in 36 hours. What is the date one week from today in MM/DD/YYYY? # If 2015 is coming in 36 hours, then today is 36 hours before. today = datetime(2015, 1, 1) - relativedelta(hours=36) # One week from today, one_week_from_today = today + relativedelta(weeks=1) # The answer formatted with %m/%d/%Y is one_week_from_today.strftime('%m/%d/%Y') # Q: The first day of 2019 is a Tuesday, and today is the first Monday of 2019. What is the date today in MM/DD/YYYY? # If the first day of 2019 is a Tuesday, and today is the first Monday of 2019, then today is 6 days later. today = datetime(2019, 1, 1) + relativedelta(days=6) # The answer formatted with %m/%d/%Y is today.strftime('%m/%d/%Y') # Q: The concert was scheduled to be on 06/01/1943, but was delayed by one day to today. What is the date 10 days ago in MM/DD/YYYY? # If the concert was scheduled to be on 06/01/1943, but was delayed by one day to today, then today is one day later. today = datetime(1943, 6, 1) + relativedelta(days=1) # 10 days ago, ten_days_ago = today - relativedelta(days=10) # The answer formatted with %m/%d/%Y is ten_days_ago.strftime('%m/%d/%Y') # Q: It is 4/19/1969 today. What is the date 24 hours later in MM/DD/YYYY? # It is 4/19/1969 today. today = datetime(1969, 4, 19) # 24 hours later, later = today + relativedelta(hours=24) # The answer formatted with %m/%d/%Y is today.strftime('%m/%d/%Y') # Q: Jane thought today is 3/11/2002, but today is in fact Mar 12, which is 1 day later. What is the date 24 hours later in MM/DD/YYYY? # If Jane thought today is 3/11/2002, but today is in fact Mar 12, then today is 3/1/2002. today = datetime(2002, 3, 12) # 24 hours later, later = today + relativedelta(hours=24) # The answer formatted with %m/%d/%Y is later.strftime('%m/%d/%Y') # Q: Jane was born on the last day of Feburary in 2001. Today is her 16-year-old birthday. What is the date yesterday in MM/DD/YYYY? # If Jane was born on the last day of Feburary in 2001 and today is her 16-year-old birthday, then today is 16 years later. today = datetime(2001, 2, 28) + relativedelta(years=16) # Yesterday, yesterday = today - relativedelta(days=1) # The answer formatted with %m/%d/%Y is yesterday.strftime('%m/%d/%Y') # Q: {question} """.strip() + '\n' ``` ```python llm_out = llm(DATE_UNDERSTANDING_PROMPT.format(question=question)) print(llm_out) ``` ```python exec(llm_out) print(born) ``` This will output the following: `02/27/1998` --- ## Python Notebooks |Description|Notebook| |--|--| |Learn how to use the Python interpreter in combination with the language model to solve tasks.|[Program-Aided Language Models](../notebooks/pe-pal.ipynb)| --- # CH5 ChatGPT Prompt Engineering In this section, we cover the latest prompt engineering techniques for ChatGPT, including tips, applications, limitations, papers, and additional reading materials. **Note that this section is under heavy development.** Topics: - [ChatGPT Introduction](#chatgpt-introduction) - [Reviewing The Conversation Task](#reviewing-the-conversation-task) - [Conversations with ChatGPT](#conversations-with-chatgpt) - [Python Notebooks](#python-notebooks) --- ## ChatGPT Introduction ChatGPT is a new model [trained by OpenAI](https://openai.com/blog/chatgpt) that can interact conversationally. This model is trained to follow instructions in a prompt to provide appropriate responses in the context of a dialogue. ChatGPT can help with answering questions, suggesting recipes, writing lyrics in a certain style, generating code, and much more. ChatGPT is trained using Reinforcement Learning from Human Feedback (RLHF). While this model is a lot more capable than previous GPT iterations (and also trained to reduce harmful and untruthful outputs), it still comes with limitations. Let's cover some of the capabilities and limitations with concrete examples. You can use the research preview of ChatGPT [here](https://chat.openai.com) but for the examples below we will use the `Chat` mode on the OpenAI Playground. --- ## Reviewing The Conversation Task In one of the [previous guides](https://github.com/dair-ai/Prompt-Engineering-Guide/blob/main/guides/prompts-basic-usage.md#conversation), we covered a bit about conversation capabilities and role prompting. We covered how to instruct the LLM to have a conversation in a specific style, with a specific intent, behavior, and identity. Let's review our previous basic example where we created a conversational system that's able to generate more technical and scientific responses to questions. *Prompt:* ``` The following is a conversation with an AI research assistant. The assistant tone is technical and scientific. Human: Hello, who are you? AI: Greeting! I am an AI research assistant. How can I help you today? Human: Can you tell me about the creation of black holes? AI: ``` From the example above, you can see two important components: - the **intent** or explanation of what the chatbot is - the **identity** which instructs the style or tone the chatbot will use to respond The simple example above works well with the text completion APIs that use `text-davinci-003`. More recently, OpenAI [announced the ChatGPT APIs](https://openai.com/blog/introducing-chatgpt-and-whisper-apis), which is a more powerful and cheaper model called `gpt-3.5-turbo` was specifically built for this type of functionality (chat completions). OpenAI recommends this as their best model even for non-chat use cases. Other benefits of using the ChatGPT APIs are significant cost reduction (90%) and efficiency. Big companies like Snap Inc. and Instacart are already integrating conversational features powered by ChatGPT on their products that range from personalized recommendations to open-ended shopping goals. --- ## Conversations with ChatGPT ### Multi-turn Conversations To begin demonstrating the capabilities of ChatGPT, we will use the chatbot assistant example above and discuss the results. Compared to `text-davinci-003`, the `gpt-3.5-turbo` model that powers ChatGPT uses a chat format as input. The model expects a series of messages as input and uses those to generate a response. *Input:* ``` SYSTEM: You are an AI research assistant. You use a tone that is technical and scientific. USER: Hello, who are you? ASSISTANT: Greeting! I am an AI research assistant. How can I help you today? USER: Can you tell me about the creation of black holes? ``` *Output:* ``` ASSISTANT: Of course! Black holes are created when a massive star runs out of fuel and collapses under its own gravity. This collapse causes the star to become incredibly dense, with a gravitational field so strong that not even light can escape it. This point of no return is known as the event horizon. Once a star crosses the event horizon, it is considered a black hole... ``` Note that in the example above, I have simplified the input and output but the ChatGPT chat completion API requires messages to be in a specific format. I have added a snapshot below of how this example would look using the `Chat Mode` in the OpenAI Playground: ![](../img/chatgpt-1.png) The more formal API call for our example would look something like the example below: ```python import openai openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are an AI research assistant. You use a tone that is technical and scientific."}, {"role": "user", "content": "Hello, who are you?"}, {"role": "assistant", "content": "Greeting! I am an AI research assistant. How can I help you today?"}, {"role": "user", "content": "Can you tell me about the creation of black holes?"} ] ) ``` The way developers interact with ChatGPT in the future is expected to be done via the [Chat Markup Language](https://github.com/openai/openai-python/blob/main/chatml.md) (ChatML for short). ### Single-turn tasks The chat format enables multi-turn conversations but it also supports single-turn tasks similar to what we used with `text-davinci-003`. This means we can use ChatGPT to perform similar tasks as what we have demonstrated for the original GPT models. For example, let's try to perform the following question-answering task using ChatGPT: *Input:* ``` USER: Answer the question based on the context below. Keep the answer short and concise. Respond "Unsure about answer" if not sure about the answer. Context: Teplizumab traces its roots to a New Jersey drug company called Ortho Pharmaceutical. There, scientists generated an early version of the antibody, dubbed OKT3. Originally sourced from mice, the molecule was able to bind to the surface of T cells and limit their cell-killing potential. In 1986, it was approved to help prevent organ rejection after kidney transplants, making it the first therapeutic antibody allowed for human use. Question: What was OKT3 originally sourced from? Answer: ``` *Output:* ``` ASSISTANT: Mice. ``` Keep in mind that I am adding the `USER` and `ASSISTANT` labels to better demonstrate how the task can be performed using ChatGPT. Here is the example using the Playground: ![](../img/chatgpt-classic.png) More formally, this is the API call (I've only included the message component of the request): ```python CONTENT = """Answer the question based on the context below. Keep the answer short and concise. Respond \"Unsure about answer\" if not sure about the answer. Context: Teplizumab traces its roots to a New Jersey drug company called Ortho Pharmaceutical. There, scientists generated an early version of the antibody, dubbed OKT3. Originally sourced from mice, the molecule was able to bind to the surface of T cells and limit their cell-killing potential. In 1986, it was approved to help prevent organ rejection after kidney transplants, making it the first therapeutic antibody allowed for human use. Question: What was OKT3 originally sourced from? Answer: """ response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "user", "content": CONTENT}, ], temperature=0, ) ``` ### Instructing Chat Models According to the official OpenAI docs, snapshots of the `gpt-3.5-turbo` model will also be made available. For example, we can access the snapshot from March 1 `gpt-3.5-turbo-0301`. This allows developers to opt for specific model versions. This also means that the best practices for instructing models may change from version to version. The current recommendation for `gpt-3.5-turbo-0301` is to add instructions in the `user` message as opposed to the available `system` message. --- ## Python Notebooks |Description|Notebook| |--|--| |Learn more about how to make calls to the ChatGPT APIs using the `openai` library.|[ChatGPT API Intro](../notebooks/pe-chatgpt-intro.ipynb)| |Learn how to use ChatGPT features using the `LangChain` library. |[ChatGPT API with LangChain](../notebooks/pe-chatgpt-langchain.ipynb)| --- ## References - [Seeing ChatGPT Through Students' Eyes: An Analysis of TikTok Data](https://arxiv.org/abs/2303.05349) (March 2023) - [Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering -- Example of ChatGPT](https://arxiv.org/abs/2303.05352) (Mar 2023) - [ChatGPT is on the horizon: Could a large language model be all we need for Intelligent Transportation?](https://arxiv.org/abs/2303.05382) (Mar 2023) - [Making a Computational Attorney](https://arxiv.org/abs/2303.05383) (Mar 2023) - [Does Synthetic Data Generation of LLMs Help Clinical Text Mining?](https://arxiv.org/abs/2303.04360) (Mar 2023) - [MenuCraft: Interactive Menu System Design with Large Language Models](https://arxiv.org/abs/2303.04496) (Mar 2023) - [A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT](https://arxiv.org/abs/2303.04226) (Mar 2023) - [Exploring the Feasibility of ChatGPT for Event Extraction](https://arxiv.org/abs/2303.03836) - [ChatGPT: Beginning of an End of Manual Annotation? Use Case of Automatic Genre Identification](https://arxiv.org/abs/2303.03953) (Mar 2023) - [Is ChatGPT a Good NLG Evaluator? A Preliminary Study](https://arxiv.org/abs/2303.04048) (Mar 2023) - [Will Affective Computing Emerge from Foundation Models and General AI? A First Evaluation on ChatGPT](https://arxiv.org/abs/2303.03186) (Mar 2023) - [UZH_CLyp at SemEval-2023 Task 9: Head-First Fine-Tuning and ChatGPT Data Generation for Cross-Lingual Learning in Tweet Intimacy Prediction](https://arxiv.org/abs/2303.01194) (Mar 2023) - [How to format inputs to ChatGPT models](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb) (Mar 2023) - [Can ChatGPT Assess Human Personalities? A General Evaluation Framework](https://arxiv.org/abs/2303.01248) (Mar 2023) - [Cross-Lingual Summarization via ChatGPT](https://arxiv.org/abs/2302.14229) (Feb 2023) - [ChatAug: Leveraging ChatGPT for Text Data Augmentation](https://arxiv.org/abs/2302.13007) (Feb 2023) - [Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness](https://arxiv.org/abs/2302.13793) (Feb 2023) - [An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)](https://arxiv.org/abs/2302.13814) (Feb 2023) - [ChatGPT: A Meta-Analysis after 2.5 Months](https://arxiv.org/abs/2302.13795) (Feb 2023) - [Let's have a chat! A Conversation with ChatGPT: Technology, Applications, and Limitations](https://arxiv.org/abs/2302.13817) (Feb 2023) - [Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback](https://arxiv.org/abs/2302.12813) (Feb 2023) - [On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective](https://arxiv.org/abs/2302.12095) (Feb 2023) - [How Generative AI models such as ChatGPT can be (Mis)Used in SPC Practice, Education, and Research? An Exploratory Study](https://arxiv.org/abs/2302.10916) (Feb 2023) - [Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT](https://arxiv.org/abs/2302.10198) (Feb 2023) - [A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT](https://arxiv.org/abs/2302.11382) (Feb 2023) - [Zero-Shot Information Extraction via Chatting with ChatGPT](https://arxiv.org/abs/2302.10205) (Feb 2023) - [ChatGPT: Jack of all trades, master of none](https://arxiv.org/abs/2302.10724) (Feb 2023) - [A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning](https://arxiv.org/abs/2302.09068) (Feb 2023) - [Netizens, Academicians, and Information Professionals' Opinions About AI With Special Reference To ChatGPT](https://arxiv.org/abs/2302.07136) (Feb 2023) - [Linguistic ambiguity analysis in ChatGPT](https://arxiv.org/abs/2302.06426) (Feb 2023) - [ChatGPT versus Traditional Question Answering for Knowledge Graphs: Current Status and Future Directions Towards Knowledge Graph Chatbots](https://arxiv.org/abs/2302.06466) (Feb 2023) - [What ChatGPT and generative AI mean for science](https://www.nature.com/articles/d41586-023-00340-6) (Feb 2023) - [Applying BERT and ChatGPT for Sentiment Analysis of Lyme Disease in Scientific Literature](https://arxiv.org/abs/2302.06474) (Feb 2023) - [Exploring AI Ethics of ChatGPT: A Diagnostic Analysis](https://arxiv.org/abs/2301.12867) (Jan 2023) - [ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education](https://www.edu.sot.tum.de/fileadmin/w00bed/hctl/_my_direct_uploads/ChatGPT_for_Good_.pdf) (Jan 2023) - [The political ideology of conversational AI: Converging evidence on ChatGPT's pro-environmental, left-libertarian orientation](https://arxiv.org/abs/2301.01768) (Jan 2023) - [Techniques to improve reliability - OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md) - [Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts) - [Introducing ChatGPT](https://openai.com/blog/chatgpt) (Nov 2022) ---