Prompt Method - HackMD

Zero-shot Prompting

現今的LLM已受過許多訓練了，因此可以透過單次的Prompt來完成目的

Prompt:

Classify the text into neutral, negative or positive. 
Text: I think the vacation is okay.
Sentiment:

Result

Neutral

Few-shot Prompting

透過提供少數範例的QA及固定的規格來引導模型達成我們要的目的
若是遇到更困難的任務，可以嘗試增加範例次數來達到更好的效果

Prompt:

A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses
the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses
the word farduddle is:

Result

When we won the game, we all started to farduddle in celebration.

Limitation of few-shot:

過於複雜或需要推理的任務可能無法透過few-shot達到目的需求->透過Chain-of-Thought解決

Prompt:

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: The answer is False.
The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.
A: The answer is True.
The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.
A: The answer is True.
The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.
A: The answer is False.
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A:

Result

The answer is True.

Chain-of-Thought Prompting

CoT結合中間推理步驟與Prompt相結合，使Modal在更複雜的Task中獲得更好的結果。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Prompt:

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A:

Result

Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.

Zero-shot COT Prompting

Zero-shot 結合 Step by Step 的方式引導模型的思考過程。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Prompt:

I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?
Let's think step by step.

Result

First, you started with 10 apples.
You gave away 2 apples to the neighbor and 2 to the repairman, so you had 6 apples left.
Then you bought 5 more apples, so now you had 11 apples.
Finally, you ate 1 apple, so you would remain with 10 apples.

Automatic Chain-of-Thought

這項研究提出了一種自動化的Let's step by step提示方法，通過使用語言模型生成CoT，避免了手工設計示例的工作量，並通過增加示例的多樣性來減輕生成鏈條中可能出現的錯誤。

Auto-CoT 由以下兩種Stage組成:

question clustering: 將數據集的問題區分不同集群
demonstration sampling: 每個集群中挑選一個代表性問題，透過Zero-Shot-CoT來產生其reasoning chain並示範

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Self-Consistency

Self-consistency aims "to replace the naive greedy decoding used in chain-of-thought prompting".
這個方法的想法是透過少數樣本的CoT，抽樣多條多樣化的推理路徑，並利用生成的結果選擇最一致的答案。這有助於提升連續思考提示在涉及算術和常識推理的任務上的性能。

Prompt:

When I was 6 my sister was half my age. Now
I’m 70 how old is my sister?

Result : wrong

Prompt: Fixed

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done,
there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted.
So, they must have planted 21 - 15 = 6 trees. The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74
chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops
did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of
lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does
he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so
in total he has 7 + 2 = 9 toys. The answer is 9.
Q: There were nine computers in the server room. Five more computers were installed each day, from
monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 =
20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers.
The answer is 29.
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many
golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On
Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent 5
Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister?
A:

Result : correct

When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70 - 3 = 67. The answer is 67.

Generated Knowledge Prompting

這裡提出了一個新的想法，模型在Predict之前就先生成相關的Knowledge，在Knowledge完備的前提下再進行完整的Predict。
注意:使用的過程中仍需要考慮更多細節。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

先產生knowledge

Prompt : Create Knowledge

Input: Greece is larger than mexico.
Knowledge: Greece is approximately 131,957 sq km, while Mexico is approximately 1,964,375 sq km, making Mexico 1,389% larger than Greece.
Input: Glasses always fog up.
Knowledge: Condensation occurs on eyeglass lenses when water vapor from your sweat, breath, and ambient humidity lands on a cold surface, cools, and then changes into tiny drops of liquid, forming a film that you see as fog. Your lenses will be relatively cool compared to your breath, especially when the outside air is cold.
Input: A fish is capable of thinking.
Knowledge: Fish are more intelligent than they appear. In many areas, such as memory, their cognitive powers match or exceed those of ’higher’ vertebrates including non-human primates. Fish’s long-term memories help them keep track of complex social relationships.
Input: A common effect of smoking lots of cigarettes in one’s lifetime is a higher than normal chance of getting lung cancer.
Knowledge: Those who consistently averaged less than one cigarette per day over their lifetime had nine times the risk of dying from lung cancer than never smokers. Among people who smoked between one and 10 cigarettes per day, the risk of dying from lung cancer was nearly 12 times higher than that of never smokers.
Input: A rock is the same size as a pebble.
Knowledge: A pebble is a clast of rock with a particle size of 4 to 64 millimetres based on the Udden-Wentworth scale of sedimentology. Pebbles are generally considered larger than granules (2 to 4 millimetres diameter) and smaller than cobbles (64 to 256 millimetres diameter).
Input: Part of golf is trying to get a higher point total than others.
Knowledge:

Knowledge Result:

The objective of golf is to play a set of holes in the least number of strokes. A round of golf typically consists of 18 holes. Each hole is played once in the round on a standard golf course. Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game.

將Prompt結合knowledge一起預測:

Prompt: predict

Question: Part of golf is trying to get a higher point total than others. Yes or No?
Knowledge: The objective of golf is to play a set of holes in the least number of strokes. A round of golf typically consists of 18 holes. Each hole is played once in the round on a standard golf course. Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game.
Explain and Answer:

Result :

No, the objective of golf is not to get a higher point total than others. Rather, the objective is to play a set of holes in the least number of strokes. The total number of strokes is used to determine the winner of the game, not the total number of points.

Tree of Thoughts (ToT)

對於需要探索或是更多前瞻性的複雜任務，普通的prompt是行不通的，因此提出了ToT的概念

這是一個概括了ToT並鼓勵對思想進行探索的框架，這些思想可以作為使用語言模型解決一般問題的中間步驟。
ToT（Tree of Thoughts）維護著一個Tree of Thoughts，其中thoughts代表著一系列連貫的語言序列，其為解決問題的中間步驟。這種方法使得語言模型能夠通過一個有意識的推理過程，自我評估中間思緒對於解決問題的進展。語言模型生成和評估思緒的能力，再結合搜索算法（例如廣度優先搜索和深度優先搜索），實現對思緒的有先見性和回溯的系統性探索。

Task demo: Game of 24

explain: 使用給定的四個數字通過加、減、乘、除等基本數學運算，計算出數字24。每個數字只能使用一次，並且必須使用所有的數字。這個遊戲挑戰玩家的計算能力、邏輯思維和數學技巧。

為了在 ToT 中執行 24 人遊戲任務的 BFS，LM 被提示將每個候選思想評估為“確定/也許/不可能”，以達到 24。正如作者所述，“目的是促進正確的部分解決方案”可以在幾次前瞻試驗中得出結論，並根據“太大/太小”的常識消除不可能的部分解決方案，並保留其餘的“也許”。每個想法的值採樣 3 次。該過程如下圖所示：

ToT效果優於其他prompt method

Hulbert (2023)(opens in a new tab) has proposed Tree-of-Thought Prompting, which applies the main concept from ToT frameworks as a simple prompting technique, getting the LLM to evaluate intermediate thoughts in a single prompt. A sample ToT prompt is:

Imagine three different experts are answering this question.
All experts will write down 1 step of their thinking,
then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realises they're wrong at any point then they leave.
The question is...

Retrieval Augmented Generation (RAG)

通用語言模型可以進行微調，以實現一些常見任務，如情感分析和命名實體識別，這些任務通常不需要額外的背景知識。
對於更複雜和知識密集的任務，可以建立基於語言模型的系統，以訪問外部知識來完成任務。這樣可以實現更高的事實一致性，提高生成回答的可靠性，並有助於緩解“幻覺”問題。
研究人員提出了一種名為“檢索增強生成”（Retrieval Augmented Generation，RAG）的方法來解決這些知識密集型任務。RAG將信息檢索組件與文本生成模型結合在一起。RAG可以進行微調，並且其內部知識可以以高效的方式進行修改，而不需要對整個模型進行重新訓練。
RAG通過輸入一個源（例如維基百科），檢索一組相關的/支持的文檔。將這些文檔與原始輸入提示連接起來作為上下文，然後將其提供給文本生成器生成最終輸出。這使得RAG能夠適應事實可能隨時間而變化的情況，這對於語言模型的參數化知識是靜態的情況非常有用。RAG允許語言模型跳過重新訓練，通過檢索生成可靠輸出時，能夠訪問最新的信息。
RAG在多個基準測試中表現出色，RAG生成的回答更具事實性、具體性和多樣性。
近年來，這些基於檢索器的方法越來越受歡迎，並與流行的語言模型（如ChatGPT）結合，以提高能力和事實一致性。

Automatic Reasoning and Tool-use (ART)

Automatic Reasoning and Tool-use (ART) 是一種結合 CoT 啟發和工具的交替方法，已被證明是處理語言模型中的許多任務的一種強大且穩健的方法。
ART 使用一個frozen LLM來自動生成中間推理步驟作為程序，而不需要手工製作特定任務的演示，並且需要仔細編寫模型生成和工具使用的交替腳本。

ART 的工作過程如下：

給定一個新任務，從任務庫中選擇多步驟推理和工具使用的演示。
在測試時，每當調用外部工具時，ART 會暫停生成，並在繼續生成之前整合它們的輸出。
ART 鼓勵模型從演示中推廣，以Zero-shot方式拆解新任務並在適當的位置使用工具。此外，ART 是可擴展的，因為它允許人們通過簡單地更新任務和工具庫來修正推理步驟中的錯誤或添加新工具。

在 BigBench 和 MMLU 基準測試中，ART 在未見過的任務上顯著優於少數樣本的提示和自動 CoT，並在人類反饋納入時超越手工製作的 CoT 提示的性能。

Automatic Prompt Engineer (APE)

自動提示工程（Automatic Prompt Engineer，APE），這是一個用於自動生成和選擇指示的框架。指示生成問題被視為自然語言合成的一種黑盒優化問題，使用語言模型（LLMs）生成和搜索候選解決方案。
第一步涉及一個大型語言模型（作為推論模型），它接收輸出演示以生成任務的指示候選解。這些候選解將引導搜索過程。指示使用目標模型執行，然後基於計算的評估分數選擇最合適的指示。
APE 發現了比人工設計的“Let's step by step”提示（Kojima 等人，2022年）更好的zero-shot CoT 提示。

效果:比step by step更好

Active-Prompt

CoT依賴於一組固定的人工推理進行注釋。問題在於這些示例可能對於不同的任務來說並不是最有效的範例->Active-Prompt

第一步是對LLM進行查詢，可以使用幾個CoT示例或不使用示例。對於一組訓練問題生成k個可能的答案。基於k個答案計算不確定度指標（使用不一致度）。選擇最不確定的問題由人工進行注釋。然後使用新的帶注釋的示例進行問題推理。

Directional Stimulus Prompting

Directional Stimulus Prompting更好地引導LLM生成所需的摘要。

訓練了一個可調節的策略語言模型來生成刺激/提示。透過更多地使用強化學習來優化LLM。

下面的圖示比較了方向刺激提示和標準提示。策略語言模型可以很小且被優化，用於生成指導black-box frozen LLM 的 prompt。

ReAct Prompting :

ReAct 的框架，透過 LLM 用於以交錯的方式生成推理軌跡和特定於任務的操作。

生成推理軌跡可以使模型誘導、跟踪和更新行動計劃，甚至處理異常情況。行動步驟可以與外部資源（如知識庫或環境）進行交互，以獲取信息。
ReAct框架使LLM能夠與外部工具交互，檢索額外的信息，從而產生更可靠和真實的回答。結果表明，在語言和決策任務上，ReAct能夠優於幾種領先的基準方法。ReAct還提高了LLM的人類可解釋性和可信度。總的來說，作者發現最佳方法是將ReAct與思路連貫（CoT）相結合，這允許在推理過程中使用內部知識和外部信息。
ReAct受到“行動”和“推理”之間的協同作用的啟發，這使人類能夠學習新任務並做出決策或推理。

ReAct Prompting

第一步是從訓練集中（例如 HotPotQA）選擇案例並組成 ReAct 格式的軌跡。這些在提示中用作少數樣本。軌跡由多個思想-行動-觀察步驟組成，如上圖所示。自由形式的思維用於實現不同的任務，例如分解問題、提取信息、執行常識/算術推理、指導搜索公式以及綜合最終答案。
ReAct通過不同類型的任務使用不同的提示設置。對於推理是主要重點的任務（例如HotpotQA），使用多個Thought-Action-Observe步驟進行任務解決軌跡。對於涉及大量行動步驟的決策任務，thoughts很少使用。

範例: 根據使用不同提示方法的HotPotQA和Fever上的提示結果顯示，ReAct通常在兩個任務上優於僅涉及行動的Act。

CoT提示顯示了LLM進行推理軌跡以回答涉及算術和常識推理等問題的能力（Wei等人，2022年）。但是，它無法訪問外部世界或無法更新其知識，可能會出現事實幻覺和錯誤傳播等問題。
ReAct是一個結合推理和行動的通用範式。ReAct提示LLM生成任務的口頭推理軌跡和動作。這使系統能夠進行動態推理，創建、維護和調整行動計劃，同時還可以與外部環境（例如維基百科）進行交互，將額外的訊息納入推理中。下圖顯示了ReAct的示例以及執行問答所涉及的不同步驟。
ReAct通過不同類型的任務使用不同的提示設置。對於推理是主要重點的任務（例如HotpotQA），使用多個思考-行動-觀察步驟進行任務解決軌跡。對於涉及大量行動步驟的決策任務，思考稀疏使用。
問題:

1. CoT容易產生事實幻覺

2. ReAct的結構約束降低了其在制定推理步驟方面的靈活性

3. ReAct在很大程度上依賴於它檢索的信息；無信息的搜索結果使模型的推理出現困難，並且難以恢復和重新制定思路
結合 ReAct 和 CoT+Self-Consistency 之間切換的提示方法通常優於所有其他提示方法。
範例:

Zero-shot Prompting

現今的LLM已受過許多訓練了，因此可以透過單次的Prompt來完成目的

Prompt:

Result

Few-shot Prompting

透過提供少數範例的QA及固定的規格來引導模型達成我們要的目的

若是遇到更困難的任務，可以嘗試增加範例次數來達到更好的效果

Prompt:

Result

Limitation of few-shot:

過於複雜或需要推理的任務可能無法透過few-shot達到目的需求->透過Chain-of-Thought解決

Prompt:

Result

Chain-of-Thought Prompting

CoT結合中間推理步驟與Prompt相結合，使Modal在更複雜的Task中獲得更好的結果。

Prompt:

Result

Zero-shot COT Prompting

Zero-shot 結合 Step by Step 的方式引導模型的思考過程。

Prompt:

Result

Automatic Chain-of-Thought

這項研究提出了一種自動化的Let's step by step提示方法，通過使用語言模型生成CoT，避免了手工設計示例的工作量，並通過增加示例的多樣性來減輕生成鏈條中可能出現的錯誤。

Auto-CoT 由以下兩種Stage組成:

question clustering: 將數據集的問題區分不同集群

demonstration sampling: 每個集群中挑選一個代表性問題，透過Zero-Shot-CoT來產生其reasoning chain並示範

Self-Consistency

Self-consistency aims "to replace the naive greedy decoding used in chain-of-thought prompting".

這個方法的想法是透過少數樣本的CoT，抽樣多條多樣化的推理路徑，並利用生成的結果選擇最一致的答案。這有助於提升連續思考提示在涉及算術和常識推理的任務上的性能。

Prompt:

Result : wrong

Prompt: Fixed

Result : correct

Generated Knowledge Prompting

這裡提出了一個新的想法，模型在Predict之前就先生成相關的Knowledge，在Knowledge完備的前提下再進行完整的Predict。

注意:使用的過程中仍需要考慮更多細節。

先產生knowledge

Prompt : Create Knowledge

Knowledge Result:

將Prompt結合knowledge一起預測:

Prompt: predict

Result :

Tree of Thoughts (ToT)

對於需要探索或是更多前瞻性的複雜任務，普通的prompt是行不通的，因此提出了ToT的概念

這是一個概括了ToT並鼓勵對思想進行探索的框架，這些思想可以作為使用語言模型解決一般問題的中間步驟。

Task demo: Game of 24

explain: 使用給定的四個數字通過加、減、乘、除等基本數學運算，計算出數字24。每個數字只能使用一次，並且必須使用所有的數字。這個遊戲挑戰玩家的計算能力、邏輯思維和數學技巧。

ToT效果優於其他prompt method

Hulbert (2023)(opens in a new tab) has proposed Tree-of-Thought Prompting, which applies the main concept from ToT frameworks as a simple prompting technique, getting the LLM to evaluate intermediate thoughts in a single prompt. A sample ToT prompt is:

Retrieval Augmented Generation (RAG)

通用語言模型可以進行微調，以實現一些常見任務，如情感分析和命名實體識別，這些任務通常不需要額外的背景知識。

對於更複雜和知識密集的任務，可以建立基於語言模型的系統，以訪問外部知識來完成任務。這樣可以實現更高的事實一致性，提高生成回答的可靠性，並有助於緩解“幻覺”問題。

RAG在多個基準測試中表現出色，RAG生成的回答更具事實性、具體性和多樣性。

近年來，這些基於檢索器的方法越來越受歡迎，並與流行的語言模型（如ChatGPT）結合，以提高能力和事實一致性。

Automatic Reasoning and Tool-use (ART)

Automatic Reasoning and Tool-use (ART) 是一種結合 CoT 啟發和工具的交替方法，已被證明是處理語言模型中的許多任務的一種強大且穩健的方法。

ART 使用一個frozen LLM來自動生成中間推理步驟作為程序，而不需要手工製作特定任務的演示，並且需要仔細編寫模型生成和工具使用的交替腳本。

ART 的工作過程如下：

給定一個新任務，從任務庫中選擇多步驟推理和工具使用的演示。

在測試時，每當調用外部工具時，ART 會暫停生成，並在繼續生成之前整合它們的輸出。

ART 鼓勵模型從演示中推廣，以Zero-shot方式拆解新任務並在適當的位置使用工具。此外，ART 是可擴展的，因為它允許人們通過簡單地更新任務和工具庫來修正推理步驟中的錯誤或添加新工具。

在 BigBench 和 MMLU 基準測試中，ART 在未見過的任務上顯著優於少數樣本的提示和自動 CoT，並在人類反饋納入時超越手工製作的 CoT 提示的性能。

Automatic Prompt Engineer (APE)

自動提示工程（Automatic Prompt Engineer，APE），這是一個用於自動生成和選擇指示的框架。指示生成問題被視為自然語言合成的一種黑盒優化問題，使用語言模型（LLMs）生成和搜索候選解決方案。

第一步涉及一個大型語言模型（作為推論模型），它接收輸出演示以生成任務的指示候選解。這些候選解將引導搜索過程。指示使用目標模型執行，然後基於計算的評估分數選擇最合適的指示。

APE 發現了比人工設計的“Let's step by step”提示（Kojima 等人，2022年）更好的zero-shot CoT 提示。

效果:比step by step更好

Active-Prompt

CoT依賴於一組固定的人工推理進行注釋。問題在於這些示例可能對於不同的任務來說並不是最有效的範例->Active-Prompt

Directional Stimulus Prompting

Directional Stimulus Prompting更好地引導LLM生成所需的摘要。

訓練了一個可調節的策略語言模型來生成刺激/提示。透過更多地使用強化學習來優化LLM。

下面的圖示比較了方向刺激提示和標準提示。策略語言模型可以很小且被優化，用於生成指導black-box frozen LLM 的 prompt。

ReAct Prompting :

ReAct 的框架，透過 LLM 用於以交錯的方式生成推理軌跡和特定於任務的操作。

生成推理軌跡可以使模型誘導、跟踪和更新行動計劃，甚至處理異常情況。行動步驟可以與外部資源（如知識庫或環境）進行交互，以獲取信息。

ReAct受到“行動”和“推理”之間的協同作用的啟發，這使人類能夠學習新任務並做出決策或推理。

ReAct Prompting

ReAct通過不同類型的任務使用不同的提示設置。對於推理是主要重點的任務（例如HotpotQA），使用多個Thought-Action-Observe步驟進行任務解決軌跡。對於涉及大量行動步驟的決策任務，thoughts很少使用。

範例: 根據使用不同提示方法的HotPotQA和Fever上的提示結果顯示，ReAct通常在兩個任務上優於僅涉及行動的Act。