# LLM
for j in [1, \dots ,|\tau|]:
if j mod
# 03/11
# back to search
After:

Before:

# next
## after:
Task:i want to find a height-adjustable desk chair in the color petal green, and price lower than 140.00 dollars
Action:Think[{"response": "I should first use Search to search for a height-adjustable desk chair in petal green."}]
Observation: OK
Action:search[{"product": "height-adjustable desk chair petal green"}]
Observation: WEB PAGE: {
[Back to Search]
Page 1 (Total results: 15)
[Next >]
[B09JSGCW97]
INZOY 360 Degree Swivel Vanity Chair for Makeup Room, Height Adjustable Desk Chair, Velvet Shell Petal Stool Chair with Backrest Cushion, Rolling Chairs for Living Room Bedroom, Pink
$149.99
[B09JJP5BR2]
Henf Cute Desk Chair, Modern Velvet Upholstered Office Chair, 360° Swivel Leisure Computer Chair Leisure Task Chair Flower Vanity Chairs for Home Office (Petal Blue & Gold)
$135.99
[B08DLL59WC]
Goujxcy Desk Chair,Modern Velvet Fabric Office Chair,360° Swivel Height Adjustable Comfy Upholstered Leisure Arm Accent Chair (Pink)
$109.99 }
Action:Think[{"response": "None of the products listed match the color requirement. I should click Next to see more options."}]
Observation: OK
Action:click[{"button": "Next >"}]
## before:
Task:i want to find a height-adjustable desk chair in the color petal green, and price lower than 140.00 dollars
Action:Think[{"response": "I should first use Search to search for a height-adjustable desk chair in petal green."}]
Observation: OK
Action:search[{"product": "height-adjustable desk chair petal green"}]
Observation: WEB PAGE: {
[Back to Search]
Page 1 (Total results: 15)
[Next >]
[B09JSGCW97]
INZOY 360 Degree Swivel Vanity Chair for Makeup Room, Height Adjustable Desk Chair, Velvet Shell Petal Stool Chair with Backrest Cushion, Rolling Chairs for Living Room Bedroom, Pink
$149.99
[B09JJP5BR2]
Henf Cute Desk Chair, Modern Velvet Upholstered Office Chair, 360° Swivel Leisure Computer Chair Leisure Task Chair Flower Vanity Chairs for Home Office (Petal Blue & Gold)
$135.99
[B08DLL59WC]
Goujxcy Desk Chair,Modern Velvet Fabric Office Chair,360° Swivel Height Adjustable Comfy Upholstered Leisure Arm Accent Chair (Pink)
$109.99 }
Action:Think[{"response": "The second product is most relevant to the requirements and is within the budget. I should click the second product."}]
Observation: OK
Action:click[{"button": "B09JJP5BR2"}]
# list
| | test wr | test rw | train wr | train rw | train ratio |
| ---------------- | ------- | ------- | -------- | -------- | ----------- |
| iter 1 | 36.8 | 0.671 | 41.43 | 0.723 | 0.356 |
| iter 2 b1 | 37.2 | 0.655 | 39.2 | 0.695 | 0.4242 |
| iter 2 b2 | 38.0 | 0.661 | 38.1 | 0.691 | 0.4512 |
| iter 2 b3 | 38.8 | 0.662 | 42.59 | 0.718 | 0.4638 |
| iter 2 b4 | 38.4 | 0.667 | 40.69 | 0.702 | 0.4694 |
| iter 2 b5 | 38.4 | 0.657 | 42.86 | 0.705 | 0.475 |
| iter 2 b6 | 38.8 | 0.677 | 41.2 | 0.698 | 0.4813 |
| iter 2 from gpt4 | 42.8 | 0.691 | 45.82 | 0.73 | 0.4854 |
| iter 2 from gpt3 | 44.8 | 0.711 | 45.89 | 0.735 | 0.5058 |
# 03/04
## train on gsm8k (only have gpt3 trajs)
## train on mistral 7b and lamma3 /2 on 1/4 of webshop
## science world
| | train thr | win rate | reward | train ratio |
| ------------ | --------- | -------- | ------ | ----------- |
| sft | 0 | 0.53 | 68.1 | 1.00 |
| sft | 50 | 0.615 | 76.68 | 0.863 |
| sft | 100 | 0.61 | 76.8 | 0.702 |
| SIL iter 2 | 100 | 0.615 | 74.56 | 0.814 |
| tmps iter 2 | 100 | 0.625 | 73.35 | 0.756 |
| tmps9 iter 2 | 100 | 0.65 | 78.58 | 0.789 |
| ETO iter 2 | N/A | 0.575 | 75.0 | 0.262 |
| ours iter 2 | 100 | 0.685 | 81.34 | 0.896 |
- ep6 budget=2
- 12000
# 02/25
- sciworld
- gold: 500
- budget: 5
- max_step_l = 4
| | train thr | win rate | reward | train ratio |
| ----------- | --------- | -------- | ------ | ----------- |
| sft | 0 | 0.53 | 68.1 | 1.00 |
| sft | 50 | 0.615 | 76.68 | 0.863 |
| sft | 100 | 0.61 | 76.8 | 0.702 |
| SIL iter 2 | 100 | 0.615 | 74.56 | 0.814 |
| ETO iter 2 | N/A | 0.575 | 74.54 | 0.262 |
| ours iter 2 | 100 | 0.685 | 81.34 | 0.896 |



# 02/10
iter:

| | RFT | ETO | SIL | FGPT4M | FGPT34M |
| ------ | ---- | ---- | ---- | ------ | ------- |
| iter 0 | 37.2 | 37.2 | 46.4 | 46.4 | 46.4 |
| iter 1 | 36.4 | 38.8 | 52.0 | 50.4 | 56.4 |
| iter 2 | 36.8 | 42.0 | 51.5 | 58.4 | 62.0 |
| | winrate | reward | success next | success back | all next | all back | train ratio |
| ----------------------- | ------- | ------ | ------------ | ------------ | -------- | -------- | ----------- |
| gpt 3 | 23.2 | 0.606 | 0.0 | 2.4 | 3.2 | 16.8 | N/A |
| gpt 4 | 35.6 | 0.662 | 2.8 | 2.8 | 16.8 | 14.0 | N/A |
| iter 1 sft | 37.2 | 0.669 | 2.8 | 1.6 | 16.8 | 9.6 | 35.6 |
| iter 1 NAT | 37.2 | 0.662 | 1.2 | 2.0 | 15.6 | 8.8 | 35.6 |
| iter 3 RFT | 36.4 | 0.673 | 2.4 | 1.2 | 18.4 | 10.0 | 36.33 |
| iter 3 ETO (sft + dpo) | 42.0 | 0.687 | 1.2 | 1.2 | 9.6 | 3.2 | 42.1 |
| iter 3 SIL | 52.0 | 0.75 | 4.4 | 3.6 | 8.0 | 7.2 | 46.48 |
| iter 3 SIL+Best-of-N | 53.6 | 0.762 | 4.0 | 3.2 | 9.2 | 7.6 | 54.05 |
| iter 3 fgpt4m | 58.4 | 0.784 | 6.8 | 5.2 | 13.2 | 12.0 | 53.3 |
| iter 3 fgpt34m | 62.0 | 0.819 | 9.2 | 9.6 | 13.2 | 14.0 | 59.85 |
| Column 1 | Positive Next | Positive Back | all next | all back |
| -------- | -------------- | -------------- | -------- | -------- |
| Text | Text | Text | | |
| iter 2 better | 50.0 | 0.74 | 4.0 | 2.8 | | | |
| iter 2 temps (need iter 3) | 51.6 | 0.762 | 4.0 | 3.2 | | | |
| iter 3 temps 9 | 54.4 | 0.771 | 4.4 | 3.6 | | | 52.27 |
| iter 3 fgpt34m + tmp9 | | | | | | | |
# 02/03
# sciworld
| | reward | wr |
| --------- | ------ | ----- |
| thr = 0 | 75.865 | 0.645 |
| thr = 5 | 78.0 | 0.665 |
| thr = 20 | 62.39 | 0.34 |
| thr = 100 | 28.175 | 0.075 |

# eto training data
- iter 1 vs gpt 4
- win: 20%
- lose: 9%
- addition action num
- all in chosen
- next: 27, back to seach: 23
- compare to ours:
- next: 237, back to seach: 566
add training sample
| | test wr | test rw | next | back to search |
| --------------------------- | ------- | ------- | ---- | -------------- |
| gpt 3 | 23.2 | 0.606 | 0.0 | 2.4 |
| gpt 4 | 35.6 | 0.662 | 2.8 | 2.8 |
| iter 1 all | 36.8 | 0.662 | 2.8 | 1.2 |
| iter 1 NAT | 37.2 | 0.662 | 1.2 | 2.0 |
| iter 1 ETO (sft + dpo) | | | | |
| iter 1 rft | 46.4 | 0.736 | 4.4 | 2.0 |
| iter 2 rft | 52.0 | 0.75 | 4.4 | 3.6 |
| iter 2 better | 50.0 | 0.74 | 4.0 | 2.8 |
| iter 2 temps (need iter 3) | 51.6 | 0.762 | 4.0 | 3.2 |
| iter 3 temps 9 | 54.4 | 0.771 | 4.4 | 3.6 |
| iter 3 fgpt4m | 56.0 | 0.776 | 5.2 | 4.4 |
| iter 3 fgpt34m | 62.0 | 0.819 | 9.2 | 9.6 |
| iter 3 fgpt34m +tmp9 | | | | |
# 01/28
## sciworld
- switch to eto trajs
- better at
- 12 per sec with one cpu
## abalation study
- different temperature
- all positive
- all better posiitve
- dpo
- hoping to show that the harder tools will be reduce the usage in dpo
| | test wr | test rw | next | back to search |
| -------------- | ------- | ------- | ---- | -------------- |
| gpt 3 | 23.2 | 0.606 | 0.0 | 2.4 |
| gpt 4 | 35.6 | 0.662 | 2.8 | 2.8 |
| iter 1 all | 36.8 | 0.662 | 2.8 | 1.2 |
| iter 1 NAT | | | | |
| iter 1 DPO | | | | |
| iter 1 ETO | | | | |
| iter 1 | 46.4 | 0.736 | 4.4 | 2.0 |
| iter 2 | 52.0 | 0.75 | 4.4 | 3.6 |
| iter 2 better | 50.0 | 0.74 | 4.0 | 2.8 |
| iter 2 temps | 51.6 | 0.762 | 4.0 | 3.2 |
| iter 2 temps 9 | 52.4 | 0.767 | 4.8 | 3.2 |
| iter 3 temps 9 | 54.4 | 0.771 | 4.4 | 3.6 |
| iter 2 f4m | 52.4 | 0.768 | 5.2 | 3.2 |
| iter 3 f4m | 56.0 | 0.776 | 5.2 | 4.4 |
| iter 2 f34m | 56.4 | 0.787 | 5.6 | 4.8 |
| iter 3 f34m | 62.0 | 0.819 | 9.2 | 9.6 |
| | test wr | test rw | next | back to search |
| ------------------ | ------- | ------- | ---- | -------------- |
| gpt 3 | 23.2 | 0.606 | 0.0 | 2.4 |
| gpt 4 | 35.6 | 0.662 | 2.8 | 2.8 |
| iter 1 all | 36.8 | 0.662 | 2.8 | 1.2 |
| iter 1 NAT | | | | |
| iter 1 DPO | | | | |
| iter 1 ETO | | | | |
| iter 1 | 46.4 | 0.736 | 4.4 | 2.0 |
| iter 2 | 52.0 | 0.75 | 4.4 | 3.6 |
| iter 2 better | 50.0 | 0.74 | 4.0 | 2.8 |
| iter 2 temps | 51.6 | 0.762 | 4.0 | 3.2 |
| iter 3 temps | | | | |
| iter 2 f4m | 52.4 | 0.768 | 5.2 | 3.2 |
| iter 3 f4m | 56.0 | 0.776 | 5.2 | 4.4 |
| iter 2 f34m | 56.4 | 0.787 | 5.6 | 4.8 |
| iter 3 f34m better | 62.0 | 0.819 | 9.2 | 9.6 |
| iter 3 f34m old | 59.6 | 0.801 | 6.8 | 6.8 |

- train on only gpt3 turbo trajs
- train on different models
- mistral 7b
- train on smaller training dataset
- find tree search exploration paper
- calculate
# 01/21
# start writing method for ACL paper
# start training on sciworld trajs of agent gym
- the traj of gpt4 is really bad
- 0.3
# also start from gpt3
# run sft
| | test wr | test rw | next | back to search |
| ----------- | ------- | ------- | ---- | -------------- |
| gpt 4 | 35.6 | 0.66 | 2.8 | 2.8 |
| iter 1 | 46.4 | 0.74 | 4.4 | 2.0 |
| iter 2 | 52.0 | 0.75 | 4.4 | 3.6 |
| iter 2 f4m | 52.4 | 0.77 | 5.2 | 3.2 |
| iter 3 f4m | 56.0 | 0.78 | 5.2 | 4.4 |
| iter 2 f34m | 56.4 | 0.79 | 5.6 | 4.8 |
| iter 3 f34m | 59.6 | 0.8 | 6.8 | 6.8 |
## GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
## agent gym

- 
## agent bank


- train on chat-gpt 4
- better at using next
- chat3.5
- next: 0.53
- back to search: 3.73
- next step:
- start from chatgpt3.5
| | valid wr | valid rw | next | back to search | test wr | test rw |
| --------- | -------- | -------- | ---- | -------------- | ------- | ------- |
| gpt 4 | 38.0 | 0.671 | 2.0 | 0.67 | 35.6 | 0.662 |
| iter 1 | 43.6 | 0.705 | 1.73 | 1.07 | 46.4 | 0.7362 |
| iter 2 | 43.07 | 0.702 | 1.47 | 0.8 | 50.8 | 0.7478 |
| iter 2 fm | 45.2 | 0.712 | 1.87 | 1.2 | 53.6 | 0.773 |
| iter 3 fm | 45.07 | 0.714 | 1.6 | 2.4 | 56.0 | 0.785 |
# 12/24
- start from gpt 4 neg trajs (reward < 0.66)
- test success rate 5.34% - > 6.76%
- reward 0.36 -> 0.38
- start from gpt 4 (reward > 0.66)
- test success rate 59.49% -> 60.98%
- reward 0.863 -> 0.871
| | tr wr | tr rw | test wr | test rw | Next | Back to Search | train data rate | train rw | loss |
| --------- | ----- | ----- | ------- | ------- | ---- | -------------- | --------------- | -------- | ------ |
| gpt-4 | 36.0 | 0.667 | 38.0 | 0.671 | 2.0 | 0.67 | | | |
| gpt-3.5 | 22.2 | 0.579 | 24.0 | 0.579 | 0.27 | 1.07 | | | |
| iter 1 | 37.72 | 0.677 | 34.8 | 0.658 | 0.4 | 1.6 | 22.19 | 0.579 | 0.0048 |
| iter 2 | 43.77 | 0.71 | 38.8 | 0.688 | 0.93 | 2.0 | 38.69 | 0.712 | 0.0025 |
| iter 3 | 47.16 | 0.729 | 41.07 | 0.694 | 0.8 | 2.53 | 44.91 | 0.75 | 0.0002 |
| iter 4 | 49.41 | 0.744 | 40.4 | 0.69 | 0.53 | 2.4 | 48.02 | 0.768 | 0.0007 |
| iter 4 fm | 52.02 | 0.754 | 42.93 | 0.707 | 0.4 | 3.33 | 51.14 | 0.787 | 0.0031 |
| iter 4 fm | 52.0 | 0.766 | 41.87 | 0.701 | 0.53 | 3.73 | 51.14 | 0.787 | 0.0031 |
# 12/10
- survey on baselines
- From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning
- 
- An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders
- (Thompson Sampling)
- 
- 
- SAMPLE-EFFICIENT ALIGNMENT FOR LLMS
- create own baselines
- use tmp9 as expert data
- to see if we can also identify good actions
- solving long success traj problems
| | tr wr | tr rw | test wr | test rw | Next | Back to Search | train data rate | train rw | loss |
| --------- | ----- | ----- | ------- | ------- | ---- | -------------- | --------------- | -------- | ------ |
| gpt-4 | 36.0 | 0.667 | 38.0 | 0.671 | 2.0 | 0.67 | | | |
| gpt-3.5 | 22.2 | 0.579 | 24.0 | 0.579 | 0.27 | 1.07 | | | |
| iter 1 | 37.72 | 0.677 | 34.8 | 0.658 | 0.4 | 1.6 | 22.19 | 0.579 | 0.0048 |
| iter 2 | 43.77 | 0.71 | 38.8 | 0.688 | 0.93 | 2.0 | 38.69 | 0.712 | 0.0025 |
| iter 3 | 47.16 | 0.729 | 41.07 | 0.694 | 0.8 | 2.53 | 44.91 | 0.75 | 0.0002 |
| iter 4 | 49.41 | 0.744 | 40.4 | 0.69 | 0.53 | 2.4 | 48.02 | 0.768 | 0.0007 |
| iter 4 fm | 52.0 | 0.766 | 41.87 | 0.701 | 0.53 | 3.73 | 51.14 | 0.787 | 0.0031 |
loss 0.0015
- generating gpt4 training data
# 12/3
- use success traj
- drop a little bit
- use xlam as start
| | tr wr | tr rw | test wr | test rw | Next | Back to Search | train data rate |
| ------------ | ----- | ----- | ------- | ------- | ---- | -------------- | --------------- |
| gpt-4 | 36.0 | 0.667 | 38.0 | 0.671 | 2.0 | 0.67 | |
| gpt-3.5 | 22.2 | 0.579 | 24.0 | 0.579 | 0.27 | 1.07 | |
| iter 1 | 37.52 | 0.677 | 35.47 | 0.666 | 0.0 | 1.73 | 22.3 |
| iter 2 | 44.75 | 0.717 | 39.4 | 0.685 | 0.53 | 2.13 | 38.7 |
| iter 2 w tmp | 45.97 | 0.723 | 38.4 | 0.68 | 0.27 | 2.13 | 44.3 |
| iter 2 w fm | 46.74 | 0.727 | 40.4 | 0.693 | 0.93 | 3.47 | 43.6 |
| iter 3 w fm | 49.75 | 0.742 | 41.73 | 0.7 | 1.07 | 2.53 | 49.5 |
| prev 5steps | 49.3 | 0.74 | 43.2 | 0.705 | 0.27 | 0.13 | |
# 11/19
- implement basic search on product selection
- basic traj (97% on 5 steps model)
- steps
1. think
2. search
- three products
- can press next
3. think
4. click product
5. more click
- no next or back to search
- force agent click other two products on step 4 and let it continue to rollout
- need back to search: 16% of test data has a more than 0.2 better reward than the original
- need next: 7% of test data show that force next trajs has a 0.2 better reward than these three trajs
- iter 1 + exploration model + iter 1
- find the useful part of the neg expert trajs and add them as training data
- find from iter 1 mid
- create other models like finetune or xlam
- two gpt3.5 trajs
- run fm456
| | tr wr | tr rw | test wr | test rw |
| ----------------------------- | ----- | ------ | ------- | ------- |
| gpt-4 | 36.0 | 0.667 | 38.0 | 0.671 |
| gpt-3.5 | 22.2 | 0.579 | 24.0 | 0.579 |
| iter 1 | 37.0 | 0.677 | 33.73 | 0.66 |
| iter 2 | 39.71 | 0.6934 | 37.73 | 0.6823 |
| iter 2 train w tmp579 | 45.92 | 0.72 | 38.53 | 0.685 |
| iter 2 train w fm123 | 45.47 | 0.722 | 40.53 | 0.688 |
| iter 2 train w fm123 + tmp579 | 44.31 | 0.727 | 41.61 | 0.694 |
| iter 3 | 42.75 | 0.7073 | 38.0 | 0.6817 |
# 11/12
- settings
- demonstrations generated by one shot
- success questions (22%)
- can be easy questions
- fail questions
- actions
- though + action
- can be learned (exploration)
- can be avoided
- training stage
- can use ft models to interact with the env
- including search with env reward
- our model w/o forcing next back to search
- 3%~4% using next and back to search
- select better trajs as training data
- testing stage
- can not access env reward system
- need a reward function to search?
- fm2
- use gpt demonstration's 2 step and use ft model for steps after 2 step
- if success
- the 2 step is something we can use
- However, some actions like force next only in the middle of the demo trajs
| | tr wr | tr rw | test wr | test rw | Next | Back to Search |
| -------------------------------------- | ----- | ----- | ------- | ------- | ---- | -------------- |
| gpt-4 | 36.0 | 0.667 | 38.0 | 0.671 | 2.0 | 0.67 |
| gpt-3.5 | 22.2 | 0.579 | 24.0 | 0.579 | 0.27 | 1.07 |
| 5steps | 49.3 | 0.74 | 43.2 | 0.705 | 0.27 | 0.13 |
| 5steps + tmp579 | 51.22 | 0.763 | 45.87 | 0.74 | 0.4 | 0.13 |
| 5steps + tmp579 + from gpt mid (fm123) | 53.04 | 0.789 | 50.27 | 0.779 | 0.8 | 1.07 |
| 6steps (train with 5step + fm) | 54.11 | 0.765 | 42.4 | 0.709 | 0.53 | 0.4 |
| 6stepsfm1 | 50.42 | 0.745 | 41.47 | 0.693 | 1.07 | 1.07 |
| 6stepsfm2 | 39.29 | 0.685 | 35.87 | 0.667 | 0.67 | 5.33 |
| 6stepsfm3 | 27.52 | 0.622 | 28.93 | 0.633 | 1.47 | 5.6 |
| 6steps + fm123 | 55.71 | 0.796 | 49.33 | 0.77 | 1.2 | 1.6 |
| fm_all | 56.02 | 0.804 | 52.67 | 0.796 | 1.2 | 1.2 |
| force next + back to search | 63.33 | 0.85 | 57.73 | 0.826 | 5.87 | 7.2 |
| force next + back to search + fm_all | 64.1 | 0.858 | 60.93 | 0.851 | 6.0 | 7.73 |
# 11/5
- fm
| | tr wr | tr rw | test wr | test rw | Next | Back to Search |
| ------------------------------ | ----- | ----- | ------- | ------- | ---- | -------------- |
| gpt-4 | 36.0 | 0.667 | 38.0 | 0.671 | 2.0 | 0.67 |
| gpt-3.5 | 22.2 | 0.579 | 24.0 | 0.579 | 0.27 | 1.07 |
| 5steps | 49.3 | 0.74 | 43.2 | 0.705 | 0.27 | 0.13 |
| 5steps + tmp579 | 51.22 | 0.763 | 45.87 | 0.74 | 0.4 | 0.13 |
| 5steps + tmp579 + from gpt mid | | | 50.27 | 0.782 | 0.8 | 1.07 |
- chat gpt help do the n step
| | tr wr | tr rw | test wr | test rw | Next | Back to Search |
| ----------------------------- | ----- | ----- | ------- | ------- | ---- | -------------- |
| gpt-4 | 36.0 | 0.667 | 38.0 | 0.671 | 2.0 | 0.67 |
| gpt-3.5 | 22.2 | 0.579 | 24.0 | 0.579 | 0.27 | 1.07 |
| 5steps | 49.3 | 0.74 | 43.2 | 0.705 | 0.27 | 0.13 |
| 5steps + gpt-2step | 39.88 | 0.685 | 37.47 | 0.676 | 1.2 | 5.47 |
| 5steps + gpt-3step | 39.77 | 0.699 | 35.73 | 0.673 | 0.0 | 0.4 |
| 5steps + gpt-4step | 46.34 | 0.73 | 40.13 | 0.695 | 0.27 | 0.13 |
| 5steps + gpt-234step | 51.96 | 0.771 | 47.87 | 0.753 | 0.53 | 0.27 |
| 5steps + tmp579 | 51.22 | 0.763 | 45.87 | 0.74 | 0.4 | 0.13 |
| 5steps + gpt-234step + tmp579 | 52.8 | 0.78 | 49.2 | 0.769 | 0.8 | 0.27 |
- next, back to search: success in test
| | tr wr | tr rw | test wr | test rw | Next | Back to Search |
| ------------------------------------- | ----- | ----- | ------- | ------- | ---- | -------------- |
| gpt-4 | 36.0 | 0.667 | 38.0 | 0.671 | 2.0 | 0.67 |
| gpt-3.5 | 22.2 | 0.579 | 24.0 | 0.579 | 0.27 | 1.07 |
| 5steps | 49.3 | 0.74 | 43.2 | 0.705 | 0.27 | 0.13 |
| 5steps + next | 53.8 | 0.787 | 47.47 | 0.764 | 5.6 | 0.13 |
| 5steps + bst | 51.13 | 0.762 | 45.2 | 0.735 | 0.13 | 2.8 |
| 5steps + tmp579 | 51.22 | 0.763 | 45.87 | 0.74 | 0.4 | 0.13 |
| 5steps + tmp579 + next | 55.28 | 0.801 | 50.8 | 0.791 | 4.93 | 0.13 |
| 5steps + tmp579 + bst | 52.69 | 0.779 | 48.27 | 0.765 | 0.4 | 2.27 |
| 5steps + tmp579 + next + bst | 56.31 | 0.81 | 52.27 | 0.801 | 4.8 | 1.47 |
| 6steps | 51.68 | 0.751 | 43.07 | 0.707 | 0.27 | 0.0 |
| 6steps + next | 56.32 | 0.803 | 48.93 | 0.773 | 6.53 | 0.13 |
| 6steps + bst | 54.4 | 0.784 | 48.4 | 0.764 | 0.53 | 6.13 |
| 6steps + bst + next | 59.91 | 0.826 | 53.73 | 0.805 | 5.73 | 4.0 |
| 6steps + bst + next + 5steps + tmp579 | 63.33 | 0.85 | 57.73 | 0.826 | 6.4 | 3.73 |
- assume
- the more positive training the better
- budget llama3 and llama3 finetune model
-
# 10/29
- test
| | Next | Prev | Back to Search | fake button |
| --------------------------------- | ------ | ---- | -------------- | ----------- |
| gpt-4 | 15.067 | 0.0 | 14.0 | 3.733 |
| gpt-3.5 | 3.733 | 0.0 | 19.6 | 4.667 |
| lamma3 5steps data x 2 | 0.667 | 0.0 | 0.667 | 3.467 |
| lamma3 5steps data x 2 NAT | 0.133 | 0.0 | 0.133 | 3.333 |
| lamma3 5steps data x 2 NAT gpt3.5 | 1.067 | 0.0 | 4.0 | 2.0 |
| lamma3 1step pos+neg | 2.667 | 0.0 | 10.267 | 3.867 |
- evaluate results with goals
- NAT3I-ws-8b-ep12-ep6-ep6-12-ep6-12-all-d-ep1
- winrate, reward: 0.432, 0.704
- selected goal reward > 0.6
| reward range | goal selected rate | goal searched rate |
| ------------ | ------------------ | ------------------ |
| 0~1 | 14.8 | 28.26 |
| 1~1 | 33.88 | 43.88 |
| 0.6~1 | 27.5 | 37.72 |
| 0~0.5 | 0.57 | 17.66 |
- force back to search: 6% on fail data (60%)
- force next: 3% on fail data
# 10/21
| | tr wr | tr rw | test wr | test rw |
| -------------------------- | ------- | ------ | ------- | ------- |
| gpt-4 | 0.36 | 0.6674 | 0.38 | 0.671 |
| gpt-3.5 | 0.221 | 0.578 | 0.24 | 0.5791 |
| lamma3 1step | 0.332 | 0.6589 | 0.3093 | 0.6323 |
| lamma3 2steps | 0.3971 | 0.6934 | 0.3773 | 0.6823 |
| lamma3 3steps | 0.4275 | 0.7073 | 0.38 | 0.6817 |
| lamma3 4steps | 0.4768 | 0.7298 | 0.4133 | 0.69835 |
| lamma3 5steps | 0.4951 | 0.739 | 0.4133 | 0.7016 |
| lamma3 5steps data x 2 | 0.50388 | 0.7473 | 0.432 | 0.7045 |
| lamma3 5steps data x 2 NAT | 0.471 | 0.731 | 0.4173 | 0.6982 |
| continue 2steps | 0.403 | 0.6963 | 0.3893 | 0.687 |
| continue 3steps | 0.4443 | 0.7143 | 0.3907 | 0.6859 |
| continue 4steps | 0.4742 | 0.7268 | 0.404 | 0.6923 |
| continue 5steps | 0.4862 | 0.729 | 0.401 | 0.6891 |
- sub-opt trajs
- gpt-3.5
- exploration
- e.g., navigation
- need to avoid
- finetune models
- need to avoid
- check how many wrong button in neg data
- on no pos scinario
- only train on pos (1+x-y) - x*(train on ft neg on all scinario) + y*(train on expert neg)
- identify critical step
- monti-carlo
- make sense on scinario with positive trajs
- MCTS
- lamma3 5steps data x 2
- wr: 0.432
- rw: 0.7045
| | wr | rw | join wr | join rw |
| ------------------------- | ------ | ------ | ------- | ------- |
| gpt 4 | 0.38 | 0.6714 | 0.4973 | 0.7681 |
| gpt 3.5 | 0.24 | 0.5792 | 0.4533 | 0.7378 |
| start from 1 step gpt 3.5 | 0.3891 | 0.6789 | 0.4435 | 0.7273 |
| start from 2 step gpt 3.5 | 0.338 | 0.6492 | 0.452 | 0.7357 |
| start from 3 step gpt 3.5 | 0.2848 | 0.6177 | 0.4565 | 0.7378 |
# 10/15

{Button must be an asin}?

- navigation
- all data
| | Next | Prev | Back to Search |
| ----------------------- | ---- | ---- | -------------- |
| gpt-4 | 50.8 | 0.0 | 21.07 |
| gpt-3.5 | 7.56 | 0.0 | 37.28 |
| p10-ep12 | 0.01 | 0.0 | 0.0 |
| p10-ep12-ep6 | 0.01 | 0.0 | 0.0 |
| p10-ep12-ep6-ep6 | 0.05 | 0.0 | 0.0 |
| p10-ep12-ep6-ep6-12-ep6 | 0.16 | 0.0 | 0.03 |
- success data
| | Next | Prev | Back to Search |
| ----------------------- | ---- | ---- | -------------- |
| gpt-4 | 3.13 | 0.0 | 1.0 |
| gpt-3.5 | 0.45 | 0.0 | 1.5 |
| p10-ep12 | 0.01 | 0.0 | 0.0 |
| p10-ep12-ep6 | 0.01 | 0.0 | 0.0 |
| p10-ep12-ep6-ep6 | 0.01 | 0.0 | 0.0 |
| p10-ep12-ep6-ep6-12-ep6 | 0.03 | 0.0 | 0.03 |
- sup-opt data (>0.6)
| | Next | Prev | Back to Search |
| ----------------------- | ---- | ---- | -------------- |
| gpt-4 | 11.6 | 0.0 | 4.27 |
| gpt-3.5 | 1.69 | 0.0 | 5.08 |
| p10-ep12 | 0.01 | 0.0 | 0.03 |
| p10-ep12-ep6 | 0.01 | 0.0 | 0.0 |
| p10-ep12-ep6-ep6 | 0.01 | 0.0 | 0.0 |
| p10-ep12-ep6-ep6-12-ep6 | 0.03 | 0.0 | 0.03 |
- training data type (join data)
| | tr wr | tr rw | test wr | test rw | tr num | tr loss |
| ------------ | ------ | ------ | ------- | ------- | ------ | ------- |
| gpt-3.5 | 0.221 | 0.578 | 0.24 | 0.5791 | 0 | |
| p10-ep6 | 0.332 | 0.6589 | 0.3093 | 0.6323 | 2025 | 0.0121 |
| p10-ep12-ep6 | 0.3027 | 0.6392 | 0.2827 | 0.6278 | 2025 | 0.0129 |
- multistep rft
| | tr wr | tr rw | test wr | test rw | tr num | tr loss |
| ----------------------- | ------ | ------ | ------- | ------- | ------------ | ------- |
| gpt-4 | 0.36 | 0.6674 | 0.38 | 0.671 | 0 | |
| gpt-3.5 | 0.221 | 0.578 | 0.24 | 0.5791 | 0 | |
| p07-ep12 | 0.2716 | 0.656 | 0.2907 | 0.6517 | 2917 | |
| p10-ep6 | 0.332 | 0.6589 | 0.3093 | 0.6323 | 2025 | 0.0121 |
| p10-ep12 | 0.3362 | 0.6626 | 0.316 | 0.6501 | 2025 | |
| p10-ep12-ep6 | 0.3971 | 0.6934 | 0.3773 | 0.6823 | 3521 | 0.0012 |
| p10-ep12-ep12 | 0.4173 | 0.6994 | 0.3747 | 0.6736 | 3521 | |
| p10-ep12-ep6-ep6 | 0.4275 | 0.7073 | 0.38 | 0.6817 | 4149 | 0.0004 |
| p10-ep12-ep6-ep12 | 0.4416 | 0.7144 | 0.3813 | 0.6831 | 4149 | 0.0001 |
| p10-ep12-ep6-ep6-12-ep6 | 0.4727 | 0.7317 | 0.4093 | 0.6959 | 4844 (0.461) | 0.0016 |


# 10/8
- bug:
- llm output: Think[{"response": "Task:Think[{"response": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button": "Task:click[{"button":
# 10/3
- continue from gpt4
- gpt4
- wr: 0.37
- reward: 0.6655
- total join
- wr: 0.466135
- reward: 0.7629
| | 0 | 1 | 2 | 3 | 4 |
| ------------ | ------- | ------- | -------- | -------- | --------- |
| winrate | 0.3227 | 0.3187 | 0.3306 | 0.3306 | 0.32270 |
| reward | 0.66842 | 0.67197 | 0.66922 | 0.669422 | 0.667430 |
| join winrate | 0.4581 | 0.45418 | 0.462151 | 0.46215 | 0.458167 |
| join reward | 0.75856 | 0.75883 | 0.7603 | 0.7605 | 0.7595617 |
- how well was the model trained?


| | gpt3.5t | thr=0.0 | thr=0.5 | thr=0.7 | thr=1.0 | thr=1.0 ep4 |
| --------------------------- | ------- | ------- | ------- | ------- | ------- | ----------- |
| reward=1 train data winrate | 1.0 | 0.8328 | 0.76716 | 0.7611 | 0.70149 | 0.8567 |
| reward=1 train data reward | 1.0 | 0.92633 | 0.9116 | 0.9095 | 0.87078 | 0.9452 |
- average performance
| | gpt3.5t | thr=0.0 | thr=0.5 | thr=0.7 | thr=1.0 | thr=1.0 ep4 |
| --------------------- | ------- | ------- | ------- | -------- | ------- | ----------- |
| train data num | | 8273 | 5388 | 2679 | 1915 | 1915 |
| train dataset winrate | 0.2224 | 0.23 | 0.232 | 0.245 | 0.257 | 0.315 |
| train dataset reward | 0.5788 | 0.5922 | 0.6159 | 0.6151 | 0.6120 | 0.6512 |
| test winrate | 0.2411 | 0.23505 | 0.2031 | 0.2629 | 0.30278 | 0.3227 |
| test reward | 0.60544 | 0.61163 | 0.6134 | 0.644389 | 0.6424 | 0.66842 |
- example
- "Observation: WEB PAGE: {your button must be in {'2\' 2" x 4\'': 'size', '2\' 2" x 6\'': 'size', '2\' 2" x 8\'': 'size', '2\' 2" x 10\'': 'size', '2\' 2" x 12\'': 'size', .... <|start_header_id|>assistant<|end_header_id|>"
- Action:click[{"button": "2\" 2\" x 18\""}]<|eot_id|><|start_header_id|>user<|end_header_id|>
# 9/26
- generate training data with human ins and gpt3.5
- total data num: 10479
- average len: 3128
- median: 2700
- average message num: 16
- median: 13
- avg reward: 0.6
- reward range: [0.0, 1.0]
- after filtering out example with 3500 tokens: 7367
- finetune llama3 with different reward thresholds
- evaluate on 251 test data
- chatgpt 4: 0.69
| | Chatgpt 4 | Chatgpt 3.5t | thr=0.0 | thr=0.5 | thr=0.6 | thr=0.7 | thr=1.0 | thr=0 ep4 |
| ---------------- | --------- | ------------ | ------- | -------- | ------- | ------- | ------- | --------- |
| test reward | 0.696 | .6049 | 0.6116 | 0.613479 | .6175 | 0.64438 | 0.6424 | 0.6684 |
| train reward | | 0.6174 | | 0.62403 | | | 0.62399 | |
| train wr | 0.37 | 0.223 | | 0.232 | | | 0.257 | |
| gpt3 join wr | 0.4033 | 0.223 | | 0.284 | | | 0.324 | |
| gpt3 join reward | 0.71 | 0.6174 | | 0.665 | | | 0.686 | |
| gpt4 join wr | 0.37 | 0.4033 | | 0.412 | | | 0.42 | |
| gpt4 join reward | 0.6694 | 0.71 | | 0.7304 | | | 0.734 | |
| num | | N/A | 8273 | 5388 | 4377 | 2679 | 1915 | N/A |
- next step
- test on training set
- positive train data
- thr = 0.5
- reward: 0.91160
- wr: 0.76716
- thr = 1.0
- reward: 0.87078
- wr: 0.70149
- lamma3 one-shot
- "I can assist you with that. Here's the execution of the task:"
- check how the agent fail
- to select the way to use the neg data
# 9/19
- finished tracing webshop code of agentlite and agentboard
- agentboard is a agent server
- 
- agentlite is a way to operate agentboard
- 
- example: click[{"button": "B01CEEGYEA"}]
Observation: "You have clicked on the product with the code B01CEEGYEA."
Action:Finish[{"response": "Task Finished. Reward: 1.0"}]
Observation: "You have finished the shopping. Reward: 1.0"
- figure out how to get 10k human ins in agentboard
- able to generate training data
- generating gpt3.5 turbo for 3k of human ins
- gpt4/gpt4/gpt4o mini for 10k?
- currently, writing the code to evaluate lamma on agentlite
-
# 9/12
- webshop
- 
- 
| | cgpt | cgpt action only | cgpt thought | l3r | l3r action only |
| ------------------ | ---- | ---------------- | ------------ | ------ | --------------- |
| gsm8k test | 58% | 69% | 0% | 59.23% | 52.07% |
| gsm8k test + hint | 73% | 75% | N/A | 77.16% | 69% |
- train on llama rephrase
- "please rephrase \"" + thought + "\" in your words directly"
- next step: make sure that the agent is able to generate the same action based on the thought
- action examples
- Now that I have the cost of all the ingredients, I can calculate the total amount spent by adding the costs of the cheddar cheese, cream cheese, and cold cuts.
- With the ingredient costs tallied, I can now determine the total expenditure by aggregating the prices of the cheddar cheese, cream cheese, and cold cuts.
- Apologies for the confusion. Let me correct the calculation.
- I think there may have been a mistake. Let me recalculate the numbers to ensure accuracy.
- The total time Tim spends biking in a week is 16 hours.
- Tim dedicates a total of 16 hours each week to biking.

# summary
- Fact
- llama3 8 shot: 78%
- "Let's think step by step. At the end, you MUST write the answer as an integer after '####' "
- 
- test more shots
- ChatGpt 3.5: 80%
- As an agent:
- llama3: 0% (37% if more obs)
- cal[10hr*3]
- chatgpt: 67.79%
- step 1:
- thought
- action
- llama3 finetune on chatgpt 3.5 pos
- 66% on train
- union with chatgpt(67.79%): 80%
- 58% on test
- collaborate with lamma3 hint
- test: 73%
- hint has some calculation error but not much
- we can increase it
- train dataset: 92%
- llama3 70b: ~90%
- llama3 finetune on chatgpt 3.5 action with 70% acc on train
- test: 69%
- collaborate with lamma3 hint
- 75%
- Learned
- agents are cable of attaching pos and neg with different label
- but not the idea of label
- letting agents learn to answer true/false do not improve too much
- with more correct answers improves a little bit
- ideas for neg
- iteration
- policy improvement
- there may still exist answers in neg demo
- we can let more questions have answers
- make the answers better
- refine with llama3 style
- testing: reduce neg label probability
- training:
- dpo based on action?
# 8/21
- ablation studies
- finetune with only chatgpt action
- 69%
- combine with lamma3 75%
- how 8-shot llama3 results works
- question + "Let's think step by step. At the end, you MUST write the answer as an integer after '####' "
-
- how often do the hint give the results dirrectly
- return correct answer dirrectly
- hard train: 19.41%
- test: 13.19%
- exist correction: 7.5%
- 
- 
# 8/14
- train only on action
- Baseline pure llama 3
- llama3-instruct 40%
| | only positive | + ft + extra2 | + ft + extra9 |
| ----- | ------------- | ------------- | ------------- |
| GSM8K | 73.54% | 75.05% | 71.49% |
| SVAMP | 82.1% | 76.8% | 79.9% |
| ASDiv | 83.5% | 83.29% | 82.45% |
- is chatgpt a better training data than finetune?
- pos: 56.7%
- ft: 55.87%
- train with lamma3 1 + ft
| | only positive | + all ft pos | + remain ft pos | + extra ft pos |
| ----- | ------------- | ------------ | --------------- | -------------- |
| GSM8K | 57.9% | 58.6% | 59.5% | 60.57% |
| SVAMP | 70.8% | 63.1% | 67.6% | 67.3% |
| ASDiv | 72.76% | 72.31% | 72.76% | 73.85% |
- results on hard training data
| | finetune only chatgpt pos | lamma3 |
| ------------------------------------ | -------- | ------ |
| pure | 14.86% | 37.21% |
| chatgpt 1 | 28.29% | 40.74% |
| chatgpt 6 | 32.26% | 50.03% |
| chatgpt 6 w/o pointing out incorrect | 33.43% | 49.47% |
| chatgpt 6 + lamma3 1 | 39.87% | 47.18% |
| finetune 6 + lamma3 1 | 14.73% | 35.35% |
| **lamma3 1** | 63.47% | 37.77% |
- Example
- Question: Four classmates were comparing their ages based on their birth month. They found out that Jolyn is 2 months older than Therese while Therese is 5 months older than Aivo. Then, Leon is 2 months older than Aivo. How much older in months is Jolyn than Leon?
- Here's a step-by-step plan to solve the problem:\n\n
- **Step 1: Identify the relationships between the classmates' ages**\n\n*
- Jolyn is 2 months older than Therese\n*
- Therese is 5 months older than Aivo\n*
- Leon is 2 months older than Aivo\n\n
- **Step 2: Create a diagram to visualize the relationships**\n\n
- Let's create a diagram to help us visualize the relationships between the classmates' ages:\n\n
- Aivo - 0 months\n
- Therese - 5 months\n
- Jolyn - 7 months (2 months older than Therese)\n\nLeon - 2 months older than Aivo, so Leon - 2 months\n\n**
- Step 3: Find the difference in months between Jolyn and Leon**\n\n
- We can see from the diagram that Jolyn is 7 months old and Leon is 2 months old. To find the difference, subtract Leon's age from Jolyn's age:\n\n
- 7 (Jolyn) - 2 (Leon) = 5 months\n\n
- **Answer:** Jolyn is 5 months older than Leon.
- Agent: Thought: We have found the age differences between Jolyn and Leon. We can finish the task now.\n\nAction: finish[5]
-
- llama3 on hard training data
- original env: 0%
- provide multiple "**thought**", "**action**", "**observation**"
- need to be "Action: "
- modified the env:
- ask llama3 to provide only one thought and action in the correct format.
- 19.58%
- "finish[$1000]"
- ask llama3 provide pure number in answer
- 23.68%
# 8/8
- chatgpt 6 + llama3 point out error + restart:
- 41.95%
- llama3 plan + start:
- 43.09%
- llama3 no prompt plan + start:
- 51.87%
- Can we use the neg data to train get more?
- pure: 8%
- continue from neg step 1: 28.48%
- 6 (chatgpt) steps -> 1 step (plan)
- restart: 32.77%
- 6 (chatgpt) + "the anwser is incorrect. Lets start again" + question
- test: the anwser is correct
- test: Lets start again
- test: please judge the trajs
- 6 (chatgpt) + "the anwser is incorrect. please point out the mistake."
- thougth: ....
- "Lets start again" + question
- pure + continue + restart: 44%
- continue + restart: 41.28%
- Add the new correct generated by finetune to training data
- add training accuracy
| | only positive | + all ft pos | + remain ft pos |
| ----- | ------------- | --------- | ------------ |
| GSM8K | 57.9% | 58.6% | 59.5% |
| SVAMP | 70.8% | 63.1% | 67.6% |
| ASDiv | 72.76% | 72.31% | 72.76% |
- After first step finetune, there are still 21% of training data that dont have solution.
- Finetune on ft traject or chate on both correct questions
| | chatgpt correct | chatgpt wrong |
| ---------------- | --------------- | ------------- |
| Finetune Correct | 56.79% | 10.14% |
| Finetune Wrong | 11.45% | 21.61% |
- There are 34% of training data that dont have solution.
| chatgpt correct | chatgpt wrong |
| --------------- | ------------- |
| 68.24% | 31.76% |
# 8/8
| | only positive | p5k + YN22k | p5k + YN22k random | p5k + NY22k |
| ----- | ------------- | ----------- | ------------------ | ----------- |
| GSM8K | 57.9% | 58.75% | 60.87% | 54.43% |
| SVAMP | 70.8% | 64.8% | 70.3% | 59.4% |
| ASDiv | 72.76% | 71.82% | 74.69% | 69.64% |
| | p5k + YN11k | only positive | p5k + NY11k | p5k + YN22k | p5k + NY22k |
| ----- | ----------- | ------------- | ----------- | ----------- | ----------- |
| GSM8K | 59.81% | 57.9% | 56.86% | 58.75% | 54.43% |
| SVAMP | 67.1% | 70.8% | 68.5% | 64.8% | 59.4% |
| ASDiv | 73.059% | 72.76% | 72.91% | 71.82% | 69.64% |
| | only positive | positive 5k + YN 11k | positive 5k + NY 11k | positive 5k + YN 11k user |
| -------- | ------------- | -------------------- | -------------------- | --- |
| GSM8K | 57.9% | 59.81% | 56.86% | 59.51% |
| SVAMP | 70.8% | 67.1% | 68.5% | 66.5% |
| ASDiv | 72.76% | 73.059% | 72.91% | 74.44% |
-
# 8/1
- notes
- notes: XXXXXX. Thought: XXXX. Action: XXXX
- thought can reject notes
- which makes it like a tree
- remove notes from trajectories
- during traing the agent can write its own notes but need to generate the same thought and action.
- refine the neg trajectoris by starting generating data from middle
- simple refinement does not improve the agent?
- algo:
- randomly select a step i
- remove steps after i
- start again from i
- we might be able to determine which step goes wrong
- WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION
- needs early stoping when learning sub-opt data
- never think about using condition for sub-opt data
- test if the agent learns yes no correctly.
| Column 1 | only positive | positive 5k + YN 7k | positive 5k + NY 7k |
| -------- | ------------- | ------------------- | --- |
| GSM8K | 57.9% | 60.12% | 55.2% |
| SVAMP | 70.8% | 72.4% | N/A |
| ASDiv | 72.76% | 73.55% | N/A |
-
# 7/18
| Column 1 | only positive | positive 5k + YN 7k | |
| -------- | ------------- | ------------------- | ------ |
| GSM8K | 57.9% | 60.12% | 61.41% |
| SVAMP | 70.8% | 72.4% | 65.9% |
| ASDiv | 72.76% | 73.55% | 73.10% |
positive 5k:
- generate by chatgpt 3
YN:
- Train an agent with only positive data
- Use the agent to generate data on train set
- For each traj, append
- {'role': 'user', 'content': 'Additional Qestion: Does the agent provide correct anwser to the user?', 'loss': False}
- If agent is correct:
- {'role': 'user', 'content': 'Yes.', 'loss': True}
- number 7000 * 67%
- else:
- {'role': 'user', 'content': 'No.', 'loss': True}
- 7000 * 33%
positive 5k + YN:
- train from llama3 instruct
Test true false on test data:
- accuracy: 61% -> 51%
- positive recall: 90.87% -> 30.57%
- negative recall: 17.64% -> 81.6%
Questions:
- loss weight in llm.
- how do I generate with two LLM models.
on going:
- webshop
# 7/11
Can llama-3 train to be better?
0.60121304
Can llama-3 notice the wrong trajectories?
No
- add "New Qestion: Does the agent provide correct anwser to the user? Use yes/no to provide the final answer."
- NAT3 op trajectories 59.2%
- simple yes/no
- accuracy: 61%
- negative accuracy: 17.64%
- positive accuracy: 90.87%
- neg recall: 38.92%
- pos recall: 51.89%
- predict neg: 9%
- with thought
- accuracy: 43%
- negative recall: 25.5%
- positive recall: 69.3%
- predict neg: 72%
- 1.5%: call calculation
- correct examples:
- 
- wrong examples:
- 
- 
| train | llama-3 | llama-3 inv | chat | chat inv |
| ------ | ------- | ----------- | ---- | -------- |
| normal | 0.566 | 0.495 | 0.56 | 0.51023 |
| inv | 0.5041 | 0.5754 | 0.4783 | 0.5625 |
- hotpotqa
- llama3
- em: 25.53, f1: 39.73
- paper
- em: 28.80, f1: 41.37
- xxxxx
- q:
- q: x y old
- search x
- xxxx
- xxxx
- xxxx
- search y
- xxxx
- xxxx
- xxxx
- finsh[x]
- webshop
# 7/5
- test
https://gorilla.cs.berkeley.edu/
- webshop
# 6/27
- try more traing prompts for 1 step
- no obvious improvment
- train llama3 hotpotqa
- 0% accuracy
- debuging
- test if the agent can learn from predicting the true or false of the trajectories
- train if increasing the trajectories
# 6/20
- 5K
| | NAT3 op | NAT3 | NAT3 chat-gpt | NAT3 chat-gpt 1 step | NAT3 cl 1 step | NAT3 cl 1 step oss |
| ------------ | --------- | -------- | ------------- | ------------------- | ------------- | ----------------- |
| GSM train | 0.6693 | 0.6455 | 0.6385 | 0.6741 | | |
| GSM8K | 0.57998 | 0.5663 | 0.5602 | 0.5822 | 0.57164518 | 0.57240 |
| SVAMP | 0.708 | 0.694 | 0.713 | 0.721 | 0.731 | |
| ASDiv | 0.727632 | 0.7266 | 0.724 | 0.7246 | 0.70785 | |
| GSM8K t=0.8 | 0.5663381 | 0.534495 | 0.54359 | 0.55648 | | |
-
| | NAT2 paper | NAT2 | NAT3 op | NAT3 | NAT3 chat-gpt | NAT3 op ep3 | NAT3 ep3 | NAT3 chatgpt ep3 |
| --------- | ---------- | ------ | -------- | ------ | ------------- | ----------- | ---------- | ---------------- |
| GSM train | | | 0.6693 | 0.6455 | 0.6385 | 0.7346 | 0.66813 | 0.669878 |
| GSM8K | 0.490 | 0.5026 | 0.57998 | 0.5663 | 0.5602 | 0.5943 | 0.5648 | 0.5739 |
| SVAMP | 0.644 | N/A | 0.708 | 0.694 | 0.713 | 0.649 | 0.691 | 0.72 |
| ASDiv | 0.6866 | N/A | 0.727632 | 0.7266 | 0.724 | 0.727 | 0.730 | 0.7123 |
- NAT2
- train
- 
# 6/13
- 1k
| | NAT2 paper | NAT2 | NAT3 op | NAT3 | NAT3 chat-gpt |
| ----------------------- | ---------- | -------- | -------- | -------- | ------------- |
| normal prompt | 0.490 | 0.44730 | 0.4374 | 0.46322 | 0.51250 |
| detailed action prompt | 0.476 | 0.4503 | 0.4344 | 0.47687 | 0.50796 |
| explain obseravtion | 0.47611 | 0.44882 | 0.4336 | 0.4768 | 0.5216 |
| explain obseravtion2 | 0.475360 | 0.433661 | 0.432145 | 0.482941 | 0.509476 |
| explain obseravtion3 | 0.478392 | N/A | 0.435936 | 0.48976 | 0.50037 |
| more obseravtion + obs3 | 0.48066 | 0.45261 | 0.42987 | 0.5072 | 0.5307 |
- detailed action prompt:
- You can trust the result of the calculator. The calculation can not solve equation
- The answer should not be an expression. The answer should be a number.
- explain obseravtion
- Thought can reason about the current situation **including explaining the meaning of the observation and the next step**
- explain obseravtion 2
- explaining the meaning or the error of the observation.
- explain obseravtion 3
- Thought can reason about the current situation **including explaining the meaning of the observation**.
- NAT2 loss
- 
- NAT3 chatgpt3 loss
- 
- LAMMA 2
- 
- 
# 6/6
- questions:
- test:
- Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
- TokenError: ('EOF in multi-line statement', (2, 0)) [/INST]
- "Action: calculate[32+1]........" => calculate 32+1]......
- or "Action: finish[4] </s> The amount of water left is 4 liters."
- fixed
- train:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
- LlamaTokenizer -> AutoTokenizer
- train data
- 
- llama3:
- old: "Thought:xxxx.\nAction: calculate[16 - 3 - 4]"
- llama3:
- "calculate[(16 - 3 - 4) * 2] </s> </s>>s>s>s>s>s>s>s"
- "Thought: To find the total number of bolts needed, I can start by calculating the amount of white fiber required, which is half the amount of blue fiber needed. Then, I can add the two amounts together to find the total number of bolts.\nAction: calculate[2/2] </s>>s> </s>s> </s> Thought: Now, I can calculate the total number of bolts needed by adding the blue and white fiber together.\nAction: calculate[2 + 1] </s> </s> </s> </s>"
- ideas
- better compile error?
- explain the useage of tool is incorrect
- [32hrs * 3] = 96hrs
- error -> hrs -> [32 * 3]
- ask the model to explain the meaning of the observation and if the observation is as expected
- e.g., the obv 32 is the total number of cats
- error handling
- reask the subgoal?
- often the calculation is not the subgoal
- avoid
- "The observation is incorrect. The correct calculation should be:"
- "It seems there was a mistake in the calculation. Let's correct it."
- "I will finish the task by providing the final answer.\n\nAction: finish[7560] </s> The profit is $7560."
# 5/10
- https://github.com/THUDM/AgentTuning
- [Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents](https://arxiv.org/pdf/2402.11651)
- 
- 
- different from our discussion
- they didn't learn the environment
- they didn't the judgement
- given a bad action, predict why it is wrong
- answer can get by providing the correct answer
- they didn't use it as agent
- guide the agent back correct trajectories
- e.g., wrong format then resent a correct format after getting the obs
- the label only contain correct or wrong
- we should have more information
- [From r to Q∗: Your Language Model is Secretly a Q-Function](https://arxiv.org/pdf/2404.12358)
- reunderstand DPO
- [REBEL: Reinforcement Learning via Regressing Relative Rewards](https://arxiv.org/pdf/2404.16767)
- 
- [In-Context Principle Learning from Mistakes](https://arxiv.org/pdf/2402.05403)
- few-shot prompting
- low level: direct explanation
- high level: 5 key bullet point
- append low level or high level p into prompt
- 
- todo:
- Star-gate: Teaching language models to ask clarifying questions
# 5/3
- how to use neg examples
- predict the environment reaction
- predict the why the given trajectory is wrong
- ask the agent to act like bad agents, predict the trajectory
- believe in somethings that is false
- wrong way to use the tools
- wasting interactions
## credit assign
- https://arxiv.org/pdf/2404.12358
- https://arxiv.org/abs/2404.16767
## base benchmark
- [AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning](https://arxiv.org/abs/2402.15506v2)
- 
- 
- [Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models](https://arxiv.org/pdf/2403.12881.pdf)
- 
- Then we prompt the gpt-3.5-turbo to directly reply to these queries without providing any tool information. To construct normal conversation with the tool information provided, we randomly append half of the queries with irrelevant tools,
- 
- [Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
](https://arxiv.org/pdf/2401.01335.pdf)
- 
- [AgentTuning: Enabling Generalized Agent Abilities for LLMs](https://arxiv.org/pdf/2310.12823.pdf)
- 
## without training
-[Reflexion: Language Agents with Verbal Reinforcement Learning](https://proceedings.neurips.cc/paper_files/paper/2023/file/1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf)
- 
- [Training Language Model Agents without Modifying Language Models](https://arxiv.org/abs/2402.11359)
## Multiagent
- [Improving Factuality and Reasoning in Language Models through Multiagent Debate](https://arxiv.org/pdf/2305.14325.pdf)
- 
## application
- Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation
## video
- [VideoAgent: Long-form Video Understanding with Large Language Model as Agent](https://arxiv.org/abs/2403.10517)
- 
## games
- [EUREKA: HUMAN-LEVEL REWARD DESIGN VIA CODING LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2310.12931.pdf)
- [Large Language Models can Implement Policy Iteration](https://arxiv.org/abs/2210.03821v2)
- [AGENTBENCH: EVALUATING LLMS AS AGENTS](https://arxiv.org/pdf/2308.03688.pdf)
- 
## some basic methods
-
## ideas
- The importancy of RL
- prevent accumulated error
- 
- use low score data during
- figure out why it is wrong
- compare the different good and bad examples
- break down each part
- connect the bad examples to negative feedbback (words)
- add more information to instructions.
- describe bad example and ask it to predict bad example?
- focus on multi-steps problem?
- try to train a model that can guide itself back to in distribution
- the agent cmd should be accurate
- learn with examples
- learn with interactions with environments (RL)
-
- 