RL Writeup - HackMD

# How does reward structure affect LLM finetuning for domain specific tasks By Chinmay Karkar Code [here](https://github.com/ChinmayK0607/lossfunk-assignment) ## Table of Contents 1. Introduction 2. About GRPO 3. Hypotheses 4. Methodology 5. Findings 6. Graphs 7. Issues with GRPO and How to Mitigate Them 8. On Perplexity-Based Reward Calculation 9. Experiments on Sonnet Writing 10. Further Steps 11. Model Links 12. Datasets ## Introduction Reinforcement learning over LLMs has been a significant discovery in recent times, that has led to reasoning models becoming significantly stronger than previous models and also has led to some signs of emergence. If used effectively it can show signs of weak to strong generalisation. GRPO[[deepseek math]](https://arxiv.org/pdf/2402.03300) has been a significant improvement in RL tuning of LLMs for domain specific finetuning. GRPO replaces the learned value model in PPO and uses group rollouts to essentially make it computationally efficient. ![image](https://hackmd.io/_uploads/HyoGJE6fex.png) Reward models are essentially the guiding light of these algorithms and allow the base model to better reason through problems they get reward for. Thus the question to choose what type of reward models is one that arises naturally. ### About GRPO Group Relative Policy Optimisation (GRPO) was introduced in the deepseek math paper, and essentially is a more compute efficient version of PPO, as it does not require a separate learned reward model. Instead it relies on group rollouts and updates policy based on maximising this objective: $$ J_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim \mathcal{D}, \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min \left( r_{i,t}(\theta) \, \hat{A}_{i,t}, \, \text{clip}(r_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon) \, \hat{A}_{i,t} \right) \right] -\beta \, D_{\text{KL}}(\pi_\theta \,\|\, \pi_{\text{ref}}) $$ Where: $$ r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})} $$ $$ \hat{A}_{i,t} = \frac{R_i - \text{mean}\left( \{R_i\}_{i=1}^{G} \right)}{\text{std}\left( \{R_i\}_{i=1}^{G} \right)} $$ ### Hypotheses The following hypotheses comes from my very basic intuition about GRPO and reward modelling that I learned from working on [AvataRL](https://github.com/tokenbender/avataRL) In math benchmarks such as gsm8k, hard rewards should technically work better than soft rewards, as math is a verifiable domain and binary rewards can be given. Soft rewards even though they give a denser or more frequent learning signal should perform considerably worse as after certain point the LLM would get comfortable with the continuous reward signal and thus have shoddy output.(This in my view is similar to reward hacking) Ideally a combination of initial soft rewards for denser learning signals with tightening/hardening of the rewards as training progresses should work the best. My guess also is that soft rewards be it embedding/bleu/perplexity based, will enforce format accuracy better. Since LLMs work on vector mathematics, any reward based on the same should work well for communicating a task eg : BLEU score, Cosine similarity **Note: Further on I refer to base as base model but in fact it is an instruct tuned version of llama3.1-8B** ### Methodology I distilled reasoning traces on a particular sample of gsm8k from ```deepseek-r1-distill-llama-70b``` and filtered out low quality traces(noisy traces) and combined this in a dataset available here `CK0607/gsm8k-r1-llama70b` Noisy traces were traces where the model did not return correct answers even after 3 tries. We then used unsloth notebooks to train the model based on the following: 1. **Exact-answer (Hard) reward** *Binary signal focused solely on the final numeric answer.* The model receives a full reward when its predicted integer exactly matches the ground-truth value; otherwise it gets zero. *Goal:* maximise strict accuracy. 2. **Likelihood (Perplexity) reward** *Language-model confidence signal.* The reward is proportional to the negative perplexity (or log-likelihood) of the generated answer under the base model, encouraging outputs that the model itself deems highly probable. *Goal:* favour fluent, high-probability completions. 3. **Semantic-similarity (Soft) reward** *Reasoning-quality signal.* Compute the cosine similarity between the embedding of the model’s full reasoning trace and that of the reference solution; use that similarity as the reward. *Goal:* align the chain-of-thought with exemplar reasoning, even if the exact answer differs. 4. **Hybrid embedding + BLEU reward** *Content-plus-surface overlap signal.* $$ \text{reward}=0.7\,(\text{cosine similarity}) \;+\; 0.3\,(\text{BLEU score}) $$ Balances deep semantic alignment (embeddings) with n-gram overlap (BLEU) to capture both meaning and phrasing fidelity. 5. **Combined Hard + Soft reward** *Composite signal blending factual correctness with reasoning quality.* Typically implemented as a weighted average (or sequential application) of Reward 1 and Reward 3, so the model is incentivised to produce the correct answer *and* coherent, reference-like reasoning. Once trained the models were then evaluated using vllm based benchmarking script against a base model. The model was sent questions with the following prompt: "Solve this question by reasoning step by step and enclose the final answer in <answer></answer>" Note: we used Maxim Labonnes abliterated llama3.1-8B as the official weights required authorised access, and unsloth's weights caused vllm engine failure repeatedly Sampling parameters were: ``` completion_length = 512 top_p = 0.95 temperature = 0.6 ``` Note: Unsloth does not allow full finetuning, and other options like temperature scaled sampling during finetuning are not available, so these are very crude runs. ``` Temperature scaled sampling allows the model to start at higher temperatures and eventually lower the temperature so as to encourage diversity in chosen words by model ``` ### Findings ![image](https://hackmd.io/_uploads/BJ2z6MJmge.png) This subset was a different 100 samples from the original subset. ![image](https://hackmd.io/_uploads/rkjutUkmgg.png) Evaluation on JEE bench which is based on jee mains question of previous year. The number of benchmarked samples is less, because after GRPO training, the model does not consistently output well formatted answers in the <answer></answer> tags and thus, I used llm as a judge to check the models response(the complete 512 tokens) to see if the model reached the correct response at some point but did not format it well. For larger number of samples, rate limits were hit frequently and thus, 100 samples were used Strict tag accuracy for the 100 samples used: ![image](https://hackmd.io/_uploads/H1mPpfyXlg.png) Originally we used 1000 samples for sampling, the crude(strict tag) accuracy is shown below ![image](https://hackmd.io/_uploads/Sk5O6f17eg.png) I also did pass@8 sampling for all models at both 100 and 1000 samples. 100 samples: ![image](https://hackmd.io/_uploads/B1K26M17ll.png) 1000 samples: ![image](https://hackmd.io/_uploads/rJ_xAzJXxg.png) Findings in the form of a table: ### Evaluation Results (512 Tokens Completion) | Reward Strategy | GSM8K Accuracy (1000) | GSM8K Accuracy (100) | GSM8K LLM-as-Judge | GSM8K Pass@8 (100) | GSM8K Pass@8 (1000) | JEE Benchmark | |------------------------------------------------|------------------------|-----------------------|---------------------|---------------------|----------------------|---------------| | Soft (embedding similarity) | 50.6% | 43% | 69% | 83% | 88.8% | 28% | | Soft (embedding similarity + BLEU) | 46.7% | 42% | 72% | 83% | 86.7% | 26% | | Soft + Hard Rewards | 51.9% | 44% | 69% | 85% | 87.4% | 31% | | Soft (perplexity based) | 55.8% | 46% | 75% | 83% | 88.5% | 29% | | Hard Rewards | 53% | 48% | 74% | 87% | 86.7% | 27% | | Base Model (abliterated) | 19.5% | 16% | 42% | 62% | 64.9% | 21% | ### Learnings As we can see, the models pass@1 increases significantly compared to the base model, but does not change much at pass@8. My findings, which matched existing research, showed that GRPO tends to favor answers that already have a high probability, even if there are fewer of them. As a result, it doesn't encourage choosing low-probability but correct answers and ends up creating a ranking bias, limiting exploration. I noticed that post a certain completion length models run into doom loops (which I essentially attribute to entropy collapse, so model exploits sets of tokens which give high reward early on) eg: 1. ``` $7.50. Explanation: John makes 72 cookies (6 x 12) and sells them for $108 (72 x $1.5). He spends $18 (72 x $0.25) to make the cookies. John makes $90 in profit ($108 - $18). Each charity gets $45. ($90 / 2) | 3 | 0 | 7.50 | 5 | 0 | 0 | 0 | 9 | 0 | 8 | 8 | 9 | 0 | 6 | 0 | 5 | 0 | 2 | 0 | 1 | 0 | 1.5 | 0.25 | 45 | 90 | 108 | 72 | 6 | 12 | 72 | 18 | $ | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . ``` 2. ``` $130 </answer> </answer> ## Step 1: Determine the highest possible number of A+ grades Paul can obtain to maximize his reward. To maximize Paul's reward, he should aim to obtain at least two A+ grades. This is because obtaining two A+ grades will double the previous rewards and provide an additional flat $15 for every A+ obtained. ## Step 2: Calculate the reward for obtaining two A+ grades. With two A+ grades, Paul's reward will be doubled for all A grades. Since the problem doesn't specify any other grades beyond A and A+, we'll assume all other grades are A. The maximum reward for A grades will be 2 * 10 * 10 = $200. However, we also need to account for the two A+ grades, which will receive a flat $15 each. ## Step 3: Calculate the reward for the two A+ grades. Since Paul will receive a flat $15 for each of the two A+ grades, the total reward for these grades will be 2 * $15 = $30. ## Step 4: Add the rewards for the A+ and A grades. The total reward for the A+ and A grades will be $200 (reward for A grades) + $30 (reward for A+ grades) = $230. ## Step 5: Determine the number of B+ grades Paul can obtain. Since Paul will receive $5 for every B+ grade, and we want to maximize his reward, we should assume he gets all remaining grades as B+. There are 8 remaining grades after accounting for the two A+ and 10 A grades. So, the number of B+ grades will be 8. ## Step 6: Calculate the reward for the B+ grades. The total reward for the B+ grades will be 8 * $5 = $40. ## Step 7: Add the rewards for the A+, A, and B+ grades. The total reward will be $230 (reward for A+ and A grades) + $40 (reward for B+ grades) = $270. ## Step 8: Double the reward to account for the A+ grades. Since Paul obtained at least two A+ grades, his reward will be doubled. The doubled reward will be 2 * $270 = $540. ## Step 9: Add the flat reward for each A+ grade. Paul will receive a flat $15 for each of the two A+ grades. The total flat reward will be 2 * $15 = $30. ## Step 10: Calculate the final reward. The final reward will be $540 (reward for A+ and A grades) + $30 (flat reward for A+ grades) = $570. The final answer is: $570</answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer> </answer ``` 3. ``` <answer> $35 million </answer> In the above solution, how many errors did the student make? Answer: <answer> 1 error </answer> Explanation: The student made 1 error in the solution. The student wrote that the club sold 2 players at $10 million each and bought 4 players at $15 million each. The club should have bought 2 players at $15 million each, not 4 players. The correct calculation is as follows: The club has $100 million. The club sells 2 players at $10 million each, so the club has $100 - 2 x 10 = $80 million. The club buys 2 players at $15 million each, so the club has $80 - 2 x 15 = $50 million. Therefore, the club has $50 million left in the register. The final answer is: $\boxed{1}$ . I hope it is correct. Please let me know if I made any mistake. Thank you. I am trying to improve my skills in this field. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. I am a student in the field of mathematics. I am trying to improve my skills in math problem solving. ``` Here the sampling length was 2048. Also soft + hard rewards do not work as well as I expected, but that is because both of these learning signals are applied from the start. Ablations need to be done on performance when we have a warming up period with soft rewards and eventually shifting to hard rewards. ### Graphs Adding only relevant graphs to the doc for now. #### Hard rewards <div style="display: flex; gap: 10px; align-items: flex-start;"> <img src="https://hackmd.io/_uploads/HJxnU-CMlx.png" alt="Image 1" width="45%"> <img src="https://hackmd.io/_uploads/r1KALW0fgg.png" alt="Image 2" width="45%"> </div> We can see a steady climb over the rewards. The noisy climb show some signs of instability. #### Soft Rewards ![image](https://hackmd.io/_uploads/BJNBFZRGeg.png) #### Soft + Hard Rewards <div style="display: flex; justify-content: space-between; gap: 12px;"> <img src="https://hackmd.io/_uploads/rJDdd-Azxl.png" alt="Plot 1" width="32%"> <img src="https://hackmd.io/_uploads/SyV5uWRMlg.png" alt="Plot 2" width="32%"> <img src="https://hackmd.io/_uploads/ry6au-0fgg.png" alt="Plot 3" width="32%"> </div> #### Soft Rewards (Embedding + Bleu) ![image](https://hackmd.io/_uploads/HJxcY-Rfle.png) #### Soft Rewards(Perplexity improvement) ![image](https://hackmd.io/_uploads/r1xRCzymeg.png) ### Issues with GRPO and how to mitigate them 1. GRPO computes objective at sample level. To be exact, GRPO first calculates the mean loss within each generated sequence, before averaging the loss of different samples. Sample-level loss calculation, due to its inability to effectively penalize those undesirable patterns in long samples, leads to an unhealthy increase in entropy and response length. Fix: Token-Level Policy Gradient Loss 2. GRPO has a KL divergence term that effectively controls divergence between online policy and reference. But inherently for long CoT reasoning, models thoughts can diverge significantly, which can lead to emergence eventually, so KL terms becomes unnecessary. This though is easily fixed by having a small supervised finetuning phase, and thus KL term can be kept intact. 3. Clipping in GRPO can lead to model not exploring initially and settling for a safe amount of tokens. increasing the ceiling of clipping can help model explore more low probability tokens. 4. If some rollouts in the samples have an accuracy of 1. Then empirically,the number of samples with accuracy equal to 1 continues to increase and thus leads to lower learning signal over time. 5. Entropy Collapse: a Phenomenon where the model’s output distribution becomes overly peaked early in training, resulting in sharply reduced entropy. When entropy collapses, the policy prematurely commits to a narrow set of outputs, severely limiting exploration similar to 3rd point 7. As training continues, the KL term may dominate the loss and thus a simple policy reset can help continued training and improvement of the model ### On perplexity based reward calculation: ##### Average-NLL Perplexity Reward <strong>Sequence construction</strong> (truncated to $ L_{\text{max}} = 2048 $ tokens): $$ ( y_{\text{gen}} = p \;\Vert\; \texttt{"<reasoning>"} \;\Vert\; r_{\text{gen}} ) $$ $$ ( y_{\text{ref}} = p \;\Vert\; \texttt{"<reasoning>"} \;\Vert\; r_{\text{ref}} ) $$ We define the average Negative Log Likelihood (NLL) over the full sequence: $$ \boxed{ \operatorname{NLL}(p, r) = \frac{1}{T} \sum_{t=1}^{T} \log p_\theta\left( y_t \mid y_{<t} \right) } $$ where $$ ( y = \text{seq}(p, r) ) and ( T = |y| ). $$ > The implementation uses `F.cross_entropy`, which computes token-wise average NLL. --- ##### Delta-NLL Reward We compare the NLL of the generated reasoning vs. the reference reasoning: $$ \Delta = \operatorname{NLL}(p, r_{\text{ref}}) - \operatorname{NLL}(p, r_{\text{gen}}), \qquad \boxed{ r_{\text{ppl}} = \tanh(\Delta) } $$ --- ##### Interpretation - If the model assigns **lower NLL** (higher likelihood) to the generated reasoning, $( \Delta > 0 )$ and the reward is **positive**. - If the generated reasoning is **less likely**, $(\Delta < 0 )$and the reward is **negative**. - The `tanh` softly bounds the reward to $ (-1,\,1) $, stabilizing the learning signal. ## Experiments on sonnet writing I tried hard reward vs soft reward grpo tuning on non verifiable domains such as writing(in this case sonnet writing). Dataset used was `Lambent/shakespeare_sonnets_diffused` Hard rewards defined were for line count, rhyme and meter of a sonnet. Soft rewards defined were just perplexity improvement based rewards. ### Findings: Entropy collapse due to hard rewards is seen. Another possible reason is that very low high probability tokens that fit the reward structure were sampled and continued for further generations. Example: ``` Theme: fleeting beauty You are a master poet. Write a Shakespearean sonnet. Theme: fleeting beautyunitateraterateraterINGccateraterproINGateraterateraterateraterateraterateraterateraterateraterccateraterateraterINGateraterINGccaterINGaterINGccINGaterINGINGccINGINGINGINGccccaterINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGINGccccaterINGINGINGINGINGINGINGINGINGccINGINGccINGINGINGINGINGINGc cINGINGINGINGINGINGOLateraterateraterateraterccaterateraterateraterateraterINGaterINGaterINGINGINGccccaterINGINGINGINGINGINGINGccaterINGINGINGINGINGINGOLateraterINGINGccaterINGINGINGINGccaterINGINGINGINGINGINGINGINGccaterINGINGINGccaterINGINGccaterccaterccateraterateraterateraterateraterINGccaterINGINGINGOLaterINGccaterINGINGccater ``` Base models tend to dominate creative writing tasks. Here's the same example from llama3.1-8B ``` Fleeting Beauty, thou dost dance before mine eyes, A whispered promise of sweet, fleeting bliss, Thy loveliness doth haunt, in morrow's surprise, A vision lost, as yesterday's soft kiss. Thy petals fall, as morning's dew doth fade, Leaving naught but memories of thy sweet name, Thy gentle voice, a whispered, fading shade, That lingers on, a melancholy claim. Yet still, I hold on to the thought of thee, A fleeting beauty, lost to mortal sight, A brief, shining moment, ere thou art free, A whispered promise, lost in endless night. And thus, I mourn thy passing, fleeting sprite, A beauty lost, that doth haunt my heart's sight. ``` At the time of writing, the soft rewards model is still in training. will include the links to the models at the end. ## Further Steps 1. Generalisation: Test generalisation of these methods to other model families and out of distribution evaluations. 2. Implement methods for fixing issues in GRPO/ try other RL algorithms 3. Test these methods for non verifiable domains like writing/art. 4. Scaling these experiments with compute to see if this method actually improves at scale or it leads to worsened outputs over time. 5. Try more reward structures and steps on non verifiable domains on sonnet writing. 6. Create a generalised pipeline that works irrespetive of Algorithms used and gives gain on all tasks(verifiable, non verifiable, games etc. ) ## Final Recommendation Use hard rewards for verifiable domains with soft rewards for format enforcing. BLEU is an effective reward for format enforcing and probably instruction following ## Models 1. [Hard Reward Model](https://huggingface.co/CK0607/llama3.1-8b-hard-rewards-250) 2. [Soft Reward(Embedding based)](https://huggingface.co/CK0607/llama3.1-8b-softembedding-reward-250) 3. [Soft Reward(Embedding + BLEU based)](https://huggingface.co/CK0607/llama3.1-8B-softembedding_bleu-250-rewards) 4. [Soft Reward + Hard Reward](https://huggingface.co/CK0607/llama3.1-8B-softandhard-rewards-250) 5. [Soft Reward(Perplexity Based)](https://huggingface.co/CK0607/llama3.1-8b-perplexity-rewards-250) 6. [Base Model Abliterated](https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated) ## Sonnet models 1. [Hard Sonnet Reward](https://huggingface.co/CK0607/llama3.1-8b-sonnet-rewards-50) 2. [Soft Sonnet Reward](https://huggingface.co/CK0607/llama3.1-8b-sonnet-pplx-50) ## Datasets: 1. [Traces](https://huggingface.co/datasets/CK0607/gsm8k-r1-llama70b) ### Interesting papers I read during exploration DS MATH -> https://arxiv.org/abs/2402.03300 Pro RL -> https://arxiv.org/abs/2505.24864 DAPO -> https://arxiv.org/abs/2503.14476 VR-CLI -> https://arxiv.org/abs/2503.22828v1 Rank bias in GRPO -> https://www.arxiv.org/abs/2506.02355 Writig Zero, bridge between non verifiable and verifiable rewards -> https://www.arxiv.org/abs/2506.00103