# GRPO Training to generate circuits from Language Code Generation tasks are very popular in training LLMs. The idea for circuit generation is if we were able to represent circuits as code, we might be able to train Language Models to generate circuits and move towards a copilot/cursor like application for electrical/electronic circuit design. After some looking through, I found SPICE netlists as the best option to generate circuits in form of a code. ## Methodology and Rewards I selected qwen2.5-0.5B as the base model and used LoRA for faster finetuning. The circuits are generated in form of **SPICE Netlists**. These are code/text like representations of circuits. These may directly not be converted to the graphical representations and may require specific tools for the same. [More about SPICE](https://www.multisim.com/help/simulation/spice-netlist/#:~:text=A%20SPICE%20netlist%20is%20a,continuation%20from%20the%20previous%20line.) #### Dataset Generation The data for this experiment was obtained from this [Github](https://github.com/symbench/spice-datasets), and *Gemma3-27b-it* (via Gemini api) was used to create prompts for the same, after a bit of cleaning. The full, cleaned data is available on [Huggingface](https://huggingface.co/datasets/Ashed00/SPICE-Circuits). #### Rewards The following 3 rewards were used and evaluated. Current goal was to generate syntactically correct netlists. 1. **Hard Rewards** : +1 if the netlist is correct, -1 if not. 2. **Training a Reward model** : Trained a DistilBERT model for reard assignment. For this training, created a code for adding syntax errors in the data, and then trained for binary classification. The reward is the 0-1 score on the prediction. 3. **SBERT similarity score** (Perplexity) : Cosine similarity between the generated and correct netlist. For Future training, the evaluation of netlists for their results should be added too. This was not added in current training since the implementation of result evaluation is difficult due to the huge variations in SPICE netlist libraries with a lot of them being closed sourced. ## Initial Hypothesis Since this task is related to code generation, I start with the hypothesis that hard rewards would provide much better performance than soft rewards, as seen in some other literature ([CodeRL](https://arxiv.org/pdf/2207.01780)). ## Results #### Base Model: `Qwen2.5-0.5B` - **Pass@1**: 11.82% - **Pass@5**: 57.64% --- #### GRPO with **Hard Rewards** **Checkpoint: 100 steps** - **Pass@1**: 4.43% - **Pass@5**: 52.71% **Checkpoint: 500 steps** - **Pass@1**: 5.91% - **Pass@5**: 54.19% --- #### GRPO with **Reward Model** **Checkpoint: 100 steps** - **Pass@1**: 4.43% - **Pass@5**: 53.69% **Checkpoint: 500 steps** - **Pass@1**: 5.91% - **Pass@5**: 54.19% --- #### GRPO with **Similarity Score Reward** **Checkpoint: 100 steps** - **Pass@1**: 4.43% - **Pass@5**: 54.19% **Checkpoint: 500 steps** - **Pass@1**: 5.91% - **Pass@5**: 51.23% ![plt](https://hackmd.io/_uploads/rJYFqScIxx.png) ![plt2](https://hackmd.io/_uploads/B1koqwcIgl.png) --- ## Analysis **Result-1**: The first noticeable result is the poorer performance of the fine-tuned models than the base model. - I believe this is due the lesser amount of training after adding the lora adapters, which are randomly initialized. - Due to less data, compute and time , I did not perform an SFT over the Lora Model. I believe an SFT before the RL would lead to better performance, as I saw on one of my previous experiments ([SmolMath](https://hackmd.io/@ashu-00/SmolMath)) where the initial performance was not as good, similar to this. - I also think, that training on a larger amount of data and more steps in direct GRPO would have led to performance increase as well with time in some reward signals, as one can see the increasing performance in the pass@1 and pass@5 trends. **Result-2** : The second noticeable result is the similar performance of the 3 rewards in pass@1 (greedy decoding), bit show different trends in the pass@5. - One of the intuitions could be that the reward signals of all 3 of these are quite closer to each other and may provide improvements. - The pass@5 trends of Similarity score rewards show that while the initial improvement is huge, it starts to slowly degrade the performance as the training goes on. - The pass@5 trends of hard rewards and those using reward model show improvement as the training steps increase. We could also see that the continuous rewards from the from the reward model helped improve the initial training more than the hard rewards, while the hard rewards start to provide higher performance boost in the later stage of training. **Final Recommendation** : A hybrid reward approach would be the best for such training. We can use a weighted approach of the 3 reward functions, with the weights changing with the training steps. $$ R = \alpha \cdot r_{\text{sim}} + \beta \cdot r_{\text{hard}} + \gamma \cdot r_{\text{int}} $$ Where: - $r_{\text{sim}}$ : similarity-based reward - $r_{\text{hard}}$: hard reward (final answer correct/incorrect) - $r_{\text{int}}$ : intermediate step reward - $r_{\text{logp}}$: log-probability reward - $\alpha, \beta, \gamma \in \mathbb{R}$: respective weighting coefficients ___ ## Other Questions **Q: Why did you pick the particular project?** A: 1. While RL in NLP has been worked upon for a long time, it still feels like an emerging topic with new innovations bringing huge improvements and challenges. Also, I was getting into the basics of RL and its mathematical derivations, and this felt like a great project to explore. 2. As an ECE student, this topic lies at the intersection of my studies and interests. Such topics are rarely picked up as one side is not much aware of the other. Plus, while closed-source models are good at such topics, most open-source, locally deployable models do not perform much good, unlike code generation models which have taken off. So I picked this topic. **Q: If you had more compute/time, what would you have done?** A: I would first start with collection of more and better quality data, mostly in form of human-made circuits, as this feels like the biggest current bottleneck. For the training process, I would want to do a good SFT before the RL, and obviously training for more steps. Hyperparameter tuning is also needed to get the best performance. **Q: What did you learn in the project?** A: A big part of learning was the behavior of performance trends for different reward structures on tasks like code generation. This project also gave me time and motivation to research about RL and GRPO, along with their mathematical basics. **Q: What surprised you the most?** A: Most surprising was the similar pass@1 results for all the 3 reward signals. However, the pass@5 performance gave insights into the training performance. **Q: If you had to write a paper on the project, what else needs to be done?** A: A more thorough training routine should be followed, including SFT and more GRPO steps. Work should also be done for how to provide correct SPICE netlist for that particular prompt as this analysis is not as simple as code generation. ___