Building SmolMath: A Math Reasoning LLM Under 200M Parameters

# Building SmolMath: A Math Reasoning SLM Under 150M Parameters ## Introduction Most recent Small Language Models(< 400M parameter models) have been largely focused on improving the performance for text generation. While this makes sense, they lost behind the true capability of the SLMs to reason. So I set out on a goal; How much can I improve a really small Language model, with excellent performance on text benchmarks, with aim to get the most performance increase in math benchmarks, and least performance drop over its original benchmark scores. Yes, by that I meant, No LoRA adapters, No partial frozen weights. We try to train everything while trying to not fall into the catastrophic forgetting trap. ### Base Model We start with [SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) as the base model. Original performance of SmolLM2 was quite poor on Math Benchmarks, scoring 1% on 5-shot [GSM8k](https://github.com/openai/grade-school-math) and 0% accuracy on 0-shot GSM8k test set. In the end, [SmolMath](https://huggingface.co/Ashed00/SmolMath-135M) was able to improve upon these scores largely, without huge drops in other non-Math benchmarks. **[Comparision](#Comparision-with-Base-Model)** #### Related Models 1. To clarify, in my research, We came across some models that have very high math scores, even greater than current SmolMath (Parmanu-Ganita ~ 208M). However, these were mostly trained completely for math, with pure disregard for performance in other text related tasks. Hence, I decided against moving on that path. 2. Another category of models that improve for math were code generation models, which generate code to solve these problems(TinyGSM models). While this may be a possible method, it also had the possibility of straying away from the original training data. This can be tried in a future work. ## Methodology and Discussion We started by trying out Supervised finetuning directly over GSM8k. This, obviously, led to a variety of issues. 1. **Repetitive answering**: Since the way the answers in GSM8k are formatted, the model keeps blabbering until it reaches its max length. We tried solving this by using explicit *end of answer* mentioning ( by adding a # END OF ANSWER at the end) which did not solve the issue at all. Example-1: ``` Question: Nancy's old washing machine could only wash 9 pieces of clothing at a time. If she had to wash 19 shirts and 8 sweaters how many loads would she have to do? Answer:8 #End of Answer. #End of 2 statement question. #End ofhtaking question. #End of question frame. #End of melodie. #End of melodie. #End of ``` Example-2: ``` Question: a student got twice as many sums wrong as he got right . if he attempted 54 sums in all , how many did he solve correctly ? Options: a ) 12 , b ) 16 , c ) 18 , d ) 24 , e ) 26 Answer: Explanation : Let x = total sums. Then, 2x = 54 => x = 15 Correct Option: E ####The answer is E. ####The answer is E. ####The answer is E. ####The answer is E. ####The answer is E. ####The answer is E. ####The answer is E. ####The answer is E. ####The answer is E. ####The answer is E ``` 2. **Overfitting**: The performance of the model on GSM8k reached its peak during all of my experiments in this stage. But its poor performance on other math benchmarks showed that all there was no actual improvement. In the end, We chose to go with a curriculum training, followed by GRPO (RL). The curriculum comprised of 1. AddSub Dataset 2. Aqua-RAT (MCQ) 3. GSM8k The GRPO training was only over GSM8k dataset. ### Curriculum Training As mentioned, the curriculum followed a simple Supervised FineTuning(SFT) over a multitude of datasets with increasing difficulty. We start with [AddSub](https://huggingface.co/datasets/Ashed00/combined_math_problems) dataset to develop basic calculation capability. [Aqua-RAT](https://huggingface.co/datasets/deepmind/aqua_rat) was introduced in middle for the model to still retain its answering capability in non numerical form. Although this is still an issue in current SmolMath and can be improved in future issues. This was followed by SFT on GSM8k train set. Through this training, We were able to reach an accuracy of 6.22% on GSM8k test set, and 9.64% on AddSub test set. All the training codes are present on the [Github](https://github.com/Ashu-00/SmolMath) ### GRPO Training Just for the fun of it, We tried a direct GRPO training on the Base model itself, which led to nothing. We hypothesize that it is due to the original low accuracy of the base model that the model is rarely earning positive rewards, leading to barely any optimization in the policy. Hence, We moved forward to training the Curriculm finetuned model on GRPO. The most difficult process of this whole training procedure was finding the correct reward function for GRPO. What was tried and failed: 1. **Normal**:+1 for correct and -1 for incorrect/unformatted answer. 2. **Intermediate step evaluation**: Since the GSM8k dataset in its own has a unique way of formatting intermediate equations in *<<equation>>* form, We took this in my model as well, providing positive reward for every correct equation. 3. **MathBERT Embeddings**: Another failed approach was to reward on the cosine similarity of MathBERT embeddings. During this training, while the reward increased, the performance of the model dropped, proving this as useless and maybe harmful in training. After trying some variations of these methods, this was the final reward function that worked. **REWARD Function** - +2 for correct answer, -1 for incorrect/wrongly formatted answer - $$ +2 \times \frac{\text{Number of correct intermediate steps}}{\text{Total intermediate steps}} $$ This led to a steady increase in the model accuracy. ## Evaluation SmolMath was evaluated against the base model on its original mentioned benchmarks, and against some similar sized models on math benchmarks. ### Comparision with Base Model | **Metrics** | **SmolLM2-135M-8k** | **SmolMath-135M** | **Δ (Change)** | |-------------------|---------------------|--------------------|----------------| | HellaSwag | 42.1 | 41.15 | −0.95 | | PIQA | 68.4 | 63.55 | −4.85 | | CommonsenseQA | 33.9 | 33.42 | −0.48 | | TriviaQA | 4.1 | 0.0 | −4.10 | | Winogrande | 51.3 | 51.78 | +0.48 | | OpenBookQA | 34.6 | 30.80 | −3.80 | | GSM8K (0-shot)* | 0.0 | 6.9 | +6.90 | *This was evaluated using the lighteval script, which is favoured by the SmolLM2 creators in their evaluation and varies from the SmolMath prompt structure. ### Math Benchmarks | Model | AddSub* (%) | MAWPS** (%) | GSM8K* (%) | |----------------------|------------|-----------|-----------| | apple/OpenELM-270M-Instruct | 2.14 | 2.83 | 2.05 | | HuggingFaceTB/SmolLM2-135M-Instruct | 1.52 |4.04 | 0.45 | | SmolMath-no GRPO (ours) | 9.64 | 7.47 | 6.22 | | SmolMath (ours) | **12.05** | **8.31** | **7.51** | *Evaluated only on the test set, not included in the training **Evaluated on complete dataset, not included in the training ## Limitations and Future work While the performance of SmolMath, on the non math benchmarks , are not quite poor, but in many instances, it will move towards answering in numbers wherever it can, even if that does not make any sense. More work can be done on this, to maintain the SmolLM vibe in it. ## Sample Outputs All of these have been done using greedy decoding. ### Simple calculations ``` Question: 1728-300 Answer: 1728-300 = <<1728-300=1428>>1428 #### 1428 ``` ``` Question: 69 + 69? Answer: 69 + 69 = <<69+69=138>>138 #### 138 ``` ``` Question: 15 * 7? Answer: 15 * 7 = <<15*7=105>>105 15 * 7 = <<15*7=105>>105 #### 105 ``` ``` Question: 125/25? Answer: 125/25 = <<125/25=5>>5 #### 5 ``` ### Word Problems ``` Question: If 4 apples cost 100 dollars each, how much would 2 apples cost? Answer: 4 apples cost 4*100=$<<4*100=400>>400 2 apples cost 2*100=$<<2*100=200>>200 #### 200 ``` ``` Question: Lily has 4 pencils. She buys 3 more and then gives 2 to her friend. How many pencils does she have now? Answer: Lily has 4 + 3 = <<4+3=7>>7 pencils. She gives 2 to her friend, so now she has 7 - 2 = <<7-2=5>>5 pencils. #### 5 ``` ``` Question: A farmer has 4 rows of tomato plants. Each row has 6 plants. If each plant grows 3 tomatoes, how many tomatoes does the farmer harvest? Answer: The farmer has 4 * 6 = <<4*6=24>>24 plants. The farmer harvests 24 * 3 = <<24*3=72>>72 tomatoes. #### 72 ``` =>Slightly wrong output ``` Question: Sarah read 12 pages on Monday, 15 pages on Tuesday, and 18 pages on Wednesday. Her book has 60 pages. How many pages does she have left to read? Answer: She read 12 + 15 + 18 = <<12+15+18=42>>42 pages. She has 60 - 42 = <<60-42=12>>12 pages left to read. #### 12 ``` ### Non-Math This evaluation shows the issues when answering in only text leads to unnecessary numerical involvement. ``` What sound does a cat make? A cat makes a hissing sound when it is hungry. A cat makes a hissing sound when it is angry. A cat makes a hissing sound when it is happy. A cat makes a hissing sound when it is tired. A cat makes a hissing sound when it is sad. A cat makes a hissing sound when it is happy. #### 1,000 ``` ``` Question: What color is Spiderman's costume? Answer: The color of the costume is blue because 2 x 100 = <<2*100=200>>200 The color of the costume is green because 200 - 100 = <<200-100=100>>100 #### 100 ```