Cost effectively improving the base model

Outline - Merge model is increasing - They can easily archive high scores in the Leaderboard - Don't need large computer resource (only) CPU is enough - Sakana has an auto model merging - Quick guide to create your first merge model (Jan's merge notebook) --- # Fast and easy way to create high performance LLM ## Introduction Creating a high-quality AI model is a goal shared by many in the field, yet traditionally, this process demands a significant investment in both quality data collection and advanced hardware for model fine-tuning. These requirements present substantial challenges, including high costs. However, recent advancements in [model merging techniques](https://huggingface.co/blog/mlabonne/merge-models) have offered a promising solution. This approach allows for the creation of competitive AI models in a relatively short time, even with limited hardware resources. By overcoming traditional barriers, model merging paves the way for more efficient and cost-effective development, making high-performance AI more accessible than ever. ## Understanding model merging The concept of model merging arises from the recognition that individually fine-tuned models typically excel in specific tasks, such as mathematics or reasoning. The innovative strategy of merging these specialized models under the hypothesis that the combined model will inherit and integrate the strengths of each skill, aims to create a versatile LLM capable of performing a broad range of tasks effectively. This technique not only enhances the model's overall functionality but also streamlines the development process, potentially leading to the emergence of multi-specific-domain LLMs. ![image](https://hackmd.io/_uploads/Bke3GJ3kR.png) **Figure 1.** Sakana.ai's merged model can even answer questions from images in Japanese, even though it is merged from a English VLM and a Japanese LLM. ## The insight of Jan's team after doing multiple merge models ### Selecting a strong specific capability model Selecting the right foundation model is crucial for successful model merging. A good outcome is often dependent on starting with a strong base. Normally, LLMs excel in question-answering capabilities but often fall short in specialized areas like mathematics. This recognition has led us to seek out models with specific strengths in math to complement our LLMs. ![image](https://hackmd.io/_uploads/S1TN64kTa.png) Figure 1. Wizardmath 7B excels in GSM8K benchmark with high score By identifying and integrating this specialized model, we aim to enhance our current model, ensuring that the merged model not only maintains its strong question-answering ability but also significantly improves its performance in areas where it was previously lacking. ## Cost effectively improving the base model We found merging models is quick and cost-effective, enabling fast adjustments based on the result of each iteration. Chúng tôi có thể thử nghiệm hàng loạt biến thể của merged model giữa các foundation models khác nhau chỉ với 15 phút cho 1 model. Chúng tôi nhanh chóng tìm ra được tổ hợp thích hợp. ![image](https://hackmd.io/_uploads/SkYBaVk6a.png) *Figure 2: The merged model, Stealth, doubles the mathematical capabilities of its foundational model while retaining the performance in other tasks.* We ended up with [Stealth 7B v1.1](https://huggingface.co/jan-hq/stealth-v1.1), a [SLERP](https://github.com/Digitous/LLM-SLERP-Merge) merge method with the following: - [WizardMath](https://huggingface.co/WizardLM/WizardMath-7B-V1.1) for its math capabilities. - Our own [Trinity](https://huggingface.co/jan-hq/trinity-v1.2) model for its versatility across general tasks. This particular combination yielded the best tradeoff across mathematical while retaining the most pre-merge performance on general tasks. ## **Further finetuning to realign new capability of the model** Merging different LLMs can lead to a mixed answering style because each model was originally trained on different types of data. Thus, we applied Direct Preference Optimization ([DPO](https://arxiv.org/abs/2305.18290)) using the [Intel's Orca DPO pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs) dataset, chosen for its helpful answering style in general, math and coding concentration. This approach produced a final model - [Stealth 7B v1.2](https://huggingface.co/jan-hq/stealth-v1.2), aligned to our technical preferences and demonstrating minimal loss. ![image](https://hackmd.io/_uploads/SkYBaVk6a.png) ## Reference - [Merge LLM with mergekit](https://huggingface.co/blog/mlabonne/merge-models)