Data is an invisible moat and Open has it

# The Invisible Moat around Open Source LLMs ## Introduction In the evolving landscape of artificial intelligence, the significance of Large Language Models (LLMs) cannot be overstated. Amidst the proliferation of proprietary models such as ChatGPT, Gemini, and Claude, a new category of "open-source" models like Mistral and Llama2 is carving its niche within the tech community. However, are they really open-source? And do they give everything without any moats? ## The Emergence of Open-Source Models The reliance on closed-source models often entails substantial costs for deployment and presents challenges for businesses in customizing these models to meet specific customer needs. ![image](https://hackmd.io/_uploads/H1S4JFF1C.png) **Figure 1.** A recent [survey conducted by a16z](https://threadreaderapp.com/thread/1771884750589841823.html) reveals that over 70% of enterprises are already fine-tuning their models to better align with user demands, highlighting the growing preference for adaptability in AI solutions. ### Unveiling the Process of LLM Development Take a step back to understanding LLM development phases: - **Pre-train phase:** This initial stage aims to teach the model human language nuances through extensive textual data. It's the phase where the model's foundational knowledge is established. - **Fine-tune phase:** Pre-trained models (e.g Llama-2 7B or Mistral 7B) are then specialized through further training on more specific datasets, enhancing their performance for particular tasks or domains. - **Alignment phase**: Lastly, models are adjusted to ensure they produce outputs that are ethical and align with societal norms, reducing the propagation of hate speech and misinformation. ![image](https://hackmd.io/_uploads/HkHIxYK1A.png) **Figure 2.** Essential steps for training LLMs effectively (*Jan's Introduction to LLM course*) ### The Crucial Role of Pre-trained Data > *Owning the pre-trained dataset is crucial as it represents the model's knowledge.* Access to the pre-trained dataset is like you own the core of the LLM's intelligence. The pre-trained dataset acts as a master key to address the critical issue of ["Catastrophic forgetting"](https://en.wikipedia.org/wiki/Catastrophic_interference) in LLM. This phenomenon describes how LLMs lose hold of prior knowledge upon learning new information when there is a sudden change. Access to the pre-trained dataset allows for effective fine-tuning with the smooth transition of knowledge when introducing new information. ![image](https://hackmd.io/_uploads/Sk_u1oFkA.png) **Figure 3.** Demonstrates the "Catastrophic forgetting" problem: fine-tuning without pre-trained dataset, LLM overfits on new tasks, impairing normal communication. ### Step-by-Step Learning As described above, with the mixture of the pre-trained dataset ensures smoother distribution shifts when introducing new information, as it embodies a comprehensive spectrum of prior knowledge. This continuity in knowledge transition helps in maintaining the robustness of the model against sudden changes, akin to providing a more gradual learning curve where the new information is incrementally integrated with the existing knowledge base. This concept is supported by the [EleutherAI's research](https://arxiv.org/abs/2403.08763) highlighting the importance of how tasks are sequenced in the learning process, suggesting that introducing dissimilar tasks early on can expand the network's capacity for new information. **Table 1.** Final results for English-only 405M parameter models trained with different replay amounts show models with more replay perform better in balancing learning and forgetting (measured as AVG Loss). Notably, just 1% mix with a pre-trained dataset significantly lowers AVG loss, effectively shifting model knowledge from English (the Pile) to German. ![image](https://hackmd.io/_uploads/ByhPEWu1A.png) *Note:* **Replay** is the method involves combining the training dataset from the pre-trained model with new task datasets. ### Balancing Learning with Noise The pre-trained data can also serve as a form of "noise masking", similar to techniques used in training [early computer vision models](https://arxiv.org/abs/1911.04252). This approach introduces a level of ["noise"](https://arxiv.org/abs/2310.05914) during training, which can prevent the model from overfitting to the new dataset. By retaining a mix of original and new data, the model is exposed to a broader range of scenarios, enhancing its generalization capabilities and robustness across tasks. ## Solutions ### Overwhelming Pre-trained Dataset Approach One partial method involves inundating the model with extensive, curated data, allowing for comprehensive fine-tuning. In the open-source community, 2 notable examples of fine-tuning with Mistral as a base model on dominant large datasets to enhance model capability, including [OpenChat](https://huggingface.co/openchat/openchat-3.5-0106) and [Hermes-Pro](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B). ![image](https://hackmd.io/_uploads/B1tOEZOkC.png) **Figure 4.** After fine-tuning with a large amount of data samples, the model's performance improved, outperforming ChatGPT and Grok-1 in some benchmarks. While effective, this approach demands a comprehensive filtering process for low-quality inputs, and an extraordinarily high cost associated with gathering millions of high-quality responses. ### Fully Open-source Model Approach Another approach is to use fully open-source models from the community. The fully open-source models can be understood when they release the model's weight, training data, and code. This means they have nothing to hide. [OLMo](https://allenai.org/olmo) model from AllenAI is one of the most recent fully open-source models with detailed information. **Table 2.** Benchmark of the OLMo model and other "open-source" models. It has a comparable result with Llama-2 but is far behind the most popular pre-trained model Mistral 7B. ![image](https://hackmd.io/_uploads/Hy6KuHY1C.png) While this approach gives the community everything they need to enhance the quality of the model, the base performance of this model is quite poor when compared to models like Mistral 7B or Llama-2 7B. With the lower starting point, this remains a challenge to improve the overall model's performance. ## Conclusion The ownership and strategic use of pre-trained data serve as an invisible moat. It not only enables the tackling of complex challenges like catastrophic forgetting but also provides a baseline for continuous, targeted improvements. Although there is a solution to decomotralize, the cost remains reasonably high. ## Reference - [Catastrophic forgetting](https://arxiv.org/abs/2308.08747) - [Simple and Scalable Strategies to Continually Pre-train Language Models](https://arxiv.org/abs/2403.08763) - [Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) - [Neftune](https://arxiv.org/abs/2310.05914) - [Self-training with Noisy Student improves ImageNet classification](https://arxiv.org/abs/1911.04252) --- # Discarded ### Illustrating Catastrophic Forgetting Catastrophic forgetting can be visualized as a ball in a multidimensional landscape, where moving towards new knowledge risks losing grasp on the old. Pre-trained data acts as a map, guiding fine-tuning in a way that incorporates new information while safeguarding existing knowledge. ![image](https://hackmd.io/_uploads/BJXDEWdy0.png) **Figure 2.** [Gradient decent demonstration](https://en.wikipedia.org/wiki/Gradient_descent)