# Rebuttal
## General Response
We are grateful for the reviewers' efforts and the recognition of our contributions:
**Novel Research Topic:** The paper explores the unique impact of vocabulary size on language model performance, an aspect often overlooked in LLM research [EsaU,Y8zo,zt86].
**Analyses:** We provide in-depth analyses why it exists a optimal vocabulary size with the FLOPs and the optimal vocabulary size increases with more FLOPs, with theoretical anayses in Appendix A.1 and demostration experiments in Sec 3 and Apppendix A.2. [EsaU,zt86]
**Experiments:** The paper includes two effective methods (IsoFLOPs and a derivative-based approach) to predict the optimal vocabulary setting. Extensive experiments on **1200 pre-trained models** (6 non-vocabulary parameters settings x 10 vocabulary sizes settings x 20 training data settings) are conducted. [EsaU,zt86]
**Applications:** The study's findings offer two practical tools for optimizing the compute allocaiton of LLMs training, with the consideration of the vocaulary size. [EsaU,Y8zo,zt86]
In response to the feedback of reviewers, we have performed additional analyses and experiments to address the raised concerns. Below, we summarize our responses and the improvements made to our paper.
**New Analyses and Experiments**
- We supplement the paper with a new perspective from parameter growing to demontrate that the vocabulary parameters need to be separately considered from the total parameters, and larger models deserve larger vocabularies.
- We evaluate the 3B pre-trained models with different vocabulary sizes on 5 more downstream tasks. The model that uses our suggested vocabulary size outperforms its counterpart by a clear margin.
- We compare the models with the same training tokens instead of the same training FLOPs as required.
All of the suggestions will be considered in our polished version.
<br><br>
## Reviewer 1 (Score 7)
### W1: This paper conducted experiments on language models of various parameter sizes, but the largest model tested was only 3 billion parameters.
Answer:
We acknowledge the importance of evaluating our approach on larger models to establish its scalability. Increasing the model size necessitates pre-training on a larger corpus, which in turn demands more computational resources. For instance, conducting pre-training experiments on 7B models would require an immense computational budget exceeding 10^22 FLOPs, translating to approximately 6 weeks of training time on a cluster with 64 A100 GPUs. However, such a substantial level of computational resources is currently beyond our reach during the rebuttal period. Despite our desire to explore larger model sizes, we are constrained by the practical limitations of our available resources.
Nonetheless, the significance of the scaling law lies in investigating it through experiments on a relatively small scale to help us reasonably allocate computational resources when training a large model, thus avoiding wasted computational power. Our experiments with 1200 pre-trained models have demonstrated the existence of an optimal vocabulary size under FLOPs constraints, and the predictions from the theoretical analysis (derivative-based approach) and experimental fitting (IsoFLOPs-based approach) agree with each other. In fact, we have also made predictions for the pre-training of a 300B model, and the two approaches align well, and so we believe it should work for larger models.
Furthermore, the change in vocabulary size from 32K in Llama-2 to 128K in Llama-3, which resulted in performance improvement, can be also seen as a verification of our conclusion regarding increasing the vocabulary size when there are more computational budgets.
In conclusion, we appreciate your suggestions and we will try our best to conduct more experiments on larger models. We hope you can understand our computational limitations. We will also discuss this in the revised version.
<br><br>
## Reviewer 2 (Score 5)
### W1: Lacks performance on large-scale models, such as whether increasing the vocabulary size to a greater extent performs better than existing models in the market. Table 2's experiments look a little bit less.
Answer:
Thanks the reviewer for bringing this concern up, we reply the concern from more experiments and anaylses. Increasing the vocabulary size can be operated by 2 ways: 1) Train a model with a larger vocabulary size from scratch; 2) Expand the existing model with continual pre-training.
1) **Train from scratch**
As shown in the our Table 2, our prediction enables a better model by only adjusting the vocabulary size in different FLOPs budgets. For example, we improve performance on ARC-Challenge from 29.1 to 31.5 with the same 2.3e21 FLOPs. We add more results of new downstream tasks in the table below. The model using our suggested vocabulary size outperforms its counterpart consistently by a clear margin.
| Tasks | MMLU | CommonsenseQA | CoQA | TruthfulQA | Lambada | Lambada |
|-----------------|----------|----------------|----------|----------|------------|---------------------|
| Metric | Normalized Accuracy | Normalized Accuracy | Exact Match | BLEU | Normalized Accuracy | Perplexity |
| $V$=32K | 25.02±0.37 | 20.15±1.15 |32.32± 01.95 | 30.35±1.61 | 43.04±0.69 | 15.57±0.48 |
| $V^{opt}$=43K | **25.46**±0.37 | **20.97**±1.10 | **37.43**± 01.99| **31.33**±1.62 | **44.91**±0.69 | **13.87**±0.40 |
> Qian: We can just remove the perplexity and leave the normalized accuracy here for Lambada? Since other metrics are all higher better
The description of the newly added tasks:
**MMLU**:Massive Multitask Language Understanding benchmark for broad domain language evaluation.
**CommonsenseQA**: A multiple-choice QA dataset for measuring commonsense knowledge.
**CoQA**: Conversational question answering tasks to test dialog understanding.
**TruthfulQA**: A QA task aimed at evaluating the truthfulness and factual accuracy of model responses.
**Lambada**: Tasks designed to predict the endings of text passages, testing language prediction skills.
2) **Continual pre-training**
Continual pre-training with vocabulary expanding is a good topic to discussion, and it involves several non-trivial challenges that are not strongly related with the main contributions of this paper, i.e. ,explore the optimal compute allocation with the consideration of vocabulary sizes. Therefore, we will discuss it in our polished version and leave it as an important future work. There are the challenges we will discussion:
- Expanding the vocabulary necessitates changes in the tokenization process, which can lead to inconsistencies in tokens segmentation.
- Ensuring that these new embeddings are compatible and effectively integrate with the pre-trained embeddings is non-trivial.
- Catastrophic forgetting with old word embeddings when learning the new word embeddings.
> It is a little bit tricky to discuss continual pre-training parallel to the from-scratch pre-training here.
> Qian Suggestion:
Thank you for raising this concern. We share the intention to compete with existing powerful models in the market, such as Llama-2-7B. However, training a 7B model on 2 trillion tokens from scratch is far beyond our current computational resources. Nevertheless, we have evaluated our 3B models on more benchmarks to alleviate your concern.
Specifically, we have added new experimental results on the following benchmarks:
- **MMLU**:Massive Multitask Language Understanding benchmark for broad domain language evaluation.
- **CommonsenseQA**: A multiple-choice QA dataset for measuring commonsense knowledge.
- **CoQA**: Conversational question answering tasks to test dialog understanding.
- **TruthfulQA**: A QA task aimed at evaluating the truthfulness and factual accuracy of model responses.
- **Lambada**: Tasks designed to predict the endings of text passages, testing language prediction skills.
The following table combines the original Table 2 in the paper and the new experimental results. As shown, our prediction enables better model performance by adjusting the vocabulary size within different FLOPs budgets. The 3B model with a 43K vocabulary size outperforms the 32K counterpart on 11 out of 12 tasks using the same FLOPs budget. For example, we improve performance on ARC-C from 29.1 to 31.5. In conclusion, the model using our suggested vocabulary size (i.e., 43K) consistently outperforms its counterpart (i.e., 32K) by a clear margin.
| Tasks | Metric | $V$=32K (Baseline) | $V^{opt}$=43K (Ours) |
|---------------|---------------------|-----------|---------------|
| Winogrande | Normalized Accuracy | 55.7±1.4 | **58.7**±1.4 |
| PIQA | Normalized Accuracy | 72.6±1.0 | **72.7**±1.0 |
| OBQA | Normalized Accuracy | **34.4**±2.1 | 33.0±2.1 |
| Hellaswag | Normalized Accuracy | 55.1±0.5 | **55.7**±0.5 |
| BoolQ | Normalized Accuracy | 60.1±0.9 | **62.3**±0.8 |
| ARC-E | Normalized Accuracy | 53.4±1.0 | **55.0**±1.0 |
| ARC-C | Normalized Accuracy | 29.1±1.3 | **31.5**±1.4 |
| MMLU | Normalized Accuracy | 25.0±0.4 | **25.5**±0.4 |
| CommonsenseQA | Normalized Accuracy | 20.2±1.2 | **21.0**±1.1 |
| CoQA | Exact Match | 32.3± 2.0 | **37.4**± 2.0 |
| TruthfulQA | BLEU | 30.4±1.6 | **31.3**±1.6 |
| Lambada | Normalized Accuracy | 43.0±0.7 | **44.9**±0.7 |
Another feasible way to compete with models in the market would be continual pre-training with a larger vocabulary size. We believe this is a good topic for discussion, but it involves several non-trivial research challenges not strongly related to the main contributions of this paper, i.e., exploring the optimal compute allocation considering vocabulary sizes. Therefore, we will discuss it in the revised version and leave it as an important future work. The challenges we will discuss include:
- Expanding the vocabulary necessitates changes in the tokenization process, which can lead to inconsistencies in token segmentation.
- Ensuring that these new embeddings are compatible and effectively integrate with the pre-trained embeddings is non-trivial.
- Catastrophic forgetting of old word embeddings when learning new word embeddings.
We will discuss all the above in the revised version. Thank you again for your valuable comments.
### W2.1: IsoFLOPs method is very sensitive.
Answer:
<!-- Thanks for your insightful question. In fact, the IsoFLOPs-based approach does exist sensitivity to some extent, depending on the granularity, range, and quality of the fitting data. Starting with pioneering work on scaling laws such as Kaplan et al. 2020 [1] and Hoffmann et al. 2022 [2], IsoFLOPs-based approach becomes a widely-used tool to study the trend of model performance [3]. We have discussed it in our Appendix B.1 and we will add more details about how to reduce the sensitivity, such as outlier data removal, repeated experiments, in our polished version. -->
<!-- Further, we use relative mean square error (rMSE) and coefficient of determination (R^2) to evalute whether we make the good fitting of the experimental data. As shown in Figure 3, the results show that the rMSE < 0.001 and R^2 >= 0.89 for all the considered attributes, non-vocabulary parameters, vocabulary parameters and training characters. It indicates that this data makes a good fit thats all of considered attributes meet a power law with the FLOPs budget. -->
<!-- Last, the predictions results reported on Table 1 from IsoFLOPs-based method and derivative-based method are aligned from small-scale models to large-scale models. It verifies the predictions from IsoFLOPs-base method from the independent derivative-based method. -->
Thank you for your insightful question. You raise a valid point – the IsoFLOPs-based approach can be sensitive to some extent, depending on the granularity, range, and quality of the fitting data. Since the pioneering work on scaling laws by Kaplan et al. 2020 [1] and Hoffmann et al. 2022 [2], the IsoFLOPs-based approach has become a widely-used tool to study the trend of model performance [3]. We have discussed it in our Appendix B.1, and we will add more details on how to reduce sensitivity, such as outlier data removal and repeated experiments, in our polished version.
To evaluate the goodness of fit, we use relative mean square error (rMSE) and the coefficient of determination (R^2). As shown in the table below (also in Figure 3), the results indicate a good fit, with rMSE < 0.001 and R^2 >= 0.89 for all the considered attributes: non-vocabulary parameters ($N_{nv}$), vocabulary parameters ($N_v$), and training characters ($H$). This suggests that these attributes follow a power law with respect to the FLOPs budget.
| | $N_{nv}$ | $N_v$ | $H$ |
|--------|----------|------|-----|
| rMSE | 0.00026 | 0.00051 | 0.00017 |
| R<sup>2</sup> | 0.93 | 0.89 | 0.96 |
Furthermore, the optimal vocabulary predictions (reported in Table 1) from the IsoFLOPs-based method and the derivative-based method are aligned across small-scale and large-scale models. This independent verification by the derivative-based method validates the predictions from the IsoFLOPs-based method. Therefore, we believe that the IsoFLOPs-based method works well in our case.
[1] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
[2] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556
[3] Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
### W2.2: The experiments also looks not enough.
Answer: It is noteworthy that we conduct extensive experiments on **1200 models pre-trained from scratch** (6 non-vocabulary parameters settings x 10 vocabulary sizes settings x 20 training data settings) for the fitting of our vocabulary scaling law. The key contributions in this paper is the several findings about how the vocabulary affects the model performance and how much compute should be allocated on the vocabulary based on the proposed 2 approaches.
Following the previous study [1,2,3], we mainly use the held-out validation loss value for the evaluation of the trained 1200 models. It is a better metric than the downstream tasks performance as the held-out loss provides an unbiased measure of the model’s ability to generalize to new data, but also enjoys high computing efficiency. Instead, the performance of downstream tasks has a great variety across different tasks, which is not suitable as the main evaluation metric.
The evaluation of downstream tasks is part of the ways to verify our prediction, therefore we do not take too much content to discuss it in our main paper. For downstream tasks, we conduct more experiments in the answer of your #Q1. The new results will be added in our polished version.
### Q1: In determining the scaling law for the relationship between non-embedding model size and data (such as the Chinchilla law), why is it assumed that the vocabulary size is independent of these two factors.
Answer:
Thanks for your question! We do not assume that the vocabulary size is independent of parameters and data. Instead, we make some adjustments in the Section of Preliminary: 1) We break down the total parameters into non-vocabulary parameters and vocabulary parameters; 2) We measure data not in tokens but in training characters.
By doing so, the vocabulary size $V$ is independent with the non-vocabulary parameters $N_{nv}$ and the number of training characters $H$. In an experimental configuration, the developers can vary the vocabulary size without affecting non-vocabulary parameters or training characters.
Then, we details our motivation why we separate the vocabulary parameter and non-vocabulary parameter below:
Traditionally, scaling up model parameters in language models has been approached in two ways: increasing depth (i.e., the number of layers) or width (i.e., the hidden size). Current empirical practices often involve expanding both simultaneously [4]. This approach overlook crucial distinctions in how different parameters benefit from parameters expansions. Non-vocabulary parameters can benefit from increases in both depth and width, allowing for more complex hierarchical representations and broader feature capture. In contrast, vocabulary parameters, associated with word embeddings and language model heads, are generally confined to a single layer, limiting their ability to benefit from increases in the model depth. This disparity in growth potential between non-vocabulary and vocabulary parameters suggests that to maintain a balanced growth rate, it is better to separate the vocabulary parameter and non-vocabulary parameter into consideration.
[4] Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Tran, Dani Yogatama, and Donald Metzler. 2023. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12342–12364, Singapore. Association for Computational Linguistic
<br><br>
## Reviewer 3 (Score 6)
### W1: The results are probably mostly applicable to a small number of well-funded labs.
Answer:
For small labs, our scaling laws are also advantageous. As we provide a verified compute allocation suggestion for developers to achieve high model performance without the need for multiple attempts on the vocabulary configurations, thereby saving compute resource.
On the other hand, the proposed derivative-based method are relied on the theoretical deduction but not the data fitting on heavy experiments. Developers can use this approach by the numerical search, which requires just several seconds to get the recommended vocabulary configuration by the CPU.
> Qian Suggestion:
>
> Thanks for pointing this out! We want to clarify that we are not a well-funded lab either. Due to our limited computing resources, we can only afford to train models with up to 3B parameters in our experiments, as the cost of validating scaling law experiments is indeed very high.
>
> However, we believe that our conclusions are beneficial to the general research community, especially for small labs. Our scaling laws with vocabulary provides a compute-optimal allocation suggestion, enabling small labs to train high-performance models without repeatedly trying different vocabulary configurations, thereby saving computing resources.
>
> Even for teams who want to conduct scaling law experiments themselves, our derivative-based method offers a simple and feasible approach based on theoretical derivation. Researchers do not need to run a large number of scaling law experiments to obtain a good vocabulary configuration. This is particularly advantageous for small labs. We will also make all our scaling law experimental results public so that more people can benefit from our work.
### Q1: The abstract states that “beyond the conventional 32K” – is this really the convention? See e.g. GPT4o.
Answer:
Yes, there is more than one classic vocabulary size, just as there is more than one classic model size. We chose one of the classic vocabulary sizes, 32K, following Llama. The reason is that the training corpus of Llama is mainly in English. We also use the English-dominated corpus, slimpajama, for pre-training. The vocabulary size of GPT4o is larger as it is a multi-lingual model. It is interesting to explore how to set the vocabulary size in a multi-lingual scenario in the future, and we discuss it in the Appendix B.4.
Additionally, the broader impact of our work is in raising the community's attention to vocabulary when training language models, and how to allocate the suitable compute on the vocabulary. Actually, big companies nowadays are starting to notice that their previous compute allocation to vocabulary was too small, and thus they are starting to increase the vocabulary size, e.g., Llama has increased its vocabulary size from 32K to 128K.
> Qian Suggestion:
> Thank you for your insightful question! We acknowledge that there is no single "conventional" vocabulary size for language models, as it can vary based on the pre-training corpus and the intended use case. A vocabulary size of 32K is widely regarded as a common choice, particularly for models trained on English-centric corpora, such as Llama-1, Llama-2, and Mistral. Since our work primarily utilizes the English-centric SlimPajama corpus for pre-training, we have adopted the 32K vocabulary setting employed by these models as a "conventional" vocabulary size. We will modify the statement in the abstract accordingly to reflect this clarification.
>
> As for GPT4o, we think its vocabulary size is relatively larger because it is designed to handle multiple languages (e.g., Chinese). This also highlights an important consideration for future research: determining the optimal vocabulary size for multilingual models, which we have discussed in Appendix B.4.
>
> Our broader goal is to draw attention to the importance of vocabulary size in training language models and to encourage the appropriate allocation of computational resources for this aspect. Recently, there has been a shift in the industry, with major companies recognizing that their previous allocations for vocabulary were insufficient. For example, Llama has increased its vocabulary size from 32K to 128K, reflecting this evolving understanding. We hope this clarification helps, and will add the discussion in the revised version.
### Q2: How does Table 2 look if you train on the same number of tokens instead of using the same FLOPs budget?
Answer:
Thanks for your prompting question! As you suggested, we also trained the model using the same number of tokens, i.e., 129B tokens, beyond the same FLOPs budget setting. As shown in the following table, the performance of the model with the suggested vocabulary size of 43K improves further compared to the 32K vocabulary size when using the same number of training tokens. We will add the results in the revised version.
| **$V$** | **$N_v$** | **$D$** | **Winogrande** | **PIQA** | **OBQA** | **Hellaswag** | **BoolQ** | **ARC-E** | **ARC-C** | **Average** |
|---------|-----------|---------|----------------|----------|----------|---------------|-----------|-----------|-----------|-------------|
| 32K (Baseline) | 0.20B | 129B | 55.7 ± 1.4 | 72.6 ± 1.0 | **34.4** ± 2.1 | 55.1 ± 0.5 | 60.1 ± 0.9 | 53.4 ± 1.0 | 29.1 ± 1.3 | 51.5 |
| 43K (Ours with same FLOPs) | 0.27B | 125B | **58.7**±1.4 |**72.7**±1.0 |33.0±2.1 |55.7±0.5 |62.3±0.8 | 55.0±1.0 | 31.5±1.4 | 52.7
| 43K (Ours with same Tokens) | 0.27B | 129B | 58.6±1.4 | **72.7**±1.0 | 33.6±2.1 | **55.8**±0.5 | **62.4**±0.9 | **55.5**±1.0 | **31.5**±1.4 | **52.9**