SH-C
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ## Official Comment to ACs Dear ACs, We sincerely appreciate the efforts of all reviewers to improve our work. Except for `Reviewer ytgM`, all reviewers were positive about our paper (with scores 6,7,7). `Reviewer ytgM` gave a negative score of 4 accompanied by a short review with two questions. Although we have provided a detailed response to address the comments of `Reviewer ytgM`, we have not received any further feedback. We hope ACs can help us remind `Reviewer ytgM` to check our rebuttal. Best, The Authors ## Reply to Reviewer GGL7 (2) We are glad that our additional experiments have resolved your concerns, and we will add the experiments to the revision. --- > Q14. Actually, there was no typo. Your claim is that one dataset is OOD compared to the other : Surely then, it should be surprising that they have textually identical pairs in them? **A14.** We apologize for our misunderstanding of your comments. As discussed previously, most of the CMMLU and C-EVAL samples (about 99%) are different. Thus, the tiny overlap may not significantly affect the evaluation. ## Reply to Reviewer 4ok1 Thanks again for your positive rating. We will certainly add the above discussions and experiments to the revision. --- > Q12. However, after re-reading the paper, I noticed that it currently lacks a comparison or at least a discussion of the existing cascade-based approaches [1-4] for assembling LLMs. **A12.** We have discussed a cascade-based method (i.e., FrugalGPT) in the related work of the paper (Lines 75-77, Lines 182-184). Thanks for bringing the four other cascade-based methods to our attention. We agree that cascade-based methods are one direction to achieve cost-effectiveness when choosing LLMs, but they are different from RouterDC in terms of **settings, inference cost, and tasks**. We discuss the differences between our RouterDC and the mentioned cascade-based methods below. (i) RouterDC considers **a different setting** compared with the cascade-based methods [1-4]. The cascade-based methods usually **assume that the capacity of LLMs depends on the model size**. Their intuitive idea is to query LLMs from weak (small) to strong (large) until a satisfactory answer is obtained instead of calling the strong LLMs for all queries. Our RouterDC does not require this assumption and can select a suitable LLM from multiple small or large candidate LLMs. Hence, routing-based methods are more general. Furthermore, even if LLMs are of the same size, they may have different specialized capabilities. (ii) Cascade-based methods [1-4] may call LLMs **multiple times** for a query (in the worst case, all candidate LLMs need to be called), but our RouterDC only needs to call the selected LLM **once** in inference/testing. (iii) RouterDC is general and can be used for **generation tasks**, but Model Cascading [1] and Online Cascade Learning [4] are limited to **classification task** (e.g., SST2, MRPC, IMDB). Generation tasks are usually more useful and challenging than classification tasks in NLP. We will include the above discussion and related works in the revision. --- #### References [1] Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems, EMNLP 2022 [2] Language Model Cascades: Token-level Uncertainty and Beyond, ICLR 2024 [3] Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning, ICLR 2024 [4] Online Cascade Learning for Efficient Inference over Streams, ICML 2024 # Rebuttal for RouterDC ## A Gentle Reminder for Reviewer ytgM Dear Reviewer ytgM, We sincerely thank you again for your effort to improve our work. We have provided a detailed response to resolve your concerns. We would like to kindly remind the reviewer that the close date of the reviewer-author discussion is approaching. Please let us know if you have any additional questions or comments. Best, The Authors <!-- [1] 做的还是分类task,和我们的生成task不同。 需要对LLM做inference来计算probability作为confident,特别是在多个LLM存在时,这会非常耗时。 [2] 做的生成task,依旧需要inference LLM,根据LLM output的每个token的probability学了一个函数来学是否要defer [3] 做的生成task,对于弱的model需要多次推理来确定consistency,进而计算confidence。 Nie, L., et al.[4] propose learning cascades online for classification tasks, which differs from the generation tasks that RouterDC focuses on. --> --- --- ## Reply to Reviewer GGL7 Thanks for your further comments and **raising the score**. For the remaining concerns, we address them as follows. --- > Q11. One of my concerns was that for the data you trained/tested over, the two things were so closely matched that the different datasets would leak the task just due to the wording, etc. **A11.** We understand the reviewer’s concern that certain words in the query may leak the task identity, making it easy for RouterDC to perform like a task classifier. To resolve this concern, we conducted an additional experiment in **a single task setting**, i.e., we train the router on the training set of HumanEval and evaluate it on the testing set. The single task setting is an edge case where **all queries may contain the same task information**. Hence, the router needs to learn how to route queries appropriately based on the query itself instead of some possible task information contained in the query. Table below reports the testing accuracy. As can be seen, **RouterDC largely outperforms the best candidate LLM (i.e., dolphin-2.9-llama3-8b) and existing routing methods**, demonstrating that the router can select appropriate LLMs for queries based on query characteristics. We will add the experiment and discussion to the revision, which definitely improve our work. \begin{array}{lc} \hline \text{Method} & \text{HumanEval} \newline \hline \text{Mistral-7B} & 28.98 \newline \text{MetaMath-Mistral-7B} & 29.80 \newline \text{zephyr-7b-beta} & 22.04 \newline \text{Chinese-Mistral-7B} & 21.43 \newline \text{dolphin-2.6-mistral-7b} & 45.10 \newline \text{Meta-Llama-3-8B} & 26.73 \newline \text{dolphin-2.9-llama3-8b} & 49.39 \newline \hline \text{ZOOTER} & 39.38 \newline \text{CosineClassifier} & 52.45 \newline \text{RouterDC} & \mathbf{56.32} \newline \hline \end{array} --- > Q12. I'm surprised that there are any strict overlaps between the datasets : Interesting! **A12.** Thanks for your comments! We guess there was a typo in the comment, it should be "there are **NOT** any strict overlaps between the datasets : Interesting!" > Q13. But that doesn't change the point that the the claim that the alternate datasets are 'OOD' seems like an overreach. **A13.** Thanks for your insightful comments! We appreciate the reviewer raising the concern about the definition of OOD. In the paper, C-EVAL and MBPP are treated as OOD tasks as they are different task distributions or question-answer instruction. We briefly summarize their differences below. (i) CMMLU and C-EVAL have **different task distributions**. CMMLU contains more culture- and region-related tasks, while C-EVAL have more STEM tasks. Moreover, CMMLU and C-EVAL use **different prompts** to ask multiple-choice question. C-EVAL uses continuous underscores to indicate the answer's location, whereas CMMLU employs no special notation for referencing the answer, except for brackets when the answer is within a sentence. (ii) HumanEval and MBPP assess the code generation proficiency of LLM from **two distinct perspectives**. A HumanEval query **gives the header** of a python function and some comments, requiring the LLM to **implement the rest** of the function. On the other hand, a MBPP query **gives an intent** and asks the LLM to **generate the function from scratch**. To further resolve this concerns, we evaluate the learned router on **one more OOD task: JavaScript** [R1], which aims to generate JavaScript code to solve problems. Different from HumanEval, which generates Python code to solve problems, JavaScript can be viewed as **a distant OOD task**. Table below reports the testing accuracy. As can be seen, RouterDC outperforms existing routing methods by a large margin, demonstrating that our RouterDC is more effective in routing queries of the distant OOD task. We will include the additional experiments and discussions in the revision. \begin{array}{lc} \hline & \text{JavaScript} \newline \hline \text{Mistral-7B} & 29.88 \newline \text{MetaMath-Mistral-7B} & 31.83 \newline \text{zephyr-7b-beta} & 11.71 \newline \text{Chinese-Mistral-7B} & 17.68 \newline \text{dolphin-2.6-mistral-7b} & 45.00 \newline \text{Meta-Llama-3-8B} & 37.07 \newline \text{dolphin-2.9-llama3-8b} & 53.84 \newline \hline \text{ZOOTER} & 41.64 \newline \text{CosineClassifier} & 37.32 \newline \text{RouterDC} & \mathbf{48.66} \newline \hline \end{array} --- ### references [R1] CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. KDD 2023. ## General Reply Dear Reviewers and ACs, We sincerely thank all the reviewers and ACs for your insightful and valuable comments. We are delighted that reviewers find that: - our work **addresses a significant challenge** in LLM utilization (`Reviewer 4ok1`). - our method is **intuitive** (`Reviewer ytgM`), **novel/innovative** (`Reviewers cnpT and 4ok1`), and **practical** (`Reviewer GGL7`). - RouterDC outperforms existing routing methods (`Reviewers ytgM, cnpT, and 4ok1`) and is capable of routing appropriately (`Reviewer GGL7`). The `Rebuttal PDF` contains the figures and tables that are referred to in the response to reviewers. Best, The Authors ## Reply to Reviewer ytgM Thanks for the valuable comments. We really appreciate your efforts to help us improve our paper. We carefully addressed your concerns below and sincerely hope that our reply resolves your concerns. Please let us know if you have any follow-up questions. > Q1. the relation between the two contrastive losses is missing. I am wondering how the two parts contribute to the final performance. **A1.** Thanks for your insightful suggestions. The relation between two contrastive losses can be seen in Figure 5 of the paper, which studies the effect of hyperparameter $\lambda$ in Eq. (5) (i.e., $\mathcal{L}\_{\text{sample-LLM}} + \lambda \ \mathcal{L}\_{\text{sample-sample}}$). We can see that using two contrastive losses together (i.e., $\lambda=1$) achieves better overall performance than using the sample-LLM contrastive loss alone (i.e., $\lambda=0$). Moreover, the overall performance of RouterDC is not so sensitive to a wide range of $\lambda \in [0.5, 5]$, making it easier to choose the value of $\lambda$. To further study the contributions of two contrastive losses to the final performance, we report the detailed results for both in-distribution (ID) and out-of-distribution (OOD) scenarios in the following tables. Since the sample-LLM loss provides the supervision signal and is essential for training the router, we focus on comparing RouterDC with and without the sample-sample contrastive loss. As can be seen from the below tables, RouterDC (w/ $\mathcal{L}\_\text{sample-LLM}$ + $\mathcal{L}\_\text{sample-sample}$) averagely outperforms RouterDC (w/ $\mathcal{L}\_\text{sample-LLM}$) in both scenarios, demonstrating the usefulness of the proposed sample-sample contrastive loss. Moreover, compared with the previous SOTA, using the sample-LLM contrastive loss alone (i.e., RouterDC (w/ $\mathcal{L}\_\text{sample-LLM}$)) performs better (with an average accuracy improvement of $0.75\\%$ in the ID scenario and $0.50\\%$ in the OOD scenario), while RouterDC (w/ $\mathcal{L}\_\text{sample-LLM}$ + $\mathcal{L}\_\text{sample-sample}$) achieves better performance by a large margin of $2.77\\%$ in the ID scenario and $2.35\\%$ in the OOD scenario. We will add this discussion to the revision. \begin{array}{lcccccl} \hline \textbf{(in-distribution)} & \text{MMLU} & \text{GSM8K} & \text{CMMLU} & \text{ARC-C} & \text{HumanEval} & \text{Avg} \newline \hline \text{Preivous SOTA} & 63.33 & 66.63 & 51.77 & 57.10 & 40.00 & 55.77 \newline \text{RouterDC } (\text{w/ } \mathcal{L}\_\text{sample-LLM}) & 63.21 & 68.87 & 49.27 & 49.43 & 51.84 & 56.52 \ \text{ (+0.75)}\newline \text{RouterDC } (\text{w/ } \mathcal{L}\_\text{sample-LLM}+\mathcal{L}\_\text{sample-sample}) & 61.07 & 70.32 & 51.77 & 58.52 & 51.02 & \mathbf{58.54}\text{ (+2.77)} \newline \hline \end{array} \begin{array}{lcccl} \hline \textbf{(out-of-distribution)} & \text{Pre-Algebra} & \text{MBPP} & \text{C-EVAL} & \text{Avg} \newline \hline \text{Preivous SOTA} & 35.36 & 43.12 & 52.01 & 43.50 \newline \text{RouterDC } (\text{w/ } \mathcal{L}\_\text{sample-LLM}) & 36.51 & 47.34 & 48.14 & 44.00 \ \text{ (+0.50)}\newline \text{RouterDC } (\text{w/ } \mathcal{L}\_\text{sample-LLM}+\mathcal{L}\_\text{sample-sample}) & 38.81 & 46.80 & 51.93 & \mathbf{45.85} \text{ (+2.35)}\newline \hline \end{array} --- > Q2. Is it possible to use ZOOTER with the sample-sample loss? What do you expect as the result? **A2.** Thanks for your valuable suggestion! We conducted additional experiments to study whether the proposed sample-sample contrastive loss is useful for ZOOTER. The following tables show the testing accuracy for the ID and OOD scenarios. As can be seen, integrating $\mathcal{L}\_\text{sample-sample}$ into ZOOTER **leads to improvements** of $+1.52\\%$ and $+0.81\\%$ for ID and OOD, respectively, demonstrating that the proposed sample-sample contrastive loss is **beneficial** for ZOOTER. We will include the experiments and add this discussion to the revision, which will definitely improve our paper. \begin{array}{lcccccl} \hline \textbf{(in-distribution)} & \text{MMLU} & \text{GSM8K} & \text{CMMLU} & \text{ARC-C} & \text{HumanEval} & \text{Avg} \newline \hline \text{ZOOTER} & 60.48 & 66.69 & 45.27 & 53.13 & 44.29 & 53.97 \newline \text{ZOOTER } (\text{w/ } \mathcal{L}\_\text{sample-sample}) & 60.15 & 69.71 & 46.59 & 54.26 & 46.73 & \mathbf{55.49}\text{ (+1.52)}\newline \hline \end{array} \begin{array}{lcccl} \hline \textbf{(out-of-distribution)} & \text{Pre-Algebra} & \text{MBPP} & \text{C-EVAL} & \text{Avg} \newline \hline \text{ZOOTER} & 34.44 & 41.10 & 44.95 & 40.16 \newline \text{ZOOTER } (\text{w/ } \mathcal{L}\_\text{sample-sample}) & 36.05 & 39.84 & 47.03 & \mathbf{40.97} \ \text{ (+0.81)}\newline \hline \end{array} ## Reply to Reviewer cnpT We sincerely thank you for the detailed and positive comments. We take all comments seriously and do our best to address every concern raised. Please let us know if you have any follow-up questions. > Q1. In Line 157, the authors claim that “The reason is that some similar queries can have dissimilar embeddings and may be routed to different LLMs.” Evidence should be provided to support this claim. **A1.** Thanks for your suggestion. Figure R1(a) in the `Rebuttal PDF` shows t-SNE visualization of training query embeddings extracted by the encoder trained by $\mathcal{L}\_\text{sample-LLM}$. As can be seen, query embeddings belonging to different tasks are roughly mixed together. We also provide two GSM8K queries as follows, which require basic calculation of shopping costs. Their embeddings have very low similarity (only $-0.4589$) when the router is trained by $\mathcal{L}\_\text{sample-LLM}$ alone. ``` Mary does her grocery shopping on Saturday. She does her shopping only at a specific store where she is allowed a credit of $100, which must be paid in full before her next shopping trip. That week she spent the full credit limit and paid $15 of it on Tuesday and $23 of it on Thursday. How much credit will Mary need to pay before her next shopping trip? ``` ``` Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet? ``` After integrating $\mathcal{L}\_\text{sample-sample}$, training query embeddings have a clear cluster structure (Figure R1(b)). Moreover, the similarity between the above queries increases to $0.9982$. We will add this discussion to the revision. --- > Q2. For the OOD setting (Table 2), RouterDC fails to beat the best individual LLMs on all tasks (e.g., 38.81 vs 39.72 on PreAlgebra). **A2.** OOD tasks are much more challenging than id-distribution (ID) tasks. Though our RouterDC fails to achieve the highest accuracy on every task, RouterDC can assemble complementary abilities of the candidate LLMs and achieve the best overall performance (an improvement of 1.90%). Besides, RouterDC performs comparably to the best candidate LLMs on all tasks (i.e., 38.81 vs 39.71 for PreAlgebra, 46.80 vs 47.34 for MBPP, and 51.93 vs 52.01 for C-EVAL). Moreover, RouterDC outperforms existing routing methods by a large margin overall. We will add this discussion to the revision. --- > Q3. RouterDC may require a large amount of labeled data to train the router. **A3.** To resolve this concern, we conducted an experiment to study the performance of RouterDC with different numbers of training samples per task. As can be seen from Figure R2 in the `Rebuttal PDF`, the testing accuracy saturates quickly, indicating that **a small number of samples is sufficient** for learning the router (e.g., 100 samples per task). Moreover, with only 30 samples per task, RouterDC already outperforms the previous SOTA overall (57.21 vs 55.77), demonstrating that our RouterDC does not require a large amount of labeled data for training. We will include the experiments and data efficiency analysis in the revision. --- > Q4. How about the performance of RouterDC without the sample-sample contrastive loss? **A4.** Thanks for the suggestion. We compare RouterDC w/ or w/o the sample-sample loss with the previous SOTA in the following tables. As can be seen, in both ID and OOD scenarios, using the sample-LLM loss alone (i.e., RouterDC (w/ $\mathcal{L}\_\text{sample-LLM}$)) performs better than the previous SOTA (with an average accuracy improvement of $0.75\\%$ in ID scenario and $0.50\\%$ in OOD scenario). We will add this discussion to the revision. \begin{array}{lcccccl} \hline \textbf{(in-distribution)} & \text{MMLU} & \text{GSM8K} & \text{CMMLU} & \text{ARC-C} & \text{HumanEval} & \text{Avg} \newline \hline \text{Preivous SOTA} & 63.33 & 66.63 & 51.77 & 57.10 & 40.00 & 55.77 \newline \text{RouterDC } (\text{w/ } \mathcal{L}\_\text{sample-LLM}) & 63.21 & 68.87 & 49.27 & 49.43 & 51.84 & 56.52 \ \text{ (+0.75)}\newline \text{RouterDC } (\text{w/ } \mathcal{L}\_\text{sample-LLM}+\mathcal{L}\_\text{sample-sample}) & 61.07 & 70.32 & 51.77 & 58.52 & 51.02 & \mathbf{58.54}\text{ (+2.77)} \newline \hline \end{array} \begin{array}{lcccl} \hline \textbf{(out-of-distribution)} & \text{Pre-Algebra} & \text{MBPP} & \text{C-EVAL} & \text{Avg} \newline \hline \text{Preivous SOTA} & 35.36 & 43.12 & 52.01 & 43.50 \newline \text{RouterDC } (\text{w/ } \mathcal{L}\_\text{sample-LLM}) & 36.51 & 47.34 & 48.14 & 44.00 \ \text{ (+0.50)}\newline \text{RouterDC } (\text{w/ } \mathcal{L}\_\text{sample-LLM}+\mathcal{L}\_\text{sample-sample}) & 38.81 & 46.80 & 51.93 & \mathbf{45.85} \text{ (+2.35)}\newline \hline \end{array} --- > Q5. In Fig. 9, it seems that no samples are routed to the Chinese-Mistral-7B model, why? **A5.** Thanks for your insightful question. We can see from Table 1 that Chinese-Mistral-7B is incapable of all tasks and has the worst overall performance, suggesting that its specialized ability may be covered by other candidate LLMs. Hence, no samples are routed to Chinese-Mistral-7B, which also verifies that RouterDC can select suitable LLMs for queries. We will add this discussion to the revision. --- > Q6. Can you visualize the embeddings of training samples extracted by the encoder $\mathcal{E}(x;w)$ of RouterDC? **A6.** Thanks for the suggestion. Figure R1(b) in the `Rebuttal PDF` shows the t-SNE visualization of training queries extracted by the learned encoder of RouterDC. As can be seen, the embeddings exhibit a clear structure. ## Reply to Reviewer GGL7 Thanks for your efforts and useful comments. We take all comments seriously and hope that our reply can resolve your concerns. Please let us know if you have any follow-up questions. --- > Q1: the meat of the router task is to see whether it's possible to classify these different tasks via a small LM and embeddings > > The router being learned here is a task-wise classifier. **A1.** Sorry for the confusion. Though Figure 9 shows that RouterDC can route most of the queries from the same task to the same LLM, RouterDC is a query-based router instead of a task-wise classifier. - RouterDC is **not a task-wise classifier** that selects an LLM for each task. The performance of a task-wise classifier is bounded by that of the top-performing LLM. However, Table 1 shows that RouterDC can beat the top-performing LLMs on GSM8K, ARC-C, and HumanEval, suggesting that RouterDC is not simply a task-wise classifier. Furthermore, Figure 9 shows that RouterDC does not always route all queries from the same task to the same LLM. For example, RouterDC assigns 96% and 4% of GSM8K queries to MetaMath-Mistral-7B and dolphin-2.9-llama3-8b, respectively. - RouterDC is **a query-based router.** All training queries are merged together to learn the router. At the testing stage, the learned router assigns the testing query to the suitable LLM based on the similarity between the query and LLM embeddings. Both sample-LLM and sample-sample losses do not require the task identity. - The previous work **LoraRetriever (ACL 2024) is exactly a task-wise classifier**. As the task identity is unavailable in practice, the cluster label is used instead. Tables 1 and 2 show that LoraRetriever (clustering) is worse than RouterDC, indicating that RouterDC routes queries more effectively. --- > Q2: Sample-Sample Contrastive Loss : Smells like a post-experiment fix, rather than a principled choice. > ... > mDeBERTaV3-base is used to determine which samples 'belong' to which clusters **A2.** We agree that the sample-sample loss is designed to deal with the training instability observed in early experiments. Contrastive learning is an effective technique to retain the semantic similarity of sentences in the embedding space (SimCSE, EMNLP 2021; Sentence-T5, ACL 2022). Hence, we introduce the sample-sample loss to encourage the encoder to generate similar embeddings for semantically similar queries. Note that at inference (testing), RouterDC does not need to cluster queries. --- > Q3. Seems unclear whether top-3 is so different from top-4. **A3.** Thank you for pointing it out. We will fix it in the revision: "the top-three LLMs have significantly higher scores than the bottom-three LLMs". --- > Q4. how disjoint (really) are (i) CMMLU and C-EVAL; (ii) HumanEval and MBPP? **A4.** Thanks for the question. (i) A detailed comparison between **CMMLU and C-EVAL** is given in Appendix A of the CMMLU paper (arXiv:2306.09212), showing that they have different distributions and contain only 74 shared samples (**about 1%** of the CMMLU dataset). (ii) We check the overlap between **HumanEval and MBPP** by string matching and find that they are **completely disjoint**. --- > Q5. isn't this proving that (for instance) Meta-Llama3-8B is being consistently chosen for CMMLU? **A5.** Yes, Meta-Llama3-8B is consistently chosen for CMMLU queries unless missing since it performs significantly better than other candidate LLMs on CMMLU (Table 1), confirming that the learned router can route queries to suitable LLMs. --- > Q6. L46: "we cluster the training queries" - intra-group vs inter-group (within batch?) **A6.** Sorry for the confusion. We cluster **all** the training queries into several groups. At each iteration, for a query, we sample its in-group query and out-group queries from the same mini-batch. We will clarify this in the revision. --- > Q7. Is it a coincidence that N=5 matches the number of training tasks? **A7.** Sorry for the confusion. The number of clusters $N$ does not have to be the number of training tasks. We have conducted an experiment in the paper to study the sensitivity of $N$ (Lines 255-258). As shown in Figure 6, **RouterDC is insensitive to a wide range of $N\in [4, 9]$**, where the number of tasks is 5. In practice, we can choose $N$ by grid search using K-fold cross-validation. We will clarify this in the revision. --- > Q8. Appendix B : Is it troubling that task-identity is roughly the same as pretrained cluster labels? Perhaps the task prompts are a give-away. > > "increasing N ... saturates quickly." (ditto) **A8.** Sorry for the confusion. To study the relationship between task-identity and cluster labels, we construct their confusion matrix in Table R1 of the `Rebuttal PDF`. As shown, some task identities are different from cluster labels. For example, the HumanEval queries are grouped into three clusters. We also construct the confusion matrix when $N=4$ and $N=9$ in Tables R2 and R3 of the `Rebuttal PDF`, respectively. Again, queries from the same task can be grouped into different clusters and a cluster can be shared across different tasks. We will add this discussion to the revision. --- > Q9. How could this approach work in a chat context? Is it just for single queries? **A9.** Great suggestion! Though RouterDC is designed as a query-based router, the framework can be extended to the chat context, e.g., selecting LLMs based on the recent conversation. We will study this in our future work. --- > Q10. doesn't seem particularly novel **A10.** Novelty of RouterDC includes **two contrastive losses**: (i) the sample-LLM loss pulls the query embeddings closer to the embeddings of the top-performing LLMs while pushing them away from the embeddings of the bottom-performing LLMs; and (ii) the sample-sample loss for training stability. The novelty is also recognized by Reviewers cnpT (**novel**) and 4ok1 (**innovative**). ## Reply to Reviewer 4ok1 We sincerely thank you for the detailed and positive comments. We carefully address your concerns below. Please let us know if you have any follow-up questions. --- > Q1. parameter efficiency, scalability, training cost is likely significant when the number of LLMs scales up (each training query needs to be evaluated on each LLM), and retraining is required when any new LLMs are incorporated. > > Table 1 lacks a comparison ... training time ... parameter efficiency **A1.** Great suggestions! We compare RouterDC with other routing methods on the number of parameters and training time in Table R4 of the `Rebuttal PDF`. As shown, all methods are very efficient in both computation and parameters, i.e., only require about **28 minutes** of training time and have only **86M parameters**. Moreover, RouterDC is **data-efficient** in training. Though each training query needs to be evaluated on all candidate LLMs, the training set can be very small. Figure R2 in the `Rebuttal PDF` shows the performance of RouterDC with different numbers of training samples per task. We can see that the testing accuracy saturates quickly, indicating that **a small number of samples is sufficient** for learning the router. Moreover, with only 30 samples per task, RouterDC already outperforms the previous SOTA overall (57.21 vs 55.77). Thus, the total number of runs to query LLMs is affordable. As RouterDC requires very little training time, retraining the router when new LLMs are incorporated would not be a practical issue. Moreover, learning the router incrementally without retraining is also practical for future work. We will include computation and data efficiency analysis in the revision. --- > Q2. incorporate LLM costs into the loss function. > > Evaluation results on RouterBench **A2.** Thanks for your suggestions. Our primary focus is training a router to select suitable LLMs for queries. Hence, performance is used as the main metric for learning the router. To resolve reviewer's concerns, we further conducted experiments on two tasks of RouterBench (i.e., MBPP and GSM8K) and considered the LLM costs. We modify the score $s_i^{(t)}$ to $s_i^{(t)}+c_i^{(t)}$, where $c_i^{(t)}$ is the cost of query $x_i$ using the $t$th LLM. Figure R3 in the `Rebuttal PDF` shows that RouterDC is more cost-effective than CosineClassifier and ZOOTER in both tasks. --- > Q3. including incapable LLMs can lead to unnecessary computational overhead and performance degradation. > There is a lack of analysis for this issue. **A3.** Thanks for your insightful comments. We agree that incapable LLMs are unnecessary. For example, Figure 9 shows that very few queries are routed to Chinese-Mistral-7B and dolphin-2.6-mistral-7b, thus, removing them can reduce computations without sacrificing performance. In practice, one can use a hold-out validation to analyze the usage of candidate LLMs and off-load those LLMs that are rarely used. --- > Q4. improvements after incorporating sample-sample contrastive loss seem marginal. Can this be further analyzed or explained? **A4.** Sorry for the confusion caused by Figures 5 and 7 due to the large y-axis range. In fact, RouterDC (w/ $\mathcal{L}\_\text{sample-LLM}$ + $\mathcal{L}\_\text{sample-sample}$) achieves better average accuracy than RouterDC (w/ $\mathcal{L}\_\text{sample-LLM}$) by a significant margin of $2.02\%$ (Table 4 in Appendix A, $\lambda=1$ vs $\lambda=0$). We will clarify this in the revision. --- > Q5. On OOD tasks, RouterDC performs worse than the best-performing LLM on certain individual tasks. Is there any analysis of the reasons? **A5.** OOD tasks are much more challenging. Though RouterDC fails to achieve the highest accuracy on all OOD tasks, RouterDC can assemble complementary abilities of the candidate LLMs and achieve the best overall performance. Specifically, RouterDC performs comparably to the best candidate LLMs on all tasks (i.e., 38.81 vs 39.71 for PreAlgebra, 46.80 vs 47.34 for MBPP, and 51.93 vs 52.01 for C-EVAL). Moreover, RouterDC outperforms existing routing methods by a large margin. We will add this discussion to the revision. --- > Q6. comparison of normalized average score. **A6.** Thanks for your insightful suggestion. We normalize the score of method $\mathbb{A}$ by $$\frac{\text{Acc on task t using method $\mathbb{A}$}}{\text{Acc on task t using dolphin-2.9-llama3-8b}} \times 100\\%.$$ Tables R5 and R6 in the `Rebuttal PDF` report the normalized scores for the ID and OOD scenarios, respectively, showing that RouterDC outperforms existing routing methods by a large margin. --- > Q7. no need to include RandomSelect **A7.** Thanks for your suggestion. We will remove it accordingly in the revision. --- > Q8. In the "Routing to Different Numbers of LLMs" evaluation, why were LLMs added in the chosen order? Based on Table 1, adding them in performance-descending order might yield smaller accuracy enhancements. **A8.** Thanks for your comments. The adding order of LLMs in Figure 13 is the same as the order of candidate LLMs (top to bottom) in Table 1. We agree that the order will affect the accuracy improvement, e.g., adding an incapable LLM will yield small or no improvement. --- > Q9. Would an LLM-dropping mechanism improve overall performance? **A9.** Good suggestion. As incapable LLMs are unnecessary, one can greedily drop them according to validation performance. For example, by dropping Mistral-7B and Chinese-Mistral-7B, the average accuracy increases from 58.54 to 58.67. --- > Q10. Did you cluster queries within one benchmark or across several benchmarks? **A10.** Sorry for the confusion. We cluster all training queries from **several** benchmarks. We will clarify this in the revision. --- > Q11. broader implications: scalability and cost efficiency **A11.** Please see our reply to Q1 and Q2.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully