ICML'25 rebuttal

### Confidential Comment to AC Dear AC, SAC and PC, Thanks for your hard work overseeing the review process. We are writing to express our deep concerns regarding the review from `Reviewer sTSP`, which falls far below the standards of ICML and calls into question the reviewer’s qualification to fairly evaluate submissions in this field. ## Summary of concern Reviewer sTSP appears to have **entirely failed to understand the basic problem setups and any technical content** of our submission, explicitly stating “_I struggled to understand...”, “I didn’t go through the proofs…"_. Many of the comments focus on basic mathematical notations and standard ML concepts that are clearly defined in the manuscript, suggesting a superficial or incomplete reading of the paper—and a lack of necessary expertise. In contrast, **all three other reviewers clearly understood the paper** and provided in-depth and constructive comments, and recommended our submission towards 3--Weak Accept. Given this disparity and the review sTSP’s complete lack of background in MLLM, we believe the reviewer should have flagged the paper as outside their area of expertise, rather than submitting a review that was both unprofessional and unconstructive. This experience has been deeply disheartening to us as authors. ## Evidence of issues in review While there are numerous issues in the review, we highlight the most concerning ones below: **(1) Reviewer sTSP has no background in MLLM** Reviewer sTSP’s comments reveal a clear lack of familiarity with the field of MLLM, **which forms the core focus of our paper, and raises serious doubts about the ability to judge our work effectively**. For instance, the reviewer questions the meaning of "multimodal language model" and incorrectly assumes our models take only visual input, despite our explicit definition in the preliminaries that MLLMs process both visual and textual inputs (Lines 92-109). As a result, the reviewer fails to understand even our problem setup, which is explicitly recognized and appreciated by others: - Reviewer vm1D: "_Understanding MLLMs under distribution shifts seems a critical research problem_.” - Reviewer cWre: "_This paper provides a theoretical framework to analyze and understand the impact of distribution shifts on MLLM performance via EMI._" **(2) Reviewer sTSP struggles to understand basic ML concepts such as data distribution** Reviewer sTSP repeatedly expresses confusion over concepts that are not only clearly defined in the manuscript, but also well-understood and acknowledged by other reviewers. For example, Reviewer sTSP questioned "_How are $P_{X}$ and $P_{Y}$ defined? Are these trained models?_..." However, we clearly introduced $P_{\mathbf{X}Y}$ as a data distribution, with $P_{\mathbf{X}}$ and $P_{Y}$ explicitly described as marginals (Lines 92-103). In contrast, - Reviewer cWre: “_The theoretical analyses are clearly presented._” - Reviewer vm1D: “_The paper is well constructed and easy to read._” - Reviewer o1Fb: “_The derivations appear to be sound._” **(3) Reviewer sTSP didn't engage with our core theoretical and empirical contributions** The reviewer does not comment on any of our core theorems (e.g., Theorem 4.5 and 4.6 on the EMID upper bound), derivations, or assumptions—all of which are clearly laid out and discussed. Similarly, the reviewer **does not mention a single result, figure, or experiment**, despite our paper including 61 real-world distribution shift scenarios, multiple models, and correlation analyses linking EMI and win rate. This stands in sharp contrast with the other reviewers: - Reviewer vm1D highlights: “_Through extensive experiments... the authors validate their framework._” - Reviewer cWre: “_The empirical findings strongly support most of the theoretical conclusions._” - Reviewer o1Fb: “_The connection between EMI and win rate provides a practical and efficient alternative for model evaluation._” ## Final remark In light of the situation, we request a careful re-evaluation of the concerns raised by Reviewer sTSP, considering the positive feedback from other reviewers and the thoroughness of our revisions and responses. We sincerely hope our manuscript can receive a fair and balanced assessment, given its theoretical and empirical significance for the field. Thanks again for your attention and service to the community. Sincerely, The authors ---- ### Xuefeng's AC note Dear AC/SAC/PC, Thank you for taking the time to read our message! We would like to bring to your attention the following situation: - Firstly, Reviewer sTSP appears to lack familiarity with the latest ground in MLLMs, the probability and machine learning theory, and raises incorrect criticism on the definitions of MLLMs and their input / output distributions. These core concepts are clearly illustrated in our paper, e.g., **Section 2 (L92-109)**, which are recognized by multiple reviewers (o1Fb, cWre, vm1D) as well. - Secondly, Reviewer sTSP questioned our motivation of exploring visually-conditioned LLMs, which is ungrounded and fallacious. In fact, our paper offers a formal framework to understand MLLMs under distribution shifts, which is unexplored yet tangible for reliable artificial intelligence in the wild. Our theory demonstrates the intricacies between how MLLMs perform under distribution shift and the divergence w.r.t. different marginal distributions and conditional dependencies. This goes beyond the simplistic framing suggested by Reviewer sTSP, who incorrectly reduces the problem to a basic comparison between conditional and unconditional sequence models. - Our extensive experiments comprehensively examine 34 synthetic and 27 natural distribution shift scenarios on 4 representative MLLMs, which confirm strong correlations between our theory and practice. This broad evaluation surpasses prior works, which typically considers fewer shifts or models. However, this key contribution is largley overlooked by Reviewer sTSP, which renders the review unfair and subjective. To sum up, we believe the rationality of this review is questionable. We have clarified all of the concerns in our rebuttal and will incorporate the changes in our revised paper. We thought we should bring this to your attention. Thank you for your valuable judgment and service. Sincerely, Authors            --- ### Rebuttal to o1Fb  > _A1. Applicability on SOTA MLLMs._ Thanks for the excellent suggestion! Following your comment, **we additionally conduct the full evaluation with`Qwen2.5-VL-7B-Instruct` and `InternVL2.5-7B`**. Specifically, we first evaluate the official release of `Qwen2.5-VL-7B-Instruct` and `InternVL2.5-7B` models on 35 synthetic shifts scenarios, then compute empirical EMI estimates over the models' responses. We perform a Pearson correlation analysis between empirical estimates of EMI difference (EMID) and its upper bound. **Consistent with our existing finding, we observe a strong correlation between EMID and its theoretical upper bound in both models**. |Model|Pearson $r$|$p$-val| |-|-|-| |InternVL2.5-7B|0.67|0.00| |Qwen2.5-VL-7B|0.81|0.00|       > _A2. How do assumptions impact the applicability of proposals to more complex cases?_ This is a very insightful question. * First, **to claim the closeness between EMI and win rate (Theorem 4.4), we assumed $\epsilon$-representation capacity of the MLLM**. * $\epsilon$-representation capacity essentially reflects the model’s ability to approximate the target task’s conditional distribution, meaning that the model can approximate this distribution with a KL divergence no greater than $\epsilon$. * The assumption mainly argues the strong expressive and approximation capabilities of MLLMs. **Given the strong expressive and approximation capabilities of recent large-scale MLLMs, this assumption is generally reasonable in practice.** * Moreover, there are numerous efforts that improve the diversity of an instruction tuning dataset and robustness of the visual encoder of MLLM [1,2,3,4], which make an MLLM's learned distribution robustly approximate conditional distributions encountered during evaluation time. * As we continually pursue enriching dataset construction strategy and enhancing visual recognition of the encoder, $\epsilon$-representation capacity assumption becomes reasonable in more complex cases. * Second, **to claim the relationship between EMID and its upper bound in a simple case (Theorem 4.5), we assumed consistent conditional distributions** over $X_{v}|X_{t}$, $X_{t}|X_{v}$, and $Y|\mathbf{X}$. * This assumption zeros out the discrepancy between conditional distributions of ID and OOD, so if the conditional distributions are quite different among ID and OOD datasets in some complex real-world scenarios. Then, this assumption makes our upper bound underestimate the performance gap, e.g., EMID. * However, we highlight that the strong correlations between EMID and this upper bound have been observed through 61 distribution shift scenarios, implying the validity of our upper bound to quantify EMID. * **Meanwhile, we also provide a bound for general cases to address non-consistent conditional distributions in Theorem 4.6**. This general-case bound can also be empirically estimated using a procedure similar to that of its simpler counterpart. * Therefore, as mentioned in our manuscript (line 302-303), we recommend choosing a proper bound based on the degree of knowledge of the data-generating process for each dataset. > _A3. Details for practical implications of EMID upper bound_. While we confined the scope of this project to _presenting the first theoretical framework to quantify MLLM's performance gap_, we further provide **a potential application of EMID upper bound, instruction tuning with regularization, for this rebuttal**. Without loss of generality, we can assume the input sequence $\mathbf{X}=(X_{v},X_{t})$ as a sequence of intermediate representation vectors of MLLM, i.e., $\mathbf{Z}=(Z_{v},Z_{t})$, and can also assume that $P_{\theta}(\cdot|\cdot)$ maps this representation to responses, i.e., $P_{\theta}:\mathbf{Z}\rightarrow Y$. This induces a modified bound with representation variable $\mathbf{Z}$ rather than raw data input $\mathbf{X}.$ We instantiate this modified EMID bound in two distinct setups below where we set the 24th layer's hidden states as $\mathbf{Z}$, and adopt MMD [5] as a differentiable estimator for JSD and an average of empirical model output entropy. We provide evaluation results with LLaVA-v1.5-7B on in-distribution (ID) and visual (V), text (T), and joint (J) synthetic shifts. `Regularization term for instruction tuning`: $\mathbb{E}[H(P_{\theta}(\cdot|\mathbf{z}))]\cdot(D_{\rm JS}^{\frac{1}{2}}(P_{Z_{v}}||N(0,I))+{D}_{\rm JS}^{\frac{1}{2}}(P_{Z_{t}}||N(0,I)))$ * One can not access $Q_{\mathbf{X}}$ during the training phase, so we alternatively enforce the distribution of the intermediate representation to be close to the standard Gaussian. * We sampled 10% of the original instruction tuning dataset from LLaVA-v1.5, and trained the entire LLM and modality connector parameters of LLaVA-v1.5 with and without the regularization term. |Method|ID|V Shift|T Shift|J Shift| |-|-|-|-|-| |Baseline|72.7|65.8|68.0|59.6| |Baseline + Ours|72.7|**66.3**|**68.3**|**60.8**|  * As shown in the above tables, we confirm that the proposed EMID can be leveraged as an effective regularizer during instruction tuning to pursue better robustness to distribution shifts. > Reference 1. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, Tong et al. 2024 2. LLaVA-OneVision: Easy Visual Task Transfer, Li et al. 2024 3. Qwen2.5-VL Technical Report, Alibaba Group 2025 4. Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, Shi et al. 2025 5. Learning Deep Kernels for Non-Parametric Two-Sample Tests, Liu et al. 2020  ### Rebuttal to o1Fb (Actual Submission) > _A1. Applicability on SOTA MLLMs._ Thanks for the suggestion! Following your comment, **we additionally conduct the full evaluation with`Qwen2.5-VL-7B-Instruct` and `InternVL2.5-7B`**. Specifically, we first evaluate the official release of `Qwen2.5-VL-7B-Instruct` and `InternVL2.5-7B` models on 35 synthetic shifts, then compute EMI estimates over the model responses. We perform a correlation analysis between estimates of EMI difference (EMID) and its upper bound. **Consistent with our existing finding, we observe a strong correlation between EMID and its theoretical upper bound in both models**. |Model|Pearson $r$|$p$-val| |-|-|-| |InternVL2.5-7B|0.67|0.00| |Qwen2.5-VL-7B|0.81|0.00| > _A2. How do assumptions impact the applicability of proposals to more complex cases?_ This is an insightful question. First, **to claim the closeness between EMI and win rate (Thm 4.4), we assumed $\epsilon$-representation capacity of the MLLM**. * $\epsilon$-representation capacity essentially reflects the model’s ability to approximate the target task’s conditional distribution, meaning that the model can approximate this distribution with a KLD no greater than $\epsilon$. * The assumption mainly argues the strong expressive and approximation capabilities of MLLMs. **Given the strong expressive and approximation capabilities of recent large-scale MLLMs, this assumption is generally reasonable in practice.** * Moreover, numerous efforts improve the diversity of an instruction tuning dataset and robustness of the visual encoder of MLLM [1,2], which makes the learned distribution robustly approximate conditional distributions encountered during evaluation. * As we continually pursue enriching dataset construction and enhancing visual recognition of the encoder, $\epsilon$-representation capacity assumption becomes reasonable in more complex cases. Second, **to claim the relationship between EMID and its upper bound in a simple case (Thm 4.5), we assumed consistent conditional distributions** over $X_v|X_t$, $X_t|X_v$, and $Y|X$. * This assumption zeros out the discrepancy between conditional distributions. If the conditional distributions are quite different between ID and OOD in some real-world scenarios, this makes our upper bound underestimate the performance gap, e.g., EMID. * However, we highlight that the strong correlations between EMID and this upper bound have been observed through 61 distribution shifts, implying the validity of our upper bound to quantify EMID. * **Meanwhile, we also provide a bound for general cases to address non-consistent conditional distributions in Thm 4.6**. This general-case bound can also be empirically estimated using a procedure similar to that of Thm 4.5. * Therefore, as mentioned in our manuscript (L302-303), we recommend choosing a proper bound based on the knowledge of the data-generating process for datasets. > _A3. Details for practical implications of EMID upper bound_. While we confined the scope of this project to _presenting the first theoretical framework to quantify MLLM's performance gap_, we further provide **a potential application of EMID upper bound, instruction tuning with regularization, for this rebuttal**. Without loss of generality, we can assume the input sequence $X=(X_v,X_t)$ as a sequence of intermediate representation vectors of MLLM, i.e., $Z=(Z_v,Z_t)$, and can also assume that $P_{\theta}(.|.)$ maps this representation to responses, i.e., $P_{\theta}:Z \rightarrow Y$. This induces a modified bound with representation variable $Z$ rather than raw data input $X$. We instantiate this modified EMID bound in two distinct setups below, where we set the 24th layer's hidden states as $Z$, and adopt RJSD [3] as a differentiable estimator for JSD and an average of empirical model output entropy. We provide evaluation results with LLaVA-v1.5-7B on in-distribution (ID) and visual (V), text (T), and joint (J) synthetic shifts. `Regularization term for instruction tuning`: $\mathbb{E}[H(P_{\theta}(\cdot|z))] \cdot ((D_{JS}^{0.5}(P_{Z_v}||N(0,I))+(D_{JS}^{0.5}(P_{Z_t}||N(0,I)))$ * One can not access $Q_X$ during the training phase, so we alternatively enforce the distribution of the intermediate representation to be close to the standard Gaussian. * We sampled 10% of the instruction dataset from LLaVA-v1.5, and trained the entire LLM and modality connector parameters of LLaVA-v1.5 with and without the regularization. * As shown in the table, we confirm that the EMID can be leveraged as a regularizer during instruction tuning to pursue better robustness to distribution shifts. |Method|ID|V Shift|T Shift|J Shift| |-|-|-|-|-| |Baseline|72.7|65.8|68.0|59.6| |Baseline + Ours|72.7|**66.3**|**68.3**|**60.8**| > Reference 1. Qwen2.5-VL Technical Report, Alibaba Group 2025 2. Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, Shi et al. 2025 3. The Representation Jensen-Shannon Divergence, Hoyos-Osorio et al. 2024 --- ### Rebuttal to sTSP       We appreciate sTSP's effort to read our paper and provide comments. We present a notation table and the responses for each comment. |Var.|Def.| |-|-| |$X_t=(X_{t,1},...,X_{t,L_t})$ where $X_{t,i}\in V$|a random variable (r.v.) of a text input sequence with length $L_t$ of tokens in vocabulary $V$| |$X_v=(X_{v,1},...,X_{v,L_v})$ where $X_{v,i}\in \mathbb{R}^{D_v}$| a r.v. of a $D_v$-dimensional image embedding sequence with length $L_v$ of tokens produced by a frozen visual encoder| |$\mathbf{X}=(X_v,X_t)$| a joint r.v. of an input query constructed with a tuple of $X_v$ and $X_t$| |$Y=(Y_1,...,Y_L)$ where $Y_i\in V$|a r.v. of a text response with length $L$ of tokens| |$P_{\mathbf{X}}=P(X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|a probability distribution (p.d.) of an input query| |$P_{X_v}=P(X_{v,1},...,X_{v,L_v})$|a p.d. of a visual input query| |$P_{X_t}=P(X_{t,1},...,X_{t,L_t})$|a p.d. of a text input query| |$P_{Y}=P(Y_1,...,Y_L)$| a p.d. of a text response| |$P_{Y \|\mathbf{X}}=P(Y_1,...,Y_L\|X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|a conditional p.d. of a text response given input query| |$P_{\mathbf{X}Y}=P(X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t},Y_1,...,Y_L)$|a joint p.d. of input query and response| |$P_{\theta}(Y\|\mathbf{X})=P_{\theta}(Y_1,...,Y_L\|X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|model's prediction p.d. for a response given input query| > What problem is being solved? what is the risk? * As we noted in the introduction, MLLMs suffer from performance degradation when they encounter distribution shifts. We denote risk as this performance degradation. * `Problem focus`: As noted in the abstract, introduction, and motivation sections, **our goal is to quantify the performance degradation of MLLMs under distribution shifts by presenting an information-theoretic framework**. > It seems this is about (bi-modal) vision and language models with vision only in the input. * As we noted in `L107-109`, MLLMs take multimodal input (both visual and text) to produce text output, not "vision only in the input". * The term MLLM is commonly used in the literature to denote LLMs that receive a visual input as well as text [1,2], so we took this term by following the convention. > Definition of $P_{X},P_{Y},P_{XY},P_{\theta}$. Are they all the trained model? * In `L92-103`, we put the definition of random variables and distributions. $P_X,P_Y,P_{XY}$ denote the probability distributions of the input query $\mathbf{X}$, target response $Y$, and their joint $\mathbf{X}Y$, respectively. * The $P_\theta$ is a model being trained. > Definitions of $D(P_{X_v}||Q_{X_v})$ and $D(P_{X_t}||Q_{X_t})$. How are these defined in terms of next token probabilities? What is $P_{X_v}$ and $P_{X_t}$? As noted in `L96-97`, $X_v$ and $X_t$ are the sequence of visual and text input tokens, so the $P_{X_v}$ and $P_{X_t}$ are the corresponding distributions. $D(P_{X_v}||Q_{X_v})$ and $D(P_{X_t}||Q_{X_t})$ are defined at the input, which should not be confused with the next token probabilities (at the output level). > I'd put the definitions of the variables before the shift examples. Domains of the variables in L93? How is (X_v,X_t) combined into a sequence? * We indeed introduced the definition of all random variables in `L92-103` before going into the motivation section. * We concatenate the visual input tokens $X_v$ and textual input tokens $X_t$ into a single input sequence, where the visual tokens are obtained by encoding the image using a vision encoder (e.g., CLIP-ViT), and then projecting them into the language embedding space. This follows the standard practice in MLLMs, where visual tokens are prepended to the text tokens to form a unified input sequence. > Definition of MI and EMI * Our main interest is to express the model performance gap across different distributions. Thus, instead of defining MI with individual variables, we define MI as their joint distribution, which is mathematically equivalent to random variable-based MI, as can be seen in Eq. (3). * The definition of EMI was explicitly introduced in the Eq (6). Please refer to `L168-169`. > L101: instruction tuning has not been introduced. L99: joint probability? Eq 1, should this not be argmin? Eq 2. Should this be P_{\theta} instead of \theta * Instruction tuning was introduced from `L104`. * In statistics, the population (distribution) [3] is used to denote a distribution of the entire collection of objects in contrast to a sampled distribution. * In Eq (1), both min and argmin can be valid where the former aims at the objective-centric whereas the latter stands for the parameter-centric perspective. * In Eq (2), we use the first argument to denote the data distribution that the metric is computed on, and use the second argument to denote the model parameter to be evaluated. **We will use $P_{\theta}$ in the revised version.** > Reference 1. A Fully Open, Vision-Centric Exploration of Multimodal LLMs, Tong et al. 2024 2. Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, Shi et al. 2025 3. Sampling of Populations: Methods and Applications, Levy and Lemeshow 2013 --- ### Rebuttal to sTSP (Actual Submission) We appreciate sTSP's effort to read our paper and provide comments. Here is a notation table and our responses. |Var.|Def.| |-|-| |$X_t=(X_{t,1},...,X_{t,L_t})$ where $X_{t,i}\in V$|a random variable (r.v.) of a text input sequence with length $L_t$ of tokens in vocabulary $V$| |$X_v=(X_{v,1},...,X_{v,L_v})$ where $X_{v,i}\in \mathbb{R}^{D_v}$| a r.v. of a $D_v$-dimensional image embedding sequence with length $L_v$ of tokens produced by a frozen vision encoder| |$\mathbf{X}=(X_v,X_t)$| a joint r.v. of an input query constructed with a tuple of $X_v$ and $X_t$| |$Y=(Y_1,...,Y_L)$ where $Y_i\in V$|a r.v. of a text response with length $L$ of tokens| |$P_{\mathbf{X}}=P(X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|a probability distribution (p.d.) of an input| |$P_{X_v}=P(X_{v,1},...,X_{v,L_v})$|a p.d. of a visual input| |$P_{X_t}=P(X_{t,1},...,X_{t,L_t})$|a p.d. of a text input| |$P_{Y}=P(Y_1,...,Y_L)$| a p.d. of a text response| |$P_{Y \|\mathbf{X}}=P(Y_1,...,Y_L\|X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|a conditional p.d. of a text response given input| |$P_{\mathbf{X}Y}=P(X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t},Y_1,...,Y_L)$|a joint p.d. of input and response| |$P_{\theta}(Y\|\mathbf{X})=P_{\theta}(Y_1,...,Y_L\|X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|model's prediction p.d. for a response given input| > What problem is being solved? What is the risk? * As we noted in the introduction, MLLMs suffer from performance degradation when they encounter distribution shifts. We denote risk as this performance degradation. * `Problem focus`: As noted in the abstract, introduction, and motivation sections, **our goal is to quantify the performance degradation of MLLMs under distribution shifts by presenting an information-theoretic framework**. > It seems this is about (bi-modal) vision and language models with vision only in the input. * As we noted in `L107-109`, MLLMs take multimodal input (both visual and text) to produce text output, not "vision only in the input". * The term MLLM is commonly used in the literature to denote LLMs that receive a visual input as well as text [1,2], so we took this term by following the convention. > Definition of $P_X,P_Y,P_{XY},P_{\theta}$. * In `L92-103`, we put the definition of random variables and distributions. $P_X,P_Y,P_{XY}$ denote the probability distributions of the input $\mathbf{X}$, target response $Y$, and their joint $\mathbf{X}Y$, respectively. * The $P_\theta$ is a model being trained. > Definitions of $D(P_{X_v}||Q_{X_v})$ and $D(P_{X_t}||Q_{X_t})$. How are these defined in terms of next token probabilities? What is $P_{X_v}$ and $P_{X_t}$? As noted in `L96-97`, $X_v$ and $X_t$ are the sequences of visual and text input tokens, so the $P_{X_v}$ and $P_{X_t}$ are the corresponding distributions. $D(P_{X_v}||Q_{X_v})$ and $D(P_{X_t}||Q_{X_t})$ are defined at the input, which should not be confused with the next token probabilities (at the output level). > Put the definitions of variables before shift examples. Domains of the variables in L93? How is (X_v,X_t) combined into a sequence? * We put the definition of all variables in `L92-103` before the motivation section. * We concatenate the visual input tokens $X_v$ and textual input tokens $X_t$ into a single sequence, where the visual tokens are obtained by encoding an image using a vision encoder (CLIP-ViT), and then projecting them into the language embedding space. This follows the standard practice in MLLMs, where visual tokens are prepended to the text tokens to form a unified input sequence. > Definition of MI and EMI * Our main interest is to express the model performance gap across different distributions. Thus, instead of defining MI with individual variables, we define MI as their joint distribution, which is mathematically equivalent to random variable-based MI, as can be seen in Eq 3. * The definition of EMI was explicitly introduced in Eq 6. Please refer to `L168-169`. > L101: instruction tuning has not been introduced. L99: joint probability? Eq 1, should this not be argmin? Eq 2. Should this be P_{\theta} instead of \theta * Instruction tuning was introduced from `L104`. * In statistics, the population (distribution) [3] is used to denote a distribution of the entire collection of objects in contrast to a sampled distribution. * In Eq 1, both min and argmin can be valid, where the former aims at the objective-centric whereas the latter stands for the parameter-centric perspective. * In Eq 2, we use the first argument to denote the data distribution that the metric is computed on, and use the second argument to denote the model parameter to be evaluated. **We will use $P_{\theta}$ in the revised version.** > Reference 1. A Fully Open, Vision-Centric Exploration of Multimodal LLMs, Tong et al. 2024 2. Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, Shi et al. 2025 3. Sampling of Populations: Methods and Applications, Levy and Lemeshow 2013 --- ### Rebuttal to cWre        > _A1. Effect of modality fusion/interaction on generalization_. * MLLMs commonly undergo a modality alignment phase during training which may affect generalization, and it is known that modality fusion can reduce the sample complexity to improve generalization [1]! * As noted in `line 250-254` in our paper, **$I(P_{\mathbf{X}Y})$ can be factorized into $0.5 I(P_{X_{v}Y}) + 0.5 I(P_{X_{t}Y}) + 0.5 I(P_{X_{t}Y|X_v}) + 0.5 I(P_{X_{v}Y|X_t})$ where the conditional MI terms $I(P_{X_{t}Y|X_v})$ and $I(P_{X_{v}Y|X_t})$ encapsulate modality interaction.** * Based on this factorization, we can define per-modality EMI based on $I(P_{X_{v}Y})$ and $I(P_{X_{t}Y})$, and then derive a new upper bound that is constructed with $I(P_{X_{t}Y|X_v})$ and $I(P_{X_{v}Y|X_t})$ terms to capture the effect of modality interaction. We leave the explicit derivation for future work. > _A2. Interdependencies between $\epsilon$-representation capacity and the model size, number of training samples, and input dimension_. The $\epsilon$-representation capacity assumption captures the minimum achievable discrepancy between truth distribution $P_{Y|X}$ and the model's distribution $P_{\theta}(\cdot|X)$. Due to the expectation and $\min$ operator, it does not depend on the training sample size but is mainly influenced by model capability. Specifically, as the models become more expressive—e.g., by increasing model size [5] and leveraging advanced positional encoding [6]—MLLM approaches the universal approximator of sequence-to-sequence mapping [6,7], as a result, the minimum expected discrepancy tends to decrease, leading to a smaller $\epsilon$. We will elucidate this in the next version, thanks! > _A3. How to leverage EMID upper bound to guide model optimization?_ While our primary focus is on presenting the theoretical framework to quantify the performance gap of MLLMs, we also showcase an application of the EMID upper bound in this rebuttal, a **regularization term for visual instruction tuning**. Due to the space limit, we have included the setups and results in the `rebuttal to reviewer o1Fb, response A3`. **Please refer to that thread!** As shown in the tables, our instantiation of Theorem 4.5 can indeed be used to optimize model to improve robustness under shifts. > _A4. Eq. (4) can be an inaccurate estimate of MI lower bound due to potentially large $H(P_Y)$_. * In the Eq. (4): $I(P_{XY})\geq\mathbb{E}[\log P_{\theta}(y|x)]+H(P_Y)$, maximizing $\mathbb{E}[\log P_{\theta}(y|x)]$ through instruction tuning can be interpreted to learn a parameter $\theta$ that maximizes $I(P_{XY})-H(P_Y)$ rather than solely $I(P_{XY})$. * **We do not claim that the log-likelihood term is a tight lower bound of MI but rather suggest that maximizing $\mathbb{E}[\log P_{\theta}(y|x)]$ can implicitly maximize the MI between input and model's response**. We will revise the paper to make this clear. * To validate this, we reproduce the visual instruction tuning of LLaVA-v1.5-7B on a 10% subset of data, and show how the empirical estimate [2] of MI evolves during training. |Step|$\hat{I}$| |-|-| |1|0.166| |5|0.172| |20|0.182| |100|0.187| |200|0.194| |500|0.197| * As shown above, visual instruction tuning can effectively maximize MI between input and model response. > _A5. Encoder sensitivity analysis to domain shifts for EMI estimation._ * In **Table 4 and 5 of Appendix**, we already discussed two alternative choices of the encoders, e.g., [3], and showed that _our theoretical claims hold in practice consistently across varying encoders with statistical significance_. * We further conduct encoder sensitivity analysis under domain shifts by replicating experiments on medical domains with a CLIP-ViT-B32 and XLM-RoBERTa encoders. * Specifically, we use 200 samples of LLaVA-Med [4], get three splits of them based on embedding distance with COCO images, and translate English queries into six different languages used in the paper by using GPT-4o to induce 28 subsets of shifts to conduct correlation analysis for EMID and its upper bound. |Model|Pearson $r$|$p$-val| |-|-|-| |LLaVA-v1.5.7B |0.93|0.00| * We see the correlation between EMID and its upper bound estimates is very strong, even though the medical image and text are relatively minority instances compared with general object and text, implying that our theorem robustly holds even on the special domains that encoders may not excel at. > Reference 1. A Theory of Multimodal Learning, Lu 2023 2. A Contrastive Log-ratio Upper Bound of Mutual Information, Cheng et al. 2020 3. Universal Embeddings with Multimodal Large Language Models, Jiang et al. 2024 4. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, Li et al. 2023 5. Scaling Laws for Neural Language Models, Kaplan et al. 2020 6. Your Transformer May Not be as Powerful as You Expect, Luo et al. 2022 7. Transformers are Universal In-context Learners, Furuya et al. 2024 --- ### Rebuttal to cWre (Actual Submission) > _A1. Effect of modality fusion/interaction on generalization_. * MLLMs commonly undergo a modality alignment phase during training, which may affect generalization, and it is known that modality fusion can reduce the sample complexity to improve generalization [1]! * As noted in `line 250-254` in our paper, **$I(P_{\mathbf{X}Y})$ can be factorized into $0.5 I(P_{X_{v}Y}) + 0.5 I(P_{X_{t}Y}) + 0.5 I(P_{X_{t}Y|X_v}) + 0.5 I(P_{X_{v}Y|X_t})$ where the conditional MI terms $I(P_{X_{t}Y|X_v})$ and $I(P_{X_{v}Y|X_t})$ encapsulate modality interaction.** * Based on this factorization, we can define per-modality EMI based on $I(P_{X_{v}Y})$ and $I(P_{X_{t}Y})$, and then derive a new upper bound that is constructed with $I(P_{X_{t}Y|X_v})$ and $I(P_{X_{v}Y|X_t})$ terms to capture the effect of modality interaction. We leave the explicit derivation for future work. > _A2. Interdependencies between $\epsilon$-representation capacity and the model size, number of training samples, and input dimension_. The $\epsilon$-representation capacity assumption captures the minimum achievable discrepancy between truth distribution $P_{Y|X}$ and the model's distribution $P_{\theta}(\cdot|X)$. Due to the expectation and $\min$ operator, it does not depend on the training sample size but is mainly influenced by model capability. Specifically, as the models become more expressive—e.g., by increasing model size [5] and leveraging advanced positional encoding [6]—MLLM approaches the universal approximator of sequence-to-sequence mapping [6,7], as a result, the minimum expected discrepancy tends to decrease, leading to a smaller $\epsilon$. We will elucidate this in the next version, thanks! > _A3. How to leverage EMID upper bound to guide model optimization?_ While our primary focus is on presenting the theoretical framework to quantify the performance gap of MLLMs, we also showcase an application of the EMID upper bound in this rebuttal, a **regularization term for visual instruction tuning**. Due to the space limit, we have included the setups and results in the `rebuttal to reviewer o1Fb, response A3`. **Please refer to that thread!** As shown in the tables, our instantiation of Theorem 4.5 can indeed be used to optimize model to improve robustness under shifts. > _A4. Eq. (4) can be an inaccurate estimate of MI lower bound due to potentially large $H(P_Y)$_. * In the Eq. (4): $I(P_{XY})\geq\mathbb{E}[\log P_{\theta}(y|x)]+H(P_Y)$, maximizing $\mathbb{E}[\log P_{\theta}(y|x)]$ through instruction tuning can be interpreted to learn a parameter $\theta$ that maximizes $I(P_{XY})-H(P_Y)$ rather than solely $I(P_{XY})$. * **We do not claim that the log-likelihood term is a tight lower bound of MI but rather suggest that maximizing $\mathbb{E}[\log P_{\theta}(y|x)]$ can implicitly maximize the MI between input and model's response**. We will revise the paper to make this clear. * To validate this, we reproduce the visual instruction tuning of LLaVA-v1.5-7B on a 10% subset of data, and show how the empirical estimate [2] of MI evolves during training. |Step|$\hat{I}$| |-|-| |1|0.166| |5|0.172| |20|0.182| |100|0.187| |200|0.194| |500|0.197| * As shown above, visual instruction tuning can effectively maximize MI between input and model response. > _A5. Encoder sensitivity analysis to domain shifts for EMI estimation._ * In **Table 4 and 5 of Appendix**, we already discussed two alternative choices of the encoders, e.g., [3], and showed that _our theoretical claims hold in practice consistently across varying encoders with statistical significance_. * We further conduct encoder sensitivity analysis under domain shifts by replicating experiments on medical domains with a CLIP-ViT-B32 and XLM-RoBERTa encoders. * Specifically, we use 200 samples of LLaVA-Med [4], get three splits of them based on embedding distance with COCO images, and translate English queries into six different languages used in the paper by using GPT-4o to induce 28 subsets of shifts to conduct correlation analysis for EMID and its upper bound. |Model|Pearson $r$|$p$-val| |-|-|-| |LLaVA-v1.5.7B |0.93|0.00| * We see the correlation between EMID and its upper bound estimates is very strong, even though the medical image and text are relatively minority instances compared with general object and text, implying that our theorem robustly holds even on the special domains that encoders may not excel at. > Reference 1. A Theory of Multimodal Learning, Lu 2023 2. A Contrastive Log-ratio Upper Bound of Mutual Information, Cheng et al. 2020 3. Universal Embeddings with Multimodal Large Language Models, Jiang et al. 2024 4. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, Li et al. 2023 5. Scaling Laws for Neural Language Models, Kaplan et al. 2020 6. Your Transformer May Not be as Powerful as You Expect, Luo et al. 2022 7. Transformers are Universal In-context Learners, Furuya et al. 2024 8. --- ### Rebuttal to vm1D    > _A1. Legend issue in Fig 1. and non-monotonic performance trend in text shifts_. Thank you for pointing out the visualization issue—we will revise the legend in Fig. 1 to improve clarity and avoid confusion. Regarding the non-monotonic trend in win rate under text shifts, this behavior arises in part from the inherent stochasticity of win rate computation based on GPT-4 API evaluations, making it fundamentally difficult to observe strictly monotonic trends in practice [1]. Additionally, the x-axis in Fig. 1 is sorted by embedding space distances—computed using CLIP ViT for visual shifts and XLM-RoBERTa for text shifts—which may not always reflect the true degree of distributional shift. That said, we still observe a meaningful overall relationship between embedding distance and performance degradation, both in the 27 natural shifts presented in Fig. 1 and in the 34 synthetic shifts shown in Fig. 6 of the Appendix. By taking inspiration from this empirical analysis, we derive a much more rigorous framework, i.e., EMID upper bound (Theorem 4.5), to quantify the performance gap that consistently shows statistical significance across diverse settings. > _A2. Clarification for the meaning of Spearman correlation and Kendall’s tau_. Spearman's $\rho$ and Kendall’s $\tau$ are both representative measures for correlation where the former is preferred for detecting weak correlation and the latter is preferred to capture strong correlation in small sample sizes and more robust to outliers with large sample sizes. Both of them are standard approaches to measure the correlation between LLM Judge score and other metrics [2]. We will add this description in our future version of the manuscript. Thank you for the suggestion. > _A3. Intricate visualization of Figure 3 and the tightness of EMID upper bound_. * On the left two panels in Figure 3, we presented the overall relationship between EMID and its upper bound, whereas **we distinguished four different models on the right two panels to show the model-dependent difference over the relationship**. We did so to show that Theorem 4.5 actually differentiates the model itself through the output entropy of each model $H(P_{\theta}(\cdot|x))$, i.e., LLaVA NeXT shows higher sensitivity to the shifts compared with LLaVA v1.5 implied by the larger slope. * In this work, we do not claim the tightness of the derived upper bound. Moreover, verifying the tightness of the proposed bound can be affected by the choice of estimators for the MI and JSD terms during empirical validation. However, we would like to emphasize that **the bound shows consistent correlation with statistical significance across 61 cases of distribution shift over four different models which means that our analytic bound of EMID effectively stands for performance degradation of MLLM.** We appreciate your valuable concern, and we will explore to devise a much tighter bound in future work. > Reference 1. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Zheng et al. 2023 2. PROMETHEUS: INDUCING FINE-GRAINED EVALUATION CAPABILITY IN LANGUAGE MODELS, Kim et al. 2024 --- ### Rebuttal to vm1D (Actual Submission) > _A1. Legend issue in Fig. 1. and non-monotonic performance trend in text shifts_. Thank you for pointing out the visualization issue—we will revise the legend in Fig. 1 to improve clarity and avoid confusion! * Regarding the non-monotonic trend in win rate under text shifts, this behavior arises in part from the inherent stochasticity of win rate computation based on GPT-4 API evaluations, making it fundamentally difficult to observe strictly monotonic trends in practice [1]. Additionally, the x-axis in Fig. 1 is sorted by embedding space distances—computed using CLIP ViT for visual shifts and XLM-RoBERTa for text shifts—which may not always reflect the true degree of distributional shift. * That said, we still observe a meaningful overall relationship between embedding distance and performance degradation, both in the 27 natural shifts presented in Fig. 1 and in the 34 synthetic shifts shown in Fig. 6 of the Appendix. By taking inspiration from this empirical analysis, we derive a much more rigorous framework, i.e., EMID upper bound (Theorem 4.5), to quantify the performance gap that consistently shows statistical significance across diverse settings. > _A2. Clarification for the meaning of Spearman correlation and Kendall’s tau_. * Spearman's $\rho$ and Kendall’s $\tau$ are both representative measures for monotonic relationships between two variables, where the former is preferred for detecting weak correlation and the latter is preferred to capture strong correlation in small sample sizes and is more robust to outliers with large sample sizes. Both of them are standard approaches to measure the correlation between LLM Judge score and other metrics [2], and are good for measuring the relation of two variables, even if they have different data types, e.g., discrete (win rate) versus continuous (EMI). * Both correlation coefficients range from -1.0 (negative correlation) to 1.0 (positive correlation), where 0.0 indicates there is no monotonic relationship between two variables. For Spearman's $\rho$, a 0.2-0.4 range of values denotes weak correlation, 0.4-0.6 denotes moderate correlation, and 0.6-0.8 and 0.8-1.0 denote strong and very strong correlations, respectively, and Kendall's $\tau$ can be similarly interpreted by multiplying 1.5, i.e., $\rho=1.5 \tau$, to compensate its relatively smaller scale in practice. * Our analysis in the paper (Table 2) indicates that the EMI consistently shows moderate or strong correlation with the LLM-judge evaluation metric, win rate, across different types of shifts and model architectures. We will add this description in our future version of the manuscript. Thank you for the suggestion. > _A3. Intricate visualization of Figure 3 and the tightness of EMID upper bound_. * On the left two panels in Figure 3, we presented the overall relationship between EMID and its upper bound, whereas **we distinguished four different models on the right two panels to show the model-dependent difference over the relationship**. We did so to show that Theorem 4.5 actually differentiates the model itself through the output entropy of each model $H(P_{\theta}(\cdot|x))$, i.e., LLaVA NeXT shows higher sensitivity to the shifts compared with LLaVA v1.5 implied by the larger slope. * In this work, we do not claim the tightness of the derived upper bound. Moreover, verifying the tightness of the proposed bound can be affected by the choice of estimators for the MI and JSD terms during empirical validation. However, we would like to emphasize that **the bound shows consistent correlation with statistical significance across 61 cases of distribution shift over four different models, which means that our analytic bound of EMID effectively stands for performance degradation of MLLM.** We appreciate your valuable concern, and we will explore devising a much tighter bound in future work. > Reference 1. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Zheng et al. 2023 2. PROMETHEUS: INDUCING FINE-GRAINED EVALUATION CAPABILITY IN LANGUAGE MODELS, Kim et al. 2024

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.