<!-- [P]: pointed out, [W]: weakness, [Q]: question -->
<!-- Thank you so much for your great efforts to thoughtfully manage the reviewing process of multiple papers. We are writing to raise a serious concern regarding the review from the `Reviewer sTSP`. -->
### Confidential Comment to AC
Dear AC, SAC and PC,
Thanks for your hard work overseeing the review process. We are writing to express our deep concerns regarding the review from `Reviewer sTSP`, which falls far below the standards of ICML and calls into question the reviewer’s qualification to fairly evaluate submissions in this field.
## Summary of concern
Reviewer sTSP appears to have **entirely failed to understand the basic problem setups and any technical content** of our submission, explicitly stating “_I struggled to understand...”, “I didn’t go through the proofs…"_. Many of the comments focus on basic mathematical notations and standard ML concepts that are clearly defined in the manuscript, suggesting a superficial or incomplete reading of the paper—and a lack of necessary expertise. In contrast, **all three other reviewers clearly understood the paper** and provided in-depth and constructive comments, and recommended our submission towards 3--Weak Accept.
Given this disparity and the review sTSP’s complete lack of background in MLLM, we believe the reviewer should have flagged the paper as outside their area of expertise, rather than submitting a review that was both unprofessional and unconstructive. This experience has been deeply disheartening to us as authors.
## Evidence of issues in review
While there are numerous issues in the review, we highlight the most concerning ones below:
**(1) Reviewer sTSP has no background in MLLM**
Reviewer sTSP’s comments reveal a clear lack of familiarity with the field of MLLM, **which forms the core focus of our paper, and raises serious doubts about the ability to judge our work effectively**. For instance, the reviewer questions the meaning of "multimodal language model" and incorrectly assumes our models take only visual input, despite our explicit definition in the preliminaries that MLLMs process both visual and textual inputs (Lines 92-109). As a result, the reviewer fails to understand even our problem setup, which is explicitly recognized and appreciated by others:
- Reviewer vm1D: "_Understanding MLLMs under distribution shifts seems a critical research problem_.”
- Reviewer cWre: "_This paper provides a theoretical framework to analyze and understand the impact of distribution shifts on MLLM performance via EMI._"
**(2) Reviewer sTSP struggles to understand basic ML concepts such as data distribution**
Reviewer sTSP repeatedly expresses confusion over concepts that are not only clearly defined in the manuscript, but also well-understood and acknowledged by other reviewers. For example, Reviewer sTSP questioned "_How are $P_{X}$ and $P_{Y}$ defined? Are these trained models?_..." However, we clearly introduced $P_{\mathbf{X}Y}$ as a data distribution, with $P_{\mathbf{X}}$ and $P_{Y}$ explicitly described as marginals (Lines 92-103). In contrast,
- Reviewer cWre: “_The theoretical analyses are clearly presented._”
- Reviewer vm1D: “_The paper is well constructed and easy to read._”
- Reviewer o1Fb: “_The derivations appear to be sound._”
**(3) Reviewer sTSP didn't engage with our core theoretical and empirical contributions**
The reviewer does not comment on any of our core theorems (e.g., Theorem 4.5 and 4.6 on the EMID upper bound), derivations, or assumptions—all of which are clearly laid out and discussed. Similarly, the reviewer **does not mention a single result, figure, or experiment**, despite our paper including 61 real-world distribution shift scenarios, multiple models, and correlation analyses linking EMI and win rate. This stands in sharp contrast with the other reviewers:
- Reviewer vm1D highlights: “_Through extensive experiments... the authors validate their framework._”
- Reviewer cWre: “_The empirical findings strongly support most of the theoretical conclusions._”
- Reviewer o1Fb: “_The connection between EMI and win rate provides a practical and efficient alternative for model evaluation._”
## Final remark
In light of the situation, we request a careful re-evaluation of the concerns raised by Reviewer sTSP, considering the positive feedback from other reviewers and the thoroughness of our revisions and responses. We sincerely hope our manuscript can receive a fair and balanced assessment, given its theoretical and empirical significance for the field.
Thanks again for your attention and service to the community.
Sincerely,
The authors
----
### Xuefeng's AC note
Dear AC/SAC/PC,
Thank you for taking the time to read our message! We would like to bring to your attention the following situation:
- Firstly, Reviewer sTSP appears to lack familiarity with the latest ground in MLLMs, the probability and machine learning theory, and raises incorrect criticism on the definitions of MLLMs and their input / output distributions. These core concepts are clearly illustrated in our paper, e.g., **Section 2 (L92-109)**, which are recognized by multiple reviewers (o1Fb, cWre, vm1D) as well.
- Secondly, Reviewer sTSP questioned our motivation of exploring visually-conditioned LLMs, which is ungrounded and fallacious. In fact, our paper offers a formal framework to understand MLLMs under distribution shifts, which is unexplored yet tangible for reliable artificial intelligence in the wild. Our theory demonstrates the intricacies between how MLLMs perform under distribution shift and the divergence w.r.t. different marginal distributions and conditional dependencies. This goes beyond the simplistic framing suggested by Reviewer sTSP, who incorrectly reduces the problem to a basic comparison between conditional and unconditional sequence models.
- Our extensive experiments comprehensively examine 34 synthetic and 27 natural distribution shift scenarios on 4 representative MLLMs, which confirm strong correlations between our theory and practice. This broad evaluation surpasses prior works, which typically considers fewer shifts or models. However, this key contribution is largley overlooked by Reviewer sTSP, which renders the review unfair and subjective.
To sum up, we believe the rationality of this review is questionable. We have clarified all of the concerns in our rebuttal and will incorporate the changes in our revised paper. We thought we should bring this to your attention.
Thank you for your valuable judgment and service.
Sincerely,
Authors
<!-- - Therefore, the problem itself compared unconditional sequence models is not simply *to what degree a conditional distribution (given some prefix) differs from the non-conditional distribution.* as stated by Reviewer sTSP. -->
<!-- Many of the reviewer’s comments focus on basic mathematical notations that are clearly defined in the manuscript, suggesting that the reviewer may not have read the paper carefully—or may lack the background knowledge necessary to evaluate it appropriately. We detail explicit examples of these issues below. -->
<!--
In contrast, **all other reviewers clearly understood the paper** and provided in-depth comments, with Reviewer vm1D explicitly stating: _“The paper is well constructed and easy to read.”_ Conversely, Reviewer sTSP repeatedly expresses confusion (“I struggled to understand…”, “I didn’t go through the proofs…”), suggesting a **lack of preparedness or subject-matter expertise** to provide a fair review.
-->
<!-- **Reviewer sTSP has no background in MLLM**. The reviewer appears unfamiliar with terminology such as "multimodal language model" and the corresponding literature—core concepts central to our method. In the current research literature, the term MLLMs is commonly used to denote a large language model processing combined visual and textual inputs. Although Section 2 (L92-109) explicitly defines multimodal inputs for MLLMs, Reviewer sTSP misunderstood our clearly stated setup by suggesting that the model uses visual input alone. -->
<!-- * **Lack of understanding of basic ML concepts such as data distribution**. The reviewer questioned "_How are $P_{X}$ and $P_{Y}$ defined? Are these trained models?_..." However, our manuscript explicitly states (Lines 92-103) that $\mathbf{X}=(X_{v},X_{t})$ denotes sequences of tokens integrating visual and textual queries, while $Y$ represents response tokens. We clearly introduced $P_{\mathbf{X}Y}$ as a data distribution, with $P_{\mathbf{X}}$ and $P_{Y}$ explicitly described as marginals. This confusion about foundational ML concepts is alarming. -->
<!--
* **Factually incorrect critism**. The reviewer incorrectly claimed "_In the motivation with visual shifts, I would put the definitions of the variables before the shift examples, ..., Line 101: instruction tuning has not been introduced._" Contrary to this, variable definitions (L92-102) and instruction tuning (Lines 104 onwards) were explicitly and clearly presented in the preliminary section preceding the motivation.
-->
<!-- ## Mismatch with other reviewers
1. **Understanding of basic definitions**. Reviewer sTSP repeatedly states confusion about basic definitions, but all of these concepts are well-understood by the other reviewers. Reviewer cWre and vm1D explicitly acknowledged the clarity of our definitions and the validity of theoretical derivations. Reviewer o1Fb even confirmed that "the derivations appear to be sound" and noted that "the connection between EMI and win rate provides a practical and efficient alternative for model evaluation."
2. **Comment on empirical evaluation**. Reviewer sTSP does not comment on any single figure, result, or evaluation method. In contrast, Reviewer vm1D says “Through extensive experiments... the authors validate their framework.”, Reviewer cWre says: “The empirical findings strongly support most of the theoretical conclusions.”
3. **Acknowledge on the motivation**. Reviewer sTSP questions the motivation behind focusing on visually-conditioned LLMs. However, other reviewers clearly acknowledge the importance of our setting, e.g., Reviewer vm1D: “Understanding MLLMs under distribution shifts seems a critical research problem.” and Reviewer cWre: “This paper provides a theoretical framework to analyze and understand the impact of distribution shifts on MLLM performance via EMI.” -->
<!--
As authors, we are deeply disheartened by the lack of care and rigor evident in this review. It is clear that Reviewer sTSP did not engage with the material in good faith or with the technical competence required to review a submission in this area. The misunderstanding of basic ML concepts and failure to follow clearly presented content calls into question the qualification and responsibility as a reviewer.
-->
<!--We fully respect the peer review process and appreciate the efforts of all reviewers. We also appreciate `Reviewer sTSP`'s attempt to understand our paper, and some suggestions for clearer presentation. **However, -->
<!-- **Due to the careless reading and lack of background knowledge of `Reviewer sTSP`, the review from `sTSP` does not engage with the main contributions of our work and fails to provide highly relevant feedback.** It raises concerns about the fairness and expertise of the review process in this case. -->
<!-- > Reference
* A Survey on Multimodal Large Language Models, Yin et al. 2024
* Hallucination of Multimodal Large Language Models: A Survey, Bai et al. 2024
* BLINK: Multimodal Large Language Models Can See but Not Perceive, Fu et al. 2024
* On the Out-Of-Distribution Generalization of Multimodal Large Language Models, Zhang et al. 2024
* Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, Tong et al. 2024
* Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, Shi et al. 2025
-->
---
### Rebuttal to o1Fb
<!-- Raw review: [W] Evaluations are only made on LLaVA v1.5 and LLaVA NeXT. It could benefit from involving the SOTA and representative MLLMs like Qwen2.5-VL and InternVL2.5 -->
> _A1. Applicability on SOTA MLLMs._
Thanks for the excellent suggestion! Following your comment, **we additionally conduct the full evaluation with`Qwen2.5-VL-7B-Instruct` and `InternVL2.5-7B`**.
Specifically, we first evaluate the official release of `Qwen2.5-VL-7B-Instruct` and `InternVL2.5-7B` models on 35 synthetic shifts scenarios, then compute empirical EMI estimates over the models' responses. We perform a Pearson correlation analysis between empirical estimates of EMI difference (EMID) and its upper bound. **Consistent with our existing finding, we observe a strong correlation between EMID and its theoretical upper bound in both models**.
|Model|Pearson $r$|$p$-val|
|-|-|-|
|InternVL2.5-7B|0.67|0.00|
|Qwen2.5-VL-7B|0.81|0.00|
<!-- We then measure the Spearman rank correlation and Kendall's tau between EMI estimates and win rates across all 35 scenarios. Results are presented in the Tables below. **The correlations clearly demonstrate the alignment between EMI and win rates for these SOTA models**.
|Model|Spearman $\rho$|Spearman $p$-val|Kendall $\tau$|Kendall $p$-val|
|-|-|-|-|-|
|InternVL2.5-7B|0.77|0.00|0.57|0.00|
|Qwen2.5-VL-7B|0.38|0.03|0.27|0.02|
-->
<!-- Raw review: [Q] How do the assumptions made in the theoretical framework impact the applicability of the results to more complex real-world scenarios? Can these assumptions be relaxed in future work?-->
<!--The validity of this assumption depends on how well our MLLM approximates (in terms of KL divergence) the actual conditional distribution $P_{Y|\mathbf{X}}$ that win rate will be computed on. -->
<!-- * Although our empirical validation shows consistent correlation between EMI and win rate, we generally can not guarantee that MLLM approximates arbitrary conditional distributions $Q_{Y|\mathbf{X}}$ encountered during evaluation time, where the approximation error ($\epsilon$) becomes large. -->
<!-- * As the visual instruction tuning explicitly reduces this KL divergence through Eq. (1) in our manuscript, this assumption is safely held on all the in-distribution (ID) data. -->
<!-- Raw review: [Q] The paper mentions the potential use of the upper bound of EMID as a regularizer during post-training or test-time adaptation. Can you provide more details on how this could be implemented and its potential impact on model robustness?-->
> _A2. How do assumptions impact the applicability of proposals to more complex cases?_
This is a very insightful question.
* First, **to claim the closeness between EMI and win rate (Theorem 4.4), we assumed $\epsilon$-representation capacity of the MLLM**.
* $\epsilon$-representation capacity essentially reflects the model’s ability to approximate the target task’s conditional distribution, meaning that the model can approximate this distribution with a KL divergence no greater than $\epsilon$.
* The assumption mainly argues the strong expressive and approximation capabilities of MLLMs. **Given the strong expressive and approximation capabilities of recent large-scale MLLMs, this assumption is generally reasonable in practice.**
* Moreover, there are numerous efforts that improve the diversity of an instruction tuning dataset and robustness of the visual encoder of MLLM [1,2,3,4], which make an MLLM's learned distribution robustly approximate conditional distributions encountered during evaluation time.
* As we continually pursue enriching dataset construction strategy and enhancing visual recognition of the encoder, $\epsilon$-representation capacity assumption becomes reasonable in more complex cases.
* Second, **to claim the relationship between EMID and its upper bound in a simple case (Theorem 4.5), we assumed consistent conditional distributions** over $X_{v}|X_{t}$, $X_{t}|X_{v}$, and $Y|\mathbf{X}$.
* This assumption zeros out the discrepancy between conditional distributions of ID and OOD, so if the conditional distributions are quite different among ID and OOD datasets in some complex real-world scenarios. Then, this assumption makes our upper bound underestimate the performance gap, e.g., EMID.
* However, we highlight that the strong correlations between EMID and this upper bound have been observed through 61 distribution shift scenarios, implying the validity of our upper bound to quantify EMID.
* **Meanwhile, we also provide a bound for general cases to address non-consistent conditional distributions in Theorem 4.6**. This general-case bound can also be empirically estimated using a procedure similar to that of its simpler counterpart.
* Therefore, as mentioned in our manuscript (line 302-303), we recommend choosing a proper bound based on the degree of knowledge of the data-generating process for each dataset.
> _A3. Details for practical implications of EMID upper bound_.
While we confined the scope of this project to _presenting the first theoretical framework to quantify MLLM's performance gap_, we further provide **a potential application of EMID upper bound, instruction tuning with regularization, for this rebuttal**.
Without loss of generality, we can assume the input sequence $\mathbf{X}=(X_{v},X_{t})$ as a sequence of intermediate representation vectors of MLLM, i.e., $\mathbf{Z}=(Z_{v},Z_{t})$, and can also assume that $P_{\theta}(\cdot|\cdot)$ maps this representation to responses, i.e., $P_{\theta}:\mathbf{Z}\rightarrow Y$. This induces a modified bound with representation variable $\mathbf{Z}$ rather than raw data input $\mathbf{X}.$
We instantiate this modified EMID bound in two distinct setups below where we set the 24th layer's hidden states as $\mathbf{Z}$, and adopt MMD [5] as a differentiable estimator for JSD and an average of empirical model output entropy. We provide evaluation results with LLaVA-v1.5-7B on in-distribution (ID) and visual (V), text (T), and joint (J) synthetic shifts.
`Regularization term for instruction tuning`: $\mathbb{E}[H(P_{\theta}(\cdot|\mathbf{z}))]\cdot(D_{\rm JS}^{\frac{1}{2}}(P_{Z_{v}}||N(0,I))+{D}_{\rm JS}^{\frac{1}{2}}(P_{Z_{t}}||N(0,I)))$
* One can not access $Q_{\mathbf{X}}$ during the training phase, so we alternatively enforce the distribution of the intermediate representation to be close to the standard Gaussian.
* We sampled 10% of the original instruction tuning dataset from LLaVA-v1.5, and trained the entire LLM and modality connector parameters of LLaVA-v1.5 with and without the regularization term.
|Method|ID|V Shift|T Shift|J Shift|
|-|-|-|-|-|
|Baseline|72.7|65.8|68.0|59.6|
|Baseline + Ours|72.7|**66.3**|**68.3**|**60.8**|
<!-- 2. `Learning objective for test-time adaptation (TTA)`: $\mathbb{E}[H(P_{\theta}(\cdot|\mathbf{z}))]\cdot(D_{\rm JS}^{\frac{1}{2}}(P_{Z_{v}}||Q_{Z_{v}})+{D}_{\rm JS}^{\frac{1}{2}}(P_{Z_{t}}||Q_{Z_{t}}))$
* We store some samples from ID distribution $P_{\mathbf{Z}}$, and dynamically estimate $D_{\rm JS}(P_{\mathbf{Z}}||Q_{\mathbf{Z}})$ using unlabeled test samples from $Q_{\mathbf{Z}}$.
* We update the entire LLM and modality connector parameters for [XXX] epochs before evaluation.
|Method|Natural Shifts (Avg.)|Synthetic Shifts (Avg.)|
|-|-|-|
|Baseline|0.xx|0.xx|
|Baseline + Ours|0.xx|0.xx| -->
* As shown in the above tables, we confirm that the proposed EMID can be leveraged as an effective regularizer during instruction tuning to pursue better robustness to distribution shifts.
> Reference
1. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, Tong et al. 2024
2. LLaVA-OneVision: Easy Visual Task Transfer, Li et al. 2024
3. Qwen2.5-VL Technical Report, Alibaba Group 2025
4. Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, Shi et al. 2025
5. Learning Deep Kernels for Non-Parametric Two-Sample Tests, Liu et al. 2020
<!-- 5. The Representation Jensen-Shannon Divergence, Hoyos-Osorio and Sanchez-Giraldo 2024 -->
### Rebuttal to o1Fb (Actual Submission)
> _A1. Applicability on SOTA MLLMs._
Thanks for the suggestion! Following your comment, **we additionally conduct the full evaluation with`Qwen2.5-VL-7B-Instruct` and `InternVL2.5-7B`**.
Specifically, we first evaluate the official release of `Qwen2.5-VL-7B-Instruct` and `InternVL2.5-7B` models on 35 synthetic shifts, then compute EMI estimates over the model responses. We perform a correlation analysis between estimates of EMI difference (EMID) and its upper bound. **Consistent with our existing finding, we observe a strong correlation between EMID and its theoretical upper bound in both models**.
|Model|Pearson $r$|$p$-val|
|-|-|-|
|InternVL2.5-7B|0.67|0.00|
|Qwen2.5-VL-7B|0.81|0.00|
> _A2. How do assumptions impact the applicability of proposals to more complex cases?_
This is an insightful question.
First, **to claim the closeness between EMI and win rate (Thm 4.4), we assumed $\epsilon$-representation capacity of the MLLM**.
* $\epsilon$-representation capacity essentially reflects the model’s ability to approximate the target task’s conditional distribution, meaning that the model can approximate this distribution with a KLD no greater than $\epsilon$.
* The assumption mainly argues the strong expressive and approximation capabilities of MLLMs. **Given the strong expressive and approximation capabilities of recent large-scale MLLMs, this assumption is generally reasonable in practice.**
* Moreover, numerous efforts improve the diversity of an instruction tuning dataset and robustness of the visual encoder of MLLM [1,2], which makes the learned distribution robustly approximate conditional distributions encountered during evaluation.
* As we continually pursue enriching dataset construction and enhancing visual recognition of the encoder, $\epsilon$-representation capacity assumption becomes reasonable in more complex cases.
Second, **to claim the relationship between EMID and its upper bound in a simple case (Thm 4.5), we assumed consistent conditional distributions** over $X_v|X_t$, $X_t|X_v$, and $Y|X$.
* This assumption zeros out the discrepancy between conditional distributions. If the conditional distributions are quite different between ID and OOD in some real-world scenarios, this makes our upper bound underestimate the performance gap, e.g., EMID.
* However, we highlight that the strong correlations between EMID and this upper bound have been observed through 61 distribution shifts, implying the validity of our upper bound to quantify EMID.
* **Meanwhile, we also provide a bound for general cases to address non-consistent conditional distributions in Thm 4.6**. This general-case bound can also be empirically estimated using a procedure similar to that of Thm 4.5.
* Therefore, as mentioned in our manuscript (L302-303), we recommend choosing a proper bound based on the knowledge of the data-generating process for datasets.
> _A3. Details for practical implications of EMID upper bound_.
While we confined the scope of this project to _presenting the first theoretical framework to quantify MLLM's performance gap_, we further provide **a potential application of EMID upper bound, instruction tuning with regularization, for this rebuttal**.
Without loss of generality, we can assume the input sequence $X=(X_v,X_t)$ as a sequence of intermediate representation vectors of MLLM, i.e., $Z=(Z_v,Z_t)$, and can also assume that $P_{\theta}(.|.)$ maps this representation to responses, i.e., $P_{\theta}:Z \rightarrow Y$. This induces a modified bound with representation variable $Z$ rather than raw data input $X$.
We instantiate this modified EMID bound in two distinct setups below, where we set the 24th layer's hidden states as $Z$, and adopt RJSD [3] as a differentiable estimator for JSD and an average of empirical model output entropy. We provide evaluation results with LLaVA-v1.5-7B on in-distribution (ID) and visual (V), text (T), and joint (J) synthetic shifts.
`Regularization term for instruction tuning`: $\mathbb{E}[H(P_{\theta}(\cdot|z))] \cdot ((D_{JS}^{0.5}(P_{Z_v}||N(0,I))+(D_{JS}^{0.5}(P_{Z_t}||N(0,I)))$
* One can not access $Q_X$ during the training phase, so we alternatively enforce the distribution of the intermediate representation to be close to the standard Gaussian.
* We sampled 10% of the instruction dataset from LLaVA-v1.5, and trained the entire LLM and modality connector parameters of LLaVA-v1.5 with and without the regularization.
* As shown in the table, we confirm that the EMID can be leveraged as a regularizer during instruction tuning to pursue better robustness to distribution shifts.
|Method|ID|V Shift|T Shift|J Shift|
|-|-|-|-|-|
|Baseline|72.7|65.8|68.0|59.6|
|Baseline + Ours|72.7|**66.3**|**68.3**|**60.8**|
> Reference
1. Qwen2.5-VL Technical Report, Alibaba Group 2025
2. Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, Shi et al. 2025
3. The Representation Jensen-Shannon Divergence, Hoyos-Osorio et al. 2024
---
### Rebuttal to sTSP
<!--
* [Q] The authors suggest there is a risk involved in using multimodal models (language models conditioned on visual input) and suggest a way to measure this is using a difference of mutual information between a model and the distribution of the training data (is this right?) – though it remains unclear to me how that distribution is gotten.
* [Q] The abstract talks about the risk of MLLMs and safe and reliable applications, and the proposed framework about quantifying risk under distribution shifts. It’s not clear what this risk is. Would it make sense to just call it the performance?
* [P] You talk about multimodal language models in general in the introduction, but it seems this is specifically about (bi-modal) vision and language models with vision only in the input. It could also be made clearer that the paper is only about textual output.
* [Q] I also don’t see the reason for focusing on vision in the input and instruction tuning. It seems to me the part of the ideas I followed apply to sequence models in general. As such the question is to what degree a conditional distribution (given some prefix) differs from the non-conditional distribution.
* [P] I don’t understand how the various probability distributions are defined. I don't understand the definitions of the visual and textual shifts, e.g. the relation between $D(P_{X_v}||Q_{X_{v}})$ and $D(P_{X_t}||Q_{X_{t}})$. How are these defined exactly in terms of next token probabilities? What is $P_{X_v}$ and $P_{X_t}$? You only define KL with some arbitrary distribution P.
* [Q] How are $P_{X}$ and $P_{Y}$ defined? Are these trained models? Are $P_{X},P_{Y},P_{XY},P_{\theta}$ all the same trained model differing only in prefixes? E.g. as used in equation 3. For all of these, I would expect to see exact definitions of the distributions in terms of the exact model configuration you say your method works for.
* [Q] In the motivation with visual shifts, I would put the definitions of the variables before the shift examples, it was confusing.
* [Q] Random variables: Could you define the domains of the random variables in line 93? How is (X_v, X_t) combined into a single sequence?
* [P] Line 101: instruction tuning has not been introduced. Line 99: joint population → joint probability? Equation 1, should this not be argmin? Equation 2. Should this be P_{\theta} instead of \theta
* [P] I think the biggest problem with the paper is the writing and underspecified mathematics. The text is unclear throughout the abstract, introduction, and the motivation of what is being done. I don't find it clear what problem is being solved, nor what the exact configuration in which your method applies is.
* [P] More concretely, I don’t find the distributions and the distribution shifts well-defined making it very hard to follow any derivations or reasoning. Even if these (in your mind) are somewhat standard conditional factorizations, I want to see it spelled out, for every such $P_{X},P_{Y},P_{XY},P_{\theta}$, otherwise it’s very hard to be sure that what is written is correct. It was also unclear what is trained and what is not trained.
* [P] The authors define mutual information as a function of a single distribution with an implicit factorization instead of over two random variables. This did not make it easier to follow the writing. For instance, in equation 7 of the EMI (a main contribution), what is the definition of $I(P_{X}\otimes P_{\theta})$, the definition of $I$ is hardcoded for a distribution $P_{XY}$ based on the factorization given. If there is some marginalization used to define it this needs to be defined, or maybe stick to the mutual information between two random variables.
-->
<!-- * Under the problem statement, we specifically contribute the following:
* Proposing effective mutual information (EMI), which can reliably measure the performance of MLLMs, and show theoretical connection between EMI and existing standard metric win rate.
* Deriving a theoretical upper bound of EMI difference (EMID) between in-distribution (ID) and out-of-distribution (OOD) data to quantify the performance degradation of MLLM. See Theorem 4.5 and Theorem 4.6.
* Validating our theoretical statements through extensive evaluation on 61 distribution shfits scenario with four MLLMs.
-->
<!--* That is, they are distributional divergences between input domains, and we can not further spell out the term because we do not know the ground truth distributions for input domains.-->
<!-- Note that they are distribution over observable feature variable rather than modal-related variable. Because we do not know the ground truth probability density function for this distribution, we can not express the $D(P_{X_v}||Q_{X_v})$ and $D(P_{X_t}||Q_{X_t})$ terms in more detail. -->
<!--* The variable $X_v$ and $X_t$ can be elaborated as $X_t=(X_{t,1},...,X_{t,L_t})$ where $X_{t,i}\in V$ and $X_v=(X_{v,1},...,X_{v,L_v})$ where $X_{v,i}\in \mathbb{R}^{\frac{W \times H \times C}{L_v}}$. Here, $V$ denote token vocabulary of the LLM, $L_t$ and $L_v$ denote the length of text and visual seqence, respectively, and the W, H, C denote the width, height, and channel dimension of the input image. -->
<!--* An image $X_v$ of $L_v$ patches are fed into visual encoder and projector that maps the patches to tokens in word embedding space (Refer to [5] for details), then all tokens can be concatenated to a sequence in the word embedding space.-->
We appreciate sTSP's effort to read our paper and provide comments. We present a notation table and the responses for each comment.
|Var.|Def.|
|-|-|
|$X_t=(X_{t,1},...,X_{t,L_t})$ where $X_{t,i}\in V$|a random variable (r.v.) of a text input sequence with length $L_t$ of tokens in vocabulary $V$|
|$X_v=(X_{v,1},...,X_{v,L_v})$ where $X_{v,i}\in \mathbb{R}^{D_v}$| a r.v. of a $D_v$-dimensional image embedding sequence with length $L_v$ of tokens produced by a frozen visual encoder|
|$\mathbf{X}=(X_v,X_t)$| a joint r.v. of an input query constructed with a tuple of $X_v$ and $X_t$|
|$Y=(Y_1,...,Y_L)$ where $Y_i\in V$|a r.v. of a text response with length $L$ of tokens|
|$P_{\mathbf{X}}=P(X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|a probability distribution (p.d.) of an input query|
|$P_{X_v}=P(X_{v,1},...,X_{v,L_v})$|a p.d. of a visual input query|
|$P_{X_t}=P(X_{t,1},...,X_{t,L_t})$|a p.d. of a text input query|
|$P_{Y}=P(Y_1,...,Y_L)$| a p.d. of a text response|
|$P_{Y \|\mathbf{X}}=P(Y_1,...,Y_L\|X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|a conditional p.d. of a text response given input query|
|$P_{\mathbf{X}Y}=P(X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t},Y_1,...,Y_L)$|a joint p.d. of input query and response|
|$P_{\theta}(Y\|\mathbf{X})=P_{\theta}(Y_1,...,Y_L\|X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|model's prediction p.d. for a response given input query|
> What problem is being solved? what is the risk?
* As we noted in the introduction, MLLMs suffer from performance degradation when they encounter distribution shifts. We denote risk as this performance degradation.
* `Problem focus`: As noted in the abstract, introduction, and motivation sections, **our goal is to quantify the performance degradation of MLLMs under distribution shifts by presenting an information-theoretic framework**.
> It seems this is about (bi-modal) vision and language models with vision only in the input.
* As we noted in `L107-109`, MLLMs take multimodal input (both visual and text) to produce text output, not "vision only in the input".
* The term MLLM is commonly used in the literature to denote LLMs that receive a visual input as well as text [1,2], so we took this term by following the convention.
> Definition of $P_{X},P_{Y},P_{XY},P_{\theta}$. Are they all the trained model?
* In `L92-103`, we put the definition of random variables and distributions. $P_X,P_Y,P_{XY}$ denote the probability distributions of the input query $\mathbf{X}$, target response $Y$, and their joint $\mathbf{X}Y$, respectively.
* The $P_\theta$ is a model being trained.
> Definitions of $D(P_{X_v}||Q_{X_v})$ and $D(P_{X_t}||Q_{X_t})$. How are these defined in terms of next token probabilities? What is $P_{X_v}$ and $P_{X_t}$?
As noted in `L96-97`, $X_v$ and $X_t$ are the sequence of visual and text input tokens, so the $P_{X_v}$ and $P_{X_t}$ are the corresponding distributions. $D(P_{X_v}||Q_{X_v})$ and $D(P_{X_t}||Q_{X_t})$ are defined at the input, which should not be confused with the next token probabilities (at the output level).
> I'd put the definitions of the variables before the shift examples. Domains of the variables in L93? How is (X_v,X_t) combined into a sequence?
* We indeed introduced the definition of all random variables in `L92-103` before going into the motivation section.
* We concatenate the visual input tokens $X_v$ and textual input tokens $X_t$ into a single input sequence, where the visual tokens are obtained by encoding the image using a vision encoder (e.g., CLIP-ViT), and then projecting them into the language embedding space. This follows the standard practice in MLLMs, where visual tokens are prepended to the text tokens to form a unified input sequence.
> Definition of MI and EMI
* Our main interest is to express the model performance gap across different distributions. Thus, instead of defining MI with individual variables, we define MI as their joint distribution, which is mathematically equivalent to random variable-based MI, as can be seen in Eq. (3).
* The definition of EMI was explicitly introduced in the Eq (6). Please refer to `L168-169`.
> L101: instruction tuning has not been introduced. L99: joint probability? Eq 1, should this not be argmin? Eq 2. Should this be P_{\theta} instead of \theta
* Instruction tuning was introduced from `L104`.
* In statistics, the population (distribution) [3] is used to denote a distribution of the entire collection of objects in contrast to a sampled distribution.
* In Eq (1), both min and argmin can be valid where the former aims at the objective-centric whereas the latter stands for the parameter-centric perspective.
* In Eq (2), we use the first argument to denote the data distribution that the metric is computed on, and use the second argument to denote the model parameter to be evaluated. **We will use $P_{\theta}$ in the revised version.**
> Reference
1. A Fully Open, Vision-Centric Exploration of Multimodal LLMs, Tong et al. 2024
2. Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, Shi et al. 2025
3. Sampling of Populations: Methods and Applications, Levy and Lemeshow 2013
---
### Rebuttal to sTSP (Actual Submission)
We appreciate sTSP's effort to read our paper and provide comments. Here is a notation table and our responses.
|Var.|Def.|
|-|-|
|$X_t=(X_{t,1},...,X_{t,L_t})$ where $X_{t,i}\in V$|a random variable (r.v.) of a text input sequence with length $L_t$ of tokens in vocabulary $V$|
|$X_v=(X_{v,1},...,X_{v,L_v})$ where $X_{v,i}\in \mathbb{R}^{D_v}$| a r.v. of a $D_v$-dimensional image embedding sequence with length $L_v$ of tokens produced by a frozen vision encoder|
|$\mathbf{X}=(X_v,X_t)$| a joint r.v. of an input query constructed with a tuple of $X_v$ and $X_t$|
|$Y=(Y_1,...,Y_L)$ where $Y_i\in V$|a r.v. of a text response with length $L$ of tokens|
|$P_{\mathbf{X}}=P(X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|a probability distribution (p.d.) of an input|
|$P_{X_v}=P(X_{v,1},...,X_{v,L_v})$|a p.d. of a visual input|
|$P_{X_t}=P(X_{t,1},...,X_{t,L_t})$|a p.d. of a text input|
|$P_{Y}=P(Y_1,...,Y_L)$| a p.d. of a text response|
|$P_{Y \|\mathbf{X}}=P(Y_1,...,Y_L\|X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|a conditional p.d. of a text response given input|
|$P_{\mathbf{X}Y}=P(X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t},Y_1,...,Y_L)$|a joint p.d. of input and response|
|$P_{\theta}(Y\|\mathbf{X})=P_{\theta}(Y_1,...,Y_L\|X_{v,1},...,X_{v,L_v},X_{t,1},...,X_{t,L_t})$|model's prediction p.d. for a response given input|
> What problem is being solved? What is the risk?
* As we noted in the introduction, MLLMs suffer from performance degradation when they encounter distribution shifts. We denote risk as this performance degradation.
* `Problem focus`: As noted in the abstract, introduction, and motivation sections, **our goal is to quantify the performance degradation of MLLMs under distribution shifts by presenting an information-theoretic framework**.
> It seems this is about (bi-modal) vision and language models with vision only in the input.
* As we noted in `L107-109`, MLLMs take multimodal input (both visual and text) to produce text output, not "vision only in the input".
* The term MLLM is commonly used in the literature to denote LLMs that receive a visual input as well as text [1,2], so we took this term by following the convention.
> Definition of $P_X,P_Y,P_{XY},P_{\theta}$.
* In `L92-103`, we put the definition of random variables and distributions. $P_X,P_Y,P_{XY}$ denote the probability distributions of the input $\mathbf{X}$, target response $Y$, and their joint $\mathbf{X}Y$, respectively.
* The $P_\theta$ is a model being trained.
> Definitions of $D(P_{X_v}||Q_{X_v})$ and $D(P_{X_t}||Q_{X_t})$. How are these defined in terms of next token probabilities? What is $P_{X_v}$ and $P_{X_t}$?
As noted in `L96-97`, $X_v$ and $X_t$ are the sequences of visual and text input tokens, so the $P_{X_v}$ and $P_{X_t}$ are the corresponding distributions. $D(P_{X_v}||Q_{X_v})$ and $D(P_{X_t}||Q_{X_t})$ are defined at the input, which should not be confused with the next token probabilities (at the output level).
> Put the definitions of variables before shift examples. Domains of the variables in L93? How is (X_v,X_t) combined into a sequence?
* We put the definition of all variables in `L92-103` before the motivation section.
* We concatenate the visual input tokens $X_v$ and textual input tokens $X_t$ into a single sequence, where the visual tokens are obtained by encoding an image using a vision encoder (CLIP-ViT), and then projecting them into the language embedding space. This follows the standard practice in MLLMs, where visual tokens are prepended to the text tokens to form a unified input sequence.
> Definition of MI and EMI
* Our main interest is to express the model performance gap across different distributions. Thus, instead of defining MI with individual variables, we define MI as their joint distribution, which is mathematically equivalent to random variable-based MI, as can be seen in Eq 3.
* The definition of EMI was explicitly introduced in Eq 6. Please refer to `L168-169`.
> L101: instruction tuning has not been introduced. L99: joint probability? Eq 1, should this not be argmin? Eq 2. Should this be P_{\theta} instead of \theta
* Instruction tuning was introduced from `L104`.
* In statistics, the population (distribution) [3] is used to denote a distribution of the entire collection of objects in contrast to a sampled distribution.
* In Eq 1, both min and argmin can be valid, where the former aims at the objective-centric whereas the latter stands for the parameter-centric perspective.
* In Eq 2, we use the first argument to denote the data distribution that the metric is computed on, and use the second argument to denote the model parameter to be evaluated. **We will use $P_{\theta}$ in the revised version.**
> Reference
1. A Fully Open, Vision-Centric Exploration of Multimodal LLMs, Tong et al. 2024
2. Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, Shi et al. 2025
3. Sampling of Populations: Methods and Applications, Levy and Lemeshow 2013
---
### Rebuttal to cWre
<!-- [W] I appreciate the theoretical part. However, the current theories do not address how modality fusion and modality interaction affect generalization, despite these techniques being commonly employed in (MLLMs). -->
<!-- [W] In Theorem 4.4, the paper assume that the $\epsilon$-representation capacity holds, where itself requires to be quantified, e.g. it it closed related to the model size, the number of training samples and the input dimension. Given that the assumption fundementally underpins the theoretical contributions presented, I suggest expanding the discourse to elucidate the quantitative interdependencies between $\epsilon$-representation capacity and these factors. -->
<!-- [W] The paper does not demonstrate how to leverage EMI or the EMID upper bound to guide model optimization (e.g., designing robust training objectives or adaptation strategies). It only uses EMI as a post-hoc evaluation tool. -->
<!-- [Q] The paper states, “In Eq. (4), we show that the autoregressive objective for instruction tuning (Eq. (1)) effectively estimates the lower bound of MI when the model’s representation capacity is sufficiently high.” However, while the term $\delta$ can be omitted under this assumption, I suppose that $H(P_{Y})$ could be significantly large, making Eq. (4) an inaccurate estimate of the MI lower bound -->
<!-- [Q] The computation of EMI relies on pre-trained encoders (e.g., CLIP and XLM-R) for feature extraction, but the paper does not discuss the sensitivity of these encoders to domain shifts. For instance, CLIP may underperform on medical images, leading to distorted EMI estimates. -->
<!-- We confine our attention to visual instruction tuning phase, and focus on the relavance between input query $(X_v,X_t)$ and response $Y$ and the distributional discrepancy between ID and OOD to derive theories. -->
<!-- i.e., $\min_{\theta\in\Theta}\mathbb{E}[D_{KL}(P_{Y|X=x}||P_{\theta}(\cdot|x))] \leq \epsilon$,-->
> _A1. Effect of modality fusion/interaction on generalization_.
* MLLMs commonly undergo a modality alignment phase during training which may affect generalization, and it is known that modality fusion can reduce the sample complexity to improve generalization [1]!
* As noted in `line 250-254` in our paper, **$I(P_{\mathbf{X}Y})$ can be factorized into $0.5 I(P_{X_{v}Y}) + 0.5 I(P_{X_{t}Y}) + 0.5 I(P_{X_{t}Y|X_v}) + 0.5 I(P_{X_{v}Y|X_t})$ where the conditional MI terms $I(P_{X_{t}Y|X_v})$ and $I(P_{X_{v}Y|X_t})$ encapsulate modality interaction.**
* Based on this factorization, we can define per-modality EMI based on $I(P_{X_{v}Y})$ and $I(P_{X_{t}Y})$, and then derive a new upper bound that is constructed with $I(P_{X_{t}Y|X_v})$ and $I(P_{X_{v}Y|X_t})$ terms to capture the effect of modality interaction. We leave the explicit derivation for future work.
> _A2. Interdependencies between $\epsilon$-representation capacity and the model size, number of training samples, and input dimension_.
The $\epsilon$-representation capacity assumption captures the minimum achievable discrepancy between truth distribution $P_{Y|X}$ and the model's distribution $P_{\theta}(\cdot|X)$. Due to the expectation and $\min$ operator, it does not depend on the training sample size but is mainly influenced by model capability.
Specifically, as the models become more expressive—e.g., by increasing model size [5] and leveraging advanced positional encoding [6]—MLLM approaches the universal approximator of sequence-to-sequence mapping [6,7], as a result, the minimum expected discrepancy tends to decrease, leading to a smaller $\epsilon$.
We will elucidate this in the next version, thanks!
> _A3. How to leverage EMID upper bound to guide model optimization?_
While our primary focus is on presenting the theoretical framework to quantify the performance gap of MLLMs, we also showcase an application of the EMID upper bound in this rebuttal, a **regularization term for visual instruction tuning**.
Due to the space limit, we have included the setups and results in the `rebuttal to reviewer o1Fb, response A3`. **Please refer to that thread!** As shown in the tables, our instantiation of Theorem 4.5 can indeed be used to optimize model to improve robustness under shifts.
> _A4. Eq. (4) can be an inaccurate estimate of MI lower bound due to potentially large $H(P_Y)$_.
* In the Eq. (4): $I(P_{XY})\geq\mathbb{E}[\log P_{\theta}(y|x)]+H(P_Y)$, maximizing $\mathbb{E}[\log P_{\theta}(y|x)]$ through instruction tuning can be interpreted to learn a parameter $\theta$ that maximizes $I(P_{XY})-H(P_Y)$ rather than solely $I(P_{XY})$.
* **We do not claim that the log-likelihood term is a tight lower bound of MI but rather suggest that maximizing $\mathbb{E}[\log P_{\theta}(y|x)]$ can implicitly maximize the MI between input and model's response**. We will revise the paper to make this clear.
* To validate this, we reproduce the visual instruction tuning of LLaVA-v1.5-7B on a 10% subset of data, and show how the empirical estimate [2] of MI evolves during training.
|Step|$\hat{I}$|
|-|-|
|1|0.166|
|5|0.172|
|20|0.182|
|100|0.187|
|200|0.194|
|500|0.197|
* As shown above, visual instruction tuning can effectively maximize MI between input and model response.
> _A5. Encoder sensitivity analysis to domain shifts for EMI estimation._
* In **Table 4 and 5 of Appendix**, we already discussed two alternative choices of the encoders, e.g., [3], and showed that _our theoretical claims hold in practice consistently across varying encoders with statistical significance_.
* We further conduct encoder sensitivity analysis under domain shifts by replicating experiments on medical domains with a CLIP-ViT-B32 and XLM-RoBERTa encoders.
* Specifically, we use 200 samples of LLaVA-Med [4], get three splits of them based on embedding distance with COCO images, and translate English queries into six different languages used in the paper by using GPT-4o to induce 28 subsets of shifts to conduct correlation analysis for EMID and its upper bound.
|Model|Pearson $r$|$p$-val|
|-|-|-|
|LLaVA-v1.5.7B |0.93|0.00|
* We see the correlation between EMID and its upper bound estimates is very strong, even though the medical image and text are relatively minority instances compared with general object and text, implying that our theorem robustly holds even on the special domains that encoders may not excel at.
> Reference
1. A Theory of Multimodal Learning, Lu 2023
2. A Contrastive Log-ratio Upper Bound of Mutual Information, Cheng et al. 2020
3. Universal Embeddings with Multimodal Large Language Models, Jiang et al. 2024
4. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, Li et al. 2023
5. Scaling Laws for Neural Language Models, Kaplan et al. 2020
6. Your Transformer May Not be as Powerful as You Expect, Luo et al. 2022
7. Transformers are Universal In-context Learners, Furuya et al. 2024
---
### Rebuttal to cWre (Actual Submission)
> _A1. Effect of modality fusion/interaction on generalization_.
* MLLMs commonly undergo a modality alignment phase during training, which may affect generalization, and it is known that modality fusion can reduce the sample complexity to improve generalization [1]!
* As noted in `line 250-254` in our paper, **$I(P_{\mathbf{X}Y})$ can be factorized into $0.5 I(P_{X_{v}Y}) + 0.5 I(P_{X_{t}Y}) + 0.5 I(P_{X_{t}Y|X_v}) + 0.5 I(P_{X_{v}Y|X_t})$ where the conditional MI terms $I(P_{X_{t}Y|X_v})$ and $I(P_{X_{v}Y|X_t})$ encapsulate modality interaction.**
* Based on this factorization, we can define per-modality EMI based on $I(P_{X_{v}Y})$ and $I(P_{X_{t}Y})$, and then derive a new upper bound that is constructed with $I(P_{X_{t}Y|X_v})$ and $I(P_{X_{v}Y|X_t})$ terms to capture the effect of modality interaction. We leave the explicit derivation for future work.
> _A2. Interdependencies between $\epsilon$-representation capacity and the model size, number of training samples, and input dimension_.
The $\epsilon$-representation capacity assumption captures the minimum achievable discrepancy between truth distribution $P_{Y|X}$ and the model's distribution $P_{\theta}(\cdot|X)$. Due to the expectation and $\min$ operator, it does not depend on the training sample size but is mainly influenced by model capability.
Specifically, as the models become more expressive—e.g., by increasing model size [5] and leveraging advanced positional encoding [6]—MLLM approaches the universal approximator of sequence-to-sequence mapping [6,7], as a result, the minimum expected discrepancy tends to decrease, leading to a smaller $\epsilon$.
We will elucidate this in the next version, thanks!
> _A3. How to leverage EMID upper bound to guide model optimization?_
While our primary focus is on presenting the theoretical framework to quantify the performance gap of MLLMs, we also showcase an application of the EMID upper bound in this rebuttal, a **regularization term for visual instruction tuning**.
Due to the space limit, we have included the setups and results in the `rebuttal to reviewer o1Fb, response A3`. **Please refer to that thread!** As shown in the tables, our instantiation of Theorem 4.5 can indeed be used to optimize model to improve robustness under shifts.
> _A4. Eq. (4) can be an inaccurate estimate of MI lower bound due to potentially large $H(P_Y)$_.
* In the Eq. (4): $I(P_{XY})\geq\mathbb{E}[\log P_{\theta}(y|x)]+H(P_Y)$, maximizing $\mathbb{E}[\log P_{\theta}(y|x)]$ through instruction tuning can be interpreted to learn a parameter $\theta$ that maximizes $I(P_{XY})-H(P_Y)$ rather than solely $I(P_{XY})$.
* **We do not claim that the log-likelihood term is a tight lower bound of MI but rather suggest that maximizing $\mathbb{E}[\log P_{\theta}(y|x)]$ can implicitly maximize the MI between input and model's response**. We will revise the paper to make this clear.
* To validate this, we reproduce the visual instruction tuning of LLaVA-v1.5-7B on a 10% subset of data, and show how the empirical estimate [2] of MI evolves during training.
|Step|$\hat{I}$|
|-|-|
|1|0.166|
|5|0.172|
|20|0.182|
|100|0.187|
|200|0.194|
|500|0.197|
* As shown above, visual instruction tuning can effectively maximize MI between input and model response.
> _A5. Encoder sensitivity analysis to domain shifts for EMI estimation._
* In **Table 4 and 5 of Appendix**, we already discussed two alternative choices of the encoders, e.g., [3], and showed that _our theoretical claims hold in practice consistently across varying encoders with statistical significance_.
* We further conduct encoder sensitivity analysis under domain shifts by replicating experiments on medical domains with a CLIP-ViT-B32 and XLM-RoBERTa encoders.
* Specifically, we use 200 samples of LLaVA-Med [4], get three splits of them based on embedding distance with COCO images, and translate English queries into six different languages used in the paper by using GPT-4o to induce 28 subsets of shifts to conduct correlation analysis for EMID and its upper bound.
|Model|Pearson $r$|$p$-val|
|-|-|-|
|LLaVA-v1.5.7B |0.93|0.00|
* We see the correlation between EMID and its upper bound estimates is very strong, even though the medical image and text are relatively minority instances compared with general object and text, implying that our theorem robustly holds even on the special domains that encoders may not excel at.
> Reference
1. A Theory of Multimodal Learning, Lu 2023
2. A Contrastive Log-ratio Upper Bound of Mutual Information, Cheng et al. 2020
3. Universal Embeddings with Multimodal Large Language Models, Jiang et al. 2024
4. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, Li et al. 2023
5. Scaling Laws for Neural Language Models, Kaplan et al. 2020
6. Your Transformer May Not be as Powerful as You Expect, Luo et al. 2022
7. Transformers are Universal In-context Learners, Furuya et al. 2024
8.
---
### Rebuttal to vm1D
<!-- [W] In Figure 1, the authors claim that as shift severity increases, performance degradation worsens. However, part of the result in the text shift is obscured by the legend for LLaVA v1.5, requiring adjustment. Additionally, the text shift results do not fully support this claim, as the win rate does not decrease monotonically with shift severity. For instance, LLaVA NeXT 7B performs better in Korean than Arabic, while LLaVA NeXT 13B shows the opposite trend. -->
<!-- [W] The results from the Spearman correlation analysis and Kendall’s tau analysis require further clarification. It would be helpful to specify the meaning of the correlation values to aid in better understanding their implications. -->
<!-- [W] In Figure 3, combining all four models in one graph makes it hard to grasp the authors’ implications. Additionally, the EMID appears much smaller than its upper bound—e.g., in the synthetic shift graph, EMID ranges from -0.02 to 0.10, while the upper bound spans from 1.1 to 1.8. If not misunderstood, this suggests the upper bound is very loose and offers little constraint on EMID. -->
> _A1. Legend issue in Fig 1. and non-monotonic performance trend in text shifts_.
Thank you for pointing out the visualization issue—we will revise the legend in Fig. 1 to improve clarity and avoid confusion.
Regarding the non-monotonic trend in win rate under text shifts, this behavior arises in part from the inherent stochasticity of win rate computation based on GPT-4 API evaluations, making it fundamentally difficult to observe strictly monotonic trends in practice [1]. Additionally, the x-axis in Fig. 1 is sorted by embedding space distances—computed using CLIP ViT for visual shifts and XLM-RoBERTa for text shifts—which may not always reflect the true degree of distributional shift.
That said, we still observe a meaningful overall relationship between embedding distance and performance degradation, both in the 27 natural shifts presented in Fig. 1 and in the 34 synthetic shifts shown in Fig. 6 of the Appendix. By taking inspiration from this empirical analysis, we derive a much more rigorous framework, i.e., EMID upper bound (Theorem 4.5), to quantify the performance gap that consistently shows statistical significance across diverse settings.
> _A2. Clarification for the meaning of Spearman correlation and Kendall’s tau_.
Spearman's $\rho$ and Kendall’s $\tau$ are both representative measures for correlation where the former is preferred for detecting weak correlation and the latter is preferred to capture strong correlation in small sample sizes and more robust to outliers with large sample sizes. Both of them are standard approaches to measure the correlation between LLM Judge score and other metrics [2].
We will add this description in our future version of the manuscript. Thank you for the suggestion.
> _A3. Intricate visualization of Figure 3 and the tightness of EMID upper bound_.
* On the left two panels in Figure 3, we presented the overall relationship between EMID and its upper bound, whereas **we distinguished four different models on the right two panels to show the model-dependent difference over the relationship**. We did so to show that Theorem 4.5 actually differentiates the model itself through the output entropy of each model $H(P_{\theta}(\cdot|x))$, i.e., LLaVA NeXT shows higher sensitivity to the shifts compared with LLaVA v1.5 implied by the larger slope.
* In this work, we do not claim the tightness of the derived upper bound. Moreover, verifying the tightness of the proposed bound can be affected by the choice of estimators for the MI and JSD terms during empirical validation. However, we would like to emphasize that **the bound shows consistent correlation with statistical significance across 61 cases of distribution shift over four different models which means that our analytic bound of EMID effectively stands for performance degradation of MLLM.** We appreciate your valuable concern, and we will explore to devise a much tighter bound in future work.
> Reference
1. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Zheng et al. 2023
2. PROMETHEUS: INDUCING FINE-GRAINED EVALUATION CAPABILITY IN LANGUAGE MODELS, Kim et al. 2024
---
### Rebuttal to vm1D (Actual Submission)
> _A1. Legend issue in Fig. 1. and non-monotonic performance trend in text shifts_.
Thank you for pointing out the visualization issue—we will revise the legend in Fig. 1 to improve clarity and avoid confusion!
* Regarding the non-monotonic trend in win rate under text shifts, this behavior arises in part from the inherent stochasticity of win rate computation based on GPT-4 API evaluations, making it fundamentally difficult to observe strictly monotonic trends in practice [1]. Additionally, the x-axis in Fig. 1 is sorted by embedding space distances—computed using CLIP ViT for visual shifts and XLM-RoBERTa for text shifts—which may not always reflect the true degree of distributional shift.
* That said, we still observe a meaningful overall relationship between embedding distance and performance degradation, both in the 27 natural shifts presented in Fig. 1 and in the 34 synthetic shifts shown in Fig. 6 of the Appendix. By taking inspiration from this empirical analysis, we derive a much more rigorous framework, i.e., EMID upper bound (Theorem 4.5), to quantify the performance gap that consistently shows statistical significance across diverse settings.
> _A2. Clarification for the meaning of Spearman correlation and Kendall’s tau_.
* Spearman's $\rho$ and Kendall’s $\tau$ are both representative measures for monotonic relationships between two variables, where the former is preferred for detecting weak correlation and the latter is preferred to capture strong correlation in small sample sizes and is more robust to outliers with large sample sizes. Both of them are standard approaches to measure the correlation between LLM Judge score and other metrics [2], and are good for measuring the relation of two variables, even if they have different data types, e.g., discrete (win rate) versus continuous (EMI).
* Both correlation coefficients range from -1.0 (negative correlation) to 1.0 (positive correlation), where 0.0 indicates there is no monotonic relationship between two variables. For Spearman's $\rho$, a 0.2-0.4 range of values denotes weak correlation, 0.4-0.6 denotes moderate correlation, and 0.6-0.8 and 0.8-1.0 denote strong and very strong correlations, respectively, and Kendall's $\tau$ can be similarly interpreted by multiplying 1.5, i.e., $\rho=1.5 \tau$, to compensate its relatively smaller scale in practice.
* Our analysis in the paper (Table 2) indicates that the EMI consistently shows moderate or strong correlation with the LLM-judge evaluation metric, win rate, across different types of shifts and model architectures.
We will add this description in our future version of the manuscript. Thank you for the suggestion.
> _A3. Intricate visualization of Figure 3 and the tightness of EMID upper bound_.
* On the left two panels in Figure 3, we presented the overall relationship between EMID and its upper bound, whereas **we distinguished four different models on the right two panels to show the model-dependent difference over the relationship**. We did so to show that Theorem 4.5 actually differentiates the model itself through the output entropy of each model $H(P_{\theta}(\cdot|x))$, i.e., LLaVA NeXT shows higher sensitivity to the shifts compared with LLaVA v1.5 implied by the larger slope.
* In this work, we do not claim the tightness of the derived upper bound. Moreover, verifying the tightness of the proposed bound can be affected by the choice of estimators for the MI and JSD terms during empirical validation. However, we would like to emphasize that **the bound shows consistent correlation with statistical significance across 61 cases of distribution shift over four different models, which means that our analytic bound of EMID effectively stands for performance degradation of MLLM.** We appreciate your valuable concern, and we will explore devising a much tighter bound in future work.
> Reference
1. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Zheng et al. 2023
2. PROMETHEUS: INDUCING FINE-GRAINED EVALUATION CAPABILITY IN LANGUAGE MODELS, Kim et al. 2024