# Reviewer CojG
"Thank you for your response. The response has addressed most of my concerns. I will improve my scores accordingly. For further communication, could you report the performance of your method and BERT on more difficult datasets, such as MMLU or coding genreation, or some other datasets to justify the generalization of your framework. You can just sample some data for evaluation instead of evaluation on all data considering the ddl of response."
Thank you for your response. The response has addressed most of my concerns. I will improve my scores accordingly. For further communication, could you report the performance of your method and BERT on more difficult datasets, such as MMLU or coding genreation, or some other datasets to justify the generalization of your framework. You can just sample some data for evaluation instead of evaluation on all data considering the ddl of response.
## Response:
We are truly delighted that our response has resolved the concerns you had. Moreover, the points you raised gave us the opportunity to reflect and discuss further, which enriched our research.
Before proceeding with our detailed response, we would like to apologize for the slight delay due to the ongoing experiments, and we kindly ask for your understanding.
We are also very pleased to address your additional questions. We applied CCL to the MMLU task as well, and also considered using BERT as the language model for inference.
- We adopted MMLU\'s common practice of using a fixed 5-shot setting for the few-shot size. Specifically, we used the "lukaemon/mmlu" dataset available on Hugging Face. This dataset consists of QA pairs across 57 diverse domains, with samples from each domain\'s train set and validation set available for use as few-shot examples. The train set contains the samples used for the fixed 5-shot setting, while the validation set contains around 30 samples per domain on average (though the exact number varies by domain).
- We combined the train set and validation set to train CCL. In other words, in the MMLU task setting with 57 environments, we trained a VAE. Afterward, for each input query, we collected 5 examples at the $c$ embedding level regardless of the domain. This approach is based on the fact that CCL is inherently designed to identify the intent of a query and retrieve examples that are helpful for task performance, irrespective of the domain. (We also retrieved 5 examples using the $x$ embedding, and we refer to the results from this approach as ICL.)
Table 5. MMLU performance comparison
| Method | Avg. Acc |
| --- | --- |
| ZS | 60.48 |
| fewshot | 61.37 |
| ICL | 61.37 |
| CCL | **61.52** |
- Table 5 shows the accuracy on MMLU when using Phi4-mini-IT as the base model. We observe that CCL shows a slight performance improvement compared to other baselines.
- We also conducted performance evaluation on BERT in Table 6.
Table 6. MMLU performance comparison based on BERT
| Method | Avg. Acc |
| --- | ---|
| ZS | 23.11 |
| fewshot | 23.17 |
| ICL | 23.14 |
| CCL | **23.31** |
- In BERT, the effect of few-shot in-context learning does not appear to be as pronounced as in other decoder-based models (e.g. Phi4-mini-IT), but we could still observe that CCL yields a slight performance improvement. One possible reason for the limited effect of few-shot in-context learning is that BERT has a maximum sequence length of 512, which may have caused the 5 examples to be truncated and not fully included in the input.
- In summary, as you suggested, we conducted experiments using the more complex MMLU benchmark and the transformer encoder based BERT model. The MMLU experiments aimed to test CCL on a more challenging task, while using BERT as the base model was intended to examine whether the examples selected by CCL could perform well regardless of the underlying model architecture.
- A common finding across the two experiments is that CCL consistently demonstrated slight but stable performance improvements over the baselines. These results support the robustness and domain-agnostic applicability of CCL in enhancing few-shot in-context learning.
- We note that these experiments were carried out within the limited time of the discussion period, so further experiments and more detailed analyses are still needed. We will address these points in more detail in the revised version of the manuscript.
We truly appreciate your thoughtful and helpful review.
~~your response. The response has addressed most of my conerns. I will improve my scores accordingly. For further communication, could you report the performance of your method and BERT on more difficult datasets, such genreation, or some other datasets to justify the of your framework. You can just sample some data for evaluation instead of evaluation on all data considering the ddl of response."~~~~
~~# Response:
Thank you for your response. The response has addressed most of my concerns. I will improve my scores accordingly. For further communication, could you report the performance of your method and BERT on more difficult datasets, such as MMLU or coding genreation, or some other datasets to justify the generalization of your framework. You can just sample some data for evaluation instead of evaluation on all data considering the ddl of response.~~
~~We are truly delighted that our response has resolved the concerns you had. Moreover, the points you raised gave us the opportunity to reflect and discuss further, which enriched our research. are also very pleased to address your additional tions. We applied CCL to the MMLU task as well, and also considered using BERT as the language model for inference.~~
~~*e adopted MMLU’s common practice of using a fixed 5-shot setting for the few-shot size. Specifically, we used the “lukaemon/mmlu” dataset available on Hugging Face. This dataset consists of QA pairs across 57 diverse domains, with samples from each domain’s train set and validation set available for use as few-shot examples. The train set contains the samples used for the fixed 5-shot setting, while the validation set contains around 30 samples per domain on average (though the exact number varies by domai~~
~~For each domain, we selected 5 examples from the approximately 35 available samples. (Similarly, using the $x$ embeddings, we selected 5 examples per domain, and we refer to this setting as ICL.)~~
~~Table 5 MMLU performance comparison~~
~~| Method | Avg. Acc |
| ZS | 60.48 |
| fewshot | 61.37 |
| ICL | 61.37 |
| CCL | **61.55** |~~
~~Table 5 shows the accuracy on MMLU when using Phi4-mini-IT as the base model. We observe that CCL shows a slight performance improvement compared to other baselines. We believe the reason the improvement is not larger is likely due to the limited number of examples that can be selected.~~
# Reviewer 2GL1
“I appreciate the analyses, but my concern remains: a single‐layer VAE—even with dual reconstruction—still oversimplifies language’s deeply entangled, context-dependent causal factors. Without broader benchmarks or formal guarantees, it’s unclear whether CCL truly learns causal structure rather than surface correlations.”
## Response:
We appreciate the concern regarding the expressivity of a VAE-based latent space and reconstruction-driven learning when modeling language’s deeply entangled, context-dependent causal factors. We acknowledge this limitation in the paper and will expand on it.
- Our framework assumes an invariant causal mechanism $p_\theta(y\mid c)$ and the conditional independence $y \perp (x,t,e,s)\mid c$, and our theoretical analysis is developed under a linear-causal approximation.
- In the paper, we derive ELBO objective in CCL to reflect the underlying data-generating process and then empirically verify its effectiveness on natural-language tasks. To examine whether CCL remains beneficial when more complex, multi-step reasoning is required, during the discussion period we additionally evaluated CCL on HotpotQA, a multi-hop QA benchmark that requires integrating evidence across documents via stepwise reasoning.
- The HotpotQA experimental setting is as follows. We first assume that, for each target query, an appropriate document is already provided. This assumption is made because CCL is not a retrieval method like RAG, and we therefore aim to exclude the influence of the retriever.
- We further treat each example collected at the $c$-level or $x$-level, together with its corresponding document, as a single shot. The purpose of this setting is to test the expectation that, compared to examples selected solely based on surface-level similarity, providing the model with examples whose questions share similar intent will allow it to learn a more effective information-processing procedure for deriving answers from the given document.
Table 3. Accuracy comparison on HotpotQA with Llama-3.2-3B-IT
| Method | Acc. |
| --- | --- |
| ZS | 80.86 |
| ICL | 81.00 |
| CCL | **82.29** |
- Table 3 shows that CCL yields higher accuracy than zero-shot and ICL on HotpotQA, supporting the effectiveness of CCL for hierarchical or composite language-understanding problems.
- These additional experiments were run during the brief discussion period and are necessarily limited in scope, yet the results are encouraging. In the revised version, we will incorporate today’s clarifications and the HotpotQA findings.
Your review not only helped us clarify the direction of this work, but the experiments conducted during the discussion period also suggest clear room to further strengthen our method. We hope this helps address the concerns you raised. Thank you again for your constructive review.