ICML 2024 Rebuttal (CODA)

# ICML 2024 Rebuttal (CODA) ## Summary of Author-Reviewer Discussion We thank all the reviewers and Area Chairs for their efforts and time in evaluating our work. After the rebuttal, we are encouraged to see that two of the reviewers raised their scores (`9fWD` raised to `6`, and `o1Qs` raised to `5`), while the remaining reviewer maintained the positive assessment (`NBG4` at `5`). We are pleased to note that our paper has received a cumulative score of `655` post-rebuttal. In response to the concerns raised by the reviewers, we believe our rebuttal effectively addressed all of them since **no more concerns or limitations were raised**. Below are the keynotes of our discussions with each reviewer: - Discussion with Reviewer `9fWD`: - **Clarification for the loss function designs**: We clarify the loss designs of Correlation Predictor $H(\cdot)$ and Data Simulator $G(\cdot)$ with supplementary experiments to showcase the detailed results. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=yCWc26U6hL) - **Feasibility of CODA for high-dimensional data**: We clarify the feasibility of CODA for high-dimensional data and point out the results in our paper. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=9jeAj14kNp) - **Impact of distribution shift intensity**: We include supplementary experiments on different shift intensities to demonstrate the effectiveness of CODA under different distribution shift intensities. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=YrVBMSEhaL) - Discussion with Reviewer `NBG4`: - **Detailed preliminary experiments**: We provide detailed experimental results in Section 3.1. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=WcXCTC7BE0) - **Benefits of using CODA compared with existing works**: We discuss the differences between existing works and CODA to showcase our motivation, novelty, and advantages. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=QazEOEA1V4) - **Synthetic and real-world concept drifts are considered**: We detail the various concept drift patterns we've considered in our paper. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=wRJIiQmAlG) - Discussion with Reviewer `o1Qs`: - **Clarification for the loss function designs**: We clarify the three regularization terms in the designed loss with supplementary experiments to showcase their effectiveness. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=maoX5RRJ7m) - **Effectiveness of the predicted correlation matrices**: We provide suplymentary experiments to further illustrate the effectiveness of CODA. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=Y5JCLnHjBO) - **Computational complexity of CODA**: We point out the discussion of computational complexity in our paper and provide supplementary experiments to demonstrate the feasibility of CODA. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=jMrzLUywXJ) Again, we thank all the Area Chairs and the reviewers for their insightful comments and helpful feedback. It is our pleasure to improve the quality of this work with them. --- ## General Comments for All Reviewers We thank all reviewers for their constructive comments and helpful feedback. We are pleased to find that they find our **well-written** (9fWD and NBG4), **novel and meaningful approach** (NBG4 and o1Qs), **theoretically sound** (9fWD and o1Qs), and **the experiments well-established and effective** (9fWD, NBG4, and o1Qs). To address your primary concerns, we have done our best to extend the work with additional experiments, and reply to your concerns and suggestions with more clarification and discussion. We propose a model-agnostic framework to tackle the root cause of concept drift by generating future data for model training. The generated training data provides flexibility and transferability for architecture-type exploration. Experimental results reveal that the different model architectures can be effectively trained on the generated data. Our responses are summarized as follows: - (9fWD, o1Qs) We clarify the loss designs of Correlation Predictor $H(\cdot)$ and Data Simulator $G(\cdot)$ with supplementary experiments to showcase the detailed results. - (9fWD, NBG4) We detail the various concept drift patterns that we've considered. - (9fWD, o1Qs) We clarify the feasibility of CODA for high-dimensional data and point out the results in our paper. - (NBG4) We provide detailed experimental results in Section 3.1. - (NBG4) We discuss the differences between existing works and CODA to showcase our motivation and novelty. - (NBG4, o1Qs) We dedicated discussion with experimental results to clarify the potential concerns and limitations in effectiveness and efficiency. We appreciate all of the suggestions made by the reviewers to enhance our work. We are delighted to receive your feedback and eagerly anticipate addressing any follow-up questions you may have. --- ## Reviewer 9fWD  We sincerely appreciate the reviewer's time and effort in reviewing our paper. We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **[Q1-1]: Clarification for the loss of Data Simulator $G(\cdot)$** **[AW1-1]:** In the Eq.(4) and Eq.(5), $\mathcal{D}_T$ represents the dataset at time domain $T$. In the loss of Data Simulator $\mathcal{L}_{G}$ (Eq.(5)), there are two parts for different purposes: - $ELBO$ is the classic reconstruction loss for learning the pobability distribution of a given dataset. In $\mathcal{L}_{G}$, the given dataset is the latest domain's dataset $\mathcal{D}_T$ (the dataset at time $T$). - $\mathcal{R}_C(\mathcal{\hat{C}}_{T+1}) = \| (\mathcal{\hat{C}}_{G} - \mathcal{\hat{C}}_{T+1})\|_{1}$ is a designed regularization term to make $G(\cdot)$ learn the given future correlation matrix $\mathcal{\hat{C}}_{T+1}$, where $\mathcal{\hat{C}}_{T+1}$ is predicted by $H(\cdot)$ in previous stage, and $\mathcal{\hat{C}}_{G}$ is the correlation matrix of the CODA-generated dataset. **[Q1-2]: How many data samples does the Data Simulator generate for the future domain?** **[AW1-2]:** The number of generated data samples is **a controllable hyperparameter**, and the default value is the same as the number of samples in $D_T$. The dataset details are in **Appendix C**. In our main experimental results, we use the default value during the training and inference stages. Furthermore, **for the inference stage, we have investigated the impacts of the generated sample size on the model performance trained with the generated data**. As shown in **Figure 3**, increasing the sample size reduces performance variances for both classification and regression tasks because a larger dataset more accurately represents the data distribution learned by the Data Simulator $G(\cdot)$.  **[Q1-3]: Does one sample in domain $T$ correspond to one sample in domain $T+1$?** **[AW1-3]:** **No**. The number of samples in different domain $t$ is not necessarily the same. In our setting and most real-world scenarios, **we don't have a sample index for each data instance at different time domains**, so we cannot treat each instance as a time series data for modeling its temporal evolution pattern. Missing index is one of the challenges and distinguish from sequence analysis problem. Therefore, we propose to capture the trend of data distribution along time domains and generate the future dataset/samples.  **[Q2]: Is it possible to use this method in language or image datasets?** **[AW2]:** We have conducted experiments for image data. - **Image datasets**: We actually conducted experiments on **image datasets as shown in Table 4** and **Appendix G**. Three models are trained on the generated datasets, which outperform other baselines. Specifically, for the high-dimensional data, we use the same encoder structure as the baseline method LSSAE (MNIST ConvNet) to save computational costs.  - **Language datasets**: The definition of "concept drift" is the changes of the joint distribution between input $x$ and the target variable $y$. However, it is extremely difficult to define and capture concept drift in natural language at the current stage. For example, the relationship between a question and its answer involves not only syntax but semantic meaning. A 'correct' answer might not share exact words with the question, but still convey the appropriate knowledge. One possible solution is to leverage Retrieval-Augmented Generation **(RAG)** framework and regularly update the corpus **to address the knowledge outdating problem** We will add more discussion on natural language processing in the revised manuscript.  **[Q3]: In data domains, distribution shifts are not quite smooth and continue. The model may not perform well in this situation.** **[AW3]:** We agree that distribution shift intensity is an important factor in this problem. Here, we conduct experiments on different shift intensities for CODA in a synthetic dataset (**2-moons**), and the result is shown below. In 2-moons dataset, there are 10 domains, where domain $i$ undergoes a rotation of $18i$°, as described in Appendix C. To evaluate CODA under different distribution shift intensities, we conduct two settings experiments compared with the original setting: - **Original:** Source domains: $i = [1, 2, 3, 4, 5, 6, 7, 8, 9]$; test domain: $i = [10]$ - **Setting 1:** Source domains: $i = [2, 4, 6, 8]$; test domain: $i = [10]$ - **Setting 2:** Source domains: $i = [1, 4, 7]$; test domain: $i = [10]$ According to the rotation between two consecutive domains, the rank of distribution shift intensity is **Original (18°) < Setting 1 (36°) < Setting 2 (54°)**, and the results are shown below: | 2-Moons | Original (18°) | Setting 1 (36°) | Setting 2 (54°) | | --------- |:--------------:|:---------------:| --------------- | | CODA (MLP) | 2.3 $\pm$ 1.0 | 3.1 $\pm$ 1.2 | 3.9 $\pm$ 0.8 | We can observe that the difficulty of simulating future datasets increases when the distribution shift intensity rises. Furthermore, in real-world data, the distribution shift intensity is unknown. Therefore, in our future work, we will have more discussion on shift intensity detection and methods for high shift intensity.  --- ## Reviewer NBG4  We sincerely appreciate the reviewer's time and effort in reviewing our paper. We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **[Q1]: Results of other datasets in Section 3.1** **[AQ1]:** We only show the results on part of datasets due to the paper length limitation. Here, we provide the results in Sec. 3.1 on other datasets in the table below: | Algorithm | 2-Moons | Shuttle | Appliance | |:-----------:|:--------------:|:-------------:| -------------- | | GI$^{[2]}$ | 3.5 $\pm$ 1.4 | 7.0 $\pm$ 0.1 | 8.2 $\pm$ 0.6 | | DRAIN$^{[1]}$ | 3.2 $\pm$ 1.2 | 7.4 $\pm$ 0.3 | 6.4 $\pm$ 0.4 | | Prelim-LSTM | 15.2 $\pm$ 2.0 | 8.1 $\pm$ 1.6 | 10.0 $\pm$ 0.4 | We didn't conduct our Prelim-LSTM on the ONP dataset because previous research has indicated that it exhibits relatively weak concept drift, as we mentioned in Sec. 4.2. We will add this discussion in the footnote of main text. **[Q2-1]: Aren't the error in simulating future data and the error in prediction accumulated?** **[AQ2-1]:**   We agree on the error accumulation point, which is important for enhancing quality and reducing model prediction error. CODA can minimize the domain generalization errors from two perspectives: - When concept drifts occur, prediction errors are mainly caused by **significant changes in joint distribution**. In other words, models trained on the outdated joint distribution would have a significant performance drop in future domains. To mitigate the issue, CODA aims to **effectively learn future joint distribution** by representing source domain datasets to correlation matrices. As our **theoretical analysis in Sec. 3.4 and Appendix A** shows, correlation matrices are guaranteed to represent joint distribution under certain assumptions that can be easily satisfied in real-world scenarios. - From the perspective of prediction model training, a performance gap inevitably exists between the training and testing sets. Nevertheless, as a **model-agnostic** framework, CODA provides **flexibility in selecting the best performance model architecture**. Benefitting from the flexibility, CODA can **minimize the errors** accumulated from the **sub-optimal model architectures**.  Our experimental results demonstrate that CODA can tackle concept drift issues well and **outperform** the existing methods on both synthetic and real-world datasets. **[Q2-2]: What is the benefit of using CODA compared with some mentioned existing works[1][2][3]?** **[AQ2-2]:** We thank the reviewer's suggestion, and we will include the discussion of the papers mentioned by the reviewer in our next version. Here, we provide the discussion below:   - **[1] & [2]** - There are two advantages by using CODA compared with [1] & [2]. First, **our proposed CODA captures the long-term temporal trends from all the historical data points**. The existing work [1] & [2] can only consider consecutive time points nearby, i.e., the immediate past and current time point. Second, CODA can capture more complex or cyclic temporal trends by considering a whole picture of multiple data distributions at different time points, while the exisitng work assumes a linear changing trend with a subtle distribution shift on $\mathcal{x}$. - **[3]** - CODA is a **model-agnostic framework** that **parallels** existing model-centric works and can be used for various model architectures. In contrast, [3] is a model-centric approach that fine-tunes pre-trained model weights. This limits the utilization of [3] for other model architectures. [3] is proposed to tackle similar concept drift issues to exisiting works, such as GI and DRAIN discussed in our paper. **[Q3]: Are many samples necessary in each domain?** **[AQ3]:** Yes, it is important to collect many samples for sufficiently represent a data distribution. Therefore, to accurately capture the underlying feature correlation and train the Correlation Predictor $H(\cdot)$, we use all the samples in training domains, where the details can be found **in Appendix C**. **[Q4]: What happens if in a dataset the change between tasks is a random number each time? In other words, the change is not always the same.** **[AQ4]:** The random changing among time domains can be categorized as "abrupt change". Please see the detailed discussion below. We already considered various concept drift patterns on both synthetic and real-world datasets as follows: - **Synthetic concept drifts**: 1. Cyclical change: the 2-Moons dataset is built with a cyclical concept drift pattern, and we conduct **Rot-MNIST** and **Sine** datasets as shown in the Table 4. 2. Abrupt change: this scenario is usually not considered by domain generalization since the joint distribution betweewn source and target domain may be significantly different. - **Real-world concept drifts**: The real-world datasets used in our experiments feature various and unknown patterns of concept drift. They have covered diverse realistic temporal trends, such as electricity demand changes (Elec2), space shuttle defects (Shuttle), and appliances energy usage changes (Appliance). Ｍore discussion can be found in **Sec. 4.2, Table 4, and Appendix G**. **[Q5]: What happens if the assumption that the distribution changes does not hold and the distribution remains the same? Can the method simulate future data?** **[AQ5]:** **Yes**. It is a simplified case in our proposed method. When there is no concept drift, the correlation matrices in each time domain are the same ($\mathcal{C}_1 = \mathcal{C}_2 \dots = \mathcal{C}$). In this case, $H(\cdot)$ can easily approximate the similar future correlation matric $\mathcal{\hat{C}}_{T+1}$ for $G(\cdot)$ to generate the similar data distribution, i.e., $\mathcal{\hat{C}}_{T+1} \approx \mathcal{C}$, where $\mathcal{\hat{C}}_{T+1} = H(\mathcal{C}, \dots, \mathcal{C})$. Here, we conduct the supplementary experiments to showcase the efficacy of CODA under the no-distribution shift (I.I.D.) scenario, where we split the test domain into training, validation, and test sets $(6:2:2)$. The results in the test sets of three datasets are shown below: | I.I.D. Scenario | 2-Moons | Elec2 | Appliance | |:---------------:|:-------------:| ------------- | ------------- | | CODA (MLP) | 0.0 $\pm$ 0.0 | 4.2 $\pm$ 0.4 | 2.1 $\pm$ 0.2 | [1] Pentina, Anastasia & Lampert, H. Christoph, "Lifelong Learning with Non-i.i.d. Tasks," NeurIPS 2015 (https://dl.acm.org/doi/10.5555/2969239.2969411) [2] Álvarez, Verónica, et al., "Minimax Forward and Backward Learning of Evolving Tasks with Performance Guarantees," NeurIPS 2023 (https://papers.nips.cc/paper_files/paper/2023/hash/cf4114c34a2b93019aa6e70f99680fae-Abstract-Conference.html) [3] Zhao, Peng, et al., "Handling Concept Drift via Model Reuse," Special Issue of the ACML 2019 Journal Track (https://dl.acm.org/doi/abs/10.1007/s10994-019-05835-w) ## Reviewer o1Qs  We sincerely appreciate the reviewer's time and effort in reviewing our paper. We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **Q1: Clarification for Sec. 3.2 and 3.3 (in Q1-1 to Q1-3 below).** **[Q1-1]: How can the Cross-Entropy loss be computed using the two correlation matrices (which are not probability distributions) in Eq.(3)?** **[AQ1-1]:** The core idea of $\mathcal{L}_{CE}$ is as follows: - Different from $\ell_1$-norm and $\ell_2$-norm calulating the errors in each element of Correlation Matrices $\mathcal{\hat{C}}_t$, we leverage $\mathcal{L}_{CE}$ to measure how well the distribution within $\mathcal{\hat{C}}_t$ matches the groundtruth distribution of $\mathcal{C}_t$. To do so, in $\mathcal{L}_{CE}$, we first normalize the range of the values in $\mathcal{C}$ to between 0 and 1, which can be seen as the probabilities that two features are correlated. - Therefore, the similarity between two probabilities $\mathcal{\hat{C}}$ and $\mathcal{C}$ can be represented by KL-divergency. Then, we simplify the similarity by $\mathcal{L}_{CE}$ because Cross-Entropy is equal to KL-divergency without a constant, where the constant is the entropy of ground truth $\mathcal{C}$.    To clarify the concept of the loss design, we will include the above description in the next version. **[Q1-2]: Why do the authors consider l2_norm, l1_norm, and Cross entropy loss simultaneously in Eq.(3)? There is a lack of intuitive explanation for this. Wouldn't it be enough to apply just one of the three?** **[AQ1-2]:** The three regularization terms in Eq.(3) are explained as follows: - The $\ell_1$-norm encourages sparsity in the predicted $\mathcal{\hat{C}}_t$ because it can effectively "zero out" less important feature correlations since the correlation matrices are generally sparse (as shown in Figure 9 of Appendix F). - The $\ell_2$-norm is to ensure smoothness and penalize large deviations in the elements of the predicted correlation matrix $\mathcal{\hat{C}}_t$, promoting stability in the feature correlations. - The cross-entropy loss $\mathcal{L}_{CE}$ can measure how well the distribution of $\mathcal{\hat{C}}_t$ matches the groundtruth distribution of $\mathcal{C}_t$. Based on our experiments, the best performance Correlation Predictor $H(\cdot)$ is optimized by the three regularization terms in Eq.(3) simultaneously, as shown in the table below (MSE between $\mathcal{\hat{C}}_{T+1}$ and $\mathcal{C}_{T+1}$): | Objective Loss | 2-Moons | Elec2 | Shuttle | Appliance | |:------------------------:| ---------- | --------- | ---------- | ---------- | | $\ell_1$-norm | 0.0026 | 0.539 | 0.0370 | 0.0111 | | $\ell_2$-norm | 0.0025 | 0.533 | 0.0356 | 0.0104 | | $\mathcal{L}_{CE}$ | 0.0029 | 0.564 | 0.0389 | 0.0126 | | $\ell_1$ + $\ell_2$-norm | 0.0025 | 0.531 | 0.0351 | 0.0102 | | All (Eq.(3)) | **0.0021** | **0.527** | **0.0341** | **0.0096** | As shown in the above table, the errors between predicted $\mathcal{\hat{C}}_{T+1}$ and the ground truth $\mathcal{C}_{T+1}$ is minimal (as shown in Figure 10 in the Appendix).  **[Q1-3]: Definitions of $z$ and $p(z)$ in Eq.(5).** **[AQ1-3]:** $z$ and $p(z)$ belong to the traditional ELBO loss in Eq.(5), where $z$ is the latent variables mapped by the encoder of the VAE model and $p(z)$ is the data distribution of $z$ modeling by a standard multivariate Gaussian distribution (with zero mean and unit variance). The assumption of Gaussian distribution is supported by theoretical foundations (e.g., connection to the Central Limit Theorem) and provides computational convenience. **[Q2]: As shown in Fig. (5), the performance with $\lambda_c = 0$ is similar to that of cases with $\lambda_c \neq 0$, suggesting that the proposed method based on feature correlation may have limited effectiveness.** **[AQ2]:** In fact, **there is no $\lambda_c = 0$ result in Fig.(5)**, and the left-most points refer to $\lambda_c = 0.1$. The performances with $\lambda_c = 0$ are shown in Sec. 4.4 (Ablation Study) and Table 2, which are similar to the baseline "LastDomain" in Table 1. $\lambda_c = 0$ means the future data $\mathcal{D}_{T+1}$ is generated only based on last domain $\mathcal{D}_T$ without any other information. As shown in Table 2, the effectiveness of integrating $\mathcal{\hat{C}}_{T+1}$ is significant. Note that ONP does not show concept drift, which has been proved by other work (as described in Sec. 4.2). For more detailed experimental results in Fig.(5), we provide the performances with $\lambda_c = [0.1, 0.3, 0.5, 0.7, 0.9]$ in the table below: | Dataset | $\lambda_c = 0.1$ | $\lambda_c = 0.3$ | $\lambda_c = 0.5$ | $\lambda_c = 0.7$ | $\lambda_c = 0.9$ | |:---------:| ----------------- | ----------------- | ----------------- | ----------------- | --- | | Elec2 | 12.4 | 12.2 | 11.9 | 11.7 | 11.6 | | ONP | 37.2 | 37.2 | 37.3 | 37.4 | 37.6 | | Appliance | 4.59 | 4.58 | 4.56 | 4.55 | 4.54 | **[Q3]: It seems necessary to compare the computational complexity with that of other algorithms.** **[AQ3]:** The computational complexity **has been discussed in Appendix H and G**.  To improve the computational efficiency, we adopt a naive solution by encoding high-dimensional dataset (such as Rot-MNIST) into low-dimensional latent space and then incorporate it into CODA. The encoder structure is the same as the baseline method LSSAE (MNIST ConvNet). Three model architctures are trained and the results show that CODA outperforms other baselines, as shown in **Table 4 and Appendix G**. In addition, the efficiency of CODA is comparable with DRAIN on Elec2, **as shown in the table below and in Appendix H**, where "Total" means the training time includes $H(\cdot)$, $G(\cdot)$, and the MLP. For further clarification, the training process of CODA is split into three sub-processes, i.e., learning correlation predictor $H(\cdot)$, learning Data Simulator $G(\cdot)$, and predictive model training. Each sub-process is a manageable sub-problem and takes less training time than the end-to-end process. We can surely better improve the efficiency in the future work.  | Framework & Components | Training Time (s) | | :----: | :----: | | **DRAIN** | **465.936** | | **CODA (Total)** | **447.817** | | Correlation Predictor $H(\cdot)$ | 142.110 | | Data Simulator $G(\cdot)$ | 290.826 | | Prediction Model (MLP) | 14.880 | **[Q4]: Is the scenario with "w/o $\mathcal{C}_{T+1}$" in the ablation study section different from the case with $\lambda_c = 0$ in Fig.(5)?** **[AQ4]:**  "w/o $\mathcal{C}_{T+1}$" is equivalent to $\lambda_c = 0$. But $\lambda_c = 0$ is not included in Fig.(5). The detailed discussion is in [previous response for Q2](). **[Q5]: In the legend of Fig. (5), the graphs on test and val are indistinguishable.** **[AQ5]:** We thank the reviewer for pointing out the unclear legend of Fig.(5). We will update the legend to distinguish the dashed lines and solid lines. In Fig.(5), the dashed lines refer to validation sets, and the solid lines refer to test sets. **[Q6]: Could the proposed method be extended to various real-world image datasets?** **[AQ6]:** **Yes**. We conducted experiments and have shown the results on image data **in Table 4 and Appendix G**. Specifically, for the high-dimensional data, we use the same encoder structure as the baseline method LSSAE (MNIST ConvNet) and train three architecture predictors. The results reveal all three different predictors trained on the dataset generated by CODA outperform other baselines. ---

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.