ICLR 2024 Rebuttal (CODA)

# ICLR 2024 Rebuttal (CODA) ## Rebuttal Summary ### Summary of Rebuttal We thank all the reviewers and AC for their efforts and time in evaluating our work. We are pleased to find that all of the reviewers appreciate our experiments, which are well-established and effective. We sincerely value the constructive comments of the reviewers during the rebuttal session. We are glad that the unclear parts for Reviewers VxLr have been addressed during the rebuttal session. After re-evaluating the first revised version, we are happy that Reviewer VxLr raised the score. The two Reviewers, BJ2w and q1Jd, maintain their scores of 6 according to our responses, reflecting that we addressed their concerns in the current version without raising further issues. After the second round of responses for Reviewer VxLr, we firmly believe that we addressed the reviewer's three main concerns: 1. We updated the parts of the claim and motivations in our paper. 2. We clarified the novelty of our work that distinguishes CODA from existing approaches (in fact, we propose a brand-new model-agnostic branch of solution with flexibility and transferability for architecture-type exploration). 3. We conducted more experiments on two real-world datasets to make our experimental results more solid. As for the second round of responses for Reviewer GM2p, we also firmly believe that we addressed all the reviewer's follow-up questions: 1. We detailed our experiments on high-dimensional data. 2. We elaborated our novel perspective for proposing CODA (the added experiments in our second round of responses for Reviewer VxLr can further support our point). 3. We explained why the conditional data generator idea (mentioned by the reviewer) is not feasible, which is actually out of our work's scope and unrelated to our work.  **With the clarification and extra experiments in rebuttal, we believe we have resolved all the reviewer's concerns and look forward to positive feedback.** ### Contributions of Our Work In this work, we propose the model-agnostic framework to address concept drift from a novel, data-centric perspective. The main motivation behind our approach is to "nip the problem in the bud" because the root cause of concept drift lies in the temporal evolution of data. Our solution directly tackles this problem by generating future data for model training. Besides introducing a novel perspective in TDG, the experimental results demonstrate the effectiveness of our solution by achieving SOTA. We believe the proposed new perspective will benefit further research in TDG. ## Reviewer VxLr We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **W1: Clarification of the Motivation** *"Concering about using a data-centric approach over a model-centric method in TDG."*  address this root cause) -->   [AW1]: We consider model-centric and data-centric approaches as parallel strategies, and our goal is not to position one approach against the other. Our main motivations are as follows: 1. **Nip the problem in the bud** - We believe the fundamental cause of concept drift is the underlying temporal trend of data distribution over time. Therefore, with the generated future data, TDG can be achieved by training a prediction model on an i.i.d. dataset. - The main motivation behind our approach is to "Nip the problem in the bud". In other words, **the root cause of concept drift lies in the temporal evolution of data**. Our solution is to directly tackle this problem from data perspective, i.e., achieve TDG by training a prediction model for future data generation.  2. **Flexibility and transferability for architecture-type exploration** - Furthermore, it is evident that the most effective model architecture can vary across different datasets and downstream tasks, such as MLPs, tree-based, and Transformer-based backbones. This observation is supported by the results in Table 1 of our paper, which shows that the architecture yielding the best performance differs among the evaluations on the five datasets. - However, existing model-centric methods are limited to specific model architectures. In contrast, our data-centric CODA framework offers flexibility for exploring various architectures by providing transferable training datasets. These datasets are adaptable for training different backbone architectures. For a detailed analysis of this adaptability, refer to 'Cross-Architecture Transferability' in Section 4.3. **W2: Why not generate data in representation space?**   show the comparison --> CODA is better -->   it will influence the performance -->  [AW2]: Although the ultimate goal is to train a predictor, as mentioned in the response for Weakness 1. The reason we generate instances for simulating the future data distribution is to pursue model-agnostic methods due to the efficacy of model architecture selection, such as Tree-based or Transformer-based models, varying across different datasets and downstream tasks, which is supported by the results in Table 1. Generating data in representation space can not guarantee model agnostic because the fixed encoder part is required for the efficacy of representation. **W3: Contributions and Novelties** *"The authors need to explicitly distinguish the contributions of existing works[1][2] and novelties in light of these studies."*   effectiveness of the correlation matrices -->    may limit the performance -->  cannot be leveraged for differnet predictors that are not trained with the identical encoder. -->   cannot be leveraged for differnet predictors that are not trained with the identical encoder. [Without code --> unreproducible] -->  [AW3]: Our main contribution and novelty is that we propose a data-centric (model-agnostic) TDG framework by using feature correlation matrices **to simplify the challenges of capturing the temporal trend**. The main challenges of capturing temporal trend among multiple time points is two-fold: 1. In most of the real-world benchmarks, **we don't have sample index for each data instance at different time domain**s, so we cannot treat each instance as a time series data for modeling its temporal evolution pattern. Therefore, it is impossible to predict the future features for each sequence. It is only durable to capture the trend of data distribution along time domains and generate the future dataset/samples. 2. An alternative way is to capture the underlying temporal trend among multiple datasets (distributions) using kernel data distribution estimation method, which is **computationally infeasible and hard to generate effective training data** (details of the analysis refer to Section 3.1). **Novelties and contributions** - For the two missed related works[1][2], we have added them to our references. Although they are also generative-based methods, both data augmentation in latent space[1] and the generation of augmented features within latent space[2] are **not model-agnostic** because different predictors, which are not trained with the same encoder, cannot identify the augmented embeddings or features. - We also conducted our CODA on Rot-MNIST for performance comparison. We use the same encoder structure as [1] (MNIST ConvNet) and train three different architecture predictors. The results reveal all the three different predictors trained on the dataset generated by CODA outperform [1, 2] and other baselines, which again proves the **effectiveness and transferability of CODA**. We have added the experimental results in Appendix G, and the added section title is highlighted in blue. | Frameworks | Sine | Rot-MNIST | | :----: | :----: | :----: | | LSSAE[1] | 36.8 $\pm$ 1.5 | 16.6 $\pm$ 0.7 | | DDA[2] | 1.6 $\pm$ 0.9 | 13.8 $\pm$ 0.3 | | GI | 33.2 $\pm$ 0.7 | 7.7 $\pm$ 1.3 | | DRAIN | 3.0 $\pm$ 1.0 | 7.5 $\pm$ 1.1 | | CODA (MLP) | 2.7 $\pm$ 0.9 | **6.0 $\pm$ 1.2** | | CODA (LightGBM) | **1.2 $\pm$ 0.4** | **5.8 $\pm$ 0.6** | | CODA (FT-Transformer) | **1.1 $\pm$ 0.4** | **6.3 $\pm$ 0.5** | [1] Tiexin Qin, et al. "Generalizing to evolving domains with latent structure-aware sequential autoencoder." ICML 2022. [2] Qiuhao Zeng, et al. "Foresee what you will learn: Data augmentation for domain generalization in non-stationary environment." AAAI 2023. **W4: Why modeling the correlation between two consecutive domains.**  cannot capture the temporal trend of each sample independently --> need to capure the temporal trend among data distribution --> need to choose a way to represent datast distribution --> in our work, we choose feature correlation matrices to do so --> resean (theoretical analysis) --> there are many other options to represent a data distribution information for capturing temporal trend, and as pioneer research, we offer the naturally and theoretically supported way (feature correlation matrices) --> [AW4]: We would like to clarify this misunderstanding. We don't model the correlation between two consecutive time domains. We use an LSTM to model the temporal trend of **feature correlation matrices over all the time domains in the training datasets**. Eq.(3) aims to optimize the loss between the predicted $\mathcal{\hat{C}}_t$ and the groudtruth $\mathcal{C}_{t}$ given $\mathcal{C}_{1}$ to $\mathcal{C}_{t-1}$. - As mentiened in the response for your Weakness 3, the two challenges of capturing the temporal trend from datasets is: 1. We don't have sample index for each data instance at different time domains, so we cannot capture the temporal trend of each sample independently. Therefore, we need to capture the temporal trend among multiple data distributions. 2. However, our preliminary experiments and analysis show the infeasibility of directly modeling the temporal evolusion among data distributions (refer to Section 3.1). - To this end, the core idea of our solution is **to simplify the data distribution at each time domain to capture the underlying temporal trend better**. In this work, we utilize feature correlation matrices to achieve simplification and provide theoretical analysis to prove the rationale of representing data distribution with a feature correlation matrix (refer to Section 3.4). - We want to emphasize that while numerous methods exist to simplify dataset information, **our pioneering research introduces a natural and theoretically supported data-centric approach for this purpose**. **W5 & Q4: Clarification of Data Simulator and how to incorporate the estimated correlation matrix for data generation.** [AW5 & AQ4]: Based on the current data distribution $\mathcal{D}_{T}$, Data Simulator ${G}(\mathcal{D}_{T} ; \mathcal{\hat{C}}_{T+1} | \theta_{G})$ can simulate the future data distribution $\mathcal{\hat{D}}_{T+1}$ that is subject to the predicted correlation matrix $\mathcal{\hat{C}}_{T+1}$. We futher explain the details as follows: - Specifically, our CODA framework comprises two replaceable components: Correlation Predictor ${H}(\cdot)$ (refers to Section 3.2) and Data Simulator $G(\cdot)$ (refers to Section 3.3). Essentially, they can be substituted by other models that perform similar functions, where $G(\cdot)$ should be a generative model that can incorporate prior knowledge into account for data generation. In our case, the prior knowledge is the predicted future correlation matrix $\mathcal{\hat{C}}_{T+1}$, as described in Eq.(4) and Eq.(5). - Simultaneously, the trained $G(\cdot)$ should learn the similar data distribution of the current domain $\mathcal{D}_{T}$. This is based on the assumption that distribution shifts are smooth and closely related to domains in the near time domains (refer to the assumption (iii) in Theorem 1). - In our experiments, we instantiate $G(\cdot)$ with a generative model that jointly learns the encoder and decoder of a VAE-based generative model and a learnable graph. Thus, it can **treat prior knowledge as an adjacency matrix and encourage the learned graph to be similar to the given prior knowledge (refer to Sections 3.3 and 4.1)**.  Therefore, based on the current data distribution $\mathcal{D}_{T}$, ${G}(\mathcal{D}_{T} ; \mathcal{\hat{C}}_{T+1} | \theta_{G})$ can simulate the future data distribution $\mathcal{\hat{D}}_{T+1}$ that is subject to the predicted correlation matrix $\mathcal{\hat{C}}_{T+1}$.  **W6: Experimental Results on Commonly used benchmarks.** *"Several commonly used benchmark data sets are also missing, including both synthetic (e.g., Circle, Sine) and real (e.g., RMNIST, Portraits, Ocular, Caltran, WILDS) data sets."*   [AW6]: We have considered diverse concept drift patterns as follows show: - Synthetic datasets: Besides the 2-Moons, we experimented on Sine dataset shown in the table below. - Real datasets: Besides the real-world datasets used in our experiments (Elec2, ONT, Shuttle, and Appliance), we conducted one more experiment on Rot-MNIST, using the same encoder structure as [1] (MNIST ConvNet) and train three architecture predictors, and the results are shown as the table below. The results reveal all three different predictors trained on the dataset generated by CODA outperform other baselines. Furthermore, the differences among the three trained predictors support one of our contributions that the proposed model-agnostic CODA framework is flexible for best architecture exploration towards different datasets and downstream tasks. We have added the experimental results in Appendix G, and the added section title is highlighted in blue. | Frameworks | Sine | Rot-MNIST | |:---------------------:|:-----------------:|:-----------------:| | LSSAE[1] | 36.8 $\pm$ 1.5 | 16.6 $\pm$ 0.7 | | DDA[2] | 1.6 $\pm$ 0.9 | 13.8 $\pm$ 0.3 | | GI | 33.2 $\pm$ 0.7 | 7.7 $\pm$ 1.3 | | DRAIN | 3.0 $\pm$ 1.0 | 7.5 $\pm$ 1.1 | | CODA (MLP) | 2.7 $\pm$ 0.9 | **6.0 $\pm$ 1.2** | | CODA (LightGBM) | **1.2 $\pm$ 0.4** | **5.8 $\pm$ 0.6** | | CODA (FT-Transformer) | **1.1 $\pm$ 0.4** | **6.3 $\pm$ 0.5** | **Q1: Do the correlation matrices include the label information?**  [AQ1]: **Yes, it includes label information**. In feature correlation matrices, each row and column corresponds to a specific feature. The final row and column represent label information. Each cell within the matrix indicates the degree of correlation between a pair of features. Furthermore, **Section 3.4** presents a theoretical analysis that guarantees the consistency of our feature correlation estimation with **three assumptions that can be easily satisfied in reality**. **Q2 Explanation of Eq.(3)** [AQ2]: The three regularization terms in Eq.(3) are explained as follows: 1. The $\ell_1$-norm encourages sparsity in the predicted $\mathcal{\hat{C}}_t$ because it can effectively "zero out" less important features while the correlation matrices are generally sparse (as shown in Appendix Figure 9). 2. The $\ell_2$-norm is sensitive to significant errors and imposes a penalty on $\mathcal{\hat{C}}_t$ for substantial errors, promoting overall accuracy in the reconstruction. 3. The cross-entropy loss $\mathcal{L}_{CE}$ can measure how well the distribution of $\mathcal{\hat{C}}_t$ matches the groundtruth distribution of $\mathcal{C}_t$. Despite the errors between predicted future $\mathcal{\hat{C}}_{t+1}$ and the ground truth $\mathcal{C}_{t+1}$ is minimal (as shown in Figure 10 in the Appendix), there is huge potential to enhance the Correlation Predictor. As one of the future directions, it could be achieved using a more sophisticated sequential prediction framework than LSTM. **Q5: Connection between Theorem 1 and the Proposed Method (e.g., Eq (5)).** *"Theorem 1 states that for two random vectors, if they are bounded and their distributions are close, then the difference between their correlation matrices are also bounded. But how this is related to the algorithm?"*  [AQ5]: Theorem 1 serves as the theoretical foundation for the usage of prior knowledge (predicted feature correlation matrix $\mathcal{\hat{C}}_{T+1}$) in data simulator. - As mentioned in our response to Weaknesses 3 and 4, one of the main challenges of capturing the temporal trend among multiple time domains is computationally infeasible (refer to Section 3.1). Our framework addresses this by representing the data distribution at each time domain by its feature correlation matrix to effectively capture the temporal trend. This simplification can effectively represent the original distribution information only if Theorem 1 holds. - Based on the analysis in Section 3.4, we conclude that the three assumptions in Theorem 1 can be easily satisfied in reality, which serves as the theoretical foundation for the simplification. --- ## Re: Response to the rebuttal (VxLr) Q1: Motivations and novelties: while the authors claim that "We consider model-centric and data-centric approaches as parallel strategies", the motivations in the paper are still the same. In fact, the authors did not revise the paper to highlight this point at all. Regarding the novelty, I agree that model-agnostic can be considered as a benefit of CODA (though still not emphasized enough in the paper), but other than that, I cannot see fundamental improvements over [1], [2]. In particular, [1] [2] also face challenge 1, and challenge 2 is not prominent in [1] [2] as they generate samples in the representation space. Q2: The experiments are still not solid enough. RMNIST is problem one of the simplest real-world data sets in TDG. =====================================================  We appreciate the reviewer adjusting the score in light of our clarifications, and we are glad to further address the remaining concerns. **[AQ1-1]: Revised the paper.** Thanks to the reviewer for reminding us of the points that should be revised. We have updated the parts of the claim and motivations, highlighted in blue in the introduction section. **[AQ1-2]: Novelty.** We would like to emphasize that our work proposes **a new branch of solution** to address concept drift problem. We argue that such a new branch itself is novel and serves as a fundamental improvement over existing literature. Additionally, As a new-branch solution, it is not obliged to improve the approaches from another branch. Our novelty lies in: 1. **Develop a new branch of solution** for addressing the concept drift problem from a data-centric perspective. 2. Our proposed **model-agnostic** CODA framework provides **flexibility and transferability for architecture-type exploration**.  Although [1] and [2] face similar scenarios, their approaches involve training predictors with **specific encoders** This setup does not ensure model-agnostic since the generated embeddings can not be recognizable by other predictors and decoders not trained with the same encoders. **[AQ2]:** As the reviewer mentioned, Rot-MNIST is also a real-world dataset with concept drift. Besides, in Table 1, we have conducted four additional real-world concept drift datasets (Elec2, ONP, Shuttle, Appliance). We believe our experiments on Sine synthetic and Rot-MNIST real-world datasets are sufficient. We are trying our best to conduct one or two more real-world datasets before the rebuttal deadline, and thank you for your understanding. **[AQ2 Part2]:** We have conducted two more real-world datasets in TDG, Portraits and Forest Cover. For Portraits, we use the same encoder as [2] (Wide ResNet) before Correlation Predictor module. We have added the experimental results in Appendix G. | Frameworks | Portraits | Forest Cover | |:---------------------:|:-----------------:|:------------------:| | LSSAE[1] | 6.9 $\pm$ 0.3 | 36.8 $\pm$ 0.4 | | DDA[2] | 5.1 $\pm$ 0.1 | 34.7 $\pm$ 0.5 | | GI[3] | 6.3 $\pm$ 0.2 | 36.4 $\pm$ 0.4 | | CODA (MLP) | 5.1 $\pm$ 0.1 | **34.4 $\pm$ 0.4** | | CODA (LightGBM) | 6.2 $\pm$ 0.1 | **33.0 $\pm$ 0.3** | | CODA (FT-Transformer) | **4.9 $\pm$ 0.2** | **33.7 $\pm$ 0.3** | [1] Tiexin Qin, et al. "Generalizing to evolving domains with latent structure-aware sequential autoencoder." ICML 2022. [2] Qiuhao Zeng, et al. "Foresee what you will learn: Data augmentation for domain generalization in non-stationary environment." AAAI 2023. [3] Anshul Nasery, et al., "Training for the Future: A Simple Gradient Interpolation Loss to Generalize Along Time," NeurIPS 2021. **With the discussion and added results above, we hope that we have resolved all the reviewer's concerns and look forward to clarifying any further questions that may arise.** --- ## Reviewer BJ2w We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **Q1: Difference between the Proposed Framework and Model-centric Methods.** *"The generated data would still be utilized as the training data for prediction models. Would it still go back to model-centric strategies?"*  [AQ1]: - Our framework focuses on solving problems via generating effective training data, which is identified as a data-centric paradigm. - The concept of the *data-centric paradigm* involves the methods for building effective training data; on the other hand, *model-centric* methods focus on identifying more effective model designs that are trained using the original data[1]. - Our framework aims to address the concept shift issue by generating future datasets for model training, achieving a model-agnostic approach to explore different model architectures for downstream tasks. [1] Daochen, Zha, et al., "Data-centric artificial intelligence: A survey," arXiv:2303.10158 **Q2: Baseline Comparison.** *"The generated data is then used to train models, would it be unfair for comparison methods? Should the comparison methods also use the same generated data to fine-tune?"* **[Q2-1]: Fair performance comparison.**  [AQ2-1]: **Yes, the performance comparison is fair**. In our experiments, all the MLP models trained on the data generated by CODA are the same as one of the baselines DRAIN. Instead of designing predictor model structures, our approach focuses on the quality and efficacy of the generated training data. **[Q2-2]: Should the comparison methods also use the same generated data to fine-tune?**  [AQ2-2]: **Other baselines cannot be trained on only one datasets at a single time domain.** - CODA simulates the one domain ahead data for model training. In contrast, other baselines require all training domains for fine-tuning their whole models or frameworks and, therefore, cannot be trained using data from a single domain alone. ## Reviewer q1Jd We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **W1. Limited dynamic network adaptability compared to some existing methods.** [AW1]: We are not sure about the content of this weakness. The reviewer's point is that CODA can only be used for certain deep neural networks. Based on this understanding, we believe that **this is a misunderstanding**. The reasons are as follows: - Our main contribution and novelty is that we propose a data-centric (model-agnostic) TDG framework by using feature correlation matrices to simplify the challenges of future data generation. Therefore, with the generated future data, TDG can be achieved by training a prediction model on the i.i.d. dataset. - Our model-agnostic CODA framework offers flexibility for exploring various architectures by providing transferable training datasets. The datasets generated by CODA are adaptable for training different backbone architectures, which is demonstrated in Table 1. For a detailed analysis of this adaptability, refer to 'Cross-Architecture Transferability' in Section 4.3. In sum, one of our main contributions is to be free from a specific model architecture for all downstream tasks instead of a limitation. **W2. Constrained application in model-agnostic learning scenarios.**  [AW2]: We are also not sure about the content of this weakness. The reviewer's point is that "CODA can be only leveraged to model-agnostic tasks." If our understanding is correct, then we believe **this is also a misunderstanding**. The reasons are as follows: - "Model-agnostic" is a feature of "approaches" rather than a feature of "downstream tasks."[1] This merit provides the flexibility of the predictor to explore the best suitable model architecture for the tasks or scenarios you meet. - The datasets generated by CODA are adaptable for training different backbone architectures, which is demonstrated in Table 1. For a detailed analysis of this adaptability, refer to 'Cross-Architecture Transferability' in Section 4.3. [1] Daochen, Zha, et al., "Data-centric artificial intelligence: A survey," arXiv:2303.10158 **W3 & Q1: Effictiveness of High-dimensional Data.** *"In the context of high-dimensional data, how does CODA maintain performance efficiency?"*  [AW3 & AQ1]: Based on Theorem 1 and our analysis in Section 3.4, we agree that feature correlation matrices may not effectively represent high-dimensional data. **However, it does not mean that the proposed CODA framework can only work on low-dimensional data**. We conducted our CODA on a high-dimentional dataset (Rotate-MNIST) for performance comparison, shown in the table below. We use the same encoder structure as [1] (MNIST ConvNet) and train three architecture predictors. The results reveal all three different predictors trained on the dataset generated by CODA outperform other baselines. Furthermore, the differences among the three trained predictors support one of our contributions that the proposed model-agnostic CODA framework is flexible for best architecture exploration towards different datasets and downstream tasks. We have added the experimental results in Appendix G, and the added section title is highlighted in blue. | Frameworks | Sine | Rot-MNIST | | :----: | :----: | :----: | | LSSAE[2] | 36.8 $\pm$ 1.5 | 16.6 $\pm$ 0.7 | | DDA[3] | 1.6 $\pm$ 0.9 | 13.8 $\pm$ 0.3 | | GI | 33.2 $\pm$ 0.7 | 7.7 $\pm$ 1.3 | | DRAIN | 3.0 $\pm$ 1.0 | 7.5 $\pm$ 1.1 | | CODA (MLP) | 2.7 $\pm$ 0.9 | **6.0 $\pm$ 1.2** | | CODA (LightGBM) | **1.2 $\pm$ 0.4** | **5.8 $\pm$ 0.6** | | CODA (FT-Transformer) | **1.1 $\pm$ 0.4** | **6.3 $\pm$ 0.5** | [2] Tiexin Qin, et al. "Generalizing to evolving domains with latent structure-aware sequential autoencoder." ICML 2022. [3] Qiuhao Zeng, et al. "Foresee what you will learn: Data augmentation for domain generalization in non-stationary environment." AAAI 2023. **W4 & Q2: Effectiveness in Diverse Concept Drift Scenarios.** *"Does CODA account for various natures of concept drift, such as abrupt or cyclical changes?"*   [AW4 & AQ2]: We already considered diverse concept drift patterns as follows: - Synthetic concept drifts: 1. Cyclical change: the 2-Moons dataset is built with a cyclical concept drift pattern, and we conduct **Rot-MNIST** and **Sine** datasets as shown in the table above. 2. Abrupt change: this type of temporal trend doesn't fit our assumption that **"the joint distribution of features and labels with smooth data shift over time"** (refer to Introduction). - Real-world concept drift: The real-world datasets used in our experiments feature various and unknown patterns of concept drift. They have covered diverse realistic temporal trends, such as electricity demand changes (Elec2), space shuttle defects (Shuttle), and appliances energy usage changes (Appliance).  ## Reviewer GM2p We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **W1: Effictiveness of High-dimensional Data.** *"The proposed algorithm can only work on low-dimensional data (as the authors also mentioned). It is intractable to learn the correlation matrix on the high-dimensional data. I guess that's why some dataset such as rotating MNIST has been excluded from evaluation."*  [AW1]: Based on Theorem 1 and our analysis in Section 3.4, we agree that feature correlation matrices may not effectively represent high-dimensional data.  **However, it does not mean the proposed framework can only work on low-dimensional data**. We conducted our CODA on a high-dimentional dataset (Rotate-MNIST) mentioned by the reviewer for performance comparison. We use the same encoder structure as [1] (MNIST ConvNet) and train three architecture predictors. **The results reveal all three different predictors trained on the dataset generated by CODA outperform other baselines**. Furthermore, the differences among the three trained predictors support one of our contributions that the proposed model-agnostic CODA framework is flexible for best architecture exploration towards different datasets and downstream tasks. We have added the experimental results in Appendix G, and the added section title is highlighted in blue. | Frameworks | Rot-MNIST | |:---------------------:|:-----------------:| | LSSAE[1] | 16.6 $\pm$ 0.7 | | DDA[2] | 13.8 $\pm$ 0.3 | | GI | 7.7 $\pm$ 1.3 | | DRAIN | 7.5 $\pm$ 1.1 | | CODA (MLP) | **6.0 $\pm$ 1.2** | | CODA (LightGBM) | **5.8 $\pm$ 0.6** | | CODA (FT-Transformer) | **6.3 $\pm$ 0.5** | **W2-1: Justification for the end-to-end SOTA comparison**  [AW2-1]: We agree that end-to-end approaches usually is ideal due to the convenient training process. At the same time, **we also believe that end-to-end approaches may not always be the best solution for tackling the root cause of a problem**. The reasons are as follows: - The main motivation behind our approach is to **"Nip the problem in the bud"**. In other words, **the root cause of concept drift lies in the temporal evolution of data**. Our solution is to directly tackle this problem from data perspective, i.e., achieve TDG by training a prediction model for future data generation. - When end-to-end approaches may be overfitting due to their comprehensive interaction between data and model, one of our main motivations is to offer the flexibility of model architecture exploration for different datasets and downstream tasks by providing high-quality and effective training data. - It is evident that the most effective model architecture can vary across different datasets and downstream tasks. This observation is supported by the results in Table 1, which shows that the architecture yielding the best performance differs among the evaluations on the five datasets. **W2-2 & Q1: Efficiency of CODA.** "Requires separate steps to solve the final task is inefficient."  [AW2-2]: In this work, we mainly focus on the effectiveness of achieving TDG rather than on efficiency. Although efficiency is not our main goal, we would like to justify that our proposed framework achieves decent efficiency compared to the SOTA method DRAIN. The reason is that by splitting the whole temporal trend modeling and data generation process into three sub-processes (learning Correlation Predictor $H(\cdot)$, learning Data Simulator $G(\cdot)$, and predictor training), **each of the sub-processes is a manageable sub-problem and takes less training time than a whole end-to-end model**. We demonstrate the training time comparison to the SOTA DRAIN on Elec2 dataset in the table below, where we train the same MLP structure as the predictor. We have added the experimental results in Appendix H, and the added section title is highlighted in blue. | Framework & Components | Training Time (s) | | :----: | :----: | | **DRAIN** | **465.936** | | **CODA (Total)** | **447.817** | | CODA (Correlation Predictor) | 142.110 | | CODA (Data Simulator) | 290.826 | | CODA (MLP) | 14.880 | **W3: A conditional data generator considering the time index.** *"Why not training a conditional data generator considering the time index."*  [AW3]: Unfortunately, this idea cannot be implemented based on the **lack of sample indices for instances at each time domain**. The explanations are as follows: - One of the key challenges of capturing temporal trends among multiple time points is that we don't have time indices for each data instance, so we cannot treat each instance as sequential data for modeling its temporal evolution pattern, as the diffusion model does. - An alternative way that is to capture the underlying temporal trend among multiple datasets (distributions), which is computationally infeasible and hard to generate effective training data (details of the analysis refer to Section 3.1). Therefore, our solution is to simplify the data distribution at each time domain to capture the underlying temporal trend better. We utilize feature correlation matrices to achieve simplification and provide theoretical analysis to prove the rationale of representing data distribution with a feature correlation matrix (refer to Section 3.4). - Notes that the baseline GI proposed a time-sensitive model to extrapolate samples to the near future via the first-order Taylor expansion, which is a implicit way to use time index as conditions for prediction. As shown in Table 1, three different architectures trained on the data generated by CODA outperform GI in all benchmarks. **Q2: Without using all the previous domains for data simulation.** *"Eq.(5) only uses the last domain and not all the previous domains."*  [AQ2]: In our proposed CODA framework, the trained Data Simulator $G(\cdot)$ should learn the similar data distribution of the current domain $\mathcal{D}_{T}$. This is based on the assumption that distribution shifts are smooth and closely related to domains in the near time domains (refer to the assumption (iii) in Theorem 1). Therefore, based on the current data distribution $\mathcal{D}_{T}$, ${G}(\mathcal{D}_{T} ; \mathcal{\hat{C}}_{T+1} | \theta_{G})$ can simulate the future data distribution $\mathcal{\hat{D}}_{T+1}$ that is subject to the predicted correlation matrix $\mathcal{\hat{C}}_{T+1}$. --- ## Re: Response to Rebuttal (GM2p) Q1: I cannot understand how the model can work on high-dimensional data. My understanding is that the correlation matrices have computational and memory complexity. So, it seems intractable on high-dimensional data. Q2: Also, if my understanding (about computational and memory complexity) is correct, the training time for correlation predictor and data simulator subprocesses are not manageable. The training time comparison to DRAIN on Elec2 dataset may be misleading since Elec2 has a few dimensions. Q3: For conditional generation, I did not mean to use sample indices. I meant to use domain index as the time index for all the samples in a domain. ===================================================== We appreciate the reviewer's feedback and are glad to further address the remaining concerns. **Q1 & Q2: How the model can work on high-dimensional data.** [AQ1 & AQ2]: We would like to clarify the confusion. We agree the computation complexity $O(N^2)$ may limit the feasibility. However, our CODA can still work on high-dimensional data. Our **empirical results** show the efficacy of the proposed CODA framework in managing a high-dimensional dataset (Rot-MNIST). The key reason is its flexibility in selecting either the input or latent space for employing the Correlation Predictor module. This module calculates correlation matrices for generating future data while preserving the model-agnostic characteristic.  We agree that using feature correlation may be limited by its computation complexity. To this end, we adopt **a naive solution by first encoding original samples into low-dimensional** latent space, which allows us to compute feature correlation and incorporate it with CODA framework (as we describe in our [previous response](https://openreview.net/forum?id=CE7lUzrp1o&noteId=FJ6NN4Dmc8)). For the purpose of conducting fair performance comparisons with DRAIN, we apply the same pre-processing method. The additional experiment also demonstrates the effectiveness of CODA. Again, we would like to emphasize that **our main contribution lies in proposing a model-agnostic solution (benefits from Data Generator module) to address concept drift from a novel, data-centric perspective**. We agree that exploring TDG in high-dimensional data is a critical and under-studied topic, and this will be our future direction to enhance the robustness of our framework. **Q3: Feasibility of a conditional data generator idea.** [AQ3]:   For training a conditional data generator, **in diffusion model**, the $x_1$ and $x_2$ should be the identical sample with different time index. However, **we have no sequencial time index for each sample**, so it is infeasible for training a diffusion model in such concept drift datasets. On the other hand, it is doable to train a VAE-based conditional data generator using a time index as the input condition. Unfortunately, the native **conditional generation models hardly capture the underlying temporal trend** since the model architecture cannot identify the continuity among the input time index condition. **As mentioned in the existing work GI[1]**: >**(in Section 1)** "as a general-purpose neural network $F(x, t) that takes as input $x$, $t$..." >**(in Section 3.3)** "A naive way to do that is to concatenate $t$ with $x$ to obtain an augmented feature vector [$x$, $t$]. However, such an approach cannot capture complex trends in data, e.g., periodicity." To tackle the difficulty, GI designs a time-sensitive model architecture with a proposed time-dependent activation function. However, the previous work still implicitly captures temporal trend and may limit TDG performance. In our work, we explicitly capture temporal trends via modeling temporal trends of correlation matrices, and empirical results demonstrate that CODA achieves better TDG performance. [1] Anshul Nasery, et al., "Training for the Future: A Simple Gradient Interpolation Loss to Generalize Along Time," NeurIPS 2021. **With the clarification above, we hope that we have resolved all the reviewer's concerns and look forward to clarifying any further questions that may arise.** --- ## General Comments for All Reviewers. Dear reviewers, We thank all reviewers for their constructive reviews. We have revised the paper accordingly and marked the modifications in blue for visibility. We are grateful to all reviewers for their constructive comments and helpful feedback. We are pleased to find that they find our well-written and well-organized (VxLr and GM2p), novel and meaningful approach (BJ2w and q1Jd), theoretically sound (BJ2w and GM2p), and the experiments well-established and effective (VxLr, BJ2w, q1Jd, and GM2p). To address your primary concerns, we have done our best to extend the work with additional experiments, and reply to your concerns and suggestions with more clarification and discussion. We propose a model-agnostic framework to tackle the root cause of concept drift by generating future data for model training. The generated training data provides flexibility and transferability for architecture-type exploration. Experimental results reveal that the different model architectures can be effectively trained on the generated data. The revision parts are summarized as follows: - (q1Jd, GM2p) We have revised the discussion of the effectiveness of high-dimensional data in Section 3.4. - (VxLr, BJ2w, q1Jd, GM2p) We have added the experiments of baseline comparisons and citations in Appendix G. - (q1Jd, GM2p) We have added the experiments of training time efficiency comparison in Appendix H. We appreciate all of the suggestions made by the reviewers to enhance our work. We are delighted to receive your feedback and eagerly anticipate addressing any follow-up questions you may have. Sincerely, Authors

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.