Scott Chang
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # ICML 2024 Rebuttal (CODA) ## Summary of Author-Reviewer Discussion We thank all the reviewers and Area Chairs for their efforts and time in evaluating our work. After the rebuttal, we are encouraged to see that two of the reviewers raised their scores (`9fWD` raised to `6`, and `o1Qs` raised to `5`), while the remaining reviewer maintained the positive assessment (`NBG4` at `5`). We are pleased to note that our paper has received a cumulative score of `655` post-rebuttal. In response to the concerns raised by the reviewers, we believe our rebuttal effectively addressed all of them since **no more concerns or limitations were raised**. Below are the keynotes of our discussions with each reviewer: - Discussion with Reviewer `9fWD`: - **Clarification for the loss function designs**: We clarify the loss designs of Correlation Predictor $H(\cdot)$ and Data Simulator $G(\cdot)$ with supplementary experiments to showcase the detailed results. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=yCWc26U6hL) - **Feasibility of CODA for high-dimensional data**: We clarify the feasibility of CODA for high-dimensional data and point out the results in our paper. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=9jeAj14kNp) - **Impact of distribution shift intensity**: We include supplementary experiments on different shift intensities to demonstrate the effectiveness of CODA under different distribution shift intensities. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=YrVBMSEhaL) - Discussion with Reviewer `NBG4`: - **Detailed preliminary experiments**: We provide detailed experimental results in Section 3.1. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=WcXCTC7BE0) - **Benefits of using CODA compared with existing works**: We discuss the differences between existing works and CODA to showcase our motivation, novelty, and advantages. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=QazEOEA1V4) - **Synthetic and real-world concept drifts are considered**: We detail the various concept drift patterns we've considered in our paper. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=wRJIiQmAlG) - Discussion with Reviewer `o1Qs`: - **Clarification for the loss function designs**: We clarify the three regularization terms in the designed loss with supplementary experiments to showcase their effectiveness. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=maoX5RRJ7m) - **Effectiveness of the predicted correlation matrices**: We provide suplymentary experiments to further illustrate the effectiveness of CODA. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=Y5JCLnHjBO) - **Computational complexity of CODA**: We point out the discussion of computational complexity in our paper and provide supplementary experiments to demonstrate the feasibility of CODA. [Redirection](https://openreview.net/forum?id=pIajLksc6f&noteId=jMrzLUywXJ) Again, we thank all the Area Chairs and the reviewers for their insightful comments and helpful feedback. It is our pleasure to improve the quality of this work with them. --- ## General Comments for All Reviewers We thank all reviewers for their constructive comments and helpful feedback. We are pleased to find that they find our **well-written** (9fWD and NBG4), **novel and meaningful approach** (NBG4 and o1Qs), **theoretically sound** (9fWD and o1Qs), and **the experiments well-established and effective** (9fWD, NBG4, and o1Qs). To address your primary concerns, we have done our best to extend the work with additional experiments, and reply to your concerns and suggestions with more clarification and discussion. We propose a model-agnostic framework to tackle the root cause of concept drift by generating future data for model training. The generated training data provides flexibility and transferability for architecture-type exploration. Experimental results reveal that the different model architectures can be effectively trained on the generated data. Our responses are summarized as follows: - (9fWD, o1Qs) We clarify the loss designs of Correlation Predictor $H(\cdot)$ and Data Simulator $G(\cdot)$ with supplementary experiments to showcase the detailed results. - (9fWD, NBG4) We detail the various concept drift patterns that we've considered. - (9fWD, o1Qs) We clarify the feasibility of CODA for high-dimensional data and point out the results in our paper. - (NBG4) We provide detailed experimental results in Section 3.1. - (NBG4) We discuss the differences between existing works and CODA to showcase our motivation and novelty. - (NBG4, o1Qs) We dedicated discussion with experimental results to clarify the potential concerns and limitations in effectiveness and efficiency. We appreciate all of the suggestions made by the reviewers to enhance our work. We are delighted to receive your feedback and eagerly anticipate addressing any follow-up questions you may have. --- ## Reviewer 9fWD <!-- We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. - [x] Q1: How to represent a dataset or domain in the data simulator and the reconstruction loss? How many data samples does the Data Simulator generate for the future domain? Does one sample in domain T correspond to one sample in domain T+1? - [x] Q1-1: How to represent a dataset or domain in the data simulator and the reconstruction loss? - [x] Q1-2: How many data samples does the Data Simulator generate for the future domain? - [x] Q1-3: Does one sample in domain T correspond to one sample in domain T+1? - [x] Q2: Besides the datasets studied in this paper, is it possible to use this method in language or image datasets? - it is hard to extract meta information of text datasets for simplifying the data distribution. - as for image data, we actually conducted experiments and shown the results in Table 7 and Appendix G. - [x] Q3: One assumption of the proposed method is that the data distribution shift is smooth among nearby domains, but in data domains e.g. image, the data distribution shift is not quite smooth and continues. The model may not perform well under this situation. - Appendix G --> We sincerely appreciate the reviewer's time and effort in reviewing our paper. We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **[Q1-1]: Clarification for the loss of Data Simulator $G(\cdot)$** **[AW1-1]:** In the Eq.(4) and Eq.(5), $\mathcal{D}_T$ represents the dataset at time domain $T$. In the loss of Data Simulator $\mathcal{L}_{G}$ (Eq.(5)), there are two parts for different purposes: - $ELBO$ is the classic reconstruction loss for learning the pobability distribution of a given dataset. In $\mathcal{L}_{G}$, the given dataset is the latest domain's dataset $\mathcal{D}_T$ (the dataset at time $T$). - $\mathcal{R}_C(\mathcal{\hat{C}}_{T+1}) = \| (\mathcal{\hat{C}}_{G} - \mathcal{\hat{C}}_{T+1})\|_{1}$ is a designed regularization term to make $G(\cdot)$ learn the given future correlation matrix $\mathcal{\hat{C}}_{T+1}$, where $\mathcal{\hat{C}}_{T+1}$ is predicted by $H(\cdot)$ in previous stage, and $\mathcal{\hat{C}}_{G}$ is the correlation matrix of the CODA-generated dataset. **[Q1-2]: How many data samples does the Data Simulator generate for the future domain?** **[AW1-2]:** The number of generated data samples is **a controllable hyperparameter**, and the default value is the same as the number of samples in $D_T$. The dataset details are in **Appendix C**. In our main experimental results, we use the default value during the training and inference stages. Furthermore, **for the inference stage, we have investigated the impacts of the generated sample size on the model performance trained with the generated data**. As shown in **Figure 3**, increasing the sample size reduces performance variances for both classification and regression tasks because a larger dataset more accurately represents the data distribution learned by the Data Simulator $G(\cdot)$. <!-- Furthermore, **we have investigated the impacts of varying sample counts on the performance of an MLP trained with the generated data**. As shown in **Figure 3**, increasing the sample rate reduces performance variances for both classification and regression tasks because a larger dataset more accurately represents the probability distribution learned by the Data Simulator $G(\cdot)$. > **we have investigated the impacts of sample size on the model performance trained with the generated data**. As shown in **Figure 3**, increasing the sample size reduces performance variances for both classification and regression tasks because a larger dataset more accurately represents the data distribution learned by the Data Simulator $G(\cdot)$. --> **[Q1-3]: Does one sample in domain $T$ correspond to one sample in domain $T+1$?** **[AW1-3]:** **No**. The number of samples in different domain $t$ is not necessarily the same. In our setting and most real-world scenarios, **we don't have a sample index for each data instance at different time domains**, so we cannot treat each instance as a time series data for modeling its temporal evolution pattern. Missing index is one of the challenges and distinguish from sequence analysis problem. Therefore, we propose to capture the trend of data distribution along time domains and generate the future dataset/samples. <!-- > []I'd suggest to say it's not necessary to have the corresponding relationship, other than "can not" --> **[Q2]: Is it possible to use this method in language or image datasets?** **[AW2]:** We have conducted experiments for image data. - **Image datasets**: We actually conducted experiments on **image datasets as shown in Table 4** and **Appendix G**. Three models are trained on the generated datasets, which outperform other baselines. Specifically, for the high-dimensional data, we use the same encoder structure as the baseline method LSSAE (MNIST ConvNet) to save computational costs. <!-- 1. **Image datasets**: We actually conducted experiments and have shown the results of image data **in Table 4 and Appendix G**. Specifically, for the high-dimensional data, we use the same encoder structure as the baseline method LSSAE (MNIST ConvNet) and train three architecture predictors. The results reveal all three different predictors trained on the dataset generated by CODA outperform other baselines. > []We actually conducted experiments on image datasets as shown in **Table 4** and **Appendix G**. Three models are trained on the generated datasets, which outperform other baselines. Specifically, for the high-dimensional data, we use the same encoder structure as the baseline method LSSAE (MNIST ConvNet) to save computational cost. --> - **Language datasets**: The definition of "concept drift" is the changes of the joint distribution between input $x$ and the target variable $y$. However, it is extremely difficult to define and capture concept drift in natural language at the current stage. For example, the relationship between a question and its answer involves not only syntax but semantic meaning. A 'correct' answer might not share exact words with the question, but still convey the appropriate knowledge. One possible solution is to leverage Retrieval-Augmented Generation **(RAG)** framework and regularly update the corpus **to address the knowledge outdating problem** We will add more discussion on natural language processing in the revised manuscript. <!-- 2. **Language datasets**: The definition of "concept drift" is the changes of the joint distribution between input $x$ and the target variable $y$. For natural language, it has immense combinatorial complexity. The relationship between a question and its answer involves not only syntax but semantic meaning. A 'correct' answer might not share exact words with the question, but still convey the appropriate knowledge. **Capturing the temporal trends among those subtleties in a precise distribution is extremely difficult**. One possible solution is to leverage Retrieval-Augmented Generation **(RAG)** framework and regularly update the corpus **to address the knowledge outdating problem**. > []The definition of "concept drift" is the changes of the joint distribution between input $x$ and the target variable $y$. However, it is extremely difficult to define and capture concept drift in natural language. For example, the relationship between a question and its answer involves not only syntax but semantic meaning. A 'correct' answer might not share exact words with the question, but still convey the appropriate knowledge. One possible solution is to leverage Retrieval-Augmented Generation **(RAG)** framework and regularly update the corpus **to address the knowledge outdating problem**. --> **[Q3]: In data domains, distribution shifts are not quite smooth and continue. The model may not perform well in this situation.** **[AW3]:** We agree that distribution shift intensity is an important factor in this problem. Here, we conduct experiments on different shift intensities for CODA in a synthetic dataset (**2-moons**), and the result is shown below. In 2-moons dataset, there are 10 domains, where domain $i$ undergoes a rotation of $18i$°, as described in Appendix C. To evaluate CODA under different distribution shift intensities, we conduct two settings experiments compared with the original setting: - **Original:** Source domains: $i = [1, 2, 3, 4, 5, 6, 7, 8, 9]$; test domain: $i = [10]$ - **Setting 1:** Source domains: $i = [2, 4, 6, 8]$; test domain: $i = [10]$ - **Setting 2:** Source domains: $i = [1, 4, 7]$; test domain: $i = [10]$ According to the rotation between two consecutive domains, the rank of distribution shift intensity is **Original (18°) < Setting 1 (36°) < Setting 2 (54°)**, and the results are shown below: | 2-Moons | Original (18°) | Setting 1 (36°) | Setting 2 (54°) | | --------- |:--------------:|:---------------:| --------------- | | CODA (MLP) | 2.3 $\pm$ 1.0 | 3.1 $\pm$ 1.2 | 3.9 $\pm$ 0.8 | We can observe that the difficulty of simulating future datasets increases when the distribution shift intensity rises. Furthermore, in real-world data, the distribution shift intensity is unknown. Therefore, in our future work, we will have more discussion on shift intensity detection and methods for high shift intensity. <!-- We have already considered various concept drift patterns on three synthetic and three real-world datasets, where the distribution shifts in real-world datasets are not smooth and are hard to identify as a specific pattern. In these cases, CODA can capture the underlying temporal shifts for generating future data distribution, as shown in Table 1 and Table 4. The concept drift patterns that we considered are as follows: - Synthetic concept drifts: 1. Cyclical change: the 2-Moons dataset in Figure 4 is built with a cyclical concept drift pattern. We conduct more experiments on **Rot-MNIST** and **Sine** datasets as shown in the Table 4. 2. Abrupt change: this scenario usually is not considered by domain generalization since the joint distribution betweewn source and target domain may be significantly different. - Real-world concept drifts: The real-world datasets used in our experiments feature various and unknown patterns of concept drift. They have covered diverse realistic temporal trends, such as electricity demand changes (Elec2), space shuttle defects (Shuttle), and appliances energy usage changes (Appliance). The more discussion can be found in **Sec. 4.2, Table 4, and Appendix G**. --> --- ## Reviewer NBG4 <!-- We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. - [x] Q1: Why are the authors using the Elec2 dataset in Section 3.1? - [x] Q2: First, future data is attempted to be simulated, and then predictions are made. Wouldn't this lead to accumulating errors? In other words, aren't both the error in simulating future data and the error in prediction accumulated? What is the benefit of using this method instead of methods that attempt to directly predict in concept drift scenarios, such as: - **1. Lifelong Learning with Non-i.i.d. Tasks** & **2. Minimax Forward and Backward Learning of Evolving Tasks with Performance Guarantees** - These two works focus on capturing the changing pattern between the nearest past and current samples, assuming independent changes between consecutive distributions, emphasizing the subtle distribution shifts of $\mathcal{x}$ that increase with time. Different from this series of works, our proposed CODA focuses on the long-term temporal trends of the correlation between instance $\mathcal{x}$ and prediction $\mathcal{y}$. While their incremental learning strategy focuses on linear changing trends, CODA can capture more complex or cyclic temporal trends by considering a whole picture of multiple data distributions at different time points. Therefore, our experimental results focus on datasets with strong concept drifts across time domains. 2. Handling Concept Drift via Model Reuse - This study introduces a model-centric approach for tackling the issue of concept drift similar to existing works, including GI and DRAIN, as discussed in our paper. - Their framework utilizes a collection of pre-trained model weights, updated to address concept drift by directly averaging the weights within the set. Therefore, the model architecture of those model weights is identical and cannot be transferred to other architectures. - On the other hand, our pioneer research aims to address concept drift from a novel data-centric angle rather than against the parallel model-centric strategy. - If the simulated future dataset accurately represents the joint distribution between input features $x$ and prediction $y$, the model trained on this simulated dataset will make predictions in an i.i.d. scenario. Thus, training models on CODA-generated data will nip the o.o.d. problem in the bud, instead of learning the outdated correlations between input features and predictions. In i.i.d. scenarios, prediction accuracy varies across different model architectures. As a model-agnostic framework, CODA is free from error accumulation by enabling the selection of architectures that offer best performance. - [x] Q3: Are many samples necessary in each domain? - because we need to calculate the correlation matrices of each time span, at least we need adequent samples in each time span for accurately capturing the underlying feature correlation. - [x] Q4: If the change between domains is always the same and we have many domains, the data from the test domain should be simulated quite accurately. I understand that in real datasets, we cannot control how the change between domains occurs, but what happens if in the 2 moons dataset the change between tasks is a random number each time? In other words, the change is not always the same - Appendix G. - [x] Q5: What happens if the assumption that the distribution changes does not hold and the distribution remains the same? Can the method simulate the future data? - CODA is learn to capture the underlying distribution changes without any assumption --> We sincerely appreciate the reviewer's time and effort in reviewing our paper. We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **[Q1]: Results of other datasets in Section 3.1** **[AQ1]:** We only show the results on part of datasets due to the paper length limitation. Here, we provide the results in Sec. 3.1 on other datasets in the table below: | Algorithm | 2-Moons | Shuttle | Appliance | |:-----------:|:--------------:|:-------------:| -------------- | | GI$^{[2]}$ | 3.5 $\pm$ 1.4 | 7.0 $\pm$ 0.1 | 8.2 $\pm$ 0.6 | | DRAIN$^{[1]}$ | 3.2 $\pm$ 1.2 | 7.4 $\pm$ 0.3 | 6.4 $\pm$ 0.4 | | Prelim-LSTM | 15.2 $\pm$ 2.0 | 8.1 $\pm$ 1.6 | 10.0 $\pm$ 0.4 | We didn't conduct our Prelim-LSTM on the ONP dataset because previous research has indicated that it exhibits relatively weak concept drift, as we mentioned in Sec. 4.2. We will add this discussion in the footnote of main text. **[Q2-1]: Aren't the error in simulating future data and the error in prediction accumulated?** **[AQ2-1]:** <!-- **In contrast, CODA avoids accumulating errors by nipping the problem in the bud.** If the simulated future dataset accurately represents the joint distribution between input features $x$ and prediction $y$, the model trained on this simulated dataset will make predictions in an i.i.d. scenario. Thus, training models on CODA-generated data will nip the o.o.d. problem in the bud, instead of learning the outdated correlations between input features and predictions. In i.i.d. scenarios, prediction accuracy varies across different model architectures. --> <!-- We agree that error acumulation is a key research point in long-term domain generalization. However, in our paper, we only focus on **short-term domain generalization**. --> We agree on the error accumulation point, which is important for enhancing quality and reducing model prediction error. CODA can minimize the domain generalization errors from two perspectives: - When concept drifts occur, prediction errors are mainly caused by **significant changes in joint distribution**. In other words, models trained on the outdated joint distribution would have a significant performance drop in future domains. To mitigate the issue, CODA aims to **effectively learn future joint distribution** by representing source domain datasets to correlation matrices. As our **theoretical analysis in Sec. 3.4 and Appendix A** shows, correlation matrices are guaranteed to represent joint distribution under certain assumptions that can be easily satisfied in real-world scenarios. - From the perspective of prediction model training, a performance gap inevitably exists between the training and testing sets. Nevertheless, as a **model-agnostic** framework, CODA provides **flexibility in selecting the best performance model architecture**. Benefitting from the flexibility, CODA can **minimize the errors** accumulated from the **sub-optimal model architectures**. <!-- - As a **model-agnostic framework**, CODA is also **free from error accumulation** by **enabling the selection of architectures that offer best performance**. --> Our experimental results demonstrate that CODA can tackle concept drift issues well and **outperform** the existing methods on both synthetic and real-world datasets. **[Q2-2]: What is the benefit of using CODA compared with some mentioned existing works[1][2][3]?** **[AQ2-2]:** We thank the reviewer's suggestion, and we will include the discussion of the papers mentioned by the reviewer in our next version. Here, we provide the discussion below: <!-- - **[1] & [2]** - These two works focus on capturing the **changing pattern between the nearest past and current samples**, assuming independent changes between consecutive distributions, emphasizing the subtle distribution shifts of $\mathcal{x}$ that increase with time. Different from this series of works, **our proposed CODA focuses on the long-term temporal trends of the correlation** between instance $\mathcal{x}$ and prediction $\mathcal{y}$. While their incremental learning strategy focuses on linear changing trends, CODA can capture more complex or cyclic temporal trends by considering a whole picture of multiple data distributions at different time points. **Therefore, our experiments focus on the datasets with strong concept drifts across time domains**. > []There are two advantages by using CODA compared with [1] & [2]. First, **our proposed CODA captures the long-term temporal trends from all the historical data points. The existing work [1] & [2] can only consider consecutive time points nearby, i.e., the immediate past and current time point. Second, CODA can capture more complex or cyclic temporal trends by considering a whole picture of multiple data distributions at different time points, while the existing work assumes an incremental changing trend with a subtle distribution shift on $\mathcal{x}$. --> <!-- - **[3]** - This study introduces a **model-centric approach** for tackling the issue of concept drift similar to existing works, including GI and DRAIN, as discussed in our paper. - Their framework utilizes a collection of pre-trained model weights, updated to address concept drift by directly averaging the weights within the set. Therefore, the model architecture of those model weights is identical and **cannot be transferred to other architectures**. - On the other hand, **our pioneer research** aims to address concept drift **from a novel data-centric angle rather than against the parallel model-centric strategy**. > []CODA is a model-agnostic framework, which can be used for various model architectures, while [3] is a model-centric approach and fine-tunes pre-trained model weights. This limits the utilization of [3] for other model architectures. > [3] is proposed to tackle similar concept drift issues to exisiting works, such as GI and DRAIN discussed in our paper. --> - **[1] & [2]** - There are two advantages by using CODA compared with [1] & [2]. First, **our proposed CODA captures the long-term temporal trends from all the historical data points**. The existing work [1] & [2] can only consider consecutive time points nearby, i.e., the immediate past and current time point. Second, CODA can capture more complex or cyclic temporal trends by considering a whole picture of multiple data distributions at different time points, while the exisitng work assumes a linear changing trend with a subtle distribution shift on $\mathcal{x}$. - **[3]** - CODA is a **model-agnostic framework** that **parallels** existing model-centric works and can be used for various model architectures. In contrast, [3] is a model-centric approach that fine-tunes pre-trained model weights. This limits the utilization of [3] for other model architectures. [3] is proposed to tackle similar concept drift issues to exisiting works, such as GI and DRAIN discussed in our paper. **[Q3]: Are many samples necessary in each domain?** **[AQ3]:** Yes, it is important to collect many samples for sufficiently represent a data distribution. Therefore, to accurately capture the underlying feature correlation and train the Correlation Predictor $H(\cdot)$, we use all the samples in training domains, where the details can be found **in Appendix C**. **[Q4]: What happens if in a dataset the change between tasks is a random number each time? In other words, the change is not always the same.** **[AQ4]:** The random changing among time domains can be categorized as "abrupt change". Please see the detailed discussion below. We already considered various concept drift patterns on both synthetic and real-world datasets as follows: - **Synthetic concept drifts**: 1. Cyclical change: the 2-Moons dataset is built with a cyclical concept drift pattern, and we conduct **Rot-MNIST** and **Sine** datasets as shown in the Table 4. 2. Abrupt change: this scenario is usually not considered by domain generalization since the joint distribution betweewn source and target domain may be significantly different. - **Real-world concept drifts**: The real-world datasets used in our experiments feature various and unknown patterns of concept drift. They have covered diverse realistic temporal trends, such as electricity demand changes (Elec2), space shuttle defects (Shuttle), and appliances energy usage changes (Appliance). More discussion can be found in **Sec. 4.2, Table 4, and Appendix G**. **[Q5]: What happens if the assumption that the distribution changes does not hold and the distribution remains the same? Can the method simulate future data?** **[AQ5]:** **Yes**. It is a simplified case in our proposed method. When there is no concept drift, the correlation matrices in each time domain are the same ($\mathcal{C}_1 = \mathcal{C}_2 \dots = \mathcal{C}$). In this case, $H(\cdot)$ can easily approximate the similar future correlation matric $\mathcal{\hat{C}}_{T+1}$ for $G(\cdot)$ to generate the similar data distribution, i.e., $\mathcal{\hat{C}}_{T+1} \approx \mathcal{C}$, where $\mathcal{\hat{C}}_{T+1} = H(\mathcal{C}, \dots, \mathcal{C})$. Here, we conduct the supplementary experiments to showcase the efficacy of CODA under the no-distribution shift (I.I.D.) scenario, where we split the test domain into training, validation, and test sets $(6:2:2)$. The results in the test sets of three datasets are shown below: | I.I.D. Scenario | 2-Moons | Elec2 | Appliance | |:---------------:|:-------------:| ------------- | ------------- | | CODA (MLP) | 0.0 $\pm$ 0.0 | 4.2 $\pm$ 0.4 | 2.1 $\pm$ 0.2 | [1] Pentina, Anastasia & Lampert, H. Christoph, "Lifelong Learning with Non-i.i.d. Tasks," NeurIPS 2015 (https://dl.acm.org/doi/10.5555/2969239.2969411) [2] Álvarez, Verónica, et al., "Minimax Forward and Backward Learning of Evolving Tasks with Performance Guarantees," NeurIPS 2023 (https://papers.nips.cc/paper_files/paper/2023/hash/cf4114c34a2b93019aa6e70f99680fae-Abstract-Conference.html) [3] Zhao, Peng, et al., "Handling Concept Drift via Model Reuse," Special Issue of the ACML 2019 Journal Track (https://dl.acm.org/doi/abs/10.1007/s10994-019-05835-w) ## Reviewer o1Qs <!-- We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. - [x] There are several unclear points in Sections 3.2 and 3.3: - [x] (1) How can the Cross Entropy loss be computed using the two correlation matrices (which are not probability distributions) in Eq.(3)? - the range of the values are between 0 and 1 -> can compute cross-entropy - we want to measure the similarity between two 白努力 distributions using KL divergence, and we treat our LSTM as multiple parameters of a 白努力 function - we use CE to measure the similarity between CE = KL divergence - const(entropy of groundtruth C) - [x] (2) Why do the authors consider l2_norm, l1_norm, and Cross entropy loss simultaneously in Eq.(3)? There is a lack of intuitive explanation for this. Wouldn't it be enough to apply just one of the three? - [x] (3) There are no definitions of z and p(z) in Eq.(5) - [x] In Equation (5), setting $\lambda_c = 0$ simplifies the method to a basic VAE (Variational Autoencoder) architecture. However as shown in Fig. (5), the performance with $\lambda_c = 0$ is similar to that of cases with $\lambda_c \neq 0$, suggesting that the proposed method based on feature correlation may have limited effectiveness. - is different - add the 0's performance - [x] In comparison with other model-centric algorithms, as the data size increases, the computational complexity significantly escalates due to the correlation matrix. Also, to achieve sufficient performance, a certain minimum number of samples is required. Therefore, it seems necessary to compare the computational complexity with that of other algorithms. - [x] Is the scenario with "w/o $\mathcal{C}_{T+1}$" different from the case with in the ablation study section? - [x] In the legend of Fig. (5), the graphs on test and val are indistinguishable. - [x] Could the proposed method be extended to various real-world image datasets? --> We sincerely appreciate the reviewer's time and effort in reviewing our paper. We thank the reviewer for the constructive comments and appreciate the reviewer for the recognition of the effectiveness of our work. **Q1: Clarification for Sec. 3.2 and 3.3 (in Q1-1 to Q1-3 below).** **[Q1-1]: How can the Cross-Entropy loss be computed using the two correlation matrices (which are not probability distributions) in Eq.(3)?** **[AQ1-1]:** The core idea of $\mathcal{L}_{CE}$ is as follows: - Different from $\ell_1$-norm and $\ell_2$-norm calulating the errors in each element of Correlation Matrices $\mathcal{\hat{C}}_t$, we leverage $\mathcal{L}_{CE}$ to measure how well the distribution within $\mathcal{\hat{C}}_t$ matches the groundtruth distribution of $\mathcal{C}_t$. To do so, in $\mathcal{L}_{CE}$, we first normalize the range of the values in $\mathcal{C}$ to between 0 and 1, which can be seen as the probabilities that two features are correlated. - Therefore, the similarity between two probabilities $\mathcal{\hat{C}}$ and $\mathcal{C}$ can be represented by KL-divergency. Then, we simplify the similarity by $\mathcal{L}_{CE}$ because Cross-Entropy is equal to KL-divergency without a constant, where the constant is the entropy of ground truth $\mathcal{C}$. <!-- - The range of the values in Correlation Matrices $\mathcal{C}$ are between 0 and 1, which represents the probability that the two features are correlated. Therefore, the similarity between two probabilities $\mathcal{\hat{C}}$ and $\mathcal{C}$ can be represented by KL-divergency. Then, we simplify the similarity by $\mathcal{L}_{CE}$ because Cross-Entropy is equal to KL-divergency without a constant, where the constant is the entropy of ground truth $\mathcal{C}$. --> <!-- - Accordingly, we want to measure the similarity between two distributions by KL-divergence, and we treat our LSTM-based Correlation Predictor $H(\cdot)$ as multiple parameters of a Bernoulli function. --> <!-- - Regarding the objective loss of $H(\cdot)$, we use Cross-Entropy to measure their similarity because Cross Entropy is equal to KL-divergency without a constant, where the constant is the entropy of ground truth $\mathcal{C}$. --> To clarify the concept of the loss design, we will include the above description in the next version. **[Q1-2]: Why do the authors consider l2_norm, l1_norm, and Cross entropy loss simultaneously in Eq.(3)? There is a lack of intuitive explanation for this. Wouldn't it be enough to apply just one of the three?** **[AQ1-2]:** The three regularization terms in Eq.(3) are explained as follows: - The $\ell_1$-norm encourages sparsity in the predicted $\mathcal{\hat{C}}_t$ because it can effectively "zero out" less important feature correlations since the correlation matrices are generally sparse (as shown in Figure 9 of Appendix F). - The $\ell_2$-norm is to ensure smoothness and penalize large deviations in the elements of the predicted correlation matrix $\mathcal{\hat{C}}_t$, promoting stability in the feature correlations. - The cross-entropy loss $\mathcal{L}_{CE}$ can measure how well the distribution of $\mathcal{\hat{C}}_t$ matches the groundtruth distribution of $\mathcal{C}_t$. Based on our experiments, the best performance Correlation Predictor $H(\cdot)$ is optimized by the three regularization terms in Eq.(3) simultaneously, as shown in the table below (MSE between $\mathcal{\hat{C}}_{T+1}$ and $\mathcal{C}_{T+1}$): | Objective Loss | 2-Moons | Elec2 | Shuttle | Appliance | |:------------------------:| ---------- | --------- | ---------- | ---------- | | $\ell_1$-norm | 0.0026 | 0.539 | 0.0370 | 0.0111 | | $\ell_2$-norm | 0.0025 | 0.533 | 0.0356 | 0.0104 | | $\mathcal{L}_{CE}$ | 0.0029 | 0.564 | 0.0389 | 0.0126 | | $\ell_1$ + $\ell_2$-norm | 0.0025 | 0.531 | 0.0351 | 0.0102 | | All (Eq.(3)) | **0.0021** | **0.527** | **0.0341** | **0.0096** | As shown in the above table, the errors between predicted $\mathcal{\hat{C}}_{T+1}$ and the ground truth $\mathcal{C}_{T+1}$ is minimal (as shown in Figure 10 in the Appendix). <!--, there is huge potential to enhance the Correlation Predictor. As one of the future directions, a more sophisticated sequential prediction framework than LSTM. --> **[Q1-3]: Definitions of $z$ and $p(z)$ in Eq.(5).** **[AQ1-3]:** $z$ and $p(z)$ belong to the traditional ELBO loss in Eq.(5), where $z$ is the latent variables mapped by the encoder of the VAE model and $p(z)$ is the data distribution of $z$ modeling by a standard multivariate Gaussian distribution (with zero mean and unit variance). The assumption of Gaussian distribution is supported by theoretical foundations (e.g., connection to the Central Limit Theorem) and provides computational convenience. **[Q2]: As shown in Fig. (5), the performance with $\lambda_c = 0$ is similar to that of cases with $\lambda_c \neq 0$, suggesting that the proposed method based on feature correlation may have limited effectiveness.** **[AQ2]:** In fact, **there is no $\lambda_c = 0$ result in Fig.(5)**, and the left-most points refer to $\lambda_c = 0.1$. The performances with $\lambda_c = 0$ are shown in Sec. 4.4 (Ablation Study) and Table 2, which are similar to the baseline "LastDomain" in Table 1. $\lambda_c = 0$ means the future data $\mathcal{D}_{T+1}$ is generated only based on last domain $\mathcal{D}_T$ without any other information. As shown in Table 2, the effectiveness of integrating $\mathcal{\hat{C}}_{T+1}$ is significant. Note that ONP does not show concept drift, which has been proved by other work (as described in Sec. 4.2). For more detailed experimental results in Fig.(5), we provide the performances with $\lambda_c = [0.1, 0.3, 0.5, 0.7, 0.9]$ in the table below: | Dataset | $\lambda_c = 0.1$ | $\lambda_c = 0.3$ | $\lambda_c = 0.5$ | $\lambda_c = 0.7$ | $\lambda_c = 0.9$ | |:---------:| ----------------- | ----------------- | ----------------- | ----------------- | --- | | Elec2 | 12.4 | 12.2 | 11.9 | 11.7 | 11.6 | | ONP | 37.2 | 37.2 | 37.3 | 37.4 | 37.6 | | Appliance | 4.59 | 4.58 | 4.56 | 4.55 | 4.54 | **[Q3]: It seems necessary to compare the computational complexity with that of other algorithms.** **[AQ3]:** The computational complexity **has been discussed in Appendix H and G**. <!--In this work, we mainly focus on **the effectiveness of achieving TDG rather than on efficiency**. There are several ways to improve the computational efficiency.--> To improve the computational efficiency, we adopt a naive solution by encoding high-dimensional dataset (such as Rot-MNIST) into low-dimensional latent space and then incorporate it into CODA. The encoder structure is the same as the baseline method LSSAE (MNIST ConvNet). Three model architctures are trained and the results show that CODA outperforms other baselines, as shown in **Table 4 and Appendix G**. In addition, the efficiency of CODA is comparable with DRAIN on Elec2, **as shown in the table below and in Appendix H**, where "Total" means the training time includes $H(\cdot)$, $G(\cdot)$, and the MLP. For further clarification, the training process of CODA is split into three sub-processes, i.e., learning correlation predictor $H(\cdot)$, learning Data Simulator $G(\cdot)$, and predictive model training. Each sub-process is a manageable sub-problem and takes less training time than the end-to-end process. We can surely better improve the efficiency in the future work. <!-- We agree the computation complexity $O(N^2)$ may limit the feasibility. However, our CODA can still work on high-dimensional data. Our **empirical results** show the efficacy of the proposed CODA framework in managing a high-dimensional dataset (Rot-MNIST). The key reason is **its flexibility in selecting either the input or latent space for employing the Correlation Predictor module**. This module calculates correlation matrices for generating future data while preserving the model-agnostic characteristic. We agree that using feature correlation may be limited by its computation complexity. To this end, we adopt **a naive solution by first encoding original samples into low-dimensional** latent space, which allows us to compute feature correlation and incorporate it with CODA framework. Specifically, for the high-dimensional data, we use the same encoder structure as the baseline method LSSAE (MNIST ConvNet) and train three architecture predictors. The results reveal all three different predictors trained on the dataset generated by CODA outperform other baselines, **as shown in Table 4 and Appendix G**. Furthermore, we would like to justify that our proposed framework achieves decent efficiency compared to the SOTA method DRAIN, **as shown in Appendix H**. The reason is that by splitting the whole temporal trend modeling and data generation process into three sub-processes (learning Correlation Predictor $H(\cdot)$, learning Data Simulator $G(\cdot)$, and predictor training), **each of the sub-processes is a manageable sub-problem and takes less training time than a whole end-to-end model**. We demonstrate the training time comparison to the SOTA DRAIN on Elec2 dataset in the table below, where we train the same MLP structure as the predictor. --> | Framework & Components | Training Time (s) | | :----: | :----: | | **DRAIN** | **465.936** | | **CODA (Total)** | **447.817** | | Correlation Predictor $H(\cdot)$ | 142.110 | | Data Simulator $G(\cdot)$ | 290.826 | | Prediction Model (MLP) | 14.880 | **[Q4]: Is the scenario with "w/o $\mathcal{C}_{T+1}$" in the ablation study section different from the case with $\lambda_c = 0$ in Fig.(5)?** **[AQ4]:** <!-- **Yes, they are different**, because there is no $\lambda_c = 0$ in Fig.(5), as we described in our [previous response for Q2]().--> "w/o $\mathcal{C}_{T+1}$" is equivalent to $\lambda_c = 0$. But $\lambda_c = 0$ is not included in Fig.(5). The detailed discussion is in [previous response for Q2](). **[Q5]: In the legend of Fig. (5), the graphs on test and val are indistinguishable.** **[AQ5]:** We thank the reviewer for pointing out the unclear legend of Fig.(5). We will update the legend to distinguish the dashed lines and solid lines. In Fig.(5), the dashed lines refer to validation sets, and the solid lines refer to test sets. **[Q6]: Could the proposed method be extended to various real-world image datasets?** **[AQ6]:** **Yes**. We conducted experiments and have shown the results on image data **in Table 4 and Appendix G**. Specifically, for the high-dimensional data, we use the same encoder structure as the baseline method LSSAE (MNIST ConvNet) and train three architecture predictors. The results reveal all three different predictors trained on the dataset generated by CODA outperform other baselines. ---

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully