WOODS things to change / correct

# WOODS things to change / correct * [x] Line 1586: Remove the *** * [x] Add sentence referencing related work section in the intro * [x] Add section in the appendix discussing all of the supplementary section information * [x] Add a sentence around the assumption of section 2 that whether this assumption is true or not is out of this scope OR reformulate it such to say that several algorithms use this assumption to provide guarantees while others do not * [x] (Maybe) Add a sentence to Spurious Fourier or TCMNIST that states the intention of this dataset to create a clear case of failure of ERM * [x] Add a sentence in conclusion explaining the intuition behind the "why" algorithms fail at synthetic datasets * [x] Add 1-2 sentence explaining the evaluation process in experimental section * [x] Add metrics in Table 2 and 4 * [x] Add to the contribution that datasets are a mix of synthetic and real-world * [x] Add Transfer and CAD/CondCAD to adaptation of algorithms (main body) * [ ] Add Transfer and CAD/CondCAD to adaptation of algorithms (Supp materials) * [x] Add results of Transfer and CAD/CondCAD * [x] Fix typos of page 7 (equation and Equation stuff) * [x] Mention that CAD/CondCAD and Transfer is not applier to time series for the same reasons as other algos * [x] Add hyperparameters for CAD/CondCAD and Transfer * [ ] EVERYTHING related to new dataset * [x] Pedestrian motivation * [x] Pedestrian problem setting * [x] Pedstrian data * [x] Pedestrian Preprocessing * [x] Pedestrian Domain information * [x] Pedestrian Architecture choice * [x] Pedestrian ID evaluation * [x] Pedstrian benchmark results * [x] Pedestrian credits * [ ] (Optional) Pedestrian figure * [ ] (Optional) Update Fig 1 with Pedestrian * [x] Ethical consideration of new dataset in checklist * [x] Change all spots where we say that we have 10 datasets For later:  # WOODS NeurIPS Reviews & Rebuttals ## Final response We thank you for your efforts in the reviewing process. We are deeply appreciative of the initial feedback from the six reviewers which allowed us to greatly improve our work during this discussion process (we added three ew algorithms and a new dataset). In the end of the discussion period we only heard back from one of the six reviewers about our improvements to WOODS, which is quite disheartening. We strongly believe that our work is an important stepping stone towards addressing distribution shift in time series, which is currently unexplored. We are the first to collect several relevant time series dataset with distribution shift and and also adapt state-of-the-art algorithms developped for static datasets to time series. This greatly reduces the bar of entry for research in this unexplored area of research. ## General response to all reviewers Dear reviewers and area chairs, We sincerely appreciate your time and effort taken to review our work. We are thrilled that reviewers share our vision of WOODS; reviewers have recognized the following contributions: * **OOD generalization in time series**: OOD generation of time series tasks is under-explored in the literature, and this work aims to fill this gap. (Reviewers: HVdg, JNLv, iA6U, ThN9) * **Accessibility of research**: The codebase for evaluating algorithms creates a low entry barrier for researchers looking for solutions to the problem of OOD generalization in time series. (Reviewers: icR7, ThN9, iA6U) * **Diversity of datasets**: The benchmark is diverse enough to cover many application areas and distribution shift scenarios. (Reviewers: JNLv, ThN9, HVdg) * **Clarity of the paper**: The paper is clear and easy to follow (Reviewers: HVdg, icR7, iA6U, JNLv). Although some minor adjustments are required, see the Access to information paragraph below. We are also grateful for the constructive feedback provided in your reviews. We take each of your recommendations very seriously. For the past two weeks, we have been working hard to provide improvements to WOODS, and we are excited to show you the results! We thank the reviewers for their patience over those two weeks. Following are the main points we addressed in the revised manuscript version. Changes made to the manuscript are shown in blue. Some changes were added to the appendix due to space limitations. * **Increasing the diversity of WOODS datasets**: Some of the reviewers pointed out that certain time series research areas are underrepresented in our benchmark, e.g., the forecasting field (Reviewers: iA6U, HVdg). To address this concern, we added a forecasting dataset with source domains, called **Pedestrian**. All information about this dataset can be found in Appendix C.11. We plan to continue adding more forecasting datasets in the future. ### Generalization gap (rmse) | Algorithm | ID | OOD | Average | | :--- | :---: | :---: | :---: | | ERM | 99 (3) | 204.09 (11) | 105 | ### Train-domain validation performance (rmse) | Algorithm | T22 | T25 | Average | | :--- | :---: | :---: | :---: | | ERM | 196.07 (10.21) | 212.11 (12.65) | 204.09 | | VREx | 197.98 (7.31) | 205.19 (4.74) | 201.58 | | GroupDRO | 243.53 (16.90) | 242.94 (9.17) | 243.23 | | IB-ERM | 224.55 (15.43) | 201.60 (6.41) | 213.07 | ### Oracle train-domain validation performance (rmse) | Algorithm | T22 | T25 | Average | | :--- | :---: | :---: | :---: | | ERM | 226.78 (11.88) | 219.71 (2.25) | 223.24 | | VREx | 203.52 (2.82) | 222.63 (4.39) | 213.07 | | GroupDRO | 261.10 (12.11) | 223.04 (7.84) | 242.06 | | IB-ERM | 201.43 (10.94) | 209.89 (11.65) | 205.66 | * **Increasing the coverage of baselines**: Some of you mentioned that the WOODS baseline is missing recent advances in OOD generalization (Reviewers: JNLv, iA6U). This is a good point, as the field is advancing extremely fast. Thus, we added 3 baselines to the benchmark: CAD (2022) [1], CondCAD (2022) [1] and Transfer (2021) [2]. We believe these recent OOD generalization algorithms complement the current set of baselines of WOODS. Results are added to the manuscript and summarized below as well. * Note: Forecasting datasets results are missing because the adaptation of CAD, CondCAD, and Transfer is impossible without significant changes to the algorithm. This is due to the absence of distinction between featurizer and classifier network in time series forecasting. Thus, an adaptation of these algorithms to time series is out of the scope of this work and left for future work. * Note 2: The IEMOCAP experiment are still running and thus are missing from the updated manuscript. ### Synthetic challenge train-domain validation | Algorithm | Spurious_Fourier | TCMNIST_Source | TCMNIST_Time | Average | | :--- | :---: | :---: | :---: | :---: | | CAD | 10.3 (0.4) | 9.8 (0.1) | 9.9 (0.1) | 10.0 | | CondCAD | 10.3 (0.6) | 10.1 (0.1) | 9.9 (0.2) | 10.1 | | Transfer | 9.5 (0.1) | 10.0 (0.2) | 9.8 (0.0) | 9,8 | ### Synthetic challenge test-domain validation | Algorithm | Spurious_Fourier | TCMNIST_Source | TCMNIST_Time | Average | | :--- | :---: | :---: | :---: | :---: | | CAD | 20.4 (2.4) | 26.3 (3.3) | 27.2 (2.3) | 28.0 | | CondCAD | 16.0 (1.6) | 20.6 (2.5) | 22.7 (2.9) | 20.1 | | Transfer | 13.4 (2.8) | 18.3 (0.8) | 24.2 (5.3) | 21.6 | ### Real-world datasets train-domain validation | Algorithm | CAP | SEDFx | PCL | LSA64 | HHAR | Average | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | CAD | 62.2 (1.1) | 66.1 (0.5) | 64.6 (0.6) | 50.3 (2.2) | 85.0 (0.6) | 65.6 | | CondCAD | 62.6 (0.5) | 66.1 (0.8) | 64.2 (0.3) | 53.4 (1.5) | 84.3 (0.8) | 66.1 | | Transfer | 55.0 (1.3) | 61.0 (0.9) | 62.3 (0.2) | 47.3 (1.3) | 84.4 (0.5) | 62.0 | ### Real-world datasets oracle train-domain validation | Algorithm | CAP | SEDFx | PCL | LSA64 | HHAR | Average | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | CAD | 63.6 (0.8) | 67.5 (0.3) | 64.5 (0.4) | 57.8 (1.5) | 84.8 (0.3) | 67.7 | | CondCAD | 63.3 (0.7) | 66.6 (0.6) | 63.6 (0.3) | 57.4 (1.3) | 84.7 (0.6) | 67.1 | | Transfer | 57.7 (0.8) | 61.5 (0.7) | 61.9 (0.2) | 51.3 (1.6) | 85.1 (0.3) | 63.5 | * **Accessibility to information**: we noticed that several reviewers pointed out missing info which was actually already included in Supplemental material; while the references to all that information were included in the main text, perhaps they were not easy to find/notice (Reviewers: ThN9, sWph, JNLv). We think this is indeed a problem because if reviewers cannot find this information, general readers will miss it too. Notably, the Related Works section was missed by reviewers because it was only mentioned with a footnote in the main text. To address this problem, we added Appendix A to the Supplementary material where we list a brief overviews of the supplementary material sections. We also increased the visibility of relevant Appendix sections in the main body of the text. We thank the reviewers for pointing out this issue. [1] Ruan, Y., Dubois, Y., & Maddison, C. J. (2021). Optimal representations for covariate shift. arXiv preprint arXiv:2201.00057. [2] Zhang, G., Zhao, H., Yu, Y., & Poupart, P. (2021). Quantifying and improving transferability in domain generalization. Advances in Neural Information Processing Systems, 34, 10957-10970. ## Reviewer sWph (Score=4, Conf=4) We thank the reviewer for their time and effort in reviewing our work. We address all of your concerns in order in the following paragraphs. > The authors state "It is commonly assumed that the relationship between the label and some subset of features (potentially a nonlinear transform of the observation [95, 2]) is invariant across all domains." I believe that is not the case in all of the time series application domains (e.g., cloud's ML jobs and their load prediction) Our intention when stating the assumption in line 86 was not to take a position on its correctness. Instead, we meant to point out that some algorithms use this assumption in their formulation to provide guarantees of OOD generalization (e.g., IRM, CORAL), while others do not (e.g., SD, GroupDRO). In this work, we are simply adapting and evaluating these methods to time series. Thus, we considered commenting on the underlying assumption of those algorithms somewhat out of the scope of this work. However, we realize now that the message conveyed can be confusing for readers. Therefore, necessary changes were made to make this message clearer. We thank the reviewer for pointing this out. > It seems to me that the domains are limited for the synthetic datasets, in particular the first and second datasets. Why did you choose these three correlations (10%, 80%, and 90%)? I would suggest justifying and motivating such selections. Colored MNIST from the seminal work of IRM [1], and its variants [2], is now a commonly used dataset formulation in OOD generalization. Colored MNIST (CMNIST) presented the failure mode of ERM under distribution shift in the image domain. This was accomplished by creating training domains with strongly predictive spurious features and weakly predictive invariant features. The spurious correlation would be flipped at test time while the invariant correlation was kept the same. The correlation flip made it clear which features (spurious or invariant) the model relied on to make predictions. We added this explanation in the manuscript to make it clearer for the reader what this dataset's motivation and goal are. Also, the intended failure mechanism behind the domain formulation can be found in Lines 135-139. > Time series aspect is not clear in some datasets (few of them are binary classification problems not time series forecasting or classification). Can you clarify that better in the paper? We are not sure we understood the correct meaning from your feedback on the clarification of the dataset's nature. We understand the time series aspect of the TCMNIST datasets (-Souce and -Time) is not clear because the task is the binary classification of the frames. The TCMNIST task differs from static classification because labels require information from the prior frame in the video (the label is the parity of the sum of the current and previous digit). More generally, all tasks in WOODS require the model to information from previous steps at prediction time. Although some may consider videos not to be typical time series but sequential in nature, we do not think this distinction is relevant in this benchmark. Please let us know if that answers your feedback. > Main properties and statistics of each dataset are missing in the main paper (e.g., number of samples, main properties, etc.) We would happily add it to the main text in the final version. However, for now, due to current space limitations, we leave information such as the number of samples, data acquisition, and characteristics, and preprocessing on all datasets in Appendixes C.X.1. > The intuition on why "Algorithms fail on synthetic challenge dataset with train-domain validation" is missing? The intuition on this statement is twofold: * Intuition on the statement itself: We state that OOD algorithms under investigation fail on synthetic challenge datasets with train-domain validation because they learn to rely on the spuriously correlated color to make predictions. This can be inferred from the 10% accuracy of the model in the test domain with a 10% correlation between the spurious features and the label (which was 85% in training). This 10% performance, which is way below the random chance performance of 50% accuracy, indicates that the model is using spurious features to make a prediction and thus fails the challenge of generalizing OOD with the invariant features. * Intuition on why differences between Train-domain and Test-domain validation: If IRM and VREx can reasonably solve synthetic challenges in test-domain validation, why is it not the case in train-domain validation? Train-domain validation is proper to the OOD generalization objective in that models remain completely agnostic of the test domain until test time. While this is great rigorously, it is often biased towards the failure of algorithms because easy spurious features (e.g., the color of digit) picked up early in training saturates the validation set used to choose the model. Test-domain validation addresses this saturation of early-training overconfidence by disallowing early stopping and allowing validation of the test set's performance. This distinction is why some algorithms can perform reasonably well on the synthetic challenge dataset with test-domain validation but not train-domain validation. We added the first part of the intuition in the manuscript, as it is the most essential for readers. If you think the second part of the intuition is also hard to infer from the paper, we can also add it to the experiments section. > Explaining clearly that the datasets have both synthetic and real is important in the set of contributions and the rest of the paper. This has been added to the paper. Thank you for pointing this out. > What are the metrics used in Tables 2 and 4. Are they the loss defined above? It seems confusing and not explained. Table 3 is much better. The metrics of Table 2 and 4 is accuracy. However, we failed to add this information to Tables 2 and 4; thank you for pointing this out. We apologize for this confusion, and we promptly resolved this issue. > Having ten datasets is a thin benchmark for time-series generalization problems (see the Kaggle time-series challenges with hundreds of datasets). Compared to similar works that propose benchmarks for OOD generalization, we surpass the number of datasets in DomainBed (seven datasets) and WILDS (ten datasets) with our eleven datasets. Moreover, it is important to remember that our work mainly focuses on curating datasets with various distribution shifts across tasks and domains while presenting them under the umbrella of a single repository. The many Kaggle competitions for time series are typically in the standard iid or forecasting setting with many repetitive shifts. Searching for datasets with significant distribution shifts and preprocessing them is a long process that requires careful analysis. Thus, it is essential to distinguish between collections of i.i.d. datasets and the WOODS datasets in that regard. We think a benchmark of ten (now eleven) datasets is enough to constitute a good coverage of tasks and modalities. > Missing Related Work and how the current work is different except few sentences. The full related works section is in Appendix B of the Supplementary material. However, it was only referred to in the main text with a footnote. The footnote failed to catch your and other reviewers' attention. We will add a sentence referring to Appendix B in the main body to make the related work section more apparent. Thank you for pointing this out. > Last two sentences in the paper are not clear. Is that related to Ethical concerns. The last two sentences of the conclusion address the checklist question: "Did you discuss any potential negative societal impacts of your work?". > Several Typos in page 7 (after Equation (1) and Equation (2)), please fix them. For example, mixing usage of "equation" and "Equation" Typos have been fixed. [1] Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893. [2] Ahuja, K., Wang, J., Dhurandhar, A., Shanmugam, K., & Varshney, K. R. (2020). Empirical or invariant risk minimization? a sample complexity perspective. arXiv preprint arXiv:2010.16412. ## Reviewer ThN9 (Score=5, Conf=3) We thank the reviewer for their time and effort in reviewing our work. We address all of your concerns in order in the following paragraphs. > The contribution is not well mentioned. The significance of contribution is not clear. The contribution of this work is the curation, preprocessing, and formulation of ten (now eleven) time series datasets with considerable distribution shifts between domains. This contribution is significant towards addressing the failure mode of empirical risk minimizers in time series with distribution shifts. The contribution size is aligned with other seminal works in the field, e.g., DomainBed (seven datasets) and WILDS (to datasets). On top of the formulation of datasets, we implement the evaluation framework for easy reproducibility and evaluation of new algorithms and adapt common existing OOD generalization algorithms for benchmarking. Our framework is the only accessible resource, to our knowledge, to provide an accessible testbed to investigate solutions to OOD generalization in time series. We hope WOODS will help address distribution shift-induced failures and lead to the safe deployment of deep learning models in the real world. > There is not enough information about evaluation methods and how to use them for users. We are sorry that you did not find the documentation helpful enough. We take this comment very seriously because being accessible to users is among our top priorities. We want to mention, in case you missed, the following key documentation pages: * How to [download](https://woods.readthedocs.io/en/latest/downloading_datasets.html) a dataset. * How to [run a sweep](https://woods.readthedocs.io/en/latest/running_a_sweep.html). * How to add an [algorithm](https://woods.readthedocs.io/en/latest/adding_an_algorithm.html). * How to add a [dataset](https://woods.readthedocs.io/en/latest/adding_a_dataset.html). However, these pages are a bit outdated at this point since the repository moved a lot in the past few months, and we did not yet take the time to update them. We are making it a priority for the coming week to make these pages more accessible and straightforward for users. We will keep you informed on our progress. Thank you very much for pointing out this flaw in our work. > There is no ethical consideration. Ethical considerations for all WOODS datasets are discussed in Question 4 of the NeurIPS checklist (Line 771). At the same time, credit and licenses are stated in Appendixes C.X.3. Note that several reviewers have stated that the ethics issues were addressed and well discussed. Could you point to missing information so that we can resolve this issue? We thank you for raising this point and for the thoroughness of their review. > The paper is not coherent enough and not easy to read. We are sorry that you found our paper hard to read. To improve the reading experience of future readers, we would be very grateful if you could point to the parts of the paper that caused confusion or was not coherent so that we can improve the reading experience. > The advantages of using this framework over existing frameworks are not clearly mentioned. Various codebases exist for OOD generalization datasets and algorithm evaluation (e.g., DomainBed and OODBench). However, these codebases are specifically focused on images and other static modalities of data. Time series tasks (e.g., multiple prediction classification or forecasting) differ significantly from static tasks to require a separate repository. Examples of the most significant changes are dataset tasks which include multiple predictions per sample, which is very hard to adapt in DomainBed, and Time-Domain datasets, which share an entirely disjoint dataset structure from the one seen in DomainBed. WOODS provide the flexibility for many common time series tasks and domain formulation out of the box. > There are recent algorithms for OOD performance improvement which is not considered in the experiments. We agree that, for this benchmark to be complete, we need to include the most recent advances in domain generalization. To this purpose, we added 3 baselines to the benchmark: CAD (2022) [1], CondCAD (2022) [1] and Transfer (2021) [2]. We believe these recent OOD generalization algorithms with complement well the current set of baselines of WOODS. See the general response for more details. The results are added to the paper. > Although easy evaluation is clearly mentioned as one of the contributions of this paper, there is no clear definition of how the evaluation process works. We added two sentences in the main body briefly explaining the evaluation process in the experimental section. However, we refrain from going into a deeper explanation in the main body as our framework follows the DomainBed [3] evaluation method, which has become the gold standard for evaluating OOD generalization algorithms. If further details are needed, Appendix F explains how the reported performances and estimated standard deviation were obtained. [1] Ruan, Y., Dubois, Y., & Maddison, C. J. (2021). Optimal representations for covariate shift. arXiv preprint arXiv:2201.00057. [2] Zhang, G., Zhao, H., Yu, Y., & Poupart, P. (2021). Quantifying and improving transferability in domain generalization. Advances in Neural Information Processing Systems, 34, 10957-10970. [3] Gulrajani, I., & Lopez-Paz, D. (2020). In search of lost domain generalization. arXiv preprint arXiv:2007.01434. ## Reviewer iA6U (Score=5, Conf=5) We thank the reviewer for their time and effort reviewing our work. We address all of your concerns in the following paragraphs. > The datasets are diverse, but the tasks are not. In fact, there are 10 datasets, only 1 is for forecasting. I think that time series domain is really challenging and forecasting problem is way challenging than classification. Therefore, you should focus more on the forecasting work than classification. Currently, the tasks are not representative. We entirely agree that forecasting is underrepresented in WOODS compared to its importance in machine learning. This is why we added the **Pedestrian** dataset to WOODS. The Pedestrian dataset is a source domain dataset where the task is to forecast the number of pedestrian in different locations within the city of Melbourne, Australia. All the details can be found in the general response. We thank the reviewer for pointing out this gap in WOODS. While forecasting datasets are still a minority in WOODS, we hope to continue addressing this problem in the coming months such that we can achieve fair representation. > The implemented algorithms are quite general, which are not specific to time series domains. I also think that OOD algorithm + time series is not enough to claim an OOD benchmark (since the main work is on data preprocessing and network design, not the algorithm). To this end, I think the conclusions are not that evident since you do not try the time-series related algorithms, such as AdaRNN [1], [2], [3]. Thank you for pointing us to missing baselines in our work. While AdaRNN has its place within the WOODS baselines, it seems that AdaRNN is not a perfect fit for our work for two reasons. First, the algorithm is aimed explicitly at forecasting datasets with temporal distribution shift, of which only one WOODS dataset has this characteristic. This makes for the limited impact of this algorithm in our work. Second, AdaRNN automatically finds different time intervals in the time series to perform its optimization; however, the WOODS datasets were curated with their domains predefined. Because of this, an adequate adaptation of AdaRNN would require expertise we do not possess and do not think we can acquire within this discussion period. Because of those two reasons, we instead decided to expand the WOODS baselines with recent algorithms that can be evaluated on more datasets, and we have expertise in, i.e., CAD [4], CondCAD [4], and Transfer [5]. All details about those algorithms can be found in the general comment. As WOODS is in process of continual expansion, we will keep this in mind going forward and hope to add AdaRNN as a WOODS baseline soon. In the meantime, we added [1,2,3] to the related works section. > Why time series is harder than traditional image classification? I fail to find the support in the experiments. We do not think to have claimed explicitly that time series tasks are intrinsically harder than traditional image classification. If we implicitly gave this impression to the reader, please let us know where such that we can address this confusion. That said, we do think that OOD generalization in time series presents unique challenges that _might_ be pragmatically harder to address in time series than in static settings. For instance, recent advances in OOD generalization using large pretrained models such as CLIP might be harder to achieve in time series due to the fact that large pretrained time series models are sparse in numbers and often narrow in scope. Additionally, the possibility of the data distribution shifting through time is unique to time series with no analog in static tasks. [1] Du Y, Wang J, Feng W, et al. Adarnn: Adaptive learning and forecasting of time series[C]//Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021: 402-411. [2] Wu Z, Liu N, Li G, et al. A Variational Bayesian Approach for Fast Adaptive Air Pollution Prediction[C]//2021 IEEE International Conference on Big Data (Big Data). IEEE, 2021: 1748-1756. [3] Ye R, Dai Q. A relationship-aligned transfer learning algorithm for time series forecasting[J]. Information Sciences, 2022, 593: 17-34. [4] Ruan, Y., Dubois, Y., & Maddison, C. J. (2021). Optimal representations for covariate shift. arXiv preprint arXiv:2201.00057. [5] Zhang, G., Zhao, H., Yu, Y., & Poupart, P. (2021). Quantifying and improving transferability in domain generalization. Advances in Neural Information Processing Systems, 34, 10957-10970. ### Reviewer iA6U - Round 2 ### About diversity of tasks We do not think it is fair to say that we did not improve in increasing the diversity of tasks in WOODS because we have added a forecasting dataset during this review process: the **Pedestrian** dataset. Please take a look at the general response for details about this new dataset. We believe that with these 11 datasets, we cover several interesting modalities and that it suffices to be a contribution to this track. Compared to similar works that propose benchmarks for OOD generalization, we surpass the number of datasets in DomainBed (seven datasets) and WILDS (ten datasets) with our 11 datasets. ### About algorithm selection Although the core algorithms are available in DomainBed and Deepdg, we would like to rectify the impression that open-sourced implementation of OOD algorithms can be easily used for time series domain generalization, which is not the case. Implementing these algorithms for time series is a challenging and time-consuming task because the construction of the environments in time series can take two different forms (i.e., Source-domains and Time-domains). Also, the adaptation of algorithms needs to consider the environment construction and the time series task nature (e.g., forecasting, multiple predictions). ## Reviewer JNLv (Score=6, Conf=5) We thank the reviewer for their time and effort in reviewing our work. We address all of your concerns in order in the following paragraphs. >The DG baselines are not enough, more recent DG baselines should be considered and included for comparison and discussion. We agree that, for this benchmark to be complete, we need to include the most recent advances in domain generalization. To this purpose, we added 3 baselines to the benchmark: CAD (2022) [1], CondCAD (2022) [1] and Transfer (2021) [2]. We believe these recent OOD generalization algorithms complement WOODS's current set of baselines well. Results are added to the paper. > The literature survey in the field of DG is lacking. The full related works section is in Appendix B of the Supplementary material. However, it was only referred to in the main text with a footnote. The footnote failed to catch your and other reviewers' attention. We will add a sentence referring to Appendix B in the main body to make the related work section more apparent. Thank you for pointing this out. [1] Ruan, Y., Dubois, Y., & Maddison, C. J. (2021). Optimal representations for covariate shift. arXiv preprint arXiv:2201.00057. [2] Zhang, G., Zhao, H., Yu, Y., & Poupart, P. (2021). Quantifying and improving transferability in domain generalization. Advances in Neural Information Processing Systems, 34, 10957-10970. ## Reviewer HVdg (Score=7, Conf=4) We thank the reviewer for their time and effort in reviewing our work. We address all of your concerns in order in the following paragraphs. >Although this paper is focused on OOD generation time series tasks, some of the real datasets collected do not naturally belong to the typical time series field. For instance, dataset LSA64 is for sign language video classification across speakers. It can be considered as sequential data but not typical time series data. A similar situation is with the synthetic challenge dataset – TCMNIST. It is more sequential than temporal data as weak relations are among pictures in different time steps. Although some may consider videos not to be typical time series but sequential in nature, we do not think this distinction is relevant within the scope of this work. Instead, we think that the important criteria are that **first**, datasets are an ordered collection of observations in time and that **second**, tasks require the model to information from previous steps at prediction time. The LSA64 and IEMOCAP follow those criteria. The TCMNIST datasets might be fringe within those two criteria; however, it serves the purpose of investigating carefully crafted domain configuration. Thus, we think it has its place in our benchmark. >The setting of the Spurious-Fourier benchmark is not convincing enough. The task is set as performing binary classification based on the frequency characteristics. But as is shown in Figure 3, it is hard to distinguish two signals in the time domain even for human intelligence. So, is it meaningful to handle such task in the time domain? Why not transform time signals to the frequency domain by FFT or other time-frequency transform methods? The Spurious-Fourier task might not be convincing at first glance, however, distinguishing between different frequency bands is a prevalent underlying task of many machine learning applications to time series. One instance of this is sleep stage classification from EEG measurements, where models need to identify operating frequencies of the brain. Of course, we could easily transform this dataset and classify the FFT of signals, but we think that is sidestepping the challenges that are encountered in more complex real-world datasets. >Definitions of domains 10%, 80%, 90%, and so on in synthetic challenge datasets should be given. It is confusing. We are sorry that you found our explanation of the domains confusing. We rephrased the explanation of the domains for the Spurious-Fourier and TCMNIST datasets (See lines 129-134 and 150-154 in the revised manuscript). Please let us know if the new explanation makes it clearer for the reader. We are open to iterating on this until we achieve clarity. Thank you for pointing this out. >Although both source domains shift and time domains are discussed in Section 2, most of the datasets contain different source domains, and few of them contain different time domains. Strictly speaking, only the AusElec dataset considers distribution shifts in the time domain. More datasets with time domain shifts should be given. The IEMOCAP dataset is also a Time-domain dataset, as the distribution shift within one dyadic conversation. Even still, we agree with this point that the representation of Time-domain datasets is low in WOODS. This is mainly because Time-domain datasets are harder to find within the literature, as it has previously been underexplored in the field. We plan to continue searching for Time-domain datasets in the future and hope to increase their representation in WOODS. >Although existing OOD generalization algorithms are adapted and evaluated in ten new benchmarks, little discussion of metrics is given. Are there any special properties needed to be considered in the OOD time series task? If not, why can the setting in the vision or NLP field be followed? We are having difficulties understanding what you mean by metrics. We assume that you are talking about the metrics used to evaluate the performance of models on the datasets. We failed to mention the metrics used for multiple datasets in the original manuscript. This was fixed and addressed. If we misunderstood what you meant by metrics, please let us know so we can answer your concern correctly. >The results in real-world datasets and synthetic challenge datasets of the same algorithm are quite different (for instance, in train domain validation, the results are about 60 and 10, respectively). Is the task in synthetic data more difficult? Discussions are needed. As their name suggests, synthetic challenge datasets are carefully crafted to be challenging for empirical risk minimizers. This was done by following the recipe of Colored MNIST from the seminal work of IRM [1], which targeted a clear failure mode under distribution shift. CMNIST created training domains with strongly predictive spurious features and weakly predictive invariant features. The spurious feature correlation would be flipped at test time, while the invariant correlation would be kept the same. The correlation flip makes models relying on spurious features completely fail in the test domain, to the point that they obtain accuracies much below the random prediction accuracy (Spurious features correlation in the test domain=~10%, random prediction accuracy=50%). This failure makes it clear which features the models rely on, and thus the success or failure of an algorithm is very clear. We agree that some clarification is needed in the main text. We added lines 334-337 to clarify this point. ## Reviewer icR7 (Score=7, Conf=3) We thank the reviewer for their time and effort in reviewing our work. We address all of your concerns in order in the following paragraphs. > computational power needed to contribute is significant We do not think that the computational power will be a barrier to use the WOODS datasets because the required compute is variable according to how many datasets one chooses to evaluate its method. Additionally, WOODS datasets comprise a nice distribution of computational weight such that one can strategically choose a set of datasets that fits its compute resources: * Very light: Spurious_Fourier, TCMNIST_Source, TCMNIST_Time (<1% of total compute used) * Light: PCL, HHAR (~15% of total compute used) * Medium: CAP, SEDFx, AusElec (~25% of total compute used) * Heavy: LSA64, IEMOCAP (~60% of total compute used) > the novelty is somewhat limited as it is mainly an aggregation of existing datasets Our work mainly focuses on the failure of empirical risk minimizers that arise under distribution shift – shift due to source or shift due to time. Searching for such datasets and preprocessing them is a long process that requires careful analysis. It is essential to distinguish between the contribution of an aggregation of i.i.d. datasets and the WOODS datasets. Additionally, the synthetic challenge datasets in WOODS are novel. > the convenience to use is hampered by the need of many assumed authorizations to use the datasets of the benchmarks Nine out of ten WOODS datasets are completely open access, and are accessible through the WOODS framework with a simple [command](https://woods.readthedocs.io/en/latest/downloading_datasets.html). The only dataset requiring credentialed access is the IEMOCAP dataset. We think that is a reasonable proportion of datasets openly accessible. > it is not clear that there is a need for a new benchmark (in terms of code especially) as the evaluation part is very similar to DomainBed. [...] As architectures are relatively similar with Vision benchmarks can you clarify why you did not simply add datasets and losses to DomainBed ? Developing a framework evaluating OOD generalization algorithms in the numerous time series tasks is an arduous process that took us multiple months to achieve. Time series tasks (e.g., multiple prediction classification or forecasting) differ significantly from static tasks to require a separate repository. Examples of the most significant changes are dataset tasks which include multiple predictions per sample, which is very hard to adapt in DomainBed, and Time-Domain datasets, which share an entirely disjoint dataset structure from the one seen in DomainBed. Although adapting DomainBed to time series might be a slightly shorter endeavor, it remains a time-consuming task that can be too high of an investment for many researchers looking for solutions in this field. WOODS provide the flexibility for many common time series tasks and domain formulation out of the box.