PEM KDD Rebuttal

# PEM KDD Rebuttal ## Common Questions We thank all the reviewers for their valuable feedback. Here, we will address the common questions asked by all the reviewers. **1. Missing citations in lines 215, 216.** We thank the reviewers for identifying missing citations in two lines. We will update the manuscript by adding the following references: [1] for DeepAR model, [2] for Deep Markov models, and [3,4] for Deep State models. We checked and found no other missing citations. [1] Salinas, David, et al. “DeepAR: Probabilistic forecasting with autoregressive recurrent networks.” International Journal of Forecasting 36.3 (2020) [2] Krishnan, Rahul, Uri Shalit, and David Sontag. “Structured inference networks for nonlinear state space models.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. No. 1. 2017. [3] Li, Longyuan et al. “Learning Interpretable Deep State Space Model for Probabilistic Time Series Forecasting.” International Joint Conference on Artificial Intelligence (2019). [4] Albert Gu, Karan Goel, & Christopher Re (2022). Efficiently Modeling Long Sequences with Structured State Spaces. In International Conference on Learning Representations. **2. More details on training time and batch size and memory.** We found that pre-training for up to 5000 epochs on all SSL tasks simultaneously was sufficient, as longer pre-training did not improve SSL-related losses or downstream performance significantly. During training, we set 5000 epochs as the maximum, but we observed that most downstream tasks required 1500-2500 epochs to converge and reach the early stopping criteria. Since the datasets in most tasks could fit into the GPU, we set the batch size to be equal to the number of training data points. On average, it took around 8 hours for pre-training and 20-150 minutes for fine-tuning for each task (using the Nvidia Tesla V100 GPU as mentioned in the paper). We also observed similar training times for most top baselines which we also trained with early stopping. Regarding memory requirements, for all the datasets, we did not require more than 8 GB of VRAM. Similarly, during pre-training, since we randomly choose a pre-train dataset for each batch (line 521), the memory requirements range from 4-8 GB of VRAM based on dataset size. We will include these additional details in the final version. ## Reviewer RU49 We thank the reviewer for the valuable feedback and are grateful to identify the significance of our work. We will address the questions and concerns as follows. **The training details are relatively rough. For the pre-training studies, the training costs, e.g., training time in term of GPU, are useful information for other researchers but not included in the paper.** We would like to clarify that we have provided specific hyperparameters for both pre-training and fine-tuning in our manuscript, as indicated in lines 646-655. Furthermore, we have conducted ablation studies and hyperparameter sensitivity analysis to investigate the impact of important self-supervised learning (SSL) and architecture hyperparameters in Section 7, with the results presented in Tables 5, 6 and 7. In addition, we provide further details on training, pre-training time, and memory used in the *common response* (Point 2) which we will add to the final version. In summary, we took about 8 hours of GPU time for pre-training on all the datasets and 20-150 minutes for fine-tuning for each task depending on the size of the training dataset. **The pre-training usually enable the few-shot or learning efficiency on the downstream tasks. The related experiments are missing in the paper.** While the use of pre-trained large language models for few-shot or zero-shot learning has been successfully demonstrated in many natural language processing (NLP) tasks, extending this approach to heterogeneous multi-domain time-series datasets and tasks remains an important and open research problem not addressed by any previous work. Our work focuses on the impact of pre-training model weights for more performant training of downstream tasks. Our work would enable future research on leveraging our methods for problems like few-shot learning problems in time-series. In terms of learning efficiency, as we discussed in response to the previous question, our proposed probabilistic epidemic models (PEMs) require similar training time to other baselines while significantly outperforming them in terms of downstream performance. This indicates that our approach not only achieves better results by effectively leveraging pre-train datasets but also does so in a computationally efficient manner. **Will the pre-trained checkpoints be publicly available?** Yes, we will publicly release the model weights along with the implementation code on publication of the paper. ## Reviewer 2RSK We thank the reviewer for their valuable comments and questions. We thank the reviewer for identifying the significance of our work in designing pre-trianing methods for time-series. We will address the reviewer's questions and concerns as follows: **The motivation is not clear enough. The authors argue that there are two challenges in time-series domain pretraining, but they don't explain clearly how their framework can solve these problems.** As explained in lines 134-156, the two main challenges for applying pre-training frameworks on multiple time-series datasets are the higher heterogeneity of time-series data compared to images and the smaller datasets available for pre-training. As a result, general time-series SSL tasks like random masking may not effectively capture the important properties from a large number of small, heterogeneous datasets with varying patterns such as seasonality, periodicity, noise, etc. To address these challenges, we specifically designed SSL tasks (Section 4.3) to enable the model to efficiently extract useful epidemic dynamics information such as identifying peaks and their dynamics (PEAKMASK), learning to forecast future values (LASTMASK), and detecting seasonal information (SEASONDETECT). These tasks effectively learn useful epidemiologically relevant patterns from all the heterogeneous epidemic time-series datasets, which can be leveraged for improved predictive performance in multiple downstream tasks. We will stress this point in the introduction and Section 3.2. **The experiment results in Tables 1 and 2 are not convincing enough. Direct comparisons are unfair for other methods because PEM uses more data. It would be helpful to see the performance of other methods under the same dataset setting.** First, we wish to clarify that other baselines, which use the traditional paradigm of training only on datasets relevant to the task, can't be trivially adapted to use pre-train data used by PEMs. For example, there is no straighforward way to leverage measles dataset when training to predict for influenza. In contrast, our approach for pre-training using SSL from a wide range of heterogeneous epidemic time-series datasets is novel framework that can effectively utilize these multiple pre-train datasets to extract useful patterns. Therefore, we believe that the experiment setup for performance comparison is fair. In addition, we compare our method with past state-of-the-art SSL methods (Table 5) where we use the full pre-trained datasets to pre-train the baselines as well (therefore using same datasets as PEM). However, these SSL methods significantly underperform since they cannot adapt to the heterogeneity of the data and cannot effectively capture useful patterns from the pre-trained datasets. This further emphasizes the effectiveness of our SSL methods for pre-training. **The analysis in the ablation study is insufficient. For example, the impact of each hyperparameter should be discussed in more detail.** We have thoroughly investigated the impact of hyperparameters related to the main contributions of our work (Tables 5, 6, 7). Specifically, we studied the impact of each SSL task and architectural novelty. We also studied the sensitivity of the hyperparameters related to these, specifically the segment size, masking probabilities of SSL tasks, and reverse instance normalization. In addition, we have provided specific hyperparameters for both pre-training and fine-tuning in our manuscript, as indicated in lines 646-655. Furthermore, we have provided additional details on training and pre-training time, memory, and batch sizes used in the *Common Response* above. We will incorporate these details into the revised manuscript. We have also observed that model's hyperparameters perform well across all downstream tasks (line 923). **In line 921, the authors said that when the segment size is 2, they got the best score, but in Table 7, P=4 has higher scores.** This is a typo in line 921. We meant that the best segment size is 4. We thank the reviewer for identifying the error and we will fix it in the revised version. **The writing should be further polished. There are many typos that need to be corrected.** We have addressed the missing citations in lines 215 and 216 in the *Common Response* above. We will go over the manuscript to correct this and any other small typos in the revised manuscript. ## Reviewer evxR Thank you for your positive feedback on our methods' technical novelty, effectiveness, and potential impact. Regrading your comments regarding missing citations we address the 4 missing citations in the *Common Response*. We have also added additional details on the compute and memory requirements during pre-training as well as training in *Common Response*. We will fix the citations and add the additional pre-training details in the final version. **Given the ones reported here for the proposed method, what is the training time cost and memory cost of the baselines?** We apologise for missing you point on comparing with the baselines. We measured the training time and maximum memory requirements of all baselines as follows: | | | Training time(min) | | | | Max. Memory(GB) | | | |-----------------|--------------|--------------------|-------------------|---------|--------------|-----------------|-------------------|---------| | Model/Benchmark | Influenza-US | Influenza-Japan | Cryptosporidiodia | Typhoid | Influenza-US | Influenza-Japan | Cryptosporidiodia | Typhoid | | Autoformer | 37.9 | 31.6 | 29.7 | 49.5 | 4.2 | 3.8 | 4.9 | 3.7 | | Pyraformer | 44.7 | 38.7 | 42.1 | 62.5 | 2.6 | 2.9 | 2.5 | 2.1 | | Informer | 31.6 | 42.5 | 35.9 | 55.1 | 4.5 | 3.7 | 4.3 | 3.2 | | Fedformer | 47.4 | 32.9 | 48.6 | 53.9 | 3.1 | 3.6 | 3.7 | 2.9 | | GP | 3.7 | 3.1 | 2.7 | 3.5 | 0.2 | 0.1 | 0.2 | 0.2 | | EpiFNP | 27.4 | 22.5 | 29.3 | 47.2 | 2.8 | 2.1 | 3.5 | 3.1 | | EpiDeep | 39.1 | 42.7 | 39.6 | 53.6 | 3.2 | 2.7 | 3.4 | 3.1 | | EB | 3.4 | 3.2 | 3.9 | 3.5 | 0.1 | 0.1 | 0.1 | 0.1 | | FUNNEL | 0.6 | 0.5 | 0.9 | 0.2 | 0.1 | 0.1 | 0.13 | 0.1 | | PEM | 42.4 | 35.5 | 39.2 | 64.5 | 4.7 | 3.5 | 4.8 | 4.1 | We observe that PEM's training time is similar to the transformer based baselines and has similar memory requirements for all downstream tasks. Methods like GP, FUNNEL and EB are not deep learning based and use considerable less time and memory but provide worse performance. **In addition, as discussed in the introduction, [12, 18, 31, 33, 10] are proposed to solve the data sparsity and deal with noise in epidemic forecasting, but they are not considered baselines. What is their performance?** We explicitly chose state-of-art machine learning based epidemic and general time-series forecasting baselines. These methods use only the past time-series of an epidemic for forecasting future values. The methods referred to by the reviewer consider other sources of external or expert knowledge that are specific to given epidemic. [12] is a spatio-temporal model that requires graph knowledge between regions such as mobility. [18, 31] require expert knowledge of mechanics of the epidemic such as the mechanistic model of differential equations that governing spread of the disease. [33] is specifically curated for Covid-19 that uses multiple Covid-19 pandemic specific features relevant to the US. [10] is another Covid-19 forecasting model that is an ensemble of predictions from top models designed by multiple research groups in US called Covid-19 FOrecast Hub.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.