STOIC KDD Rebuttal

# STOIC KDD Rebuttal ## Reviewer ivXj **The framework is a combination of previous works. Please clarify the difference between the baseline models and the proposed model.** We respectfully disagree. In term of graph generation, we have mentioned the main drawbacks of the baseline methods compared to STOIC in the second paragraph of Section 2. Specifically, baselines like MTGNN and GDN use a deterministic process of choosing top k nodes based on embedding similarity to construct the a node's neighborhood which may not capture structure uncertainty. Recent state of the art methods like GTS and NRI learn a fixed global set of parameters to learn a graph that do not adapt to input time-series. STOIC generates a graph based on similarity in the input time-series using a probabilistic process that captures edge uncertainty. Moreover, we capture other sources of uncertainty from time-series, namely, temporal uncertainty from inputs via the Probabilistic Time-series Encoder and uncertainty and similarity with past data via Reference Correlation Network unlike previous baseline. Finally, STOIC adaptively combines uncertainty and information from these multiple sources to generate the resultant forecasts. **There is no analysis of time complexity.** The time-complexity of Probabilistic Time-series encoder scales with length of the input $T$ and number of time-series $N$ ($O(NT)$). The graph generation module scales as $O(N^2)$ to derive the edge probability and samples for each pairwise combination of nodes. The recurrent graph neural encoder scales as $O(NET)$ where $E$ is number of edges in sampled graph since it uses a GNN and a RNN alternatively. Finally, the correlation network module scales as $O(NK)$ where $K$ is maximum size of refernce set which is typically much smaller than $N$. Therefore the total runtime complexity is $O(NET + N^2)$. This is on par with other state-of-art methods like NRI and GTS. In practice also, we found the GPU time and memory requirements of STOIC to be similar to these baselines. **The writing of this draft is far from being prepared for publication.** We appreciate the feedback. If the reviewer could point to specific aspects of writing to improve upon, we will gladly work on it for the revised version. ## Reviewer d3N1 **My main concern is that the proposed approach looks unstable due to the two entangled steps of stochastic sampling: H and A...It would be nice if the authors can present results on the sensitivity and stability, such as intermediate loss and accuracy during training, sensitivity to the choice of hyperparameters, etc. By the way, hyperparameter search spaces for the proposed approach and all baselines should be reported in the paper (at least in Appendix).** The stochastic processes in PTE, GGM and RCM modules are made tractable for learning using variational inference as discussed in lines 450-463. We also used tricks like Gumbel-softmax for Bernoulli sampling and the reparameterization trick and found that empirically stabilized training. We have also performed ablations on each of the main components of STOIC to assess the importance of each of them (Table 3). We have mentioned all the hyperparameters in Appendix Section A for STOIC. We used the official implementations for all the baselines making no architectural changes with only changes in batch size and learning rates for each task. We will add these hyperparameters in the revised version. We will also add the loss and performance curves over epochs for the tasks as requested by the reviewer. **I think the individual modules in the proposed approach are not justified well. Although the authors present the results of ablation studies, I hope to have a clear understanding on the role of each module, and insights to further improve this approach.** We have discussed the technical motivation of each module in Section 4, specifically in the Overview and first paragraph of each subsection. The ablation experiments in Table 5 also illustrate the effect of different components on the overall performance. Most importantly, removing the graph generation module results in largest impact on model performance showcasing the importance of automatically learning useful relation across time-series. If the reviewer can suggest any specific questions or experiments to enable better understanding of different modules we will add it to the Appendix of the revised version. **It would be nice to include case studies on forecasting results to visually inspect how well the “uncertainty” works. How does the uncertainty region grows or shrinks based on the patterns of given time series? Is it visually better than those of baseline approaches?** We thank the reviewer for the suggestion. We observed that the confidence intervals are indeed better for STOIC and the 90% confidence interval plots capture the ground truth more often than the baselines. This is also inferred from the better calibration scores reported in Table 2, since CS measures the fraction of ground truth captured by confidence intervals. We can gladly add example forecasts of STOIC and other baselines in the Appendix. **It would be nice to include time complexity analysis either by theory or by experiments. The proposed approach may not be fast due to the high complexity and stochastic sampling processes.** The time-complexity of Probabilistic Time-series encoder scales with length of the input $T$ and number of time-series $N$ ($O(NT)$). The graph generation module scales as $O(N^2)$ to derive the edge probability and samples for each pairwise combination of nodes. The recurrent graph neural encoder scales as $O(NET)$ where $E$ is number of edges in sampled graph since it uses a GNN and a RNN alternatively. Finally, the correlation network module scales as $O(NK)$ where $K$ is maximum size of refernce set which is typically much smaller than $N$. Therefore the total runtime complexity is $O(NET + N^2)$. This is on par with other state-of-art methods like NRI and GTS. In practice also, we found the GPU time and memory requirements of STOIC to be similar to these baselines. **Does “a single layer of a feedforward neural network” mean a linear transformation with a learnable weight matrix, or a two-layer network with one hidden layer?** We mean a single layer. We will add this clarification. **GRU-Cell and GNN in Eq. (7) should be defined in the paper for completeness.** The GRU-Cell is the single step of the GRU for a single time-step. We will add details of these commonly used neural components in the Appendix for clarity. **In Eq. (11) and (14), what is the main reason to put the same value twice with only one exponential?** The gaussain distribution is generated from the mean and variance parameters which are generated by two seperate feedforward layers in both Equation 11 and 14 (decoted by subscript of $NN$. Also since the output for variance module can be any real-value, we add an exponential to get the positive value for the variance. **If the given time series is too long, do we have to introduce another starting point, e.g., t’’, to make Eq (9) tractable?** We do not believe, the size of $t$ would be a computational bottleneck in practice since PTE derives the parameters from the embeddings of a GRU. In terms of performance, GRU may not capture very long temporal patterns and using an alternative module like transformer could be useful at the cost of more compute cost. However, in practice for our experiments, we observed that GRU is sufficient in capturing relevant temporal patterns. **Can you give more intuitions for "A well-calibrated model has k(c) close to c and therefore its CS is close to zero?”** Intutively, the confidence intervals provided by the forecast distributions should capture uncertainty in the dataset. Therefore, for example, on expectation, 60\% of the ground truth should lie within 60\% confidence interval of a forecast distribution. CS empirically measures this for across all the forecasts during evaluation. Specifically, if a model is perfectly calibrated, for confidence interval $c$, the fraction of ground truth inside $c$-confidence interval should be $c$, i.e., $k(c) = c$ for different values of $c$. This results in CS going to 0. **The shared repository only contains datasets and README files without actual codes.** ## Reviewer 485e **The novelty of methods is a little weak. Building a graph for multivariate time-series data is proposed by MTGNN and GDN. Leveraging the functional neural process for time-series data is studied in EPIFNP. The authors integrated them and added a module to deal with past time-series data by creating a stochastic correlation graph and learning the corresponding embeddings.** We respectfully disagree. In term of graph generation, we have mentioned the main drawbacks of the baseline methods compared to STOIC in the second paragraph of Section 2. Specifically, baselines like MTGNN and GDN use a deterministic process of choosing top k nodes based on embedding similarity to construct the a node's neighborhood which may not capture structure uncertainty.In contrast, STOIC generates a graph based on similarity in the input time-series using a probabilistic process that captures edge uncertainty. While we use methods from EpiFNP to capture other sources of uncertainty from time-series, namely, temporal uncertainty from inputs via the Probabilistic Time-series Encoder and uncertainty and similarity with past data via Reference Correlation Network unlike previous baseline, we believe adapting them for multi-variate setup leveraging learned graph (via RGNE) is non-trivial. Finally, STOIC also solves the important problem of adaptively combining uncertainty and information from these multiple sources to generate the resultant forecasts via the Adaptive Distribution Decoder. Due to various novel contributions in graph generation, leveraging graph structure and temporal information and uncaertainty as well as combining them adaptively for providing accurate and calibrated forecasts, we do not believe STOIC is just combination of past methods. **In the experiments, some kinds of baselines are missing.** We have used recent and state-of-art baselines for comparison. If the reviewer can point to any other specific baselines, we would gladly include it in the revised version. **Why the authors do not compare the transformer and its variants for multivariate time-series data, such as informer? They can directly deal with long-sequence time-series forecasting and do not need a correlation graph.** Works like transformers are evaluated for univariate forecasting scenarios where the goal is to forecast for time-series for a single target variable. In contrast, multivariate forecasting datasets such as those used in this work consists of 50-300 time-series with underlying relations between them. Therefore, we compare our work with other state-of-art multivariate forecasting methods many of which model underlying relations when forecasting for all time-series simultaneously. Further, the correlation graph captures buth relations and uncertainty between input and past time-series data unlike transformer models. As shown in ablation study, modeling these relations in important for superior perofrmance of STOIC. **How about the time complexity of the proposed method compared with baselines?** The time-complexity of Probabilistic Time-series encoder scales with length of the input $T$ and number of time-series $N$ ($O(NT)$). The graph generation module scales as $O(N^2)$ to derive the edge probability and samples for each pairwise combination of nodes. The recurrent graph neural encoder scales as $O(NET)$ where $E$ is number of edges in sampled graph since it uses a GNN and a RNN alternatively. Finally, the correlation network module scales as $O(NK)$ where $K$ is maximum size of refernce set which is typically much smaller than $N$. Therefore the total runtime complexity is $O(NET + N^2)$. This is on par with other state-of-art methods like NRI and GTS. In practice also, we found the GPU time and memory requirements of STOIC to be similar to these baselines. ## Reviewer zww5 **The framework of StoIC seems to be a simple combination of existing works (from EpiFPN and GTS).** We respectfully disagree. In term of graph generation, we have mentioned the main differences and drawbacks of the baseline methods compared to STOIC in the second paragraph of Section 2. Specifically, baselines like GTS and NRI learn a fixed global set of parameters to learn a graph that do not adapt to input time-series. In contrast, STOIC generates a graph based on similarity in the input time-series using a probabilistic process that captures edge uncertainty. While we use methods from EpiFNP to capture other sources of uncertainty from time-series, namely, temporal uncertainty from inputs via the Probabilistic Time-series Encoder and uncertainty and similarity with past data via Reference Correlation Network unlike previous baseline, we believe adapting them for multi-variate setup leveraging learned graph (via RGNE) is non-trivial. Finally, STOIC also solves the important problem of adaptively combining uncertainty and information from these multiple sources to generate the resultant forecasts via the Adaptive Distribution Decoder. Due to various novel contributions in graph generation, leveraging graph structure and temporal information and uncaertainty as well as combining them adaptively for providing accurate and calibrated forecasts, we do not believe STOIC is just combination of GTS and EpiFNP. **There are some data files in the anonymized repository. No source codes nor running instructions are provided. This may limit the reproducibility of the proposed model.** **Experimental settings are incomplete. For example, the training/val/test ratio.... Also, can we use the mentioned models as baselines for comparison, or are there any hurdles preventing us from doing so?..I suggest to include more horizon settings.** We have mentioned the training test split for most datasets based on settings from previous works in Section 5.1. Specifically, we mention this in lines 540-541 for Flu-Symptoms, lines 546-550 for Covid-19, lines 556-558 for S&P-100, lines 569, 570 for Electricity. For Transit demand, we use the last two weeks for testing and rest for training and validation similar to [1]. We divide the full training datase into 80/20 split for training and validation respectively. We will add these details in the revised version. The horizons used for evaluation are also based on specifications from past works. These dataset splits vary slightly across some of the other papers mentioned by the author. For example, ESG uses a 7:1.5:1.5 split for transit data where as StemGNN use 7:2:1 split for transit demant datasets. Our splits are also in the same ballpark and we should not notice any significant performance differences. While we chose state-of-art baselines of different categories (specifically GTS being the best baseline overall) of baselines, we can add baselines referenced by the reviewer as well in the revised version. **There are many modules in StoIC; however, but no complexity analysis or run-time analysis is provided to determine whether the proposed model is efficient compared to other methods.** The time-complexity of Probabilistic Time-series encoder scales with length of the input $T$ and number of time-series $N$ ($O(NT)$). The graph generation module scales as $O(N^2)$ to derive the edge probability and samples for each pairwise combination of nodes. The recurrent graph neural encoder scales as $O(NET)$ where $E$ is number of edges in sampled graph since it uses a GNN and a RNN alternatively. Finally, the correlation network module scales as $O(NK)$ where $K$ is maximum size of refernce set which is typically much smaller than $N$. Therefore the total runtime complexity is $O(NET + N^2)$. This is on par with other state-of-art methods like NRI and GTS. In practice also, we found the GPU time and memory requirements of STOIC to be similar to these baselines. **Without graph learning, the performance of StoIC significantly degenerates to a level of baselines that do not use graphs, which makes me concerning the effectiveness of other modules proposed in StoIC.** We agree that graph learning is vital for superior for performance fo the model. In fact the novel graph generation method proposed in GGM module that leverages structure uncertainty as well to sample the graph is an important technical contribution of our work. We have also showcased the importance of other important modules: Correlation Graph and Adaptive weighting i ablation studies (Table 5) which also degrade the performance to the level of some of the baselines. This shows that all the modules are important for superior performance of STOIC. The observation that without graph the performance degenerates to level of baselines is very interesting. This is because STOIC-NoGraph with out the novel GGM and RGNE modules has strong architectural similarities to EpiFNP which also has the reference correlation graph except small changes such as we share the same decoder for all time-series for STOIC-NoGraph whereas we have independent module for EpiFNP (since it is originally a univariate forecasting model). The scores of STOIC-NoGraph and EpiFNP are therefore quite similar as well in many cases. **Do all datasets lack "ground-truth" structures?** As discussed in lines 843, yes, there in no "ground-truth" structure for the datasets. Similar to previous works, we study some of the useful domain-related relations learned by the model and baselines but we would not categorize them as ground-truth since the datasets are generated using any such fixed relations but are collected from real-world applications which can have more complex dynamics and relations not captures by the domain knowledge based relations alone. **Other metrics, such as the widely adopted MAE and MAPE, should be used.** The metrics (RMSE, CRPS and CS) useed in the paper are chosen to evaluate both accuracy and calibration of the forecasts. While there are many metrics that in other works some of which include MAPE and MAE, each of these metrics have their own benefits and disadvantages and capture different aspects of forecast performance. However, we can gladly add these two metrics as well in Appendix. **EpiFNP also considers the uncertainty of the forecasting and adopts a probabilistic framework. Why its performance drops quickly if we inject Gaussian noises into the data (its decrease is even more pronounced than other non-probabilistic baselines)? I suggest the authors to include more explanations here.** EpiFNP is a univariate forecasting model that captures the uncertainty of each each time-series independently. They do not capture the relations across time-series that can enable the models to detect variations and noise in time-series values based on learned relations. Intuitively, the models can leverage the relations to correct any devistions in values due to noise enabling them to be more robust. In fact, empirically, we indeed observe that leveraging structure is so important that most graph leanring based baselines are significantly better than non-graph baselines in terms of robustness (Figure 2). **I believe my concerns have not been adequately addressed without specific numbers and results, particularly for W1 and W3.** We thank the reviewer for engaging in the discussions. We have run hyperparameter sensitivity test for learning rate, batch size and the number of hidden units(and embedding sizes) and report the results as follows: | LR | Flu-Symptoms | Covid-19 | SP100 | Electricity | METR-LA | PEMS-BAYS | NYC-Bike | NYC-Taxi | |----------|--------------|----------|-------|-------------|---------|-----------|----------|----------| | 1.00E-05 | 0.52 | 32.3 | 0.16 | 212.5 | 1.54 | 3.1 | 2.44 | 9.47 | | 1.00E-04 | 0.44 | 31.9 | 0.17 | 207.3 | 1.48 | 2.7 | 2.41 | 8.61 | | 1.00E-03 | 0.42 | 31.7 | 0.16 | 223.9 | 1.63 | 2.5 | 2.52 | 8.98 | | Batch SIze | Flu-Symptoms | Covid-19 | SP100 | Electricity | METR-LA | PEMS-BAYS | NYC-Bike | NYC-Taxi | |------------|--------------|----------|-------|-------------|---------|-----------|----------|----------| | 16 | 0.44 | 33.8 | 0.19 | 209.3 | 1.57 | 2.6 | 2.95 | 8.69 | | 32 | 0.46 | 31.5 | 0.21 | 211.7 | 1.49 | 2.5 | 2.48 | 8.92 | | 64 | 0.45 | 32.9 | 0.19 | 207.3 | 1.48 | 2.7 | 2.45 | 8.73 | | 128 | 0.42 | 31.7 | 0.16 | 208.5 | 1.48 | 2.5 | 2.41 | 8.61 | | Embedding size | Flu-Symptoms | Covid-19 | SP100 | Electricity | METR-LA | PEMS-BAYS | NYC-Bike | NYC-Taxi | |----------------|--------------|----------|-------|-------------|---------|-----------|----------|----------| | 30 | 0.58 | 39.7 | 0.25 | 229.5 | 1.64 | 3.7 | 3.59 | 8.85 | | 60 | 0.42 | 31.7 | 0.16 | 207.3 | 1.48 | 2.5 | 2.41 | 8.61 | | 120 | 0.43 | 30.1 | 0.18 | 205.8 | 1.52 | 2.1 | 2.37 | 8.79 | As reported earlier, the model is not very sensitive to hyperparameters. We also report the intermediate losses for different benchmarks as below (for a run): *Flu-Symptoms* | Epoch | Loss | CRPS | |-------|------|------| | 100 | 2.25 | 3.77 | | 200 | 0.73 | 1.18 | | 300 | 0.39 | 1.07 | | 400 | 0.22 | 0.86 | | 500 | 0.19 | 0.53 | | 600 | 0.17 | 0.51 | | 700 | 0.16 | 0.43 | | 753 | 0.16 | 0.42 | *Covid-19* | Epoch | Loss | CRPS | |-------|------|-------| | 100 | 3.35 | 329.2 | | 200 | 1.67 | 297.4 | | 300 | 1.06 | 129.5 | | 400 | 0.69 | 66.2 | | 500 | 0.63 | 35.9 | | 600 | 0.61 | 32.3 | | 622 | 0.61 | 32.7 | *SP100* | Epoch | Loss | CRPS | |-------|------|------| | 100 | 2.77 | 1.18 | | 200 | 2.31 | 0.73 | | 300 | 1.75 | 0.41 | | 400 | 0.57 | 0.35 | | 500 | 0.51 | 0.33 | | 600 | 0.48 | 0.37 | | 700 | 0.43 | 0.31 | | 800 | 0.45 | 0.26 | | 900 | 0.41 | 0.22 | | 1000 | 0.41 | 0.18 | | 1100 | 0.41 | 0.19 | | 1200 | 0.39 | 0.16 | | 1230 | 0.39 | 0.16 | *Electricity* | Epoch | Loss | CRPS | |-------|------|-------| | 100 | 8.53 | 594.6 | | 200 | 4.41 | 471.3 | | 300 | 3.98 | 422.9 | | 400 | 3.82 | 383.5 | | 500 | 3.97 | 396.5 | | 600 | 2.66 | 377.1 | | 700 | 2.61 | 352.4 | | 800 | 2.95 | 388.8 | | 900 | 2.38 | 317.4 | | 1000 | 2.17 | 266.8 | | 1100 | 2.25 | 272.9 | | 1200 | 1.72 | 285.4 | | 1300 | 1.77 | 272.6 | | 1400 | 1.68 | 238.5 | | 1500 | 1.64 | 211.9 | | 1600 | 1.61 | 209.5 | | 1777 | 1.59 | 208.1 | *PEMS-BAYS* | Epoch | Loss | CRPS | |-------|------|-------| | 100 | 4.81 | 10.27 | | 200 | 2.77 | 6.93 | | 300 | 1.83 | 5.55 | | 400 | 1.15 | 4.19 | | 500 | 0.8 | 2.97 | | 600 | 0.74 | 2.28 | | 700 | 0.61 | 1.74 | | 800 | 0.39 | 1.59 | | 900 | 0.31 | 1.52 | | 1000 | 0.27 | 1.47 | | 1104 | 0.28 | 1.47 | We will add the plots for these tables in the Appendix of the revised version.