ProfHiT KDD Rebuttal

# ProfHiT KDD Rebuttal ## Common Response **Regarding Model runtime complexity** The novel component of ProfHiT is the Hierarchy-aware refinement module that facilitates the integration of base forecast distributions. As described in lines 398-403, the total computational complexity of obtaining the refined distributional parameters is $O(N^2)$ ($N$ is the number of nodes in the hierachy), which is comparable to the reconciliation step of end-to-end methods like HierE2E. Post-processing techniques such as MinT, ERM, and PEMBU have an even higher time complexity of $O(N^3)$. Note that the other portion of the pipeline that may add to the time-complexity is the base forecasting models. Models like DeepAR and RNN used by HierE2E, SHARQ as well as the post-processing methods and TSFNP (used by ProfHiT) scales linearly with respect to the length of the time-series and linearly with number of nodes $N$. Therefore all these baselines and ProfHiT use models with similar time-complexity for base forecasts with respect to the size of the hierarchy $N$. ## Reviewer 9PoY We thank the reviewer for their valuable comments and suggestions. We are grateful that the reviewer found our paper well written and appreciated the theoretical and empirical contributions. We will respond to the concerns and comments as follows: **Although the experimental datasets were carefully chosen, they appear to be relatively small. To make the results more convincing and compelling, I suggest validating the method on larger-scale datasets.** We chose all the datasets used by previous works in the literature. In addition, we have also benchmarked on weakly consistent datasets (Flu-Symptoms, FB-Survey), which have not been studied by previous works. These datasets vary in size from 57 to 555 nodes in the hierarchy and with 15,677 to 126,540 total observations in the dataset. If the reviewer can point to any other larger datasets, we would gladly add it to our experiments. **The proposed method involves modeling the forecast distributions of all time-series nodes in the hierarchy. It would be helpful to discuss the method's complexity and compare it with other methods. Could the high complexity of the proposed method hinder its ability to handle large-scale datasets?** We have discussed the computational complexity of the main bottleneck component of ProfHiT, the refinement module, in lines 398-403 and showed that it is on par or better than other baselines. We have added additional details in the *Common Response* above which we will add to the revised version. We agree that scaling up to hierarchies with three or more orders of magnitude larger than our datasets may pose computational challenges for PROFHiT and other end-to-end methods (which scale $O(N^2)$ per pass) and post-processing methods like ERM and PEMBU (which scale as $O(N^3)$) and is an important research direction. **This paper provides a good comparison of the proposed method with various types of methods. However, some of the compared post-processing methods seem outdated. Therefore, I encourage the authors to include a comparison with more recent post-processing methods.** To the best of our knowledge, the most recent state-of-the-art hierarchical forecasting methods are end-to-end methods, such as HierE2E and SHARQ. If the reviewer can point us to any post-processing methods, we would be happy to add them as baselines. ## Reviewer dDvR We thank the reviewer for their valuable comments and questions. We would respond to their questions and concerns as follows: **The idea of using soft regularization is a bit straightforward.** We wish to emphasize that one of our main contributions, Soft Distributional Consistency Regularization, is a novel and complex method that enables our model, ProfHiT, to effectively learn from datasets with varying consistency at the distributional level throughout the entire hierarchy. This regularization allows our proposed method to constrain the full forecast distributions of the output forecasts in a tractable manner. By using it as a soft regularizer, our approach can adaptively adhere to hierarchical constraints based on the inherent consistency of the datasets. Previous works have not considered this challenge of varying consistency and strictly constrain their forecasts resulting in lower predictive performance compared to PROFHiT on strong and weakly consistent datasets. **The proposed method could increase the computational cost. The authors are recommened to provide the running times of their proposed method.** The main bottlenect component of PROFHiT, the refinement module scales as $O(N^2)$ where $N$ is number of nodes (lines 398-403). This is on par with other end-to-end methods and better than many post-processing methods that scale as $O(N^3)$. We also request the reviewer to refer to the *Common Response* above for a more detained discussion on running time complexity. We will add this in the revised version. **The results show that the proposed method can be outperformed by the baselines with a large margin in MAPE. For example, in Table 3, the MAPE achieved by TSFNP-MinT is 1.17, while the MAPE achieved by proposed method is 1.47.** PROFHiT significantly outperforms all the baselines or close to the best performing baseline in all cases except on one metric for one dataset. Different metrics have their own advantages and disadvantages and measure various aspects of forecast performance (Eg: CS which measures calibration and CRPS measures both accuracy and calibration of probabilistic forecasts). Therefore, rather than looking at a single metric on one of the benchmarks, evaluating the table holistically we clearly observe the significance of ProfHiT in providing accurate and well-calibrated forecasts across a variety of datasets. ## Reviewer U5Fw We thank the reviewer for their valuable comments and questions. We are glad the reviewer appreciates the importance of the challenges tackled by our work and the experimental results. We would respond to their questions and concerns as follows: **The justification of using Gaussian assumptions.** The Gaussian assumption for the predictive distribution is a reasonable choice for many real-world time-series datasets with continuous-valued data. In our experiments, we demonstrate that this simple assumption is sufficient for PROFHiT to outperform previous state-of-the-art methods on a wide range of forecasting tasks in fields such as macroeconomics and epidemiology. While we do assume Gaussian distributions for each node’s forecast, we do not treat this assumption as a fundamental constraint of the problem statement. Rather, it is a modeling choice to simplify forecast distribution parametrization and derive tractable SDCR loss. That being said, combining different distributions across the hierarchy in our model is a very interesting possible extension to our model which we discussed in lines 980-982 as potential future work. In fact, other distributions such as exponential and gamma distributions also have closed-form JSD and can be used instead of Gaussians depending on the application. **clarification of linearity assumption of hierarchical relationship** Previous works on hierarchical forecasting have all formultated the hierarchical forecasting problem assuming linear relationships across nodes connected by the hierarchy. The linear aggregation function for the hierarchy is motivated by real-world applications such as aggregation of case counts accross geographical regions in epidemiology or macroeconomic metrics across different demographics and regions. That being said, other kinds of aggregation functions may be more appropriate for certain other applications. Extending our work to accommodate different kinds of aggregation functions and deriving tractable distributional coherency regularization loss for them is an interesting future direction. **It seems the proposed SDCR is not consistent with the motivation of calibrated forecasting as claimed...** The SDCR regularizer helps the model to better leverage distributional consistency of hierarchical constraints of the entire hierarchy when learning from training data from all nodes. This allows the model to generate well-calibrated forecasts that adaptively adhere to distributional consistency of the hierarchy. In contrast, other baselines apply regularization on point forecasts or specific quantiles of the distributions which don’t guide the model learn calibrated forecasts across the hierarchy. Finally, we observed empirically that applying this novel regularization on the full forecast distributions provides significantly better calibrated forecasts in all the benchmark applications based on the CS and CRPS scores (Table 3). **Lack of further discussions between the proposed SDCR and previous method SHARQ** As mentioned in lines 217-220, SHARQ uses a regularizer on quantiles of forecast distributions. However, we would like to emphasize two main disadvantages of SHARQ compared to PROFHiT. First, as mentioned in the paper, SHARQ only regularizes for a fixed number of quantiles, whereas PROFHiT considers the entire forecast distribution. This enables PROFHiT to capture the full distributional consistency across the hierarchy. Second, it is not straightforward to find good quantile estimators for any arbitrary distribution. In such non-trivial cases, researchers resort to approximation methods (like sampling, bootstrapping, bagging, etc.) to generate underlying distribution and uncertainty. (We will add the second point in the updated manuscript.) Furthermore, while the Gaussian assumption is simple, it is also sufficient for modeling many real-valued time-series datasets. In fact, PROFHiT outperforms all previous state-of-the-art models in a wide range of real-world applications, including macroeconomics and epidemic forecasting. Finally, as requested by the reviewer, we run TSFNP replacing SDCR with SHARQ’s quantile regularizer and removing the refinement module. The results are as follows: | | | Tourism-L | | | Labour | | | Wiki | | | Flu | | | FB Symptoms | | |:-----------:|:----:|:---------:|:----:|:-----:|:------:|:----:|:-----:|:----:|:----:|:-----:|:-----:|:----:|:----:|:-----------:|:----:| | Models/Data | CRPS | CS | DCE | CRPS | CS | DCE | CRPS | CS | DCE | CRPS | CS | DCE | CRPS | CS | DCE | | TSFNP-SHARQ | 0.16 | 0.11 | 0.08 | 0.042 | 0.16 | 0.07 | 0.217 | 0.14 | 0.11 | 0.372 | 0.068 | 0.16 | 3.19 | 0.12 | 0.18 | | ProfHiT | 0.12 | 0.09 | 0.02 | 0.026 | 0.14 | 0.05 | 0.184 | 0.13 | 0.04 | 0.250 | 0.042 | 0.14 | 1.43 | 0.08 | 0.16 | While adding TSFNP as backbone to SHARQ regularizer improves its scores, particularly CS, PROFHiT outperforms this variant across all metrics and benchmarks. We would gladly add this result to the revised paper. **Counter-intuitive experiment results. Table 3 shows that on Flu and FB dataset, the CS (calibration score) of HierE2E is higher than its baseline DeepAR. Similar phenomenon occurs for TSFNP-MINT/ERM against the baseline TSFNP.** This is a very interesting observation from Table 3. We observe an increase (lower is better) in Calibration score after applying baseline reconciliation methods (ERM, MINT, HierE2E) to base forecast models in some cases. This phenomenon could be attributed to the fact that baseline methods do not effectively leverage distributional consistency like ProfHiT which uses SDCR to train the model. They mostly perform reconciliation on point estimates or as a post-processing step, resulting in potentially worse calibration of reconciled forecasts. We thank the reviewer for pointing this out and we will add this comment to the revised manuscript. **Further description of the significance of weakly-consistent scenarios and why previous methods cannot be trivially adapted to such scenarios. This helps clarify the contribution/impact of this paper.** As mentioned in Section 1 lines 93-100, we frequently encounter weakly consistent datasets in real-world applications such as epidemiology and macroeconomics. These datasets deviate from the hierarchical constraints to many factors such as reporting errors, asynchrony in data aggregation, etc. Therefore automatically adapting to deviations is crucial for providing accurate forecasts. We also show the data consistency of benchmark datasets in Appendix Table 5 to illustrate the extent of deviations in weakly consistent datasets. Previous methods cannot be trivially adapted to weakly consistent datasets. They rely on strict constraint enforcement to ensure hierarchical consistency in the output forecasts either as constraints for an optimization problem or by projecting the forecasts to a subspace that follows the constraints. The only previous method that can potentially adapt to weak consistency is SHARQ, which employs a soft regularization on select quantiles of the forecast distribution. In contrast, PROFHIT's SDCR leverages the full distributional coherency to adaptively tradeoff between consistency and training accuracy. This results in both better accuracy and calibration on datasets of varying consistency compared to SHARQ. In fact, even the Distributional Coherency Error of SHARQ is worse than PROFHIT for datasets of all consistency. **Further justifications in the assumption of conditional dependency of TSFNP. Line 292 implies that nodes are not independent while Eq.2 suggest a factorization over the nodes.** In the TSFNP model, we derive the latent variable $z_i$ for node $i$ by modeling the correlation between its input time-series and the past time-series of all nodes at the Stochastic Data Correlation Graph (Appendix Section A). Hence, the first term of factorization in Eq 2 is $P(\mathcal{Z}|{y_i}_i)$, where $\mathcal{Z}$ denotes the latent variables for all nodes. Each $z_i$ is then used to derive the base forecast distribution parameters $\mu_i$ and $\sigma_i$. The second term $P(\mu_i, \sigma_i|\mathcal{Z})$ encapsulates this. It is important to note that the formulation in Eq 2 is very general and can be applied to any base forecasting methods, whether they choose to model each node's forecasts completely independently or not. Additionally, it is worth mentioning that in practice, any differentiable probabilistic base forecasting model, which may or may not be node-wise independent, can replace TSFNP based on the application. The training pipeline should be similar, with potentially minor changes in the likelihood loss. **Although details are plugged to the appendix, I suggest the authors keep the main paper self-containing to avoid confusion.** We chose to keep the main manuscript focused on the key technical contributions of our work, namely the Refinement Module and Soft Distributional Coherency Regularization. As TSFNP is a small modification of previous work, we provide its details in the Appendix to ensure that the main manuscript remains self-contained and easy to read for a broader audience. Furthermore, we have made sure to use notations that are consistent throughout the manuscript and do not depend on TSFNP-specific notation, which should help in avoiding any confusion. This should allow the readers to easily leverage PROFHIT using any base forecast model for their choice depending on the application. **Definition of H. In line 237, the hierarchical relationship H is defined to be strongly consistent. And the discussion of strong/weak consistency is not necessary following this definition. The authors may better refine their logics.** We would like to clarify that the hierarchical constraint $H$ of a dataset is independent of its data consistency. $H$ is derived from the assumptions made about the data generation process of the hierarchical time-series, as explained in the paper. However, many real-world datasets may deviate from these constraints. Therefore, we have formally defined the notion of data consistency in Definition 1 and classified datasets as strong and weakly consistent in Definition 2 to better account for these deviations. This provides a useful framework to understand the degree to which datasets adhere to the hierarchical constraints $H$. We will clarify this in the revised version. ## Reviewer aK8F We thank the reviewer for their valuable comments. We are grateful the reviewer has appreciated the problem of dealing with weak consistency and our through evaluation. We will address some of the comments and questions as follows: **The theoretical study presented in the paper has some limitations. Although the authors have developed their own probabilistic models, it is unclear whether their definition of weak consistency plays a role in the convergence of the model...** We have evaluated our method in section 5 on both strong and weakly consistent datasets of varying $E_T$ (Appendix table 5) from multiple domains. We observed that PROFHiT in the only model that can adapt to these deviations by providing significantly better accuracy and calibration. We also note that while $E_T$ measures the average the deviation from hierarchical constraints, the dynamics or nature of such deviations across the hierarchy can be complex and caused my multiple factors such as reporting errors, random noise, data asynchrony and revisions. Therefore, modeling such deviations is non-trivial for most real-world applications and systematically studying its effects on model convergence and performance would be an important research direction. We will add this to the discussion section in the updated version. **Certain sections of the paper are not well-written. For instance, the model details for TSFNP are only available in the appendix, making it challenging to grasp the complete model when reading the main manuscript.** We chose to keep the main manuscript focused on the key technical contributions of our work, namely the Refinement Module and Soft Distributional Coherency Regularization. As TSFNP is a small modification of previous work, we provide its details in the Appendix to ensure that the main manuscript remains self-contained and easy to read for a broader audience. Furthermore, to avoid any confusion, we have taken care to not use any TSFNP-specific notation in the main manuscript, and we denote all latent variables as $\mathcal{Z}$. This approach also enables readers to easily leverage ProfHiT using any differentiable base forecast model of their choice depending on the application.