B2F NIPS rebuttal

# B2F NIPS rebuttal # Reviewer 1 We thank the reviewer for going through our work and providing valuable feedback. **It does not seem to me that the authors adequately show that backfill dynamics are actually useful for forecasting. The authors do show improvements to existing models when backfill sequences are used to refine existing models, but that could be attributable to e.g. existing models not using some of the features that the authors used. I would be more convinced of this if the authors showed that a model for forecasting (not refining) using backfill sequences (e.g. removing the model predictions from ModelPredEnc) outperforms a model trained on normal sequences.** We believe we convincingly showed the importance of modelling backfill dynamics through our experimental setup: 1. **Data:** We chose the most representative features which have been widely used by the 30+ teams participating in CovidHub including our candidate models. For example, DeepCovid [1], one of the top models, uses a superset of all our features. Other previously published works [3,4] leverage some line-list and mobility features. Many others use their own proprietary datasets as well. Hence, the performance gains cannot be attributed merely to the features considered. We indeed perform better by additionally considering backfill sequences via our framework. 2. **Baselines**: Further, note that the *PredRNN* baseline in our experiments takes only model predictions. Also, our *FFNReg* baseline utilizes only the real-time features without using backfill sequences. Both these baselines perform very sub-optimally and show even negative improvements in many cases. This shows that modelling backfill is necessary and useful for improving model predictions. 3. **Methodology:** Finally, we emphasize that our main contribution is not to design a new Covid-19 forecasting model but to inject the knowledge of backfill to the existing prediction pipeline. Therefore, Back2Future improves *any* model's (statistical or mechanistic/simulation-based) predictions in a real-time setting. 4. **Leaderboard problem:** As an aside, we also note that the leaderboard problem and the corresponding results do not explicitly use any model. Therefore, the improvements are solely due to our model. **For example, I would recommend moving subsection 2.1 onwards into a separate section and renaming the remaining part to "preliminaries".** **The paper is also excessively verbose, and the writing/phrasing can be improved in general.** **Some of the small figures in Section 2.1 are also not legible when printed on paper.** Thank you for the comments. We will move Sec 2.1 to a new preliminaries/setup section and also increase the font/figure sizes to improve readability. **Results in Table 2 should have confidence intervals** We have performed significance tests for all our results by running them 5 times and reported that they are indeed statistically significant (footnote on page 8). We also observed very low variance in our results. We will be happy to add the variance in the supplementary for reference. **The authors should discuss the mechanism and causes for the frequent feature revisions in their COVID dataset...** We mentioned the typical causes for backfill in Lines 31-33 along with relevant references like [2]. Identifying the exact cause of reporting delays and revisions for a specific ongoing pandemic like Covid-19 typically takes many months and is still an open research problem. Note that in spite of this uncertainty, our framework is robust to any such anomalies (see last results in Section 4). **The authors should briefly describe their model selection method: this seems non-trivial given the temporal nature of the evaluation.** The insights from the data alluded to in Sec 2.2 helped in designing our framework and architecture. Most of the hyperparameters were lightly tuned and were not very sensitive to performance (Line 704 of supplementary). We also leveraged only a small subset of the dataset: the first 8 of 30 weeks of the *CovDS* dataset relying only on 5 of the 50 states when initially building our model and tuning the hyperparameters. In addition, we did not consider the leaderboard setup at all for model selection. We will add these details in the experimental setup. We also note that this subset of dataset exhibits similar backfill behaviors to those found in Sec 2.2. Specifically, it exhibits median backfill error BErr of 55.58%, average stability time STime of 5.3 weeks, and the difference between stable and real-time MAE (Obs 5) is 13.5. We provided the statistics for the entire dataset in Sec 2.2 only for completeness. We further show the performance of Back2Future on further 6 months of test data. We report the results below for all weeks from Jan 2021 to June 2021 with the same real-time setup *without* making any changes to the model hyperparameters. We see similar improvements to model predictions as observed in the paper (Sec 4, Table 2). We also observed a 43.2% reduction in the difference between real-time and stable MAE for the leaderboard problem, similar to the 51.7% decrease reported in Line 385. Thus, we provide a clear and consistent improvement for our tasks. The robustness of our architecture over a period of a year (June 2020 to June 2021) of an evolving pandemic is noteworthy. We can add these results in the final version. | Candidate Model | Refining model | K=2 | | K=3 | | K=4 | | |-----------------|----------------|-------|-------|-------|-------|-------|------| | | | MAPE | MAE | MAPE | MAE | MAPE | MAE | | Ensemble | Back2Future | 4.39 | 5.25 | 4.19 | 3.31 | 3.15 | 4.41 | | GT-DC | Back2Future | 10.33 | 11.84 | 10.93 | 10.79 | 11.27 | 9.92 | | Umass-MB | Back2Future | 4.66 | 5.43 | 4.89 | 5.68 | 3.11 | 3.32 | | CMU-TS | Back2Future | 8.04 | 7.5 | 8.23 | 6.42 | 6.22 | 5.73 | Table 1: Percentage improvement due to refinement by Back2Future for the time period Jan 2021 to June 2021, averaged over 50 states. **The authors mentioned that all of the models they rectified are the "top" models of COVID-19 Forecast Hub - does the trend still hold for the weaker models?** We do observe even larger drastic improvements on some of the weaker COVIDHub models we evaluated. For instance, we saw over 20% improvements for the IHME-SEIR model which was placed near the middle of the Hub leaderboard. However, we only focus on the presumably harder-to-improve top-performing models in our paper to candidly showcase the impact of Back2Future on real-time COVID-19 forecasting. ## References [1] Rodriguez, Alexander, et al. "Deepcovid: An operational deep learning-driven framework for explainable real-time covid-19 forecasting." AAAI 2020 [2] Reich, Nicholas G., et al. "A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States." PNAS 2019 [3] Arik, Sercan O., et al. "Interpretable sequence learning for COVID-19 forecasting." [4] Jin, Xiaoyong, Yu-Xiang Wang, and Xifeng Yan. "Inter-Series Attention Model for COVID-19 Forecasting." SDM 2021 --- # Reviewer 2 We thank the reviewer for going through our work and providing valuable feedback. **What safeguards were put in place to avoid tuning the methodology to the evaluation data? Was all B2F model development done referencing only data before the evaluation phase of June 2020?** Most of the hyperparameters were lightly tuned and were not very sensitive to performance (Line 704 of supplementary). We also leveraged only a small subset of the dataset: the first 8 of 30 weeks of the *CovDS* dataset relying only on 5 of the 50 states when initially building our model and tuning the hyperparameters. In addition, we did not consider the leaderboard setup at all for model selection. We will add these details in the experimental setup. We also note that this subset of the dataset exhibits similar backfill behaviors to those found in Sec 2.2. Specifically, it exhibits median backfill error BErr of 55.58%, average stability time STime of 5.3 weeks, and the difference between stable and real-time MAE (Obs 5) is 13.5. We provided the statistics for the entire dataset in Sec 2.2 only for completeness. In order to further show the performance of Back2Future on further 6 months of unseen test data, we also report the results for all weeks from Jan 2021 to June 2021 with the same real-time setup below without making any changes to the model hyperparameters. We see similar improvements to model predictions as observed in the paper (Sec 4, Table 2). We also observed 43.2% reduction in the difference between real-time and stable MAE for the leaderboard problem, similar to 51.7% decrease reported in Line 385. Thus, we provide a clear improvement for our tasks. The robustness of our architecture over a period of a year (June 2020 to June 2021) of an evolving pandemic is noteworthy. We can add these results in the final version. | Candidate Model | Refining model | K=2 | | K=3 | | K=4 | | |-----------------|----------------|-------|-------|-------|-------|-------|------| | | | MAPE | MAE | MAPE | MAE | MAPE | MAE | | Ensemble | Back2Future | 4.39 | 5.25 | 4.19 | 3.31 | 3.15 | 4.41 | | GT-DC | Back2Future | 10.33 | 11.84 | 10.93 | 10.79 | 11.27 | 9.92 | | Umass-MB | Back2Future | 4.66 | 5.43 | 4.89 | 5.68 | 3.11 | 3.32 | | CMU-TS | Back2Future | 8.04 | 7.5 | 8.23 | 6.42 | 6.22 | 5.73 | Table 1: Percentage improvement due to refinement by Back2Future for the time period Jan 2021 to June 2021, averaged over 50 states. **Regarding prior literature** Thank you for the references. Ref #3, Ref #4, and Ref #5 do not explicitly discuss backfill while Ref #8 mentions it but does not address it. The rest of the papers focus on univariate backfill only in the target (as the reviewer also noted, our focus is instead on the general problem of revisions in both features and target). Within univariate backfill, most do data assimilation and sensor fusion which requires revision-free signals (Ref #1, Ref #2, Ref #7, Ref #10, Ref #11, and Ref #13). This line of work departs from our goal as we do not assume the existence of such signals; instead, our focus is on statistical correction based on the revision history. Ref #6, Ref #9, and Ref #12 explicitly model the reporting process of event-based count variables, which is not adequate for several of the signals in our task like mobility, which is publicly available only as a normalized signal. We have indeed cited Ref #13 already, and, for completeness, we will discuss all these references in our extended literature review. We also note that we have discussed representative works from economics that enrich the context of our work. **Page 4 line 146: The numbers given for stability times don't appear to correspond to figure 1c.** The bar graph in Figure 1c shows the average STime over groups of features as grouped in Table 1. In contrast, the numbers reported in Line 146 correspond to the range of STime for individual signals for each state. For example, an example for 21 week average delay corresponds to the *covidnet* feature for the state of *Indiana*. **Are these instances where a very large reporting spike was subsequently redistributed to prior dates in the time series? If so, (a) the real time error is not very meaningful; and (b) these errors can be easily handled by teams making outlier corrections to the data before feeding it into models when generating forecasts....** We did find only a small number of instances where reporting spikes were redistributed, an example of which was during the 1st and 2nd week of June. Even in these instances, recognizing such patterns is easier in hindsight. But doing so in real-time is non-trivial, especially in a pandemic where modellers regularly encounter novel patterns. Therefore, we consider the discrepancy between real-time and stable values for all datapoints. However, we note that our framework does not require such anomaly detection techniques to work (it can also potentially be used to detect anomalies by studying the difference between predicted and refined targets (as briefly mentioned in the conclusion)). ## References [1] Jin, Xiaoyong, Yu-Xiang Wang, and Xifeng Yan. "Inter-Series Attention Model for COVID-19 Forecasting." SDM 2021 [2] Qian, Zhaozhi, Ahmed M. Alaa, and Mihaela van der Schaar. "When and How to Lift the Lockdown? Global COVID-19 Scenario Analysis and Policy Assessment using Compartmental Gaussian Processes." NeurIPS 2020 [3] Hawryluk, I. et al., Gaussian Process Nowcasting: Application to COVID-19 Mortality Reporting, arXiv preprint arXiv:2102.11249 [4] David, Farrow. Modeling the Past, Present, and Future of Influenza. Ph.D. dissertation [5] Osthus, Dave, Ashlynn R. Daughton, and Reid Priedhorsky. "Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited." PLoS (2019) [6] Lawless, JF. Adjustments for reporting delays and the prediction of occurred but not reported events.Canadian Journal of Statistics.1994;22(1):15–31. --- # Reviewer 3 We thank the reviewer for going through our work and providing valuable feedback. **Given that the paper is not theoretically grounded, I am not really sure whether the advantage of the proposed method is generalizable to other datasets or domains.** Our contributions are generalizable to other forecasting domains and Back2Future is methodologically grounded due to its various technical insights and design choices. Firstly, our work proposes a general framework that formalizes the problems concerning data revision on real-time time series forecasting problems. Specifically, we introduce the refinement and leaderboard problems that arise due to data revisions. Therefore, our framework can be easily extended to other domains that encounter data revision issues like economics, climate science, social science, etc (some of which we mention in the introduction as well). Our method, Back2Future, is also methodologically grounded as we formulate a *general model* that helps improve the performance of *any kind* of model (statistical and mechanistic including theory and simulation-based) models: 1. We propose a new neural architecture leveraging deep sequential modules to capture revision patterns and graph neural networks to automatically leverage sparse relations across individual features that exhibit similar historical backfill behaviors. 2. Further, we embed model's general prediction bias to backfill, via its prediction history along with backfill patterns to aid in refinement. 3. We also introduce a two-step training schedule that involves model-agnostic pre-training to accelerate fine-tuning for specific candidate models. 4. Finally, due to generality of our method, Back2Future can be easily applied to any general time series forecasting problem that is affected by data revision. **the hyperparameter settings are given in the appendix, which is good, but it seems that how they were chosen is not clearly specified.** We leveraged only a small subset of the dataset: the first 8 of 30 weeks of the *CovDS* dataset relying only on 5 of the 50 states when initially building our model and tuning the hyperparameters. In addition, we didn't consider the leaderboard setup for model selection. We will add this detail in the experimental setup. We note that this subset of the dataset also exhibits similar backfill behaviors to those found in Sec 2.2. Specifically, it exhibits median backfill error BErr of 55.58%, average stability time STime of 5.3 weeks. The difference between stable and real-time MAE (Obs 5) is 13.5. We provided the statistics for an entire dataset in Sec 2.2 only for completeness. We emphasize that we followed common practice from recently published works [1, 2, 3, 4, 5, 6, 7, 8] in this space. Therefore, we formulated a controlled setup for a fixed period of June 2020 to Dec 2020 on which we evaluated our model and the baselines on refinement tasks. In order to further show the robustness of hyperparameters by evaluating Back2Future on further 6 months of unseen test data, we also report the results for all weeks from Jan 2021 to June 2021 with the same real-time setup below without making any changes to the model hyperparameters. We see similar improvements to model predictions as observed in the paper (Sec 4, Table 2). We also observed 43.2% reduction in the difference between real-time and stable MAE for the leaderboard problem, similar to 51.7% decrease reported in Line 385. Thus, we provide a clear improvement for our tasks. The robustness of our architecture over a period of a year (June 2020 to June 2021) of an evolving pandemic is noteworthy. We can add these results in the final version. | Candidate Model | Refining model | K=2 | | K=3 | | K=4 | | |-----------------|----------------|-------|-------|-------|-------|-------|------| | | | MAPE | MAE | MAPE | MAE | MAPE | MAE | | Ensemble | Back2Future | 4.39 | 5.25 | 4.19 | 3.31 | 3.15 | 4.41 | | GT-DC | Back2Future | 10.33 | 11.84 | 10.93 | 10.79 | 11.27 | 9.92 | | Umass-MB | Back2Future | 4.66 | 5.43 | 4.89 | 5.68 | 3.11 | 3.32 | | CMU-TS | Back2Future | 8.04 | 7.5 | 8.23 | 6.42 | 6.22 | 5.73 | Table 1: Percentage improvement due to refinement by Back2Future for the time period Jan 2021 to June 2021, averaged over 50 states. ## References [1] Jin, Xiaoyong, Yu-Xiang Wang, and Xifeng Yan. "Inter-Series Attention Model for COVID-19 Forecasting." SDM 2021 [2] Qian, Zhaozhi, Ahmed M. Alaa, and Mihaela van der Schaar. "When and How to Lift the Lockdown? Global COVID-19 Scenario Analysis and Policy Assessment using Compartmental Gaussian Processes." NeurIPS 2020 [3] Adhikari, Bijaya, et al. "Epideep: Exploiting embeddings for epidemic forecasting." KDD 2019. [4] Zimmer, Christoph, and Reza Yaesoubi. "Influenza Forecasting Framework based on Gaussian Processes." ICML, 2020. [5] Rodríguez, Alexander, et al. "Steering a Historical Disease Forecasting Model Under a Pandemic: Case of Flu and COVID-19." AAAI 2020 [6] Arik, Sercan O., et al. "Interpretable sequence learning for COVID-19 forecasting." NeurIPS 2020. [7] Rekatsinas, Theodoros, et al. "Forecasting rare disease outbreaks from open source indicators." Statistical Analysis and Data Mining: The ASA Data Science Journal 10.2 (2017): 136-150. [8] Chakraborty, Prithwish, et al. "Forecasting a moving target: Ensemble models for ILI case count predictions." ICDM 2014 --- # Reviewer 4 We thank the reviewer for going through our work and providing valuable feedback. **On a meta-level, I believe the authors have a responsibility to present their claims reasonably. Making unsubstantiated statements like calling their work "the best COVID-19 forecasting model" or overstating how "impressive" their results are could have negative social impact. I would recommend tempering their claims.** It was not our intent to invent the 'single best' COVID-19 forecasting model; instead, the whole paper works around us recognizing the central role of multiple models in practice (as recent experiences in flu/dengue/COVID-19 show). B2F is a refinement/rectification method, not a standalone forecasting model (hence B2F-Ensemble, B2F-DC, etc). Our focus is in showing that (a) backfill is an important and complex issue; and (b) we can get meaningful improvements for any model (statistical, mechanistic/simulation, etc) and leaderboard evaluation setup in real-time by incorporating backfill systematically using neural models. We will clarify this to prevent any confusion of our intent and focus on backfill. ## Concerns regarding evaluation **A proposed model for real-time forecasting should be evaluated in real-time, not on out-of-sample historical data....but the pandemic is still ongoing across the world,... it would greatly strengthen your results if you could show that your model works in the real world...If this model has been evaluated in real-time, please mention so.** As mentioned above, our focus was not to provide the 'single best Covid-19 forecasting model', but to introduce backfill and Back2Future as a first-class citizen in the space of real-time forecasting to improve any model's (statistical or mechanistic) outputs. Secondly, to evaluate the effect of backfill, note that our setup follows the same standard setup for reporting methodological innovation in the field of real-time disease forecasting (including by recently published works in similar venues such as NeurIPS 2020 [3, 6], AAAI 2021 [12], SDM 2021 [4], ICML 2020 [11], KDD 2019 [10], PNAS 2012 [14], ASDM 2017 [15], ICDM 2014 [16]). All these works stipulated a controlled setting of fixed time-period where they emulated real-time forecasting to compare performance with other baselines. Note that while there are papers [1, 2, 5] that describe the real-time modelling challenges and experiences in more detail providing many valuable insights, they do not follow a fully controlled setup to the extent needed for our experiments so that we can quantify cleanly the benefit of technical improvements in methodology. E.g., we can not know apriori if some data signal will stop getting reported, necessitating us to re-run as a real-time-emulated setup. Finally, as illustrated by Table 1 below, our model's performance is consistent on *further 6 months of new data* from Jan 2021 to June 2021 without making any changes to the hyperparameters. Therefore, the robustness of our architecture for over a year of data from an evolving pandemic is noteworthy. **Improving on past real-time forecasts by 3-5% seems near-trivial as we have access to time and knowledge that teams working in real-time did not have in the past** We believe our experiments provide enough evidence that incorporating backfill using Back2Future leads to robust, consistent and significant improvements in a wide range of models and better real-time evaluation. 1. Our evaluation spans all 50 states of the US, unlike recent previously published works [3, 4, 6]. 2. We clearly improve predictions of top-ranking forecasting models. The 3-7% number is the average % improvements for only the Ensemble model, which includes both sparsely populated and larger states. As mentioned in Line 380, for some of the major states like Illinois and Texas we observed an improvement of over 15% which is significant. 3. Further, we saw over 10% improvements in at least 25 states for each of the individual models (Lines 373-375). Additionally, these results are noteworthy by the fact that using Back2Future refinement can make any of these models become the top performing model in the Hub (Lines 376, 377). 4. Concerning real-time evaluation, as mentioned in Lines 382-388, we also reduced the discrepancy between real-time and stable estimates by over 50% with an over 90% reduction in some states. 5. Finally, all our results are statistically significant as described in the paper (Page 8 footnote 2). **Furthermore, the paper currently implies that there was only one testing set (ie no hold-out set) when hyperparameter tuning (lines 315-323), which would entirely invalidate the results. Please provide more detail on the evaluation scheme.** Firstly, we again emphasize that we use the standard controlled setup used by recent works that introduce methodological innovations [3, 4, 6, 10, 11, 12, 14] in this space i.e. to fix a dataset over a period of time to emulate real-time forecasting for evaluation of the methods and the baselines. We also went beyond the expected standard by leveraging only a small subset of dataset: the first 8 of 30 weeks of the *CovDS* dataset relying only on 5 of the 50 states when initially building our model and validation/tuning the hyperparameters. Most of the hyperparameters were lightly tuned and were not very sensitive to performance (Line 704 of supplementary). In addition, we did not consider the leaderboard setup for model selection. We will add this detail in the experimental setup. We also note that this subset of dataset exhibits similar backfill behaviors to those found in Sec 2.2. Specifically, it exhibits median backfill error BErr of 55.58%, average stability time STime of 5.3 weeks, and the difference between stable and real-time MAE (Obs 5) is 13.5. We provided the statistics for the entire dataset in Sec 2.2 only for completeness. Finally, we further showcase the performance of Back2Future on additional 6 months of completely new test data. We report below the results for all weeks from Jan 2021 to June 2021 with the same real-time setup below without making any changes to the model hyperparameters. We see similar improvements to model predictions as observed in the paper (Sec 4, Table 2). We also observed 43.2% reduction in the difference between real-time and stable MAE for the leaderboard problem similar to 51.7% decrease reported in Line 385. Thus, we provide a clear improvement for our tasks. The robustness of our architecture for over a year (June 2020 to June 2021) of an evolving pandemic is noteworthy and provides immediate benefits. We can add these results in the final version. | Candidate Model | Refining model | K=2 | | K=3 | | K=4 | | |-----------------|----------------|-------|-------|-------|-------|-------|------| | | | MAPE | MAE | MAPE | MAE | MAPE | MAE | | Ensemble | Back2Future | 4.39 | 5.25 | 4.19 | 3.31 | 3.15 | 4.41 | | GT-DC | Back2Future | 10.33 | 11.84 | 10.93 | 10.79 | 11.27 | 9.92 | | Umass-MB | Back2Future | 4.66 | 5.43 | 4.89 | 5.68 | 3.11 | 3.32 | | CMU-TS | Back2Future | 8.04 | 7.5 | 8.23 | 6.42 | 6.22 | 5.73 | Table 1: Percentage improvement due to refinement by Back2Future for the time period Jan 2021 to June 2021, averaged over 50 states. **I would also be curious to see how the model works on hospitalization forecasting, not just mortality. Hospitalizations are not subject to delays in reporting nearly to the degree that deaths are, so my guess would be the gains from the model would be less** Forecasting on mortality is a standardized problem used by CovidHub [2] which regularly publishes performance values for this target. In contrast, evaluation on hospitalization was not yet standardized in our study setup due to fundamental discrepancies in reporting across states over time (e.g., some states reported current hospitalizations while others the new hospitalizations and some cumulative). There was much debate on the ground truth values themselves. Finally, we note that due to the generality of our framework, our refining approach could be applied to improve predictions of other targets like hospitalization, vaccination rates etc (or even for correcting other features). This can be interesting future work. **There are no uncertainty intervals provided for any of the results.** We have performed and reported for statistical significance over 5 runs (footnote on page 8). We observed a very low variance in results across multiple runs. We would be happy to include the variance in the supplementary. **I would have liked to see section 2.2 explored in much higher detail. Which hospital features in particular are more delayed in reporting? How is the variation at the county-level?** Our goal for section 2.2 was to showcase that the backfill effect was significant and complex. Therefore, we showcased the most important aggregate statistics to justify our scope. We believe that our work will open up new research questions concerning data revision and generate other interesting problems, like some of the ones the reviewer has mentioned. Therefore, we have also publicly released the dataset for further exploration. **Is Figure 2 a hypothetical rendering of the five proposed canonical backfill patterns, or some sort of aggregate based on the K-means analysis? This whole subsection on the k-means analysis needs much more exposition in general. Can you visualize the results of the clustering?....** As described in Lines 152-154 and the caption of Figure 2, the plots are the *centroids* of the clusters that we obtained by performing K-means clustering. We could not show the individual BSeqs as they are more than 30K in number. We used DTW distance between two sequences as the distance measure and chose 5 clusters to highlight the canonical behavior. We can also add the elbow plot (aka scree plot) in the supplementary for reference. We would also like to emphasize that the goal of this analysis is to provide evidence of similar patterns across BSeqs. We can provide more details in the supplementary, but the exact nature of clusters is not important for our method and hence we did not go deeper. ## Other comments **I'm not convinced that NeurIPS is the best venue for this paper,...** We cordially disagree because our work makes technical contributions to ML research, general time-series prediction and real-time forecasting problems. Our contributions include 1) formulating novel generalizable problems to tackle data revision issues for general forecasting problems; and 2) introducing a novel deep-learning architecture that leverages key technical insights to successfully tackle these problems. Firstly, our work introduces the refinement and leaderboard problems that arise due to data revisions and can be easily extended to other domains that encounter data revision issues like economics, climate science, social science, etc (some of which we mention in the introduction as well). Secondly, our method, Back2Future helps improve the performance of *any kind* of model (statistical and mechanistic including theory and simulation-based) models leveraging the following methodological innovations: 1. We propose a new neural architecture leveraging deep sequential modules to capture revision patterns and graph neural networks to automatically leverage sparse relations across individual features that exhibit similar historical backfill behaviors. 2. Further, we embed the model's general prediction bias to backfill, via its prediction history along with backfill patterns to aid in refinement. 3. We also introduce a two-step training schedule that involves model-agnostic pre-training to accelerate fine-tuning for specific candidate models. 4. Finally, due to the generality of our method, Back2Future can be easily applied to any general time series forecasting problem that is affected by data revision. **The backfill issue, the main motivating premise behind the paper, is poorly explained. I would recommend using concrete examples instead of just mathematical jargon...** **The introduced definitions and abbreviations are largely unnecessary, and just make the paper more difficult to read...** **Section 3 is written opaquely in a way that overuses mathematical notation and jargon...** Since formalizing the refinement and leaderboard problem with respect to a general real-time forecasting setup is an important contribution of our paper, we provide formal definitions and terminology. Due to the unconventional nature of our problem where the values of the dataset are indexed by multiple components: regions, features, observation, and revision week, we chose to introduce and explain these notations throughout setup and Sec 3. We will try to add more informal comments and explanations where applicable. In Section 2.2, we applied our definitions to the CovDS dataset, which provides a real-world example. We will also add a specific example when formalizing definitions to improve the understanding of the reader. We also thank the reviewer for the suggestion of adding a visual diagram. We can add a figure showing the entire pipeline of Back2Future for clarity. **Terms like backfill, exogenous, signal, target, set, and sequence are used throughout the paper in ways that are not typically used for either machine learning or public health practitioners....** To the best of our knowledge, *Backfill* is a well-known term in computational epidemiology and public health research [5, 8, 9]. Other terms like *exogenous*, *signal* etc are commonly used in time-series modeling and ML [13]. Since our primary audience is the ML community, we formally defined terms related to real-time forecasting and the backfill problem for a general (application-agnostic) real-time forecasting problem. **The written English is often awkwardly phrased, and there are also many typos throughout the paper. I recommend having a colleague proofread the paper.** Although we could not find any typos, we will go over the paper again and smooth out any awkwardly phrased sentences. ## References [1] Rodriguez, Alexander, et al. "Deepcovid: An operational deep learning-driven framework for explainable real-time covid-19 forecasting." AAAI 2020 [2] Cramer, Estee Y., et al. "Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US." medRxiv (2021). [3] Arik, Sercan O., et al. "Interpretable sequence learning for COVID-19 forecasting." NeurIPS 2020 [4] Jin, Xiaoyong, Yu-Xiang Wang, and Xifeng Yan. "Inter-Series Attention Model for COVID-19 Forecasting." SDM 2021 [5] Reich, Nicholas G., et al. "A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States." PNAS 2019 [6] Qian, Zhaozhi, Ahmed M. Alaa, and Mihaela van der Schaar. "When and How to Lift the Lockdown? Global COVID-19 Scenario Analysis and Policy Assessment using Compartmental Gaussian Processes." NeurIPS 2020 [7] Wu, Dongxia, et al. "Quantifying Uncertainty in Deep Spatiotemporal Forecasting." arXiv preprint arXiv:2105.11982 (2021). [8] Osthus, Dave, Ashlynn R. Daughton, and Reid Priedhorsky. "Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited." PLoS 2019 [9] Brooks, Logan C., et al. "Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions." PLoS 2018 [10] Adhikari, Bijaya, et al. "Epideep: Exploiting embeddings for epidemic forecasting." KDD 2019. [11] Zimmer, Christoph, and Reza Yaesoubi. "Influenza Forecasting Framework based on Gaussian Processes." ICML, 2020. [12] Rodríguez, Alexander, et al. "Steering a Historical Disease Forecasting Model Under a Pandemic: Case of Flu and COVID-19." AAAI 2020 [13] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015. [14] Shaman, Jeffrey, and Alicia Karspeck. "Forecasting seasonal outbreaks of influenza." PNAS 2012 [15] Rekatsinas, Theodoros, et al. "Forecasting rare disease outbreaks from open source indicators." Statistical Analysis and Data Mining: The ASA Data Science Journal 10.2 (2017): 136-150. [16] Chakraborty, Prithwish, et al. "Forecasting a moving target: Ensemble models for ILI case count predictions." ICDM 2014

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.