# Reviewer 1 gQHv
We would like to sincerely thank the Reviewer gQHv for providing valuable comments and recognizing the value of our work.
### Response to Weaknesses
> W1: multivariate parts that author claim to be new seems to not be really in the novel parts, but multi-step ones are main focus. Otherwise, kindly describe why existing methods by Stankeviciute et al. (2021) cannot handle this setting (with small modifications).
We agree with review gQHv's comment. In Stankeviciuteet al. (2021), only univariate time series was handled by the algorithm; the multivariate setting was mentioned as a future direction. We made a small modification to cover multivariate data by computing the vector norm in Eqn (1).
Our key contribution is multi-step forecasting. Due to the curse of dimensionality, the uncertainty regions often becomes very large for multivariate time series, if we directly apply CP or Stankeviciuteet al. (2021)'s approach. We developed our method CopulaCPTS to significantly shrink the uncertainty regions while still maintaining coverage, especially for multivariate datasets (see table 1). To summarize, we mention multivariate as a motivating setting and useful application area for this work.
> W2: enhance UQ literatures and discussion on other classes. For example there are several advances in forecasting based on quantile regresison methods for univariate and multivariate https://arxiv.org/abs/2202.11316,https://arxiv.org/abs/2111.06581, and even conformal prediction enhance quantile regression method itself https://arxiv.org/abs/1905.03222.
Thank you for the valuable suggestion! We have incorporated these papers in our literature review section.
Even though quantile methods are widely used in time-series forecasting UQ, they do not give finite sample guarantees on coverage, so they are not directly related to this work. In the CP section we have discussed conformal quantile prediction (Romano et al., 2019) and cited a later work (Sousa et al. 2022).
Please understand that in the interest of space, we have kept the discussion concise.
> W3: more deep dive on behvarious $\alpha$. For example, is this related to the author claim the capability of capturing dependency?
Please see the updated Appendix C.6 for a study on $\alpha_h$. Let us know if it clarifies the role $\alpha_h$ plays in representing the correlation.
> W4: the experiment setting are poorly described and difficult to examine univatie/multivate cases and which various base forecasters are used and why.
We have edited section 5.1, section 5.2, and Appendix C.1 for better clarity on experimental setting and model selection.
### Response to Clarity
> in eq (7) , h is typo?
Yes, thank you for the note. We've corrected it in the updated version of the paper.
> the data split of training and val is not clearly described, especially multivatie settings.
See W4 for the updated description.
Thank you again for your thorough review and insightful questions! Let us know if anything remains unclear.
# Reviewer 3 oH8D
We sincerely thank reviewer oH8D for providing thorough and insightful comments.
> -: many typos in the paper, suggesting a rushed in submission. Among others:
We apologize for the typos! The updated version has them all corrected. Thank you for carefully reading our paper.
> -: in the case of auto-regressive, it is unclear how is treated the newly produced point in order to produce further predictions. In particular, since the newly predicted point is a conformal sets, do authors run conformal prediction using weak-labels/intervals (as in "Cauchois, M., Gupta, S., Ali, A., & Duchi, J. (2022). Predictive Inference with Weak Supervision."), or do they simply use the new point-wise prediction? What guarantees that the previous validity guarantees with respect to the ground-truth will hold (as the prediction is not the true value)? Also, how does one obtain the cumulative distribution for the non-conformity scores of the k+1 prediction? Does one use the same cumulative distribution, and if this is the case, why do we need to re-run an optimisation?
In our current setup we only use the point-wise predictions for autoregressive forecasting. The coverage guarantee will always only be on the $k$ step that are predicted by the model, whether it is time steps $\{t+1, \ldots, t+k\}$ or $\{t+2, \ldots, t+k+1 \}$, etc. The algorithm described in the paper does not give cumulative coverage guarantees for over $k$ steps of forecasts.
We assume we have validation data for the length of the autoregressive forecasts. This way, we model the copula for uncertainty of _the first $k$ steps of the autoregressive forecast_, _the $t+1$ to $t+k+1$ steps of the autoregressive forecast_, and so on. As outline in paragraph 2 of section 4.3 in the paper, we might or might not need to refit to every one of these cases.
We understand that it's desirable to have coverage guarantees for the entire forecasting horizon, or use predicted intervals for future predictions. However these goes beyond the scope of our paper, so we leave it for future work.
> -: While the optimisation program introduced to find suitable time-step wise confidence degrees seems fine, it is unclear what we can/should expect from it? In particular, shall we in practice have very similar confidence degrees or very imbalanced ones? What justify to have different confidence degrees at each run? Why not optimizing to minimize the obtained volume? How does it compare to the constant time dichotomic search assuming all confidence degrees being equal?
Please see the newly added appendix C.6 for a study on finding $\alpha_h$.
Please let us know if any of any of the explanations are unclear! Thank you again.
# Reviewer 4 FqUe
We would like to sincerely thank Reviewer FqUe for providing these comments. We believe there are some misunderstandings, as we will outline in this response.
### Response to Weakness 1
> the ICP validity definition on coverage (Definition 1) is not correct (unless there are reasons, which are not described). The probability needs to be taken over X, Y, and also a calibration set (e.g., [R1])
We respectfully disagree. Our definition of validity follows from seminal works on conformal prediction: see proposition 1 of Vovk (2013) [1] and the introduction section of Lei (2018) [2]. We want to point out that $\mathcal{P}$ in our definition 1 _**is**_ the data distribution for $(X,Y)$. In the regression case, we can represent the training, calibration, and test data as an exchangeable set of data drawn from distribution $\mathcal{P}$. Theorem 1 in [R1] is identical to our formulation. It omits the subscript because the coverage inequality holds for arbitrary data distribution $\mathcal{P}$.
[1] Vovk, Vladimir. "Conditional validity of inductive conformal predictors." Asian conference on machine learning. PMLR, 2012.
[2] Lei, Jing, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. "Distribution-free predictive inference for regression." Journal of the American Statistical Association 113, no. 523, 2018.
> Also, this coverage guarantee is not used consistently. In particular, the validity statement in Proposition 1 contains the expectation over y, but to my understanding, the probability here is also taken over y.
The expectation in proposition 1's statement is a typo - we sincerely apologize for it. The inequality should be on the probability as written in the proof, not on the expectation. The content of the proof (in appendix A) is correct and is what we mean for the coverage guarantee.
We have removed the extraneous expectation notation in the updated version of the paper. We believe Proposition 1 in the paper is now correct.
> In particular, (11) assumes $u^*$ is given, which needs to be found in Line 15 of Algorithm 1 based on a calibration set. As $u^*$ depends on a calibration set, it is a random vector.
We want to highlight that the $u^*$ does not depend on the calibration set, but only on the copula function $C$. Given a copula $C$, the set of value for $u^*$ such that $C(u^*) = 1-\alpha$ is fixed. Sklar's theorem (theorem 1) proves the existence of this $u^*$. The proposition is proven with the perfect $u^*$ here; we assume our search for $u^*$ is accurate give the law of large numbers.
> Finally, it is unclear how the exchangeability assumption is used in the proof.
The exchangeability assumption is used when estimating $\hat{F}$ functions (equation 6 and 7). Note that equation (6) is an ICP procedure; the equation holds because of the exchangeability assumption and the coverage guarantee that follows (equation 3). CopulaCPTS combines the results of $k$ individual ICP procedures for a cumulative guarantee.
Please let us know if this explanation is clear. If not, what can we add to the proof to clarify?
### Response to Weakness 2
> Missing baselines:
> - A score function can be simply be $\| \hat{\mathbf{y}} - \mathbf{y} \|$
> - by modifying RNN to estimate the standard deviations of $\hat{\mathbf{y}}$
We have included the baselines per request in Appendix C.7 of the updated paper. However, we justify our choice of not including them in the main text as follows.
Using L2 norm over the entire forecast horizon as nonconformity score has two drawbacks: (1) L-2 distance does not grant us the desired property of creating a cone of uncertainty for time-series forecasting. (2) Form an application perspective, it is limited in the ability to help aid decisions. In the Covid forecasting scenario, we want to know the highest and lowest number of cases per day with high confidence. For robotics applications, we want to draw bounding boxes around our prediction so we can navigate around with safety. L2 norm does not give direct guidance on the uncertainty for each timestep. (To visualize the uncertainty estimate we assume that for each timestep all error comes from that one step. This results in very inefficient confidence regions.)
Directly using an RNN to estimate the standard deviations is widely know to be prone to over-confidence [3], hence more sophisticated approaches such as the papers you cited are researched. There is also a misunderstanding here: if one directly estimate the variance with an RNN, one make the assumption that the error distribution is Gaussian. Conformal prediction is a distribution-free UQ method and avoids making such assumptions. You wouldn't need CP once you have a Gaussian estimation and can directly obtain confidence regions, as we did in Figure 11(b) in Appendix C.7.
[3] Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. "On calibration of modern neural networks." In International conference on machine learning, pp. 1321-1330. PMLR, 2017.
Particle dataset $\sigma = 0.01$, with confidence $1-\alpha = 0.9$
| Method | Coverage (90%) | Area $\downarrow$ |
| --- | ----------- | -- |
| L2-Conformal | $88.5 \pm 0.4$ | $7.21 \pm 0.35$ |
| Direct Gaussian| $11.9 \pm 0.09$ | $0.07 \pm 0.31$ |
| CF-RNN | $97.0 \pm 2.3$ | $3.13 \pm 3.24$ |
| Copula-RNN | $\textbf{91.5} \pm 2.1$ | $\textbf{1.06} \pm 0.36$ |
Particle dataset $\sigma = 0.05$, with confidence $1-\alpha = 0.9$
| Method | Coverage (90%) | Area $\downarrow$|
| --- | ----------- | -- |
| L2 | $89.7 \pm 0.6$ | $7.21 \pm 0.35$ |
| Direct Gaussian| $0.0 \pm 0.0$ | $0.08 \pm 0.02$ |
| CF-RNN | $97.0 \pm 2.3$ | $5.79 \pm 0.51$ |
| Copula-RNN | $\textbf{90.3} \pm 0.7$ | $\textbf{4.50} \pm 0.07$ |
### Response to clarity
> Page 4: The citation (Gibbs & Candes, 2021) in the first paragraph is not appropriate; the cited paper attacks online learning, but the submitted paper does not consider the same setup, making readers confusing.
Gibbs & Candes also consider conformal prediction for time series forecasting, which is related. However, their setting treats each time step independent (as in online learning) whereas we consider the correlations among multi-step jointly.
In the conformal prediction section of related works, we have elaborated on this distinction.
> Page 5: why is (8) true? The Sklar’s theorem only implies the first equality.
The first equality, by Theorem 1 Sklar's theorem, we have
$$
F(s_1, \ldots, s_k) = C(F_1(s_1), \ldots, F_k(s_k))
$$
The $F_h$ are CDF functions and $F_h(s_h)$ is the probability of on time step $h$, the nonconformity score being less than $s_h$. The $\alpha_h$ values here are introduced as variables.
$$F_h(s_h) = 1-\alpha_h, \; h \in 1, \ldots, k$$
The last part of the equation is the objective for conformal prediction. We want to find the $\alpha_1, \ldots, \alpha_k$ values such that
$$
C(1-\alpha_1, \ldots, 1-\alpha_k) \geq 1 - \alpha
$$
In practice, we perform the search in the case of equality, hence we have written (8) as it is. Your comment made us realize that it might be confusing, so we have updated section 4.2 of the paper to clarify.
>Page 5: epsilon_t below the (8) is not defined.
Thank you for pointing this out. corrected to $\alpha_h$.
>Page 7: in “Metrics”, the expectation in the definition of coverage does not make sense to me.
As written in our metric section, coverage is calculated as
$$coverage_{1-\alpha} = \mathbb{E}_{x,y \sim \mathbf{X} \times \mathbf{Y}} P(\mathbf{y} \in \Gamma^{1-\alpha}(\mathbf{x})) \approx \frac{1}{n} \sum^n_1 \mathbb{1} (\mathbf{y}_n \in \Gamma^{1-\alpha}_n(\mathbf{x}_n))$$
Which is the expectation over $\mathbf{X} \times \mathbf{Y}$ forthe probability of our prediction region covering $\mathbf{y}$.
> For the Covid19 dataset, it consists of 380 sequences; how can it be collected in UK regions? It would be better if the paper is self-contained. Is there any chance that the exchangeability assumption can be violated for this dataset?
Please see appendix C.2 for a detailed description of how the dataset is obtained and processed. The 380 sequences are daily new cases from 380 regions (to name a few: Barnet, Worthing, East Devon.) We follow the experiment setup in Stankeviciute et al. (2021) [3] treat them as interrelated time series that are exchangeable.
[3] Stankeviciute, Kamile, Ahmed M Alaa, and Mihaela van der Schaar. "Conformal time-series forecasting." Advances in Neural Information Processing Systems 34 (2021): 6216-6228
# Thank you for your response
We appreciate your careful review and reply.
> please explicitly denote that the probability is also taken over calibration samples.
The probability is not taken over the calibration set. Please refer to Proposition 1 of [1]. The precise wording is
"the probability of error $Y_{l+1} \notin \Gamma^{\epsilon}(\ldots)$ does not exceed $\epsilon$."
which is a probability for the test sample $Y_{l+1}$.
[1] Vovk, Vladimir. "Conditional validity of inductive conformal predictors." Asian conference on machine learning. PMLR, 2012.
> The proof needs to consider the randomness from the calibration set. I think an asymptotic guarantee is okay, but the paper still needs to change notations.
It is true that in Proposition 1 we made the assumption both $C$ and $\textbf{u}^*$ are accurately estimated, and an asymptotic guarantee is achievable if we were to take out this assumption.
> Bonferroni correction, reusing calibration dataset.
We use the calibration set exactly the same as CF-RNN [2], by treating each forecasting timestep $h$ as a separate task. By validity of ICP, each of these $k$ estimations are individually valid.
Bonferroni correction, and hence CF-RNN, allows us to make a cumulative guarantee because of Boole’s Inequality:
$\mathbf{P}[ \:\bigcup_{i=1}^k (p_i\leq \frac{\alpha}{k}) \: ] \leq \sum_{i=1}^k (\: \mathbf{P}[p_i \leq \frac{\alpha}{k} ] \:)= k \cdot \frac{\alpha}{k} = \alpha$
In the second paragraph of Section 5.1, we noted that Bonferroni correction is an upper bound of Copula functions. The idea of our paper is that if we can find an accurate Copula then CopulaCPTS is theoretically guaranteed to have smaller regions than CF-RNN.
Learning the copula with the calibration set does not constitute reusing it for calibration, as there is no conformal prediction after the $k$ ICP for each forecast step. Although we did not provide analysis on accuracy of estimating the copula, the extensive experiment results show the effectiveness of our method.
[2] Stankeviciute, Kamile, Ahmed M Alaa, and Mihaela van der Schaar. "Conformal time-series forecasting." Advances in Neural Information Processing Systems 34 (2021): 6216-6228
> Instead, use the Guassian as a score function for ICP as ICP works with any score function
The RNN will not be able to output standard deviations if not trained with an Gaussian assumption.
> Figure 11(b) demonstrates that if the intervals are correctly chosen based on ICP, “Direct Gaussian” may provide a smaller interval size than the proposed approach, while satisfying the coverage guarantee simply due to ICP.
This is an ungrounded argument. The area with 12% or 0% coverage is not informative of the area for higher coverage levels. ICP methods calibrated for 12% coverage will also have very small areas, for that matter.
We do not know of any methods that combine Baysian forecasting along with ICP, and to our knowledge there is no simple way of combining them without compromising the generality of conformal prediction. We thank the reviewer for the comment and leave it for future work.