Rebuttal - HackMD

# Rebuttal ## Paper Revision Summary We thank all the reviewers for their insightful feedback and suggestions for improving the paper. We are glad that the reviewers found our paper found our paper studies an new and important problem, novel, contains comprehensive comparison to two baselines and extensive experimental evaluation. Below is a summary of major paper updates: 1. **[Section 1]** Adjust our presentation and fix some typos, following Reviewer $\color{Blue}{\rm{F33Z}}$, Reviewer $\color{Green}{\rm{K4J1}}$'s suggestions. 1. **[Section 3 - Definition 1]** Revise Eq. (1) in Definition 1 and change the **notation of perturbation** ($\boldsymbol{\delta}_{u:v}$ denotes the perturbation vector and $\delta_t$ denotes a single perturbation value added to the time point $x_t$), following Reviewer $\color{Blue}{\rm{F33Z}}$, Reviewer $\color{Green}{\rm{K4J1}}$ and Reviewer $\color{Maroon}{\rm{SiPr}}$'s suggestions. 2. **[Section 3]** Add a detailed discussion on the significance of temporally-localized perturbation, following Reviewer $\color{Green}{\rm{K4J1}}$'s suggestions. 3. **[Section 4.1 - Eq. (2)]** Clarify the notation of mask in Eq. (2), following Reviewer $\color{Blue}{\rm{F33Z}}$, Reviewer $\color{Green}{\rm{K4J1}}$ and Reviewer $\color{Maroon}{\rm{SiPr}}$'s suggestions. 4. **[Section 4.1 - Eq. (3)]** Revise the typo in Eq. (3) following Reviewer $\color{Maroon}{\rm{SiPr}}$'s suggestions. 5. **[Section 4.3 - Remark 2]** Add a discussion on why MIA aggregation cannot tolerate a disagreement, following $\color{Green}{\rm{K4J1}}$'s suggestions. 6. **[Section 4.3 - Remark 2]** Add a discussion on how to apply MIA to multivariate tasks, following $\color{Green}{\rm{K4J1}}$'s suggestions. 7. **[Section 4.3 - Remark 2]** Add a discussion on the impact of certificate only for $f_{\rm{MIA}}(\cdot)$ instead of $f(\cdot)$, following $\color{Green}{\rm{K4J1}}$'s suggestions. 8. **[Section 4.3, Appendix]** Provide additional experiments on evaluating MIA on multivariate forecasting tasks, following $\color{Green}{\rm{K4J1}}$'s suggestions. 9. **[Section 4.4]** Add a discussion on why we adopt Gaussian agumentation when training imputation models, following Reviewer $\color{Maroon}{\rm{SiPr}}$'s suggestions. 10. **[Section 4.4, Appendix]** Provide additional experiments on evaluating the impact of Gaussian agumentation, following Reviewer $\color{Maroon}{\rm{SiPr}}$'s suggestions. 11. **[Section 4.5]** Revise $k+ L_{adv} -1$ to $2 (k+ L_{adv} -1)$, following Reviewer $\color{Maroon}{\rm{SiPr}}$'s suggestions. 12. **[Section 4.5]** Add an explaination on why the comparison to two baselines is important, following $\color{Green}{\rm{K4J1}}$'s suggestions. Please also let us know if there are other questions, and we are really looking forward to the discussion with the reviewers to further improve our paper. Thanks! ## Review 1 We thank the reviewer for the valuable comment and sincerely apologize for the mistake in Eq. (1). We provide our responses below. > 1. In equation (1), what does it mean that the $\delta$ index decreases as $x$ index increases? We are deeply sorry for the mistake in Eq. (1). We have revised Eq. (1) to be: \begin{equation} \begin{aligned} & \mathbf{x}_{1:t_0} + \boldsymbol{\delta}_{[t_{adv}+1 :t_{adv}+L_{adv}]} \\ = & \mathbf{x}_{1:t_0} + [0, \ldots, 0, \delta_{t_{adv}+1}, \ldots, \delta_{t_{adv}+L_{adv}}, 0, \ldots, 0 ]\\ = & [x_1, \ldots, \; \underbrace{x_{t_{adv}+1}+\delta_{t_{adv}+1} , \; \ldots, \; x_{t_{adv}+ L_{adv}}+\delta_{t_{adv}+L_{adv}}}_{\text{Perturbed subsequence}} , \; \ldots, x_{t_0}] \end{aligned} \end{equation} Sorry again for the mistake in Eq. (1). Please let us know if there is additional unclarity. > 2. Unclear definition of the mask in Eq. (2). We thank the reviewer for the valuable comment and sincerely apologize for the unclear definition. For clarity, we redefine the mask in Eq. (2). We denote mask by $M_{[u : v]}$. The masking operation denoted by $\mathbf{x}_{1:t_0} \odot M_{[u : v]}$ is to remove out the values at the period $[u : v]$. --- For ease of understanding, we prsent an example of mask: 1. Consider a time series $\mathbf{x}_{1:5}$: | $x_1$ | $x_2$ | $x_3$ | $x_4$ | $x_5$ | |----|----|----|----|----| 2. The mask $M_{[2 : 4]}$ is: | 1 | 0 | 0 | 0 | 1 | |----|----|----|----|----| 3. The masked series $\mathbf{x}_{1:5} \odot M_{[2:4]}$ is: | $x_1$ | $\;0\;$ | $\;0\;$ | $\;0\;$ | $x_5$ | |----|----|-----|----|----| > 3. Would the cat-and-mouse game continue even with the proposed method when other types of attacks are considered? We thank the reviewer for the insightful question and we agree that MIA can hardly defend other types of attacks (e.g., $\ell_2$-norm perturbations), because MIA is only designed for defending temporally-localized perturbations. However, that does not reduce the significance of MIA. We emphasize that, all the multi-attack defenses based on the development of single-attack defenses [2,3,4] (e.g., the universal defense [1] is based on the specific $\ell_p$-perturbation defense [2]. The universal defense [3] is an extension of randomized smoothing [4]). Since MIA is the first $\ell_0$-norm defense in the domain of time series, we believe it would inspire more multi-attack defenses for time series in the future. > 4. Even if a mask hides the perturbed area, it seems to make little sense since it is aggregated with the results from other masked time series where the perturbed area remains. We thank the reviewer for the thoughtful question. We point out that our aggregation is **the one-veto based aggregation**, not the commonly-used majority-vote based aggregation. In one-veto based aggregation, MIA classifier only gives the prediction when there is no disagreement, which is an essential module for the certified robustness. With our masking scheme, we guarantee that there is no temporally-localized perturbation that can circumvent all the masks simultaneously. That is to say, there always exists a prediction on one of the masked series that is unaffected. With one-veto based aggregation, we guarantee that **the classfication of the MIA classifier must equal the predictions of all the masked series**, so the return of MIA also equals the unaffected prediction. Then, the class returned by the MIA classifier are also unaffected, otherwise the MIA classifier would return Alert. --- For ease of understanding why one-veto based aggregation is necessary, we present a toy example of **0-1 binary classification** where $L_{mask}=1, L_{atk}=1$. 1. Inputed the series $\mathbf{x}_{1:3}$, MIA classifier $f_{ m{MIA}} (\mathbf{x}_{1:3})$ returns **Class $0$**. The step size of the sliding mask is $L_{mask}-L_{atk}+1=1$. Based on the one-veto based aggregation of MIA, we have: $f (\mathbf{x}_{1:3} \odot M_{[1:1]}) = f (\mathbf{x}_{1:3} \odot M_{[2:2]}) = f (\mathbf{x}_{1:3} \odot M_{[3:3]}) = f_{ m{MIA}} (\mathbf{x}_{1:3}) = \text{Class 0}$. 2. Consider an adversary that aims to change the prediction through adding a temporally-localized perturbation ($L_{atk}=1$). WLOG, suppose the adversary takes the perturbation $\boldsymbol{\delta}_{[1:1]}$. Then the perturbed series $\mathbf{x}_{1:3}+ \boldsymbol{\delta}_{[1:1]}$ is as follow: | $x_1+\delta_1$ | $x_2$ | $x_3$ | |----|----|----| 3. The mask $M_{[1:1]}$ can completely occlude the perturbation. The masked series $(\mathbf{x}_{1:3}+ \boldsymbol{\delta}_{[1:1]}) \odot M_{[1:1]}$ is as follow: | $0$ | $x_2$ | $x_3$ | |----|----|----| which is equal to $\mathbf{x}_{1:3} \odot M_{[1:1]}$. Thus we have $f((\mathbf{x}_{1:3}+ \delta_{[1:1]}) \odot M_{[1:1]}) = \text {Class 0}$ 4. There are two possible returns of $f_{ m{MIA}}(\mathbf{x}_{1:3}+ \delta_{1})$: **Class 0** or **Alert**: 1). $f_{ m{MIA}}(\mathbf{x}_{1:3}+ \boldsymbol{\delta}_{[1:1]})$ returns **Class 0** if $f(\cdot)$ predicts all the masked series **Class 0**. 2). $f_{ m{MIA}}(\mathbf{x}_{1:3}+ \boldsymbol{\delta}_{[1:1]})$ returns **Alert** if exists a disagreement. $f_{ m{MIA}}(\mathbf{x}_{1:3}+ \boldsymbol{\delta}_{[1:1]} ) = \begin{cases} \text{Class 0} & f((\mathbf{x}_{1:3}+ \boldsymbol{\delta}_{[1:1]})\odot M_{[2:2]})= f((\mathbf{x}_{1:3}+ \boldsymbol{\delta}_{[1:1]})\odot M_{[3:3]})= \textbf{Class 0}\\ \text{Alert} & f((\mathbf{x}_{1:3}+ \boldsymbol{\delta}_{[1:1]})\odot M_{[2:2]}) \neq \textbf{Class 0} \textbf{ or } f((\mathbf{x}_{1:3}+ \boldsymbol{\delta}_{[1:1]})\odot M_{[3:3]}) \neq \textbf{Class 0} \end{cases}$ $f_{ m{MIA}}(\cdot)$ returns **Class 0** or **Alert**, thus the prediction cannot be manipulated by the pertubation $\boldsymbol{\delta}_{[1:1]}$. We can also prove the this conclusion also hold for $\boldsymbol{\delta}_{[2:2]}$ and $\boldsymbol{\delta}_{[3:3]}$ in a similar way. [1] Croce, Francesco, and Matthias Hein. "Provable robustness against all adversarial $\ell_p$-perturbations for $p\geq 1$." International Conference on Learning Representations. 2020. [2] Croce, Francesco, Maksym Andriushchenko, and Matthias Hein. "Provable robustness of relu networks via maximization of linear regions." the 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019. [3] Hong, Hanbin, Binghui Wang, and Yuan Hong. "UniCR: Universally Approximated Certified Robustness via Randomized Smoothing." European Conference on Computer Vision. Springer, Cham, 2022. [4] Cohen, Jeremy, Elan Rosenfeld, and Zico Kolter. "Certified adversarial robustness via randomized smoothing." International Conference on Machine Learning. PMLR, 2019. ## Review 2 We thank the reviewer for insightful comments and suggestions. Our new ICLR submission is an updated version which has addressed several concerns from the reviews such as the typos, unclear presentation of the technical part, etc. In addition, in our current version, we have added the experiments of evaluating MIA on multivariate forecasting datasets. Next we provide detailed answers to the questions. > 1. The practical significance to detect $\ell_0$ norm perturbation in time series is obscure. The paper could provide some examples to illustrate why detect $\ell_0$ norm localized perturbation in time series is practically useful, compared to $\ell_2$ norm perturbation. We thank the reviewer for the helpful question. First, for the **temporal nature** of time series data, $\ell_0$ norm robustness is a natural research topic when investigating the adversarial robustness of time series models. Second, **short-term volatility** can be regarded as the normal data added with temporally-localized perturbation. The resistance to short-term volatility is important in long-term forecasting/prediction, in which the long-term value is considered unaffected by the short-term volatility. A typical example is the well-known investment philosophy, "Value Investing"[1], where the "intrinsic value" of a business is considered to be robust against short-term volatility. Third, **local anomaly** can also be regarded as temporally-localized perturbation. Local anomaly detection is practically useful in real-world scenarios. For instance, detecting a subsequent time interval of abnormal heart rate in electronic health records is a problem of local anomaly detection. We can also adopt the method of detecting temporally-localized perturbation for detecting the abnormal network traffic for IoT Time-Series Data. To highlight the risk of temporally-localized perturbations, we empirically show how much a $\ell_0$-norm perturbation can change the output of an undefended forecaster in Appendix B.3. Furthermore, we compare the attacking performance of $\ell_0$-norm perturbation to $\ell_0$-norm perturbation, and the empirical results in **Table 1, Table 2, Table 3** suggest that **forecasting models might be more sensitive to $\ell_0$-norm perturbations**. Table 1. (Electricity) Compare $\ell_0$-norm localized perturbation to $\ell_2$-norm perturbation (computed by the algorithm [2]) on the MSE between the original forecast and the perturbed forecast. The table reports the relative improvement of the $\ell_0$-norm perturbation over the $\ell_2$-norm perturbation (averaging among $128$ randomly selected samples). **The positive value indicates that our $\ell_0$-norm perturbation outperforms $\ell_0$-norm perturbation**. For fairness, the $\ell_0$ or $\ell_2$ norm of the perturbation is restricted to be no larger than $\boldsymbol{\beta} \times$ the average value among the $\ell_0$ or $\ell_2$ norm of all the testing samples. Values in tables are calculated as $(\rm{MSE}_{\ell_0} - \rm{MSE}_{\ell_2}) / \rm{MSE}_{\ell_2} \times 100\%$. | Model $\downarrow \quad$ $\boldsymbol{\beta} \rightarrow$| 10% | 20% | 30% | 40% | 50% | |:----:|:----:|:----:|:----: |:----:|:----:| |MLP-Mixer | +769.9 % | +89.5 % | +73.3 % | +12.3 % | +53.1 \%| |GRU | +2.5 % | -1.6 % | -8.3 % | -3.3 % | -6.5 \%| |LSTM | +23.1 % | +15.1 % | -33.5 % | -14.8 % | +2.0 \%| |MLP | +265.0 % | +376.3 % | +211.6 % | +114.1 % | +58.3 % | Table 2. (Exchange) Compare $\ell_0$-norm localized perturbation to $\ell_2$-norm perturbation on relative improvement on MSE. | Model $\downarrow \quad$ $\boldsymbol{\beta} \rightarrow$| 10% | 20% | 30% | 40% | 50% | |:----:|:----:|:----:|:----: |:----:|:----:| |MLP-Mixer | +1000.1 % | +332.2 % | +114.7 % | +131.7 % | +84.6 % | |GRU | -48.6 % | -45.1 % | -39.7 % | -31.7 % | -20.9 | |LSTM | -8.8 % | -28.3 % | -24.9 % | -16.2 % | -6.6 \% | |MLP | +528.9 % | +101.8 % | +36.9 % | +8.8 % | +6.2 \% | Table 3. (Traffic) Compare $\ell_0$-norm localized perturbation to $\ell_2$-norm perturbation on relative improvement on MSE. | Model $\downarrow \quad$ $\boldsymbol{\beta} \rightarrow$| 10% | 20% | 30% | 40% | 50% | |:----:|:----:|:----:|:----: |:----:|:----:| |MLP-Mixer | +3310.1 % | +841.8 % | +328.5 % | +317.7 % | +205.5 \%| |GRU | +147.0 % | +28.2 % | +7.3 % | -4.7 % | +21.6 \% | |LSTM | +239.0 % | +66.8 % | +14.9 % | +2.5 % | -2.8 \%| |MLP | +1760.4 % | +145.7 % | +38.4 % | +7.5 % | -7.6 \% | [1] Piotroski, Joseph D. "Value investing: The use of historical financial statement information to separate winners from losers." Journal of Accounting Research (2000): 1-41. [2] Dang-Nhu, Raphaël, et al. "Adversarial attacks on probabilistic autoregressive forecasting models." International Conference on Machine Learning. PMLR, 2020. > 2. The presentation of the technical development is hard to follow. Many equations are given without sufficient description on their intuition. For example, Eq. 1, it is unclear why the norm is used as the time point index and what is the relationship between $\delta$ and $\delta_t$. Eq. 4, it is unclear on why should all M predictions be consistent instead of allowing some tolerance. Some theoretical analysis is desired. Similarly, Eq 5, 7, etc., are presented without introducing intuition. We are deeply sorry for the unclarity in the technical part. We have updated them accordingly in our revision. Specifically, in Eq. (1) of our revision, we use $\boldsymbol{\delta}$ to denote the pertubation series and $\delta_t$ to denote the perturbation value added to the $t$-th time point $x_t$. In Eq. (4), **the robustness certificate of MIA would not hold if we allow the presence of a disagreement**. Since our robustness is based on two preconditions: 1) for an arbitrary temporally-localized perturbation of $L_{atk}$, there always exsits a masked series that is unaffected by the perturbation. This precondition is guaranteed by our specific masking scheme. 2) **The label returned by our MIA classifiers must equal the prediction on the unaffected series, which is guaranteed by our strict aggregation**. With these two preconditions, we then can guarantee that the MIA classifier will return the unaffected prediction or Alert. In Eq. (5), $f_{\rm{dis}} (\mathbf{x}) = \Delta\cdot (\lfloor f(\mathcal{x}_{1:t_0})/ \Delta \rfloor + 0.5)$ formulates the process of discretization. We specifcally adopt discretization is for that it is almost impossible for the forecasts on different imputed series to be consistent. For the Gaussian augmentation noise $\sigma$, Eq. (7) is our loss funtion of training the imputation model, which is to compute the average among the MSE loss of all the masked series with Gaussian augmentation. The Gaussian augmentation is to overcome the random noise in the input time series. Besides, it facilitates imputation model to avoid overfitting problem. The Gaussian noise $\sigma$ is only injected in training stage. In inference stage, we do not add this noise to input time series. We hope this clarifies the reviewer’s concern. Please let us know if there is additional unclarity and we really look forward to further improving our paper based on the suggestions.  Sorry again for our unclear presentation. > 3. The authors claim in most cases short-term events should not have large impacts on the long-term outcomes. Then the question is why resistance to the short-term perturbation is essential. We thank the reviwer for the insightful question. We have replaced "in most cases short-term events should ..." with "in some cases ..." in our revison. We admit that in some cases the resistance to the short-term perturbation is not so important. However, in some long-term forecasting/prediction scenarios, the resistance to short-term perturbations is essential. For instance, if we want to forecast the future of a certain industry (e.g., new energy industry), since the future of an industry is believed unaffected by the short-term events, the resistance to short-term pertubations seems essential. Another typical example is "Value Investing" strategy[1], in which the future value of a company is considered unaffected by the short-term stock price volatility, then strengthening the robustness to short-term perturbations is important. > 4. (**Understand MIA**) The proposed method requires the masks to cover arbitrary temporally-localized perturbation of $L_{adv}$. It is better to provide some guidelines on how to ensure this requirement is satisfied. We thank the reviwer for the valuable suggestion. We have formally proves that **for an arbitray temporally-localized pertubation subject to $L_{atk}$, there always exists a mask generated by our masking （Step 1） that can occlude it** in Appendix Section A (Page 13). For ease of understanding, below we illustrate how to cover all temporally-localized perturbations intuitively with a toy example. 1. **Given $L_{adv}$, we can list the possible perturbations: $\boldsymbol{\delta}_{[1:1+L_{adv}]}, \boldsymbol{\delta}_{[2:2+L_{adv}]}, \ldots, \boldsymbol{\delta}_{[t_0-L_{adv}:t_0]}$.** Consider an example where the adversary attacks $\mathbf{x}_{1:5}$ ($t_0=5$) subject to $L_{atk} = 2$. Then there are totally four cases of perturbed series: **1)** $\mathbf{x}_{1:5}+\boldsymbol{\delta}_{[1:2]}$;  **2)** $\mathbf{x}_{1:5}+\boldsymbol{\delta}_{[2:3]}$;   **3)** $\mathbf{x}_{1:5}+\boldsymbol{\delta}_{[3:4]}$;   **4)** $\mathbf{x}_{1:5}+\boldsymbol{\delta}_{[4:5]}$, as follows: | $x_1+\delta_1$ | $x_2+\delta_2$ | $x_3$ | $x_4$ | $x_5$ | |----|----|----|----|----| | $x_1$ | $x_2+\delta_2$ | $x_3+\delta_3$ | $x_4$ | $x_5$ | |----|----|----|----|----| | $x_1$ | $x_2$ | $x_3+\delta_3$ | $x_4+\delta_4$ | $x_5$ | |----|----|----|----|----| | $x_1$ | $x_2$ | $x_3$ | $x_4+\delta_4$ | $x_5+\delta_5$ | |----|----|----|----|----| 2. **A $L_{mask}$-length mask is capable of covering $L_{mask} - L_{atk} + 1$ number of perturbations**. Back to the example, the mask $M_{[1:3]}$ (the mask occludes the values of $x_1, x_2, x_3$) can cover two perturbations, $\boldsymbol{\delta}_{[1:2]}$ and $\boldsymbol{\delta}_{[2:3]}$. Specifically, the masked verison of $\mathbf{x}_{1:5}+ \boldsymbol{\delta}_{[1:2]}$ and $\mathbf{x}_{1:5}+ \boldsymbol{\delta}_{[2:3]}$ are as follows: | $\color{#F00}{x_1+\delta_1 \rightarrow 0}$ | $\color{#F00}{x_2+\delta_2 \rightarrow 0}$ | $\color{#F00}{x_3 \rightarrow 0}$ | $x_4$ | $x_5$ | |----|----|----|----|----| | $\color{#F00}{x_1 \rightarrow 0}$ | $\color{#F00}{x_2 + \delta_2 \rightarrow 0}$ | $\color{#F00}{x_3+\delta_3 \rightarrow 0}$ | $x_4$ | $x_5$ | |----|----|----|----|----| 3. **We then continue sliding the $L_{mask}$-length mask through the input series with the step size $L_{mask}-L_{adv}+1$.** Back to our examaple, the step size is $L_{mask}-L_{adv}+1=2$, then the next mask is $M_{[3:5]}$. $M_{[3:5]}$ exactly covers the other two possible perturbations $\boldsymbol{\delta}_{[3:4]}, \boldsymbol{\delta}_{[4:5]}$. The masked perturbed series $(\mathbf{x}_{1:5}+ \boldsymbol{\delta}_{[3:4]}) \odot M_{[3:5]}$ and $(\mathbf{x}_{1:5}+ \boldsymbol{\delta}_{[4:5]}) \odot M_{[3:5]}$ are: | $x_1$ | $x_2$ | $\color{#F00}{x_3+\delta_3 \rightarrow 0}$ | $\color{#F00}{x_4+\delta_4 \rightarrow 0}$ | $\color{#F00}{x_5 \rightarrow 0}$ | |----|----|----|----|----| | $x_1$ | $x_2$ | $\color{#F00}{x_3 \rightarrow 0}$ | $\color{#F00}{x_4+\delta_4 \rightarrow 0}$ | $\color{#F00}{x_5+\delta_5 \rightarrow 0}$ | |----|----|----|----|----| 4. **In summary, given $L_{atk}$ and $L_{mask}$, we generate the masks as follow**: \begin{equation} \begin{aligned} &M_{[1 + k \alpha : \min(L_{mask} + k \alpha, t_0) ]}, \; k=0, \ldots, \lceil (t_0- L_{mask})/ \alpha \rceil \\ &\text{where} \quad \alpha= L_{mask}-L_{adv}+1 \end{aligned} \end{equation} Back to the example, the masks generatd by the above equation are $M_{[1+2k, \min(3+2k, 5)]}, k=0,1$, which have been proved **they can cover all the possible temporally-localized perturbations in the above discussion**. We hope this clarifies the reviewer’s problem. Please let us know if there is additional unclarity. Thanks! > 5. In proposition 1, to conclude the forecast/label cannot be changed, it is better to quantify the change and define what is considered unchanged quantitatively. We thank the reviewer for pointing this out. In terms of time series classification, the label is considered to have been changed if the MIA classifier returns a different label (not Alert). In terms of time series forecasting, the forecast is considered to have been changed if there is any minor change to the forecast value. > 6. In Remark 3, describe the impact of certificate only for $f_{ m{MIA}}(\cdot)$ instead of $f(\cdot)$ We thank the reviewer for the valuable question. First, the reason why we derive the certificate only for $f_{ m{MIA}}(\cdot)$ instead of $f(\cdot)$ is that we have no requirement on $f(\cdot)$. It is impossible to derive the certificate for an arbitray model. The advantage of certifying robustness for $f_{ m{MIA}}(\cdot)$ is **we can arm an arbitrary time series model with certified robustness since MIA is a plug-and-play method**. The limitations of certificate for $f_{ m{MIA}}(\cdot)$ are: 1) $f_{ m{MIA}}(\cdot)$ do not always return the forecast/label as the normal model. In particular, even for some clean time series, $f_{ m{MIA}}(\cdot)$ could give a false alert. 2) The forecast of $f_{ m{MIA}}(\cdot)$ could be inconsistent with the forecast of $f(\cdot)$. > 7. It is good to see the comparison to randomized smoothing defenses. It is better to further justify on why this comparison is important. We thank the reviewer for the valuable comment. Randomized smoothing [1] is a well-know and widely-used method in the field of certified defenses. Randomized smoothing has been applied to defend various types of attacks and achieves superior certified robustness to other certified defenses in their respective fields, including $\ell_0/ \ell_1/\ell_2/\ell_\infty$-norm perturbations [2,3], image translation, Gaussian blur, rotation and scaling [4,5]. **For the widespread success of randomized smoothing, we naturally regard randomize smoothing as the baseline and compare MIA to two $\ell_0$-norm randomized smoothing defenses**, Derandomized Smoothing and Randomized Ablation. Empirical results demonstrate that MIA outperforms these two baselines on robustness (certified accuracy/ MSE/ MAE) and inference time. Additionally, MIA do not require the user to retrain the base model, which is necessary in two baselines, suggesting that MIA is much more practical. To summarize, **MIA surpasses two baselines on robustness, inference time and practicality**. [1] Cohen, Jeremy, Elan Rosenfeld, and Zico Kolter. "Certified adversarial robustness via randomized smoothing." International Conference on Machine Learning. PMLR, 2019. [2] Yang, Greg, et al. "Randomized smoothing of all shapes and sizes." International Conference on Machine Learning. PMLR, 2020. [3] Salman, Hadi, et al. "Provably robust deep learning via adversarially trained smoothed classifiers." Advances in Neural Information Processing Systems 32 (2019). [4] Li, Linyi, et al. "Tss: Transformation-specific smoothing for robustness certification." Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021. [5] Hao, Zhongkai, et al. "GSmooth: Certified robustness against semantic transformations via generalized randomized smoothing." International Conference on Machine Learning. PMLR, 2022. > 8. (**Add experiments**) Is there any discussion on whether the proposed method useful for both univariate and multivariate time series? We thank the reviewer for the valuable comment. We can easily apply MIA to multivariate time series through repeating the process of univariate version of Masking (Step 1) and Imputing (Step 2) on each variable. Then we obtain a list of imputed multi-variate time series. Finally, we aggregate the labels/forecasts of all the imputed multi-variate time series in the same way as Aggregation (Step 3) for univariate time series. In addition, we evluate MIA on four multi-variate time series forecasting tasks (ETTm2, ETTh2 [2], Illness [3] and Weather [4]), following the work [1]. Results are evaluated with different forecasting model architectures (MLP-Mixer, MLP, LSTM, GRU, RNN, TransformerNormal, TransformerPadd, TransformerConv). The results are reported in Table 18 ~ 25 in Appendix G in our revision. Extensive experiments demonstrate that MIA behaves similarly to that of univariate forecasting tasks. [1] Wu, Haixu, et al. "Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting." Advances in Neural Information Processing Systems 34 (2021): 22419-22430. [2] Zhou, Haoyi, et al. "Informer: Beyond efficient transformer for long sequence time-series forecasting." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 12. 2021. [3] https://www.bgc-jena.mpg.de/wetter/ [4] https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html ## Review 3 > 1. The idea is relatively straightforward: it is like an application of [1,2] in the time series domain. The unique point of imputation model is spiritually similar to [3]. We thank the reviewer for the thoughtful comment and glad that the reviewer acknowledge the novelty of this paper. We do agree that MIA is a relatively direct extension of [1,2] and imputation model is spiritually similar to [3]. We first note that this is the first work that explores the $\ell_0$ norm certification for both TSF and TSC models. Our work has several differences when compared to prior works. First, MIA also produces robustness ceriticates for forecasting models, no just for classfication models. Second, we propose masked training algorithm for training the imputation model, which takes into consideration the **sliding-window masking** nature of MIA masked series. Extensive empirical results validate the effectiveness of our masked training. We also comprehensively compare MIA to two randomized smoothing based approaches on both **certified accuracy/ MSE / MAE** and **inference cost**, to show the superior robustness and practicality of MIA. [1] Xiang, Chong, et al. "{PatchGuard}: A Provably Robust Defense against Adversarial Patches via Small Receptive Fields and Masking." 30th USENIX Security Symposium (USENIX Security 21). 2021. [2] Xiang, Chong, Saeed Mahloujifar, and Prateek Mittal. "{PatchCleanser}: Certifiably Robust Defense against Adversarial Patches for Any Image Classifier." 31st USENIX Security Symposium (USENIX Security 22). 2022. [3] Salman, Hadi, et al. "Denoised smoothing: A provable defense for pretrained classifiers." Advances in Neural Information Processing Systems 33 (2020): 21945-21957. > 2. Eq. (1): the first line does not contain the temporal localization constraint, so it is not equal to the second line. Sorry for the mistakes in Eq. (1). We have updated Eq. (1) in our revision. Please see our paper and let us know if there is any additional unclarity. > 3. Remark 1, last sentence: the inference cost could not be 0. It should at least be 1. We thank the reviewer for pointing this out. We admit that the inference cost should be at least $1$ and we have revised it in our revision. > 4. What is the $\sigma$ value used in practice in Eq. (7)? It is not specified in later sections. Also, why Gaussian augmentation is needed here? Unlike other randomized smoothing work, here no Gaussian augmentation is used in the inference time, so the use of Gaussian augmentation in the training time may induce domain mismatch and needs further justifications. We thank the reviewer for pointing this out. We clarify that we use $\sigma=0.02$ and Gaussian augmentation in our masked training is used to **make the learnt model be resistant to the natural noise in the time series data**. Extensive literature validates the presence of noise in time series data [1,2,3], e.g., [4] demonstrates that finacial time series data are noisy. The well-known probabilistic forecasting model DeepAR [5] explicitly models the noise via a Gaussian distribution. In our paper, we deal with such noise through adding Gaussian augmentation to the input series. We are sorry for lack of ablation study on Gaussian augmentation. Here we provide additional experiments to show the impact of Gaussian augmentation. Table 1. (DistalPhalanxTW) We compare MIA ($L_{mask}=$ 15%) with Gaussian augmentation ($\sigma=$ 0.01, 0.02, 0.03) to MIA without Gaussian augmentation (baseline) on **Certified Accuracy** at $L_{atk} =$ 5%, 10%, 15%. The table reports the improvement (abs $\%$) of Guassian augmentation in certified accuracy. | Model $\downarrow$               | $\sigma \downarrow$   | 5 %             | 10%             | 15%             | $\sigma \downarrow$   | 5 %             | 10%             | 15%             |$\sigma \downarrow$   | 5 %             | 10%              | 15%             | |:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| | MLP-Mixer | 0.01 | - 2.0 | 0.0 | - 2.0 | 0.02 | - 1.0 | + 3.0 | + 5.0 | 0.03 | + 1.0 | + 3.0 | + 0.0 | | FCN | 0.01 | + 1.0 | + 3.0 | 0.0 | 0.02 | + 5.0 | + 6.9 | + 3.0 | 0.03 | + 1.0 | + 3.0 | + 0.0 | | MLP | 0.01 | + 1.0 | + 1.0 | + 3.0 | 0.02 | 0.0 | + 2.0 | + 3.0 | 0.03 | + 1.0 | + 3.0 | + 0.0 | |ResNet | 0.01 | - 1.0 | - 3.0 | - 1.0 | 0.02 | 0.0 | + 1.0 | + 1.0 | 0.03 | + 1.0 | + 3.0 | + 0.0 | [1] Foster, W. R., F. Collopy, and L. H. Ungar. "Neural network forecasting of short, noisy time series." Computers & chemical engineering 16.4 (1992): 293-297. [2] Passalis, Nikolaos, et al. "Training noise-resilient recurrent photonic networks for financial time series analysis." 2020 28th European Signal Processing Conference (EUSIPCO). IEEE, 2021. [3] Hwang, Jeng-Ren, Shyi-Ming Chen, and Chia-Hoang Lee. "Handling forecasting problems using fuzzy time series." Fuzzy sets and systems 100.1-3 (1998): 217-228. [4] Magdon-Ismail, Malik, Alexander Nicholson, and Yaser S. Abu-Mostafa. "Financial markets: very noisy information processing." Proceedings of the IEEE 86.11 (1998): 2184-2195. [5] Salinas, David, et al. "DeepAR: Probabilistic forecasting with autoregressive recurrent networks." International Journal of Forecasting 36.3 (2020): 1181-1191. > 5. Eq. (9): I think the guarantee of RHS should be $2*(k+L_{adv}+1)$ instead of $(k+L_{adv}+1)$. We thank reviewer SiPr for pointing out this mistake and we are deeply sorry. We emphasize that, this mistake will not affect the superiority of MIA in our experimental comparisons, but even will enlarge the gap between MIA and DS (baseline method). We have revised the evaluation of DS in Table 1, Table 2 and Table 3. Below we only provide a part of the revised Table 1 due to space limitation: Table 3. (Exchange) Compare MIA to DS and RA on **FR (%)** at $L_{atk}= 2\%, 4\%, 6\%, 8\%, 10\%$ (higher FR is better). | Defense $\downarrow \quad L_{atk} \rightarrow$      | 2% | 4% | 6% | 8% | 10% | |--------------------|------|------|------|------|------| | DS ($k= 10\%$) | ~~24.8~~ 24.8 | ~~24.8~~ 24.8| ~~24.8~~ 23.8| ~~24.8~~ 23.8 | ~~24.8~~ 22.8| | RA ($k= 10\%$) | 16.8 | 16.8 | 16.8 | 16.8 | 16.8 | | MIA ($L_{mask}= 10\%$) | **82.2** | **82.2** | **81.2** | **81.2** | **79.2** | | DS ($k= 15\%$) | ~~20.8~~ 9.9 | ~~20.8~~ 8.9| ~~20.8~~ 8.9| ~~20.8~~ 2.0 | ~~20.8~~ 1.0| | RA ($k= 15\%$) | 23.8 | 23.8 | 23.8 | 23.8 | 23.8 | | MIA ($L_{mask}= 15\%$) | **71.3** | **69.3** | **68.3** | **71.3** | **65.3** | We have updated Table 1, Table 2 and Table 3 in our revision. Please refer to our paper for more details. Sorry again for the mistake.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.