NeurIPS'23 Rebuttal Submission

# NeurIPS'23 Rebuttal Submission ## Common comment We sincerely appreciate the reviewers' valuable comments and some concerns. In general, we are very pleased that the reviewers have concurred with the three main contributions of this study: **(1) motivation**---"the motivation of the combination of time and frequency domain to detect time-series anomaly is insightful" (Reviewers pYjo and q6Ub); **(2) effectiveness**---"extensive experimental results support the overall claims of the paper" and DualTF "outperforms the compared benchmarks" (Reviewers pYjo, q6Ub, and dAug) with "enough datasets" (Reviewer dAug); **(3) justification**---DualTF suggests "the mathematical basis for the loss of granularity" to explain why DualTF works (Reviewers dAug). We hope that our clarifications presented in this rebuttal can address the identified concerns. * **Improving the presentation** (Reviewers pYjo and dAug): The title, introduction, and justification have been revised or will be clearly improved. * **Clarifying the technical contribution** (Reviewers pYjo, q6Ub, and yZcD): We would like to emphasize on the technical contributions of frequency domain reconstruction and anomaly scoring alignments. * **Enriching the experiment results** (Reviewers pYjo, q6Ub, dAug, and qmoP): The results for other realistic datasets have been added (see the attached PDF file). Furthermore, sophisticated evaluation measures have been added. --- ## Summary We could summarize and categorize all reviews as follows: * **Overall** * The paper provides a way to work under this discrepancy, rather than breaking this discrepancy. * The importance of exact localization (rather than a slightly smaller level of granularity) is not clear to me. * **Technique/Methods** * Multidimensional correlation (time, frequency) * Align the anomaly scores → truely aligned? * It's not clear why such a complex solution is necessary. * The output of FFT should be complex numbers instead of the real numbers. Use the amplitude information only. How do you deal with information loss in such cases? * How is the frequency-domain reconstruction performed? * **Theorem** * Rationale for uncertainty defenition * I would suggests separating it from the theorems, which are not essential here in my opinion. * **Experiments** * [Dataset] * Does not compare against all datasets in TSB-UAD * Is it possible to test on a challenging dataset? * [Baselines] * Does not compare against all baselines in TSB-UAD * Baselines from the non-deep learning literature * Adding comparison to more classical methods * Compare with baselines fairer. * [Evaluation Metrics] * New evaluation measures have been omitted (Volume Under the Surface) --- ## Reviewer1 - pYjo, Score(4), Confidence(3) > *Q1. The uncertainty is not properly defined. I don't see the rationale there, e.g. why not square/cubic/exponential?* Thank you very much for helping us strengthen the theorem part. The choice of a linear relationship is specific to the Gaussian function and the Fourier transform. It emerges from the mathematics of fundamental concepts. The Gaussian function has the unique property that its Fourier transform is also a Gaussian function, and **this symmetry leads to the *linear* relationship between the standard deviations (or uncertainties) in time and frequency**. While other functions, such as square, cubic, or exponential functions, could be used to model specific types of uncertainty, they would not lead to the same fundamental relationship as the Gaussian function does in the context of the Gabor limit. The Gaussian function is essential due to its role in signal processing and its mathematical properties that align with both time and frequency domains. Related works [a, b] provide more detailed insights into the mathematical reasoning behind the linear relationship in the Gabor uncertainty principle. > *Q2. For multidimensional data it makes more sense that the spectrum and time domain considerer correlations between dimensions.* As you mentioned, considering the correlations within each domain in a multivariate time series holds significant importance and has been actively studied [c, d]. However, this paper focuses on identifying independent anomalies within both the time and frequency domains. The main contribution is effectively detecting anomalies in each domain and mitigating the granularity discrepancy arising from domain transformation. Consequently, this study emphasizes the alignment to a timestamp granularity level for anomaly detection and proposes corresponding solutions. Furthermore, **this study can be combined with other work that concentrates on correlation between dimensions**. > *Q3. I wonder is there any way that enables us compare with baselines fairer? Maybe running on their datasets? Because I think anomaly detection tasks can be very sensitive to parameters.* Thank you again for your helpful review. Although we implemented the state-of-the-art baselines and used their datasets in the paper, we would like to report additional experimental results using fairer 250 benchmark datasets [e] from diverse domains with complicated univariate time series. In a comparative evaluation against the previous state-of-the-art benchmarks, we achieved the highest accuracy at 0.656, followed by the second-highest accuracy of 0.632 [f]. Further experimental investigations about additional benchmarks will be incorporated into the final supplementary materials. | Method | Accuracy | Reference | | --------------------- |:---------:|:----------:| | USAD | 0.276 | [g] | | LSTM-VAE | 0.198 | [h] | | AE | 0.236 | [i] | | NORMA | 0.474 | [j] | | SCRIMP | 0.416 | [k] | | DAMP (out-of-the-box) | 0.512 | [f, l] | | DAMP (sharpened data) | 0.632 | [f, l] | | ***Dual-TF*** | **0.656** | ***OURS*** | > *Q4. The authors over claim about "Breaking the Time-Frequency Granularity Discrepancy". The paper provides a way to work under this discrepancy, rather than breaking this discrepancy.* We understand your point, and this point would be beneficial in terms of the outline of this paper. The key achievement of this study is to address the issue of different levels of anomaly detection between the two domains. The alignment with a timestamp-based precision for detecting anomalies holds notable significance. Given your feedback, we plan to consider **toning down the title as well as relevant phrases** in the final version. > *Q5. How do we know if they are truely aligned?* We think that it is impossible to define "being truely aligned" because the two domains are inherently different. Thus, as a workaround, we propose to derive an optimal length of the inner window that can minimize the information loss, as discussed in Section 3.2. [a] "Time-frequency signal analysis and processing: a comprehensive reference." Academic press, 2015. [b] "Time-frequency analysis." Vol. 778. New Jersey: Prentice hall, 1995. [c] "A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019. [d] "Practical approach to asynchronous multivariate time series anomaly detection and localization." Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2021. [e] "Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress." IEEE Transactions on Knowledge and Data Engineering (2021). [f] "DAMP: accurate time series anomaly detection on trillions of datapoints and ultra-fast arriving data streams." Data Mining and Knowledge Discovery 37.2 (2023): 627-669. [g] "Usad: Unsupervised anomaly detection on multivariate time series." Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020. [h] "A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder." IEEE Robotics and Automation Letters 3.3 (2018): 1544-1551. [i] "From univariate to multivariate time series anomaly detection with non-local information." Advanced Analytics and Learning on Temporal Data: 6th ECML PKDD Workshop, AALTD 2021, Bilbao, Spain, September 13, 2021, Revised Selected Papers 6. Springer International Publishing, 2021. [j] "Unsupervised and scalable subsequence anomaly detection in large data series." The VLDB Journal (2021): 1-23. [k] "Matrix profile XI: SCRIMP++: time series motif discovery at interactive speeds." 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018. [l] "Matrix profile XXIV: scaling time series anomaly detection to trillions of datapoints and ultra-fast arriving data streams." KDD2022. --- ## Reviewer2 - q6Ub, Score(4), Confidence(4) > *Q1. It's not clear why such a complex solution is necessary. We have methods working on time domain. We have methods working in the frequency domain. Then the anomaly scores (all in time domain) can be averaged for example to detect anomalies. A simple baseline like that is necessary to understand (and ensure) that this is indeed a difficult problem and such solutions are not effective.* It is unfortunate that our primary contribution cannot be properly delivered to you. **We already compared with TFAD [17], exactly the simple baseline that you mentioned; TFAD is shown to offer lower performance than DualTF in Table 3**. Furthermore, the ablation study containing backbone replacement justifies the necessity of each component. > *Q2. The paper cites a recent benchmark, TSB-UAD [6], but does not compare against all datasets (2000 time series) and all baselines. Hence, it's not clear if the proposed solution really advances the state of the art in the area.* Thank you very much for helping us enrich the experiment part. We would like to report additional experimental results with more realistic datasets similar to TSB-UAD [6] benchmarks. We adopted 250 benchmark datasets [d] from diverse domains with complicated univariate time series. In a comparative evaluation against the previous state-of-the-art benchmarks, we achieved the highest accuracy at 0.656, followed by the second-highest accuracy at 0.632 [e]. Further experimental investigations about additional benchmarks will be incorporated into the final supplementary materials. | Method | Accuracy | Reference | | --------------------- |:---------:|:----------:| | USAD | 0.276 | [f] | | LSTM-VAE | 0.198 | [g] | | AE | 0.236 | [h] | | NORMA | 0.474 | [i] | | SCRIMP | 0.416 | [j] | | DAMP (out-of-the-box) | 0.512 | [e, k] | | DAMP (sharpened data) | 0.632 | [e, k] | | ***Dual-TF*** | **0.656** | ***OURS*** | > *Q3. New evaluation measures have been omitted for this area [a]* Thank you again for your comment. We conducted additional experiments incorporating the new evaluation measures (VUS) per your suggestion. We will present these complete results including the following tables. **<*Anomaly Transformer* performance in terms of more evaluation measures>** | Metrics | TODS(Global) | TODS(Contextual) | TODS(Shaplet) | TODS(Seasonal) | TODS(Trend) | ASD | ECG | PSM | CompanyA | |:--------- |:------------:|:----------------:|:-------------:|:--------------:|:-----------:|:------:| ------ |:------:|:--------:| | R_AUC_ROC | 0.9995 | 0.9859 | 0.8457 | 0.9272 | 0.6277 | 0.8498 | 0.6432 | 0.6158 | 0.8493 | | R_AUC_PR | 0.9994 | 0.9862 | 0.6878 | 0.7736 | 0.3713 | 0.5263 | 0.2447 | 0.4789 | 0.4139 | | VUS_ROC | 0.9354 | 0.9160 | 0.8065 | 0.9147 | 0.6222 | 0.7952 | 0.6343 | 0.6073 | 0.8335 | | VUS_PR | 0.8985 | 0.8765 | 0.6000 | 0.6920 | 0.3560 | 0.4466 | 0.2424 | 0.4665 | 0.3590 | **<*Dual-TF* performance in terms of more evaluation measures>** | Metrics | TODS(Global) | TODS(Contextual) | TODS(Shaplet) | TODS(Seasonal) | TODS(Trend) | ASD | ECG | PSM | CompanyA | |:--------- |:------------:|:----------------:|:-------------:|:--------------:|:-----------:|:------:| ------ |:------:|:--------:| | R_AUC_ROC | 0.9998 | 0.9995 | 0.9097 | 0.9611 | 0.7035 | 0.9013 | 0.7216 | 0.7735 | 0.8653 | | R_AUC_PR | 0.9998 | 0.9996 | 0.7925 | 0.8719 | 0.4287 | 0.6058 | 0.3809 | 0.6304 | 0.4254 | | VUS_ROC | 0.9373 | 0.9322 | 0.8843 | 0.9380 | 0.6992 | 0.8505 | 0.7067 | 0.7752 | 0.8568 | | VUS_PR | 0.9053 | 0.9014 | 0.6950 | 0.7620 | 0.4016 | 0.5127 | 0.3754 | 0.6131 | 0.3694 | > *Q4. Baselines from the non-deep learning literature are omitted. For example: [b], [c]* Thank you for your comment. SAND [b] and Series2graph [c] are undoubtedly outstanding studies in the fields of stream outlier and graph-based anomaly detection, respectively. However, they are beyond our targeted scope, time-series anomaly detection (TSAD). Firstly, the SAND does not align well with our research scope due to its primary focus on streaming data scenarios. Secondly, the Series2graph is not directly applicable to our research due to its underlying graph-based approach. Graph-based methods often entail constructing complex graphs to model time-series data, which might not align with the intention of our framework to leverage both the time and frequency domains directly. Moreover, Series2graph can not be integrated with the dual-domain approach of our framework, which aims to bridge the gap between time and frequency domains for improved anomaly detection accuracy. While SAND and Series2graph are not ideal for our research scope, the potential challenges in adapting these methods for dual-domain anomaly detection could arise from their specific focuses and underlying methodologies for future works. As we reviewed prior works (e.g., TSB-UAD and Anomaly Transformer) that employed the same benchmark datasets (e.g., ASD, PSM, and TODS) as ours, it is commonly known that non-deep learning methods, such as OCSVM, Isolation Forest, and LOF, do not perform well on these benchmark datasets [6, 16]. Therefore, we chose baselines primarily for recent deep learning. [a] "Volume under the surface: a new accuracy evaluation measure for time-series anomaly detection." Proceedings of the VLDB Endowment 15.11 (2022): 2774-2787. [b] "SAND: streaming subsequence anomaly detection." Proceedings of the VLDB Endowment 14.10 (2021): 1717-1729. [c] "Series2graph: Graph-based subsequence anomaly detection for time series." arXiv preprint arXiv:2207.12208 (2022). [d] "Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress." IEEE Transactions on Knowledge and Data Engineering (2021). [e] "DAMP: accurate time series anomaly detection on trillions of datapoints and ultra-fast arriving data streams." Data Mining and Knowledge Discovery 37.2 (2023): 627-669. [f] "Usad: Unsupervised anomaly detection on multivariate time series." Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020. [g] "A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder." IEEE Robotics and Automation Letters 3.3 (2018): 1544-1551. [h] "From univariate to multivariate time series anomaly detection with non-local information." Advanced Analytics and Learning on Temporal Data: 6th ECML PKDD Workshop, AALTD 2021, Bilbao, Spain, September 13, 2021, Revised Selected Papers 6. Springer International Publishing, 2021. [i] "Unsupervised and scalable subsequence anomaly detection in large data series." The VLDB Journal (2021): 1-23. [j] "Matrix profile XI: SCRIMP++: time series motif discovery at interactive speeds." 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018. [k] "Matrix profile XXIV: scaling time series anomaly detection to trillions of datapoints and ultra-fast arriving data streams." Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022. --- ## Reviewer3 - yZcD, Score(4), Confidence(4) > *Q1. There is a fatal fact error in this paper: **The output of FFT should be complex numbers instead of the real numbers mentioned in line 216.** Also, complex vectors cannot be directly passed to Transformer. So I guess the authors use the amplitude information only.* This is a very good question linked with our main contribution. The Fast Fourier Transform (FFT) is a widely used mathematical technique for efficiently computing the Discrete Fourier Transform (DFT). The DFT decomposes a signal into its constituent frequency components, and the FFT algorithm makes this computation much faster than the direct DFT computation. The FFT output has symmetry around the center, meaning that the positive and negative frequency components are complex conjugates of each other. As a result, **the imaginary part of the FFT output for real-valued signals is redundant** [a, b]. Due to this symmetry property, many applications and research papers [7, 8, 9, 31, 32, 33, 34, 35, 36], mentioned in related work section, dealing with real-world signals can safely ignore the imaginary part of the FFT output. By ignoring the imaginary part, one can save memory and computational resources, as the imaginary part essentially contains duplicate information. A detailed explanation about the DFT is described in supplementary material Section B. **Comment:** > The authors' rebuttal further shows the fatal error that they make. The output of DFT/FFT is redundant for the real signal in which the positive and negative frequency components are complex conjugates. However, it is the negative frequency components that are redundant instead of the imaginary part. Specifically, when the DFT is computed for purely real input, the output is Hermitian-symmetric, i.e., the negative-frequency terms are just the complex conjugates of the corresponding positive-frequency terms, and the negative-frequency terms are therefore redundant. We can then ignore the negative frequency terms, and the length of the transformed axis of the output is, therefore, n//2 + 1. > > For example, the DFT result of a sequence is [3-2j, 2-j, 1+j, 4, 1-j, 2+j, 3+2j] which is symmetric in terms of the zero frequency. If we take the positive frequency part, the ‘real-valued DFT’ result should be [4, 1-j, 2+j, 3+2j] and you can reconstruct the negative frequency part with the symmetric property. However, if you only take the real part of the DFT, which is [3, 2, 1, 4, 1, 2, 3]. Such ‘real spectrum’ is still symmetric, in other words, redundant. However, you cannot reconstruct the ignored imaginary part. Discarding the imaginary part leads to information loss but preserves redundant real-valued information. > > I suggest the authors read the book mentioned in the rebuttal for detailed information. We really thank the reviewer for bringing this matter to our attention. We are sorry that we misunderstood and that our rebuttal was not well-described originally. As you mentioned, negative frequency components are redundant instead of the imaginary part. Furthermore, we would like to clarify our methodology by using the expressions you have provided. In the paper, within line 216, we just described "FFT indicates the fast Fourier transform". Specifically, when FFT is computed for time series input, the output is Hermitian-symmetric, i.e., the positive-frequency terms are just the complex conjugates of the corresponding negative-frequency terms. When we get the absolute value of the complex conjugates, we can then ignore the negative frequency terms, and the length of the transformed axis of the output is, therefore, n//2 + 1. Here, we computed **magnitude value** that can be defined in the complex plane. For example, the FFT result of a sequence is [3-2j, 2-j, 1-j, 4, 1+j, 2+j, 3+2j], which is symmetric in terms of the zero frequency. If we get the absolute value of the result and take the positive frequency part, the ‘real-valued FFT’ result should be [4, 1+j, 2+j, 3+2j]. In this point, we computed magnitude value as [$\sqrt{4^2+0^2}, \sqrt{1^2+1^2}, \sqrt{2^2+1^2}, \sqrt{3^2+2^2}$], signifying our intention to reconstruct this specific form of real **magnitude value** for the detection of frequency anomalies. This is a completely distinct concept compared to the reconstruction for inverse FFT, so angle or phase spectrum is not required to detect anomalies. We will definitely revise our draft further to clarify this confusion in future submission. We appreciate your comments on this matter. --- > *Q2. The authors claim that they train the model with reconstruction task following previous works. However, these works are conducted only on the time domain. They do not elaborate on how they conduct such a reconstruction task on the frequency domain, which should be a key technique of their method. In fact, it would be quite hard to conduct frequency-domain reconstruction because masking introduce unexpected frequency spectrum changes.* Thank you once more for your comments. The sequence is transformed from the time domain, where the $x$-axis is $time$ and the $y$-axis is $value$, to the frequency domain, where the $x$-axis is $frequency$ and the $y$-axis is $magnitude$. As a result, we reconstructed the sequence value in the frequency domain in the same manner as we did in the time domain. Therefore, we don't need to introduce masking to conduct frequency-domain reconstruction. Detailed description will be added in the final version. [a] Proakis, John G., and Dimitris G. Manolakis. "Digital signal processing: principles, algorithms, and edition." (1995). [b] Oppenheim, Alan V. Discrete-time signal processing. Pearson Education India, 1999. ## Reviewer4 - dAug, Score(3), Confidence(3) > *Q1. In case where the anomaly is successfully detected, the importance of exact localization (rather than a slightly smaller level of granularity) is not clear to me.* We appreciate the reviewer for pointing out this important issue. As you pointed out, the exact localization may be unnecessary in some scenarios. However, **in time-critical real-world contexts like swiftly identifying a heart attack or a failure in a nuclear power plant**, the prompt and precise detection of an incident can lead to swift determination of its underlying cause and solution. Given the specific demands of the field, our proposed framework can be operated at a finer level of granularity. Furthermore, our experimental results are evaluated with a point-wise F1 score appropriate for exact localization demands. > *Q2. While the idea is simple, I found the method description hard to follow. I would suggests separating it from the theorems, which are not essential here in my opinion.* It is unfortunate that our presentation is not satisfactory to you. We would like to improve our presentation to the simple way as much as possible. However, we believe that the theorem is an important part in resolving granularity discrepancy and determining the appropriate window length in NS-windowing. > *Q3. I would suggest adding comparison to more classical methods that often outperform in time-series anomaly detection [1].* As we reviewed prior works (e.g., TSB-UAD and Anomaly Transformer) that employed the same benchmark datasets (e.g., ASD, PSM, and TODS) as ours, it is commonly known that non-deep learning methods, such as OCSVM, Isolation Forest, and LOF, do not perform well on these benchmark datasets [6, 16]. Therefore, we chose baselines primarily for recent deep learning. We also contain representative classical methods, LOF, ISF, and OCSVM, for a fair comparison. [1] Rewicki, Ferdinand, Joachim Denzler, and Julia Niebling. "Is It Worth It? Comparing Six Deep and Classical Methods for Unsupervised Anomaly Detection in Time Series." Applied Sciences 13, no. 3 (2023): 1778. ## Reviewer5 - qmoP, Score(3), Confidence(5) > *Q1. Is it possible to test on a challenging dataset?* Thank you very much for helping us enrich the experiment part. We would like to report additional experimental results using fairer 250 benchmark datasets [a] from diverse domains with complicated univariate time series. In a comparative evaluation against the previous state-of-the-art benchmarks, we achieved the highest accuracy at 0.656, followed by the second-highest accuracy of 0.632 [b]. See the PDF file (Table A) in the global response for the result of each dataset. Further experimental investigations about additional benchmarks will be incorporated into the final supplementary materials. | Method | Accuracy | Reference | | --------------------- |:---------:|:----------:| | USAD | 0.276 | [c] | | LSTM-VAE | 0.198 | [d] | | AE | 0.236 | [e] | | NORMA | 0.474 | [f] | | SCRIMP | 0.416 | [g] | | DAMP (out-of-the-box) | 0.512 | [b, h] | | DAMP (sharpened data) | 0.632 | [b, h] | | ***Dual-TF*** | **0.656** | ***OURS*** | [a] "Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress." IEEE Transactions on Knowledge and Data Engineering (2021). [b] "DAMP: accurate time series anomaly detection on trillions of datapoints and ultra-fast arriving data streams." Data Mining and Knowledge Discovery 37.2 (2023): 627-669. [c] "Usad: Unsupervised anomaly detection on multivariate time series." Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020. [d] "A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder." IEEE Robotics and Automation Letters 3.3 (2018): 1544-1551. [e] "From univariate to multivariate time series anomaly detection with non-local information." Advanced Analytics and Learning on Temporal Data: 6th ECML PKDD Workshop, AALTD 2021, Bilbao, Spain, September 13, 2021, Revised Selected Papers 6. Springer International Publishing, 2021. [f] "Unsupervised and scalable subsequence anomaly detection in large data series." The VLDB Journal (2021): 1-23. [g] "Matrix profile XI: SCRIMP++: time series motif discovery at interactive speeds." 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018. [h] "Matrix profile XXIV: scaling time series anomaly detection to trillions of datapoints and ultra-fast arriving data streams." Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022.