WWW'24 Rebuttal Submission

# WWW'24 Rebuttal Submission ## Common comment We sincerely appreciate the reviewers' valuable comments. In general, we are very pleased that the reviewers have concurred with the three main contributions of this study: **(1) motivation**---"nicely motivated discrepancy for time-series anomaly detection" (Reviewer NKfj) and ; **(2) effectiveness**---"experimental results support the overall claims of the paper over multiple datasets" and "evaluation is done on multiple datasets and provides detailed comparisons with the related SOTA approaches" (Reviewers NKfj, gMPs, and xu3u) including "appropriate comparison groups (e.g., TFAD, Anomaly Transformer)" (Reviewer ws1N); **(3) justification**---"the paper explains well why this research problem is an important problem for many Web applications" (Reviewers gMPs). We also hope that our clarifications can further address some concerns raised by the reviewers, which can be summarized as follows. * **Improving the presentation of the novelty** (Reviewers NKfj, ws1N, and PsKo): In response to the valuable feedback from Reviewers, we have undertaken a comprehensive revision to clearly express the unique contributions of our work. Specifically, we have provided a thorough justification for the proposed nested-sliding window. * **Clarifying the technical contribution** (Reviewers NKfj, ws1N, and xu3u): We would like to emphasize our technical contributions in two main aspects. First, we have explained the importance of the frequency domain for detecting pattern-wise anomalies and the importance of the frequency domain reconstructor in our paper. Second, we have explained in detail why addressing the time-frequency discrepancy is important and how our approach solves the time-frequency discrepancy. These clarifications are intended to provide a sufficient understanding of the unique technical contributions of our paper. * **Enriching the experiment results** (Reviewers NKfj and gMPs): The results for other realistic datasets, relevant interesting baselines, and the effect of parameters have been added. We will make sure that all suggestions and clarifications made in this response are reflected in the final version of the paper. --- ## Summary We could summarize and categorize all reviews as follows: * **Novelty** * Why such a complex solution is necessary? Comparison with a simple averaging of the score can reveal the strength of multiple methods. * It seems unclear whether the problem arises from the granularity discrepancy of the Time-Domain and Frequency-Domain scores, or from the coarseness of the Frequency-Domain's Anomaly score. * The approach of creating various window sizes and implementing an anomaly detection method seems not to be substantially different from the proposed approach. * **Technique/Methods** * The size of the inner-window for the Frequency-Domain is smaller than the size of original window, which seems to limit the context viewed. For example, using the nested-sliding window may be effective for generating fine-grained anomaly scores; however, this approach may limit the consideration of a variety of frequencies. * The technique is from two existing framework of attention in time domain and frequency domain, both have been proposed before in other transformer work before. However, simply adding the scores together is not a novel loss function. This is a similar solution just like the ensemble, the key discrepancy here is simply addressed by a sum, which over-simplifies the problem. * The optimal window is determined by dominal frequency and uncertainty, which does not always hold since there can be non-major frequencies have anomalies too. For example, in a heartbeat, the meaningful anomaly is usually in subtle changes. The 1/vmajor would simply miss such an anomaly. * **Theorem** * There is a surprising assumption of proof of theorem 3.1 does not make any sense in practice, why the anomaly need to be in a window that is monotonic increasing followed by monotonic decreasing? * **Experiments** * **[Baseline & Dataset]** * The paper cites a recent benchmark, TSB-UAD [6], but does not compare against all datasets (2000 time series) and all baselines. * Data mining methodologies are only evaluated in the appendix, which is not ideal. Several methods are missing, some examples (and there are much more in the literature). (Series2Graph) * The literature review is based on very new work in deep learning. However, pattern -based anomaly detection has been done in the last twenty years [1][2], such as context anomaly, collective anomaly, time series discord. There is zero discussion about any of these works. Although the authors tested one matrix profile, we have no idea on how matrix profile were tested and which parameters it was tested. * The claim seems a little bit contradictory, in Fig. 1 motivation, it seems the authors work on Pattern-Wise anomaly (or discord if considering the definition first introduced in 2005). But the experiment seems to aim to detect any type of anomalies. And all the anomalies are not well discussed (like why you use a dataset). Given the claim of this paper, I believe universal comparison is insufficient. * Table 1 information needs more explanation. For example, what is # Point Anomaly Ratio?, # Pattern Anomaly? And their ratio? * **[Visualization]** * It seems to introduce a universal anomaly detection approach which claims to be able to detect any type of anomalies. However, it is unclear why t-anomaly + f-anomaly can detect any type of anomalies. I believe authors should analyze how each type of anomalies that could benefit from t-anomaly, f-anomaly, and combined final score in visualization shown in Fig. 6 * Most of the example shown in Fig 6 looks like the f-anomaly score plays an important role in anomaly detection (Other than Contextual Point), any example in which t-anomaly score plays an important role in determining the location of the anomaly. Or t-anomaly score is solely used to determine a more concrete location? * **[Parameter]** * The parameter settings are not clear. In some methods you mention that parameters were chosen from a set of values (how?) in other you say you use the default parameters in these methods (is that fair?). * **[Evaluation Metrics]** * Why don't you present all measures in all tables to easy identification of patterns * **[Efficiency]** * The efficiency of the method is not mentioned. The two transformers combined would be very expensive to train on large datasets, and could be impossible to deploy in any production situation such as the setting in [3]. * It would be better to mention the computation cost of the proposed work in anomaly detection. --- 리뷰어의 긍정적 코멘트 하이라이트 + 너의 concern 이렇게 해결해줬다 + 디스커션 기간 종료가 다가오니 피드백 부탁드린다 1, 4번 리뷰어: 걔네 리뷰 장점 위주로 + 각 concern어떤식으로 대응했는지 혹시 다른 리뷰어가 좋세 봐준 부분 있으면 그거 살짝 인용 적극적으로 오해 해소하고 논문 발전을 위해서 코멘트 수용한다 --- Upon reviewing your feedback, we noted the primary concerns related to low scores for soundness were around negative betas. Together with the other reviewers' feedback (Reviewer 3U3m), we have tried to address it by additional ablation and explanation grounded in the literature. Can we request you to give a short feedback on the proposed improvements to this paper? Thus, receiving your opinion on the comments provided would be helpful. Also, if you view the improvements as material clarifications, we would be happy to see this reflected in an increased soundness score. We recognize that evaluations encompass various factors, and the written feedback represents just a portion of the assessment. Nonetheless, we would greatly value any additional specific comments or questions you might have, which we can address further during the discussion period. --- Dear Reviewer 3U3m, We hope this message finds you well. Noting the extension of the discussion period to September 15th, we wanted to reach out and inquire if you have any further comments or concerns about our submission. Your prior feedback was very valuable, guiding us to conduct supplementary tests that have strengthened our paper. We would truly appreciate your insights on the added ablation study. Thank you for your time and consideration. Warm regards, Authors --- ## Reviewer1 - NKfj, Scope(1), Novelty(4), Tech.Qual.(5), Conf.(4) To save space, we have omitted the description of your question here. ```Web Application scope``` We regret that the high relevance of our paper to Web Applications has not been delivered, and we respectfully assert its relevance for the following reasons: First, papers on anomaly detection in time-varying data are highly relevant to web applications, with about 173 papers published on the WWW in the last decade. Anomaly detection is commonly used to monitor states in numerous web applications such as website KPIs[a, b], cloud servers[c], microservices[d, e, f], and health[g]. Second, we would like to highlight the key role of time-series anomaly detection as an essential technology for web applications. The increasing demand for robust anomaly detection mechanisms requires in enhancing the reliability and security of web-based systems. Third, the CompanyA dataset used in the experiment was obtained from the server monitoring system belonging to the 'Web traffic and logs analysis' of the Web Mining and Content Analysis track. **References** * [a] Xu, Haowen, et al. "Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications." WWW. 2018. * [b] Dai, Liang, et al. "SDFVAE: Static and Dynamic Factorized VAE for Anomaly Detection of Multivariate CDN KPIs." WWW. 2021. * [c] Huang, Tao, et al. "A Semi-Supervised VAE Based Active Anomaly Detection Framework in Multivariate Time Series for Online Systems." WWW. 2022. * [d] Jiang, Xinrui, et al. "Look Deep into the Microservice System Anomaly through Very Sparse Logs." WWW. 2023. * [e] Xie, Zhe, et al. "Unsupervised Anomaly Detection on Microservice Traces through Graph VAE." WWW. 2023. * [f] Shan, Huasong, et al. "?-Diagnosis: Unsupervised and Real-time Diagnosis of Small- window Long-tail Latency in Large-scale Microservice Platforms." WWW. 2019. * [g] Ovadia, Oded, et al. "Detection of Infectious Disease Outbreaks in Search Engine Time Series Using Non-Specific Syndromic Surveillance with Effect-Size Filtering." WWW. 2022. > *Q1. It's not clear why such a complex solution is necessary. We have methods working on time domain. We have methods working in the frequency domain. Then, the anomaly scores (all in the time domain) can be averaged for example, to detect anomalies. A simple baseline like that is necessary to understand (and ensure) that this is indeed a complex problem and such solutions are ineffective. To state that differently, it's not clear if the improvement is coming due to technical contribution or because more "data angles/representations" are used (i.e., you have an ensembling solution but you do not compare against other ensembling solutions, a simple averaging of the score can reveal the strength of multiple methods).* ```Q1``` Thank you very much for helping us clarify our technical contribution. We argue that our contribution extends beyond the use of more data angles/representations. The definition of frequency as the inverse of the time period ($frequency = \frac{1}{period}$) contributes to the coarseness in the frequency domain's anomaly score. In turn, addressing this frequency coarseness becomes integral to resolving the granularity discrepancy between the two domains, which is the core challenge our methodology aims to tackle. A simple combination of these domains, as seen in the TFAD(Time-Frequency analysis based time series Anomaly Detection model)[59], which represents a straightforward ensembling solution, proves insufficient for achieving precise detection at the time-point level. Therefore our proposed method is necessary for TSAD(Time-Series Anomaly Detection) employing frequency domain information.   Let us characterize TFAD[59] as a simple solution by equating it to a mere average of time and frequency. We would like to provide a supplementary comparison table below to compare with the ensemble method. Despite employing frequency information as TFAD, Dual-TF exhibited significant performance differences by addressing the granularity discrepancy that arises between the two domains in all datasets. | Algorithms | Domain | Contribution | Average Performance ($F_1$) | |:-----------------------:|:----------------:|:------------------------------------:|:---------------------------:| | **TFAD** [59] | Time & Frequency | Using frequency domain | 0.569 $(\pm0.035)$ | | ***Dual-TF*** | Time & Frequency | Breaking **granularity discrepancy** | **0.712** $(\pm0.011)$ |  > *Q2. The paper cites a recent benchmark, TSB-UAD [6], but does not compare against all datasets (2000 time series) and all baselines. Hence, it's not clear if the proposed solution really advances the state of the art in the area.* ```Q2``` Following your suggestions, we are currently conducting additional experiments with the TSB-UAD[42]. Due to the time and space constraints, we show the results of representative time series for each domain at this moment. As summarized below, we could confirm that ***dual-TF*** still outperforms other baselines in terms of the point-wise $F_1$ score in almost all domains, except for the NASA, OPPORTUNITY, and Occupancy domains, which have less pronounced periodicity. (Please note that we will keep updating the additional experiment results during and after the rebuttal period). | Methods | Average for All Domains | Daphnet | Dodgers | ECG | GHL | Genesis | IOPS | KDD21 | MGAB | MITDB | NAB | NASA-MSL | NASA-SMAP | OPPORTUNITY | Occupancy | SMD | SVDB | SensorScope | YAHOO | |:-------------------:| ----------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ----------- | ---------- | ---------- | ---------- | ----------- | ---------- | | IF | 0.2238 | 0.1432 | 0.2261 | 0.1449 | 0.0471 | 0.1486 | 0.1502 | 0.0000 | 0.0096 | 0.0148 | 0.1499 | 0.2684 | 0.0000 | **0.2740** | 0.5117 | 0.5985 | 0.6846 | 0.4236 | 0.2333 | | LOF | 0.1340 | 0.1197 | 0.1210 | 0.0596 | 0.0000 | 0.0086 | 0.3134 | 0.0000 | 0.0034 | 0.0144 | 0.1682 | **0.2833** | **0.2000** | 0.0456 | 0.5122 | 0.1490 | 0.1181 | 0.1447 | 0.1508 | | OCSVM | 0.1546 | 0.1432 | 0.1709 | 0.1592 | 0.0471 | 0.1477 | 0.1211 | 0.0000 | 0.0066 | 0.0150 | 0.1504 | 0.2659 | **0.2000** | 0.2611 | **0.5341** | 0.1173 | 0.1464 | 0.1473 | 0.1492 | | Anomaly Transformer | 0.2752 | 0.1438 | 0.2548 | 0.1966 | 0.1226 | 0.2333 | 0.5593 | 0.1294 | 0.1385 | 0.1450 | 0.1888 | 0.1484 | 0.1245 | 0.1350 | 0.1239 | 0.6319 | 0.7841 | 0.4499 | 0.4444 | | ***dual-TF*** | **0.3150** | **0.1473** | **0.3337** | **0.2175** | **0.1483** | **0.4375** | **0.5946** | **0.1459** | **0.1459** | **0.1487** | **0.1896** | 0.1496 | 0.1499 | 0.1541 | 0.1465 | **0.7263** | **0.8707** | **0.4924** | **0.4706** | | Methods | Average for All Domains | Daphnet | Dodgers | ECG | GHL | Genesis | IOPS | KDD21 | MGAB | MITDB | |:-------------------:| ----------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | | IF | 0.2238 | 0.1432 | 0.2261 | 0.1449 | 0.0471 | 0.1486 | 0.1502 | 0.0000 | 0.0096 | 0.0148 | | LOF | 0.1340 | 0.1197 | 0.1210 | 0.0596 | 0.0000 | 0.0086 | 0.3134 | 0.0000 | 0.0034 | 0.0144 | | OCSVM | 0.1546 | 0.1432 | 0.1709 | 0.1592 | 0.0471 | 0.1477 | 0.1211 | 0.0000 | 0.0066 | 0.0150 | | Anomaly Transformer | 0.2752 | 0.1438 | 0.2548 | 0.1966 | 0.1226 | 0.2333 | 0.5593 | 0.1294 | 0.1385 | 0.1450 | | ***dual-TF*** | **0.3150** | **0.1473** | **0.3337** | **0.2175** | **0.1483** | **0.4375** | **0.5946** | **0.1459** | **0.1459** | **0.1487** | | Methods | NAB | NASA-MSL | NASA-SMAP | OPPORTUNITY | Occupancy | SMD | SVDB | SensorScope | YAHOO | |:-------------------:| ---------- | ---------- | ---------- | ----------- | ---------- | ---------- | ---------- | ----------- | ---------- | | IF | 0.1499 | 0.2684 | 0.0000 | **0.2740** | 0.5117 | 0.5985 | 0.6846 | 0.4236 | 0.2333 | | LOF | 0.1682 | **0.2833** | **0.2000** | 0.0456 | 0.5122 | 0.1490 | 0.1181 | 0.1447 | 0.1508 | | OCSVM | 0.1504 | 0.2659 | **0.2000** | 0.2611 | **0.5341** | 0.1173 | 0.1464 | 0.1473 | 0.1492 | | Anomaly Transformer | 0.1888 | 0.1484 | 0.1245 | 0.1350 | 0.1239 | 0.6319 | 0.7841 | 0.4499 | 0.4444 | | ***dual-TF*** | **0.1896** | 0.1496 | 0.1499 | 0.1541 | 0.1465 | **0.7263** | **0.8707** | **0.4924** | **0.4706** | > *3. The parameter settings are not clear. In some methods you mention that parameters were chosen from a set of values (how?) in other you say you use the default parameters in these methods (is that fair?). In unclear if all these methods were compared fairly, under the same settings or just results were taken "as is" from previous papers.* ```Q3``` We would like to clarify that we did our best to the fair comparisons with other baselines. The primary parameter influencing anomaly detection performance is the window length for each dataset, which was standardized uniformly across all datasets to ensure fair conditions (Table 1). If the original paper compares the methods used in the same dataset and settings, use the performance as presented there. If not, default parameters were employed in the deep learning-based model. For the machine learning-based model, model parameters were set through grid search as follows: * ISF: The number of tree is selected from $\{25, 100\}$. * LOF: The number of neighbors is selected from $\{1, 3, 5, 12\}$. * OCSVM: The RBF kernel is used. The inverse length parameter $\gamma$ is selected from $\{10^{−4}, 10^{−3}, 10^{−2}, 10^{−1}, 0.5\}$.  > *Q4. The results are presented in a strange way. First the F1 measure is presented, then Range-AUC and VUS separately but only for 2 methods. Then ablation study happens with F1 again. Why don't you present all measures in all tables to easy identification of patterns* ```Q4``` We will organize the table as you mentioned, and include it as completely as space permits in the final version.  > *5. The paper seems to be focusing mainly on DNN solutions. Data mining methodologies are only evaluated in the appendix, which is not ideal. Several methods are missing, some examples (and there are much more in the literature).* > - "Series2Graph: Graph-based subsequence anomaly detection for time series." arXiv preprint arXiv:2207.12208 (2022). > - "SAND: streaming subsequence anomaly detection." Proceedings of the VLDB Endowment 14.10 (2021): 1717-1729.* ```Q5``` ```Q5 (cont'd)``` We agree with your point. Our initial focus was on presenting the results of the DNN baseline, which seem more relevant to ours (also using DNNs). Rather, we also believe that moving the results of widely used data mining methods from the appendix to the main table would provide a more comprehensive view of the comparison to readers. We will reflect this in the final version. Also, we were indeed aware of the interesting relevant methods you mentioned but were not sure if they are directly comparable with ours. As far as we know, they detect anomalies at a subsequene level, while we evaluate anomalies at individual time point-level. Nevertheless, we conducted additional experiments using TSB-UAD[42] dataset with the two methods in our setting. While direct performance comparison is difficult due to the different scopes of the papers, we used the point-wise $F_1$ which is a traditional evaluation metric and a R_AUC_ROC[41] measure recently designed for sequence anomaly detection. Unlike range-AUC, point-wise $F_1$ can vary its performance depending on the threshold setting, which is why it scored zero on some datasets. Overall, ***dual-TF*** shows superiority on both measures, especially on datasets with prominent frequencies. The results are shown as follows. We will make sure to discuss these methods and the additional experiment results in the final version. | Methods | Evaluation Measure | Average for All Domains | Daphnet | Dodgers | ECG | GHL | Genesis | IOPS | KDD21 | MGAB | MITDB | NAB | NASA-MSL | NASA-SMAP | OPPORTUNITY | Occupancy | SMD | SVDB | SensorScope | YAHOO | |:-------------------:| ------------------ | ----------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ----------- | ---------- | ---------- | ---------- | ----------- | ---------- | | SAND(offline) | point-wise $F_1$ | 0.1461 | 0.0000 | 0.1783 | **0.7385** | 0.0578 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | **0.7121** | 0.0000 | 0.0000 | 0.6564 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0100 | 0.2759 | | | R_AUC_ROC | 0.6832 | 0.5405 | 0.6679 | 0.6075 | 0.5995 | 0.1102 | 0.2582 | 0.8074 | 0.7141 | 0.9979 | 0.9998 | 0.8799 | **0.9999** | 0.9026 | 0.6528 | 0.5761 | 0.4814 | 0.5063 | 0.9965 | | SAND(online) | point-wise $F_1$ | 0.1303 | 0.0000 | 0.0913 | 0.6630 | 0.0095 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.4068 | 0.0000 | 0.0840 | **0.6589** | 0.0000 | 0.0000 | 0.0153 | 0.0000 | 0.0000 | 0.4167 | | | R_AUC_ROC | 0.7489 | 0.7552 | 0.7952 | 0.9988 | 0.8374 | 0.4814 | 0.4678 | 0.5225 | **0.9163** | 0.9708 | **0.9999** | **0.9685** | **0.9999** | 0.7072 | 0.5191 | 0.6115 | 0.4277 | 0.5036 | **0.9966** | | Series2Graph | point-wise $F_1$ | 0.1164 | 0.0000 | **0.4263** | 0.6987 | 0.0000 | 0.4062 | 0.3426 | 0.0000 | 0.0000 | 0.1012 | 0.0000 | 0.0000 | 0.0000 | 0.0024 | 0.0000 | 0.0000 | 0.0074 | 0.0665 | 0.0444 | | | R_AUC_ROC | 0.7659 | 0.3172 | 0.8546 | 0.9981 | 0.3535 | **0.9806** | **0.9642** | 0.5145 | 0.6703 | 0.8828 | 0.8544 | 0.6240 | 0.9132 | 0.6981 | 0.7764 | **0.9183** | **0.9651** | 0.5208 | 0.9806 | | Anomaly Transformer | point-wise $F_1$ | 0.2752 | 0.1438 | 0.2548 | 0.1966 | 0.1226 | 0.2333 | 0.5593 | 0.1294 | 0.1385 | 0.1450 | 0.1888 | 0.1484 | 0.1245 | 0.1350 | 0.1239 | 0.6319 | 0.7841 | 0.4499 | 0.4444 | | | R_AUC_ROC | 0.6600 | 0.8328 | 0.8374 | 0.9993 | 0.9137 | 0.6244 | 0.4483 | 0.8698 | 0.5849 | 0.5236 | 0.6849 | 0.4627 | 0.5950 | 0.8163 | 0.6139 | 0.4186 | 0.4953 | **0.5260** | 0.6336 | | ***dual-TF*** | point-wise $F_1$ | **0.3150** | **0.1473** | 0.3337 | 0.2175 | **0.1483** | **0.4375** | **0.5946** | **0.1459** | **0.1459** | 0.1487 | **0.1896** | **0.1496** | 0.1499 | **0.1541** | **0.1465** | **0.7263** | **0.8707** | **0.4924** | **0.4706** | | | R_AUC_ROC | **0.8017** | **0.8466** | **0.8750** | **0.9993** | **0.9175** | 0.6692 | 0.6298 | **0.8911** | 0.9142 | **0.9991** | 0.9998 | 0.9043 | 0.9739 | **0.9030** | **0.7243** | 0.5805 | 0.4879 | 0.5095 | 0.6060 | | Methods | Evaluation Measure | Average for All Domains | Daphnet | Dodgers | ECG | GHL | Genesis | IOPS | KDD21 | MGAB | MITDB | |:-------------------:| ------------------ | ----------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | | SAND(offline) | point-wise $F_1$ | 0.1461 | 0.0000 | 0.1783 | **0.7385** | 0.0578 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | **0.7121** | | | R_AUC_ROC | 0.6832 | 0.5405 | 0.6679 | 0.6075 | 0.5995 | 0.1102 | 0.2582 | 0.8074 | 0.7141 | 0.9979 | | SAND(online) | point-wise $F_1$ | 0.1303 | 0.0000 | 0.0913 | 0.6630 | 0.0095 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.4068 | | | R_AUC_ROC | 0.7489 | 0.7552 | 0.7952 | 0.9988 | 0.8374 | 0.4814 | 0.4678 | 0.5225 | **0.9163** | 0.9708 | | Series2Graph | point-wise $F_1$ | 0.1164 | 0.0000 | **0.4263** | 0.6987 | 0.0000 | 0.4062 | 0.3426 | 0.0000 | 0.0000 | 0.1012 | | | R_AUC_ROC | 0.7659 | 0.3172 | 0.8546 | 0.9981 | 0.3535 | **0.9806** | **0.9642** | 0.5145 | 0.6703 | 0.8828 | | Anomaly Transformer | point-wise $F_1$ | 0.2752 | 0.1438 | 0.2548 | 0.1966 | 0.1226 | 0.2333 | 0.5593 | 0.1294 | 0.1385 | 0.1450 | | | R_AUC_ROC | 0.6600 | 0.8328 | 0.8374 | 0.9993 | 0.9137 | 0.6244 | 0.4483 | 0.8698 | 0.5849 | 0.5236 | | ***dual-TF*** | point-wise $F_1$ | **0.3150** | **0.1473** | 0.3337 | 0.2175 | **0.1483** | **0.4375** | **0.5946** | **0.1459** | **0.1459** | 0.1487 | | | R_AUC_ROC | **0.8017** | **0.8466** | **0.8750** | **0.9993** | **0.9175** | 0.6692 | 0.6298 | **0.8911** | 0.9142 | **0.9991** | | Methods | Evaluation Measure | NAB | NASA-MSL | NASA-SMAP | OPPORTUNITY | Occupancy | SMD | SVDB | SensorScope | YAHOO | |:-------------------:| ------------------ | ---------- | ---------- | ---------- | ----------- | ---------- | ---------- | ---------- | ----------- | ---------- | | SAND(offline) | point-wise $F_1$ | 0.0000 | 0.0000 | 0.6564 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0100 | 0.2759 | | | R_AUC_ROC | 0.9998 | 0.8799 | **0.9999** | 0.9026 | 0.6528 | 0.5761 | 0.4814 | 0.5063 | 0.9965 | | SAND(online) | point-wise $F_1$ | 0.0000 | 0.0840 | **0.6589** | 0.0000 | 0.0000 | 0.0153 | 0.0000 | 0.0000 | 0.4167 | | | R_AUC_ROC | **0.9999** | **0.9685** | **0.9999** | 0.7072 | 0.5191 | 0.6115 | 0.4277 | 0.5036 | **0.9966** | | Series2Graph | point-wise $F_1$ | 0.0000 | 0.0000 | 0.0000 | 0.0024 | 0.0000 | 0.0000 | 0.0074 | 0.0665 | 0.0444 | | | R_AUC_ROC | 0.8544 | 0.6240 | 0.9132 | 0.6981 | 0.7764 | **0.9183** | **0.9651** | 0.5208 | 0.9806 | | Anomaly Transformer | point-wise $F_1$ | 0.1888 | 0.1484 | 0.1245 | 0.1350 | 0.1239 | 0.6319 | 0.7841 | 0.4499 | 0.4444 | | | R_AUC_ROC | 0.6849 | 0.4627 | 0.5950 | 0.8163 | 0.6139 | 0.4186 | 0.4953 | **0.5260** | 0.6336 | | ***dual-TF*** | point-wise $F_1$ | **0.1896** | **0.1496** | 0.1499 | **0.1541** | **0.1465** | **0.7263** | **0.8707** | **0.4924** | **0.4706** | | | R_AUC_ROC | 0.9998 | 0.9043 | 0.9739 | **0.9030** | **0.7243** | 0.5805 | 0.4879 | 0.5095 | 0.6060 | `Reminder` Dear reviewer NKfj, We hope this message finds you well. As the discussion phase is coming to an end soon, we wanted to ask if you had any additional comments or concerns about our submission and rebuttal. As you mentioned in the strengths section, we are very pleased to hear that you enjoyed reading our paper because of our writing and motivation. Upon reviewing your feedback, we note your concern about the low scope score for the Web. We have tried to address this issue by referencing previous related works on the WWW along with the relevance scores of other reviewers to argue for the high relevance of our paper to Web applications. We have also tried to address your concerns in terms of technical novelty, parameter details, additional experiments with baselines/datasets, etc. Your feedback has been very valuable in guiding us to conduct additional experiments that will help strengthen the paper. We sincerely appreciate your insights on these areas. Thank you for your time and consideration. Best regards, Authors.  --- ## Reviewer2 - gMPs, Scope(3), Novelty(6), Tech.Qual.(6), Conf.(3) > *Q1. The proposed approach is very well described and formulated. However, some of the formulation is not easy to follow as it uses its unique formulation. I suggest adding some text to describe some essential steps in more detail.* We are very glad to hear that you enjoy reading our paper. We admit that some of the formulations seem to be complex (e.g. Definition 3.2, 3.3, and so on). As you suggested, we will incorporate additional textual content to describe some crucial steps in more detail in the final version, and we hope that our efforts will address your concern. > *Q2. How important is the size of the inner window in comparison to the outer window in Dual-TF in terms of the percentage of the outer window?* As the relationship of the size of the inner window and the outer window, $w^{outer}=\mathbf{\rho}\cdot w^{inner}$, we would like to report the influence to the performance($F_1$) regarding two kinds of window lengths. | $\rho$ | Point-Global | Point-Contextual | Pattern-Shaplet | Pattern-Seasonal | Pattern-Trend | | ------ | ------------ | ---------------- | --------------- | ---------------- | ------------- | | 1 | 0.3961 | 0.5947 | 0.7103 | 0.7604 | 0.3251 | | 2 | **0.9682** | **0.9426** | **0.7413** | **0.9247** | **0.4758** | | 3 | 0.9673 | **0.9426** | **0.7413** | **0.9247** | **0.4758** | | 4 | 0.9461 | 0.8979 | 0.5608 | 0.8631 | 0.4248 | | 5 | 0.9327 | 0.5528 | 0.4971 | 0.8235 | 0.3096 | > *Q3. Are there any of the evaluation results that experiment with the window sizes? For example, a graph that shows the F1 score by changing the outer and inner window size?* Because the window lengths, $w^{outer}=\mathbf{\rho}\cdot w^{inner}$ and $w^{inner}$, are the most crucial hyperparameters in ***Dual-TF***, we examine the effect of varying these values on the detection accuracy($F_1$) in the TODS dataset. To show the impact of the outer window size, the above table in Answer 2 shows the change in the $F_1$ score when $\rho$ varies within $[1, 2, 3, 4, 5]$ while maintaining the proposed value for $w^{inner}$. The below table demonstrates the change in the $F_1$ score when $w^{inner}$ varies by $[0.2, 0.4, 1.0, 1.2, 1.6, 2.0]$ times the value determined in Section 3.2 while $\rho$ remains constant at $2$. | $w^{inner}$ | Point-Global | Point-Contextual | Pattern-Shaplet | Pattern-Seasonal | Pattern-Trend | | ----------- | ------------ | ---------------- | --------------- | ---------------- | ------------- | | 0.2 | 0.5513 | 0.7553 | 0.3282 | 0.6339 | 0.0701 | | 0.4 | 0.8416 | 0.8705 | 0.5071 | 0.6666 | 0.1536 | | 1.0 | **0.9682** | **0.9426** | **0.7413** | **0.9247** | **0.4758** | | 1.2 | 0.9461 | 0.8979 | 0.5608 | 0.8631 | 0.3937 | | 1.6 | 0.8359 | 0.8904 | 0.4615 | 0.8235 | 0.1802 | | 2.0 | 0.7045 | 0.8215 | 0.2428 | 0.5887 | 0.1197 | `Reminder` Dear Reviewer gMPs, As the discussion phase is about to end, we wanted to ask if you have any additional comments or concerns about our submission and rebuttal. Your previous feedback has been very helpful in further clarifying the impact of the window parameters. We truly appreciate your insights. Thank you for your time and consideration. Warm regards, authors. --- ## Reviewer3 - ws1N, Scope(4), Novelty(3), Tech.Qual.(5), Conf.(4) > *Q1. It seems unclear whether the problem arises from the granularity discrepancy of the Time-Domain and Frequency-Domain scores, or from the coarseness of the Frequency-Domain's Anomaly score.* We appreciate your understanding of this thoughtful aspect of our work. You have accurately mentioned a causal relationship between the coarseness of the frequency domain's anomaly score and the granularity discrepancy between the time domain and frequency domain scores. The inherent definition of frequency as the inverse of the time period ($frequency = \frac{1}{period}$) contributes to the coarseness in the frequency domain's anomaly score. In turn, addressing this frequency coarseness becomes integral to resolving the granularity discrepancy between the two domains, which is the core challenge our methodology aims to tackle. > *Q2. The size of the inner-window for the Frequency-Domain is smaller than the size of original window, which seems to limit the context viewed. For example, using the nested-sliding window may be effective for generating fine-grained anomaly scores; however, this approach may limit the consideration of a variety of frequencies. I wonder if there are any considerations for this limitation.* Thank you for acknowledging the key points of our approach. To address the concern regarding the limitation due to the smaller size of the inner window for the frequency domain, we employ a nested-sliding window approach. This approach ensures that the inner window guarantees at least one full frequency period, aiming to minimize information loss associated with a limited context view. The length of the nested-sliding window is carefully chosen to preserve frequency information effectively. (More detail in Section 3.2 Optimal Window Length) However, extending the nested-sliding window excessively poses challenges from the perspective of the time domain. A too-long window length hinders the ability to detect anomalous points in the time domain, thus a balanced consideration in determining the window size is necessary. Breaking the discrepancy and keeping this balance is important to maintaining the frequency information while enabling precise anomaly detection in the time domain. > *Q3. The approach of creating various window sizes and implementing an anomaly detection method seems not to be substantially different from the proposed approach.* We appreciate your consideration and would like to emphasize that our approach goes beyond the conventional use of various window sizes and standard anomaly detection methods. Furthermore, our key point is not merely relying on frequency information in TSAD; rather our novelty is to mitigate the inherent granularity discrepancy between domains. We hope this rebuttal better highlights the distinctive novelty and technical contribution. `Reminder` Dear Reviewer ws1N, After reviewing your feedback, we noted that the main reason for the low score on novelty came from the description of the granularity discrepancy of the time and frequency domain and the lack of emphasis on distinction from previous works. We have tried to address this by adding a detailed description of the granularity discrepancy in the time and frequency domains, along with positive feedback from another reviewer (reviewer gMPs). Can we request you to give a short feedback on the proposed improvements to this paper? It would be helpful to receive your thoughts on the rebuttal provided, and if you consider the improvements to be an important clarification, we would be grateful for an increase in the novelty score. If you have any additional specific comments or questions, we'd be happy to consider them further during the discussion period. Best regards, Authors. ## Reviewer4 - xu3u, Scope(3), Novelty(2), Tech.Qual.(3), Conf.(4) > *Q1. The technique is from two existing framework of attention in time domain and frequency domain, both have been proposed before in other transformer work before. However, simply adding the scores together is not a novel loss function. This is a similar solution just like ensemble, the key discrepancy here is when does pattern based definition useful and when does the point-based definition? The paper did not offer any insight here.*  We appreciate your observation regarding the existing frameworks of attention in both the time and frequency domains. We would like to clarify that our proposed approach goes beyond a simple addition of scores. The ***dual-TF*** we propose is not simply adding the scores of existing time/frequency methods. In the time domain, the Transformer demonstrated superiority for TSAD (Time-Series Anomaly Detection), while its effectiveness in the frequency domain remained unverified. Consequently, we introduced the ***dual-TF*** in this context. More specifically, the inherent challenge lies in the fundamental nature of frequency, defined as the inverse of the time period ($frequency = \frac{1}{period}$). Given this definition, there will be an inevitable discrepancy between time and frequency domains. Our primary objective is to derive anomaly scores at the time point level. Because, the essence of our research goal is anomaly detection in time series data, inherently requiring anomaly scores at the time-point level. However, in pursuing this goal, we are confronted with the fundamental issue of granularity discrepancy between the time and frequency domains. We believe that the nested-sliding window plays a key role in resolving this discrepancy challenge. The analysis of the mentioned insight is detailed in the Introduction part, including Figure 2, and the analysis in Figure 6 shows when the time domain is important and when the frequency domain is important in TSAD. We will make these points clearer in the final version of the paper.   > *Q2. The optimal window is determined by dominal frequency and uncertainty, which does not always hold since there can be non-major frequency have anomaly too. For example, in a heartbeat, the meaningful anomaly is usually in subtle changes. The 1/vmajor would simply miss such anomaly.* We would like to address any potential misunderstanding. The length represented by $\frac{1}{v_{major}}$ ensures the inclusion of one complete cycle of all frequencies relevant to pattern anomaly. To clarify, the $\frac{1}{v_{major}}$ length, when applied in the frequency domain, can effectively detect magnitude differences, even in cases of subtle changes. For instance, in Figure 8 in Appendix A, even minor variations in the period result in distinct magnitudes of the inner window within the outer window.  > *Q3. There is a surprising assumption of proof of theorem 3.1 does not make any sense in practice, why the anomaly need to be in a window that is monotonic increasing followed by monotonic decreasing?* Thanks for letting us know your concern about Theorem 3.1. Please kindly note that there might be a misunderstanding in the assumption you mentioned. To be clear, we are **not making any assumption about the monotonicity of consecutive windows over which anomalies exist**. The assumption on Theorem 3.1. is about the **monotonicity in uncertainty** in the time and frequency domains. Specifically, the uncertainty represents the information loss showing a trade-off relationship in time and frequency domains in regard to window sizes (i.e., a longer window leads to higher time uncertainty but lower frequency uncertainty). Theorem 3.1. states the optimal window size with the mininum uncertainty (not anmoalies) measured in the time and frequency domains. Consequently, there is no assumption on the window sizes where anomalies exist, which we also believe is not practical at all. We hope this helps to clarify the rationale in Theorem 3.1. We will be glad to discuss further if anything remains unclear or needs improvement. We believe that the first paragraph (Lines 1305-1309) in the proof of Theorem 3.1 (Appendix C) may be misleading to readers by confusing uncertainty with anomalies. We will revise it to be clearer in the final version.  > *Q4. The efficiency of the method is not mentioned. The two transformers combined would be very expensive to train on large datasets, and could be impossible to deploy in any production situation such as the setting in [3].* We appreciate your comments on the efficiency aspect. While we agree that using two transformers together would increase training and inference costs, we believe it is worth it considering the performance improvement especially in the mission-critical applications for anomaly detection. We performed additional experiments on processing time with a commercial GPU (RTX 3090Ti) and the results are summarized in the table below. Dual-TF takes an average of 1.9 times longer to train and an average of 2.3 times longer to infer than Anomaly Transformer using a single transformer, but it improves accuracy by an average of 11.95% on all datasets. In particular, the inference time per data point is **only 6.18ms to 9.15ms**. **Table. Comparison of computation time with Anomaly Transformer in real-world datasets.** | Phase(Time/iter.) | Network | ASD | ECG | PSM | CompanyA | |:-------------------:|:-------------------:|:-----:|:-----:|:-----:|:--------:| | Train(sec/iter) | Anomaly Transformer | 0.043 | 0.040 | 0.044 | 0.065 | | | Dual-TF | 0.115 | 0.091 | 0.117 | 0.106 | | Inference(ms/point) | Anomaly Transformer | 4.02 | 3.29 | 3.18 | 3.14 | | | Dual-TF | 8.32 | 6.18 | 9.15 | 7.20 | In the ultra-fast arriving data streams, such as the setting assumed in the work you mentioned, training and inference may need to be conjunctively conducted in an online manner. We believe that in the particular scenarios our method may not be the best solution. Depending on the nature of the application, the inference speeds within a few miliseconds per data point could be practical enough in situations where time-series data is accumulated periodically and the model can be updated unsynchronously at periodic intervals. Nevertheless, we believe that there are many promising directions for future work to improve the time complexity especially in terms of model selection (e.g., a lightweight backbone model other than a transformer) or windowing strategy (e.g., sparsely nested window with approximations). We will discuss the relevant future work as well as these limitations in the final version.  > *Q5. The literature review is based on very new work in deep learning. However, pattern -based anomaly detection has been done in the last twenty years [1][2], such as context anomaly, collective anomaly, time series discord. There is zero discussion about any of these works. Although the authors tested one matrix profile, we have no idea on how matrix profile were tested and which parameters it was tested.* We appreciate your insightful comment, and we acknowledge that collective anomaly detection has been explored in the past two decades including context anomaly and time series discord. In our approach, we aim to contribute to a more detailed perspective of pattern-wise anomalies classified as a behavior-driven taxonomy [32]. Our focus is on targeting these pattern-wise anomalies, further categorizing them into shapelet, seasonality, and trend anomalies. We believe this refined perspective offers a distinctive view. Regarding the matrix profile, the accuracy of the existing methods is taken from Lu et al. [37]. The number of correct predictions is simply derived by multiplying the accuracy by 250. > [1] Chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly detection: A survey." ACM computing surveys (CSUR) 41.3 (2009): 1-58. [2] Keogh, Eamonn, et al. "Finding the most unusual time series subsequence: algorithms and applications." Knowledge and Information Systems 11 (2007): 1-27. [3] DAMP: accurate time series anomaly detection on trillions of datapoints and ultra-fast arriving data streams. Yue Lu, Renjie Wu, Abdullah Mueen, Maria A. Zuluaga, Eamonn J. Keogh:. Data Min. Knowl. Discov. 37(2): 627-669 (2023) `Reminder` Dear Reviewer xu3u, We sincerely hope this message finds you well. As the discussion phase is coming to an end soon, we would like to inquire whether you have any more comments or concerns regarding our submission and comments. As you mentioned, we are very pleased to hear that you enjoyed reading our paper based on a lot of experiments. Upon considering your feedback, we recognize your concerns regarding the theoretical analysis, efficiency of the method, and main contribution. To verify the significance of our work and to solve several concerns you submitted, we have tried to address concerns in terms of clarifying the main challenge, theory for nested-sliding windows, and model efficiency. We greatly appreciate your feedback as it has provided significant guidance for us in conducting further clarification, which will enhance the reliability of the paper. I appreciate the time and thought you have given us. Sincerely, Authors. ## Reviewer5 - PsKo, Scope(4), Novelty(4), Tech.Qual.(5), Conf.(4) > *Q1. The claim seems a little bit contradictory, in Fig. 1 motivation, it seems the authors work on Pattern-Wise anomaly (or discord if considering the definition first introduced in 2005). But the experiment seems to aim to detect any type of anomalies. And all the anomalies are not well discussed (like why you use a dataset). Given the claim of this paper, I believe universal comparison is insufficient.* We appreciate your insightful comments, and we will certainly enhance the discussion on dataset selection to provide a more comprehensive understanding of our experimental choices. Our primary objective is to effectively detect pattern-wise anomalies classified by the introduced behavior-driven anomaly taxonomy, aiming for a comprehensive coverage of various anomaly types. However, recognizing the importance of not overlooking point anomalies, we employ both time and frequency domains in our approach. Our experimental design is geared towards validating our claim: maintaining performance in detecting point anomalies while significantly improving the detection of pattern-wise anomalies. To support this claim, we selected datasets that encompass a spectrum of scenarios, including datasets with solely point anomalies, datasets with only pattern anomalies, and real-world datasets that involve a mixture of both point and pattern anomalies. This diverse dataset selection is intended to demonstrate the superiority and effectiveness of our approach across a range of practical scenarios.  > *Q2. It seems to introduce a universal anomaly detection approach which claims to be able to detect any type of anomalies. However, it is unclear why t-anomaly + f-anomaly can detect any type of anomalies. I believe authors should analyze how each type of anomalies that could benefit from t-anomaly, f-anomaly, and combined final score in visualization shown in Fig. 6* > *Q4. Most of the example shown in Fig 6 looks like the f-anomaly score plays an important role in anomaly detection (Other than Contextual Point), any example in which t-anomaly score plays an important role in determining the location of the anomaly. Or t-anomaly score is solely used to determine a more concrete location?* Thank you for your insightful questions. We would like to provide a comprehensive explanation of the roles of t-anomaly and f-anomaly in our approach, especially in reference to the visualization shown in Figure 6. The combined effectiveness of t-anomaly and f-anomaly comes from their distinct yet complementary roles in anomaly detection. The t-anomaly specializes in identifying locations with abrupt changes in time values, making it good at detecting point anomalies. On the other hand, the f-anomaly precisely captures the boundaries and the whole range of pattern-wise anomalies. The collaboration between these scores is synergetic in achieving a high ability for anomaly detection.  > *Q3. Table 1 information needs more explanation. For example, what is # Point Anomaly Ratio?, # Pattern Anomaly? And their ratio?* Certainly, we would like to kindly provide further clarification. According to the definition, each anomaly is categorized as a pattern-wise anomaly if the length of its anomaly interval exceeds 1; otherwise, it is considered a point-wise anomaly [7]. All numerical values were counted on a per-data point basis. We defined the term "# Point Anomaly" as the count of data points classified as point anomalies among the time data points labeled as anomalies in the entire test time series. Similarly, "# Pattern Anomaly" represents the count of data points classified as pattern anomalies. The respective ratios mean the proportion of "# Point Anomaly" and "# Pattern Anomaly" among the overall time data points labeled as anomalies. > *Q5. It would be better to mention the computation cost of the proposed work in anomaly detection.* As shown in the below Table, Dual-TF takes longer 1.4–2.7 times for training and 1.9–2.9 times for inference than Anomaly Transformer which employs one transformer. This additional cost for the improved accuracy is reasonable, considering the two reconstructors and the NS-windows. Because the inference time per data point is only 6.18–9.15 ms. **Table. Comparison of computation time with Anomaly Transformer in real-world datasets.** | Phase(Time/iter.) | Network | ASD | ECG | PSM | CompanyA | |:-----------------:|:-------------------:|:------:|:-------:|:-------:|:--------:| | Train(sec/iter) | Anomaly Transformer | 0.043 | 0.040 | 0.044 | 0.065 | | | Dual-TF | 0.115 | 0.091 | 0.117 | 0.106 | | Inference(sec) | Anomaly Transformer | 17.370 | 85.684 | 279.058 | 41.777 | | | Dual-TF | 35.938 | 194.123 | 803.952 | 95.82 | `Reminder` Dear Reviewer PsKo, Thank you for your thoughtful and constructive feedback. We sincerely appreciate the insightful comments and suggestions provided, which we believe will significantly contribute to enhancing the quality of our paper. As you mentioned, we are very pleased to hear that you thought is an interesting paper regarding time and frequency domain simultaneously. Following thoroughly considering your comments, we recognize your concerns regarding the novelty, technical contribution, computational cost, and some visual representations. We tried to address concerns in terms of enhanced presentation of novelty, clarification of technical contribution for collaboration between t-anomaly and f-anomaly, more explanation of dataset statistics, and the computation cost. We greatly appreciate your feedback as it has provided significant assistance for us in undertaking further clarification. Thank you once again for your time and consideration. Best regards, Authors.