We thank all reviewers for their precious time and valuable feedback.
## Review A
---
#### Details on the precise steps
AirTag detects malicious log entries and reconstructs the attack story following the causality of detected entities and connecting with necessary ones. Namely, the graph is constructed after the OC-SVM classification. We will add details and provide an algorithm.
#### Labeling and comparison
ATLAS reproduced well-documented attacks and performed controlled experiments. All key steps are prior knowledge. ATLAS labeled all key attack entities. Their labeling is at the entity level, which contains more "malicious" events (e.g., a payload-irrelevant event belonging to a malware). Our labeling is fine-grained at the event level. Thus, we mark payload-irrelevant events as benign. We reproduced ATLAS, relabeled them, cross-checked with several individuals, and confirmed with ATLAS authors. We will clarify.
For comparison, we re-run all ATLAS experiments and reproduced their results. We run ATLAS and AirTag on the same machine when comparing time costs (Section IV-A). Hence, the comparison is fair.
#### ROC curve
Thanks for the suggestion. We will use ROC to show the curve. Our preliminary results show that AirTag is more robust than ATLAS in recovering the whole attack story. We will add details about the results and analysis.
#### Generalization of tokenizer
Thanks for the suggestion. Existing work[1] can parse different log formats automatically. Our tokenizer is generic to common components of preprocessed logs, e.g., composited paths and operations. We believe the whole process can be automatic and generalized. We will add a discussion.
[1] Logstash:https://www.elastic.co/logstash/
#### Inconsistent statements
Thank you. AirTag does not require labeling. We will revise accordingly.
#### Related work
Thank you for the pointers. Log2vec converts log entries into a heterogeneous graph and performs graph embedding. AirTag directly embeds natural language artifacts. Attack2vec requires labeling attack events as the training data, which is labor-intensive and error-prone. AirTag is unsupervised. We will discuss.
## Review B
---
#### Why and how to apply NLP techniques to security log analysis
Analyzing logs to understand system activities is similar to NLP tasks like sentiment analysis. The core mission is to capture the semantics of individual words and their relationships. For example, sentiment analysis identifies emotional words (e.g., like) and relations with other words (e.g., not, hardly) to judge the sentiment. AirTag identifies entities (e.g., processes) and their relations to determine if it is malicious. Thus, we believe NLP techniques can be used. ATLAS also confirms this.
There are two new challenges: heterogeneous log formats and imbalanced data. AirTag designed a specific tokenizer that is more suitable for processing log data (Section III-C). To handle the imbalanced data in security tasks and avoid data labeling, AirTag employs unsupervised learning models such as BERT and OC-SVM. Evaluation results confirmed effectiveness of these designs. We will discuss.
#### Irrelevant logs
AirTag builds the causal graph in postprocessing (on a significantly smaller dataset). It removes irrelevant logs in both preprocessing and postprocessing. In preprocessing (Section III-B), AirTag merges log entries from different sources (consistent with existing methods[2]). In postprocessing, we filter out unrelated logs when reconstructing the causality. We will clarify and provide more details.
#### Threshold and details of the detection
AirTag does not perform anomaly detection but forensics analysis.
The thresholds are determined by following the standard procedure in the machine learning community. We leverage a small randomly sampled training dataset and determine hyperparameter values by comparing their effectiveness. Fig. 7 shows the comparison of using different hyperparameters. We will clarify.
The key for AirTag to generalize to unseen logs is capturing the relationships in addition to the semantics of individual words, which DNN shows superior results in NLP tasks. For example, sentiment analysis models can recognize words 'like' and 'not' as well as relations (e.g., 'do not like') to determine the sentiment. For `xaa.com resolve 192.b.c.d`, `xaa.com` is a malicious web address with no benign semantic information in its embedding. The contextual information of `xaa.com resolve 192.b.c.d` is different from that of benign entry `aa.com resolve 149.a.b.c` because `192.b.c.d` downloads payloads in later steps. Therefore, `xaa.com resolve 192.b.c.d` is classified malicious.
#### False positives
AirTag obtains a classic causal graph in post-processing. The false positives in AirTag are connected. We will provide a detailed analysis of representative cases to clarify.
AirTag does not raise alarms for single incidents (e.g., an unseen domain name). Similar to how BERT models generalize to unseen entities, AirTag can generalize to new IP/domain names by capturing their contexts and relationships with others.
In NLP tasks, such schema has been shown can generalize to unseen data. Our evaluation in Fig.12 also confirmed this. Particularly, when changing the specific names for payloads, AirTag can still achieve good results. We will discuss.
## Review C
---
#### Benefits
We respectfully disagree that the efficiency of attack investigation is unimportant. As explained by existing work[29] aiming to optimize the investigation time, the attack investigation is a time-sensitive task. Fast investigation reduces the time costs for fixing the compromised system, significantly reducing financial loss, and helps understand attack intentions to prevent potential future damages.
As shown in Fig.8, AirTag is 2.3x faster than ATLAS. The time-consuming graph operations are integral components of ATLAS and later components depend on it. Graph operations are hard to optimize[29] and optimizations cannot generalize. There are ML acceleration techniques, e.g., XLA[1], XNNPACK[2], and AI-chips, which can reduce the overhead of AirTag. We will discuss.
In addition to low time costs, AirTag removes labor-intensive and error-prone labeling and achieves better generalization (Section II-B).
[1]XLA:https://www.tensorflow.org/xla.
[2]XNNPACK:https://github.com/google/XNNPACK.
#### Fairness of comparison
AirTag uses frequency-based heuristics just like ATLAS (Section 4.1 of ATLAS paper), consistent with existing papers[2,29]. The comparison is fair.
#### Technical differences
Compared with LogBERT and LAnoBERT, AirTag uses a novel tokenizer that interprets the semantics of file paths instead of abstracting them. Abstracting such information will lose its relations with other entities, leading to the failure of attack investigation. We will discuss.
#### Details for inputs of BERT and OC-SVM
BERT takes multiple log entries as input (the length is 32).
OC-SVM takes the embedding of each log entry. We will clarify.
#### Case study
We will provide a case study to help understand the results of AirTag.
## Review D
---
#### Definition and scope of attack investigation
In short, attack investigation aims to recover the attack chain for given attack symptoms. This analysis considers the network with noise and does not need to work on cleaned data. We will clarify.
#### Evaluation on different attack scenarios
Thanks for the suggestion. Our log files are collected from real attack scenarios, containing both benign (majority) and attack behaviors (Section IV-A). We will provide more details and add experiments on incomplete attacks.