We thank the reviewers for all detailed comments. We are glad to read that reviewers A & C appreciate the timeliness and urgency of our addressed topic and the developed technique and that Reviewers D & E also point out that our approach can be applied in the considered hard-to-change infrastructural aviation context -- a challenge by itself. We have studied your comments carefully and will make corrections accordingly to further improve and clarify the paper. We are excited to provide further results, correct misunderstandings, and answer your questions.
### Classification Performance (A, B, C, D)
In an aviation surveillance system that currently does not apply security and trust metrics, even suboptimal true positive rates are a big step forward and are a considerable improvement by itself. We acknowledge that an FPR of ~11% is not appropriate to issue warnings without additional filtering. Naturally, we do not encourage a final decision on a "per message" basis and will emphasize this further in the paper. However, we took the decision to provide "per message" classification performance considering the smallest classification set possible. To tackle this issue, we will further implement a classification on a "per track" basis, which will greatly lower the FPR as well as the FNR (down to ~0% depending on attack sensitivity and decision delay).
### Attack Simulation (B, C, D, E)
We indeed work with synthetic attack data; this is, however, a common approach as no ground truth for attack data on aviation and ADS-B exists and inserting it into the live system would be unethical and irresponsible. Due to the lack of attack samples, we are bound to generate them ourselves. We keep the simulation as realistic as possible. ADS-B spoofing attacks may lead to several outcomes, e.g., injection of fake aircraft, confusion of air traffic controllers, trigger collision avoidance warnings, etc. We can detect the entire attack class inherently, independent of the attacker's objective. Moreover, an attacker may very well affect multiple sensors, however, the reproduction of consistent reception patterns cannot be performed in a targeted manner (with the exception of an attacker at a similar position from where the reports initially originated).
### Machine Learning Specifics (B, D, E)
We use machine learning to model the patterns of how signals are received by which sensors. It is not possible to do a cross-check on these patterns based on physical rules because of the lack of accurate sensor locations (many sensors in fact hide their positions on purpose).
As for the selection of the classification algorithm, we initially chose to use one ML classifier and then changed the evaluation towards comparing ML classifiers when suggested to. The algorithms we chose are 4 typical and widely used binary classification algorithms. We will add more discussion of our ML training and clustering details in the revised version, along with more specifics about our choice and the utilized parameters. Naturally, this is an anomaly detection problem, and we address this problem by supervised binary classification models because it is indeed feasible to generate realistic attack data. In such conditions, supervised learning is better than unsupervised anomaly detection.
### Grid Approach (A, D)
Sensors are assigned to grids (only) based on their observed reception patterns. Notably, this is not sufficient to pinpoint sensors. The addition or removal of sensors in other regions (e.g., America) does not affect grids where the sensor cannot receive reports, since each grid is trained separately. Thereby, it would only require the extension and retraining of the affected grid areas, which may be performed dynamically.
### Attack Delay (A, C)
Reference [15] presents a detection after a few seconds but makes assumptions (multiple targets, secondary positioning) that we do not require with our approach. Attacks (e.g., GPS spoofing) are detected with more than 99% after 50 minutes, however, we already obtain indications for the attack as early as a few minutes after the attack has been launched, which is crucial for live reaction and counter mechanisms.
### Auxiliary Solution (D)
We agree that a "security-by-design" solution to replace ADS-B is more satisfactory, but, unfortunately, it is not realistic and would simply not take the step from a theoretical consideration to the real world (at least within the next decades) due to the way the aviation industry operates.
### Attack Detection (E)
We do not understand why "incremental attacks" should not be detectable with our approach since we consider both "incremental" GPS spoofing (slowly deviating from the authentic track) and "incremental" ADS-B spoofing (injecting reports slowly departing the origin). These attacks are detectable as soon as the reception pattern differs which cannot be prevented by "incremental" change. We provide detection rates for both attack classes.
### Holistic Evaluation (A)
A holistic approach (as a combination of all four tests) is an interesting direction and may streamline the classification performance. In a new revision, we want to provide an evaluation of effect of filtering reports based on the first three tests (sanity, differential, dependency). Thereby, we are able to provide the requested TPR/FPR of the combined system.
### Separation minima (A)
Since the average distance between two sensors is still large at this point, we only can reach a TPR approximately 60%, see Fig. 6. However, with the mandatory deployment of ADS-B for aircraft, as the density of sensors can be expected to increase, we believe our method can detect such attacks with high accuracy within the safe separation (5NM).
### Deployment (A)
In general, our trust evaluation framework can be deployed by anyone with live access to the data. We want to highlight that even the owners of the OpenSky network do not know the sensor positions of anonymous sensors and in that case the performance of MLAT is limited. Further, MLAT is not secured against any malicious intents (e.g., signal injections) to fool the system rendering the MLAT test useless. Such an attack, however, is detectable by our system.
### Minor (A)
The notion of "metadata" only refers to the binary reception events; no time or other information is needed.
### Open Questions (B)
In order for us to address details of the review, we would need further clarity and explanations on the following questions:
- In which sense should the notion of trust that we apply be considered as simplistic? What is missing?
- What are references for data trustworthiness and security in participatory sensing? Would they be applicable to the context we consider?
Finally, we thank all reviewers for pointing out typos.