Thanks for your valuable comments. We appreciate to read that Reviewers A/B/C/E consider our approach as novel/good and Reviewer D views it as hard work on a difficult problem. We have studied your comments and are excited to make corrections and improvements to the paper that we hope will meet your approval in the revision process. We believe many concerns are caused by a few misunderstandings that we would like to clarify below.
**Common concerns** We first address issues raised by multiple reviewers:
1. *Real-world experiments/evaluation* (Reviewers C/D): We did not conduct GPS spoofing experiments on real drones: First, GPS spoofing is illegal in most countries, it is hard to get government consents and implement real-world spoofing attacks. Second, even if we could conduct such experiments, the results would provide anecdotal evidence rather than a general performance evaluation of our methods.
2. *Dataset size* (Reviewers C/E): We would like to collect/use data for more than 9 regions, but face limited resources and drone bans by laws. Actually, we tested our model on a smaller dataset (600 pairs) vs. a larger dataset (near 8,000 pairs), however, the results show that the improvement on accuracy and F1 score is limited. We consider the accuracy high enough (95%), especially for a prototype evaluation. Nevertheless, we continue enriching our dataset. We plan to publish dataset and code once our paper is accepted. We believe with the help from the whole community, we can build a large enough dataset being useful in the security area and other domains like computer vision.
We further address comments raised by individual reviewers.
**Reviewer A**
*Matching algorithm:* Our idea is that UAV's aerial photo in a real geolocation should match with the satellite image extracted from the claimed GPS location if there are no spoofing attacks. If two images do not match, then spoofing attacks are likely taking place. In an extreme example, a UAV is flying across the Eiffel Tower, however, the GPS information embedded in photos indicates it is in NYC. Thus, the aerial photos taken over Paris will mismatch the expected satellite images from NYC. The approach you propose could also be viable. We do not consider it since: (1) It is much more time-/energy-consuming since it compares an aerial image with a large area of satellite images while our method only compares one pair of images, and (2) it has lower accuracy since matching an aerial image to find the coordinates is much harder than comparing whether two images are matched.
*Obtaining images in battlefield-like situations:* In the context of battlefields, interference may be very strong. So the UAV does not need to send aerial photos to the controller. It can run the "on-board model" to detect spoofing.
*Surface irregularity and inclines:* The two factors may distort the photos (see the following illustrations). We thus need to collect photos as vertically as possible. We will illustrate this by adding concrete examples.


**Reviewer B**
*Advantages over existing methods:* A general discussion of our advantages is contained in our paper on page 2 ***Advantages***.
IMU-based methods suffer from intrinsic accumulated system errors resulting in drifts of estimates over long periods of operation. Since IMUs only measure acceleration but not absolute coordinates, any measurement errors are accumulated over time. We directly compare the real-time picture taken by UAVs with the corresponding satellite images, so there are no intrinsic accumulated system errors. Also, IMU-based methods require to be activated all the time, but our methods can be turned on/off as needed. We will make this clear in the revised version.
As for the mentioned paper, it is an effective attack to IMU-based methods. With 2-sigma threshold, IMU-based detection has a high false alarm, with 5-sigma, it's vulnerable to covert attacks. Our method, however, is immune to such attacks.
*Drones without cameras:* Drones without cameras exist, but are not common. Cameras are an essential part of all UAV systems as they assist in drone maneuvering and providing real-time video feeds to the controller. Even UAVs sold without a camera can be manually equipped with one.
Thanks for pointing to the imprecision in our classification regarding [32]. We will correct it in the revised version and also add the post-detection methods to our paper.
**Reviewer C**
*Accuracy comparison:* We are comparing DeepSIM with previous GPS spoofing attack detection methods in Section VII (Related Work), in particular regarding the accuracy of our proposal with existing methods. We point out, though, that this is rather a qualitative comparison because the DeepSIM approach and data are novel, thus it is difficult to conduct a quantitative performance comparison without losing fairness.
*Novelty of the image comparison technique:* Indeed we leverage existing image pairing algorithms to propose a novel system to detect GPS spoofing attacks. Proposing novel image pairing methods is not our intention in the first place.
*Limitations of our approach:* Ocean and desert could indeed be obstacles for our approach. Severe weather conditions, however, are addressed by our image augmentation techniques. For UAV night flights, there should be night vision cameras installed as otherwise it is dangerous to fly UAVs at night. With night vision cameras, our approach will likely work (although we did not test it yet). As for mountain and forests, unless in highly homogeneous mountain/forest areas, there are still enough vision features for our methods.
**Reviewer D**
*On-board model storage/power:* According to data from satellite image service providers, raw data taken by QuickBird for a $25\times 10^6m^2 (5km \times 5km)$ area is about 400 MB. As a reference, the max travel distance of DJI MavicPro is 13 km. MavicPro supports 128GB MicroSD, which is more than enough for satellite images. For the power consumption, 300 times running means ~1hr of flying while DeepSIM is working constantly. One-hour's running discharges less than 10% of the battery, it is an acceptable energy consumption level.
*False positive rate:* We have a discussion in Section VI.A regarding 3 possible solutions to reduce FPR. We will add accuracy performance for the first solution to the revised version.
*Data augmentation:* We only augment aerial images, because aerial images can be affected by environmental conditions. Satellite images are selected by satellite image service providers and are usually stable and clear, there is no need to augment them. We do have an 8x blowup in the dataset size. By augmentation, our model could perform better with different lighting/season/weather/shadow/etc.
*Error tolerance:* With slight mismatches (<15m is the error range of GPS), our system works fine. As the mismatch increases, DeepSIM has a larger chance to predict it as attack, which is reasonable because large mismatches should be considered an attacks. Table VIII does not agree with Table VII because they are different tests running on different data.
We will correct Table IV in the revised version.
**Reviewer E**
*More evaluation:* We conducted preprocessing to unify the resolution (Section V.B 1), but have tested that higher resolution can achieve a slightly better result. However, the current resolution ($960\times 720$) can reach a good balance between accuracy and computation overhead. We collected photos from different areas including 9 types of landscapes (mountain/forests/city). For the flying height, we did change the size of the surface areas to simulate different heights by cropping the images as you suggested (Section V.B.2).
We agree that non-random datasplits can help us better understand the performance. We will add this to the revised version and incorporate the additional references.
We thank all reviewers for pointing out typos.