Thanks for your detailed and valuable comments. We appreciate to read that Reviewers A, B, C, & E expressed the opinion that our approach and application setting are novel/good and Reviewer D also expressed that it presents hard work on a difficult problem. We have studied your comments carefully and are excited to make corrections and improvements to the paper that we hope will meet your approval in the revision process. We believe that many concerns are caused by a few misunderstandings as we would like to clarify below.
**Common concerns** We first address two issues raised by multiple reviewers:
1. *Real-world experiments/evaluation* (Reviewers C & D): We did not conduct GPS spoofing experiments on real drones due to the following reasons: First, GPS spoofing is illegal in most countries, it is hard to get government consents and implement real-world spoofing attacks. Second, even if we could conduct such experiments, the results would provide anecdotal evidence rather than enable a general performance evaluation of our methods.
2. *Dataset size* (Reviewers C & E): Regarding the dataset size, we would like to collect/use data for more than 9 regions, but face limited resources and drone bans by laws. Actually, we tested our model on a smaller dataset (600 pairs) vs. a larger dataset (near 8,000 pairs), however, the results show that the improvement on accuracy and F1 score is limited. We consider the accuracy high enough (95%), in particular for a prototype evaluation. Nevertheless, we still keep on enriching our dataset. We plan to publish data and code once our paper is accepted. We believe with the help from the whole community, we can build a large enough dataset which will be useful in the security area and for other domains like aviation, computer vision, etc.
We further address comments raised by individual reviewers.
**Reviewer A**
*Matching algorithm:* Our idea is that UAV's aerial photo in a real geolocation should match with the satellite image extracted from the claimed GPS location if there are no spoofing attacks. If these two images do not match, then a spoofing attack is likely taking place. In an extreme example, a UAV is flying across the Eiffel Tower, however, the GPS information embedded in photos indicates it is in NYC. Thus, the aerial photos taken over Paris will not match the expected satellite images from NYC. The alternative approach you are proposing could also be a viable approach. We do, however, not consider for the following reasons: (1) It is much more time- and energy-consuming since it compares an aerial image with a large area of satellite images while our method only needs to compare one pair of images, and (2) it must be expected to have lower accuracy since matching an aerial image to find the coordinates is much harder than comparing whether two (approximate) images are matched.
*Obtaining images in battlefield-like situations:* In the context of battlefields or alike, interference may be very strong. In this case the UAV does not need to send aerial photos to the controller. It can run the "on-board model" to detect spoofing attacks. Once the spoofing is confirmed, an appropriate post-detection strategy can be adopted.
*Surface irregularity and inclines:* The two factors may distort the photos, (e.g., ratios may not be correct for different elevations, see the following illustrations), we thus need to collect photos as vertically as possible. We will illustrate this by adding a concrete example in the paper.


**Reviewer B**
*Advantages over existing methods:* General discussion of our advantages can be found in our paper on Page 2 ***Advantages***.
As for IMU-based methods, they suffer from intrinsic accumulated system errors resulting in drifts of estimates over long periods of operation. Since IMUs can only measure acceleration but not absolute coordinates; any measurement errors, even small, are over long periods of operation accumulated over time. We directly compare the real-time picture taken by UAVs with the corresponding satellite image, so there is no intrinsic accumulated system error. Also, IMU-based methods require to be activated all the time, but our methods can be turned on/off as needed. We will make this clear in revised version.
As for the mentioned paper, it is actually an effective attack to IMU-based methods. If the threshold is 2-sigma, then IMU-based detection will have a high false alarm, if the threshold is 5-sigma, then it is vulnerable to covert attacks. Our method, however, is immune to such attacks.
*Drones without cameras:* Drones without cameras indeed exist. We point out that they are not common. Cameras are now an essential part of all UAV systems as they assist in drone maneuvering and providing real-time video feeds to the ground controller. Even UAVs sold without a camera can be manually equipped with one.
Thanks for pointing out the imprecision in our classification regarding [32], it's indeed detection at direction of arrival sensing rather than detection at signal level. We have corrected it in our latest version. Regarding the post-detection methods, we will add these references to our paper.
Thanks for the advice on post-detection countermeasures, we will add these examples and references to our revised version.
**Reviewer C**
*Accuracy comparison:* We are comparing DeepSIM with previous GPS spoofing attack detection methods in our discussion in Section VII (Related Work), in particular regarding the accuracy of our proposal with existing methods. We point out, though, that this is rather a qualitative comparison because the DeepSIM approach and data are novel, thus it is difficult to conduct a quantitative performance comparison without losing fairness.
*Novelty of the image comparison technique:* Indeed we leverage existing image pairing algorithms in order to propose a novel system to detect GPS spoofing attacks. Proposing novel image pairing methods is not our intention in the first place.
*Limitations of our approach:* It is true that ocean and desert could be obstacles for our approach. However, severe weather conditions are addressed by our image augmentation techniques. For UAV night flights, there should be night vision cameras installed as otherwise it is quite dangerous to fly a UAV in night. With night vision cameras, our approach may also work (although we did not test it yet). As for mountain and forests, unless in highly homogeneous mountain/forest areas, there are still enough vision features for our methods.
**Reviewer D**
*On-board model storage and power requirements:* We looked up data from satellite image service providers. For example, raw data taken by QuickBird for a $25\times 10^6m^2 (5km \times 5km)$ area is about 400 MB. As a reference, the max travel distance of DJI MavicPro is 13 km. MavicPro supports 128GB MicroSD, which is more than enough for satellite images. For the power consumption, 300 times running means ~1hr of flying while DeepSIM is working constantly. One-hour's running discharges less than 10% of the battery, it is an acceptable energy consumption level.
*False positive rate:* We do have a discussion in Section VI.A about possible solutions to address this issue: 1) analyze the results as time sequence, 2) raise an exception to the operator when an attack is detected, 3) combine with results from other sensors. We will add accuracy performance for the first solution in the revised version.
*Data augmentation:* We only apply augmentation techniques on aerial images. The reason is that aerial images can be affected by environmental conditions but satellite images are selected by satellite image service providers. Thus satellite images are usually stable and clear, there is no need to augment these images. We do have an 8x blowup in the dataset size. By augmentation, our model could perform better with different lighting, season, weather, etc.
*Error tolerance:* When there is a slight mismatch (<15m is the error range of GPS), our system works fine. As the mismatch goes larger, DeepSIM has a larger chance to predict it as attack, which is reasonable because large mismatch should be considered as attacks. Table VIII does not agree with Table VII because they are different tests running on different data.
As for Table IV, we will correct it in the revised version.
**Reviewer E**
*More evaluation:* We conducted preprocessing to unify the resolution (Section V.B 1), but we have tested that higher resolution can achieve a slightly better result. However, current resolution ($960\times 720$) can reach a good balance between accuracy and computation overhead. We collected photos from different areas including 9 types of landscapes(mountain/forests/city/etc). For the flying height, we did change the size of the surface areas by simulating different heights by cropping the images as you suggested (Section V.B.2).
As for dataset split issue, we agree with the reviewer that non-random splits can help us better understand the performance. We will add this to the revised version. We are also glad to incorporate the additional references.
Finally, we thank all reviewers for pointing out typos.