> You will find three (or more) separate reviews. Please enter a response to each review separately. Each of these responses is **limited to 2500 characters**. While each response should address the specific concerns of the corresponding reviewer, it is fine to cross-reference reviewers. The reviewers and the area chair will have access to the complete set of reviews and the rebuttal in the discussion period.
>
Reviewer #1
>
> The approach of using geolocation to improve object detection has been explored before (see references). More traditional graph representations have been used for modeling geometry (not GNN), which would make this work different, but would extremely limit its novelty.
>
> ~~Related work seem to be lacking in terms of covering the area of multi-camera 3d reconstruction. If detection, re-identification, and localization is done simultaneously. this problem can be boiled down to a conventional 3d point estimate instance, considering each object a point in 3d space.~~
>
> ~~Is the detection backbone ever trained? Based on my understanding the graph representation and geometrical constraints are pruning/improving on top of a backbone pre-trained detector. In that case I wouldn't call it an end-to-end object detection instance.~~
>
> Please see weaknesses. Given that this work does not compare and contrast itself with other previous related works, in terms of approach and performance, it is not easy to evaluate its performance.
>
We are pleased that all reviewers find our approach of achieving detection, re-identification, and geo-localization in an end-to-end manner interesting, and novel (R2). We are glad that R3 and R2, find using GNNs to achieve localisation a good idea. R1 finds that the "framework is clearly explained and the visualizations make it an easy to follow read". We thank the reviewers for their valuable and constructive comments and address their individual questions separately.
**Contribution:** To the best of our knowledge, GeoGraph is the first end-to-end trainable, multi-view, multi-instance geolocalization method where sensors are moving and objects are static that can cope with only very coarse pose information, varying, very long baselines, and different image properties (stitched panoramas, dash cams, mobile phone footage, different image sizes). We thus believe our approach offers significant technical novelty indeed, which is further highlighted by the fact that we are unaware of any work closely related to ours.
**Conventional methods:** We thank R1 for hinting at more conventional SfM methods and we are happy to add more references. However, as mentioned throughout the paper, the large perspective difference between the views caused by the wide-baseline setup as well as panorama stitching artefacts make the application of such methods impossible. Failure of conventional SfM methods is what motivated our research in the first place.
**Pretrained backbone and end-to-end:** We used a backbone pretrained on ImageNet as feature extractor because it usually reduces hyper-parameter search significantly, yields better results and thus has become a defacto standard procedure in computer vision. Note that our architecture is fully end-to-end trainable: the different network parts all support backpropagation: the gradient of the final geolocalization loss flows back to the backbone part during training (regardless of the backbone being pre-trained or not) to optimize its parameters.
**Additional Experimentation:** We have noted the reviewers concerns regarding the experiments section. We have given a detailed respose, plus an additional comparison method in **R3**'s rebuttal due to character limitation.
**Reviewer #2**
> ~~Line 287: how are the features extracted from the feature space, Do you take the max probability. Specifically i want to know how is the feature map converted to a graph and can this operation be back-propagated??~~
>
> ~~Line 330: "It is worth noting that by relying on a graph formulation we are able to effortlessly deal with a varying number of distinct objects in the scenes." I am curious on how this is achieved specifically as far as the GNN architecture is concerned the training and testing time nodes need to be of the same dimension but as far i observe your input graph structure varies widely with the number of detections. [36] paper only proposed higher order neural network but does not elaborate on the number of nodes.~~
>
> ~~Fig3 and Fig 4 are output result from the proposed algorithm or just an illustration of the dataset?. I would suggest adding more results of the proposed algorithm in Fig5 and Fig 6 instead of these if they are not the results.~~
>
> ~~Table 1 shows that the method outperforms [38] but no visualization of the geolocation localization is illustrated. It would be great to see some results of the accuracy of localization to support the numbers.~~
>
> ~~Table 2 Is Siamese the baseline for this approach, What happens if you use a fully connected network instead of Siamese. It would be great to see comparisons with more architectures to illustrate the advantage of using this method.~~
>
> I think more results either in supplementary or in the paper need to be shown.
>
> ~~The example results do not quite grasp the complete picture of the pipeline and its advantages. May be better illustrated images might help in improvement of the manuscript.~~
>
> I liked the idea behind the manuscript and its bold attempt to learn this formulation in an end-to-end way. But the paper lacks good evaluations and results pushing me towards the current rating. It would be great if the authors address my concerns from above in the rebuttal.
We are pleased that all reviewers find our approach of achieving detection, re-identification, and geo-localization in an end-to-end manner interesting, and novel (R2). We are glad that R3 and R2, find using GNNs to achieve localisation a good idea. R1 finds that the "framework is clearly explained and the visualizations make it an easy to follow read". We thank the reviewers for their valuable and constructive comments and address their individual questions separately.
**Extracting features:** There are 2 modes (training and evaluation) in which we perform the process of extracting the features for our graph nodes. During training, we extract features from the cells of the feature map using the ground truth to know which features belong to what object (illustrated in Fig 2). During testing (as explained in line 363), the graph nodes are selected using the object detector's CNN: the cells of the feature map associated with a high score object prediction are used to create the graph. Yes, the operation is backpropagated, it is simpler to regard the "Graph Network" as just similar to another object detection network such as the regression, and classification network.
**Graph Formulation Flexibility:** Our claim of being able to "effortlessly deal with a varying number of distinct objects in the scenes", meant that we do not need a pre-determined or hardcoded number of detections. We use GraphConv [36] as our convolutional layer for the aggregation of graph nodes. As we are performing a "link prediction" task, and not a "graph classification" task, the training and testing node size does not need to be the same, which is one of the key advantages of GNN, as also presented in [3]. We will do our best to improve and clarify this part in text.
**Suggested changes:** Thank you, good ideas. We will remove Fig. 3 & 4, freeing up space to add illustrations showing 1) predictions similar to Fig. 5 & 6 with the ground truth depicted as Fig. 3 & 4, 2) aerial top view that visually compare localization predictions of different methods. Finally, we can indeed add additional architectures to showcase the advantage of using graphs for similarity compared to such as the suggesstion of a fully connected network.
**Additional Experimentation:** We have noted the reviewers concerns regarding the experiments section. We have given a detailed respose, plus an additional comparison method in **R3**'s rebuttal due to character limitation.
<!-- **again, i do not understand the following statement, please rephrase or remove** Finally, we can indeed add additional architectures to showcase the advantage of using graphs for similarity compared to a fully connected network, and stacked images through a CNN with a decision network [**are we going to add this comparison?**] [59].
[59] Leal-Taixe et al. "Learning by tracking: Siamese CNN for robust target"
-->
Reviewer #3
> ~~Single-method comparison that looks like the authors' previous work[1]. No comparisons with methods that have code available, such as [2]. It is true, it's not an 'end-to-end' method, but that doesn't disqualify a comparison. No experiments involving a multiple object tracking (MOT) instead of the graph tracking and/or some other pose estimation method for localization.~~
>
> ~~Fairly basic ablation study - baseline, with graph, with graph&localization. A number of parameters that affect performance (such as image resolution/occlusions/sample step) are barely mentioned. For example, the sample step is mentioned when the framework yields worse results from more views for the Mapillary dataset.~~
>
> Future work/experiments related:
> ~~Not sure about that fully-contected-to-sparse graph strategy, there is a good chance that a traffic sign close to the ground shouldn't match one on top of a pole and I believe connecting it without checking the geometric plausibility of the link hurts training.~~
>
> ~~The bounding boxes are approximate and this is all you have - geolocalization would surely benefit from a better localized instances/ 6dof position estimation - see the pipeline proposed in [1]. For example, street signs come in standard shapes and sizes across a wide range of countries - this information could be used towards computing a more accurate camera pose and re-localize all instances from several views.~~
>
> ~~An iterative algorithm based on the re-localization idea would be interesting. I know that the robotcar dataset does not have traffic signs or instances, but you could try a pre-trained model and see how it fares after k laps of re-identified street signs.~~
> ~~Of course, almost-free-but-noisy information such as depth or optical flow could help the current pipeline.~~
>
> The efficiency claim is mentioned a couple of times, but no forward time is given, nevertheless compared with the previous method.
>
> Paper/writing related:
> 104 much more computationally efficient -- compared to?
> 276 - by using graph >> "a graph formalism/representation"
> 576 remove be
>
> The biggest flaw is the lack of comparisons IMHO. Test [1] and I'm willing to improve my rating.
>
> Furthermore, the paper looks like incremental work over [2] - the graph was added, reducing the localization error by ~10%-20%.
>
> We are pleased that all reviewers find our approach of achieving detection, re-identification, and geo-localization in an remark of the first reviewer end-to-end manner interesting, and novel (**R2**). We are glad that **R3** and **R2**, find using GNNs to achieve localisation a good idea. **R1** finds that the "*framework is clearly explained and the visualizations make it an easy to follow read*". We thank the reviewers for their valuable and constructive comments and address their individual questions separately.
> **Suggestions:** We would like to thank **R3** for the suggestions and interest in our work.
**Method design**: We agree with R3 that in theory a drawback of *"fully-connected-to-sparse graph"* is class imbalance, but did not find this to cause major problems in our case.
Confusion of adjacent detections is particular to the "Mapillary" dataset and generally rare since we are using CNN features to disinguish between signs.
We are aware of [58], but it does not apply to our setting. [58] relies on video sequences separated by a few seconds in an autonomous driving scenario, which is different from our setting with generally much larger gaps and often severe perspective changes.
R3's suggestion of using pretrained networks for pose & depth estimation is indeed a good idea. However, our ambition is to solely rely on crowdsourced and publicly available RGB imagery for object localization, which makes our method applicable to a wide range of scenes. Regarding the forward pass it is 0.21s per image on a 1080Ti. We will include a comparison of efficiency/performance of the different methods, and parameters in the supp. material.
**Experiments:** Image resolution is provided in line 395 for Pasadena. The Mapillary datasets contains imagery of various sizes from different sensors. We will add an overview in the supp. material.
Multiple object tracking methods operate under different assumptions (cameras fixed, baselines fixed, background mostly static, objects moving, high frame rate) compared to our scenario (cameras moving, baselines changing drastically, background moving, objects moving, very low frame rate). We agree that comparing performance to related methods would have been great, but there were simply none when we submitted the paper. Note that *[58] has been published after the ECCV paper submission deadline*. Moreover, it does only compare to one barely similar method [38], and a clearly outdated MRF-based method [23]. In general, we attribute lack of related works to the **particular set-up** of the problem and the **lack of datasets with object instance labels across multiple views**. Although **asking for additional experiments by reviewers is strongly discouraged in the ECCV reviewing guidelines**, we did run the multi-step, hierarchical approach of [23] on Pasadena. [23] confuses most adjacent objects during clustering, with parameters that are very cumbersome to be tuned manually, and run separately for each new scene which is not practical to achieve a geolocalization error of 3.827m (higher than ours).
[58] Chaabane et al.