Understanding the Impact of Image Quality and Distance of Objects to Object Detection Performance

# Understanding the Impact of Image Quality and Distance of Objects to Object Detection Performance *A reproduction by Vasil Dakov, Lauri Warsen, Lászlo Roovers.* *Source Code: https://github.com/vdakov/frmdl-object-detection-distance-resolution* All model weights are also uploaded, along with the configuration files. ## Introduction How high does image quality need to be for object detection? What if you could reduce computational cost by 50%, without losing accuracy, simply by lowering the input image resolution? Current object detection models are typically trained on datasets with consistent image quality or compressed to fixed resolutions. While this simplifies training and benchmarking, it fails to reflect the variability of real-world use cases. In practice, images often vary in quality due to limitations in camera hardware (e.g. maximum resolution), and storage requirements (e.g. file size). Although higher-quality images generally lead to better detection performance, especially when paired with high-capacity models, this approach can be inefficient and impractical. In many cases, using high-resolution images is unnecessary and wasteful, particularly if similar performance can be achieved with optimized models and lower-quality input. **If this impact is better understood (or even quantified), one can save on computation cost, data collection costs and more accurately know what resources to use for an object detection task.** This leads us directly to the focus of this blog post: *To what extent does data quality impact object detection, and can it be accounted for?* It is the question tackled in the paper *"Understanding the Impact of Image Quality and Distance of Objects to Object Detection Performance"* by Hao et al. from NYU [1]. **This blog post describes the approach taken there and an attempt to reproduce the results from it by students from TU Delft.** > ¹ Data quality is defined in the next section. ## What are image quality and distance in object detection? Intuitively, image quality depends on the **resolution** of the image. Resolution for images can be defined as the amount of detail that can be perceived/measured. In the context of images and computer vision it can be divided into two parts. **Spatial resolution**: The total amount of diverse pixels per unit area in the image captured by the camera. Higher spatial resolution means more pixels with finer image features. *Here is how an image decreasing in spatial resolution looks.* | High Resolution | Medium Resolution | Low Resolution | |-----------------|-------------------|----------------| | ![amsterdam_01078_1_1_spatial_1.00_qp24](https://hackmd.io/_uploads/BJvTMrK7le.png) | ![amsterdam_01078_1_1_spatial_0.70_qp16](https://hackmd.io/_uploads/ry5O-St7gl.png) | ![amsterdam_01078_1_1_spatial_0.44_qp16.png](https://hackmd.io/_uploads/SyOubHtQxe.png) | **Amplitude resolution**: The number of distinct intensity levels available per pixel (e.g. 256 for 8-bit images). Lower amplitude resolution leads to banding or visible quantization artifacts. Higher amplitude resolution allows for smoother gradients and more precise color or brightness representation. QP (Quantization Parameter) is a control variable used in video/image compression to reduce file size by lowering amplitude resolution, specifically in the BMP format [2]. A higher QP leads to more compression and visible loss of detail. *Here is how an image decreasing in amplitude resolution looks.* | Full Bit Depth | Mid Bit Depth | Low Bit Depth | |----------------|---------------|---------------| | ![amsterdam_01078_1_1_spatial_1.00_qp24](https://hackmd.io/_uploads/HkllmrKmeg.png) | ![amsterdam_01078_1_1_spatial_1.00_qp34](https://hackmd.io/_uploads/S1jubBKQee.png)| ![amsterdam_01078_1_1_spatial_1.00_qp46 (1)](https://hackmd.io/_uploads/BJ_cpNs7le.png) This work is in the context of object detection. Considering the dimension of **distance**, it has many similarities to the effects of resolution. An object (*e.g. a person*) that is closer to the camera is bigger and would consequently have more pixels and finer details available to it. For an object detection model, one would expect to a degree spatial and amplitude resolution to be able to compensate for distance. *Below is an illustration of decreasing object in size and pixels due to distance.* | Small Distance | Medium Distance | Large Distance | |----------------|---------------|---------------| | ![tree2](https://hackmd.io/_uploads/r1ppMroQge.png)| ![tree1](https://hackmd.io/_uploads/H1DaMSsXxe.png) | ![tree3](https://hackmd.io/_uploads/BJ_1XrjXxl.png) ## How do current object detection models tackle different resolutions and distances ? Deep learning has for about a decade been the gold standard for object detection with prominent models like the YOLO [3] family or R-CNN [4]. Despite architectural differences, most modern object detection models share core structural components: (1) **having a multi-scale feature extractor**, (2): **using multiple detectors to extract bounding boxes at these different scales** and (3): **running them through a non-maximum suppression algorithm.** ![An example of a multi-scale extraction via YOLO [3]. Image source: Original Paper](https://hackmd.io/_uploads/HyA4MLjmxl.png) **Figure:** An example of a multi-scale extraction via YOLO. *Image source: Original Paper* Our interest is in (1) and (2). Multi-scale feature extractors leverage deeper layers with larger receptive fields, enabling the network to capture both coarse and fine-grained details. This improves detection across a range of object sizes. The aforementioned all also seem to share a similarity in their evaluations and trainings, using metrics like **mAP** (Mean Average Precision)at different **IoU** (Intersection over Union) thresholds on tasks like ImageNet [6] or COCO [7], evaluated at a standardized resolution. #### Why current object detection evaluation falls short The limitation stems from how data is preprocessed and how models are typically evaluated. It is standard for object detectors to be trained at a predetermined number of scales, therefore assuming data is either at a fixed size (more common) or operates within a range where this is applicable. Having more scales will always result in better or equivalent predictions (as non-maximum supression [9] will disregard worse ones). This, however, introduces unnecessary computation time. Next, metrics. Mean Average Precision (or mAP) is considered the gold standard metrics for object detection, quanitifying the precision levels for all classes, at different threshold levels and illustrates general classifier + detection robustness. It is based on Average Precision (AP) per class and defined as: $$ \begin{aligned} \text{AP}_i &= \int_0^1 p_i(r) \, dr \\ \text{mAP} &= \frac{1}{N} \sum_{i=1}^{N} \text{AP}_i \end{aligned} $$ The issue with data standardization (on common datasets) and just mAP is that they do not consider **when** such accuracy can be achieved. For example, if we have high-resolution data², we may either not need it to be so (and thus preserve valuable storage space), or perhaps if we are fixing it to a size, it may be too much and lose out on accuracy. **These limitations suggest that current models may either underutilize high-resolution data or expend unnecessary resources processing information that does not significantly improve performance.** > ² Like the Eurocity dataset we discuss later. ## Methodology of Original Paper The paper proposes a controlled experiment that focuses on creating a resolution-adaptive model and evaluates the impact of distance, spatial and amplitude resolution in a controlled manner. They also test a way to address it by proposing a new object detection architecture, and training it on data from a variety of resolutions and distances. #### Proposed model: RA-YOLO (Resolution-Adaptive YOLO) Yu et al. propose an object detector based on YOLOv5 [14] that addresses the problem outlined in the previous section and regulates how many detection heads are used based on the size of the input image. The proposed structure operates at a maximum of 5 scales and consists of **a backbone** (for initial feature extraction)**, a neck** (for upscaling it) and **a head.** (which consists of multiple convolutional layers that extract the final bounding boxes). In the case of the used YOLOv5, this means: - A CSPDarkNet53 [10] backbone, which consists of a sequence of convolutional layers, bound by residual connections and the so-called CSP bottlenecks, which result in more computationally efficient feature extraction" - A PANet [11] neck, which is a take of upscaling Feature Pyramids, which upsample and concatenate our CSPDarknet Layers to the desired scales - 5 heads, which are a series of convolutional layers that feed into a series of anchor boxes, outputting offset image coordinates $(x_{center}, y_{center}, x_{offset}, y_{offset})$, a confidence score $s$ and a class $c$ The way regulating the amount of scales works is by relying on two additional hyperparameters: $H_4 \in \mathbb R$ and $H_5 \in \mathbb R$. If an image height is denoted by $h$, - if $h \le H_4$, only three scales are used, - if $H_4 \le h \le H_5$, four scales are used, and - if $h >H_5$, then all five scales are used. An important note is that training only impacts the scales in the feature pyramid that are used for extraction, by **freezing the remaining layers.** Freezing entails that no backpropagated gradient is accumulated on them and their weights stay the same. The idea of it is to be able to take advantage of more scales when necessary, improving accuracy and possibly saving on inference time. ![image](https://hackmd.io/_uploads/BJOD8UsXex.png) ![image](https://hackmd.io/_uploads/BkVyILjQle.png) **Figure:** The proposed RA-YOLO architecture (top) and its different scales utilization (bottom). *Image source:* Original Paper. #### Dataset setup To isolate the impact of both spatial and amplitude resolution, the authors took several existing datasets at different spatial resolutions. Namely, they use TJU (high-resolution) [12], EuroCity (medium)[8] and BDD (low) [13]. The second of them, Eurocity, also has distance annotations in their labels, which is also what prompted them to downscale it by a factor of (720/1024), and create a second set, dubbed Eurocity 1.42x³. For amplitude resolution, they took both TJU and Eurocity (+ their downsampled versions ), converted them to BPG format, which allows for finer control of how quantized the images are. Next, they ran them all through different quantization parameters (QP), which control the bit depth that the image is quantized to, such that $QP \in [0, 51]$, with 0 being high quality, and 51 being low quality. Finally, they took all of them and randomly sampled images to create a final, mixed training dataset representing different types of both spatial and amplitude resolution. ##### Table: Dataset Summary | Dataset | Spatial Resolution | Amplitude Resolution | |----------------------|------------------------|------------------------------------------| | TJU Original | 2000P - 4000P | QP = [0–51] | | TJU Down2 | 1000P - 2000P | QP = [0–51] | | TJU Down4 | 500P - 1000P | QP = [0–51] | | EuroCity Original | 1920 × 1024 | QP = [0–51] | | EuroCity Down1.42 | 1350 × 720 | QP = [0–51] | | BDD | 1280 × 720 | Original data are compressed at around 0.648 bits/pixel | When discussing labels, the authors noted that the labels between the TJU, Eurocity and BDD are not made equal (the example given was that the "rider" category in TJU includes the vehicle, whereas in Eurocity it does not). It is why the classes focused on in the study were only "pedestrian" and "rider". Even later, during evaluation it was seen that the only consistent performance was on "pedestrian". > ³ Derived from 1024/720 ~ 1.42222... #### Experiment The experimental setup the authors did is comprised of several steps: - Create 3 additional mixed models (YOLO(3), YOLO(4), YOLO(5)) at fixed scales, but with the same structure as RA-YOLO - Train them all on the **mixed** dataset with various spatial and amplitude resolutions. - Evaluate their inference speed, and accuracy on the mixed dataset. - Take RA-YOLO and evaluate it on subsets from TJU and Eurocity with different quantization parameters and spatial resolutions. Create a fine-tuned version of RA-YOLO and evaluate them again, but this time in terms of both precision and image size (MB/PSNR (Peak Signal-to-Noise Ratio)). - Take RA-YOLO and evaluate its performance on binned distances, as well as Eurocity and Eurocity 1.42x. ## Implementation Challenges Design choices and ambiguities in this paper make it challenging to reproduce. This section outlines the main ones faced throughout this study. ### Challenges with RA-YOLO The study is ambiguous in its description of RA-YOLO and its structure. They claim they are based on YOLOv5, but do not correspond 1:1 with the actual specification. Most importantly, it is not clear how the final part of the backbone, the SPPF (Spatial Pyramid Pooling - Fast) [5] layer relates to the resolution adaptiveness of the network. The YOLOv5 backbone and neck are not fully symmetric. The SPPF functions is sequence of pooling operations and convolutions intending to enhance processing at multiple scales, being right in the middle. If taken at face value, the figure with RA-YOLO provided would skip it whenever the RA-YOLO from the paper operates at lower scales. If stated, as intended it would be fine. However, as it stands we as reproducers have to guess at what they meant. Considering the SPPF has weights, it is not trivial. There are three options: 1. Take their claims at face value while sticking as close as possible to YOLOv5. This means that the SPPF is kept, but is only used when all scales are used. 2. Skip SPPF altogether. As it is not described explicitly, it might not be there at all. 3. Add an SPPF layer at every scale that is taken optionally. While that adapts the structure, it seems to not coincide with the "Tile" block which they state copies the feature maps from the current scale in the backbone. It might contradict their implementation. Which option was picked and why outlined in **Methodology of Reproduction.** ### Data Processing, Quantity and Compute This is a reproduction study, with limited time and computing resources (GPUs for model training). The original paper is very data-heavy, however. They list numbers in the 10000s for Eurocity (which is the only dataset available to us), a number which grows fast when samples at $s$ different spatial and $a$ amplitude resolutions multiplying $n$ datapoints to $n*s*a$ data points. Next, the scales listed in RA-YOLO ($H_4=810$) and ($H_5=1620$) are very high, and further increase training time. The authors correctly note that this is necessary to distance away from the way other object detection studies train on fixed-size data (most commonly $640 \times 640$ for YOLO). For our purposes, however, this makes training even slower. Finally, data acquisition and processing was another challenge for the reproduction. The paper claims in its introduction that the data they used for model training will be released after the paper was published. This is not the case and requested contact from our team to the original authors did not result in a response. As such it had to be acquired manually, from Eurocity and processed both spatially and ampitudinally. This is slow - for example, to get the same 10 ampitudinal scales and spatial resolutions listed on a subset of 2000 images on Eurocity, it takes $\sim 4$ hours. Both the quantity of data and the scale of the images were infeasible for our purposes. How it was tackled is outlined in **Methodology of Reproduction**. ## Methodology of Reproduction With the challenges stemming from the original paper's description, in this section are outlined the design choices in the creation of this reproduction. With the challenges of the previous section in mind, **the goal is not to reproduce the same numerical results, but rather the same trend with regards to distance, amplitude and spatial resolution.** ### RA-YOLO Reimplementation Due to the ambiguities in the original paper, the team decided to stick as close as possible to the original YOLOv5 implementation. In this case, this meant a direct extension of the Ultralytics repository containing the model. While this induced other technical challenges as it is code made to be ran, not extended, it guaranteed any details omitted would be as close to the original as possible. First, a larger version at 5 fixed scales using the default YOLOv5 structure was created. It was later modified to reflect RA-YOLO. Here is how: - The team decided to go with one SPPF layer used only at 5 scales. It is the option with least modifications and seem to keep with their architecture. If implemented from scratch, we would recommend a version that uses multiple SPPFs, but that would not adhere to a reproduction here. - For the forward pass, the structure of the model was modified to skip images based on the current image input size $H_{curr}$. During a backward pass, this is once again considered and only the layers used's weights are updated - To still get effects of resolution on distance, but in the interest of computing time, the team decided to downscale $H_4, H_5$ to $500$ and $1000$ respectively. This would be large enough to show a trend, while keeping computations more feasible. ### Data and Training For our reproduction, we focus on the EuroCity dataset due to its rich annotation of both spatial resolution and object distance, which makes it suitable for studying the core hypothesis of the original paper. We downloaded the Eurocity dataset and followed a preprocessing pipeline similar to the one described by Hao et al. The only difference is that it is done on **a subset of the original data, 2000 images.** It is as follows: - **Resizing**: Following the paper, we resized each image to two additional spatial scales using the same scale factors: 720/1024 and 854/1920. This resulted in three versions of each image: the original size, a version with dimensions approximately 1024×720, and one with dimensions approximately 1920×854. - **Quantization**: We converted all images to BPG (Better Portable Graphics) format and applied various QP (quantization parameter) values to create versions with different amplitude resolutions. This was done using a command-line encoder with QP values 16, 24, 34, 38 & 46 like in the paper. - **Distance labels**: Since EuroCity contains distance annotations, we also grouped instances into buckets representing different object distances (close, medium, far) to later study how these interact with spatial and amplitude resolution. > Note: We constrained training to the “pedestrian” class only, to keep evaluation consistent with the orignal paper. The goal is to see whether the RA-YOLO model can adaptively perform better than standard YOLOv5 under the mixed dataset (varying spatial & amplitudinal resolution), especially when inference time or input quality is constrained. The model was trained for 10 epochs with a batch size of 1. The training losses are displayed in the figure below. This batch size was necessary due to the scale-conditional complexity of the model. Images of different spatial resolutions require a different number of layers and the gradients should only be propagated back to those layers that were enabled. We believe batching of same-resolution images could still have been possible and provided a more optimal training speed. However, this would have constituted a definite deviation from the paper. The low number of epochs is a result of the poor performance caused by the low batch size, combined with time constraints. <center> <img src="https://hackmd.io/_uploads/H1FD4HXNle.png" width="600" /> **Figure:** Box loss and Object loss during training. </center> ## Results The original RA-YOLO paper reported the following performance in terms of mean Average Precision (mAP) at 0.5 IoU: <center> <img src="https://hackmd.io/_uploads/rkYPK-m4eg.png" width="400" /> **Figure:** mAP@50 plotted against Megabytes per Image (MB/Image) and against PSNR (Peak Signal-to-Noise Ratio) for RA-YOLO. *Image source: Original Paper.* </center> Implementation bugs, and issues with the Ultralytics extension did not allow us to get competitive results to the original paper. As such, we had to lower our IoU threshold from $0.5$ to the considerably worse $0.3$. With this, re-implementation gave the mAP values shown below: <div style="display: flex; justify-content: center; align-items: center; gap: 20px;"> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BkOUM87Exl.jpg" alt="mAP vs MB/Image" width="400"> <p><strong>Figure:</strong> mAP@30 plotted against Megabytes per Image (MB/Image) for our re-implementation.</p> </div> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/r1d8zLQEee.jpg" alt="mAP vs PSNR" width="400"> <p><strong>Figure:</strong> mAP@30 plotted against PSNR for our re-implementation.</p> </div> </div> The original paper also evaluated detection performance as a function of object distance. Below is the reported recall curve for the *Pedestrian* class from the Eurocity dataset: <div style="display: flex; justify-content: center; align-items: center; gap: 20px;"> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/B189iW7Vxe.png" alt="Recall vs Distance - Original" width="400"> <p><strong>Figure:</strong> Recall vs. Distance (in meters) for the Pedestrian class.<br><em>Image source:</em> Original Paper.</p> </div> <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/Sy3uW8mVge.jpg" alt="Recall vs Distance - Reimplementation" width="400"> <p><strong>Figure:</strong> Recall vs. Distance (in meters) for the Pedestrian class for our re-implementation.</p> </div> </div> #### Interpretation The results are in-line with the conclusions of the paper. **Lower QP values, while hindering performance do not break it below a certain threshold.** Our model seemed to show more resilience to these lower-quality images, but this can be attributed to the lower IoU threshold rather than anything else. The spatial resolution scaling factor of 0.7 does outperform 0.44, which is in line with the original paper. The way spatial resolution without scaling performs the worst can be attributed to implementation challenges the team faced throughout the paper. We are also happy to report the same trend as the hypothesized one for distance, the performance of closer objects was much higher than the ones further away. #### Key Takeaways: - Like the original, our recall drops significantly with increased object distance, especially in low-quality image variants. - The region-aware structure helps maintain higher recall at close distances even when quality is reduced. - There is still a performance gap between our model and the original paper's, likely due to the reduced dataset, IoU threshold adjustment, and architectural mismatches. It prevents us from fully reproducing all of the trends. They (most likely) train for longer, use more data and had larger computing resources. With that in mind, this is our current best. ## Discussion & Limitations The paper presents an interesting exploration image quality and object distance and their effect on performance, but there are many critiques to the methodology they present. As mentioned in **Implementation Challenges**, no source code or pretrained models were provided, which significantly limits reproducibility. Many of the techniques force readers to make educated guesses or assumptions when trying to replicate the results, which introduces noise and potential inconsistencies in any follow-up work. Secondly, the paper is not rigorous in its methodology or explanations. Key components, such as how the image resizing scales were derived or the specific thresholds used for scale switching, are either briefly mentioned or implied without detailed justification. For example, the spatial scale parameters appear critical to RA-YOLO's behavior but are introduced with little elaboration or motivation. There is a notable lack of citations to support technical decisions. Design choices—like the use of a scale-aware head or the specific architectural modifications to YOLO—are presented as given, without referencing prior work that might have inspired them. YOLO is not the only object detection network - Fast-R-CNN, SDD, etc. Why not them? This lack of grounding makes it harder to understand the context in which the method sits and raises questions about whether similar ideas were explored. **In defense of the paper**, it’s also worth acknowledging that this paper explores an understudied direction (i.e. the explicit analysis of object detection performance under varying image quality and object distance). As the authors themselves point out, there is limited prior work that isolates these two factors in a controlled experimental setting. This lack of precedent makes their job harder and helps explain some of the ambiguity: there simply aren’t many standardized approaches or baselines to lean on. ## Conclusion Reproducing RA-YOLO presented challenges, especially due to some minor suspected bugs (e.g. a padding mismatch) that forced us to lower the IoU threshold. Combined with limited documentation and a smaller dataset, this prevented a fully one-on-one replication. However, our results still captured the core trends from the original paper: detection performance degrades with lower image quality and greater object distance, but RA-YOLO’s design improves robustness under these conditions. This presents an argument for considering the data used in object detection more. One can save on training time, data collection or resources needed for their tasks. Or, they could know what to prioritize - object proximity, data storage, etc. Either way, it is a starting point for more research on the topic. While not perfect, our reproduction provides a useful, publicly accessible foundation for future work to build on and refine this promising approach to more resilient object detection. ## References [1] Yu Hao et al. - Understanding the Impact of Image Quality and Distance of Objects to Object Detection Performance - https://arxiv.org/abs/2209.08237# [2] Fabrice Bellard - BPG Encoder/Decoder - https://bellard.org/bpg/ [3] Joserh Redmon et al. - "You Only Look Once: Unified, Real-Time Object Detection" - https://arxiv.org/abs/1506.02640 [4] Ross Girschick et al. - Rich feature hierarchies for accurate object detection and semantic segmentation - https://arxiv.org/abs/1311.2524 [5] Kaimin He et al. - Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition - https://arxiv.org/abs/1406.4729 [6] ImageNet Benchmark - https://image-net.org/ [7] COCO Dataset - https://cocodataset.org/#home [8] Eurocity Dataset - https://eurocity-dataset.tudelft.nl/eval/overview/statistics [9] Non-maximum suppression - https://paperswithcode.com/method/non-maximum-suppression [10] Bochovskiyi et al. - "YOLOv4: Optimal Speed and Accuracy of Object Detection" https://arxiv.org/abs/2004.10934v1 [11] Shu Liu et al. Path Aggregation Network for Instance Segmentation - https://arxiv.org/abs/1803.01534 [12] TJU Dataset - https://paperswithcode.com/dataset/tju-dhd [13] BDD - http://bdd-data.berkeley.edu/ [14] YOLOv5 - https://docs.ultralytics.com/models/yolov5/#usage-examples *Generative AI has not been used in the making of this project, with the exception of general questions on OpenAI's ChatGPT as well as occasional code completion aided by GitHub Copilot.*