Reproducing TrackNetV4: Tennis Ball Tracking with Deep Learning

# Reproducing and Improving TrackNetV4: Tennis Ball Tracking with Deep Learning *Enhancing the Model with Attention Mechanisms* ## Introduction Tracking a tiny, fast-moving tennis ball in video footage is a challenging task for computer vision systems. The ball’s small size, rapid motion blur, and frequent occlusions make this an interesting problem to tackle. In 2019, the first iteration of TrackNet [1], a deep learning model for tracking tiny high-speed objects, achieved remarkable accuracy in localizing tennis balls across sequential frames. Since then, others have experimented with improvements on TrackNet, and by now we are already at TrackNetV4 (2024) [3], which improved upon the original TrackNet with a better architecture that allows the model to exploit motion information more effectively. In this project, we reproduced TrackNetV4 from scratch to verify its claims and explore potential improvements. Beyond replication, we experimented with attention mechanisms and different ways to leverage the motion information to enhance the model’s ability to locate the ball. This blog details our methodology, results, and key insights. ## Background: TrackNetV4 Overview ### TrackNet: Evolution In this section we cover the evolution of TrackNet, and the improvements that came with each iteration. The original TrackNet used a heatmap-based approach, using a VGG-16 style encoder to extract spatial features and a deconvolutional decoder to predict ball positions at the pixel level. It processed three consecutive frames as input, enabling it to leverage temporal information implicitly. TrackNetV2 introduced several optimizations, including multi-input, multi-output (MIMO) processing, which allowed the network to predict the ball's position across multiple frames simultaneously, improving efficiency. Skip connections were also added to better preserve small-object features, and a weighted cross-entropy loss function was introduced to improve training under a high class imbalance. However, TrackNetV2 still relied on static visual features rather than explicit motion modeling. TrackNetV3 further refined tracking by incorporating background subtraction to highlight moving objects and a trajectory rectification module that interpolated missing ball positions. This allowed better handling of occlusions and improved trajectory completeness. ### TrackNetV4: Motion-Aware Fusion for Improved Tracking TrackNetV4 takes inspiration from TrackNetV3 by integrating maps, which explicitly highlight moving objects while suppressing irrelevant background noise. These maps are generated using frame differencing, capturing changes between consecutive frames. A motion prompt layer learns to refine these differences through a power transform, ensuring the network focuses on relevant motion patterns. The architecture retains the U-Net architecture from TrackNetV2 but introduces a motion-aware fusion mechanism. This mechanism integrates motion attention maps with high-level visual features via element-wise multiplication. This improves the tracking accuracy as the model learns to more effectively focus on parts of the image which changed between frames (which the ball is likely a subset of). The architecture diagram can be seen below. ![TrackNetv4](https://hackmd.io/_uploads/B1-s8sWTkg.png) ## Methodology ### Dataset The dataset used in this study is the Tennis Tracking dataset, introduced alongside the original TrackNet framework in [1], which serves as a foundational resource for ball tracking in tennis videos. This dataset comprises annotated frames extracted from broadcast footage of the men's singles final at the 2017 Summer Universiade. The dataset includes 20,844 game-related frames, each labeled with attributes such as frame name, visibility class, ball coordinates (X, Y), and trajectory pattern. Additionally, to enhance model generalization and prevent overfitting, nine different tennis court settings (e.g., clay, grass, hard courts) were recorded, generating an extra 16,118 frames labeled in the same manner. This results in a comprehensive collection of 36,962 labeled frames. This dataset is especially suitable for training and evaluating TrackNetV2 and TrackNetV4, which use sequences of consecutive frames to capture temporal information, crucial for tracking the tennis ball’s trajectory across frames. ### Network Architecture In this study, as stated before, we are training and evaluating 2 network architectures, TrackNetV2 and TrackNetV4. TrackNetV4 uses the previous TrackNetV2 as a backbone network and adds a motion attention layer on top. Since the motion attention layer is formed by algebraic operations, both architectures have the same number of learnable parameters, which is equal to 11.4 million. The architecture for TrackNet is displayed below. ![image](https://hackmd.io/_uploads/S134lOzCkl.png) This final motion attention layer is what makes TrackNetV4 have a higher accuracy compared to its previous version. This layer is designed to emphasize regions in the input where motion is most significant, typically where the ball is moving across frames. It operates by first converting the input into grayscale to reduce complexity, then computing frame-to-frame differences to capture motion dynamics. These differences highlight areas where pixel intensities change, which often corresponds to object movement. The output is passed through a parameterized sigmoid function that learns to selectively weight these motion regions. This learned attention map focuses the model on likely ball locations and suppresses static background noise. By explicitly guiding the network toward dynamic areas, the attention layer improves the network’s ability to localize the ball, particularly in challenging conditions like occlusion, blur, or cluttered scenes ### Training For training both network architectures, we are following the same hyperparameters and settings as the initial implementation of [2]. Thus, The models were trained for 30 epochs using the Adadelta optimizer with a starting learning rate of 0.99, a Weighted Binary Cross-Entropy loss, and an exponential learning rate scheduler, which was exponentially decayed by a factor of 0.9 after each epoch. ## Extensions: Adding Attention Mechanisms ### Motivation We considered adding attention gates for both models, hoping that the general accuracy and performance would increase. For our implementation, we followed the integration of attention layers in a U-Net architecture as described in [4]. We hypothesized that the extra attention layers would allow the network to focus on the most relevant spatial features while suppressing irrelevant background noise during the skip connections. ### Approach For this approach, we chose to add a total of 3 attention gates right before the skip connections. For this specific experiment, we tried 2 different implementations of the attention layers. The first one (later referred to as attention v1) uses the lower level features as a gating signal for the skip connections as described in [4]. The second implementation (attention v2) integrates both channel attention and spatial attention mechanisms into a neural network for enhanced feature selection. The channel computes the importance of each channel using both average and max pooling, followed by a fully connected network to generate attention weights. The spatial attention class focuses on spatial features by concatenating the average and max-pooled outputs across the channel dimension, then applying a convolutional layer. The final module combines both attention mechanisms, refining the input feature map by scaling important channels and emphasizing relevant spatial regions. ## Results To evaluate the effectiveness of both the reproduced TrackNetV4 model and our proposed extensions, we conducted quantitative and qualitative comparisons across several model variants. The results are summarized in Table 1. ![image](https://hackmd.io/_uploads/BJjoeD6T1g.png) We found that the original TrackNetV4 implementation, once successfully reproduced, performed better than the results we observed for TrackNetV2, as reported in the TrackNetV4 paper. However, we observe slight differences in the accuracy values we obtained, compared to the reported values. For TrackNetV2, we observe an accuracy of 66.5% compared to the reported 85.2% in the paper. Despite this, when we tested the publicly available weights for TrackNetV2, we also got an accuracy of 68.2%, indicating that the discrepancy is likely in the evaluation code. As the network outputs a heatmap, the coordinates of the blob must be localized, and the prediction is deemed correct if the ball is within some tolerance. We used a tolerance value of 3, and it is likely the authors used some different value which they did not disclose. When we implemented motion attention mechanisms, specifically the element-wise multiplication of feature maps with frame differencing maps as introduced in TrackNetV4, we saw the accuracy of TrackNetV2 improve from 66.49% to 73.40%. Interestingly, the largest gain was seen with our second version of TrackNetV2 + attention (not to be confused with motion attention), which achieved the highest accuracy (77.17%), outperforming even TrackNetV4. This suggests that attention mechanisms may significantly enhance the U-net architecture, by allowing the model to focus on salient ball features from the encoder before fusing them with decoder features. Furthermore, TrackNetV4 + attention v1 showed a decrease in performance compared to TrackNetV2 + attention v1. This is again surprising, as the motion map in TrackNetV4 was shown to improve performance. This could be because the motion attention map forces the model to predict a ball where there was a difference between consecutive frames, which is not always the case, so this may sometimes lead to a drop in performance. More tests would need to be performed to evaluate this. It is also notable that for our second implementation of attention, we do not observe this same pattern, as attention v2 with TrackNetV2 and TrackNetV4 goes from an accuracy of 72.76% to 76.38%. To further investigate the effect of adding the attention layer to the skip connections, we visualize the attention map for the highest layer (shown below). This reveals something interesting; the attention layer learns to focus minimal attention on the player. This means the network does not use the player to predict the location of the ball. Additionally, it seems the attention layer is used to remove parts of the image which it deems irrelevant for tracking the ball, like the scoreboard and courtside equipment. ![image](https://hackmd.io/_uploads/SygxKITa1x.png) To better understand how motion information was used by the model, we visualized the motion attention map, which highlights regions of motion via frame differencing. This map (shown below) is multiplied elementwise with the high-level features, helping the network suppress static background and emphasize moving objects. ![image](https://hackmd.io/_uploads/BkcPYm6Tke.png) Additionally, we visualized the output heatmap of the network for a frame. As seen in the image below, the model learns to generate high-probability regions around the ball’s location, even in frames with heavy blur or partial occlusion. ![image](https://hackmd.io/_uploads/Hy-8YQaTke.png) For qualitative evaluation, we produced a video with the tracked tennis ball overlaid on the original footage. As can be seen in the video linked below, the tracking is generally smooth and robust. However, there are still occasional misses or slight temporal lag, particularly when the ball is completely occluded or is on top of a player. {%youtube RafdmXwVfBM %} ## Discussion & Future Work ### Discussion #### Main findings and Reflections In our experiments, we reproduced TrackNetV4, verifying its improvement over TrackNetV2. We did come accross a discrepancy between the reported accuracy in the original paper and our observed accuracy, reproduced 66.5% vs reported 85.2%, however this is likely due to differences in evaluation tolerance or other localisation criteria. We focused therefore on the relative improvement. Our investigation of augmenting the convolution layers with attention, showed significant improvements over TrackNetV2, increasing accuracy to 77.17% and 72.76% for attention v1 and v2 respectively. Additionally, attention v2 significantly improved recall, as compared to the other configurations, suggesting this architecture is particularly effective at detectig the ball consistently across frames. #### Effectiveness of Incorporating Attention Attention V1 with TrackNetV2 (no motion attention) achieved the highest accuracy, and a strong precision (92.88%), which is significantly better than the baseline TrackNetV4 (73.4%). Attention V2 improved recall significantly (85.72%), but also reduced precision (82.06%), highlighting the precision vs recall tradeoff in this context. The most complicated model, TrackNetV4 with attention V2, showed modest improvement to 76.38% and displayed incremental benefits but failed to outperform TrackNetV2 + V1 in terms of accuracy. #### Hypotheses for Results The first interesting behaviour we observe is attention v1 (accuracy 77.17%) significantly outperforms TrackNetV4 (accuracy 73.4%). One hypothesis for this is that attention enhances the integration of the skip connections from the UNet architecture over simple concatenation. This can provide the model with much more informative guidance on spatial information, allowing it to locate the ball much more accurately. This would explain the improved recall and accuracy over TrackNetV4. Another attribute of these results worth exploring is attention v2 achieving significantly better recall results but fails to significantly improve accuracy. Spatial and channel attention heavily focuses on features that closely resemble ball movement. This might cause it to be overly sensitive to motion cues that are a bit ambiguous, such as simiarly coloured objects or a player interfering with ball movement signals. This would explain the lower precision since it would result in many more false positives. Finally, we observe that attention mechanisms with TrackNetV4 showed significantly smaller gains. This can potentially be explained by the fact that V4 already incorporates some motion information. The frame differencing layers focus the model on regions in the image that contain motion, but do not differentaite between semantic regions, leaving room for the V2 attention to focus on these regions. This provides some improvement but only marginally compared to TrackNetV2, which has no motion information whatsoever. #### Insights from Visualisations Visualisations of spatial and channel attention (attention V2) show the network's ability to effectively focus on only relevant regions, and ignoring background noise from players, scoreboards, courtside equipment, and other features that are irrelevant for ball tracking. The attention gates learned to focus only on regions that are potentially useful for ball tracking such as open court areas and recent ball paths. It importantly learned to deprioritise movement information from the players, which occurs frequently but is not useful for ball tracking. The implication here is that attention V2 enhances the model's ability to isolate ball-relevant features, which enhances the model's robustness to interference from noise such as blur or partial occlusion. In stark contrast to this, the motion attention maps in TrackNetV4 highlight all regions of motion equally, making it far less robust to noise. ### Future Work A main weakness of the approaches investigated in this experiment was the poor ability of these models to balance precision and recall. It is worth investigating methods to improve upon this weakness, perhaps using ensemble methods or adaptive gating. Another possible improvement is incorporating a temporal component. Attention has been shown to be particularly effective at modelling sequences and it may prove valuable to model the frame-to-frame dependancies that are present here. This could also help improve robustness to occlusions, rapid motion, or camera instability. It is also important to test this architecture on similar but different domains, such as badminton or squash. This has already been done by the TrackNet team, but it would be useful to also investigate our attention improvements in these domains. Finally, to address the discrepancy between our paper and the original in terms of accuracy results, it is important that future work investigates and clarifies evaluation criteria, particularly in terms of localisation tolerance to enable consistent comparison accross different studies. ## Conclusion In this project, we successfully reproduced the TrackNetV4 and explored how attention mechanisms could improve performance of the model, by improving tennis ball localization. Our results showed the same trend outlined in the paper, that TrackNetV4 outperformed its predecessor, TrackNetV2, through the addition of motion attention maps that compute the difference between consecutive frames. Furthermore, our research into adding attention mechanisms to the U-net architecture's skip connections showed that this architectural change significantly improved the models results, outperforming our results for TrackNetV4. ## References [1] Yu-Chuan Huang, I-No Liao, Ching-Hsuan Chen, Tsi-Ui Ik, & Wen-Chih Peng (2019). TrackNet: A Deep Learning Network for Tracking High-speed and Tiny Objects in Sports Applications. CoRR, abs/1907.03698. [2] Sun, N.E., Lin, Y.C., Chuang, S.P., Hsu, T.H., Yu, D.R., Chung, H.Y., & İk, T.U. (2020). TrackNetV2: Efficient Shuttlecock Tracking Network. In 2020 International Conference on Pervasive Artificial Intelligence (ICPAI) (pp. 86-91). [3] Arjun Raj, Lei Wang, & Tom Gedeon. (2024). TrackNetV4: Enhancing Fast Sports Object Tracking with Motion Attention Maps. [4] Ozan Oktay, et al. (2018). Attention U-Net: Learning Where to Look for the Pancreas.