Justin Luu | Arjun Vilakathara | Remi Lejeune
Computer vision, especially image segmentation and object detection, is a rapidly evolving field with potential applications across various industries. In recent years, significant improvement has occurred in the development of algorithms and training methodologies, drawing from existing techniques to enhance both the accuracy and efficiency of image segmentation tasks.
As a team interested in CV (Computer Vision), we've been intrigued by the recent surge in AI (Artifical Intellingence) applications and advancements. Our curiosity led us to explore the intersection of AI and gaming, specifically how AI technologies could affect players' experiences in tactical FPS (First-Person Shooters) like Valorant.
The primary goal of aiming in games like Valorant is precision and speed—key factors that can significantly impact gameplay. Traditionally, aimbots have relied on extracting data directly from the game or server to pinpoint the locations of opposing players. These methods are effective for cheating, but our interest here isn't to optimize how to cheat, but to see if we can make something similar using deep learning computer vision techniques.
In our exploration, we will assess how these computer vision techniques can be used to develop an aimbot. The focus will be on evaluating the speed of processing the quality of image segmentation and its effectiveness within the game. We aim to develop an aimbot that mimics playing the game like a human would. The first step of which would be to first dileneate between player models and the background in the games video.
In this blog, we describe how we trained different models to be able to perform object detection in the FPS game Valorant. We will provide an analysis of the results and determine which model is most suitable for this agent detection in Valorant.
Multiple computer vision models are analysed in this blog, each one having their own specifity. First, there is YOLO (You Only Look Once), a state-of-the-art object detection model. Then, there is FastSAM (Fast Segment Anything Model), this model is an image segmentation model based on SAM (Segment Anything) that has been modified to run up to 50x times faster. Finally, there is RT-DETR (Real Time Detection Transformer) which is an object detection model like YOLO but it uses a transformer encoder.
Figure 1: YOLOv9 Architecture
YOLO [1] is a popular object detection model known for its speed and accuracy. Unlike traditional object detection systems that apply the model to an image at multiple locations and scales, YOLO applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. YOLO is capable of detecting multiple objects in real-time, making it highly efficient for tasks that require quick processing, such as autonomous driving.
But despite its strengths, YOLO has some limitations. The model may struggle with detecting small objects in close proximity [2].
The YOLO model used for this research is YOLOv9c. It is currently at the time of this research the latest available YOLO model.
Figure 2: FastSAM Architecture
FastSAM [3] is an advanced image segmentation model designed for rapid and accurate segmentation tasks. It builds upon SA [4](Segment Anything) and has been optimized to run up to 50x faster. It excels in identifying objects within an image, producing precise segmentations that can be used for various applications, such as medical imaging, autonomous driving, and video analysis.
Figure 3: RTDETR Architecture
RT-DETR (Real-Time Detection Transformer) [5] is an advanced object detection model that utilizes a transformer encoder architecture as oppose to traditional convolutional neural network (CNN)-based models like YOLO. Its architecture involves dividing an image into a grid of patches, which are then processed through a series of transformer encoder layers. These layers enable the model to learn relationships between different parts of the image, allowing it to accurately detect and classify objects.
Through that, RT-DETR addresses some of the limitations of grid-based models like YOLO by providing more precise localization and classification of objects.
Datasets of annotated Valorant images are readily available on the internet. The first dataset that was tried is the valorant-object-detection2 dataset, composed of 1416 training images. However when models were trained on this dataset, they resulted in poor performances. After looking more in depth into this dataset, it was found that some of the data was poorly labeled for the usecase of the project, since we are looking for data that has the whole body covered in a bounding box, not just parts of the chest area: Figure 4: Example of 'bad' annotation
This meant that another dataset had to be chosen, the newly selected dataset is the Santyasa Image Dataset, with 2975 training images. Review of this dataset inspired confidence in its annotations, as no obviously faulty ones were identified similar to those in the previous dataset.
The data is divided into three parts: a training set, a validation set, and a testing set. The training process required the dataset to be in YOLO format, which involved a folder of images and a folder of text documents containing the annotations associated with the images.
The labels and annotation in a text document for segmentations would be as follows:
The labels and annotation in a text document for the bounding boxes would be as follows:
Where the first number, is the class of the segmented object and the numbers are the coordinates points that make up the shape of the segmentation.
In the dataset there are two classes: 0 and 1, where 0 is the body and 1 is the head of the agent.
Here are a few example images from the dataset:
Figure 5: Example images from Santyasa dataset
Then, we created a dataset called Edge Case Dataset, which was made to replicate situations considered as edge cases in the game. For example, where only the head is seen, or only part of the body is seen. The Santyasa Image Dataset does contain a few of these images, but the Edge Case Dataset was made especially for evaluation purposes only since it uses very difficult images.
Here are a few example images from the dataset:
Figure 6: Example images from Edge Case dataset
Finally, to showcase the performance of the models in real time, a video clip was recored. The video is 30FPS (Frame Per Second), and will be used to show the results of running the detection models on this video in real time and for calculating the inference time.
Video 1: Video clip to test real time performance of models on
Training is done using the following equipment, provided by Kaggle: https://hackmd.io/PMtzDryvQUy-wNrTX9wrKw?both# Kaggle GPU P100, 16GB
Within these specs, these are the paramaters used for the training:
FastSAM-s epochs=100, batch=16, imgsz=640
YOLOv9c epochs=100, batch=32, imgsz=640
RTDETR epochs=100, batch=32, imgsz=640
Performing the experiments is done using an RTX3060 TI 8GB.
As mentioned, the project utilized YOLO, FastSAM, and RTDETR. This section provides a detailed breakdown of the training process, specifically focusing on the YOLO model for illustrative purposes.
Wandb (Weights & Biases): Used for experiment tracking and visualizing model performance. Secrets management provided by kaggle_secrets was used to securely access the wandb API key.
Ultralytics: The project leverages the Ultralytics suite for training, validating, and testing the YOLO, FastSAM, and RTDETR models. These models are part of the Ultralytics ecosystem, which offers straightforward methods for training and testing.
Initially, the project aimed to train models from scratch. However, it was soon realized that this approach required extensive computational resources and time. To address this issue, transfer learning was employed, a technique that significantly accelerates the training process by fine-tuning models pre-trained on a large, generic dataset for a more specific task. Pre-trained versions of all models were fine-tuned with the chosen dataset (Santyasa Image Dataset).
Models were loaded as follows in the code. For YOLO, the model was initialized with pre-trained weights ("YOLOv9c.pt"), and similar steps were taken for the other models.
The ultralytics training pipeline requires details about the dataset in a data.yaml file, which links to the dataset containing annotated images for object detection. This file outlines the dataset's structure, path, classes, and other configurations required for training.
For exmaple:
The models were trained for 100 epochs, with a batch size of 16 for FastSAM and 32 for the other two models. Image size was set to 640x640 pixels. It was also ensured that the model and its best iterations were saved for later evaluation and inference.
The chosen parameters and the inconsistency in batch size selection were influenced by computational resource limitations. FastSAM was trained using a smaller batch size than YOLO and RT-DETR. FastSAM would use up too much memory, so a smaller batch size had to be used during training. The goal here was to use as much power of the GPU as possible, because of the limited training time. As a result of this choice, FastSAM has the advantage of being able to lean more from each individual example, therefore it is expected to have an advantage in the metrics.
For the experiments, the YOLO, FastSAM, and RTDETR models underwent testing to assess their effectiveness on the test subset of the Santyasa Image Dataset and the Edge Case Dataset created specifically for this project. Additionally, the models were evaluated on selected videos to measure average computation time and detection accuracy.
The experiments began by deploying all three models on the test set of the Santyasa Image Dataset and the Edge Case Dataset. This approach was designed to compare the performance of the models in recognizing and detecting objects.
After offline testing, the models were applied to a series of pre-selected videos. The primary goal was to evaluate the models under more dynamic conditions, reflecting potential real-world applications. Each model's average computation time per frame and detection efficacy were recorded.
Precision is the metric that indicates the accuracy of the predictions made by the model, specifically focusing on the proportion of positive identifications that were actually correct. It is calculated as the number of true positive detections divided by the total number of elements labeled as positives (true positives + false positives). It is particularly important when the cost of a false positive is high.
Precision =
Recall measures the ability of the model to find all the relevant cases (true positives) within a dataset. It is the proportion of actual positives that were correctly identified. This specifies the fraction of actual objects that were detected by the model. It's an important metric to consider when it’s important to capture as many positives as possible.
Recall =
The F1-score is a harmonic mean of precision and recall, providing a single score that balances both the concerns of precision and recall in one number. It’s particularly useful for our usecase since it is needed to compare multiple models.
F1 Score =
In object detection, AP (Average Precision) measures the accuracy of the model in detecting objects of a particular class, integrating over a precision-recall curve. MAP is the average of AP scores across all classes or over different IoU (Intersection over Union) thresholds.
IoU is a measure used to determine the accuracy of a predicted bounding box. It calculates the ratio of the intersection area between the predicted bounding box and the ground truth bounding box to their union area. A higher IoU indicates a more accurate prediction.
MAP50 and MAP50-95 were used, where the numbers indicate the IoU threshold values.
IoU =
MAP50 =
MAP50-95 =
In this section the results of the offline evaluation are presented. Class "0" is body, class "1" is head and class "all" is about the aggregated results over all the images.
Model | Class | Images | Instances | Precision | Recall | MAP50 | MAP50-95 | F1-score |
---|---|---|---|---|---|---|---|---|
YOLOv9c | all | 404 | 881 | 0.871 | 0.418 | 0.653 | 0.304 | 0.565 |
0 | 399 | 461 | 0.891 | 0.675 | 0.795 | 0.417 | 0.569 | |
1 | 365 | 420 | 0.85 | 0.162 | 0.51 | 0.192 | 0.272 | |
RT-DETR | all | 404 | 881 | 0.769 | 0.602 | 0.639 | 0.265 | 0.675 |
0 | 399 | 461 | 0.89 | 0.827 | 0.876 | 0.412 | 0.881 | |
1 | 365 | 420 | 0.648 | 0.377 | 0.403 | 0.118 | 0.477 | |
FastSAM | all | 404 | 622 | 0.9 | 0.664 | 0.791 | 0.491 | 0.764 |
0 | 365 | 452 | 0.92 | 0.763 | 0.853 | 0.592 | 0.834 | |
1 | 154 | 170 | 0.881 | 0.565 | 0.73 | 0.389 | 0.688 |
Table 1: Performance metrics achieved by the models on Santyasa Image Dataset
As shown in Table 1, YOLO scores the worst out of the three models on recall, which means it detects less true positives, than FastSAM and RT-DETR. RT-DETR scores the worst on precision. This indicates that RT-DETR has more false-positives than the other models. Overall FastSAM is yields the best scores on all metrics for Santyasa Image Dataset, therefore it is the best model on this dataset.
YOLOv9c results
Figure 7: Performance metrics achieved by YOLOv9c on the test set
In Figure 7, it would appear from the F1-Confidence, Recall-Confidence and Precision-Confidence curves, that the best confidence threshold lies around 0.65.
Figure 8: Predictions made by YOLOv9c
Figure 9: Ground truth of the predictions by YOLOv9c
RT-DETR results
Figure 10: Performance metrics achieved by RT-DETR on the test set
In Figure 10, it would appear from the F1-Confidence, Recall-Confidence and Precision-Confidence curves, that the best confidence threshold lies around 0.7.
Predictions Figure 11: Predictions made by RT-DETR
Ground truth Figure 12: Ground truth of the predictions by RT-DETR
FastSAM results
Figure 13: Performance metrics achieved by FastSAM on the test set
In Figure 13, it would appear from the F1-Confidence, Recall-Confidence and Precision-Confidence curves, that the best confidence threshold lies between 0.7 and 0.8.
Predictions Figure 14: Predictions made by FastSAM
Ground truth Figure 15: Ground truth of the predictions made by FastSAM
Edge cases evaluation
Model | Class | Images | Instances | P | R | mAP50 | mAP50-95 |
---|---|---|---|---|---|---|---|
YOLOv9c | all | 12 | 19 | 0 | 0 | 0 | 0 |
RT-DETR | all | 12 | 19 | 0.225 | 0.307 | 0.161 | 0.0497 |
0 | 11 | 11 | 0.288 | 0.364 | 0.162 | 0.0553 | |
1 | 8 | 8 | 0.162 | 0.25 | 0.16 | 0.0441 | |
FastSAM | all | 12 | 22 | 0.362 | 0.0357 | 0.153 | 0.0763 |
0 | 11 | 14 | 0.723 | 0.0714 | 0.305 | 0.153 | |
1 | 8 | 8 | 0 | 0 | 0 | 0 |
Table 2: Performance metrics achieved by the models on the Edge Case Dataset
On the Edge Case Dataset FastSAM has significantly lower recall score than RT-DETR. This indicates that FastSAM detects only a really small portion (3.57%) of all true positives, missing a significant 96.43% of them. RT-DETR detects about 30.7% of all true positives in the dataset. It misses about 69.3% of the positives. YOLOv9c appears to not detect anything at all in the Edge Case Dataset*.
The precision of the models are also low. RT-DETR has a precision score of 0.225, so only 22.5% of the detections classified as positive by RT-DETR are true positives. FastSAM has a significantly higher precision score than RT-DETR (0.362), this shows that 36.2% of FastSAM’s positive detections are correct.
YOLO edge case predictions Figure 16: YOLOv9c edge case predictions
YOLO does not detect any objects in the Edge Case Dataset.
FastSAM edge case predictions Figure 17: FastSAM edge case predictions
In Figure 17, you can see that FastSAM cannot detect the agent in the first image even though it is fully visible to the naked eye. It cannot detect any objects in the foggy environment in the third picture either. However, it is able to detect an object in the second picture where a part of the body is visible in a normal environment.
RT-DETR edge case predictions Figure 18: RT-DETR edge case predictions
In Figure 18, you can see that RT-DETR has detected two false positives in the first image and one false positive in the second image. RT-DETR is able to detect the agent in the foggy environment and the agent where only a part of the body is visible in a normal environment.
Ground truth Edge Case Dataset Figure 19: ground truth of the predictions in figures 16, 17 and 18
The inference time for all the models are very low. YOLO struggles with detecting the heads of the agents and it has the lowest number of false positives. FastSAM has more false positives than YOLO, but it has better detection, since it does detect the heads more often than YOLO. RT-DETR has the best ability to detect the heads, but it has significantly more false positives than YOLO or FastSAM.
Video of YOLO:
Video of FastSAM:
Video of RT-DETR:
YOLOv9c | FastSAM | RT-DETR | |
---|---|---|---|
mean | 43.4ms | 33.8ms | 73.8ms |
std | 1.47ms | 0.28ms | 1.01ms |
Table 3: Mean and standard deviation of inference times for each model
From Table 3, it can be seen that FastSAM stands out as the superior model for applications requiring fast and consistent inference times, making it potentially more suitable for real-time object detection tasks. YOLOv9c and RT-DETR, with their longer inference times, are slower and therefore the worse choice for this circumstance.
The reason why all models perform badly on the Edge Case Dataset could be that the dataset they were trained on does not contain a lot of the "complex" scenarios where only parts of an agent are visible to the model. A few examples for this can be seen in figure 12. RT-DETR has a few false positives in the first picture while FastSAM and YOLO detect nothing at all.
An interesting case can be observed in the first picture. An agent has their back turned against the player. However, even though the agent is clearly visible, it is not detected by any of the models. We think the reason for this might be because the dataset only contains pictures of agents from the front.
RT-DETR has a better recall but worse precision. This means it is more comprehensive in detecting objects but at the cost of accuracy. FastSAM has better precision but worse recall. This model is more reliable when it claims to detect an object, but it misses a lot more actual objects.
From the findings documented in tables 1, 2 and 3 it is clear that FastSAM is the best performing model of the three in terms of the performance metrics, inference time and edge case detection.
In this study, we conducted a comparative analysis of three object detection models—YOLO, FastSAM, and RT-DETR—to identify the most effective model for detecting agents in the FPS game "Valorant". Each model was trained and tested utilizing the Santyasa Image Dataset. They were also tested on a specially designed Edge Case Dataset to challenge their robustness in atypical scenarios.
The evaluation revealed that FastSAM consistently outperformed the other models in terms of inference speed and demonstrated superior performance on the Santyasa Image Dataset. FastSAM provided the most reliable detections among the three models. On the other hand, YOLO failed to detect any agents within the Edge Case Dataset, highlighting a significant limitation in handling complex scenarios. RT-DETR, detected agents in both datasets but performed less effectively than FastSAM, especially in the Santyasa Image Dataset.
These findings suggest that FastSAM is the preferred model for object detection tasks within "Valorant" under normal conditions. However, the performance of all models in complex or atypical scenarios was underwhelming, indicating a need for further refinement to enhance their robustness and reliability.
The scope and effectiveness of the project faced significant constraints due to limited computational resources, restricted training and testing time, and challenges in dataset creation. These limitations impacted the amount of data that could be processed and the extent to which the models could be optimized.
The available computational resources were insufficient to fully exploit the capabilities of the YOLO, FastSAM, and RTDETR models. High-performance computing environments are typically required to train and fine-tune such sophisticated models efficiently. However, the project had to operate with relatively weak processing power, which prolonged training durations and affected the overall throughput of data processing. This limitation was particularly challenging during model training phases that demanded intensive computations and data handling.
The project was allocated approximately 30 hours of training time per week, totaling around 180 hours for the entire duration of the project. Although intially, it seems sufficient, it was found out that this was not as much as expected. For example, training RTDETR a single time took 20 hours. If a model required more time to train than was available on the GPU that week, we would have to wait until the time got reset the next week, which meant that a few hours were lost like this. This severely limited the amount of experimentation and optimization that could be conducted.
A considerable amount of time—approximately two weeks (50 hours of GPU time)-was spent attempting to understand and implement the FastSAM training and validation code available in the official FastSAM repository. The intent was to utilize the provided code to train the FastSAM model.
Despite following the setup instructions accurately, the training process encountered continuous errors related to dataset formatting. Even after resolving these issues, the model often failed to detect anything, indicating that it had not been trained correctly. Various dataset formats and training parameters were adjusted in an effort to resolve the training issues. Attempts were made to train the model in different environments, including Google Colab and local machines, but these were unsuccessful.
This ongoing struggle led to the exploration of Ultralytics as an resource for training FastSAM. However, further complications arose when it was discovered that the FastSAM constructor did not include a .train() method, making training via this approach impossible.
After many trials and analysis of the FastSAM code—which revealed that YOLO model weights were used with the YOLO constructor for its transfer learning—it was decided to train the pretrained FastSAM using the YOLO constructor instead just as they did in their training. Although in theory the training is done just as it is in the FastSAM code, we cannot guanatee its equivalance. This approach was a deviation from the original goal, which was to use the FastSAM training method directly.
Although the dataset used to train contains 2975 training images, it is far too small. Valorant as a game is very dynamic with many characters and lighting scenarios. To capture a significant number of such scenarios and situations would require potentially tens of thousands of images.
Creating such a large high-quality, annotated dataset would be impractical given the project's time constraints. The inability to develop a large dataset that could trusted to be well annotated, limited the training potential of the models and subsequently narrowed the project's capacity to achieve potentially higher detection accuracy.
Due to limited computational resources, the team was motivated to choose scaled-down versions of each model, designed specifically to minimize GPU load. Attempts to use larger models were stopped by the necessity to operate with very small batch sizes (batch size 1 or 2), to avoid exceeding the available GPU memory. Training with such small batch sizes would have resulted in training times exceeding the allocated GPU usage duration. These scaled-down models are advantageous in terms of reduced computational demands, but they generally provide less robustness and lower performance compared to their full-sized counterparts. This limitation could have not only impacted the precision and efficiency of object detection but also restricted the project’s capacity to explore more advanced capabilities that might be offered by more powerful models.
These computational, time, data creation, and model selection limitations restricted the scope of the project. While the original intent was to explore the boundaries of object detection within specific environments, the resource constraints meant that the project could not be executed to the extent the group wished for. The limited training and testing time, coupled with the FastSAM issues, the challenges in dataset creation, and the forced choice of scaled-down models, limited the depth of analysis and refinement of the models, leading to a narrower exploration of their potential capabilities.
YOLO, RT-DETR and FastSAM are not good enough yet to be used, since their performance is subpar on Edge Case Dataset. To improve the perfomances of the models, we would need to train them on more data that covers more scenarios.
Data augmentation could also be used to create synthetic variations of the existing dataset, increasing its diversity and helping the models generalize better to unseen edge cases.
FastSAM should be trained with a more powerful GPU so all the models can be trained using the same batchsize. It would be even better if FastSAM could be trained directly.
https://github.com/Remi-Lejeune/computer-vision-valorant
[1] Wang, Chien-Yao et al. “YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information.” ArXiv abs/2402.13616 (2024): n. pag.
[2] Li, Yongjun & Li, Shasha & Du, Haohao & Chen, Lijia & Zhang, Dongming & Li, Yao. (2020). YOLO-ACN: Focusing on small target and occluded object detection. IEEE Access. PP. 1-1. 10.1109/ACCESS.2020.3046515.
[3] Zhao, Xu & Ding, Wenchao & An, Yongqi & Du, Yinglong & Yu, Tao & Li, Min & Tang, Ming & Wang, Jinqiao. (2023). Fast Segment Anything.
[4] Kirillov, Alexander, et al. "Segment anything." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[5] Lv, Wenyu & Xu, Shangliang & Zhao, Yian & Wang, Guanzhong & Wei, Jinman & Cui, Cheng & Du, Yuning & Dang, Qingqing & Liu, Yi. (2023). DETRs Beat YOLOs on Real-time Object Detection.