computer-vision-valorant

# computer-vision-valorant [Justin Luu](https://github.com/justinluu2311) | [Arjun Vilakathara](https://github.com/avilakathara) | [Remi Lejeune](https://github.com/Remi-Lejeune) ## Introduction Computer vision, especially image segmentation and object detection, is a rapidly evolving field with potential applications across various industries. In recent years, significant improvement has occurred in the development of algorithms and training methodologies, drawing from existing techniques to enhance both the accuracy and efficiency of image segmentation tasks. As a team interested in CV (Computer Vision), we've been intrigued by the recent surge in AI (Artifical Intellingence) applications and advancements. Our curiosity led us to explore the intersection of AI and gaming, specifically how AI technologies could affect players' experiences in tactical FPS (First-Person Shooters) like Valorant. The primary goal of aiming in games like Valorant is precision and speed—key factors that can significantly impact gameplay. Traditionally, aimbots have relied on extracting data directly from the game or server to pinpoint the locations of opposing players. These methods are effective for cheating, but our interest here isn't to optimize how to cheat, but to see if we can make something similar using deep learning computer vision techniques. In our exploration, we will assess how these computer vision techniques can be used to develop an aimbot. The focus will be on evaluating the speed of processing the quality of image segmentation and its effectiveness within the game. We aim to develop an aimbot that mimics playing the game like a human would. The first step of which would be to first dileneate between player models and the background in the games video. In this blog, we describe how we trained different models to be able to perform object detection in the FPS game Valorant. We will provide an analysis of the results and determine which model is most suitable for this agent detection in Valorant. ## Models Multiple computer vision models are analysed in this blog, each one having their own specifity. First, there is YOLO (You Only Look Once), a state-of-the-art object detection model. Then, there is FastSAM (Fast Segment Anything Model), this model is an image segmentation model based on SAM (Segment Anything) that has been modified to run up to 50x times faster. Finally, there is RT-DETR (Real Time Detection Transformer) which is an object detection model like YOLO but it uses a transformer encoder. ### YOLO (You Only Look Once) ![image](https://hackmd.io/_uploads/SycBKtirC.png) *Figure 1: YOLOv9 Architecture* YOLO [1] is a popular object detection model known for its speed and accuracy. Unlike traditional object detection systems that apply the model to an image at multiple locations and scales, YOLO applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. YOLO is capable of detecting multiple objects in real-time, making it highly efficient for tasks that require quick processing, such as autonomous driving. But despite its strengths, YOLO has some limitations. The model may struggle with detecting small objects in close proximity [2]. The YOLO model used for this research is YOLOv9c. It is currently at the time of this research the latest available YOLO model. ### FastSAM (Fast Segment Anything Model) ![image](https://hackmd.io/_uploads/SkOt6YoHA.png) *Figure 2: FastSAM Architecture* FastSAM [3] is an advanced image segmentation model designed for rapid and accurate segmentation tasks. It builds upon SA [4](Segment Anything) and has been optimized to run up to 50x faster. It excels in identifying objects within an image, producing precise segmentations that can be used for various applications, such as medical imaging, autonomous driving, and video analysis. ### RT-DETR (Real Time Detection Transformer) ![image](https://hackmd.io/_uploads/Hy81AFiSC.png) *Figure 3: RTDETR Architecture* RT-DETR (Real-Time Detection Transformer) [5] is an advanced object detection model that utilizes a transformer encoder architecture as oppose to traditional convolutional neural network (CNN)-based models like YOLO. Its architecture involves dividing an image into a grid of patches, which are then processed through a series of transformer encoder layers. These layers enable the model to learn relationships between different parts of the image, allowing it to accurately detect and classify objects. Through that, RT-DETR addresses some of the limitations of grid-based models like YOLO by providing more precise localization and classification of objects. ## Dataset Datasets of annotated Valorant images are readily available on the internet. The first dataset that was tried is the [valorant-object-detection2](https://universe.roboflow.com/kwan-li-jqief/valorant-object-detection2/dataset/7) dataset, composed of 1416 training images. However when models were trained on this dataset, they resulted in poor performances. After looking more in depth into this dataset, it was found that some of the data was poorly labeled for the usecase of the project, since we are looking for data that has the whole body covered in a bounding box, not just parts of the chest area: ![image](https://hackmd.io/_uploads/BydiK0oHR.jpg) *Figure 4: Example of 'bad' annotation* This meant that another dataset had to be chosen, the newly selected dataset is the [Santyasa Image Dataset](https://universe.roboflow.com/alfin-scifo/santyasa/dataset/4), with 2975 training images. Review of this dataset inspired confidence in its annotations, as no obviously faulty ones were identified similar to those in the previous dataset. The data is divided into three parts: a training set, a validation set, and a testing set. The training process required the dataset to be in YOLO format, which involved a folder of images and a folder of text documents containing the annotations associated with the images. The labels and annotation in a text document for **segmentations** would be as follows: ``` 0 0.49038460850715637 0.47836539149284363 0.48798078298568726 0.48076921701431274 0.48798078298568726 0.48317307233810425 0.48317307233810425 0.48798078298568726 0.48317307233810425 0.49038460850715637 0.48076921701431274 0.4927884638309479 0.48076921701431274 0.4951923191547394 0.47836539149284363 0.4975961446762085 0.47836539149284363 0.5192307829856873 0.4759615361690521 0.5216346383094788 0.4759615361690521 0.5288461446762085 0.47836539149284363 0.53125 0.47836539149284363 0.5408653616905212 0.4759615361690521 0.5432692170143127 0.4759615361690521 0.5600961446762085 0.47836539149284363 0.5625 0.47836539149284363 0.567307710647583 0.48076921701431274 0.567307710647583 0.48076921701431274 0.5625 0.48798078298568726 0.5552884340286255 ``` The labels and annotation in a text document for the **bounding boxes** would be as follows: ``` 1 0.5360576923076923 0.4375 0.03365384615384615 0.040865384615384616 0 0.5360576923076923 0.5552884615384616 0.10817307692307693 0.3004807692307692 ``` Where the first number, is the class of the segmented object and the numbers are the coordinates points that make up the shape of the segmentation. In the dataset there are two classes: 0 and 1, where 0 is the body and 1 is the head of the agent. Here are a few example images from the dataset: | ![dwdw](https://hackmd.io/_uploads/Sy_OcthS0.jpg) | ![dww](https://hackmd.io/_uploads/r1eljY2r0.jpg) | |--------|---------| | ![efe](https://hackmd.io/_uploads/H1oKjtnSR.jpg) | ![fg](https://hackmd.io/_uploads/ryf-2K3HC.jpg) | *Figure 5: Example images from Santyasa dataset* Then, we created a dataset called [Edge Case Dataset](https://universe.roboflow.com/justin-3f4xp/valorant-agents-pl9nm/dataset/1), which was made to replicate situations considered as edge cases in the game. For example, where only the head is seen, or only part of the body is seen. The Santyasa Image Dataset does contain a few of these images, but the Edge Case Dataset was made especially for evaluation purposes only since it uses very difficult images. Here are a few example images from the dataset: | ![Screenshot-6-_png.rf.5201dfd3912c2ef5c62a9d5cab881ada](https://hackmd.io/_uploads/BJ1NSnhBC.jpg) | ![Screenshot-7-_png.rf.c6686d58c6d628eaf574480571e863a3](https://hackmd.io/_uploads/ByUVr2nSR.jpg) | |--------|---------| | ![Screenshot-15-_png.rf.4d1d9afe4fd4f92bead448de19beff87](https://hackmd.io/_uploads/S1K4S2nBR.jpg) | ![Screenshot-10-_png.rf.8455b5912363482b8c17c3dcec2a672b](https://hackmd.io/_uploads/BJpVH32rA.jpg) | *Figure 6: Example images from Edge Case dataset* Finally, to showcase the performance of the models in real time, a video clip was recored. The video is 30FPS (Frame Per Second), and will be used to show the results of running the detection models on this video in real time and for calculating the inference time. <iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/vb26pUwg6Hc?si=6mu8sJ3JjbE5oKGL" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> *Video 1: Video clip to test real time performance of models on* ## Equipment Training is done using the following equipment, provided by Kaggle: https://hackmd.io/PMtzDryvQUy-wNrTX9wrKw?both# Kaggle GPU P100, 16GB Within these specs, these are the paramaters used for the training: FastSAM-s epochs=100, batch=16, imgsz=640 YOLOv9c epochs=100, batch=32, imgsz=640 RTDETR epochs=100, batch=32, imgsz=640 Performing the experiments is done using an RTX3060 TI 8GB. ## Training As mentioned, the project utilized YOLO, FastSAM, and RTDETR. This section provides a detailed breakdown of the training process, specifically focusing on the YOLO model for illustrative purposes. #### Dependencies Wandb (Weights & Biases): Used for experiment tracking and visualizing model performance. Secrets management provided by kaggle_secrets was used to securely access the wandb API key. Ultralytics: The project leverages the Ultralytics suite for training, validating, and testing the YOLO, FastSAM, and RTDETR models. These models are part of the Ultralytics ecosystem, which offers straightforward methods for training and testing. #### Loading the models Initially, the project aimed to train models from scratch. However, it was soon realized that this approach required extensive computational resources and time. To address this issue, transfer learning was employed, a technique that significantly accelerates the training process by fine-tuning models pre-trained on a large, generic dataset for a more specific task. Pre-trained versions of all models were fine-tuned with the chosen dataset (Santyasa Image Dataset). Models were loaded as follows in the code. For YOLO, the model was initialized with pre-trained weights ("YOLOv9c.pt"), and similar steps were taken for the other models. ``` from ultralytics import YOLO from ultralytics import RTDETR model_yolo = YOLO("YOLOv9c.pt") model_fastsam = YOLO("FastSAM-s.pt") model_rtdetr = RTDETR("rtdetr-l.pt") ``` #### Preparing the data The ultralytics training pipeline requires details about the dataset in a data.yaml file, which links to the dataset containing annotated images for object detection. This file outlines the dataset's structure, path, classes, and other configurations required for training. For exmaple: ``` train: ../train/images val: ../valid/images test: ../test/images nc: 2 names: ['1', '2'] ``` #### Training using paramaters The models were trained for 100 epochs, with a batch size of 16 for FastSAM and 32 for the other two models. Image size was set to 640x640 pixels. It was also ensured that the model and its best iterations were saved for later evaluation and inference. ``` model.train(data="data.yaml", epochs=100, batch=16, imgsz=640, save=True, device='0') ``` The chosen parameters and the inconsistency in batch size selection were influenced by computational resource limitations. FastSAM was trained using a smaller batch size than YOLO and RT-DETR. FastSAM would use up too much memory, so a smaller batch size had to be used during training. The goal here was to use as much power of the GPU as possible, because of the limited training time. As a result of this choice, FastSAM has the advantage of being able to lean more from each individual example, therefore it is expected to have an advantage in the metrics. ## Experiment For the experiments, the YOLO, FastSAM, and RTDETR models underwent testing to assess their effectiveness on the test subset of the Santyasa Image Dataset and the Edge Case Dataset created specifically for this project. Additionally, the models were evaluated on selected videos to measure average computation time and detection accuracy. #### Testing on Datasets The experiments began by deploying all three models on the test set of the Santyasa Image Dataset and the Edge Case Dataset. This approach was designed to compare the performance of the models in recognizing and detecting objects. ``` model = YOLO("chosen model e.g.: 'best_weights_fastsam_final.pt'") # model = RTDETR("best_weights_rtdetr_final.pt") use for RTDETR metrics = model.val(conf=0.5, data="path/to/testdata.yaml", split="test") ``` #### Testing on Video Data After offline testing, the models were applied to a series of pre-selected videos. The primary goal was to evaluate the models under more dynamic conditions, reflecting potential real-world applications. Each model's average computation time per frame and detection efficacy were recorded. ``` # Run experiment for real-time testing on videos # Loop through the video frames inference_times = [] # Loop through the video frames while cap.isOpened(): # Read a frame from the video success, frame = cap.read() if success: # Run YOLOv8 tracking on the frame, persisting tracks between frames start_time = time.time() results = model.track(frame, persist=True, conf=0.5) end_time = time.time() inference_time = end_time - start_time inference_times.append(inference_time) # Visualize the results on the frame annotated_frame = results[0].plot() # Display the annotated frame cv2.imshow("Real-Time Tracking", annotated_frame) ``` ## Results ### Metrics #### Precision Precision is the metric that indicates the accuracy of the predictions made by the model, specifically focusing on the proportion of positive identifications that were actually correct. It is calculated as the number of true positive detections divided by the total number of elements labeled as positives (true positives + false positives). It is particularly important when the cost of a false positive is high. Precision = $\frac{True Positives}{True Positives + False Positives}$ #### Recall Recall measures the ability of the model to find all the relevant cases (true positives) within a dataset. It is the proportion of actual positives that were correctly identified. This specifies the fraction of actual objects that were detected by the model. It's an important metric to consider when it’s important to capture as many positives as possible. Recall = $\frac{True Positives}{True Positives + False Negatives}$ #### F1-score The F1-score is a harmonic mean of precision and recall, providing a single score that balances both the concerns of precision and recall in one number. It’s particularly useful for our usecase since it is needed to compare multiple models. F1 Score = $2 \times \frac{Precision \times Recall}{Precision + Recall}$ #### MAP (Mean Average Precision) In object detection, AP (Average Precision) measures the accuracy of the model in detecting objects of a particular class, integrating over a precision-recall curve. MAP is the average of AP scores across all classes or over different IoU (Intersection over Union) thresholds. IoU is a measure used to determine the accuracy of a predicted bounding box. It calculates the ratio of the intersection area between the predicted bounding box and the ground truth bounding box to their union area. A higher IoU indicates a more accurate prediction. MAP50 and MAP50-95 were used, where the numbers indicate the IoU threshold values. IoU = $\frac{Overlap Predicted Box and Labeled Box}{Union Predicted Box and Labeled Box}$ MAP50 = $\text{Average of APs calculated at IoU threshold of 0.50 across all classes}$ MAP50-95 = $\text{Average of APs calculated at IoU thresholds from 0.50 to 0.95 across all classes}$ ### Offline Evaluation In this section the results of the offline evaluation are presented. Class "0" is body, class "1" is head and class "all" is about the aggregated results over all the images. | **Model** | **Class** | **Images** | **Instances** | **Precision** | **Recall** | **MAP50** | **MAP50-95** | **F1-score** | |-------------|-----------|------------|---------------|---------|-----------|-----------|--------------|--------------| | **YOLOv9c** | all | 404 | 881 | 0.871 | 0.418 | 0.653 | 0.304 | 0.565 | | | 0 | 399 | 461 | 0.891 | 0.675 | 0.795 | 0.417 | 0.569 | | | 1 | 365 | 420 | 0.85 | 0.162 | 0.51 | 0.192 | 0.272 | | | | | | | | | | | | **RT-DETR** | all | 404 | 881 | 0.769 | 0.602 | 0.639 | 0.265 | 0.675 | | | 0 | 399 | 461 | 0.89 | 0.827 | 0.876 | 0.412 | 0.881 | | | 1 | 365 | 420 | 0.648 | 0.377 | 0.403 | 0.118 | 0.477 | | | | | | | | | | | | **FastSAM** | all | 404 | 622 | **0.9** | **0.664** | **0.791** | **0.491** | **0.764** | | | 0 | 365 | 452 | 0.92 | 0.763 | 0.853 | 0.592 | 0.834 | | | 1 | 154 | 170 | 0.881 | 0.565 | 0.73 | 0.389 | 0.688 | *Table 1: Performance metrics achieved by the models on Santyasa Image Dataset* As shown in Table 1, YOLO scores the worst out of the three models on recall, which means it detects less true positives, than FastSAM and RT-DETR. RT-DETR scores the worst on precision. This indicates that RT-DETR has more false-positives than the other models. Overall FastSAM is yields the best scores on all metrics for Santyasa Image Dataset, therefore it is the best model on this dataset. **YOLOv9c results** | ![F1_curve](https://hackmd.io/_uploads/B15YTFiH0.png) | ![P_curve](https://hackmd.io/_uploads/By5YaYsrC.png) | |--------|---------| | ![PR_curve](https://hackmd.io/_uploads/Bk9YaYsS0.png) | ![R_curve](https://hackmd.io/_uploads/H1qKpYiHR.png) | *Figure 7: Performance metrics achieved by YOLOv9c on the test set* In Figure 7, it would appear from the F1-Confidence, Recall-Confidence and Precision-Confidence curves, that the best confidence threshold lies around 0.65. ![yolo_predictions](https://hackmd.io/_uploads/HJqlWa2BR.jpg) *Figure 8: Predictions made by YOLOv9c* ![yolo_labels](https://hackmd.io/_uploads/rJMG-pnB0.jpg) *Figure 9: Ground truth of the predictions by YOLOv9c* **RT-DETR results** | ![F1_curve](https://hackmd.io/_uploads/r1WwTYsrR.png) | ![P_curve](https://hackmd.io/_uploads/HJZP6FirR.png) | |--------|---------| | ![PR_curve](https://hackmd.io/_uploads/BkZDTFoSR.png) | ![R_curve](https://hackmd.io/_uploads/SkWwTYiSA.png) | *Figure 10: Performance metrics achieved by RT-DETR on the test set* In Figure 10, it would appear from the F1-Confidence, Recall-Confidence and Precision-Confidence curves, that the best confidence threshold lies around 0.7. **Predictions** ![rtdetr_predictions](https://hackmd.io/_uploads/SkMYx63r0.jpg) *Figure 11: Predictions made by RT-DETR* **Ground truth** ![rtdetr_labels](https://hackmd.io/_uploads/SJnpgpnS0.jpg) *Figure 12: Ground truth of the predictions by RT-DETR* **FastSAM results** | ![BoxF1_curve](https://hackmd.io/_uploads/rkcvRYiS0.png) | ![BoxP_curve](https://hackmd.io/_uploads/ry5DRYsSC.png) | |--------|---------| | ![BoxPR_curve](https://hackmd.io/_uploads/HJ5wRKirA.png) | ![BoxR_curve](https://hackmd.io/_uploads/SJx9wRYjrR.png) | *Figure 13: Performance metrics achieved by FastSAM on the test set* In Figure 13, it would appear from the F1-Confidence, Recall-Confidence and Precision-Confidence curves, that the best confidence threshold lies between 0.7 and 0.8. **Predictions** ![fastsam_predictions](https://hackmd.io/_uploads/BkvHe6nSC.jpg) *Figure 14: Predictions made by FastSAM* **Ground truth** ![fastsam_labels](https://hackmd.io/_uploads/r1ewlp3HR.jpg) *Figure 15: Ground truth of the predictions made by FastSAM* **Edge cases evaluation** | **Model** | Class | Images | Instances | P | R | mAP50 | mAP50-95 | |-------------|-------|--------|-----------|-------|--------|-------|----------| | **YOLOv9c** | all | 12 | 19 | 0 | 0 | 0 | 0 | | | | | | | | | | **RT-DETR** | all | 12 | 19 | 0.225 | **0.307** | **0.161** | 0.0497 | | | 0 | 11 | 11 | 0.288 | 0.364 | 0.162 | 0.0553 | | | 1 | 8 | 8 | 0.162 | 0.25 | 0.16 | 0.0441 | | | | | | | | | | | **FastSAM** | all | 12 | 22 | **0.362** | 0.0357 | 0.153 | **0.0763** | | | 0 | 11 | 14 | 0.723 | 0.0714 | 0.305 | 0.153 | | | 1 | 8 | 8 | 0 | 0 | 0 | 0 | *Table 2: Performance metrics achieved by the models on the Edge Case Dataset* On the Edge Case Dataset FastSAM has significantly lower recall score than RT-DETR. This indicates that FastSAM detects only a really small portion (3.57%) of all true positives, missing a significant 96.43% of them. RT-DETR detects about 30.7% of all true positives in the dataset. It misses about 69.3% of the positives. YOLOv9c appears to not detect anything at all in the Edge Case Dataset*. The precision of the models are also low. RT-DETR has a precision score of 0.225, so only 22.5% of the detections classified as positive by RT-DETR are true positives. FastSAM has a significantly higher precision score than RT-DETR (0.362), this shows that 36.2% of FastSAM’s positive detections are correct. **YOLO edge case predictions** ![yolo_edge_cases](https://hackmd.io/_uploads/rJfrb63rA.jpg) *Figure 16: YOLOv9c edge case predictions* YOLO does not detect any objects in the Edge Case Dataset. **FastSAM edge case predictions** ![fastsam_edge_cases](https://hackmd.io/_uploads/S1oUba3BA.jpg) *Figure 17: FastSAM edge case predictions* In Figure 17, you can see that FastSAM cannot detect the agent in the first image even though it is fully visible to the naked eye. It cannot detect any objects in the foggy environment in the third picture either. However, it is able to detect an object in the second picture where a part of the body is visible in a normal environment. **RT-DETR edge case predictions** ![rtdetr_edge_cases](https://hackmd.io/_uploads/rJxCZa3rA.jpg) *Figure 18: RT-DETR edge case predictions* In Figure 18, you can see that RT-DETR has detected two false positives in the first image and one false positive in the second image. RT-DETR is able to detect the agent in the foggy environment and the agent where only a part of the body is visible in a normal environment. **Ground truth Edge Case Dataset** ![edge_cases_labels](https://hackmd.io/_uploads/SkGeGanrC.jpg) *Figure 19: ground truth of the predictions in figures 16, 17 and 18* ### Real-time evaluation The inference time for all the models are very low. YOLO struggles with detecting the heads of the agents and it has the lowest number of false positives. FastSAM has more false positives than YOLO, but it has better detection, since it does detect the heads more often than YOLO. RT-DETR has the best ability to detect the heads, but it has significantly more false positives than YOLO or FastSAM. Video of YOLO: <iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/Tj21M-y_cIM?si=og4QlEhWEUA9yzyd" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> *Video 2: Results of YOLO* Video of FastSAM: <iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/R3svfqkTfow?si=P4wSDRGD2-ILXWtR" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> *Video 2: Results of FastSAM* Video of RT-DETR: <iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/UWS17TGn6A8?si=RZLrEWjPMJP9dMS5" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> *Video 2: Results of RT-DETR* | | YOLOv9c | FastSAM | RT-DETR | |------|---------|---------|---------| | **mean** | 43.4ms | **33.8ms** | 73.8ms | | **std** | 1.47ms | **0.28ms** | 1.01ms | *Table 3: Mean and standard deviation of inference times for each model* From Table 3, it can be seen that FastSAM stands out as the superior model for applications requiring fast and consistent inference times, making it potentially more suitable for real-time object detection tasks. YOLOv9c and RT-DETR, with their longer inference times, are slower and therefore the worse choice for this circumstance. ## Discussion The reason why all models perform badly on the Edge Case Dataset could be that the dataset they were trained on does not contain a lot of the "complex" scenarios where only parts of an agent are visible to the model. A few examples for this can be seen in figure 12. RT-DETR has a few false positives in the first picture while FastSAM and YOLO detect nothing at all. An interesting case can be observed in the first picture. An agent has their back turned against the player. However, even though the agent is clearly visible, it is not detected by any of the models. We think the reason for this might be because the dataset only contains pictures of agents from the front. RT-DETR has a better recall but worse precision. This means it is more comprehensive in detecting objects but at the cost of accuracy. FastSAM has better precision but worse recall. This model is more reliable when it claims to detect an object, but it misses a lot more actual objects. From the findings documented in tables 1, 2 and 3 it is clear that FastSAM is the best performing model of the three in terms of the performance metrics, inference time and edge case detection. ## Conclusion In this study, we conducted a comparative analysis of three object detection models—YOLO, FastSAM, and RT-DETR—to identify the most effective model for detecting agents in the FPS game "Valorant". Each model was trained and tested utilizing the Santyasa Image Dataset. They were also tested on a specially designed Edge Case Dataset to challenge their robustness in atypical scenarios. The evaluation revealed that FastSAM consistently outperformed the other models in terms of inference speed and demonstrated superior performance on the Santyasa Image Dataset. FastSAM provided the most reliable detections among the three models. On the other hand, YOLO failed to detect any agents within the Edge Case Dataset, highlighting a significant limitation in handling complex scenarios. RT-DETR, detected agents in both datasets but performed less effectively than FastSAM, especially in the Santyasa Image Dataset. These findings suggest that FastSAM is the preferred model for object detection tasks within "Valorant" under normal conditions. However, the performance of all models in complex or atypical scenarios was underwhelming, indicating a need for further refinement to enhance their robustness and reliability. ## Limitations The scope and effectiveness of the project faced significant constraints due to limited computational resources, restricted training and testing time, and challenges in dataset creation. These limitations impacted the amount of data that could be processed and the extent to which the models could be optimized. #### Computational Power Constraints The available computational resources were insufficient to fully exploit the capabilities of the YOLO, FastSAM, and RTDETR models. High-performance computing environments are typically required to train and fine-tune such sophisticated models efficiently. However, the project had to operate with relatively weak processing power, which prolonged training durations and affected the overall throughput of data processing. This limitation was particularly challenging during model training phases that demanded intensive computations and data handling. #### Restricted Training and Testing Time The project was allocated approximately 30 hours of training time per week, totaling around 180 hours for the entire duration of the project. Although intially, it seems sufficient, it was found out that this was not as much as expected. For example, training RTDETR a single time took 20 hours. If a model required more time to train than was available on the GPU that week, we would have to wait until the time got reset the next week, which meant that a few hours were lost like this. This severely limited the amount of experimentation and optimization that could be conducted. #### Difficulties with FastSAM A considerable amount of time—approximately two weeks (50 hours of GPU time)-was spent attempting to understand and implement the FastSAM training and validation code available in the [official FastSAM repository](https://github.com/CASIA-IVA-Lab/FastSAM?tab=readme-ov-file). The intent was to utilize the provided code to train the FastSAM model. Despite following the setup instructions accurately, the training process encountered continuous errors related to dataset formatting. Even after resolving these issues, the model often failed to detect anything, indicating that it had not been trained correctly. Various dataset formats and training parameters were adjusted in an effort to resolve the training issues. Attempts were made to train the model in different environments, including Google Colab and local machines, but these were unsuccessful. This ongoing struggle led to the exploration of Ultralytics as an resource for training FastSAM. However, further complications arose when it was discovered that the FastSAM constructor did not include a .train() method, making training via this approach impossible. ``` from ultralytics import FastSAM model = FastSAM("FastSAM-s.pt") ``` After many trials and analysis of the FastSAM code—which revealed that YOLO model weights were used with the YOLO constructor for its transfer learning—it was decided to train the pretrained FastSAM using the YOLO constructor instead just as they did in their training. Although in theory the training is done just as it is in the FastSAM code, we cannot guanatee its equivalance. This approach was a deviation from the original goal, which was to use the FastSAM training method directly. #### Challenges in Creating a High-Quality Dataset Although the dataset used to train contains 2975 training images, it is far too small. Valorant as a game is very dynamic with many characters and lighting scenarios. To capture a significant number of such scenarios and situations would require potentially tens of thousands of images. Creating such a large high-quality, annotated dataset would be impractical given the project's time constraints. The inability to develop a large dataset that could trusted to be well annotated, limited the training potential of the models and subsequently narrowed the project's capacity to achieve potentially higher detection accuracy. #### Constraints on Model Selection Due to limited computational resources, the team was motivated to choose scaled-down versions of each model, designed specifically to minimize GPU load. Attempts to use larger models were stopped by the necessity to operate with very small batch sizes (batch size 1 or 2), to avoid exceeding the available GPU memory. Training with such small batch sizes would have resulted in training times exceeding the allocated GPU usage duration. These scaled-down models are advantageous in terms of reduced computational demands, but they generally provide less robustness and lower performance compared to their full-sized counterparts. This limitation could have not only impacted the precision and efficiency of object detection but also restricted the project’s capacity to explore more advanced capabilities that might be offered by more powerful models. #### Limited Scope of the Project These computational, time, data creation, and model selection limitations restricted the scope of the project. While the original intent was to explore the boundaries of object detection within specific environments, the resource constraints meant that the project could not be executed to the extent the group wished for. The limited training and testing time, coupled with the FastSAM issues, the challenges in dataset creation, and the forced choice of scaled-down models, limited the depth of analysis and refinement of the models, leading to a narrower exploration of their potential capabilities. ## Future works YOLO, RT-DETR and FastSAM are not good enough yet to be used, since their performance is subpar on Edge Case Dataset. To improve the perfomances of the models, we would need to train them on more data that covers more scenarios. Data augmentation could also be used to create synthetic variations of the existing dataset, increasing its diversity and helping the models generalize better to unseen edge cases. FastSAM should be trained with a more powerful GPU so all the models can be trained using the same batchsize. It would be even better if FastSAM could be trained directly. ## GitHub https://github.com/Remi-Lejeune/computer-vision-valorant ## References [1] Wang, Chien-Yao et al. “YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information.” ArXiv abs/2402.13616 (2024): n. pag. [2] Li, Yongjun & Li, Shasha & Du, Haohao & Chen, Lijia & Zhang, Dongming & Li, Yao. (2020). YOLO-ACN: Focusing on small target and occluded object detection. IEEE Access. PP. 1-1. 10.1109/ACCESS.2020.3046515. [3] Zhao, Xu & Ding, Wenchao & An, Yongqi & Du, Yinglong & Yu, Tao & Li, Min & Tang, Ming & Wang, Jinqiao. (2023). Fast Segment Anything. [4] Kirillov, Alexander, et al. "Segment anything." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [5] Lv, Wenyu & Xu, Shangliang & Zhao, Yian & Wang, Guanzhong & Wei, Jinman & Cui, Cheng & Du, Yuning & Dang, Qingqing & Liu, Yi. (2023). DETRs Beat YOLOs on Real-time Object Detection.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.