Revolutionizing Squat Depth Monitoring: A Breakthrough in Computer Vision Technology

Revolutionizing Squat Depth Monitoring: A Breakthrough in Computer Vision Technology

Team

This blogpost is written as part of the project of the CS4245 course: Seminar Computer Vision by Deep Learning.

Group members

Alain Iskandar (5749972)
Mohamed Msallak (5399920)
Youssef Zahran (5057841)

Tasks

Task	Alain	Mohamed	Youssef
Data collection and augmentation	X	X	X
Architecture 1	X
Architecture 2		X
Architecture 3			X
Storyline	X	X	X
Blog	X	X	X

Introduction

Squats, essential for strength training and powerlifting, offer remarkable benefits such as muscle growth, injury prevention, and improved competitive performance. However, accurately monitoring squat depth poses a significant challenge, as it greatly influences training effectiveness and competition scores. Conventional methods, relying on subjective observations and video recordings, often lack real-time feedback and objective assessments.

To address this challenge, we present in this blog post the first-ever computer vision-based solution for automatic squat depth detection. Our innovative approach aims to revolutionize the assessment of squat depth during training and competitions.

Powerlifting competitions prioritize fair competition and standardization, enforcing specific rules and regulations for squats. These rules encompass criteria such as proper depth, where the hip crease should go below the top surface of the knee. However, achieving and judging adherence to these standards can be challenging and prone to errors.

Figure 1 showcases two different squats performed during a competition, highlighting the complexity and variability of squat techniques encountered in powerlifting. Leveraging computer vision technology, our solution aims to provide real-time and objective feedback on squat depth. By doing so, we enhance the accuracy and consistency of evaluations, improving the overall assessment process.

Stay tuned as we delve into the technical details of our computer vision-based approach and explore its potential to revolutionize squat depth monitoring and assessment in powerlifting.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure 1: Example of squats performed during a competition. Judges have to specify whether or not this a proper squat. Test your own judgment and see if you can identify whether both squats meet the proper depth criteria.

Despite the recognition of the importance of monitoring squat depth, there has been a noticeable absence of a reliable and automated solution in the market. Coaches, training partners, and athletes have long relied on direct observation, using mirrors or video footage, to ensure correct technique. However, these methods are prone to subjectivity, inconsistency, and often lack real-time corrections. The consequences of relying on such subjective assessments are substantial, as athletes may mistakenly believe their technique is appropriate during training, only to realize its inadequacy during competitions. This realization necessitates perilous last-minute adjustments, risking injury and compromised performance while handling near-maximum weights.

Moreover, while video recordings have been useful in capturing squat movements, they require laborious post-workout analysis, suffer from issues related to lighting or image quality, and rely on the inconsistent availability of experienced juries for accurate feedback. These limitations have left a void in the field of squat depth monitoring, hindering progress in training optimization, injury prevention, and performance enhancement.

Solution

We have taken up the challenge of revolutionizing squat depth monitoring by harnessing the power of computer vision technology. Our innovative approach combines the latest advancements in pose estimation and deep learning techniques to provide real-time and accurate assessments of squat depth, setting us apart as pioneers in the field.

By incorporating a pose estimation preprocessing step, our computer vision-based system detects joint angles and overlays skeletal representations on the images, enabling precise depth measurements. These preprocessed images are then fed into a convolutional neural network (CNN) that classifies each squat as either good or bad. The effectiveness of our approach is further validated through a comparison with a method that does not utilize pose estimation. Both networks are trained using labeled images of good and bad squats, ensuring robust performance and consistent evaluations.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure 2: Same squats as before but now processed with OpenPose. Again, test your own judgment and see if you can identify whether both squats meet the proper depth criteria.

In Figure 2, we see the same squats as before, but this time they have been processed with OpenPose[3], a pose estimation tool. This processing step has enabled us to gain a better understanding of the squats and assess their validity based on the criterion that the hip crease should go below the top surface of the knee.

Upon analysis, it is evident that the squat on the left is deemed improper as the hip crease does not descend below the knee. However, the squat on the right meets the required depth criteria, with the hip crease clearly going below the knee, thus being considered proper.

The incorporation of pose estimation preprocessing in our solution holds great promise. By leveraging pre-detected joint angles as pre-learned features, our system capitalizes on prior knowledge, leading to enhanced results and accurate squat depth detection. This advancement paves the way for real-time corrections, objective feedback, and data-driven training improvements.

Dataset

In our pursuit to develop a reliable computer vision-based system for squat depth monitoring, we faced a significant challenge: the lack of an existing comprehensive dataset. Undeterred, we embarked on an extensive data collection endeavor, gathering our own set of squat depth images, both pass and fail instances, to train and evaluate our models. In this section, we will delve into the process of data collection, augmentation, and the techniques employed to ensure the diversity and quality of our dataset.

Data collection

Recognizing the importance of a well-balanced dataset, we carefully curated 135 pass squats and 135 fail squats, meticulously capturing variations in squat depth and technique. These images were collected from official powerlifting meets, including but not limited to organizations such as USAPL (USA Powerlifting)[1] and KNKF (Koninklijke Nederlandse Krachtsport en Fitnessfederatie)[2]. During these powerlifting meets, trained juries consisting of experienced individuals familiar with powerlifting rules and standards were responsible for assessing the squats. They determined whether each squat was performed correctly and met the required criteria for a successful lift or if it was deemed unsuccessful. This initial dataset served as the foundation for our training and evaluation processes.

Data augmentation

To augment our dataset and enhance its diversity, we employed a range of transformation techniques using the ImageDataGenerator class from Keras. These techniques introduce variations in the images, making our models more robust and better able to handle real-world scenarios. An example of an image before augmentation is shown in Figure 3. The following augmentation techniques were applied:

Rotation: Randomly rotates the image by ±10 degrees.
Width Shift: Randomly shifts the image horizontally by a fraction of the total width.
Height Shift: Randomly shifts the image vertically by a fraction of the total height.
Shear: Randomly applies shearing transformations to the image.
Zoom: Randomly zooms in or out of the image.
Brightness Adjustment: Randomly adjusts the brightness of the image within a specified range.
Preprocessing Function: Introducing Gaussian noise to the image, further diversifying the dataset.

Through this augmentation, we expanded our dataset to include 1755 pass squats and 1755 fail squats. Examples of images generated during augmentation can be seen in Figure 4.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure 3: Example of image before augmentation.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure 4: Example of images generated during augmention.

Implementation

Our code iterates over the train and test image sets, loading each image and resizing it to a standardized 224x224 pixels. The ImageDataGenerator then expands the dimensions of the images to match the expected shape and applies the defined augmentation techniques. By generating augmented images in batches, we effectively increase the size of our dataset.

To ensure a robust dataset, we set the augmentation ratio to 13 images per original image, which can be adjusted as needed.

Architectures

In this chapter, we explore three distinct architectures that have been implemented, each leveraging a modified version of the renowned VGG16 model [4].

Modified VGG16

The three implemented architectures are based on a modified version of the VGG16 model, which is a well-known convolutional neural network (CNN) widely acclaimed for its performance in computer vision tasks, specifically object detection and classification. The VGG16 architecture consists of 16 layers, including convolutional and pooling operations, followed by fully connected layers. It has been trained on a large dataset containing 1000 different categories, enabling it to recognize and classify various objects in images. The model achieves an impressive accuracy rate of 92.7%, demonstrating its capability to accurately classify images across diverse categories. The architecture of VGG16 is depicted in Figure 5.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure 5: Original VGG16 architecture.

In order to adapt VGG16 for our specific squat classification task, we modified the architecture. The last three fully connected layers were replaced with a feedforward layer that consists of 256 neurons, and an additional layer with a single neuron utilizing the sigmoid activation function for binary classification. This modification allowed us to leverage the pre-existing knowledge of VGG16 and enable transfer learning, resulting in accurate squat classification.

During the implementation of transfer learning, we ensured that the weights of the original VGG16 model were frozen. This means that the learned features from the large-scale dataset were preserved and utilized in our network. Only the weights of the added layers were modified to relearn the relevant features specific to squat classification. This approach allowed us to efficiently utilize the pre-existing knowledge while customizing the model to suit our squat classification task.

Architecture 1: Data Augmentation, OpenPose, and Modified VGG16

The first architecture, displayed in Figure 6, was an experimentation with a sequence of data augmentation, OpenPose, and the modified VGG16 architecture. We started by augmenting our dataset using the techniques mentioned earlier, increasing both the quantity and diversity of training samples. Next, we incorporated the OpenPose framework, which utilized the modified VGG16 architecture for pose estimation.

The output of OpenPose, which provides skeleton representations showing the connections between the joints, as depicted in Figure 3, was passed through the modified VGG16 architecture for squat classification.

Figure 6: Pipeline of Architecture 1

Architecture 2: Data Augmentation and Modified VGG16

In the second architecture, which is visually captured in Figure 7, we focused on evaluating the impact of data augmentation on the performance of the modified VGG16 architecture. We began by augmenting our dataset using the techniques mentioned earlier. The augmented data was then directly fed into the modified VGG16 architecture without incorporating the OpenPose step. This architecture allowed us to assess the extent to which pose estimation preprocessing enhanced the accuracy of the squat depth monitoring system.

Figure 7: Pipeline of Architecture 2

Architecture 3: OpenPose, Data Augmentation, and Modified VGG16

In the third architecture, illustrated in Figure 8, we sought to investigate the potential synergy between OpenPose and data augmentation in conjunction with the modified VGG16 architecture. Similar to Architecture 1, we incorporated the OpenPose framework and utilized the modified VGG16 architecture for pose estimation and classification. However, in Architecture 3, it is now the preprocessed images from OpenPose that were augmented using the data augmentation techniques. The augmented images were then fed into the modified VGG16 architecture for squat classification.

Figure 8: Pipeline of Architecture 3

Results

In this chapter, we present our findings, showcasing the performance metrics obtained from each architecture on a test dataset. We then analyze and draw conclusions based on the results obtained.

Experimental Setup

Training was conducted on the Kaggle platform using a GPU of type P100. Each architecture was trained once. To ensure reliable and robust results, it is recommended to conduct multiple training runs to account for any variations and ensure the stability of the findings.

The collected data was divided into a 75% training set and a 25% test set. Before this split, a separate validation set was created, comprising 20 samples of proper squats and 20 samples of improper squats. It is important to note that each architecture utilized exactly the same data for training, testing, and validation. After the data splits, augmentation was performed as described in the Data augmentation section.

Results Analysis

In our pursuit of identifying the optimal architecture for our deep learning model, we recognized the significance of early stopping in achieving reliable results.

Early stopping played a pivotal role in our methodology, allowing us to closely monitor the validation loss during training. By setting a patience hyperparameter to 5, we established a threshold for the number of consecutive epochs without improvement. This approach helped us identify the epoch at which the model reached the global minimal loss, as illustrated in Figure 9. It's worth emphasizing that early stopping prevented overfitting and provided us with a more accurate representation of the model's performance.

Figure 9: Validation Loss with Early Stopping for the 3 Architectures

In evaluating the performance of the different architectures for the computer vision project, we have considered several important metrics: Accuracy, Precision, Recall, and F1 Score. Each metric provides valuable insights into different aspects of the models' performance. Below, we discuss the significance of each metric and highlight the drawbacks, if any. The values presented in Table 1 correspond to the epoch associated with the lowest validation loss, reflecting the optimal performance achieved during training.

Accuracy: Accuracy measures the overall correctness of the predictions made by the models. In our case, where we have a balanced split of 50% positive and 50% negative samples, accuracy remains relevant as it provides an accurate representation of the model's performance in correctly classifying both positive and negative samples. It is an important metric to consider in assessing the overall effectiveness of the models.
Precision and Recall: Precision and Recall are two metrics that evaluate the model's performance on positive samples. Precision quantifies the model's ability to correctly identify positive samples out of the total samples predicted as positive, while Recall measures the model's ability to identify all positive samples correctly. Both precision and recall are crucial in different scenarios.
F1 Score: The F1 Score is the harmonic mean of precision and recall, providing a balanced measure of a model's performance. By utilizing the F1 Score, we obtain a single metric that captures both precision and recall, thereby providing a comprehensive evaluation of the model's ability to classify positive samples accurately.

Architecture	Epoch	Accuracy	Precision	Recall	F1 Score
Architecture 1	80	0.6892	0.6814	0.7108	0.6958
Architecture 2	22	0.6784	0.6435	0.8000	0.7133
Architecture 3	98	0.7597	0.7778	0.7271	0.7516

Table 1: Performance Metrics on Test Set

Considering the performance of the three architectures, Architecture 3 achieved the highest F1 Score of 0.7516, indicating a better balance between precision and recall. Additionally, Architecture 3 also exhibited the highest accuracy of 0.7597, which further strengthens its performance as the best architecture for the computer vision project.

Discussion

The results obtained from our evaluation of different architectures for squat depth monitoring provide valuable insights and raise several points for further discussion. In this section, we will delve deeper into these aspects and explore potential avenues for future research and development.

Performance Comparison: The comparison of the three architectures revealed that Architecture 3, incorporating OpenPose, data augmentation, and a modified VGG16 architecture, achieved the highest F1 score and demonstrated superior performance across multiple metrics. This highlights the significance of leveraging joint angle information extracted by OpenPose and the enhanced diversity and robustness introduced by data augmentation. Further optimization can be explored to better understand the contributions of OpenPose and data augmentation techniques.
Challenges of Architecture 1: Architecture 1, combining data augmentation, OpenPose, and a modified VGG16 architecture, faced challenges due to potential misalignment between the applied data augmentation techniques and the OpenPose framework. This inconsistency resulted in difficulties locating the correct joints when the image was rotated or distorted, leading to lower accuracy. Investigating and addressing the misalignment issues can improve the performance of this architecture.
Limitations of OpenPose Framework: Our findings highlight a potential weakness of the OpenPose framework, relying on training with real data. While capturing real-world variability is advantageous, it may limit the framework's ability to generalize well to unseen scenarios or variations in joint angles. Integrating augmented data alongside real data could enhance the model's robustness and adaptability. Research can explore this integration to improve the performance of the OpenPose framework.
Real-Time Monitoring and Feedback: One of the key advantages of our computer vision-based solution is the potential for real-time monitoring and feedback during squat exercises. Future work can focus on optimizing the processing speed and resource utilization of the architectures to enable real-time deployment on edge devices or dedicated hardware. Additionally, integrating the system with a user-friendly interface or mobile application can provide immediate feedback to athletes and trainers, facilitating technique corrections and training optimization.
Generalizability and Extensibility: While our evaluation focused on squat depth monitoring, the underlying computer vision techniques and architectures can be extended to other exercises and movements. The concept of pose estimation combined with deep learning has wide applicability in various domains, including sports analysis, rehabilitation, and physical therapy. Exploring the generalizability and extensibility of our system to other movements can open up new avenues for research and practical applications.
Ethical Considerations: As with any computer vision system, ethical considerations regarding privacy, data security, and consent must be addressed. Future work should emphasize the development of robust privacy protocols, ensuring that the system respects the privacy rights and autonomy of individuals. Additionally, obtaining informed consent and maintaining transparent communication with users regarding the purpose, storage, and usage of their data is crucial for ethical implementation.

Conclusion

In conclusion, our groundbreaking computer vision-based system for automatic squat depth detection represents a significant advancement in the field of strength training and powerlifting. By harnessing the power of pose estimation, deep learning techniques, and data augmentation, we have successfully created a solution that overcomes the limitations of subjective assessments and laborious post-workout analysis.

Our comprehensive evaluation of three different architectures has led us to believe that Architecture 3 is the most robust and accurate configuration. The incorporation of OpenPose, data augmentation, and a modified VGG16 architecture enables real-time corrections, objective feedback, and data-driven training improvements. This achievement marks a significant milestone in squat depth monitoring, offering coaches, training partners, and athletes a reliable and automated tool for optimizing training, preventing injuries, and enhancing performance.

While Architecture 1 showed good accuracy, it presents opportunities for further refinement. Future work can focus on aligning the data augmentation techniques with OpenPose to ensure consistent joint localization, potentially leading to improved performance. Additionally, exploring other pose estimation frameworks or advanced deep learning architectures could contribute to further advancements in squat depth monitoring.

Overall, our system's success in achieving our goal of creating a non-existing automated solution for squat depth monitoring with impressive accuracy reinforces the potential of computer vision technology in revolutionizing strength training, performance analysis, and human movement understanding.

References

1. USAPL. Retrieved from https://www.youtube.com/@USAPowerlifting1/videos.
2. KNKF. Retrieved from https://www.youtube.com/@powerliften.
3. Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., & Sheikh, Y. A. (2019). Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.
4. Rohini. G. (2021). Everything you need to know about VGG16.

Revolutionizing Squat Depth Monitoring: A Breakthrough in Computer Vision Technology

Team

Group members

Tasks

Introduction

Related Work: A Gap in the Market

Solution

Dataset

Data collection

Data augmentation

Implementation

Architectures

Modified VGG16

Architecture 1: Data Augmentation, OpenPose, and Modified VGG16

Architecture 2: Data Augmentation and Modified VGG16

Architecture 3: OpenPose, Data Augmentation, and Modified VGG16

Results

Experimental Setup

Results Analysis

Discussion

Conclusion

References