# Can a Camera Read Your Chord? *[Jasraj Anand](https://www.linkedin.com/in/jasraj747/) · [Nikshith Menta](https://www.linkedin.com/in/nikshithmenta/) · [Sara Cortez](https://www.linkedin.com/in/saracorteztolstoi/)* ## 1. Introduction & Motivation A beginner guitarist receives no visual feedback on whether their finger positions are correct unless they own expensive hardware. Current tools that help with this, like audio-based apps such as Chordify, MIDI-enabled guitars, and sensor gloves, only tell you if the sound is right, not whether your finger position is correct. Hardware solutions add cost and setup friction that most learners simply don't have. There is academic work on visual chord recognition, but it is based on static images in controlled lab settings, not on live video. This means that learners without a MIDI interface or recording setup get no immediate visual feedback on whether the chord they are forming is correct. We created a system that addresses these shortcomings. With a live webcam feed, a YouTube video, or a custom video file, it identifies the guitar chord being played in real time with no extra hardware required. The system has an accuracy of 88.07% on a test set that includes different players, guitars, and lighting conditions across five chord classes: C, D, Em, F, and G. To achieve this, we created our own varied dataset of 1,619 labeled fretboard images. We designed a five-stage detection and classification process and demonstrated that how we represent and preprocess input is more important than choosing the right architecture at this scale of data. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BklIwDPe2-x.png" style="width: 100%; display: block; margin: 0 auto;"> <p><em>Figure 1: Classifier output across all five chord classes. Each bounding box color corresponds to the detected chord.</em></p> <br/> </div> :::info **:guitar: Explore the project** <table style="width: 100%;"> <tbody> <tr> <td style="width: 25%; min-width: 120px; vertical-align: top;">🎥 <a href="https://drive.google.com/file/d/1ty-b1mIPJI87yHAxNZSCeTcJQATcjx7w/view?usp=drive_link"><strong>Live Demo</strong></a></td> <td style="vertical-align: top;">Works with webcam, YouTube URL, or video file</td> </tr> <tr> <td style="vertical-align: top;">💾 <a href="https://www.kaggle.com/datasets/jasraj312/guitar-chords-fretboard-crop-dataset-5-classes"><strong>Dataset</strong></a></td> <td style="vertical-align: top;">1,619 annotated fretboard images across 5 chord classes, multiple players and conditions</td> </tr> <tr> <td style="vertical-align: top;">💻 <a href="https://github.com/jasraj-jsa/Exploring-Guitar-Chord-Recognition-with-Vision-Models.git"><strong>Code</strong></a></td> <td style="vertical-align: top;">Includes training, inference, and live demo scripts</td> </tr> </tbody> </table> ::: --- **Team Contribution** All three of us contributed to the entire project. We shared the dataset collection and annotation equally. Two of us filmed self-recorded chord examples under different conditions. All three helped source and label YouTube frames. We developed the model structure, training process, and ablation experiments together. The live demo, inference features (EMA smoothing, entropy rejection), and blog write-up were divided among the team. --- ## 2. Dataset A major challenge in this project was finding a dataset with enough real-world variation. At first, we used a large public dataset from Roboflow with over 13,000 images. However, it only featured three different people and very few different backgrounds, so our model memorized the backgrounds and specific users instead of learning the chords. To fix this, we compiled our own diverse dataset from miscellaneous sources. We gathered **1041 images** for training and **578** for validation by combining an existing GitHub dataset, some pictures from the original Roboflow dataset, and about one-third of the training set was generated from YouTube tutorials, concert videos, and self-generated pictures. Finally, we created a strict test set of **218 images**. To make sure our model truly works in real life, we took these test images from completely different YouTube videos and recorded our own photos using extreme angles and lighting conditions. ### 2.1 Data Pre-processing We introduce a SampleWiseCenter transform, which subtracts the mean pixel intensity from each individual image, as a replacement for ImageNET normalization. This operation removes global brightness differences between images, forcing the model to rely on structural patterns rather than absolute intensity values. As a result, the model becomes less sensitive to exposure variations and focuses on features that are consistent across images. Finally, to match the input requirements of our classification network, we resize all images to a fixed square of 229x229 pixels. ### 2.2 Data Augmentations <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/rysRhvxnbx.png" style="width: 100%; display: block; margin: 0 auto;"> <p><em>Figure 2: Augmentation introduces the viewpoint and lighting variety absent from raw data which helps in reducing overfitting on the test data.</em></p> <br/> </div> Data augmentation artificially increases training diversity by applying label-preserving transformations, improving generalization to unseen data. To reduce sensitivity to viewpoint and lighting variations, we apply a compact augmentation pipeline. Using PyTorch's transforms, we apply random affine transformations (rotations up to 15 degrees and translations up to 5%) to simulate different camera angles. Color jittering adjusts brightness, contrast, and saturation by a factor of 0.2 to mimic lighting changes. This pipeline helps the model handle viewpoint and illumination changes, directly supporting better generalization. ## 3. Proposed Method ### 3.1 What We Tried First: Classical Approach Before trying deep learning methods, we explored whether classical CV techniques can help with fretboard detection since they can provide a faster baseline with no requirement of training data. #### Our approach: * Image preprocessing: Converting to grayscale and applying Gaussian Blur. * Applying Canny edge detection to extract fret and string boundaries. * Passing the resulting edge to Hough Line Transform to convert discrete edge pixels into continuous string positions. * Exposing different parameters like Hough thresholds as interactive sliders for fine-tuning the detector. #### Results & Insights: <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/ryiU14ZnWl.png" style="width: 100%; display: block; margin: 0 auto;"> <p><em>Figure 3: Upper strings are detected reliably; lower strings are lost to background noise, motivating the learned detector.</em></p> <br/> </div> The upper strings are detected properly in clean conditions. However, the lower strings produce too many false edges for Hough transform to handle due to the noise (fretting hand, clothing). This method is not able to distinguish between a guitar string and a similar object like an earphone cable. Hence, we move on to a learned detector. ### 3.2 Finding the Right Input Representation After the classical approach, we ran three further experiments before settling on our final approach. First, we used MediaPipe to crop only the hand region. However, this approach failed to distinguish transposed chords, since identical hand shapes appear at different fretboard positions (resulting in ≈50% accuracy for three classes). Next, we tried using full guitar images, but the background noise made classification difficult. We then experimented with a U-Net–based fretboard detector, but it struggled to reliably capture the full extent of the fretboard. <div style="text-align: center;"> <div style="display: flex; justify-content: center; align-items: center; gap: 15px; width: 100%;"> <img src="https://hackmd.io/_uploads/r1NMhPZnbg.png" style="width: 38%; height: auto;"> <img src="https://hackmd.io/_uploads/Hke35oZhZe.jpg" style="width: 41.5%; height: auto;"> </div> <p><em>Figure 4: Initial input experiments. Left: U-Net segmentation fails to capture the full fretboard. Right: Hand-crop produced by MediaPipe. We have no way of knowing its relative position to the start of the fretboard (the nut of the guitar) </em></p> <br/> </div> MediaPipe isolates the hand in the fretboard. We had to reject hand-cropping with MediaPipe because when playing guitar, chords are executed by positioning our fingertips relatively to the first fret. This means that some pairs of chords are played in the exact same hand positioning, but not in the same place in the fretboard. To ensure the scalability of our approach, we need to embed fretboard awareness into our model. Since the relevant information is localized strictly in the fretboard region, we needed a robust way to isolate it. To solve this, we manually annotated 714 images from our own photos and YouTube videos to train a custom, lightweight [RF-DETR](https://rapid.roboflow.com/p/sq0dzbo0) Nano object detection model via [Roboflow](https://app.roboflow.com/). This new detector successfully extracts the fretboard and crops the input images. By restricting the input to the fretboard area, the model is guided to learn chord-relevant visual patterns instead of irrelevant background information. <div style="text-align: center;"> <div style="display: flex; flex-direction: column; align-items: center; gap: 15px; width: 100%;"> <img src="https://hackmd.io/_uploads/HyiyR0bhWg.jpg" style="width: 85%; height: auto; border-radius: 8px;"> <img src="https://hackmd.io/_uploads/HJfJR0Zh-e.jpg" style="width: 85%; height: auto; border-radius: 8px;"> </div> <p style="margin-top: 15px;"><em>Figure 5: Fretboard isolation using our custom trained RF-DETR Nano object detector. The original image (top) is cropped to strictly the region of interest (bottom).</em></p> <br/> </div> ### 3.3 Final Pipeline <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/B1k13vlhWe.jpg" style="width: 100%; display: block; margin: 0 auto;"> <p><em>Figure 6: Five-stage inference pipeline. From raw webcam frame to chord label with confidence score.</em></p> <br/> </div> We use a pretrained Convolutional Neural Network (CNN) to avoid training from scratch, which requires large datasets. Pretraining provides general visual features such as edges and shapes that transfer to chord recognition. We adopt Inception-ResNet V2 as a feature extractor, which combines multi-scale feature extraction with residual connections. The network produces high-level spatial features from the input image. We replace the original classification head with a task-specific fully connected network. Global average pooling reduces spatial dimensions, after which two fully connected layers with ReLU and dropout learn chord-specific representations. A final linear layer outputs the chord classes. By combining a dedicated fretboard detector with a pretrained backbone and a task-specific head, the system removes two main reasons for previous failures. It gets rid of background noise at the detection stage. It also addresses the issue of limited labeled data by using ImageNet transfer. This makes real-time camera-only chord classification practical for the first time. ### 3.4 Training Strategy We train the model using the Adam optimizer (learning rate 0.001) and cross-entropy loss for multi-class classification. To control overfitting and stabilize learning, we use a two-stage training scheme. First, the pretrained backbone is frozen and only the classification head is trained. This updates task-specific weights while keeping general visual features fixed. After 25 epochs, all layers are unfrozen and the full network is fine-tuned, allowing features to adapt to chord-specific patterns. Training runs for up to 200 epochs with early stopping based on validation loss. If validation loss does not improve for 25 epochs, training stops and the best model is retained. Freezing limits overfitting in early training, while later fine-tuning improves task-specific feature learning. Early stopping prevents the model from fitting noise in the training data. ### 3.5 Evaluation Setup We evaluate the model using both accuracy and loss on a held-out validation set. Accuracy measures the percentage of correctly predicted chord labels, while loss reflects the confidence of the predictions. To further analyze model behavior, we compute a confusion matrix, which shows how often each chord is confused with others. This analysis reveals systematic errors that are not visible from accuracy alone. ## 4. Results & Insights ### 4.1 Overall Model Performance The model achieves **overall accuracy of 88.07%** across five chords on a test dataset curated from different YouTube tutorials and other online sources. The per class breakdown reveals where the model is confident and where it struggles. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/rkEb0wxnZe.png" style="width: 100%; display: block; margin: 0 auto;"> <p><em>Figure 7: Em and G are the weakest classes, with confusion patterns consistent with their visually ambiguous finger positions.</em></p> <br/> </div> #### Insights: * C,D and F perform strongly. * Em is the weakest class at 75% with confusion spread across all the other classes. This is expected since Em uses two fingers with a sparse hand shape that can resemble partial hand positions of other chords. * G shows the second lowest accuracy at 82.6% with the most samples (10.9%) missclassified as C. It's probably because G and C share similar finger cluster position on the upper frets. ### 4.2 What The Model Is Actually Learning Quantitative accuracy alone does not tell us what the model is learning to make predictions. To make sense of our results, we applied Gradient-weighted Class Activation Mapping (Grad-CAM) to visualize which regions of the fretboard crop most strongly influence each prediction. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/H1NkGEZnbl.png" style="width: 100%; display: block; margin: 0 auto;"> <p><em>Figure 8: Grad-CAM activations across three samples. Warm regions (red/yellow) indicate high model activation; cool regions (blue) indicate low activation. The model consistently attends to finger-fret contact regions rather than the guitar body or background, suggesting it has learned a musically meaningful representation.</em></p> <br/> </div> It is observed with most of the results that the activation maps concentrate consistently on the region where the fingers press strings on frets. The model is not being influenced by other elements like the guitar neck/body, or the background. This suggests that model is learning a meaningful representation and therefore is able to generalize well on a diversified dataset. That said, this is not universal. On harder samples, particularly Em and G, the two weakest classes, activations sometimes spread beyond the contact region to the surrounding hand or neck area. This likely contributes to their lower accuracy. ## 5. Experiments & Ablation Studies ### 5.1 Ablation: Data augmentation & normalisation To understand the impact of our data pre-processing pipeline, we conducted ablation studies by independently removing data augmentation and the sample-wise centering transform from our training process. We evaluated these ablations by tracking the learning curves and final accuracy metrics on our isolated test set. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/SJLphmM3Wg.png" style="width: 100%; display: block; margin: 0 auto;"> <p><em> Figure 9: Accuracy without (left) and with (right) augmentation. Both curves look identical, reaching ~95% validation accuracy. However, the non-augmented model fails our strict test set, demonstrating validation scores are misleading without real-world variety. </em></p> <br/> </div> As seen in Figure 9, both models quickly reach 100% training accuracy and plateau at approximately 95% validation accuracy. While these metrics suggest strong performance, our initial training and validation sets lack real world variety, allowing the unaugmented model to simply memorize backgrounds and specific lighting. Consequently, when evaluated on our strict test set featuring completely unseen videos and extreme angles, the accuracy of the model without augmentation falls to 73.39% as shown in Table 1. This significant drop demonstrates that we must utilize data augmentation to mitigate overfitting and force the model to learn the actual chords. <div style="text-align: center;"> | Model Configuration | Highest Val Accuracy | Global Test Accuracy | | ------------------- | -------------------- | -------------------- | | Baseline | 95.50% | 88.07% | | Without Augmentation | 95.50% | 73.39% | | Without Sample-wise Centre | 94.98% | 83.94% | <p><em>Table 1: Ablation results. Removing either step causes a massive drop in test accuracy</em></p> </div> Standard normalization uses the average brightness of the whole dataset. Instead, we use sample-wise centering, which subtracts the average brightness of each individual image. Table 1 shows that going back to standard normalization drops our test accuracy to 83.94%. Centering each image individually makes the model ignore global lighting changes and focus purely on the hand and fretboard shapes. This simple step is critical for handling the unpredictable lighting found in real guitar videos. ### 5.2 Comparison: Transfer learning vs from scratch To see if transfer learning is actually necessary, we tested training our model from an absolute blank slate (pretrained=False). <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/rJx9rKZh-l.png" style="width: 100%; display: block; margin: 0 auto;"> <p><em>Figure 10: Validation loss comparing our pre-trained model against one trained from scratch. The from-scratch model takes significantly more epochs to converge than the baseline.</em></p> <br/> </div> A model starting from scratch has to learn basic visual concepts (like edges and shapes) before it can recognize chords. Because of this, it took 113 epochs to converge, compared to just 41 for our pre-trained baseline. More importantly, **test accuracy dropped to 68.35%**. This shows that for a small dataset, borrowing foundational knowledge from a pre-trained model is essential for both speed and accuracy. ### 5.3 Comparison: ConvNeXt architecture We wanted to test if a high capacity, transformer-inspired architecture like **ConvNeXt** could outperform our Inception-ResNet V2 baseline. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HyInrt-3We.png" style="width: 100%; display: block; margin: 0 auto;"> <p><em>Figure 11: Validation accuracy comparison. The ConvNeXt model severely overfits, showing it is poorly suited for our data scale.</em></p> <br/> </div> During training, we experimented with progressive unfreezing of the ConvNeXt model stages; however, we could not achieve improvements in performance. We also tried a variation of ConvNeXt, InceptionNeXt, but it did not produce improvements either. The ConvNext architecture achieved only **56.42% accuracy** on the test set. This poor performance occurs because high-capacity architectures like ConvNeXt are extremely data-hungry. Our training dataset of just 1,619 images is simply too small for it to generalize. Instead of learning the visual patterns of guitar chords, the model used its massive parameter count to instantly overfit and memorize the specific room environments of the training data. This shows that upgrading to a heavier architecture fails without a massive dataset; our Inception-ResNet V2 baseline is a much better fit for this specific task. ### 5.4 Comparison: Left-Handed Generalization Typically, guitar players use their left hand to shape the chord and the right hand to pull the strings. However, some left-handed guitarists find it easier to learn to shape the chords with their right hand. We tried to explore the properties of CNNs to enable our model to generalize to this set of players. For that we used the exact same architecture, simply flipping horizontally half of the images fed to the model during training. We tested it on fully-flipped data, achieving **80.23% test accuracy**. ## 6. Discussion & Future Work The main finding of this project is that what you feed the model is more important than which model you choose. Our studies show that removing augmentation or sample-wise centering leads to a bigger drop in accuracy than completely changing the backbone architecture. For small datasets in visually noisy areas, paying close attention to the input representation is more valuable than searching for the best architecture. The approach depends on the availability and quality of the dataset. Limited diversity in chord examples, players, and backgrounds directly reduces the ability of the model to generalize across real-world conditions. In addition, the fretboard detection step introduces a dependency on an external model, and errors in detection can propagate to the classifier. **Future Work:** * Expanding more chord classes, particularly barre chords since they require different finger patterns. * Rather than classifying each frame independently, feeding a short sequence of frames to an LSTM or lightweight transformer could smooth predictions at chord transitions. * A multimodal approach combining the vision and audio could resolve issues that neither mode can handle on its own such as muted string or open string variants of the same chord shape. * Song-level feedback: a user inputs a song's chord progression and the system tracks whether or not they are playing the right chords in the right order. * Scaling to concert and live show videos, where lighting and camera angles can vary a lot. ## 7. References (1) Canny Edge Detection [[OpenCV]([Link](https://docs.opencv.org/4.x/da/d22/tutorial_py_canny.html))] (2) Hough Line Transform [[OpenCV]([Link](https://docs.opencv.org/3.4/d9/db0/tutorial_hough_lines.html))] (3) Kristian, Y., Zaman, L., Tenoyo, M., & Jodhinata, A. (2023). Advancing guitar chord recognition: A visual method based on deep convolutional neural networks and deep transfer learning. *ECTI Transactions on Computer and Information Technology, 17*(2), 235–246. [[ResearchGate](https://www.researchgate.net/publication/380532263_Advancing_Guitar_Chord_Recognition_A_Visual_Method_Based_on_Deep_Convolutional_Neural_Networks_and_Deep_Transfer_Learning)] (4) Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2019). Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. International Journal of Computer Vision, 128(2), 336-359. [[Springer](https://doi.org/10.1007/s11263-019-01228-7)] (5) Szegedy, C., et al. (2017). Inception-v4, Inception-ResNet and the impact of residual connections on learning. AAAI. [[Arxiv](https://doi.org/10.48550/arXiv.1602.07261)]