Recognizing Catan Boards with Machine Learning

# Recognizing Catan Boards with Machine Learning *A project done for the DSAIT4125 Computer Vision seminar at TU Deflt*. **Authors:** Vasil Dakov (v.d.dakov@student.tudelft.nl), Alperen Guncan (a.i.guncan@student.tudelft.nl), Elena-Oana Milchi (e.o.milchi@student.tudelft.nl) **Repository link:** https://github.com/vdakov/cv-board-game-detection ## The Pitch Board game state recognition is a task with applications in strategy analysis, competitions, etc. It may be employed for both visualization/broadcasting, but also AI-assisted/driven gameplay. **Catan**, a game with a sudden rise in popularity exhibits nice geometric and visual properties that make it approachable for a simple pipeline using deep learning computer vision methods. ### The assumptions: - All tiles are visually distinct (even between editions). - The board is a standard Catan board. More specifically, it is symmetric, and the tiles are *regular* hexagons (equal sides, sum of all angles = $720^{\circ}$). - Viewing angles $\theta$ vary between $\in [0, 30]$ degrees. - Numeric labels of tiles are *properly* placed, i.e. near the center of the tile. - **No settlements, ports, or roads are considered.** This is a simplified version of the game, and could be considered "the initial game state". ## How we get our data: synthetic boards One of the main challenges of this project is the lack of general datasets. We implement a script to create synthetic Catan boards for training and evaluation. We first modify the code of an [online Catan board generator](https://jkirschner.github.io/catan-randomizer/) to support multiple art styles for the tiles. Then, we put the board over [any image of a table](https://www.kaggle.com/datasets/dataclusterlabs/table-image-dataset-indoor-object-detection), to replicate the background of a real-taken image. This gives us a board in a **top-down perspective**. <div id="images" style="text-align:center"> <img src="https://hackmd.io/_uploads/rkkZ4_ZA1x.png" height="250"> <em> An example of a synthetically generated data point </em> </div> Finally, we do a **perspective warping**. It can be mathematically expressed via a sequence of affine transformations. Below are listed the standard rotation $R$, scaling $S$ and translation $T$. $$R = \begin{bmatrix}\cos\theta & -\sin\theta & 0 \\ \sin\theta & \cos\theta & 0 \\ 0 & 0 & 1\end{bmatrix} \ S = \begin{bmatrix}s_x & 0 & 0 \\ 0 & s_y & 0 \\ 0 & 0 & 1\end{bmatrix} \ T = \begin{bmatrix}1 & 0 & t_x \\ 0 & 1 & t_y \\ 0 & 0 & 1\end{bmatrix}$$ Collapsing them to a single matrix gives what is called a **homography matrix**. It can be any combination of rotation, scaling, and translation. <div align="center"> $$ \mathbf{H} \equiv \begin{bmatrix} h_1 & h_2 & h_3 \\ h_4 & h_5 & h_6 \\ h_7 & h_8 & h_9 \end{bmatrix} \equiv RST\ldots \quad \forall\, R, S, T \in \mathbb{R}^{3 \times 3} $$ </div> This expression shows that **any sequence of 2D affine transformations**, when represented in homogeneous coordinates, results in a **homography matrix** $\mathbf{H} \in \mathbb{R}^{3 \times 3}$. **In the mining script**, we generate arbitrary perspective warping matrices and apply them to the original images (with background mirroring added). ## Pipeline ![image](https://hackmd.io/_uploads/H1v8jRW0Jg.png) *Yes, the visualizer does not match with the input - the images were taken separately. We show proper examples [below](#Bringing-it-all-together).* The pipeline consists of five steps. The goal is to be able to input any image of a Catan board and output a game state with relative positioning of tiles, along with their types and number labels. The steps are: 1. **Perspective Correction** - Fixing the orientation to a top-down perspective. 2. **Object Detection** - Finding a bounding box around the board. 3. **Board Segmentation** - Extracting the tiles. 4. **Classification** - Running the tiles through classification models. 5. **Catan board assembly** - Post-processing into the format of a catan board. ### Perspective correction As shown, a perspective can be represented as a sequence of linear transformations from a top-down POV. This transformation can be inverted with a matrix $\mathbf H^{-1}$ (if non-singular). We want a **general way to arrive at this top-down perspective.** The benefits are that the hexagons will become regular, and layout - standardized. Non-Deep-Learning ways, such as the DLT [5] rely on already having corresponding pairs of points between two perspectives, which we don't have. Our approach to do this is a **modified** version of "Deep Image Homography Estimation" [1]. It is a VGG-like CNN, whose output is 9-value vector representing $\mathbf H$. <div id="images" style="text-align:center"> <img src="https://hackmd.io/_uploads/rky2vaZR1l.png" height="300"> </div> <div id="images" style="text-align:center"> <em> The homography estimation network. Source: https://arxiv.org/pdf/1606.03798 by Daniel et al. </em> </div>   Our modification is in the **loss function.** We combine both regular Mean Squared Error and Photometric loss [6], which directly relates the reconstructing the perspective of the original image $I$ via $\mathbf H^{-1}$. $$L(\mathbf H, \widehat{\mathbf H}, I ) = \lambda_1 \cdot \text{MSE} (\mathbf H, \widehat{\mathbf H}) + \lambda_2 \cdot \text{MSE} (\mathbf H^{-1}I, \widehat{\mathbf H} ^{-1}I )$$ <div style="text-align: center;"> <em>We consider both lambdas to be 1.</em> </div> The motivation is that small changes to the matrix can lead to large changes to the output image. Thus, we include necessary knowledge of the image changes, so that small mistakes in both the matrix or image reverberate back strongly in the gradients. ### Board Detection Localizing the Catan board is a standard object detection task. For this reason, we have chosen to re-tune the [YOLOv8](https://yolov8.com/) detector [2]. From the mining, we have the bounding boxes of the Catan boards. It is how our network will get an understanding of both perspective and the boards. ![image](https://hackmd.io/_uploads/BJ8EPR-Ayl.png) ### Board segmentation At the end of this step, we would like to have all tiles cropped and separated for classification. We use [SAM (Segment Anything Model)](https://segment-anything.com/), a SOTA segmentation model [3]. A challenge arises with fine-tuning this model to our use case. Since SAM is pretrained for zero-shot segmentation, it lacks *semantic* understanding of what a Catan tile truly is. For instance, it may split a wheat tile into multiple sections if contrast/color palette is sufficiently different, or segment the number label separately, or even in some lighting conditions, merge/skip tiles. Due to the difficulty of fine-tuning the model directly, especially with limited data, we **turn to an approach of segmenting everything, and then using unsupervised learning and post-processing.** First, we run **DBScan** clustering [7] to merge neighboring segments by area, with an ideal output being a grid of hexagons that match the specifications of a Catan board. Afterwards, we extract the **contour information** from these segments, more specifically the contours that define each hex tile. During extraction, we make sure we only store contours of proper size, with a proper enclosed area, and angles that resemble a hexagon. In short, we apply heuristics based on the expected proportions for a Catan board tile. The contour information is then used to crop the input image into tiles. <div id="images" style="text-align:center"> <img src="https://hackmd.io/_uploads/ryPqJ1MCyl.png" height="250"> <em style="display:block;"> Intermediate results of the segmentation step. From left to right: Original image, SAM output, merged segments based on DBScan, contour information. </em> </div> ### Tile classification A Catan game tile consists of two elements: - One **number** plate - The surrounding area which determines the **type** of the tile There are five resource tiles in Catan: sheep/wool, brick/clay, wood/lumber, wheat/grain, iron/ore. These tiles are numbered. The default (or ‘desert’) tile has no number plate on it. <div id="images" style="text-align:center"> <img src="https://hackmd.io/_uploads/SJqLEhC6ke.png" height="250"> <img src="https://hackmd.io/_uploads/Skq8VhApJl.png" height="250"> <em style="display:block;"> Examples of Catan tiles. On the left: a 'wheat' tile numbered 11. On the right: a 'desert' tile - not numbered. Images taken from https://www.generatecatanboard.com/scenario/catan </em> </div>   We split tile classification as two separate tasks for the type and the tile number. Combining the two tasks results in too many combinations and would impact accuracy significantly. For tile classification, we resolve the data scarcity issue in two ways: #### Dataset generation via board mining We can use the segmented output of the [previous pipeline step](#Board-segmentation) with our [synthetic board](#How-we-get-our-data:-synthetic-boards), and perform a mask comparison between each tile image and reference images in order to discern the type of the tile. ![dataset labeling animation](https://hackmd.io/_uploads/H1GVHT06Jx.gif) #### Tile generation from the ground up Alternatively, we can programatically compile images of Catan tiles by combining different tile types with different backgrounds and number plates. We found this method highly customizable as it allowed merging tiles and backgrounds of any art styles. ![tile generation animation](https://hackmd.io/_uploads/S1eST2Rp1x.gif) In the end, we chose the [first method](#Dataset-generation-via-board-mining) since it is more suitable to our pipeline. Moreover, with this method we ensure more realistic cropped tile images, as per the pipeline, with no potential artifacts from generation. We scale the images to 100 x 100 pixels to make the training process tractable. For **tile type classification**, we decided to build a simple CNN classifier. All board tiles are distinct from each other, and should be simple enough to tell apart by a small, generalizable model. **We then proceed with number prediction**. Our first attempt was similarly a CNN, however the model showed poor performance - likely due to lack of data, unique fonts, small input images, etc. We thus opted to use the [Tesseract Optical Character Recognition (OCR) algorithm](https://github.com/tesseract-ocr/tesseract) [4]. We zoom into the number label, and then we remove background clutter by thresholding the pixel values to only very dark or bright red colors. The final label is defaulted to 0 if no number can be read. ### Catan board assembly At the end, we would like to transform the intermediate results into a format that is easy to visualize and grasp. For each tile we have found, we store its id, and its identified hex type and number label from the [classification step](#Tile-classification). <div id="images" style="text-align:center"> <img src="https://hackmd.io/_uploads/Bkuzr6b0Jg.png" height="250"> <em> Showcasing the visualizer with a synthetic board sample </em> </div>   An important consideration is for when the intermediate results are not complete, e.g. some tiles were not detected, the image is obstructed, or the classifier fails. We skip over such hexes in our visualizer. This raises a new challenge - identifying the proper locations of hexes if not all are provided. We resolve this by making use of the centers of mass of identified contour clusters that form hexagons, calculated in the [segmentation step](#Catan-board-segmentation). Since the board has been corrected for perspective already, the center relations follow a regular (top-down) hexagon. We also store the board in a JSON format, as a map from center positions to (id, hex type, number label) tuples. This file can then be used by any future engine or visualizer. ## Experimental setup This section discusses the evaluation of each component separately. **All of them were evaluated on a test set, either specifically made for verification or the one from training.** ### Perspective correction Evaluating the homography matrix inversion is not trivial due to specificity of the task. The values of the original homography matrix $\mathbf H$ depend on both the orientation and size of input images. Moreover, small changes in the matrix result in huge differences. We focus on the training convergence, and metrics robust to the scale of the output values. **Training setup**: - The original network proposed by Daniel et. al [1] - A dataset of 10 000 perspective images - 100 epochs + the modified loss - Adam Optimizer - A 70-15-15 training-validation-test split Two **metrics** were used: - Coefficient of determination: $R^2 = 1 - \frac{\text{Residual Sum Of Squares}}{\text{Total Sum of Squares}}$ **Why:** It's a standard parameter in regression tasks, indicating "how much" of the variance of the data our model can explain. - Mean Average Corner Error $\begin{align*} P &= \text{Point (x, y)} \\ n &= \text{Number of corners in bounding shape}\\ ACE &= \frac{1}{n} \sum_{i=1}^{n} \left\| P_{\text{gt},i} - P_{\text{pred},i} \right\|_2 \\ mACE &= \frac{1}{N} \sum_{I \in \text{images}}\frac{1}{n} \sum_{i=1}^{n} \left\| P_{\text{gt},i} - P_{\text{pred},i} \right\|_2 \end{align*}$ **Why:** This is the original metric proposed in Daniel et al. [1] The loss and convergence of the final network are also shared. **All tests were done on synthetic data, as there was no efficient way to obtain homography matrices on real examples.** ### Board detection We can reuse the infrastructure for evaluation that YOLO provides. Training-validation splits are handled by YOLO as well. **Training setup**: - 10 epochs - Learning rate $\alpha = 0.01$ - Data augmentations including rotations, scalings, shears, HSV adjustments, etc. The **metrics** we examine are: - Box Loss/Epochs - Precision, Recall, mAP50-95 ### Board segmentation Since SAM is pre-trained, this component requires special evaluation. We can still gain some insights on whether the domain-specific assumptions are valid. For evaluation we choose two metrics, both plotted via histograms: - **Number of tiles detected** - We know 19 is the optimal number of Catan tiles detected - the expectation is to detect all 19 tiles, or slightly less. - **Average size (with respect to the image)** - Due to our geometric assumptions, we should expect little variation between tiles, and a constant percentage with respect to the full image. This partially verifies our object detection too. ### Tile classification We again treat tile types and tile numbers separately. In both cases we used the following metrics, as a standard multiclass classification task: - Accuracy - Overall idea of performance. - Mean Average Precision (MAP) - To gain an idea of the misclassifications per class. - Plotted ROC/AUC curves per class for similar reasons as MAP. **Training Setup:** - One-layer CNN (kernel=3, stride=1, ReLU) - Adam optimizer ($\alpha = 5 \times 10^{-5}$) - Sparse categorical cross entropy loss - Dropout for regularization - Training data augmentation: random flips & rotations #### Tile type detection We used a train-validation-test split of 70%-20%-10% for our dataset, using the [first method](#Dataset-generation-via-board-mining). #### Number detection To test our number classification pipeline, we compiled a dataset of tiles using the [second method](#tile-generation-from-the-ground-up), which allowed for labeling based on the tile numbers. For both tests we plot the training performance, and the aformentioned metrics. ## Results This section presents the individual component results from the pipeline, with a brief discussion after each one. ### Perspective correction <div id="images" style="text-align:center"> <img src="https://hackmd.io/_uploads/BkdnKeMR1e.png" height="300"> </div> From the training loss and its quick convergence we can conclude that **the modified loss functon is reasonable for the task at hand.** Below are also presented the results of our metrics. Both of them showcase the robustness of our homography predictions. The high $R^2$ shows the variance can be attributed solely to image variances, while average distance of $\approx 9.5$ pixels is a good estimate of the true transformation. | Test set R^2 | Test set MACE | | :----------: | :------------: | | 0.998963 | 9.58888 | ### Board Detection ![image](https://hackmd.io/_uploads/B1ClRRWRke.png) Judging by all plots, the board detection turned out to be a simple task - training loss decreases steadily, making the model more and more robust, while keeping a low misclassification rate. From all plots, we can either conclude YOLO is very suitable for the task, or the synthetic data we feed it is too easy. ### Board Segmentation <div id="images" style="text-align:center"> <img src="https://hackmd.io/_uploads/SkzbUGMRke.png" height="250"> <img src="https://hackmd.io/_uploads/SyC4LfG0yg.png" height="250"> </div>   It is clear segmentation is the bottleneck of the pipeline, in terms of performance/accuracy. The data shows that most tiles are recognized, but rarely all of them are. It never detects more than 19 tiles for any instance, indicating that all superfluous contours are skipped. However, 4 out of 100 instances had no tiles detected, possibly due to poor proportions. Furthermore, most of the hexes found are 1.85 - 1.9% of the original image size, which is sound geometrically. | Mean tile count | Mean tile size | | -------- | -------- | | 15.97 | 1.81% | Also doing an empirical check on the generated figures, we may conclude that segmentation performs well on synthetic data, though not perfect - at least not good enough for consistent board recognition, given that it rarely finds all tiles. ### Tile classifcation #### Tile type detection The plot below shows the training and validation curves of our model. We find these results satisfactory, as they show that the model converges at an accuracy higher than 95%. Moreover, the curves show that the model does not overfit on the training set. ![hex_detector_training_plot](https://hackmd.io/_uploads/H1Voxq-A1e.png) The table below shows some results evaluated on the test set. We find them satisfactory as they indicate that our CNN was able to correctly classify most of the tiles. | Test set accuracy | Test set MAP | Test set AUC | | -------- | -------- | -------- | | 97% | 97% | 99% | Lastly, we plotted the ROC curves for all tile types in the figure below. The plot shows that the model perfectly classifies all classes except desert tiles. This can be explained by their similarities in color palettes to grain tiles, causing misclassifications. ![hex_detector_roc_curve](https://hackmd.io/_uploads/ryckm5-R1l.png) #### Number detection We notice that the scores for number detection are worse than our tile detecting model, so we plot the ROC curves for all classes for further investigation. | Test set accuracy | Test set MAP | Test set AUC | | ----------------- | ------------ | ------------ | | 90% | 86% | 93% | The ROC curves below provide insight into the lower AUC score. We show that with our method we correctly classify almost all numbers, except for the numbers 9 and 11, and for tiles with no numbers. We assume that the number 9, is confused for 6. Interestingly, the model classifies the number 11 no better than random labeling. We expect this to impact our pipeline as a whole negatively. ![number_detector_roc_curve](https://hackmd.io/_uploads/rJfuniZ01e.png) ## Bringing it all together In this section we present how the full pipeline works. We have sampled both a real (assembled by us) Catan board, and a synthetic one, and show intermediate results. *Unfortunately, there has been no large-scale end-to-end test and metrics extracted, mostly due to a lack of data. With more time, the team could generate more adversarial synthetic examples. Nevertheless, we believe this does illustrate the proof of concept of our pipeline.* If you wish to try out your own Catan image, retrieve the project from [our GitHub repository ](https://github.com/vdakov/cv-board-game-detection)and follow the instructions from the README for downloading the models. ### Synthetic image ![image](https://hackmd.io/_uploads/HkOS4wMAyx.png) <p> <img src="https://hackmd.io/_uploads/B1sr-wM01e.png)" width="45%" /> <img src="https://hackmd.io/_uploads/H1IiWPfA1e.png" width="54%" /> </p> **Discussion:** The model performs well on the synthetic image. We have 18/19 tiles detected and no misclassifications. This is to be expected as it reflects the training data throughout our pipeline. The geometric and perspective assumptions also seem to be the most accurate here, due to the perturbations being an exact homography transformation, that is handled well by the segmentations. ### Real image ![image](https://hackmd.io/_uploads/BJItRUG0kl.png) <p> <img src="https://hackmd.io/_uploads/r1EXC8G0kl.png" width="45%" /> <img src="https://hackmd.io/_uploads/rkd0pIfC1x.png" width="54%" /> </p> **Discussion:** The model performs significantly worse on real-life data. We have 12/19 tiles detected, and only 4 properly classified. We can already see segmentation errors, with 3 tiles not being segmented and contoured correctly. Moreover, only 2 tiles have correct number labels. The classification models require more training with this art style specifically to improve accuracy. ## Future work Looking at our end-to-end testing, we can clearly see that the two most lacking components are segmentation and classification. Segmentation is a clear bottleneck in terms of its "theoretical" performance, both due to complexity and difficulty in fine-tuning. We believe there are better solutions for the task specific to our domain. A **custom segmentation** network is definitely doable, with more time, data and computational resourses. It would also have interesting properties, looking for a non-overlapping set of $N$ masks. Classification, on the other hand, is the current "bottleneck" in terms of accuracy. Even though experiments indicated no overfitting, we can observe a stark contrast between the synthetic and real-life data samples' performance. With the synthetic sample, all types except one are identified correctly, yet most number labels fail. However, once real-life data is involved, nearly all tile types and number labels are incorrect, even though most tiles are found based on the contour image and the board assembly. Heavier data collection is necessary, as well as an approach to make the network learn the concepts of "wool", "lumber", etc. The number label classification performs much worse than expected. One extra challenge we did not consider was that the numbers may be rotated unpredictably - this is an issue since Tesseract OCR is **not** rotation invariant, and expects horizontal numbers. Furthermore, the classification component shrinks images to only 100x100 pixels, even if a tile is much larger, which makes it more challenging for the OCR algorithm. ## Conclusion The pipeline-component structure has many benefits. The modular design allows for separation of concerns, and individual training/evaluation. It also entails explicit intermediate results that are simple to visualize and analyze for errors. Lastly, the design is quite flexible - we can easily swap components, or even add/remove others (e.g. removing the perspective warping in certain use-cases). However, the pipeline design also has a glaring issue - errors carry-over and stack over components. This is clear from the poor end-to-end performance. With more data and computational resources for training, we expect significantly better results. We had to make many simplifications due to our limited computational capabilities, for example shrinking images to make forward- and backward passes tractable. The small image sizes especially affected segmentation and tile classification, severely reducing accuracy for real-life examples. --- Our pipeline stands as a successful proof-of-concept, demonstrating considerable potential through the strength of its individual modules. It showcases the synergy between black-box deep learning techniques and heuristic-driven problem-solving. We hope this work is a foundation for future collaborative development and refinement, contributing meaningfully to the field of computer vision. --- ## References [1] D. DeTone, T. Malisiewicz, and A. Rabinovich, "Deep image homography estimation," *arXiv preprint arXiv:1606.03798*, 2016 - https://arxiv.org/abs/1606.03798 [2] J. Redmon, S. Divvala, R. Girshick, et al., "You only look once: Unified, real-time object detection," *arXiv preprint arXiv:1506.02640*, 2015 - https://arxiv.org/pdf/1506.02640 [3] A. Kirillov, E. Mintun, N. Ravi, et al., "Segment anything," *arXiv preprint arXiv:2304.02643*, 2023 - https://arxiv.org/pdf/2304.02643 [4] R. Smith, "An overview of the Tesseract OCR engine," in *Proc. 9th Int. Conf. Document Analysis and Recognition (ICDAR)*, vol. 2, Sep. 2007, pp. 629–633. - https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdfSmith [5] E. Dubrofsky, Homography Estimation - https://www.cs.ubc.ca/sites/default/files/2022-12/Dubrofsky_Elan.pdf [6] C. Shu et al. - Feature-metric Loss for Self-supervised Learning of Depth and Egomotion - https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123640562.pdf?utm_source=chatgpt.com [7] M. Ester et al. - A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise -https://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf