Project Blog: Creating a Controlled Dataset to Test Shape vs. Texture Bias in CNNs

# Project Blog: Creating a Controlled Dataset to Test Shape vs. Texture Bias in CNNs Author: Yongcheng Huang(5560950) Assignment: Toy problem(control-dataset) Course: FunMDL (DSAIT4205) ## 1. Motivation and Background ![image](https://hackmd.io/_uploads/B1GRoD2Qel.png) In the last decade, Convolutional Neural Networks (CNNs) have achieved monumental success in computer vision. A common intuition is that these networks recognize objects because they learn to understand their "shape," much like humans do. For example, when we see the outline of a cat, we recognize it as a "cat," regardless of whether it's a black cat, a white cat, or a tabby. However, a compelling hypothesis was introduced by Geirhos et al. (2018) in their seminal work, *"ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness"*. They argue that the decision-making process of powerful models trained on ImageNet may be fundamentally different from that of humans. They posit that **CNNs are strongly biased towards relying on local object texture rather than global object shape for classification**. For instance, instead of identifying the shape of a "cat," a model might be primarily recognizing its "furry" texture. This hypothesis is critical for our understanding and trust in deep learning models. If a model relies on texture as a "shortcut," it could become fragile when faced with new scenarios where textures are inconsistent with its training data. For example, a model that primarily uses texture to identify an "elephant" might misclassify a soccer ball covered in elephant skin texture as an "elephant". To precisely and controllably test this core hypothesis, we designed and generated a novel control dataset called the **"Shape-Texture Cue Conflict Dataset."** This dataset is characterized by images where the object's shape and texture are intentionally set to contradict each other. This forces the model to make a choice between shape and texture. By observing the model's classification results, we can directly quantify its preference for either cue. **Relevant Paper:** * Geirhos, R., et al. (2018). *ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.* ICLR 2019. https://arxiv.org/abs/1811.12231 ## 2. Control Dataset Description The "Shape-Texture Cue Conflict Dataset" is a synthetic dataset designed to decouple and recombine the shape and texture of objects. Each image in the dataset is composed of the **shape** from one category and the **texture** from a completely different category. ### Dataset Composition We selected several common and distinct classes from ImageNet, such as "cat" , "elephant", and "car". * **Shape Source:** We extract the outlines (silhouettes) of these objects. * **Texture Source:** We extract the characteristic textures of other objects. We then combine them in a conflicting manner. ### Data Sample Examples Here are some typical examples from our dataset: * **Sample 1:** * **Shape:** Plane * **Texture:** Brick * **Image Description:** An image with the clear silhouette of a "plane" but whose surface is entirely covered with the texture of "brick" * **Expected Model Judgment:** * If the model outputs "brick" it relies on **texture**. * If the model outputs "plane" it relies on **shape**. ![image](https://hackmd.io/_uploads/ryNBqwhQxx.png) *Figure 1: A sample from the Shape-Texture Cue Conflict dataset. The shape is a "Plane," and the texture is from a "brick". A ResNet-50 classifies this as "brick wall", demonstrating a texture bias.* * **Sample 2:** * **Shape:** Car * **Texture:** Desert * **Image Description:** An object with the silhouette of a "car" but with a body covered "desert" ![image](https://hackmd.io/_uploads/HyEFqDnQeg.png) * **Sample 3:** * **Shape:** Cat * **Texture:** Cloud * **Image Description:** An object with the silhouette of a "Cat" but have the texture of "Cloud" ![image](https://hackmd.io/_uploads/HklfiwhXxg.png) In this way, we create two "ground truth" labels for each image: a **shape label** and a **texture label**. When a standard pretrained model (e.g., ResNet-50) classifies these images, its prediction directly reveals its decision-making bias. ## 3. Dataset Generation Method We use a programmatic approach to generate this dataset, ensuring reproducibility and scalability. The entire process consists of three main steps: shape extraction, texture extraction, and conflict synthesis. Below is the GUI-based generator designed for the data generation. ![image](https://hackmd.io/_uploads/Bkw-tD2mle.png) ### Step 1: Extract Object Shape To obtain clean object silhouettes, we can use a pre-trained instance segmentation or salient object detection model. * **Input:** An original image from a dataset like ImageNet (e.g., a picture of a cat). * **Process:** Use a model to generate a binary mask of the object. In this mask, the object area is white (pixel value 255), and the background is black (pixel value 0). A similar process can be done by converting images to silhouettes. * **Output:** A "shape template" containing only the object's outline. ![image](https://hackmd.io/_uploads/SyRLjDhXle.png) ### Step 2: Extract Style Texture We extract the texture from another image. This does not require a complex model; the original image itself is sufficient. * **Input:** An image with a representative texture (e.g., a close-up of elephant skin). * **Output:** A "texture source" image. ### Step 3: Synthesize Shape and Texture Conflict This is the most critical step. We use the "shape template" and "texture source" from the previous steps to generate the final conflict image. This can be achieved either by using style transfer or by cropping a texture image with a shape mask. * **Alignment and Scaling:** Adjust the size of the "texture source" image to fully cover the "shape template." * **Apply Mask:** Use the "shape template" (binary mask) as a mask on the "texture source" image. This is equivalent to using the "shape template" as a cookie cutter on the "texture source" dough. * **Output:** A new image with a conflict between shape and texture. It has the shape of a "cat" and the texture of an "elephant." ## 4. Experiment and Verification Once the dataset is designed, the steps to verify the hypothesis are as follows: 1. **Load Pre-trained Model:** Load a standard CNN model pre-trained on ImageNet, such as a ResNet-50. 2. **Make Predictions:** Feed all the images from our "Shape-Texture Cue Conflict Dataset" into the model and record the Top-1 predicted class for each image. 3. **Analyze Results:** * Calculate the percentage of predictions that match the **shape label** (shape accuracy). * Calculate the percentage of predictions that match the **texture label** (texture accuracy). 4. **Draw Conclusion:** * If the **texture accuracy is significantly higher than the shape accuracy**, it strongly supports the hypothesis of Geirhos et al. (2018) that the model exhibits a significant texture bias. * Conversely, if shape accuracy is higher, it indicates the model has a shape bias. ![result](https://hackmd.io/_uploads/S1zk6D3Qge.png) Through this precisely controlled experiment, we can clearly and intuitively demonstrate a counter-intuitive yet crucial intrinsic property of deep learning models. ## 5. Resource Links * **Dataset:** [D4vidHuang/Cue_Conflict_Dataset](https://huggingface.co/datasets/D4vidHuang/Cue_Conflict_Dataset) * **Code:** https://github.com/D4vidHuang/Cue-Conflict-Dataset-Generator.git