CV Project Group 24

--- title: 'ForestNet: Deforestation Classification' disqus: hackmd --- ForestNet: Deforestation Classification === This blog post is created by Group 24 team members of DSAIT4125 Computer Vision course: * Yaren Aslan (y.aslan-1@student.tudelft.nl) * Bendik Christensen (b.christensen@student.tudelft.nl) * Christopher Charlesworth (c.charlesworth@student.tudelft.nl) ## Table of Contents [TOC] ## Introduction Forests cover approximately 1/3 of the earth’s land surface, forming the foundations for a large range of ecosystems. They absorb carbon dioxide, regulate weather patterns and provide habitats for millions of species. Their importance to the environment and to us cannot be understated, but are we doing a good job taking care of them? In 2017, about a football pitch’s worth of forest was lost every second, with deforestation being a large contributor to the ever-increasing amount of forest destruction (Watts, 2018). Whether due to industrial agriculture, logging, or urban expansion, these growing numbers are fuelling concern for our climate and our ecosystems. As such, it’s crucial to understand the main driving factors for deforestation, so that the proper preventative measures can be undertaken. One emerging approach for identifying the causes of deforestation applies Computer Vision techniques to analyze satellite imagery for classification of deforestation types. In this blog post, we specifically look at the ForestNet paper (Irvin et al., 2020), applying similar and extended versions of their techniques to enhance the reproducibility and depth of their work. ## Problem Definition The task is to classify the causes of deforestation using satellite imagery and associated environmental data. The original ForestNet paper employed a Convolutional Neural Network (CNN) based semantic segmentation model with multimodal inputs. However, only the dataset—not the code—was provided. To ensure reproducibility, we rebuilt the training pipeline from the ground up, including data preprocessing, augmentation, model training, and evaluation. Additionally, we introduced new experiments focused on: * Model robustness via metamorphic testing, * Multimodal feature analysis, * Masking techniques for spatial attention. Our repository can be found [here](https://github.com/cv-group-24/deforestation-detection). ## Dataset The [ForestNet dataset](https://stanfordmlgroup.github.io/projects/forestnet/) includes RGB satellite images as the primary data source for analyzing deforestation. Additional multimodal features such as forest cover (GFC) images, OpenStreetMap (OSM) and National Centers for Environmental Prediction (NCEP) features provide contextual information but are secondary to the image data. ## Methodology The proposed solution can be broken down into four main parts. First, we augment our training set to improve robustness and prevent overfitting. Next, we apply a mask to the images such that the model is only trained on the forest loss region. Following this, we trained a variety of different CNNs on the resulting images. Finally, we explored different ways to combine the multimodal data with the features learnt by the CNNs before training the final classifier. ### Data Augmentation To enhance model robustness to variations in input data, we applied data augmentation techniques to the training images. These augmentations simulate real-world variations in satellite imagery, such as differing atmospheric conditions, viewing angles, and sensor artifacts. The augmentation pipeline was composed of the following categories: 1. Geometric and Resizing Augmentations * **Standard Resizing:** All images were resized to a fixed resolution to ensure compatibility with the CNN input dimensions. This step standardizes the image input pipeline. * **Random Cropping:** Random patches were cropped from the full-size image to simulate variation in framing and spatial composition. * **Center Cropping:** Central regions were selected in some cases to emphasize the main area of interest (e.g., likely deforestation zones). These transformations help the model become invariant to differences in scale, framing, and spatial coverage. <figure style="text-align: center;"> <div style="display: flex; justify-content: center; gap: 20px;"> <img src="https://imgur.com/9aICoSq.png" width="250" height="250"> <img src="https://imgur.com/OZdawro.png" width="250" height="250"> </div> <figcaption><b>Figure 1:</b> Example of Random Cropping augmentation. Original image with the crop area (left), cropped and resized result (right). </figcaption> </figure> 2. Spatial Transformations * **Horizontal and Vertical Flips:** Random flips were applied to simulate satellite images captured from various orientations. * **90-Degree Rotations:** Fixed-angle rotations were introduced to encourage rotational invariance in feature recognition. * **Affine Transformations:** Small-scale perturbations were introduced in the form of translations, rotations, scaling, and shear transformations. These mimic natural variations in viewpoint or satellite trajectory. Such spatial alterations help the model develop robustness to changes in camera angle, alignment, and orientation, which are common in multi-pass satellite observations. <figure style="text-align: center;"> <div style="display: flex; justify-content: center; gap: 20px;"> <img src="https://imgur.com/Y4iPXIJ.png" width="200" height="200"> <img src="https://imgur.com/rmyfIVK.png" width="200" height="200"> <img src="https://imgur.com/BJAWc3B.png" width="200" height="200"> </div> <figcaption><b>Figure 2:</b> Examples of spatial transformations. Original image (left), horizontal flip (middle), slight rotation (right) </figcaption> </figure> 3. Weather and Environmental Simulation Simulated weather-based perturbations were applied to mimic conditions that commonly degrade satellite image quality. These include cloud overlays, fog and haze generation, and snowflake artifacts. These effects are injected using probabilistic combinators, with a 50% chance of applying at least one weather-based transformation per image in relevant modes. This is particularly important for satellite-based models, as cloud cover is one of the most frequent sources of noise and occlusion in real-world observations and it has been shown to influence performance of many computer vision tasks (Fisher, 2014). <figure style="text-align: center;"> <div style="display: flex; justify-content: center; gap: 20px;"> <img src="https://imgur.com/U0vE0wd.png" width="200" height="200"> <img src="https://imgur.com/VwV8mer.png" width="200" height="200"> <img src="https://imgur.com/B7X88FF.png" width="200" height="200"> </div> <figcaption><b>Figure 3:</b> Example of different levels of cloud cover. Original (cloudless) image on the left, to heaviest cloud cover on the right. </figcaption> </figure> 4. Pixel-level augmentations * **Salt-and-Pepper Noise:** Random pixel noise was added to simulate sensor degradation or transmission noise. * **RGB Channel Shifts:** Selective channel manipulations were applied to reflect color shifts caused by atmospheric distortion or sensor imbalance. * **Brightness and Contrast Adjustments:** These augmentations allowed the model to adapt to varying lighting conditions across different satellite passes or image batches. Pixel-level transformations help mitigate issues related to lighting, sensor calibration, and noise sensitivity. <figure style="text-align: center;"> <div style="display: flex; justify-content: center; gap: 20px;"> <img src="https://imgur.com/8EZm8SB.png" width="200" height="200"> <img src="https://imgur.com/oS9Xnt6.png" width="200" height="200"> <img src="https://imgur.com/LIiPdsL.png" width="200" height="200"> </div> <figcaption><b>Figure 4:</b> Examples of different pixel-level augmentations. Original image (left), RGB channel shift (middle), 20% contrast enhancement (right). </figcaption> </figure> For a full understanding of all the transforms used to augment the dataset, please see the `data/transforms.py` file in the [repository](https://github.com/cv-group-24/deforestation-detection). Once the augmentations were applied, the augmented images were concatenated to the original dataset, ensuring that the model was trained with both the augmented and the original images. ### Masking Forest Region The original satellite imagery has an image size of 322 $\times$ 322 which corresponds approximately to a 5 km$^2$ area of land. However, this includes large areas for which deforestation did not occur which could be seen as uninformative given the task at hand. Thus, we have incorporated the forest loss regions identified by Austin et al. (2019) to have our model focus on only the regions where deforestation occurred. Each forest loss region is represented as a polygon with a `.pkl` file attached to the data sample, so this file was loaded and the masking was applied to the interior of the polygon’s outline. The masking itself consisted of overwriting the data outside of the polygon region to some value such that the information surrounding the forest loss region is ignored. The annotation of the forest loss regions themselves was done by experts, and examples of such annotations can be seen below: <figure style="text-align: center;"> <div style="display: flex; justify-content: center; gap: 20px;"> <img src="https://imgur.com/niRPv2e.png" width="200" height="200"> <img src="https://imgur.com/sx5tzv8.png" width="200" height="200"> <img src="https://imgur.com/Bib57GB.png" width="200" height="200"> </div> <figcaption><b>Figure 5:</b> Visual example of the masking annotation from the dataset and the masking process. Original image (left), outline provided by the expert annotations in the dataset (middle), final masked image (right). </figcaption> </figure> ### Models To explore which neural architectures are most effective for classifying deforestation causes, we implemented and evaluated multiple CNNs. These include two custom models we designed from scratch, SimpleCNN and EnhancedCNN, as well as three widely used pretrained models: ResNet, EfficientNet, and DenseNet. Both SimpleCNN and EnhancedCNN were trained from scratch on the ForestNet dataset, without leveraging external pretraining. This allowed us to assess their performance purely based on features learned from satellite imagery. #### SimpleCNN The SimpleCNN was designed as a lightweight baseline architecture. It consists of three convolutional blocks, each followed by ReLU activations and max pooling. This structure results in progressive downsampling of spatial information while expanding the depth of the feature maps. The extracted features are flattened and passed through two fully connected layers for classification. Dropout is applied to prevent overfitting. As seen in Figure 6, the input image passes through three convolutional blocks before being flattened. Then, it passes through a linear classifier. For the in-detail implementation, see <code>models/cnn.py</code> in the <a href="https://github.com/cv-group-24/deforestation-detection" target="_blank">repository</a>. <figure style="text-align: center;"> <img src="https://i.imgur.com/4X4XeMf.png" alt="SimpleCNN Architecture" style="width: 900px; height: auto;"> <figcaption > <b>Figure 6:</b> SimpleCNN Model Architecture </figcaption> </figure> <br> Although computationally efficient and easy to interpret, SimpleCNN is relatively shallow and lacks the capacity to capture complex or hierarchical patterns in the data. Its primary function in this work is to serve as a point of reference against which deeper models can be evaluated. #### EnhancedCNN To improve upon the limitations of the baseline, we developed the EnhancedCNN, which introduces several architectural enhancements: * Two convolutional layers per block with four blocks in total, increasing the model’s depth and capacity to extract hierarchical spatial features. * Batch normalization after each convolutional layer to improve training stability and convergence speed. * Global average pooling before the final classifier, which reduces each feature map to a single value, greatly reducing the number of parameters before classification and therefore mitigating overfitting. As seen in Figure 7, the input image passes through four convolutional blocks, then a global average pooling layer is applied. Finally, the output is flattened and passed through a linear classifier. For the in-detail implementation, see <code>models/cnn.py</code> in the <a href="https://github.com/cv-group-24/deforestation-detection" target="_blank">repository</a>. <figure style="text-align: center;"> <img src="https://i.imgur.com/Dg1bzjN.png" alt="EnhancedCNN Architecture" style="width: 900px; height: auto;"> <figcaption > <b>Figure 7:</b> EnhancedCNN Model Architecture </figcaption> </figure> <br> The EnhancedCNN architecture is more expressive than SimpleCNN and better suited for the satellite imagery domain, where spatial patterns are often subtle and span multiple scales. #### Pre-trained CNNs: ResNet, EfficientNet, DenseNet To benchmark our custom models, we also evaluated three standard pre-trained architectures: * **ResNet:** This model was chosen because its residual connections allow for mitigating the vanishing gradient problem (He et al., 2016), helping it learn complex and intricate visual patterns, such as subtle signs of vegetation loss. * **EfficientNet:** EfficientNet was chosen, as it balances model depth, width, and resolution efficiently (Tan & Le, 2019). Since overfitting could be a concern for this domain-specific dataset, an optimized model such as EfficientNet has the potential to perform well. * **DenseNet:** Connects each layer to every other layer for feature reuse and improved gradient flow (Huang et al., 2017). Since this architecture is able to pick up on and propagate lower-level features such as textures through the network, we hypothesized that it would suit the detail-oriented nature of the problem. These models were fine-tuned on the ForestNet dataset, allowing us to compare how well generic image features (learned from ImageNet) transfer to the domain of satellite imagery and deforestation classification. ### Multimodal Data The original ForestNet paper proposes to include the multimodal features by concatenating them onto the features learnt by the CNN before training the final linear layers. After implementing this approach, we noticed that the performance wasn’t improving, despite the model having more data available. To address this, a combinatorial method was implemented, training CNN-based and multimodal models independently, then combining their outputs. This approach enabled faster experimentation with multimodal architectures, improved explainability (e.g., using a Decision Tree for interpretable predictions), and separated multimodal information from CNN-analyzed image data to assess its classification value. During the testing of the combinatorial method the multimodal features still weren’t particularly helpful. To evaluate the overall utility of the multimodal data for classification, we applied three dimensionality reduction techniques: * **t-SNE** (Van der Maaten & Hinton, 2008): Nonlinear technique minimizing KL divergence between probability distributions. * **UMAP** (McInnes et al., 2018): Nonlinear approach optimizing graph representations across dimensions. * **PCA** (Abdi & Williams, 2010): Linear method maximizing variance along orthogonal components. <div align="center"> <table style="border: none;"> <tr> <td align="center" style="border: none;"> <img src="https://i.imgur.com/7c4PUlk.jpeg" alt="t-SNE" width="100%"> <br><b>(a)</b> t-SNE </td> <td align="center" style="border: none;"> <img src="https://i.imgur.com/c4Pw9UE.jpeg" alt="UMAP" width="100%"> <br><b>(b)</b> UMAP </td> </tr> <tr> <td colspan="2" align="center" style="border: none;"> <img src="https://i.imgur.com/VRDPnNA.jpeg" alt="PCA" width="50%"> <br><b>(c)</b> PCA </td> </tr> </table> <center> <b>Figure 8:</b> Dimensionality reduction of multimodal features for deforestation classification. Classes: 0 – Grassland shrubland, 1 – Other, 2 – Plantation, 3 – Smallholder agriculture. All methods (a–c) show significant class overlap, suggesting limited discriminative value. </div> <br> The dimensionality reduction visualizations showed substantial overlap between classes, suggesting that multimodal features provided minimal discriminative information. This confirmed our hypothesis that the multimodal data contributed little additional value, leading to its exclusion from subsequent model training. ## Results and Discussion We conducted an extensive performance evaluation of multiple CNN architectures under four distinct training configurations: * Baseline (no augmentation, no masking) * With data augmentation * With masking * With both masking and data augmentation The models were evaluated using four metrics: accuracy, precision, recall, and F1-score. These metrics provide a holistic understanding of both classification correctness and balance across classes. As it can be seen from Table 1, EnhancedCNN shows the best performance among all models in the baseline configuration. This could be attributed to its task-specific design and effective depth. ResNet and EfficientNet also perform competitively, indicating that pretrained models can generalize reasonably well even without fine-tuning on deforestation-specific augmentations. <figure style="text-align: center;"> <img src="https://i.imgur.com/pbsyWRP.png" width="80%"> <figcaption > <b>Table 1:</b> Performance metrics for all models (trained with no augmentation and no masking). </figcaption> </figure> For the second experiment, we can see that data augmentation led to notable performance improvements across all models. The results are shown via Table 2. EnhancedCNN benefits the most, with an accuracy boost from 67% to 71%, highlighting its ability to generalize better when trained on more varied data. SimpleCNN also improves, but remains significantly behind due to its limited capacity. ResNet sees consistent gains across all metrics, further validating the robustness of pretrained architectures when enhanced by synthetic variation. <figure style="text-align: center;"> <img src="https://i.imgur.com/iTlWNNE.png" width="80%"> <figcaption > <b>Table 2:</b> Performance metrics for models trained with data augmentation only. </figcaption> </figure> Masking, where only deforested regions are emphasized, surprisingly yields poor results, except for SimpleCNN, which now surpasses its performance under augmentation-only. This could suggest that the removal of background information helps shallow models focus on discriminative features, but it doesn’t help the complex models. EnhancedCNN and ResNet perform similarly to each other as seen in Table 3, though both underperform compared to the augmentation-only case, suggesting they benefit more from diverse spatial context. <figure style="text-align: center;"> <img src="https://i.imgur.com/lwrLbiq.png" width="80%"> <figcaption > <b>Table 3:</b> Performance metrics for models trained with masking only. </figcaption> </figure> For the last experiment, the results seen in Table 4 conclude that combining masking with augmentation produces mixed results. SimpleCNN continues to benefit and achieves its highest F1-score (0.62). ResNet also shows its best performance, reaching an F1-score of 0.67. However, EnhancedCNN drops significantly, likely due to oversensitivity to input structure when both perturbations and reduced spatial context are introduced. This suggests that deeper, task-specific models like EnhancedCNN may require careful tuning of augmentation intensity and spatial context management. Finally, when comparing our results to the ones obtained by the [original paper](https://stanfordmlgroup.github.io/projects/forestnet/) (labelled as 'ForestNet' in Table 4), there's a clear gap in model performance. This could be due to a variety of factors, such as slight differences in approaches, model backbones or computational resources. The main implication of this is that the paper has a low reproducibility, as these differences would have been easier to minimize if the paper had more details about their implementation. However, when comparing the best achieved results by EnhancedCNN with augmentation only (0.71 Accuracy and 0.71 F1-Score) with the ForestNet results (0.80 Accuracy and 0.74 F1-Score), the gap between the performances is slightly lower. This suggests that the obtained results in the original paper are achievable, just not easily reproducible. <figure style="text-align: center;"> <img src="https://i.imgur.com/4UUp1mu.png" width="80%"> <figcaption > <b>Table 4:</b> Performance metrics for three models trained with both masking and data augmentation. Additionally the best accuracy and F1 Score from the original ForestNet paper are provided for comparison. </figcaption> </figure> </center> We also conducted metamorphic testing, introducing variations such as additional synthetic clouds and haze, flipped image orientations, and adjusted color balances to simulate potential variations in satellite captures. The objective was to assess how consistently the models predicted under semantically equivalent but visually altered input conditions. Below, we discuss and compare the results for two best-performer models: ResNet and EnhancedCNN, each trained with data augmentation only. The plots drawn from the results of metamorphic testing can be seen via the figures below. <div align="center"> <table style="border: none;"> <tr> <td align="center" style="border: none;"> <img src="https://i.imgur.com/eOahhX7.png" alt="EnhancedCNN" width="80%"> <br><b>(a)</b> EnhancedCNN </td> </tr> <tr> <td align="center" style="border: none;"> <img src="https://i.imgur.com/loBN9xR.png" alt="ResNet" width="80%"> <br><b>(b)</b> ResNet </td> </tr> </table> <center> <p><b>Figure 9:</b> Results of metamorphic testing for two best performing models, both trained with data augmentation only.</p> </div> <br> As seen in Figure 9a, the EnhancedCNN model displays strong invariance to perturbations across most classes: * Grassland shrubland achieves the highest robustness with a prediction ratio of 0.93, showing near-perfect consistency. * Smallholder agriculture, Plantation, and Other classes also show high stability, with prediction retention rates between 0.77 and 0.82. * The few misclassifications that do occur are logically adjacent classes, such as “Plantation” → “Smallholder agriculture,” indicating semantically reasonable errors. EnhancedCNN appears to have learned robust class boundaries and generalized well under visual noise. This suggests that the task-specific design, masking and augmentation pipeline led to a model that is not only performant but also resilient to atmospheric and spatial distortions. When it comes to the ResNet model, despite its strong pretrained foundation, shows more variability in predictions as seen in Figure 9b: * Plantation remains stable (0.86), but other classes drop; Smallholder agriculture at 0.74, Grassland shrubland at 0.75, and notably Other at only 0.61. * Misclassifications are more diverse and frequent, especially for the “Other” class, which gets redistributed across multiple categories. While ResNet remains a high-performing model overall, it shows greater sensitivity to perturbations compared to EnhancedCNN. This may be due to the generic features learned from pretraining, which are powerful but possibly less optimized for fine-grained deforestation class distinctions under altered conditions. ## Conclusion Deforestation type classification was conducted based on the dataset provided with the ForestNet paper by Irvin et al. (2020). Five different CNNs were tested with, including two trained from scratch and three pre-trained models fine-tuned to this dataset. Additionally, data augmentation, masking and multimodal data integration were implemented similarly to the original paper, in an attempt to improve the paper’s reproducibility and properly understand their obtained results. Finally, metamorphic testing was added for a more comprehensive model evaluation. Our repository can be found [here](https://github.com/cv-group-24/deforestation-detection). Our key findings/contributions were: * **EnhancedCNN performs the best among all the tested models,** reaching a 67% accuracy with the baseline implementation, improving to 71% with augmentation. Pre-trained models performed competitively, whereas SimpleCNN lagged behind, showing that a more complex architecture is more beneficial for this task. * **EnhancedCNN is the most robust to data perturbations,** as demonstrated by the metamorphic testing. * Our best performing model (EnhancedCNN) was **not able to match ForestNet performance**, indicating a difficulty in reproducing these paper's results. * **Data augmentation improved performance, while masking decreased performance.** We speculate that the masking didn’t improve performance because it should be coupled with a semantic segmentation approach rather than an image classification approach. * **Multimodal data did not contribute significantly to classification performance.** We discovered this through analysing results throughout the process and conducting dimensionality reduction techniques, seeing a high class overlap. * Although the dataset was provided, **the code was built from scratch** and is attached in the [repository](https://github.com/cv-group-24/deforestation-detection), improving the paper’s reproducibility. ## References Irvin, J., Sheng, H., Ramachandran, N., Johnson-Yu, S., Zhou, S., Story, K., ... & Ng, A. Y. (2020). Forestnet: Classifying drivers of deforestation in indonesia using deep learning on satellite imagery. arXiv preprint arXiv:2011.05479. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11). McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708). Austin, K. G., Schwantes, A., Gu, Y., & Kasibhatla, P. S. (2019). What causes deforestation in Indonesia?. Environmental Research Letters, 14(2), 024007. Watts, J. (2018, June 27). One football pitch of forest lost every second in 2017, data reveals. The Guardian. https://www.theguardian.com/environment/ng-interactive/2018/jun/27/one-football-pitch-of-forest-lost-every-second-in-2017-data-reveals Fisher, A. (2014). Cloud and cloud-shadow detection in SPOT5 HRG imagery with automated morphological feature extraction. Remote Sensing, 6(1), 776-800. ###### tags: `Deforestation` `Computer Vision` `Deep Learning` `ForestNet`