Geospatial Reading Group

# Geospatial Reading Group - **8/12/2022 - Simone** - ["PolyWorld: Polygonal Building Extraction with Graph Neural Networks in Satellite Images"](https://openaccess.thecvf.com/content/CVPR2022/papers/Zorzi_PolyWorld_Polygonal_Building_Extraction_With_Graph_Neural_Networks_in_Satellite_CVPR_2022_paper.pdf) - Notes - 8/5/2022 - No meeting - 7/29/2022 - Caleb - ["SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery"](https://arxiv.org/pdf/2207.08051.pdf) - Notes - Introduces SatMAE, a method for SSL with temporal or multi-spectral satellite imagery. This includes a vision transformer masked autoencoder that concatenates patches from either time or groups of spectral channels in the initial reshaping. - Normally, ViTs transform an image into a L x P^{2}C shape where each of the L rows is a single patch from the input image. With a time series of imagery, SatMAE proposes to stack the patches from the other images along the first dimension. - They test "Consistent masking" vs. "Independent masking" vs. "Independent masking with inconsistent cropping" (inconsistent cropping is never described) - They test "stacking channels" in a single patch vs. "grouping channels" - Date / group encodings - Results - 0.9 better than Imagenet weights in top 1 acc (Table 1) - ViT Imagenet weights help massively over training from scratch - Learning from scratch is best when you do channel grouping, which makes sense - Problems: - "Intuitively, the day, minute, and second should be unrelated to the visual appearance of a region." ??!? - ViT-Large has 300M parameters vs. ResNet-152 that has 60M - They claim 7% improvement over SOTA but this comes from using a ViT-Large vs. a ResNet50 - The EuroSat dataset results are garbage -- TorchGeo paper shows you can get 98% with a ResNet50 - 7/22/2022 - Loc Trinh - ["Hierarchical Text-Conditional Image Generation with CLIP Latents" arxiv only so far](https://arxiv.org/pdf/2204.06125.pdf) - 7/15/2022 - Saumya - ["Efficient Visual Pretraining with Contrastive Detection" ICCV 2021](https://openaccess.thecvf.com/content/ICCV2021/papers/Henaff_Efficient_Visual_Pretraining_With_Contrastive_Detection_ICCV_2021_paper.pdf) - Notes - Introduces contrastive detection (DETCON), a new objective which maximizes the similarity of object-level features across augmentations. This involves getting an object segmentation for each image, getting features per object, passing those through a small MLP, then doing a contrastive loss vs an augmented view of the same image. - Better for learning from complex scenes with many objects like in geospatial data. - It attempts to alleviate the computational burden of self-supervised transfer learning, reducing by up to 10x the computation required to match supervised transfer learning from ImageNet. - Tests different methods for creating the objects from pixels. - Links - [Code](https://github.com/deepmind/detcon/tree/8a3c5d804d454a44915cae8675957b248d9a0a20) - [Pytorch implementation](https://github.com/isaaccorley/detcon-pytorch) - 7/8/2022 - Anthony - ["A ConvNet for the 2020s" CVPR 2022](https://arxiv.org/abs/2201.03545) - Notes - Implements a series of training and architectural changes on top of vanilla ResNets inspired by SwinTransformer design choices - The changes include - "Modern" training methods like AdamW, 300 epochs, lots of augmentation - Changing "where" in the network most of the computation happens. ResNets are organized into 4 stages where the number of filters follows the ratio (3,4,6,3). The new way is (3,3,9,3). **IDEA:** What happens if we try (9,3,3,3) for encoder of land cover mapping application? - "Patchify" the first part of the network (i.e. the "stem") using 4x4 convs with stride 4 instead of 7x7 convs with stride 2 followed by maxpooling. - Change block architecture: - Previous: 256-d incoming, 64x 1x1 convs, BN + ReLU, 64x 3x3 convs, BN + ReLU, 256x 1x1 convs, BN, residual, ReLU - New: 96-d incoming, 96x 7x7 convs, LN, 384x 1x1 convs, GELU, 96x 1x1 convs, residual - Separate downsampling layers - Even though the number of flops stays the same, the resulting networks are up to 50% faster in terms of throughput than SwinTransformer counterparts with better performance. - Links - Code: https://github.com/facebookresearch/ConvNeXt - 7/1/2022 - Caleb - ["Segmenting across places: The need for fair transfer learning with satellite imagery"](https://openaccess.thecvf.com/content/CVPR2022W/FaDE-TCV/papers/Zhang_Segmenting_Across_Places_The_Need_for_Fair_Transfer_Learning_With_CVPRW_2022_paper.pdf) CVPRW 2022 - Notes - Tests the discrepancy in rural vs. urban land cover segmentation performance under different domain adaptation methods and source/target settings: no difference between source and target, spatial difference, urban vs. rural difference - Finds that domain adaptation methods (specifically the class-balanced self-training and instance adaptive self-training methods) can introduce a larger gap in urban vs. rural performance compared to not using domain adaptation. - Urban performance is better than rural performance in spatial generalization setting. This is partially explained by the larger difference in imagery between different rural groups (as measured by "maximum mean discrepancy" and "proxy-A-distance") - Finds little difference in performance between U-Net and DeepLabV3+ architectures - Finds very small increase in performance with the domain adpatation methods - Should report individual class IoUs, should report class distributions as a table, should investigate different domain adpatation methods - Links - Code for maximum mean discrepancy: https://github.com/jindongwang/transferlearning/blob/master/code/distance/mmd_pytorch.py - Code for proxy-A-distance: https://github.com/jindongwang/transferlearning/blob/master/code/distance/proxy_a_distance.py - [Paper describing maximum mean discrepancy](https://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf) JMLR 2012 - [Paper describing proxy-A-distance](https://www.jmlr.org/papers/volume17/15-239/15-239.pdf) JMLR 2016 - This paper is particularly nice for the formal setup of the domain adaptation problem ## Older papers that I have notes for - ["Tiling and stitching segmentation output for remote sensing: basic challenges and recommendations"](https://arxiv.org/ftp/arxiv/papers/1805/1805.12219.pdf) - ["A generalizable and accessible approach to machine learning with global satellite imagery"](https://www.nature.com/articles/s41467-021-24638-z) - MOSAIKs - ["An image is worth 16x16 wods: Transformers for image recognition at scale"](https://arxiv.org/pdf/2010.11929.pdf) - ViT model - ["Continental-scale Building Detection from High Resolution Satellite Imagery"](https://arxiv.org/pdf/2107.12283.pdf) - Google Africa building footprints - ["Deep Double Descent: Where Bigger Models and More Data Hurt"](https://arxiv.org/pdf/1912.02292.pdf) - ["Deep High-Resolution Representation Learning for Visual Recognition"](https://arxiv.org/pdf/1908.07919.pdf) - HRNet - ["Emerging Properties in Self-Supervised Vision Transformers"](https://arxiv.org/pdf/2104.14294.pdf) - DINO - ["Exploring Simple Siamese Representation Learning"](https://arxiv.org/pdf/2011.10566.pdf) - SimSiam - ["In-domain Representation Learning for Remote Sensing"](https://arxiv.org/pdf/1911.06721.pdf) - ["Intriguing Properties of Vision Transformers"](https://arxiv.org/pdf/2105.10497v1.pdf) - ["Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data"](https://arxiv.org/pdf/2103.16607.pdf) - SeCo - ["Using object-based image analysis to map commercial poultry operations from high resolution imagery to support animal health outbreaks and events"](https://geospatialhealth.net/index.php/gh/article/view/919/917)