ISRO - HackMD

# ISRO ## Constraints 1. Input images: L1/L2 processed optical (RGB) images converted to 0–255 PNG/JPEG; must handle images up to 2k×2k quickly 2. Scale range: sampling between 0.5 m/px to 10 m/px (objects can be very small: a car at 0.5 m/px is many pixels, at 10 m/px a car may be 1–2 px). 3. Tasks required: captioning, visual grounding (oriented bounding boxes), VQA (binary, numeric float, attribute strings). 4. Robustness: efficiency, scale-resilience, and optional SAR/IR/false color support for multimodality. 5. **Spectral channels: many EO images are multispectral (beyond RGB), or SAR (radar) with very different statistics. Off-the-shelf VLMs pretrained on natural images perform poorly without adaptation.** 6. **Orientation & OBBs: objects are arbitrarily rotated (aircraft/runways/ships). Grounding requires oriented bounding boxes (OBBs) not just axis-aligned boxes.** Datasets like DOTA and many remote-sensing detectors use OBB formats. [OBB Datasets](docs.ultralytics.com/datasets/obb), [DOTA paper](https://arxiv.org/pdf/1711.10398) 7. **Small-instance density & class imbalance: many small objects in large images; background area >> object area. This affects both detectors and VLM attention.** 8. **Geometric scale variance: the same semantic object (building, ship) appears at many scales. Hence models need multi-scale features and careful tiling.** 9. Limited paired image–text data: high-quality image-caption and image-text pairs for remote sensing (and especially SAR) are scarce, so fine-tuning data is limited. VRSBench addresses this gap but is still young. [VRSBench](https://arxiv.org/pdf/2406.12384) 10. Labeling complexity: expert captions (domain language) and oriented boxes require domain experts to produce evaluation labels which is expensive. 11. Evaluation specifics: BLEU for captions (relative to expert captions), IoU@0.7 for oriented boxes, and VQA-specific metrics. So backend must output formats matching those metrics. --- ## Background Research/Knowledge ### Review Papers - [Advancements in Vision–Language Models for Remote Sensing](https://www.mdpi.com/2072-4292/17/1/162) - [Vision-Language Models in Remote Sensing](https://arxiv.org/pdf/2305.05726v2) - [Remote Sensing SpatioTemporal Vision-Language Models: A Comprehensive Survey](https://arxiv.org/pdf/2412.02573v3) ### Astronomy Domain Knowledge 1. Understand how Satellite Imagery and Geospatial Data is fundamentally different from Regular Photos: Issue of viewpoint, resolution, image size etc. 2. L1/L2 Processing Levels 3. Spectral indices and pre-processing 5. Multispectral (SAR, Infrared) Data 1. [Combining Information from SAR and Imagery](https://www.nv5geospatialsoftware.com/Support/Maintenance-Detail/the-power-of-combining-information-from-sar-and-imagery) 6. Standard ViTs are not able to deal with SAR data. SAR Imagery is a noisy, grayscale "speckle" image that looks nothing like an RGB photo. A standard ViT encoder will extract meaningless features. It requires a dedicated encoder trained on SAR data or a fusion architecture. 1. [SAR-TEXT](https://arxiv.org/pdf/2507.18743v3) 2. [SARLANG-1M](https://arxiv.org/pdf/2504.03254v1) 3. [CLOSP](https://arxiv.org/pdf/2507.10403v2) 4. [EarthGPT](https://arxiv.org/pdf/2401.16822v3) 7. False-Color Composites. 1. [A technique for generating natural colour images from false colour composite images](https://www.researchgate.net/publication/232950498_A_technique_for_generating_natural_colour_images_from_false_colour_composite_images) 2. [Understanding True Colour and False Colour Composites](https://medium.com/@shiela.mms/understanding-true-colour-and-false-colour-composites-546455becda1) 8. VLM model has to be finetuned to understand False Color Composites, for example, by incorporating the false-color property within the prompts. 1. [Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment](https://arxiv.org/html/2402.09816v2) ### ML-based (VLM/LLM/Vision Models) Knowledge 1. Downsampling and Tiling in VLMs. Resources 1. [Tiling and stitching segmentation output for remote sensing](https://medium.com/@MarouaMaru7/part-1-paper-explained-tiling-and-stitching-segmentation-output-for-remote-sensing-basic-e343c9fcfd9b) 2. [Exact Tile-Based Segmentation Inference for Images Larger than GPU Memory](https://pmc.ncbi.nlm.nih.gov/articles/PMC10914126/) 2. How to process 2k x 2k Images Efficiently?: Most VLMs were trained on 224x224 or 336x336 pixel images. The computational and memory complexity of the self-attention in a ViT is $O(n^2)$, where $n$ is the number of patches. Both Downsampling and Tiling fail because objects effectively disappear after application of these techniques for satellite data. Resources: 1. [Systematic Evaluation of Image Tiling Adverse Effects on Deep Learning Semantic Segmentation](https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2020.00065/full) 3. Hybrid global-local architecture that processes both a downsampled global view and high-resolution local tiles. Resources: 1. [GLFFNet](https://www.mdpi.com/2072-4292/17/6/1019) 4. [Oriented / rotated object detection](https://openaccess.thecvf.com/content/WACV2021/papers/Yi_Oriented_Object_Detection_in_Aerial_Images_With_Box_Boundary-Aware_Vectors_WACV_2021_paper.pdf) 1. rotated Faster-R-CNN variants 2. CenterNet, BBAVectors 3. rotated YOLO/RetinaNet variants 4. specialized loss/Iou (e.g., GIoU for rotated boxes) 5. CenterNet + BBAVectors 5. Tile + overlap + inference stitching 6. Feature stitching 7. Hybrid supervision: combine strong detector labels (from DOTA/SpaceNet) with weak/noisy caption pairs or synthetic captions to bootstrap VLMs. ### Interdisciplinary Knowledge 1. Ground Sample Distance: Read about how variable GSD breaks Standard Vision Encoders and how to fix it. Resources: 1. [GAIA Paper](https://arxiv.org/pdf/2502.09598v1) 2. [Repo of satellite ML Techniques](https://github.com/satellite-image-deep-learning/techniques) 2. Using OBBs, rather than HBBs for VLMs: standard Horizontal Bounding Box (HBB) is axis-aligned, hence they fail. Objects in aerial imagery (e.g., ships, airplanes, buildings) are not aligned with the cardinal direction of the image sensor. They are oriented. This task forces the VLM to generate a float vector. This is a text-generation-for-regression problem, which LLM tokenizers are not built for. Resources: 1. [Theoretically Achieving Continuous Representation of Oriented Bounding Boxes](https://arxiv.org/pdf/2402.18975v1), 2. [Unified Large Vision-Language Model for Remote Sensing Visual Grounding](https://arxiv.org/pdf/2411.11904v1) 3. [Review paper for Advancements in VLMs for Remote Sensing](https://www.mdpi.com/2072-4292/17/1/162) ### Datasets 1. [SpaceNet](https://spacenet.ai/datasets): building footprints & roads (segmentation, geo tasks). 2. [DOTA](https://arxiv.org/pdf/1711.10398): for oriented bbox detector training 3. [xView2](https://xview2.org/dataset): object detection / small objects - [xView2 Baseline](https://github.com/DIUx-xView/xView2_baseline) 4. [VRSBench](https://arxiv.org/pdf/2406.12384): central for multimodal tasks (captioning, grounding, VQA). 5. [Multimodal Visible-SAR Dataset for Airport Detection in Remote Sensing Imagery ](https://www.scidb.cn/en/detail?dataSetId=29765a93b1c54e3d9fac27dbc03b8214) 6. Sentinel-1 + Sentinel-2 co-registered pairs for SAR/optical alignment and synthetic caption generation --- ## Baseline Models and SOTAs 1. GeoChat (Given as the VRSBench Baseline) - [Paper Website](https://mbzuai-oryx.github.io/GeoChat/) - [Hugging Face Link](https://huggingface.co/MBZUAI/geochat-7B) - Strengths: - domain specific for remote sensing - unifies multiple tasks (captioning, VQA, classification) - can perform region-level reasoning by accepting bounding box coordinates as input - Weaknesses: - Fails OBB Task: it's grounding is for standar HBBs. Cannot meet the $IoU > 0.7$ OBB requirement - Fails Float VQA Task due to numerical hallucation (Agents with regression based tools required to achieve good accuracy on float based eval) 2. RS-LLaVA and Co-LLaVA - [RS-LLaVA Paper Link](https://www.mdpi.com/2072-4292/16/9/1477) - [Github Repo](https://github.com/BigData-KSU/RS-LLaVA) - [Co-LLaVA Paper](https://www.mdpi.com/2072-4292/17/3/466) - Strengths: - LLaVA adaptations designed for joint captioning and VQA in remote sensing - SOTA for the standard Captioning (BLEU score) and VQA tasks - Weakness: no native mechanism for OBB grounding or numeric VQA 3. GeoGround - [Paper](https://arxiv.org/pdf/2411.11904) - [Github](https://github.com/VisionXLab/GeoGround) - Strengths: - Specific for OBB tasks - Text-as-Geometry Mech: GeoGround does not use a separate detection head. Instead, it fine-tunes the LLM to generate a structured text string that represents the OBB's geometry. 4. OpenRSD (Alt to GeoGround) - Non-VLM approach - open-prompt object detection framework - [Paper](https://arxiv.org/html/2503.06146v2) - Precise and non-generative, so can be used as a tool for an agent. But tough to incorporate it in the overall generative model architecture 5. TIGeR (Tool-Integrated Geometric Reasoning) - [Paper Link](https://arxiv.org/html/2510.07181v2) - [SpatialVLM](https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_SpatialVLM_Endowing_Vision-Language_Models_with_Spatial_Reasoning_Capabilities_CVPR_2024_paper.pdf) - solves the "float VQA" problem by teaching the VLM to use external tools - SOTA in Numeric VQA - the VLM does not generate the answer. It generates Python code that calls a predefined tool 6. AnySat - SOTA in Scale & Modality Resilience - addresses the Scale Resilience hurdle (0.5m-10m GSD) - handles varied resolutions in a single model - [Paper](https://arxiv.org/pdf/2412.14123) 7. EarthGPT - SOTA in Scale & Modality Resilience - addresses the Multimodality issue - [Paper](https://arxiv.org/pdf/2401.16822v3) 8. CLOSP - SOTA in Scale & Modality Resilience - uses three separate encoders (Text, Optical and SAR) and a contrastive loss to align all three into a unified embedding space - [Paper](https://arxiv.org/pdf/2507.10403v2) - [Huggingface](https://huggingface.co/DarthReca/CLOSP-Visual) | Model | Handles OBB Grounding *(IoU > 0.7)* | Handles Numeric (Float) VQA | Handles Scale Resilience (0.5–10m) | Handles SAR/IR | Handles 2k Image Efficiency | |--------------|---------------------------------------------------------------------------|----------------------------------------------------------|-----------------------------------------------------------|---------------------------------------------|----------------------------------------------------| | **GeoChat** | No (HBB only; poor angle support) | No (Generative; hallucinates) | Partial (RS-tuned, but not scale-adaptive) | No | No | | **RS-LLaVA** | No | No | Partial | No | No | | **GeoGround** | Yes (Primary function; uses text-as-geometry) | No | Partial (Trained on ‘refGeo’) | No | No | | **TIGeR** | No | Yes (Primary function; uses tool-call) | No | No | No | | **EarthGPT** | Partial (Does grounding, but not specialized OBB) | No | Yes (Visual-enhanced perception) | Yes (MMRS-1M dataset) | Partial (Handles local/global features) | | **AnySat** | N/A (Encoder only) | N/A (Encoder only) | Yes (Scale-adaptive JEPA) | Yes (Trained on 11 sensors) | N/A (Encoder only) | Some other models to research on: 1. [Grounding DINO](https://arxiv.org/pdf/2303.05499): SOTA for open-vocabulary grounding 2. [Rotation-RetinaNet](https://github.com/ming71/Rotated-RetinaNet) 3. --- ## Possible Hybrids 1. GeoGround + TIGeR - GeoGround is precise - TIGeR is SOTA for reasoning and float based outputs, and includes tooling in VLMs. This will allow us to perform well in the float metric as well - ISSUES: - dependent on the LLM's ability to generatively regress a 5-parameter float vector. IDK how accurate it will be coz LLMs are fundamentally weak at regression tasks. What we can do is offload this regression task to a tool, but idk how we'll be able to insert a tool in the middle of the attention blocks - Tool-use hallucination: The VLM might try to guess a numeric answer instead of using the tool, or The VLM might try to call a tool for a simple, non-numeric VQA query. We need to create a perfect instruction tuning dataset that teaches the VLM to differentiate between simple, attribute, and geometric-numeric VQA 2. CLOSP + fine tuned VLM - ISSUES: - bridge assumes that the text descriptions for SAR and Optical images are semantically equivalent. The reasoning model has to have domain knowledge of SAR-optical description coversions. This can't be solved just by doing some mathematical transformations while combining the embedding spaces. Creating an agent with domain knoweldge to compare text-descriptions is one solution. For that we need to integrate TIGeR here, but this will lead to a huge increase in latency and idk how accurate this will be. 3. Global-Local Aggregator VLM 4. Agentic Shiz (Most obvious and basic, idk how performance will scale): - SUMMARY: Use a central LLM to route tasks to specialized, expert models - ARCHITECTURE: - Use a A frozen, instruction tuned LLM (Llama 3 8B or Flan-T5) - Make different, independent models for VQA, Captioning and Grounding - Make pre-defined workflows rather than dynamic workflows to minimze hallucinations A lot of permutations and mixtures are possible, these are just a few I was able to think off the top of my head. Would have to read all the papers and understand all architectures to come to a reasonable ensemble. --- ### Issues to tackle beyond the ones highlighted above 1. Explainability 2. Real-time / efficient processing for high-res imagery: IDK how will the architecture be real time given the complexities, scale of models, multiple levels of pre provessing and potential tooling. 0 idea as to how this will be tackled 3. Label and language mismatch: Expert captions use domain terms; VLMs trained on generic captions output generic language. 4. Balancing large context (global scene meaning) with tiny object detail is unresolved. If we try to use hierarchical multi-scale transformers then the processing time will rise exponentially; but ISRO expects real time and low latency. 5. Current LVLM grounding heads often fail on small pixel-scale objects. Detection-first is good but not perfect. 6. Scarcity of high-quality SAR image–text pairs --- ## PS Highlights **OBJECTIVE:** interpret and analyze satellite imagery using natural language **INPUT:** processed satellite image **PROCESSING:** 1. **Captioning**: A description of the image that clearly summarizes all the key information in the satellite image accurately 2. **Grounding:** Localize objects within the satellite image based on the natural language query. The outcome will be oriented bounding boxes overlaid on image. 3. **Visual Question Answering:** Question answers that address the geometric and semantic attributes of the objects and features in the given image. **CHALLENGES/DOMAIN-ORIENTED FEATURES:** 1. **Details:** Describe and measure the local and global attributes of small and large objects in the satellite image accurately. 2. **Efficiency:** Analyse images up to 2k x 2k size quickly and efficiently. 3. **Scale resilience:** Be able to handle sampling from 0.5m/pixel to 10m/pixel imagery efficiently. 4. **Multimodality:** work with SAR or Infrared Imagery. Deal with false color composites for optical images. **PLATFORM** 1. Hosted Website 2. Interface to upload satellite images (L1/L2 processed, converted to 0-255 .png/.jpg format) 3. Chatbot Interface (this is bare minimum, can also add image selection interfaces, like zooming into an image, selecting a specific region of an image)