Enhancing Autonomous Driving: Leveraging Frame-Based Foundation Models for Improved Pedestrian Detection and Sensor Fusion

# Enhancing Autonomous Driving: Leveraging Frame-Based Foundation Models for Improved Pedestrian Detection and Sensor Fusion Explore the potential of foundational computer vision models to improve pedestrian detection and sensor fusion in the autonomous driving domain. # Why interesting - Autonomous driving uses computer vision models and remote sensing technologies (e.g. LiDAR, RADAR) to navigate roads and avoid pedestrians. - Hand crafted sensor fusion uses all sensor types to predict and correct predictions from computer vision models that run on RGB images. - Foundational computer vision models could improve sensor fusion by increasing initial pedestrian detection accuracies and removing the hardcoded decision steps.   # How done now - Current autonomous driving is largely done with sensor fusion, where LiDAR and radar data is used to increase prediciton accuracy of computer vision models. - Complex region proposals with LiDAR and radar are computed to filter out unwanted detections (e.g. exclude pedestrian false positives like humans on motorcycles).   # What is missing - Hand-crafted sensor fusion is limited by hardcoded 3D calculations and assumptions. - Sampling more data modalities in real time is more complex and computationally intensive than having one source of data  # Proposed solution    - Detect pedestrian bounding boxes using foundational models on [View-of-Delft](https://intelligent-vehicles.org/datasets/view-of-delft/) dataset. - Eliminate hand-crafted rules for sensor fusion by directly learning features from multiple modalities of sensor data. # Experimental questions - Can foundational models’ implicit knowledge of driving-related images be leveraged for accurately detecting 2D pedestrians? - How well do foundational models detect pedestrians in 3D, relative to the autonomous vehicle? - How does learned sensor fusion compare to hand-crafted sensor fusion? - How suitable are foundational models for real-time use in an autonomous vehicle?