<!-- This is the typical structure of a story line; each element might take 1-3 bullet points.
1. Why interesting? What is the general setting/application. Why should people care.
2. How done now? The typical approach(es) to the setting in (1)
3. What is missing? What’s the problem in (2), and what consequences does this have.
4. Proposed solution. What do you do, and why does it solve the problem in (3)
5. Experimental questions. How do you evaluate experimentally that (4) solves the problem in (2) and it’s consequences in (3).
Important: The story line is stand alone: you cannot introduce a new term/concept before it has been introduced (by motivating why the term/concept is important). -->
<!-- # Annotating autonomous driving data with foundation models -->
<!-- # Using frame-based foundation models instead of sensor fusion for autonomous driving -->
# Enhancing Autonomous Driving: Leveraging Frame-Based Foundation Models for Improved Pedestrian Detection and Sensor Fusion
Explore the potential of foundational computer vision models to improve pedestrian detection and sensor fusion in the autonomous driving domain.
# Why interesting
- Autonomous driving uses computer vision models and remote sensing technologies (e.g. LiDAR, RADAR) to navigate roads and avoid pedestrians.
- Hand crafted sensor fusion uses all sensor types to predict and correct predictions from computer vision models that run on RGB images.
- Foundational computer vision models could improve sensor fusion by increasing initial pedestrian detection accuracies and removing the hardcoded decision steps.
<!-- Classifiers use features to make decisions. Features could be the original pixel intensity values, or derived properties thereof. Feature extraction is the process of calculating these derived properties. A good feature extractor retains those properties that are most descriptive for the classification task, while reducing the total amount of information passed to the classifier. A good feature extractor generalizes the sample in such a way that objects of the same class are similar. This makes the classification task easier.
In the context of pedestrian classification, a good feature extractor would be invariant to - for example - the illumination of the scene, or the color of the clothes of the pedestrian. That means that a feature extractor would ideally be *invariant* to these properties.
Foundation models are a type of artificial intelligence models that learn from substantially large-scale databases, and that are capable of generalizing to a variety of unseen data and downstream tasks
The emergence of foundation models offers new possibilities for various downstream tasks via knowledge transfer. Despite the recent success of foundation models in various general-purpose tasks, adapting the knowledge of pre-trained foundation models to the autonomous driving domain remains **relatively unexplored**.
We will use data from the new, real-world [View-of-Delft](https://intelligent-vehicles.org/datasets/view-of-delft/) dataset. This dataset consists of short recordings of road scenes made by a vehicle driving through the streets of Delft. This vehicle was equipped with a range of sensors, such as cameras, RADAR, and LiDAR. -->
<!-- - evaluate large visual foundation models, that can be used for pedestrian detection from mono RGB images, compare these with existing SOTA methods, or industry methods.
- The emergence of foundation models offers new possibilities for various downstream tasks via knowledge transfer. -->
# How done now
- Current autonomous driving is largely done with sensor fusion, where LiDAR and radar data is used to increase prediciton accuracy of computer vision models.
- Complex region proposals with LiDAR and radar are computed to filter out unwanted detections (e.g. exclude pedestrian false positives like humans on motorcycles).
<!-- - Autonomous driving datasets are typically sampled with multi modalities of sensors, in addition to RGB images, LiDAR and radar data is usually available as well.
- Sensor fusion and 3D priors are used for smart region proposals where a head classifier can be used to perform detection. -->
<!-- # Shortcomings
- the visual content in autonomous driving scenarios differs significantly from the iconic datasets that are widely used to train foundation models
- autonomous driving data exhibits strong temporal correlations from frame to frame -->
# What is missing
- Hand-crafted sensor fusion is limited by hardcoded 3D calculations and assumptions.
- Sampling more data modalities in real time is more complex and computationally intensive than having one source of data
<!-- - Sampling more data modalities in real time is more complex and computationally intensive than having one source of data
- Mounting only one sensor type, i.e. cameras, on some machine (e.g. a car) would simplify the hardware and software design -->
# Proposed solution
<!-- In this project, we propose to explore the generalization ability of the foundation models in the autonomous driving domain. Firstly, we would like to build an off-line semi-automatic annotator on top of a visual foundation model to supplement pixel-level labels for the autonomous driving dataset, [View-of-Delft](https://intelligent-vehicles.org/datasets/view-of-delft/). For example:
- use DINOv2 and Segment Anything Model (SAM) to split an input image into several semantically meaningful segments.
- given a few labeled segments, we would like to develop a semantic–aware head to label the remaining segments, by fine-tuning the pre-trained foundation models. -->
<!-- In this project, we propose to explore the generalization ability of the foundation models in the autonomous driving domain, using DINOv2 and Segment Anything Model (SAM) for feature extraction and pedestrian segmentation respectively, for the autonomous driving dataset [View-of-Delft](https://intelligent-vehicles.org/datasets/view-of-delft/). -->
<!-- - Explore the generalization ability of foundational models in the autonomous driving domain to improved pedestrian detection and sensor fusion.
- Detect pedestrian bounding boxes using finetuned [YOLO-NAS](https://github.com/Deci-AI/super-gradients/blob/master/YOLONAS.md) on [View-of-Delft](https://intelligent-vehicles.org/datasets/view-of-delft/) dataset.
- Propose 3D bounding boxes predictions using 2D YOLO output and raw LiDAR ([3D-Box-Segment-Anything](https://github.com/dvlab-research/3D-Box-Segment-Anything/tree/main)). -->
- Detect pedestrian bounding boxes using foundational models on [View-of-Delft](https://intelligent-vehicles.org/datasets/view-of-delft/) dataset.
- Eliminate hand-crafted rules for sensor fusion by directly learning features from multiple modalities of sensor data.
# Experimental questions
- Can foundational models’ implicit knowledge of driving-related images be leveraged for accurately detecting 2D pedestrians?
- How well do foundational models detect pedestrians in 3D, relative to the autonomous vehicle?
- How does learned sensor fusion compare to hand-crafted sensor fusion?
- How suitable are foundational models for real-time use in an autonomous vehicle?
<!-- - For the task of pedestrian detection, evaluate mAP of proposed solution on View-of-Delft dataset.
- Compare it to existing solution using YOLOv4, LiDAR, and radar. -->