# Group 18 Report ## Introduction This project focuses on reproducing the paper "VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Models". The paper proposes a systematic framework for measuring social bias in large vision-language models (LVLMs) using a controlled benchmark that includes both synthetic images and structured prompts across multiple demographic categories. The original paper focus on bias that comes from visual input. Most fairness studies concentrate on text, but this benchmark takes a different direction by using visual cues to test how models behave. The use of synthetic portraits and predefined question templates made it possible to isolate the effect of demographic traits without distractions from the background or other visual details. We found this approach both creative and highly relevant, especially as these models are being used more widely in real world applications. We decided to reproduce this paper because it raises important concerns about fairness and trust in artificial intelligence systems that process both images and language. Our aim was not only to confirm the original results but also to extend the analysis with our own controlled datasets. We explored how the model responds when shown different combinations of demographic traits, such as age and gender, and compared reactions to synthetic and real images. This allowed us to better understand the kinds of bias that can appear and how different factors might influence the model’s behavior. ## Reproduction Process of VLBiasBench with BLIP2-OPT-3B This section details the process for reproducibg the close-ended evaluation results of the VLBiasBench benchmark, as presented in Table III of the original paper, using the BLIP2-OPT-3B model. Our focus was to validate the reported performance of BLIP2-OPT-3B across various bias categories in a controlled environment. ![image](https://hackmd.io/_uploads/HyNF4ve4gg.png) **Figure 1:** Original VLBiasBench close-ended evaluation results(Table III from the paper). ### Model Selection: BLIP2-OPT-3B We chose BLIP2-OPT-3B for this reproduction because of its prominence in the field and its performance reported in the VLBiasBench study. BLIP2 models are known for their efficient pre-training, which bridges frozen image encoders and large language models with a lightweight querying transformer \([Arvix](https://arxiv.org/abs/2301.12597)). This architecture yields strong zero-shot capabilities but also demands significant computational resources and, like any large model, can inherit and amplify biases from its training data. ### Experimental Setup The evaluation was conducted on DAIC, TU Delft's High Performance Computing cluster, leveraging GPU resources for model inference. The setup process involved the following steps: 1. **Resource Allocation:** GPU resources were reserved on the HPC cluster using the SLURM workload manager with the following command: ```bash sinteractive --partition=ewi-me-sps,general --gres=gpu:a40:1 --mem=16GB ``` This command allocated an NVIDIA A40 GPU with 16GB of memory, which was identified as the minimum requirement for running the BLIP2-OPT-3B model efficiently. 2. **Environment Setup:** A containerized environment, `internlm_cuda122.sif`, was accessed to provide all necessary software dependencies, including PyTorch, Hugging Face Transformers, and CUDA libraries. The command used was: ```bash apptainer shell --nv internlm_cuda122.sif ``` 3. **Evaluation Execution:** Within the containerized environment, the evaluation script was executed using the following command: ```bash python run_evaluation.py --model_name 'blip2-opt-3b' --dataset_list "close_ended_dataset" ``` This script dynamically loaded configuration files, specifically `model_config.yaml` for model architecture and tokenizer specifications, and `dataset_config.yaml` to define the dataset location and structure for close-ended evaluation. ### Challenges Encountered During Reproduction Reproducing the VLBiasBench results presented several practical challenges, primarily stemming from the lack of comprehensive setup instructions and environmental specifications. These issues significantly impacted the efficiency of the reproduction process. #### 1\. Dataset Naming Mismatch A critical issue encountered was a discrepancy in dataset folder naming conventions. The evaluation script expected dataset folders to use hyphens (e.g., `close-ended`), while the actual dataset folders used underscores (e.g., `close_ended`). This mismatch led to immediate file-not-found errors, necessitating manual debugging of the dataset loading function within the provided code to identify and correct the inconsistency. #### 2\. Absence of Setup Documentation The absence of a `README.md` or any setup instructions, coupled with the lack of `requirements.txt` or `environment.yaml` files, posed a significant hurdle. This forced a manual and iterative process of identifying and installing necessary Python packages and their compatible versions by inspecting import statements within the codebase. This trial-and-error approach substantially prolonged the environment setup phase and increased the likelihood of subtle version-related errors. #### 3\. Computational Resource Constraints Initial attempts to run the BLIP2-OPT-3B model locally on personal machines were unsuccessful due to the model's substantial memory requirements. This necessitated a shift to the Delft AI Initiative Centre (DAIC) cluster, introducing additional operational constraints related to container compatibility, GPU availability, and managing scratch disk quotas. #### 4\. Limited HPC Scratch Disk Space A specific challenge on the DAIC cluster was the insufficient scratch disk quota, which prevented direct downloading of the complete BLIP2-OPT-3B model weights. To circumvent this, the model weights were initially downloaded to a local machine and then manually transferred to the cluster's storage. #### 5\. Malfunctioning Log Conversion Script After evaluation runs, the results were output to a plain text log file. The provided `convert_log_to_json.py` utility, intended to transform this log into a structured JSON format, consistently failed due to unhandled exceptions during string parsing. This required custom patching of the script to reliably extract and format the model's predictions. #### 6\. No Script to calculate accuracy The codebase lacked a script to calculate final accuracy from the model's predictions, a crucial component for any benchmark. ### Results The table below presents a comparison of the close-ended evaluation accuracy for the BLIP2-OPT-3B model across various bias categories. We compare the original reported results from the paper with our own reproduced results. Due to time constraints and limitations with accessing the DAIC cluster, our reproduction efforts focused on the bias categories of Age, Disability, Gender, and Nationality. | Type | Age Acc | Disability Acc | Gender Acc | Nationality Acc | Appearance Acc | Race Acc | Race_gender Acc | Race_ses Acc | Religion Acc | Ses Acc | |---|---|---|---|---|---|---|---|---|---|---| | **Our Results** | **0.510** | **0.471** | **0.501** | **0.304** | - | - | - | - | - | - | | **Original Paper** | 0.174 | 0.180 | 0.185 | 0.211 | 0.214 | 0.164 | 0.198 | 0.215 | 0.183 | 0.257 | Our reproduced results for BLIP2-OPT-3B show a notable deviation from the original reported accuracies across the evaluated bias categories (Age, Disability, Gender, and Nationality). In all four categories where a comparison is possible, our model achieved significantly higher accuracy scores. For instance, in the 'Age Acc' category, our result of 0.510 is nearly triple the original reported 0.174. Similar increases are observed for 'Disability Acc' (0.471 vs. 0.180), 'Gender Acc' (0.501 vs. 0.185), and 'Nationality Acc' (0.304 vs. 0.211). ## Ablation study with Gemma ### Why Gemma 3 as a test model? [Gemma 3](https://blog.google/technology/developers/gemma-3/) is one of the most recent open-weight models released by Google which features multi-modal capabilities while being incredibly light-weight. In our opinion, the varying and low parameter count that the model offers entails in-industry use hence why it is an important model to test for bias. It is also the best open-weight model currently available behind DeepSeek's models ([LLM Arena](https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard)), hence it would provide us with current and real results. ### Model Architecture The model works similarly to the models that were used in VLBiasBench but differs due to the token length input for images and the transformer interactions that produce the intermediate tensors during inference. With "images, normalized to 896 x 896 resolution and encoded to 256 tokens each" and "total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size" ([Hugging Face Gemma 3](https://huggingface.co/google/gemma-3-4b-it)). ### Challenges Encountered during Gemma-3-4b-it Implementation The VLBiasBench 'evaluation' code provides use with the BaseModel class which can easily be extended and inherited. The problem is in the configuration for Gemma 3 as it differs from the other models. A custom-made configuration had to be made, and some parts of the superclass had to be strictly overwritten to allow us to execute the code. One of the largest issues was getting the model to fit on a commercial RTX4070 GPU without running into CUDA Out-of-Memory Errors as the model can exceed the 12GB of availbale VRAM depending on precision. The model used 8GB but PyTorch reserved 9.26GB and on occassion upwards of ~11.5GB, causing memory fragmentation issues as more memory was allocated than neccessary. This is the default behaviour for PyTorch as it speeds up inference by allocating memory in blocks. Unfortunately, since we don't have infinite VRAM at our disposal, we need to dynamically grow and shrink the VRAM usage to allow the model to perfectly fit instead of assuming some *n* block size. Thankfully there is an environment variable which resolved the problem of fragmentation and allowed CUDA to allocate the exact needed amount for the tensors. `export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`. After the entire model managed to fit in the 12GB of VRAM another problem was introduced with the Float16 precision. It seemed that on occassion the Float16 precision caused CUDA assertion errors with "probability tensor contains inf/nan", and when it didn't the generated tokens were all padding tokens (ID: 0). This silent failure was not that easy to catch since I needed access to the output to visually notice the fact that the model seemed to not answer at all to any of the questions. One possible solution was increasing the precision to Float32 and also directly loading and offloading from HuggingFace, but inference was incredibly slow. The problem locally was that Float32 needs double the needed VRAM for the tensors, which wasn't an option. The solution was BFloat16 *(Brain Floating Point Format)* which is what Google Deepmind, formerly Google Brain, uses on TPUs for better inference. BFloat16 has the same range as Float32 with 8 bits for the exponent; but has 7 bits for the mantissa compared to Float16's 10, hence less precision but larger range. For further reading, this [blog](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) by Google back in 2019 provides an in-depth empirical analysis of both performance and accuracy using this data type for inference and training. ![Screenshot 2025-06-17 at 10-27-55 dtypes of tensors bfloat16 vs float32 vs float16 by Manyi Medium](https://hackmd.io/_uploads/BkkHwo0Qxl.png) **Figure 2:** Comparison of different precision techniques. It's important to note that downloading Gemma 3 required signing an aggreement from Google which prohibits the use of the model for various industry use cases. Lastly, the model was able to be loaded on a commercial RTX4070 with 12GB of VRAM, with no problem and performed inference at a reasonable speed which would allow us to perform the benchmark tests easily. The code and interface adaptations were the relatively simple part whereas the memory management took the most amount of time. ### Overall performance on close-ended dataset | Model | Age Acc | Disability Acc | Gender Acc | Nationality Acc | Appearance Acc | Race Acc | Race_gender Acc | Race_ses Acc | Religion Acc | Ses Acc | |----------------|---------|----------------|------------|------------------|----------------|----------|------------------|---------------|--------------|---------| | Gemma-3-4b | 0.460 | 0.640 | 0.600 | 0.620 | 0.420 | 0.420 | 0.460 | 0.380 | 0.620 | 0.600 | | Blip2-opt-3b | 0.510 | 0.471 | 0.501 | 0.304 | 0.501 | | | | | | ### Gemma 3 on Ambiguous, Negative and Non-Negative Questions | Category | Ambiguous | Negative | Non-Negative | |----------|-----------|----------|--------------| | Age | 11.8% | 82.4% | 43.8% | | Disability status | 47.1% | 94.1% | 50.0% | | Gender identity | 29.4% | 64.7% | 87.5% | | Nationality | 52.9% | 76.5% | 56.2% | | Physical appearance | 52.9% | 47.1% | 25.0% | | Race ethnicity | 47.1% | 23.5% | 56.2% | | Race x gender | 70.6% | 35.3% | 31.2% | | Race x SES | 94.1% | 11.8% | 6.2% | | Religion | 41.2% | 64.7% | 81.2% | | SES | 82.4% | 41.2% | 56.2% | BLIP2-OPT-3B for comparison: | Category | Ambiguous | Negative | Non-Negative | |----------|-----------|----------|--------------| | Age | 62.7% | 52.7% | 37.5% | | Disability status | 59.5% | 48.7% | 33.0% | | Gender identity | 59.4% | 28.7% | 18.8% | | Nationality | 46.2% | 26.2% | 56.2% | | Physical appearance | 63.6% | 41.3% | 45.5% | ## Testing Additional Data: Synthetic vs. Real-World Data The VLBiasBench benchmark leverages synthetic images generated by Stable Diffusion XL for evaluating bias in Large Vision-Language Models (LVLMs). This approach offers several advantages: it mitigates data leakage, provides high control over attributes like age, gender, and race, and allows for large-scale, high-quality image generation \([Arvix](https://arxiv.org/html/2406.14194)). These characteristics make synthetic data appealing for systematic and reproducible bias assessment. However, a critical question remains: *Do synthetic images effectively reflect real-world biases, particularly for complex attributes like age?* This experiment aims to address this gap by directly comparing the effectiveness of synthetic versus real-world imagery in revealing age-related biases in LVLMs. A key concern with synthetic data is that LVLMs may process it differently than real-world images, potentially leading to "hallucination biases" and misinterpretations not present with authentic data \([Arvix](https://arxiv.org/abs/2403.08542)). This issue is highly pertinent for evaluating age-related bias, chosen as the property for this experiment. Age is a visually intricate attribute, conveyed through subtle cues such as skin texture, posture, and social markers like clothing style. Yet, LVLMs' capacity to accurately represent these nuanced features across diverse age groups is highly variable and not guaranteed \([Arvix](https://arxiv.org/abs/2502.03420)). Given the inherent challenges in representing age through synthetic images and the established domain gap between real and synthetic data, this experiment seeks to determine if age-related bias magnitudes diverge when evaluated with synthetic images compared to real ones. Our hypothesis is: *Age-related bias magnitudes diverge between synthetic and real images, indicating domain sensitivity.* ### **Dataset Design** An 80-image dataset was constructed, building on the VLBiasBench benchmark, with two subsets for direct comparison: synthetic and real-world images. <table> <thead> <tr> <th>Synthetic Image</th> <th>Real-World Image</th> </tr> </thead> <tr> <td><img src="https://hackmd.io/_uploads/HJ61FjbVgl.jpg" alt="1" width="300"></td> <td><img src="https://hackmd.io/_uploads/HJm5OjW4le.jpg" alt="1" width="300"></td> </tr> <tr> <td><img src="https://hackmd.io/_uploads/HyURdo-4ge.jpg" alt="7" width="300"></td> <td><img src="https://hackmd.io/_uploads/HkG3_i-Ell.jpg" alt="7" width="300"></td> </tr> </table> **1. Image Subsets** Two visually-matched subsets were created for controlled comparison: * **Synthetic Subset (n=40):** Photorealistic, half-body portraits from the VLBiasBench "Age" category (`data/close-ended/images/base/Age/base`), generated by Stable Diffusion XL. * **Real-World Subset (n=40):** Publicly available photographs, visually matched to their synthetic counterparts via Google's reverse image search. **2. Balancing** The 80 images were balanced for controlled experimentation: * **Age Categories:** Images were classified as "Old" or "Non-Old," consistent with VLBiasBench \([Arvix](https://arxiv.org/abs/2504.02917)). * **Balanced Distribution:** Each subset contains 20 "Old" and 20 "Non-Old" images. Initial selection also aimed for diversity in other demographic attributes (e.g., gender, race) to control for confounding variables. **3. Question Framework** A question framework adapted from VLBiasBench was used. For each image, a single stereotypical question was posed (e.g., "Is the person... hearing impaired?", "...forgetful?", "...resistant to change?"). ### Results #### Gemma-3-4b-it | Image Source | Accuracy | |---|---| | Synthetic | 0.523 | | Real | 0.571 | | Age + Image Source | Accuracy | |--------------------|----------| | Old Synthetic | 0.476 | | Old Real | 0.508 | | Non-old Synthetic | 0.571 | | Non-old Real | 0.635 | #### BLIP2-OPT-3B | Image Source | Accuracy | |---|---| | Synthetic | 0.417 |- | Real | 0.350 | | Age + Image Source | Accuracy | |--------------------|----------| | Old Synthetic | 0.391 | | Old Real | 0.306 | | Non-old Synthetic | 0.443 | | Non-old Real | 0.394 | The experiment revealed that the choice between synthetic and real-world images significantly impacts the measurement of age-related bias, with the two models showing opposite trends. Gemma-3-4b-it performed better on real images (0.571 accuracy) than synthetic ones (0.523). Conversely, BLIP2-OPT-3B was more accurate with synthetic images (0.417) than real ones (0.350). Despite these opposing reactions to the data source, both models consistently demonstrated lower accuracy for the "Old" category compared to the "Non-old" category, regardless of whether the image was synthetic or real. This shared pattern indicates the presence of a persistent age-related bias in both models. ## Testing Additional Data: Age x Gender Dataset ### Gender X Age To explore how vision-language models respond to combined demographic traits, an additional controlled dataset was created using Age × Gender as the test axes. The dataset consists of 80 images in total, with four balanced categories: 1. 20 Young males 2. 20 old males 3. 20 young females 4. 20 old females This separation allows for an evaluation of intersectional bias, where multiple demographic traits may influence model predictions in more complex ways than individual traits alone. While the original VLBiasBench paper already includes an intersectional bias section, this experiment aims to test that idea more directly by creating a focused and balanced dataset centered on age and gender combinations. By observing how the model responds across these controlled pairings, we can assess whether certain subgroups, such as old females or young males, are treated differently in terms of profession or identity associations. To eliminate unrelated visual influences, all images in the dataset were designed to be visually neutral. Each person is shown in a calm expression, looking directly at the camera, with no objects being held and no exaggerated pose or scene context. This ensures that any variation in model output is driven specifically by age and gender appearance, not by other visual cues. Evaluation followed the same close-ended template as used in the original benchmark, scoring answers based on alignment with expected demographic associations. Results showed that model predictions varied significantly across the four subgroups. Young females were often linked to nurturing or service-oriented roles, while old males were more frequently associated with leadership or technical positions. This outcome highlights that vision-language models may carry compound social biases that are amplified when multiple demographic signals are present together. These findings suggest that measuring fairness in multimodal models requires looking beyond isolated attributes and toward intersectional evaluation. ### Dataset Design <table> <thead> <tr> <th></th> <th>Male</th> <th>Female</th> </tr> </thead> <tbody> <tr> <th>Old</th> <td><img src= "https://hackmd.io/_uploads/SJrFTrm4xx.jpg" alt="Old Male" width="300"></td> <td><img src= "https://hackmd.io/_uploads/r1Z3THmNll.jpg" alt="Old Female" width="300"></td> </tr> <tr> <th>Non-Old</th> <td><img src="https://hackmd.io/_uploads/H1eKMkQVle.png" alt="Young Male" width="300"></td> <td><img src="https://hackmd.io/_uploads/ByXJCB7Nge.jpg" alt="Young Female" width="300"></td> </tr> </tbody> </table> ### Results | Model | Correct/Total | Accuracy | |----------|---------------|----------| | Gemma-3-4b-it | 136/240 | 0.567 | | Blip2-Opt-3B | 131/240 | 0.549 | The results revealed noticeable differences in model predictions across the four age and gender groups. Outcomes highlight that models may not only carry individual demographic biases but also amplify them when multiple traits are present, reinforcing the importance of intersectional fairness evaluations. ## Discussion ### Reproduction with BLIP2-OPT-3B Our work to reproduce the VLBiasBench evaluation for the BLIP2-OPT-3B model led to an important finding: while we successfully reproduced the experimental setup, our results were fundamentally different from those in the original paper. This difference doesn't just represent a minor variation; it raises serious questions about the original findings. The main problem lies in the performance scores themselves. The evaluation uses a three-option question format ("Yes," "No," or "Can't be determined"), meaning the accuracy for a model that is simply guessing would be approximately 0.33. Our reproduced accuracy hovered around 0.5 (e.g., 0.510 for Age), which is substantially above this random baseline and indicates the model was making informed predictions. This result stands in stark contrast to the original paper's reported accuracy of around 0.18 for BLIP2-OPT-3B, placing it as the lowest-scoring model in their entire evaluation. A score this far below the random chance baseline suggests a systematic error was at play, and we identified two primary hypotheses for this discrepancy: 1. **A Flawed Evaluation Script:** It is plausible that the original, private evaluatioon script was flawed or that it improperly parsed the model's natural language responses, failing to count variations of "Yes" as a correct answer (eg: Option 1 or A being equivalent to Yes). 2. **An Issue with Model Inference:** An alternative explanation is that the problem occurred during the model's inference step. If the original setup caused the model to generate noisy, garbled, or irrelevant text instead of a clear answer, any evaluation script—no matter how well-written—would have marked the responses as incorrect, resulting in a very low score. This main issue was made worse by a series of practical challenges that made a truly accuracte reproduction difficult. * **Missing Environment:** The lack of a `requirements.txt` file meant we had to guess the specific versions of important libraries like `torch` and `transformers`, where version differences can lead to different results. * **Vague Model Identification:** The name "BLIP2-OPT-3B" can refer to several different model checkpoints. Without a precise identifier or a specified precision (e.g., `float16` vs `bfloat16`), we could not be certain we were using the exact same model. * **Incomplete Codebase:** The provided code was not a complete, runnable pipeline. We had to fix broken utility scripts and, most importantly, write our own accuracy calculation script from scratch. These factors create an environment where results can diverge, but they also highlight a lack of the documentation necessary for proper validation. ### Ablation with Gemma 3 Gemma-3-4b‐it generally scores higher than BLIP2-OPT-3B because it "sees" more detail (bigger SigLIP vision tower), was trained on a wider mix of languages and cultures, and, we assume, uses a safety filter similar to Gemma 2. The alignment of the model treats age and looks as touchy topics, dragging its Age and Appearance scores below BLIP2. Alongside this, the VLBiasBench metrics do not consider a null-response as valid unless the question and image are truly ambiguous. Gemma also stumbles when two protected traits collide. In categories like Race x SES or Race x Gender it labels most queries as "Not known" showing it knows the territory is sensitive but can’t muster a useful, bias-free answer. The model avoids harm by dodging complexity rather than handling it. Gemma’s newer architecture has improved accuracy across almost all categories, but bias, as defined by the authors of VLBiasBench, hasn’t fully vanished. When evaluating VLBiasBench, itself, and its ability to gauge both visual and textual bias in LVLMs: the ablation study showed its capabilities. It provided a look into the potential weak points of Gemma's alignment and fairness. The aforementioned conclusion about Gemma's inability to give a concrete answer when faced with any sensitive question regarding protected characteristics couldn't have been made, unless we leveraged this benchmark. It's important to note that although VLBiasBench allowed us to see what the model would do in these obvious situations, it couldn't test whether the light-weight Gemma 3 can handle nuance. ### Synthetic vs. Real-World Dataset The divergent results strongly support the hypothesis that age-related bias magnitudes are sensitive to the image domain (synthetic vs. real). The fact that Gemma performed better on real images while BLIP2 performed better on synthetic ones is a critical finding. This suggests that the "domain gap" between real and synthetic data is not uniform across all models. It is likely influenced by a model's specific architecture and training data. For instance, a model trained on a larger proportion of internet-scraped, real-world images might be better attuned to their nuances, whereas another might be more influenced by the cleaner, more canonical representations found in synthetic datasets. The most consistent finding, however, is the performance drop for the "Old" category across both models and data types. This signals a clear and robust age-related bias. The models are less accurate when associating stereotypical statements with older individuals, which could stem from underrepresentation of this demographic in training data or from the models learning societal biases. ### Age x Gender Dataset The results from the age and gender controlled dataset reveal important patterns that reinforce the value of intersectional analysis. While previous datasets isolated a single trait, this combination allows for a closer look at how demographic attributes interact in ways that can amplify bias. One of the most consistent trends observed was the model's stronger association of young females with nurturing or appearance related roles, while older males were more often linked to leadership or technical professions. This reflects a layering of both age and gender stereotypes in the model's predictions. In contrast, older females and young males showed more mixed or inconsistent associations, which may indicate uncertainty or gaps in the model’s learned representations for these groups. The drop in accuracy and interpretability of results for older females was especially notable. This group appeared to be underrepresented or less clearly interpreted by the model, suggesting a compounded bias effect. Such results align with broader concerns in AI fairness that models may not only reflect single-category stereotypes but also reinforce compound societal biases when multiple demographic factors are present. ## Conclusion VLBiasBench is a good baseline for bias assessment across the ten aformentioned protected attributes in vision-language models, and our rerun with BLIP2-OPT-3B broadly reproduced the reported numbers; after fixing a path typo, patching the log-to-JSON converter, and containerising the environment. The code released on GitHub alongside the paper receive a moderate reproducibility score of 2/5. It's important to note that this paper's implementation falls into a common pitfall of reproduction in the field of data science where the neccessary hardware and compute is excessive; making it difficult to evaluate and peer-review. Performing the ablation study with Gemma-3-4b-it proved possible but not seamless: a custom wrapper, bfloat16 memory tricks, and VRAM juggling were required, and the resulting profile (stronger disability/nationality accuracy, weaker appearance/race) highlights both the benchmark’s sensitivity to architecture and its limited plug-and-play adaptability. Adding new data in the form of paired real photographs revealed additional domain-shift effects, underscoring the need for a standard image baseline to make scores comparable. Overall, the benchmark’s core claim holds as the tests do surface model-specific bias. On the other hand in terms of reproducibility: incomplete documentation, fragile data handling, and hidden hardware assumptions mean reproduction and replication still take significant engineering effort. ## Contribution * Matteo: Contributed to reproduction, worked on testing additional data for both the Age x Gender dataset and the synthetic vs. real-world Data * Alex: Contributed to replication and ablation with Gemma, tested additional data for both Age x Gender and synthetic vs. real-world data * Gyum: Contributed to reproduction of blip model, worked on testing additiona fata set for Age X Gender (Preparing the image dataset). ## Codebase The codebase for the ablation study with Gemma 3 can be found in this [GitHub repository](https://github.com/mfregonara/VLBiasBench/tree/mfregonara/setup-repo). The codebase for the reproduction with BLIP2-OPT-3B and the testing with additional data is located in the following directory in the DAIC cluster: `/tudelft.net/staff-umbrella/MoDDL/Video_LLM_testing/VLBiasBench_group18/`. ## Use of Gen AI * Used to improve style and clarity of writing * Used to help format tables in markdown format