Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://aclanthology.org/2023.emnlp-main.34/) | [Code link](https://github.com/phoebussi/compress-robust-vqa) | EMNLP 2023 :::success **Thoughts** This study introduces the first joint study on the compression and debiasing problems of Vision Language Pre-trained Model (VLP) for the Visual Question Answering (VQA) task. ::: ## Abstract Despite their strong performance on Visual Question Answering (VQA) tasks, Vision Language Pre-trained Models (VLPs) struggle with two issues: 1. Reliance on dataset language biases, leading to poor generalization on out-of-distribution (OOD) data. 2. Inefficiency in memory and computation. This paper explores the possibility of simultaneously **compressing** and **debiasing** VLPs by identifying sparse, robust subnetworks through a systematic training and compression pipeline. ## Background ### OOD problems The performance significantly drops when VLPs encounter out-of-distribution (OOD) test datasets with answer distributions that differ from those in the training set. ### Dataset-bias problem The dataset bias problem in VQA has been extensively studied, with numerous debiasing methods applied to conventional small-scale models. These methods primarily address the issue by regularizing the loss based on the bias degree of the training samples. ## Method ### VLP Architecture and Subnetworks This study use LXMERT as an example. LXMERT (Learning Cross-Modality Encoder Representations from Transformers) is a framework to learn these vision-and-language connections. The parameters to be pruned are: $$ \boldsymbol{\theta}_{pr} = \{ \mathbf{W}_{\mathrm{emb}}, \mathbf{W}_{\mathrm{vis-fc}}, \mathbf{W}_{\mathrm{plr}} \} \cup \boldsymbol{\theta}_{L_{enc}} \cup \boldsymbol{\theta}_{R_{enc}} \cup \boldsymbol{\theta}_{X_{enc}} $$ where $\mathbf{W}_{\mathrm{emb}}$, $\mathbf{W}_{\mathrm{vis-fc}}$ and $\mathbf{W}_{\mathrm{plr}}$ are the weights of embedding layer, vision fc layer and pool layer. $\boldsymbol{\theta}_{L_{enc}}$, $\boldsymbol{\theta}_{R_{enc}}$ and $\boldsymbol{\theta}_{X_{enc}}$ are language encoder, object relationship encoder and cross-modality encoder. ### Pruning Methods They explore two representative pruning techniques: #### Magnitude-based Pruning This method estimates the importance of model parameters by their absolute values, removing those deemed less critical. #### Mask Training This approach optimizes a binary pruning mask, denoted as $\mathbf{m}$, to achieve specific objectives directly. ### Debiasing Methods A biased model that captures language bias is employed to assess the bias degree of training samples. The training loss for the main model is then adjusted accordingly to counteract this bias. > Binary Cross-Entropy (BCE) Calculates cross-entropy between the main model's predictions and the soft target score of each ground-truth. >Learned-Mixin +H (LMH) Incorporates a biased model to learn and adjust for biases during training. > RUBi Uses a strategy similar to LMH to regularize the main model's probability and employs standard cross-entropy as the training loss. > LPF Measures bias degree and regularizes the loss of the main model accordingly. ### Problem Formulation The goal is to find a subnetwork $f(\mathbf{m} \odot \boldsymbol{\theta}_{ft})$ that satisfies a target sparsity level $s$ and maximizes the OOD performance: $$ \max_{\mathbf{m}, \boldsymbol{\theta}_{ft}} (\mathcal{E}_{\mathrm{OOD}}(f(\mathbf{m} \odot \boldsymbol{\theta}_{ft}))), \ \mathrm{s.t.} \ \frac{\| \mathbf{m} \|_0}{|\boldsymbol{\theta}_{pr}|} = (1-s) $$ where $\mathcal{E}_{\mathrm{OOD}}$ denotes OOD evaluation, $\| \|_0$ is the $L_0$ norm and $|\boldsymbol{\theta}_{pr}|$ is the total number of parameters in $\boldsymbol{\theta}_{pr}$. ### Training and Compression Pipeline A typical training and compression pipeline involves three stages: > Stage 1: Full Model Fine-tuning The pre-trained LANERT $f(\boldsymbol{\theta}_{pt})$ is fine-tuned using loss $\mathcal{L}$, which produces $f(\boldsymbol{\theta}_{ft}) = \mathcal{F}_{\mathcal{L}}(f(\boldsymbol{\theta}))$. > Stage 2: Model Compression The fine-tuned LXMERT $f(\boldsymbol{\theta}_{ft})$ is compressed and the subnetwork will be $f(\mathbf{m} \odot \boldsymbol{\theta}_{ft})$, where $\mathbf{m} = \mathcal{P}_\mathcal{L}^p(f(\boldsymbol{\theta}_{ft}))$. > Srage 3: Further Fine-tunring (Optional) The subnetwork $f(\mathbf{m} \odot \boldsymbol{\theta}_{ft})$ is further fine-tuned using loss $\mathcal{L}^\prime$ and the result will be $f(\mathbf{m} \odot \boldsymbol{\theta}_{ft}^\prime) = \mathcal{F}_{\mathcal{L}^\prime}(f(\mathbf{m} \odot \boldsymbol{\theta}_{ft}))$. ## Experiment Below is a figure showing the results of subnetworks from the BCE fine-tuned lxmert (left) and from the LMH fine-tuned lxmert (right) on VQA-CP v2. ![image](https://hackmd.io/_uploads/B13pWuxiR.png) Below is a figure showing the results of LXMERT subnetworks fine-tuned using various debiasing methods on the VQA-CP v2 dataset. ![image](https://hackmd.io/_uploads/Hy22yuli0.png)