Out of Distribution as Preference Selection Problem

--- # Project Charter        --- ## Abstract Generative AI has has been the topic of discussion turning the decade of 2020. Applications like ChatGPT have stunned industry and academia with its ability to mimic conversations with a knowledgeable individual. With Large Language models at the base of GenAI technology, Large Language models have been growing larger and larger requiring unrealistic needs for GPU resources. However, there are many limitations that GenAI still leave unaddressed. Applications like ChatGPT perform fine on a high level of abstraction or on simple tasks where there can only be a fixed number of outcomes. Low-level, niched and specific knowledge is often times thrown out the window. This project formulates the well known problem of out of distribution model performance as a preference selection problem. Extending on Reinforcement learning with human in the loop([RLHF](https://arxiv.org/pdf/2203.02155)), I use the novel [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) algorithm and its derivatives to make Large Vision Language models learn subjective preferences that are not only not within available data but to have the model make inferences beyond the current available dataset. This is done such that a Supervised Finetuned(SFT) model self generates preferred/rejected pairs with out human annotaion, with out the need for Reinforcement learning and without the reliance of knowledge distillation from closed source commercial GenAI models like ChatGPT. Doing so, I establish that the out of distribution problem is fundamentally a preference selection task where computational costs are cut, human domain expertise and annotation costs is not necessary and model training from scratch is not always necessary as well. * [Github link](https://github.com/Tony363/HA-DPO/tree/main) * [Slides Deck](https://docs.google.com/presentation/d/1sknHWkxdDRP-JH9UOo8KMRiCXQlUUI4HfZ2OJfSnvXc/edit?usp=sharing) --- ## Problem Description Current literature offers little to no studies or methodologies for preference aligning Multi-Modal tasks such as Visual Question and Answering. This is especially profound when recent commerical big tech companies claim performance on industry tasks while offering no means to evaluate and reverse engineer their complex systems that involve Multi-Modal systems. With the plethora of resources available for said big tech companies, there is little refinement and understanding of their deployed systems. To that end, a phenonmenon for "prompt engineering" has become popular among the social sciences. However, the quantifiable reliability of prompting a Large Vision Language model(LVLM) limited vector space remains dubious. In our own studies while establishing a template chat bot for evaluation purposes, simply adding or removing a '/' for a part of speech(POS) tag that indicate a positional location for a visual embedding makes a 14% difference in classification performance. As there is yet to be a universal and rigorously quantifiable template for querying a LVLM, I chose to diverge away from popular "prompt engineering" studies and opted to seek out closed form optimization methodologies for both directing and generalizing a models learnable gradient space. The below is a comparison between 2 sample Chat templates that caused the 14% performance difference. The template without the '/' had a 14% increase compared to sticking rigorously to the Vicuna POS tag formats; ``` ###Human: Given the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img>', '<Img>###Human: Is the person looking straight at the screen? Is the person looking down atthe paper? Is the person looking away?###Assistant: ``` ``` ###Human: Given the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img>', '</Img>###Human: Is the person looking straight at the screen? Is the person looking down atthe paper? Is the person looking away?###Assistant: ```   --- ## Task ![Screenshot 2024-05-03 203907](https://hackmd.io/_uploads/SJuaXb7zA.png) Conventional Computer Vision models do not have the ability to output qualitative detailed text responses. Framing the common Multi-Modal task as an encompassing task that can tackle many conventional computer vision problems, specifically Visual Question & Answering, we utilize a toy dataset, the Student Engagement Dataset. The choice of using this dataset is due to its non conventional qualitative nature where in performance in this task is subjective and open for criticism. Provided with a set of labels "paper", "screen" and "wander", a LVLM will encode an image, encode a question and out put an answer in context of the image and question. An explicit evaluation method using DistillBert is then used to classify the output sentence of the LVLM. The core contributions of the project is the following; * A novel semi-supervised Multi-Modal preference alignment optimization method that considers the out of distribution problem as a preference selection task * Characterizate the out of sample generalizability for closed form preference alignment algorithms as opposed to RLHF methods * Establish a precedence for the necessity to preference align model responses as opposed to "appropriately" prompt engineer a "correct" response within a limited model parameter space and with limited resources available.  --- ## Data * [ICCVW Frame Engagement Annotations](https://cs-people.bu.edu/sbargal/studentdatasets/index.html) ![sample_images_small](https://hackmd.io/_uploads/SJlZigPh6.jpg) * The Student Engagement dataset(SED) consists of approximately 19K frames divided between three classes (looking at screen, looking at paper, wandering) from 19 different students. For the frame-level annotations, videos of 19 participants were sampled at one FPS, which gave us a total 18,721 frames. There is an imbalanced and balanced set from the SED. In the imbalanced distribution of this data, the Screen class includes 14 times more samples than the Wander class, and three times more than the Paper class. The Paper class includes 4,655 frames, the Screen class includes 13,483 frames, and the Wander class includes 583 frames for a total of 18,721 frames. A more balanced version of this dataset is constructed by removing similar samples for each class. This dataset is more equally distributed and contains 638 samples for the Paper class, 826 samples for the Screen class, and 509 samples for the Wander class for a total of 1,973 samples. We only sampled three students out of the original 19 for our test set. 80% of the balanced is used for finetuning. 20% of the balanced set and the rest from the imbalanced set is used for evaluation. Lastly,another 85 hard samples drawn across the SED dataset not within the training set is also used for an additional testing set. * [DAiSEE, Dataset for Affective States in E-Environment](https://people.iith.ac.in/vineethnb/resources/daisee/index.html) ![daisee](https://hackmd.io/_uploads/ByvApBo-0.png) * The first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration "in the wild". The dataset has four levels of labels namely - very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. The DAiSEE dataset is functionally the out of distribution testing dataset. We random sampled 1129 frames from DAiSEE and annotated it according to the SED framework labels. Having re-annotated DAiSEE, there are 984 Screen samples, 112 Wander samples and 33 Paper samples. --- ## ML Methodology ![Screenshot 2024-04-28 001814](https://hackmd.io/_uploads/Sky7RSiZA.png) We will be utilizing the [Minigpt4](https://minigpt-4.github.io/) framework. It consists of a pretrained vision model([ViT](https://arxiv.org/abs/2010.11929)), [BLIP2](https://arxiv.org/abs/2301.12597), a hybrid model that transforms the vision vector space to a language compatible vector space and a language model([Vicuna1](https://lmsys.org/blog/2023-03-30-vicuna/)). The main strength of Minigpt4 is that both the Vision, language and hybrid model weights are frozen and that only an additional linear projection layer between BLIP2 and the language model is being parameterized.This is because the primary task of Minigpt4 is to learn to align the Vision and Language vector spaces. This means training requirements is only for the linear layer while inference for the vision and language components can be precomputed. This shortens the time and resources needed for training and inference. In my experiments, I finetune and conduct evaluation on a batch size of 1 with 4bit quantization. Finetuning with MiniGPT4 take up 20 GiBs of VRAM while inference only requires 8 GiBs of VRAM. --- ### Multi Modal Question & Answer data pairs setup Minigpt4 expects pairs of image_ids paired with its target caption response. A list of questions during finetuning is stored in a alignment.txt file. Questions from the alignment.txt file is randomly chosen to be asked to MiniGPT4 during finetuning given an image. The original prompts from MiniGPT4 for finetuning are listed below; * *Describe this image in detail* * *take a look at this iamge and describe what you notice* * *Please provide a detailed description of the picture* * *Could you describe the contents of this iamge for me?* The finetuning prompts specific for the Student Engagement dataset is the below; * *Is the person looking straight at the screen?* * *Is the person looking down at the paper?* * *is the person looking away?* * *Is the person looking straight at the screen? Is the person looking down at the paper? Is the person looking away?* --- ### Direct Preference Optimization ![Screenshot 2024-04-29 053435](https://hackmd.io/_uploads/HJzAFkpZC.png) Reinforcement learning with human feedback(RLHF) as well as DPO makes use of the [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model). The Bradley-Terry model is a way of converting a dataset of preferences into a numeric score called reward that is given for each pair of questions and answers such that the score numerically resembles the preferences of the annotators. A Maximum Liklihood Estimator(MLE) is constructed from the Bradley-Terry model such that the probability of choosing the preferred answer is maximized opposed to choosing the rejected answer. DPO extends the RLHF loss function by making RLHF method via PPO differentiable. #### Brief Derivation $$P(y_w>y_l) = \frac{e^{r*(x,y_w)}}{e^{r*(x,y_w)}+e^{r*(x,y_l)}}$$ $$A = r*(x,y_w)$$ $$B=er*(x,y_l)$$ $$\frac{e^A}{e^A + e^B} = \frac{\frac{e^A}{e^A}}{\frac{e^A + e^B}{e^A}} = \frac{1}{1 + (\frac{e^B}{e^A})} = \frac{1}{1 + e^{-(A - B)}} = \sigma(A - B)$$ $$L = -E_{(x,y_w,y_l)}\sim[log\sigma(r_\gamma(x,y_w) - r_\gamma(x,y_l))]$$ The above rearranges the Bradley Terry pairwise comparison model in to a sigmoid function of the preferred and rejected log probabilities. #### Closed form Reward function The below is the final derivation of the PPO and Kullback-Leibler Divergence as a reward model. The key is that the PPO, KL divergance derivation is rearranged that the preffered and rejected log probabilities Z(x) property is canceled out. This makes it such that the reward is a closed form function with respect to the preffered and rejected log probabilties and that a simple derivative of the function can be taken. $$Z(x) = \sum_y \pi_{ref}(y|x)exp(\frac{1}{\beta}r(x,y))$$ $$\pi_r(y|x) = \frac{1}{Z(x)}\pi_{ref}(y|x)exp(\frac{1}{\beta}r(x,y))$$ $$P(y_w > y_l) = \sigma(r(x,y_w) - r(x,y_l)) = \sigma(\beta log \frac{\pi^*(y_w | x)}{\pi_{ref}(y_w | x)} + \beta log Z(x) - \beta log \frac{\pi^*(y_l | x )}{\pi_{ref}(y_l | x)} - \beta log Z(x))$$ The below is the full formulation of direct preference optimization objective function. ![Screenshot 2024-03-22 234527](https://hackmd.io/_uploads/rkedxRsRp.png) The loss function of DPO is as the following. DPO takes in the softmax vector representations of the rejected and preferred responses of the supervised fine tuned(SFT) reference model and the trainable policy model. A aggregate ratio between the preferred and rejected log probabilities for both the SFT reference and trainable policy model is represented as $$\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}$$ and $$\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}$$ The difference between the policy and reference ratios is used to minimize the negative log-likelihood loss where the policy models aggregate is the target vector representation to learn.This is how the SFT model learns to cover for sample spaces that is not learned from only using question, image, answer sample triplets and learn how to output preferred responses about an image given a question. An important aspect of the DPO loss is that the reference SFT model is used to both generate the prefered/rejected pairs and also constrain the loss function with in the vector space of the reference SFT model. The source code implementation for DPO is found [here](https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py). #### Hallucination Aware Direct preference Optimization(HA-DPO) The *hallucination aware* aspect of HA-DPO come from the paper [Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization](https://arxiv.org/abs/2311.16839). To put it simply, the authors of HA-DPO used GPT4s openai api to generate the style consistent non-hallucinatory/hallucinatory pairs for their Visual Question & Answering(VQA) task on the Visual Genome dataset evaluated using [Polling-based Object Probing Evaluation](https://github.com/RUCAIBox/POPE). Their goal was to have the ability to quantify object hallucination from VQA tasks. My method self generates the preferred and rejected pairs using the reference SFT model but adds the same auxiliary LLM loss to DPO as HA-DPO does. The inspiration of the added auxiliary loss is from the [InstructGPT paper](https://arxiv.org/abs/2203.02155). The final loss function I use for my project is the below. $$L_{dpo}(\pi_{\theta};\pi_{ref}) = -E(x_T,x_l,y_{pos},y_{neg})\sim D[log \sigma(\beta log \frac{\pi_{\theta}(y_{pos})|[x_T,x_O]}{\pi_{ref}(y_{pos}|[x_T,x_I])}) - \beta log \frac{\pi_{\theta}(y_{neg}|[x_T,x_I])}{\pi_{ref}(y_{neg}|[x_T,x_I])}] $$ $$L_{aux} = -\sum log P(y|x_P;\pi_{\theta}),{x_p,y}\sim D_{sft}$$ $$L = L_{dpo} + \lambda L_{aux}$$ #### Hallucination Aware Kahneman Tversky Optimization(HA-KTO) There is not a published HA-KTO paper. I coined HA-KTO as I performed preference alignment finetuning with the [Kahneman Tversky Optimization algorithm](https://arxiv.org/pdf/2402.01306). I simply took the currenlty available RLHF algorithms from the [source code of huggingface TRL](https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py) and implemented it for MiniGPT4 Visual Question & Answering. Like DPO, KTO formulates the maximum liklhood objective from the bradley terry model. As of date, KTO is touted to be the better preference alignment optimization algorithm available. --- ## Evaluation In the evaluation, an image is vectorized by a Vision model. A question template is also vectorized by a language model. Using both these vectors, Minigpt-4 will concatenate these vectors as sequential tokens to output a sequence of words. The prompting template as described in the MiniGPT4 paper was utilized. The final prompt that will be vectorized is the below. This prompt is currently the template prompt that outputs the best classification performance. Additionally, note that the ',' between the ```<img>``` and ```</img>``` tags indicate where the visual vector embedding will be concatenated.; ``` ###Human: Given the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img>', '<Img>###Human: Is the person looking straight at the screen? Is the person looking down atthe paper? Is the person looking away?###Assistant: ``` DistillBert used as a Sentence Classification model is tuned on the augmented label captions of the SED dataset. DistillBert classifies the out put of MiniGPT4. Accuracy, F1-macro, Precision-macro, and Recall-macro are aggregated. **Metrics** * Accuracy - *base metric of overall performance* * F1 Macro - *used to gauge model performance on the average of each class* * Precision Macro - *What proportion of actual positives was identified correctly* * Recall Macro - *What proportion of positive identifications was actually correct* ## Results --- **Table 1. SED Balanced test set** ![Screenshot 2024-04-28 011042](https://hackmd.io/_uploads/HyKP98jW0.png) --- **Table 2. SED imbalanced test set** ![Screenshot 2024-04-28 011110](https://hackmd.io/_uploads/B1GFcIiWA.png) --- **Table 3. SED Hard Samples test set** ![Screenshot 2024-04-28 011135](https://hackmd.io/_uploads/ryhc5UoZ0.png) --- **Table 4. DAiSEE out of distribution test set** ![Screenshot 2024-04-28 011154](https://hackmd.io/_uploads/B10jc8iZC.png) --- ---        ## Conclusion My experiment results with DPO preference tuning methods suggests that the Out of Distribution problem common in statistics and in machine learning can be formulated as a simple preference selection problem. My experiment results with both HA-DPO and HA-KTO consistently out perform other baselines in the SED dataset and also perform well on the DAiSEE out of distribution dataset at the batch size of 1. It remains to be seen how well DPO generalizability can become on larger quanitization scales, model sizes and compute resources. ## Future Work Future work may involve more work into methodologies to gurantee output sentence structure from the LVLM using Context Free Grammars as discrete activation functions on the final layers of the language model. This works as an embedded token parsing mechanism for the model to learn, guiding the learnable gradient space of the LVLM. Additional future work would involve future experiments with the derivatives of DPO and the latest ORPO which further extends DPO, this time no longer needing the reference SFT model all together. However, the question of out of distribution generalizability, resource reduction in human annotation, and compute costs would remain a reseach topic for these cutting edge techniques.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.