---
# Project Charter
<!--
Not require new model, cheaper more performant, cheaper than develop own model from scratch. -->
<!-- why, what, how -->
<!-- prompting evaluation and why it didn'twork -->
<!-- establish point that question and answering models do not perform well with out of context evaluation -->
<!-- No need to some up with dataset specific prompts -->
<!-- visualize prompt framework -->
<!-- no need human annotation involvedment\ -->
---
## Abstract
Generative AI has has been the topic of discussion turning the decade of 2020. Applications like ChatGPT have stunned industry and academia with its ability to mimic conversations with a knowledgeable individual. With Large Language models at the base of GenAI technology, Large Language models have been growing larger and larger requiring unrealistic needs for GPU resources.
However, there are many limitations that GenAI still leave unaddressed. Applications like ChatGPT perform fine on a high level of abstraction or on simple tasks where there can only be a fixed number of outcomes. Low-level, niched and specific knowledge is often times thrown out the window. This project formulates the well known problem of out of distribution model performance as a preference selection problem. Extending on Reinforcement learning with human in the loop([RLHF](https://arxiv.org/pdf/2203.02155)), I use the novel [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) algorithm and its derivatives to make Large Vision Language models learn subjective preferences that are not only not within available data but to have the model make inferences beyond the current available dataset. This is done such that a Supervised Finetuned(SFT) model self generates preferred/rejected pairs with out human annotaion, with out the need for Reinforcement learning and without the reliance of knowledge distillation from closed source commercial GenAI models like ChatGPT. Doing so, I establish that the out of distribution problem is fundamentally a preference selection task where computational costs are cut, human domain expertise and annotation costs is not necessary and model training from scratch is not always necessary as well.
* [Github link](https://github.com/Tony363/HA-DPO/tree/main)
* [Slides Deck](https://docs.google.com/presentation/d/1sknHWkxdDRP-JH9UOo8KMRiCXQlUUI4HfZ2OJfSnvXc/edit?usp=sharing)
---
## Problem Description
Current literature offers little to no studies or methodologies for preference aligning Multi-Modal tasks such as Visual Question and Answering. This is especially profound when recent commerical big tech companies claim performance on industry tasks while offering no means to evaluate and reverse engineer their complex systems that involve Multi-Modal systems. With the plethora of resources available for said big tech companies, there is little refinement and understanding of their deployed systems. To that end, a phenonmenon for "prompt engineering" has become popular among the social sciences. However, the quantifiable reliability of prompting a Large Vision Language model(LVLM) limited vector space remains dubious. In our own studies while establishing a template chat bot for evaluation purposes, simply adding or removing a '/' for a part of speech(POS) tag that indicate a positional location for a visual embedding makes a 14% difference in classification performance. As there is yet to be a universal and rigorously quantifiable template for querying a LVLM, I chose to diverge away from popular "prompt engineering" studies and opted to seek out closed form optimization methodologies for both directing and generalizing a models learnable gradient space.
The below is a comparison between 2 sample Chat templates that caused the 14% performance difference. The template without the '/' had a 14% increase compared to sticking rigorously to the Vicuna POS tag formats;
```
###Human: Given the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img>', '<Img>###Human: Is the person looking straight at the screen? Is the person looking down atthe paper? Is the person looking away?###Assistant:
```
```
###Human: Given the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img>', '</Img>###Human: Is the person looking straight at the screen? Is the person looking down atthe paper? Is the person looking away?###Assistant:
```
<!-- The goal of the project is to explore and provide methodologies to parameterize custom knowledge into a Multi-Modal capable AI system without having to develop and build such models from scratch. This should cut research costs, training time, resource costs, data annotation costs, and domain expertise required on a subject. -->
<!-- Visual Question Answering[1] is a Multimodal, Multiview Computer vision task. Provided with a visual component and a natural language query. The task is to address contextual information using visual components and a natural language query. As the nature of natural language context is free from and open ended, Visual Question Answering tasks are also free-form and open ended. I believe that that is the major difficulty current AGI[2] research has yet to overcome. Therefore, one of the goals of the project is to explore and measure the implications of how Machine Learning can contextualize qualitative abstractions given 2 or more input modal vectors. The tasks that are viable for these datasets may be restricted and further data engineering of the dataset may be necessary. Furthermore, as the task to address queries may be open ended and free form, objective functions, how to measure model effectiveness and error analysis may vary depending on individual components within the VQA framework. -->
---
## Task

Conventional Computer Vision models do not have the ability to output qualitative detailed text responses. Framing the common Multi-Modal task as an encompassing task that can tackle many conventional computer vision problems, specifically Visual Question & Answering, we utilize a toy dataset, the Student Engagement Dataset. The choice of using this dataset is due to its non conventional qualitative nature where in performance in this task is subjective and open for criticism. Provided with a set of labels "paper", "screen" and "wander", a LVLM will encode an image, encode a question and out put an answer in context of the image and question. An explicit evaluation method using DistillBert is then used to classify the output sentence of the LVLM.
The core contributions of the project is the following;
* A novel semi-supervised Multi-Modal preference alignment optimization method that considers the out of distribution problem as a preference selection task
* Characterizate the out of sample generalizability for closed form preference alignment algorithms as opposed to RLHF methods
* Establish a precedence for the necessity to preference align model responses as opposed to "appropriately" prompt engineer a "correct" response within a limited model parameter space and with limited resources available.
<!-- Using [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4/tree/main), I have achieved a 97% precision recall harmonization on the Student engagement 2K balanced dataset and a 93% precision recall on the entirty of the 19k Student engagement dataset. The next steps for the project is to further study the Visual Question and Answering task using different alighnment optmization algorithms. The core contribution are listed below;
* Study on the generalization ability pre and post alighnment optimization
* Comparison to current Multi Modal Benchmarks
* Develop a methodology that self generate question and answer pairs that allows DPO to provide preference alignment towards a broader distribution unavailable in the in sample dataset -->
---
## Data
* [ICCVW Frame Engagement Annotations](https://cs-people.bu.edu/sbargal/studentdatasets/index.html)

* The Student Engagement dataset(SED) consists of approximately 19K frames divided between three classes (looking at screen, looking at paper, wandering) from 19 different students. For the frame-level annotations, videos of 19 participants were sampled at one FPS, which gave us a total 18,721 frames. There is an imbalanced and balanced set from the SED. In the imbalanced distribution of this data, the Screen class includes 14 times more samples than the Wander class, and three times more than the Paper class. The Paper class includes 4,655 frames, the Screen class includes 13,483 frames, and the Wander class includes 583 frames for a total of 18,721 frames. A more balanced version of this dataset is constructed by removing similar samples for each class. This dataset is more equally distributed and contains 638 samples for the Paper class, 826 samples for the Screen class, and 509 samples for the Wander class for a total of 1,973 samples. We only sampled three students out of the original 19 for our test set. 80% of the balanced is used for finetuning. 20% of the balanced set and the rest from the imbalanced set is used for evaluation. Lastly,another 85 hard samples drawn across the SED dataset not within the training set is also used for an additional testing set.
* [DAiSEE, Dataset for Affective States in E-Environment](https://people.iith.ac.in/vineethnb/resources/daisee/index.html)

* The first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration "in the wild". The dataset has four levels of labels namely - very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. The DAiSEE dataset is functionally the out of distribution testing dataset. We random sampled 1129 frames from DAiSEE and annotated it according to the SED framework labels. Having re-annotated DAiSEE, there are 984 Screen samples, 112 Wander samples and 33 Paper samples.
---
## ML Methodology

We will be utilizing the [Minigpt4](https://minigpt-4.github.io/) framework. It consists of a pretrained vision model([ViT](https://arxiv.org/abs/2010.11929)), [BLIP2](https://arxiv.org/abs/2301.12597), a hybrid model that transforms the vision vector space to a language compatible vector space and a language model([Vicuna1](https://lmsys.org/blog/2023-03-30-vicuna/)). The main strength of Minigpt4 is that both the Vision, language and hybrid model weights are frozen and that only an additional linear projection layer between BLIP2 and the language model is being parameterized.This is because the primary task of Minigpt4 is to learn to align the Vision and Language vector spaces. This means training requirements is only for the linear layer while inference for the vision and language components can be precomputed. This shortens the time and resources needed for training and inference. In my experiments, I finetune and conduct evaluation on a batch size of 1 with 4bit quantization. Finetuning with MiniGPT4 take up 20 GiBs of VRAM while inference only requires 8 GiBs of VRAM.
---
### Multi Modal Question & Answer data pairs setup
Minigpt4 expects pairs of image_ids paired with its target caption response. A list of questions during finetuning is stored in a alignment.txt file. Questions from the alignment.txt file is randomly chosen to be asked to MiniGPT4 during finetuning given an image. The original prompts from MiniGPT4 for finetuning are listed below;
* *Describe this image in detail*
* *take a look at this iamge and describe what you notice*
* *Please provide a detailed description of the picture*
* *Could you describe the contents of this iamge for me?*
The finetuning prompts specific for the Student Engagement dataset is the below;
* *Is the person looking straight at the screen?*
* *Is the person looking down at the paper?*
* *is the person looking away?*
* *Is the person looking straight at the screen? Is the person looking down at the paper? Is the person looking away?*
---
### Direct Preference Optimization

Reinforcement learning with human feedback(RLHF) as well as DPO makes use of the [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model). The Bradley-Terry model is a way of converting a dataset of preferences into a numeric score called reward that is given for each pair of questions and answers such that the score numerically resembles the preferences of the annotators. A Maximum Liklihood Estimator(MLE) is constructed from the Bradley-Terry model such that the probability of choosing the preferred answer is maximized opposed to choosing the rejected answer. DPO extends the RLHF loss function by making RLHF method via PPO differentiable.
#### Brief Derivation
$$P(y_w>y_l) = \frac{e^{r*(x,y_w)}}{e^{r*(x,y_w)}+e^{r*(x,y_l)}}$$
$$A = r*(x,y_w)$$
$$B=er*(x,y_l)$$
$$\frac{e^A}{e^A + e^B} = \frac{\frac{e^A}{e^A}}{\frac{e^A + e^B}{e^A}} = \frac{1}{1 + (\frac{e^B}{e^A})} = \frac{1}{1 + e^{-(A - B)}} = \sigma(A - B)$$
$$L = -E_{(x,y_w,y_l)}\sim[log\sigma(r_\gamma(x,y_w) - r_\gamma(x,y_l))]$$
The above rearranges the Bradley Terry pairwise comparison model in to a sigmoid function of the preferred and rejected log probabilities.
#### Closed form Reward function
The below is the final derivation of the PPO and Kullback-Leibler Divergence as a reward model. The key is that the PPO, KL divergance derivation is rearranged that the preffered and rejected log probabilities Z(x) property is canceled out. This makes it such that the reward is a closed form function with respect to the preffered and rejected log probabilties and that a simple derivative of the function can be taken.
$$Z(x) = \sum_y \pi_{ref}(y|x)exp(\frac{1}{\beta}r(x,y))$$
$$\pi_r(y|x) = \frac{1}{Z(x)}\pi_{ref}(y|x)exp(\frac{1}{\beta}r(x,y))$$
$$P(y_w > y_l) = \sigma(r(x,y_w) - r(x,y_l))
= \sigma(\beta log \frac{\pi^*(y_w | x)}{\pi_{ref}(y_w | x)} + \beta log Z(x) - \beta log \frac{\pi^*(y_l | x )}{\pi_{ref}(y_l | x)} - \beta log Z(x))$$
The below is the full formulation of direct preference optimization objective function.

The loss function of DPO is as the following. DPO takes in the softmax vector representations of the rejected and preferred responses of the supervised fine tuned(SFT) reference model and the trainable policy model. A aggregate ratio between the preferred and rejected log probabilities for both the SFT reference and trainable policy model is represented as $$\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}$$ and $$\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}$$
The difference between the policy and reference ratios is used to minimize the negative log-likelihood loss where the policy models aggregate is the target vector representation to learn.This is how the SFT model learns to cover for sample spaces that is not learned from only using question, image, answer sample triplets and learn how to output preferred responses about an image given a question. An important aspect of the DPO loss is that the reference SFT model is used to both generate the prefered/rejected pairs and also constrain the loss function with in the vector space of the reference SFT model.
The source code implementation for DPO is found [here](https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py).
#### Hallucination Aware Direct preference Optimization(HA-DPO)
The *hallucination aware* aspect of HA-DPO come from the paper [Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization](https://arxiv.org/abs/2311.16839). To put it simply, the authors of HA-DPO used GPT4s openai api to generate the style consistent non-hallucinatory/hallucinatory pairs for their Visual Question & Answering(VQA) task on the Visual Genome dataset evaluated using [Polling-based Object Probing Evaluation](https://github.com/RUCAIBox/POPE). Their goal was to have the ability to quantify object hallucination from VQA tasks. My method self generates the preferred and rejected pairs using the reference SFT model but adds the same auxiliary LLM loss to DPO as HA-DPO does. The inspiration of the added auxiliary loss is from the [InstructGPT paper](https://arxiv.org/abs/2203.02155). The final loss function I use for my project is the below.
$$L_{dpo}(\pi_{\theta};\pi_{ref}) = -E(x_T,x_l,y_{pos},y_{neg})\sim D[log \sigma(\beta log \frac{\pi_{\theta}(y_{pos})|[x_T,x_O]}{\pi_{ref}(y_{pos}|[x_T,x_I])}) - \beta log \frac{\pi_{\theta}(y_{neg}|[x_T,x_I])}{\pi_{ref}(y_{neg}|[x_T,x_I])}] $$
$$L_{aux} = -\sum log P(y|x_P;\pi_{\theta}),{x_p,y}\sim D_{sft}$$
$$L = L_{dpo} + \lambda L_{aux}$$
#### Hallucination Aware Kahneman Tversky Optimization(HA-KTO)
There is not a published HA-KTO paper. I coined HA-KTO as I performed preference alignment finetuning with the [Kahneman Tversky Optimization algorithm](https://arxiv.org/pdf/2402.01306). I simply took the currenlty available RLHF algorithms from the [source code of huggingface TRL](https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py) and implemented it for MiniGPT4 Visual Question & Answering. Like DPO, KTO formulates the maximum liklhood objective from the bradley terry model. As of date, KTO is touted to be the better preference alignment optimization algorithm available.
---
## Evaluation
In the evaluation, an image is vectorized by a Vision model. A question template is also vectorized by a language model. Using both these vectors, Minigpt-4 will concatenate these vectors as sequential tokens to output a sequence of words. The prompting template as described in the MiniGPT4 paper was utilized. The final prompt that will be vectorized is the below. This prompt is currently the template prompt that outputs the best classification performance. Additionally, note that the ',' between the ```<img>``` and ```</img>``` tags indicate where the visual vector embedding will be concatenated.;
```
###Human: Given the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img>', '<Img>###Human: Is the person looking straight at the screen? Is the person looking down atthe paper? Is the person looking away?###Assistant:
```
DistillBert used as a Sentence Classification model is tuned on the augmented label captions of the SED dataset. DistillBert classifies the out put of MiniGPT4. Accuracy, F1-macro, Precision-macro, and Recall-macro are aggregated.
**Metrics**
* Accuracy - *base metric of overall performance*
* F1 Macro - *used to gauge model performance on the average of each class*
* Precision Macro - *What proportion of actual positives was identified correctly*
* Recall Macro - *What proportion of positive identifications was actually correct*
## Results
---
**Table 1. SED Balanced test set**

---
**Table 2. SED imbalanced test set**

---
**Table 3. SED Hard Samples test set**

---
**Table 4. DAiSEE out of distribution test set**

---
---
<!-- ## Communication
The hypothetical "client" or "target audience" is the computational problems highlighted in [5][6][7]. There in this project is not targeted towards a human entity but purely to address computational problems of RLHF, LLMs and LVLMs.
* Weekly meetings every friday 930 est via zoom
* Dr Latecki, Dr Kant, Lu Pang and Tony Siu attend these
* Weekly progress is summarized
* Tasks for the folowing week is decided
* Assessment on the tasks performed is evaluated
* Notes on weekly meetings is found [here](https://docs.google.com/document/d/1_0ds-KxvDJewPCKHj6eLNUGwbOLdLdTJxvAjcBHAGJo/edit?usp=sharing) for project team members
* Contact persons
* lu.pang@temple.edu
* latecki@temple.edu
* kkant@temple.edu
-->
<!-- ## Personnel
* Dr Longin Jan Latecki
* Project Ideation
* Lu Pang, Post Doc student
* Researcher with access to high performance compute
* Experiments on Direct Preference Optimization rejected/preferred response pairs generation
* Set proportion of rejected/preffered generated responses as to cover for out of sample context
* Deliver DPO results to characterize active learning loop to tackle out of sample data
* Stay up to date with different LVLM architectures
* Paper writing
* Tony Siu
* Part time researcher
* Set all baseline evaluations of different models
* Formulated Conversational Evaluation bot
* Designed evaluation framework
* Set different finetuning hyperparameters for Visual Question and answering
* Provided and integrated Direct Preference Optimization code within Conversational Evaluation framework
* Experiments with interchanging vision and language architecture within the Vision Language model
* Write preprocessing, training and evaluation scripts for finetuning
* Paper writing
* Dr Krishna Kant
* Provides compute
* Revises paper -->
<!-- ## Plan
#### Important Dates
* ICPR submission extended deadline, April 10
* ACM-MultiMedia submission deadline, April 12th
* ECAI submission deadline, April 25th -->
<!-- #### Development MileStones
* [x] Get Data
* [x] SED (May 2023)
* [x] Random Sampled & Annotated DAiSEE (Feb 2024)
* [x] Choose Hard Samples from SED
* [x] EDA
* [x] SED (May 2023)
* [x] DAiSEE (Feb 2024)
* [x] Preprocessing (May 2023)
* [x] SED (May 2023)
* [x] DAiSEE (Feb 2024)
* [x] Modeling
* [x] ViperGPT (March 2023)
* Does not work for abract arbitrary Datasets like SED
* [x] VisualChatGPT (April 2023)
* resource limitation for research
* [x] VILT (April 2023)
* Proof of Concept
* [x] MiniGPT4 (July 2023)
* lightweight resource efficient hybrid VQA model
* Only need tune single linear
* [x] Minigpt4 + DPO
* [x] Minigpt4 + HADPO
* [x] Evaluation
* [X] SED balanced set

* [x] Minigpt4 + DPO
* [x] Minigpt4 + HADPO
* [x] Minigpt4 SED prompt finetuning
* [x] Minigpt4 original prompt finetuning
* [x] Xception
* [x] MobileNets V3
* [x] VGG16
* [x] SED imbanaced set

* [x] Minigpt4 + DPO
* [x] Minigpt4 + HADPO
* [x] Minigpt4 SED prompt finetuning
* [x] Minigpt4 original prompt finetuning
* [x] Xception
* [x] MobileNets V3
* [x] VGG16
* [x] Out of Distribution

* [x] Minigpt4 + DPO
* [x] Minigpt4 + HADPO
* [x] Minigpt4 SED prompt finetuning
* [x] Minigpt4 original prompt finetuning
* [x] Xception
* [x] MobileNets V3
* [x] VGG16
* [x] Hard Samples

* [x] Minigpt4 + DPO
* [x] Minigpt4 + HADPO
* [x] Minigpt4 SED prompt finetuning
* [x] Minigpt4 original prompt finetuning
* [x] Xception
* [x] MobileNets V3
* [x] VGG16
* [ ] Reporting
-->
<!-- #### Miscellaneous tasks
* 3/22/2024 - 3/26/24
* Set baselines with POS tag ```<img></img>``` as opposed ```<img><img>``` according to the Vicuna[4] template
* Revise paper -->
<!-- ## References
* [1] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).
* [2] Thoppilan, Romal, et al. "Lamda: Language models for dialog applications." arXiv preprint arXiv:2201.08239 (2022).
* [3] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." Advances in Neural Information Processing Systems 36 (2024).
* [4]Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng,
Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An
open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https:
//vicuna.lmsys.org
* [5] Casper, Stephen, et al. "Open problems and fundamental limitations of reinforcement learning from human feedback." arXiv preprint arXiv:2307.15217 (2023).
* [6] Höglund, S., & Khedri, J. (2023). Comparison Between RLHF and RLAIF in Fine-Tuning a Large Language Model (Dissertation). Retrieved from https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-331926
* [7] Kirk, Robert, et al. "Understanding the effects of rlhf on llm generalisation and diversity." arXiv preprint arXiv:2310.06452 (2023)
* [8] Chen, Hailin, et al. "ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?." arXiv preprint arXiv:2311.16989 (2023).
* [9] Zhu, Deyao, et al. "Minigpt-4: Enhancing vision-language understanding with advanced large language models." arXiv preprint arXiv:2304.10592 (2023).
* [10]Li, Shengzhi, Rongyu Lin, and Shichao Pei. "Multi-modal preference alignment remedies regression of visual instruction tuning on language model." arXiv preprint arXiv:2402.10884 (2024).
* [11] Zhao, Zhiyuan, et al. "Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization." arXiv preprint arXiv:2311.16839 (2023).
* [12] Li, Lei, et al. "Silkie: Preference distillation for large visual language models." arXiv preprint arXiv:2312.10665 (2023).
* -->
<!--
## Appendix
---
### RLHF vs RLAIF
[This](https://klu.ai/glossary/rlaif) article highlights the pros and cons of RLHF vs RLAIF and some common limitations for small and private entities with limited resources. While [8] gives a comprehensive over view of the costs and benefits and the current landscape of LLM development.
---
### The few Studies on Visual Question Answer Preference Alignment Optimization available
[10][11][12] have been recently published. All 3 methods differ from our work in that they do not self generate preferred/rejected response pairs or tweak with the log probability ratios of preferred/rejected response pairs encoded by the reference and sft models. However, it may be a future work to look to incorporating concepts from these works. -->
## Conclusion
My experiment results with DPO preference tuning methods suggests that the Out of Distribution problem common in statistics and in machine learning can be formulated as a simple preference selection problem. My experiment results with both HA-DPO and HA-KTO consistently out perform other baselines in the SED dataset and also perform well on the DAiSEE out of distribution dataset at the batch size of 1. It remains to be seen how well DPO generalizability can become on larger quanitization scales, model sizes and compute resources.
## Future Work
Future work may involve more work into methodologies to gurantee output sentence structure from the LVLM using Context Free Grammars as discrete activation functions on the final layers of the language model. This works as an embedded token parsing mechanism for the model to learn, guiding the learnable gradient space of the LVLM. Additional future work would involve future experiments with the derivatives of DPO and the latest ORPO which further extends DPO, this time no longer needing the reference SFT model all together. However, the question of out of distribution generalizability, resource reduction in human annotation, and compute costs would remain a reseach topic for these cutting edge techniques.