Tony Siu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- # Project Charter <!-- Not require new model, cheaper more performant, cheaper than develop own model from scratch. --> <!-- why, what, how --> <!-- prompting evaluation and why it didn'twork --> <!-- establish point that question and answering models do not perform well with out of context evaluation --> <!-- No need to some up with dataset specific prompts --> <!-- visualize prompt framework --> <!-- no need human annotation involvedment\ --> --- ## Abstract Generative AI has has been the topic of discussion turning the decade of 2020. Applications like ChatGPT have stunned industry and academia with its ability to mimic conversations with a knowledgeable individual. With Large Language models at the base of GenAI technology, Large Language models have been growing larger and larger requiring unrealistic needs for GPU resources. However, there are many limitations that GenAI still leave unaddressed. Applications like ChatGPT perform fine on a high level of abstraction or on simple tasks where there can only be a fixed number of outcomes. Low-level, niched and specific knowledge is often times thrown out the window. This project formulates the well known problem of out of distribution model performance as a preference selection problem. Extending on Reinforcement learning with human in the loop([RLHF](https://arxiv.org/pdf/2203.02155)), I use the novel [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) algorithm and its derivatives to make Large Vision Language models learn subjective preferences that are not only not within available data but to have the model make inferences beyond the current available dataset. This is done such that a Supervised Finetuned(SFT) model self generates preferred/rejected pairs with out human annotaion, with out the need for Reinforcement learning and without the reliance of knowledge distillation from closed source commercial GenAI models like ChatGPT. Doing so, I establish that the out of distribution problem is fundamentally a preference selection task where computational costs are cut, human domain expertise and annotation costs is not necessary and model training from scratch is not always necessary as well. * [Github link](https://github.com/Tony363/HA-DPO/tree/main) * [Slides Deck](https://docs.google.com/presentation/d/1sknHWkxdDRP-JH9UOo8KMRiCXQlUUI4HfZ2OJfSnvXc/edit?usp=sharing) --- ## Problem Description Current literature offers little to no studies or methodologies for preference aligning Multi-Modal tasks such as Visual Question and Answering. This is especially profound when recent commerical big tech companies claim performance on industry tasks while offering no means to evaluate and reverse engineer their complex systems that involve Multi-Modal systems. With the plethora of resources available for said big tech companies, there is little refinement and understanding of their deployed systems. To that end, a phenonmenon for "prompt engineering" has become popular among the social sciences. However, the quantifiable reliability of prompting a Large Vision Language model(LVLM) limited vector space remains dubious. In our own studies while establishing a template chat bot for evaluation purposes, simply adding or removing a '/' for a part of speech(POS) tag that indicate a positional location for a visual embedding makes a 14% difference in classification performance. As there is yet to be a universal and rigorously quantifiable template for querying a LVLM, I chose to diverge away from popular "prompt engineering" studies and opted to seek out closed form optimization methodologies for both directing and generalizing a models learnable gradient space. The below is a comparison between 2 sample Chat templates that caused the 14% performance difference. The template without the '/' had a 14% increase compared to sticking rigorously to the Vicuna POS tag formats; ``` ###Human: Given the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img>', '<Img>###Human: Is the person looking straight at the screen? Is the person looking down atthe paper? Is the person looking away?###Assistant: ``` ``` ###Human: Given the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img>', '</Img>###Human: Is the person looking straight at the screen? Is the person looking down atthe paper? Is the person looking away?###Assistant: ``` <!-- The goal of the project is to explore and provide methodologies to parameterize custom knowledge into a Multi-Modal capable AI system without having to develop and build such models from scratch. This should cut research costs, training time, resource costs, data annotation costs, and domain expertise required on a subject. --> <!-- Visual Question Answering[1] is a Multimodal, Multiview Computer vision task. Provided with a visual component and a natural language query. The task is to address contextual information using visual components and a natural language query. As the nature of natural language context is free from and open ended, Visual Question Answering tasks are also free-form and open ended. I believe that that is the major difficulty current AGI[2] research has yet to overcome. Therefore, one of the goals of the project is to explore and measure the implications of how Machine Learning can contextualize qualitative abstractions given 2 or more input modal vectors. The tasks that are viable for these datasets may be restricted and further data engineering of the dataset may be necessary. Furthermore, as the task to address queries may be open ended and free form, objective functions, how to measure model effectiveness and error analysis may vary depending on individual components within the VQA framework. --> --- ## Task ![Screenshot 2024-05-03 203907](https://hackmd.io/_uploads/SJuaXb7zA.png) Conventional Computer Vision models do not have the ability to output qualitative detailed text responses. Framing the common Multi-Modal task as an encompassing task that can tackle many conventional computer vision problems, specifically Visual Question & Answering, we utilize a toy dataset, the Student Engagement Dataset. The choice of using this dataset is due to its non conventional qualitative nature where in performance in this task is subjective and open for criticism. Provided with a set of labels "paper", "screen" and "wander", a LVLM will encode an image, encode a question and out put an answer in context of the image and question. An explicit evaluation method using DistillBert is then used to classify the output sentence of the LVLM. The core contributions of the project is the following; * A novel semi-supervised Multi-Modal preference alignment optimization method that considers the out of distribution problem as a preference selection task * Characterizate the out of sample generalizability for closed form preference alignment algorithms as opposed to RLHF methods * Establish a precedence for the necessity to preference align model responses as opposed to "appropriately" prompt engineer a "correct" response within a limited model parameter space and with limited resources available. <!-- Using [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4/tree/main), I have achieved a 97% precision recall harmonization on the Student engagement 2K balanced dataset and a 93% precision recall on the entirty of the 19k Student engagement dataset. The next steps for the project is to further study the Visual Question and Answering task using different alighnment optmization algorithms. The core contribution are listed below; * Study on the generalization ability pre and post alighnment optimization * Comparison to current Multi Modal Benchmarks * Develop a methodology that self generate question and answer pairs that allows DPO to provide preference alignment towards a broader distribution unavailable in the in sample dataset --> --- ## Data * [ICCVW Frame Engagement Annotations](https://cs-people.bu.edu/sbargal/studentdatasets/index.html) ![sample_images_small](https://hackmd.io/_uploads/SJlZigPh6.jpg) * The Student Engagement dataset(SED) consists of approximately 19K frames divided between three classes (looking at screen, looking at paper, wandering) from 19 different students. For the frame-level annotations, videos of 19 participants were sampled at one FPS, which gave us a total 18,721 frames. There is an imbalanced and balanced set from the SED. In the imbalanced distribution of this data, the Screen class includes 14 times more samples than the Wander class, and three times more than the Paper class. The Paper class includes 4,655 frames, the Screen class includes 13,483 frames, and the Wander class includes 583 frames for a total of 18,721 frames. A more balanced version of this dataset is constructed by removing similar samples for each class. This dataset is more equally distributed and contains 638 samples for the Paper class, 826 samples for the Screen class, and 509 samples for the Wander class for a total of 1,973 samples. We only sampled three students out of the original 19 for our test set. 80% of the balanced is used for finetuning. 20% of the balanced set and the rest from the imbalanced set is used for evaluation. Lastly,another 85 hard samples drawn across the SED dataset not within the training set is also used for an additional testing set. * [DAiSEE, Dataset for Affective States in E-Environment](https://people.iith.ac.in/vineethnb/resources/daisee/index.html) ![daisee](https://hackmd.io/_uploads/ByvApBo-0.png) * The first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration "in the wild". The dataset has four levels of labels namely - very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. The DAiSEE dataset is functionally the out of distribution testing dataset. We random sampled 1129 frames from DAiSEE and annotated it according to the SED framework labels. Having re-annotated DAiSEE, there are 984 Screen samples, 112 Wander samples and 33 Paper samples. --- ## ML Methodology ![Screenshot 2024-04-28 001814](https://hackmd.io/_uploads/Sky7RSiZA.png) We will be utilizing the [Minigpt4](https://minigpt-4.github.io/) framework. It consists of a pretrained vision model([ViT](https://arxiv.org/abs/2010.11929)), [BLIP2](https://arxiv.org/abs/2301.12597), a hybrid model that transforms the vision vector space to a language compatible vector space and a language model([Vicuna1](https://lmsys.org/blog/2023-03-30-vicuna/)). The main strength of Minigpt4 is that both the Vision, language and hybrid model weights are frozen and that only an additional linear projection layer between BLIP2 and the language model is being parameterized.This is because the primary task of Minigpt4 is to learn to align the Vision and Language vector spaces. This means training requirements is only for the linear layer while inference for the vision and language components can be precomputed. This shortens the time and resources needed for training and inference. In my experiments, I finetune and conduct evaluation on a batch size of 1 with 4bit quantization. Finetuning with MiniGPT4 take up 20 GiBs of VRAM while inference only requires 8 GiBs of VRAM. --- ### Multi Modal Question & Answer data pairs setup Minigpt4 expects pairs of image_ids paired with its target caption response. A list of questions during finetuning is stored in a alignment.txt file. Questions from the alignment.txt file is randomly chosen to be asked to MiniGPT4 during finetuning given an image. The original prompts from MiniGPT4 for finetuning are listed below; * *Describe this image in detail* * *take a look at this iamge and describe what you notice* * *Please provide a detailed description of the picture* * *Could you describe the contents of this iamge for me?* The finetuning prompts specific for the Student Engagement dataset is the below; * *Is the person looking straight at the screen?* * *Is the person looking down at the paper?* * *is the person looking away?* * *Is the person looking straight at the screen? Is the person looking down at the paper? Is the person looking away?* --- ### Direct Preference Optimization ![Screenshot 2024-04-29 053435](https://hackmd.io/_uploads/HJzAFkpZC.png) Reinforcement learning with human feedback(RLHF) as well as DPO makes use of the [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model). The Bradley-Terry model is a way of converting a dataset of preferences into a numeric score called reward that is given for each pair of questions and answers such that the score numerically resembles the preferences of the annotators. A Maximum Liklihood Estimator(MLE) is constructed from the Bradley-Terry model such that the probability of choosing the preferred answer is maximized opposed to choosing the rejected answer. DPO extends the RLHF loss function by making RLHF method via PPO differentiable. #### Brief Derivation $$P(y_w>y_l) = \frac{e^{r*(x,y_w)}}{e^{r*(x,y_w)}+e^{r*(x,y_l)}}$$ $$A = r*(x,y_w)$$ $$B=er*(x,y_l)$$ $$\frac{e^A}{e^A + e^B} = \frac{\frac{e^A}{e^A}}{\frac{e^A + e^B}{e^A}} = \frac{1}{1 + (\frac{e^B}{e^A})} = \frac{1}{1 + e^{-(A - B)}} = \sigma(A - B)$$ $$L = -E_{(x,y_w,y_l)}\sim[log\sigma(r_\gamma(x,y_w) - r_\gamma(x,y_l))]$$ The above rearranges the Bradley Terry pairwise comparison model in to a sigmoid function of the preferred and rejected log probabilities. #### Closed form Reward function The below is the final derivation of the PPO and Kullback-Leibler Divergence as a reward model. The key is that the PPO, KL divergance derivation is rearranged that the preffered and rejected log probabilities Z(x) property is canceled out. This makes it such that the reward is a closed form function with respect to the preffered and rejected log probabilties and that a simple derivative of the function can be taken. $$Z(x) = \sum_y \pi_{ref}(y|x)exp(\frac{1}{\beta}r(x,y))$$ $$\pi_r(y|x) = \frac{1}{Z(x)}\pi_{ref}(y|x)exp(\frac{1}{\beta}r(x,y))$$ $$P(y_w > y_l) = \sigma(r(x,y_w) - r(x,y_l)) = \sigma(\beta log \frac{\pi^*(y_w | x)}{\pi_{ref}(y_w | x)} + \beta log Z(x) - \beta log \frac{\pi^*(y_l | x )}{\pi_{ref}(y_l | x)} - \beta log Z(x))$$ The below is the full formulation of direct preference optimization objective function. ![Screenshot 2024-03-22 234527](https://hackmd.io/_uploads/rkedxRsRp.png) The loss function of DPO is as the following. DPO takes in the softmax vector representations of the rejected and preferred responses of the supervised fine tuned(SFT) reference model and the trainable policy model. A aggregate ratio between the preferred and rejected log probabilities for both the SFT reference and trainable policy model is represented as $$\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}$$ and $$\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}$$ The difference between the policy and reference ratios is used to minimize the negative log-likelihood loss where the policy models aggregate is the target vector representation to learn.This is how the SFT model learns to cover for sample spaces that is not learned from only using question, image, answer sample triplets and learn how to output preferred responses about an image given a question. An important aspect of the DPO loss is that the reference SFT model is used to both generate the prefered/rejected pairs and also constrain the loss function with in the vector space of the reference SFT model. The source code implementation for DPO is found [here](https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py). #### Hallucination Aware Direct preference Optimization(HA-DPO) The *hallucination aware* aspect of HA-DPO come from the paper [Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization](https://arxiv.org/abs/2311.16839). To put it simply, the authors of HA-DPO used GPT4s openai api to generate the style consistent non-hallucinatory/hallucinatory pairs for their Visual Question & Answering(VQA) task on the Visual Genome dataset evaluated using [Polling-based Object Probing Evaluation](https://github.com/RUCAIBox/POPE). Their goal was to have the ability to quantify object hallucination from VQA tasks. My method self generates the preferred and rejected pairs using the reference SFT model but adds the same auxiliary LLM loss to DPO as HA-DPO does. The inspiration of the added auxiliary loss is from the [InstructGPT paper](https://arxiv.org/abs/2203.02155). The final loss function I use for my project is the below. $$L_{dpo}(\pi_{\theta};\pi_{ref}) = -E(x_T,x_l,y_{pos},y_{neg})\sim D[log \sigma(\beta log \frac{\pi_{\theta}(y_{pos})|[x_T,x_O]}{\pi_{ref}(y_{pos}|[x_T,x_I])}) - \beta log \frac{\pi_{\theta}(y_{neg}|[x_T,x_I])}{\pi_{ref}(y_{neg}|[x_T,x_I])}] $$ $$L_{aux} = -\sum log P(y|x_P;\pi_{\theta}),{x_p,y}\sim D_{sft}$$ $$L = L_{dpo} + \lambda L_{aux}$$ #### Hallucination Aware Kahneman Tversky Optimization(HA-KTO) There is not a published HA-KTO paper. I coined HA-KTO as I performed preference alignment finetuning with the [Kahneman Tversky Optimization algorithm](https://arxiv.org/pdf/2402.01306). I simply took the currenlty available RLHF algorithms from the [source code of huggingface TRL](https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py) and implemented it for MiniGPT4 Visual Question & Answering. Like DPO, KTO formulates the maximum liklhood objective from the bradley terry model. As of date, KTO is touted to be the better preference alignment optimization algorithm available. --- ## Evaluation In the evaluation, an image is vectorized by a Vision model. A question template is also vectorized by a language model. Using both these vectors, Minigpt-4 will concatenate these vectors as sequential tokens to output a sequence of words. The prompting template as described in the MiniGPT4 paper was utilized. The final prompt that will be vectorized is the below. This prompt is currently the template prompt that outputs the best classification performance. Additionally, note that the ',' between the ```<img>``` and ```</img>``` tags indicate where the visual vector embedding will be concatenated.; ``` ###Human: Given the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img>', '<Img>###Human: Is the person looking straight at the screen? Is the person looking down atthe paper? Is the person looking away?###Assistant: ``` DistillBert used as a Sentence Classification model is tuned on the augmented label captions of the SED dataset. DistillBert classifies the out put of MiniGPT4. Accuracy, F1-macro, Precision-macro, and Recall-macro are aggregated. **Metrics** * Accuracy - *base metric of overall performance* * F1 Macro - *used to gauge model performance on the average of each class* * Precision Macro - *What proportion of actual positives was identified correctly* * Recall Macro - *What proportion of positive identifications was actually correct* ## Results --- **Table 1. SED Balanced test set** ![Screenshot 2024-04-28 011042](https://hackmd.io/_uploads/HyKP98jW0.png) --- **Table 2. SED imbalanced test set** ![Screenshot 2024-04-28 011110](https://hackmd.io/_uploads/B1GFcIiWA.png) --- **Table 3. SED Hard Samples test set** ![Screenshot 2024-04-28 011135](https://hackmd.io/_uploads/ryhc5UoZ0.png) --- **Table 4. DAiSEE out of distribution test set** ![Screenshot 2024-04-28 011154](https://hackmd.io/_uploads/B10jc8iZC.png) --- --- <!-- ## Communication The hypothetical "client" or "target audience" is the computational problems highlighted in [5][6][7]. There in this project is not targeted towards a human entity but purely to address computational problems of RLHF, LLMs and LVLMs. * Weekly meetings every friday 930 est via zoom * Dr Latecki, Dr Kant, Lu Pang and Tony Siu attend these * Weekly progress is summarized * Tasks for the folowing week is decided * Assessment on the tasks performed is evaluated * Notes on weekly meetings is found [here](https://docs.google.com/document/d/1_0ds-KxvDJewPCKHj6eLNUGwbOLdLdTJxvAjcBHAGJo/edit?usp=sharing) for project team members * Contact persons * lu.pang@temple.edu * latecki@temple.edu * kkant@temple.edu --> <!-- ## Personnel * Dr Longin Jan Latecki * Project Ideation * Lu Pang, Post Doc student * Researcher with access to high performance compute * Experiments on Direct Preference Optimization rejected/preferred response pairs generation * Set proportion of rejected/preffered generated responses as to cover for out of sample context * Deliver DPO results to characterize active learning loop to tackle out of sample data * Stay up to date with different LVLM architectures * Paper writing * Tony Siu * Part time researcher * Set all baseline evaluations of different models * Formulated Conversational Evaluation bot * Designed evaluation framework * Set different finetuning hyperparameters for Visual Question and answering * Provided and integrated Direct Preference Optimization code within Conversational Evaluation framework * Experiments with interchanging vision and language architecture within the Vision Language model * Write preprocessing, training and evaluation scripts for finetuning * Paper writing * Dr Krishna Kant * Provides compute * Revises paper --> <!-- ## Plan #### Important Dates * ICPR submission extended deadline, April 10 * ACM-MultiMedia submission deadline, April 12th * ECAI submission deadline, April 25th --> <!-- #### Development MileStones * [x] Get Data * [x] SED (May 2023) * [x] Random Sampled & Annotated DAiSEE (Feb 2024) * [x] Choose Hard Samples from SED * [x] EDA * [x] SED (May 2023) * [x] DAiSEE (Feb 2024) * [x] Preprocessing (May 2023) * [x] SED (May 2023) * [x] DAiSEE (Feb 2024) * [x] Modeling * [x] ViperGPT (March 2023) * Does not work for abract arbitrary Datasets like SED * [x] VisualChatGPT (April 2023) * resource limitation for research * [x] VILT (April 2023) * Proof of Concept * [x] MiniGPT4 (July 2023) * lightweight resource efficient hybrid VQA model * Only need tune single linear * [x] Minigpt4 + DPO * [x] Minigpt4 + HADPO * [x] Evaluation * [X] SED balanced set ![Screenshot 2024-04-01 150853](https://hackmd.io/_uploads/By6uLFdkR.png) * [x] Minigpt4 + DPO * [x] Minigpt4 + HADPO * [x] Minigpt4 SED prompt finetuning * [x] Minigpt4 original prompt finetuning * [x] Xception * [x] MobileNets V3 * [x] VGG16 * [x] SED imbanaced set ![Screenshot 2024-04-01 152101](https://hackmd.io/_uploads/SkB4YY_1C.png) * [x] Minigpt4 + DPO * [x] Minigpt4 + HADPO * [x] Minigpt4 SED prompt finetuning * [x] Minigpt4 original prompt finetuning * [x] Xception * [x] MobileNets V3 * [x] VGG16 * [x] Out of Distribution ![Screenshot 2024-04-01 152657](https://hackmd.io/_uploads/rkoi5FdkA.png) * [x] Minigpt4 + DPO * [x] Minigpt4 + HADPO * [x] Minigpt4 SED prompt finetuning * [x] Minigpt4 original prompt finetuning * [x] Xception * [x] MobileNets V3 * [x] VGG16 * [x] Hard Samples ![Screenshot 2024-04-01 152811](https://hackmd.io/_uploads/HJKeoFO1A.png) * [x] Minigpt4 + DPO * [x] Minigpt4 + HADPO * [x] Minigpt4 SED prompt finetuning * [x] Minigpt4 original prompt finetuning * [x] Xception * [x] MobileNets V3 * [x] VGG16 * [ ] Reporting --> <!-- #### Miscellaneous tasks * 3/22/2024 - 3/26/24 * Set baselines with POS tag ```<img></img>``` as opposed ```<img><img>``` according to the Vicuna[4] template * Revise paper --> <!-- ## References * [1] Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023). * [2] Thoppilan, Romal, et al. "Lamda: Language models for dialog applications." arXiv preprint arXiv:2201.08239 (2022). * [3] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." Advances in Neural Information Processing Systems 36 (2024). * [4]Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https: //vicuna.lmsys.org * [5] Casper, Stephen, et al. "Open problems and fundamental limitations of reinforcement learning from human feedback." arXiv preprint arXiv:2307.15217 (2023). * [6] Höglund, S., & Khedri, J. (2023). Comparison Between RLHF and RLAIF in Fine-Tuning a Large Language Model (Dissertation). Retrieved from https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-331926 * [7] Kirk, Robert, et al. "Understanding the effects of rlhf on llm generalisation and diversity." arXiv preprint arXiv:2310.06452 (2023) * [8] Chen, Hailin, et al. "ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?." arXiv preprint arXiv:2311.16989 (2023). * [9] Zhu, Deyao, et al. "Minigpt-4: Enhancing vision-language understanding with advanced large language models." arXiv preprint arXiv:2304.10592 (2023). * [10]Li, Shengzhi, Rongyu Lin, and Shichao Pei. "Multi-modal preference alignment remedies regression of visual instruction tuning on language model." arXiv preprint arXiv:2402.10884 (2024). * [11] Zhao, Zhiyuan, et al. "Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization." arXiv preprint arXiv:2311.16839 (2023). * [12] Li, Lei, et al. "Silkie: Preference distillation for large visual language models." arXiv preprint arXiv:2312.10665 (2023). * --> <!-- ## Appendix --- ### RLHF vs RLAIF [This](https://klu.ai/glossary/rlaif) article highlights the pros and cons of RLHF vs RLAIF and some common limitations for small and private entities with limited resources. While [8] gives a comprehensive over view of the costs and benefits and the current landscape of LLM development. --- ### The few Studies on Visual Question Answer Preference Alignment Optimization available [10][11][12] have been recently published. All 3 methods differ from our work in that they do not self generate preferred/rejected response pairs or tweak with the log probability ratios of preferred/rejected response pairs encoded by the reference and sft models. However, it may be a future work to look to incorporating concepts from these works. --> ## Conclusion My experiment results with DPO preference tuning methods suggests that the Out of Distribution problem common in statistics and in machine learning can be formulated as a simple preference selection problem. My experiment results with both HA-DPO and HA-KTO consistently out perform other baselines in the SED dataset and also perform well on the DAiSEE out of distribution dataset at the batch size of 1. It remains to be seen how well DPO generalizability can become on larger quanitization scales, model sizes and compute resources. ## Future Work Future work may involve more work into methodologies to gurantee output sentence structure from the LVLM using Context Free Grammars as discrete activation functions on the final layers of the language model. This works as an embedded token parsing mechanism for the model to learn, guiding the learnable gradient space of the LVLM. Additional future work would involve future experiments with the derivatives of DPO and the latest ORPO which further extends DPO, this time no longer needing the reference SFT model all together. However, the question of out of distribution generalizability, resource reduction in human annotation, and compute costs would remain a reseach topic for these cutting edge techniques.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully