The Internal State of an LLM Knows When It’s Lying

## The Internal State of an LLM Knows When It’s Lying Group 10 Presenter : 鄧博元、張信富、張巧柔、趙啟翔 ---- ## Table of Contents <div style="font-size:xx-large"> * Introduction : Overview of LLMs and the motivation for truthfulness detection. * Related Work : Key studies and advancements in LLM accuracy and misinformation. * The True-False Dataset Description of data sources and preprocessing steps. * Experiment Methodology and classifier training. * Result Analysis of classifier performance and findings. * Conclusion Summary of insights, implications, and future directions. </div> --- ## Introduction ---- ### Background * LLM's remarkable Success : LLMs have demonstrated impressive abilities across a range of tasks. * Key Challenge : LLMs may generate **inaccurate or false information** with high confidence, leading to potential misinformation. ---- * Possible Cause : | Reason | Consequence | | -------- | -------- | | Word-by-Word Generation | Each word relies on prior words, so early errors can propagate through the sentence. | | Non-Maximal Probability Sampling | Increases the chance of selecting incorrect words, resulting in false information. |  ---- * An example: how generating words one at a time and committing to them may result in generating inaccurate information. ![image](https://hackmd.io/_uploads/SyJduQJGyl.png) ---- ### Crucial Assumption & Hypothesis * Assumption LLM must have some internal notion as to whether a sentence is true or false, as this information is required for generating following tokens. * Hypothesis The truth or falsehood of a statement should be represented by the LLM’s internal state. ---- ### Goal * Determine if LLMs' internal states contain information that indicates the truthfulness of generated statements. --- ## Related Work ---- ### Judgement of Hallucination | Paper | Judgement | | -------- | -------- | | Dale et al. (2022) | Translations that are detached from the source are considered hallucinations. | | Pagnoni et al. (2021) | Information generated in text summarization that is unrelated to the input text is considered a hallucination, even if the information is factually correct. | | Peng et al. (2023) | Detecting false information or hallucinations based on discrepancies in responses from multiple queries. | | Burns et al. (2022) - Contrast-Consistent Search (CCS) | Judging hallucination by rephrasing a statement as a question and evaluating the LLM’s responses to different prompts. | | this paper | Determine if LLMs' internal states contain information that indicates the truthfulness of generated statements. | --- ## The True-False Dataset  ---- ### Topics and Sources <div style="font-size:xx-large"> | Topic | Source | | ----------------- | ---------------------------- | | Cities | simplemaps | | Inventions | Wikipedia Inventor List | | Chemical Elements | PubChem in NLM website | | Animals | National Graphic Kids | | Companies | Forbes Global 2000 List(2022)| | Scientific Facts | *Generated by ChatGPT* | </div>  ---- ### Extraction (first 5 topics)  For exmaple (Chemical Elements): <div style="font-size:xx-large"> | Name | Atomic # | Symbol | ... | Unique Prop. | | -------- | --- | --- | --- | ----------------------------------------- | | Hydrogen | 1 | H | ... | the most abundant element in the universe | </div> 1. "The atomic number of Hydrogen is 1." 2. "The atomic number of Hydrogen is 34."   (randomly choose from another row)  ---- ### Scentific Facts 1. Make ChatGPT(Feb13) to provide: “scientific facts that are well known to humans” ---- “The sky is often cloudy when it’s going to rain” 3. Make ChatGPT to provide the opposite statement ---- “The sky is often clear when it’s going to rain”  --- ## SAPLMA ++S++tatement ++A++ccuracy ++P++rediction, based on ++L++anguage ++M++odel ++A++ctivations  ---- ### SAPLMA - to determine the **truthfulness** of LLM-generated statements - focusing on hidden layers - **Facebook OPT-6.7b** and **LLAMA2-7b** - both 32 layers  ---- ### Hidden Layers <div style="font-size:x-large"> | | | |-|-| | 32th layer | focused on generating the next token | | 28th layer | | | 24th layer | | | 20th layer | | | 16th layer | closer to the input -> focused on extracting lower-level information from the input | ||| </div> 5 variation using activations from different layers Note: all layers contain 4096 units  ---- ### Model Structure <div style="background-color:white"> ```flow input=>start: Input (4096) l1=>operation: Linear (4096 -> 256) l2=>operation: Linear (256 -> 128) l3=>operation: Linear (128 -> 64) sig=>end: Sigmoid input->l1->l2->l3->sig ``` </div> \# all linear layers will be activated by ReLU  ---- ### Training - Adam without fine tuning hyperparameters - 5 epoch - for each classifier(each topic) - train 3 times with various random init weights - report accuracy mean over these 3 runs  --- ## Results ---- ### Three Baselines * BERT * a few shot-learner using OPT-6.7b * a statement ‘X’ ---- ### 1. SAPLMA significantly outperforms other methods ---- #### accuracy of all the models tested for each of the topics ![Screenshot from 2024-11-11 19-55-08](https://hackmd.io/_uploads/rkN2e_kMJe.png =80%x) ---- #### bar-chart comparing the accuracy ![Screenshot from 2024-11-11 20-38-16](https://hackmd.io/_uploads/Hk369u1Gyl.png =80%x) ---- ### 2. Accuracy is even higher when using the LLAMA2-7b model ---- #### Accuracy classifying truthfulness of externally generated sentences using SAPLMA with LLAMA2-7b ![Screenshot from 2024-11-11 21-31-48](https://hackmd.io/_uploads/HyEB_F1zJe.png =80%x) ---- ### Differences between the topics - SAPLMA achieves high accuracy for the “cities” topic and the “companies” topic - but achieves much lower accuracy when tested on the “animals” and “elements” topics. ---- ### 3. Data set of statements generated by the LLM itself (the OPT-6.7b model) ---- #### Accuracy classifying truthfulness of sentences generated by the LLM (OPT-6.7b) itself ![image](https://hackmd.io/_uploads/HyuNptyzkl.png =80%x) ---- ### 4. Estimating the optimal threshold from a held-out validation data set ---- #### Accuracy classifying truthfulness of sentences generated by the LLM itself ![image](https://hackmd.io/_uploads/Hk3s3FyGye.png =80%x) ---- ### Observation - the **20th layer** no longer performs best for the statements generated by the LLM, but the **28th** layer seems to perform best. --- ## Discussion ---- 1. Use different topic They do not consider models that were trained or fine-tuned on data from the same topic of the test-set. 2. Sentence length and word frequency affect probability. ---- #### SAPLMA’s values are much better aligned with the truth value ![image](https://hackmd.io/_uploads/B1m-xc1G1e.png =80%x) ---- ![image](https://hackmd.io/_uploads/Hy3eWiJGyg.png) --- ## Conclusion ---- ### Conclusion * Problem Addressed: inaccurate and false info generated by LLMs * Propose SAPLMA to solve the problem, which significantly outperforms few-shot prompting. * Performance: SAPLMA compared to a maximum 56% accuracy for few-shot prompting. | Few-shot Prompting | SAPLMA on OPT-6.7b | SAPLMA on LLAMA2-7b | | -------- | -------- | -------- | | 56% | 60% ~ 80% | 70% ~ 90% | * Provide a true-false dataset and a methodology for generating