## The Internal State of an LLM Knows When It’s Lying <br> Group 10 <br> <br> <small>Presenter : 鄧博元、張信富、張巧柔、趙啟翔</small> ---- ## Table of Contents <div style="font-size:xx-large"> * Introduction : <small>Overview of LLMs and the motivation for truthfulness detection.</small> * Related Work : <small>Key studies and advancements in LLM accuracy and misinformation.</small> * The True-False Dataset <small>Description of data sources and preprocessing steps.</small> * Experiment <small>Methodology and classifier training.</small> * Result <small>Analysis of classifier performance and findings.</small> * Conclusion <small>Summary of insights, implications, and future directions.</small> </div> --- ## Introduction ---- ### Background * LLM's remarkable Success : <small>LLMs have demonstrated impressive abilities across a range of tasks.</small> * Key Challenge : <small>LLMs may generate **inaccurate or false information** with high confidence, leading to potential misinformation.</small> ---- * Possible Cause : | Reason | Consequence | | -------- | -------- | | <small>Word-by-Word Generation</small> | <small>Each word relies on prior words, so early errors can propagate through the sentence. </small> | | <small>Non-Maximal Probability Sampling</small> | <small> Increases the chance of selecting incorrect words, resulting in false information. </small> | <!-- | <small>More Correct Options Than Incorrect Ones</small> | <small>If there’s only one incorrect completion, it may have a higher probability than any single correct one, leading to selecting the incorrect option. </small> | --> ---- * An example: how generating words one at a time and committing to them may result in generating inaccurate information. ![image](https://hackmd.io/_uploads/SyJduQJGyl.png) ---- ### Crucial Assumption & Hypothesis * Assumption <small> LLM must have some internal notion as to whether a sentence is true or false, as this information is required for generating following tokens.</small> <br> * Hypothesis <small>The truth or falsehood of a statement should be represented by the LLM’s internal state.</small> ---- ### Goal <br> * Determine if LLMs' internal states contain information that indicates the truthfulness of generated statements. --- ## Related Work ---- ### Judgement of Hallucination | Paper | Judgement | | -------- | -------- | | <small>Dale et al. (2022)</small> | <small>Translations that are detached from the source are considered hallucinations.</small> | | <small>Pagnoni et al. (2021)</small> | <small> Information generated in text summarization that is unrelated to the input text is considered a hallucination, even if the information is factually correct.</small> | | <small>Peng et al. (2023)</small> | <small>Detecting false information or hallucinations based on discrepancies in responses from multiple queries.</small> | | <small>Burns et al. (2022) - Contrast-Consistent Search (CCS)</small> | <small>Judging hallucination by rephrasing a statement as a question and evaluating the LLM’s responses to different prompts.</small> | | <small>this paper</small> | <small>Determine if LLMs' internal states contain information that indicates the truthfulness of generated statements.</small> | --- ## The True-False Dataset <!-- Then, I will continue the presentation. --> ---- ### Topics and Sources <div style="font-size:xx-large"> | Topic | Source | | ----------------- | ---------------------------- | | Cities | simplemaps | | Inventions | Wikipedia Inventor List | | Chemical Elements | PubChem in NLM website | | Animals | National Graphic Kids | | Companies | Forbes Global 2000 List(2022)| | Scientific Facts | *Generated by ChatGPT* | </div> <!-- The author's work require a dataset, however, there's no such ones. Therefore, the author present a dataset with a wide variety of topic. And most of the data are from such said "reliable sources". As you could see, there're 6 different topics: Cities, Inventions, Chemical Elements, Animals, Companies, Scientific Facts. For the first 5, the data are from simplemaps, wikipedia, NLM(the National Library of Medicine of the states), National Graphic, and Forbes Global. For scientific facts, the author generated the data by ChatGPT, we will mention this later. --> ---- ### Extraction (first 5 topics) <!-- .slide: style="text-align: left" --> For exmaple (Chemical Elements): <div style="font-size:xx-large"> | Name | Atomic # | Symbol | ... | Unique Prop. | | -------- | --- | --- | --- | ----------------------------------------- | | Hydrogen | 1 | H | ... | the most abundant element in the universe | </div> <br/> 1. "The atomic number of Hydrogen is <font color=green>1</font>." 2. "The atomic number of Hydrogen is <font color=red>34</font>." &nbsp;&nbsp;<small>(randomly choose from another row)</small> <!-- For the first 5 topics, a table is built by extracting information from different sources. If we take chemical elements as exmaple, Hydrogen is introduced into the table with its atomic number, symbol, some other properties and an unique property. For each elements, the author took its name and one random property to build a TRUE sentence. And then, choose another row of that property to replace with the correct one in the TRUE sentence to form a FALSE sentence. --> ---- ### Scentific Facts 1. Make ChatGPT(Feb13) to provide:<br/> “scientific facts that are well known to humans” ---- “The sky is often <font color="green">cloudy</font> when it’s going to rain” 3. Make ChatGPT to provide the opposite statement ---- “The sky is often <font color="red">clear</font> when it’s going to rain” <!-- For the scentific fact topic, the author use ChatGPT to provide scientific facts that are well known to humans Here gives an example "The sky is often *cloudy* when it's going to rain" Then, use ChatGPT to geneate a opposite statement --> --- ## SAPLMA ++S++tatement ++A++ccuracy ++P++rediction, based on ++L++anguage ++M++odel ++A++ctivations <!-- SAPLMA: Statement Accuracy Prediction, based on Language Model Activations --> ---- ### SAPLMA - to determine the **truthfulness** of LLM-generated statements - focusing on hidden layers - **Facebook OPT-6.7b** and **LLAMA2-7b** - both 32 layers <!-- The auther design this method is to check the truthfulness of LLM-generated statements. This method is mainly focusing on the hidden layers of the LLM. In the author's experiments, he use two LLMs, Facebook OPT-6.7b and LLAMA2-7b, and they both contain 32 hidden layers. --> ---- ### Hidden Layers <div style="font-size:x-large"> | | | |-|-| | 32th layer | focused on generating the next token | | 28th layer | | | 24th layer | | | 20th layer | | | 16th layer | closer to the input <br/>-> focused on extracting lower-level information from the input | ||| </div> 5 variation using activations from different layers <small>Note: all layers contain 4096 units</small> <!-- A general hypothesis is that the valuee in the hidden layers must contain information about LLM “believes” that a statement is true or false. Then, he posed 5 different variations using different hidden layer as input from the 32th layer, which is the last, to 16th layer, which is the middle one, decreasing in a step of 4 layers. He assumes that the last layer should conatain such information since it need to deal with the gneneration of the next token. However, for some layers near the LLM input should have more focus on extracting lower-level information. Therfore, he split the experiment into these variations. --> ---- ### Model Structure <br/> <div style="background-color:white"> ```flow input=>start: Input (4096) l1=>operation: Linear (4096 -> 256) l2=>operation: Linear (256 -> 128) l3=>operation: Linear (128 -> 64) sig=>end: Sigmoid input->l1->l2->l3->sig ``` </div> \# all linear layers will be activated by ReLU <!-- The model structure is quite simple. The input is a simple 4096 units. Then go through 3 linear layers to downscale the dimension from 4096 to 256, 128, then to 64. The output is a simple sigmoid output. And, all the linear layers will be activated by ReLU after. --> ---- ### Training - Adam without fine tuning hyperparameters - 5 epoch - for each classifier(each topic) - train 3 times with various random init weights - report accuracy mean over these 3 runs <!-- The training details is quite simple, too. The author use Adam for the Adapter and without fine tuning on any hyperparameters. The training went for 5 epochs. And for each classifier, it will be trained 3 times with various randomly initialized weights. The reported accuracy is the mean over these 3 runs. --> --- ## Results ---- ### Three Baselines * BERT * a few shot-learner using OPT-6.7b * a statement ‘X’ ---- ### 1. SAPLMA significantly outperforms other methods ---- #### accuracy of all the models tested for each of the topics ![Screenshot from 2024-11-11 19-55-08](https://hackmd.io/_uploads/rkN2e_kMJe.png =80%x) ---- #### bar-chart comparing the accuracy ![Screenshot from 2024-11-11 20-38-16](https://hackmd.io/_uploads/Hk369u1Gyl.png =80%x) ---- ### 2. Accuracy is even higher when using the LLAMA2-7b model ---- #### Accuracy classifying truthfulness of externally generated sentences using SAPLMA with LLAMA2-7b ![Screenshot from 2024-11-11 21-31-48](https://hackmd.io/_uploads/HyEB_F1zJe.png =80%x) ---- ### Differences between the topics - SAPLMA achieves high accuracy for the “<font color=green>cities</font>” topic and the “<font color=green>companies</font>” topic - but achieves much lower accuracy when tested on the “<font color=red>animals</font>” and “<font color=red>elements</font>” topics. ---- ### 3. Data set of statements generated by the LLM itself (the OPT-6.7b model) ---- #### Accuracy classifying truthfulness of sentences generated by the LLM (OPT-6.7b) itself ![image](https://hackmd.io/_uploads/HyuNptyzkl.png =80%x) ---- ### 4. Estimating the optimal threshold from a held-out validation data set ---- #### Accuracy classifying truthfulness of sentences generated by the LLM itself ![image](https://hackmd.io/_uploads/Hk3s3FyGye.png =80%x) ---- ### Observation - the **20th layer** no longer performs best for the statements generated by the LLM, but the **28th** layer seems to perform best. --- ## Discussion ---- 1. Use different topic <small>They do not consider models that were trained or fine-tuned on data from the same topic of the test-set.</small> 2. Sentence length and word frequency affect probability. ---- #### SAPLMA’s values are much better aligned with the truth value ![image](https://hackmd.io/_uploads/B1m-xc1G1e.png =80%x) ---- ![image](https://hackmd.io/_uploads/Hy3eWiJGyg.png) --- ## Conclusion ---- ### Conclusion <small style="text-align: left"> * Problem Addressed: inaccurate and false info generated by LLMs * Propose SAPLMA to solve the problem, which significantly outperforms few-shot prompting. * Performance: SAPLMA compared to a maximum 56% accuracy for few-shot prompting. | Few-shot Prompting | SAPLMA on OPT-6.7b | SAPLMA on LLAMA2-7b | | -------- | -------- | -------- | | 56% | 60% ~ 80% | 70% ~ 90% | * Provide a true-false dataset and a methodology for generating </small>
{"description":"123","title":"The Internal State of an LLM Knows When It’s Lying","contributors":"[{\"id\":\"33a971fe-c7db-4435-8007-a708ae593e01\",\"add\":6411,\"del\":1045},{\"id\":\"5b0dee87-0af0-479a-b018-6417238c8cc2\",\"add\":2706,\"del\":1429},{\"id\":\"23aa18ab-a368-4cd1-8a5b-0422cc5d1c03\",\"add\":2555,\"del\":220},{\"id\":\"cf98e8ad-62b4-4fb9-9628-7a456fc18fce\",\"add\":5025,\"del\":2168}]"}
    335 views