DISCOVERING LATENT KNOWLEDGE IN LANGUAGE MODELS WITHOUT SUPERVISION

# DISCOVERING LATENT KNOWLEDGE IN LANGUAGE MODELS WITHOUT SUPERVISION ###### tags: `progress` ICLR 2023 ## Introduction - A model doesnot always output truthful outcome. - For example, if we train a model to imitate human-generated text, it may learn to output common misconceptions. - The author thinks it is because of the e misalignment between a training objective and the truth. - As models are applied to more complex domains, human supervision may become less effective at mitigating this misalignment. - Proposed method: using models to answer questions in a purely unsupervised way and search for implicit, internal “beliefs” or “knowledge” learned by a model. - Recall the dicussion last time - 大山老師對於題目的傳達: 主觀正確vs. 客觀正確: 主觀是模型在自己的世界中, 能夠誠實回答該世界的正確答案; 客觀是從模型的世界之外, 擷取答案, 例如RAG. 如果把模型所處的世界更動, 例如training data的7跟8對調, 則模型應該回答7 * 7 = 64, 這樣是主觀正確. 如果把training data裡一半的7跟8對調, 則模型應該回答7 * 7有一半機率為64, 一半機率為49, 偵測出模型所處的世界有不一致的地方. 如果這樣的題目太難, 可以改成簡單一點的題目, 或是將題目拆成難的跟簡單的部分, 先做簡單的部分. ## PROBLEM STATEMENT AND FRAMEWORK ### PROBLEM: DISCOVERING LATENT KNOWLEDGE - We want the methods do not rely on the model **generating correct outputs** and that do not rely on **external supervision**. - Instead, we turn to the model’s **unlabeled hidden representations**. - Let $\phi\left(x\right) \in \mathbb{R}^d$$ denote some feature representation on a natural langauge input $x$. Our goal is to answer the yes/no questions $q_{1}$, ... $q_{n}$ only given access to $\phi\left(x\right)$. ### METHOD: CONTRAST-CONSISTENT SEARCH - To make progress on the goal described above, we exploit the fact that truth has special structure: it satisfies consistency properties that few other features in a language model are likely to satisfy. - CCS works by ![image](https://hackmd.io/_uploads/ryEer3fyke.png) #### Feature extraction and normalization. $$ \tilde{\phi}\left(x_i^{+}\right):=\phi\left(x_i^{+}\right)-\mu^{+} ; \quad \tilde{\phi}\left(x_i^{-}\right):=\phi\left(x_i^{-}\right)-\mu^{-} $$ #### Mapping activations to probabilities $$p_{\theta, b}(\tilde{\phi})=\sigma\left(\theta^T \tilde{\phi}+b\right)$$ #### Training objective - we use the fact that a statement and its negation should have probabilities that add up to 1. $$ L_{\text {consistency }}\left(\theta, b ; q_i\right):=\left[p_{\theta, b}\left(x_i^{+}\right)-\left(1-p_{\theta, b}\left(x_i^{-}\right)\right)\right]^2 $$ - We encourage the model to also be confident with the following confidence loss $$ L_{\text {confidence }}\left(\theta, b ; q_i\right):=\min \left\{p_{\theta, b}\left(x_i^{+}\right), p_{\theta, b}\left(x_i^{-}\right)\right\}^2 $$ - The final unsupervised loss is the sum of these two losses, averaged across all contrast pairs: $$ L_{\mathrm{CCS}}(\theta, b):=\frac{1}{n} \sum_{i=1}^n L_{\text {consistency }}\left(\theta, b ; q_i\right)+L_{\text {confidence }}\left(\theta, b ; q_i\right) $$ #### Inference - Because we use a soft consistency constraint, the probabilities might not sum up to 1. $$ \tilde{p}\left(q_i\right):=\frac{1}{2}\left(p\left(x_i^{+}\right)+\left(1-p\left(x_i^{-}\right)\right)\right. $$ #### Hidden states - For encoder-decoder models, CCS is evaluated on the last layer hidden states of both the encoder and decoder, and use whichever one generally achieves a lower unsupervised loss. ## EVALUATING CCS ### CCS OUTPERFORMS ZERO-SHOT ![image](https://hackmd.io/_uploads/SJy3Jh7kyg.png) ### CCS IS ROBUST TO MISLEADING PROMPTS - Recall our goal: to discover latent knowledge in a language model even when the model outputs false text. - Specifically, we add a prefix to the beginning of our zero-shot prompts that consists of questions answered incorrectly. ![image](https://hackmd.io/_uploads/SJivVn71Jx.png) ![image](https://hackmd.io/_uploads/ryPAXhX11g.png) ## ANALYZING CCS ### CCS FINDS A TASK-AGNOSTIC REPRESENTATION OF TRUTH ![image](https://hackmd.io/_uploads/rJJ9aiQkyg.png) ## Comments - 在pretrain or finetune 階段時是否可以加入這樣的Loss. 並只調整最後一層參數, 因為作者認為untruthful的output是因為traditional與truth的misalignment, - paper中提到的truth or latenet knowledge，是否可以表示model output表現出自己不知道自己知不知道，但其實model內部的latenet knowledge是知道自己知道自己知不知道的。 - 我認為此種方法像是使用logit作為confidence estimation 並進行類似的callibration(因為並非改model本身的參數，所以我說類似)。 - Recall the dicussion last time - 大山老師對於題目的傳達: 主觀正確vs. 客觀正確: 主觀是模型在自己的世界中, 能夠誠實回答該世界的正確答案; 客觀是從模型的世界之外, 擷取答案, 例如RAG. 如果把模型所處的世界更動, 例如training data的7跟8對調, 則模型應該回答7 * 7 = 64, 這樣是主觀正確. 如果把training data裡一半的7跟8對調, 則模型應該回答7 * 7有一半機率為64, 一半機率為49, 偵測出模型所處的世界有不一致的地方. 如果這樣的題目太難, 可以改成簡單一點的題目, 或是將題目拆成難的跟簡單的部分, 先做簡單的部分. ## Realated sites/works - [eleuther: Eliciting Latent Knowledge](https://www.eleuther.ai/projects/elk)