# DISCOVERING LATENT KNOWLEDGE IN LANGUAGE MODELS WITHOUT SUPERVISION
###### tags: `progress`
ICLR 2023
## Introduction
- A model doesnot always output truthful outcome.
- For example, if we train a model to imitate human-generated text, it may learn to output common misconceptions.
- The author thinks it is because of the e misalignment between a training objective and the truth.
- As models are applied to more complex domains, human supervision may become less effective at mitigating this misalignment.
- Proposed method: using models to answer questions in a purely unsupervised way and search for implicit, internal “beliefs” or “knowledge” learned by a model.
- Recall the dicussion last time
- 大山老師對於題目的傳達: 主觀正確vs. 客觀正確: 主觀是模型在自己的世界中, 能夠誠實回答該世界的正確答案; 客觀是從模型的世界之外, 擷取答案, 例如RAG. 如果把模型所處的世界更動, 例如training data的7跟8對調, 則模型應該回答7 * 7 = 64, 這樣是主觀正確. 如果把training data裡一半的7跟8對調, 則模型應該回答7 * 7有一半機率為64, 一半機率為49, 偵測出模型所處的世界有不一致的地方. 如果這樣的題目太難, 可以改成簡單一點的題目, 或是將題目拆成難的跟簡單的部分, 先做簡單的部分.
## PROBLEM STATEMENT AND FRAMEWORK
### PROBLEM: DISCOVERING LATENT KNOWLEDGE
- We want the methods do not rely on the model **generating correct outputs** and that do not rely on **external supervision**.
- Instead, we turn to the model’s **unlabeled hidden representations**.
- Let $\phi\left(x\right) \in \mathbb{R}^d$$ denote some feature representation on a natural langauge input $x$. Our goal is to answer the yes/no questions $q_{1}$, ... $q_{n}$ only given access to $\phi\left(x\right)$.
### METHOD: CONTRAST-CONSISTENT SEARCH
- To make progress on the goal described above, we exploit the fact that truth has special structure: it satisfies consistency properties that few other features in a language model are likely to satisfy.
- CCS works by

#### Feature extraction and normalization.
$$
\tilde{\phi}\left(x_i^{+}\right):=\phi\left(x_i^{+}\right)-\mu^{+} ; \quad \tilde{\phi}\left(x_i^{-}\right):=\phi\left(x_i^{-}\right)-\mu^{-}
$$
#### Mapping activations to probabilities
$$p_{\theta, b}(\tilde{\phi})=\sigma\left(\theta^T \tilde{\phi}+b\right)$$
#### Training objective
- we use the fact that a statement and its negation should have probabilities that add up to 1.
$$
L_{\text {consistency }}\left(\theta, b ; q_i\right):=\left[p_{\theta, b}\left(x_i^{+}\right)-\left(1-p_{\theta, b}\left(x_i^{-}\right)\right)\right]^2
$$
- We encourage the model to also be confident with the following confidence loss
$$
L_{\text {confidence }}\left(\theta, b ; q_i\right):=\min \left\{p_{\theta, b}\left(x_i^{+}\right), p_{\theta, b}\left(x_i^{-}\right)\right\}^2
$$
- The final unsupervised loss is the sum of these two losses, averaged across all contrast pairs:
$$
L_{\mathrm{CCS}}(\theta, b):=\frac{1}{n} \sum_{i=1}^n L_{\text {consistency }}\left(\theta, b ; q_i\right)+L_{\text {confidence }}\left(\theta, b ; q_i\right)
$$
#### Inference
- Because we use a soft consistency constraint, the probabilities might not sum up to 1.
$$
\tilde{p}\left(q_i\right):=\frac{1}{2}\left(p\left(x_i^{+}\right)+\left(1-p\left(x_i^{-}\right)\right)\right.
$$
#### Hidden states
- For encoder-decoder models, CCS is evaluated on the last layer hidden states of both the encoder and decoder, and use whichever one generally achieves a lower unsupervised loss.
## EVALUATING CCS
### CCS OUTPERFORMS ZERO-SHOT

### CCS IS ROBUST TO MISLEADING PROMPTS
- Recall our goal: to discover latent knowledge in a language model even when the model outputs false
text.
- Specifically, we add a prefix to the beginning of our zero-shot prompts that consists of questions answered incorrectly.


## ANALYZING CCS
### CCS FINDS A TASK-AGNOSTIC REPRESENTATION OF TRUTH

## Comments
- 在pretrain or finetune 階段時是否可以加入這樣的Loss. 並只調整最後一層參數, 因為作者認為untruthful的output是因為traditional與truth的misalignment,
- paper中提到的truth or latenet knowledge,是否可以表示model output表現出自己不知道自己知不知道,但其實model內部的latenet knowledge是知道自己知道自己知不知道的。
- 我認為此種方法像是使用logit作為confidence estimation 並進行類似的callibration(因為並非改model本身的參數,所以我說類似)。
- Recall the dicussion last time
- 大山老師對於題目的傳達: 主觀正確vs. 客觀正確: 主觀是模型在自己的世界中, 能夠誠實回答該世界的正確答案; 客觀是從模型的世界之外, 擷取答案, 例如RAG. 如果把模型所處的世界更動, 例如training data的7跟8對調, 則模型應該回答7 * 7 = 64, 這樣是主觀正確. 如果把training data裡一半的7跟8對調, 則模型應該回答7 * 7有一半機率為64, 一半機率為49, 偵測出模型所處的世界有不一致的地方. 如果這樣的題目太難, 可以改成簡單一點的題目, 或是將題目拆成難的跟簡單的部分, 先做簡單的部分.
## Realated sites/works
- [eleuther: Eliciting Latent Knowledge](https://www.eleuther.ai/projects/elk)