Entropy rate calibration of LMs and diversity of outputs

Modern language models are often poorly calibrated, especially after RLHF: ![](https://hackmd.io/_uploads/rkCMaRAY3.png) *Image from [GPT-4 Technical Report](https://arxiv.org/pdf/2303.08774.pdf)* Though temperature scaling does appear to be helpful in some instances: ![](https://hackmd.io/_uploads/Sk6YR0AKn.png) *Image from [Language Models (Mostly) Know What They Know](https://arxiv.org/pdf/2207.05221.pdf)*, Kadavath et al. But here calibration is mostly measured only on classification tasks, and considers only the max. probability rather than the entire distribution. Ideally, for distributions $p$ and $\hat{p}$ over finite sequences of tokens where $p$ is the "gold" distribution, we want: $$CE(p,\hat{p})=\mathrm{EntRate}(\hat{p})$$ where \begin{align} \mathrm{EntRate}(\hat{p}) & := \mathbb{E}_{W\sim\hat{p}}\left[-\frac{1}{|W|}\log\hat{p}(W)\right] \\ CE(p,\hat{p}) &= \mathbb{E}_{W\sim p}\left[-\frac{1}{|W|}\log\hat{p}(W)\right] \end{align} That is, the entropy rate of the generated text should be equal to the cross-entropy of the true data and the model distribution. Similar to what is evaluated in [Braverman et al.](http://proceedings.mlr.press/v119/braverman20a/braverman20a.pdf) : - Showed that a well-trained model can be calibrated in a way that also decreases cross-entropy loss via a technique akin to temperature scaling: given model $\hat{p}$, define the *calibrated model* $\hat{p}_\alpha$ as $$\hat{p}_\alpha = \frac{\hat{p}(W)^{(1+\alpha)}}{Z_\alpha},$$ where $\alpha=\arg\min_\alpha CE(p,\hat{p}_\alpha)$. - Then if $\hat{p}$ is sufficiently well-trained, $\hat{p}_\alpha$ is calibrated to $p$ and $\hat{p}_\alpha$ has entropy close to the true entropy rate. The main general problem is the quantification of high-dimensional class calibration, when the number of classes is large. **Other options**: We can also talk or multi-class calibration of some sort. Note that calibration is always trying to move a model's predictive distribution with respect to some ground truth dataset/distribution $\mathcal{D}$. "Strong Calibration": This is simply vector-based calibration over all vector-based outcomes. Generally considered impractical. However, clustering could be done apriori. "Confidence Calibration" (Guo et al 17): Calibrates our the most likely class. Therefore, for each prompt, let $f(prompt) = P(arg \, max \, response| prompt) = p$, which is only a function of prompt and is usually very small (which is problematic). Then, we want that $E_{(prompt, response) \sim \mathcal{D}}[1_{response = argmax \, response} | f(prompt) = p ] = p$ "Classwise Calibration" (Kull et al 19): Technically, each response is separate class here, which is more classes than the number of datapoints. However, you you can consider forming clusters of responses. So, let $C_1, ..., C_k$ be $k$ classes of responses, where $C_i$ form a partition of the response space. Then, we can use calibration for each class marginally. The partitioning may be task specific, such as $C_1$ representing safe responses and $C_2$ unsafe ones. "F-calibration" (https://arxiv.org/pdf/1910.11385.pdf). Not a fundamentally new calibration concept, just a new way to measuring calibration via matrix-valued kernels, instead of the typical binned ECE, which assumes a block-diagonal kernel. "Decision Calibration" (https://arxiv.org/pdf/2107.05719.pdf) generalizes all preivous calibration by using downstream loss calibration over a SET of losses with respect to the Bayes optimal action. Practically seems like multi-task calibration, where each task has a loss (and corresponding optimal action) ## Research questions - How should calibration be measured for generative models? - Whether the model's output probabilities align well with the reward distribution - we know that RLHF models often output less diverse text and are generally trained in a mode-seeking way - in RLHF models, perhaps the better question is what is a principled way to interpolate between the SFT model distribution and the reward distribution? (*i.e.* what is a principled way of tuning $\beta$) - If our goal is to properly model human preferences, is this even fully possible? There could be many oddities/paradoxes: - Arrow's impossibility theorem - Does the reward distribution fulfill the Condorcet criterion? - Bradley-Terry preference model does not fulfill the Condorcet criterion - But actually no positional scoring ranking system does - the best you can do is minimize pairwise differences - What then happens to the reward distribution? - Modeling preferences can be seen as an instance of finding the minimum feedback arc set in a tournament graph (MFAST); do modern reward models do this well? - [Putting Human Assessments of Machine Translation Systems in Order](https://aclanthology.org/W12-3101.pdf) - [On The Structure of Parametric Tournaments with Application to Ranking from Pairwise Comparisons](https://proceedings.neurips.cc/paper/2021/hash/64dafb11e52edd3cd840bf24e56ddce6-Abstract.html) - The above all arise from the assumption that - Are modern RLHF LLMs poorly calibrated to the true data distribution? - **experiment**: Measure gap between $CE(p,\hat{p})$ and $\mathrm{EntRate}(\hat{p})$ - but how best to measure $\mathrm{EntRate}(\hat{p})$ when some token probabilities are $\approx 0$? - a gap in these two quantities is **not necessarily a bug**, since it depends on the dataset - it's reasonable for cross-entropy to be higher on lower-reward sequences - Does this calibration worsen as generated sequence length increases? - How does RLHF worsen calibration? - Possibly due to worse exploration of the action space - Does this poor calibration contribute to the lack of diversity/modal collapse issues of RLHF models? - Empirical examples of mode collapse: https://arxiv.org/pdf/2012.01365.pdf#page=7 (section 5), https://arxiv.org/pdf/2303.17548.pdf - Can we improve both the diversity of RLHF outputs and calibration without eroding accuracy of the model (especially when sampling with high temperature)? ## RLHF basic setup Suppose we have two items $y_0, y_1\in \mathcal{Y}$. Under the Bradley-Terry model, the probability of selecting $y_0$ is $$ \mathbb{P}[y_0 > y_1] = \frac{e^{r(y_0)}}{e^{r(y_0)} + e^{r(y_1)}} \sim e^{r(y_0)}. $$ The reward model aims to learn $r(\cdot)$ such that the probabilities on the left hand side is aligned with the dataset. Given a reward function, the RLHF objective is \begin{align*}J(\theta) &= \mathbb{E}_{y\sim \pi(\theta)}[r(y)] - \beta KL(\pi(\theta), \pi_0)\\ &= \sum_{y\in \mathcal{Y}} \mathbb{P}[y|\pi(\theta)] r(y) - \beta\sum_{y\in \mathcal{Y}}\mathbb{P}[y|\pi(\theta)](\log\mathbb{P}[y|\pi(\theta)] - \log \mathbb{P}[y|\pi_0]). \end{align*} Now assume we have the perfect reward model. Can we recover the desired distribution by maximizing $J(\theta)$? The answer seems to be no. First, let $p_0 = \mathbb{P}[y_0|\pi(\theta)]$ and $p_1 = 1-p_0$. We can directly solve for the $p_0$ that maximizes $J(\theta)$. We can write $J(\theta)$ in terms of $p_0$ and set the derivative to zero. The solution is $$ p_0 = (\exp(\frac{1}{\beta}(r(y_1) - r(y_0)-c))+1)^{-1} $$ ## Poor exploration in RLHF leads to miscalibration The RLHF objective: Given an input $x$, the initial policy generates multiple outputs $y_0, y_1$, human raters indicate that they prefer $y_i$ over $y_{1-i}$, and the reward model is trained with the loss: $$loss(r_\theta)=-\mathbb{E}_{x,y_0,y_1,i}\left[\log(\sigma(r_\theta(x,y_i)-r_\theta(x,y_{1-i})))\right]$$ The reward for the learned policy is then: $$R(x,y, \phi)=r_\theta(x,y)-\beta\log\left[\frac{\pi_\phi^{RL}(y|x)}{\pi^{SFT}(y|x)}\right]$$ which is equivalent to the learning objective: \begin{align} &\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}}\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot|x)} \left[R(x,y,\phi)\right] \\ &=\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}}\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)} \left[r_\theta(x,y)-\beta\log\left[\frac{\pi_\phi^{RL}(y|x)}{\pi^{SFT}(y|x)}\right]\right] \\ &= \arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}}\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)} \left[\frac{1}{\beta}r_\theta(x,y) + \log\left[\frac{\pi^{SFT}(y|x)}{\pi_\phi^{RL}(y|x)}\right]\right] \\ &=\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}}\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)} \left[ \log\left[\frac{\pi^{SFT}(y|x)\exp(\frac{1}{\beta}r_\theta(x,y))}{\pi_\phi^{RL}(y|x)}\right]\right] \\ &=\arg\min_\phi \mathbb{E}_{\substack{x\sim \mathcal{D}}} \left[ \mathrm{KL}(\pi^{SFT}(y|x)\exp(\frac{1}{\beta}r_\theta(x,y)) \middle\|\pi_\phi^{RL}(y|x))\right] \end{align} Therefore, the minimizer of above for each $x$ is: $$\pi_\phi^*(y|x) \propto \pi^{SFT}(y|x)\exp(\frac{1}{\beta} r(x,y))$$ But the above objective encourages mode-seeking behavior, for a couple of reasons: 1. $r_\theta(x,y)$ can be maximized by tuning $\pi_\phi^{RL}$ to only generate $y$'s that maximize $r_\theta(x,y)$. 1. Since $\mathrm{KL}\left(\pi_\phi^{RL}(\cdot|x) \middle\| \pi^{SFT}(\cdot|x)\right)$ represents an expectation over $y\sim \pi_\phi^{RL}(\cdot|x)$, this regularization term will focus disproportionately on high-reward outputs $y$, and have much less effect on the rest of the distribution $\pi^{SFT}(\cdot|x)$. Even though this will dampen the mode-seeking caused by (1), it does not really smooth the rest of the distribution outside of the modes. What if we try to use a separate $\pi^{Exp}$ exploration policy? **Step 1** (Train reward model): $$\min_\theta\mathbb{E}_{y\sim \pi^{SFT}(\cdot|x)}(r_\theta(x,y)-r^*(x,y))^2$$ **Step 2a** (): $$\max_{\pi^{Exp}}\mathbb{E}_{y \sim \pi^{Exp}(\cdot|x)}[r_\theta(y|x)]+\beta\mathbb{E}_{y\sim\pi^{SFT}(\cdot|x)}[\log\pi^{Exp}(y|x)]$$ **Step 2b**: $$\max_{\pi_\phi^{RL}}\mathbb{E}_{y\sim\pi^{Exp}(\cdot|x)}\left[\log\left(\frac{\pi_\phi^{RL}(y|x)}{\pi^{Exp}(y|x)}\right)r_\theta(y|x)\right]$$ ### Calibrating via Geometric Mean. Note that as $\beta\to \infty$, $\pi_\phi^\star(y|x)$ does not approach the human preference distribution of $\exp(r(x,y))$. However, to calibrate towards this distribution we simply modify the reward as $$R(x,y, \phi)=r_\theta(x,y)-(\beta + 1)\log\left[\frac{\pi_\phi^{RL}(y|x)}{\pi^{SFT}(y|x)^{1/(1+\beta)}}\right]$$  ## Calibration of StableVicuna-13B StableVicuna-13B is an RLHF model that is tune from an instruction-tuned LLaMa-13B model. RedPajama-1T is a reproduction of the LLaMa pre-training data and Anthropic's HH-RLHF dataset is one of the RL-tuning datasets that StableVicuna-13B was tuned on. For the latter, we only compute the cross-entropy and entropy rates on the "chosen" component of the preference pair. ![](https://hackmd.io/_uploads/BkVisIto2.png) ![](https://hackmd.io/_uploads/HkhooIFs3.png) Takeaways: - StableVicuna-13B exhibits very minimal variation in entropy-rate on the HH-RLHF dataset - for this particular type of data, it has learned to produce much less diversity in text. - StableVicuna-13B has a much larger range in cross-entropy on the HH-RLHF dataset -- this likely makes sense because the "chosen" example may not always be high reward -- it is only more high reward in comparison with the other example in the pair. ### Calibrating Reward Models. Note that we can view a reward model $r(x, y)$, where $x$ is the prompt and $y$ is the response as a classifier in the space of all pairs of responses. Specifically, consider the space $\mathcal{X} \times \mathcal{Y} \times \mathcal{Y}$ and for $(x, y_1, y_2)$, let C be a classifer such that $C(x, y_1, y_2) = \frac{\exp(r(x,y_1))}{\exp(r(x,y_1)) + \exp(r(x,y_2))}$ is the probability that the first response is preferred by using the Bradley-Terry model assumption. Now, calibration means that in your dataset $\mathcal{D}$, $\mathbb{E}_{(x, y_1, y_2) \sim \mathcal{D}}[1_{y_1 > y_2} | C(x, y_1, y_2) = p] = p$, where $1_{y_1 > y_2} = 1$ iff the human prefers $y_1$. This captured by the Expected Calibration Error of $C$ which computes $E_p[ |p -\mathbb{E}_{(x, y_1, y_2) \sim \mathcal{D}}[1_{y_1 > y_2} | C(x, y_1, y_2) = p]| ]$ #### Relationship to RL-tuned policy. Consider the standard RHLF objective. Suppose we have a reward model $r$, and $\pi_r$ is the optimal policy. From the DPO paper we know that the reward $r(x, y)$ can be expressed in terms of $\pi_r(y|x)$: $$r(x, y) = \beta\log\frac{\pi_r(y|x)}{\pi_{SFT}(y|x)}+\beta\log Z(x). $$ If we define the calibration of $\pi_r$ to be calibration wrt the preference distribution from $\mathcal{D}$, then we need to know $\mathbb{P}[y_1 > y_2|x]$ under $\pi_r$. Given a generative model, it is unclear what's the probability that it prefers $y_1$ to $y_2$ if there are no assumptions on the generative model. If the generative model is an optimal policy under the RLHF objective with reward model $r$, we can recover the preference distribution by computing $r$ using $\pi_r$. In this case, the ECE of $\pi_r$ is the same as the ECE of $r$. We can in fact infer the reward of any policy under this assumption. One possible map from a generative model to preference distribution is the following: $$\mathbb{P}[y_1 > y_2 | x] = \frac{\pi(y_1 |x)}{\pi(y_1|x) + \pi(y_2|x)}. $$ Now, suppose we have the true reward function $r^*$ and we perform RLHF on the model, the preference distribution is $$\mathbb{P}[y_1 > y_2 | x] = \frac{1}{1 + \frac{\pi^{SFT}(y_2|x)}{\pi^{SFT}(y_1|x)}\text{exp}(\frac{1}{\beta}(r(x, y_2) - r(x, y_1)))}. $$ ## Questions: - Is the CE being approximated correctly when we use the one-hot vector? - What prefix to use when generating test for computing $\mathrm{EntRate}(\hat{p})$? - for HH-RLHF dataset, maybe try computing using only the last Assistant response instead - How does $\mathrm{EntRate}(\hat{p})$ change as a function of the sequence length? - Try generating with larger max_new_tokens - check the neg values - check if reward function has been released for SV - for top k outputs (as rated by reward models), cross-entropy increases as k increase - inflection of the distribution $\pi_\phi$ - how does that change as beta increases? perhaps this is a more correct definition of calibration ## Potential definitions of sequence-level calibration **Terminology**: Let our token vocabulary be $\mathcal{V}$ and $\mathcal{V}^*$ be the Kleene closure of $\mathcal{V}$ (*i.e.* the set of all finite concatenations of tokens in $\mathcal{V}$). We have a preference dataset $\mathcal{D}=\left\{\left(x_i,y_i^{(1)},y_i^{(2)}\right)\right\}_{i=1}^N$ consisting of prompts $x_i\in\mathcal{V}^*$ and outputs $y_i^{(1)},y_i^{(2)}\in\mathcal{V}^*$ for each prompt $x_i$. (WLOG, let $y_i^{(1)}\succ y_i^{(2)}$ by human raters.) Suppose we also have pairwise comparisons $r_i(j,k)$, where $r_i(j,k)$ is the proportion of human raters who prefer $y_i^{(j)}\succ y_i^{(k)}$ given prompt $x_i$. Note that $r_i(j,k)=1-r_i(k,j)$ for all $i,j,k$. Lastly, we have an LLM $\pi_\theta:\mathcal{V}\to[0,1]$ that defines a probability distribution over all tokens in $\mathcal{V}$. In some cases, we abuse notation to write $$\pi_\theta(x)=\prod_{i=1}^{|x|} \pi_\theta(x_i|x_{1:(i-1)})$$ for the sequence $x=(x_1,\cdots,x_{|x|})$. 1. **Likelihood calibration**: $\pi_\theta$ is calibrated if, for any $(x_i,y_i^{(1)},y_i^{(2)})\sim\mathcal{D}$, $$\frac{\pi_\theta\left( y_i^{(1)} \mid x_i\right)}{\pi_\theta\left(y_i^{(2)}\mid x_i\right)}=r_i(1,2)$$ Alternative: $$\frac{\pi_\theta\left( y_i^{(1)} \mid x_i\right)}{\pi_\theta\left(y_i^{(2)}\mid x_i\right) +\pi_\theta\left(y_i^{(1)}\mid x_i\right) }=r_i(1,2)$$ - Impossible when cycles exist in the human preferences though! You can show that the RHLF should induce this essentially. **Also this needs to be the average perplexity as it would favor short sequences** 1. **Entropy rate calibration** (similar to definition in [Braverman et al.](https://proceedings.mlr.press/v119/braverman20a/braverman20a.pdf), but modified for sequences instead of tokens): $$-\mathbb{E}_{(x_i,y_i^{(1)},y_i^{(2)})\sim\mathcal{D}}\log\pi_\theta\left(y_i^{(1)}|x_i\right)=-\mathbb{E}_{\substack{(x_i,y_i^{(1)},y_i^{(2)})\sim\mathcal{D},\\ \hat{y}\sim\pi_\theta(\cdot|x_i)}}\log\pi_\theta(\hat{y}|x_i)$$ - Empirical issues with this definition: we assume that all the $y_i^{(1)}$'s belong to the golden distribution (i.e. $y_i^{(1)}\sim \pi^*(\cdot |x)$ for gold distribution $\pi^*$) due solely to the fact that it is preferred over $y_i^{(2)}$, but this generally isn't the case. It is possible that both $y_i^{(1)},y_i^{(2)}$ are in fact very low-quality outputs! - We could perhaps remedy this issue if we could sample from the gold distribution instead? (Similar to the rejection sampling in [Liu et al.](https://arxiv.org/abs/2309.06657)) **Other options**: We can also talk or multi-class calibration of some sort. Note that calibration is always trying to move a model's predictive distribution with respect to some ground truth dataset/distribution $\mathcal{D}$. "Strong Calibration": This is simply vector-based calibration over all vector-based outcomes. Generally considered impractical. However, clustering could be done apriori. "Confidence Calibration" (Guo et al 17): Calibrates our the most likely class. Therefore, for each prompt, let $f(prompt) = P(arg \, max \, response| prompt) = p$, which is only a function of prompt and is usually very small (which is problematic). Then, we want that $E_{(prompt, response) \sim \mathcal{D}}[1_{response = argmax \, response} | f(prompt) = p ] = p$ "Classwise Calibration" (Kull et al 19): Technically, each response is separate class here, which is more classes than the number of datapoints. However, you you can consider forming clusters of responses. So, let $C_1, ..., C_k$ be $k$ classes of responses, where $C_i$ form a partition of the response space. Then, we can use calibration for each class marginally. The partitioning may be task specific, such as $C_1$ representing safe responses and $C_2$ unsafe ones. "F-calibration" (https://arxiv.org/pdf/1910.11385.pdf). Not a fundamentally new calibration concept, just a new way to measuring calibration via matrix-valued kernels, instead of the typical binned ECE, which assumes a block-diagonal kernel. "Decision Calibration" (https://arxiv.org/pdf/2107.05719.pdf) generalizes all preivous calibration by using downstream loss calibration over a SET of losses with respect to the Bayes optimal action. Practically seems like multi-task calibration, where each task has a loss (and corresponding optimal action) ## Lit Review - [SLiC-HF: Sequence Likelihood Calibration with Human Feedback](https://arxiv.org/pdf/2305.10425.pdf), Zhao et al. - Tried to directly learn from human preferences without separate reward model or PPO - combines the rank calibration loss and cross-entropy regularization loss: $$\mathcal{L}(\theta) = \max(0, \delta - \log P_\theta(y^+|x)+\log P_\theta(y^-|x))-\lambda\log P_\theta(y_\mathrm{ref}|x)$$ where $y^+$ and $y^-$ are positive and negative sequences, and $(x,y_\mathrm{ref})\sim D_{SFT}$, the supervised fine-tuning model - only evaluated on summarization and against the 2020 RLHF-PPO algorithm from [Stiennon et al.](https://arxiv.org/abs/2009.01325): ![](https://hackmd.io/_uploads/BJ3jtz06n.png) - uses much less memory than RLHF-PPO, since there's only one model as opposed to four(!) (in RLHF-PPO, there are separate policy, value, reward, and SFT models) - also does not require any decoding during the training loop - also related to [CALIBRATING SEQUENCE LIKELIHOOD IMPROVES CONDITIONAL LANGUAGE GENERATION](https://arxiv.org/pdf/2210.00045.pdf) - [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290) - Learning from human preferences through fine-tuning only. - assumes Bradley-Terry (BT) model of human preferences; then allows you to express $r(x,y)$ in terms of $\pi_r$, $\pi_\mathrm{ref}$, and $\beta$ - BT model of preferences: if strengths of teams $i$ and $j$ are $\beta_i,\beta_j$ respectively, then the log-odds corresponding to the probability $p_{ij}$ that team $i$ beats team $j$ is modeled as: $$\log\frac{p_{ij}}{1-p_{ij}}=\beta_i-\beta_j$$ - Plugging into the RLHF objective (for $\pi_r$ causes the partition function to cancel out, resulting in: - $$\mathcal{L}_{DPO}(\pi_\theta;\pi_\mathrm{ref})=-\mathbb{E}_{(x,y_w,y_t)\sim\mathcal{D}}\left[\log\sigma\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_\mathrm{ref}(y_w|x)}-\beta\log\frac{\pi_\theta(y_l|x)}{\pi_\mathrm{ref}(y_l|x)}\right)\right]$$ ![](https://hackmd.io/_uploads/H1P-HBCTn.png) - [RL with KL penalties is better viewed as Bayesian inference](https://arxiv.org/abs/2205.11275) - [Aligning Language Models with Preferences through f-divergence Minimization ](https://arxiv.org/abs/2302.08215) - [Reinforced Self-Training (ReST) for Language Modeling](https://arxiv.org/abs/2308.08998) - [Demystifying Prompts in Language Models via Perplexity Estimation](https://arxiv.org/abs/2212.04037) - [Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback](https://arxiv.org/pdf/2307.15217.pdf) - [Reward Collapse in Aligning Large Language Models](https://arxiv.org/pdf/2305.17608.pdf) - argues that reward collapse happens because the empirical distribution of the rewards is independent of the prompt itself in the interpolating regime - argues that reward function should be dependent on the prompt (sharper for some prompts, and more gradual for others) - proposes instead maximizing $$\sum_{\substack{(\text{prom}, \text{compl}_w, \text{compl}_l) \in\Pi}} U_\text{prom}(R_\theta(\text{prom},\text{compl}_w)-R_\theta(\text{prom},\text{compl}_l))$$ - where utility function $U$ depends on the choice of prompt - [A Close Look into the Calibration of Pre-trained Language Models](https://aclanthology.org/2023.acl-long.75.pdf) - [The Calibration Generalization Gap](https://arxiv.org/abs/2210.01964) - [A Unifying Theory of Distance from Calibration](https://arxiv.org/abs/2211.16886) - [Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation](https://openreview.net/forum?id=VD-AYtP0dve) - [Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback](https://arxiv.org/pdf/2305.14975.pdf) - [Scaling Laws for Reward Model Overoptimization](https://arxiv.org/abs/2210.10760) - empirically discover scaling laws for both best-of-N-based and RL-based reward models; former is quadratic w.r.t $d$ ($d=\sqrt{D_{KL}(\pi\|\pi_\text{init})}$) and latter scales as $O(-d\log d)$ - RL tends to be slower than best-of-n sampling at both optimization and overoptimization - **KL penalty ineffectiveness**: using KL penalty increases the proxy reward model score that can be achieved for a given KL divergence, but this does not correspond to a measurable improvement in the gold RM score-${KL}_{RL}$ frontier - regressional Goodhart occurs due to features with noise - [Statistical Rejection Sampling Improves Preference Optimization](https://arxiv.org/abs/2309.06657) - RRHF, DPO, and SLiC aim to more effectively LLMs w/ human preferences while avoiding complexities of RL - RRHF trains a model to rank multiple sequences for the same prompt, and then apply a ranking loss plus supervised fine-tuning loss: $\mathcal{L}(\theta)=\sum_{r_i<r_j} \max(0, \pi_\theta(y_i|x)-\pi_\theta(y_j|x))-\lambda\log\pi_\theta(y_\text{ref}|x)$ - SLiC uses a contrastive ranking calibration loss plus regularization loss: $\mathcal{L}(\theta)=\max(0,\delta-\log\pi_\theta(y_w|x)+\log\pi_\theta(y_l|x))-\lambda\log\pi_\theta(y_\text{ref}|x)$ - DPO: $p^*(y_1 \succ y_2|x) = \left(1+\exp\left(\beta\log\frac{\pi^*(y_2|x)}{\pi_\text{sft}(y_2|x)}-\beta\log\frac{\pi^*(y_1|x)}{\pi_\text{sft}(y_1|x)}\right)\right)^{-1}$ but $y_1,y_2$ should be sampled from $\pi^*$, but in practice they are not - this paper attempts to fix DPO's sampling problem by using rejection sampling to simulate sampling from $\pi^*$ - Can estimate $\pi^*$ from the fact that the reward advantage of $y_1$ over $y_2$ is: \begin{align} \delta_{r^*}(y_1,y_2,x,\pi^*, \pi_{SFT},\beta) &= r^*(x,y_1)-r^*(x,y_2) \\ &= \beta\log\frac{\pi^*(y_1|x)}{\pi_{SFT}(y_1|x)}-\beta\log\frac{\pi^*(y_2|x)}{\pi_{SFT}(y_2|x)} \\ P(y_1\succ y_2|x) &= g(\delta_{r^*}(y_1,y_2,x,\pi^*,\pi_{SFT},\beta)) \end{align} for some $g:\mathbb{R}\to\{0,1\}$ a monotonically non-decreasing function that converts the reward difference into winning probability (if $g$ is a sigmoid, we get the BT model) - true reward distribution should be $\pi_{r_\psi}(y|x)=\frac{1}{Z(x)}\pi_\text{sft}(y|x)\exp\left(\frac{1}{\beta}r(x,y)\right)$ but this is difficult to sample from - Rejection sampling algorithm for sampling from $\pi_{r_\psi}$: - Generate $y\sim \pi_{SFT}(y|x)$ and $u\sim U[0,1]$ - Let $M=\min\{m|m\pi_{SFT}(y|x)\geq \pi_{r_\psi}(y|x) \forall y\}$. If $u<\frac{\pi_{r_\psi}(y|x)}{M\pi_{SFT}(y|x)}$, then we accept $y$. Otherwise we reject $y$ and start again. - In reality, dataset is $\mathcal{D}_{hf}=\{(x^{(i)},y_w^{(i)},y_l^{(i)})|y_w^{(i)},y_l^{(i)}\sim\pi_\text{unk}(y|x^{(i)})\}_{i=1}^{N_\text{unk}}$ where $\pi_\text{unk}$ is some mixed unknown policies - Taken together, algorithm **rso-sample-rank** is: - train reward-ranking model $\rho_\psi(x,y_1,y_2)$ on $\mathcal{D}_\text{hf}$ - use rejection sampling to sample from $\pi_{r_\psi}$, where $r_\psi(x,y)=\mathrm{logit}(\rho_\psi(x,y_1,y_2))$ - label response pairs using the reward-ranking model to construct preference dataset $\mathcal{D}_p=\{(x^{(i)},y_w^{(i)},y_l^{(i)}) | y_w^{(i)},y_l^{(i)}\sim\pi_{r_\psi}(y|x^{(i)})\}_{i=1}^{N_{\rho_\psi}}$ - **rso-sample-rank** performs better than DPO and SLiC on the Reddit TL;DR, CNN/DailyMail, and Anthropic HH datasets - the sigmoid-norm and hinge-norm loss functions performed similarly but hinge loss showed some reward hacking due to not considering the $\pi_{SFT}$ and relying too much on the reward function - [Uncertainty estimation for language reward models](https://arxiv.org/abs/2203.07472) - Tried using an ensemble of reward models to predict which examples should be trained on (i.e. active learning) - ensemble of reward models is well-cablibrated (w.r.t output confidence vs. accuracy of predicting which output will be preferred by humans), but the uncertainty (as computed by variance between predictions) is not predictive of model error (computed as KL divergence of ensemble predictions vs. oracle predictions) - but the "oracle" is simply a reward model trained on the full dataset (whereas the ensemble is trained on only a subset) - the active learning did not improve accuracy - [The Trickle-down Impact of Reward (In-)consistency on RLHF](https://arxiv.org/abs/2309.16155) - created ContrastInstructions, a benchmark consisting of pairs of lexically similar instructions with different responses that evaluates whether a reward model can consistently distinguish the correct response for each member of the instruction pair; also whether a reward model can assign higher reward to the correct instruction for each response - fine-tuned Llama-7B reward models perform close to random chance on this benchmark - proposed a combination of two approaches to improve reward model consistency - ConvexDA: replace words in example with synonyms, form convex hull of all the augmented sequences and original sequence, and select one corner of that convex hull to use in the training dataset instead - RewardFusion: replace reward score with the weighted average of the rewards of training examples that are similar to the current example - does improve consistency, but does not get anywhere close to human performance - when their improved RM is used in RLHF, it improves the usefulness of responses, but not relevance or factuality - the more consistent model is also more calibrated (higher correlation of RM score with human score) - [Set Learning for Accurate and Calibrated Models](https://arxiv.org/abs/2307.02245) - [Reward Model Ensembles Help Mitigate Overoptimization](https://arxiv.org/abs/2310.02743) - using ensemble-based conservative optimization (w/ an ensemble of reward models) mitigates overoptimization of RMs used in RLHF - ensemble-based conservative optimization based methods tried: - worst-case optimization: selects the lowest reward from the ensemble - uncertainty-weighted optimization (UWO): mean reward subtracted by the intra-model reward variance - using a KL divergence penalty between original and tuned policy also helps - UWO helps mitigate the impact of label noise - [A Long Way to Go: Investigating Length Correlations in RLHF](https://arxiv.org/abs/2310.03716) - [Stop Measuring Calibration when Humans Disagree](https://aclanthology.org/2022.emnlp-main.124/) - [A General Theoretical Paradigm to Understand Learning from Human Preferences ](https://arxiv.org/abs/2310.12036) ## Updates ### Week of 1/5/2024 - NLHF paper (https://arxiv.org/pdf/2312.00886.pdf) suggests to find $\pi^*:=\arg\max_\pi\min_{\pi'}\mathcal{P}(\pi\succ\pi')$ - does this also minimize ECE? - does not match the calibration game in Foster: https://www.jstor.org/stable/2337364?seq=1 - Angie: Evaluate the different calibration measures on the various preference-learning methods - also maybe different calibration metrics will have different values on the same model/method - especially on self-play methods - Xinyi/Sadhika: check that NLHF does not actually guarantee calibrated model? - Does solving the calibration game result in a kind of self-play? ### Week of 12/15/2023 - Nash equilibrium paper makes sense for trying to resolve noisy preferences! https://arxiv.org/pdf/2312.00886.pdf - try to show how nash learning is a form of calibration, definition 2 in section 4.2 - calibration as a min-max strategy (https://arxiv.org/pdf/1910.11385.pdf) - KL divergence is max of fenchel dual? ### Week of 12/8/2023 - Updates - DPO best hyperparams (for training gpt-2 on hh): - LR 1e-5, grad accum 16, per-device tbs 4 (effective batch size = 16 * 4 = 64), 2500 steps (1 epoch) - SFT best hyperparams (for training gpt-2 on hh): - LR 1e-5, grad accum 2, per-device tbs 32 (effective batch size = 64), 5 epochs ### Week of 12/1/2023 - Updates - Baseline experiments set-up for SFT (https://api.wandb.ai/links/angie-chen55/rqtvid51) + DPO (https://api.wandb.ai/links/angie-chen55/5rv5loqi) - Also, building the graph dataset - should we just average the edge weights between each pair of outputs? - Richard? Exploration of different calibration metrics? - Next steps - Angie: finish up graph DPO experiments - Read about this preference model: https://openreview.net/pdf?id=6UtOXn1LwNE - Also read this: https://arxiv.org/pdf/2206.02231.pdf - Angie: set up notebook for Richard (pre-trained LLM + RLHF model + preference dataset w/preference votes) - can be just small LLM, no need for RLHF model - Richard: Use notebook to investigate calibration metrics - Xinyi: look at f-calibration? + contrastive preference model - continue looking into entropy-based seq. calibration metrics: https://www.overleaf.com/1992686842tttdzzkvcnfd#580c44 - multi-turn rewards vs. regret? ### Week of 11/10/2023 - Updates - still working on graph data experiments - Starting to brainstorm possible definitions of sequence-level calibration: https://www.overleaf.com/1992686842tttdzzkvcnfd#580c44 - Current outline of project: - Hypothesis 1: RLHF models are poorly calibrated (compared to their pre-trained non-RL counterparts) - sub-hypothesis: miscalibration is in part caused by assuming that human preferences are linear and training RMs that score outputs via positional scoring rules - experiment (partially done): RMs trained on human preferences induce cycles and as a result predict probabilities that are too close to 0.5 - experiment (not yet done): Should be obvious and easily confirmed mathematically, but miscalibrated RMs cause the RL-tuned model to be miscalibrated too. This can't easily be fixed via temperature scaling or other common calibration techniques - Hypothesis 2: Both RLHF model accuracy and calibration can be improved by modeling preferences as a graph, rather than via positional scoring - experiment: Use RM to bootstrap a graph of human preferences, train via DPO on that graph instead of using RM+PPO; measure sequence-level calibration and performance on various benchmarks ### Week of 10/31/2023 - Follow-ups - (*) Brainstorm possible definitions of sequence-level calibration - Try to mathematically connect the miscalibration of RMs to the miscalibration of the RL-tuned LM - Also work on graph preference training experiments - Read through ["A Unifying Theory of Distance from Calibration"](https://arxiv.org/pdf/2211.16886.pdf) - Updates - Met with Cho, brainstormed using the cycles/graphs that we constructed to bootstrap a larger, non-linear preference dataset: - Let $\mathcal{D}_P=\{(x_i,y_{i1},y_{i2})\}_{i=1}^N$ (where $y_{i1}\succ y_{i2}$) be our original preference dataset and $\pi_R$ be a reward model trained on $\mathcal{D}_P$. We also augment $\mathcal{D}_P$ by using another LM $\pi_\theta$ to generate more outputs for each $x$ (*i.e.* we sample $y_{i3},\cdots,y_{ik}\sim \pi_\theta(\cdot | x_i)$), resulting in an augmented dataset $\mathcal{D}_P^\text{aug}=\{(x_i,y_{i1},\cdots,y_{ik})\}_{i=1}^N$. - For each unique instruction $x_i$, we generate $m$ paraphrases: $x_i^{(1)},\cdots,x_i^{(m)}$. Then each example $i$ has a set of $m+1$ distinct but semantically equivalent instructions, which we denote as $\mathcal{X}_i$. - For each example $i$, we construct a directed multigraph $\mathcal{G}_i=(V_i,E_i)$ where the nodes are the set of outputs (*i.e.* $V_i=\{y_{ij}\}_{j=1}^k$) and $E_i=\{(y_{ia}, y_{ib}, w_i(a,b) \,|\, y_{ia},y_{ib}\in V_i, \pi_R(x,y_{ia}) < \pi_R(x,y_{ib}), x\in\mathcal{X}_i \}$ where the edge weights are assigned such that $w_i(a,b)=\pi_R(x,y_{ib})-\pi_R(x,y_{ia})$. - We then create a modified preference dataset $\mathcal{D}_P^\text{boot}$ by extracting binary preferences from each $\mathcal{G}_i$. For example, given this graph: ```graphviz digraph dg{ "A" -> "B" ; "B" -> "C" ; "C" -> "D"; "D" -> "E"; "E" -> "B"; "D" -> "F"; "E" -> "C"; label="A preference graph." } ``` We can then collapse the cycle into a single node to obtain ```graphviz digraph dg{ "A" -> "B, C, D, E" ; "B, C, D, E" -> "F" ; label="Collapsed preference graph." } ``` the output preferences would be: \begin{align} A &\prec B \\ A &\prec C \\ A &\prec D \\ A &\prec E \\ A &\prec F \\ B &\prec F \\ C &\prec F \\ D &\prec F \\ E &\prec F \end{align} Then directly train a DPO model on this dataset Question: what to do with weights? - Can average edge weights and use the average weight in the training objective? - Other research questions about RLHF/RM calibration: - Can we directly show (theoretically/mathematically) how the calibration of the RM relates to the sequence-level calibration of the final RL-tuned model? (requires us to define sequence-level calibration, lol. could start with just likelihood ratio of preferred sequence to not-preferred sequence?) - Does the non-linearity + cycles that are intrinsic to human preferences contribute to RM miscalibration? - How is the sequence-level calibration important for a generative model? - Is this orthogonal to perplexity / other current metrics? - How are these entropy-based metrics affected on the fine-tuned RLHF models - OpenAssistant Pythia 6.9B reward model calibration experiments: - In-distribution test set (Anthropic HH/RLHF) ![](https://hackmd.io/_uploads/BkD2ZFAG6.png) ![](https://hackmd.io/_uploads/HkkTWK0M6.png) - Out-of-distribution test set (Alpaca Eval) ![](https://hackmd.io/_uploads/HyfyfYCMa.png) ![](https://hackmd.io/_uploads/ByYxGYCzp.png) - Looking for cycles - 100\% of rows had at least one cycle of length >= 2 ![](https://hackmd.io/_uploads/BJ33u9Rza.png) ![](https://hackmd.io/_uploads/HyI1YqCMT.png) ### Friday 10/25/2023 - Follow-ups - Could check for length bias by changing prompts and checking if RM still prefers longer output - Could suggest new cycle-breaking method - Also examine the miscalibrated examples for an overfitted RM? - check if there are more cycles in the underfitted versus overfitted models? - Check whether DPO model is better calibrated than an RLHF model? - Do further evaluations on OpenAssistant RM - like calibration on OOD validation set? - Maybe come up with methods to improve the way RMs currently learn? - based off our observations about how cycles are introduced - calibration loss - ** Also output the rephrasings + RM preferences on new outputs into a spreadsheet - ** Also redo the above experiment with an OpenAssistant RM (and some overfitted models) - Updates - Observations about examples that the RM is miscalibrated on: - when the human-preferred answer is much shorter - Took a sample of 100 examples from alpaca_eval/alpaca_farm_human_crossannotations - For each row, used GPT3.5 to generate 5 re-wordings of the instruction and 5 new outputs for the original pair of (original instruction, original input) - Then used the custom-trained AF reward model to score each possible tuple of ($\text{instruction}_i$, input, $\text{output}_j$, $\text{output}_k$) for each row (for $i\in \{1,\cdots,6\}, j,k\in \{1,\cdots,7\}$) - Since the instructions are all paraphrases of each other, a non-cyclic RM should never have both $r(\text{instruction}_{i_1},o_j,o_k)>r(\text{instruction}_{i_2},o_j,o_k)$ and $r(\text{instruction}_{i_3},o_j,o_k)<r(\text{instruction}_{i_4},o_j,o_k)$. This would be akin to saying that both $o_j\succ o_k$ and $o_k\succ o_j$. We can also say the same thing for longer sequences of $o_1,o_2,\cdots,o_n$. - For each row, I set each $o_j$ as a node and created a directed edge $o_j\rightarrow o_k$ whenever $r(\text{instruction}_{i_1},o_j,o_k)<r(\text{instruction}_{i_2},o_j,o_k)$ for any $i_1,i_2\in\{1, \cdots ,6\}$. The weight of each edge was $Pr(\text{instruction}_{i_2},o_j,o_k) -Pr(\text{instruction}_{i_1},o_j,o_k)$. - 49 out of 100 rows had cycles of length 2 or longer ![](https://hackmd.io/_uploads/B1N1qutz6.png) - Average cycle edge weights: ![](https://hackmd.io/_uploads/SkF_hdKz6.png) - Caveat: the Python NetworkX library seems to not allow multiple edges between the same pair of nodes, so some cycles might have been overwritten ### Friday 10/20/2023 - Follow-ups: - See if there's some qualitative pattern in the pairs that the RM is more accurate on vs. less accurate on - Can also look at embeddings? - Cycles - are there similar pairs of pairs? - Maybe cluster embeddings? - Try to find even one cycle - Maybe try replacing some words in the $x$ or $y$ with synonyms and see if cycles are induced by the RM? - Updates - Tried some other RM models, and also tried training from scratch the same RM as AlpacaFarm - Most runs don't achieve very high eval accuracy, and easily overfit - https://api.wandb.ai/links/angie-chen55/gu7p2gxj - Custom-trained Llama-7B-based RM (trained on alpaca_farm/alpaca_human_preferences): ![](https://hackmd.io/_uploads/BJOmNNezp.png) ![](https://hackmd.io/_uploads/ryJNNVxGT.png) - Also tried evaluating the OpenAssistant Pythia 6.9B RM on the Anthropic/RLHF test set (it was trained on the corresponding training set): ![](https://hackmd.io/_uploads/Sy6-rNefp.png) ![](https://hackmd.io/_uploads/rJNzS4xGa.png) - Accuracies: | Dataset | AlpacaFarm RM | Custom-trained AlpacaFarm RM | | -------- | -------- | -------- | | alpaca_eval/alpaca_farm_human_crossannotations (includes ties so random chance is 33%) | 0.33 | 0.43 | | alpaca_eval/alpaca_farm_human_annotations | 0.53 | | | (Training) alpaca_farm/alpaca_human_preference | 0.51 | | OpenAssistant Pythia-6.9B RM accuracies: | Dataset | | | -------- | -------- | | Anthropic RLHF Test | 0.578 | ### Friday 10/13/2023 - Follow-ups - Try different dataset sizes? and multiple pairs of y's per x - is the model underfitting or overfitting maybe? - train versus test accuracy - Also see if we can try training a smaller reward model - cycles/DAGs - cluster by embeddings? - https://arxiv.org/abs/2306.12105 - Look into other open-source reward models? - RM calibration updates - AlpacaFarm RM (trained on human preference data) seems to be better calibrated on its own training data: ![](https://hackmd.io/_uploads/HyxglfeMa.png) ![](https://hackmd.io/_uploads/HyZ-lzgzT.png) than on validation data: ![](https://hackmd.io/_uploads/rkwGefxGa.png) ![](https://hackmd.io/_uploads/B1GQgGxzT.png) ![](https://hackmd.io/_uploads/rytHxzgz6.png) ![](https://hackmd.io/_uploads/HkHIezxf6.png) but these datasets are formatted slightly differently (training has an additional "input" field, whereas this field is combined with the "instruction" field in the validation dataset) - Do standard calibration techniques help? - Temperature scaling ![](https://hackmd.io/_uploads/HJroXeUbT.png) ![](https://hackmd.io/_uploads/B1ajXlLW6.png) ![](https://hackmd.io/_uploads/ryei7bDW6.png) - Isotonic regression ![](https://hackmd.io/_uploads/rkcSQxL-6.png) ![](https://hackmd.io/_uploads/SkLqQx8ba.png) ![](https://hackmd.io/_uploads/r1EcX-vb6.png) ### Friday 10/06/2023 - Apply for student researcher program? - Other research questions to think about: - Can you train two RLHF models to have the same training error using two different RMs (one well-calibrated, and one not)? - What is the downstream impact of a miscalibrated RM? - Why does the RM become miscalibrated in the first place? Label noise? - To do: - temperature scaling to see if we can fix the miscalibration - Think about why the RM might be miscalibrated in the first place. Cycles in preferences? - What if we trained a pairwise reward model instead that outputs probabilities? ### Friday 09/29/2023 - AlpacaFarm authors responded back, pointing to this dataset: https://huggingface.co/datasets/tatsu-lab/alpaca_eval/viewer/alpaca_farm_human_crossannotations/eval - has four annotations per triple of $(x,y_1,y_2)$ - reward model calibration experiments ![](https://hackmd.io/_uploads/SJPG_9NeT.png) ECE=0.09 - Try confirming above results with another RM - Theoretical questions? - What causes reward model miscalibration in the first place? - Does there exist a reward model with small/zero training error but poor calibration? - How does reward model miscalibration affect the RL-tuned LM? - Also think about whether it actually makes sense to rank all y's (for a given x) on a single linear scale - Ask Richard about multi-objective learning - Also read https://arxiv.org/abs/2307.01928 ### Friday 09/23/2023 - Got AlpacaFarm reward models set up (see https://huggingface.co/angie-chen55/af-rmh and https://huggingface.co/angie-chen55/af-sft10k) - But having trouble with the datasets, because each outcome is only compared once to one other outcome - and no preference distributions are given - Can use AlpacaFarm simulation to simulate noisy preferences, but requires a GPT account - To do: - Investigate whether reward models are *actually* calibrated ### Friday 09/08/2023 - **AI**: Look at belief propagation work - AC: try the sparsification approach and look at the distribution of values - AC: look at current reward model calibration - Xinyi: also, think about what happens to the RL-tuned model if the reward model is miscalibrated? ### Friday 09/01/2023 - Discussed modeling user preferences instead as a DAG - positional-scoring: For each vote, a positional-scoring rule on $m$ alternatives assigns a score of $\alpha_j$ for the alternative ranked in $j$-th place by the vote, with $\alpha_1\geq\alpha_2\geq\cdots\geq\alpha_m\geq 0$, and $\alpha_1>\alpha_m$. The alternative selected is the one with the maximum total score, summing up across all votes - reward models induce positional-scoring rules! - No positional-scoring rule satisfies the Condorcet criterion on domains with three or more alternatives - relies upon assumption that preferences are linear, but this isn't always the case, for many reasons - [AlpacaFarm paper](https://arxiv.org/pdf/2305.14387.pdf) observe that reward overoptimization increases when labels are noisier - above is also hypothesized by [Scaling Laws for Reward Model Overoptimization](https://arxiv.org/pdf/2210.10760.pdf) - model preferences as a DAG instead? nodes only have edges between them if they can be compared, and we take just the leaves from each connected component and train on those - construct DAG from binary preferences - start out with zeroed-out adjacency matrix, then adjust each value by +1 if the output represented by the row index is preferred in a given pairwise comparison, and -1 otherwise. Then sparsify the matrix by zero-ing out low values - also how well calibrated are current reward models to actual user preferences? - ### Friday 08/25/2023 Discussion: - **AI: look at this first!** could show, first of all, that the RLHF objective can't actually lead to directly modelling the preference distribution (i.e. 70\% A and 30\% B) - consider the ratio of sequence likelihoods? - what about conflicting preferences in the non-binary case? - how well does RLHF result in a model that models the preference distribution? - the way that we trade-off modeling the reward distribution vs. modeling the pre-training distribution is not principled, tune $\beta$ - why can't we just minimize KL divergence to the reward distribution? - how does this relate to calibration though? - conformal prediction/confidence intervals calibration - guarantees on risk levels? ### Friday 08/04/2023 - Re-implemented plots - Talked to KC, Richard Pang - collaboration? set up weekly meetings? - Feedback from KC: - the varied cross-entropy on the HH-RLHF dataset is actually *expected*, because not all of the "chosen" choices are necessarily high reward - but the entropy rate \#s do seem to indicate that RLHF models are generating less diverse data - what is calibration in this case? what about calibrating $\beta$? - Discussion: - this is a geometric mean, but we should be using $1/(1+\beta)$ as the coefficient of the KL penalty instead to have a proper geometric mean - could have another parameter that controls how much you want to scale the exponential distribution (i.e. the reward/EBM?) - $(\beta + 1) *\theta * \log(\pi(x)/\pi_\textrm{Base}(x)^{1/(1+1/\beta)})$ - for RLHF dataset, can try weighting the individual cross-entropies per example by the quality - quality is estimated based off of how many times that example is "chosen" vs. "rejected" - Should read over DPO paper and continue brainstorming!

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.