[PreDeToR] NeuRIPS 2024

# [PreDeToR] NeuRIPS 2024 # ## Reviewer ukA6 ## We thank the reviewer for their feedback and are happy to hear that they found it to be well-motivated, intuitive, and our evaluation extensive. We think the concerns raised can be addressed with minor revisions and we respond in detail below. Cons: - Regarding real-world data, our main focus was to understand whether a transformer could automatically discover and exploit structure in a wide variety of bandit strucutes. For this reason, we focused on synthetic experiments under a wide variety of bandit problem types that have been previously studied in the literature. We do agree that an experiment based on real-world data would add to the paper and so we added the following experiment: *Experiment on Movielens Dataset:* In this experiment, we first construct a rank-4 approximation of the dataset over 1000 users and 200 movies such that each user prefers either movies $7,13,16$, or $20$ . We use a set of $10$ movies that have been rated by all the users as our set of "actions". The users are the "tasks". The features of the movies form the action feature set $\mathcal{X}$ shared by all the tasks (users). We learn the $\theta_{m, \star}$ from the actual data in the dataset for each task $m$. The goal is to identify the most preferred movies (which should be one of $7,13,16$, or 20 ) in each task as quickly as possible. We then generate the rewards from the learned $\theta_{m, \star}$ for each task $m$ at round $t$ using a linear structure such that $r_{m,t}= \theta_{m, \star}^\top \mathbf{x}_{m,t} + \eta_{m,t}$. We first train for $M=1000$ users/tasks, and then test in $20$ test users/tasks. We show the cumulative regret with increasing number of rounds in Table 1. | Algorithm | Regret (n=25) | Regret (n=50) | Regret (n=75) | Regret (n=100) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 5.43 | 9.9 | 13.93 | 17.75 | | PreDeToR-$\tau$ (Ours) | 5.4 | 9.1 | 12.9 | 15.7 | | DPT (Greedy) | 7.29 | 10.91 | 14.88 | 19.37 | | LinUCB | 5.08 | 7.39 | 9.42 | 11.98 | | Thomp | 9.56 | 15.30 | 22.5 | 29.92 | | AD | 9.52 | 15.88 | 21.6 | 29.12 | Table 1: Cumulative Regret of algorithms at different horizon time in linear MTRL setting for Movielens Dataset for $n = 100$, $M=1000$, and $A=10$. From Table1 we observe that PreDeToR $(-\tau)$ has lower cumulative regret than DPT-greedy and AD. Note that PreDeToR $(-\tau)$ matches the regret of LinUCB which as this is a linear setting. Finally PreDeToR- $\tau$ performs slightly better than PreDeToR as it conducts more exploration. The data is collected using Thompson sampling. - We think there is a slight misunderstanding leading to the concern about a linearity assumption. Please note that we do not make any assumption about underlying linear structure in the way that an algorithm like LinUCB does. Rather, our key finding is that PreDeToR can learn to disover such structure automatically when pre-trained to predict rewards on a family of tasks with common structure. Importantly, the common structure need not be linear as demonstrated by our experiments in the non-linear setting). Our theoretical results are relevant to PreDeToR because PreDeToR is trained to minimize the MSE of reward prediction. - We would like to clarify one misunderstanding of the paper by the reviewer. Note that PreDeToR does not have access to the feature of the actions. It only observes the action indices and rewards of the weak demonstrator across the source tasks and learns the reward of each action. Note that the actions are shared by all the tasks. Furthermore, PreDeToR minimizes the MSE loss during training, and during inference time, it selects an action based on the estimated reward it calculated from the data of the source tasks (transfer learning). We want to point out that we do not assume that the reward structure is linear for theoretical analysis. d method in the transfer learning setting. **We do not use any linear realtionship between the arm contexts and hidden parameter $\theta_{\star}$**. In line 109 we clearly state that $r_{m, t}=f\left(\mathbf{x}\left(I_{m, t}\right), \boldsymbol{\theta}_{m, *}\right)+\eta_{m, t}$ where the function $f(.)$ can be linear, non-linear, bilinear, latent, etc. - We apologize for putting the related works in the appendix. We chose to do so since the most important related works were already discussed in the introduction and this enabled us to include more empirical analysis in the main paper. We agree an expansive related works is a good addition to the main paper. We will add this and the above Movielens experiment using the additional page available in the camera-ready version. ## Reviewer djXe ## We thank the reviewer for their positive feedback on our work, particularly the extensive empirical evaluation, use of theory, and positive results. We answer the questions below: 1. We agree with the reviewer that PreDeToR cannot be readily extended to RL setting but respectfully submit that bandit problems themselves are still an important class of ML problems with many real world applications. Our aim was to develop a method that could improve over demonstrations without knowledge of the optimal action (as DPT requires), and be effective in a diverse variety of bandit problems with unknown structure. Finding a way to improve upon DPT with sub-optimal MDP demonstrations is a very interesting direction for future work. 2. The reviewer asked about SOTA in MT bandit learning. There are prior works that minimize regret under a linearity assumption. A key part of our contribution is that we do not require the linearity assumption but instead discover underlying structure from the pre-training data. We illustrate this point with our non-linear setting experiments, where PreDeTor outperforms methods derived with the linear assumption. To the best of our knowledge, there are no baselines that have been derived without the linearity assumption. 3. We agree that there are previous works that address regret minimization in the linear MTRL setting. These are [Yang et al. (2020)](https://arxiv.org/pdf/2010.06531), [Cella et al. (2023)](https://arxiv.org/pdf/2202.10066), [Hu et al. (2021)](https://proceedings.mlr.press/v139/hu21a/hu21a.pdf). However, these works use strong linear structure assumption to devise an algorithm, whereas PreDeToR learns the reward structure without knowledge of the underlying explicit structure. We compare against the MLin-Greedy algorithm of [Yang et al. (2022)](https://arxiv.org/pdf/2202.10066) in the non-linear setting. Our results in Table 2 show that PreDeToR consistently outperforms it. | Algorithm | Regret (n=50) | Regret (n=100) | Regret (n=150) | Regret (n=200) | Regret (n=250) | | ----------------- | ----------- | ------------ | ------------ | ------------ | ------------- | | PreDeToR (Ours) | 3.7 | 5.92 | 7.97 | 9.98 | 11.91 | | PreDeToR-$\tau$ (Ours) | 3.5 | 5.1 | 6.9 | 9.7 | 10.94 | | DPT (Greedy) | 6.86 | 11.49 | 15.69 | 19.81 | 23.94 | | Thomp | 9.56 | 16.33 | 21.56 | 26.09 | 29.93 | | LinUCB | 4.95 | 7.35 | 9.39 | 10.99 | 16.32 | | MLin-Greedy | 3.95 | 6.5 | 8.4 | 10.1 | 14.2 | Table 2: Cumulative Regret of algorithms at different horizon time in non-linear MTRL setting. From Table 2 we observe that PreDeToR $(-\tau)$ has lower cumulative regret than DPT-greedy. Note that PreDeToR $(-\tau)$ has lower regret than LinUCB which fails to perform well in this non-linear setting due to its algorithmic design. Finally PreDeToR- $\tau$ performs slightly better than PreDeToR in both the settings as it conducts more exploration. Finally we show that MLin-Greedy performs badly in this non-linear setting as it is designed for linear feedback setting. This shows that previous approaches of linear MTRL algorithms do not generalize well as PreDeToR for the non-linear and other multi-task structured bandit setting. ## Reviewer 5jdk ## We thank the reviewer for their feedback. We answer below: 1. One concrete use case is making recommendations where we want to make the best recommendation to the current user. This can be posed as a bandit problem where we make recommendations and the user provides evaluative feedback on how good they are. In real applications, we don't have just one user but might have had many past interactions with other users. While users have different evaluations of different items, it is reasonable that there is some shared structure in their item evaluation. Our aim is to automatically discover and exploit this structure for more efficient bandit learning. In the following, we evaluate our approach against other methods for movie rating with a real-world dataset. *Experiment on Movielens Dataset:* In this experiment, we first obtain a rank-4 approximation of the dataset over 1000 users and 200 movies such that all users prefer either movies $7,13,16$, or 20 . The movies are the actions and we choose 10 movies that have been rated by all the users. The features of the movies form the action set $\mathcal{X}$ shared by all the tasks (users). We learn the $\theta_{m, \star}$ from the actual data in the dataset for each task $m$. The goal is to identify the most preferred movies (which should be one of $7,13,16$, or 20 ) in each task as quickly as possible. We then generate the rewards from the learned $\theta_{m, \star}$ for each task $m$ at round $t$ using a linear structure such that $r_{m,t}= \theta_{m, \star}^\top \mathbf{x}_{m,t} + \eta_{m,t}$. We first train for $M=1000$ tasks, and then test in $20$ test environments. We show the cumulative regret with increasing number of rounds in Table 1. | Algorithm | Regret (n=25) | Regret (n=50) | Regret (n=75) | Regret (n=100) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 5.43 | 9.9 | 13.93 | 17.75 | | PreDeToR-$\tau$ (Ours) | 5.4 | 9.1 | 12.9 | 15.7 | | DPT (Greedy) | 7.29 | 10.91 | 14.88 | 19.37 | | LinUCB | 5.08 | 7.39 | 9.42 | 11.98 | | Thomp | 9.56 | 15.30 | 22.5 | 29.92 | | AD | 9.52 | 15.88 | 21.6 | 29.12 | Table 1: Cumulative Regret of algorithms at different horizon time in linear MTRL setting for Movielens Dataset for $n = 100$, $M=1000$, and $A=10$. 2. We thank the reviewer for talking about the adversarial setting. We implemented the adversarial setting from Laskin et al. and also found that PreDeToR outperforms AD in this setting. Note that Algorithm Distillation (AD) cannot outperform the demonstrator, whereas if data collected for the new optimal arm when the task switches is sufficient then PreDeToR can outperform AD.  Questions: 1. This is an excellent suggestion and we will add discussions on practical scenarios in our revision. We will also add the movie recommendation experiment to include an experiment based on real-world data. 2. The Algorithm Distillation is indeed more robust in the adversarial setting described by the reviewer. We implement the adversarial setting Laskin et al (2019) such that certain arms being optimal is 95% during training but then only 5% for the same arms during evaluation. We take half of the arms and only allow them to be optimal 5% of the time during training and then make them optimal 95% of the time during testing. We show the result in the Table 2 below.   | Algorithm | Regret (n=13) | Regret (n=25) | Regret (n=38) | Regret (n=50) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 3.13 | 3.9 | 4.64 | 5.44 | | PreDeToR-$\tau$ (Ours) | 3.78 | 4.89 | 5.8 | 6.76 | | DPT (Greedy) | 7.46 | 10.19 | 12.83 | 15.47 | | LinUCB | 2.42 | 2.84 | 3.12 | 3.3 | | Thomp | 8.07 | 10.74 | 11.95 | 12.68 | | AD | 6.12 | 7.59 | 8.64 | 9.59 | Table 2: Cumulative Regret of algorithms at different horizon time in linear MTRL setting under adversarial rewards. From table 2 we observe that PreDeToR has lower cumulative regret than DPT-greedy. Note that PreDeToR has higher regret than LinUCB as this is a linear feedback setting. Finally note that PreDeToR outperforms AD because even though different arms are optimal during training and testing time, the feature of the arms do not change. The PreDeToR exploits the latent representation of these arms and reward correlation to predict the mean rewards. ## Reviewer uY6b ## We thank the reviewer for their positive feedback. Weaknesses: 1. We show experiments with increasing attention heads in Appendix A.9. Questions: 1. We thank the reviewer for bringing this up. We compare against LinUCB and Thompson sampling for two reasons. First, we show that even when PreDeToR is trained on data collected by Thomp, which is a weak demonstrator and has no access to the feature information, our method is able to outperform the weak demonstrator. Secondly, we show LinUCB performance to demonstrate that PreDeToR exploits the latent linear structure (for the linear bandit experiment) and matches its performance. In Figure 5 we try to understand the behavior of PreDeToR and its ability to exploit the reward correlation across tasks under a linear multivariate Gaussian model, thereby comparing against BayesPred. We compare against DPT, and AD which leverages information across the tasks and are outperformed by PreDeToR. 2. Yes, the transformers for each of the linear and non-linear bandits are trained separately. 3. That's a great point. In multi-task setting, we generally assume the $M_{\text {pre }} \gg|\mathcal{A}|$ (cite works). When the tasks are small the multi-task learning algorithms fail to exploit the latent structure of the problem. This is also evident from our experiments in Appendix A.10. The experiments show that as the number of tasks $M$ increases the PreDeToR has lower cumulative regret. ------------------------------------------------------- # ICML 2024 # ## Reviewer Qjdp We thank the reviewer for their positive feedback of our work. We provide answers to the main points below. Weaknesses: 1. We will take into consideration your suggestion for not using a different color style for algorithm names. 2. Yes, we have tried an explorative version of PreDeTor as stated by the reviewer. We show the result in Table 1 where PreDeToR (Explore) captures the variant the reviewer has asked. Note that the greedy algorithm PreDeToR (Ours) outperforms PreDeToR (Explore) as it has a good estimation of the reward of the arms given a task and doesn't need to explore. However, there could be tasks, where such exploration might yield better result and we leave this to future work. We will discuss this in camera-ready version. We also highlight the main difference between DPT and our approach. DPT requires the optimal action for each task $m$. Then during training the transformer $\mathbf{T}_{\mathbf{w}}$, the DPT minimizes the cross-entropy loss as follows: $\mathcal{L}_t^{\mathrm{DPT}}=\operatorname{cross-entropy}\left(\widehat{a}_{m, t, *}, a_{m, *}\right)$ where, $\widehat{a}_{m, t, *}$ is the predicted optimal action at round $t$ by $\mathbf{T}_{\mathbf{w}}\left(\cdot \mid \mathcal{H}_m^t\right)$ (see eq (1)). However, the PreDeToR does not require the optimal action for each task and just uses the reward for next action to minimize the loss (see eq (3)). | Algorithm | Regret (H=50) | Regret (H=100) | Regret (H=150) | Regret (H=200) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 5.43 | 9.9 | 13.93 | 17.75 | PreDeToR (Explore) | 7.25 | 11.83 | 15.63 | 19.13 | | DPT (Greedy) | 7.29 | 12.54 | 17.48 | 22.37 | | Thomp | 9.56 | 16.30 | 21.59 | 25.92 | | LinUCB | 5.08 | 7.59 | 9.54 | 11.9 | Table 1: Cumulative Regret of algorithms at different horizon time in linear setting denoted by $H = \{50, 100, 150, 200\}$. For each horizon time we first train for $M=100000$ tasks, and then test in $200$ test environments. 3. Thanks for the suggestion. We will mention which are the strong learners and weak learners in the camera-ready version. We will state that LinUCB is using oracle knowledge of the structure of the problem and therefore performs optimally in the linear bandit setting. 4. We clarify the new action setting here. - First fix the total number of action across all the environments (tasks). Say this is $A=10$. Fix the number of in-variant arms across the tasks. Say this is $|\mathcal{A}^{inv}| = 8$. These arms are shared by all tasks. - So each task $m\in [M]$ now has two new arms that is drawn randomly from $\mathcal{N}(0, \mathbf{I}_d)$. Note that these arms are different for each task. This also implies that each of the test environments will have $2$ completely new arms unseen during training time ($\mathcal{D}_{\text {test }} \neq \mathcal{D}_{\text {pre }}$). - Then create a head on the transformer model that has 10 outputs, one reward for each action. - The learning proceeds as before where we use a weak learner to collect data without observing the features and the goal is that the PreDeToR learns to exploit the underlying latent structure and reward correlation between the different tasks even when $\mathcal{D}_{\text {test }} \neq \mathcal{D}_{\text {pre }}$. The main goal of this section is to demonstrate how the transformer is robust to unseen data during test time and subsequently learns to adapt to unseen data. We see that as long as the number of invariant arms are substantially higher than new arms, the transformer is still able to learn near-optimal policy. We also do not permute the actions during training time. 5. Reservation: As discussed in Conclusion/Future Work, we indeed plan to extend this line of research to MDPs. There are several open questions. As you correctly stated, now we need to have a value function estimation for each state. Also, the exploration in MDP needs to be factored in, and it is not clear whether being greedy with the value function is sufficient to obtain a near-optimal policy in MDPs. We leave this to future works. ## Reviewer puCK We thank the reviewer for their helpful comments. We address the concerns below: Weakness: 1. We focussed on short horizon setting for two reasons: First (motivation) this setting shows up in lot of real-world applications like recommender systems with LLMs ([Liu et al. (2023)](https://arxiv.org/pdf/2309.17382.pdf)) and Latent Multi-armed Bandits (MAB) (see [Kwon et al. (2022)](https://proceedings.neurips.cc/paper_files/paper/2022/file/95a6fcdc0c8458baa9c6e14736a644f8-Paper-Conference.pdf), [Kwon et al. (2021)](https://proceedings.neurips.cc/paper_files/paper/2021/file/cd755a6c6b699f3262bcc2aa46ab507e-Paper.pdf)). In all of these settings, the user interacts with the task only for a short horzion as the incentive/utility for a long interaction is very less. As shown by [Kwon et al. (2022)](https://proceedings.neurips.cc/paper_files/paper/2022/file/95a6fcdc0c8458baa9c6e14736a644f8-Paper-Conference.pdf) in the Latent MAB setting, the short horizon learning is a more challenging setting as less information is available for each task. Therefore for Multi-task learning, any good algorithm has to leverage the information across tasks to perform well in any invidual new task shown at test time. Similar settings are studies from theoretical perspective in multi-task representation learning in [Yang et al. (2021)](https://arxiv.org/pdf/2010.06531.pdf), [Cella et al. (2022)](https://arxiv.org/pdf/2205.15100.pdf). We show the scalability of PreDeToR in larger horizon for the linear setting in Appendix A.6. We further show the performance of PreDeToR in non-linear setting in Questions 1. 2. Predicting the reward (in contrast to showing the optimal arm for each task as in DPT paper [Lee et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/file/8644b61a9bc87bf7844750a015feb600-Paper-Conference.pdf)) is a natural choice. Note that all the tasks share the same action set $\mathcal{X}$. This naturally imparts a shared correlation between the tasks, even though the reward distibution for each task changes due to a different $\theta_{m,*}$. However, by observing the reward pattern across many such tasks, the transformer can learn the shared reward correlation and perform well on a new task (with same arm set but different $\theta_{m,*}) shown at inference time. This happens in two ways: First after observing a few in-context examples the transformer is able to identify the task (or a closely related task) and then use the reward prediction model to predict the expected reward of the optimal arm of the identified task and select it. Questions: 1. We show the performance of PreDeToR in non-linear setting in Table 2. In this setting we set $f\left(\mathbf{x}, \boldsymbol{\theta}_*\right)=1 /\left(1+0.5 \cdot \exp \left(2 \cdot \exp \left(-\mathbf{x}^{\top} \boldsymbol{\theta}_*\right)\right)\right)$ and this is discussed in detail in Appendix A.2. Note that the DPT paper [Lee et al. 2023](https://proceedings.neurips.cc/paper_files/paper/2023/file/8644b61a9bc87bf7844750a015feb600-Paper-Conference.pdf) considers a horizon $H=200$ for linear bandit setting. We show even a larger horizon till $H=250$ to illustrate that the main goal is to learn the correlation across the tasks, which our approach is able to learn. From the results, we observe that PreDeToR has lower cumulative regret than LinUCB as this is a non-linear setting. | Algorithm | Regret (H=50) | Regret (H=100) | Regret (H=150) | Regret (H=200) | Regret (H=250) | | ----------------- | ----------- | ------------ | ------------ | ------------ | ------------- | | PreDeToR (Ours) | 3.7 | 5.92 | 7.97 | 9.98 | 11.91 | | DPT (Greedy) | 6.86 | 11.49 | 15.69 | 19.81 | 23.94 | | Thomp | 9.56 | 16.33 | 21.56 | 26.09 | 29.93 | | LinUCB | 4.95 | 7.35 | 9.39 | 10.99 | 12.32 | Table 2: Cumulative Regret of algorithms at different horizon time in non-linear setting denoted by $H = \{50, 100, 150, 200, 250\}$. For each horizon time we first train for $M=150000$ tasks, and then test in $200$ test environments. 2. Our model complexity and training convergence is similar to the DPT algorithm. Note that DPT also has a linear head on top of the transformer that predicts the optimal action (instead of reward). We will state this in our camera-ready version. We discuss the motivation of reward prediction in weakness 2. 3. In our Figure 1, we provide a test arm distribution and prediction result. In the linear setting experiment, we have $31.2\%$ of test environments where the difference in expected rewards between the optimal arm and thr second best arm is less than $0.05$. In $52\%$ of test environments the difference in expected rewards between the optimal arm and thr second best arm is less than $1.0$. In all these cases, we see from Figure 1, that PreDeToR is able to predict the mean reward of the optimal arm accurately which leads to a near-optimal policy. ## Reviewer Pc2o Weaknesses: 1. [Ask this] 2. See Question 3. Questions: 1. We show the performance of PreDeToR in long horizon setting for linear setting in Appendix A.6. We show further result for non-linear setting in Table 2 extending to horizon $H=250$. This is more than the horizon studied in DPT paper. Following their work we focus on how to leverage data across tasks to perform better at inference time on a new test task. 2. [Ask this] [QX: I am not quite sure why this reviewer finds it confusing. Some thoughts: We can potentially provide intuition on our algorithmic design; we can also consider moving Section 6 before Section 4?] 3. We present a ~~theoretical~~ understanding how PreDeToR in Section 6. In this work, we mainly focus on the empirical study of pretraining for decision transformers. In Section 6, we aim to understand the behavior of PreDeToR through a linear multivariate Gaussian model. We hypothesize that the PreDeToR is exploiting the reward correlation covariance matrix from the training data and acting greedily on it. To test this hypothesis, we consider the greedy BayesPred algorithm that first estimates the coorelation matrix from the pretraining data. It then uses the LMMSE estimator in Lemma 1 to calculate the posterior mean vector, and then acts grredily on it by selecting the most rewarding action. The hypothesis that BayesPred is a lower bound to PreDeToR is supported by Figure 3. Unfortunately, so far we are not able to provide a regret analysis as the DPT paper, as their proof technique crucially relies on observing the optimal action during training. We leave the theoretical regret analysis for PreDeToR to future works. ## Reviewer senv We thank the reviewer for their feedback. We address the concerns below: Cons: 1. We restate the main difference of PreDeToR from AD and DPT. In AD, the learner aims to predict the next action of the demonstrator and is trained to minimize the cross-entropy loss as $\mathcal{L}_t^{\mathrm{AD}}=\operatorname{cross-entropy}\left(\widehat{I}_{m, t}, I_{m, t}\right)$ where $\widehat{I}_{m, t}$ is the predicted next action by the model, and $I_{m, t}$ is the true action taken by the demonstrator. Therefore AD incurs similar cumulative regret as the demostrator. This observation is illustrated in Table 3. As observed by the reviewer, the performance of DPT (original with optimal action info) is better than PreDeToR as it has access to the optimal actions in each task. We also show these results in Table 3. | Algorithm | Regret (H=50) | Regret (H=100) | Regret (H=150) | Regret (H=200) | | ----------------- | ----------- | ------------ | ------------ | ------------ | | PreDeToR (Ours) | 5.43 | 9.9 | 13.93 | 17.75 | | DPT (Original) | 5.29 | 7.91 | 9.88 | 12.37 | | LinUCB | 5.08 | 7.59 | 9.54 | 11.9 | | Thomp | 9.56 | 16.30 | 21.59 | 25.92 | | AD | 9.42 | 15.88 | 20.6 | 25.12 | Table 3: Cumulative Regret of algorithms at different horizon time in linear setting denoted by $H = \{50, 100, 150, 200\}$. For each horizon time we first train for $M=100000$ tasks, and then test in $200$ test environments. 2. We mention in our future work that we intend to extend this line of research to MDPs. There are several open questions. In the MDP setting we need to have a value function estimation for each state. Also, the exploration in MDP needs to be factored in, and it is not clear whether being greedy with the value function is sufficient to obtain a near-optimal policy in MDPs. We leave this to future works. 3. Thank you for your comment. Excellent suggestion. See the Con 1. We report the performance of DPT (original) and AD in Table 3. To determine the optimal action for each task for DPT (greedy) we follow the steps in DPT paper. We will also include the comparison with AD and DPT (original) in other experiments as well. Questions: 1. See Con 3. 2. See Con 2.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.