Linear vs Cosine Schedule

2024-05-11 # Linear vs Cosine Schedule # Intro - Linear schedules for beta result in image noising too fast - We use a new schedule, which is a modified form of the cosine schedule developed at OpenAI (Nichol & Dhariwal, 2021), instead of the linear schedule - And then we compare results # Methods ## Current linear schedule Beta is directly and linearly proportional to t ```python def beta(t): return t/MAX_TIMESTEPS ``` We then define alpha bar as $$ \bar\alpha_{t} = \prod_{s=1}^{t} a_{s} $$ where $$ \alpha_{t} = 1 - \beta_{t} $$ In code, this is represented as ```python def alpha_bar(t): product = 1 for x in range (1, t): alpha = 1 - beta(x) product *= alpha return product ``` The linear schedule is based on a [blog post](https://erdem.pl/2023/11/step-by-step-visual-introduction-to-diffusion-models) by Kemal Erdem (2023). ## Cosine schedule Alpha bar is defined as $$ \bar\alpha_{t} = \frac{f(t)}{f(0)} $$ where $$ f(t) = \cos(\frac{t/T+s}{1+s} \times \frac{\pi}{2})^{2} $$ and s = 0.008. (s from OpenAI [paper](https://arxiv.org/pdf/2102.09672)) When then use this alpha bar value directly in our noising process, without needing a beta value. This schedule is based on Nichol & Dhariwal's 2021 [paper](https://https://arxiv.org/pdf/2102.09672) on improving diffusion models. However, the paper and code both define beta again, although it is unclear how beta plays a part in the process when alpha bar is already defined. Interestingly, in contrast to the linear schedule, alpha bar is explicitly defined, while in the linear schedule beta is defined and alpha bar is calculated based on beta bar. In code, alpha bar is: ```python def alpha_bar(t): def alpha_bar(t): f_t = cos( ((t/MAX_TIMESTEPS + 0.008)/(1+0.008)) * (pi/2) ) ** 2 f_0 = cos( ((0.008)/(1+0.008)) * (pi/2) ) ** 2 return f_t/f_0 ``` # Data T (max timesteps) is 100 for all graphs. ## Alpha bar over timesteps ![image](https://hackmd.io/_uploads/SkD5GTpGC.png) ## Mean of noised images over timesteps ![image](https://hackmd.io/_uploads/By0R-66zC.png) ## Variance of noised images over timesteps ![image](https://hackmd.io/_uploads/ryWWm6TfR.png) ## Example noised images at different timesteps ## t/T = 0.1 Linear: ![image](https://hackmd.io/_uploads/H1QJEpafR.png) Cosine: ![image](https://hackmd.io/_uploads/rJTim6TfR.png) ## t/T = 0.3 Linear ![image](https://hackmd.io/_uploads/S1ZzN66MC.png) Cosine ![image](https://hackmd.io/_uploads/SksdN6TfA.png) ## t/T = 0.75 Linear ![image](https://hackmd.io/_uploads/HkWB4TaGR.png) Cosine ![image](https://hackmd.io/_uploads/HynKEaafA.png) ## Gradient of images ![image](https://hackmd.io/_uploads/HJwG_66GR.png) # Conclusion - Cosine schedule is undoubtly better than linear schedule - Variance decreases almost linearly - Gradient of images is probably most striking evidence - Not sure exactly whether I implemented the paper right, but it works well enough at its current state