Triton Tutorial 05-Layer Normalization

![image](https://hackmd.io/_uploads/rkj_GS9Iye.png) $$ \begin{align} \hat{x} &= \dfrac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}(x) + \epsilon }} \\ y &= \dfrac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}(x) + \epsilon}} * w\; +\; b \\ y &= \hat{x} \; * \; w \; + \; b \end{align} $$ $$ \begin{split} \triangledown_w &= \dfrac{\partial L}{\partial w} \\ &= \dfrac{\partial L}{\partial y} \dfrac{\partial y}{\partial w} \\ &= \dfrac{\partial L}{\partial y} \dfrac{\partial}{\partial w}(\hat{x}\; * \;w \;+ \;b) \\ &= \triangledown_y \; \hat{x} \end{split} $$ $$ \begin{split} \triangledown_b &= \dfrac{\partial L}{\partial b} \\ &= \dfrac{\partial L}{\partial y} \dfrac{\partial y}{\partial b} \\ &= \dfrac{\partial L}{\partial y} \dfrac{\partial}{\partial b}(\hat{x}\; * \; w\; +\; b) \\ &= \triangledown_y \end{split} $$ $$ \begin{split} \triangledown_{\hat{x}} &= \dfrac{\partial L}{\partial \hat{x}} \\ &= \dfrac{\partial L}{\partial y} \dfrac{\partial y}{\partial \hat{x}} \\ &= \dfrac{\partial L}{\partial y} \dfrac{\partial}{\partial \hat{x}}(\hat{x}\; * \;w \;+ \;b) \\ &= \triangledown_y\; w \end{split} $$ $$ \begin{split}\triangledown_x &= \dfrac{\partial L}{\partial x} \\ &= \dfrac{\partial L}{\partial y} \dfrac{\partial y}{\partial \hat{x}} \dfrac{\partial \hat{x}}{\partial x} \\ &= \triangledown_y\; w\; \dfrac{\partial}{\partial x}(\dfrac{x - \mu}{\sigma}) \\ &= \triangledown_y\; w\; \dfrac{({1 - \dfrac{\partial u}{\partial x}})\; \sigma\; - (x - \mu)\; \dfrac{\partial \sigma}{\partial x}}{\sigma^2} \\ &= \triangledown_y\; w\; \dfrac{(1\; -\; \dfrac{1}{\mathrm{N}})\; \sigma - (x - \mu) \dfrac{(x - \mu)}{\mathrm{N} \sigma}\; }{\sigma^2} \\ &= \triangledown_y\; w\; \dfrac{1}{\sigma}\; \Bigl( 1\; -\; \dfrac{1}{\mathrm{N} }\; -\; \dfrac{1}{\mathrm{N}}\; \dfrac{(x\; -\; \mu)}{\sigma}\; \dfrac{(x\; -\; \mu)}{\sigma} \Bigr) \\ &= \dfrac{1}{\sigma} \Bigl(\; \triangledown_y\; w\; - \dfrac{1}{\mathrm{N}}\; \triangledown_y\; w\; -\; \dfrac{1}{\mathrm{N}}\; (\dfrac{x - \mu}{\sigma})\; \triangledown_y\; w\; (\dfrac{x - \mu}{\sigma}) \Bigr) \\ &= \dfrac{1}{\sigma}\; \Bigl(\; \triangledown_y\; w\; -\; \dfrac{1}{\mathrm{N}}\; \triangledown_y\; w\; -\; \dfrac{1}{\mathrm{N}}\; \hat{x}\; (\triangledown_y\; w)\; \hat{x} \Bigr) \\ &= \dfrac{1}{\sigma} \Bigl( \triangledown_y\; w\; -\; \dfrac{1}{\mathrm{N}}\; \hat{x}\; (\triangledown_y\; w)\; \hat{x} -\; \dfrac{1}{\mathrm{N}}\; \triangledown_y\; w\; \Bigr) \end{split} $$ 參考資料: 1. [Triton Tutorial - Layer Normalization][1] 2. [Layer Normalization, and how to compute its Jacobian for Backpropagation?][2] 3. [Layer Normalization, Deriving the Gradient for the Backward Pass][3] 4. [The Tensor Calculus You Need for Deep Learning][4] 5. [The Matrix Calculus You Need For Deep Learning][5] 6. [Einstein notation 愛因斯坦符號][6] 7. [Derivatives, Backpropagation, and Vectorization][7] [1]:https://triton-lang.org/main/getting-started/tutorials/05-layer-norm.html#sphx-glr-getting-started-tutorials-05-layer-norm-py [2]:https://neuralthreads.medium.com/layer-normalization-and-how-to-compute-its-jacobian-for-backpropagation-55a549d5936f [3]:https://robotchinwag.com/posts/layer-normalization-deriving-the-gradient-for-the-backward-pass/ [4]:https://robotchinwag.com/posts/the-tensor-calculus-you-need-for-deep-learning/#example-layer-normalisation-using-index-notation [5]:explained.ai/matrix-calculus/ [6]:en.wikipedia.org/wiki/Einstein_notation [7]:https://cs231n.stanford.edu/handouts/derivatives.pdf