例行進度報告

## 各論文實驗統整 Logistic Regression on MNIST ### 要有的 * 1-Layer Perceptron on MNIST * ResNet(18 or 34), DenseNet121 on Cifar10, Tiny-ImageNet * LSTM on Penn Treebank * 待選 * WGAN-GP on ? * ? on IWSLT’14 ### AdaBelief (2020) * VGG11, ResNet34, DenseNet121 on Cifar10 * 1,2,3-layer LSTM on Penn Treebank * WGAN WGAN-GP with vanilla CNN generator on Cifar10 * SN-GAN with ResNet generator on Cifar10 * IWSLT14(Transformer)、PASCAL VOC object detection * ImageNet(Not Win) ### AdaBound (2019 ICLR) * 1-Layer Perceptron on MNIST * ResNet34 on Cifar10 * DenseNet121 on Cifar10 * 1,2,3-layer LSTM on Penn Treebank ### NosAdam (2020) * 1-Layer Perceptron on MNIST * Wide ResNet28 on Cifar10 ### PAdam (2019) * VGG16, ResNet18, Wide ResNet16 on Cifar10, Cifar100 ### AdaShift (2018 ICLR) * Multilayer Perceptron on MNIST * ResNet18, DenseNet100 on Cifar10, Tiny-ImageNet * WGAN-GP on ? (fixed generator) * LSTM on Neural Machine Translation ### AdamW (2017 ICLR) * 2x(96,64)d ResNet26 on Cifar10, ImageNet32x32 ### SWATS (2017) * ResNet32, DenseNet, PyramidNet, SENet on Cifar10, Cifar100 * LSTM, QRNN on Penn Treebank, WT-2 * ResNet18 on Tiny-ImageNet ### RAdam (2019 ICLR) * ? on One Billion Word * ? on Cifar10, ImageNet * ? on IWSLT’14 DE-EN/EN-DE, WMT’16 EN-DE ### Neural Optimizer Search (2017) * Wide-ResNet on Cifar10 * Google Neural Machine Translation (GNMT) on WMT’14 EN-GE * LSTM on Penn Treebank ### AMSGrad (2018 ICLR) * FFN on MNIST * CIFARNET on Cifar 10 ### 參考價值低 * Google self-learning optimizer (2020) * LookAhead (2019) * Ranger (2019) --- # 6/23 例行報告 --- ## 有的沒的 [Signal 通訊軟體](https://buzzorange.com/techorange/2021/01/28/qa-signal-ceo-moxie-marlinspike-on-the-future-of-privacy/) [Floc, Goole 最新的網路追蹤技術](https://www.bnext.com.tw/article/61662/google-cookie-floc) [網路信標](https://zh.wikipedia.org/wiki/%E7%BD%91%E7%BB%9C%E4%BF%A1%E6%A0%87) --- activate function 自身輸出入分布相同 --- [新的optimizer整理重要](https://zhuanlan.zhihu.com/p/208178763) [WarmUp解釋1](https://www.zhihu.com/question/338066667) [WarmUp解釋2](https://chih-sheng-huang821.medium.com/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92warm-up%E7%AD%96%E7%95%A5%E5%9C%A8%E5%B9%B9%E4%BB%80%E9%BA%BC-95d2b56a557f) [關於地形的論文](https://arxiv.org/pdf/1612.04010v1.pdf) [關於地形的論文2](https://arxiv.org/pdf/1612.04010.pdf) [信心學習](https://zhuanlan.zhihu.com/p/101379289)：可以找出資料集內的標籤錯誤 [RAdam的解釋](https://allen108108.github.io/blog/2019/10/08/RAdam%20optimizer%20%E6%96%BC%20Dogs%20vs.%20Cats%20%E8%B2%93%E7%8B%97%E8%BE%A8%E8%AD%98%E4%B8%8A%E4%B9%8B%E5%AF%A6%E4%BD%9C/) [LookAhead的解釋](https://allen108108.github.io/blog/2019/10/08/Lookahead%20optimizer%20%E6%96%BC%20Dogs%20vs.%20Cats%20%E8%B2%93%E7%8B%97%E8%BE%A8%E8%AD%98%E4%B8%8A%E4%B9%8B%E5%AF%A6%E4%BD%9C/) --- # 群體期末 --- 社交蜘蛛碎形自適應波浪 gradient base --- # 6/16 例行報告 --- * [Neural Optimizer Search](https://arxiv.org/pdf/1709.07417.pdf) * 星期六提到的論文整理中 --- # 6/9 例行報告 --- # 雜物 [對鞍點的理解](https://www.getit01.com/p20180103768109802/) [Deep Learning without Poor Local Minima](https://arxiv.org/abs/1605.07110) --- ## Padam 把RMS的 ^0.5 換成 ^p，0<p<0.5，發現結果變好，且說明Ada系的sqrt(vt)並非類似二階導數。但此方法的超參數繁多，且p的調整與lr相關。 --- ## SWATS 以EMA方式記錄梯度與實際走的長度的比例，並在此值跟這次的比例之差小於epsilon時將該維度切換成SGD並使用此比例做為學習率。 [解說與 150 epoch 時調低學習率對切換條件的影響的質疑](https://zhuanlan.zhihu.com/p/32406552) --- ## AdaShift 將vt公式的 gt^2 換成前幾次的 gt-n^2，讓他跟gt沒有正相關（但不就是應該要有嗎），感覺是針對AMSgrad那篇設計出來的東西，同時再次說明sqrt(vt)並非類似二階導數，而且似乎隨機也不會太差。 [作者解說影片](https://www.bilibili.com/video/av64670460/) --- ## NosAdam 一個調整Adam中beta2^t項使其避免發散的框架，實際上使用 $\sum_{k=1}^{t+1}k^{-\gamma} / \sum_{k=1}^tk^{-\gamma}$ [作者自解說](https://zhuanlan.zhihu.com/p/65625686) Point：自製optimizer在發paper時也要做Weight Decay --- ## 已讀過的特性整理 * Padam、AdaShift: 自適應學習率並非二階導數，是可動的（Expectigrad等） * SWATS: 用某種方法漸漸過渡到SGDM * NosAdam: 沒什麼價值，傾向不往不收斂的方向去做（數學證明多） * AdamW: Weight Decay 比L2正規化有用→L2正規化（防止overfitting）或許與訓練速度相衝突，並與Weight Decay是不同的概念 --- # 6/2 例行報告 --- ## 尋找論文 ### Adam 變形與相關 [AdaBelief (2020)](https://arxiv.org/pdf/2010.07468.pdf) [AdaBound (2019)](https://arxiv.org/abs/1902.09843) [NosAdam (2020)](https://arxiv.org/abs/1805.07557) [PAdam (2019)](https://arxiv.org/abs/1901.09517) [AdaShift (2018)](https://arxiv.org/abs/1810.00143) [Dissecting Adam (2017)](https://arxiv.org/abs/1705.07774) [AdamW (2017)](https://arxiv.org/abs/1711.05101) [SWATS (2017)](https://arxiv.org/abs/1712.07628) ---- ## 尋找論文 ### 強者 [RAdam (2019)](https://kknews.cc/zh-tw/code/2nxbbjg.html) [LookAhead (2019)](https://blog.csdn.net/u011681952/article/details/99414931) [Ranger (2019)](https://zhuanlan.zhihu.com/p/100877314) ---- ## 尋找論文 ### 自我學習 [Google self-learning optimizer (2020)](https://www.jiqizhixin.com/articles/2020-10-21) [Neural Optimizer Search (2017)](https://mp.weixin.qq.com/s/E0ULyXGz3UEcD0cIg-XkNA) --- ## Optimizer 整理 [懶人包1](https://zhuanlan.zhihu.com/p/208178763) [懶人包2](https://medium.com/%E8%BB%9F%E9%AB%94%E4%B9%8B%E5%BF%83/deep-learning-%E7%82%BA%E4%BB%80%E9%BA%BCadam%E5%B8%B8%E5%B8%B8%E6%89%93%E4%B8%8D%E9%81%8Esgd-%E7%99%A5%E7%B5%90%E9%BB%9E%E8%88%87%E6%94%B9%E5%96%84%E6%96%B9%E6%A1%88-fd514176f805) [懶人包3](https://zhuanlan.zhihu.com/p/65625686) --- ## 以後再看可解釋 AI --- ## AdaBelief 把 g^2 項換成 (g-m)^2 使如果當前預測準確度高（loss曲率低）則學習率高，反之則低 [解釋與EAdam的質疑](https://zhuanlan.zhihu.com/p/339225508) [解釋與Fine-tune的質疑](https://www.jiqizhixin.com/articles/2020-10-17) [解釋與實驗baseline的質疑](https://www.qbitai.com/2020/10/19174.html) --- ## AdaBound 用一個bound慢慢夾自適應學習率讓他前期是Adam後期是SGD 後期lr=0.1，隨beta2收斂 [衍伸論文](https://arxiv.org/abs/1908.04457) [之前也試過讓後期變SGD的論文SWATS](https://arxiv.org/abs/1712.07628) [有人認為是原型的AdaFactor](https://arxiv.org/abs/1804.04235) [AdaFactor的解釋](https://kexue.fm/archives/7302) [對實驗的質疑](https://www.zhihu.com/question/313863142) Point：好像提出Optimizer一定要跑ImageNet和NLP和GAN系列不然不會被認可？ --- ## 已讀過的特性整理 * Expectigrad: Outer Momemtum, AA instead of RMS * Gradient Centralization: 對梯度做中心化等操作 * AdaBelieve: 考慮loss靠近當前的曲率 * AdaBound: 用某種方法漸漸過渡到SGDM * Other: 靠近0的梯度修剪 --- ## 要做的事候補讀上面的paper 查訓練過程統計梯度資訊的code [(tensorboard?)](https://www.tensorflow.org/tensorboard/) 整理通用測資並寫測試用code --- # 4/21 例行報告 --- * 這週寫出 Expectigrad （我上次報的那篇 paper）並跑兩個實驗（按方向鍵右） --- ### paper MNIST 原圖（按方向鍵下） ![](https://i.imgur.com/IleMv7X.png) ---- ### run1 ![](https://i.imgur.com/JtyYj0e.jpg) ---- ### run2 ![](https://i.imgur.com/KaStNPd.jpg) ---- ### run3 ![](https://i.imgur.com/bnzoc1k.jpg) ---- ### run4 ![](https://i.imgur.com/hlYwQYn.jpg) ---- ### run5 ![](https://i.imgur.com/ygGI1Bw.jpg) ---- ### run6 ![](https://i.imgur.com/LlHCSfD.jpg) ---- ### run7 ![](https://i.imgur.com/pqBc0rN.jpg) ---- ### run8 ![](https://i.imgur.com/pnfQpiZ.jpg) ---- ### run9（run10壞掉了）（按方向鍵右） ![](https://i.imgur.com/KqIjMbp.jpg) --- ### paper CIFAR 原圖 10 run 平均（按方向鍵下） ![](https://i.imgur.com/KQVPWPI.png) ---- ### 我的 10 run 平均 ![](https://i.imgur.com/ID2UGql.jpg) ---- ### 以下 run 1~10 ![](https://i.imgur.com/lipxn1i.jpg) ---- ![](https://i.imgur.com/68RGO8r.jpg) ---- ![](https://i.imgur.com/AE91xRG.jpg) ---- ![](https://i.imgur.com/qBL5R52.jpg) ---- ![](https://i.imgur.com/p3CPVIq.jpg) ---- ![](https://i.imgur.com/9TU23ne.jpg) ---- ![](https://i.imgur.com/K7jfM6k.jpg) ---- ![](https://i.imgur.com/v53duOb.jpg) ---- ![](https://i.imgur.com/GpifTM6.jpg) ---- ![](https://i.imgur.com/qBxzeps.jpg) --- # 3/3 例行報告 --- ## 安裝環境 ### TensorFlow 與 GPU 加速 - Driver Version: 460.32.03 - CUDA Version: 11.2.1 - cudNN Version: 8.1.0.77 - TensorFlow 2.4.1 [解決 cudNN 顯卡記憶體不足問題](https://davistseng.blogspot.com/2019/11/tensorflow-2.html) ---- ## 安裝環境 ### 筆電 VSCode - Python - Remote SSH [使用金鑰登入](https://blog.gtwang.org/linux/linux-ssh-public-key-authentication/) [Python 虛擬環境的原理與使用](https://zhuanlan.zhihu.com/p/71615515) [venv 原理](https://www.kawabangga.com/posts/3543) [source 的解釋](https://www.itread01.com/content/1548311242.html) --- ## 開發 [**kwargs 意義](https://skylinelimit.blogspot.com/2018/04/python-args-kwargs.html) [Keras 簡中文檔](https://keras.io/zh/) [TF2.3 以下的 bug](https://www.mdeditor.tw/pl/pNvD/zh-tw?fbclid=IwAR163v4_k_9WH8myH2-2cna_3PBOwq0p_3fnOk-7HJeILTbFGDnFIArsg2Q) [Python 底線](https://zhuanlan.zhihu.com/p/36173202) --- ## 跳過的東西 - pdb - git on VSCode - test VSCode Remote SSH server on gentoo --- ## 有的沒的 [在 Windows 10 上關閉自動調整亮度的 Intel DPST](https://blog.brucehsu.org/posts/2017/04/14/disable-intel-display-power-saving-dpst-on-windows-10/) --- # 3/10 例行報告 --- ## 有的沒的 [Python 運算子本質與強弱型別問題](https://www.itread01.com/content/1599015784.html) [Python 的星號本質及其使用方式](https://www.itread01.com/hkcpxqy.html) --- ## Custom Optimizer [自製 Optimizer in TF2](https://cloudxlab.com/blog/writing-custom-optimizer-in-tensorflow-and-keras/) ---- ```python= from tensorflow.python.keras.optimizer_v2 import optimizer_v2 from tensorflow.python.util.tf_export import keras_export from tensorflow.python.ops import state_ops import random @keras_export("keras.optimizers.custom") class custom_optimizer(optimizer_v2.OptimizerV2): _HAS_AGGREGATE_GRAD = False def __init__(self, learning_rate=0.01, rand_rate=0.1, name="test", **kwargs): super(test_optimizer, self).__init__(name, **kwargs) self._set_hyper("learning_rate", kwargs.get("lr", learning_rate)) self._set_hyper("decay", self._initial_decay) self._set_hyper("rand_rate", rand_rate) self._is_first = True def _create_slots(self, var_list): for var in var_list: self.add_slot(var, "positive") def _resource_apply_dense(self, grad, var, apply_state=None): var_dtype = var.dtype.base_dtype lr = self._decayed_lr(var_dtype) rand_rate = self._get_hyper("rand_rate") positive = self.get_slot(var, "positive") if self._is_first: self._is_first = False new_var = var - grad * lr * \ random.uniform(1.0 - rand_rate, 1.0 + rand_rate) else: new_var = var - (grad + positive) * lr * \ random.uniform(1.0 - rand_rate, 1.0 + rand_rate) positive.assign(grad) return state_ops.assign(var, new_var, use_locking=self._use_locking).op def _resource_apply_sparse(self, grad, var, indices, apply_state=None): raise NotImplementedError def get_config(self): config = super(test_optimizer, self).get_config() config.update({ "learning_rate": self._serialize_hyperparameter("learning_rate"), "decay": self._serialize_hyperparameter("decay"), "rand_rate": self._serialize_hyperparameter("rand_rate") }) return config ``` --- ## GLAdam ![](https://i.imgur.com/OmiRIx4.png) $lr_t = \mathrm{learning\_rate} * \sqrt{1 - \beta_2^t} / (1 - \beta_1^t)$ $m_t = \beta_1 * m_{t-1} + (1 - \beta_1) * g$ $v_t = \beta_2 * v_{t-1} + (1 - \beta_2) * g^2$ $\hat{v}_t = \max(\hat{v}_{t-1}, v_t)$ $\theta_t = \theta_{t-1} - lr_t * m_t / (\sqrt{\hat{v}_t} + \epsilon)$ --- # 3/17 例行報告 --- ## Optimizer call stack ``` minimize _compute_gradients (don't care) apply_gradients _create_all_weights _create_hypers (don't care) _create_slots (implement) call add_slot _prepare _prepare_local (implement) update apply_state[(var_device, var_dtype)] _distributed_apply apply_grad_to_update_var _resource_apply_dense (implement) ``` remind to use operations when implement value assign --- # 4/13 例行報告 --- ## Expectigrad ![](https://i.imgur.com/IleMv7X.png) ---- ![](https://i.imgur.com/uBucFi8.png) --- ## 有的沒的 [Hello Matplotlib](https://medium.com/python4u/hello-matplotlib-8ffe04355ebf) [使用 TensorBoard 視覺化呈現 TensorFlow 計算流程教學](https://blog.gtwang.org/programming/tensorboard-tensorflow-visualization-tutorial/) https://github.com/linsamtw/cifar10_vgg16_kaggle --- # 4/20 例行報告 --- ## kawai * [Owl Search Algorithm](https://sci-hub.se/https://doi.org/10.3233/JIFS-169452)：略複雜，總覺得不可信 * [Squirrel Search Algorithm](https://www.sciencedirect.com/science/article/pii/S2210650217305229)：複雜 * [Jellyfish Search](https://www.sciencedirect.com/science/article/pii/S0096300320304914)：水母有時整個群體往global best移動，有時在群內移動，分成與另一隻水母溝通往好的移動與自己隨機亂飄。很多隨機值要再看看是否有作弊。 * Transient Search Optimization * Butterfly Optimization Algorithm * Emperor Penguins Colony ## kakkoi * [Future Search Algorithm](https://www.researchgate.net/publication/327654743_Future_search_algorithm_for_optimization)：垃圾 * [Chaos Game Optimization](https://sci-hub.se/10.1007/s10462-020-09867-w)：根據global best, mean of random seeds, self seed組成三角形，並以不同alpha值以4種方式對每個seed產生新seed，再選前N個進入下一代 * [Social Engineering Optimizer](https://www.researchgate.net/publication/321155851_Social_Engineering_Optimization_SEO_A_New_Single-Solution_Meta-heuristic_Inspired_by_Social_Engineering)：直覺爛爛的 * [Giza Pyramids Construction](https://sci-hub.se/https://link.springer.com/article/10.1007/s12065-020-00451-3)：拐彎抹角，先懶得看 * Archimedes Optimization Algorithm * Black Hole Mechanics Optimization * Life Choice-Based Optimizer * Multi-Verse optimizer ## ub * [Coronavirus Optimization Algorithm](https://arxiv.org/abs/2003.13633)：一個框架，沒有實作關鍵的擴散，有LSTM示範 * Shuffled Shepherd Optimization Algorithm * Golden Ratio Optimization Method * Black Widow Optimization Algorithm * Sailfish Optimizer * Dynamic Differential Annealed Optimization [Nature inspired optimization algorithms or simply variations of metaheuristics?](https://www.researchgate.net/publication/343846931_Nature_inspired_optimization_algorithms_or_simply_variations_of_metaheuristics) 關於解的形式可能不是串列圖的串列化