CIM - HackMD

# CIM 1. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 2. Layer Normalization 3. Attention Is All You Need 4. Deep Residual Learning for Image Recognition 5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 6. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale 7. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models 8. Importance Estimation for Neural Network Pruning 9. DepGraph: Towards Any Structural Pruning 10. Training data-efficient image transformers & distillation through attention 11. THE LOTTERY TICKET HYPOTHESIS:FINDING SPARSE, TRAINABLE NEURAL NETWORKS 12. SVD-BASEDCHANNELPRUNINGFORCONVOLUTIONALNEURALNETWORKIN ACOUSTICSCENECLASSIFICATIONMODEL 13. Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition 14. Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration 15. EDropout: Energy-Based Dropout and Pruning of Deep Neural Networks 16. Barlow Twins: Self-Supervised Learning via Redundancy Reduction 17. PruningBench: A Comprehensive Benchmark of Structural Pruning 18. EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis 19. Pruning Filters for Efficient ConvNets 20. Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks 21. Layer-adaptive sparsity for the magnitude based pruning 22. HRank: Filter Pruning using High-Rank Feature Map 23. Channel Pruning for Accelerating Very Deep Neural Networks 24. Thinet: A filter level pruning method for deep neural network compression 25. NISP: Pruning Networks using Neuron Importance Score Propagation 26. Learning Efficient Convolutional Networks through Network Slimming 27. Learning Structured Sparsity in Deep Neural Networks 28. NEURAL PRUNING VIA GROWING REGULARIZATION 29. A Tutorial on Energy-Based Learning 30. Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space 31. Patch Slimming for Efficient Vision Transformers 32. X-Pruner: eXplainable Pruning for Vision Transformers 33. Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric (DIEM) 34. Every Model Learned by Gradient Descent Is Approximately a Kernel Machine 35. SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY 36. Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning 37. BilevelPruning: Unified Dynamic and Static Channel Pruning for Convolutional Neural Networks 38. Runtime Neural Pruning 39. Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers 40. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification 41. MULTIFLOW: Shifting Towards Task-Agnostic Vision Language Pruning 42. TOKEN MERGING: YOUR VIT BUT FASTER 43. Token Fusion: Bridging the Gap between Token Pruning and Token Merging. 44. Neural Tangent Kernel:Convergence and Generalization in Neural Networks 45. Rewriting a Deep Generative Model 46. Resource-Efficient Transformer Pruning for Finetuning of Large Models 47. Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch 48. https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf 49. Global Vision Transformer Pruning with Hessian-Aware Saliency (CVPR 2023) 50. Robustness via curvature regularization, and vice versa 51. Accelerating Sparse Deep Neural Networks (1. nvidia 2:4 Ampere requires that the input dim and output dim must be 16 or 32 2. It mentioned that the pruned pretrain model finetuned on smaller dataset can not be recover accuracy ) 52.Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery Tickets from Large Models 53. Learning both Weights and Connections for Efficient Neural Networks $$ \begin{bmatrix} 1 & 1 & 1 & 4\\ 1 & 1 & 2 & 5 \\ 1 & 1 & 3 & 6\\ \end{bmatrix} \cdot \begin{bmatrix} 1\\ 1\\1\\1 \end{bmatrix} = \begin{bmatrix} 7\\ 9\\11 \end{bmatrix} $$ $$ \begin{bmatrix} 1 & 1 & 4\\ 1 & 2 & 5 \\ 1 & 3 & 6\\ \end{bmatrix} \cdot \begin{bmatrix} 1\\1\\1 \end{bmatrix} = \begin{bmatrix} 6\\ 8\\10 \end{bmatrix} $$ $$ \begin{bmatrix} 2 & 1 & 4\\ 2 & 2 & 5 \\ 2 & 3 & 6\\ \end{bmatrix} \cdot \begin{bmatrix} 1\\1\\1 \end{bmatrix} = \begin{bmatrix} 7\\ 9\\11 \end{bmatrix} $$ $$ \mathbf{v} = \begin{bmatrix} 1 & 2 \end{bmatrix} $$ $$ \mathbf{A} = \begin{bmatrix} 3 & 4 \\ 5 & 6 \end{bmatrix} $$ $$ \mathbf{v} \cdot \mathbf{A} = \begin{bmatrix} 1 & 2 \end{bmatrix} \cdot \begin{bmatrix} 3 & 4 \\ 5 & 6 \end{bmatrix} = \begin{bmatrix} 13 & 16 \end{bmatrix} $$ $$ \begin{bmatrix} V_1 & V_2 & V_3 \end{bmatrix} \cdot \begin{bmatrix} 13\\ 5 \\3 \end{bmatrix} = 13V_1 + 5V_2+3V_3 $$ $$ \begin{bmatrix} V_2 & V_3 \end{bmatrix} \cdot \begin{bmatrix} 5 \\3 \end{bmatrix} = 5V_2+3V_3 $$ $$ \begin{aligned} \begin{bmatrix} V_2 + 2V_1 & V_3 + V_1 \end{bmatrix} \cdot \begin{bmatrix} 5 \\ 3 \end{bmatrix} &= 5(V_2 + 2V_1) + 3(V_3 + V_1) \\ &= 13V_1 + 5V_2 + 3V_3 \end{aligned} $$ $$ \min_{\mathbf{\tilde{w}}} \left\| \mathbf{\tilde{w}}\mathbf{\tilde{x}} - \mathbf{w} \mathbf{x} \right\|^2, \quad \mathbf{\tilde{x}} = \mathbf{A} \mathbf{x} $$ $$ \mathbf{A} = \begin{bmatrix} a_1 & 0 & \cdots & 0 \\ 0 & a_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & a_n \end{bmatrix}_{n \times m} a_i = \begin{cases} 0, & \text{if } a_i \text{ is a pruned dimension} \\ 1, & \text{otherwise} \end{cases} $$ 其中，$\mathbf{A}$ 是對角矩陣。 $$ \mathbf{\tilde{w}} = \mathbf{w} \mathbf{x} \mathbf{\tilde{x}}^\top \left( \mathbf{\tilde{x}} \mathbf{\tilde{x}}^\top \right)^{-1} = \mathbf{w} \mathbf{x} \mathbf{x}^\top \mathbf{A}^\top \left( \mathbf{A}\mathbf{x} \mathbf{x}^\top \mathbf{A}^\top \right)^{-1} $$ $$ \| y - Xw \|^2_2 \Rightarrow \| y - Xw \|^2_2 + \alpha \| w \|^2_2 $$ # weight initalization 1. WEIGHT INITIALIZATION TECHNIQUES FOR DEEP LEARNING ALGORITHMS IN REMOTE SENSING: RECENT TRENDS AND FUTURE PERSPECTIVES 2. Understanding the difficulty of training deep feedforward neural networks # knowledge editing 1. Locating and Editing Factual Associations in GPT # layer pruning 1. Optimal Brain Damage 2. Optimal Brain Surgeon and General Network Prunng 3. Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning 4. Discovering Sparsity Allocation for Layer-wise Pruning of Large Language Models (這篇paper跟我們的研究不一樣但是他實驗的比較表格可能適合我們使用) 5. SlimGPT: Layer-wise Structured Pruning for Large Language Models (跟 OBC 很像但Head pruning 部分的實作有點搞不懂) # book 1. https://www.cs.princeton.edu/courses/archive/fall19/cos597B/lecnotes/bookdraft.pdf # Time 郭峻因 (online): 5/23 08:00~12:00 5/27 16:30~17:30 林澤 5/26 9:00~11:00, 14:30~15:30 5/27 16:30~17:30 洪士灝 5/26 09:30~12:00, 13:00~14:00 5/28 14:30~16:00 # LLM pruning 1. LLM-Pruner: On the Structural Pruning of Large Language Models 2. LoRA: Low-Rank Adaptation of Large Language Models