Part I: The way AI transforms

# Part I: The way AI transforms --- ### How AI have elvoved in the past decade? 2010. Deep Belief Network 2011. Deep convnets for tiny image classification *(inductive-bias)* 2012. AlexNet: deep convnets for ImageNet *(inductive-bias)* 2014. VGG convnets *(inductive-bias)* 2015. GoogleNet, MobileNet *(inductive-bias)* --- ### How AI have elvoved in the past decade? 2016. Residual networks, Transformers *(inductive-bias)* 2017. AlphaGo 4:1 Lee Sedol *(self-play)* 2018. OpenFive vs human on Dota2 *(self-play)* 2019. GPT-2, Turing-NGL/DeepSpeed *(self-learn)* 2020. GPT-3 *(self-learn)* --- ### More is More * Model sizes: thousands $\rightarrow$ billions (trillions in 2021?) * Accuracy improved as models grow bigger * Accuracy improved as more data thrown in --- ### On the Generalization of Neural nets * Intermediate neural layers capable of learning latent representation * Learned representation is reusable * Representation is data-domain bounded --- ### Pre-training as Generalization * Prior 2010, only layer-wise pre-training can harness deep nets * Early 2010s pre-training is indispensable to train on small-size data * Enter the era of data explosion, pre-training becomes an option --- ### Pre-training Strikes back * Prior 2010, only layer-wise pre-training can harness deep nets * Early 2010s pre-training is indispensable to train on small-size data * Enter the era of data explosion, pre-training becomes an option --- ### NLP is the new Vision * 2010-2015: DL in NLP not so impressive as in CV * Now NLP's turn influences CV * NLP thing: pretext training on massive unlabeled data --- ### More is Less * The more amount of unlabeled data and the bigger model is, * the lesser amount of labeled data is required --- ### Over-parametrization * As model size grows, micro-structures do not matter much * Initial conditions are not very important either * Many equally good local minima found --- ### Optimization Landscape ![opt-landscape](https://losslandscape.com/wp-content/uploads/2019/09/loss-landscape-research-8.jpg) --- ### Less is More ? * Since 2015, embedding deep learning spinned off from big models * Whole lots of topics on model compression, fp16 inference, etc. * Limited math ops, packed memory, low TFLOPS * "Detached" from the reality of "bigger-better" --- ### Less is More: A How-To * Distangle efficiency vs accuracy * Orthogonize epxerimenting versus production * Model cycle: Research - "Productionize" - Production * Productionize (v): distilizing big models to minis --- ### Knowledge Distillation ![mario](https://xtremeretro.com/wp-content/uploads/2018/02/Mario-Bros-Arcade-Coin-Op-Nintendo-1983-Platform-Xtreme-Retro-1.png) --- ### Less is More ! * Freely pursuit higher accuracy with big models + big data * While maintaining well-defined "productionize" procedure to normalize experimental models into well-tested, hardware-compatible efficient small models --- # DIY: Our Inventory --- ### Data * 1.47 million of labelled events * 27.8 million of raw images * Enough to facilitate self-supervised learning --- ### Compute * Each V100 card can fit 1 billion model * DeepSpeed (MSFT) boosts ~10x (compute, speed) --- ### Algorithms * Attention is All You Need * Invented in 2016 for NLP, lazy-smart way to learn fro mdata * Influence back to CV, attention *will* replace convolution * Attention can solve *best image picking*, *change detect* --- ### Algorithms * Knowledge distillation with Teacher-student learning * Inductive bias of residual connections * More efficient networks for mobiles --- ### Algorithms ![residual](https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-26-at-10.50.53-PM.png) --- ### Farther Future ? Adding memory, a stateful AI machine