2020 Statistics Camp: Machine Learning == 2020/9/3-9/4 :round_pushpin: Academia Sinica ## A short tour to neural networks and AI - 黃文良 [PDF](http://www3.stat.sinica.edu.tw/camp2020/files/01_%E9%BB%83%E6%96%87%E8%89%AF%E6%91%98%E8%A6%81.pdf) * The History of AI Dartmouth conference, 1956 * Doom period NN cannot find publications in ML conf (Yann Lecunn) 1. Generation 1: Design classifier Feature selection + dimensional reduction + classifier Mathematical model: NN as a feedforward graph (contains no loops). Each neuron computes a nonlinear activation function: sigmoid(monotonically increaing function) LeCunn's argument: Feature selection is bottleneck(Cannot be automated) 2. Generation 2: DNN End-to-end: Learning features. Features + Classifier = DNN(Deep learning) Success factor: GPU, large database **NN不太有Dimensional reduction和overfitting的問題** 3. Generation 3: Learning/Simplifying DNN architecture ResNet(High performance & high power) vs MobileNet(Lower performance & lower power) ResNet: Modularity design MobilNet: Pruning and quantization(E.g. Continous to discrete) and simplifying convolution Problem of DNN: 1. Blackbox (Hard to fix if result is bad) 2. Hard to Interpretate 3. 如果要了解數學分析不建議看分類問題 * Convex optimization * Half-plane cutting vs GAN * Tree-partition thm Future works * DNN with feedback links (time-varing function) * 在學術界要有原創 * 讀paper要讀key paper(Original), 讀Paper重質不重量 * LA, [Convex Optimization](https://web.stanford.edu/class/ee364a/lectures.html) * [Neilson book for NN](http://neuralnetworksanddeeplearning.com/) --- ## 結合空間資訊與健康數據之應用 - 詹大千 [PDF](http://www3.stat.sinica.edu.tw/camp2020/files/02_%E8%A9%B9%E5%A4%A7%E5%8D%83%E6%91%98%E8%A6%81.pdf) * Data visualization is important for statistical research * 3S: GIS, GPS, Remote Sensing 台灣健保沒有涵蓋抽菸等行為紀錄 * Spatio-temporal data visualization on health care --- ## Causal Inference - 黃彥棕 [PDF](http://www3.stat.sinica.edu.tw/camp2020/files/03_%E9%BB%83%E5%BD%A5%E6%A3%95%E6%8A%95%E5%BD%B1%E7%89%87.pdf) Sharp causal null hypothesis holds if $\forall$ subject Y(s=1) = Y(s=0) * Fundamental problem of casual inference Individual casual effects cannot be determined! But could be approached doable statistically If P(Y(s=0)) != P(Y(s=1)), means **average casual effect** existed No association(correlation) = independence **Association != causation** * AI in causal inference * Association can be used for prediction, but not necessarily causation So far, AI focuses more on prediction than causation --- ## So you want to work with data - 楊振翔 [PDF](http://www3.stat.sinica.edu.tw/camp2020/files/04_%E6%A5%8A%E6%8C%AF%E7%BF%94%E6%91%98%E8%A6%81.pdf) 1. Data collection and access 2. Data processing Garbage in garbage out 4. Data analysis The dimension of the data far exceeds the sample size * Data integration Vertical integration- different aspects of same cohort Horizontal integration- same aspects of different cohort --- ## Data driven discovery and Storytelling using Visualization - 馬匡六 [PDF](http://www3.stat.sinica.edu.tw/camp2020/files/05_%E9%A6%AC%E5%8C%A1%E5%85%AD%E6%91%98%E8%A6%81.pdf) * Data became an asset * Anscombe's Quartet (Same statistics but distinct datasets) ![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/1200px-Anscombe%27s_quartet_3.svg.png) * Data visualization might be misleading * Storytelling comics ![XKCD.com](http://imgs.xkcd.com/comics/movie_narrative_charts_large.png) [Ref](https://xkcd.com/) 1. E-commerce data visualization Visualizing Parallel Matrix Multiplication 2. High-dimensional data Dimensionality reduction, e.g. PCA, LDA, t-SNE, UMAP, etc. 4. Constrastive Learning * cPCA (Constrastive Principal Component Ananlysis) Limits of cPCA: 1. Not suitable for categorical data 2. Only for binary data * ccPCA (Contrasting CLusters in PCA) 5. Network data visualization * Sensitivity analysis --- ## A glimpse of online learning - 李彥寰 [PDF](http://www3.stat.sinica.edu.tw/camp2020/files/06_%E6%9D%8E%E5%BD%A5%E5%AF%B0%E6%8A%95%E5%BD%B1%E7%89%87.pdf) ### Individual sequence prediction * Optimal answers to referendums * Worst-case formulation * Sequential model Prediction space:= $\Gamma = \left \{ 0,1 \right \}\ or\ \Delta \subset \mathbb{R}^2$ Outcome space:= $\Omega = \left \{ 0,1 \right \}$ Loss function:= $\lambda : \Gamma \times \Omega \rightarrow \mathbb{R}$ Algorithm based on previous data, reveal new data while updating algorithm * Regret * Online-to-batch conversion * Mixture forecaster $\lambda( \gamma, \omega):= -log\gamma(\omega)$ * Universal coding, MDL 1. Uniform prior: Laplace mixture 2. Dirichlet prior: Krichevsky-Trofimov mixture Two remarks 1. Probability theory is not the only appoach to dealing with uncertainty. 2. A Bayes algorithm can have a non-Bayesain non-probabilistic interpretation and guarantee [Recommended Book](https://www.amazon.com/Prediction-Learning-Games-Nicolo-Cesa-Bianchi/dp/0521841089#ace-g9859629705) [COLT](http://learningtheory.org/colt2020/) --- ## Statistical approach of dimension reduction - 黃名鉞 [PDF](http://www3.stat.sinica.edu.tw/camp2020/files/07_%E9%BB%83%E5%90%8D%E9%89%9E%E6%91%98%E8%A6%81.pdf) ### Supervised learning * Regression analysis * Parametric models (Dim = #Parameters) / Semiparametric models (dist. is unknown) * Nonparametric method Disadvantages: 1. Hard to visualize 2. Curse of dimensionality 3. Overparametrization * Esimation method 1. Pseudo Least Squares Estimation 2. Pseudo MLE, which also has classical MLE's properties(Optimality) * Nuisance tangent space Conclusion 1. Optimality depends on the criterion of interest 2. Implementation: Nonlinear optimization problems 3. When random sample is not available: survivalm analysis, causal inference, etc. --- ## Discussion * Math should not be the only way to prove things * Statistics should not be the only method to illustrate relationship between data