Data Scientist Skills Preparation

Data Scientist Skills Preparation === ###### tags: `Templates` `Meeting` :::info - **Mahcine Learning** - Supervised - Unsupervised - Reinforcement Learning - Ensemble Learning - Transfer Learning - CNN, RNN(LSTM) - **Statistics and Probabilty** - Maximum Likelihood Estimation - Expectation-maximization Algorithm - Bayesian Methods - **Coding Techniques** - Python - ML-relared Packages - **Database** - SQL - MongoDB - **Distributed Platform** - Hadoop - Spark - Kafka - **Project Experiences** - **Advanced Topics** ::: :computer: Machine Learning - ### Supervised - Random forest - Sample and get subset of data, randomly chooses feature between a random number of features - Feature selection(permutaion test: randomize the value of one feature and see if it makes error get higher) - Support Vector Machine - Reference: https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E6%94%AF%E6%92%90%E5%90%91%E9%87%8F%E6%A9%9F-support-vector-machine-svm-%E8%A9%B3%E7%B4%B0%E6%8E%A8%E5%B0%8E-c320098a3d2e - Linear Regression - Logistic Regression - Gradient Boosting - Given a trained model f(x), learn a new model h(x) to make f(x)+h(x)=y ### Unsupervised - Clustering - k-means - hierachical clustering - mixure models - DBSCAN - be able to find non-linearly separable clusters - Dimension Reduction - Autoencoder - Singular value decomposition - Principle Componet Analysis - Linear Discriminant Analysis - t-SNE (non-linear): - mainly approximates the high-dimensional data with the probability density function of the Gaussian distribution, while the low-dimensional data is approximated by the t-distribution method. The similarity is calculated using the KL distance, and finally using gradient decent to seek the best solution - Reference: https://medium.com/d-d-mag/%E6%B7%BA%E8%AB%87%E5%85%A9%E7%A8%AE%E9%99%8D%E7%B6%AD%E6%96%B9%E6%B3%95-pca-%E8%88%87-t-sne-d4254916925b ### Reinforcement Learning - Value-based - Predict expected cumulative reward for each action in each state - Policy-based - Predict a distribution to sample the action without calculating the cumulative reward - Hybrid - Actor-Critic(Actor:Policy-based, Critic: Value-based) - **Model-based** - Simulate thousands time to collect training data ### Ensemble Learning - Bagging(subset data training) ![](https://i.imgur.com/K1uEEEj.png) - Boosting(for every iteration, focusing on misclassified data) ![](https://i.imgur.com/drrnBUA.png) - **Adaboost** - Reference: https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-ensemble-learning%E4%B9%8Bbagging-boosting%E5%92%8Cadaboost-af031229ebc3 ### Transfer Learning - Use pre-trained model to learn new data ### CNN, RNN(LSTM) - Convolutional Neural Nework - Convolution Layer - Kernel(filter): kernel size - Channels: feature size in specific region(i.e. RGB image contains 3 channels) - Stride: how far it crosses - Padding: how many pads it adds before convolution - Pooling Layer - Max-pooling - Mean-pooling - Stochastic-pooling - Case Study - U-Net: MRI Segmentation - Recurrent Neural Network ### Evaluation Metrics - Classification - Confusion Matrix ![](https://i.imgur.com/dnPAiSq.png) - ROC Curve (example: predict who is actually sick and predict who is actually healthy) ![](https://i.imgur.com/hkkmkGp.png) - Reference https://www.ycc.idv.tw/confusion-matrix.html - Regression - Explained Variance Score - R2 Score - MSE - ASE - Clustering :1234: Statistics and Probabilty - ### Maximum Likelihood Estimation ![](https://i.imgur.com/VTjgGVY.png) ![](https://i.imgur.com/lbglTjt.png) - Reference: https://www.youtube.com/watch?v=XepXtl9YKwc https://www.youtube.com/watch?v=pYxNSUDSFH4 ### Expectation-maximization Algorithm - Reference: https://www.coursera.org/lecture/bayesian-methods-in-machine-learning/expectation-maximization-algorithm-Fm3mY?fbclid=IwAR18mjNoWokwVL3vMdRAzkFraOSIuQ6VS2vEhquq1CZvuPQLoxmfdNutXuI ### Bayesian Methods :bug: Coding Techniques - ### Python ### ML-relared Packages :house_with_garden: Database - ### SQL ### MongoDB :oncoming_bus: Distributed Platform - ### Hadoop - Hadoop Distributed File System(HDFS): storage - MapReduce(need to store the computation result back to HDFS - a lot of read and write) ### Spark - Distributed Memory(faster than MapReduce) - Resilient Distributed Dataset(RDD) - Three crucial components - Partitions - Dependencies on parent RDDs - Function to compute a partition given its' parent RDD - Transformation - Action - Tolerant Mechanism (Lineage & Checkpoint) - No duplication, but parent RDD can recompute - Design a checkpoint to record the result - Reference: http://yjhyjhyjh0.pixnet.net/blog/post/411468760-spark-rdd-%28resilient-distributed-datasets%29-%E8%A9%B3%E7%B4%B0%E5%9C%96%E6%96%87%E4%BB%8B ### Kafka ### Big Data Pipeline - Components - The messaging system - Message distribution support to various nodes for further data processing. - Data analysis system to derive decisions from data. - Data storage system to store results and related information. - Data representation and reporting tools and alerts system. - Parameters - Compatible with big data - Low latency - Scalability - A diversity that means it can handle various use cases - Flexibility - Economic - Roles of Kafka, Spark, Hadoop - Kafka: works as an input system(message system) - Spark: ingests and processes in a real-time manner. It makes it possible by using its streaming APIs - Hadoop: provides an ecosystem for the Spark and Kafka to run on top of it. It provides persistent data storage through its HDFS, and security features to cover Kafka and Spark ### Build Big Data Pipeline with Apache Hadoop, Apache Spark, and Apache Kafka - Lambda Architecture - Three Purposes - Ingest - Process - Query real-time and batch data - Batch Layer: Mapreduce - managing historical data - recomputing results such as machine learning models. (most accurate but with high latency) - Speed Layer(real-time): Spark + Batch Layer helps in case of data error - Serving Layer: NoSQL - Data Storage: Hadoop ![](https://i.imgur.com/WCkGBHT.png) - Kappa Architecture - No Batch Layer (avoid maintaining two separate code bases ) - Apache Hadoop provides the eco-system for Apache Spark and Apache Kafka. ![](https://i.imgur.com/VRZh9gf.png) - Reference - https://www.whizlabs.com/blog/real-time-big-data-pipeline/ - https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb - https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa :books: Project Experiences - :star2: Advanced Topics - ### Latent Dirichlet Allocation - Reference: https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/ ### VAE ### GAN ### Conditional Generator ### Discriminator Type ### Kernel Function ### Time Series K-fold Validation https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/