Data Scientist Skills Preparation
===
###### tags: `Templates` `Meeting`
:::info
- **Mahcine Learning**
- Supervised
- Unsupervised
- Reinforcement Learning
- Ensemble Learning
- Transfer Learning
- CNN, RNN(LSTM)
- **Statistics and Probabilty**
- Maximum Likelihood Estimation
- Expectation-maximization Algorithm
- Bayesian Methods
- **Coding Techniques**
- Python
- ML-relared Packages
- **Database**
- SQL
- MongoDB
- **Distributed Platform**
- Hadoop
- Spark
- Kafka
- **Project Experiences**
- **Advanced Topics**
:::
:computer: Machine Learning
-
### Supervised
- Random forest
- Sample and get subset of data, randomly chooses feature between a random number of features
- Feature selection(permutaion test: randomize the value of one feature and see if it makes error get higher)
- Support Vector Machine
- Reference: https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E6%94%AF%E6%92%90%E5%90%91%E9%87%8F%E6%A9%9F-support-vector-machine-svm-%E8%A9%B3%E7%B4%B0%E6%8E%A8%E5%B0%8E-c320098a3d2e
- Linear Regression
- Logistic Regression
- Gradient Boosting
- Given a trained model f(x), learn a new model h(x) to make f(x)+h(x)=y
### Unsupervised
- Clustering
- k-means
- hierachical clustering
- mixure models
- DBSCAN
- be able to find non-linearly separable clusters
- Dimension Reduction
- Autoencoder
- Singular value decomposition
- Principle Componet Analysis
- Linear Discriminant Analysis
- t-SNE (non-linear):
- mainly approximates the high-dimensional data with the probability density function of the Gaussian distribution, while the low-dimensional data is approximated by the t-distribution method. The similarity is calculated using the KL distance, and finally using gradient decent to seek the best solution
- Reference: https://medium.com/d-d-mag/%E6%B7%BA%E8%AB%87%E5%85%A9%E7%A8%AE%E9%99%8D%E7%B6%AD%E6%96%B9%E6%B3%95-pca-%E8%88%87-t-sne-d4254916925b
### Reinforcement Learning
- Value-based
- Predict expected cumulative reward for each action in each state
- Policy-based
- Predict a distribution to sample the action without calculating the cumulative reward
- Hybrid
- Actor-Critic(Actor:Policy-based, Critic: Value-based)
- **Model-based**
- Simulate thousands time to collect training data
### Ensemble Learning
- Bagging(subset data training)

- Boosting(for every iteration, focusing on misclassified data)

- **Adaboost**
- Reference: https://medium.com/@chih.sheng.huang821/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-ensemble-learning%E4%B9%8Bbagging-boosting%E5%92%8Cadaboost-af031229ebc3
### Transfer Learning
- Use pre-trained model to learn new data
### CNN, RNN(LSTM)
- Convolutional Neural Nework
- Convolution Layer
- Kernel(filter): kernel size
- Channels: feature size in specific region(i.e. RGB image contains 3 channels)
- Stride: how far it crosses
- Padding: how many pads it adds before convolution
- Pooling Layer
- Max-pooling
- Mean-pooling
- Stochastic-pooling
- Case Study
- U-Net: MRI Segmentation
- Recurrent Neural Network
### Evaluation Metrics
- Classification
- Confusion Matrix

- ROC Curve (example: predict who is actually sick and predict who is actually healthy)

- Reference
https://www.ycc.idv.tw/confusion-matrix.html
- Regression
- Explained Variance Score
- R2 Score
- MSE
- ASE
- Clustering
:1234: Statistics and Probabilty
-
### Maximum Likelihood Estimation


- Reference:
https://www.youtube.com/watch?v=XepXtl9YKwc
https://www.youtube.com/watch?v=pYxNSUDSFH4
### Expectation-maximization Algorithm
- Reference: https://www.coursera.org/lecture/bayesian-methods-in-machine-learning/expectation-maximization-algorithm-Fm3mY?fbclid=IwAR18mjNoWokwVL3vMdRAzkFraOSIuQ6VS2vEhquq1CZvuPQLoxmfdNutXuI
### Bayesian Methods
:bug: Coding Techniques
-
### Python
### ML-relared Packages
:house_with_garden: Database
-
### SQL
### MongoDB
:oncoming_bus: Distributed Platform
-
### Hadoop
- Hadoop Distributed File System(HDFS): storage
- MapReduce(need to store the computation result back to HDFS - a lot of read and write)
### Spark
- Distributed Memory(faster than MapReduce)
- Resilient Distributed Dataset(RDD)
- Three crucial components
- Partitions
- Dependencies on parent RDDs
- Function to compute a partition given its' parent RDD
- Transformation
- Action
- Tolerant Mechanism (Lineage & Checkpoint)
- No duplication, but parent RDD can recompute
- Design a checkpoint to record the result
- Reference: http://yjhyjhyjh0.pixnet.net/blog/post/411468760-spark-rdd-%28resilient-distributed-datasets%29-%E8%A9%B3%E7%B4%B0%E5%9C%96%E6%96%87%E4%BB%8B
### Kafka
### Big Data Pipeline
- Components
- The messaging system
- Message distribution support to various nodes for further data processing.
- Data analysis system to derive decisions from data.
- Data storage system to store results and related information.
- Data representation and reporting tools and alerts system.
- Parameters
- Compatible with big data
- Low latency
- Scalability
- A diversity that means it can handle various use cases
- Flexibility
- Economic
- Roles of Kafka, Spark, Hadoop
- Kafka: works as an input system(message system)
- Spark: ingests and processes in a real-time manner. It makes it possible by using its streaming APIs
- Hadoop: provides an ecosystem for the Spark and Kafka to run on top of it. It provides persistent data storage through its HDFS, and security features to cover Kafka and Spark
### Build Big Data Pipeline with Apache Hadoop, Apache Spark, and Apache Kafka
- Lambda Architecture
- Three Purposes
- Ingest
- Process
- Query real-time and batch data
- Batch Layer: Mapreduce
- managing historical data
- recomputing results such as machine learning models. (most accurate but with high latency)
- Speed Layer(real-time): Spark + Batch Layer helps in case of data error
- Serving Layer: NoSQL
- Data Storage: Hadoop

- Kappa Architecture
- No Batch Layer (avoid maintaining two separate code bases )
- Apache Hadoop provides the eco-system for Apache Spark and Apache Kafka.

- Reference
- https://www.whizlabs.com/blog/real-time-big-data-pipeline/
- https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb
- https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa
:books: Project Experiences
-
:star2: Advanced Topics
-
### Latent Dirichlet Allocation
- Reference: https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/
### VAE
### GAN
### Conditional Generator
### Discriminator Type
### Kernel Function
### Time Series K-fold Validation
https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/