Simply ++ Machine Learning Interview Questions: === ## Links https://towardsdatascience.com/a-beginners-guide-to-the-data-science-pipeline-a4904b2d8ad3 https://github.com/MohamedSondo/MLQuestions ## How IBM Does Data science consulting https://www.youtube.com/watch?v=GU2AIgf-6SU&list=RDCMUCEBpSZhI1X8WaP-kY_2LLcg&index=17 ## Path for becoming a data scientist 2020 https://www.youtube.com/watch?v=UXi8Ml2UoYk ![](https://i.imgur.com/WMwpaXs.jpg) ![](https://i.imgur.com/gilmekK.jpg) # Probability and Statistic https://hackmd.io/aRcpusRsTmeH6ktieO9Mkg https://hackmd.io/lZe4l3NjT9GjLw9SxSI4kA ## Data Science workflow: ![](https://i.imgur.com/7rb5SwR.png) ![](https://i.imgur.com/fSwS0hX.png) ## OSEMN Pipeline - O — Obtaining our data - S — Scrubbing / Cleaning our data - E — Exploring / Visualizing our data will allow us to find patterns and trends - M — Modeling our data will give us our predictive power as a wizard - N — Interpreting our data ## Business Question So before we even begin the OSEMN pipeline, the most crucial and important step that we must take into consideration is understanding what problem we’re trying to solve. Let’s say this again. Before we even begin doing anything with “Data Science”, we must first take into consideration what problem we’re trying to solve. If you have a small problem you want to solve, then at most you’ll get a small solution. If you have a BIG problem to solve, then you’ll have the possibility of a BIG solution. Ask yourself: - *How can we translate data into dollars*? - *What impact do I want to make with this data*? - *What business value does our model bring to the table*? - *What will save us lots of money*? - *What can be done to make our business run more efficiently*?* ----------- # Statistics ## What does P-value signify about the statistical data? - P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and is always between 0 and 1. - P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected. - P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected. - P-value=0.05is the marginal value indicating it is possible to go either way. ## What is the goal of A/B Testing? It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad. ![](https://i.imgur.com/V1kkFUg.png) ![](https://i.imgur.com/ZzA2Hvn.png) ![](https://i.imgur.com/tZvB7KC.png) ![](https://i.imgur.com/mHAKGlg.png) -------------- # Machine Learning ![](https://i.imgur.com/90lqbd7.png) ## Can you explain the differences between supervised, unsupervised, and reinforcement learning? - **In supervised learning**, we train a model to learn the relationship between input data and output data. We need to have labeled data to be able to do supervised learning. ![](https://i.imgur.com/aXJqZGX.png) - **With unsupervised learning**, we only have unlabeled data. The model learns a representation of the data. Unsupervised learning is frequently used to initialize the parameters of the model when we have a lot of unlabeled data and a small fraction of labeled data. We first train an unsupervised model and, after that, we use the weights of the model to train a supervised model. - **In reinforcement learning**, the model has some input data and a reward depending on the output of the model. The model learns a policy that maximizes the reward. Reinforcement learning has been applied successfully to strategic games such as Go and even classic Atari video games. ![](https://i.imgur.com/FKe5HBR.png) ![](https://i.imgur.com/ExbFDRh.jpg) ![](https://i.imgur.com/ot0ZOON.png) ## How to avoid overfitting? ![](https://i.imgur.com/1a6Dlor.png) ## What is regularization, why do we use it, and give some examples of common methods? A technique that discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. Examples - Ridge (L2 norm) - Lasso (L1 norm) The obvious disadvantage of ridge regression, is model interpretability. It will shrink the coefficients for least important predictors, very close to zero. But it will never make them exactly zero. In other words, the final model will include all predictors. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Therefore, the lasso method also performs variable selection and is said to yield sparse models. ## What’s the difference between L1 and L2 regularization? Regularization is a very important technique in machine learning to prevent overfitting. Mathematically speaking, it adds a regularization term in order to prevent the coefficients to fit so perfectly to overfit. The difference between the L1(Lasso) and L2(Ridge) is just that L2(Ridge) is the sum of the square of the weights, while L1(Lasso) is just the sum of the absolute weights in MSE or another loss function. As follows: ![](https://i.imgur.com/RDqze3c.png) ![](https://i.imgur.com/Y4ee4Fd.png) ## What is an imbalanced dataset? Can you list some ways to deal with it? An imbalanced dataset is one that has different proportions of target categories. For example, a dataset with medical images where we have to detect some illness will typically have many more negative samples than positive samples—say, 98% of images are without the illness and 2% of images are with the illness. There are different options to deal with imbalanced datasets: **Oversampling or undersampling**. Instead of sampling with a uniform distribution from the training dataset, we can use other distributions so the model sees a more balanced dataset. **Data augmentation**. We can add data in the less frequent categories by modifying existing data in a controlled way. In the example dataset, we could flip the images with illnesses, or add noise to copies of the images in such a way that the illness remains visible. **Using appropriate metrics**. In the example dataset, if we had a model that always made negative predictions, it would achieve a precision of 98%. There are other metrics such as precision, recall, and F-score that describe the accuracy of the model better when using an imbalanced dataset. ## Handle missing values ![](https://i.imgur.com/sGhofsr.png) ## What is stratified cross-validation and when should we use it? - Cross-validation is a technique for dividing data between training and validation sets. On typical cross-validation this split is done randomly. But in stratified cross-validation, the split preserves the ratio of the categories on both the training and validation datasets. - For example, if we have a dataset with 10% of category A and 90% of category B, and we use stratified cross-validation, we will have the same proportions in training and validation. In contrast, if we use simple cross-validation, in the worst case we may find that there are no samples of category A in the validation set. - Stratified cross-validation may be applied in the following scenarios: - On a dataset with multiple categories. The smaller the dataset and the more imbalanced the categories, the more important it will be to use stratified cross-validation. - On a dataset with data of different distributions. For example, in a dataset for autonomous driving, we may have images taken during the day and at night. If we do not ensure that both types are present in training and validation, we will have generalization problems ## Stage of building a machine learning model: ![](https://i.imgur.com/WY5Gw04.png) ## Confusion Matrix ![](https://i.imgur.com/Fy9vrAQ.png) ## How do you combat the curse of dimensionality? - Manual Feature Selection - Principal Component Analysis (PCA) - Multidimensional Scaling - Locally linear embedding ## What is Precision? Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances ***Precision = true positive / (true positive + false positive)*** ## What is Recall? Recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. ***Recall = true positive / (true positive + false negative)*** ## Define F1-score. It is the weighted average of precision and recall. It considers both false positive and false negative into account. It is used to measure the model’s performance. ***F1-Score = 2 (precision recall) / (precision + recall)*** ## What's the trade-off between bias and variance? If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data ## Naive Bayes Rule - Conditional Probability - ![](https://i.imgur.com/CkzfpXs.png) - Bayes Rule - ![](https://i.imgur.com/cXc7aIp.png) - ![](https://i.imgur.com/kMP7fDe.png) ## Generative Algorithm For Bayes ![](https://i.imgur.com/TJRIXEw.png) - Example - ![](https://i.imgur.com/7s6kQ4G.png) ## Logistic Regression ![](https://i.imgur.com/dezPaTS.png) ## Cost of logistic reg derivation ![](https://i.imgur.com/knYutSm.png) ## Using a new convex function with log ![](https://i.imgur.com/Q6sfyff.png) ![](https://i.imgur.com/NS0IIQZ.png) -------------- # Tree Classifier. ![](https://i.imgur.com/uyh3P9A.png) ![](https://i.imgur.com/c4TzoLV.png) - Information Gain ![](https://i.imgur.com/tyae9Gu.png) - Example - ![](https://i.imgur.com/u1vyrJG.png) - ![](https://i.imgur.com/5lT8zbO.png) - ![](https://i.imgur.com/WWOEFnN.png) - ![](https://i.imgur.com/WQJx9Xy.png) - Summary of example ![](https://i.imgur.com/NJQpWnd.png) - Pruning techniques in Decision tree to prevent overfitting. - ![](https://i.imgur.com/0zbIrer.png) ## Best approach for Decision tree ![](https://i.imgur.com/IKRovd5.png) # Ensemble Methods - Boosting - ![](https://i.imgur.com/cNXp47R.png) - ![](https://i.imgur.com/pcukKrm.png) - AdaBoosting - ![](https://i.imgur.com/kq9comB.png) - Bagging - ![](https://i.imgur.com/RtDcj2w.png) ![](https://i.imgur.com/rIADeMu.png) ### Random forests A type of Bagging on DCS-Tree: ![](https://i.imgur.com/Xihsb4D.png) ## Random Forest algorithm ![](https://i.imgur.com/2gYssAS.png) ## Why do ensembles typically have higher scores than individual models? - An ensemble is the combination of multiple models to create a single prediction. The key idea for making better predictions is that the models should make different errors. That way the errors of one model will be compensated by the right guesses of the other models and thus the score of the ensemble will be higher. We need diverse models for creating an ensemble. Diversity can be achieved by: - Using different ML algorithms. For example, you can combine logistic regression, k-nearest neighbors, and decision trees. - Using different subsets of the data for training. This is called bagging. - Giving a different weight to each of the samples of the training set. If this is done iteratively, weighting the samples according to the errors of the ensemble, it’s called boosting. Many winning solutions to data science competitions are ensembles. However, in real-life machine learning projects, engineers need to find a balance between execution time and accuracy. ## KNN - Given a data point, we compute the K nearest data points (neighbors) using certain distance metric (e.g., Euclidean metric). For classification, we take the majority label of neighbors; for regression, we take the mean of the label values. - Note for KNN technically we don't need to train a model, we simply compute during inference time. This can be computationally expensive since each of the test example need to be compared with every training example to see how close they are. There are approximation methods can have faster inference time by partitioning the training data into regions. Note when K equals 1 or other small number the model is prone to overfitting (high variance), while when K equals number of data points or other large number the model is prone to underfitting (high bias) ----------------- # Unsupervised learning ## How does Kmean work? - Given $k$, the K-means algorithm works as follows: - Randomly choose $k$ data points (seeds) to be the initial centroids - Assign each data point to the closest centroid - Re-compute (update) the centroids using the current cluster memberships - If a convergence criterion is not met, go to step 2 - We can also terminate the algorithm when it reaches an iteration budget, which yields an approximate result. From the pseudo-code of the algorithm, we can see that K-means clustering results can be sensitive to the order in which data samples in the data set are explored. A sensible practice would be to run the analysis several times, randomizing objects order; then, average the cluster centers of those runs and input the centers as initial ones for one final run of the analysis. ## Kmean vs Knn difference ![](https://i.imgur.com/athLcuG.png) -------------------------- # Deep Learning ## Neural Network ![](https://i.imgur.com/tylfuQQ.png) ## The perceptron Algorithm ![](https://i.imgur.com/2nxBnTA.png) ![](https://i.imgur.com/u37HpP3.png) ## Backpropagations ![](https://i.imgur.com/MLpHulz.png) ![](https://i.imgur.com/a8NxVwd.png) ![](https://i.imgur.com/sp3EYbV.png) ![](https://i.imgur.com/LxkASNb.png) # backpropagation Summary: ![](https://i.imgur.com/gXgo9T5.png) ## List different activation neurons or functions. - Linear Neuron - Binary Threshold Neuron - Stochastic Binary Neuron - Sigmoid Neuron - Tanh function - Rectified Linear Unit (ReLU) ## Define Learning rate. Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. ## What is the difference between Batch Gradient Descent and Stochastic Gradient Descent? - **Batch gradient descent** computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction. - **Stochastic gradient descent (SGD)** computes the gradient using a single sample. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. ## Epoch vs Batch vs Iteration. - Epoch: one forward pass and one backward pass of all the training examples - Batch: examples processed together in one pass (forward and backward) - Iteration: number of training examples / Batch size ## What is vanishing gradient? As we add more and more hidden layers, back propagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks. ## What are dropouts? Long Short Term Memory – are explicitly designed to address the long term dependency problem, by maintaining a state what to remember and what to forget. ## Define LSTM. As we add more and more hidden layers, back propagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks. ## List the key components of LSTM. - Gates (forget, Memory, update & Read) - tanh(x) (values between -1 to 1) - Sigmoid(x) (values between 0 to 1) ## List the variants of RNN. - LSTM: Long Short Term Memory - GRU: Gated Recurrent Unit - End to End Network - Memory Network ## Vanishing Gradient problem: - calculated using backpropagation - Early weight problem(SGD) - become very small hence vanishing - if gradient is small, update will be small - not doing much for the network - stuck - not getting close to optimal value - **HOW DOES IT OCCURS** - exploding gradient - Early in network, more larger value will be multiply and will expode in side. - updating the weight will greatly move it. - optimal will not be reach since the product result is too large. - Droupout - solution to solve them. Randomly droping certain layers. ## BACKPROPAGATION: - SGD update weight. takes derivative of loss with respect the weight. - Forward propagaiton: - sum of weight time input pass to activation function - given output, we calculate the loss function - difference between actual and prediction - gradient of loss with respect to weight: - takes the derivative - working backward - output deepend of the ouput on the previous layer ## What is Autoencoder, name few applications. Auto encoder is basically used to learn a compressed form of given data. Few applications include - Data denoising - Dimensionality reduction - Image reconstruction - Image colorization ## What are the components of GAN? - Generator - Discriminator ## What's the difference between boosting and bagging? - **Boosting and bagging** are similar, in that they are both ensembling techniques, where a number of weak learners (classifiers/regressors that are barely better than guessing) combine (through averaging or max vote) to create a strong learner that can make accurate predictions. - **Bagging** means that you take bootstrap samples (with replacement) of your data set and each sample trains a (potentially) weak learner. - **Boosting**, on the other hand, uses all data to train each learner, but instances that were misclassified by the previous learners are given more weight so that subsequent learners give more focus to them during training. ## Explain how a ROC curve works. The **ROC curve** is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It’s often used as a **proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).** ## What’s the difference between Type I and Type II error? - **Type I error** is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn’t. - **Type II error** means that you claim nothing is happening when in fact something is. A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn’t carrying a baby. ## What’s the difference between a generative and discriminative model? [src] A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks ------------ ## NLP - Tokenization: - break sentence to smaller chunch - stemming: - Normalize word to it base or root word - Lemmatization: - Take in consideration the morpholigy of the word - POS TAG: - how a word mean grammatically in a sentence - NER: - Person, company, location - Orgainzation, person - Chunking: - break to small pience and group them to . ## Word2vec Shallow, two-layer neural networks that are trained to construct linguistic context of words Takes as input a large corpus, and produce a vector space, typically of several hundred dimension, and each word in the corpus is assigned a vector in the space The key idea is context: words that occur often in the same context should have same/opposite meanings. Two flavors - continuous bag of words (CBOW): the model predicts the current word given a window of surrounding context words - skip gram: predicts the surrounding context words using the current word ------------------- # Reinforcement learning. ![](https://i.imgur.com/066qPkM.png) ## Process ![](https://i.imgur.com/NpNL6oI.png) ![](https://i.imgur.com/vW6cfmz.png) ## Terms and definitions ![](https://i.imgur.com/xO8497L.png) ![](https://i.imgur.com/wO79s0L.png) # Discount: Takes closer reward ![](https://i.imgur.com/Ziulz7b.png) ![](https://i.imgur.com/asAN8hd.png) ## Markov Decision Process ![](https://i.imgur.com/dPqgmQD.png) ![](https://i.imgur.com/7UFsNzV.png) ![](https://i.imgur.com/UA1IRLd.png) ## finding the best policy ![](https://i.imgur.com/4ZjUkXn.png) ## Bellman equation ![](https://i.imgur.com/pMjoJ0e.png) # Second source MDP ![](https://i.imgur.com/jLDqTmX.png) ![](https://i.imgur.com/tSeYNC3.png) ## summary ![](https://i.imgur.com/2jzOxyY.png) # Q-learning ![](https://i.imgur.com/HeFCA8r.png) ![](https://i.imgur.com/KbsQoGa.png) ![](https://i.imgur.com/tqgrp3H.png) ## Trading use https://github.com/llSourcell/Reinforcement_Learning_for_Stock_Prediction https://www.youtube.com/watch?v=05NqKJ0v7EE ![](https://i.imgur.com/7SS88CT.jpg) ![](https://i.imgur.com/K5eVoIJ.png) ![](https://i.imgur.com/evULzI8.png) ![](https://i.imgur.com/uhJivlB.png) ![](https://i.imgur.com/5ikoOE7.png) ![](https://i.imgur.com/43YR1qb.png) ------------- ## Version2 ![](https://i.imgur.com/vUCPzhC.jpg) # Bellman equation ![](https://i.imgur.com/weZkk93.jpg) ## Improvement ![](https://i.imgur.com/C3J7aQW.png) # Coding Component: https://www.youtube.com/watch?v=rRssY6FrTvU ## API realtime. https://alpaca.markets/ https://www.alphavantage.co/