Microsoft Data Science Interview ========== What is the use of regularization? What are the differences between L1 and L2 regularization, why don’t people use L0.5 regularization for instance? - Regularization is the penalization of the loss function for certain weight vectors to discourage overfitting. The regularized weight vectors can drive its corresponding feature close to zero making its effects insignificant in the model. - L1 norm: has the effect of forcing some coefficients to zero when the tuning parameter is sufficiently large, therfore eliminating useless features all together - L2/ridge regression: will allow all features to remain in the model by only minimizing the effects of less meaningful features and not eliminating them. Can you explain the fundamentals of Naive Bayes? How do you set the threshold? - we use the assumption that each feature independently contributes to the probability that the target belongs to a certain class - however, these features can be totally dependant upon each other in deterministically describing the target. Thus it is said to be Naive. - maximum likelihood * prior class prob / prior feature - P(A|B)=P(B|A)\*P(A)/P(B) Threshold setting: - k-fold cross-validation to test the skill of the model at different thresholds Pros: - easy and fast to predict class of test dataset - performs well in multi-class prediction - when assumptions of independence holds, NB performs better than most models and less data is needed - performs well in case of categorical input variables compared to numerical variables. for numerical variables normal distribution is assumed. Cons: - Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent. - On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously. - If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation. Explain Support Vector Machine. - a hyperplane is drawn such that it is equidistant from each class it attempts to separate. if it can't separate the class in 2-D, the implementation of the kernal will transform the space into a multidimensional space to allow for separation. Bias / Variance tradeoff - Bias is the difference between the average prediction of our model and the correct value we are trying to predict. High bias pays little attention to the training data and oversimplifies the model or __underfits__ - Variance is the variability of the model prediction for a given data point or a value which gives us the spread of our data. A model with high varaiance pays too much attention to the data and fails to generalize or __overfits__ Measure distance between data point - Euclidean distance - manhattan distance - consine similarity Define Variance - the average of the squared differences from the mean - variance describes the spread of your dataset. - when finding the variance of the sample mean (x-bar) use Bassel's Correction (N-1). - the sample mean is always going to be closer to a feature than the population mean so there is a deflation is the numerator. N-1 helps correct that deflation or underestimator by lowering the denominator - squared variance is the standard deviation - giving you the exact distances from the mean. Boosting - the method of converting weak learners into strong learners. - each new tree is a fit on a modified version of the original dataset. - the model begins by training a decision tree in which each feature is assigned an equal weight. - after evaluating the results of the first tree, we increase the weights of poorly classified observations and decrease the weights of easily classified ones. - the second tree is grown from this weighted data. the classification error is computed from the 2-tree ensemble and a 3rd tree is grown to predict the revised residuals Gradient Boosting - trains many models in a gradual, additive and sequential manner. - it uses the gradients of a loss function to adjust the weights of the model and determine which algorithm is performing better. ## Probability Union A∪B = A + B - A∩B - to avoid double counting Additive Law P(A∪B) = P(A) + P(B) - P(A∩B) Multiplication Rule P(A|B) x P(B) = P(A∩B) x P(B)/P(B) == - P(A|B) x P(B) = P(A∩B) Conditional Probability - the liklihood of an event occuring assuming a different one has already happened - the probability of an outcome given new information - used to distinguish dependant from independant events - __P(A|B)__ == the probability of __A__ given new information __B__ - if the probability of an event is effected by another event then the event is dependant - __P(A|B) = P(A)__ suggests that the two events are independant because __B__ occuring had no effect on __A__. - __P(A∩B) = P(A) x P(B)__ means the two events are independant Bayes Rule P(A|B) = P(B|A) x P(A)/P(B) ### Distributions Normal (Uniform) - each outcome is equally likely - both mean and variance are uniterpretable and offer no real intuitive meaning Bernoulli - X ~ Bern(p) - variable X follows a Bernoulli dist with a probability of success equal to p - 1 Trial - 2 Possible outcomes - either have past data or probability of one event occurring - E(X) = p or E(X) = 1-p - variance of Bernoulli event always equal p*(1-p) Binomial (many Bernoulli) - a sequence of identical Bernoulli events - B(n,p) - X ~ B(10,0.6) - variable X has a binomial distribution with 10 trials and a likihood of 0.6 of success on each individual trial - Poisson - Po(ℷ) - Y ~ Po(4) - variable Y follows a Poisson dist with lambda equal to 4 - How often an event occurs within a distribution - only positive values because an event can't happen a negative amount of times - ex. P(Y=7). What is the likelihood that a kid will ask exactly 7 questions in a day? - avg number of question asked = ℷ = 4 - interval = 1 work day - instance interested in = 7 - P(Y) = ℷ^(y) * e^(-ℷ)/y! Find the max subsequence of an interger list ``` def maxSubArraySum(a,size): max_so_far = 0 max_ending_here = 0 for i in range(0, size): max_ending_here = max_ending_here + a[i] if (max_so_far < max_ending_here): max_so_far = max_ending_here if max_ending_here < 0: max_ending_here = 0 return max_so_far a = [-2, -3, 4, -1, -2, 1, 5, -3] maxSubArraySum(a,len(a)) 7 ``` Merge k arrays (k == 2) then sort ``` k1 = [5,44,3,14] k2 = [523,623,37,81] def merge_arr(k1, k2): k3 = k1+k2 return k3 merge_arr(k1,k2) def bubble_sort(arr): for ele in range(0,len(arr)-1): for el in range(ele,len(arr)): if arr[ele] > arr[el]: temp = arr[ele] arr[ele] = arr[el] arr[el] = temp return arr bubble_sort(merge_arr(k1,k2)) [3, 5, 14, 37, 44, 81, 523, 623] ``` Percentile ``` import matplotlib.pyplot as plt import numpy as np import shapely.geometry as SG %matplotlib inline time = [0,2,4,6,8,10,12] people = [0,350,1100,2400,6500,8850,10000] line = SG.LineString(list(zip(time,people))) y0 = 3000 yline = SG.LineString([(min(time), y0), (max(time), y0)]) coords = np.array(line.intersection(yline)) print(coords[0]) ``` Compute the inverse ``` temp_mat = np.array([ [2,10], [-1,2] ]) mat = np.array([ [2,-10], [1,2] ]) solve = (1/(temp_mat[0,0]*temp_mat[1,1] - temp_mat[0,1]*temp_mat[1,0])) * mat ```