SUPERVISED - HackMD

# SUPERVISED ![](https://i.imgur.com/cTt1lRI.png) First plot the points. Join the multiple polygons made by the points and calculate their area individually and add them to get the area of the convex hull. for k and d, First calculate the slope of the line connecting the two outer point (2,2) (5,4) which is m= 2/3 so the normal(k) will be -(1/m) =-1,5 and then y=kx+d will give you d where (x,y) is the middle point (3.5,3). d= 8.25 ![](https://i.imgur.com/zvvRjoo.png) First, order the values in ascending order. Take the first threshold value less than the smallest. This means everything will be predicted as 1. Calculate TPR and FPR and plot the point on the graph. Calculate the total positives and negatives in the y column. THe graphs x axis should have total negatives and y axis should have total positives. In our case, 4 on both. Then slowly increase the threshold one by one and plot the points until 0 is reached. ![](https://i.imgur.com/33ezGqS.png) Expected Prediction Error = unavoidable error(original variance = b^2) + bias^2 + model variance bias is just expected value difference which means mean1 - mean2 model variance = variance of the model = d^2 Find the solution (w1*, w2*) to the following constrained optimization problem and enter w1* in the field below: f(w1, w2) = 1 +w1^2 + w2^2 h(w1, w2) = 2w1 + w2 − 7 ≤ 0. ![](https://i.imgur.com/SRtkZoO.jpg) ![](https://i.imgur.com/GJx9WkJ.png) Calculate the Gini index of the original set using the formula: Gini index = 1 - (proportion of class + in original set)^2 - (proportion of class - in original set)^2 Calculate the Gini index of each partition created by the decision tree using the same formula: Gini index of partition i = 1 - (proportion of class + in partition i)^2 - (proportion of class - in partition i)^2 Calculate the weighted average of the Gini indices of the partitions: weighted average of Gini indices = (size of partition 1 / size of original set) * Gini index of partition 1 + (size of partition 2 / size of original set) * Gini index of partition 2 + ... Calculate the Gini purity gain by subtracting the weighted average of the Gini indices of the partitions from the Gini index of the original set: Gini purity gain = Gini index of original set - weighted average of Gini indices of partitions. ![](https://i.imgur.com/v430hqC.png) w with F1 as the evaluation = w with F2 and F3 as the training data So, w = 1/4 (|0--1| +|7-1| +|-1-1|+|-5-1|) = 15/4 = 3.75 Here, l is 4 because training data is (xi,yi) i=1 to l, l is not the number of folds but the total number of training data. resulting risk of 1st fold, x1 in F1 is 5 which is greater than 3.75 so g(x;w)= 1 and y1 = 1, so loss = 0. Similarly, x2 in F1 is 2 which is smaller than 3.75 so g(x;w) = -1 and y2 = -1 s0 here also loss is 0. Hence, 0+0 = 0 is the resulting risk. Which statements about the ensemble methods, in particular random forests and gradient boosting, are true? ![](https://i.imgur.com/KFXQP1g.png) CONFUSED WITH A AND B! ![](https://i.imgur.com/xUPKOs4.png) CONFUSED WITH C D E F! 1. In general, ensemble methods aim at reducing the bias and the variance both. 2. Random forest -> reduce variance 3. Gradient Boosting -> reduce bias 4. Random Forests are easier to tune than gradient boosting because: a) Random Forests use bagging (bootstrap aggregating) to train multiple decision trees, which can help to reduce overfitting and stabilize the model. **(BROS = Bagging Reduce Overfitting and Stabilize the model)** b) Gradient Boosted Trees use boosting to train multiple decision trees, which can be more sensitive to the choice of hyperparameters and require more careful tuning. **(SHy = sensitive to hyperparameter)** c) Random Forest has less hyperparameter to tune than Gradient Boosted Tree, making it easier to tune. 5. Random forests in general have higher bias than single decision trees. 6. Forward stagewise additive modelling is a robust boosting method which can be applied to different loss functions. 7. In Adaboost, the more accurate the weak classifiers are the higher the influence they are given for the classification output. 8. In a random forest, each individual decision tree is built upon a random bootstrap subsample of the original data set. 9. A new subsample of considered features from the training data set is drawn at each individual split during the training of a tree. 10. The main idea behind boosting is that samples that were misclassified by the previous weak models in an iteration get a larger weight in the next iteration. The idea is to focus on the harder samples and improve the performance of the ensemble by iteratively training new weak models that correct the mistakes of the previous ones. 11. A common measure of variable importance in random forests is the mean decrease in gini impurity. Which statements about Logistic regression are true? 1. Logistic regression is a convex problem. a) The loss function (negative log-likelihood, cross entropy) of logistic regression is a convex function of the parameters, which means that the optimization problem has a unique global minimum and any local minimum is also a global minimum. 2. The logistic function reads: ![](https://i.imgur.com/KhRIAwR.png) which is the sigmoid function. 3. The likelihood function for a Bernoulli distribution reads:![](https://i.imgur.com/ohOPhf3.png) Hint: In logistic regression, the likelihood function is the product of the likelihoods of the individual samples, whereas the loss function is the negative logarithm of the likelihood function. 4. The softmax function is a generalization of the sigmoid function and is suitable for multiclass classification. ![](https://i.imgur.com/oaZx8qn.png) 5. In Logistic Regression, minimization of the loss function can be achieved by Gradient Descent (optimization algorithm). 6. The optimal weights of a Logistic Regression model can not be directly calculated, instead optimization algorithms are typically used for this problem. 7. The momentum term in gradient descent is particularly helpful in regions where **a) the gradient is small** (such as plateaus or near-flat regions) or in regions where **b) the gradient oscillates** (such as in the vicinity of shallow local minima). a) The momentum term allows the optimization algorithm to maintain a certain level of velocity in the direction of the global minimum, even when the gradient is small. b) The momentum term can help the optimization algorithm to converge faster by reducing the oscillations. This is because the momentum term adds a fraction of the previous update to the current update, which helps to smooth the optimization process and reduce the oscillations. c) In steep regions of the loss function, the gradient is large and the optimization algorithm is able to take large steps towards the minimum without the need of momentum term. 7. Assume that y data are generated around a function g(x) with Guassian noise. Minimizing the negative log-likelihood of observing the data is equivalent to minimizing the mean squared error between y and g(x). 8. Logistic regression does not have a closed-form solution due to the nonlinearity of the logistic sigmoid function. However, In the case of the linear regression models ,the maximum likelihood solution leads to a closed-form solution. 9. Linear regression is a linear classifier hence it cannot solve the XOR problem. Non-linear models are needed to solve XOR problem (Multi Layer Perceptron). 10. In some cases, the optimization algorithm can get stuck at a local minimum of the loss function, which can result in a suboptimal solution. However, logistic regression models are convex models, which means that the loss function is convex. This means that the global minimum of the loss function is guaranteed to be reached regardless of the initialization of the parameters. This is why logistic regression models are not prone to getting stuck in local minima of the loss function. Which statements about Artificial Neural Networks are true? 1. Without activation functions, neural networks are linear functions with respect to their weights. In other words, just a linear regression model. We apply a non linear transformation to the inputs of the neuron and this non-linearity in the neural network is introduced by an activation function. 2. **Dropout and weight decay** are both regularization techniques used in training neural networks to **prevent overfitting**. 3. Multi-layer perceptrons are equivalent to fully connected feed forward neural networks. 4. The vanishing gradient problem can be mitigated/ avoided by the use of suitable activation functions. 5. Let us denote the learning rate with l , the weights with w, and the empirical risk with R. Then the gradient descent update rule reads: *w_new = w_old - lgrad(R(w_old))* ![](https://i.imgur.com/N7LsjDX.png) 6. The perceptron is a simple linear classifier. 7. The recursive formula for delta errors is given by: delta|l-1|Transpose = delta|l|Transpose * J|l|. Here J|l| is the single layer jacobian for layer l. ![](https://i.imgur.com/20zyJeP.png) 8. The Universal Approximation Theorem guarantees that neural networks have infinite/ very large VC dimension. Which statements about Convolutional Neural Networks are true? 1. Re-using a kernel weight matrix multiple positions of an image is called weight sharing. In other words, it refers to the use of the same set of weights in multiple locations in the input image. This is achieved by using convolutional layers, which apply a set of filters to different regions of the input image. Each filter is represented by a set of weights, and these weights are shared across all locations in the input image where the filter is applied. This allows the CNN to learn features that are translation-invariant, meaning they can be detected regardless of their location in the input image. Additionally, weight sharing reduces the number of parameters in the network, making it more computationally efficient. 2. Transfer learning allows to use pretrained kernels from other networks which were trained on different data sets. 3. With input dimension A, kernel size R, padding P, and stride S, the feature map dimension is given by the formula: (A - R + 2P) / S + 1 => This formula applies to both the width and height of the feature map. A is the input dimension, R is the kernel size, P is the padding, and S is the stride. The formula is calculated as follows: a) (A - R) computes the number of positions the kernel can be applied to the input. b) 2P takes into account the padding added to the input image. c) / S computes the number of steps the kernel takes while moving through the input. d) 1 because the kernel needs to be applied in the first position too So the feature map dimension describes the size of the output obtained by applying a convolutional operation on the input with the given parameters. 4. With kernel weight matrix Wi,j, kernel size R, and input image Xa,b, the feature map preactivation is caluclated by: ![](https://i.imgur.com/TNWzfrm.png) The kernel is slided over the entire input image (also called the receptive field) with a step defined by the stride, at each position the element-wise multiplication is performed between the kernel and the receptive field, and the results are summed up to get the preactivation value of that position. Once the preactivation is computed, it is often passed through an activation function, such as ReLU, to produce the final feature map. 5. There is not a fixed number of kernels that each convolutional layer must use, it is a hyperparameter that can be set depending on the specific problem and architecture of the network. 6. Skip connections are a technique used to improve the flow of gradients during training by creating a direct path for the gradients to flow from the output layer to an earlier layer in the network, allowing for deeper and more complex representations of the input data. Used in ResNets and U-net. 7. Dataset augmentation can help to overcome the problem of limited qunatity and/or diversity of given data. 8. Pooling does not only increase the size of the receptive field of convolutional kernels (neurons) over layers, but also reduces the computational complexity and the memory requirements as it reduces the resolution of the feature maps while preserving important features that are needed for processing by the subsequent layers. Which statements about Recurrent Neural Networks are true? 1. RNNs use weight sharing as the network is slid over the input sequence and the hidden-layer weight matrix is reused for all timesteps. Weight sharing in RNNs refers to the use of the same set of weights for multiple time steps in the input sequence, and it allows the network to learn patterns and dependencies in the input sequence that span multiple time steps. 2. In an RNN, the units in a hidden layer are recurrently connected. This means that the output of a unit at one time step is used as input, along with the current input, to compute the output of the same unit at the next time step. This creates a feedback loop within the hidden layer, which allows information to be passed between time steps. 3. In RNN, the units in a hidden layer can also connect to themselves. This type of connection is known as self-recurrent connection, where the output of a unit at one time step is used as input to the same unit at the next time step. This creates a self-loop within the hidden layer, which allows the network to maintain information from previous time steps and use it to inform the current time step. 4. Therefore, the units in a hidden layer can connect to themselves and to the neuron in the previous and/or next time step. 5. RNNS can be trained via backpropagation with time. In BPTT, the error is propagated back through the network for a fixed number of time steps. The process starts by unrolling the RNN for a given number of time steps, creating a feedforward neural network. The error is then calculated at the output of the network and propagated back through the unrolled network using backpropagation. The gradients are then used to update the weights of the network, and the process is repeated for multiple time steps. 6. The operations complexity of backpropagation through time for an RNN with hidden units and sequence length T is O(N squared T). 7. LSTMs use an integrator to store information and gates for remembering, communicating, and (potentially) forgetting. 8. RNNs are typically useful for sequence classification and generation. 9. RNNs are Turing complete, meaning that they are capable of computing any computable function, given enough resources (time and memory). This is because RNNs have the ability to store and access information over time, which allows them to perform complex computations. The ability of RNNs to be Turing complete comes from their capability to store information in their hidden state, which can be passed from one time step to the next. This allows the network to maintain a kind of memory of past inputs, which can be used to make decisions about future inputs. The hidden state can also be used to perform computations, such as counting, logic, and other operations. 10. In RNN also linear acivation function needed to mitigate vanishing gradient problem. LINEAR ACTIVATION? RELU! NONLINEAR? SIGMOID TANH. True statements for vanishing gradients problem of feed forward neural networks. 1. The vanishing gradient problem occurs when the gradients of the weights in a deep neural network become very small, making it difficult for the network to learn. This can be caused by the use of activation functions that produce outputs that saturate at the extremes (e.g. sigmoid or tanh). Activation functions such as ReLU (Rectified Linear Unit) and Leaky ReLU can solve the vanishing gradient problem because they do not saturate at the extremes, allowing for gradients to flow back into the network during backpropagation. This allows the network to continue learning and updating its weights even in deeper layers. 2. That is one of the reasons why ReLU and Leaky ReLU can solve the vanishing gradient problem. The derivative of the ReLU function is 1 for all input values greater than 0, and 0 for all input values less than or equal to 0. This means that the gradients in the positive part of the input space do not get scaled down, allowing them to flow back into the network with relatively high values.Leaky ReLU also has a non-zero gradient for negative inputs, which is a small constant value, this means that even when the input is negative, it still can backpropagate gradients to the network, therefore the gradients don't vanish. In contrast, the derivative of the sigmoid function is small for input values that are far from 0, which means that gradients in those regions are scaled down significantly, making it difficult for the network to learn. 3. Assume that you have a three layer feed forward NN (i.e. 1 hidden layer), where the average absolute values of the gradients of the loss with respect to the activations in the layers 1,2,3 are given by A1, A2, A3 respectively. In case the vanishing gradient problem is present we typically have A3>A2>A1. 4. The sigmoid function is known to induce vanishing gradients, because its derivative is strictly smaller than 1 in absolute values. The derivative of tanh is 0 to 1 so this can also induce vanishing gradients but Relu solves vanishing gradients because its gradients is 0 or 1 (0 for all values less than or equal to 0, 1 for all greater than 0). So, since 1 => gradients in the positive part of the input space do not get scaled down. ![](https://i.imgur.com/RCCYdcX.png) ![](https://i.imgur.com/IxMkHuh.png) We know, pre-activation in 1st time step => S(t=1) = W_transpose X x(t=1) + R_transpose X a(t=0) then, a(t=1) = f(s(t=1)) =>relu Finally, y_hat(t=1) = V_transpose X a(t=1) Similarly, pre-activation in 2nd time step => S(t=2) = W_transpose * x(t=2) + R_transpose * a(t=1) then, a(t=2) =f(s(t=2)) =>relu Finally, y_hat(t=2) = V_transpose X a(t=2) ![](https://i.imgur.com/9ExbCmv.png) Stride means steps by which the kernel matrix W will slide over the input. Padding means 0 values to the left right bottom top ex: 2 1 0 2 1 2 2 0 1 1 1 2 Padding equal to 2 for this will make it: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 2 0 0 0 0 1 2 2 0 0 0 0 0 1 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ![](https://i.imgur.com/dZaT1Y1.png) ![](https://i.imgur.com/XFsmY3g.png) ![](https://i.imgur.com/xBZ56Bn.png) Delta error at output = δk = δL/δsk = δL/δy_hat * δy_hat/δsK = δL/δy_hat * f’(sk) , Delta error at hidden = δj =f’(sj) N∑i=1δiwij, specific loss gradient = δL/δwij = δL/δsi * δsi/δwij = δi*aj, weight update = wij_new = wij_old - ɳ* δi * aj ![](https://i.imgur.com/pAFAPs6.png) sign (N∑i=1 𝛼igi(x)), if the sign of the sum is -, then -1 else +1. ![](https://i.imgur.com/NY5IlKi.png) derivative of relu, if >0, 1 otherwise 0. ![](https://i.imgur.com/jwlegOW.png) ![](https://i.imgur.com/eHzgnVw.png) misclassfied points get larger weights in this case 4,1 ![](https://i.imgur.com/bQi2ZoR.png) Gradient checking = f(x,w+eps) - f(x,w-eps)/ 2*eps ![](https://i.imgur.com/mfA1T9N.png) multiply the probabilities ex: HTH = 2/3*1/3*2/3 then rank based on highest likelihood values.