--- title: Binary classification tags: teach:MF --- # Binary classification ## Agenda 1. Lecture: - Focus on: Logistic regression, - A glance at: KNN, Decision tree, Random forest, SVM, Neural network 2. Labs - [Logistic from scrach](https://colab.research.google.com/drive/1_LAvmGR9DGXtku-IZppu458ijGN3pE1m?usp=sharing) - [Logistic Regression Demo using credit card dataset](https://drive.google.com/file/d/1C4tRbnbBhWcBpIsC4uaXkmbGBELegQPY/view?usp=sharing) - [binary classification](https://drive.google.com/file/d/1C4tRbnbBhWcBpIsC4uaXkmbGBELegQPY/view?usp=sharing) 4. Hw 8 ## References Slides adapted from Hill, Griffiths, and Lim (2012) Principles of Econometrics, 4th edition. ## Logistic regression ### The generalized linear model Suppose the explanatory variable is denoted by $x$, a $p\times 1$ vector. Let $\beta$ denote the parameter vector of size $p\times 1$. Generalized linear models allows to deal with response variable which are not continuous: $$y\sim F(\mu(x'\beta)),$$ where $F(\theta)$ is the error distribution, $\mu$ is the link function. ### Examples 1. The linear regression is originally written as $$y = x'\beta+\varepsilon,$$ where $\varepsilon\sim N(0,\sigma^2)$. It can be re-written as a generalized linear model: $$y \sim N(x'\beta, \sigma^2),$$ where $F$ is the normal distribution and $\mu(t) = t$. 2. The logistic regression is for binary response variable: $$y \sim Bernoulli(\frac{1}{1+e^{-x'\beta}}),$$ where $F$ is the Bernoulli distribution and $\mu(t)=\frac{1}{1+e^{-t}}$. ### Bernoulli random variable Let $Y$ be the Bernoulli random variable, denoted by $Y\sim Bernoulli(p)$. The probability function for $Y$ is: $$f(y)=p^y(1-p)^{1-y},\;y=0,1.$$ In addition, we have $$E(Y)=p,\;var(Y)=p(1-p)$$ ### A simple case, $p=2$ For binary dependent variables, the general model is: $$y_i \sim Bernoulli\left(\mu(\beta_1+\beta_2 x_i)\right),\;i=1,\ldots,N,$$ where - Probit model: $\mu(\beta_1+\beta_2 x) =\Phi( \beta_1+\beta_2 x )$. (Calculating $\Phi(\cdot)$ is computationally demanding!) - Logit model: $\mu(\beta_1+\beta_2 x) =\Lambda( \beta_1+\beta_2 x ) = \frac{1}{1+\exp^{-\beta_1-\beta_2 x}}$. The *logit model* is also called the *logistic regression*. ### Example We represent an individual's choice by the indicator variable: $$Y = \left\{\begin{array}{ll}1,&\mbox{individual drives to work},\\0,&\mbox{individual takes bus to work}\end{array}\right.$$ $y$ is also called a choice variable. Define the explanatory variable as $$x= (\mbox{commuting time by bus}-\mbox{commuting time by car}).$$ - If the probability that an individual drives to work is $p$, then $P[Y=1]=p$, - The probability function that a person uses public transportation is $P[Y=0]=1-p$. ### Inference for the logistic regression. The probability function for $y$ is combined with the probability model to obtain: \begin{eqnarray*} f(y_i) & = & p^{y_i}(1-p_i)^{1-y_i}\\ &=& (\Lambda(\beta_1 +\beta_2 x_i))^{y_i}(1-\Lambda(\beta_1+\beta_2 x_i))^{1-y_i}. \end{eqnarray*}The likelihood function of $\beta_1$ and $\beta_2$ is $$L(\beta_1,\beta_2) = \prod_{i=1}^N f(y_i) = \prod_{i=1}^N (\Lambda(\beta_1 +\beta_2 x_i))^{y_i}(1-\Lambda(\beta_1+\beta_2 x_i))^{1-y_i}.$$ The *maximum likelihood estimates (MLE)*, $\hat{\beta}_1$ and $\hat{\beta}_2$, maximize the *likelihood function*: $$\hat{\beta}_1, \hat{\beta}_2 = \arg\max _{\beta_1,\beta_2} L(\beta_1,\beta_2).$$ However, for numerical ease, we define the *log-likelihood function* as: \begin{eqnarray*}l(\beta_1,\beta_2)&=&\ln L(\beta_1,\beta_2)\\& =& \sum_{i=1}^N \left(y_i \ln(\Lambda(\beta_1 +\beta_2 x_i)) + (1-y_i)\ln (1-\Lambda(\beta_1+\beta_2 x_i))\right).\end{eqnarray*} The MLEs maximize the *log-likelihood function*: $$\hat{\beta}_1, \hat{\beta}_2 = \arg\max _{\beta_1,\beta_2} l(\beta_1,\beta_2).$$ There are no closed-form formula to obtain the MLEs, and numerical procedures are required. We estimate the probability $p$ given $x$ by $$\hat{p}=\Lambda(\hat{\beta_1}+\hat{\beta_2}x).$$ We set up threshold $l$, and predict $y$ to be 1 if $\hat{p}>l$: $$\hat{y}=\begin{cases}1,&\hat{p}\geq l,\\0,&\hat{p}<l.\end{cases}$$ Usually, $l=0.5$ or the ratio for the positive case of the data. ### Re-expressing the logistic regression in the realm of machine learning See [a reference blog in towarddatascience.com](https://towardsdatascience.com/an-introduction-to-logistic-regression-8136ad65da2e). ## K-nearest neighbors (KNN) ![](https://i.imgur.com/ITpaehK.png) Source: https://towardsdatascience.com/knn-k-nearest-neighbors-1-a4707b24bd1d Decide a distance measure (ex, Euclidean distance) and the weighting scheme (uniform or distance weighting). Then, the score for each category is summed over the $k$-nearest neighbours. Predict the unknown category of a sample as that yields the highest score. <!-- #### Definitions of distance measures. Let $\rho_{x,y}$ denote the Pearson's correlation coefficient of instances $x$ and $y$. |Distance measure |$D(x,y)$| |---|---| |Euclidean |$\sqrt{\sum_{i=1}^d(x_i-y_i)^2}$| |Manhattan | $\sum_{i=1}^d \|x_i-y_i\|$| |Correlation | $1-\rho_{x,y}$| --> ## Decision tree :construction: ![](https://i.imgur.com/r0gymPv.png) Source: https://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/ Decision tree is constructed top-down (from root node), by choosing a mapping of feature variables (test functions, or internal decision nodes) at each step that best splits the set of items. Leaves represent class labels and branches represent conjunctions of features that lead to those class labels. ## Random forest :construction: ![](https://i.imgur.com/DBPagtv.jpg) Source: https://towardsdatascience.com/understanding-random-forest-58381e0602d2 ## Support vector machine (SVM) :construction: ![](https://i.imgur.com/AdB7g4j.png) Source: https://newonlinecourses.science.psu.edu/stat508/lesson/10/10.1 ![](https://i.imgur.com/0mqfXL8.jpg) <!-- SVM finds a maximal margin hyperplane to separate data points of different categories. If the data can not be separated by a linear hyperplane, the input features have to be mapped into a higher-dimensional feature space by a mapping kernel function such as linear, polynomial, and radial basis functions. Feature expansion. Suppose the original feature space includes two variables $X_1$ and $X_2$. Using polynomial transoframtion, the space is expanded to $(X_1, X_2,X_1^2, X_21, X_2,X_1^2, X_2^2, X_1X_2)$. Then the hyperplan woud be of the form $$\theta_0 +\theta_1 X_1 + \theta_2 X_2 +\theta_3X_1^2 + \theta_4 X_2^2 +\theta_5 X_1X_2 = 0.$$ This will lead to nonlinear decision boundaries in the original feature space. The inner product of two $n$-dimensional vectors is defined as $$\sum_{j=1}^n x_{1j}x_{2j},$$ where $X_1 = (x_{11},x_{12},\ldots,x_{1n})$ and $X_2 = (x_{21}, x_{22}, \ldots, x_{2n})$. The kernel function is a generalization of inner product of nonlinear transformation and is denoted by $K(X_1,X_2)$. Some common kernels are polynomial kernel, sigmoid kernel, Gaussian radial basis function. There is no golden rule to determine which kernel will provide the most accurate result in a given situation. --> ## Neural network :construction: The nodes in the input layer receive input features $X=(X_1,\ldots,X_p)$ of each training sample and transmit the weighted outputs to the hidden layer. The $d$ nodes in the output layer represent the output features $Y=(Y_1,\ldots,Y_d)$. NN links the input-output paired variables with simple functions called activation functions. A simple standard structure for a NN include an input layer, a hidden layer, and an output layer. ![](https://i.imgur.com/h6tcJkS.png) Source: https://thedatascientist.com/what-deep-learning-is-and-isnt/ Suppose that there are unknown $L$ layers in a neural network. The original input layer and the output layer are also called the 0th layer and $(L+1)$th layer, respectively. The number of layers $L$ is called the depth of the architecture. Each layer is composed of nodes (also called neurons) representing a non-linear transformation of information from previous layer. For a hidden layer, various activation functions can be applied. - Let $f^{(0)}, f^{(1)},\ldots,f^{(L)}$ be given univariate activation functions for these layers. For notational simplicity, suppose that $f$ is a given activation. - Suppose $U =(U_1,\ldots,U_k)'$ is a $k$-dimensional input. We abbreviate $f(U)$ by $$f(U) = (f(U_1),\ldots,f(U_k))'.$$ - Let $N_l$ denote the number of nodes at the $l$-th layers, for $l=1,\ldots,L$. For notational consistency, let $N_0 = p$, and $N_{(L+1)}=d$. - To build the $l$-th layer, let $W^{(l-1)}\in \Re^{N_l\times N_{l-1}}$ be the weight matrix, and $b^{(l-1)}\in \Re^{N_l}$ be the thresholds or activation levels, for $l=1,\ldots,L+1$. - Then, these $N_l$ nodes at the $l$-th layers $Z^{(l)}\in \Re^{N_l}$ are formed by $$Z^{(l)} = f^{(l-1)}(W^{(l-1)}Z^{(l-1)}+b^{(l-1)}),$$ for $l=1,\ldots,L+1.$ Specifically, the deep learning neural network is constructed by the following iterations: \begin{eqnarray*} Z^{(1)} &=& f^{(0)}\left(W^{(0)}X + b^{(0)}\right)\\ Z^{(2)} &=& f^{(1)}\left(W^{(1)}Z^{(1)} + b^{(1)}\right)\\ Z^{(3)} &=& f^{(2)}\left(W^{(2)}Z^{(2)} + b^{(2)}\right)\\ & \vdots &\\ Z^{(l)} &=& f^{(l-1)}\left(W^{(l-1)}Z^{(l-1)} + b^{(l-1)}\right)\\ & \vdots & \\ Z^{(L)} &=& f^{(L-1)}\left(W^{(L-1)}Z^{(L-1)} + b^{(L-1)}\right)\\ \hat{Y} &=& f^{(L)}\left(W^{(L)}Z^{(L)}+b^{(L)}\right). \end{eqnarray*} - Finally, the deep learning neural network predicts $Y$ by $\hat{Y}$ using the input $W$ and the learning parameters $W = \{W^{(0)}, W^{(1)},\ldots, W^{(L)}\}$ and $b=\{b^{(0)},b^{(1)},\ldots, b^{(L)}\}$. - A deep learning neural network predicts $Y$ by \begin{eqnarray*} F^{W,b}(X) &:=& f^{(L)}\left(W^{(L)}Z^{(L)}+b^{(L)}\right). \end{eqnarray*} - Once the architecture of the deep neural network and activation functions are decided, we need to solve the training problem to find the learning parameters $W=\{W^{(0)},W^{(1)},\ldots,W^{(L)}\}$ and $b = \{b^{(0)},\ldots,b^{(L)}\}$: $$\hat{W},\hat{b} = \arg\min _{W,b} \frac{1}{n}\sum_{i=1}^n \mathcal{L}(Y^{(i)},F^{W,b}(X^{(i)})).$$ Here, $\mathcal{L}$ is the loss function. ### List of activation functions Reference: [7 useful activation functions](https://towardsdatascience.com/7-popular-activation-functions-you-should-know-in-deep-learning-and-how-to-use-them-with-keras-and-27b4d838dfe6) 1. Sigmoid (Logistic) 2. Hyperbolic Tangent (Tanh) 3. Rectified Linear Unit (ReLU) 4. Leaky RELU 5. Parametric Leaky ReLU (PReLU) 6. Exponential Linear Units (ELU) 7. Scaled Exponential Linear Unit (SELU) ### List of Loss functions |Loss function | Form | |----|---| |Mean Squared Error (MSE) | $\sum_{i=1}^N(y_i-\hat{y}_i)^2$ | |Binary Crossentropy (BCE) | $\sum_{i=1}^N(y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i))$| Categorical Cross-entropy (CC) | $\frac{1}{N}\sum_{i=1}^N\sum_{k=1}^K I\{y_i=k\}\log\frac{exp(w_k'x_i)}{\sum_{j=1}^K exp(w_j'x_j)}$ |Sparse Categorical Crossentropy (SCC) | :construction:| ## Multiple class classification :construction: References 1. https://towardsdatascience.com/categorical-cross-entropy-and-softmax-regression-780e8a2c5e8c ## Revisting regression ### Logistic regression and Neural network Logistic regression could be seen as a reduced case of neural network, where there only one layer with one neuron. The activation function is the sigmoid function, and the loss function is the binary cross-entropy. ### Multiple linear regression Multiple linear regression could be seen as a reduced case of neural network, where there only one layer with one neuron. The activation function is the linear function, and the loss function is the MSE. ## Illustrations: Credit card dataset ### The credit card dataset The credit card dataset can be download in https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset. 1. Predict if the credit card holder will default in payment. 2. Binary classification problem. 3. Similar settings can be extended to insurance writing, fraud detection, etc. ## Challenges in FinTech (data science) Which ML model performs best is problem and data specific. Given a specific problem, exploring stylized features of the data may help to improve the benchmark model. 1. Problem formulation 2. Data exploration - Data cleaning - Exploratory data analysis, EDA - Feature engineering - Modeling - Parameters tuning via cross-validation: avoid overfitting, comparisons of various combinations of hyper-parameters by testing scores and cross-validation scores (percentages of accuracy). - Out-of-sample analysis (test data, or, rolling-window backtest) ## Hw 8 ### Question 1 Read carefully the following blog: https://towardsdatascience.com/an-introduction-to-logistic-regression-8136ad65da2e Derive the following identities: 1. $f(x)=\frac{exp(x)}{exp(x)+exp(0)}$ 2. $\frac{d}{dx}f(x) = f(x)(1-f(x))$ 3. $1-f(x)=f(-x)$ 4. The gradient of the loss function is $\frac{1}{N}\sum_{i=1}^N((p-y_i)x_i)$ ### Question 2 In this homework assignment, you will download the dataset on NewE3 “Titanic Dataset” to solve the following problems. The explanation of variables as follow: - Name: the name of the passenger - Sex: Sex (male and female) - Age: the age of the passenger - Siblings/Spouses Aboard: Number of Siblings/Spouses Aboard - Parents/Children Aboard: Number of Parents/Children Aboard - Fare: Passenger Fare (British pound) - Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) - Survived: Survived (0 = No; 1 = Yes) Analysis the data by Logistic Regression. 1. Please provide EDA for the dataset. 2. Pre-process the continuous and discrete data. 3. Build the logistic regression model. 4. Predict the data and show the confusion matrix. Please explain it in details. stop here! --- :::info Following Question 2 in Hw 8: 5. Use KNN, Decision tree, random forest, SVM, and neural network to predict the data and show the confusion matrix. Please explain your results in details. 6. Among all these questions, what is your best result? Explain why you regard it as the best result. *Remark*: We leave KNN, DT, SVM, NN, as later work. :::