--- tags: machine-learning --- # Classification problem <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/classification-problem/1.png" height="100%" width="70%"> </div> > This is my personal notes taken for the course [Machine learning](https://www.coursera.org/learn/machine-learning#syllabus) by Standford. Feel free to check the [assignments](https://github.com/3outeille/Coursera-Labs). > Also, if you want to read my other notes, feel free to check them at my [blog](https://ferdinandmom.engineer/machine-learning/). > # I) Introduction In classification your algorithms produce discrete outcomes: one/zero, yes/no, do/don't and so on. Spam versus non-spam emails is a traditional example of classification task: you want to predict if the email fed to your program is spammy or not, where usually 0 means not spam and 1 means spam. Formally, we want to predict a variable $y \in {0,1}$, where 0 is called **negative class**, while 1 is called **positive class**. Such task is known as **binary classification**. Other classification problems might require more than a binary output, for example where $y \in {0,1,2,3}$. Such task is known as a **multiclass classification**. # II) Binary classification Instead of our output vector $y$ being a continuous range of values, it will only be 0 or 1 ($y \in \{0,1\}$) where 0 is usually taken as the *\"negative class\"** and 1 as the **\"positive class\"**. We can solve binary classification problem by using 2 methods : **Linear regression** and **Logistic regression**. ## 1) Logistic regression Logistic regression is a more performant algorithm used in classification problems. The most important feature is its ability to produce a sort of hypothesis function of this kind: $$0 \leq hθ(x) \leq 1$$ The hypothesis function is slightly different from the one used in linear regression. For logistic regression, $$h(x)=g(\theta^{T}x)$$ which is the traditional hypothesis function processed by a new function g, defined as: $$g(z)=\frac{1}{1+e^{-z}}$$ It is called sigmoid function (logistic function) and looks like the picture 2: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/classification-problem/2.png" width="70%"> <figcaption><b>Figure 1</b></figcaption> </div> <br> In a sigmoid function, as the input $z$ goes to $-\infty$, the output $g(z)$ approaches to 0; as $z$ goes to +$\infty$, the output $g(z)$ approaches to 1. This works like an output limiter which makes the hypothesis function bound between 0 and 1. Thus, the hypothesis function is: $$\begin{aligned} \boxed{h(x) = \frac{1}{1+e^{-\theta^{T}x}}} \end{aligned}$$ The logistic regression's hypothesis function outputs a number between 0 and 1. You can think of it as the estimated probability that $y=1$ on a new example $x$ in input. Let's take the bananas classification task. Our goal is to determine whether or not the banana is ripe. For simplicity, let's consider that we are given only the size ($x_{1}$) and the "yellowness" ($x_{2}$) of the banana. So my feature vector $x$ is defined as: $$x = \begin{bmatrix}x_{1}\\x_{2}\end{bmatrix}$$ It's now time to define whether a new picture of a banana given in input to my algorithm is ripe or not. I grab its feature vector and plug it into the hypothesis function, which magically produces some outcome, let's say: $$h(x) = 0.9$$ The hypothesis is telling me that for the new banana in input the probability that y=1 is 0.9. In other words, the new banana in the picture has $90\%$ chance of being ripe. Let me write it more formally and generally: $$h(x) = P(y = 1 | x; \theta)$$ In words, the hypothesis function tells you the probability that $y=1$ given $x$ parametrized by $\theta$. Since the outcome y is restricted between two values 0 or 1, I can compute the probability that $y=0$ as well. Let's figure out the following statement: $$P(y = 0 | x; \theta) + P(y = 1 | x; \theta) = 1$$ Thus, $$P(y = 0 | x; \theta) = 1 - P(y = 1 | x; \theta)$$ Let's find the probability of $y$ being 0 (unripe banana): $$P(y = 0 | x; \theta) = 1 - 0.9 = 0.1$$ In other words, the new banana has $10\%$ chance of being unripe. <ins>**Binary output**</ins> So far I've talked about percentages: where is the binary output? It's very simple to obtain. We already know that the hypothesis function $h(x)=g(\theta^{T}x)$ returns values in range [0,1]. Let's make the following assumption: $$\left\{\begin{matrix}h(x)\geq 0.5 => y=1 \\ h(x)< 0.5 => y=0 \end{matrix}\right.$$ That's kind of intuitive. Now, when exactly is $h(x)$ above or below $0.5$? If you look at the generic sigmoid function above (**figure 1**), you'll notice that $g(z) \geq 0.5$ whenever $z \geq 0$. Since the hypothesis function for logistic regression is defined as $h(x)=g(\theta^{T}x)$, I can clearly state that $h(x)=g(\theta^{T}x) \geq 0.5$ whenever $\theta^{T}x \geq 0$ (because $z=\theta^{T}x$).\ Same reasoning goes for $g(z) < 0.5$. <ins>**Decision boundary**</ins> Suppose you have a training set as in **figure 2** below. Two features $x_1$ (size of a banana) and $x_2$ (yellowness of a banana) and a bunch of items scattered around the plane. Crosses represent unripe banana whereas circle represent ripe one. The task of the logistic regression algorithm is to separate those points, by drawing a line between them. That line is technically called **decision boundary** and it is used to infer about the dataset: Everything that is below the decision boundary belongs to category A and the remaining part belongs to category B. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/classification-problem/3.png" width="70%"> <figcaption><b>Figure 2</b></figcaption> </div> <br> The decision boundary is a line, hence it can be described by an equation. As in linear regression, the logistic regression algorithm will be able to find the best $\theta$s parameters in order to make the decision boundary actually separate the data points correctly. Suppose that we know the hypothesis function: $$h(x)=g(\theta_0+\theta_1x_1+\theta_2x_2)$$ and through some magical procedures we end up with the following parameters: $$\Theta = \begin{bmatrix}\theta_0\\ \theta_1\\ \theta_2 \end{bmatrix} = \begin{bmatrix}-3\\ 1\\ 1 \end{bmatrix}$$ We know since the beginning that $h(x)=g(\Theta^{T}x)$ and for the current example $\Theta^{T}x = \theta_0+\theta_1x_1+\theta_2x_2$ which is, not surprisingly, the equation of a line (the decision boundary). We also know from the previous paragraph that we can predict y=1 whenever $\Theta^{T}x \geq 0$. Thus, we can state that: $$y = 1 \ \ \text{if} \ \ \theta_0 + \theta_1x_1 + \theta_2x_2 \geq 0$$ And for our specific example: $$\begin{aligned} & y = 1 \ \ \text{if} \ -3 + x_1 + x_2 \geq 0 \\ & y = 1 \ \ \text{if} \ \ \ x_1 + x_2 \geq 3\end{aligned}$$ If you draw the equation $x_1+x_2=3$ you will obtain the decision boundary line in **figure 2** above. In words, given the current training set, pick a new example from the real world with features $x_1$, $x_2$, a banana with a certain size and a certain level of yellowness. It will be classified as 1 ($y=1$ i.e. ripe) whenever it satisfies the inequation $x_1 + x_2 \geq 3$, that is whenever it lies above ($\geq$) the decision boundary. It will be classified as 0 ($y=0$ i.e. unripe) otherwise. Here the power of the sigmoid function emerges. It does not matter how far the new example lies from the decision boundary: $$y=1 \ \ \text{if ABOVE else} \ \ y=0$$ In the real world you will find oddly-shaped decision boundaries, as a simple straight line doesn't always separate things well. Those shapes are usually described by **polynomial equations**. Logistic regression has no built-in ability to define the decision boundary's equation. Usually it's good practice to look at data and figure out the appropriate shape to implement: that's the so-called **data exploration**. ## 2) Logistic regression cost function You might remember the original cost function $J(\Theta)$ used in linear regression (If you don't, check out my [post]). I can tell you right now that it's not going to work here with logistic regression. Let me go back for a minute to the cost function we used in linear regression: $$J(\Theta) = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2$$ which can be rewritten in a slightly different way: $$J(\Theta) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(h(x^{(i)}) - y^{(i)})^2$$ Now let's make it more general by defining a new function: $$\mathrm{Cost}(h(x^{(i)}),y^{(i)}) = \frac{1}{2}(h(x^{(i)}) - y^{(i)})^2$$ Thus, $$J(\Theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h(x^{(i)}),y^{(i)})$$ However we know that the linear regression cost function cannot be used in logistic regression problems. So what is this all about? Well, it turns out for logistic regression, we just have to find a different function **Cost**, while the summation part stays the same. For logistic regression, the cost function is defined as: $$\mathrm{Cost}(h(x),y) = \begin{cases} -\log(h(x)) & \text{if y = 1} \\ -\log(1-h(x)) & \text{if y = 0} \end{cases}$$ The $i$ indexes have been removed for clarity. When $y=1$, the output (i.e. the cost to pay) approaches to 0 as $h(x)$ approaches to 1. Conversely, the cost to pay grows to infinity as $h(x)$ approaches to 0. You can clearly see it in the **figure 3** below-left side. This is a desirable property: we want a bigger penalty as the algorithm predicts something far away from the actual value. If the label is $y=1$ but the algorithm predicts $h(x)=0$, the outcome is completely wrong. Conversely, the same intuition applies when $y=0$, depicted in the **figure 3** below-right side. Bigger penalties when the label is $y=0$ but the algorithm predicts $h(x)=1$. <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/classification-problem/4.png"> <figcaption><b>Figure 3</b></figcaption> </div> <br> What we have just seen is the verbose version of the function **Cost** for logistic regression. We can make it more compact into a one-line expression: this will help avoiding boring if/else statements when converting the formula into an algorithm. $$\mathrm{Cost}(h(x),y) = -y \log(h(x)) - (1 - y) \log(1-h(x))$$ <ins>**Proof:**</ins> Try to replace y with 0 and 1 and you will end up with the two pieces of the original function. With the optimization in place, the logistic regression cost function can be rewritten as: $$\begin{aligned} J(\Theta) & = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h(x^{(i)}),y^{(i)}) \\ & = - \dfrac{1}{m} [\sum_{i=1}^{m} y^{(i)} \log(h(x^{(i)})) + (1 - y^{(i)}) \log(1-h(x^{(i)}))] \\\end{aligned}$$ We now have the hypothesis function and the cost function. It's now time to find the best values for $\theta$s parameters in the cost function, or in other words, to minimize the cost function by running the gradient descent algorithm. <ins>**Apply gradient descent:**</ins> \begin{align*} \text{repeat until convergence \{} \\ \theta_j & := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\Theta) \\ \text{\}} \end{align*} <ins>**General derivation formula of cost function:**</ins> $$\boxed{\frac{\mathrm{d} }{\mathrm{d} \theta_j}J(\Theta) = \frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})*x_j^{(i)}}$$ Surprisingly, it looks identical to what we were doing for the multivariate linear regression. What's changed however is the definition of the hypothesis $h(x)$: - For linear regression we had $h(x)=\theta^{T}x$ - For logistic regression we have $h(x) = \frac{1}{1 + e^{-\theta^{\top} x}}$. In **linear regression**, you are trying to find the **best fit line** to the plotting data. In **logistic regression**, you are trying to **separate** the plotting data into categories. We are asking ourselves: **What is the probability of the point belonging to the "positive class".** ## 3) Proof: General derivation formula of cost function Let's compute $\frac{\partial}{\partial \theta_j} J(\Theta)$. First calculate derivative of sigmoid function (it will be useful while finding partial derivative of $J(\Theta)$): $$\begin{aligned} \sigma(x)'=\left(\frac{1}{1+e^{-x}}\right)'&=\frac{-(1+e^{-x})'}{(1+e^{-x})^2} \\ &=\frac{-1'-(e^{-x})'}{(1+e^{-x})^2} \\ &=\frac{0-(-x)'(e^{-x})}{(1+e^{-x})^2} \\ &=\frac{-(-1)(e^{-x})}{(1+e^{-x})^2} \\ &=\frac{e^{-x}}{(1+e^{-x})^2} \\ &=\left(\frac{1}{1+e^{-x}}\right)\left(\frac{e^{-x}}{1+e^{-x}}\right) \\ &=\sigma(x)\left(\frac{1-1 + e^{-x}}{1+e^{-x}}\right) \\ &=\sigma(x)\left(\frac{1 + e^{-x}}{1+e^{-x}} -\frac{1}{1+e^{-x}}\right) \\ &=\sigma(x)(1 - \sigma(x))\end{aligned}$$ Now we are ready to find out resulting partial derivative: <br> \begin{align*} \frac{\partial}{\partial \theta_j} J(\theta) &= \frac{\partial}{\partial \theta_j} \frac{-1}{m}\sum_{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ] \\ &= - \frac{1}{m}\sum_{i=1}^m \left [y^{(i)} \frac{\partial}{\partial \theta_j} log (h_\theta(x^{(i)})) + (1-y^{(i)}) \frac{\partial}{\partial \theta_j} log (1 - h_\theta(x^{(i)}))\right ] \\ &= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} h_\theta(x^{(i)})}{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - h_\theta(x^{(i)}))}{1 - h_\theta(x^{(i)})}\right ] \\ &= - \frac{1}{m}\sum_{i=1}^m \left [\frac{y^{(i)} \frac{\partial}{\partial \theta_j} \overbrace{ \sigma(\theta^T x^{(i)}) }^{(fog)'} }{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} \overbrace{ (1 - \sigma(\theta^T x^{(i)})) }^{(uov)'} }{1 - h_\theta(x^{(i)})}\right ] \\ &=- \frac{1}{m}\sum_{i=1}^m \left [\frac{y^{(i)} \overbrace{ \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)} }^{(fog)' = (f'og) * g'} } {h_\theta(x^{(i)})} + \frac{-(1-y^{(i)}) \overbrace{ \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)} }^{(uov)' = (u'ov) * v'} } {1 - h_\theta(x^{(i)})}\right ] \\ &= - \frac{1}{m}\sum_{i=1}^m \left [\frac{y^{(i)} h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})} - \frac{(1-y^{(i)}) h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})}\right ] \\ &= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) x^{(i)}_j - (1-y^{(i)}) h_\theta(x^{(i)}) x^{(i)}_j\right ] \\ &= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) - (1-y^{(i)}) h_\theta(x^{(i)}) \right ] x^{(i)}_j \\ &= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - y^{(i)} h_\theta(x^{(i)}) - h_\theta(x^{(i)}) + y^{(i)} h_\theta(x^{(i)}) \right ] x^{(i)}_j \\ &= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(x^{(i)}) \right ] x^{(i)}_j \\ &= \boxed{\frac{1}{m}\sum_{i=1}^m \left [ h_\theta(x^{(i)}) - y^{(i)} \right ] x^{(i)}_j} \end{align*} # III) Multi-class classification: One-vs-all Let's say you want a learning algorithm that tags your emails as "friend", "spam", "work". Since there is more than 2 outputs classes, we are no longer in a binary classification problem. In a multi-class classification problem, we are going to isolate each time one class (the "positive class") from the other ("negative class"). It's like solving a binary classification problem for each class. Here is an example below: <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/classification-problem/5.png"> </div> [post]:https://hackmd.io/@machine-learning/HJsGZfx88#Regression-Problem