Classification problem

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Introduction

In classification your algorithms produce discrete outcomes: one/zero, yes/no, do/don't and so on.

Spam versus non-spam emails is a traditional example of classification task: you want to predict if the email fed to your program is spammy or not, where usually 0 means not spam and 1 means spam.

Formally, we want to predict a variable

y \in 0, 1

, where 0 is called negative class, while 1 is called positive class. Such task is known as binary classification.

Other classification problems might require more than a binary output, for example where

y \in 0, 1, 2, 3

. Such task is known as a multiclass classification.

II) Binary classification

Instead of our output vector

y

being a continuous range of values, it will only be 0 or 1
(

y \in {0, 1}

) where 0 is usually taken as the "negative class"* and 1 as the "positive class".

We can solve binary classification problem by using 2 methods : Linear regression and Logistic regression.

1) Logistic regression

Logistic regression is a more performant algorithm used in classification problems. The most important feature is its ability to produce a sort of hypothesis function of this kind:

0 \leq h θ (x) \leq 1

The hypothesis function is slightly different from the one used in linear regression. For logistic regression,

h (x) = g (θ^{T} x)

which is the traditional hypothesis function processed by a new function g, defined as:

g (z) = \frac{1}{1 + e^{- z}}

It is called sigmoid function (logistic function) and looks like the picture 2:

Figure 1

In a sigmoid function, as the input

z

goes to

- \infty

, the output

g (z)

approaches to 0; as

z

goes to +

\infty

, the output

g (z)

approaches to 1. This works like an output limiter which makes the hypothesis function bound between 0 and 1.

Thus, the hypothesis function is:

\begin{array}{r} h (x) = \frac{1}{1 + e^{- θ^{T} x}} \end{array}

The logistic regression's hypothesis function outputs a number between 0 and 1. You can think of it as the estimated probability that

y = 1

on a new example

x

in input.

Let's take the bananas classification task. Our goal is to determine whether or not the banana is ripe. For simplicity, let's consider that we are given only the size (

x_{1}

) and the "yellowness"
(

x_{2}

) of the banana. So my feature vector

x

is defined as:

x = [\begin{matrix} x_{1} \\ x_{2} \end{matrix}]

It's now time to define whether a new picture of a banana given in input to my algorithm is ripe or not. I grab its feature vector and plug it into the hypothesis function, which magically produces some outcome, let's say:

h (x) = 0.9

The hypothesis is telling me that for the new banana in input the probability that y=1 is 0.9. In other words, the new banana in the picture has

90 %

chance of being ripe. Let me write it more formally and generally:

h (x) = P (y = 1 | x; θ)

In words, the hypothesis function tells you the probability that

y = 1

given

x

parametrized by

θ

Since the outcome y is restricted between two values 0 or 1, I can compute the probability that

y = 0

as well. Let's figure out the following statement:

P (y = 0 | x; θ) + P (y = 1 | x; θ) = 1

Thus,

P (y = 0 | x; θ) = 1 - P (y = 1 | x; θ)

Let's find the probability of

y

being 0 (unripe banana):

P (y = 0 | x; θ) = 1 - 0.9 = 0.1

In other words, the new banana has

10 %

chance of being unripe.

Binary output

So far I've talked about percentages: where is the binary output? It's very simple to obtain. We already know that the hypothesis function

h (x) = g (θ^{T} x)

returns values in range [0,1].

Let's make the following assumption:

{\begin{matrix} h (x) \geq 0.5 => y = 1 \\ h (x) < 0.5 => y = 0 \end{matrix}

That's kind of intuitive. Now, when exactly is

h (x)

above or below

0.5

? If you look at the generic sigmoid function above (figure 1), you'll notice that

g (z) \geq 0.5

whenever

z \geq 0

Since the hypothesis function for logistic regression is defined as

h (x) = g (θ^{T} x)

, I can clearly state that

h (x) = g (θ^{T} x) \geq 0.5

whenever

θ^{T} x \geq 0

(because

z = θ^{T} x

).
Same reasoning goes for

g (z) < 0.5

Decision boundary

Suppose you have a training set as in figure 2 below. Two features

x_{1}

(size of a banana) and

x_{2}

(yellowness of a banana) and a bunch of items scattered around the plane.

Crosses represent unripe banana whereas circle represent ripe one.

The task of the logistic regression algorithm is to separate those points, by drawing a line between them. That line is technically called decision boundary and it is used to infer about the dataset:

Everything that is below the decision boundary belongs to category A and the remaining part belongs to category B.

Figure 2

The decision boundary is a line, hence it can be described by an equation. As in linear regression, the logistic regression algorithm will be able to find the best

θ

s parameters in order to make the decision boundary actually separate the data points correctly.

Suppose that we know the hypothesis function:

h (x) = g (θ_{0} + θ_{1} x_{1} + θ_{2} x_{2})

and through some magical procedures we end up with the following parameters:

Θ = [\begin{matrix} θ_{0} \\ θ_{1} \\ θ_{2} \end{matrix}] = [\begin{matrix} - 3 \\ 1 \\ 1 \end{matrix}]

We know since the beginning that

h (x) = g (Θ^{T} x)

and for the current example

Θ^{T} x = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2}

which is, not surprisingly, the equation of a line (the decision boundary). We also know from the previous paragraph that we can predict y=1 whenever

Θ^{T} x \geq 0

. Thus, we can state that:

y = 1 if θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} \geq 0

And for our specific example:

\begin{aligned} y = 1 if - 3 + x_{1} + x_{2} \geq 0 \\ y = 1 if x_{1} + x_{2} \geq 3 \end{aligned}

If you draw the equation

x_{1} + x_{2} = 3

you will obtain the decision boundary line in figure 2 above.

In words, given the current training set, pick a new example from the real world with features

x_{1}

x_{2}

, a banana with a certain size and a certain level of yellowness.

It will be classified as 1 (

y = 1

i.e. ripe) whenever it satisfies the inequation

x_{1} + x_{2} \geq 3

, that is whenever it lies above (

\geq

) the decision boundary. It will be classified as 0 (

y = 0

i.e. unripe) otherwise.

Here the power of the sigmoid function emerges. It does not matter how far the new example lies from the decision boundary:

y = 1 if ABOVE else y = 0

In the real world you will find oddly-shaped decision boundaries, as a simple straight line doesn't always separate things well. Those shapes are usually described by polynomial equations.

Logistic regression has no built-in ability to define the decision boundary's equation. Usually it's good practice to look at data and figure out the appropriate shape to implement: that's the so-called data exploration.

2) Logistic regression cost function

You might remember the original cost function

J (Θ)

used in linear regression (If you don't, check out my post). I can tell you right now that it's not going to work here with logistic regression.

Let me go back for a minute to the cost function we used in linear regression:

J (Θ) = \frac{1}{2 m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)})^{2}

which can be rewritten in a slightly different way:

J (Θ) = \frac{1}{m} \sum_{i = 1}^{m} \frac{1}{2} (h (x^{(i)}) - y^{(i)})^{2}

Now let's make it more general by defining a new function:

Cost (h (x^{(i)}), y^{(i)}) = \frac{1}{2} (h (x^{(i)}) - y^{(i)})^{2}

Thus,

J (Θ) = \frac{1}{m} \sum_{i = 1}^{m} Cost (h (x^{(i)}), y^{(i)})

However we know that the linear regression cost function cannot be used in logistic regression problems. So what is this all about?

Well, it turns out for logistic regression, we just have to find a different function Cost, while the summation part stays the same.

For logistic regression, the cost function is defined as:

Cost (h (x), y) = {\begin{cases} - \log (h (x)) & if y = 1 \\ - \log (1 - h (x)) & if y = 0 \end{cases}

The

i

indexes have been removed for clarity.

When

y = 1

, the output (i.e. the cost to pay) approaches to 0 as

h (x)

approaches to 1. Conversely, the cost to pay grows to infinity as

h (x)

approaches to 0. You can clearly see it in the figure 3 below-left side.

This is a desirable property: we want a bigger penalty as the algorithm predicts something far away from the actual value. If the label is

y = 1

but the algorithm predicts

h (x) = 0

, the outcome is completely wrong.

Conversely, the same intuition applies when

y = 0

, depicted in the figure 3 below-right side. Bigger penalties when the label is

y = 0

but the algorithm predicts

h (x) = 1

Figure 3

What we have just seen is the verbose version of the function Cost for logistic regression. We can make it more compact into a one-line expression: this will help avoiding boring if/else statements when converting the formula into an algorithm.

Cost (h (x), y) = - y \log (h (x)) - (1 - y) \log (1 - h (x))

Proof:

Try to replace y with 0 and 1 and you will end up with the two pieces of the original function.

With the optimization in place, the logistic regression cost function
can be rewritten as:

\begin{aligned} J (Θ) & = \frac{1}{m} \sum_{i = 1}^{m} Cost (h (x^{(i)}), y^{(i)}) \\ = - \frac{1}{m} [\sum_{i = 1}^{m} y^{(i)} \log (h (x^{(i)})) + (1 - y^{(i)}) \log (1 - h (x^{(i)}))] \end{aligned}

We now have the hypothesis function and the cost function. It's now time to find the best values for

θ

s parameters in the cost function, or in other words, to minimize the cost function by running the gradient descent algorithm.

Apply gradient descent:

\begin{aligned} repeat until convergence { \\ θ_{j} & := θ_{j} - α \frac{\partial}{\partial θ_{j}} J (Θ) \\ } \end{aligned}

General derivation formula of cost function:

\frac{d}{d θ_{j}} J (Θ) = \frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) * x_{j}^{(i)}

Surprisingly, it looks identical to what we were doing for the multivariate linear regression. What's changed however is the definition of the hypothesis

h (x)

For linear regression we had
$h (x) = θ^{T} x$
For logistic regression we have
$h (x) = \frac{1}{1 + e^{- θ^{⊤} x}}$ .

In linear regression, you are trying to find the best fit line to the plotting data.

In logistic regression, you are trying to separate the plotting data into categories. We are asking ourselves: What is the probability of the point belonging to the "positive class".

3) Proof: General derivation formula of cost function

Let's compute

\frac{\partial}{\partial θ_{j}} J (Θ)

First calculate derivative of sigmoid function (it will be useful while finding partial derivative of

J (Θ)

\begin{aligned} σ (x)^{'} = {(\frac{1}{1 + e^{- x}})}^{'} & = \frac{- (1 + e^{- x})^{'}}{(1 + e^{- x})^{2}} \\ = \frac{- 1^{'} - (e^{- x})^{'}}{(1 + e^{- x})^{2}} \\ = \frac{0 - (- x)^{'} (e^{- x})}{(1 + e^{- x})^{2}} \\ = \frac{- (- 1) (e^{- x})}{(1 + e^{- x})^{2}} \\ = \frac{e^{- x}}{(1 + e^{- x})^{2}} \\ = (\frac{1}{1 + e^{- x}}) (\frac{e^{- x}}{1 + e^{- x}}) \\ = σ (x) (\frac{1 - 1 + e^{- x}}{1 + e^{- x}}) \\ = σ (x) (\frac{1 + e^{- x}}{1 + e^{- x}} - \frac{1}{1 + e^{- x}}) \\ = σ (x) (1 - σ (x)) \end{aligned}

Now we are ready to find out resulting partial derivative:

\begin{aligned} \frac{\partial}{\partial θ_{j}} J (θ) & = \frac{\partial}{\partial θ_{j}} \frac{- 1}{m} \sum_{i = 1}^{m} [y^{(i)} l o g (h_{θ} (x^{(i)})) + (1 - y^{(i)}) l o g (1 - h_{θ} (x^{(i)}))] \\ = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \frac{\partial}{\partial θ_{j}} l o g (h_{θ} (x^{(i)})) + (1 - y^{(i)}) \frac{\partial}{\partial θ_{j}} l o g (1 - h_{θ} (x^{(i)}))] \\ = - \frac{1}{m} \sum_{i = 1}^{m} [\frac{y^{(i)} \frac{\partial}{\partial θ_{j}} h_{θ} (x^{(i)})}{h_{θ} (x^{(i)})} + \frac{(1 - y^{(i)}) \frac{\partial}{\partial θ_{j}} (1 - h_{θ} (x^{(i)}))}{1 - h_{θ} (x^{(i)})}] \\ = - \frac{1}{m} \sum_{i = 1}^{m} [\frac{y^{(i)} \frac{\partial}{\partial θ_{j}} \overset{(f o g)^{'}}{\overset{⏞}{σ (θ^{T} x^{(i)})}}}{h_{θ} (x^{(i)})} + \frac{(1 - y^{(i)}) \frac{\partial}{\partial θ_{j}} \overset{(u o v)^{'}}{\overset{⏞}{(1 - σ (θ^{T} x^{(i)}))}}}{1 - h_{θ} (x^{(i)})}] \\ = - \frac{1}{m} \sum_{i = 1}^{m} [\frac{y^{(i)} \overset{(f o g)^{'} = (f^{'} o g) * g^{'}}{\overset{⏞}{σ (θ^{T} x^{(i)}) (1 - σ (θ^{T} x^{(i)})) \frac{\partial}{\partial θ_{j}} θ^{T} x^{(i)}}}}{h_{θ} (x^{(i)})} + \frac{- (1 - y^{(i)}) \overset{(u o v)^{'} = (u^{'} o v) * v^{'}}{\overset{⏞}{σ (θ^{T} x^{(i)}) (1 - σ (θ^{T} x^{(i)})) \frac{\partial}{\partial θ_{j}} θ^{T} x^{(i)}}}}{1 - h_{θ} (x^{(i)})}] \\ = - \frac{1}{m} \sum_{i = 1}^{m} [\frac{y^{(i)} h_{θ} (x^{(i)}) (1 - h_{θ} (x^{(i)})) \frac{\partial}{\partial θ_{j}} θ^{T} x^{(i)}}{h_{θ} (x^{(i)})} - \frac{(1 - y^{(i)}) h_{θ} (x^{(i)}) (1 - h_{θ} (x^{(i)})) \frac{\partial}{\partial θ_{j}} θ^{T} x^{(i)}}{1 - h_{θ} (x^{(i)})}] \\ = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} (1 - h_{θ} (x^{(i)})) x_{j}^{(i)} - (1 - y^{(i)}) h_{θ} (x^{(i)}) x_{j}^{(i)}] \\ = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} (1 - h_{θ} (x^{(i)})) - (1 - y^{(i)}) h_{θ} (x^{(i)})] x_{j}^{(i)} \\ = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} - y^{(i)} h_{θ} (x^{(i)}) - h_{θ} (x^{(i)}) + y^{(i)} h_{θ} (x^{(i)})] x_{j}^{(i)} \\ = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} - h_{θ} (x^{(i)})] x_{j}^{(i)} \\ = \frac{1}{m} \sum_{i = 1}^{m} [h_{θ} (x^{(i)}) - y^{(i)}] x_{j}^{(i)} \end{aligned}

III) Multi-class classification: One-vs-all

Let's say you want a learning algorithm that tags your emails as "friend", "spam", "work".

Since there is more than 2 outputs classes, we are no longer in a binary classification problem.

In a multi-class classification problem, we are going to isolate each time one class (the "positive class") from the other ("negative class"). It's like solving a binary classification problem for each class. Here is an example below:

Classification problem

I) Introduction

II) Binary classification

1) Logistic regression

2) Logistic regression cost function

3) Proof: General derivation formula of cost function

III) Multi-class classification: One-vs-all

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation