Regression Problem

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Linear Regression (one variable)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

where (

xi,
yi
) are values from the training set.

Example:

  • x1
    = 2415
  • y2
    = 400,000

We then plot these values on a graph. The idea is to have a straight line (blue) that best fit the plotting point(red). The blue line is created by the hypothesis function.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Hypothesis function: (Formula of a straight line)

h(x)=θ0+θ1x

If we take random value of

θ0 and
θ1
. We will probably have a straight line that doesn't perfectly fit with our data (plotted point).

Idea:

Choose

θ0 and
θ1
such that
h(x(i))
is close to
y(i)
. Thus, we will have a straight line what will best fit our data. In order to choose good values for
θ0
and
θ1
, we have to use
a cost function.

Cost function: (MSE: Mean Square Error)

J(Θ)=12mi=1m(h(x(i))y(i))2

Where the variables used are:

  1. m : number of element in the training example.
  2. x(i)
    : input feature (value from the training set).
  3. y(i)
    : output feature (value from the training set).
  4. h(xi)
    : our prediction.
  5. Θ
    : parameter.

Remark:

J(Θ) will always be equal to a single value. Your intuition may tell you that we are just computing the mean distance between 2 points (if we removed
2
and the square). The square makes it like an Euclidean distance.


Euclidean distance:

Let A (

xa,
ya
) and B (
xb
,
yb
) be 2 points in a cartesian system. Thus, the distance AB in the system is:

AB=(xbxa)2+(ybya)2

In fact, the cost function is just an euclidian distance. In our case, we squared it to make the computation easier when doing the derivative. You may wonder why there is no x's coordinates for us ? This is due to the fact that the difference is equal to 0.
We compute the distance between 2 points that have the same x coordinate, thus when using the formula, the difference between the x's will be 0. See picture below.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
  1. Cost is high
    =>
    difference between
    h(xi)
    and
    yi
    is high
    =>
    (
    xi
    ,
    h(xi)
    ) and (
    xi
    ,
    yi
    ) are far from each other
    =>
    model is performing badly.
  2. Cost is low
    =>
    difference between
    h(xi)
    and
    yi
    is low
    =>
    (
    xi
    ,
    h(xi)
    ) and (
    xi
    ,
    yi
    ) are near from each other
    =>
    model is performing well.

Remark:

You can notice that we divide by

m. It will give us the average error per data point. The benefit of the average error is that if we have two datasets
{xi,yi}
and
{xi,yi}
of different sizes, then we can compare the average errors but not the total errors. For example, if the second dataset is, let's say, ten times the size of the first, then we would expect the total error to be about ten times larger for the same model. On the other hand, the average error divides out the effect of the size of the dataset, and so we would expect models of similar performance to have the similar average errors on different datasets.


Reformulation of the idea:

Find good values of

θ0 and
θ1
in order to minimize the cost function (that is to say, make it as much as possible near 0)

To do so, we are going to apply the gradient descent algorithgm:

For a certain number of iteration:

ΘΘαddΘJ(Θ)

  1. : This is an assignment operation.
  2. Θ
    : parameter
  3. α
    : The learning rate.
  4. J(Θ)
    : Cost function.

In our case; it will be:

θ0θ0αddθ0J(θ0,θ1)
θ1θ1αddθ1J(θ0,θ1)

Cost function:

J(θ0,θ1)=12mi=1m(h(xi)yi)2

Goal:

minθ0,θ1J(θ0,θ1)
(fancy way to say that we want to find the
θ0
and
θ1
which minimizes the cost
J(θ0,θ1)
.

Apply gradient descent:

θ0θ0αddθ0J(θ0,θ1)
θ1θ1αddθ1J(θ0,θ1)

Computation of derivatives:

(a)ddθ0J(θ0,θ1)=1mi=1m(h(xi)yi)
(b)ddθ1J(θ0,θ1)=1mi=1m(h(xi)yi)xi


Case (a):

ddθ0J(θ0,θ1)=12mi=1m2(h(xi)yi)21d(h(xi)yi)udθ0=1mi=1m(h(xi)yi)1=1mi=1m(h(xi)yi)

with

dudθ0=d(h(xi)yi)dθ0=d(θ0+θ1xiyi)dθ0=1


Case (b):

ddθ1J(θ0,θ1)=12mi=1m2(h(xi)yi)21d(h(xi)yi)udθ1=1mi=1m(h(xi)yi)xi

with

dudθ1=d(h(xi)yi)dθ1=d(θ0+θ1xiyi)dθ1=xi


II) Linear Regression (multiple variables)

Linear regression with multiple variables is also known as "multivariate linear regression".

We now introduce notation for equations where we can have any number of input variables.

For example, let's have this following training set.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Notice, index start at 1 and not 0.

  1. n
    : number of features.
  2. m
    : number of training examples.
  3. Fi
    : column feature.
  4. x(i)
    : the
    ith
    traning example (Use the exponent
    i
    as an index).
  5. xj(i)
    : value
    jth
    in the
    ith
    training example.

In our case:

  1. n = 4 (Indeed, These are the following one:
    x(1)
    ,
    x(2)
    ,
    x(3)
    ,
    x(4)
    )
  2. m = 47
  3. F1
    =
    [210414161534852...]
  4. x(2)
    =
    [14163240]
  5. x1(2)
    = 1416 (the
    1st
    element of the
    2nd
    training example)

Multivariable hypothesis function:

h(x)=θ0+θ1x1+θ2x2+θ3x3+...+θnxn

The reason we multiply the parameters

θ with the features
x
is to see which feature is the most influencing. For example, when it comes to determine the house sell value, maybe the size matters more than the number of bedrooms. Thus, the parameter in front of the "size" feature will be higher than the one in front of the "number of bedroom" feature.

Let

Θ =
[θ0θ1θ2...θn]
and x =
[1x1x2...xn]

Thus,

h(x)=θ0+θ1x1+θ2x2+θ3x3+...+θnxn=ΘTx

with

ΘT =
[θ0θ1θ2...θn]
(Transposed matrix)

Cost function:

J(Θ)=12mi=1m(h(x(i))y(i))2
J(Θ)=12mi=1m(ΘTx(i)y(i))2
J(Θ)=12mi=1m((j=0nθjxj(i))y(i))2


1) Method using Gradient descent

Gradient descent:

repeat until convergence {

θ0θ0αddθ0J(θ0,θ1,...,θn)
θ1θ1αddθ1J(θ0,θ1,...,θn)
θ2θ2αddθ2J(θ0,θ1,...,θn)
...
θnθnαddθnJ(θ0,θ1,...,θn)
}

General derivation formula of cost function:

ddθjJ(θ0,θ1,...,θn)=1mi=1m(h(x(i))y(i))xj(i)

with j

[0,...,n]


Feature Normalization:

We can speed up gradient descent by having each of our input values in roughly the same range. This is because

θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

1Fi1
or
0.5Fi0.5

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

One technique to help with this is mean normalization.

Mean normalization:

FiFiaverage(Fi)max(Fi)min(Fi)

Example:

Fi is housing prices feature with range of 100 to 2000, with a mean value of 1000. Then,
FiFi10001900
.


2) Method using Normal equation

The "Normal Equation" is a method of finding the optimum theta without iteration. There is no need to do feature scaling with the normal equation.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

With the normal equation, computing the inversion has complexity

O(n3). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

Normal equation:

Θ=(XTX)1XTY


with

Θ=(θ0...θn),
X=[1x1(1)x2(1)...xn(1)1x1(2)x2(2)...xn(2).....................1x1(n)x2(n)...xn(n)]
and
Y=(y(1)...y(n))


3) Proof: Normal equation formula

We want to prove :

JΘ=0

We want to minimize the cost function J. Calculus courses tell us that if we want to do so, we have to take its derivative and sets it to 0. We can then isolate

Θ to find its components.

Let,

{J(Θ)=12mi=1m(h(x(i))y(i))2h(x)=θ0+θ1x1+θ2x2+θ3x3+...+θnxn


h(x(1))=θ0+θ1x1(1)+θ2x2(1)+θ3x3(1)+...+θnxn(1)
h(x(2))=θ0+θ1x1(2)+θ2x2(2)+θ3x3(2)+...+θnxn(2)
...
h(x(n))=θ0+θ1x1(n)+θ2x2(n)+θ3x3(n)+...+θnxn(n)

Thus,

(1)J(Θ)=12m(XΘY)T(XΘY)

(We use the transpose to mimic the square in the previous form of J(

Θ) (the non matrix form one)).

(1)<=>J(Θ)=12m((XΘ)T(Y)T)(XΘY)=12m[(XΘ)T(XΘ)(XΘ)TYYT(XΘ)identical+YTY)]=12m[ΘTXTXΘ=Q(Θ)2(XΘ)TY=P(Θ)+YTY]

Notice that if we differentiate

J(Θ), we only have to differentiate
Q(Θ)
and
P(Θ)
since they are the only ones to depend on
Θ
.


Let's differentiate P(

Θ):

P(Θ)=2(XΘ)TY


P(Θ)=2([1x1(1)x2(1)...xn(1)1x1(2)x2(2)...xn(2).....................1x1(n)x2(n)...xn(n)](θ0...θn))T(y(1)...y(n))=2[θ0θ1x1(1)θ2x2(1)...θnxn(1)θ0θ1x1(2)θ2x2(2)...θnxn(2).....................θ0θ1x1(n)θ2x2(n)...θnxn(n)]T(y(1)...y(n))=2[θ0+θ1x1(1)+...+θnxn(1),θ0+θ1x1(2)+...+θnxn(2),...,θ0+θ1x1(n)+...+θnxn(n)](y(1)...y(n))=2(θ0+θ1x1(1)+...+θnxn(1))y(1)+............+2(θ0+θ1x1(n)+...+θnxn(n))y(n)


And since,

PΘ=(Pθ0Pθ1..Pθn)

Pθ0=2y(1)+2y(2)+...+2y(n)Pθ1=2x1(1)y(1)+2x1(2)y(2)+...+2x1(n)y(n)Pθ2=2x2(1)y(1)+2x2(2)y(2)+...+2x2(n)y(n)...Pθn=2xn(1)y(1)+2xn(2)y(2)+...+2xn(n)y(n)

Thus,

PΘ=2[11...1x1(1)x1(2)...x1(n)x2(1)x2(2)...x2(n)................xn(1)xn(2)...xn(n)](y(1)...y(n))=2XTY


Let's differentiate Q(

Θ):

Q(Θ)=ΘTXTXΘ

Let's consider

i[1,n],x0(i)=1.

Q(Θ)=(θ0θ1...θn)[x0(1)x0(2)...x0(n)x1(1)x1(2)...x1(n)x2(1)x2(2)...x2(n)................xn(1)xn(2)...xn(n)][x0(1)x1(1)x2(1)...xn(1)x0(2)x1(2)x2(2)...xn(2).....................x0(n)x1(n)x2(n)...xn(n)]X2=XTX(θ0...θn)

Let's consider

Xr,c2=i=1mxr(i)xc(i), each component of the matrix
X2
, with r as row and c as column. They are umbers, not matrices !

Thus,

Q(Θ)=(θ0θ1...θn)X2(θ0...θn)=(θ0θ1...θn)[X0,02X0,12...X0,n2X1,02X1,12...X1,n2X2,02X2,12...X2,n2................Xn,02Xn,12...Xn,n2](θ0...θn)=(θ0θ1...θn)[X0,02θ0+X0,12θ1+...+X0,n2θnX1,02θ0+X1,12θ1+...+X1,n2θn.........Xn,02θ0+Xn,12θ1+...+Xn,n2θn]=θ0(X0,02θ0+X0,12θ1+...+X0,n2θn)+θ1(X1,02θ0+X1,12θ1+...+X1,n2θn)(1)+...+θn(Xn,02θ0+Xn,12θ1+...+Xn,n2θn)


We can notice 2 things:

  1. X0,02=n1=n
    .
  2. r,c[0,n],rcXr,c2=Xc,r2
    (because
    XTX
    gives a symmetrical matrix).

Since,

QΘ=(Qθ0Qθ1..Qθn)

Qθ0=(2nθ0+X0,12θ1+...+X0,n2θn)+(X1,02θ1)+...+(Xn,02θn)Qθ1=(X0,12θ0)+(X1,02θ0+2X1,12θ1+...+X1,n2θn)+X2,12θ2+...+Xn,12θn...Qθn=(X0,n2θ0)+(X1,n2θ1)+...+(Xn,02θ0+2Xn,12θ1+...+Xn,n2θn)


Using the fact that

X2 is a symmetric matrix.

Qθ0=2nθ0+2X0,12θ1+2X0,22θ2+...+2X0,n2θnQθ1=2X1,02θ0+2X1,12θ1+2X1,22θ2+...+2X1,n2θn...Qθn=2Xn,02θ0+2Xn,12θ1+2Xn,22θ2+...+2Xn,n2θn

Thus,

QΘ=2[nX0,12...X0,n2X1,02X1,12...X1,n2X2,02X2,12...X2,n2................Xn,02Xn,12...Xn,n2]XTX(θ0...θn)=2X2Θ=2XTXΘ


Finally, we can compute

JΘ=0.


JΘ=0QΘPΘ=02XTXΘ2XTY=0XTXΘXTY=0XTXΘ=XTYΘ=(XTX)1XTY