Regression Problem

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Linear Regression (one variable)

where (

x_{i}

y_{i}

) are values from the training set.

Example:

$x_{1}$ = 2415
$y_{2}$ = 400,000

We then plot these values on a graph. The idea is to have a straight line (blue) that best fit the plotting point(red). The blue line is created by the hypothesis function.

Hypothesis function: (Formula of a straight line)

h (x) = θ_{0} + θ_{1} * x

If we take random value of

θ_{0}

and

θ_{1}

. We will probably have a straight line that doesn't perfectly fit with our data (plotted point).

Idea:

Choose

θ_{0}

and

θ_{1}

such that

h (x^{(i)})

is close to

y^{(i)}

. Thus, we will have a straight line what will best fit our data. In order to choose good values for

θ_{0}

and

θ_{1}

, we have to use
a cost function.

Cost function: (MSE: Mean Square Error)

J (Θ) = \frac{1}{2 m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)})^{2}

Where the variables used are:

m : number of element in the training example.
$x^{(i)}$ : input feature (value from the training set).
$y^{(i)}$ : output feature (value from the training set).
$h (x_{i})$ : our prediction.
$Θ$ : parameter.

Remark:

J (Θ)

will always be equal to a single value. Your intuition may tell you that we are just computing the mean distance between 2 points (if we removed

2

and the square). The square makes it like an Euclidean distance.

Euclidean distance:

Let A (

x_{a}

y_{a}

) and B (

x_{b}

y_{b}

) be 2 points in a cartesian system. Thus, the distance AB in the system is:

A B = \sqrt{(x_{b} - x_{a})^{2} + (y_{b} - y_{a})^{2}}

In fact, the cost function is just an euclidian distance. In our case, we squared it to make the computation easier when doing the derivative. You may wonder why there is no x's coordinates for us ? This is due to the fact that the difference is equal to 0.
We compute the distance between 2 points that have the same x coordinate, thus when using the formula, the difference between the x's will be 0. See picture below.

Cost is high
$=>$ difference between
$h (x_{i})$ and
$y_{i}$ is high
$=>$ (
$x_{i}$ ,
$h (x_{i})$ ) and (
$x_{i}$ ,
$y_{i}$ ) are far from each other
$=>$ model is performing badly.
Cost is low
$=>$ difference between
$h (x_{i})$ and
$y_{i}$ is low
$=>$ (
$x_{i}$ ,
$h (x_{i})$ ) and (
$x_{i}$ ,
$y_{i}$ ) are near from each other
$=>$ model is performing well.

Remark:

You can notice that we divide by

m

. It will give us the average error per data point. The benefit of the average error is that if we have two datasets

{x_{i}, y_{i}}

and

{x_{i}^{^{'}}, y_{i}^{^{'}}}

of different sizes, then we can compare the average errors but not the total errors. For example, if the second dataset is, let's say, ten times the size of the first, then we would expect the total error to be about ten times larger for the same model. On the other hand, the average error divides out the effect of the size of the dataset, and so we would expect models of similar performance to have the similar average errors on different datasets.

Reformulation of the idea:

Find good values of

θ_{0}

and

θ_{1}

in order to minimize the cost function (that is to say, make it as much as possible near 0)

To do so, we are going to apply the gradient descent algorithgm:

For a certain number of iteration:

Θ \leftarrow Θ - α \frac{d}{d Θ} J (Θ)

$\leftarrow$ : This is an assignment operation.
$Θ$ : parameter
$α$ : The learning rate.
$J (Θ)$ : Cost function.

In our case; it will be:

θ_{0} \leftarrow θ_{0} - α * \frac{d}{d θ_{0}} J (θ_{0}, θ_{1})

θ_{1} \leftarrow θ_{1} - α * \frac{d}{d θ_{1}} J (θ_{0}, θ_{1})

Cost function:

J (θ_{0}, θ_{1}) = \frac{1}{2 m} \sum_{i = 1}^{m} (h (x_{i}) - y_{i})^{2}

Goal:

m i n_{θ_{0}, θ_{1}} J (θ_{0}, θ_{1})

(fancy way to say that we want to find the

θ_{0}

and

θ_{1}

which minimizes the cost

J (θ_{0}, θ_{1})

Apply gradient descent:

θ_{0} \leftarrow θ_{0} - α * \frac{d}{d θ_{0}} J (θ_{0}, θ_{1})

θ_{1} \leftarrow θ_{1} - α * \frac{d}{d θ_{1}} J (θ_{0}, θ_{1})

Computation of derivatives:

\begin{matrix} (a) & \frac{d}{d θ_{0}} J (θ_{0}, θ_{1}) = \frac{1}{m} \sum_{i = 1}^{m} (h (x_{i}) - y_{i}) \end{matrix}

\begin{matrix} (b) & \frac{d}{d θ_{1}} J (θ_{0}, θ_{1}) = \frac{1}{m} \sum_{i = 1}^{m} (h (x_{i}) - y_{i}) * x_{i} \end{matrix}

Case (a):

\begin{aligned} \frac{d}{d θ_{0}} J (θ_{0}, θ_{1}) & = \frac{1}{2 m} \sum_{i = 1}^{m} 2 * (h (x_{i}) - y_{i})^{2 - 1} * \frac{d \overset{u}{\overset{⏞}{(h (x_{i}) - y_{i})}}}{d θ_{0}} \\ = \frac{1}{m} \sum_{i = 1}^{m} (h (x_{i}) - y_{i}) * 1 \\ = \frac{1}{m} \sum_{i = 1}^{m} (h (x_{i}) - y_{i}) \end{aligned}

with

\frac{d u}{d θ_{0}} = \frac{d (h (x_{i}) - y_{i})}{d θ_{0}} = \frac{d (θ_{0} + θ_{1} * x_{i} - y_{i})}{d θ_{0}} = 1

Case (b):

\begin{aligned} \frac{d}{d θ_{1}} J (θ_{0}, θ_{1}) & = \frac{1}{2 m} \sum_{i = 1}^{m} 2 * (h (x_{i}) - y_{i})^{2 - 1} * \frac{d \overset{u}{\overset{⏞}{(h (x_{i}) - y_{i})}}}{d θ_{1}} \\ = \frac{1}{m} \sum_{i = 1}^{m} (h (x_{i}) - y_{i}) * x_{i} \end{aligned}

with

\frac{d u}{d θ_{1}} = \frac{d (h (x_{i}) - y_{i})}{d θ_{1}} = \frac{d (θ_{0} + θ_{1} * x_{i} - y_{i})}{d θ_{1}} = x_{i}

II) Linear Regression (multiple variables)

Linear regression with multiple variables is also known as "multivariate linear regression".

We now introduce notation for equations where we can have any number of input variables.

For example, let's have this following training set.

Notice, index start at 1 and not 0.

$n$ : number of features.
$m$ : number of training examples.
$F_{i}$ : column feature.
$x^{(i)}$ : the
$i^{t h}$ traning example (Use the exponent
$i$ as an index).
${x_{j}}^{(i)}$ : value
$j^{t h}$ in the
$i^{t h}$ training example.

In our case:

n = 4 (Indeed, These are the following one:
$x^{(1)}$ ,
$x^{(2)}$ ,
$x^{(3)}$ ,
$x^{(4)}$ )
m = 47
$F_{1}$ =
$[\begin{matrix} 2104 \\ 1416 \\ 1534 \\ 852 \\ . . . \end{matrix}]$
$x^{(2)}$ =
$[\begin{matrix} 1416 & 3 & 2 & 40 \end{matrix}]$
${x_{1}}^{(2)}$ = 1416 (the
$1 s t$ element of the
$2 n d$ training example)

Multivariable hypothesis function:

h (x) = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} + θ_{3} x_{3} + . . . + θ_{n} x_{n}

The reason we multiply the parameters

θ

with the features

x

is to see which feature is the most influencing. For example, when it comes to determine the house sell value, maybe the size matters more than the number of bedrooms. Thus, the parameter in front of the "size" feature will be higher than the one in front of the "number of bedroom" feature.

Let

Θ

[\begin{matrix} θ_{0} \\ θ_{1} \\ θ_{2} \\ . . . \\ θ_{n} \end{matrix}]

and x =

[\begin{matrix} 1 \\ x_{1} \\ x_{2} \\ . . . \\ x_{n} \end{matrix}]

Thus,

\begin{aligned} h (x) & = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} + θ_{3} x_{3} + . . . + θ_{n} x_{n} \\ = Θ^{T} x \end{aligned}

with

Θ^{T}

[\begin{matrix} θ_{0} & θ_{1} & θ_{2} & . . . & θ_{n} \end{matrix}]

(Transposed matrix)

Cost function:

J (Θ) = \frac{1}{2 m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)})^{2}

\equiv

J (Θ) = \frac{1}{2 m} \sum_{i = 1}^{m} (Θ^{T} x^{(i)} - y^{(i)})^{2}

\equiv

J (Θ) = \frac{1}{2 m} \sum_{i = 1}^{m} ((\sum_{j = 0}^{n} θ_{j} x_{j}^{(i)}) - y^{(i)})^{2}

1) Method using Gradient descent

Gradient descent:

repeat until convergence {

θ_{0} \leftarrow θ_{0} - α \frac{d}{d θ_{0}} J (θ_{0}, θ_{1}, . . ., θ_{n})

θ_{1} \leftarrow θ_{1} - α \frac{d}{d θ_{1}} J (θ_{0}, θ_{1}, . . ., θ_{n})

θ_{2} \leftarrow θ_{2} - α \frac{d}{d θ_{2}} J (θ_{0}, θ_{1}, . . ., θ_{n})

. . .

θ_{n} \leftarrow θ_{n} - α \frac{d}{d θ_{n}} J (θ_{0}, θ_{1}, . . ., θ_{n})

}

\to

General derivation formula of cost function:

\frac{d}{d θ_{j}} J (θ_{0}, θ_{1}, . . ., θ_{n}) = \frac{1}{m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)}) * x_{j}^{(i)}

with j

\in

[0,...,n]

Feature Normalization:

We can speed up gradient descent by having each of our input values in roughly the same range. This is because

θ

will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

- 1 \leq F_{i} \leq 1

o r

- 0.5 \leq F_{i} \leq 0.5

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

One technique to help with this is mean normalization.

Mean normalization:

F_{i} \leftarrow \frac{F_{i} - a v e r a g e (F_{i})}{m a x (F_{i}) - m i n (F_{i})}

Example:

F_{i}

is housing prices feature with range of 100 to 2000, with a mean value of 1000. Then,

F_{i} \leftarrow \frac{F_{i} - 1000}{1900}

2) Method using Normal equation

The "Normal Equation" is a method of finding the optimum theta without iteration. There is no need to do feature scaling with the normal equation.

With the normal equation, computing the inversion has complexity

O (n^{3})

. So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

Normal equation:

Θ = (X^{T} X)^{- 1} X^{T} Y

with

Θ = (\begin{matrix} θ_{0} \\ . \\ . \\ . \\ θ_{n} \end{matrix})

X = [\begin{matrix} 1 & x_{1}^{(1)} & x_{2}^{(1)} & . . . & x_{n}^{(1)} \\ 1 & x_{1}^{(2)} & x_{2}^{(2)} & . . . & x_{n}^{(2)} \\ . & . & . & . . . & . \\ . & . & . & . . . & . \\ . & . & . & . . . & . \\ 1 & x_{1}^{(n)} & x_{2}^{(n)} & . . . & x_{n}^{(n)} \end{matrix}]

and

Y = (\begin{matrix} y^{(1)} \\ . \\ . \\ . \\ y^{(n)} \end{matrix})

3) Proof: Normal equation formula

We want to prove :

\frac{\partial J}{\partial Θ} = 0

We want to minimize the cost function J. Calculus courses tell us that if we want to do so, we have to take its derivative and sets it to 0. We can then isolate

Θ

to find its components.

Let,

{\begin{cases} J (Θ) = \frac{1}{2 m} \sum_{i = 1}^{m} (h (x^{(i)}) - y^{(i)})^{2} \\ h (x) = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} + θ_{3} x_{3} + . . . + θ_{n} x_{n} \end{cases}

\to

h (x^{(1)}) = θ_{0} + θ_{1} x_{1}^{(1)} + θ_{2} x_{2}^{(1)} + θ_{3} x_{3}^{(1)} + . . . + θ_{n} x_{n}^{(1)}

h (x^{(2)}) = θ_{0} + θ_{1} x_{1}^{(2)} + θ_{2} x_{2}^{(2)} + θ_{3} x_{3}^{(2)} + . . . + θ_{n} x_{n}^{(2)}

. . .

h (x^{(n)}) = θ_{0} + θ_{1} x_{1}^{(n)} + θ_{2} x_{2}^{(n)} + θ_{3} x_{3}^{(n)} + . . . + θ_{n} x_{n}^{(n)}

Thus,

\begin{matrix} (1) & J (Θ) = \frac{1}{2 m} (X Θ - Y)^{T} (X Θ - Y) \end{matrix}

(We use the transpose to mimic the square in the previous form of J(

Θ

) (the non matrix form one)).

\begin{aligned} (1) <=> J (Θ) & = \frac{1}{2 m} ((X Θ)^{T} - (Y)^{T}) (X Θ - Y) \\ = \frac{1}{2 m} [(X Θ)^{T} (X Θ) - \underset{i d e n t i c a l}{\underset{⏟}{(X Θ)^{T} Y - Y^{T} (X Θ)}} + Y^{T} Y)] \\ = \frac{1}{2 m} [\underset{= Q (Θ)}{\underset{⏟}{Θ^{T} X^{T} X Θ}} - \underset{= P (Θ)}{\underset{⏟}{2 (X Θ)^{T} Y}} + Y^{T} Y] \end{aligned}

Notice that if we differentiate

J (Θ)

, we only have to differentiate

Q (Θ)

and

P (Θ)

since they are the only ones to depend on

Θ

Let's differentiate P(

$Θ$ ):

P (Θ) = 2 (X Θ)^{T} Y

\begin{aligned} P (Θ) & = 2 {([\begin{array}{c} 1 & x_{1}^{(1)} & x_{2}^{(1)} & . . . & x_{n}^{(1)} \\ 1 & x_{1}^{(2)} & x_{2}^{(2)} & . . . & x_{n}^{(2)} \\ . & . & . & . . . & . \\ . & . & . & . . . & . \\ . & . & . & . . . & . \\ 1 & x_{1}^{(n)} & x_{2}^{(n)} & . . . & x_{n}^{(n)} \end{array}] (\begin{array}{c} θ_{0} \\ . \\ . \\ . \\ θ_{n} \end{array}))}^{T} (\begin{array}{c} y^{(1)} \\ . \\ . \\ . \\ y^{(n)} \end{array}) \\ = 2 {[\begin{array}{c} θ_{0} & θ_{1} x_{1}^{(1)} & θ_{2} x_{2}^{(1)} & . . . & θ_{n} x_{n}^{(1)} \\ θ_{0} & θ_{1} x_{1}^{(2)} & θ_{2} x_{2}^{(2)} & . . . & θ_{n} x_{n}^{(2)} \\ . & . & . & . . . & . \\ . & . & . & . . . & . \\ . & . & . & . . . & . \\ θ_{0} & θ_{1} x_{1}^{(n)} & θ_{2} x_{2}^{(n)} & . . . & θ_{n} x_{n}^{(n)} \end{array}]}^{T} (\begin{array}{c} y^{(1)} \\ . \\ . \\ . \\ y^{(n)} \end{array}) \\ = 2 [\begin{array}{c} θ_{0} + θ_{1} x_{1}^{(1)} + . . . + θ_{n} x_{n}^{(1)}, & θ_{0} + θ_{1} x_{1}^{(2)} + . . . + θ_{n} x_{n}^{(2)}, & . . . & , θ_{0} + θ_{1} x_{1}^{(n)} + . . . + θ_{n} x_{n}^{(n)} \end{array}] \\ * (\begin{array}{c} y^{(1)} \\ . \\ . \\ . \\ y^{(n)} \end{array}) \\ = 2 (θ_{0} + θ_{1} x_{1}^{(1)} + . . . + θ_{n} x_{n}^{(1)}) y^{(1)} + . . . . . . . . . . . . + 2 (θ_{0} + θ_{1} x_{1}^{(n)} + . . . + θ_{n} x_{n}^{(n)}) y^{(n)} \end{aligned}

And since,

\frac{\partial P}{\partial Θ} = (\begin{matrix} \frac{\partial P}{\partial θ_{0}} \\ \frac{\partial P}{\partial θ_{1}} \\ . \\ . \\ \frac{\partial P}{\partial θ_{n}} \end{matrix})

\begin{aligned} \to \\ \frac{\partial P}{\partial θ_{0}} = 2 y^{(1)} + 2 y^{(2)} + . . . + 2 y^{(n)} \\ \frac{\partial P}{\partial θ_{1}} = 2 x_{1}^{(1)} y^{(1)} + 2 x_{1}^{(2)} y^{(2)} + . . . + 2 x_{1}^{(n)} y^{(n)} \\ \frac{\partial P}{\partial θ_{2}} = 2 x_{2}^{(1)} y^{(1)} + 2 x_{2}^{(2)} y^{(2)} + . . . + 2 x_{2}^{(n)} y^{(n)} \\ . \\ . \\ . \\ \frac{\partial P}{\partial θ_{n}} = 2 x_{n}^{(1)} y^{(1)} + 2 x_{n}^{(2)} y^{(2)} + . . . + 2 x_{n}^{(n)} y^{(n)} \end{aligned}

Thus,

\frac{\partial P}{\partial Θ} = 2 [\begin{matrix} 1 & 1 & . . . & 1 \\ x_{1}^{(1)} & x_{1}^{(2)} & . . . & x_{1}^{(n)} \\ x_{2}^{(1)} & x_{2}^{(2)} & . . . & x_{2}^{(n)} \\ . & . & . . . & . \\ . & . & . . . & . \\ . & . & . & . \\ x_{n}^{(1)} & x_{n}^{(2)} & . . . & x_{n}^{(n)} \end{matrix}] (\begin{matrix} y^{(1)} \\ . \\ . \\ . \\ y^{(n)} \end{matrix}) = 2 X^{T} Y

Let's differentiate Q(

$Θ$ ):

Q (Θ) = Θ^{T} X^{T} X Θ

Let's consider

\forall i \in [1, n], x_{0}^{(i)} = 1

\begin{aligned} Q (Θ) & = (\begin{array}{c} θ_{0} & θ_{1} & . . . & θ_{n} \end{array}) \underset{X^{2} = X^{T} X}{\underset{⏟}{[\begin{array}{c} x_{0}^{(1)} & x_{0}^{(2)} & . . . & x_{0}^{(n)} \\ x_{1}^{(1)} & x_{1}^{(2)} & . . . & x_{1}^{(n)} \\ x_{2}^{(1)} & x_{2}^{(2)} & . . . & x_{2}^{(n)} \\ . & . & . . . & . \\ . & . & . . . & . \\ . & . & . & . \\ x_{n}^{(1)} & x_{n}^{(2)} & . . . & x_{n}^{(n)} \end{array}] [\begin{array}{c} x_{0}^{(1)} & x_{1}^{(1)} & x_{2}^{(1)} & . . . & x_{n}^{(1)} \\ x_{0}^{(2)} & x_{1}^{(2)} & x_{2}^{(2)} & . . . & x_{n}^{(2)} \\ . & . & . & . . . & . \\ . & . & . & . . . & . \\ . & . & . & . . . & . \\ x_{0}^{(n)} & x_{1}^{(n)} & x_{2}^{(n)} & . . . & x_{n}^{(n)} \end{array}]}} (\begin{array}{c} θ_{0} \\ . \\ . \\ . \\ θ_{n} \end{array}) \end{aligned}

Let's consider

X_{r, c}^{2} = \sum_{i = 1}^{m} x_{r}^{(i)} x_{c}^{(i)}

, each component of the matrix

X^{2}

, with r as row and c as column. They are umbers, not matrices !

Thus,

\begin{aligned} Q (Θ) & = (\begin{array}{c} θ_{0} & θ_{1} & . . . & θ_{n} \end{array}) X^{2} (\begin{array}{c} θ_{0} \\ . \\ . \\ . \\ θ_{n} \end{array}) \\ = (\begin{array}{c} θ_{0} & θ_{1} & . . . & θ_{n} \end{array}) [\begin{array}{c} X_{0, 0}^{2} & X_{0, 1}^{2} & . . . & X_{0, n}^{2} \\ X_{1, 0}^{2} & X_{1, 1}^{2} & . . . & X_{1, n}^{2} \\ X_{2, 0}^{2} & X_{2, 1}^{2} & . . . & X_{2, n}^{2} \\ . & . & . . . & . \\ . & . & . . . & . \\ . & . & . & . \\ X_{n, 0}^{2} & X_{n, 1}^{2} & . . . & X_{n, n}^{2} \end{array}] (\begin{array}{c} θ_{0} \\ . \\ . \\ . \\ θ_{n} \end{array}) \\ = (\begin{array}{c} θ_{0} & θ_{1} & . . . & θ_{n} \end{array}) [\begin{array}{c} X_{0, 0}^{2} θ_{0} + X_{0, 1}^{2} θ_{1} + . . . + X_{0, n}^{2} θ_{n} \\ X_{1, 0}^{2} θ_{0} + X_{1, 1}^{2} θ_{1} + . . . + X_{1, n}^{2} θ_{n} \\ . . . \\ . . . \\ . . . \\ X_{n, 0}^{2} θ_{0} + X_{n, 1}^{2} θ_{1} + . . . + X_{n, n}^{2} θ_{n} \end{array}] \\ = θ_{0} (X_{0, 0}^{2} θ_{0} + X_{0, 1}^{2} θ_{1} + . . . + X_{0, n}^{2} θ_{n}) + θ_{1} (X_{1, 0}^{2} θ_{0} + X_{1, 1}^{2} θ_{1} + . . . + X_{1, n}^{2} θ_{n}) \\ (1) & + . . . + θ_{n} (X_{n, 0}^{2} θ_{0} + X_{n, 1}^{2} θ_{1} + . . . + X_{n, n}^{2} θ_{n}) \end{aligned}

We can notice 2 things:

$X_{0, 0}^{2} = n * 1 = n$ .
$\forall r, c \in [0, n], r \neq c \Leftrightarrow X_{r, c}^{2} = X_{c, r}^{2}$ (because
$X^{T} X$ gives a symmetrical matrix).

Since,

\frac{\partial Q}{\partial Θ} = (\begin{matrix} \frac{\partial Q}{\partial θ_{0}} \\ \frac{\partial Q}{\partial θ_{1}} \\ . \\ . \\ \frac{\partial Q}{\partial θ_{n}} \end{matrix})

\begin{aligned} \to \\ \frac{\partial Q}{\partial θ_{0}} = (2 n θ_{0} + X_{0, 1}^{2} θ_{1} + . . . + X_{0, n}^{2} θ_{n}) + (X_{1, 0}^{2} θ_{1}) + . . . + (X_{n, 0}^{2} θ_{n}) \\ \frac{\partial Q}{\partial θ_{1}} = (X_{0, 1}^{2} θ_{0}) + (X_{1, 0}^{2} θ_{0} + 2 X_{1, 1}^{2} θ_{1} + . . . + X_{1, n}^{2} θ_{n}) + X_{2, 1}^{2} θ_{2} + . . . + X_{n, 1}^{2} θ_{n} \\ . \\ . \\ . \\ \frac{\partial Q}{\partial θ_{n}} = (X_{0, n}^{2} θ_{0}) + (X_{1, n}^{2} θ_{1}) + . . . + (X_{n, 0}^{2} θ_{0} + 2 X_{n, 1}^{2} θ_{1} + . . . + X_{n, n}^{2} θ_{n}) \end{aligned}

Using the fact that

X^{2}

is a symmetric matrix.

\begin{aligned} \to \\ \frac{\partial Q}{\partial θ_{0}} = 2 n θ_{0} + 2 X_{0, 1}^{2} θ_{1} + 2 X_{0, 2}^{2} θ_{2} + . . . + 2 X_{0, n}^{2} θ_{n} \\ \frac{\partial Q}{\partial θ_{1}} = 2 X_{1, 0}^{2} θ_{0} + 2 X_{1, 1}^{2} θ_{1} + 2 X_{1, 2}^{2} θ_{2} + . . . + 2 X_{1, n}^{2} θ_{n} \\ . \\ . \\ . \\ \frac{\partial Q}{\partial θ_{n}} = 2 X_{n, 0}^{2} θ_{0} + 2 X_{n, 1}^{2} θ_{1} + 2 X_{n, 2}^{2} θ_{2} + . . . + 2 X_{n, n}^{2} θ_{n} \end{aligned}

Thus,

\frac{\partial Q}{\partial Θ} = 2 \underset{X^{T} X}{\underset{⏟}{[\begin{matrix} n & X_{0, 1}^{2} & . . . & X_{0, n}^{2} \\ X_{1, 0}^{2} & X_{1, 1}^{2} & . . . & X_{1, n}^{2} \\ X_{2, 0}^{2} & X_{2, 1}^{2} & . . . & X_{2, n}^{2} \\ . & . & . . . & . \\ . & . & . . . & . \\ . & . & . & . \\ X_{n, 0}^{2} & X_{n, 1}^{2} & . . . & X_{n, n}^{2} \end{matrix}]}} (\begin{matrix} θ_{0} \\ . \\ . \\ . \\ θ_{n} \end{matrix}) = 2 X^{2} Θ = 2 X^{T} X Θ

Finally, we can compute

\frac{\partial J}{\partial Θ} = 0

\begin{aligned} \frac{\partial J}{\partial Θ} = 0 \Leftrightarrow & \frac{\partial Q}{\partial Θ} - \frac{\partial P}{\partial Θ} = 0 \\ \Leftrightarrow & 2 X^{T} X Θ - 2 X^{T} Y = 0 \\ \Leftrightarrow & X^{T} X Θ - X^{T} Y = 0 \\ \Leftrightarrow & X^{T} X Θ = X^{T} Y \\ \Leftrightarrow & Θ = (X^{T} X)^{- 1} X^{T} Y \end{aligned}

Regression Problem

I) Linear Regression (one variable)

II) Linear Regression (multiple variables)

1) Method using Gradient descent

2) Method using Normal equation

3) Proof: Normal equation formula

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation