--- tags: machine-learning --- # Regression Problem <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/regression-problem/1.png" height="100%" width="70%"> </div> > This is my personal notes taken for the course [Machine learning](https://www.coursera.org/learn/machine-learning#syllabus) by Standford. Feel free to check the [assignments](https://github.com/3outeille/Coursera-Labs). > Also, if you want to read my other notes, feel free to check them at my [blog](https://ferdinandmom.engineer/machine-learning/). ## I) Linear Regression (one variable) <div style="text-align:center"> <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/regression-problem/2.png" height="100%" width="60%"> </div> where ($x_i$, $y_i$) are values from the training set. <ins>**Example:**</ins> - $x_1$ = 2415 - $y_2$ = 400,000 We then plot these values on a graph. The idea is to have a straight line (blue) that best fit the plotting point(red). The blue line is created by the hypothesis function. <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/regression-problem/3.png"> <ins>**Hypothesis function:**</ins> (Formula of a straight line) $$\boxed{h(x) = \theta_0 + \theta_1 *x}$$ If we take random value of $\theta_0$ and $\theta_1$. We will probably have a straight line that doesn't perfectly fit with our data (plotted point). <ins>**Idea:**</ins> Choose $\theta_0$ and $\theta_1$ such that $h(x^{(i)})$ is close to $y^{(i)}$. Thus, we will have a straight line what will best fit our data. In order to choose good values for $\theta_0$ and $\theta_1$, we have to use a cost function. <ins>**Cost function:**</ins> (MSE: Mean Square Error) $$\boxed{J(\Theta) = \frac{1}{2m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})^{2}}$$ Where the variables used are: 1. m : number of element in the training example. 2. $x^{(i)}$ : input feature (value from the training set). 3. $y^{(i)}$ : output feature (value from the training set). 4. $h(x_i)$ : our prediction. 5. $\Theta$ : parameter. --- <ins>**Remark:**</ins> $J(\Theta)$ will always be equal to a single value. Your intuition may tell you that we are just computing the mean distance between 2 points (if we removed $2$ and the square). The square makes it like an Euclidean distance. --- <ins>**Euclidean distance:**</ins> Let A ($x_a$, $y_a$) and B ($x_b$, $y_b$) be 2 points in a cartesian system. Thus, the distance AB in the system is: $$\boxed{AB =\sqrt{(x_b - x_a)^{2} + (y_b - y_a)^{2}}}$$ In fact, the cost function is just an euclidian distance. In our case, we squared it to make the computation easier when doing the derivative. You may wonder why there is no x's coordinates for us ? This is due to the fact that the difference is equal to 0. We compute the distance between 2 points that have the same x coordinate, thus when using the formula, the difference between the x's will be 0. See picture below. <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/regression-problem/4.png"> 1. Cost is **high** $=>$ difference between $h(x_i)$ and $y_i$ is **high** $=>$ ($x_i$, $h(x_i)$) and ($x_i$, $y_i$) are **far** from each other $=>$ model is performing **badly**. 2. Cost is **low** $=>$ difference between $h(x_i)$ and $y_i$ is **low** $=>$ ($x_i$, $h(x_i)$) and ($x_i$, $y_i$) are **near** from each other $=>$ model is performing **well**. --- <ins>**Remark:**</ins> You can notice that we divide by $m$. It will give us the average error per data point. The benefit of the average error is that if we have two datasets $\{x_i, y_i\}$ and $\{x_i^{'},y_i^{'}\}$ of different sizes, then we can compare the average errors but not the total errors. For example, if the second dataset is, let's say, ten times the size of the first, then we would expect the total error to be about ten times larger for the same model. On the other hand, the average error divides out the effect of the size of the dataset, and so we would expect models of similar performance to have the similar average errors on different datasets. --- <ins>**Reformulation of the idea:**</ins> Find good values of $\theta_0$ and $\theta_1$ in order to **minimize** the cost function (that is to say, make it as much as possible near 0) To do so, we are going to apply the <ins>**gradient descent algorithgm:**</ins> <ins>For a certain number of iteration:</ins> $$\boxed{\Theta \leftarrow \Theta - \alpha \frac{\mathrm{d} }{\mathrm{d} \Theta}J(\Theta)}$$ 1. $\leftarrow$ : This is an assignment operation. 2. $\Theta$ : parameter 3. $\alpha$ : The learning rate. 4. $J(\Theta)$ : Cost function. <ins>In our case; it will be:</ins> $$\boxed{\theta_0 \leftarrow \theta_0 - \alpha * \frac{\mathrm{d} }{\mathrm{d} \theta_0}J(\theta_0, \theta_1)}$$ $$\boxed{\theta_1 \leftarrow \theta_1 - \alpha *\frac{\mathrm{d} }{\mathrm{d} \theta_1}J(\theta_0, \theta_1)}$$ <ins>**Cost function:**</ins> $$\boxed{J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^{m}(h(x_i) - y_i)^{2}}$$ <ins>**Goal:**</ins> $$\boxed{min_{\theta_0,\theta_1} J(\theta_0, \theta_1)}$$ (fancy way to say that we want to find the $\theta_0$ and $\theta_1$ which minimizes the cost $J(\theta_0, \theta_1)$. <ins>**Apply gradient descent:**</ins> $$\boxed{\theta_0 \leftarrow \theta_0 - \alpha * \frac{\mathrm{d} }{\mathrm{d} \theta_0}J(\theta_0, \theta_1)}$$ $$\boxed{\theta_1 \leftarrow \theta_1 - \alpha *\frac{\mathrm{d} }{\mathrm{d} \theta_1}J(\theta_0, \theta_1)}$$ <ins>**Computation of derivatives:**</ins> $$\boxed{\frac{\mathrm{d} }{\mathrm{d} \theta_0}J(\theta_0, \theta_1) = \frac{1}{m}\sum_{i=1}^{m}(h(x_i) - y_i)} \tag{a}$$ $$\boxed{\frac{\mathrm{d} }{\mathrm{d} \theta_1}J(\theta_0, \theta_1) = \frac{1}{m}\sum_{i=1}^{m}(h(x_i) - y_i)*x_i} \tag{b}$$ --- <ins>**Case (a):**</ins> $$\begin{aligned} \frac{\mathrm{d} }{\mathrm{d} \theta_0}J(\theta_0, \theta_1) &= \frac{1}{2m}\sum_{i=1}^{m}2*(h(x_{i}) - y_{i})^{2-1} * \frac{\mathrm{d} \overbrace{ (h(x_i) - y_i) }^{u}} {\mathrm{d} \theta_0} \\ &= \frac{1}{m}\sum_{i=1}^{m}(h(x_{i}) - y_{i})*1 \\ &= \boxed{\frac{1}{m}\sum_{i=1}^{m}(h(x_{i}) - y_{i})}\end{aligned}$$ with $$\frac{\mathrm{d}u}{\mathrm{d} \theta_0} = \frac{\mathrm{d}(h(x_i) - y_i)}{\mathrm{d} \theta_0} = \frac{\mathrm{d} (\theta_0 + \theta_1*x_i - y_i)}{\mathrm{d} \theta_0} = 1$$ --- <ins>**Case (b):**</ins> $$\begin{aligned} \frac{\mathrm{d} }{\mathrm{d} \theta_1}J(\theta_0, \theta_1) &= \frac{1}{2m}\sum_{i=1}^{m}2*(h(x_{i}) - y_{i})^{2-1} * \frac{\mathrm{d} \overbrace{ (h(x_i) - y_i) }^{u}} {\mathrm{d} \theta_1} \\ &= \boxed{\frac{1}{m}\sum_{i=1}^{m}(h(x_{i}) - y_{i})*x_i}\end{aligned}$$ with $$\frac{\mathrm{d}u}{\mathrm{d} \theta_1} = \frac{\mathrm{d}(h(x_i) - y_i)}{\mathrm{d} \theta_1} = \frac{\mathrm{d} (\theta_0 + \theta_1*x_i - y_i)}{\mathrm{d} \theta_1} = x_i$$ --- ## II) Linear Regression (multiple variables) Linear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables. For example, let's have this following training set. <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/regression-problem/5.png"> Notice, index start at 1 and not 0. 1. $n$ : number of features. 2. $m$ : number of training examples. 3. $F_{i}$ : column feature. 4. $x^{(i)}$ : the $i^{th}$ traning example (Use the exponent $i$ as an index). 5. ${x_{j}}^{(i)}$ : value $j^{th}$ in the $i^{th}$ training example. **In our case:** 1. n = 4 (Indeed, These are the following one: $x^{(1)}$, $x^{(2)}$, $x^{(3)}$, $x^{(4)}$) 2. m = 47 3. $F_{1}$ = $\begin{bmatrix}2104\\1416 \\ 1534 \\ 852 \\ ... \end{bmatrix}$ 4. $x^{(2)}$ = $\begin{bmatrix}1416&3&2&40 \end{bmatrix}$ 5. ${x_{1}}^{(2)}$ = 1416 (the $1st$ element of the $2nd$ training example) <ins>**Multivariable hypothesis function:**</ins> $$\boxed{h(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + ... + \theta_nx_n}$$ The reason we multiply the parameters $\theta$ with the features $x$ is to see which feature is the most influencing. For example, when it comes to determine the house sell value, maybe the size matters more than the number of bedrooms. Thus, the parameter in front of the "size" feature will be higher than the one in front of the "number of bedroom" feature. Let $\Theta$ = $\begin{bmatrix}\theta_0\\\theta_1 \\ \theta_2 \\... \\ \theta_n \end{bmatrix}$ and x = $\begin{bmatrix} 1 \\x_1\\ x_2 \\... \\ x_n \end{bmatrix}$ **Thus,** $$\begin{aligned} h(x) & = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + ... + \theta_nx_n \\ & = \boxed{\Theta^{T}x}\end{aligned}$$ with $\Theta^{T}$ = $\begin{bmatrix} \theta_0 & \theta_1 & \theta_2 & ... & \theta_n \end{bmatrix}$ (Transposed matrix) <ins>**Cost function:**</ins> $$J(\Theta) = \frac{1}{2m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})^{2}$$ $$\equiv$$ $$J(\Theta) = \frac{1}{2m}\sum_{i=1}^{m}(\Theta^{T}x^{(i)} - y^{(i)})^{2}$$ $$\equiv$$$$J(\Theta) = \frac{1}{2m}\sum_{i=1}^{m}((\sum_{j=0}^{n}\theta_jx_{j}^{(i)}) - y^{(i)})^{2}$$ --- ### <ins>1) Method using Gradient descent</ins> <ins>**Gradient descent:**</ins> repeat until convergence { $$\theta_0 \leftarrow \theta_0 - \alpha \frac{\mathrm{d} }{\mathrm{d} \theta_0}J(\theta_0, \theta_1, ...,\theta_n)$$$$\theta_1 \leftarrow \theta_1 - \alpha \frac{\mathrm{d} }{\mathrm{d} \theta_1}J(\theta_0, \theta_1, ...,\theta_n)$$$$\theta_2 \leftarrow \theta_2 - \alpha \frac{\mathrm{d} }{\mathrm{d} \theta_2}J(\theta_0, \theta_1, ...,\theta_n)$$$$...$$$$\theta_n \leftarrow \theta_n - \alpha \frac{\mathrm{d} }{\mathrm{d} \theta_n}J(\theta_0, \theta_1, ...,\theta_n)$$} <ins>$\rightarrow$ General derivation formula of cost function:</ins> $$\boxed{\frac{\mathrm{d} }{\mathrm{d} \theta_j}J(\theta_0, \theta_1, ...,\theta_n) = \frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})*x_j^{(i)}}$$ with j $\in$ \[0,\...,n\] --- <ins>**Feature Normalization:**</ins> We can speed up gradient descent by having each of our input values in roughly the same range. This is because $\theta$ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally: $$-1 \leq F_{i} \leq 1$$ $$or$$ $$-0.5 \leq F_{i} \leq 0.5$$ These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few. One technique to help with this is **mean normalization**. <ins>**Mean normalization:**</ins> $$F_{i} \leftarrow \frac{F_{i} - average(F_{i})}{max(F_{i}) - min(F_{i})}$$ <ins>**Example:**</ins> $F_{i}$ is housing prices feature with range of 100 to 2000, with a mean value of 1000. Then, $F_{i} \leftarrow \frac{F_{i} - 1000}{1900}$. --- ### <ins>2) Method using Normal equation</ins> The "Normal Equation" is a method of finding the optimum theta without iteration. There is no need to do feature scaling with the normal equation. <img src="https://raw.githubusercontent.com/valoxe/image-storage-1/master/blog-machine-learning/regression-problem/6.png"> With the normal equation, computing the inversion has complexity $\mathcal{O}(n^3)$. So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process. <ins>**Normal equation:**</ins> $$\boxed{\Theta = (X^{T}X)^{-1}X^{T}Y}$$ <br/> with $\Theta = \begin{pmatrix}\theta_0\\ .\\ .\\ .\\ \theta_n \end{pmatrix}$, $X = \begin{bmatrix} 1 & x_1^{(1)}& x_2^{(1)} & ... & x_n^{(1)} \\ 1 & x_1^{(2)}& x_2^{(2)} & ... & x_n^{(2)} \\ . & . & . & ... & . \\ . & . & . & ... & . \\ . & . & . & ... & . \\ 1 & x_1^{(n)} & x_2^{(n)} & ... & x_n^{(n)} \\ \end{bmatrix}$ and $Y = \begin{pmatrix}y^{(1)}\\ .\\ .\\ .\\ y^{(n)} \end{pmatrix}$ --- ### 3) <ins>Proof: Normal equation formula</ins> We want to prove : $$\frac{\partial J}{\partial \Theta} = 0$$ We want to minimize the cost function J. Calculus courses tell us that if we want to do so, we have to take its derivative and sets it to 0. We can then isolate $\Theta$ to find its components. Let, $$\left\{ \begin{array}{ll} J(\Theta) = \frac{1}{2m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})^{2}\\ h(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + ... + \theta_nx_n \end{array} \right.$$ $\rightarrow$ $$h(x^{(1)}) = \theta_0 + \theta_1x_1^{(1)} + \theta_2x_2^{(1)} + \theta_3x_3^{(1)} + ... + \theta_nx_n^{(1)}$$ $$h(x^{(2)}) = \theta_0 + \theta_1x_1^{(2)} + \theta_2x_2^{(2)} + \theta_3x_3^{(2)} + ... + \theta_nx_n^{(2)}$$ $$...$$ $$h(x^{(n)}) = \theta_0 + \theta_1x_1^{(n)} + \theta_2x_2^{(n)} + \theta_3x_3^{(n)} + ... + \theta_nx_n^{(n)}$$ <ins>**Thus,**</ins> $$\boxed{J(\Theta) = \frac{1}{2m}(X\Theta - Y)^{T}(X\Theta - Y)}\tag{1}$$ (We use the transpose to mimic the square in the previous form of J($\Theta$) (the non matrix form one)). \begin{aligned} (1) <=> J(\Theta) & = \frac{1}{2m}((X\Theta)^{T} - (Y)^{T})(X\Theta - Y) \\ & = \frac{1}{2m}[(X\Theta)^{T}(X\Theta) - \underbrace{(X\Theta)^{T}Y - Y^{T}(X\Theta)}_{identical} + Y^{T}Y)] \\ & = \frac{1}{2m}[\underbrace{\Theta^{T}X^{T}X\Theta}_{= Q(\Theta)} - \underbrace{2(X\Theta)^{T}Y}_{= P(\Theta)} + Y^{T}Y] \\\end{aligned} Notice that if we differentiate $J(\Theta)$, we only have to differentiate $Q(\Theta)$ and $P(\Theta)$ since they are the only ones to depend on $\Theta$. --- <ins>**Let's differentiate P($\Theta$):**</ins> $$\boxed{P(\Theta)= 2(X\Theta)^{T}Y}$$ <br/> \begin{align*} P(\Theta) & = 2 \left( \begin{bmatrix} 1 & x_1^{(1)}& x_2^{(1)} & ... & x_n^{(1)} \\ 1 & x_1^{(2)}& x_2^{(2)} & ... & x_n^{(2)} \\ . & . & . & ... & . \\ . & . & . & ... & . \\ . & . & . & ... & . \\ 1 & x_1^{(n)} & x_2^{(n)} & ... & x_n^{(n)} \\ \end{bmatrix} \begin{pmatrix}\theta_0 \\. \\. \\. \\\theta_n \end{pmatrix} \right)^{T} \begin{pmatrix}y^{(1)} \\. \\. \\. \\y^{(n)} \end{pmatrix} \\ \\ & = 2 \begin{bmatrix} \theta_0 & \theta_1x_1^{(1)}& \theta_2x_2^{(1)} & ... & \theta_nx_n^{(1)} \\ \theta_0 & \theta_1x_1^{(2)}& \theta_2x_2^{(2)} & ... & \theta_nx_n^{(2)} \\ . & . & . & ... & . \\ . & . & . & ... & . \\ . & . & . & ... & . \\ \theta_0 & \theta_1x_1^{(n)} & \theta_2x_2^{(n)} & ... & \theta_nx_n^{(n)} \\ \end{bmatrix}^{T} \begin{pmatrix}y^{(1)} \\. \\. \\. \\ y^{(n)} \end{pmatrix} \\ \\ & = 2 \begin{bmatrix} \theta_0 + \theta_1x_1^{(1)} + ... + \theta_nx_n^{(1)} ,& \theta_0 + \theta_1x_1^{(2)} + ... + \theta_nx_n^{(2)} ,& ...& ,\theta_0 + \theta_1x_1^{(n)} + ... + \theta_nx_n^{(n)} \\ \end{bmatrix} \\ \\ & * \begin{pmatrix}y^{(1)} \\. \\. \\. \\ y^{(n)} \end{pmatrix}\\ & = 2 (\theta_0 + \theta_1x_1^{(1)} + ... + \theta_nx_n^{(1)})y^{(1)} + ............ + 2(\theta_0 + \theta_1x_1^{(n)} + ... + \theta_nx_n^{(n)})y^{(n)} \end{align*} <br/> <ins>**And since,**</ins> $$\frac{\partial P }{\partial \Theta} = \begin{pmatrix}\frac{\partial P }{\partial \theta_0} \\\frac{\partial P }{\partial \theta_1} \\. \\. \\ \frac{\partial P }{\partial \theta_n} \end{pmatrix}$$ \begin{align*} \rightarrow & \\ & \frac{\partial P }{\partial \theta_0} = 2y^{(1)} + 2y^{(2)} + ... + 2y^{(n)} \\ & \frac{\partial P }{\partial \theta_1} = 2x_1^{(1)}y^{(1)} + 2x_1^{(2)}y^{(2)} + ... + 2x_1^{(n)}y^{(n)} \\ & \frac{\partial P }{\partial \theta_2} = 2x_2^{(1)}y^{(1)} + 2x_2^{(2)}y^{(2)} + ... + 2x_2^{(n)}y^{(n)} \\ & . \\ & . \\ & . \\ & \frac{\partial P }{\partial \theta_n} = 2x_n^{(1)}y^{(1)} + 2x_n^{(2)}y^{(2)} + ... + 2x_n^{(n)}y^{(n)} \\ \end{align*} <ins>**Thus,**</ins> $$ \frac{\partial P }{\partial \Theta} = 2 \begin{bmatrix} 1 & 1 & ... & 1 \\ x_1^{(1)}& x_1^{(2)} & ... & x_1^{(n)} \\ x_2^{(1)}& x_2^{(2)} & ... & x_2^{(n)} \\ . & . & ... & . \\ . & . & ... & . \\ . & . & . & .\\ x_n^{(1)} & x_n^{(2)} & ... & x_n^{(n)} \\ \end{bmatrix} \begin{pmatrix}y^{(1)} \\. \\. \\. \\ y^{(n)} \end{pmatrix} = 2X^{T}Y$$ --- <ins>**Let's differentiate Q($\Theta$):**</ins> $$\boxed{Q(\Theta) = \Theta^{T}X^{T}X\Theta}$$ Let's consider $\forall i \in [1, n], x_0^{(i)} = 1$. \begin{align*} Q(\Theta) & = \begin{pmatrix} \theta_0 & \theta_1 & ... & \theta_n \end{pmatrix} \underbrace{ \begin{bmatrix} x_0^{(1)} & x_0^{(2)} & ... & x_0^{(n)} \\ x_1^{(1)}& x_1^{(2)} & ... & x_1^{(n)} \\ x_2^{(1)}& x_2^{(2)} & ... & x_2^{(n)} \\ . & . & ... & . \\ . & . & ... & . \\ . & . & . & .\\ x_n^{(1)} & x_n^{(2)} & ... & x_n^{(n)} \\ \end{bmatrix} \begin{bmatrix} x_0^{(1)} & x_1^{(1)}& x_2^{(1)} & ... & x_n^{(1)} \\ x_0^{(2)} & x_1^{(2)}& x_2^{(2)} & ... & x_n^{(2)} \\ . & . & . & ... & . \\ . & . & . & ... & . \\ . & . & . & ... & . \\ x_0^{(n)} & x_1^{(n)} & x_2^{(n)} & ... & x_n^{(n)} \\ \end{bmatrix} }_{X^{2}=X^{T}X} \begin{pmatrix} \theta_0 \\ . \\ . \\ . \\ \theta_n \end{pmatrix} \end{align*} Let's consider $X^{2}_{r,c} = \sum_{i=1}^{m}x_r^{(i)}x_c^{(i)}$, each component of the matrix $X^{2}$, with r as row and c as column. They are umbers, not matrices ! <ins>**Thus,**</ins> \begin{align*} Q(\Theta) & = \begin{pmatrix} \theta_0 & \theta_1 & ... & \theta_n \end{pmatrix} X^{2} \begin{pmatrix} \theta_0 \\ . \\ . \\ . \\ \theta_n \end{pmatrix} \\ \\ & = \begin{pmatrix} \theta_0 & \theta_1 & ... & \theta_n \end{pmatrix} \begin{bmatrix} X^{2}_{0,0} & X^{2}_{0,1} & ... & X^{2}_{0,n} \\ X^{2}_{1,0} & X^{2}_{1,1} & ... & X^{2}_{1,n} \\ X^{2}_{2,0} & X^{2}_{2,1} & ... & X^{2}_{2,n} \\ . & . & ... & . \\ . & . & ... & . \\ . & . & . & .\\ X^{2}_{n,0} & X^{2}_{n,1} & ... & X^{2}_{n,n} \end{bmatrix} \begin{pmatrix} \theta_0 \\ . \\ . \\ . \\ \theta_n \end{pmatrix} \\ \\ & = \begin{pmatrix} \theta_0 & \theta_1 & ... & \theta_n \end{pmatrix} \begin{bmatrix} X^{2}_{0,0}\theta_0 + X^{2}_{0,1}\theta_1 + ... + X^{2}_{0,n}\theta_n \\ X^{2}_{1,0}\theta_0 + X^{2}_{1,1}\theta_1 + ... + X^{2}_{1,n}\theta_n \\ ... \\ ... \\ ... \\ X^{2}_{n,0}\theta_0 + X^{2}_{n,1}\theta_1 + ... + X^{2}_{n,n}\theta_n \end{bmatrix} \\\\ & = \theta_0(X^{2}_{0,0}\theta_0 + X^{2}_{0,1}\theta_1 + ... + X^{2}_{0,n}\theta_n) + \theta_1(X^{2}_{1,0}\theta_0 + X^{2}_{1,1}\theta_1 + ... + X^{2}_{1,n}\theta_n) \\ & \hspace{0.45cm} + ... + \theta_n(X^{2}_{n,0}\theta_0 + X^{2}_{n,1}\theta_1 + ... + X^{2}_{n,n}\theta_n) \tag{1} \end{align*} --- We can notice 2 things: 1. $X^{2}_{0,0} = n*1 = n$. 2. $\forall r,c \in [0, n], r \neq c \Leftrightarrow X^{2}_{r, c} = X^{2}_{c, r}$ (because $X^{T}X$ gives a symmetrical matrix). <ins>**Since,**</ins> $$\frac{\partial Q }{\partial \Theta} = \begin{pmatrix}\frac{\partial Q }{\partial \theta_0} \\\frac{\partial Q }{\partial \theta_1} \\. \\. \\ \frac{\partial Q }{\partial \theta_n} \end{pmatrix}$$ \begin{align*} \rightarrow & \\ & \frac{\partial Q }{\partial \theta_0} = (2n\theta_0 + X^{2}_{0,1}\theta_1 + ... + X^{2}_{0,n}\theta_n) + (X^{2}_{1,0}\theta_1) + ... + (X^{2}_{n,0}\theta_n) \\ & \frac{\partial Q }{\partial \theta_1} = (X^{2}_{0,1}\theta_0) + (X^{2}_{1,0}\theta_0 + 2X^{2}_{1,1}\theta_1 + ... + X^{2}_{1,n}\theta_n) + X^{2}_{2,1}\theta_2 + ... + X^{2}_{n,1}\theta_n \\ & . \\ & . \\ & . \\ & \frac{\partial Q }{\partial \theta_n} = (X^{2}_{0,n}\theta_0) + (X^{2}_{1,n}\theta_1) + ... + (X^{2}_{n,0}\theta_0 + 2X^{2}_{n,1}\theta_1 + ... + X^{2}_{n,n}\theta_n) \end{align*} <br/> Using the fact that $X^{2}$ is a symmetric matrix. \begin{align*} \rightarrow & \\ & \frac{\partial Q }{\partial \theta_0} = 2n\theta_0 + 2X^{2}_{0,1}\theta_1 + 2X^{2}_{0,2}\theta_2 +... + 2X^{2}_{0,n}\theta_n \\ & \frac{\partial Q }{\partial \theta_1} = 2X^{2}_{1,0}\theta_0 + 2X^{2}_{1,1}\theta_1 + 2X^{2}_{1,2}\theta_2 + ... + 2X^{2}_{1,n}\theta_n \\ & . \\ & . \\ & . \\ & \frac{\partial Q }{\partial \theta_n} = 2X^{2}_{n,0}\theta_0 + 2X^{2}_{n,1}\theta_1 + 2X^{2}_{n,2}\theta_2 + ... + 2X^{2}_{n,n}\theta_n \end{align*} <ins>**Thus,**</ins> $$\frac{\partial Q }{\partial \Theta} = 2 \underbrace{ \begin{bmatrix} n & X^{2}_{0,1} & ... & X^{2}_{0,n} \\ X^{2}_{1,0} & X^{2}_{1,1} & ... & X^{2}_{1,n} \\ X^{2}_{2,0} & X^{2}_{2,1} & ... & X^{2}_{2,n} \\ . & . & ... & . \\ . & . & ... & . \\ . & . & . & .\\ X^{2}_{n,0} & X^{2}_{n,1} & ... & X^{2}_{n,n} \end{bmatrix} }_{X^{T}X} \begin{pmatrix} \theta_0 \\ . \\ . \\ . \\ \theta_n \end{pmatrix} = 2X^{2}\Theta = \boxed{2X^{T}X\Theta}$$ --- Finally, we can compute $\frac{\partial J}{\partial \Theta} = 0$. <br/> \begin{aligned} \frac{\partial J}{\partial \Theta} = 0 \Leftrightarrow & \frac{\partial Q}{\partial \Theta} - \frac{\partial P}{\partial \Theta} = 0 \\ \Leftrightarrow & 2X^{T}X\Theta - 2X^{T}Y = 0 \\ \Leftrightarrow & X^{T}X\Theta - X^{T}Y = 0 \\ \Leftrightarrow & X^{T}X\Theta = X^{T}Y \\ \Leftrightarrow & \fbox{$\Theta = (X^{T}X)^{-1}X^{T}Y$}\end{aligned}