Linear Regression From Scratch Using Gradient Descent

# Linear Regression From Scratch Using Gradient Descent Linear regression | Gradient Descent | Normalizing Data | Loss Function [GitHub Repository Link](https://github.com/Mecha-Coder/42-ft-linear-regression) ![Title](https://hackmd.io/_uploads/H17EqV4b-e.png) # 1. Introduction This is the first Machine Learning project from 42 School. Machine Learning is about training a model to recognize patterns in data and then using those learned patterns to make predictions. # 2. Understand Assignment Train a `linear regression model` using `gradient descent` on the provided dataset. After training, use the model to predict a car’s price based on its mileage. ## 2.1 Dataset The training data is stored in `data.csv`. It consists of 2 columns: | Column | Unit | Range | | --- | --- | --- | | Mileage (feature) | km | 10k - 100k | | Price (target) | $ | 1k - 10k | Terms: - `feature` - the input used by the model to make predictions - `target` - the output the model is trying to predict ## 2.2 Linear Regression Model A Linear Regression model is a line on a 2D plot that represents the relationship between a feature and the target variable. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJYjjNVZ-x.png" alt="Figure1_Linear_Regression" > <p><strong>Figure 1:</strong> Linear Regression Model</p> </div> This line can be expressed using (eq1) or (eq2). $$ \hat y = mx + c \qquad ---(eq1) $$ $$ price = (\text{mileage}) \cdot \theta_1 + \theta_0 \qquad ---(eq2) $$ where, - $m$ = gradient (weight) - $c$ = intercept (bias) - $\hat{y}$ = price (target) - $x$ = mileage (feature) The training process involves fitting the model (the line) to best capture the relationship between `car price` and `mileage`. This is done by finding the optimal values for the gradient (`m`) and the intercept (`c`) of the line. ## 2.3 Loss Function Also known as the Cost Function. The Loss Function is used by Gradient Descent to improve the model. It computes the errors by comparing predicted values against actual values. This measurement tells us whether the model is getting better or worse. $$ \downarrow \text{ Loss Function } = 👍 \text{ Improving} $$ $$ \uparrow \text{ Loss Function } = 👎 \text{ Worsen} $$ There are several ways to calculate errors. For this assignment, we use **Mean Squared Error (MSE)**. $$ L(c, m) = \frac{1}{2n} \sum (\hat{y} - y)^2 \qquad ---(eq3) $$ where, - $n$ = size of the data - $\hat{y}$ = predicted value - $y$ = actual value :::info 💡 You may see MSE written as 1/n or 1/2n. Click [here](https://datascience.stackexchange.com/questions/52157/why-do-we-have-to-divide-by-2-in-the-ml-squared-error-cost-function) to learn why. ::: Let’s try some random values of (`c`) and (`m`) - Attempt 1: L(3,0) = 24.62 - Attempt 2: L(2.5, 0.5) = 18.72 <div style="display: flex; justify-content: center; gap: 40px; text-align: center;"> <div> <img src="https://hackmd.io/_uploads/B15keSE-Ze.png" width="250" > <p><strong>Figure 2:</strong> Attempt 1, MSE = 24.62</p> </div> <div> <img src="https://hackmd.io/_uploads/ryNJlBVWWg.png" width="250" > <p><strong>Figure 3:</strong> Attempt 1, MSE = 18.72</p> </div> </div> Based on the MSE score, the second attempt produces a better regression model. If we try all combinations and plot the MSE, we get a U-shaped curve (Figure 4). <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/B1FYMBV--e.png" width="300" > <p><strong>Figure 4:</strong> Losses at regression line</p> </div> From the graph, we can deduce that the lowest loss is where the optimum (`c`) and (`m`) values lie. If we split the x-axis into components (`c`) and (`m`), the Loss Function graph looks like Figure 5. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HyfwmBEbWe.png" width="300" > <p><strong>Figure 5:</strong> Loss Function with respect to (m) and (c)</p> </div> We could brute-force different (`c`) and (`m`) values to find the minimum loss, but there’s a more efficient method: Using **Gradient Descent algorithm**. ## 2.4 Gradient Descent Let’s analyze Figure 6. - <span style="color:#66c266; font-weight: bold;">Green line</span> – Loss Function line - <span style="color:#ff7f7f; font-weight: bold;">Red line</span> – Gradient of the Loss Function at that point As we descend the Loss Function, the gradient becomes smaller. Eventually, at the lowest cost, the gradient is zero. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/SJh04HV-bg.gif" width="300" > <p><strong>Figure 6:</strong> Gradient Descent - descending to the lowest point</p> </div> The idea behind Gradient Descent is to use the gradient of the Loss Function as a compass, guiding the computation toward the optimal values of (`m`) and (`c`). <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJcUIHVbWe.png" width="500" > <p><strong>Figure 7:</strong> Start with a random (c) value and eventually reach the minimum point</p> </div> **Here is the Gradient Descent process:** 1. Select a suitable initial value for (`c`) & (`m`) 2. Calculate the gradient of the Loss Function at that point. 3. Adjust the value based on the gradient: | Gradient value | Action | | --- | --- | | Positive (+) | Decrease the value | | Negative (-) | Increase the value | | Large magnitude | Far from the optimal value | | Small magnitude | Close to the optimal value | 4. Repeat steps 2–3 iteratively until the gradient is approximately zero. Gradient Descent may seem like a guessing game, but the gradient provides clear guidance on both the direction and the step size needed to reach the minimum. ### 2.4.1 Gradient Derivation We use equation (eq1) and (eq3) to derive the gradient. :::success **Gradient with respect to $m$** ::: 1. Derivation using **Mean Squared Error**: $$ L(c, m) = \frac{1}{2n} \sum (\hat{y} - y)^2 $$ 2. Apply the chain rule for the composite function: $$ \frac{dL}{dm} = \frac{1}{2n} \cdot \sum 2(\hat{y} - y)^1 \frac{d\hat{y}}{dm} $$ 3. Since $\hat{y} = mx + c$, we get $\frac{d\hat{y}}{dm} = x$ $$ \frac{d\hat{y}}{dm} = c + mx - y = x $$ 4. Insert $\frac{d\hat{y}}{dm}$ and the 2 cancel out $\frac{1}{2}$, giving: $$ \frac {dL}{dm} = \frac{1}{n} \sum (\hat{y} - y)(x) \qquad ---(eq4) $$ :::success **Gradient with respect to $c$** ::: 1. Derivation using **Mean Squared Error**: $$ L(c, m) = \frac{1}{2n} \sum (\hat{y} - y)^2 $$ 2. Apply the chain rule for the composite function: $$ \frac{dL}{dc} = \frac{1}{2n} \cdot \sum 2(\hat{y} - y)^1\frac{d\hat{y}}{dc} $$ 3. Since $\hat{y} = mx + c$, we get $\frac{d\hat{y}}{dc} = 1$ $$ \frac{d\hat{y}}{dc} = c + mx - y = 1 $$ 3. Insert $\frac{d\hat{y}}{dc}$ and the 2 cancel out $\frac{1}{2}$, giving: $$ \frac {dL}{dc} = \frac{1}{n} \sum (\hat{y} - y) \qquad ---(eq5) $$ ### 2.4.2 Updating Parameters In each iteration of Gradient Descent, we update the parameters using the following equations: $$ c_{new} = c_{current} - step_c \qquad ---(eq6) $$ $$ m_{new} = m_{current} - step_m \qquad ---(eq7) $$ The adjustments are based on the gradient of the Loss Function with respect to (`c`) and (`m`): $$ step_c = \alpha \cdot \frac{dL}{dm} \qquad ---(eq8) $$ $$ step_m = \alpha \cdot \frac{dL}{dc} \qquad ---(eq9) $$ where, - $step$ = parameter adjustment for each iteration - $\alpha$ = learning rate The **learning rate** is a tunable number that controls how big each step is. Choosing the right value is crucial: too large may overshoot the minimum, too small may slow down learning. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/ryo0iSEWWe.png" width="600" > <p><strong>Figure 8:</strong> Importance of selecting the appropriate learning rate</p> </div> Putting it all together, the final update equations become: $$ m_{new} = m_{current} - \frac{\alpha }{n} \cdot \sum (\hat{y} - y)(x) \qquad ---(eq10) $$ $$ c_{new} = c_{current} - \frac{\alpha }{n} \cdot \sum (\hat{y} - y) \qquad ---(eq11) $$ ### 2.4.3 Scaling: Normalize & De-normalize If we train using the current dataset, we will encounter problems. Refer Section 2.2, the large mileage values dominates the smaller car price values. Gradient Descent will becomes unstable, slows down to reach the minimum, and can even overflow (which actually happened when I skipped this step) To fix this, we **normalize** the data. There are two common approaches are: - Z-score Standardization - Min-Max Scaling **(what I use)** With Min-Max Scaling, all values are transformed into the range 0–1: - 0 → minimum value - 1 → maximum value **Normalization:** $$ x_{norm} = \frac{x - x_{min}}{x_{range}} \qquad ---(eq12) $$ $$ y_{norm} = \frac{y - y_{min}}{y_{range}} \qquad ---(eq13) $$ where, $$ x_{range} = x_{max} - x_{min} \qquad ---(eq14) $$ $$ y_{range} = y_{max} - y_{min} \qquad ---(eq15) $$ When the data is normalized, the scales will change but the pattern in the data is intact. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HJ6anSVWbx.png" width="700" > <p><strong>Figure 9:</strong> Before and after normalizing the data</p> </div> After training, we **de-normalize** the gradient(`m`) and intercept (`c`) to restored to the original scale: **De-normalize:** $$ m_{denorm} = m \cdot \frac{y_{range}}{x_{range}} \qquad ---(eq16) $$ $$ c_{denorm} = (c \cdot {y_{range}}) + y_{min} - (m_{norm} \cdot x_{min}) \qquad ---(eq17) $$ Now, we have the knowledge needed to train our model # 3. Complete Assignment :::success **Training Program:** ::: 1. Load the dataset and required libraries (NumPy, Pandas, Matplotlib). 2. Normalize the data using Min-Max scaling 3. Train the model on the normalized data with Gradient Descent. Iterate until the step size converges (reach lowest point) ```bash # Training Hyperparameters INIT_M = 0 INIT_C = 0 LEARN_RATE = 1 CONVERGE_LIMIT = 0.00001 ITERATION_LIMIT = 10000 ``` 4. De-normalize value final (`c`) and (`m`) 5. Saved important results (e.g results.json) :::success **Predict Program:** ::: 1. Load the saved results 2. Loop a. Ask the user for a mileage value b. Validate input c. Compute the predicted price # 4. Bonus Part **1. Plotting Training Results** - During training, each iteration produces a new `m`, `c`, and MSE value. - Store those *de-normalized* values into an array and save it inside `results.json.` - Load that data into the bonus program and plot using matplotlib <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HkkrELV--x.gif" width="700" > <p><strong>Figure 10:</strong> Show (m), (c) and (MSC) evolve across iterations</p> </div> **2. Precision Evaluation** - To measure how well the model fits the data using Coefficient of determination, $R^2$ $$ R^2 = \frac {\sum(\hat{y} - y)^2}{\sum(y - \bar{y})^2} \qquad ---(eq18) $$ where, - $\hat y$ = predicted value - $\bar y$ = mean value - $y$ = actual value | R² Range | Relationship | | --- | --- | | 0–0.2 | No relationship | | 0.2–0.4 | Weak | | 0.4–0.6 | Medium | | 0.6–0.8 | Strong | | 0.8–1.0 | Very strong |