---
# System prepended metadata

title: '梯度下降迴歸法(Stochastic Gradient Descent Regressor,SGDRegressor)'
tags: [regression, machine learning, linear regression]

---

---
title: '梯度下降迴歸法(Stochastic Gradient Descent Regressor,SGDRegressor)'
disqus: hackmd
---

梯度下降迴歸法(Stochastic Gradient Descent Regressor,SGDRegressor)
===


## Table of Contents

[TOC]

## 梯度下降迴歸法(Stochastic Gradient Descent Regressor,SGDRegressor)

### SGDRegressor 在數學上到底在做什麼？

它沒有改模型，不論是前面的：

* Simple
* Multiple
* Polynomial

仍然使用同一個模型：

$$
\hat{y} = wx + b
$$

同一個 SSE：

$$
SSE = \sum_{i=1}^{n} (y_i - wx_i - b)^2
$$

差別只在於：

* **OLS**：對「整個 SSE」求導 → 解聯立方程式
* **SGD**：對「單一資料點的誤差」求導 → 逐步更新參數

### OLS（Ordinary Least Square）

如前一節 [OLS（Ordinary Least Square，最小平方法）](https://hackmd.io/axF0Q0AmSgmEdiGtRZJgHg#OLS%EF%BC%88Ordinary-Least-Square%EF%BC%8C%E6%9C%80%E5%B0%8F%E5%B9%B3%E6%96%B9%E6%B3%95%EF%BC%89) 所推導


### SGD（Stochastic Gradient Descent）

定義 **平均平方誤差（Mean Squared Error, MSE）** 為目標函數：

$$
J(w,b)=
\frac{1}{n}\sum_{i=1}^{n}(y_i - wx_i - b)^2
$$


對n筆資料點進行偏微分（SGD 的核心）

#### 對 (b) 偏微分


\begin{aligned}
\frac{\partial J}{\partial b}
&=
\frac{\partial}{\partial b}
\left(
\frac{1}{n}\sum_{i=1}^{n}(y_i-wx_i-b)^2
\right) \\
&=
\frac{1}{n}\sum_{i=1}^{n} 2(y_i-wx_i-b)(-1) \\
&=
-\frac{2}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)
\end{aligned}


#### 對 (w) 偏微分


\begin{aligned}
\frac{\partial J}{\partial w}
&=
\frac{1}{n}\sum_{i=1}^{n} 2(y_i-wx_i-b)(-x_i) \\
&=
-\frac{2}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)x_i
\end{aligned}


#### SGD 的更新規則（關鍵差異在這）

SGD **不會令導數 = 0**，而是設學習率（learning rate）為 $(\eta)$

$$
[
\theta \leftarrow \theta - \eta \nabla J(\theta)
]
$$
其中：

* $(\nabla J(\theta))$：**梯度（函數上升最快的方向）**
* $(\eta)$：學習率（步長）
:::success
#### 學習率

學習率 =「我相信這個方向有多準，所以我願意走多大一步」
在 SGD 裡，我們每次都在做這件事，我站在某一組參數 $(w,b)$，我知道「哪個方向會讓 SSE 下降最快」（梯度），但我不知道，要走多遠才不會走過頭?這個「走多遠」的比例，就是學習率。
:::


#### 更新 (b)


\begin{aligned}
b
&\leftarrow
b - \eta \frac{\partial J}{\partial b} \\
&=
b - \eta\left(
-\frac{2}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)
\right) \\
&=
b + \frac{2\eta}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)
\end{aligned}


#### 更新 (w)

\begin{aligned}
w
&\leftarrow
w - \eta \frac{\partial J}{\partial w} \\
&=
w - \eta\left(
-\frac{2}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)x_i
\right) \\
&=
w + \frac{2\eta}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)x_i
\end{aligned}

### SGD 更新式總結


\begin{aligned}
w &\leftarrow w + \frac{2\eta}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)x_i \\
b &\leftarrow b + \frac{2\eta}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)
\end{aligned}


### 為什麼要用 SGD？

正規方程式的限制:

$$
\mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top y
$$

OLS 能順利用，背後其實偷偷假設了：

* 特徵數 ($p$) **不大**
* ($X^\top X$) **可逆**
* 記憶體放得下 ($p \times p$) 矩陣
* 願意花時間算矩陣反矩陣

只要其中一個條件出問題，OLS 就會開始崩。


| 面向              | OLS（Normal Equation） | SGD         |
| --------------- | -------------------- | ----------- |
| 解法              | 一次算出封閉解              | 逐步數值逼近      |
| 是否需要反矩陣         | ✅                    | ❌           |
| 計算成本            | $(O(p^3))$             | $(O(p))$ 每步   |
| 記憶體需求           | 高                    | 低           |
| 大型資料            | ❌                    | ✅           |
| Online learning | ❌                    | ✅           |
| 正則化支援           | 有，但成本高               | 天然適合        |
| 解的本質            | 全域最小值                | 同一個最小值（理論上） |


numpy 程式碼範例
---


引入模組與產生資料
```python=
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

# 啟用 LaTeX 渲染
mpl.rcParams["text.usetex"] = True
mpl.rcParams["text.latex.preamble"] = r"\usepackage{amsmath}"

#實驗規模
seed = 42
n_samples = 800
n_features = 3

# 資料特性
noise_std = 2.0            # y 的高斯噪聲強度
bias_true = 5.0            # 真實截距 b
feature_scale = 1.0        # X 的數值尺度
w_scale = 1.0              # 真實 w 的尺度

# 離群值
outlier_frac = 0.05        # 離群比例（0.0 表示不加）
outlier_y_scale = 25.0     # 離群值偏移幅度（乘在 noise_std 上）

# 稀疏化（模擬 one-hot/高維稀疏資料，效能比較會更有感），這邊先不啟用
sparse_frac = 0.0          # 0.0 表示不稀疏；例如 0.9 代表 90% 變 0
```

固定隨機種子（保證可重現）

```python=
rng = np.random.default_rng(seed)
```

生成特徵矩陣 X（n_samples × n_features）

```python=
X = rng.normal(loc=0.0, scale=feature_scale, size=(n_samples, n_features))
X.shape
```

稀疏化（把部分特徵值變成 0）

```python=
if sparse_frac > 0:
    mask = rng.random(size=X.shape) < sparse_frac
    X = X.copy()
    X[mask] = 0.0
```

設定真實參數 w_true、b_true

```python=
w_true = rng.normal(loc=0.0, scale=w_scale, size=(n_features,))
b_true = float(bias_true)

w_true, b_true
```
生成 y（線性關係 + 高斯噪聲）

```python=
noise = rng.normal(loc=0.0, scale=noise_std, size=(n_samples,))
y = X @ w_true + b_true + noise
y.shape
```
加入 y 方向離群值（outliers）

```python=
outlier_mask = np.zeros(n_samples, dtype=bool)

if outlier_frac > 0:
    k = max(1, int(n_samples * outlier_frac))
    idx = rng.choice(n_samples, size=k, replace=False)
    outlier_mask[idx] = True

    y = y.copy()
    y[idx] += rng.normal(loc=0.0, scale=outlier_y_scale * noise_std, size=k)

outlier_mask.sum()
```
切分 train/val/test

```python=
val_ratio = 0.2
test_ratio = 0.2

idx = np.arange(n_samples)
rng = np.random.default_rng(seed)  # 用同 seed 保持可重現
rng.shuffle(idx)

n_test = int(n_samples * test_ratio)
n_val  = int(n_samples * val_ratio)
n_train = n_samples - n_val - n_test

train_idx = idx[:n_train]
val_idx   = idx[n_train:n_train+n_val]
test_idx  = idx[n_train+n_val:]

X_train, y_train = X[train_idx], y[train_idx]
X_val, y_val     = X[val_idx], y[val_idx]
X_test, y_test   = X[test_idx], y[test_idx]

(X_train.shape, X_val.shape, X_test.shape)
```

資料摘要

```python=
print("X:", X.shape, " y:", y.shape)
print("train/val/test:", X_train.shape, X_val.shape, X_test.shape)
print("outliers:", outlier_mask.sum())
print("w_true:", w_true)
print("b_true:", b_true)
```

### OLS(詳細講解請看前面章節)

```python=
import numpy as np
import time

# 第一欄是 1（代表截距 b）
X_design = np.c_[np.ones(len(X_train)), X_train]  # shape: (n_train, 1+n_features)
X_design.shape


theta_ols, residuals, rank, svals = np.linalg.lstsq(X_design, y_train, rcond=None)


b_ols = float(theta_ols[0])
w_ols = theta_ols[1:]

print("OLS done.")
print("w_ols shape:", w_ols.shape, "b_ols:", b_ols)

def mse(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

yhat_train_ols = X_train @ w_ols + b_ols
train_mse_ols = mse(y_train, yhat_train_ols)

print("OLS train MSE:", train_mse_ols)


yhat_val_ols = X_val @ w_ols + b_ols
val_mse_ols = mse(y_val, yhat_val_ols)
print("OLS val   MSE:", val_mse_ols)

```


### SGD 

設定SGD 超參數

```python=

seed = 42
rng = np.random.default_rng(seed)

epochs = 30
eta = 0.05            # learning rate（你之後可以調）
batch_size = 128        # 1 = SGD；例如 32/128 就是 mini-batch
shuffle = True

#初始化參數 
w = np.zeros(X_train.shape[1], dtype=float)
b = 0.0

print("init w shape:", w.shape, "init b:", b)
```

開始訓練

```python=
loss_history = []
w_history = []
b_history = []

n = len(X_train)

for epoch in range(1, epochs + 1):
    # full-batch：不需要 steps_per_epoch / batch_idx
    # shuffle 對 full-batch 沒差（梯度用全部資料），可以忽略或保留都一樣
    yhat = X_train @ w + b
    err = y_train - yhat

    # full-batch gradients（注意這裡是 /n）
    grad_w = -(2.0 / n) * (X_train.T @ err)
    grad_b = -(2.0 / n) * np.sum(err)

    # one update per epoch
    w = w - eta * grad_w
    b = b - eta * grad_b

    # 記錄（用 train 全體算 MSE）
    yhat_train = X_train @ w + b
    train_mse = mse(y_train, yhat_train)

    loss_history.append(train_mse)
    w_history.append(w.copy())
    b_history.append(b)

    if epoch == 1 or epoch % 5 == 0:
        print(f"epoch {epoch:02d} | train MSE={train_mse:.4f}")

```

訓練結果

```python=
yhat_train_sgd = X_train @ w + b
train_mse_sgd = mse(y_train, yhat_train_sgd)
print("SGD train MSE:", train_mse_sgd)

yhat_val_sgd = X_val @ w + b
val_mse_sgd = mse(y_val, yhat_val_sgd)
print("SGD val   MSE:", val_mse_sgd)

```

產生loss圖

:::success
#### 注意:在colab cell執行下面程式碼，下載Latex所需資源，才能在matplotlib的圖裡使用Latex
```python=
!apt update
!apt install -y cm-super dvipng texlive-latex-extra texlive-fonts-recommended
```
如果是在window本機
* 安裝 MiKTeX（建議完整安裝）
* 安裝 Ghostscript（必要）
* 安裝完成後，將 latex, dvipng 所在路徑加入環境變數 PATH
:::

```python=
plt.figure(figsize=(6,4))
plt.plot(loss_history, marker="o")
plt.xlabel("epoch")
plt.ylabel("train MSE")
plt.title("SGD training loss (per epoch)")
plt.grid(True)
plt.show()
```

![SDG_Loss](https://hackmd.io/_uploads/H1cu7hqEbx.png)

產生gif並存取(可見SGD逐漸往OLS走)

```python=
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter


# 準備資料（只用 x[0], x[1] 畫 3D）

X2 = X_train[:, :2].astype(float)
y = y_train.astype(float)

# OLS（只取前兩維）
w2_ols = np.array(w_ols[:2], dtype=float)
b_ols_ = float(b_ols)

# SGD history（只取前兩維）
w2_hist = [np.array(w[:2], dtype=float) for w in w_history]
b_hist = [float(b) for b in b_history]

loss_history = list(loss_history)
n_frames = len(loss_history)


#meshgrid（平面網格）


x0_surf, x1_surf = np.meshgrid(
    np.linspace(X2[:, 0].min(), X2[:, 0].max(), 30),
    np.linspace(X2[:, 1].min(), X2[:, 1].max(), 30)
)

# OLS 平面（固定）
y_surf_ols = b_ols_ + w2_ols[0] * x0_surf + w2_ols[1] * x1_surf


#建立排版：2x3（左 3D / 右兩圖 / 最右文字）

fig = plt.figure(figsize=(10, 4.5), constrained_layout=True)
gs = fig.add_gridspec(
    2, 3,
    width_ratios=[2.3, 1.0, 0.6],
    height_ratios=[1, 1]
)

ax3d = fig.add_subplot(gs[:, 0], projection="3d")  # 左：3D
ax_loss = fig.add_subplot(gs[0, 1])                # 右上：loss
ax_param = fig.add_subplot(gs[1, 1])               # 右下：params
ax_text = fig.add_subplot(gs[:, 2])                # 最右：文字欄
ax_text.axis("off")


#點雲 + OLS 平面（固定）

ax3d.scatter(
    X2[~train_outlier, 0], X2[~train_outlier, 1], y[~train_outlier],
    s=25, edgecolors="k", alpha=0.7, label="normal"
)
if train_outlier.sum() > 0:
    ax3d.scatter(
        X2[train_outlier, 0], X2[train_outlier, 1], y[train_outlier],
        s=60, edgecolors="k", alpha=0.9, color="red", label="outlier"
    )

# OLS：surface + wireframe（固定）
surf_ols = ax3d.plot_surface(x0_surf, x1_surf, y_surf_ols, alpha=0.30)
#wire_ols = ax3d.plot_wireframe(x0_surf, x1_surf, y_surf_ols, rstride=2, cstride=2, linewidth=0.6)
wire_ols = ax3d.plot_wireframe(
    x0_surf, x1_surf, y_surf_ols,
    rstride=2, cstride=2,
    linewidth=0.3,
    alpha=0.4
)

# SGD：先畫第 0 幀（會在 update 內更新）
surf_sgd = None
wire_sgd = None

w0 = w2_hist[0]
b0 = b_hist[0]
y_surf_sgd0 = b0 + w0[0] * x0_surf + w0[1] * x1_surf
surf_sgd = ax3d.plot_surface(x0_surf, x1_surf, y_surf_sgd0, alpha=0.18)
# wire_sgd = ax3d.plot_wireframe(x0_surf, x1_surf, y_surf_sgd0, rstride=2, cstride=2, linewidth=0.8)
wire_sgd = ax3d.plot_wireframe(
    x0_surf, x1_surf, y_surf_sgd0,
    rstride=2, cstride=2,
    linewidth=0.4,      # 原本 0.8
    alpha=0.5           # 新增
)


ax3d.set_xlabel("x[0]", labelpad=10)
ax3d.set_ylabel("x[1]", labelpad=10)
ax3d.set_zlabel("y", labelpad=6)
ax3d.set_title("3D: Data + OLS plane (fixed) + SGD plane (moving)", pad=18)
ax3d.view_init(elev=18, azim=120)

# 固定 zlim（避免 outlier 把平面壓扁）
y_vis = y[~train_outlier] if train_outlier.sum() > 0 else y
lo, hi = np.percentile(y_vis, [2, 98])
ax3d.set_zlim(lo, hi)

# legend 放外面（只放點雲 legend）
ax3d.legend(loc="upper left", bbox_to_anchor=(0.0, 1.02))


# 殘差垂直線

rng = np.random.default_rng(0)
sample_size = min(50, len(X2))
sample_idx = rng.choice(len(X2), size=sample_size, replace=False)

res_lines = []
for i in sample_idx:
    ln, = ax3d.plot(
        [X2[i, 0], X2[i, 0]],
        [X2[i, 1], X2[i, 1]],
        [y[i], y[i]],
        linewidth=0.8,
        alpha=0.7
    )
    res_lines.append(ln)


#右上：loss（逐步長出來）

ax_loss.set_title("Training loss (MSE)")
ax_loss.set_xlabel("epoch")
ax_loss.set_ylabel("MSE")
ax_loss.grid(True)

loss_line, = ax_loss.plot([], [], linewidth=2)
loss_dot, = ax_loss.plot([], [], marker="o")

ax_loss.set_xlim(0, n_frames - 1)
ax_loss.set_ylim(min(loss_history) * 0.95, max(loss_history) * 1.05)


#右下：參數軌跡（w0, w1, b）

w0_hist = [w[0] for w in w2_hist]
w1_hist = [w[1] for w in w2_hist]

ax_param.set_title("Parameter trace")
ax_param.set_xlabel("epoch")
ax_param.set_ylabel("value")
ax_param.grid(True)
ax_param.set_xlim(0, n_frames - 1)

vals = w0_hist + w1_hist + b_hist
vmin, vmax = min(vals), max(vals)
pad = (vmax - vmin) * 0.1 if vmax > vmin else 1.0
ax_param.set_ylim(vmin - pad, vmax + pad)

w0_line, = ax_param.plot([], [], linewidth=2, label="w[0]")
w1_line, = ax_param.plot([], [], linewidth=2, label="w[1]")
b_line,  = ax_param.plot([], [], linewidth=2, label="b")
ax_param.legend(loc="upper right")


#右側文字欄（不擋圖）

info_text = ax_text.text(0.02, 0.98, "", va="top", ha="left", fontsize=12)


# 動畫更新：更新 SGD 平面 + 殘差線 + loss + params + 文字

def update(frame):
    global surf_sgd, wire_sgd

    # --- 更新 SGD 平面（移除舊的再畫新的）---
    if surf_sgd is not None:
        surf_sgd.remove()
    if wire_sgd is not None:
        wire_sgd.remove()

    w = w2_hist[frame]
    b = b_hist[frame]
    y_surf = b + w[0] * x0_surf + w[1] * x1_surf

    surf_sgd = ax3d.plot_surface(x0_surf, x1_surf, y_surf, alpha=0.18)
    wire_sgd = ax3d.plot_wireframe(x0_surf, x1_surf, y_surf, rstride=2, cstride=2, linewidth=0.8)

    # --- 更新殘差線（從 SGD 平面到真實點）---
    z_hat = b + X2[sample_idx, 0] * w[0] + X2[sample_idx, 1] * w[1]
    for k, i in enumerate(sample_idx):
        res_lines[k].set_data([X2[i, 0], X2[i, 0]], [X2[i, 1], X2[i, 1]])
        res_lines[k].set_3d_properties([z_hat[k], y[i]])

    # --- 更新 loss ---
    xs = np.arange(frame + 1)
    ys = np.array(loss_history[:frame + 1])
    loss_line.set_data(xs, ys)
    loss_dot.set_data([frame], [loss_history[frame]])

    # --- 更新 params ---
    w0_line.set_data(xs, np.array(w0_hist[:frame + 1]))
    w1_line.set_data(xs, np.array(w1_hist[:frame + 1]))
    b_line.set_data(xs, np.array(b_hist[:frame + 1]))

    # --- 更新文字欄 ---
    info_text.set_text(
        "SGD status\n"
        "---------\n"
        f"epoch = {frame}\n\n"
        f"w0 = {w0_hist[frame]:.4f}\n"
        f"w1 = {w1_hist[frame]:.4f}\n"
        f"b  = {b_hist[frame]:.4f}\n\n"
        f"MSE = {loss_history[frame]:.4f}\n\n"
        f"outliers = {int(train_outlier.sum())}"
    )

    return (loss_line, loss_dot, w0_line, w1_line, b_line, info_text, surf_sgd, wire_sgd, *res_lines)

anim = FuncAnimation(fig, update, frames=n_frames, interval=200, blit=False)
plt.show()


#存 GIF

gif_path = "sgd_vs_ols_3d_clear.gif"
anim.save(gif_path, writer=PillowWriter(fps=5))
print("Saved:", gif_path)

```


![sgd_vs_ols_3d_clear (5) (1)](https://hackmd.io/_uploads/HyiqpqMBbx.gif)


scikit-learn 程式碼(SGD vs OLS，大筆資料X大量特徵)
---

接下來實驗會同樣使用sklearn模組的OLS和SGD來比較使用記憶體和時間，來顯示為何使用SGD。

資料規模設定
```python=
import numpy as np

rng = np.random.default_rng(42)

# 設定資料規模
n_samples = 200_000   # 20 萬筆
n_features = 100

# 真實參數（固定，方便解釋）
w_true = rng.normal(0, 2, size=n_features)
b_true = 3.0

# 生成 X
X = rng.normal(0, 1, size=(n_samples, n_features))

# 生成 y（線性 + 雜訊）
noise = rng.normal(0, 1.0, size=n_samples)
y = X @ w_true + b_true + noise

print("X shape:", X.shape)
print("y shape:", y.shape)

```

資料分割

```python=
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print("Train size:", X_train.shape)
print("Val size:", X_val.shape)

```

引入分析模組和sklearn

```python=
import time
import tracemalloc
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
```

OLS訓練

```python=
ols = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LinearRegression(fit_intercept=True))
])

tracemalloc.start()
t0 = time.perf_counter()

ols.fit(X_train, y_train)

t1 = time.perf_counter()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()

ols_time = t1 - t0
ols_mem = peak / (1024**2)

yhat_train_ols = ols.predict(X_train)
yhat_val_ols = ols.predict(X_val)

mse_train_ols = mean_squared_error(y_train, yhat_train_ols)
mse_val_ols = mean_squared_error(y_val, yhat_val_ols)

print("\n=== OLS (n large) ===")
print("fit time (sec):", ols_time)
print("peak mem (MB):", ols_mem)
print("train MSE:", mse_train_ols)
print("val   MSE:", mse_val_ols)

```
 
 SGD訓練
 
 
 ```python=
from sklearn.linear_model import SGDRegressor

sgd_fixed = Pipeline([
    ("scaler", StandardScaler()),
    ("model", SGDRegressor(
        loss="squared_error",
        penalty=None,
        alpha=0.0,
        fit_intercept=True,
        max_iter=30,          # 注意：每 iter = 掃過全部資料
        tol=1e-3,
        learning_rate="constant",
        eta0=0.01,
        random_state=42
    ))
])

tracemalloc.start()
t0 = time.perf_counter()

sgd_fixed.fit(X_train, y_train)

t1 = time.perf_counter()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()

sgd_time = t1 - t0
sgd_mem = peak / (1024**2)

yhat_train_sgd = sgd_fixed.predict(X_train)
yhat_val_sgd = sgd_fixed.predict(X_val)

mse_train_sgd = mean_squared_error(y_train, yhat_train_sgd)
mse_val_sgd = mean_squared_error(y_val, yhat_val_sgd)

print("\n=== SGD (n large, fixed iter) ===")
print("fit time (sec):", sgd_time)
print("peak mem (MB):", sgd_mem)
print("n_iter_:", sgd_fixed.named_steps["model"].n_iter_)
print("train MSE:", mse_train_sgd)
print("val   MSE:", mse_val_sgd)

 
 ```
 
 比較結果
 
 
 | | OLS (n large) | SGD (n large, fixed iter) |
| -------- | -------- | -------- |
| fit time (sec)|2.076020639000035|2.052420466000001|
| peak mem (MB) |368.74022674560547|137.4093828201294|
| train MSE |0.9980185576293048|1.85396486854132|
| val   MSE |1.0007565636951319|1.8735307768331146|

由此可看出雖然效果不一定比較好，但是在記憶體和時間上，SGD都相對要少