---
title: 'L2 正則化(Ridge Regression)'
disqus: hackmd
---
L2 正則化(Ridge Regression)
===
## Table of Contents
[TOC]
Ridge Regression
---
### 從L1(Lasso)開始
過去我們提過 OLS 在線性迴歸中,我們的模型會遇到:
1. 共線性(multicollinearity)
1. 高維、小樣本(p 接近或大於 n)時會**overfitting**
因此L1(Lasso)嘗試解決:「少用一點權重」,L1 選擇用:
$$
|w|_1=\sum_j |w_j|
$$
但 L1 也有其缺點
1. 對共線特徵較激進(但可用於特徵選取):如果 $x_1 \approx x_2$,L1 傾向 **只留一個**,另一個直接砍掉
1. 模型「不平滑」:L1 的懲罰是$|w|$,在 0 有尖點,最佳化不連續,解對資料擾動敏感
到這裡,第二個方法出現了,我們可以改用L2(Ridge)
### L2 正則化(Ridge Regression)
L2 的設計稍有不同(和 L1 不同),但解決的問題是類似的
差別在L2 不問:「這個特徵要不要?」,而是問:「用這個特徵,用得會不會太誇張?」
#### L2 的懲罰形式
改用平方:
$$
|W|_2^2=\sum_j W_j^2
$$
為什麼是平方?
* 大權重 → 懲罰急遽變大
* 小權重 → 幾乎不被懲罰
* 整體 **平滑、連續、可微**
L2 與 L1 一樣存在 constrained 與 penalized 兩種等價形式;**差別在於 L2 的正則項是光滑可微的**,因此可以直接使用封閉解或標準梯度法求解,不需要像 L1 那樣引入 ISTA 或 proximal gradient。
**Ridge 的 loss 是「二次函數+二次函數」=整體仍是光滑凸二次型**
也就是:
$$
\underbrace{|y-Xw|_2^2}_{\text{二次}}
+
\underbrace{\lambda|w|_2^2}_{\text{二次}}
$$
結果是:
* 全部**可微**
* 梯度是**線性的**
### Ridge的封閉解
L2 正則最常用的目標函數是:
$$
J(w) = |y-Xw|_2^2+\lambda |w|_2^2
$$
其中:
* $|y-Xw|_2^2$:資料誤差(SSE)
* $|w|_2^2 = w^\top w$:係數能量(權重不要太大)
* $\lambda \ge 0$:正則化強度
L2(Ridge)本質上就是在 OLS 的 SSE 對 (w) 微分之後,額外把「正則化項的微分」加上去。
用數學寫就是:
$$
\nabla_w \mathrm{SSE}(w)
\quad+\quad
\nabla_w (\lambda |w|_2^2)
$$
然後令整個東西 = 0。
#### 對SSE部分微分
展開:
$$
SSE = (\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw})
$$
先用代數方式展開乘法(回憶這是個內積):
$$
(\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw})
=\mathbf{y}^\top \mathbf{y}- \mathbf{y}^\top \mathbf{Xw}- (\mathbf{Xw})^\top \mathbf{y}+ (\mathbf{Xw})^\top (\mathbf{Xw})
$$
化簡後
$$
=\mathbf{y}^\top \mathbf{y} - 2\mathbf{y}^\top \mathbf{Xw} + \mathbf{w}^\top \mathbf{X}^\top \mathbf{Xw}
$$
對 $\mathbf{w}$ 求導數:
$$
\nabla_w SSE = \nabla_w \left(
\mathbf{y}^\top \mathbf{y} - 2\mathbf{y}^\top \mathbf{Xw} + \mathbf{w}^\top \mathbf{X}^\top \mathbf{Xw}
\right)
$$
說明:
* $\mathbf{y}^\top \mathbf{y}$ 是常數,導數為 0
* $-2\mathbf{y}^\top \mathbf{Xw}$ 的導數是 $-2\mathbf{X}^\top \mathbf{y}$
* $\mathbf{w}^\top \mathbf{X}^\top \mathbf{Xw}$ 是二次型,其導數是 $2\mathbf{X}^\top \mathbf{Xw}$
所以:
$$
\nabla_wSSE = -2\mathbf{X}^\top \mathbf{y} + 2\mathbf{X}^\top \mathbf{Xw}
$$
#### 對正則項微分
正則項:$\lambda w^\top w$
同理 $w^\top w$ 的梯度是 $2w$,所以:
$$
\nabla_w(\lambda w^\top w)=2\lambda w
$$
#### 統整結果
$$
\nabla_w J(w)=
\underbrace{(-2X^\top y + 2X^\top X w)}_{\text{來自 SSE}}
+
\underbrace{2\lambda w}_{\text{來自 L2}}
$$
令梯度為 0:
$$
-2X^\top y + 2X^\top Xw + 2\lambda w = 0
$$
兩邊同除以 2:
$$
-X^\top y + X^\top Xw + \lambda w = 0
$$
移項:
$$
X^\top Xw + \lambda w = X^\top y
$$
把 $w$ 提出來:
$$
(X^\top X + \lambda I)w = X^\top y
$$
如果 $(X^\top X + \lambda I)$ 可逆(通常 $\lambda>0$ 幾乎保證它可逆),則:
$$
w= (X^\top X + \lambda I)^{-1}X^\top y
$$
這就是 Ridge 的封閉解。
### Ridge 的SGD 更新式
整個目標函數:
$$
J(w)=
\frac{1}{n}| y - Xw|_2^2
+
\lambda |w|_2^2
$$
$$
J(w)=\frac{1}{n}\Big(y^\top y-2w^\top X^\top y + w^\top X^\top Xw\Big)+\lambda w^\top w
$$
承前面,已知梯度
$$
\nabla_w J(w)=
\frac{2}{n}X^\top(Xw - y)
+
2\lambda w
$$
梯度下降更新
$$
w \leftarrow
w-\eta
\left(
\frac{2}{n}X^\top(Xw - y)
+
2\lambda w
\right)
$$
numpy程式碼範例(封閉解)
---
造一個回歸資料:$(y = XW_{true} + noise)$,刻意讓兩個特徵 **有相關性**(模擬共線性)。
```python=
import numpy as np
rng = np.random.default_rng(42)
n = 100
d = 3
X = rng.normal(size=(n, d))
true_w = np.array([2.0, -1.0, 0.5])
true_b = 1.5
noise = 0.3 * rng.normal(size=n)
y = X @ true_w + true_b + noise
```
定義封閉解並訓練,跟OLS結果比
```python=
def ridge_closed_form(X, y, lam=1.0):
"""
Ridge closed-form solution (no regularization on intercept)
X: (n, d)
y: (n,)
lam: lambda (L2 strength)
return:
w: (d,)
b: scalar
"""
X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float).reshape(-1, 1)
# center X and y
X_mean = X.mean(axis=0, keepdims=True)
y_mean = y.mean(axis=0, keepdims=True)
Xc = X - X_mean
yc = y - y_mean
n, d = Xc.shape
# (X^T X + λI) w = X^T y
A = Xc.T @ Xc + lam * np.eye(d)
b_vec = Xc.T @ yc
w = np.linalg.solve(A, b_vec) # (d, 1)
b = (y_mean - X_mean @ w).item() # intercept
return w.ravel(), b
w_hat, b_hat = ridge_closed_form(X, y, lam=1.0)
print("Estimated w:", w_hat)
print("Estimated b:", b_hat)
print("\nTrue w:", true_w)
print("True b:", true_b)
w_ols, b_ols = ridge_closed_form(X, y, lam=0.0)
print("OLS w:", w_ols)
print("OLS b:", b_ols)
```
numpy程式碼範例(constraine)
---
**目前的算法是在解:**
$$
\min_W \ |y - XW|_2^2 \quad \text{s.t. } |W|_2 \le C
$$
:::success
**在 constrained 問題裡:**
$$
|W|_2 \le C
\quad \text{和} \quad
|W|_2^2 \le \tilde C
$$
**對「解的位置」來說沒有本質差別**,兩者只是 C 的數值重新定義(平方或開根號)
這不是只給 L2 用的,其實是**一整個 family**:
$$
\min_W \ \text{Loss}(W)
\quad \text{s.t. } |W|_p \le C
$$
| p | 名稱 | 幾何形狀 |
| --- | -------- | ---- |
| p=1 | Lasso | 菱形 |
| p=2 | Ridge | 圓 |
| p=∞ | Max-norm | 方形 |
**這整套統一語言,建立在「norm」上**
那Ridge「懲罰項」卻寫平方,**原因是為了方便微分、運算封閉解**
$$
\min_W \ |y - XW|_2^2 + \lambda |W|_2^2
$$
:::
方法是:
1. 先算 **OLS**
2. 如果 OLS 在 L2 球內 → 結束
3. 如果不在 → **用 KKT + λ 的一維搜尋**
4. 找到那個讓
$$
| (X^TX + \lambda I)^{-1} X^T y |_2 =|W|_2= C
$$
的 **λ**
5. 該 λ 對應的解就是 constrained Ridge 解
:::info
KKT(Karush–Kuhn–Tucker)條件是「有約束的最佳化問題」中,用來判斷一個解是不是最優解的必要(有時也是充分)條件。
#### Stationarity(你熟的微分)
$$
(X^TX + \lambda I)W = X^Ty
$$
也就是你看到的 **Ridge-like 解**
#### Primal feasibility(原約束要成立)
$$
|W|_2 \le C
$$
#### Dual feasibility(λ 不能亂來)
$$
\lambda \ge 0
$$
#### Complementary slackness
$$
\lambda(|W|_2 - C) = 0
$$
* constraint 沒碰到 → ($\lambda = 0$,也就是OLS解符合約束)
* constraint 被卡住 → ($|W|_2 = C$)
**不可能兩個同時不為 0**
:::
也就是λ**不是給定的**,λ 是「被 C 反推回來的」(constraine時是我們自己定的),慢慢調 λ,直到模型的權重剛好「縮到」指定的大小 C。
由於沒有簡單封閉式,所以程式用二分搜尋,λ 的下界永遠是 0(OLS,起點),λ 的上界先定1,之後發現太小用指數擴張(*2),持續搜尋直到
$$
| (X^TX + \lambda I)^{-1} X^T y |_2 =|W|_2= C
$$
就是找到適合的λ
:::success
#### 注意:在colab cell執行下面程式碼,下載Latex所需資源,才能在matplotlib的圖裡使用Latex
```python=
!apt update
!apt install -y cm-super dvipng texlive-latex-extra texlive-fonts-recommended
```
如果是在window本機
* 安裝 MiKTeX(建議完整安裝)
* 安裝 Ghostscript(必要)
* 安裝完成後,將 latex, dvipng 所在路徑加入環境變數 PATH
:::
```python=
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
# 啟用 LaTeX 渲染
mpl.rcParams["text.usetex"] = True
mpl.rcParams["text.latex.preamble"] = r"\usepackage{amsmath}"
# Utility: MSE
def mse(y_true, y_pred):
y_true = np.asarray(y_true).ravel()
y_pred = np.asarray(y_pred).ravel()
return np.mean((y_true - y_pred)**2)
# Generate synthetic regression data (2 features)
rng = np.random.default_rng(0)
n = 80
x1 = rng.normal(0, 1, size=n)
eps = rng.normal(0, 1, size=n)
rho = 0.95
x2 = rho * x1 + np.sqrt(1 - rho**2) * eps
X = np.column_stack([x1, x2]) # (n,2)
true_w = np.array([2.5, -1.8])
true_b = 1.2
noise = rng.normal(0, 0.6, size=n)
y = X @ true_w + true_b + noise
X_mean = X.mean(axis=0, keepdims=True)
y_mean = y.mean()
Xc = X - X_mean # (n,2)
yc = y - y_mean # (n,)
#OLS closed-form in centered space
Xty = Xc.T @ yc
W_ols = np.linalg.solve(XtX, Xty) # (2,)
#Constrained Ridge (L2 ball)
def w_of_lam(lam, XtX, Xty):
A = XtX + lam * np.eye(XtX.shape[0])
return np.linalg.solve(A, Xty)
def constrained_ridge_l2(XtX, Xty, C, tol=1e-10, max_iter=200):
w0 = w_of_lam(0.0, XtX, Xty)
norm0 = np.linalg.norm(w0)
if norm0 <= C + tol:
return w0, 0.0 # λ=0
# need λ>0
lam_lo = 0.0
lam_hi = 1.0
# increase lam_hi until ||w(lam_hi)|| <= C
for _ in range(200):
wh = w_of_lam(lam_hi, XtX, Xty)
if np.linalg.norm(wh) <= C:
break
lam_hi *= 2.0
# bisection
for _ in range(max_iter):
lam_mid = 0.5 * (lam_lo + lam_hi)
wm = w_of_lam(lam_mid, XtX, Xty)
nm = np.linalg.norm(wm)
if abs(nm - C) <= tol:
return wm, lam_mid
if nm > C:
lam_lo = lam_mid
else:
lam_hi = lam_mid
lam_final = 0.5 * (lam_lo + lam_hi)
return w_of_lam(lam_final, XtX, Xty), lam_final
C = 1.0
W_ridge, lam_star = constrained_ridge_l2(XtX, Xty, C=C)
mse_ols = mse(yc, Xc @ W_ols)
mse_ridge = mse(yc, Xc @ W_ridge)
#Build MSE grid over (w1, w2) for contour plot
pad = 1.8
w1_min = min(-C*pad, W_ols[0] - pad, W_ridge[0] - pad)
w1_max = max( C*pad, W_ols[0] + pad, W_ridge[0] + pad)
w2_min = min(-C*pad, W_ols[1] - pad, W_ridge[1] - pad)
w2_max = max( C*pad, W_ols[1] + pad, W_ridge[1] + pad)
w1 = np.linspace(w1_min, w1_max, 320)
w2 = np.linspace(w2_min, w2_max, 320)
W1, W2 = np.meshgrid(w1, w2)
MSE_grid = np.empty_like(W1)
for i in range(W1.shape[0]):
W_stack = np.stack([W1[i], W2[i]], axis=1) # (len(w1),2)
Y_hat = Xc @ W_stack.T # (n, len(w1))
# mse per column
MSE_grid[i] = np.mean((yc.reshape(-1,1) - Y_hat)**2, axis=0)
# Plot (cleaner layout, LaTeX-friendly)
mpl.rcParams.update({
"font.size": 14,
"axes.titlesize": 16,
"axes.labelsize": 14,
"legend.fontsize": 12,
})
fig, axs = plt.subplots(
1, 2, figsize=(16, 7),
gridspec_kw={"width_ratios": [1.25, 1.0]}
)
ax = axs[0]
# MSE contours
levels = np.percentile(MSE_grid, [10, 20, 35, 50, 70])
ax.contour(W1, W2, MSE_grid, levels=levels, colors="gray", linewidths=1.2)
# L2 feasible region circle
theta = np.linspace(0, 2*np.pi, 400)
circle_x = C * np.cos(theta)
circle_y = C * np.sin(theta)
ax.plot(circle_x, circle_y, color="tab:blue", linewidth=3.2,
label=r"Feasible set: $\|W\|_2 \le C$")
ax.fill(circle_x, circle_y, color="tab:blue", alpha=0.18)
# Axes
ax.axhline(0, lw=1.2, color="tab:blue", alpha=0.9)
ax.axvline(0, lw=1.2, color="tab:blue", alpha=0.9)
# Solutions
ax.scatter(W_ols[0], W_ols[1], s=110, color="tab:orange", zorder=6, label="OLS solution")
ax.scatter(W_ridge[0], W_ridge[1], s=160, marker="X", color="tab:green", zorder=7,
label="Ridge constrained solution")
# Annotation style: white box to avoid overlap
bbox_kw = dict(boxstyle="round,pad=0.25", fc="white", ec="none", alpha=0.9)
arrow_kw = dict(arrowstyle="->", lw=1.5)
ax.annotate(
rf"OLS" "\n" rf"$W=({W_ols[0]:.2f},{W_ols[1]:.2f})$",
xy=(W_ols[0], W_ols[1]),
xytext=(W_ols[0]+0.9, W_ols[1]-1.0),
arrowprops=arrow_kw,
bbox=bbox_kw
)
ax.annotate(
rf"Ridge" "\n" rf"$W=({W_ridge[0]:.2f},{W_ridge[1]:.2f})$",
xy=(W_ridge[0], W_ridge[1]),
xytext=(W_ridge[0]-1.9, W_ridge[1]+0.9),
arrowprops=arrow_kw,
bbox=bbox_kw
)
# Title / labels
ax.set_title(r"Constrained Ridge (strict geometry)", pad=12)
ax.set_xlabel(r"$w_1$")
ax.set_ylabel(r"$w_2$", rotation=0, labelpad=20)
x_min = min(-1.4*C, W_ols[0]-0.8)
x_max = max( 1.4*C, W_ols[0]+1.2)
y_min = min(-1.4*C, W_ols[1]-0.8)
y_max = max( 1.4*C, W_ols[1]+1.2)
ax.set_xlim(x_min, x_max)
ax.set_ylim(y_min, y_max)
ax.set_aspect("equal", adjustable="box")
ax.grid(True, alpha=0.18)
# Move legend to cleaner spot
ax.legend(loc="upper right", frameon=True, framealpha=0.95)
# Right: math explanation (cleaner typography)
axs[1].axis("off")
text = (
r"\textbf{Constrained Ridge}" "\n\n"
r"$\min_{W}\ \mathrm{MSE}(W)$" "\n"
r"$\mathrm{s.t.}\ \|W\|_2 \le C$" "\n\n"
r"\textbf{Definitions}" "\n"
r"$W=(w_1,w_2)$" "\n"
r"$\|W\|_2=\sqrt{w_1^2+w_2^2}$" "\n"
r"$\mathrm{MSE}(W)=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat y_i)^2$" "\n"
r"$\hat y = XW + b$" "\n"
r"(centered data $\Rightarrow$ $b$ is not penalized)" "\n\n"
r"\textbf{Interpretation}" "\n"
r"Contours are equal-MSE level sets." "\n"
r"Optimal constrained point is the" "\n"
r"first tangent point on the L2 ball." "\n\n"
r"\textbf{Results}" "\n"
rf"$C={C:.2f}$" "\n"
rf"$\|W_{{OLS}}\|_2={np.linalg.norm(W_ols):.3f}$" "\n"
rf"$\lambda^{{*}}={lam_star:.3g}$" "\n"
rf"$W_{{OLS}}=({W_ols[0]:.2f},{W_ols[1]:.2f})$" "\n"
rf"$W_{{Ridge}}=({W_ridge[0]:.2f},{W_ridge[1]:.2f})$" "\n"
rf"$\mathrm{{MSE}}(W_{{OLS}})={mse_ols:.3f}$" "\n"
rf"$\mathrm{{MSE}}(W_{{Ridge}})={mse_ridge:.3f}$"
)
axs[1].text(
0.03, 0.97, text,
fontsize=18, va="top", ha="left",
linespacing=1.35
)
fig.subplots_adjust(left=0.05, right=0.98, top=0.90, bottom=0.08, wspace=0.18)
plt.show()
```

numpy程式碼範例(SGD)
---
```python=
import numpy as np
import matplotlib.pyplot as plt
def mse_vec(y_true, y_pred):
return np.mean((y_true - y_pred)**2)
rng = np.random.default_rng(42)
n = 80
x1 = rng.normal(0, 1, size=n)
eps = rng.normal(0, 1, size=n)
rho = 0.95
x2 = rho * x1 + np.sqrt(1 - rho**2) * eps
X = np.column_stack([x1, x2]) # (n,2)
true_w = np.array([2.5, -1.8])
true_b = 1.2
noise = rng.normal(0, 0.6, size=n)
y = X @ true_w + true_b + noise
X_mean = X.mean(axis=0, keepdims=True)
y_mean = y.mean()
Xc = X - X_mean # centered X
yc = y - y_mean # centered y
#OLS closed-form in centered space
XtX = Xc.T @ Xc
Xty = Xc.T @ yc
W_ols = np.linalg.solve(XtX, Xty)
#Ridge (L2) SGD in centered space: minimize
def ridge_sgd_centered_with_loss(
Xc, yc, lr=0.05, lam=0.25, epochs=120, seed=1, shuffle=False
):
rng = np.random.default_rng(seed)
n, d = Xc.shape
w = np.zeros(d)
w_hist = [w.copy()]
loss_hist = []
for ep in range(epochs):
# full-batch 通常不需要 shuffle;若你想保留「每輪打亂」也可開 shuffle=True
if shuffle:
idx = rng.permutation(n)
X_use = Xc[idx]
y_use = yc[idx]
else:
X_use = Xc
y_use = yc
# prediction (full batch)
yhat = X_use @ w
e = y_use - yhat
# gradients (full batch)
grad_mse = -(2/n) * (X_use.T @ e) # 等價於 (2/n) X^T (Xw - y)
grad_reg = 2 * lam * w
grad = grad_mse + grad_reg
# update (one update per epoch)
w = w - lr * grad
# record loss on full data (用原始 Xc,yc 評估比較一致)
mse_part = np.mean((yc - Xc @ w)**2)
reg_part = lam * np.sum(w**2)
loss = mse_part + reg_part
w_hist.append(w.copy())
loss_hist.append(loss)
return w, np.array(w_hist), np.array(loss_hist)
lam = 0.25
W_ridge_sgd, W_hist, loss_hist = ridge_sgd_centered_with_loss(
Xc, yc, lr=0.05, lam=lam, epochs=120, seed=1, shuffle=False
)
plt.figure(figsize=(7, 4))
plt.plot(loss_hist, lw=2)
plt.xlabel("Epoch")
plt.ylabel("Objective value")
plt.title("Ridge (L2) SGD training loss\nMSE + λ||W||²")
plt.grid(alpha=0.3)
plt.show()
#Build MSE contour grid over (w1, w2)
pad = 1.8
w1_min = min(W_ols[0], W_ridge_sgd[0]) - 3.0
w1_max = max(W_ols[0], W_ridge_sgd[0]) + 3.0
w2_min = min(W_ols[1], W_ridge_sgd[1]) - 3.0
w2_max = max(W_ols[1], W_ridge_sgd[1]) + 3.0
w1 = np.linspace(w1_min, w1_max, 320)
w2 = np.linspace(w2_min, w2_max, 320)
W1, W2 = np.meshgrid(w1, w2)
MSE_grid = np.empty_like(W1)
for i in range(W1.shape[0]):
W_stack = np.stack([W1[i], W2[i]], axis=1) # (len(w1),2)
Yhat = Xc @ W_stack.T # (n, len(w1))
MSE_grid[i] = np.mean((yc.reshape(-1,1) - Yhat)**2, axis=0)
#Plot like your figure:OLS + Ridge(SGD)
plt.figure(figsize=(7, 7))
levels = np.percentile(MSE_grid, [8, 15, 25, 40, 60, 80])
plt.contour(W1, W2, MSE_grid, levels=levels, colors="gray", linewidths=1.2)
# axis cross lines (like the blue cross in your image)
plt.axhline(0, color="tab:blue", lw=1.5, alpha=0.9)
plt.axvline(0, color="tab:blue", lw=1.5, alpha=0.9)
# points
plt.scatter(W_ols[0], W_ols[1], s=80, color="tab:blue", label="OLS (min MSE only)")
plt.scatter(W_ridge_sgd[0], W_ridge_sgd[1], s=110, marker="x", color="tab:orange", label="Ridge (penalized, SGD)")
# (optional) training trajectory (comment out if you want cleaner)
plt.plot(W_hist[:,0], W_hist[:,1], color="tab:orange", alpha=0.35, lw=1.2)
bbox_kw = dict(boxstyle="round,pad=0.25", fc="white", ec="none", alpha=0.9)
arrow_kw = dict(arrowstyle="->", lw=1.5)
# annotations (simple, readable)
plt.annotate(
f"OLS\nW=({W_ols[0]:.2f},{W_ols[1]:.2f})",
xy=(W_ols[0], W_ols[1]),
xytext=(W_ols[0]+0.6, W_ols[1]-1.0),
arrowprops=arrow_kw,
bbox=bbox_kw
)
plt.annotate(
f"Ridge-SGD\nW=({W_ridge_sgd[0]:.2f},{W_ridge_sgd[1]:.2f})\n||W||₂={np.linalg.norm(W_ridge_sgd):.2f}",
xy=(W_ridge_sgd[0], W_ridge_sgd[1]),
xytext=(W_ridge_sgd[0]-3.0, W_ridge_sgd[1]+1.2),
arrowprops=arrow_kw,
bbox=bbox_kw
)
plt.title("Penalized Ridge training (SGD): shown on MSE contours")
plt.xlabel("w1")
plt.ylabel("w2")
plt.xlim(w1_min, w1_max)
plt.ylim(w2_min, w2_max)
plt.grid(True, alpha=0.2)
plt.legend(loc="upper right")
plt.gca().set_aspect("equal", adjustable="box")
plt.show()
```


scikit-learn程式碼範例
---
```python=
import numpy as np
from sklearn.linear_model import Ridge
np.random.seed(42)
n = 80
X = np.random.randn(n, 2)
# 製造共線性
X[:, 1] = 0.85 * X[:, 0] + 0.35 * X[:, 1]
W_true = np.array([2.0, -1.0])
y = X @ W_true + 0.8 * np.random.randn(n)
print("X shape:", X.shape, "y shape:", y.shape)
print("W_true:", W_true)
# alpha 對應「正則化強度」
ridge = Ridge(
alpha=0.25, # L2 正則強度(不是約束半徑)
fit_intercept=False #不懲罰截距
)
ridge.fit(X, y)
W_ridge = ridge.coef_
print("Ridge coefficients:", W_ridge)
print("||W||_2 =", np.linalg.norm(W_ridge))
```