L2 正則化（Ridge Regression）

--- title: 'L2 正則化（Ridge Regression）' disqus: hackmd --- L2 正則化（Ridge Regression） === ## Table of Contents [TOC] Ridge Regression --- ### 從L1（Lasso）開始過去我們提過 OLS 在線性迴歸中，我們的模型會遇到： 1. 共線性（multicollinearity） 1. 高維、小樣本（p 接近或大於 n）時會**overfitting** 因此L1（Lasso）嘗試解決：「少用一點權重」，L1 選擇用： $$ |w|_1=\sum_j |w_j| $$ 但 L1 也有其缺點 1. 對共線特徵較激進(但可用於特徵選取):如果 $x_1 \approx x_2$，L1 傾向 **只留一個**，另一個直接砍掉 1. 模型「不平滑」:L1 的懲罰是$|w|$，在 0 有尖點，最佳化不連續，解對資料擾動敏感到這裡，第二個方法出現了，我們可以改用L2(Ridge) ### L2 正則化（Ridge Regression） L2 的設計稍有不同（和 L1 不同），但解決的問題是類似的差別在L2 不問：「這個特徵要不要？」，而是問：「用這個特徵，用得會不會太誇張？」 #### L2 的懲罰形式改用平方： $$ |W|_2^2=\sum_j W_j^2 $$ 為什麼是平方？ * 大權重 → 懲罰急遽變大 * 小權重 → 幾乎不被懲罰 * 整體 **平滑、連續、可微** L2 與 L1 一樣存在 constrained 與 penalized 兩種等價形式；**差別在於 L2 的正則項是光滑可微的**，因此可以直接使用封閉解或標準梯度法求解，不需要像 L1 那樣引入 ISTA 或 proximal gradient。 **Ridge 的 loss 是「二次函數＋二次函數」＝整體仍是光滑凸二次型** 也就是： $$ \underbrace{|y-Xw|_2^2}_{\text{二次}} + \underbrace{\lambda|w|_2^2}_{\text{二次}} $$ 結果是： * 全部**可微** * 梯度是**線性的** ### Ridge的封閉解 L2 正則最常用的目標函數是： $$ J(w) = |y-Xw|_2^2+\lambda |w|_2^2 $$ 其中： * $|y-Xw|_2^2$：資料誤差（SSE） * $|w|_2^2 = w^\top w$：係數能量（權重不要太大） * $\lambda \ge 0$：正則化強度 L2（Ridge）本質上就是在 OLS 的 SSE 對 (w) 微分之後，額外把「正則化項的微分」加上去。用數學寫就是： $$ \nabla_w \mathrm{SSE}(w) \quad+\quad \nabla_w (\lambda |w|_2^2) $$ 然後令整個東西 = 0。 #### 對SSE部分微分展開： $$ SSE = (\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw}) $$ 先用代數方式展開乘法（回憶這是個內積）： $$ (\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw}) =\mathbf{y}^\top \mathbf{y}- \mathbf{y}^\top \mathbf{Xw}- (\mathbf{Xw})^\top \mathbf{y}+ (\mathbf{Xw})^\top (\mathbf{Xw}) $$ 化簡後 $$ =\mathbf{y}^\top \mathbf{y} - 2\mathbf{y}^\top \mathbf{Xw} + \mathbf{w}^\top \mathbf{X}^\top \mathbf{Xw} $$ 對 $\mathbf{w}$ 求導數： $$ \nabla_w SSE = \nabla_w \left( \mathbf{y}^\top \mathbf{y} - 2\mathbf{y}^\top \mathbf{Xw} + \mathbf{w}^\top \mathbf{X}^\top \mathbf{Xw} \right) $$ 說明： * $\mathbf{y}^\top \mathbf{y}$ 是常數，導數為 0 * $-2\mathbf{y}^\top \mathbf{Xw}$ 的導數是 $-2\mathbf{X}^\top \mathbf{y}$ * $\mathbf{w}^\top \mathbf{X}^\top \mathbf{Xw}$ 是二次型，其導數是 $2\mathbf{X}^\top \mathbf{Xw}$ 所以： $$ \nabla_wSSE = -2\mathbf{X}^\top \mathbf{y} + 2\mathbf{X}^\top \mathbf{Xw} $$ #### 對正則項微分正則項：$\lambda w^\top w$ 同理 $w^\top w$ 的梯度是 $2w$，所以： $$ \nabla_w(\lambda w^\top w)=2\lambda w $$ #### 統整結果 $$ \nabla_w J(w)= \underbrace{(-2X^\top y + 2X^\top X w)}_{\text{來自 SSE}} + \underbrace{2\lambda w}_{\text{來自 L2}} $$ 令梯度為 0： $$ -2X^\top y + 2X^\top Xw + 2\lambda w = 0 $$ 兩邊同除以 2： $$ -X^\top y + X^\top Xw + \lambda w = 0 $$ 移項： $$ X^\top Xw + \lambda w = X^\top y $$ 把 $w$ 提出來： $$ (X^\top X + \lambda I)w = X^\top y $$ 如果 $(X^\top X + \lambda I)$ 可逆（通常 $\lambda>0$ 幾乎保證它可逆），則： $$ w= (X^\top X + \lambda I)^{-1}X^\top y $$ 這就是 Ridge 的封閉解。 ### Ridge 的SGD 更新式整個目標函數： $$ J(w)= \frac{1}{n}| y - Xw|_2^2 + \lambda |w|_2^2 $$ $$ J(w)=\frac{1}{n}\Big(y^\top y-2w^\top X^\top y + w^\top X^\top Xw\Big)+\lambda w^\top w $$ 承前面，已知梯度 $$ \nabla_w J(w)= \frac{2}{n}X^\top(Xw - y) + 2\lambda w $$ 梯度下降更新 $$ w \leftarrow w-\eta \left( \frac{2}{n}X^\top(Xw - y) + 2\lambda w \right) $$ numpy程式碼範例(封閉解) --- 造一個回歸資料：$(y = XW_{true} + noise)$，刻意讓兩個特徵 **有相關性**（模擬共線性）。 ```python= import numpy as np rng = np.random.default_rng(42) n = 100 d = 3 X = rng.normal(size=(n, d)) true_w = np.array([2.0, -1.0, 0.5]) true_b = 1.5 noise = 0.3 * rng.normal(size=n) y = X @ true_w + true_b + noise ``` 定義封閉解並訓練，跟OLS結果比 ```python= def ridge_closed_form(X, y, lam=1.0): """ Ridge closed-form solution (no regularization on intercept) X: (n, d) y: (n,) lam: lambda (L2 strength) return: w: (d,) b: scalar """ X = np.asarray(X, dtype=float) y = np.asarray(y, dtype=float).reshape(-1, 1) # center X and y X_mean = X.mean(axis=0, keepdims=True) y_mean = y.mean(axis=0, keepdims=True) Xc = X - X_mean yc = y - y_mean n, d = Xc.shape # (X^T X + λI) w = X^T y A = Xc.T @ Xc + lam * np.eye(d) b_vec = Xc.T @ yc w = np.linalg.solve(A, b_vec) # (d, 1) b = (y_mean - X_mean @ w).item() # intercept return w.ravel(), b w_hat, b_hat = ridge_closed_form(X, y, lam=1.0) print("Estimated w:", w_hat) print("Estimated b:", b_hat) print("\nTrue w:", true_w) print("True b:", true_b) w_ols, b_ols = ridge_closed_form(X, y, lam=0.0) print("OLS w:", w_ols) print("OLS b:", b_ols) ``` numpy程式碼範例(constraine) --- **目前的算法是在解：** $$ \min_W \ |y - XW|_2^2 \quad \text{s.t. } |W|_2 \le C $$ :::success **在 constrained 問題裡：** $$ |W|_2 \le C \quad \text{和} \quad |W|_2^2 \le \tilde C $$ **對「解的位置」來說沒有本質差別**，兩者只是 C 的數值重新定義（平方或開根號）這不是只給 L2 用的，其實是**一整個 family**： $$ \min_W \ \text{Loss}(W) \quad \text{s.t. } |W|_p \le C $$ | p | 名稱 | 幾何形狀 | | --- | -------- | ---- | | p=1 | Lasso | 菱形 | | p=2 | Ridge | 圓 | | p=∞ | Max-norm | 方形 | **這整套統一語言，建立在「norm」上** 那Ridge「懲罰項」卻寫平方，**原因是為了方便微分、運算封閉解** $$ \min_W \ |y - XW|_2^2 + \lambda |W|_2^2 $$ ::: 方法是： 1. 先算 **OLS** 2. 如果 OLS 在 L2 球內 → 結束 3. 如果不在 → **用 KKT + λ 的一維搜尋** 4. 找到那個讓 $$ | (X^TX + \lambda I)^{-1} X^T y |_2 =|W|_2= C $$ 的 **λ** 5. 該 λ 對應的解就是 constrained Ridge 解 :::info KKT（Karush–Kuhn–Tucker）條件是「有約束的最佳化問題」中，用來判斷一個解是不是最優解的必要（有時也是充分）條件。 #### Stationarity（你熟的微分） $$ (X^TX + \lambda I)W = X^Ty $$ 也就是你看到的 **Ridge-like 解** #### Primal feasibility（原約束要成立） $$ |W|_2 \le C $$ #### Dual feasibility（λ 不能亂來） $$ \lambda \ge 0 $$ #### Complementary slackness $$ \lambda(|W|_2 - C) = 0 $$ * constraint 沒碰到 → ($\lambda = 0$，也就是OLS解符合約束) * constraint 被卡住 → ($|W|_2 = C$) **不可能兩個同時不為 0** ::: 也就是λ**不是給定的**，λ 是「被 C 反推回來的」(constraine時是我們自己定的)，慢慢調 λ，直到模型的權重剛好「縮到」指定的大小 C。由於沒有簡單封閉式，所以程式用二分搜尋，λ 的下界永遠是 0(OLS，起點)，λ 的上界先定1，之後發現太小用指數擴張(*2)，持續搜尋直到 $$ | (X^TX + \lambda I)^{-1} X^T y |_2 =|W|_2= C $$ 就是找到適合的λ :::success #### 注意:在colab cell執行下面程式碼，下載Latex所需資源，才能在matplotlib的圖裡使用Latex ```python= !apt update !apt install -y cm-super dvipng texlive-latex-extra texlive-fonts-recommended ``` 如果是在window本機 * 安裝 MiKTeX（建議完整安裝） * 安裝 Ghostscript（必要） * 安裝完成後，將 latex, dvipng 所在路徑加入環境變數 PATH ::: ```python= import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl # 啟用 LaTeX 渲染 mpl.rcParams["text.usetex"] = True mpl.rcParams["text.latex.preamble"] = r"\usepackage{amsmath}" # Utility: MSE def mse(y_true, y_pred): y_true = np.asarray(y_true).ravel() y_pred = np.asarray(y_pred).ravel() return np.mean((y_true - y_pred)**2) # Generate synthetic regression data (2 features) rng = np.random.default_rng(0) n = 80 x1 = rng.normal(0, 1, size=n) eps = rng.normal(0, 1, size=n) rho = 0.95 x2 = rho * x1 + np.sqrt(1 - rho**2) * eps X = np.column_stack([x1, x2]) # (n,2) true_w = np.array([2.5, -1.8]) true_b = 1.2 noise = rng.normal(0, 0.6, size=n) y = X @ true_w + true_b + noise X_mean = X.mean(axis=0, keepdims=True) y_mean = y.mean() Xc = X - X_mean # (n,2) yc = y - y_mean # (n,) #OLS closed-form in centered space Xty = Xc.T @ yc W_ols = np.linalg.solve(XtX, Xty) # (2,) #Constrained Ridge (L2 ball) def w_of_lam(lam, XtX, Xty): A = XtX + lam * np.eye(XtX.shape[0]) return np.linalg.solve(A, Xty) def constrained_ridge_l2(XtX, Xty, C, tol=1e-10, max_iter=200): w0 = w_of_lam(0.0, XtX, Xty) norm0 = np.linalg.norm(w0) if norm0 <= C + tol: return w0, 0.0 # λ=0 # need λ>0 lam_lo = 0.0 lam_hi = 1.0 # increase lam_hi until ||w(lam_hi)|| <= C for _ in range(200): wh = w_of_lam(lam_hi, XtX, Xty) if np.linalg.norm(wh) <= C: break lam_hi *= 2.0 # bisection for _ in range(max_iter): lam_mid = 0.5 * (lam_lo + lam_hi) wm = w_of_lam(lam_mid, XtX, Xty) nm = np.linalg.norm(wm) if abs(nm - C) <= tol: return wm, lam_mid if nm > C: lam_lo = lam_mid else: lam_hi = lam_mid lam_final = 0.5 * (lam_lo + lam_hi) return w_of_lam(lam_final, XtX, Xty), lam_final C = 1.0 W_ridge, lam_star = constrained_ridge_l2(XtX, Xty, C=C) mse_ols = mse(yc, Xc @ W_ols) mse_ridge = mse(yc, Xc @ W_ridge) #Build MSE grid over (w1, w2) for contour plot pad = 1.8 w1_min = min(-C*pad, W_ols[0] - pad, W_ridge[0] - pad) w1_max = max( C*pad, W_ols[0] + pad, W_ridge[0] + pad) w2_min = min(-C*pad, W_ols[1] - pad, W_ridge[1] - pad) w2_max = max( C*pad, W_ols[1] + pad, W_ridge[1] + pad) w1 = np.linspace(w1_min, w1_max, 320) w2 = np.linspace(w2_min, w2_max, 320) W1, W2 = np.meshgrid(w1, w2) MSE_grid = np.empty_like(W1) for i in range(W1.shape[0]): W_stack = np.stack([W1[i], W2[i]], axis=1) # (len(w1),2) Y_hat = Xc @ W_stack.T # (n, len(w1)) # mse per column MSE_grid[i] = np.mean((yc.reshape(-1,1) - Y_hat)**2, axis=0) # Plot (cleaner layout, LaTeX-friendly) mpl.rcParams.update({ "font.size": 14, "axes.titlesize": 16, "axes.labelsize": 14, "legend.fontsize": 12, }) fig, axs = plt.subplots( 1, 2, figsize=(16, 7), gridspec_kw={"width_ratios": [1.25, 1.0]} ) ax = axs[0] # MSE contours levels = np.percentile(MSE_grid, [10, 20, 35, 50, 70]) ax.contour(W1, W2, MSE_grid, levels=levels, colors="gray", linewidths=1.2) # L2 feasible region circle theta = np.linspace(0, 2*np.pi, 400) circle_x = C * np.cos(theta) circle_y = C * np.sin(theta) ax.plot(circle_x, circle_y, color="tab:blue", linewidth=3.2, label=r"Feasible set: $\|W\|_2 \le C$") ax.fill(circle_x, circle_y, color="tab:blue", alpha=0.18) # Axes ax.axhline(0, lw=1.2, color="tab:blue", alpha=0.9) ax.axvline(0, lw=1.2, color="tab:blue", alpha=0.9) # Solutions ax.scatter(W_ols[0], W_ols[1], s=110, color="tab:orange", zorder=6, label="OLS solution") ax.scatter(W_ridge[0], W_ridge[1], s=160, marker="X", color="tab:green", zorder=7, label="Ridge constrained solution") # Annotation style: white box to avoid overlap bbox_kw = dict(boxstyle="round,pad=0.25", fc="white", ec="none", alpha=0.9) arrow_kw = dict(arrowstyle="->", lw=1.5) ax.annotate( rf"OLS" "\n" rf"$W=({W_ols[0]:.2f},{W_ols[1]:.2f})$", xy=(W_ols[0], W_ols[1]), xytext=(W_ols[0]+0.9, W_ols[1]-1.0), arrowprops=arrow_kw, bbox=bbox_kw ) ax.annotate( rf"Ridge" "\n" rf"$W=({W_ridge[0]:.2f},{W_ridge[1]:.2f})$", xy=(W_ridge[0], W_ridge[1]), xytext=(W_ridge[0]-1.9, W_ridge[1]+0.9), arrowprops=arrow_kw, bbox=bbox_kw ) # Title / labels ax.set_title(r"Constrained Ridge (strict geometry)", pad=12) ax.set_xlabel(r"$w_1$") ax.set_ylabel(r"$w_2$", rotation=0, labelpad=20) x_min = min(-1.4*C, W_ols[0]-0.8) x_max = max( 1.4*C, W_ols[0]+1.2) y_min = min(-1.4*C, W_ols[1]-0.8) y_max = max( 1.4*C, W_ols[1]+1.2) ax.set_xlim(x_min, x_max) ax.set_ylim(y_min, y_max) ax.set_aspect("equal", adjustable="box") ax.grid(True, alpha=0.18) # Move legend to cleaner spot ax.legend(loc="upper right", frameon=True, framealpha=0.95) # Right: math explanation (cleaner typography) axs[1].axis("off") text = ( r"\textbf{Constrained Ridge}" "\n\n" r"$\min_{W}\ \mathrm{MSE}(W)$" "\n" r"$\mathrm{s.t.}\ \|W\|_2 \le C$" "\n\n" r"\textbf{Definitions}" "\n" r"$W=(w_1,w_2)$" "\n" r"$\|W\|_2=\sqrt{w_1^2+w_2^2}$" "\n" r"$\mathrm{MSE}(W)=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat y_i)^2$" "\n" r"$\hat y = XW + b$" "\n" r"(centered data $\Rightarrow$ $b$ is not penalized)" "\n\n" r"\textbf{Interpretation}" "\n" r"Contours are equal-MSE level sets." "\n" r"Optimal constrained point is the" "\n" r"first tangent point on the L2 ball." "\n\n" r"\textbf{Results}" "\n" rf"$C={C:.2f}$" "\n" rf"$\|W_{{OLS}}\|_2={np.linalg.norm(W_ols):.3f}$" "\n" rf"$\lambda^{{*}}={lam_star:.3g}$" "\n" rf"$W_{{OLS}}=({W_ols[0]:.2f},{W_ols[1]:.2f})$" "\n" rf"$W_{{Ridge}}=({W_ridge[0]:.2f},{W_ridge[1]:.2f})$" "\n" rf"$\mathrm{{MSE}}(W_{{OLS}})={mse_ols:.3f}$" "\n" rf"$\mathrm{{MSE}}(W_{{Ridge}})={mse_ridge:.3f}$" ) axs[1].text( 0.03, 0.97, text, fontsize=18, va="top", ha="left", linespacing=1.35 ) fig.subplots_adjust(left=0.05, right=0.98, top=0.90, bottom=0.08, wspace=0.18) plt.show() ``` ![ridge](https://hackmd.io/_uploads/B17XcFeBZg.png) numpy程式碼範例(SGD) --- ```python= import numpy as np import matplotlib.pyplot as plt def mse_vec(y_true, y_pred): return np.mean((y_true - y_pred)**2) rng = np.random.default_rng(42) n = 80 x1 = rng.normal(0, 1, size=n) eps = rng.normal(0, 1, size=n) rho = 0.95 x2 = rho * x1 + np.sqrt(1 - rho**2) * eps X = np.column_stack([x1, x2]) # (n,2) true_w = np.array([2.5, -1.8]) true_b = 1.2 noise = rng.normal(0, 0.6, size=n) y = X @ true_w + true_b + noise X_mean = X.mean(axis=0, keepdims=True) y_mean = y.mean() Xc = X - X_mean # centered X yc = y - y_mean # centered y #OLS closed-form in centered space XtX = Xc.T @ Xc Xty = Xc.T @ yc W_ols = np.linalg.solve(XtX, Xty) #Ridge (L2) SGD in centered space: minimize def ridge_sgd_centered_with_loss( Xc, yc, lr=0.05, lam=0.25, epochs=120, seed=1, shuffle=False ): rng = np.random.default_rng(seed) n, d = Xc.shape w = np.zeros(d) w_hist = [w.copy()] loss_hist = [] for ep in range(epochs): # full-batch 通常不需要 shuffle；若你想保留「每輪打亂」也可開 shuffle=True if shuffle: idx = rng.permutation(n) X_use = Xc[idx] y_use = yc[idx] else: X_use = Xc y_use = yc # prediction (full batch) yhat = X_use @ w e = y_use - yhat # gradients (full batch) grad_mse = -(2/n) * (X_use.T @ e) # 等價於 (2/n) X^T (Xw - y) grad_reg = 2 * lam * w grad = grad_mse + grad_reg # update (one update per epoch) w = w - lr * grad # record loss on full data (用原始 Xc,yc 評估比較一致) mse_part = np.mean((yc - Xc @ w)**2) reg_part = lam * np.sum(w**2) loss = mse_part + reg_part w_hist.append(w.copy()) loss_hist.append(loss) return w, np.array(w_hist), np.array(loss_hist) lam = 0.25 W_ridge_sgd, W_hist, loss_hist = ridge_sgd_centered_with_loss( Xc, yc, lr=0.05, lam=lam, epochs=120, seed=1, shuffle=False ) plt.figure(figsize=(7, 4)) plt.plot(loss_hist, lw=2) plt.xlabel("Epoch") plt.ylabel("Objective value") plt.title("Ridge (L2) SGD training loss\nMSE + λ||W||²") plt.grid(alpha=0.3) plt.show() #Build MSE contour grid over (w1, w2) pad = 1.8 w1_min = min(W_ols[0], W_ridge_sgd[0]) - 3.0 w1_max = max(W_ols[0], W_ridge_sgd[0]) + 3.0 w2_min = min(W_ols[1], W_ridge_sgd[1]) - 3.0 w2_max = max(W_ols[1], W_ridge_sgd[1]) + 3.0 w1 = np.linspace(w1_min, w1_max, 320) w2 = np.linspace(w2_min, w2_max, 320) W1, W2 = np.meshgrid(w1, w2) MSE_grid = np.empty_like(W1) for i in range(W1.shape[0]): W_stack = np.stack([W1[i], W2[i]], axis=1) # (len(w1),2) Yhat = Xc @ W_stack.T # (n, len(w1)) MSE_grid[i] = np.mean((yc.reshape(-1,1) - Yhat)**2, axis=0) #Plot like your figure:OLS + Ridge(SGD) plt.figure(figsize=(7, 7)) levels = np.percentile(MSE_grid, [8, 15, 25, 40, 60, 80]) plt.contour(W1, W2, MSE_grid, levels=levels, colors="gray", linewidths=1.2) # axis cross lines (like the blue cross in your image) plt.axhline(0, color="tab:blue", lw=1.5, alpha=0.9) plt.axvline(0, color="tab:blue", lw=1.5, alpha=0.9) # points plt.scatter(W_ols[0], W_ols[1], s=80, color="tab:blue", label="OLS (min MSE only)") plt.scatter(W_ridge_sgd[0], W_ridge_sgd[1], s=110, marker="x", color="tab:orange", label="Ridge (penalized, SGD)") # (optional) training trajectory (comment out if you want cleaner) plt.plot(W_hist[:,0], W_hist[:,1], color="tab:orange", alpha=0.35, lw=1.2) bbox_kw = dict(boxstyle="round,pad=0.25", fc="white", ec="none", alpha=0.9) arrow_kw = dict(arrowstyle="->", lw=1.5) # annotations (simple, readable) plt.annotate( f"OLS\nW=({W_ols[0]:.2f},{W_ols[1]:.2f})", xy=(W_ols[0], W_ols[1]), xytext=(W_ols[0]+0.6, W_ols[1]-1.0), arrowprops=arrow_kw, bbox=bbox_kw ) plt.annotate( f"Ridge-SGD\nW=({W_ridge_sgd[0]:.2f},{W_ridge_sgd[1]:.2f})\n||W||₂={np.linalg.norm(W_ridge_sgd):.2f}", xy=(W_ridge_sgd[0], W_ridge_sgd[1]), xytext=(W_ridge_sgd[0]-3.0, W_ridge_sgd[1]+1.2), arrowprops=arrow_kw, bbox=bbox_kw ) plt.title("Penalized Ridge training (SGD): shown on MSE contours") plt.xlabel("w1") plt.ylabel("w2") plt.xlim(w1_min, w1_max) plt.ylim(w2_min, w2_max) plt.grid(True, alpha=0.2) plt.legend(loc="upper right") plt.gca().set_aspect("equal", adjustable="box") plt.show() ``` ![ridge_loss](https://hackmd.io/_uploads/Bkxh5KGBWl.png) ![ridge_SGD](https://hackmd.io/_uploads/SJs3cFMrbx.png) scikit-learn程式碼範例 --- ```python= import numpy as np from sklearn.linear_model import Ridge np.random.seed(42) n = 80 X = np.random.randn(n, 2) # 製造共線性 X[:, 1] = 0.85 * X[:, 0] + 0.35 * X[:, 1] W_true = np.array([2.0, -1.0]) y = X @ W_true + 0.8 * np.random.randn(n) print("X shape:", X.shape, "y shape:", y.shape) print("W_true:", W_true) # alpha 對應「正則化強度」 ridge = Ridge( alpha=0.25, # L2 正則強度（不是約束半徑） fit_intercept=False #不懲罰截距 ) ridge.fit(X, y) W_ridge = ridge.coef_ print("Ridge coefficients:", W_ridge) print("||W||_2 =", np.linalg.norm(W_ridge)) ```