L1 正則化（Lasso Regression）

--- title: 'L1 正則化（Lasso Regression）' disqus: hackmd --- L1 正則化（Lasso Regression） === ## Table of Contents [TOC] Lasso Regression --- ### 從熟悉的 OLS 開始從我們已經熟悉的 OLS 開始，在線性迴歸中，我們的模型是： $$ \hat{y} = XW $$ 在前面章節已經定義過 **MSE（Mean Squared Error）** 作為 Loss： $$ \mathrm{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$ $$ \mathrm{MSE} = \frac{1}{n}\mathrm{SSE} $$ OLS 做的事情其實很單純： > **找一組權重 (W)，讓 MSE 最小。** ### 「只看 MSE」會出問題？ OLS 在**變數很多、而且高度相關**時，係數的變異會非常大。 :::info 更詳細請看後面章節的「迴歸假設與缺點」 ::: 當資料出現以下情況時： * 特徵很多 * 特徵之間高度相關 * 有些特徵其實沒什麼用只最小化 MSE 會發生什麼？ * 模型會「硬用」所有特徵 * 某些權重會變得很大(當特徵彼此高度相關，怎麼分都差不多，**模型一邊把某個權重拉大一邊用另一個權重抵銷**，稱為權重補償（weight compensation）) * 模型容易記住雜訊（overfitting）所以我們開始意識到一件事： **MSE 小不代表模型好，權重如果太複雜，模型其實不穩定。** :::info OLS 的「不穩定」，不是指預測一定很差，而是指**模型對資料的微小變動過度敏感** 1. 不穩定的第一個來源：高度相關的特徵（共線性）假設有兩個特徵： $$ x_2 \approx x_1 $$ 那模型其實在學的是： $$ y \approx w_1 x_1 + w_2 x_2 $$ 但因為 $(x_1, x_2)$ 幾乎一樣，以下兩組解： * $(w_1 = 10,; w_2 = -8)$ * $(w_1 = 2,; w_2 = 0)$ 對 MSE 來說可能幾乎一樣好，但只要資料稍微變一點點： * 某次抽樣 $(x_1)$ 稍強 * 下次抽樣 $(x_2)$ 稍強 OLS 就會把責任改丟給另一個特徵，結果造成： * 權重數值變很大、正負號亂跳 * 解釋性完全崩壞、泛化能力差 2. 不穩定的第二個來源：特徵很多，但資料有限當特徵數接近樣本數，甚至超過時，每個權重都是一個「自由度」，OLS 幾乎可以「完美貼合訓練資料」，但貼合的其實是**雜訊 + 偶然性** 這時候會發生： * 訓練 MSE 很小 * 驗證 / 新資料誤差暴增(overfitting) ::: ### 限制「權重總量」為了降低這個問題，Lasso 使用 $$ |W|_1 = \sum_j |W_j| $$ 這代表我們在意的是**所有權重「絕對值的總和」。** 於是，Lasso 的 **constrained 形式** 就出現了： $$ \min_W \mathrm{MSE}(W) \quad \text{s.t.} \quad |W|_1 \le C $$ 這一公式可以用人話翻譯成：「在 MSE 表現不錯的模型裡，我只允許那些『用少量、簡單權重』的模型。」 :::success **一般情況下，Lasso / Ridge (L1/L2)的「懲罰權重不包含截距（intercept）」** 也就是說正則化 **只作用在斜率係數 ($w_1, w_2, \dots$)，不懲罰截距 ($b$)**，因為截距只是「整體平移」，不是模型複雜度 ::: ### 引入拉格朗日函數上面的 constrained 問題在概念上很清楚，但在實作上不好直接解，但數學給了我們一個等價的做法： > **把「不能超過限制」改成「超過就付出代價」。** 這時就引入 **拉格朗日乘數 $(\lambda \ge 0)$**。 :::info #### 拉格朗日函數的正式定義 >對一個有 ${\displaystyle n}$ 個變數與 ${\displaystyle k}$ 個限制條件的最佳化問題，拉格朗日乘數法會將其轉換成一個 ${\displaystyle n+k}$ 個變數的方程式組，稱作拉格朗日方程式（英語：Lagrange function），這個方程組的解將包括所有最佳化問題的解。拉格朗日乘數法將會引入一個或一組新的未知數，即拉格朗日乘數（英語：Lagrange multiplier），又稱拉格朗日乘子，或拉氏乘子，它們是在轉換後的方程式，即限制方程式中作為梯度的線性組合中各個向量的係數。 >[維基百科](https://zh.wikipedia.org/zh-tw/%E6%8B%89%E6%A0%BC%E6%9C%97%E6%97%A5%E4%B9%98%E6%95%B0) ::: 原始約束條件可以寫成： $$ g(W) = |W|_1 - C \le 0 $$ 拉格朗日函數則定義為： $$ \mathcal{Loss}(W,\lambda)= \mathrm{MSE}(W) + \lambda \big(|W|_1 - C\big) $$ 也可以直接 $$ \mathcal{Loss}(W,\lambda)= \mathrm{MSE}(W) + \lambda \big|W|_1 $$ :::success 在最佳化時，我們關心的是哪一個 $W$ 讓目標最小而 $(-\lambda C)$： * 不含 $W$ * 是一個常數所以它**不影響最小化的位置**，可以直接忽略。 ::: 意思是MSE 本來就要小，如果 $|W|_1$太大，就用$\lambda$來懲罰這件事 ### Lasso 的 Loss 函數這就是 Lasso 的 Loss 函數（承接MSE） $$ \mathcal{Loss}(W,\lambda)= \mathrm{MSE}(W) + \lambda \big|W|_1 $$ 這個 Loss 讓**Lasso 在最小化 MSE 的同時，額外懲罰「權重總量過大」的模型。** 而且： * $\lambda$ 越大 → 懲罰越重 → 更多 $W_j = 0$ * $\lambda \to 0$ → Lasso 退化成 OLS :::success #### 注意 Lasso 沒有讓問題變好，它只是讓模型「不要對問題那麼敏感」，它做的是「選擇」(特徵權重可變0)，不是「修復」。 ::: numpy程式碼範例(constrained) --- :::success #### 注意:在colab cell執行下面程式碼，下載Latex所需資源，才能在matplotlib的圖裡使用Latex ```python= !apt update !apt install -y cm-super dvipng texlive-latex-extra texlive-fonts-recommended ``` 如果是在window本機 * 安裝 MiKTeX（建議完整安裝） * 安裝 Ghostscript（必要） * 安裝完成後，將 latex, dvipng 所在路徑加入環境變數 PATH ::: ```python= import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl # 啟用 LaTeX 渲染 mpl.rcParams["text.usetex"] = True mpl.rcParams["text.latex.preamble"] = r"\usepackage{amsmath}" np.random.seed(42) n = 80 X = np.random.randn(n, 2) # Add multicollinearity X[:, 1] = 0.85 * X[:, 0] + 0.35 * X[:, 1] W_true = np.array([2.0, -1.0]) y = X @ W_true + 0.8 * np.random.randn(n) ``` 造一個回歸資料：$(y = XW_{true} + noise)$，刻意讓兩個特徵 **有相關性**（模擬共線性），這樣會更能理解「為什麼需要正則化」 ```python= w1 = np.linspace(-5, 5, 401) w2 = np.linspace(-5, 5, 401) W1, W2 = np.meshgrid(w1, w2) pred = X[:, 0][:, None, None] * W1[None, :, :] \ + X[:, 1][:, None, None] * W2[None, :, :] resid = y[:, None, None] - pred SSE_grid = np.sum(resid**2, axis=0) MSE_grid = SSE_grid / n # Constrained Lasso # min MSE(W) s.t. ||W||1 <= C C = 2.5 feasible = (np.abs(W1) + np.abs(W2)) <= C MSE_feasible = np.where(feasible, MSE_grid, np.inf) idx = np.unravel_index(np.argmin(MSE_feasible), MSE_feasible.shape) W_lasso = np.array([W1[idx], W2[idx]]) # OLS solution W_ols = np.linalg.lstsq(X, y, rcond=None)[0] def mse(X, y, W): r = y - X @ W return float((r @ r) / len(y)) mse_ols = mse(X, y, W_ols) mse_lasso = mse(X, y, W_lasso) ``` 定義模型與 Loss，這一步最重要的觀念是，目前不是用梯度下降去訓練（那會牽涉到 L1 不可微），而是在 $(w_1,w_2)$ 的平面上掃過所有可能(constrained)，**找出在菱形約束內 MSE 最小的那個點**，並算出OLS答案做比較 ```python= fig, axs = plt.subplots(1, 2, figsize=(15, 7)) # ===== Left: geometry ===== ax = axs[0] # MSE contours (all equal status) levels = np.percentile(MSE_grid, [10, 20, 35, 50, 70]) ax.contour(W1, W2, MSE_grid, levels=levels, colors="gray", linewidths=1.2) # L1 feasible region diamond diamond = np.array([ [ C, 0], [ 0, C], [-C, 0], [ 0,-C], [ C, 0] ]) # IMPORTANT FIX: # - Do NOT use Unicode "≤" in labels if LaTeX is enabled. # - Use LaTeX command \le in math mode. ax.plot(diamond[:, 0], diamond[:, 1], color="tab:blue", linewidth=2.8, label=r"Feasible set: $|w_1|+|w_2|\le C$") ax.fill(diamond[:-1, 0], diamond[:-1, 1], color="tab:blue", alpha=0.18) # Solutions ax.scatter(W_ols[0], W_ols[1], s=90, color="tab:orange", zorder=5, label="OLS solution") ax.scatter(W_lasso[0], W_lasso[1], s=140, marker="X", color="tab:green", zorder=6, label="Lasso constrained solution") # Annotation style: white box to avoid overlap bbox_kw = dict(boxstyle="round,pad=0.25", fc="white", ec="none", alpha=0.9) arrow_kw = dict(arrowstyle="->", lw=1.5) ax.annotate(rf"OLS" "\n" rf"$W=({W_ols[0]:.2f},{W_ols[1]:.2f})$", xy=(W_ols[0], W_ols[1]), xytext=(W_ols[0]+0.6, W_ols[1]-1.0), arrowprops=arrow_kw, bbox=bbox_kw) ax.annotate(rf"Lasso" "\n" rf"$W=({W_lasso[0]:.2f},{W_lasso[1]:.2f})$", xy=(W_lasso[0], W_lasso[1]), xytext=(W_lasso[0]-3.0, W_lasso[1]+1.2), arrowprops=arrow_kw, bbox=bbox_kw) # Axes ax.set_title("Constrained Lasso (strict geometry)") ax.set_xlabel("w1") ax.set_ylabel("w2", rotation=0, labelpad=15) ax.axhline(0, lw=1) ax.axvline(0, lw=1) ax.set_aspect("equal", adjustable="box") ax.grid(True, alpha=0.25) ax.legend(loc="upper right") # ===== Right: mathematical explanation ===== axs[1].axis("off") text = ( r"$\mathbf{Constrained\ Lasso}$" "\n\n" r"$\min_{W}\ \mathrm{MSE}(W)$" "\n" r"$\mathrm{s.t.}\ \|W\|_1 \le C$" "\n\n" r"$\mathbf{Definitions}$" "\n" r"$W=(w_1,w_2)$" "\n" r"$\|W\|_1 = |w_1| + |w_2|$" "\n" r"$\mathrm{MSE}(W)=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat y_i)^2$" "\n" r"$\hat y = XW$" "\n\n" r"$\mathbf{Interpretation}$" "\n" r"Contours represent equal MSE levels." "\n" r"The solution is the feasible point" "\n" r"that minimizes MSE." "\n\n" r"$\mathbf{Results}$" "\n" rf"$C={C:.2f}$" "\n" rf"$W_{{OLS}}=({W_ols[0]:.2f},{W_ols[1]:.2f})$" "\n" rf"$W_{{Lasso}}=({W_lasso[0]:.2f},{W_lasso[1]:.2f})$" "\n" rf"$\mathrm{{MSE}}(W_{{OLS}})={mse_ols:.3f}$" "\n" rf"$\mathrm{{MSE}}(W_{{Lasso}})={mse_lasso:.3f}$" ) axs[1].text(0.04, 0.96, text, fontsize=16, va="top", linespacing=1.5) # tight_layout should now work plt.tight_layout() plt.show() ``` ![lasso](https://hackmd.io/_uploads/SyWES_crbx.png) 在圖上看到： 1. **MSE 等高線**（等誤差線，一圈圈灰色線）：代表「同樣 MSE 的一圈」 2. **菱形（藍色，L1 約束區域）**：$(|w_1|+|w_2|\le C)$ 3. **OLS 點**：不管係數大小，只管 MSE 最小 4. **Lasso constrained 點**：在菱形內 MSE 最小 numpy程式碼範例(Lasso vs OLS) --- 如果針對問題去做詳細比較，我們創造一個共線性更高的特徵(用數學保證相關係數大約是rho，但再加一點獨立雜訊)並採取多次採樣，分別用OLS、Lasso去多次訓練 ```python= def make_collinear_data(n=80, rho=0.98, seed=None): """X has 2 features, highly correlated when rho is close to 1.""" rng = np.random.default_rng(seed) x1 = rng.normal(0, 1, size=n) eps = rng.normal(0, 1, size=n) x2 = rho * x1 + np.sqrt(max(1 - rho**2, 1e-12)) * eps X = np.column_stack([x1, x2]) return X ``` <details> <summary>產生共線性資料</summary> 為什麼要用 `sqrt(1 - rho^2)` 這個奇怪係數，因為它讓 (x2) 的變異數維持在 1（跟 x1 同尺度），而且讓**理論相關係數剛好是 rho**。假設： * $x_1 \sim N(0,1)$ * $\varepsilon \sim N(0,1)$ * 而且 $x_1$ 跟 $\varepsilon$ 互相獨立定義： $$ x_2 = \rho x_1 + \sqrt{1-\rho^2}\varepsilon $$ #### 先看變異數為什麼會是 1 因為獨立的話，變異數會相加： $$ Var(x_2) = Var(\rho x_1) + Var(\sqrt{1-\rho^2}\varepsilon) $$ $$ = \rho^2 Var(x_1) + (1-\rho^2)Var(\varepsilon) $$ 因為 $Var(x_1)=1$、$Var(\varepsilon)=1$： $$ Var(x_2) = \rho^2 + (1-\rho^2) = 1 $$ 所以不管 rho 是多少，x2 都會保持「標準差約 1」的尺度。 #### 再看相關係數為什麼是 rho 相關係數要看協方差： $$ Cov(x_1, x_2) = Cov(x_1, \rho x_1 + \sqrt{1-\rho^2}\varepsilon) $$ 拆開： $$ = \rho Cov(x_1,x_1) + \sqrt{1-\rho^2}Cov(x_1,\varepsilon) $$ 因為 $Cov(x_1,x_1)=Var(x_1)=1$，而 $x_1$ 跟 $\varepsilon$ 獨立所以 $Cov(x_1,\varepsilon)=0$： $$ Cov(x_1,x_2)=\rho $$ 相關係數： $$ Corr(x_1,x_2)=\frac{Cov(x_1,x_2)}{\sqrt{Var(x_1)Var(x_2)}}=\frac{\rho}{\sqrt{1\cdot 1}}=\rho $$ 所以理論上它就是 rho。 </details> ```python= import numpy as np import matplotlib.pyplot as plt def make_collinear_data(n=80, rho=0.98, seed=None): """X has 2 features, highly correlated when rho is close to 1.""" rng = np.random.default_rng(seed) x1 = rng.normal(0, 1, size=n) eps = rng.normal(0, 1, size=n) x2 = rho * x1 + np.sqrt(max(1 - rho**2, 1e-12)) * eps X = np.column_stack([x1, x2]) return X def make_y(X, W_true, noise_y=1.0, seed=None): rng = np.random.default_rng(seed) y = X @ W_true + noise_y * rng.normal(0, 1, size=len(X)) return y def mse(X, y, W): r = y - X @ W return float((r @ r) / len(y)) def fit_ols(X, y): return np.linalg.lstsq(X, y, rcond=None)[0] def soft_threshold(z, gamma): # S(z, gamma) = sign(z) * max(|z| - gamma, 0) if z > gamma: return z - gamma if z < -gamma: return z + gamma return 0.0 def fit_lasso_cd(X, y, lam=0.15, iters=200, tol=1e-7): """ Lasso (penalized) using coordinate descent (pure NumPy). Objective: J(W) = (1/n) * ||y - XW||^2 + lam * ||W||_1 No intercept here (keep it simple for 2D geometry & stability demo). """ n, p = X.shape W = np.zeros(p, dtype=float) # Precompute for speed X_sq = np.sum(X**2, axis=0) for _ in range(iters): W_old = W.copy() # update each coordinate for j in range(p): # partial residual: y - sum_{k != j} x_k w_k r = y - X @ W + X[:, j] * W[j] rho_j = (X[:, j] @ r) / n # coordinate descent closed-form with soft-threshold # NOTE: because we use (1/n)||...||^2, the threshold uses lam/2 # Derivation yields: # w_j = S(rho_j, lam/2) / ( (1/n) * sum x_ij^2 ) denom = (X_sq[j] / n) W[j] = soft_threshold(rho_j, lam / 2.0) / (denom + 1e-12) if np.max(np.abs(W - W_old)) < tol: break return W # Step 1) 固定「同一套生成規則」 W_true = np.array([2.0, -1.0]) rho = 0.98 # 高共線 n = 60 B = 200 noise_y = 1.0 # Lasso lambda（你可以調大→更稀疏、更穩；調小→更接近 OLS） lam = 0.20 # Step 2) 同一批重抽樣資料，收集 OLS / Lasso 的 W Ws_ols = [] Ws_lasso = [] mse_ols_list = [] mse_lasso_list = [] for b in range(B): # "同樣資料" 的意思：同一個 b 用同一個 X、同一個 y，同時丟給 OLS 與 Lasso X = make_collinear_data(n=n, rho=rho, seed=1000 + b) y = make_y(X, W_true, noise_y=noise_y, seed=2000 + b) W_ols = fit_ols(X, y) W_lasso = fit_lasso_cd(X, y, lam=lam, iters=400) Ws_ols.append(W_ols) Ws_lasso.append(W_lasso) mse_ols_list.append(mse(X, y, W_ols)) mse_lasso_list.append(mse(X, y, W_lasso)) Ws_ols = np.array(Ws_ols) Ws_lasso = np.array(Ws_lasso) mse_ols_list = np.array(mse_ols_list) mse_lasso_list = np.array(mse_lasso_list) # stats def stats(Ws, mses): mu = Ws.mean(axis=0) sd = Ws.std(axis=0) return mu, sd, mses.mean(), mses.std() mu_o, sd_o, mse_mu_o, mse_sd_o = stats(Ws_ols, mse_ols_list) mu_l, sd_l, mse_mu_l, mse_sd_l = stats(Ws_lasso, mse_lasso_list) # Step 3) 畫圖：左邊係數散佈、右邊文字總結（同你原本風格） fig, axs = plt.subplots(1, 2, figsize=(15, 7)) # --- Left: scatter in (w1, w2) space --- ax = axs[0] ax.set_title("Coefficient instability comparison (same resamples)") ax.set_xlabel("w1") ax.set_ylabel("w2", rotation=0, labelpad=15) ax.axhline(0, lw=1) ax.axvline(0, lw=1) ax.grid(True, alpha=0.25) ax.scatter(Ws_ols[:, 0], Ws_ols[:, 1], s=25, alpha=0.35, label="OLS (many resamples)") ax.scatter(Ws_lasso[:, 0], Ws_lasso[:, 1], s=25, alpha=0.35, label="Lasso (many resamples)") ax.scatter(W_true[0], W_true[1], s=180, marker="X", label="True W") # annotate typical std ax.text(0.07, 0.02, "Each dot = one resample\nSame X,y used for OLS and Lasso", transform=ax.transAxes, fontsize=10, bbox=dict(boxstyle="round", alpha=0.15)) ax.legend(loc="upper right") # --- Right: explanation --- axs[1].axis("off") text = ( r"$\mathbf{Same\ data,\ compare\ stability}$" "\n\n" rf"Data: n={n}, B={B}, rho={rho:.2f}, noise={noise_y:.1f}" "\n" rf"True: $W^{{true}}=({W_true[0]:.1f},{W_true[1]:.1f})$" "\n\n" r"$\mathbf{Models}$" "\n" r"OLS: minimize MSE only" "\n" r"Lasso (penalized): minimize MSE(W) + $\lambda \|W\|_1$" "\n" rf"Here: $\lambda={lam:.2f}$" "\n\n" r"$\mathbf{Why\ OLS\ unstable\ under\ collinearity?}$" "\n" r"When x1 and x2 are very similar, many (w1,w2) pairs" "\n" r"give almost the same predictions, so small resample/noise changes" "\n" r"can shift OLS coefficients a lot." "\n\n" r"$\mathbf{Stability\ summary\ (mean / std)}$" "\n" rf"OLS mean(W)=({mu_o[0]:.2f},{mu_o[1]:.2f}), std(W)=({sd_o[0]:.2f},{sd_o[1]:.2f})" "\n" rf"Lasso mean(W)=({mu_l[0]:.2f},{mu_l[1]:.2f}), std(W)=({sd_l[0]:.2f},{sd_l[1]:.2f})" "\n\n" r"$\mathbf{Training\ MSE\ (mean / std)}$" "\n" rf"OLS MSE={mse_mu_o:.3f} ± {mse_sd_o:.3f}" "\n" rf"Lasso MSE={mse_mu_l:.3f} ± {mse_sd_l:.3f}" "\n\n" r"$\mathbf{Key\ point}$" "\n" r"Lasso shrinks $|W|$ and tends to make coefficients smaller / sparser," "\n" r"so W varies less across resamples (more stable)." ) axs[1].text(0.04, 0.96, text, fontsize=14, va="top", linespacing=1.35) plt.tight_layout() plt.show() ``` ![LASSO_vs_OLS](https://hackmd.io/_uploads/ryl_xcRN-l.png) 從圖中可以看到相比OLS，Lasso在多次採樣下比較集中、穩定。 numpy程式碼範例(ISTA,Proximal Gradient Method) --- :::info ISTA = 用「兩個小步驟」來解 > 「一半可微、一半不可微」的最小化問題** 正式名稱： > **ISTA = Iterative Shrinkage-Thresholding Algorithm** ::: 現在做的是 **Penalized Lasso**，，實務上幾乎都直接用 penalized form因為 λ 比 C 好調。 $$ \min_W; Loss(W)=\mathrm{MSE}(W)+\lambda |W|_1 $$ 其中 $$ \mathrm{MSE}(W)=\frac{1}{n}|y-XW|^2 $$ 平滑、可微 $$ |W|_1=\sum_j |w_j| $$ 在 0 不可微（尖點） :::success #### 「不可微」到底卡在哪？以一個係數 $w$ 來看： * $|w|$ 在 $w>0$ 梯度是 +1 * 在 $w<0$ 梯度是 -1 * 但在 $w=0$，左右斜率不一樣 → 沒有單一梯度所以 **普通 GD/SGD** 沒辦法「正確且穩定」地在 0 附近處理它（你會看到係數在 0 附近抖動、不容易精準變成 0）。 ::: **Lasso 的 Loss 有一半是 MSE（可微），一半是 L1（不可微），所以ISTA的做法是，先對 MSE 做一次普通梯度下降，在每一步，先沿著可微部分往下走一小步，再立刻用不可微部分『修正這一步』。** ISTA 每一步都做兩件事： #### (1) 先做一次「只管 MSE」的梯度下降 $$ W_{\text{tmp}} = W_k - \eta \nabla MSE(W_k) $$ 直覺：先沿著「讓誤差變小」的方向走一步。 #### (2) 再做一次「L1 專用的收縮」 $$ W_{k+1} = \mathrm{prox}_{\eta g}(W_{\text{tmp}}) $$ 這句 prox 是啥？它的定義是： $$ \mathrm{prox}_{\eta g}(z)=\arg\min_w \left(\frac{1}{2}(w-z)^2+\eta g(w)\right) $$ :::success argmin 是什麼？ $\min_x f(x)$意思是：「這個函數的最小值是多少？」」 $\arg\min_x f(x)$意思是：「**在哪個 x** 取得這個最小值？」例子 $$ f(x) = (x-3)^2 $$ * $\min f(x) = 0$（最小值是 0） * $\arg\min f(x) = 3$（在 x=3 時達到） **min = 值** **argmin = 位置（參數）** ::: 把 $g(w)=\lambda|w|$ 代入： $$ \mathrm{prox}_{\eta\lambda|\cdot|_1}(z)=\arg\min_w \left(\frac{1}{2}(w-z)^2+\eta\lambda|w|_1\right) $$ proximal step 解的是： $$ \arg\min_W \left( \underbrace{\frac{1}{2}(W - W_{\text{tmp}})^2}_{\text{留在 MSE 的好方向附近}} + \underbrace{\eta\lambda |W|_1}_{\text{L1 拉回來}} \right) $$ 產生資料 ```python= import numpy as np np.random.seed(42) n = 80 X = np.random.randn(n, 2) # 做一點共線性：讓第二個特徵和第一個特徵相關 X[:, 1] = 0.85 * X[:, 0] + 0.35 * X[:, 1] W_true = np.array([2.0, -1.0]) y = X @ W_true + 0.8 * np.random.randn(n) print("X shape:", X.shape, "y shape:", y.shape) print("W_true:", W_true) ``` 定義模型與 MSE ```python= def y_hat(X, W): return X @ W def mse_loss(X, y, W): r = y - y_hat(X, W) return (r @ r) / len(y) def grad_mse(X, y, W): # MSE(W) = (1/n) ||y - XW||^2 # grad = (2/n) X^T (XW - y) n = len(y) return (2/n) * (X.T @ (X @ W - y)) ``` 實際訓練：ISTA（Proximal GD）解 Penalized Lasso soft-threshold（L1 的關鍵） ```python= def soft_threshold(z, t): return np.sign(z) * np.maximum(np.abs(z) - t, 0.0) ``` <details> <summary>soft-threshold推導</summary> 在 soft-threshold 那個問題裡，我們解的是： $$ \arg\min_w \left( \frac{1}{2}(w - z)^2 + t|w| \right) $$ * **變數是：** ( w ) * **不是：** (|w|) * $z$ = 梯度下降後的暫時位置 * $t = \eta\lambda$ = L1 在這一步的收縮強度所以你不是在「解 $|w|$」，你是在找「哪個 **w（可以正、負、0）** 讓整個式子最小」，但是$|w|$ 在 **w = 0 不可微，所以在數學上你**不能一次微分處理，只能： > 「假設在右邊 → 解 → 檢查假設對不對」 > 「假設在左邊 → 解 → 檢查假設對不對」 #### Case by case 分析 #### 假設 (w > 0) 此時 ($|w| = w$) $$ f(w) = \frac{1}{2}(w - z)^2 + t w $$ 微分： $$ \frac{d}{dw} = (w - z) + t $$ 令為 0： $$ w = z - t $$ 但前提是： $(w > 0 \Rightarrow z - t > 0 \Rightarrow z > t)$ #### 假設 (w < 0) 此時 ($|w| = -w$) $$ f(w) = \frac{1}{2}(w - z)^2 - t w $$ 微分： $$ \frac{d}{dw} = (w - z) - t $$ 令為 0： $$ w = z + t $$ 但前提是： $(w < 0 \Rightarrow z + t < 0 \Rightarrow z < -t)$ #### 介於中間如果： $$ -t \le z \le t $$ 那麼： * 上面兩個解都「不符合假設」 * 此時最小值會落在 **不可微點 (w=0)** 因為 L1 在 0 有尖角，會「卡住」 #### 把三種情況合起來 $$ w^* = \begin{cases} z - t, & z > t \\ 0, & |z| \le t \\ z + t, & z < -t \end{cases} $$ 這個東西就是**soft-threshold** #### 寫成 compact 形式（你看到的那行） $$ w^* = \mathrm{sign}(z)\max(|z|-t, 0) $$ </details> 訓練迴圈（每一步都在最小化 $MSE+\eta\lambda|W|$） ```python= def train_lasso_ista(X, y, lam=0.2, lr=0.1, steps=200): W = np.zeros(X.shape[1]) # 從 0 開始（常見做法） history = [] for k in range(steps): # (1) 先往讓 MSE 變小的方向走 W_tmp = W - lr * grad_mse(X, y, W) # (2) 再做 L1 的 proximal：soft-threshold W = soft_threshold(W_tmp, lr * lam) # 記錄 loss J = mse_loss(X, y, W) + lam * np.sum(np.abs(W)) history.append(J) return W, np.array(history) ``` 實際訓練 ```python= lam = 0.25 lr = 0.1 steps = 300 W_lasso, hist = train_lasso_ista(X, y, lam=lam, lr=lr, steps=steps) W_ols = np.linalg.lstsq(X, y, rcond=None)[0] print("lambda =", lam) print("W_lasso (ISTA) =", W_lasso, " | L1 =", np.abs(W_lasso).sum()) print("W_ols =", W_ols) print("Final J(W) =", hist[-1]) ``` 繪圖：把每個部分畫出來＋直接寫在圖上解釋 ```python= import matplotlib.pyplot as plt # 在平面上算 MSE 等高線（用來視覺化地形） w1 = np.linspace(-5, 5, 401) w2 = np.linspace(-5, 5, 401) W1, W2 = np.meshgrid(w1, w2) pred = X[:, 0][:, None, None] * W1[None, :, :] + X[:, 1][:, None, None] * W2[None, :, :] resid = y[:, None, None] - pred SSE = np.sum(resid**2, axis=0) MSE = SSE / len(y) fig, ax = plt.subplots(figsize=(9, 7)) levels = np.percentile(MSE, [10, 20, 35, 50, 70]) ax.contour(W1, W2, MSE, levels=levels, colors="gray", linewidths=1.2) # 點：OLS vs Penalized Lasso(ISTA) ax.scatter([W_ols[0]], [W_ols[1]], s=70, marker="o", label="OLS (min MSE only)") ax.scatter([W_lasso[0]], [W_lasso[1]], s=120, marker="x", label="Lasso (penalized, ISTA)") # 在點旁邊寫解釋 bbox_kw = dict(boxstyle="round,pad=0.25", fc="white", ec="none", alpha=0.9) arrow_kw = dict(arrowstyle="->", lw=1.5) ax.annotate(f"OLS\nW=({W_ols[0]:.2f},{W_ols[1]:.2f})", xy=(W_ols[0], W_ols[1]), xytext=(W_ols[0]+0.6, W_ols[1]-1.0), arrowprops=arrow_kw, bbox=bbox_kw) ax.annotate(f"Lasso penalized\nW=({W_lasso[0]:.2f},{W_lasso[1]:.2f})\n|W|1={np.abs(W_lasso).sum():.2f}", xy=(W_lasso[0], W_lasso[1]), xytext=(W_lasso[0]-3.4, W_lasso[1]+1.2), arrowprops=arrow_kw, bbox=bbox_kw) ax.set_title("Penalized Lasso training (ISTA): shown on MSE contours") ax.set_xlabel("w1") ax.set_ylabel("w2") ax.axhline(0, linewidth=1) ax.axvline(0, linewidth=1) ax.set_aspect("equal", adjustable="box") ax.legend(loc="upper right") plt.tight_layout() plt.show() ``` ![lasso_ISTA](https://hackmd.io/_uploads/S1dS8uqrZe.png) 訓練曲線：看loss真的下降 ```python= plt.figure(figsize=(8, 5)) plt.plot(hist) plt.title("Training curve: Loss(W)=MSE(W)+lambda*|W|") plt.xlabel("step") plt.ylabel("Loss(W)") plt.tight_layout() plt.show() ``` ![lasso_loss](https://hackmd.io/_uploads/B1_zq41Bbx.png) scikit-learn程式碼範例 --- ```python= import numpy as np from sklearn.linear_model import Lasso np.random.seed(42) n = 80 X = np.random.randn(n, 2) # 做一點共線性：讓第二個特徵和第一個特徵相關 X[:, 1] = 0.85 * X[:, 0] + 0.35 * X[:, 1] W_true = np.array([2.0, -1.0]) y = X @ W_true + 0.8 * np.random.randn(n) print("X shape:", X.shape, "y shape:", y.shape) print("W_true:", W_true) # alpha 對應筆記裡的 lambda lasso = Lasso( alpha=0.25, # λ fit_intercept=False, max_iter=10_000 ) lasso.fit(X, y) W_lasso = lasso.coef_ print("Lasso coefficients:", W_lasso) print("||W|| =", np.abs(W_lasso).sum()) ```