【Genetic Algorithms】

# 【Genetic Algorithms】 - Part1: 基因演算法重要單字 - Part2: Naïve Selection - Part3: 遺傳演算法關鍵要點 - Part4: 解決該問題的歷程與方法 - Part5: Environment - Part6: Mutation - Part7: Performance o GA - Part8: Using the optimixation view to compose Neural & GA - Part9: The design of Genetic Algorithm - Part10: GA-K-means - Part11: 補充資料 ## 1. 基因演算法重要單字 - Selection：篩選 - Crossover：交配 - Mutation：突變 - Fitness function：適應函數 - Population：族群 - Chromosome：染色體 - Individual：個體 ## 2. Naive Selection ![image](https://hackmd.io/_uploads/H1OHBUAlgg.png) ## 3. 遺傳演算法關鍵要點 **定義問題 (Define your problem):** - 清楚界定你想要解決的問題。 - 思考如何將問題轉化為演算法可以處理的形式。 > 使用二進制編碼 (binary On/Off)，像 `010` 這樣的序列可以代表問題的某種狀態或解的一部分。 **定義個體 (Define the individual / chromosome)**: - 明確「個體」的結構，即「染色體 (chromosome)」。這代表問題的一個潛在解決方案。 - 染色體通常由一串「基因 (gene)」組成，每個基因對應解的一個特徵或參數。 > 承上，010 就是一個包含三個二進制基因的染色體範例。 **定義適應度函數 (Define your fitness function):** - 設計一個函數來評估每個個體 (即每個潛在解決方案) 的優劣程度或「適應能力」。 - 適應度函數的輸出值（分數）越高，通常代表該個體越接近理想解。 **演算法運行時的限制與考量 (Limitations when running your algorithm):** - 早熟收斂/卡在局部最優解：演算法在尋找最佳解的過程中，可能會過早收斂或卡在一個「尚可」但並非全局最好的解（local optimum）。 > 可能會一直卡在某個問題點 (may be always stuck for the problem)。數字 `0.4/0.5` 可能指的是適應度函數值達到某個水平後停滯不前，無法進一步提升。 - 需要注意演算法是否能有效探索整個解空間。 **如何提升效能 (How to enhance performance):** - 思考有哪些方法可以改進演算法的表現，以找到更好或更快的解決方案。 - 可能的方向包括： 1. 調整演算法參數（如族群大小、突變率、交配率）。 2. 優化選擇 (Selection)、交配 (Crossover) 和突變 (Mutation) 的策略。 3. 引入更高級的技巧，例如精英保留 (Elitism)、多樣性維持機制等。 **染色體長度的設計考量** - **過長**：增加運算成本，搜尋空間大，導致收斂變慢。 - **過短**：資訊不足，無法表達複雜結構，可能無法達到最小誤差。 🔸 常見評估指標：MSE、RMSE、Accuracy 等。 **多目標適應函數設計（Multi-objective Fitness）** 若考慮多重效益（如時間與金錢），可設計加權適應函數： \begin{equation} \text{Fitness} = \alpha \cdot \text{time} + \beta \cdot \text{money}, \quad \alpha + \beta = 1 \end{equation} **範圍懲罰（範圍限制）：** 當輸出超出預期目標區間（如 5~10）時，加入懲罰項，例如： - 若目標 < 5，設 error = 99999 - 若目標 > 10，設 error = 99999 ## 4. 解決該問題的歷程與方法在解決特定問題，尤其是需要從數據中找出規律或建立預測模型的場景時，「系統識別」(System Identification) 扮演了核心角色。其主要目標是根據已知的輸入 ($x$) 和輸出 ($y$) 數據，找到一個能夠準確描述該系統行為的數學方程式（模型）。一個理想的模型不僅能在已知數據上表現良好，更期望能在面對未知的新問題或數據時，依然能夠做出正確的決策或預測。 <center> | 發展階段 | 約 1940 年代至今 (傳統基礎) | 約 1980 年代至今 (遺傳演算法應用) | 約 1990 年代至今 (現代方法，含 2015 年後深度學習影響) | | :--: | :--: | :--: | :--: | | 步驟一：確立模型結構($y=ax^ 2+bx+c$) | 主要方法論建立 | 承襲傳統方法 | 強調可使用單一虛擬問題 (single virtual problem) 或模擬環境來輔助模型結構的選擇與驗證。 | | 步驟二：最佳化模型參數(找到最佳的 $a, b, c$) | 傳統參數估計方法 | 引入遺傳演算法 (GA) 等最佳化工具進行參數搜尋。 | 同樣可利用單一虛擬問題進行參數調校與模型效能評估。深度學習等技術也為複雜模型的參數學習帶來新途徑。 | </center> **評估與考量 (以遺傳演算法等方法為例)：** - **優點 (Advantages)** 1. 高效率 (efficient): 在尋找複雜問題的參數解時，通常能展現出較高的搜索效率。 2. 準確性 (correct): 能夠在給定的模型和數據下，找到相對正確或接近最佳的參數組合。 - **挑戰與缺點 (Disadvantages)** 1. 泛化能力問題：即使模型在訓練數據或已知情境下表現優異，當它遇到全新的、未曾見過的未知問題 (unknown problem) 或數據時，其預測或決策的正確性可能大幅下降 (may be wrong)。這是所有模型（包括透過遺傳演算法或深度學習建立的模型）都需要持續面對和克服的挑戰。 ## 5. Environment \begin{equation} y = ax + b \end{equation} <center> | **$x$** | 0 | 1 | 2 | 3 | | :--- | :-- | :-- | :-- | :-- | | **$y$** | 5 | 6 | 7 | 8 | </center> \begin{equation} 0 < a < 10, \quad 0 < b < 10 \end{equation} ### Step1： Generate a population Population： $(2, 3), (1, 4), (9, 3), (6, 7)$ Usually we set it equation to 200 ### Step2： Calculate its fitness 1. $(2, 3) \rightarrow MSE = 6/4$ <center> | $x$ | $y$ | $y^{'} = 2x + 3$ | $error$ | | :--: | :--: | :--: | :--: | | 0 | 5 | 3 | 2 | | 1 | 6 | 5 | 1 | | 2 | 7 | 7 | 0 | | 3 | 8 | 9 | -1 | </center> 2. $(9, 3) \rightarrow MSE = \uparrow$ <center> | $x$ | $y$ | $y^{''} = 9x + 3$ | $error$ | | :--: | :--: | :--: | :--: | | 0 | 5 | 3 | 2 | | 1 | 6 | 12 | -6 | | 2 | 7 | 21 | -14 | | 3 | 8 | 30 | -22 | </center> Suppose：1.5、0.5、200、150 ### Step3： Selection & Crossover ```mermaid pie "Label 1 (1.5)" : 30 "Label 2 (0.5)" : 20 "Label 3 (200)" : 200 "Label 4 (150)" : 150 ``` **Two ways to do the inverse** - Method1 \begin{equation} 1/1.5 、 1/0.5 、 1/200 、 1/150 \end{equation} :::info **Property**： 1. Emphasize the difference between small values $((\frac{1}{2} - \frac{1}{3}) v.s (\frac{1}{201} - \frac{1}{202}))$ 2. The big value will disappear very quickly $(\frac{1}{2} 、\frac{1}{2000})$ **Collary**：Converage very quickly but may locate in local minimal. ::: - Method2：Find the biggest（fitness value + 1） \begin{equation} (201 - 1.5) 、 (201 - 0.5) 、 (201 - 200) 、 (201 - 150) \end{equation} :::info **Property**：The different between values will be treated equaly **Collary**：Converage slowly but have higher possibility to reach global minimum. ::: **Crossover： First sheet 2 、 Second sheet 1** - Sheet 2 \begin{equation} (1, 4) \rightarrow (0001 \quad 100) \end{equation} - Sheet 1 \begin{equation} (2, 3) \rightarrow (0010 \quad 011) \\ \end{equation} \begin{align} 0 \le a \le 15 , 4bits \\ 0 \le b \le 7, 3bits \end{align} - Crossover Method (Single) (Random Choosed) <center> ![image](https://hackmd.io/_uploads/Sk-D3G1Zgl.png) </center> - Crossover Method (Double) <center> ![image](https://hackmd.io/_uploads/S1W82zy-ll.png) </center> ## 6. Mutation **Two kinds of definitions** (Mutation rate $\sigma$, **$\sigma$ self define = 0.25**)（Smaller is Better） 1. For each chromosome <center> ![image](https://hackmd.io/_uploads/rJbb1mJZge.png) </center> 2. For each bits（Diversity）（Jump at the local minimum） <center> ![image](https://hackmd.io/_uploads/ryA8e7JWel.png) </center> ## 7. Performance of GA <center> ![image](https://hackmd.io/_uploads/SkvS6Qy-xx.png) </center> **n generations the same best results $\rightarrow$ stop** <center> ![image](https://hackmd.io/_uploads/HJ4UkVy-xl.png) </center> :::success **重點筆記**：The same as the earlt stop in deep learning model. ::: ## 8. Using the optimixation view to compose Neural & GA - Neural Network：NN cannot fine the global minimum, usually find the local minimum <center> ![image](https://hackmd.io/_uploads/Bk9SE4JZeg.png) </center> - Genetic Algorithm： 1. GA is the brute force method for become as enough generation is used it can try all posiibble solution 2. GA can find the global minimum of the problem, once enough generation are used <center> ![image](https://hackmd.io/_uploads/SJcmvV1Wxe.png) </center> ## 9. The design of Genetic Algorithm ### Real World Problem <center> ![image](https://hackmd.io/_uploads/SJ21RVyble.png) </center> **Start**：for every station the # of bikes $30\% \le x \le 70\%$ <center> | 06:00 | 12:00 | 18:00 | 24:00 | | :--: | :--: | :--: | :--: | | A got 8 bikes | ... | ... | ... | | C put 5 bikes | ... | ... | ... | | D got 5 bikes | ... | ... | ... | | B put 8 bikes | ... | ... | ... | </center> **Constraint**：Each can take only 30 bikes. ![截圖 2025-05-13 凌晨1.34.44](https://hackmd.io/_uploads/SyMSb2k-el.png) ### Design of genetic algorithm view 1 $\rightarrow$ from the view of user **Structure of our solutions：** ![image](https://hackmd.io/_uploads/BJ_mXw1-xl.png) **Loss function** ![image](https://hackmd.io/_uploads/rJe8KDk-ex.png) **The difficulty you will meet here** ![image](https://hackmd.io/_uploads/BkXuauJZxe.png) ### Design of genetic algorithm view 2 $\rightarrow$ this the result of all solutions **Structure of our solutions** - The difficulty of this view become that we must to find the solution by ourself - You must to design a schedule alorithm additionally to find the route of dispatching bikes. ![image](https://hackmd.io/_uploads/Hy3dWt1-ll.png) **Crossover** ![image](https://hackmd.io/_uploads/r1Jx7Y1-ge.png) :::success **重點筆記**：When you design your gene. You don't need to reach all requirements of the proposed ::: ### Design of genetic algorithm view 3 $\rightarrow$ GA solution can solve the problem but hightly cost <center> ![image](https://hackmd.io/_uploads/HJ4FV2yZll.png) </center> **Disadvantage 1**：Here in this example # or routes = $\binom{4}{2} = 6$, in real world # or routes $\binom{100}{2}$ - Make the gene too long and slow down the process of finding solution. **Disadvantage 2**：The result is hard to combine \begin{equation} AB \quad \quad AC \quad \quad AD \quad \quad BD \quad \quad CD \end{equation} \begin{equation} A \rightarrow B \quad A \rightarrow C \quad A \rightarrow D \quad B \rightarrow D \quad C \rightarrow D \end{equation} **How to deal with the limitation of gene？ Maximum 30 bikes in one Car** ![image](https://hackmd.io/_uploads/S1EzA3kZel.png) ## 10. GA-K-means GA-K-means：兼顧大／小群的改良式分群 | 步驟 | 核心做法 | 為何能解決「大群被切開 / 小群被併吞」？ | |------|----------|-------------------------------------------| | **1. 染色體設計** | `centroid₁..k (x,y)` **＋** `scale₁..k` 例： `[x₁, y₁, s₁, …, x_k, y_k, s_k]` `scale ∈ (0, 1]` 隨機初始化 | 多一顆 **scale 基因**，分派點時使用 `d′(p,c_j) = ‖p−c_j‖ / s_j`。 群越大 → `s_j ↑` → 半徑「較鬆」； 群越小 → `s_j ↓` → 半徑「較緊」。 | | **2. Fitness** | $$\text{Fitness} = \sum_j\sum_{p\in C_j} d′(p,c_j)^2 + \lambda\cdot(\text{空群數})$$ | 以 **縮放後距離** 評估群內緊實度， 並用 λ 懲罰空群，同時顧及 *compactness* 與 *完整性*。 | | **3. 演化策略** | • Tournament 選擇 • SBX 或兩點交叉 • Gaussian 變異：   `c ← c + N(0, σ)`   `s ← s × 10^{N(0, 0.1)}` • **局部精煉**：每代 GA 後對染色體跑 1 步 K-means（固定 `scale`） | GA 搜索 *(centroid, scale)* 全局最佳， K-means 做局部微調； 兩者混和能同時鎖定大圈與小圈。 | > 一句話：在染色體裡為每個 centroid 再放一顆 scale 基因，以「距離 ÷ scale」做分派；GA 全局搜尋 (centroid, scale)，K-means 局部微調，如此即使群大小差很大，也能同時抓到大圈與小圈。 ## 補充：Problem in final exam ### Knapsack Problem（背包問題） **Item** <center> | Name | Volumn | Weight | Value | | :--: | :--: | :--: | :--: | | A | 100 | 200g | 10 | </center> Max limitation $2000 cm^3$, and $5000g$. And you need to answer 1. Gene Algorithm：用 *n-bit 0/1 向量*（1 = 放入背包）。本題若有 *m* 個物品就用 *m* 個 bit；交配採單點 crossover、突變採 bit-flip 2. Loss Function：計算 $V=\sum x_i\,value_i,\;W=\sum x_i\,weight_i,\;C=\sum x_i\,volume_i$ 若 $W\le5000\;\land\;C\le2000$ → fitness = +V（要**最大化**）；否則 fitness = −10 000（重罰直接淘汰）。 3. Meet the limitation：（i）靠上列重罰，違規基因在選擇階段自然被捨棄；或（ii）*Repair*：超載時依最差 $value/weight$ 依序把 bit 由 1→0，直到兩條限制都滿足。 ### Genetic Algorithm Implementation 1 #### 題目描述給定一個二次方程式： $y = ax^2 + bx + c$ 其中： - $a, b \in \{0, 1, \dots, 9\}$（正整數） - $c \in \{0, 1, \dots, 99\}$（正整數） - 資料為多組 $(x, y)$ 對 #### 資料範例 <center> | x | y | | :--: | :--: | | 1524 | 20909367 | | 3979 | 142507972 | | 6752 | 410332631 | | 1617 | 23538756 | | 8146 | 597248515 | | 5091 | 233284980 | | 1332 | 15973431 | | 1766 | 28075955 | </center> 目標：使用 Genetic Algorithm (遞傳演算法) 找出最適合資料的參數 $a, b, c$。 **一、基因組成設計** **Binary Encoding** <center> | 參數 | 範圍 | 位元 | 二進位稱示 | | -- | ---- | -- | ---------------------- | | a | 0–9 | 4 | `0000` \~ `1001` | | b | 0–9 | 4 | `0000` \~ `1001` | | c | 0–99 | 7 | `0000000` \~ `1100011` | </center> ★ 總長度：**15 bits** **二、Fitness Function** 預測值：$\hat{y}_i = ax_i^2 + bx_i + c$ 使用 MSE （均方誤差）作為捐款： $MSE = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$ 適應度計算： $\text{Fitness} = \frac{1}{\text{MSE} + \varepsilon}$ > $\varepsilon$：免除 0 除的小值常數 **三、Constraint Handling** **超過參數範圍時，可以：** - 針對 a, b, c 進行 clipping - 或直接設定 MSE = 999999 給不良適應度 **四、演化** **Generation 0（初始族群）** | # | 染色體 (a,b,c) | 損失值 | | - | ----------- | ------------- | | 1 | (9, 7, 89) | 90 637 | | 2 | (0, 5, 37) | 1 471 720 276 | | 3 | (5, 0, 40) | 654 232 352 | | 4 | (1, 1, 80) | 1 308 312 973 | > 選擇前 2 名進行交配，其餘由交叉＋突變產生。 **Generation 1** | # | 染色體 (a,b,c) | 損失值 | | - | ----------- | ----------- | | 1 | (9, 7, 89) | 90 637 | | 2 | (5, 0, 40) | 654 232 352 | | 3 | (9, 4, 40) | **376** | | 4 | (5, 0, 89) | 654 231 960 | > 376 已相當接近 0，故再演化一次。 **Generation 2** | # | 染色體 (a,b,c) | 損失值 | | - | ----------- | ------------- | | 1 | (9, 4, 40) | 376 | | 2 | (9, 7, 89) | 90 637 | | 3 | (9, 4, 87) | **0 ← 最佳解** | | 4 | (2, 4, 89) | 1 144 694 493 | **結果與驗證** GA 在第二世代即找到 **(a=9, b=4, c=87)**，將其代入可得 $y = 9x^2 + 4x + 87$ 逐一驗證所有 $x$ ，皆能 **完美對應** 觀測值（總損失 = 0）。 ### Genetic Algorithm Implementation 2―YouBike 站間調度路徑與車量最佳化 #### 題目敘述：給予ABCD四個Youbike站點，以及他們的歷史借用記錄，接著假設Youbike公司每天會調度四次車輛（0600, 1200, 1800,2400），請設計一個基因演算法來幫Youbike公司找出ABCD四個站間的最佳調車路徑與數量。請設計出本題基因、Loss function、還有遇到演變限制時的處置方式。 ### 回答： **基因表示** - Route：A → B → C → D 的任意不重複排列。 - Load：長度 4 的整數向量 [xA, xB, xC, xD]，正值裝車、負值卸車，向量總和必須 = 0。 **適應度（要最大化）** - $Fitness=−(𝛼∑∣庫存差∣+𝛽路程+𝛾超載或負庫存罰則)$ **約束處理** - 容量 / 庫存違規：交叉或變異後立即 repair，把超量平均分散到仍有餘裕的站；若修不了就施以高額罰則。 - Load 總和≠0 或站點重複：最後一基因位自動調整成使總和歸零；若 Route 有重複／漏站則重新抽樣。 > 如需同時優化四個時段，將染色體長度乘以 4 或用分層 GA；多目標情境可改用 NSGA-II。 ### 背包問題 - 基因：4 bit 0/1 向量（1 = 放入背包）；例 1010 = 果汁＋牛排。 - Fitness：若總重 ≤ 15 kg → 直接回傳總價值；否則回傳總價值 − 1000 （重罰淘汰）。 - 交配：單點 crossover，機率 0.8；突變：逐 bit 翻轉，機率 0.1／位元。 - 流程：隨機族群 → Tournament 選擇 → 交配＋突變 → 保留最佳個體（Elitism）→ 重複直到收斂，即可得到最佳組合 1100 (果汁+蘋果) 或 1010 (果汁+牛排) 視資料設定而定。