錯誤更正碼介紹和實作考量

# Error-Correcting Codes (ECC) > 貢獻者: RinHizakura, [jouae](https://github.com/jouae), [jserv](https://wiki.csie.ncku.edu.tw/User/jserv) 錯誤更正碼 (Error-Correcting Codes, ECC) 在資料中附加冗餘資訊，使接收端得以偵測並修正傳輸或儲存過程中產生的錯誤。本文從 Shannon 的資訊理論出發，依序介紹 Hamming codes 的編解碼原理與漢明邊界的數學證明，再深入 Reed–Solomon codes 的有限體運算、編碼器與 Peterson–Gorenstein–Zierler 解碼器的推導。本文先以 Python 程式碼展示 $GF(2^8)$ 上的多項式運算，接著探討 Linux 核心 ECC 的實作細節，涵蓋 Galois field 查找表建立、生成多項式計算、syndrome 求值、Berlekamp–Massey 演算法及 Forney 演算法等程式碼分析。 ## 什麼是 ECC？ Error-Correcting Codes (ECC) 廣泛用於通訊傳輸，補正因為資料損壞、雜訊所導致的錯誤。過往當人們談及通訊系統的改進時，幾乎只著眼於如何讓訊號接收得更清楚，像是藉由加大天線或增強訊號的傳輸功率，這些方法都以物理學的角度來改進。 1948 年，[Claude Shannon](https://en.wikipedia.org/wiki/Claude_Shannon) 發表劃時代的論文《通訊的數學理論》([A Mathematical Theory of Communications](https://en.wikipedia.org/wiki/A_Mathematical_Theory_of_Communication))。他指出，通訊系統的改進不必依賴實體設備的提升——只要傳送資料的速度低於通道容量 (channel capacity)，就能將錯誤率控制到幾乎為零。此觀點奠定資訊理論 (information theory) 的基礎，將通訊乃至資料的儲存、壓縮、傳輸帶進全新的紀元。 ```graphviz digraph{ rank = LR data[shape = box, color = white] Encoder[shape = box] Channel[shape = box] Decoder[shape = box] gooddata[label="good data" shape = box color = white] subgraph{ rank = same data->Encoder Encoder->Channel [label = "redundancy" ] Encoder->Channel [label = "data" color=white] Channel->Decoder [label = "noisy redundancy" ] Channel->Decoder [label = "noisy data" color=white] Decoder->gooddata } } ``` 通常伺服器等級主機配置 BMC (一種 out-of-band 系統，裡面往往是小型的 Linux 系統)，後者便於監控硬體的錯誤，且有 EDAC 機制，觸發 ECC 修復記憶體錯誤，從而提升伺服器整體的穩定性。於是乎，主流伺服器品牌會改進其 BMC 裡頭的 ECC 機制。科普短片： * [數學怎麼讓溝通無礙？錯誤更正碼](https://youtu.be/BBxQriJjCEo) * [計算機怎麼知道自己錯了？](https://youtu.be/ey0Yk3I2ARY) 短片 [What are Reed-Solomon Codes? How computers recover lost data](https://youtu.be/1pQJkt7-R4Q) 提供 Reed-Solomon (RS) 在內演算法的視覺化解說。 ### Shannon 熵的由來 Shannon 在論文中提出一個根本問題：如何量化一則訊息所攜帶的「資訊量」？直覺上，愈不可能發生的事件，一旦發生就帶來愈多資訊；反之，幾乎確定會發生的事件幾乎不帶資訊。Shannon 證明，滿足以下三個公理的量度函數是唯一的： 1. $H$ 對所有 $p_i$ 連續 2. 當所有結果等機率時 ($p_i = 1/n$)，$H$ 隨 $n$ 單調遞增 3. 若一個選擇可分解為連續的子選擇，則 $H$ 等於各子選擇的加權和唯一滿足這三個公理的函數即為熵 (entropy)： $$ H(p_1,p_2,\dots,p_n) = -\sum_{i=1}^n p_i\log (p_i) $$ 此名稱借自熱力學，度量的是隨機變數的平均不確定性。以一枚公正硬幣為例，$H(1/2, 1/2) = 1$ bit，表示每次投擲恰好攜帶 1 bit 的資訊。若硬幣不公正，例如 $p(\text{正面}) = 0.9$，則 $H(0.9, 0.1) \approx 0.47$ bit，因為結果較可預測，每次投擲攜帶的資訊較少。 Shannon 進而定義離散通道容量 (channel capacity) 為 $$ C = \lim_{T\rightarrow \infty} \dfrac{\log N(T)}{T} $$ 其中 $N(T)$ 是在時間 $T$ 內可允許的訊號數量。通道容量的意義在於：只要資料傳輸速率低於 $C$，就存在編碼方式使錯誤率趨近於零——這正是 ECC 存在的理論基礎。 > 關於 Shannon 論文更完整的推導，參閱 [通訊領域中的數學理論](https://hackmd.io/@jouae/Shannon_entropy) ## 為什麼會有錯誤 > 參考文獻: > 1. Error Correction Coding Mathematical Methods and Algorithms by Todd K. Moon > 2. Error Control Coding: Fundamentals and Applications by Shu Lin 在 Todd 所著一書中描述：一序列的數個位元可以藉由一對應關係寫成波形 (waveform) 的形式透過通道 (channel) 傳遞。二元相移鍵控 (Binary phase-shift keying, BPSK) 即為其中一種調變形式，從訊號空間的角度來看，可視為振幅係數取 $\pm 1$ 的反極性 (antipodal) 訊號，其中波形的正負號用於表示位元為 `0` 或是 `1`。以 $b_i \in \lbrace 0, 1 \rbrace$ 構成的序列 $$ \lbrace \dots, b_{-2}, b_{-1}, b_0, b_1, b_2, \dots \rbrace $$ 表示傳遞的訊息。其中每個位元到達的時間需要 $T$ 單位時間，換言之，週期為 $T$ ，頻率則為 $\frac{1}{T}$。要以波形的正負號來表示位元為 `0` 或 `1` ，需要先做一個變數變換。以 $\tilde{b_i}$ 表示 $\lbrace 0, 1 \rbrace$ 經過映射至 $\lbrace -1, 1 \rbrace$ 的結果，具體的表示為： $$ \tilde{b_i}=(2b_i - 1) \text{ or } \tilde{b_i}=-(2b_i - 1) $$ 要確立該波形的數學表示式，在已知波形週期 $T$ 時，還需知道波形的振幅 (amplitude)，因此藉由找出一位元對應振幅的映射關係，來確定該波形的振幅。令振幅 $a_i$ 表示為 $$ a_i= \sqrt{E_b}(2b_i-1) = -\sqrt{E_b}(-1)^{b_i} = \sqrt{E_b}\tilde{b_i} $$ 得到該振幅後，接下來要藉由座標系描述訊號中哪些位置對應位元 `0` 或 `1`。先從熟知的一維座標系思考：一維座標系就是數線，構成要素為原點、方向及單位度；二維座標系則是在一維的基礎上，找出兩同原點且互相正交的單位度。那麼以訊號為座標軸時，單位度該如何表示？因此要先定義單位訊號，更具體的說法是單位能量 (unit-energy) 訊號。對一連續時間的訊號，可以藉由對時間積分的方式來描述在該段時間內的能量為多少，在《Signal and System》一書 1.1.2 節提及從伏特 $v(t)$ 與電流 $i(t)$ 流經一電阻 $R$ 的物理角度思考該電阻的瞬時功率為： $$ p(t) = v(t)i(t) = \dfrac{1}{R} v(t)^2 $$ 第二個等式藉由歐姆定律 (Ohm's law) $v=iR$ 得來。在自 $t_0$ 至 $t_1$ 時間內的功耗則可以表示為： $$ \int_{t_0}^{t_1} p(t) dt= \dfrac{1}{R}\int_{t_0}^{t_1} v(t)^2 dt $$ 藉由該物理概念，將一訊號 $s(t)$ 的能量計算方式定義成自身對自身的 $L_2$ 內積 $$ E = \left< s(t), s(t) \right> = \int^{\infty}_{-\infty} s(t)^2 dt. $$ 其中，一單位能量訊號 $\psi_1(t)$ 滿足： $$ \left< \psi_1(t), \psi_1(t) \right> = 1. $$ 要注意的是該訊號 $\psi_1(t)$ 在區間 $[0,T)$ 有[支撐集 (support)](https://en.wikipedia.org/wiki/Support_(mathematics))。簡單來說，在該區間 $t\in[0,T)$ 中 $\psi_1(t)$ 有非零值， $t\in\mathbb{R}- [0,T)$ 中皆為零。藉由該單位能量訊號 $\psi_1(t)$，可假想一個一維訊號座標系，以 $\psi_1(t)$ 作為軸上的單位訊號，該軸上的各點對應不同的振幅係數。這樣的空間稱為訊號空間 (signal space)。所以該位元 $b_i$ 在時間點 $iT$ 到達的波形，能以一訊號表示： $$ a_i\psi_1(t-iT)=\sqrt{E_b}\tilde{b_i}\psi_1(t-iT) $$ 其中， $b_i=0$ 對應到的訊號為 $-\sqrt{E_b}\psi_1(t)$，$b_i=1$ 對應到的訊號為 $\sqrt{E_b}\psi_1(t)$。以 $\psi_1(t)$ 為基底的一維訊號空間中，會有兩個座標點 $\sqrt{E_b}$ 和 $-\sqrt{E_b}$。將此訊號空間繪製成圖，稱為星座圖 (constellation diagram)。 ![1d_signal_space](https://hackmd.io/_uploads/ByNL5G7OA.svg =70%x)  顯而易見，傳送一位元 $b_i$ 需要的能量為： $$ \left< a_i\psi_1(t), a_i\psi_1(t) \right> = E_b $$ 以此方式，可以將有限個位元依連續訊號表示： $$ s(t) = \sum_i a_i\psi_1(t-iT) $$ 這個手法使離散的資訊得以連續訊號傳送。再來討論更一般化的情況：單一訊號波形僅能傳遞一個位元的資訊，即 `0` 和 `1`。假設想要傳遞 `00`、`01`、`10` 和 `11`，該怎麼以連續訊號表示？ ## Single–Error Correcting (SEC) Codes > 此節內容摘錄並重新整理自 [錯誤更正碼簡介](https://w3.math.sinica.edu.tw/math_media/d184/18404.pdf) 其中一類稱為 Single–Error Correcting (SEC) Codes，顧名思義，這個方法可以修正編碼中單一位元的錯誤。一個經典的例子為 [(7, 4) Hamming code](https://en.wikipedia.org/wiki/Hamming(7,4))，意指以 7 個位元編碼的資料中，有 4 個位元是真實的資料，其做法大致為：對於一個資料 `d1 d2 d3 d4`，加上額外的 parity `p1 p2 p3`，此 parity 定義為使下圖中的三個大圓中的 1 數量為偶數，則在僅有至多一碼錯誤的前提下: * 僅有其中一個大圓內 1 的個數不符合，這表示是 `p1` 或 `p2` 或 `p3` 產生錯誤，因此反轉 `p1` 或 `p2` 或 `p3` 即可 * 其中兩個大圓內的 1 的個數不符合，表示是 `d1` 或 `d2` 或 `d3` 產生錯誤，因此反轉 `d1` 或 `d2` 或 `d3` * 若是 `d4` 產生錯誤，三個大圓內的 1 個數皆不符合，反轉 `d4` ![image](https://hackmd.io/_uploads/S16K6Taaa.png) ### Hamming codes 的數學結構以數學的角度來探討，為什麼 (7, 4) Hamming code 可修正一個位元的錯誤呢？考慮集合 $S=\{0,1\}$ 以及其元素 $x_i\in S$，對 $i=1,2,\dots,7$。假設 $d_1,d_2,d_3,d_4$ 對應到 $x_4,x_3,x_2,x_1$，而 $p_3,p_2,p_1$ 對應到 $x_5,x_6,x_7$，則表示其需滿足 (注意這裡的加法指的是二進位單一位元的加法，例如 $1+1=0$)： $$ \left\{ \begin{array}{c} x_1 + x_2 + x_3 + x_5 = 0 \\ x_1 + x_2 + x_4 + x_6 = 0 \\ x_1 + x_3 + x_4 + x_7 = 0 \\ \end{array} \right. $$ 第一個方程對應上述韋恩圖 (Venn diagram) 紅色部分，第二個方程對應藍色部分，第三個方程對應綠色部分。將上述線性方程式寫成矩陣則是： $$H= \begin{pmatrix} 1 & 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 1\\ \end{pmatrix} $$ 令 $C$ 為所有編碼資料的集合即是 H 的 null space，也就是 $\forall x \in C$　$Hx^T = 0$，又 $\text{rank}(H) = 3$，根據 [Rank–nullity theorem](https://en.wikipedia.org/wiki/Rank%E2%80%93nullity_theorem) 則 $\dim(C) = 7 - 3 = 4$ ，表示對於上述式子的解，事實上只要給定 x1~x7 中任意 4 個，其他 3 個就會因應產生，又這任意 4 個可以為任意的 0 或 1，因此共有 $2^4$ = 16 個可能，對應可以編碼的 4 個位元。接著，可從 Hamming distance 和 Hamming weight 來思考。 * Hamming distance: 指兩個 bitstring 0 和 1 相異的個數，例如 x = 0100, y = 1100，則 Hamming distance $d_H(x,y)$ = 1 * Hamming weight: 一個 bitstring 中非 0 的數量，例如 x = 0100，其 Hamming weight $w_{H}(x)$ = 1 * 從這兩個定義可推廣出：x 和 y 的 Hamming distance 事實上就是 x + y 的 Hamming weight 可定義一段 code word 的 minimum distance 為此 code word 集合中任兩個編碼的最小 Hamming distance，也就是 $$ d_{min}{C} = \min \left\{ d_{H}(x,y) \mid x,y \in C, x \neq y \right\} $$ 而 minimum weight，則定義為 $$ w_{\min}(C) = \min \left\{ w_{H}(x)\mid x \in C, x \neq 0 \right\} $$ 由於 null space $C$ 同時也是一個 [linear subspace](https://en.wikipedia.org/wiki/Linear_subspace)，根據加法封閉性，集合中的任兩個 x、y 相加仍會存在集合 C 中，又前面提到 x、y 相加的 Hamming weight 會等同 x、y 的 Hamming distance，因此可以歸納出 $d_{min}{C} = w_{min}{C}$。那麼回過頭看，$C$ 的 minimum weight 會是多少？已知 $C$ 是 $H$ 的 nullspace，由此推理： * 1 是不可能的，因為 H 的任一個 column 中都不存在 [0 0 0] * 2 是不可能的，因為 H 的任兩個 column 各異，任取兩個相加後不會得到 [0 0 0] * 如果取第 1、2、7 個 column 相加，可得 [0 0 0]，因此 $C$ 的 minimum weight / minimum distance 為 3 由於對於 d 個 bits 的錯誤，其 minimum distance 必須至少要是 2d + 1，因此可知這個演算法僅能修正 1 個位元的錯誤 (詳情請參考[錯誤更正碼簡介](https://w3.math.sinica.edu.tw/math_media/d184/18404.pdf)第 7 頁) ### Hamming codes 的矩陣運算 Hamming codes 的編碼與解碼操作可以被轉換成矩陣的運算，首先要先定義編碼的生成矩陣 $G$ 和檢驗的矩陣 $H$ $$ \begin{split} (d4, \space d3, \space d2, \space d1) \\ G = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 0 & 1 \\ 1 & 0 & 1 & 1 \\ \end{pmatrix}\end{split} $$ 可以看到前四行形成一個單位矩陣，後三行則分別對應 p1、p2、p3 所在的大圓，根據該圓包含的 $d_{n}$ 為何，其對應的位元就設為 1。而 $H$ 就是前面定義的矩陣。對於一段資料，舉例來說 p = $[1 \space 1 \space 0\space 1]^T$，其編碼的計算如下。注意，這裡的矩陣運算是對單一位元的加乘法： $$ x = Gp = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 0 & 1 \\ 1 & 0 & 1 & 1 \\ \end{pmatrix}\ \begin{pmatrix} 1 \\ 1 \\ 0 \\ 1 \end{pmatrix}\ = \begin{pmatrix} 1 \\ 1 \\ 0 \\ 1 \\ 0 \\ 1 \\ 0\end{pmatrix}\ $$ 假設經過傳輸後得到的向量為 r，如果在傳輸過程中沒有錯誤的位元產生，則會得到 0 向量： $$ z = Hr = \begin{pmatrix} 1 & 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 1\\ \end{pmatrix}\begin{pmatrix} 1 \\ 1 \\ 0 \\ 1 \\ 0 \\ 1 \\ 0\end{pmatrix}\ = \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix}\ $$ 如果 r 產生 1 位元的錯誤，舉例來說第 6 個位元有錯，則輸入第 6 個位元有錯的 code word 之後，得到的結果為： $$ z = Hr = \begin{pmatrix} 1 & 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 1\\ \end{pmatrix}\begin{pmatrix} 1 \\ 1 \\ 0 \\ 1 \\ 0 \\ 0 \\ 0\end{pmatrix}\ = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}\ $$ 因為僅第 6 個位元為 1 的向量，經 $H$ 運算會得到： $$ z = Hr = \begin{pmatrix} 1 & 1 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 1\\ \end{pmatrix}\begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 0\end{pmatrix}\ = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}\ $$ 則表示輸入的 codeword 第 6 個位元出錯，因此需要對其取補數調整，就可得原始的 codeword。(若將 parity 位元擺在 2^n 的位置，事實上這個矩陣運算可得更精巧的結果，詳情請參考[(7, 4) hamming code](https://en.wikipedia.org/wiki/Hamming(7,4))) ### 漢明邊界 hamming bound > 參考自 Introduction to Abstract Algebra by W. Keith Nicholson - 2.11 An Application to Binary Linear Code  令 $B^n=\lbrace 0,1\rbrace \times \dots \times \lbrace 0,1\rbrace$ 為一 $n$ 摺[笛卡爾積](https://en.wikipedia.org/wiki/Cartesian_product) (n-fold cartesian product)，以及集合 $$ B_r(w) = \lbrace v \in B^n \vert d(w,v) < r\rbrace \\ $$ 為以 $w$ 為中心，漢明距離 $d(\cdot,\cdot)$ 為測度 (metric)，蒐集與 $w$ 距離嚴格小於 $r$ 的所有元素。$B_r(w)$ 也可以叫做以 $d(\cdot,\cdot)$ 為距離函數，圓心為 $w$ ，半徑為 $r$ 在 $B^n$ 中的 t-ball，在點集拓樸中該球又稱作一開球 (open ball)。與之相對的為閉球 (closed ball)，差別在於漢明距離可等於 $r$ $$ \bar{B_r}(w) = \lbrace v \in B^n \vert d(w,v) \leq r\rbrace $$ 可用點集拓樸的術語來描述錯誤碼的偵測與糾正。 > 以下節錄自《Introduction to Abstract Algebra by W. Keith Nicholson》第 147 頁假設 $c\in C$ 且 $C$ 最小距離為 $d$ ，換言之 $C$ 集合中任意兩不相同元素的距離不小於 $d$。更進一步的具象來講，對任意的 $c,c'\in C$ 且 $c\neq c'$，以半徑 $\lceil d/2 \rceil$ 的開球不會相交，即 $B_{\lceil d/2 \rceil}(c) \cap B_{\lceil d/2 \rceil}(c') = \emptyset$。同時，以半徑 $d$ 的開球不會包含其他 codeword，即 $B_d(c) \cap C = \lbrace c \rbrace$。假設一個 code word $c$ 被傳送後，接收者得到的 code word $w$ 帶有 $s$ 個錯誤，其中 $1\leq s \leq t$，藉由漢明距離可以得到 $s=d(c,w)$。因此，開球 $B_t(c)$ 在以漢明距離 $t$ 為半徑所圈起來的圓，可以收到最多 $t$ 個錯誤。在此以集合論的方式說明漢明碼的兩個功能「偵測錯誤」與「校正錯誤」：先假設 $C$ 該集合沒有任何 code word 落在以漢明距離 $t$ 為半徑，另一個 code word 的開球中，則 1. 偵測錯誤： $w\in B_t(c)$ 且 $w\neq c$ 時， $w$ 為一個錯誤碼。 2. 校正錯誤：偵測錯誤的條件下，如果再加上每個 code words 形成的半徑 $t$ 開球皆是兩兩不相交。則 $w$ 可以被校正成正確的 code word。依據這個概念，可藉由漢明距離和開球等集合論方式寫出漢明碼偵測錯誤與校正錯誤的能力。下圖將集合 $C$ 與其中元素視覺化，該圖顯示當各 codeword 的閉球互不相交的情況。對所有 $c_i,c_j\in C$，$\bar{B_t}(c_i)\cap \bar{B_t}(c_j) = \emptyset$，其中 $i\neq j$ 且 $t = \lfloor (d-1)/2 \rfloor$。 ![set without any error](https://hackmd.io/_uploads/BJ2TiUL5Wx.png =70%x) :warning: 值得注意的是，一般常用的圓形與半徑距離是在歐氏空間 $\mathbb{R}^2$ 底下以 $l_2$ 範數為距離定義而成的，漢明碼中使用的是漢明距離，實際的「圓形」與「半徑」與圖中不一定相符。此處為便於解釋以常用圖示表示。 > 不同距離函數下的圓形，可參考 [Unit circles with $l_p$-norm](https://en.wikipedia.org/wiki/Lp_space#/media/File:Vector-p-Norms_qtl1.svg) - [ ] 假如 $C$ 為 $B^n$ 一非空子集合，同時 $C$ 的最小距離 (minimum distance) 為 $d$，則： 1. 如果 $t+1 \leq d$ ， $C$ 可以偵測到 $t$ 個錯誤。 2. 如果 $2t+1 \leq d$ ， $C$ 可以校正 $t$ 個錯誤。 * 證明： 1. 對任意 $c\in C$，因為 $t+1 \leq d$，開球 $B_{t+1}(c)$ 中不包含其他 codeword，意即 $B_{t+1}(c) \cap C = \lbrace c\rbrace$。因此若接收到的 $w$ 滿足 $d(c,w) \leq t$，可知 $w$ 不是任何其他的 codeword，從而偵測到錯誤。 2. 藉由反證法。在 $2t+1\leq d$ 條件下，假設存在 $c\neq c' \in C$ 使得以 $t$ 為半徑的閉球有交集，即存在 $w\in \bar{B_t}(c)\cap \bar{B_t}(c')$，藉由三角不等式： $$ d(c,c') \leq d(c,w) + d(w,c') \leq t+t = 2t < 2t+1 \leq d $$ 則得 $d(c,c') < d$，與 $C$ 的最小距離為 $d$ 矛盾。因此所有以 $t$ 為半徑所構成的閉球皆兩兩不相交，$C$ 可以校正 $t$ 個錯誤。接下來要探討的是，閉球 $\bar{B_r}(w)$ 究竟有多少個元素，換言之該集合的[基數 (cardinal number)](https://en.wikipedia.org/wiki/Cardinal_number)為多少？藉由漢明距離的定義，$d(w,v)=i$ 成立時，若且唯若 $w,v$ 兩者有 $i$ 個不同的位元。對於一個 $n$ 位元長的元素，與 $w$ 恰好有 $i$ 個位元不同的元素共有 ${n \choose i}$ 個。藉此，閉球的基數可以表示成： $$ \vert \bar{B_r}(w) \vert = {n \choose 0} + {n \choose 1} + \dots + {n \choose r} $$ - [ ] 漢明邊界：假如 $C$ 為 $B^n$ 一非空子集合，可以校正 $t$ 個錯誤，則 $$ \vert C \vert \leq \dfrac{2^n}{{n \choose 0} + \dots + {n \choose t}} $$ * 證明：令 $N=\vert \bar{B_t}(w) \vert = {n \choose 0} + {n \choose 1} + \dots + {n \choose t}$。對不同 $c\in C$，每個閉球有 $N$ 個元素，又閉球數量總共有 $\vert C \vert$ 個 (因為有 $\vert C \vert$ 個圓心)。因為兩兩閉球皆不相交，則有 $N\vert C \vert$ 個獨特的元素，且不超過 $\vert B^n \vert = 2^n$，因此 $$ N\vert C \vert \leq 2^n $$ 移項後得到 $\vert C \vert \leq \dfrac{2^n}{{n \choose 0} + \dots + {n \choose t}}$ ### 從機率觀點看 Hamming code 在二元對稱通道 (binary symmetric channel, BSC) 中，每個位元獨立地以機率 $p$ 發生翻轉。對於一個可修正 $t$ 個錯誤的 $(n, k)$ 碼，成功解碼的機率為傳輸 $n$ 個位元時至多出現 $t$ 個錯誤的機率： $$ P_{\text{success}} = \sum_{i=0}^{t} \binom{n}{i} p^i (1-p)^{n-i} $$ 以 (7, 4) Hamming code 為例，$t = 1$，在 $p = 0.01$ (每位元 1% 錯誤率) 的通道中： $$ P_{\text{success}} = (1-0.01)^7 + 7 \times 0.01 \times (1-0.01)^6 \approx 0.9980 $$ 相較於不使用任何編碼直接傳送 4 位元的成功率 $(1-0.01)^4 \approx 0.9606$，Hamming code 以 $7/4 = 1.75$ 倍的頻寬代價，將失敗率從約 4% 降至約 0.2%。 ECC 校正能力可從兩個面向量化： * 碼率 (code rate) $R = k/n$ 衡量冗餘開銷。(7, 4) Hamming code 的碼率為 $4/7 \approx 0.571$，意即約 57% 的傳輸量用於實際資料。 * 編碼增益 (coding gain) 在相同錯誤率要求下，使用 ECC 後可降低所需的訊號雜訊比 (SNR)。這個降幅即為編碼增益。 Shannon 的通道編碼定理提供理論上限：對任何速率 $R < C$ (通道容量)，必定存在編碼方式可達到任意低的錯誤率。漢明邊界和 Gilbert–Varshamov 邊界則分別給出碼字數目的上界與下界，界定在特定參數下可達到的糾錯能力範圍。 > 更多細節可參考 Shu Lin and Daniel J. Costello, "Error Control Coding: Fundamentals and Applications," Pearson/Prentice Hall, 2nd edition, 2004. --- ## 基礎代數 ### 二元運算子 binary operation 在一個非空集合 $G$ 定義的二元運算 $*$ 為一函數 $f$ (或稱為映射)，其映射關係為該集合的[笛卡兒積](https://en.wikipedia.org/wiki/Cartesian_product)至該集合本身 $f: G\times G \rightarrow G$ ，另一種更常見的寫法為 $x*y$ ，其中 $x,y$ 屬於 $G$。同時該二元運算滿足： - [ ] 封閉性：對任意元素 $x,y\in G$， $x*y\in G$ * 假如任意 $x,y,z\in G$ 其二元運算 $*$ 滿足： $$ (x*y)*z=x*(y*z) $$ 則稱該二元運算子具有結合律。 * 假如任意 $x,y\in G$ 其二元運算 $*$ 滿足： $$ x*y=y*x $$ 則稱該二元運算具有交換律。 ### 群 group 當一非空集合 $G$ 中的二元運算 $*$ 滿足以下性質時稱之為群： 1. 該二元運算子具有結合律 2. $G$ 存在一單位 (identity) 元素 $e$ ，使得對任意 $x\in G$ 滿足 $x*e=e*x=x$ 3. 對任意 $x\in G$ 存在一反元素 $x'$ ，使得 $x*x'=e$ 例如: $G=\lbrace 0, 1 \rbrace$ 與一運算 $*$ ，有以下映射關係： $$ 1*1 = 0, \quad 0*0 = 0, \quad 1*0 = 0*1 = 1 $$ 該集合的運算子是否為二元運算？假如是，該集合是否為有一二元運算 $*$ 的群？顯而易見，該運算為二元運算，對任意元素的運算結果皆是集合中的元素。同時，$0$ 為單位元素；$1$ 的反元素為自己本身；$0$ 的反元素亦為自己本身。結合律可以藉由列舉的方式說明，在此不贅述。所以 $(G,*)$ 為一群。假如對任意 $x,y\in G$ 該二元運算 $*$ 滿足交換律 $x*y=y*x$ 則稱為阿貝爾群 (abelian group)。顯而易見，上述例子 $(G,*)$ 為一阿貝爾群。 ## Reed–Solomon error correction > 此節內容部分譯自 [Reed-Solomon error-correcting code decoder](https://www.nayuki.io/page/reed-solomon-error-correcting-code-decoder#preliminaries) Reed–Solomon error correction 也是 ECC 的一種，透過在原始資料中加上額外的冗餘資訊，只要傳輸後產生的錯誤數量在可糾正範圍內，即可還原正確資料。該 ECC 演算法被廣泛使用於諸多領域。 ### 先備知識 1. Reed-solomon codes 的操作是對於在 [finite field](https://en.wikipedia.org/wiki/Finite_field) 集合中的 symbol。對於使用者定義的 field $F$ (以電腦領域來說，通常是 $GF(2^8)$)，存在 primitive element / generator $\alpha$，這表示 field 中的所有非零元素可以被表示成不重複的 $\alpha^0, \alpha^1, \dots, \alpha^{\lvert F\rvert -2}$，再加上零元素構成整個 field :::info 對於 primitive element 的敘述可以參考 [Primitive element (finite field)](https://en.wikipedia.org/wiki/Primitive_element_(finite_field)) 和 [Primitive root modulo n](https://en.wikipedia.org/wiki/Primitive_root_modulo_n)。 ::: 2. 定義資料的長度為 $k$，$k$ 需滿足 $1\leq k < \lvert F\rvert$，意即每個要編碼的資料都是來自 field $F$ 的 $k$ 值數量的 block 3. 定義 $m$ 為延伸的糾正碼長度，$m$ 為滿足 $1 \leq m < \lvert F\rvert - k$ 的整數，這可以糾正上至 $\lfloor m/2 \rfloor$ 的錯誤數量 4. 因此，最終定義 $n = k + m$ 為最後的 block size \ codeword 長度，$n$ 需要滿足 $2 \leq n < \lvert F\rvert$，這表示若要糾正大量的錯誤，則需要愈大的 field $F$ ### Systematic encoder 1. Reed-Solomon error-correcting codes 具有不同的形式與處理方式，這裡探討較普遍的 [BCH code](https://en.wikipedia.org/wiki/BCH_code)。這意味著原始資料會視為是一個多項式的係數，而接續的糾正碼也會接續著成為新的多項式項。 2. 根據 $\alpha$ 和 $m$ 定義一個生成多項式 $$ g(x) = \prod_{i=0}^{m-1} (x-\alpha^i) $$ 則此多項式為 m degree，且最大冪次項 $x^m$ 的係數為 1 3. 假設原始的資料是由 $k$ 個值組成的 ($M_0$, $M_1$, ... ,$M_k-1$)，$M_i$ 是 field $F$ 中的一個數值，則將以這 $k$ 個值形成的多項式定義為 $$ M(x) = \sum\limits_{i= 0}^{k-1}{M_ix^i} $$ 4. 則 Reed-solomon codeword 的編碼被定義為 $$ s(x) = M(x)x^m - [(M(x)x^m) \space mod \space g(x)] $$ 這裡可看到，$M(x)x^m$ 會形成 $x^m$, $x^{m+1}$ 到 $x^{m+k-1}$，總共 $k$ 項，而這 $k$ 項的係數其實也是資料的內容。又因為 $[(M(x)x^m) \space mod \space g(x)]$ 的 degree 會是 $m - 1$ 或更低，不會與 $M(x)x^m$ 的計算互動，因此整個 $s(x)$ 的 degree 會為 $m + k - 1$ 也就是 $n - 1$ 5. 因此，最後的編碼即為 $$ s(x) = \sum\limits_{i= 0}^{n-1}{s_ix^i} $$ 多項式的係數，也就是 $(s_0,s_1,...,s_{n-1})$ ### Peterson-Gorenstein-Zierler decoder #### Calculating syndromes 1. 假設收到的編碼為 $(r_0,r_1,...r_{n-1})$，這些編碼形成的多項式為： $$ r(x) = \sum\limits_{i= 0}^{n-1}{r_ix^i} $$ 2. 對於接收者而言，$s_0, s_1, ... , s_{n-1}$ 是未知的，但可定義誤差 $e_0 = r_0 -s_0\text{, } e_1 = r_1 - s_1,...,e_{n-1} = r_{n-1}-s_{n-1}$，則誤差多項式為 $$ e(x) = r(x) - s(x ) = \sum\limits_{i= 0}^{n-1}{e_ix^i} $$ 3. 定義 $m$ 個 syndrome value $S_i$ 中，$0 \leq i < m$ ，則 $$ S_i = r(\alpha^i) = s(\alpha^i) + e(\alpha^i) = 0 + e(\alpha^i) = e(\alpha^i) $$ * 這是因為 $s(x)$ 可以被 $g(x)=(x-\alpha^0)(x-\alpha^1)...(x-\alpha^{m-1})$ 整除(參考 Systematic encoder 的第四點)，代表 syndrome 的值只會根據 error 而改變 * 所以，如果所有的 syndrome value 皆為 $0$ 的話，這表示得到的編碼是正確的，不需要做任何的修正。 4. 如果把所有的 syndrome 方程式都列出來，那就是 $$ \left\{ \begin{array}{r} S_0 =& e_0\alpha^{0 \times 0} + e_1\alpha^{1 \times 0} + ... + e_{n-1}\alpha^{(n-1) \times 0}\\ S_1 =& e_0\alpha^{0 \times 1} + e_1\alpha^{1 \times 1} + ... + e_{n-1}\alpha^{(n-1) \times 1}\\ ... \\ S_{m-1} =& e_0\alpha^{0 \times (m-1)} + e_1\alpha^{1 \times (m-1)} + ... + e_{n-1}\alpha^{(n-1) \times (m-1)} \end{array} \right. $$ 5. 並重寫成矩陣形式： $$ \begin{bmatrix} e_0\alpha^{0 \times 0} + e_1\alpha^{1 \times 0} + ... + e_{n-1}\alpha^{(n-1) \times 0} \\ e_0\alpha^{0 \times 1} + e_1\alpha^{1 \times 1} + ... + e_{n-1}\alpha^{(n-1) \times 1} \\ ...\\ e_0\alpha^{0 \times (m-1)} + e_1\alpha^{1 \times (m-1)} + ... + e_{n-1}\alpha^{(n-1) \times (m-1)} \end{bmatrix} = \begin{bmatrix} S_0 \\ S_1 \\ ... \\ S_{m-1} \end{bmatrix} $$ 再把誤差項單獨取出 $$ \begin{bmatrix} \alpha^{0 \times 0} & \alpha^{1 \times 0} & ... & \alpha^{(n-1) \times 0} \\ \alpha^{0 \times 1} & \alpha^{1 \times 1} & ... & \alpha^{(n-1) \times 1} \\ ... \\ \alpha^{0 \times (m-1)} & \alpha^{1 \times (m-1)} & ... & \alpha^{(n-1) \times (m-1)} \end{bmatrix} \begin{bmatrix} e_0 \\ e_1 \\ ... \\ e_{n-1} \end{bmatrix} = \begin{bmatrix} S_0 \\ S_1 \\ ... \\ S_{m-1} \end{bmatrix} $$ 由於 $\alpha$ 是已知的，且 $S_i$ 也為已知，理論上可對 $e_i$ 求解，不過可以看到對於 $n$ 個未知數，僅有 $m$ 個方程式，顯然該如何得到誤差項 (或者比較準確的說，預測) 就需要一些技巧了 #### Finding error locations 1. 需要先找到錯誤的位置。假設訊息中有 $\nu$ 個錯誤，如前提所述，$\nu$ 需滿足 $1 \leq \nu \leq \lfloor\frac{m}{2}\rfloor$。除非有時間或空間的限制，一般會期望 $\nu$ 可以大一些，以最大化能糾正的錯誤 2. 假設這 $\nu$ 錯誤的位置在 $I_0,I_1,...I_{\nu-1}$，$I_i$ 之間是沒有順序的且滿足 $0 \leq I_i < n$，這表示在這位置上的誤差 $e_{I_0},e_{I_1},...,e_{I_{\nu-1}}$ 為非零值，而其他的 $e_i$ 則必須等於 $0$ 3. 接著，再根據錯誤的地方 $I_i$ 定義，對於 $0 \leq i < \nu$ $$ \begin{aligned} X_i &= \alpha^{I_i} \\ Y_i &= e_{I_i} \end{aligned} $$ 4. 因為其他的 $e_i$ 都為 $0$，因此可以重新定義 syndrome，把前面的矩陣方程式改寫成 $$ \begin{bmatrix} X^{0}_0 & X^{0}_1 & ... & X^{0}_{\nu-1} \\ X^{1}_0 & X^{1}_1 & ... & X^{1}_{\nu-1}\\ ... \\ X^{m-1}_0 & X^{m-1}_1 & ... & X^{m-1}_{\nu-1} \end{bmatrix} \begin{bmatrix} Y_0 \\ Y_1 \\ ... \\ Y_{\nu-1}\end{bmatrix} = \begin{bmatrix} S_0 \\ S_1 \\ ... \\ S_{m-1}\end{bmatrix} $$ 5. 接著，定義一個 error locator polynomial $$ \Lambda(x) = \prod_{i=0}^{\nu-1} (1-X_ix) \\ = 1+\Lambda_{1} x+\Lambda_{2} x^2+...\Lambda_{\nu} x^\nu $$ 6. 因為 $(1-X_iX_i^{-1}) = 0$，因此可知 $$ 0 = \Lambda(X_{i}^{-1}) = 1+\Lambda_{1} X_i^{-1}+\Lambda_{2} X_i^{-2}+...\Lambda_{\nu} X_i^{-\nu} $$ 7. 現在，對於 $0 \leq i < \nu$ 且 j $\in \mathbb{Z}$，將兩側同乘上 $Y_iX_i^{j+\nu}$ $$ 0 = Y_iX_i^{j+\nu}\Lambda(X_{i}^{-1}) = Y_iX_i^{j+\nu} + \Lambda_{1}Y_iX_i^{j+\nu-1} + ... + \Lambda_{\nu}Y_iX_i^{j} $$ 8. 然後把範圍中的所有 $i$ 得到的結果相加 $$ \begin{aligned} 0 &= \sum\limits_{i= 0}^{\nu-1}{Y_iX_i^{j+\nu}\Lambda(X_{i}^{-1})} \\ &= \sum\limits_{i= 0}^{\nu-1}{(Y_iX_i^{j+\nu} + \Lambda_1 Y_iX_i^{j+\nu - 1} + \Lambda_2 Y_iX_i^{j+\nu - 2} + ... +\Lambda_\nu Y_iX_i^{j})}\\ &= \sum\limits_{i= 0}^{\nu-1}{Y_iX_i^{j+\nu}} + \Lambda_1 \sum\limits_{i= 0}^{\nu-1}{Y_iX_i^{j+\nu - 1}} + ... +\Lambda_\nu \sum\limits_{i= 0}^{\nu-1}{Y_iX_i^{j}} \\ &= S_{j+\nu} + \Lambda_1 S_{j+\nu - 1} + ... + \Lambda_\nu S_{j} \end{aligned} $$ 然後再將式子調整成 $$ -S_{j+\nu} = \Lambda_\nu S_{j} +\Lambda_{\nu-1} S_{j+1}+ ... +\Lambda_1 S_{j+\nu - 1} $$ 其中 $0 \leq j < \nu$ 9. 則如果將每個 $j$ 都列出等式： $$ \left\{ \begin{array}{r} \Lambda_\nu S_0 + \Lambda_{\nu-1} S_1 + ... + \Lambda_{1} S_{\nu-1} = -S_\nu \\ \Lambda_\nu S_1 + \Lambda_{\nu-1} S_2 + ... + \Lambda_{1} S_{\nu} = -S_{\nu+1}\\ ...\\ \Lambda_\nu S_{\nu-1} + \Lambda_{\nu-1} S_{\nu} + ... + \Lambda_{1} S_{2\nu-2} = -S_{2\nu-1} \end{array} \right. $$ 一樣可以將此轉換成矩陣運算的形式： $$ \begin{bmatrix} S_0 & S_1 & ... & S_{\nu-1} \\ S_1 & S_2 & ... & S_{\nu} \\ ...\\ S_{\nu-1} & S_\nu & ... & S_{2\nu-2} \\ \end{bmatrix} \begin{bmatrix} \Lambda_\nu \\ \Lambda_{\nu-1} \\ ... \\ \Lambda_{1}\end{bmatrix} = \begin{bmatrix} -S_\nu \\ -S_{\nu+1} \\ ... \\ -S_{2\nu-1}\end{bmatrix} $$ 10. 現在終於得到了可以求解的方程式，可將其寫成擴增矩陣： $$ \begin{bmatrix} S_0 & S_1 & ... & S_{\nu-1} & -S_\nu\\ S_1 & S_2 & ... & S_{\nu} & -S_{\nu+1}\\ ...\\ S_{\nu-1} & S_\nu & ... & S_{2\nu-2} & -S_{2\nu-1}\\ \end{bmatrix} $$ 並且透過 [Gauss-Jordan elimination](https://online.stat.psu.edu/statprogram/reviews/matrix-algebra/gauss-jordan-elimination) 求解。如果對此線性矩陣的解為 [inconsistent](https://en.wikipedia.org/wiki/Consistent_and_inconsistent_equations) (無解)，那表示 codeword 中有超過 $\nu$ 數量的錯誤，因此無法得到解。如果 $\nu$ 可以更大，則應該將其調大並且重新解碼，否則就是無法完成 error correction。如果是 consistent 但為 under-determined (無限多解)，這可能發生於實際錯誤數少於 $\nu$ 時，因為任何一個 $e_i = 0$ 的位置都可以被視為是一個 virtual error，此時就必須把 $\nu$ 降低並重新解碼。 11. 一旦得到係數 $\Lambda_1$、$\Lambda_2$、...$\Lambda_\nu$，即可把 x = $\alpha^{-i}$($0 \leq i < n$) 代入 $\Lambda(x)$ 中進行運算，看 $\Lambda(\alpha^{-i})$ 是否為 0，表示此位置有誤差，即 $e_i$ 不為 0，把這個 i 加入 $I_i$ 如果找到不為 $0$ 的 $e_i$ 多於 $\nu$ 個，表示傳輸後的資料錯誤太多，無法修正。如果少於 $\nu$ 個，表示誤差比預計中的少，只需將 $\nu$ 重新定義為較低的數字，並刪除編號較高的 $I_i$ 變量即可。對於大於等於 $n$ 的 $i$ 毋需被代入，這是因為錯誤不能出現在長度為 $n$ 的資料以外。 #### Calculating error values 1. 現在已得知錯誤的位置 $I_0,I_1,...I_{\nu-1}$，因此 $X_i$ = $\alpha^{I_i}$, $0 \leq i < \nu$ 2. 由於 $Y_i = e_{I_i}$，且 $S_i$ 已知，所以 $$ \begin{bmatrix} X^{0}_0 & X^{0}_1 & ... & X^{0}_{\nu-1} \\ X^{1}_0 & X^{1}_1 & ... & X^{1}_{\nu-1}\\ ... \\ X^{m-1}_0 & X^{m-1}_1 & ... & X^{m-1}_{\nu-1} \end{bmatrix} \begin{bmatrix} Y_0 \\ Y_1 \\ ... \\ Y_{\nu-1}\end{bmatrix} = \begin{bmatrix} S_0 \\ S_1 \\ ... \\ S_{m-1}\end{bmatrix} $$ 由於 $X_i$ 已知，此方程式同樣可透過 Gauss-Jordan elimination 求解。若 linear system 是 consistent 且 independent 的，則存在唯一解，可成功修正錯誤。若為 inconsistent 或 under-determined，則表示錯誤無法被修正。 3. 則對於修正後的訊息 $r'(x) = r'_0x^0+r'_1x^1+...r'_{n-1}x^{n-1}$，對於在錯誤位置 $I_i$ $0 \leq i < \nu$ 的部份，根據 $r'_{I_i} = r_{I_i} \!-\! Y_i = r_{I_i} \!-\! e_{I_i}$ 去修正，其他則維持 $r_i' = r_i$ 即可。 ### 註記解碼部分，Peterson-Gorenstein-Zierler decoder 的時間複雜度為 $\Theta(m^3+mk)$，但在速度的考量上，則應該透過 [Berlekamp-Massey algorithm](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Berlekamp%E2%80%93Massey%20decoder) 求出位置，加上 [Forney algorithm](https://en.wikipedia.org/wiki/Forney_algorithm) 還原錯誤，其中兩者的時間複雜度皆為 $Θ(m^2)$ ## 如何實作 Reed–Solomon codes？ > 此節內容譯自: [Reed–Solomon codes for coders](https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders) ### 基本想法想像資料在傳輸過程中有部分遺失，最終收到的資料是 `a p p * e`。不過可以猜到遺失的字母是什麼，因為翻開字典，最接近的字就是 `apple`。那麼如果收到的資料是 `b * d` 呢？這時候它應該要是 `bad` 還是 `bed`？為了解決這個問題，可在原始資料後面加上一些額外的字母再傳輸，例如： ``` bad -> badxyz bed -> bedqwe ``` 這時候，假如收到的資料有遺失，變成 `b * d x * z`，那麼對照字典，最接近的字應該是 `badxyz`，因此有足夠的證據猜測原始的資料應該是 `bad`。 ### BCH code Reed–Solomon codes 是 [BCH code](https://en.wikipedia.org/wiki/BCH_code) 的一個子類，在前面的章節也有提到，可簡單理解為透過多項式的係數來表達資料。 #### BCH error detection BCH 的錯誤檢查類似整數長除法，不過減法的運算會改用 XOR。舉例來說，對於 000111101011001 這個 formatting information 以及 10100110111 這個 generator，如果可以整除 (餘數為 0)，那就是表示沒有錯誤發生，此時得到的商就是原始的資料，否則的話就表示有錯誤需要被修正。 :::info 因為原文中是以 QR 為例來說明的，關於 [formatting information](https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders#Formatting_information) 請參考原文。可先簡單理解成 000111101011001 的前 5 個位元是真正的資料，後面的 10 個位元則是用來修正錯誤的編碼。 ::: ``` 00011 ---------------- 10100110111 ) 000111101011001 ^ 10100110111 -------------- 010100110111 ^ 10100110111 ------------- 00000000000 ``` 對應的 Python 程式碼實作如下 ```python def qr_check_format(fmt): g = 0x537 # = 0b10100110111 in python 2.6+ for i in range(4,-1,-1): if fmt & (1 << (i+10)): fmt ^= g << i return fmt ``` 也可以反過來用此編碼 5 位元的 formatting information ```python format = 0x3 encoded_format = (format<<10) ^ qr_check_format(format<<10) print(hex(encoded_format)) # 反過來可得 formatting information 000111101011001 ``` #### BCH error correction 如果得到的餘數不為 0，表示資料有誤，那就要「猜出」原本的資料為何。對於 BCH 的解碼有許多演算法，不過這裡可以用一個簡單的方式來理解：因為原始資料只有 32 種可能，因此可一一比較，並以 Hamming distance 最小的那個作為猜出的結果 ```python def hamming_weight(x): weight = 0 while x > 0: weight += x & 1 x >>= 1 return weight def qr_decode_format(fmt): best_fmt = -1 best_dist = 15 for test_fmt in range(0,32): test_code = (test_fmt<<10) ^ qr_check_format(test_fmt<<10) test_dist = hamming_weight(fmt ^ test_code) if test_dist < best_dist: best_dist = test_dist best_fmt = test_fmt elif test_dist == best_dist: best_fmt = -1 return best_fmt ``` 事實上 Reed–Solomon 就是依據類似的概念來解碼的，不過當資料量大時，窮舉搜尋顯然不可行，必須採用如 Berlekamp-Massey 等更有效率的演算法。 ### Finite field arithmetic 要了解 Reed–Solomon，就不得不對 [finite field](https://en.wikipedia.org/wiki/Finite_field) 有一定的了解。需要定義對於一個 8 位元數字的加法、減法、乘法、除法，且經過這些運算後仍為 8 bit。直覺上可能想用普通整數運算再 mod 256，但 $\mathbb{Z}/256\mathbb{Z}$ 並非 field (例如 2 沒有乘法反元素)。正確的做法是將 8 位元數字視為 $GF(2)$ 上的多項式，在一個 8 次不可約多項式下做模運算，這才能構成 $GF(2^8)$。要簡單的說明什麼是 Galois Fields 的話：一個 finite field 就是在有限集合中的一群數字，在這個 field 裡，需定義加法、減法、乘法、除法滿足：[封閉性](https://en.wikipedia.org/wiki/Closure_(mathematics))、[結合律](https://en.wikipedia.org/wiki/Associative_property)、[交換律](https://en.wikipedia.org/wiki/Commutative_property)、[分配律](https://en.wikipedia.org/wiki/Distributive_property)、[Identity](https://simple.wikipedia.org/wiki/Identity_Property) 和 inverse 六大 property。例如，實數域 ℝ 是一個 field，但整數域 ℤ 就不是，因為不滿足乘法的 inverse property (找不到整數域中的 x 可以滿足 x * 4 = 1)。為了處理此問題，可用質數 $p$ 來做 modulo，$\mathbb{Z}/p\mathbb{Z}$ 構成一個 Galois Field。要構造元素數為質數冪 $p^n$ ($n > 1$) 的 field，則需使用多項式環 $GF(p)[x]$ 對不可約多項式取模，而非單純的整數模運算。而 modulo 2 則是有趣的一類，例如這裡的例子中，因為 256 = $2^8$，這是質數 2 的 8 次方，這使得 GF(2^8) 的計算定義有一些特性。 :::warning 細節可以參考[原文](https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders#Introduction_to_mathematical_fields)以及 [Non-prime fields](https://en.wikipedia.org/wiki/Finite_field#Non-prime%20fields) ::: #### 加減法對於 GF(2^n) ，加法和減法可以同時被 XOR 運算所替代，這很符合邏輯，因為相加 modulo 2 就像 XOR，而相減 modulo 2 和相加 modulo 2 並沒有不同。 ``` 0101 + 0110 = 0101 - 0110 = 0101 XOR 0110 = 0011 ``` 或者從多項式的角度去思考 $(x^2 + 1) + (x^2 + x) = 2 x^2 + x + 1 = 0 x^2 + x + 1 = x + 1$ 注意這是因為在 GF(2^n) 中，才可以用 XOR 取代加減法，在其他的 GF 中則不一定如此。 ```python= def gf_add(x, y): return x ^ y def gf_sub(x, y): return x ^ y ``` #### 乘法乘法的運算是基於多項式的乘法，例如 10001001 乘以 00101010： $$ \begin{align} &(x^7 + x^3 + 1) * (x^5 + x^3 + x^1)\\ &= x^7 * (x^5 + x^3 + x^1) + x^3 * (x^5 + x^3 + x^1) + 1 * (x^5 + x^3 + x^1)\\ &= x^{12} + x^{10} + x^6 + x^5 + x^4 + x^3 + x^1 \end{align} $$ 這等同於 ``` 10001001 * 00101010 ---------- 10001001 ^ 10001001 ^ 10001001 ------------- 1010001111010 ``` 注意要把加法替換成 XOR。顯然，此時得到的位元已經超過原本的 8 bits，因此需要 modulo 一個神祕的數字 100011101，同等的操作也可以寫成： ``` 1010001111010 ^ 100011101 ---------------- 0010110101010 ^ 100011101 ---------------- 00111011110 ^ 100011101 --------------- 011000011 ``` 對應的程式碼為： ```python def gf_mult_noLUT(x, y, prim=0): '''Multiplication in Galois Fields without using a precomputed look-up table (and thus it's slower) by using the standard carry-less multiplication + modular reduction using an irreducible prime polynomial''' ### Define bitwise carry-less operations as inner functions ### def cl_mult(x,y): '''Bitwise carry-less multiplication on integers''' z = 0 i = 0 while (y>>i) > 0: if y & (1<<i): z ^= x<<i i += 1 return z def bit_length(n): '''Compute the position of the most significant bit (1) of an integer. Equivalent to int.bit_length()''' bits = 0 while n >> bits: bits += 1 return bits def cl_div(dividend, divisor=None): '''Bitwise carry-less long division on integers and returns the remainder''' # Compute the position of the most significant bit for each integer dl1 = bit_length(dividend) dl2 = bit_length(divisor) # If the dividend is smaller than the divisor, just exit if dl1 < dl2: return dividend # Else, align the most significant 1 of the divisor to the most significant 1 of the dividend (by shifting the divisor) for i in range(dl1-dl2,-1,-1): # Check that the dividend is divisible (useless for the first iteration but important for the next ones) if dividend & (1 << i+dl2-1): # If divisible, then shift the divisor to align the most significant bits and XOR (carry-less subtraction) dividend ^= divisor << i return dividend ### Main GF multiplication routine ### # Multiply the gf numbers result = cl_mult(x,y) # Then do a modular reduction (ie, remainder from the division) with an irreducible primitive polynomial so that it stays inside GF bounds if prim > 0: result = cl_div(result, prim) return result ``` 為什麼需要 mod 100011101 呢，其數學證明可能會有些複雜，不過簡而言之，這是為了確保 $GF(2^8)$ 多項式是 "irreducible" (可以想成是一個質數的多項式，所以不能被任兩個 degree 小的多項式相乘而得到)，這個數字被稱做 primitive polynomial \ irreducible polynomial \ prime polynomial， 100011101 (0x11d) 是其中常用的一種。至於該如何生成這個數字呢，請參考 [Universal Reed-Solomon Codec](https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders/Additional_information#Universal_Reed-Solomon_Codec)。 #### Multiplication with logarithms 雖然前面的作法可以達成 GF 中數字的乘法，然而並不是一個高效率的作法，更好的作法是透過查表，不過 8 位元數字會有 256 種結果，沒必要把 256 x 256 的乘法結果完全記錄下來。下面介紹一種更精簡的作法: 首先，定義 $\alpha$ 為一個 generator number = 2，則對於 $\alpha^i$ 的計算就只是對 $\alpha$ 做位元的位移，然後和 100011101 做 XOR (如果有 overflow) 運算而已。可得如下： $$ \begin{split} & \alpha^0 = 00000001, \alpha^1 = 00000010, \alpha^2 = 00000100, \alpha^3 = 00001000 \\ & \alpha^4 = 00010000, \alpha^5 = 00100000, \alpha^6 = 01000000, \alpha^7 = 10000000 \\ & \alpha^8 = 00011101, \alpha^9 = 00111010, \alpha^{10} = 01110100, \alpha^{11} = 11101000 \\ & \alpha^{12} = 11001101, \alpha^{13} = 10000111, \alpha^{14} = 00010011, \alpha^{15} = 00100110\\ \end{split} $$ 如果依此類推建立表格，從 $\alpha^0$ 直到 $\alpha^{255}$ 前的數值都不會有重複 (這是 GF 的特性)，也就是說，除了 0 之外的數都會是某個 $\alpha^i$。那麼，已知 10001001 乘以 00101010 就會是某兩個 $\alpha$ 相乘 $$ 10001001 * 00101010 = \alpha^{74} \times \alpha^{142} = \alpha^{216} = 11000011 $$ 因此只要建立起對於 $\alpha^i$ 查到的值，以及反過來對於一個值對應的 $\alpha^i$，就可以有效的進行乘法運算了。 ```python gf_exp = [0] * 512 # Create list of 512 elements. In Python 2.6+, consider using bytearray gf_log = [0] * 256 def init_tables(prim=0x11d): '''Precompute the logarithm and anti-log tables for faster computation later, using the provided primitive polynomial.''' # prim is the primitive (binary) polynomial. Since it's a polynomial in the binary sense, # it's only in fact a single galois field value between 0 and 255, and not a list of gf values. global gf_exp, gf_log gf_exp = [0] * 512 # anti-log (exponential) table gf_log = [0] * 256 # log table # For each possible value in the galois field 2^8, we will pre-compute the logarithm and anti-logarithm (exponential) of this value x = 1 for i in range(0, 255): gf_exp[i] = x # compute anti-log for this value and store it in a table gf_log[x] = i # compute log at the same time x = gf_mult_noLUT(x, 2, prim) # If you use only generator==2 or a power of 2, you can use the following which is faster than gf_mult_noLUT(): #x <<= 1 # multiply by 2 (change 1 by another number y to multiply by a power of 2^y) #if x & 0x100: # similar to x >= 256, but a lot faster (because 0x100 == 256) #x ^= prim # substract the primary polynomial to the current value (instead of 255, so that we get a unique set made of coprime numbers), this is the core of the tables generation # Optimization: double the size of the anti-log table so that we don't need to mod 255 to # stay inside the bounds (because we will mainly use this table for the multiplication of two GF numbers, no more). for i in range(255, 512): gf_exp[i] = gf_exp[i - 255] return [gf_log, gf_exp] ``` 稍微瀏覽程式碼，可看出 `gf_exp` 儲存 $\alpha^0$ 到 $\alpha^{254}$ 的數值 (共 255 個非零 field 元素)，而 `gf_log` 則是反向查找表。程式碼將 `gf_exp` 延伸至 512 個元素 (index 255–511 重複 index 0–254)，這樣查表時 `gf_log[x] + gf_log[y]` 超過 254 也能直接取得對應值，不需額外取模。 ```python def gf_mul(x,y): if x==0 or y==0: return 0 return gf_exp[gf_log[x] + gf_log[y]] ``` #### 除法透過對 $\alpha$ 建立的表格，則除法的運算也就可以一併加速了 ```python def gf_div(x,y): if y==0: raise ZeroDivisionError() if x==0: return 0 return gf_exp[(gf_log[x] + 255 - gf_log[y]) % 255] ``` 注意這裡 `+ 255` 可以保證 index > 0 #### Power and Inverse 同樣的，冪次和 inverse 也可以一併加速 ```python def gf_pow(x, power): return gf_exp[(gf_log[x] * power) % 255] def gf_inverse(x): return gf_exp[255 - gf_log[x]] # gf_inverse(x) == gf_div(1, x) ``` #### Polynomials 接下來還需要定義一些對於多項式係數的操作。因為 GF 的元素本身，做為多項式的係數也是個多項式，所以需要注意區分。以下面的多項式為例： $$ 00000001 x^4 + 00001111 x^3 + 00110110 x^2 + 01111000 x^1 + 01000000\\= 01 x^4 + 0f x^3 + 36 x^2 + 78 x^1 + 40 $$ 在程式中，可把這個係數在陣列中寫成 [0x01, 0x0f, 0x36, 0x78, 0x40]，根據實作的不同考量，也可以把順序反過來。則第一個 function 用來與一個 scalar 相乘 ```python def gf_poly_scale(p,x): r = [0] * len(p) for i in range(0, len(p)): r[i] = gf_mul(p[i], x) return r ``` 將兩個多項式相加 (記得用 XOR) ```python def gf_poly_add(p,q): r = [0] * max(len(p),len(q)) for i in range(0,len(p)): r[i+len(r)-len(p)] = p[i] for i in range(0,len(q)): r[i+len(r)-len(q)] ^= q[i] return r ``` 將兩個多項式相乘 ```python= def gf_poly_mul(p,q): '''Multiply two polynomials, inside Galois Field''' # Pre-allocate the result array r = [0] * (len(p)+len(q)-1) # Compute the polynomial multiplication (just like the outer product of two vectors, # we multiply each coefficients of p with all coefficients of q) for j in range(0, len(q)): for i in range(0, len(p)): r[i+j] ^= gf_mul(p[i], q[j]) # equivalent to: r[i + j] = gf_add(r[i+j], gf_mul(p[i], q[j])) # -- you can see it's your usual polynomial multiplication return r ``` 最後，還需要一個可對特定的 x 求出多項式值的 function。這裡使用了 [Horner's method](https://en.wikipedia.org/wiki/Horner%27s_method) 使得 $x^i$ 的計算可以迭代累積，不需要每次都重複計算 ```python def gf_poly_eval(poly, x): '''Evaluates a polynomial in GF(2^p) given the value for x. This is based on Horner's scheme for maximum efficiency.''' y = poly[0] for i in range(1, len(poly)): y = gf_mul(y, x) ^ poly[i] return y ``` 多項式除法較為複雜，留待下一節講解。 --- ## Linux 核心中的 Reed–Solomon codes 前面以 Python 說明 Reed–Solomon codes 的基本實作。接下來探討 Linux 核心中的對應實作 [lib/reed_solomon/](https://github.com/torvalds/linux/tree/master/lib/reed_solomon)，從 [Reed-Solomon Library Programming Interface](https://www.kernel.org/doc/html/latest/core-api/librs.html#reed-solomon-library-programming-interface) 的 API 使用深入到程式內部。下面以 [Reed-Solomon Library Programming Interface](https://www.kernel.org/doc/html/latest/core-api/librs.html#reed-solomon-library-programming-interface) 為出發點，從 API 的使用深入到程式內部。 ### `rs_control` 可在 [include/linux/rslib.h](https://github.com/torvalds/linux/blob/0fe5f9ca223573167c4c4156903d751d2c8e160e/include/linux/rslib.h) 找到相關的資料結構： ```c struct rs_control { struct rs_codec *codec; uint16_t buffers[]; }; ``` * `codec`: 整個 Reed-Solomon 的編碼相關參數 * `buffer`: 在 `decode_rs()` 內部會使用的 buffer ```c struct rs_codec { int mm; int nn; uint16_t *alpha_to; uint16_t *index_of; uint16_t *genpoly; int nroots; int fcr; int prim; int iprim; int gfpoly; int (*gffunc)(int); int users; struct list_head list; }; ``` rs_codec 則是整個 Reed-Solomon 中最關鍵的資料結構： * `mm`: 每個 symbol 總共的位元數量 * `nn`: 每個 block 中共包含幾個 symbol (就是 `(1<<mm)-1`) * `alpha_to`: 為了增加在 [Galois field](https://en.wikipedia.org/wiki/Finite_field) 中乘法運算的效率，會建立每個 $\alpha^i$ 的表 * `index_of`: 同樣考量效率，對於每個值，也要可以反回去找到對應的 $\alpha^i$ * `genpoly`: 生成多項式的係數陣列 * `nroots`: 生成多項式的 degree，也就是額外的 parity 數量 * `fcr`: first consecutive root (index form)。這可以使計算生成多項式時的 consecutive root 起始位置改變，通常設為 0 * 舉例來說，對於 degree 3 的生成多項式，如果指定 fcr = 3，計算出的生成多項式就是 $g(x) = \prod_{i=3}^{5} (x-\alpha^i)$ * 也因此有範圍限制 (例如 255 for GF(2^8)) * `prim`: 表示用 Galois field 中的 primitive element / generator $\alpha^{prim}$ 來當 generator，參照 [Primitive element (finite field)](https://en.wikipedia.org/wiki/Primitive_element_(finite_field)) :::info 在 Linux 的實作中，如果不考慮 `gffunc` 的使用，則 α 固定 = 2。輸入的 `prim` 是對此的 index，因此實際的 generator $\beta$ = $\alpha^{prim}$，於是 $\beta^2 = \alpha^{2 * prim}$... 以此類推。為了方便解說，後面我們也會把這裡設定的 primitive element 稱為 $\beta$ ::: * `iprim`: prim 的模反元素，用於解碼時的 Chien search * `gfpoly`: [Primitive polynomial](https://en.wikipedia.org/wiki/Primitive_polynomial_(field_theory)) 的多項式，這是為了把多項式制約在 Galois field 中。以 `0x409 = 0b0100 0000 1001` 為例，表示 primitive polynomial 為 $x^{10}+x^3+1$ ### `init_rs` 想要使用 Reed–Solomon codes，首先需要初始化相關的資料結構。 ```c static struct rs_control *rs_decoder; rs_decoder = init_rs (10, 0x409, 0, 1, 6); ``` 同樣在 `rslib.h` 中可找到其定義，先來看相關的參數設定： ```c static inline struct rs_control *init_rs(int symsize, int gfpoly, int fcr, int prim, int nroots) { return init_rs_gfp(symsize, gfpoly, fcr, prim, nroots, GFP_KERNEL); } struct rs_control *init_rs_gfp(int symsize, int gfpoly, int fcr, int prim, int nroots, gfp_t gfp) { return init_rs_internal(symsize, gfpoly, NULL, fcr, prim, nroots, gfp); } ``` `init_rs(int symsize, int gfpoly, int fcr, int prim, int nroots)` * `symsize`: 每個 symbol 的位元數量 > 根據 [Reed-Solomon Library Programming Interface](https://www.kernel.org/doc/html/latest/core-api/librs.html) > The databytes are expanded to the given symbol size on the fly. There is no support for decoding continuous bitstreams with a symbolsize != 8 at the moment. If it is necessary it should be not a big deal to implement such functionality. * 其他的參數基本上就對應資料結構 `rs_codec` 中的同名 `init_rs` 會呼叫 `init_rs_gfp`，`init_rs_gfp` 再呼叫 `init_rs_internal`，針對額外的參數說明： ```c init_rs_internal(int symsize, int gfpoly, int (*gffunc)(int), int fcr, int prim, int nroots, gfp_t gfp) ``` * `gffunc`: 在 [non-canonical](https://en.wikipedia.org/wiki/Canonical_form) 模式下生成不同的 $\alpha$，不過在 `init_rs` 下會被設為 NULL(對此的使用還沒概念，暫時先略過) * `gfp`: 使用 [kmalloc](https://www.kernel.org/doc/htmldocs/kernel-api/API-kmalloc.html) 時的 flag，會影響記憶體配置的區域和行為不同 ### `init_rs_internal` 初始化 `rs_control` 結構。 ```c if (symsize < 1) return NULL; if (fcr < 0 || fcr >= (1<<symsize)) return NULL; if (prim <= 0 || prim >= (1<<symsize)) return NULL; if (nroots < 0 || nroots >= (1<<symsize)) return NULL; ``` 進行幾個基本的檢查： * `symsize` 是每個 symbol 的位元數量，因此不能小於 1 * `fcr` 的範圍，參見前面的說明 * `prim` 的範圍，因為 primitive element 不能是 $\alpha^0 = 1$ 也不能是 GF 外的 $\alpha^i$ * 每個 block 只會有 `(1<<symsize)-1` 個 symbol，`nroots` 大於這個範圍是不合理的，小於 0 顯然也是 ```c /* * The decoder needs buffers in each control struct instance to * avoid variable size or large fixed size allocations on * stack. Size the buffers to arrays of [nroots + 1]. */ bsize = sizeof(uint16_t) * RS_DECODE_NUM_BUFFERS * (nroots + 1); rs = kzalloc(sizeof(*rs) + bsize, gfp); if (!rs) return NULL; ``` * decode 的時候，會需要一個固定的大小的 heap buffer 空間，可以看到 rs_control 中使用了 [Arrays of Length Zero](https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html) 技巧，因此可以在執行時期才決定分配的空間。`RS_DECODE_NUM_BUFFERS` = 8。 ```c mutex_lock(&rslistlock); /* Walk through the list and look for a matching entry */ list_for_each(tmp, &codec_list) { struct rs_codec *cd = list_entry(tmp, struct rs_codec, list); if (symsize != cd->mm) continue; if (gfpoly != cd->gfpoly) continue; if (gffunc != cd->gffunc) continue; if (fcr != cd->fcr) continue; if (prim != cd->prim) continue; if (nroots != cd->nroots) continue; /* We have a matching one already */ cd->users++; rs->codec = cd; goto out; } ``` * `list_for_each(tmp, &codec_list)` 逐一走訪整個鏈結串列，檢查是否已經存在相同的編碼結構，如果有，則直接 reference 一個相同的即可，不需要另外 init 一份 ```c /* Create a new one */ rs->codec = codec_init(symsize, gfpoly, gffunc, fcr, prim, nroots, gfp); if (!rs->codec) { kfree(rs); rs = NULL; } out: mutex_unlock(&rslistlock); return rs; ``` * 如果此結構不存在，則呼叫 `codec_init` 去建立一個新的結構 ### `codec_init` ```c int i, j, sr, root, iprim; struct rs_codec *rs; rs = kzalloc(sizeof(*rs), gfp); if (!rs) return NULL; INIT_LIST_HEAD(&rs->list); rs->mm = symsize; rs->nn = (1 << symsize) - 1; rs->fcr = fcr; rs->prim = prim; rs->nroots = nroots; rs->gfpoly = gfpoly; rs->gffunc = gffunc; /* Allocate the arrays */ rs->alpha_to = kmalloc_array(rs->nn + 1, sizeof(uint16_t), gfp); if (rs->alpha_to == NULL) goto err; rs->index_of = kmalloc_array(rs->nn + 1, sizeof(uint16_t), gfp); if (rs->index_of == NULL) goto err; rs->genpoly = kmalloc_array(rs->nroots + 1, sizeof(uint16_t), gfp); if(rs->genpoly == NULL) goto err; ``` * 初始化一個 `rs_codec` 所需的空間，並透過 `INIT_LIST_HEAD` 初始化這個鏈結串列節點 * 初始化相關成員與陣列，關於這些成員的作用請參考前面 ```c /* Generate Galois field lookup tables */ rs->index_of[0] = rs->nn; /* log(zero) = -inf */ rs->alpha_to[rs->nn] = 0; /* alpha**-inf = 0 */ if (gfpoly) { sr = 1; for (i = 0; i < rs->nn; i++) { rs->index_of[sr] = i; rs->alpha_to[i] = sr; sr <<= 1; if (sr & (1 << symsize)) sr ^= gfpoly; sr &= rs->nn; } } else { /* else case 在 init_rs_non_canonical 才會用到， * 此處先略過 */ ``` * 定義 `rs->alpha_to[rs->nn] = 0` * 接著需要初始化 $\alpha^0$、$\alpha^1$、...、$\alpha^{2^{symsize}-2}$ 對應的數值 (共 `nn` 個非零 field 元素)，以及反過來的查找，以加速後續編碼與解碼的運算。`alpha_to[nn]` 是零元素的哨兵值 * 因為 $\alpha$ = 2，因此每次對 $\alpha^i$ 的運算就是把 $\alpha^{i-1}$ 左移 1 bit * `if (sr & (1 << symsize))` 判斷是否有 overflow，如果有，則需要額外對 `gfpoly`，也就是 primitive polynomial 做 XOR 運算，最後則 `sr &= rs->nn` 以確保得到的數字仍在 GF 中 ```c /* If it's not primitive, exit */ if(sr != rs->alpha_to[0]) goto err; /* Find prim-th root of 1, used in decoding */ for(iprim = 1; (iprim % prim) != 0; iprim += rs->nn); /* prim-th root of 1, index form */ rs->iprim = iprim / prim; ``` * 如果 sr 不等於 $\alpha^0$，表示 $\alpha^i$ 建立起的集合並非 GF，有錯誤 * 這裡計算出 decode 用到的 `iprim`，即 `prim` 對 `nn` 的模反元素 ```c /* Form RS code generator polynomial from its roots */ rs->genpoly[0] = 1; for (i = 0, root = fcr * prim; i < nroots; i++, root += prim) { rs->genpoly[i + 1] = 1; /* Multiply rs->genpoly[] by @**(root + x) */ for (j = i; j > 0; j--) { if (rs->genpoly[j] != 0) { rs->genpoly[j] = rs->genpoly[j -1] ^ rs->alpha_to[rs_modnn(rs, rs->index_of[rs->genpoly[j]] + root)]; } else rs->genpoly[j] = rs->genpoly[j - 1]; } /* rs->genpoly[0] can never be zero */ rs->genpoly[0] = rs->alpha_to[rs_modnn(rs, rs->index_of[rs->genpoly[0]] + root)]; } ``` * genpoly 的 index 從 0 到 rs->nroots，對應多項式的常數項到 $x^{nroots}$ 的係數 * 建立生成多項式，已知生成多項式為 $g(x) = \prod_{i=fcr}^{fcr+nroots-1} (x-\beta^i)$，這裡顯然是在對其進行計算 * 對於 `for (j = i; j > 0; j--)` 迴圈中的運算，可以理解為例如當下的 genpoly 對應的多項式可能是 $a_j+ a_{j-1}x^1...+ a_2x^{j-2} + a_1x^{j-1}$，然後要乘上下一個 $(x-\beta^{root})$，因此新的多項式係數，假設是 $x$ 項好了，就會變成 $a_{j-1} \times \beta^{root} + a_j \times 1$ * 呈上，$a_{j-1}$ 其實就是某一個 $\alpha^i$，而 $\beta^{root}$ 也是某個 $\alpha^j$，則因為 $\alpha^{(i+j)}$ 可能會超出能 index 到的範圍，因此就要去使用 `rs_modnn(rs,rs->index_of[rs->genpoly[j]] + root)` 去調整 ```c for (i = 0; i <= nroots; i++) rs->genpoly[i] = rs->index_of[rs->genpoly[i]]; rs->users = 1; list_add(&rs->list, &codec_list); return rs; err: kfree(rs->genpoly); kfree(rs->index_of); kfree(rs->alpha_to); kfree(rs); return NULL; ``` * 前面計算出來的 `genpoly` 是係數的真實值，為了方便計算，透過 `index_of` 改成用 index 來代表 * `rs->users = 1` 計算在 reference 這個結構的數量 ### `rs_modnn` ```c static inline int rs_modnn(struct rs_codec *rs, int x) { while (x >= rs->nn) { x -= rs->nn; x = (x >> rs->mm) + (x & rs->nn); } return x; } ``` * 如果 x 大於 `rs->nn = (1<<mm)-1`，表示 overflow 所以不存在定義的 GF 中，此時需調整成 `x % rs->nn` * 這裡對取餘運算進行了優化，可以參考[Mersenne Numbers: Mod 3, Mod 7, Mod ($2^i$ – 1)](http://homepage.divms.uiowa.edu/~jones/bcd/mod.shtml#exmod3) (沒有找到比較明確的數學證明，待補) ### `encode_rs8` 初始化 RS 相關的資料結構後，接著就可以進行編碼了，下面以 `encode_rs8` 為範例 ```c /* Parity buffer. Size = number of roots */ uint16_t par[6]; /* Initialize the parity buffer */ memset(par, 0, sizeof(par)); /* Encode 512 byte in data8. Store parity in buffer par */ encode_rs8 (rs_decoder, data8, 512, par, 0); ``` ```c int encode_rs8(struct rs_control *rsc, uint8_t *data, int len, uint16_t *par, uint16_t invmsk) ``` * `rsc` 就是前面初始化過的結構 * `data` 是要被 encode 的資料，其長度為 `len` * `par` 則是呈裝 parity data 的地方，注意需要由使用者自行初始化 * `invmsk` 可在編碼時加上一個 mask (關於為什麼需要 mask，可參考 [Reed–Solomon codes for coders: Masking](https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders#Masking)) ```c pad = nn - nroots - len; if (pad < 0 || pad >= nn) return -ERANGE; ``` * 首先檢查 `nn - nroots - len`，也就是檢查資料數量加上 parity 數量是否超過總 symbol 數量，這是因為編碼數量必須小於 GF 的元素總量 $\lvert F\rvert$ ```c for (i = 0; i < len; i++) { fb = index_of[((((uint16_t) data[i])^invmsk) & msk) ^ par[0]]; /* feedback term is non-zero */ if (fb != nn) { for (j = 1; j < nroots; j++) { par[j] ^= alpha_to[rs_modnn(rs, fb + genpoly[nroots - j])]; } } /* Shift */ memmove(&par[0], &par[1], sizeof(uint16_t) * (nroots - 1)); if (fb != nn) { par[nroots - 1] = alpha_to[rs_modnn(rs, fb + genpoly[0])]; } else { par[nroots - 1] = 0; } } ``` 以編碼而言，事實上就是把資料對生成多項式做長除法，並把餘數記錄到 `par` 中。為了儘量說清楚，這裡借用 [Polynomial_division](https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders#Polynomial_division) 為例來解說。假設資料是 `12 34 56`，生成多項式為 `01 0f 36 78 40`，`par` 的大小為 4 個 uint8: ``` 12 da df --------------------- 01 0f 36 78 40 ) 12 34 56 00 00 00 00 ^ 12 ee 2b 23 f4 ---------------- da 7d 23 f4 00 ^ da a2 85 79 84 ----------------- df a6 8d 84 00 ^ df 91 6b fc d9 ---------------- 37 e6 78 d9 ``` * 先假設沒有設 invmsk * 那麼最外層迴圈的第 1 層，i = 0，一開始的 `par` 成員就是 `00 00 00 00` ，`fb` 先取出 `12` * 在第 5 行的 for 迴圈要計算出後面的 `ee 2b 23`，這會是透過 `par[j] ^= alpha_to[rs_modnn(rs, fb + genpoly[nroots - j])]` 來得到 * 要計算 `12` * `0f 36 78` 為何。在這裡 `12` 就是 $\alpha^{fb}$，而 generator 的 `0f 36 78` 會對應某個 $\alpha^{genpoly[nroots - j]}$ * 因此要計算的 `12` * `0f 36 78` 對應每個 $\alpha^{fb+{genpoly[nroots - j]}}$，也就是程式中的 `alpha_to[rs_modnn(rs, fb + genpoly[nroots - j])];` * 至此的 par 是 `12 ee 2b 23`，接著將它向左位移 (也就是這裡的 `memmove`)，變成 `ee 2b 23 23` (雖然這裡說位移，不過最後一個 byte 沒有變成 0，這是因為使用 memmove，不過稍後它就會被覆蓋掉) * 第 13 行的概念如前，因此 par 變成 `ee 2b 23 f4` * 來到最外層迴圈的第二層，i = 1，fb 會是 `34 ^ ee = da`，然後就依尋前面的步驟，一直做到處理完整個 data * 最後得到的餘數會在 parity 中，那就是編碼的 parity ### `decode_rs8` 最複雜的部份應該就是這裡了。對於 RS 的 decode，可以有許多不同的演算法，大致總結成 5 點： 1. 計算出 syndromes polynomial，透過 Berlekamp-Massey 等演算法檢查資料是否有錯誤產生 2. 根據 syndromes 計算出 erasure/error locator polynomial，用 Berlekamp-Massey 找到確切是哪些位置產生了錯誤 3. 根據前兩者算出 erasure/error evaluator polynomial，用於後續計算各錯誤位置的誤差大小 4. 由前述多項式計算各錯誤位置的 error magnitude (誤差值) 5. 將計算出的誤差值從接收到的資料中去除，即可還原原始資料 ```c int decode_rs8(struct rs_control *rsc, uint8_t *data, uint16_t *par, int len, uint16_t *s, int no_eras, int *eras_pos, uint16_t invmsk, uint16_t *corr) ``` * `rsc` 就是前面初始化好的 `rs_control` 結構 * 輸入長度為 `len` 的 `data` 跟 `par` * `s`: 解碼的過程中會需要 syndromes polynomial，找出發生錯誤的位置，因為 syndromes 的計算可能可以用硬體來計算，因此 API 提供此參數可以直接依據這個 syndromes 來修正錯誤 * 下面的範例先假設這個欄位為 NULL，也就是需要軟體負責計算 syndromes * `no_eras`、`eras_pos`: 有些 channel 中，錯誤會發生的情形是可以預知的，這種錯誤被稱做 erasure。要修正一個 erasure，只要添加一個 symbol 就可以做 [erasure correction](https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders#Erasure_correction) 了 * 因此可告知 decoder 相關的訊息來做 erasure 修正 * 如果不知道錯誤的位置，那就稱為 error，這種情況下，就需要添加 2 個 symbol 才可以做修正 * `invmsk` 就等於 encode 時所使用的 * `corr`：根據 `eras_pos` 可以要做 correction 相關的 bitmsk * 所以這個 API 可以不輸入 data 和 parity，僅輸入 `s`、`eras_pos` 等得到 `corr` 的 mask，再透過自定義的程式去處理問題 ```c enum { RS_DECODE_LAMBDA, RS_DECODE_SYN, RS_DECODE_B, RS_DECODE_T, RS_DECODE_OMEGA, RS_DECODE_ROOT, RS_DECODE_REG, RS_DECODE_LOC, RS_DECODE_NUM_BUFFERS }; ... /* * The decoder buffers are in the rs control struct. They are * arrays sized [nroots + 1] */ uint16_t *lambda = rsc->buffers + RS_DECODE_LAMBDA * (nroots + 1); uint16_t *syn = rsc->buffers + RS_DECODE_SYN * (nroots + 1); uint16_t *b = rsc->buffers + RS_DECODE_B * (nroots + 1); uint16_t *t = rsc->buffers + RS_DECODE_T * (nroots + 1); uint16_t *omega = rsc->buffers + RS_DECODE_OMEGA * (nroots + 1); uint16_t *root = rsc->buffers + RS_DECODE_ROOT * (nroots + 1); uint16_t *reg = rsc->buffers + RS_DECODE_REG * (nroots + 1); uint16_t *loc = rsc->buffers + RS_DECODE_LOC * (nroots + 1); /* Check length parameter for validity */ pad = nn - nroots - len; BUG_ON(pad < 0 || pad >= nn); ``` * 還記得前面預先拿到的 `rs->buffers` 嗎？當時 allocate 了 `8 * (nroots + 1)` uint16_t 大小的空間，現在將它切成 8 等份給不同的需求使用 * encode 時相同的理由，做對應的檢查 ```c /* Does the caller provide the syndrome ? */ if (s != NULL) { for (i = 0; i < nroots; i++) { /* The syndrome is in index form, * so nn represents zero */ if (s[i] != nn) goto decode; } /* syndrome is zero, no errors to correct */ return 0; } ``` * 如果 caller 有輸入 syndrome 參數的話，檢查 syndrome 是否為 0 * 因為 syndrome 是 index 的，而初始化時也定義了 `rs->alpha_to[rs->nn] = 0`，所以這邊只要檢查 syndrome 是否等於 `nn` 即可 ```c /* form the syndromes; i.e., evaluate data(x) at roots of * g(x) */ for (i = 0; i < nroots; i++) syn[i] = (((uint16_t) data[0]) ^ invmsk) & msk; for (j = 1; j < len; j++) { for (i = 0; i < nroots; i++) { if (syn[i] == 0) { syn[i] = (((uint16_t) data[j]) ^ invmsk) & msk; } else { syn[i] = ((((uint16_t) data[j]) ^ invmsk) & msk) ^ alpha_to[rs_modnn(rs, index_of[syn[i]] + (fcr + i) * prim)]; } } } for (j = 0; j < nroots; j++) { for (i = 0; i < nroots; i++) { if (syn[i] == 0) { syn[i] = ((uint16_t) par[j]) & msk; } else { syn[i] = (((uint16_t) par[j]) & msk) ^ alpha_to[rs_modnn(rs, index_of[syn[i]] + (fcr+i)*prim)]; } } } s = syn; /* Convert syndromes to index form, checking for nonzero condition */ syn_error = 0; for (i = 0; i < nroots; i++) { syn_error |= s[i]; s[i] = index_of[s[i]]; } if (!syn_error) { /* if syndrome is zero, data[] is a codeword and there are no * errors to correct. So return data[] unmodified */ return 0; } ``` * 如果沒有提供 `s` 參數，就要自行計算 syndromes polynomial * syndromes polynomial 的計算就是把 data(包含資料部份和 parity) 當成多項式，然後代入 $\beta^{fcr}$ 到 $\beta^{fcr+nroots-1}$ (共 nroots 個點) 看看是否為 0 * 這邊有使用 [Horner's method](https://en.wikipedia.org/wiki/Horner%27s_method) 避免直接計算 $x^n$ 做優化 * 把計算完的轉換為 index 形式方便後續的處理 * `syn_error` 計算是不是所有 syndrome 都是 0，如果不是就要去修正 ```c decode: memset(&lambda[1], 0, nroots * sizeof(lambda[0])); lambda[0] = 1; if (no_eras > 0) { /* Init lambda to be the erasure locator polynomial */ lambda[1] = alpha_to[rs_modnn(rs, prim * (nn - 1 - (eras_pos[0] + pad)))]; for (i = 1; i < no_eras; i++) { u = rs_modnn(rs, prim * (nn - 1 - (eras_pos[i] + pad))); for (j = i + 1; j > 0; j--) { tmp = index_of[lambda[j - 1]]; if (tmp != nn) { lambda[j] ^= alpha_to[rs_modnn(rs, u + tmp)]; } } } } for (i = 0; i < nroots + 1; i++) b[i] = index_of[lambda[i]]; ``` * 如果已知哪些位置出錯，如前所述要做 erasure correction * 首先初始 $\Lambda(x)$ = 1 * 如果有 erasure，則要計算 erasure locator polynomial $\Lambda(x) = \prod (1-\beta^ix)$ * 其中 i 的值是根據錯誤位置計算 * 要注意到例如資料是 `01 02 03 04`，對應的多項式是 $01x^3 + 02x^2 + 03x + 04$，index 0 的地方是多項式的最高次方，但是如果錯誤是發生在 `03` (index = 2) 的地方，要代入的是 $\beta^1$，`nn - 1 - (eras_pos[0] + pad)` 就是在做此計算 * 則迴圈的計算邏輯就和前面的 g(x) 計算大致相同 * `b` 對應 Berlekamp-Massey algorithm 的多項式 B(x)，用 lamda 的 $\alpha$ index 初始化 ```c /* * Begin Berlekamp-Massey algorithm to determine error+erasure * locator polynomial */ r = no_eras; el = no_eras; while (++r <= nroots) { /* r is the step number */ /* Compute discrepancy at the r-th step in poly-form */ discr_r = 0; for (i = 0; i < r; i++) { if ((lambda[i] != 0) && (s[r - i - 1] != nn)) { discr_r ^= alpha_to[rs_modnn(rs, index_of[lambda[i]] + s[r - i - 1])]; } } discr_r = index_of[discr_r]; /* Index form */ ... ``` * 需透過 [Berlekamp-Massey algorithm](https://en.wikipedia.org/wiki/Berlekamp%E2%80%93Massey_algorithm) 來找出未知位置的錯誤 (error locator polynomial) * 迴圈要進行 `nroots` 減去已知的錯誤次(換句話說，如果沒有 erasure 的話就是進行 nroots 次) * 首先要計算每一步的 discrepancy，這個數字等於 $\sum_{i=0}^{r-1}\Lambda_iS_{r-i-1}$ ```c ... if (discr_r == nn) { /* 2 lines below: B(x) <-- x*B(x) */ memmove (&b[1], b, nroots * sizeof (b[0])); b[0] = nn; } else { /* 7 lines below: T(x) <-- lambda(x)-discr_r*x*b(x) */ t[0] = lambda[0]; for (i = 0; i < nroots; i++) { if (b[i] != nn) { t[i + 1] = lambda[i + 1] ^ alpha_to[rs_modnn(rs, discr_r + b[i])]; } else t[i + 1] = lambda[i + 1]; } if (2 * el <= r + no_eras - 1) { el = r + no_eras - el; /* * 2 lines below: B(x) <-- inv(discr_r) * * lambda(x) */ for (i = 0; i <= nroots; i++) { b[i] = (lambda[i] == 0) ? nn : rs_modnn(rs, index_of[lambda[i]] - discr_r + nn); } } else { /* 2 lines below: B(x) <-- x*B(x) */ memmove(&b[1], b, nroots * sizeof(b[0])); b[0] = nn; } memcpy(lambda, t, (nroots + 1) * sizeof(t[0])); } } ``` * `discr_r == nn` 表示原求的 discrepancy `d` 為 0 * 此時 $B(x) \Longleftarrow x*B(x)$ * 然後前進到下個 iteration * 如果 discrepancy 不為 0: * $\Lambda'(x) \Longleftarrow \Lambda(x) - d \times x \times B(x)$ * 可以看到程式中的 `t` 就是 $\Lambda(x) - d \times x \times B(x)$ 的暫存空間 $\Lambda'(x)$ * 每個 iteration 中的 `r + no_eras` 表示目前是第幾個 iteration (從 1 開始數) * `el` 的用途是用來判斷某些 iteration 中需要額外的調整 `B(x)`，在條件滿足的 `el` 下： * $el \Longleftarrow$ 現在的 iteration 數量 $- \space el$ * 此時 $B(x) \Longleftarrow d^{-1}\Lambda(x)$ * 除此之外 $B(x) \Longleftarrow x*B(x)$ * 最後，$\Lambda(x) \Longleftarrow \Lambda'(x)$，然後進入下一個 iteration ```c /* Convert lambda to index form and compute deg(lambda(x)) */ deg_lambda = 0; for (i = 0; i < nroots + 1; i++) { lambda[i] = index_of[lambda[i]]; if (lambda[i] != nn) deg_lambda = i; } if (deg_lambda == 0) { /* * deg(lambda) is zero even though the syndrome is non-zero * => uncorrectable error detected */ return -EBADMSG; } ``` * 最終會得到一個 erasure /error locator polynomial $\Lambda(x)$ * 把 lambda 的值轉換成 index 形式，並且計算是幾階的多項式 `deg_lambda` * 如果結果為 0 階但是 syndrome 卻是非零值，那表示此錯誤不能被修正 ```c /* Find roots of error+erasure locator polynomial by Chien search */ memcpy(&reg[1], &lambda[1], nroots * sizeof(reg[0])); count = 0; /* Number of roots of lambda(x) */ for (i = 1, k = iprim - 1; i <= nn; i++, k = rs_modnn(rs, k + iprim)) { q = 1; /* lambda[0] is always 0 */ for (j = deg_lambda; j > 0; j--) { if (reg[j] != nn) { reg[j] = rs_modnn(rs, reg[j] + j); q ^= alpha_to[reg[j]]; } } if (q != 0) continue; /* Not a root */ if (k < pad) { /* Impossible error location. Uncorrectable error. */ return -EBADMSG; } /* store root (index-form) and error location number */ root[count] = i; loc[count] = k; /* If we've already found max possible roots, * abort the search to save time */ if (++count == deg_lambda) break; } ``` * 則對於 erasure /error locator polynomial $\Lambda(x)$，其根 ($\beta^{-i}$ 使得 $\Lambda(\beta^{-i}) = 0$) 就表示錯誤發生的地方是 len(msg) - 1 - i * 這裡使用了 [Chien search](https://en.wikipedia.org/wiki/Chien_search) 加速求根 * 若錯誤位置落在 padding 範圍內 (即 `k < pad`)，表示不可能的錯誤位置，回傳不可修正錯誤 ```c if (deg_lambda != count) { /* * deg(lambda) unequal to number of roots => uncorrectable * error detected */ return -EBADMSG; } /* * Compute err+eras evaluator poly omega(x) = s(x)*lambda(x) (modulo * x**nroots). in index form. Also find deg(omega). */ deg_omega = deg_lambda - 1; for (i = 0; i <= deg_omega; i++) { tmp = 0; for (j = i; j >= 0; j--) { if ((s[i - j] != nn) && (lambda[j] != nn)) tmp ^= alpha_to[rs_modnn(rs, s[i - j] + lambda[j])]; } omega[i] = index_of[tmp]; } ``` * 如果找到的根數量和 $\Lambda$ 的階數不一致，表示錯誤無法被修正 * 接下來要求 error / evaluator polynomial $\Omega(x) = S(x) * \Lambda(x) \space mod \space x^{nroots}$，且存成 index form ```c /* * Compute error values in poly-form. num1 = omega(inv(X(l))), num2 = * inv(X(l))**(fcr-1) and den = lambda_pr(inv(X(l))) all in poly-form * Note: we reuse the buffer for b to store the correction pattern */ num_corrected = 0; for (j = count - 1; j >= 0; j--) { num1 = 0; for (i = deg_omega; i >= 0; i--) { if (omega[i] != nn) num1 ^= alpha_to[rs_modnn(rs, omega[i] + i * root[j])]; } if (num1 == 0) { /* Nothing to correct at this position */ b[j] = 0; continue; } num2 = alpha_to[rs_modnn(rs, root[j] * (fcr - 1) + nn)]; den = 0; /* lambda[i+1] for i even is the formal derivative * lambda_pr of lambda[i] */ for (i = min(deg_lambda, nroots - 1) & ~1; i >= 0; i -= 2) { if (lambda[i + 1] != nn) { den ^= alpha_to[rs_modnn(rs, lambda[i + 1] + i * root[j])]; } } b[j] = alpha_to[rs_modnn(rs, index_of[num1] + index_of[num2] + nn - index_of[den])]; num_corrected++; } ``` * [Forney algorithm](https://en.wikipedia.org/wiki/Forney_algorithm) ```c /* * We compute the syndrome of the 'error' and check that it matches * the syndrome of the received word */ for (i = 0; i < nroots; i++) { tmp = 0; for (j = 0; j < count; j++) { if (b[j] == 0) continue; k = (fcr + i) * prim * (nn-loc[j]-1); tmp ^= alpha_to[rs_modnn(rs, index_of[b[j]] + k)]; } if (tmp != alpha_to[s[i]]) return -EBADMSG; } ``` * 經過前面的步驟後，這裡會得到誤差多項式 e(x) (重用 `b` 來保存係數) $$ e(x) = e_0 + e_1x + ... + e_{n-1}x^{(n-1)} $$ * 理論上，前面得到的 syndromes $S_i$ 應該要等於 $e(\beta^i)$，這裡再做額外的檢查以確保正確性 ```c /* * Store the error correction pattern, if a * correction buffer is available */ if (corr && eras_pos) { j = 0; for (i = 0; i < count; i++) { if (b[i]) { corr[j] = b[i]; eras_pos[j++] = loc[i] - pad; } } } else if (data && par) { /* Apply error to data and parity */ for (i = 0; i < count; i++) { if (loc[i] < (nn - nroots)) data[loc[i] - pad] ^= b[i]; else par[loc[i] - pad - len] ^= b[i]; } } return num_corrected; ``` * 如果 input 是 `corr` 和 `eras_pos` 的話，就把該誤差多項式填入 `corr` 並把有錯的位置填入 `eras_pos`，當成回傳 * 如果是 input 是 `data` 和 `par` 的話，那就是看誤差項的位置在 `data` 還是 `par`，並且將結果修正回去 ## 參考資料 * [錯誤更正碼簡介](https://w3.math.sinica.edu.tw/math_media/d184/18404.pdf) * [Hamming(7,4)](https://en.wikipedia.org/wiki/Hamming(7,4)) * [Reed-Solomon error-correcting code decoder](https://www.nayuki.io/page/reed-solomon-error-correcting-code-decoder#preliminaries) * [Reed–Solomon error correction(wikipedia)](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction) * [Reed–Solomon codes for coders](https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders) * [Why use ECC?](https://danluu.com/why-ecc/)