###### tags: `ADA 7.4` `Huffman Code` # ADA 7.4: Huffman Coding 字元編碼問題 Textbook Chapter 16.3 Chapter 4.8 in Algorithm Design by Kleinberg & Tardos ## Encoding & Decoding 在這裡Code是編碼規則的意思,不是程式碼。 ```mermaid flowchart LR A(&nbsp) A -->|input\nmessage| B B(Encoder) B -->|encoded\nmessage| C C(Decoder) C -->|output\nmessage| D D(&nbsp) style A opacity: 0 style B fill: orange style C fill: orange style D opacity: 0 ``` :::info Goal * Enable communication and storage * Detect or correct errors introduced during transmission * compress data: lossy or lossless ::: ```mermaid flowchart LR A(Snoopy) A --> B B(Encoder) B -->|536E6F6F7079\n| C C(Decoder) C --> D D(Snoopy) style A opacity: 0 style B fill: orange style C fill: orange style D opacity: 0 ``` ### Lossy Data Compression: Autoencoder 利用人工智慧網路來做壓縮 但無法保證可以完全一樣,會失真的意思 ## Lossless Data Compression :::info Goal: encode each symbol using a unique binary code (w/o ambiguity) * How to represent symbols? * How to ensure decode(encode(x))=x? * How to minimize the number of bits? ::: ```mermaid flowchart TB note[find a binary tree] A((&nbsp)) A -->|0| B B((&nbsp)) A -->|1| C C((&nbsp)) B -->|0| D D((A)) B -->|1| E E((G)) C -->|0| F F((T)) C -->|1| G G((C)) style A fill: grey style B fill: grey style C fill: grey style D fill: red style E fill: green style F fill: blue style G fill: yellow ``` ``` 10101101011010100101010010 T T C G G T T T G G G A T ``` ### Code ```mermaid flowchart TB subgraph VL [ ] direction TB note(Variable-length) A((&nbsp)) A -->|0| B B((&nbsp)) A -->|1| C C((D)) B -->|0| D D((A)) B -->|1| E E((&nbsp)) E -->|0| G G((B)) E -->|1| H H((C)) end subgraph FL [ ] direction TB fnote(Fixed-length) fA((&nbsp)) fA -->|0| fB fB((&nbsp)) fA -->|1| fC fC((&nbsp)) fB -->|0| fD fD((A)) fB -->|1| fE fE((B)) fC -->|0| fF fF((C)) end style fA fill: grey style fB fill: grey style fC fill: grey style fD fill: red style fE fill: green style fF fill: yellow style A fill: grey style B fill: grey style C fill: blue style D fill: red style E fill: grey style G fill: green style H fill: yellow ``` 如果A出現的次數比起他人多的話,那他的壓縮比就會特別的好。 |symbol|A|B|C|D|E|F| |------|-|-|-|-|-|-| |Frequency(K)|45|13|12|16|9|5| |Fixed-length|000|001|010|011|100|101| |Variable-length|0|101|100|111|1101|1100| Fixed-length total bits $(45 + 13+ 12 + 16 + 9 + 5)\cdot 3 = 300_{bits}$ Variable-length total bits $45\cdot1+(13+12+16)\cdot3+(9+5)\cdot4=224_{bits}$ ## Prefix Code Definition: a variable-length code where no codeword is a prefix of some other codeword |symbol|A|B|C|D|E|F| |------|-|-|-|-|-|-| |Frequency(K)|45|13|12|16|9|5| |Prefix code|0|101|100|111|1101|1100| |Not prefix code|0|101|10|111|1101|1100| :::danger Ambiguity: decode(1011100) can be 'BF' or 'CDAA' ``` 101_1100 10_111_0_0 B F C D A A ``` ::: ## Total Length of Codes The weighted depth of a leaf = weight of a leaf (freq) x depth of a leaf Total Length of codes = Total weighted depth of leaves :::info Cost of the tree $T$ $$ B(T) = \sum_{c \in C} \text{freq}(c) \cdot d_T(c) $$ ::: :::success Average bits per character $$ \frac{B(T)}{100} = \sum_{c \in C} \text{relative-freq}(c) \cdot d_T(c) $$ ::: ```mermaid flowchart TB l1((100)) l2((55)) l31((25)) l32((30)) l4((14)) l1 -->|0| A l1 -->|1| l2 l2 -->|0| l31 l2 -->|1| l32 l31 -->|0| C l31 -->|1| B l32 -->|0| l4 l32 -->|1| D l4 -->|0| E l4 -->|1| F A((A:45)) B((B:13)) C((C:12)) D((D:16)) E((E:9)) F((F:5)) style A fill: orangered style B fill: green style C fill: yellow style D fill: sandybrown style E fill: violet style F fill: mediumorchid ``` ==How to find the optinal prefix. code to minimize the cost?== ## Prefix Code Problem :::warning Input: $n$ positive integers $w_1,w_2,...,w_n$ indicating word frequency Output: $a$ binary tree of $n$ leaves, whose weights form $w_1,w_2,...,w_n$ s.t. the cost of the tree is minimized ::: $$ T^*=arg \min_T B(T) = arg \min_T \sum_{c \in C} \text{freq}(c) \cdot d_T(c) $$ ### Step 1: Cast Optimization Problem :::warning Input: $n$ positive integers $w_1,w_2,...,w_n$ indicating word frequency Output: $a$ binary tree of $n$ leaves with minimal cost ::: Subproblem: merge two characters into a new one whose weight is their sum * $PC(i)$: prefix code problem for i leaves ==$PC(n) \to PC(n-1)$== * Goal: $PC(n)$ Issues * it is not the subproblem of the original problem * The cost of two merged characters should be consodered #### Example ```mermaid flowchart TB subgraph right [ ] direction TB rl1((100)) rl2((55)) rl31((25)) rl32((30)) rl1 -->|0| rA rl1 -->|1| rl2 rl2 -->|0| rl31 rl2 -->|1| rl32 rl31 -->|0| rC rl31 -->|1| rB rl32 -->|0| rl4 rl32 -->|1| rD rA((A:45)) rB((B:13)) rC((C:12)) rD((D:16)) subgraph ride1 [ ] rl4((EF:14)) end end subgraph left [ ] direction TB l1((100)) l2((55)) l31((25)) l32((30)) l1 -->|0| A l1 -->|1| l2 l2 -->|0| l31 l2 -->|1| l32 l31 -->|0| C l31 -->|1| B l32 -->|0| l4 l32 -->|1| D A((A:45)) B((B:13)) C((C:12)) D((D:16)) subgraph ide1 [ ] l4((14)) E((E:9)) F((F:5)) l4 -->|0| E l4 -->|1| F end end style left fill: red style A fill: orangered style B fill: green style C fill: yellow style D fill: sandybrown style E fill: violet style F fill: mediumorchid style rA fill: orangered style rB fill: green style rC fill: yellow style rD fill: sandybrown ``` ### Step 2: Prove Optimal Substructure ```mermaid flowchart TB subgraph left [T] direction TB l1(( )) l2(( )) l31(( )) l32(( )) l1 --> A l1 --> l2 l2 --> l31 l2 --> l32 l31 --> C l31 --> B l32 --> l4 l32 --> D A(( )) B(( )) C(( )) D(( )) l4(( )) E((x)) F((y)) l4 --> E l4 --> F end subgraph right [T'] direction TB rl1(( )) rl2(( )) rl31(( )) rl32(( )) rl1 --> rA rl1 --> rl2 rl2 --> rl31 rl2 --> rl32 rl31 --> rC rl31 --> rB rl32 --> rl4 rl32 --> rD rA(( )) rB(( )) rC(( )) rD(( )) rl4((z)) end style A fill: orangered style B fill: green style C fill: yellow style D fill: sandybrown style E fill: violet style F fill: mediumorchid style rA fill: orangered style rB fill: green style rC fill: yellow style rD fill: sandybrown style rl4 fill: violet ``` $$ B(T) = B(T') -\color{red}{\text{freq}(z)}d_{T'}(z) +\text{freq}(x)\color{blue}{d_{T}(x)} +\text{freq}(y)\color{olive}{d_{T}(y)} \\ = B(T') -\color{red}{(\text{freq}(x)+\text{freq}(y))}d_{T'}(z) +\text{freq}(x)\color{blue}{(1+d_{T'}(z))} +\text{freq}(y)\color{olive}{(1+d_{T'}(z))} \\ = B(T') +\text{freq}(x)+\text{freq}(y) \\ $$ Optimal substructure: T' is OPT if and only if T is OPT ## Greedy Algorithm Design Greedy choice: merge repeatedly until one tree left * Select two trees $x,y$ with minimal frequency roots $\rm{freq}(x)$ and $\rm{freq}(y)$ * Merge into a single tree by adding root $z$ with the frequency $\rm{freq}(x)+\rm{freq}(x)$ ### Example ```mermaid flowchart TB subgraph s1 [Step1] direction TB D1((16)):::cD E1((5)):::cE F1((9)):::cF C1((12)):::cC B1((13)):::cB A1((45)):::cA end s1 --> s2 subgraph s2 [Step2] direction TB subgraph s22 [ ] D2((16)):::cD EF2((14)):::cG C2((12)):::cC B2((13)):::cB A2((45)):::cA end EF2 --> E2 EF2 --> F2 E2((5)):::cE F2((9)):::cF end s2 --> s3 subgraph s3 [Step3] direction TB subgraph s33 [ ] D3((16)):::cD EF3((14)):::cG CB3((25)):::cG A3((45)):::cA end EF3 --> E3 EF3 --> F3 E3((5)):::cE F3((9)):::cF CB3 --> C3 CB3 --> B3 C3((12)):::cC B3((13)):::cB end s3 --> s4 subgraph s4 [Step4] direction TB subgraph s44 [ ] DEF4((30)):::cG CB4((25)):::cG A4((45)):::cA end DEF4 --> D4 DEF4 --> EF4 D4((16)):::cD EF4((14)):::cG EF4 --> E4 EF4 --> F4 E4((5)):::cE F4((9)):::cF CB4 --> C4 CB4 --> B4 C4((12)):::cC B4((13)):::cB end s4 --> s5 subgraph s5 [Step5] direction TB subgraph s55 [ ] CBDEF5((55)):::cG A5((45)):::cA end CBDEF5 --> DEF5 CBDEF5 --> CB5 DEF5((30)):::cG DEF5 --> D5 DEF5 --> EF5 D5((16)):::cD EF5((14)):::cG CB5((25)):::cG EF5 --> E5 EF5 --> F5 E5((5)):::cE F5((9)):::cF CB5 --> C5 CB5 --> B5 C5((12)):::cC B5((13)):::cB end s5 --> s6 subgraph s6 [Step6] direction TB CBDEFA6((100)):::cG CBDEFA6 --> CBDEF6 CBDEFA6 --> A6 CBDEF6((55)):::cG CBDEF6 --> DEF6 CBDEF6 --> CB6 DEF6((30)):::cG DEF6 --> D6 DEF6 --> EF6 D6((16)):::cD EF6((14)):::cG CB6((25)):::cG A6((45)):::cA EF6 --> E6 EF6 --> F6 E6((5)):::cE F6((9)):::cF CB6 --> C6 CB6 --> B6 C6((12)):::cC B6((13)):::cB end classDef cA fill: orangered classDef cB fill: green classDef cC fill: yellow classDef cD fill: sandybrown classDef cE fill: violet classDef cF fill: mediumorchid classDef cG fill: lightgrey ``` ### Step 3: Prove Greedy-Choice Property Greedy choice: merge two nodes with min weights repeatedly Proof via contradiction * Assume that there is no OPT including * $x$ and $y$ are two symbols with lowest frequencies * $a$ and $b$ are siblings with largest depths * WLOG, assume $freq(a) \leq freq(b)$ and $freq(x) \leq freq(y)$<br>$\rightarrow$ $freq(x) \leq freq(a)$ and $freq(y) \leq freq(b)$ * Exchanging $a$ with $x$ and then $b$ with $y$ can make the tree equally or better ```mermaid flowchart TB subgraph s7 [ ] direction TB ABYX7(( )):::cG ABYX7 --> D7 ABYX7 --> ABY7 D7((a)):::cA AB7(( )):::cG AB7 --> A7 AB7 --> C7 A7((x)):::cD C7((b)):::cC ABY7(( )):::cG ABY7 --> AB7 ABY7 --> B7 B7((y)):::cB end subgraph s6 [ ] direction TB ABYX6(( )):::cG ABYX6 --> D6 ABYX6 --> ABY6 D6((x)):::cD AB6(( )):::cG AB6 --> A6 AB6 --> C6 A6((a)):::cA C6((b)):::cC ABY6(( )):::cG ABY6 --> AB6 ABY6 --> B6 B6((y)):::cB end classDef cA fill: orangered classDef cB fill: green classDef cC fill: yellow classDef cD fill: sandybrown classDef cE fill: violet classDef cF fill: mediumorchid classDef cG fill: lightgrey ``` $$ B(T)-B(T')= \sum_{s \in S} \rm{freq}(s)d_T(s) -\sum_{s \in S} \rm{freq}(s)d_{T'}(s) \\ = \rm{freq}(x)d_T(x) + \rm{freq}(a)d_T(a) - \rm{freq}(x)d_{T'}(x) - \rm{freq}(a)d_{T'}(a) \\ = \rm{freq}(x)d_T(x) + \rm{freq}(a)d_T(a) - \rm{freq}(x)d_{T}(a) - \rm{freq}(a)d_{T}(x) \\ = (\rm{freq}(a)-\rm{freq}(x))(d_T(a)-d_T(x)) \geq 0 \color{blue} {\because \rm{freq}(x) \leq \rm{freq}(a)} $$ Because T is OPT, T' must be another optimal solution. ## Correctness and Optimality Theorem: Huffman coding generates an optinal prefix code Proof * Use induction to prove: Huffman codes are optimal for $n$ symbols * $n=2$, trivial * For a set $S$ with $n+1$ symbols, 1. Based on the greedy choice property, two symbols with minimum frequencies are siblings in $T$ 2. Construct $T'$ by replacing these two symbols $x$ and $y$ with $z$ s.t. $S' = ( S \setminus \{x,y\})\;\bigcup \;\{z\}$ and $freq(z)=freq(x)+freq(y)$ 3. Assume $T'$ i sthe optimal tree for $n$ symbols by inductive hypothesis 4. Based on the optimal substructure property, we know that when $T'$ is optimal, $T$ is optimal too (case $n+1$ holds) ## Pseudo Code :::info Prefix Code Problem ::: ``` Huffman(S) n = |S| Q = Build-Priority-Queue(S) for i = 1 to n-1 allocate a new node z z.left = x = Extract-Min(Q) z.right = y = Extract-Min(Q) freq(z) = freq(x) + freq(y) Insert(Q, z) Delete(Q, x) Delete(Q, y) return Extract-Min(Q) ``` $T(n) = \Theta(n \log n)$ ## Drawbacks of Huffman Codes Huffman's algorithm is optimal for a symbol-by-symbol coding with a known input probability distribution Huffman's algorithm is sub-optimal when * allowing multiple-symbol encoding is allowed * unknown probability distribution * symbols are not independent