###### tags: `ADA 7.4` `Huffman Code`
# ADA 7.4: Huffman Coding 字元編碼問題
Textbook Chapter 16.3
Chapter 4.8 in Algorithm Design by Kleinberg & Tardos
## Encoding & Decoding
在這裡Code是編碼規則的意思,不是程式碼。
```mermaid
flowchart LR
A( )
A -->|input\nmessage| B
B(Encoder)
B -->|encoded\nmessage| C
C(Decoder)
C -->|output\nmessage| D
D( )
style A opacity: 0
style B fill: orange
style C fill: orange
style D opacity: 0
```
:::info
Goal
* Enable communication and storage
* Detect or correct errors introduced during transmission
* compress data: lossy or lossless
:::
```mermaid
flowchart LR
A(Snoopy)
A --> B
B(Encoder)
B -->|536E6F6F7079\n| C
C(Decoder)
C --> D
D(Snoopy)
style A opacity: 0
style B fill: orange
style C fill: orange
style D opacity: 0
```
### Lossy Data Compression: Autoencoder
利用人工智慧網路來做壓縮
但無法保證可以完全一樣,會失真的意思
## Lossless Data Compression
:::info
Goal: encode each symbol using a unique binary code (w/o ambiguity)
* How to represent symbols?
* How to ensure decode(encode(x))=x?
* How to minimize the number of bits?
:::
```mermaid
flowchart TB
note[find a binary tree]
A(( ))
A -->|0| B
B(( ))
A -->|1| C
C(( ))
B -->|0| D
D((A))
B -->|1| E
E((G))
C -->|0| F
F((T))
C -->|1| G
G((C))
style A fill: grey
style B fill: grey
style C fill: grey
style D fill: red
style E fill: green
style F fill: blue
style G fill: yellow
```
```
10101101011010100101010010
T T C G G T T T G G G A T
```
### Code
```mermaid
flowchart TB
subgraph VL [ ]
direction TB
note(Variable-length)
A(( ))
A -->|0| B
B(( ))
A -->|1| C
C((D))
B -->|0| D
D((A))
B -->|1| E
E(( ))
E -->|0| G
G((B))
E -->|1| H
H((C))
end
subgraph FL [ ]
direction TB
fnote(Fixed-length)
fA(( ))
fA -->|0| fB
fB(( ))
fA -->|1| fC
fC(( ))
fB -->|0| fD
fD((A))
fB -->|1| fE
fE((B))
fC -->|0| fF
fF((C))
end
style fA fill: grey
style fB fill: grey
style fC fill: grey
style fD fill: red
style fE fill: green
style fF fill: yellow
style A fill: grey
style B fill: grey
style C fill: blue
style D fill: red
style E fill: grey
style G fill: green
style H fill: yellow
```
如果A出現的次數比起他人多的話,那他的壓縮比就會特別的好。
|symbol|A|B|C|D|E|F|
|------|-|-|-|-|-|-|
|Frequency(K)|45|13|12|16|9|5|
|Fixed-length|000|001|010|011|100|101|
|Variable-length|0|101|100|111|1101|1100|
Fixed-length total bits
$(45 + 13+ 12 + 16 + 9 + 5)\cdot 3 = 300_{bits}$
Variable-length total bits
$45\cdot1+(13+12+16)\cdot3+(9+5)\cdot4=224_{bits}$
## Prefix Code
Definition: a variable-length code where no codeword is a prefix of some other codeword
|symbol|A|B|C|D|E|F|
|------|-|-|-|-|-|-|
|Frequency(K)|45|13|12|16|9|5|
|Prefix code|0|101|100|111|1101|1100|
|Not prefix code|0|101|10|111|1101|1100|
:::danger
Ambiguity: decode(1011100) can be 'BF' or 'CDAA'
```
101_1100 10_111_0_0
B F C D A A
```
:::
## Total Length of Codes
The weighted depth of a leaf = weight of a leaf (freq) x depth of a leaf
Total Length of codes = Total weighted depth of leaves
:::info
Cost of the tree $T$
$$
B(T) = \sum_{c \in C} \text{freq}(c) \cdot d_T(c)
$$
:::
:::success
Average bits per character
$$
\frac{B(T)}{100} = \sum_{c \in C} \text{relative-freq}(c) \cdot d_T(c)
$$
:::
```mermaid
flowchart TB
l1((100))
l2((55))
l31((25))
l32((30))
l4((14))
l1 -->|0| A
l1 -->|1| l2
l2 -->|0| l31
l2 -->|1| l32
l31 -->|0| C
l31 -->|1| B
l32 -->|0| l4
l32 -->|1| D
l4 -->|0| E
l4 -->|1| F
A((A:45))
B((B:13))
C((C:12))
D((D:16))
E((E:9))
F((F:5))
style A fill: orangered
style B fill: green
style C fill: yellow
style D fill: sandybrown
style E fill: violet
style F fill: mediumorchid
```
==How to find the optinal prefix. code to minimize the cost?==
## Prefix Code Problem
:::warning
Input: $n$ positive integers $w_1,w_2,...,w_n$ indicating word frequency
Output: $a$ binary tree of $n$ leaves, whose weights form $w_1,w_2,...,w_n$ s.t. the cost of the tree is minimized
:::
$$
T^*=arg \min_T B(T) = arg \min_T \sum_{c \in C} \text{freq}(c) \cdot d_T(c)
$$
### Step 1: Cast Optimization Problem
:::warning
Input: $n$ positive integers $w_1,w_2,...,w_n$ indicating word frequency
Output: $a$ binary tree of $n$ leaves with minimal cost
:::
Subproblem: merge two characters into a new one whose weight is their sum
* $PC(i)$: prefix code problem for i leaves ==$PC(n) \to PC(n-1)$==
* Goal: $PC(n)$
Issues
* it is not the subproblem of the original problem
* The cost of two merged characters should be consodered
#### Example
```mermaid
flowchart TB
subgraph right [ ]
direction TB
rl1((100))
rl2((55))
rl31((25))
rl32((30))
rl1 -->|0| rA
rl1 -->|1| rl2
rl2 -->|0| rl31
rl2 -->|1| rl32
rl31 -->|0| rC
rl31 -->|1| rB
rl32 -->|0| rl4
rl32 -->|1| rD
rA((A:45))
rB((B:13))
rC((C:12))
rD((D:16))
subgraph ride1 [ ]
rl4((EF:14))
end
end
subgraph left [ ]
direction TB
l1((100))
l2((55))
l31((25))
l32((30))
l1 -->|0| A
l1 -->|1| l2
l2 -->|0| l31
l2 -->|1| l32
l31 -->|0| C
l31 -->|1| B
l32 -->|0| l4
l32 -->|1| D
A((A:45))
B((B:13))
C((C:12))
D((D:16))
subgraph ide1 [ ]
l4((14))
E((E:9))
F((F:5))
l4 -->|0| E
l4 -->|1| F
end
end
style left fill: red
style A fill: orangered
style B fill: green
style C fill: yellow
style D fill: sandybrown
style E fill: violet
style F fill: mediumorchid
style rA fill: orangered
style rB fill: green
style rC fill: yellow
style rD fill: sandybrown
```
### Step 2: Prove Optimal Substructure
```mermaid
flowchart TB
subgraph left [T]
direction TB
l1(( ))
l2(( ))
l31(( ))
l32(( ))
l1 --> A
l1 --> l2
l2 --> l31
l2 --> l32
l31 --> C
l31 --> B
l32 --> l4
l32 --> D
A(( ))
B(( ))
C(( ))
D(( ))
l4(( ))
E((x))
F((y))
l4 --> E
l4 --> F
end
subgraph right [T']
direction TB
rl1(( ))
rl2(( ))
rl31(( ))
rl32(( ))
rl1 --> rA
rl1 --> rl2
rl2 --> rl31
rl2 --> rl32
rl31 --> rC
rl31 --> rB
rl32 --> rl4
rl32 --> rD
rA(( ))
rB(( ))
rC(( ))
rD(( ))
rl4((z))
end
style A fill: orangered
style B fill: green
style C fill: yellow
style D fill: sandybrown
style E fill: violet
style F fill: mediumorchid
style rA fill: orangered
style rB fill: green
style rC fill: yellow
style rD fill: sandybrown
style rl4 fill: violet
```
$$
B(T)
= B(T')
-\color{red}{\text{freq}(z)}d_{T'}(z)
+\text{freq}(x)\color{blue}{d_{T}(x)}
+\text{freq}(y)\color{olive}{d_{T}(y)} \\
= B(T')
-\color{red}{(\text{freq}(x)+\text{freq}(y))}d_{T'}(z)
+\text{freq}(x)\color{blue}{(1+d_{T'}(z))}
+\text{freq}(y)\color{olive}{(1+d_{T'}(z))} \\
= B(T')
+\text{freq}(x)+\text{freq}(y) \\
$$
Optimal substructure: T' is OPT if and only if T is OPT
## Greedy Algorithm Design
Greedy choice: merge repeatedly until one tree left
* Select two trees $x,y$ with minimal frequency roots $\rm{freq}(x)$ and $\rm{freq}(y)$
* Merge into a single tree by adding root $z$ with the frequency $\rm{freq}(x)+\rm{freq}(x)$
### Example
```mermaid
flowchart TB
subgraph s1 [Step1]
direction TB
D1((16)):::cD
E1((5)):::cE
F1((9)):::cF
C1((12)):::cC
B1((13)):::cB
A1((45)):::cA
end
s1 --> s2
subgraph s2 [Step2]
direction TB
subgraph s22 [ ]
D2((16)):::cD
EF2((14)):::cG
C2((12)):::cC
B2((13)):::cB
A2((45)):::cA
end
EF2 --> E2
EF2 --> F2
E2((5)):::cE
F2((9)):::cF
end
s2 --> s3
subgraph s3 [Step3]
direction TB
subgraph s33 [ ]
D3((16)):::cD
EF3((14)):::cG
CB3((25)):::cG
A3((45)):::cA
end
EF3 --> E3
EF3 --> F3
E3((5)):::cE
F3((9)):::cF
CB3 --> C3
CB3 --> B3
C3((12)):::cC
B3((13)):::cB
end
s3 --> s4
subgraph s4 [Step4]
direction TB
subgraph s44 [ ]
DEF4((30)):::cG
CB4((25)):::cG
A4((45)):::cA
end
DEF4 --> D4
DEF4 --> EF4
D4((16)):::cD
EF4((14)):::cG
EF4 --> E4
EF4 --> F4
E4((5)):::cE
F4((9)):::cF
CB4 --> C4
CB4 --> B4
C4((12)):::cC
B4((13)):::cB
end
s4 --> s5
subgraph s5 [Step5]
direction TB
subgraph s55 [ ]
CBDEF5((55)):::cG
A5((45)):::cA
end
CBDEF5 --> DEF5
CBDEF5 --> CB5
DEF5((30)):::cG
DEF5 --> D5
DEF5 --> EF5
D5((16)):::cD
EF5((14)):::cG
CB5((25)):::cG
EF5 --> E5
EF5 --> F5
E5((5)):::cE
F5((9)):::cF
CB5 --> C5
CB5 --> B5
C5((12)):::cC
B5((13)):::cB
end
s5 --> s6
subgraph s6 [Step6]
direction TB
CBDEFA6((100)):::cG
CBDEFA6 --> CBDEF6
CBDEFA6 --> A6
CBDEF6((55)):::cG
CBDEF6 --> DEF6
CBDEF6 --> CB6
DEF6((30)):::cG
DEF6 --> D6
DEF6 --> EF6
D6((16)):::cD
EF6((14)):::cG
CB6((25)):::cG
A6((45)):::cA
EF6 --> E6
EF6 --> F6
E6((5)):::cE
F6((9)):::cF
CB6 --> C6
CB6 --> B6
C6((12)):::cC
B6((13)):::cB
end
classDef cA fill: orangered
classDef cB fill: green
classDef cC fill: yellow
classDef cD fill: sandybrown
classDef cE fill: violet
classDef cF fill: mediumorchid
classDef cG fill: lightgrey
```
### Step 3: Prove Greedy-Choice Property
Greedy choice: merge two nodes with min weights repeatedly
Proof via contradiction
* Assume that there is no OPT including
* $x$ and $y$ are two symbols with lowest frequencies
* $a$ and $b$ are siblings with largest depths
* WLOG, assume $freq(a) \leq freq(b)$ and $freq(x) \leq freq(y)$<br>$\rightarrow$ $freq(x) \leq freq(a)$ and $freq(y) \leq freq(b)$
* Exchanging $a$ with $x$ and then $b$ with $y$ can make the tree equally or better
```mermaid
flowchart TB
subgraph s7 [ ]
direction TB
ABYX7(( )):::cG
ABYX7 --> D7
ABYX7 --> ABY7
D7((a)):::cA
AB7(( )):::cG
AB7 --> A7
AB7 --> C7
A7((x)):::cD
C7((b)):::cC
ABY7(( )):::cG
ABY7 --> AB7
ABY7 --> B7
B7((y)):::cB
end
subgraph s6 [ ]
direction TB
ABYX6(( )):::cG
ABYX6 --> D6
ABYX6 --> ABY6
D6((x)):::cD
AB6(( )):::cG
AB6 --> A6
AB6 --> C6
A6((a)):::cA
C6((b)):::cC
ABY6(( )):::cG
ABY6 --> AB6
ABY6 --> B6
B6((y)):::cB
end
classDef cA fill: orangered
classDef cB fill: green
classDef cC fill: yellow
classDef cD fill: sandybrown
classDef cE fill: violet
classDef cF fill: mediumorchid
classDef cG fill: lightgrey
```
$$
B(T)-B(T')=
\sum_{s \in S} \rm{freq}(s)d_T(s)
-\sum_{s \in S} \rm{freq}(s)d_{T'}(s) \\
= \rm{freq}(x)d_T(x) + \rm{freq}(a)d_T(a) - \rm{freq}(x)d_{T'}(x) - \rm{freq}(a)d_{T'}(a) \\
= \rm{freq}(x)d_T(x) + \rm{freq}(a)d_T(a) - \rm{freq}(x)d_{T}(a) - \rm{freq}(a)d_{T}(x) \\
= (\rm{freq}(a)-\rm{freq}(x))(d_T(a)-d_T(x))
\geq 0 \color{blue} {\because \rm{freq}(x) \leq \rm{freq}(a)}
$$
Because T is OPT, T' must be another optimal solution.
## Correctness and Optimality
Theorem: Huffman coding generates an optinal prefix code
Proof
* Use induction to prove: Huffman codes are optimal for $n$ symbols
* $n=2$, trivial
* For a set $S$ with $n+1$ symbols,
1. Based on the greedy choice property, two symbols with minimum frequencies are siblings in $T$
2. Construct $T'$ by replacing these two symbols $x$ and $y$ with $z$ s.t. $S' = ( S \setminus \{x,y\})\;\bigcup \;\{z\}$ and $freq(z)=freq(x)+freq(y)$
3. Assume $T'$ i sthe optimal tree for $n$ symbols by inductive hypothesis
4. Based on the optimal substructure property, we know that when $T'$ is optimal, $T$ is optimal too (case $n+1$ holds)
## Pseudo Code
:::info
Prefix Code Problem
:::
```
Huffman(S)
n = |S|
Q = Build-Priority-Queue(S)
for i = 1 to n-1
allocate a new node z
z.left = x = Extract-Min(Q)
z.right = y = Extract-Min(Q)
freq(z) = freq(x) + freq(y)
Insert(Q, z)
Delete(Q, x)
Delete(Q, y)
return Extract-Min(Q)
```
$T(n) = \Theta(n \log n)$
## Drawbacks of Huffman Codes
Huffman's algorithm is optimal for a symbol-by-symbol coding with a known input probability distribution
Huffman's algorithm is sub-optimal when
* allowing multiple-symbol encoding is allowed
* unknown probability distribution
* symbols are not independent