# 【資工考研】DS: Binary Tree (3) Extended Binary Tree & Huffman Coding
## Extended Binary Tree
對於 binary tree ,原本指向 `NULL` 的節點,讓它改成指向一種特別的節點
原本那棵 binary tree 的節點被稱為 internal nodes ,那些特殊的節點則為 external nodes
$$
\mathrm{num}_{\text{external nodes}} = \mathrm{num}_{\text{internal nodes}} + 1
$$
定義 $I(T)$ 為樹 $T$ 之 internal path length,即 $T$ 之所有 internal nodes 深度之總和
定義 $E(T)$ 為樹 $T$ 之 external path length,即 $T$ 之所有 external nodes 深度之總和
則
$$
E(T) = I(T) + 2n
$$
其中 $n$ 為 internal nodes 總數
這個可以輕鬆用數學歸納法證明出來,在此不贅述
### Weighed External Path Length ($\mathrm{WEPL}$)
給予 $n$ 個 external node weight values $p_i, 1\le i \le n$
定義從 root 到 external node $\mathrm{node}_i$ 之 path length 為 $d_i$
$\mathrm{WEPL} = \sum\limits_{i=1}^{n}d_i\cdot p_i$
由此可發現,樹高縮小,$\mathrm{WEPL}$ 不見得變小
那麼要怎麼求出最小的 $\mathrm{WEPL}$ 呢?
### Huffman Algorithm
使用 greedy 的策略
Huffman Coding 是種最佳方法,其每個輸入符號通常是具有二元機率的已知獨立且相同分布的隨機變數
Huffman code 是 variable-length code,即其編碼結果長度不見得都相同長
我們目標是 $\mathrm{WEPL}$ 最小,如果以字母出現頻率(或者說機率)作為權重,則出現頻率越高的字母,編碼長度應會越短
這樣的特色使得編碼之後的字串的平均長度、期望值降低,而甚至能達到無失真壓縮數據
> [!Tip]
> 那麼,對於具有均勻機率分布的一組符號,以及作為 $2$ 的冪之成員,Huffman Coding 的效益如何呢?
> 此時 Huffman Coding 的效果等同於簡單的二進位制編碼,例如 ASCII code
> 這邊反映出的事實是:無論壓縮方法是什麼,這種輸入都不可能進行壓縮,或著說比起壓縮,對數據無所作為才是最佳選擇
>
> 另外,Prefix code,特別是 Huffman code,往往在小字母表上的效率較差
> 當最可能符號的機率遠超過 $0.5$ 時,Huffman Coding 可能會遇到 worst case
Huffman encode 的演算法相當好懂:
每次我們挑最小的兩個 weight (可以用 Min-Heap) 把兩者相加的值作為新節點再塞回 Min-Heap
總共執行 `n-1` 次,每回合花 $O(\lg n)$ 故時間複雜度是 $O(n\lg n)$
以下是個簡易版的實現:
Min-Heap 使用我們[前面](https://hackmd.io/@RyoukiWei/binary_tree_1#Heap)製作的 binary heap (標準庫有提供 `std::prioirty queue` 但我們都自己寫了一個玩具版了,為何不用呢?)
```cpp!
struct HuffmanNode
{
char character;
int weight;
HuffmanNode *left;
HuffmanNode *right;
HuffmanNode(char ch, int w) : character(ch), weight(w), left(nullptr), right(nullptr) {}
HuffmanNode(int w, HuffmanNode *l, HuffmanNode *r) : character('\0'), weight(w), left(l), right(r) {}
~HuffmanNode()
{
delete left;
delete right;
}
};
struct NodeCompare
{
bool operator()(HuffmanNode *a, HuffmanNode *b) const { return a->weight < b->weight; }
};
class HuffmanCoding
{
private:
HuffmanNode *root;
public:
std::unordered_map<char, std::string> encodeChars;
HuffmanCoding() : root(nullptr) {}
~HuffmanCoding()
{
if (root)
{
delete root;
root = nullptr;
}
}
/**
* @brief Build Huffman Tree from weight(frequency) table.
*/
template <typename Weight>
void build(const std::unordered_map<char, Weight> &table)
{
if (table.empty())
throw std::invalid_argument("The table is empty.");
riklib::BinaryHeap<HuffmanNode *, NodeCompare> minHeap;
for (const auto &[ch, w] : table)
{
minHeap.push(new HuffmanNode(ch, w));
}
while (minHeap.size() > 1)
{
HuffmanNode *left = minHeap.top();
minHeap.pop();
HuffmanNode *right = minHeap.top();
minHeap.pop();
HuffmanNode *add = new HuffmanNode(left->weight + right->weight, left, right);
minHeap.push(add);
}
root = minHeap.top();
minHeap.pop();
buildEncodingTable(root, "");
}
/**
* @brief Encode a single character into Huffman encoded binary string.
*/
std::string encode(char ch)
{
if (!root)
throw std::runtime_error("Huffman tree is not built yet.");
if (encodeChars.find(ch) == encodeChars.end())
throw std::runtime_error("Character not found in encoding table.");
return encodeChars[ch];
}
/**
* @brief Encode a given string into Huffman encoded binary string.
*/
std::string encode(const std::string &text)
{
std::string res;
for (char ch : text)
{
res += encode(ch);
}
return res;
}
/**
*@brief Decode a binary string into the original char.
*/
char decodeChar(const std::string &binaryStr)
{
if (!root)
throw std::runtime_error("Huffman tree is not built yet.");
HuffmanNode *curr = root;
for (char bit : binaryStr)
{
if (bit == '0')
curr = curr->left;
else if (bit == '1')
curr = curr->right;
else
throw std::invalid_argument("Invalid character in binary string.");
if (!curr)
throw std::runtime_error("Invalid binary string.");
}
if (!curr->left && !curr->right)
return curr->character;
throw std::runtime_error("Binary string does not terminate at a valid character.");
}
/**
* @brief Decode a binary string into the original string.
*/
std::string decodeStr(const std::string &binaryStr)
{
if (!root)
throw std::runtime_error("Huffman tree is not built yet.");
HuffmanNode *curr = root;
std::string res;
for (char bit : binaryStr)
{
if (bit == '0')
curr = curr->left;
else if (bit == '1')
curr = curr->right;
else
throw std::invalid_argument("Invalid character in binary string.");
if (!curr)
throw std::runtime_error("Invalid binary string.");
if (!curr->left && !curr->right)
{
res += curr->character;
curr = root;
}
}
return res;
}
private:
void buildEncodingTable(HuffmanNode *node, const std::string &path)
{
if (!node)
return;
if (!node->left && !node->right)
encodeChars[node->character] = path;
buildEncodingTable(node->left, path + "0");
buildEncodingTable(node->right, path + "1");
}
};
```
我們 `input` 給定 `"BDCAE"`,各字母頻率如下:
<table>
<tr>
<th>Character</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
</tr>
<tr>
<td>Frequency</td>
<td>22</td>
<td>3</td>
<td>10</td>
<td>5</td>
<td>60</td>
</tr>
</table>
來看看結果
```bash!
❯ g++ -o huffman huffman.cpp
❯ ./huffman
A: 01
B: 0000
C: 001
D: 0001
E: 1
Encoded: 00000001001011
Decoded: BDCAE
```
## 【資工考研】DS: Binary Tree 全系列
- [【資工考研】DS: Binary Tree (1)](https://hackmd.io/@RyoukiWei/binary_tree_1)
- [【資工考研】DS: Binary Tree (2)](https://hackmd.io/@RyoukiWei/binary_tree_2)
- [【資工考研】DS: Binary Tree (3)](https://hackmd.io/@RyoukiWei/binary_tree_3)
- [【資工考研】DS: Binary Tree (4)](https://hackmd.io/@RyoukiWei/binary_tree_4)
- [【資工考研】DS: Binary Tree (5)](https://hackmd.io/@RyoukiWei/binary_tree_5)