Compression - Huffman Coding & LZW Algorithm

# Compression - Huffman Coding & LZW Algorithm ## What is Compression? The goal of compression is to reduce the size of data or a file so that it occupies less storage space and can be transmitted more efficiently over networks. This process involves encoding information using fewer bits than the original representation. Here are the primary objectives of compression: 1. Space Saving: By reducing the size of files, compression helps save storage space on devices, servers, and storage media. 2. Transmission Efficiency: Smaller files can be transmitted faster over networks, which is particularly important for bandwidth-limited or speed-sensitive applications such as streaming, online backups, and data transfers. 3. Cost Reduction: Reduced storage and transmission requirements can lead to lower costs, as less physical storage is needed, and lower data transfer fees may be incurred. 4. Performance Improvement: In some scenarios, compressed data can be read from storage and transmitted faster, improving the overall performance of applications. There are two main types of compression: - Lossless Compression: This type of compression allows the original data to be perfectly reconstructed from the compressed data. It is essential for applications where data integrity is critical, such as text documents, executable files, and some image and audio formats. - Lossy Compression: This type of compression reduces the size of the data by permanently eliminating certain information, especially redundant or less important details. This is commonly used in multimedia files like images, audio, and video, where a perfect reconstruction of the original data is not necessary. ## Huffman Algorithm The Huffman algorithm is a widely-used method for lossless data compression. It was developed by David A. Huffman in 1952 and is based on the concept of variable-length coding. The Huffman algorithm compresses data by assigning shorter binary codes to more frequent symbols and longer codes to less frequent ones. It constructs a binary tree where each leaf node represents a symbol and its frequency. The tree is built by repeatedly merging the two nodes with the lowest frequencies until only one node remains, which becomes the root. The binary codes are then derived by traversing the tree from the root to each leaf, creating an efficient, variable-length encoding that reduces the overall size of the data. First we need to create a min heap, nodes for it and a compare struct. ```c++ struct MinHeapNode { char data; unsigned freq; struct MinHeapNode *left, *right; }; struct Compare { bool operator()(MinHeapNode* l, MinHeapNode* r) { return (l->freq > r->freq); } }; struct MinHeap { unsigned size; priority_queue<MinHeapNode*, vector<MinHeapNode*>, Compare> minHeap; }; ``` Now we need to create functions for creating new nodes, new min heap and build it. ```c++ MinHeapNode* newNode(char data, unsigned freq) { MinHeapNode* temp = new MinHeapNode(); temp->left = temp->right = nullptr; temp->data = data; temp->freq = freq; return temp; } MinHeap* createMinHeap() { MinHeap* minHeap = new MinHeap(); minHeap->size = 0; return minHeap; } void buildMinHeap(MinHeap* minHeap) { while (minHeap->minHeap.size() != 1) { MinHeapNode* left = minHeap->minHeap.top(); minHeap->minHeap.pop(); MinHeapNode* right = minHeap->minHeap.top(); minHeap->minHeap.pop(); MinHeapNode* top = newNode('$', left->freq + right->freq); top->left = left; top->right = right; minHeap->minHeap.push(top); } } ``` Now we need to generate codes and write huffman codes for them. ```c++ void generateCodes(MinHeapNode* root, string str, unordered_map<char, string>& huffmanCodes) { if (root) { if (!root->left && !root->right) { huffmanCodes[root->data] = str; cout << "Character: " << root->data << ", Code: " << str << endl; // Debugging statement } else { generateCodes(root->left, str + "0", huffmanCodes); generateCodes(root->right, str + "1", huffmanCodes); } } } void writeHuffmanCodes(const unordered_map<char, string>& huffmanCodes, ofstream& output) { for (const auto& pair : huffmanCodes) { output << pair.first << " " << pair.second << "\n"; } } ``` All in all Huffman coding is a popular algorithm for lossless data compression. It defines structures (MinHeapNode, Compare, MinHeap) for managing nodes and constructing a Min-Heap based on node frequencies. Functions include newNode for creating nodes, createMinHeap for initializing the Min-Heap, and buildMinHeap for merging nodes until a Huffman Tree is formed. generateCodes recursively traverses the tree to assign binary codes to each character, stored in an unordered map, with debug outputs. writeHuffmanCodes writes these codes to a file. Together, these components form a complete Huffman coding implementation, enabling efficient data compression by replacing characters with variable-length binary codes based on their frequencies. Now we have to encode the file. ```c++ void encodeFile(const string& inputFileName, const string& outputFileName) { ifstream inputFile(inputFileName, ios::binary); ofstream outputFile(outputFileName); unordered_map<char, unsigned> freq; char c; while (inputFile.get(c)) { freq[c]++; } MinHeap* minHeap = createMinHeap(); for (const auto& pair : freq) { minHeap->minHeap.push(newNode(pair.first, pair.second)); } buildMinHeap(minHeap); MinHeapNode* root = minHeap->minHeap.top(); unordered_map<char, string> huffmanCodes; generateCodes(root, "", huffmanCodes); // Debugging statements for (const auto& pair : huffmanCodes) { cout << "Character: " << pair.first << ", Code: " << pair.second << endl; } writeHuffmanCodes(huffmanCodes, outputFile); //outputFile << "\n"; for (const auto& pair : huffmanCodes) { cout << "Character: " << pair.first << ", Code: " << pair.second << endl; } inputFile.clear(); inputFile.seekg(0, ios::beg); while (inputFile.get(c)) { outputFile << huffmanCodes[c]; } inputFile.close(); outputFile.close(); } ``` Also we need to write a function to decode the file. ```c++ void decodeFile(const string& inputFileName, const string& outputFileName) { ifstream inputFile(inputFileName); ofstream outputFile(outputFileName, ios::binary); unordered_map<string, char> reverseCodes; string line; // Read Huffman codes until an empty line is encountered while (getline(inputFile, line) && !line.empty()) { istringstream iss(line); char character; string code; if (iss >> character) { getline(iss >> std::ws, code); // Skip leading whitespaces reverseCodes[code] = character; } else { cerr << "Error reading Huffman codes from input file." << endl; return; // Exit the function in case of an error } } for (const auto& pair : reverseCodes) { cout << "Code: " << pair.first << ", Character: " << pair.second << endl; } // Process the rest of the file string encodedText; while (getline(inputFile, line)) { encodedText += line; } string decodedText; string currentCode; for (char bit : encodedText) { if (bit != '\n') { // Skip newline characters currentCode += bit; cout << "Current Code: " << currentCode << endl; // Debugging statement if (reverseCodes.find(currentCode) != reverseCodes.end()) { decodedText += reverseCodes[currentCode]; cout << "Decoded Text: " << decodedText << endl; // Debugging statement currentCode.clear(); } } } outputFile << decodedText; inputFile.close(); outputFile.close(); } ``` Now we put everything together in the main function. ```c++ int main() { const string inputFileName = "input.txt"; const string compressedFileName = "compressed.txt"; const string decompressedFileName = "decompressed.txt"; // Compress the file encodeFile(inputFileName, compressedFileName); // Decompress the file decodeFile(compressedFileName, decompressedFileName); return 0; } ``` Together, these implementations showcase how Huffman coding and the LZW algorithm can be integrated to achieve efficient data compression and decompression. Huffman coding provides a method to encode data with variable-length codes based on character frequencies, while LZW enhances compression by recognizing and replacing repeated patterns with shorter codes. This combination illustrates effective utilization of compression techniques to achieve space-saving, efficient data transmission, and overall performance improvements in various applications.