Introduction
You’re trying to take something that can be described in many, many sentences and pages of prose, but you can convert it into a couple lines of poetry and you still get the essence, so that’s compression. The best code is poetry.
– Satya Nadella
When David Huffman was a graduate student in a class at MIT, the professor gave the class an unsolved problem: How to construct an optimal static encoding of information. The young Huffman came back a few days later with his solution, and that solution changed the world. Data compression is now used in all aspects of communication. David Huffman joined the faculty of MIT in 1953, and in 1967 he joined the faculty of University of California, Santa Cruz as one of its earliest members and helped to found its Computer Science Department, where he served as chairman from 1970 to 1973. He retired in 1994, and passed away in 1999.
The key idea is called entropy, originally defined by Claude Shannon in 1948. Entropy is a measure of the amount of information in a, say, set of symbols. If we define I (x) = log2 Pr[x] to be the information content of a symbol, then the entropy of the set X = {x 1 , . . . , x n } is
It should be easy to see that the optimal static encoding will assign the least number of bits to the most common symbol, and the greatest number of bits to the least common symbol.
Your task will be to find a Huffman encoding for the contents of a file, and use it to compress that file. You must also be able to reconstruct the original file from its compressed encoding.
In order to do this, you will need to:
- Compute a histogram of the file, in other words, count the number of occurrences of each byte in the file.
- Construct the Huffman tree that represents this histogram, in order to do this you will use a priority queue.
- Construct the code for each symbol, in order to do this you will use a stack and perform a traversal of the Huffman tree.
- Emit an encoding of the Huffman tree to a file, in order to do this you will perform a post-order traversal of the Huffman tree.
- Emit an encoding of the original file to the compressed file, in order to do this you will use bit vectors with two additional operations: append and optionally concatenate.
- Read the tree from the compressed file, in order to do this you will use a stack.
- Decode the compressed bit stream into an identical copy of the original file. In order to do this, you will use the bits (0 means left, 1 means right) to guide your traversal of a reconstructed Huffman tree.
Stacks
I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
– Linus Torvalds
You will be using stacks in at least two instances in this assignment, consequently you will need several different types of nodes. You may, for example, want a stack of treeNode and a stack of uint32_t, or even better a stack of bits if you are trying to be extra clever (so, let me give you some code for that).
Below is the generic interface for a stack.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
|
#ifndef _STACK_H #define _STACK_H #include lt;stdint.hgt; #include lt;stdbool.hgt;
typedef uint32_t item;
typedef struct stack #123; uint32_t size; uint32_t top; item *entries; #125; stack;
stack *newStack(); void delStack(stack *);
item pop (stack *); void push(stack *, item);
bool empty(stack *); bool full (stack *);
#endif
|
The code for each type of stack will be identical, except for the type of items being operated on. Unlike some languages with templates or generics, you will need to have code for each type. You could try to be exceedingly clever and use (void *), but again, I caution you against that approach.
Priority Queues
Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
– Rob Pike
In order to construct the Huffman tree, you will want to use a data structure called a priority queue. A priority queue is like a regular queue in that you remove items from the tail, but differs in that when you remove an item it is always the smallest (or largest) item. This implies that the enqueue operation does not simply insert at the head. Of course, the dequeue operation could search for the smallest item each time, but that is a bad idea. Note that the definition does not say what the order in the rest of the queue must be, only that the dequeue operation will return the smallest item.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
|
#ifndef _QUEUE_H #define _QUEUE_H #include lt;stdint.hgt; #include lt;stdbool.hgt;
typedef treeNode item;
typedef struct queue #123; uint32_t size; uint32_t head, tail; item *Q; #125; queue;
queue *newQueue(uint32_t size); void delQueue(queue *q);
bool empty(queue *q); bool full (queue *q);
bool enqueue(queue *q, item i); bool dequeue(queue *q, item *i);
#endif
|
Huffman Trees
It’s easy to make mistakes that only come out much later, after you’ve already implemented a lot of code. You’ll realize “Oh I should have used a different type of data structure.” Start over from scratch.
– Guido van Rossum
A Huffman tree is a full binary tree where the leaves represent the symbols to be encoded, and the interior nodes provide a path from the root to the leaves. The path is labeled with 0 for left, and 1 for right. The most common symbol (the one with the highest count in the histogram) should have the shortest path; the least common symbol the longest path.
The process of creating a Huffman tree is surprisingly simple. First, compute your histogram. Next, using treeNode *newNode(uint8_t s, bool l, uint64_t c) enqueue a node for each entry in the histogram into the priority queue.
Third, repeat until the priority queue is empty: dequeue two nodes from the priority queue. Join them under a new node, and insert that node into the priority queue. If there is only one node remaining in the priority queue, then that is the root of the Huffman tree.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
|
#ifndef _HUFFMAN_H #define _HUFFMAN_H #include lt;stdint.hgt; #include lt;stdbool.hgt;
#ifndef NIL #define NIL (void *) 0 #endif
typedef struct DAH treeNode;
struct DAH #123; uint8_t symbol; uint64_t count; bool leaf; treeNode *left, *right; #125;;
treeNode *newNode(uint8_t s, bool l, uint64_t c);
void dumpTree(treeNode *t, int file);
treeNode *loadTree(uint8_t savedTree[], uint16_t treeBytes);
int32_t stepTree(treeNode *root, treeNode **t, uint32_t code);
void buildCode(treeNode *t, code s, code table[256]);
void *delTree(treeNode *t);
static inline void delNode(treeNode *h) #123; free(h); return; #125;
static inline int8_t compare(treeNode *l, treeNode *r) #123; return l-gt;count - r-gt;count; #125;
treeNode *join(treeNode *l, treeNode *r); #endif
|
The function treeNode *join(treeNode *l, treeNode *r) takes two nodes, which may be either leaves or interior nodes, and creates a new internal node with its count set to the sum of the counts of the two child nodes.
When you are creating the code for each symbol, you will want to use a stack as defined in code.h, it provides a lovely little stack of bits. You will see that as you traverse the Huffman tree, you want to push 0 when you go to the left child and push 1 when you go to the right. Consequently, when you return from either child, you will want to perform a pop. As soon as you reach any leaf node during the traversal, the current state of the stack will represent the code for the symbol at the leaf node.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
|
#ifndef _CODE_H #define _CODE_H
#include lt;stdint.hgt; #include lt;stdbool.hgt;
typedef struct code #123; uint8_t bits[32]; uint32_t l; #125; code;
static inline code newCode() #123; code t; for (int i = 0; i lt; 32; i += 1) #123; t.bits[i] = 0; #125; t.l = 0; return t; #125;
static inline bool pushCode(code *c, uint32_t k) #123; if (c-gt;l gt; 256) #123; return false; #125; else if (k == 0) #123; c-gt;bits[c-gt;l / 8] amp;= ~(0x1 lt;lt; (c-gt;l % 8)); c-gt;l += 1; #125; else #123; c-gt;bits[c-gt;l / 8] |= (0x1 lt;lt; (c-gt;l % 8)); c-gt;l += 1; #125; return true; #125;
static inline bool popCode(code *c, uint32_t *k) #123; if (c-gt;l == 0) #123; return false; #125; else #123; c-gt;l -= 1; *k = ((0x1 lt;lt; (c-gt;l % 8)) amp; c-gt;bits[c-gt;l / 8]) gt;gt; (c-gt;l % 8); return true; #125; #125; static inline bool emptyCode(code *c) #123; return c-gt;l == 0; #125; static inline bool fullCode (code *c) #123; return c-gt;l == 256; #125; #endif
|
You may find it convenient to create a function appendCode for a bitVector which can take in a code * and append that code to the bit vector.
Your Task
You should construct both an encoder and a decoder. The encoder takes any file and produces a compressed file. The decoder takes a compressed file and produces an exact duplicate of the original file, down to each bit.
Specifics
You will be pulling together most of the data structures that you learned this quarter to finish this assignment. You will need at least two different stacks, a priority queue, arrays, and of course, the Huffman tree.
Encoding
Encoding is the concept of taking a source file and compressing it to reduce its size. For this section, I will refer to the source file as sFile and the output (compressed) file as oFile. To encode an sFile, follow these steps:
-
Open the sFile, and read through it to construct your histogram. Your histogram could be a simple array of 256 uint32_t’s (because a byte can only hold 256 different values).
-
Increment the count of element 0 and element 255 by one in the histogram. This is so that at the very minimum, the histogram will have two elements present. Do this regardless of what you read in. While doing this may result in a non-optimal Huffman Tree later on, it is a quick and clean solution to handling the case when a file has no bytes or has bytes of the same value.
-
For each entry in the histogram where the count is greater than 0 (there should be at minimum two elements because of step 2), create a corresponding treeNode and insert this node into the priority queue.
-
Use the priority queue to construct the Huffman tree. You do this by acquiring the two smallest elements in the queue, adding their count together and creating an internal node (leaf == 0, symbol == $). You then join the two elements you initially acquired as the parents of this new node using treeNode *join. Then you insert this new node back into the priority queue. You can use any symbol you like to represent an internal node, but I urge you to be consistent.
-
Perform a post-order traversal of the Huffman tree (buildCode).
- (a) If the current node is a leaf, the current stack (code s) represents the path to the node, and so is the code for it. Save this stack into a table of variable length codes corresponding to each symbol (code table[256]).
- (b) If it is an interior node, push(0), and follow the left link.
- (c) After you return from the link, pop() the stack, and push(1), and follow the right link.
-
Write out a uint32_t magic number onto the oFile. This number is 0xdeadd00d. This magic number identifies a file as one which has been compressed using your encoder. It is crucial that you use this magic number and nothing else.
-
Write the length of the original file (in bytes) to the oFile as a uint64_t. This will help you debug when performing decoding and also allows you to acquire the size of the array you will need when constructing the original file from the compressed file.
-
Write out the size of your tree (in bytes) to the oFile as a uint16_t. This size will be (3 * leafCount) - 1 (but never less than zero).
-
Perform a post-order traversal of the Huffman tree to write the tree to the oFile. This should be a function called dumpTree and should write L followed by the byte of the symbol for each leaf, and I for interior nodes. You should not write a symbol for an interior node.
-
Beginning at the start of the sFile, for each symbol copy the bits of the code for that symbol to the oFile. It may prove easier to append these bits to a long bit vector, and then write the bytes of the bit vector.
-
Close both files.
-
Make sure that you follow the order for creating the oFile. If you do not, it will not work.
- (a) Magic number (uint32_t): 0xdeadd00d.
- (b) Size of sFile file (uint64_t).
- (c) Size of Huffman tree (uint16_t): (3 * number of leaf nodes) - 1.
- (d) The Huffman tree using a post-order traversal.
- (e) The encoding of the original file.