A Closer Look at Huffman Encoding Using C#

06 Apr 2016

On this page:

How it Works
Constructing the Binary Tree
Encoding the Data
Unpacking the Data
Final Thoughts
The Code

Huffman encoding is a compression technique developed by David Huffman and published in his 1952 paper 'A Method for the Construction of Minimum-Redundancy Codes'. The technique centres on encoding characters as a list of code words, which are then used to generate a coded version of the original data. Huffman encoding can be very efficient, and should generally yield compression as low as 20% of the original data size.

How it Works

The input data is scanned, and a list of characters and their corresponding weight (the number of instances the character appears in the data). The weights of each character are used to compute the length of the code words that represent that character: larger weights should yield smaller code words, a key aspect of compressing the data. Characters are then arranged into a binary tree, where the resulting layout is used to determine the encoding table. The last step is using the encoding table to create the new file.

Constructing the Binary Tree

Consider the following text:

abccccddddddefghijjjjj

Represented as a list of characters with corresponding weights:


Index	Char	Weight 
0	a	1 
1	b	1 
2	c	4 
3	d	6 
4	e	1 
5	f	1 
6	g	1 
7	h	1 
8	i	1 
9	j	5

This list is now turned into a binary tree. A binary tree is a tree where each node may have a maximum of two child nodes. The left child node is consider the 0 child node, whereas the right child node is considered the 1 child node. The first two nodes are removed from the list, and a new parent node is constructed. This parent node's weight is the combined weight of both children. In our example, after the first iteration our list would look like this:


Index 	Char	Weight 
00	null	null	//	used to be a
01	null	null	//	used to be b
02	e	1 
03	f	1 
04	g	1 
05	h	1 
06	i	1 
07	c	4 
08	j	5 
09	d	6 
10	a + b	2	//	new parent node

The next two lowest weight nodes are added to a parent:


Index 	Char	Weight 
00	null	null	//	used to be a
01	null	null	//	used to be b
02	null	null	//	used to be e
03	null	null	//	used to be f
04	g	1 
05	h	1 
06	i	1 
07	c	4 
08	j	5 
09	d	6 
10	a + b	2	//	new parent node
11	e + f	2	//	new parent node

And so on. At the end of this round of processing, the list looks like this:


Index 	Char	Weight 
00	null	null	//	used to be a
01	null	null	//	used to be b
02	null	null	//	used to be e
03	null	null	//	used to be f
04	null	null	//	used to be g
05	null	null	//	used to be h
06	null	null	//	used to be i
07	null	null	//	used to be c
08	null	null	//	used to be j
09	null	null	//	used to be d
10	a + b	2	//	new parent node
11	e + f	2	//	new parent node
12	g + h	2	//	new parent node
13	i + c	5	//	new parent node
14	j + d	11	//	new parent node

Null nodes are removed from the list:

Index 	Char	Weight
0	a + b	2
1	e + f	2
2	g + h	2
3	i + c	5
4	j + d	11

And the next round of processing begins, always taking the smallest weighted nodes and combining them under a new parent. At the end of the next round of processing, the list looks like this:


Index 	Char			Weight 
0	null			null	//	a + b
1	null			null	//	e + f
2	null			null	//	g + h
3	null			null	//	i + c
4	j + d			11 
5	a + b (2), e + f (2)	4	//	new parent node
6	g + h (2), i + c (5)	7	//	new parent node

Null nodes are removed from the list:

Index 	Char			Weight
0	a + b (2), e + f (2)	4
1	g + h (2), i + c (5)	7
2	j + d			11

At the end of the next round of processing:


Index 	Char					Weight 
0	null					4	//	a + b (2), e + f (2)
1	null					7	//	g + h (2), i + c (5)	
2	j + d					11	//	new parent node
3	(a + b, e + f) - (4), (g + h, i + c) - (7)	11	//	new parent node

After removing null nodes from the list, we are ready for the final round of processing:


Index 	Char						Weight  
0	j + d						11	//	new parent node
1	(a + b, e + f) - (4), (g + h, i + c) - (7)	11	//	new parent node

And finally, after all processing, there is only one node left:


Index 	Char							Weight  
0	(j + d) - 11, (a + b, e + f, g + h, i + c) - (11)	22	//	new parent node

This final node forms the root of the tree, and is used to generate a list of binary words that represent each character. Starting at the root of the tree, following a left child node inserts a 0 into the binary word, following a right child node inserts a 1 into the binary word. Using this method, the following list of binary words is created:


Char	Weight	BinaryWord	Int32Word  
j	5	00		0		//	This column is for sorting purposes only
d	6	01		1  
a	1	1000		8  
b	1	1001		9  
e	1	1010		10  
f	1	1011		11  
g	1	1100		12  
h	1	1101		13  
I	1	1110		14  
c	4	1111		15

Note that nodes without a character value are not present in this list, but are able to be inferred from the list sorting operations. For the moment, I'll leave you to construct the tree using the information above.

Encoding the Data

The next step is to encode the original data using the binary words created from the previous steps. Moving forward through the data one byte at a time, each character is used to perform a lookup for its corresponding binary word. This binary word is added into a binary word buffer. When the binary word buffer reaches sufficient length, 8 values are read off and removed to construct a new byte. Consider the following:


Index	Char	BinaryWord	ByteBuffer 
00	a	1000		//	Only 4 bits here, not enough for a new byte
01	b	1001		//	Combined with the previous entry, there are enough bits here for a new byte: 10001001 => 137, remove 8 entries from the buffer
02 	c	1111		//	Start a new byte using these 4 bits
03	c	1111		//	11111111 => 255
04	c	1111		//	Start a new byte
05	c	1111		//	11111111 => 255
06	d	01		//	Start a new byte
07	d	01		//	0101 - the buffer is only half full
08	d	01		//	010101
09	d	01		//	01010101 => 85 - note that this one byte actually represents 4 chars!
10	d	01		//	Start a new byte
11	d	01		//	0101
12	e	1010		//	01011010 => 90 - this one byte represents 3 chars
13	f	1011		//	Start a new byte
14	g	1100		//	10111100 => 188
15	h	1101		//	New byte
16	i	1110		//	11011110 => 222
17	j	00		//	New byte
18	j	00		//	0000
19	j	00		//	000000
20	j	00		//	00000000 => 0
21	j	00		//	Start a new byte

With the processing of the final character, the byte buffer only contains 2 values, which is not enough to construct a byte, however as the binary word is 00 it is simply converted to 0. Therefore, the final byte to be constructed is 0. Processing of these 22 characters has yielded a list containing only 9 bytes, about 41% of the original data size.

Unpacking the Data

Decompression (in this example at least) is relatively straight forward. The byte list is read, and each byte converted into a binary word, which is added into a buffer. If the buffer is less than 8 values, then it is 'fixed' (more on that later). Reading each bit from the binary word indicates which path to take through the binary tree (constructed above): starting at the root node, follow the left child for a 0 value, and follow the right for a 1 value. The entire decompression process for the above data:


Index	Byte			Characters	  
0	137 => 10001001		a, b  
1	255 => 11111111		c, c  
2	255 => 11111111		c, c  
3	85  => 01010101		d, d, d, d  
4	90  => 01011010		d, d, e  
5	188 => 10111100		f, g  
6	222 => 11011110		h, i  
7	0   => 00000000		j, j, j, j  
8	0   => 00		j		//	This is the last value, so rather than pad it out, we can just go directly to the char value

The output data now contains 22 characters, the data has been successfully unpacked.

Final Thoughts

Conveniently, this example ignores completely the size of the encoding table, as well as how storage and retrieval mechanisms, so theoretically 9 bytes is probably as small as we can get. If this compressed data were to be used outside this example, then the binary tree and code words would also need to be stored along with the byte list in order to unpack the data at some other time.

The Code

I've included my somewhat rambling encoding and decoding process, in C# below.


public class HuffmanCompressor 
{ 
    private HuffmanNode oRootHuffmanNode; 
    private List<HuffmanNode> oValueHuffmanNodes; 

    private List<HuffmanNode> BuildBinaryTree(string Value) 
    { 
        List<HuffmanNode> oHuffmanNodes = GetInitialNodeList(); 

        //  Update node weights
        Value.ToList().ForEach(m => oHuffmanNodes[m].Weight++); 

        //  Select only nodes that have a weight
        oHuffmanNodes = oHuffmanNodes 
            .Where(m => (m.Weight > 0)) 
            .OrderBy(m => (m.Weight)) 
            .ThenBy(m => (m.Value)) 
            .ToList(); 

        //  Assign parent nodes
        oHuffmanNodes = UpdateNodeParents(oHuffmanNodes); 

        oRootHuffmanNode = oHuffmanNodes[0]; 
        oHuffmanNodes.Clear(); 

        //  Sort nodes into a tree
        SortNodes(oRootHuffmanNode, oHuffmanNodes); 

        return oHuffmanNodes; 
    } 

    public void Compress(string FileName) 
    { 
        FileInfo oFileInfo = new FileInfo(FileName); 

        if (oFileInfo.Exists == true) 
        { 
            string sFileContents = ""; 

            using (StreamReader oStreamReader = new StreamReader(File.OpenRead(FileName))) 
            { 
                sFileContents = oStreamReader.ReadToEnd(); 
            } 

            List<HuffmanNode> oHuffmanNodes = BuildBinaryTree(sFileContents); 

            //  Filter to nodes we care about
            oValueHuffmanNodes = oHuffmanNodes 
                .Where(m => (m.Value.HasValue == true)) 
                .OrderBy(m => (m.BinaryWord))   //  Not really required, for presentation purposes
                .ToList(); 

            //  Construct char to binary word dictionary for quick value to binary word resolution
            Dictionary<char, string> oCharToBinaryWordDictionary = new Dictionary<char, string>(); 
            foreach (HuffmanNode oHuffmanNode in oValueHuffmanNodes) 
            { 
                oCharToBinaryWordDictionary.Add(oHuffmanNode.Value.Value, oHuffmanNode.BinaryWord); 
            } 

            StringBuilder oStringBuilder = new StringBuilder(); 
            List<byte> oByteList = new List<byte>(); 
            for (int i = 0; i < sFileContents.Length; i++) 
            { 
                string sWord = ""; 

                //  Append the binary word value using the char located at the current file position
                oStringBuilder.Append(oCharToBinaryWordDictionary[sFileContents[i]]); 

                //  Once we have at least 8 chars, we can construct a byte
                while (oStringBuilder.Length >= 8) 
                { 
                    sWord = oStringBuilder.ToString().Substring(0, 8); 

                    //  Remove the word to be constructed from the buffer
                    oStringBuilder.Remove(0, sWord.Length); 
                } 

                //  Convert the word and add it onto the list
                if (String.IsNullOrEmpty(sWord) == false) 
                { 
                    oByteList.Add(Convert.ToByte(sWord, 2)); 
                } 
            } 

            //  If there is anything in the buffer, make sure it is retrieved
            if (oStringBuilder.Length > 0) 
            { 
                string sWord = oStringBuilder.ToString(); 

                //  Convert the word and add it onto the list
                if (String.IsNullOrEmpty(sWord) == false) 
                { 
                    oByteList.Add(Convert.ToByte(sWord, 2)); 
                } 
            } 

            //  Write compressed file
            string sCompressedFileName = Path.Combine(oFileInfo.Directory.FullName, String.Format("{0}.compressed", oFileInfo.Name.Replace(oFileInfo.Extension, ""))); 
            if (File.Exists(sCompressedFileName) == true) 
            { 
                File.Delete(sCompressedFileName); 
            } 

            using (FileStream oFileStream = File.OpenWrite(sCompressedFileName)) 
            { 
                oFileStream.Write(oByteList.ToArray(), 0, oByteList.Count); 
            } 
        } 
    } 

    public void Decompress(string FileName) 
    { 
        FileInfo oFileInfo = new FileInfo(FileName); 

        if (oFileInfo.Exists == true) 
        { 
            string sCompressedFileName = String.Format("{0}.compressed", oFileInfo.FullName.Replace(oFileInfo.Extension, "")); 

            byte[] oBuffer = null; 
            using (FileStream oFileStream = File.OpenRead(sCompressedFileName)) 
            { 
                oBuffer = new byte[oFileStream.Length]; 
                oFileStream.Read(oBuffer, 0, oBuffer.Length); 
            } 

            //  Find the zero node
            HuffmanNode oZeroHuffmanNode = oRootHuffmanNode; 
            while (oZeroHuffmanNode.Left != null) 
            { 
                oZeroHuffmanNode = oZeroHuffmanNode.Left; 
            } 

            //  Unpack the file contents
            HuffmanNode oCurrentHuffmanNode = null; 
            StringBuilder oStringBuilder = new StringBuilder(); 

            for (int i = 0; i < oBuffer.Length; i++) 
            { 
                string sBinaryWord = ""; 
                byte oByte = oBuffer[i]; 

                if (oByte == 0) 
                { 
                    sBinaryWord = oZeroHuffmanNode.BinaryWord; 
                } 
                else 
                { 
                    sBinaryWord = Convert.ToString(oByte, 2); 
                } 

                if ((sBinaryWord.Length < 8) && (i < (oBuffer.Length - 1))) 
                { 
                    //  Pad binary word out to 8 places
                    StringBuilder oBinaryStringBuilder = new StringBuilder(sBinaryWord); 
                    while (oBinaryStringBuilder.Length < 8) 
                    { 
                        oBinaryStringBuilder.Insert(0, "0"); 
                    } 

                    sBinaryWord = oBinaryStringBuilder.ToString(); 
                } 

                //  Use the binary word to navigate the tree looking for the value
                for (int j = 0; j < sBinaryWord.Length; j++) 
                { 
                    char cValue = sBinaryWord[j]; 

                    if (oCurrentHuffmanNode == null) 
                    { 
                        oCurrentHuffmanNode = oRootHuffmanNode; 
                    } 

                    if (cValue == '0') 
                    { 
                        oCurrentHuffmanNode = oCurrentHuffmanNode.Left; 
                    } 
                    else 
                    { 
                        oCurrentHuffmanNode = oCurrentHuffmanNode.Right; 
                    } 

                    if ((oCurrentHuffmanNode.Left == null) && (oCurrentHuffmanNode.Right == null)) 
                    { 
                        //  No more child nodes to choose from, so this must be a value node
                        oStringBuilder.Append(oCurrentHuffmanNode.Value.Value); 
                        oCurrentHuffmanNode = null; 
                    } 
                } 
            } 

            //  Write out file
            string sUncompressedFileName = Path.Combine(oFileInfo.Directory.FullName, String.Format("{0}.uncompressed", oFileInfo.Name.Replace(oFileInfo.Extension, ""))); 

            if (File.Exists(sUncompressedFileName) == true) 
            { 
                File.Delete(sUncompressedFileName); 
            } 

            using (StreamWriter oStreamWriter = new StreamWriter(File.OpenWrite(sUncompressedFileName))) 
            { 
                oStreamWriter.Write(oStringBuilder.ToString()); 
            } 
        } 
    } 

    private static List<HuffmanNode> GetInitialNodeList() 
    { 
        List<HuffmanNode> oGetInitialNodeList = new List<HuffmanNode>(); 

        for (int i = Char.MinValue; i < Char.MaxValue; i++) 
        { 
            oGetInitialNodeList.Add(new HuffmanNode((char)(i))); 
        } 

        return oGetInitialNodeList; 
    } 

    private static void SortNodes(HuffmanNode Node, List<HuffmanNode> Nodes) 
    { 
        if (Nodes.Contains(Node) == false) 
        { 
            Nodes.Add(Node); 
        } 

        if (Node.Left != null) 
        { 
            SortNodes(Node.Left, Nodes); 
        } 

        if (Node.Right != null) 
        { 
            SortNodes(Node.Right, Nodes); 
        } 
    } 

    private static List<HuffmanNode> UpdateNodeParents(List<HuffmanNode> Nodes) 
    { 
        //  Assign parent nodes
        while (Nodes.Count > 1) 
        { 
            int iOperations = (Nodes.Count / 2); 
            for (int iOperation = 0, i = 0, j = 1; iOperation < iOperations; iOperation++, i += 2, j += 2) 
            { 
                if (j < Nodes.Count) 
                { 
                    HuffmanNode oParentHuffmanNode = new HuffmanNode(Nodes[i], Nodes[j]); 
                    Nodes.Add(oParentHuffmanNode); 

                    Nodes[i] = null; 
                    Nodes[j] = null; 
                } 
            } 

            //  Remove null nodes
            Nodes = Nodes 
                .Where(m => (m != null)) 
                .OrderBy(m => (m.Weight))   //  Choose the lowest weightings first
                .ToList(); 
        } 

        return Nodes; 
    }

Here's the code for the HuffmanNode type itself:


public class HuffmanNode 
{ 
    public HuffmanNode() 
    { 
    } 

    public HuffmanNode(char Value) 
    { 
        cValue = Value; 
    } 

    public HuffmanNode(HuffmanNode Left, HuffmanNode Right) 
    { 
        oLeft = Left; 
        oLeft.oParent = this; 
        oLeft.bIsLeftNode = true; 

        oRight = Right; 
        oRight.oParent = this; 
        oRight.bIsRightNode = true; 

        iWeight = (oLeft.Weight + oRight.Weight); 
    } 

    private string sBinaryWord; 
    private bool bIsLeftNode; 
    private bool bIsRightNode; 
    private HuffmanNode oLeft; 
    private HuffmanNode oParent; 
    private HuffmanNode oRight; 
    private char? cValue; 
    private int iWeight; 

    public string BinaryWord 
    { 
        get 
        { 
            string sReturnValue = ""; 

            if (String.IsNullOrEmpty(sBinaryWord) == true) 
            { 
                StringBuilder oStringBuilder = new StringBuilder(); 

                HuffmanNode oHuffmanNode = this; 

                while (oHuffmanNode != null) 
                { 
                    if (oHuffmanNode.bIsLeftNode == true) 
                    { 
                        oStringBuilder.Insert(0, "0"); 
                    } 

                    if (oHuffmanNode.bIsRightNode == true) 
                    { 
                        oStringBuilder.Insert(0, "1"); 
                    } 

                    oHuffmanNode = oHuffmanNode.oParent; 
                } 

                sReturnValue = oStringBuilder.ToString(); 
                sBinaryWord = sReturnValue; 
            } 
            else 
            { 
                sReturnValue = sBinaryWord; 
            } 

            return sReturnValue; 
        } 
    } 

    public HuffmanNode Left 
    { 
        get 
        { 
            return oLeft; 
        } 
    } 

    public HuffmanNode Parent 
    { 
        get 
        { 
            return oParent; 
        } 
    } 

    public HuffmanNode Right 
    { 
        get 
        { 
            return oRight; 
        } 
    } 

    public char? Value 
    { 
        get 
        { 
            return cValue; 
        } 
    } 

    public int Weight 
    { 
        get 
        { 
            return iWeight; 
        } 
        set 
        { 
            iWeight = value; 
        } 
    } 

    public override string ToString() 
    { 
        StringBuilder oStringBuilder = new StringBuilder(); 

        if (cValue.HasValue == true) 
        { 
            oStringBuilder.AppendFormat("'{0}' ({1}) - {2} ({3})", cValue.Value, iWeight, BinaryWord, BinaryWord.BinaryStringToInt32()); 
        } 
        else 
        { 
            if ((oLeft != null) && (oRight != null)) 
            { 
                if ((oLeft.Value.HasValue == true) && (oRight.Value.HasValue == true)) 
                { 
                    oStringBuilder.AppendFormat("{0} + {1} ({2})", oLeft.Value, oRight.Value, iWeight); 
                } 
                else 
                { 
                    oStringBuilder.AppendFormat("{0}, {1} - ({2})", oLeft, oRight, iWeight); 
                } 
            } 
            else 
            { 
                oStringBuilder.Append(iWeight); 
            } 
        } 

        return oStringBuilder.ToString(); 
    } 
}

And finally, an extension method that is also required for the code to compile:


public static int BinaryStringToInt32(this string Value) 
{ 
    int iBinaryStringToInt32 = 0; 

    for (int i = (Value.Length - 1), j = 0; i >= 0; i--, j++) 
    { 
        iBinaryStringToInt32 += ((Value[j] == '0' ? 0 : 1) * (int)(Math.Pow(2, i))); 
    } 

    return iBinaryStringToInt32; 
}

To get this code running, it will need to be stitched together, perhaps in a console application:


public static void Main(string[] Arguments) 
{ 
    HuffmanCompressor oHuffmanCompressor = new HuffmanCompressor(); 
    oHuffmanCompressor.Compress("Huffman.txt"); 
    Console.WriteLine(); 
    oHuffmanCompressor.Decompress("Huffman.txt"); 
    Console.WriteLine(); 
}

Note that for this example, decompression is only possible if the binary tree has been constructed, i,e., the data has already been compressed.