CSCI466 Assignment 2

This is a huge programming assignment to explore other forms of lossless compression. In particular, we investigate the Burrows-Wheeler Transform (BWT, in short) compression scheme and compare its efficiency with Huffman Coding.

The Huffman encoded text is to be transmitted over a channel that can be modeled as binary symmetric channel with some given transition probabilities. With this model, we will insert the errors in the encoded text file and attempt to decode the text after reception at the receiver.

We will then design a suitable channel code that upon receiving and decoding, the amount of errors should be reduced. This is for the program introduction.

A few tasks involved and I will give some answers in here.

1. Source Model

1.1 What is the file size in bytes for "Dec of Ind.txt"?
Ans: the file is given by your lecturer. To find the file size, in command line, type "ls -al" command. This will list down the file properties. Alternatively, you can right click on the file and select properties.



1.2 Write the program to compute the frequency of the ASCII characters used in the text.
Ans: Simple. Just write a program. If you want the program, ask me.

1.3 Report the findings as probability distribution table.
Ans:Try to write your program to output like this.


1.4 Compute the entropy of the source generating these characters.
Ans: In Information Entropy, Claude Elwood Shanon said the entropy of the source can be calculated with

Based on formula, the answer for this is 4.30196.

1.5 Comment on the probability distribution of the character used in text file. Is the expectation on the most frequently used letter of the English alphabet confirmed?
Ans: Yes. This is science proof. The most frequent letter in English alphabet is the letter "E". Of course, you will get the same answer.

1.6 On average, how many bits/symbol would you expect to use in encoding this text file?
Ans: It is the same as entropy value. In this case, our entropy is 4.30196 which is the average usage of codeword for this text file.

2. Entropy Coding

2.1 Encoding

Huffman coding is an entropy encoding algorithm used for lossless data compression technique. I will discuss the Huffman Tree Construction. It is based on probability distribution. Huffman tree produces the unique and linear codeword for every source entity.

You got to write your program to do Huffman Tree Construction. The algorithm should look like this.

BEGIN
Arrange the Source Symbols in decreasing order of probabilities
WHILE
Pick and Add the two smallest values and write it to both nodes.
Enable flag so that it will not repeat for next round.
Label the branches with a "1" and "0"
Ensure that the branches are checked and updated accordingly
END
END

After the algorithm, your program will produce something like this.


Output all these codewords into a text file and check the file size. It should be smaller than the original text file.

2.2 Decoding

This process is the reverse process of encoding. In encoding, we convert the letter into codeword and in decoding, we will revert the codeword into the letter. It should not be deviate or change after the decoding process. If it changes, then your encoding tree is not accurate.

2.3 Comparison of origin vs entropy coded file.

To compare, type command "cmp -l ". The cmp utility compares two files of any type and writes the results to the standard output. By default, cmp is silent if the files are the same; if they differ, the byte and line number at which the first difference occured is reported.

3. Burrows- Wheeler Transform

BWT is not actually a compression. In fact, BWT requires that a small amount of additional information to be stored with the transformed data so that the reverse process can take place. BWT is useful because it converts the data into a format that is generally more compressible by statistical encoders (eg. Huffman Coding). With the help of "Move to Front" coding, the data will be in a format wich is generally more compressible by even zero order statistical encoders.

Your lecturer could give you some documents about BWT, but trust me, it is complicated. But I will explain it here, so that it is easier to understand.

Let's say the original string is "nokia". First, we will do the sorting as the table below.

Next, we will arrange this table in lexicographical order.

Okay, notice that F is now in alphabetical order. We will take the column "L". So "L" column is "ikoan". We also can find out the primary index. Primary index is simply to search for S1 and its row value will be the primary index. So the primary index = 4.

Given L and primary index, the one can decode with this program code.
void decode()
{
int index = primary_index;
for (int i=0; i<7; i++) index=T[index];
}


Move to Front Coding

This is one of the recommendation from BWT coding. Move to front will help us to compress fast and better. Let's see how it works.

We know that our L column is "ikoan". And there, the receiver can calculate the entropy list.

Source1= { a i k o n} and we have L of "ikoan". Let's encode.

First, we see "i" in L column, then we search "i" in source symbols. We found it in 1st positiion. So R[I]={1}. We will flip the source by putting the symbol at the front. Therefore, the source become {i a k o n}.

Second, we continue to see L column, now we see "k". Check the "k" position in new source {i a k o n}, it is at 2nd position. So R[I]={1,3}. We will flip the source by putting the symbol at the front. Now, the source become {k i a o n}.

In this way, we continue all the alphabet in L column and occassionally, we will get R[I] = {1,2,3,3,4}.

We will use this {1,2,3,3,4} with huffman coding. We construct the huffman tree based on this. As you notice, the original word "nokia" is now being replaced by {1,2,3,3,4}. When we compress, the file size of reduced dramistically.

To program this topic, you will follow-
Compession: 1. Apply BWT, 2. Apply Move-to-Front, 3. Apply Huffman Coding
Depression: 1. Decode Huffman, 2. Decode Move-to-Front, 3. Apply Reverse BWT.

The result of all the process will be such differences.



Channel Coding

Now, we have data encoded with Huffman coding algorithm. We are going to send this information across some kind of channel.

The question is how reliable is our information across the channel. This is becoming a big problem since our information is depending on the unreliable medium. This is due to the noise over signal that could easily change our information bit with our acknowledgement.

To demo this problem, we will program which could introduce the error bit probability of 0.03% and 0.01%. It means, each time the program read a bit from the file, I call my function to generate a random number. if the value is les than or equal to the value selected for p, it will invert the bit otherwise, the bit remain unchanged.



We are now using Hamming Code (7,4). I will give u some example for that.

Step 1: Decide for the position of parity bit.


Step 2: Put the data bit into the chart. We use the example data bit = 0101.


Step 3: Now we will calculate the bit#1. (check 1 bit, skip 1 bit; start from bit#1)
result = bit#1 + bit#3 + bit#5 + bit#7 = ? + 0 + 1 + 1 = 2 (even)
if the result is even, we assign "0". If it is odd, assign "1". Now the table become-


Step 4: To calculate bit#2, (check 2 bits, skip 2 bits; start from bit#2)
result = bit#2 + bit#3 + bit#6 + bit#7 = ? + 0 + 0 + 1 = 1 (odd)
The result is odd, so assign "1". Now the table become -


Step 5: To calculate bit#4, (check 4 bits, skip 4bits; start from bit#4)
result = bit#4 + bit#5 + bit#6 + bit#7 = ? + 1 +0 +1 = 2 (even)
The result is even, so assign "0". The table is now


The data(4 bits) "0101" is now "0100101". Now, I will show you how it corrects the error bit.
Let's say, the data is transferred and the receiver get "0100111". So, the receiver has this table.

Error pointer = 0;

First, check for parity at bit#1. Check 1 bit, skip 1 bit.
Result = bit#1 + bit#3 + bit#5 + bit#7 = 0 + 0 + 1 + 1 = 0 (even) => no error

Second, check for parity at bit#2, check 2 bits, skip 2 bits.
Result = bit#2 + bit#3 + bit#6 + bit#7 = 1 + 0 + 1 + 1 = 1 (odd) => error
If error at bit#2, add the error pointer by 2. (Error Pointer = 2)

Third, check for parity at bit#4, check 4 bits, skip 4 bits.
Result = bit#4 + bit#5 + bit#6 + bit#7 = 0 + 1 + 1 + 1 = 3 (odd) => error
If error at bit#4, add the error pointer by 4. (Error Pointer = 2 + 4 = 6).

Now, we know the error bit at bit#6. To correct the error, simply toggle it. In this way, the error is automatically corrected.

With Hamming Coding, the reliability is better than transmitting data bits alone. We can reduce the error rate. However, one must bear in mind - to get reliability, you must sacrifice on file size and transmission time.

No comments:

Post a Comment