# Third Midterm Exam Solutions

#### 23 April 2007

Question text is in black, solutions are in blue.

### Directions:

• Answer the problems on the exam pages.
• There are six problems on pages 2-7, for 100 total points. Actual scale was A=92, C=57.
• If you need extra space use the back of a page.
• No books, notes, calculators, or collaboration.
• The first four questions are true/false, with five points for the correct boolean answer and up to five for a correct justification.
• Questions 5 and 6 have numerical answers -- remember that logarithms are base 2.

```  Q1: 10 points
Q2: 10 points
Q3: 10 points
Q4: 10 points
Q5: 30 points
Q6: 30 points
Total: 100 points
```

• Question 1 (10): True or false with justification: Let X be the input alphabet for a memoryless channel and let Y be the output alphabet. Then the mutual information I(X,Y) is the same for any discrete random source with alphabet X.

FALSE. Consider X = Y = {0,1} and let the channel always reproduce the input as the output. If X always sends 0, then H(X) = H(Y) = H(X,Y) = 0 because there is only one possible input that occurs with probability 1 and so -log Pr(X=a) is always 0. Thus I(X,Y) = 0 + 0 - 0 = 1. But if X sends 0 or 1 each with probability 1/2, then H(X) = H(Y) = H(X,Y) = 1 (since -log Pr(X=a), etc., are always 1), and thus I(X,Y) = 1 + 1 = 1 = 1. The mutual information is different for the different distributions on X.

• Question 2 (10): True or false with justification: If a linear code is able to correct up to t bit errors per block, then it must be able to detect up to 2t errors per block.

TRUE. Proof by contrapositive -- if the code were not able to detect up to 2t errors per block, there would have to be two code words within Hamming distance 2t of each other, so that one code word could be transformed into another by at most 2t bit errors. But if this is the case, where x and y are code words at most 2t apart, there must be a single word z that is at most distance t from both x and y (take x, for example, and form z by changing t of the bits that are diffferent in y.) If a recipient gets z and knows that there are at most t errors, they still cannot tell whether x or y is the correct word sent.

• Question 3 (10): True or false with justification: If X and Y are two discrete random variables, the mutual information I(X,Y) can never be greater than the joint entropy H(X,Y).

TRUE. H(X,Y) must be at least as great as H(X), because adding the information from Y cannot increase the uncertainty about X. (More formally, the expected value of log (1/Pr(x,y)) must be at least as great as the expected value of log (1/Pr(x)) because Pr(x,y) can never be greater than Pr(x).) Similarly, H(X,Y) must be greater than or equal to H(Y). Adding these two inequalities, we get that 2H(X,Y) ≥ H(X) + H(Y), and subtracting H(X,Y) from both sides gives us H(X,Y) ≥ H(X) + H(Y) - H(X,Y) = I(X,Y).

Alternatively, after noting that H(X,Y) ≥ H(X) we can recall that we proved H(X) ≥ I(X,Y) using Jensen's Inequality, and our desired inequality follows by transitivity.

• Question 4 (10): True or false with justification: Because at least four bits are required in the worst case to send one digit from {0,1,...,9}, at least 4n bits are required in the worst case to send n digits from that alphabet.

FALSE. By sending digits in blocks we can save bits over sending each digit separately. For example, we could use two-digit blocks and send each block with seven bits because 100 < 128 = 27. This allows us to send n digits in 7n/2 < 4n bits. Similarly, we could send blocks of three digits using ten bits each (since 1000 < 1024 = 210), taking 10n/3 bits to send n digits. By using larger and larger blocks, we could get arbitrarily close to (log 10)n = 3.32n bits, since a source where each bit is equally likely has an entropy of log 10.

• Question 5 (30): Suppose we have a set of DNA sequences, strings over the alphabet {A,C,G,T}, and that these come from a distribution where each letter is chosen independently and for each letter, the probability of A is 3/8, the probability fo C is 1/8, the probability of G is 1/8, and the probability of T is 3/8.

• (a,5) If I have a sequence of n letters, how many bits do I need to specify it using a fixed-length binary code?

We just pick a two-bit sequence for each of the four letters, and send the n letters in 2n bits.

• (b,10) Using Huffman's algorithm, design a variable-length binary code that has the minimum possible average length for this distribution of letters. What is the expected number of bits you need to send an n-letter sequence using this code?

We have four letters of weights 3/8, 3/8, 1/8, and 1/8. We first combine C and G into a group of total weight 2/8. Then we combine this group with, say, T to get a group of total weight 5/8. Finally we combine this group with A to get a group of total weight 1. The eventual code might have A = 0, C = 100, G = 101, and T = 11 (other specific codes are possible). The average length of a code word, and hence the expected number of bits needed to send a letter, is 1*(3/8) + 2*(3/8) + 3*(2/8) = 15/8 = 1.875.

• (c,10) Compute the entropy of this distribution. Get a numerical answer accurate to within 0.2 at worst. You may estimate log 3 as 1.6, which allows you to compute other base-two logs. (For example, log 12 = log 4 + log 3 = 3.6, and log 4/3 = log 4 - log 3 = 0.4.)

The entropy, by the definition, is (3/8)log(8/3) + (3/8)log(8/3) + (1/8)log(8) + (1/8)log(8) = (3/4)(3 - log 3) + 3/4 = (3/4)(4 - 1.585) = 1.812. (The given estimate of 1.6 for log(3) yields an answer of 1.8.)

• (d,5) Suppose we sent an n-letter sequence by grouping the letters into k-letter blocks, designing a Huffman variable-length code for k-letter blocks, and using that code. As n and k increase, how many bits do we need to send an n-letter sequence?

As n and k increase, the average number of bits needed approaches the entropy from above. Thus the total number of bits needed approaches 1.812n from above.

• Question 6 (30): This problem concerns several languages (sets of strings) over the alphabet {0,1}, given by regular expressions. Let R be the regular expression 0* + 1*, S be the regular expression 0*1*, and T be the regular expression (00+11)*.

• (a,10) How many binary strings have length 4? How many of these are in the languages of each of the three regular expressions? List the strings (of length 4) in each of these languages.

There are 24 = 16 total strings of length 4. Two of them (0000 and 1111) are in L(R). Five (0000, 0001, 0011, 0111, and 1111) are in L(S). Four (0000, 0011, 1100, and 1111) are in L(T).

• (b,10) How many strings of length n are in each of these languages, as a function of the positive integer n? (In one case there are separate answers for odd n and for even n.) In each of the three cases, suppose you had a string of length n in the language and you needed to tell someone which string it was. How many bits would you need, assuming that the recipient knows the language, knows n, and has agreed on a coding method with you?

For any positive n there are exactly two length-n strings in L(R), 0n and 1n.

There are n+1 length-n strings in L(S), because the number of 0's can be any integer from 0 through n.

Finally there are 2n/2 length-n strings in L(T), because such a string consists of n/2 substrings, each of the form 00 or 11.

We thus need one bit to specify a string in L(R), log(n+1) bits (rounded up) to specify a string in L(S), and n/2 bits (rounded up) to specify a string in L(T).

• (c,10) What does it mean for a set of strings (all of the same length) to be a linear code? (This is also called being a subspace in the text.) For a fixed positive n, consider the sets of length-n strings in each of the three languages. Which of these sets of strings are linear codes, if any?

A linear code or subspace is a nonempty set of strings such that for any two strings x and y in the set, the string x+y (the bitwise XOR of x and y) is also in the set.

The n-length strings of L(R) form a code because adding two equal strings gives 0n, and adding two unequal strings gives 1n.

The n-length strings of L(S) do not form a code for n ≥ 2, because the sum of 01n-1 and 1n is 10n-1, which is not in L(S). (For n=1 the L(S) strings are actually is equal to the L(R) strings and thus form a code.)

The n-length strings of L(T) do not form a code for odd n because the set is empty. But for even n they do form a code. Let x and y be two strings in the set. Then each of the n/2 two-letter segments of x+y is the sum of two elements from {00,11} and is thus also in this set -- hence x+y is in L(T).