CMPSCI 240: Reasoning About Uncertainty

Solutions to Final Exam

David Mix Barrington

18 December 2009

Directions:

Answer the problems on the exam pages.
There are seven problems for 125 total points. Actual scale is A = 105, C = 70.
If you need extra space use the back of a page.
No books, notes, calculators, or collaboration.
The first five questions are true/false, with five points for the correct boolean answer and up to five for a correct justification.
When the answer to a question is a number, you may give your answer in the form of an expression using arithmetic operations, powers, falling powers, or the factorial function. Probabilities may be given as either fractions or decimals.

  Q1: 10 points
  Q2: 10 points
  Q3: 10 points
  Q4: 10 points
  Q5: 10 points
  Q6: 45 points
  Q7: 30 points
  Total: 125 points

Question text is in black, solutions in blue.

Question 1 (10): True or false with justification: At Zane's Noodle Bowl, a customer may choose a wide variety of possible soups. They may choose egg or rice noodles, one of five types of protein (beef, chicken, fish balls, tofu, or none) and any subset of the thirteen kinds of vegetables. Given these options, there are over a million possible ways to order soup.
FALSE. Two choices of noodles, five choices of protein, 2¹³ sets of vegetables, by the Product Rule we have 2 times 5 times 2¹³ = 81920 total choices, many fewer than a million. (Without computing the exact number you can see that 10 times 2¹³ is much smaller than a million which is about 2²⁰.)
Question 2 (10): True or false with justification: Suppose that the word "money" occurs in 6% of my spam emails and in 2% of my non-spam emails. Suppose also that I am using a Naive Bayes Classifier spam filter, with word occurrences as features. Then, other things being equal, my filter will multiply its estimated odds of spamness by three for emails containing the word "money", and slightly reduce those odds for emails not containing the word "money".
TRUE. Let M be the event that the word "money" occurs and let S be the event that the email is spam. If M occurs the filter multiplies the odds by L(M, S) = Pr(M | S) / Pr(M | ¬S) = 0.06/0.02 = 3. If M does not occur it multiplies the odds by L(¬M | S) = Pr(¬M | S) / Pr(¬M | ¬S) = 0.94/0/98, slightly less than 1.
Question 3 (10): True or false with justification: Consider any two-player simultaneous-move zero-sum game where Players A and B each have a choice of k options and there is a k × k matrix giving the payoff for A in each of the k² possible situations. Suppose that A is going to play a mixed strategy, where the probability that he will take each option is known to B. Then B has a pure strategy that will do at least as well for her as any mixed strategy.
TRUE. Each pure strategy b for Player B has an expected reward for B, which we will call R_b, given by the sum over all A-options a of Pr(A does a)(- Payoff(a,b)). (The reward to B is -1 times the reward to A.) If B were to apply a mixed strategy, her reward would be a weighted average of the numbers R_b, with the weights given by her probability of taking each option. But no such weighted average can be larger than the largest of the R_b's, so B does at least as well by taking the pure strategy b that has the largest R_b.
Questions 4 and 5 involve a situation where we choose two three-letter words (over the 26-letter alphabet {A, B,..., Z}, not necessarily English words) at random. Each word is equally likely to be any of the possible three-letter strings, and the choice of the two words is independent.
Question 4 (10): True or false with justification: Let E be the event that at least one letter occurs in the same position of the two words -- that is, that the two words have the same first letter, have the same second letter, or have the same third letter (or that more than one of these things happen). Then Pr(E) ≥ 3/26.
FALSE. Pr(E) = Pr(E₁ ∪ E₂ ∪ E₃), where E_i is the event that the i'th letters match. Each Pr(E_i) is equal to 1/26, and the events are independent, so that two of them happen with probability 1/26² and all three happen with probability 1/26³. By inclusion/exclusion, Pr(E) is thus 1/26 + 1/26 + 1/26 - 1/26² - 1/26² - 1/26² + 1/26³, which is strictly less than 3/26.
Question 5 (10): True or false with justification: Let F be the event that some letter occurs in both words, regardless of position. Then Pr(F) ≥ 1 - (23/26)³.
FALSE. Let G be the event that the first word has three different letters. Pr(F | G) = 1 - (23/26)³, because for F not to happen each of the three letters in the second word must fail to be one of the three letters in the first word. But Pr(F | ¬G) is smaller than that, either 1 - (24/26)³ if the first word has two different letters or 1 - (25/26)³ if it has only one letter occurring. By the Law of Total Probability, Pr(F) is a weighted average of Pr(F | G) and Pr(F | ¬G), and so is strictly larger than 1 - (23/26)³. (All we need to know about Pr(G) to solve the problem is that it is less than one.)
Question 6 (45): Professor Kyle is conducting an experiment on animal behavior. She observes a cat for n successive five-minute periods, and characterizes the cat's behavior in each period as "Active", "Quiet", or "Sleeping". For each time period t in the set {1, 2,..., n}, the behavior b(t) is thus either A, Q, or S.
She wants to know whether this sequence of behaviors can be well modeled by a Markov chain. Examining her data, she finds that when b(t) = A, b(t+1) = A 20% of the time and b(t+1) = Q 80% of the time. When b(t) = Q, b(t+1) = A 20% of the time, b(t+1) = Q 20% of the time, and b(t+1) = S 60% of the time. Finally, when b(t) = S, she finds that b(t+1) = Q 20% of the time and b(t+1) = S 80% of the time.
- (a,5) Draw a diagram and write a transition matrix for a Markov chain that has three states and the given transition probabilities. (For your matrix, order the rows A, Q, S.)
  The diagram has three nodes marked A, Q, and S, and arrows from A to itself labeled 0.2, A to Q labeled 0.8, Q to A labeled 0.2, Q to itself labeled 0.2, Q to S labeled 0.6, S to Q labeled 0.2, and S to S labeled 0.8. The matrix M looks like this:
```
0.2  0.8  0.0
0.2  0.2  0.6
0.0  0.2  0.8
```
- (b,10) Determine the steady-state probability of this Markov chain.
  If we let v = (a, q, s) be the steady-state probability row vector, then the identity vM = v gives us three equations in the three unknowns a, q, and s:
  0.2a + 0.2q = a, 0.8a + 0.2q + 0.2s = q, and 0.6q + 0.8s = s. The first equation implies 0.2q = 0.8a and thus q = 4a. The second then gives 0.2q + 0.2q + 0.2s = q and thus 0.2s = 0.6q or s = 3q = 12a. For the three probabilities to add up to one we need a = 1/17, q = 4/17, and s = 12/17.
- Looking more closely at her data, Professor Kyle discovers that for the 100 values of t where b(t) = A, b(t+2) = A only four times.
- (c,5) Determine the probabilities of each of the three states of the Markov chain at time t+2, given that the state at time t is A.
  We compute the top row of the matrix M² and find that the chance of A is (0.2)(0.2) + (0.8)(0.2) = 0.20, the chance of Q is (0.2)(0.8) + (0.8)(0.2) = 0.32, and the chance of S is (0.2)(0) + (0.8)(0.6) = 0.48.
- (d,10) Based on your answer to part (c) and the Normal Approximation to the Binomial, determine how unusua it would be to have b(t+2) only four times in 100 situations with b(t) = A. Should she reject the hypothesis that the cat is behaving according to this Markov chain, using a 95% confidence level?
  If the Markov model is correct, we would expect the number of A's to be given by a binomial distribution with n = 100 and p = 0.20, so the expected number would be np = 20. The variance is np(1-p) = 100(0.2)(0.8) = 16, so the standard deviation is the square root of 16 or 4. The observed number of 4 A's is four standard deviations less than the expected number. There is about a 95% chance that a normal random variable will be within two standard deviations of its mean, so we REJECT the hypothesis at the 95% confidence level.
- (e,5) How might the Markov Hypothesis be failing in this situation? Suggest a way in which she might refine her model to be more accurate.
  In the Markov chain, 0.16 of the probability of an A at time t+2 comes from the path A --> Q --> A and the other 0.04 comes from A --> A --> A. If the latter is happening at the rate implied by independence, the former isn't happening at all. Perhaps the cat is less likely to go from quiet to active if he has just been active, whereas the Markov Hypothesis would say that his chance of becoming active would depend only on the state.
  We might try refining the model by dividing Q into three states, "Q preceded by A", "Q preceded by Q", and "Q preceded by S", and reinterpreting the data as a five-state Markov process. But if we need still more information to know the probabilities of each behavior, then the Markov Hypothesis does not hold.
- (f,5) Suppose now that the cat's behavior, as characterized by these three states, is characterized by a Markov Decision Process, where the possible actions are to give the cat some catnip (C) or not (N). What observations would she need to make to completely describe the MDP?
  For every pair of states i and j, we would need to know the probability Pr_C,i,j = Pr(B(t+1) = i | B(t) = j and catnip is given at time t) and Pr_N,i,j = Pr(B(t+1) = i | B(t) = j and no catnip at time t). Her prior data give her estimates of the Pr_N,i,j's and she would need to repeat her observations with catnip to get estimates of the Pr_C,i,j's.
- (g,5) Once she has all the numbers necessary to describe the MDP from part (f), how could she determine a policy that would maximize the percentage of the time that the cat is active, on average in the steady state? (You don't need to do the arithmetic because I'm not giving you the numbers. But describe a correct algorithm that she could use -- don't worry if it's not the most efficient one.)
  We could define a reward function of 1 for state A and 0 for states Q and S. There are eight possible policies because there are three independent choices -- do you give catnip in state A, do you give it in state Q, and do you give it in state S? For each of these eight policies, we can find the steady-state distribution for the resulting Markov chain as we did in part (c). Then for each policy we can determine the expected reward per term in the steady state, which is the probability that the cat is active in the steady state, and choose the policy that makes this largest.
  Many of you wanted to decide the choice of action for each state independently, based on which choice would make A more likely on the next term. Such a greedy strategy is not guaranteed to work in general.
Question 7 (30): In this problem we will construct a variable-length code to transmit messages where each bit is independently and randomly generated, with Pr(1) = 0.2 and Pr(0) = 0.8.
- (a,10) List the eight possible blocks of three bits that might be generated by this source, and the probability that each block is generated.
  Block 000 has probability (0.8)³ = 0.512. Blocks 001, 010, and 100 each have probability (0.8)²(0.2) = 0.128. Blocks 011, 101, and 110 each have probability (0.8)(0.2)² = 0.032. Finally, block 111 has probability (0.2)³ = 0.008. These probabilities add to 0.512 + 3(0.128 + 0.032) + 0.008 = 1.
- (b,10) Using Huffman's Algorithm (constructing a Huffman Tree), design a variable-length binary code that will use the smallest possible expected number of bits to transmit each three-bit block.
  We begin with eight nodes with the eight given probabilities. There are several equivalent ways to proceed because several of the probabilities are equal, but here is one way to do it. (All give codes with the same answer for part (c).)
  Merge 111 and 110 to get node A, with weight 0.040.
  Merge 101 and 011 to get node B, with weight 0.064.
  Merge A and B to get node C, with weight 0.104.
  Merge 001 and C to get node D, with weight 0.232.
  Merge 010 and 100 to get node E, with weight 0.256.
  Merge D and E to get node F, with weight 0.488.
  Merge 000 and F to get node G, with weight 1.
  Our tree has root G, with children 000 and F. F has children D and E, D has 001 and C, E has 010 and 100, C has A and B, A has 111 and 110, and B has 101 and 011. The code word for 000 is 0, for 001 is 100, for 010 is 110, for 011 is 10111, for 100 is 111, for 101 is 10110, for 110 is 10101, and for 111 is 10100.
- (c,5) Determine the expected number of bits to transmit a three-bit string using your code.
  The probability of a one-bit code word is 0.512, of a three-bit code word is 3(0.128) = 0.384, and of a five-bit code word is 3(0.032) + 0.008 = 0.104. So the expected length of the code word is 1(0.512) + 3(0.384) + 5(0.104) = 0.512 + 1.152 + 0.520 = 2.184.
- (d,5) The correct answer to part (c) is strictly less than three. Thus a message of 3n bits from this source is transmitted, on average, using fewer than 3n bits. How is this possible, given that we need to be able to transmit 2³ⁿ different messages?
  The coding scheme uses shorter strings to representing more common source strings, i.e., those with more zeros. A source string of length 3n will be encoded by a string of length ranging from n (if it is all 0's) to 5n (if it is all 1's). If all strings were equally likely, this coding scheme would use an average length of 1(0.125) + 3(0.375) + 5(0.500) = 3.750 to send a three-bit block. But because the source gives strings with more zeros with much higher probability, the code's better performance on these strings more than makes up for its poorer performance on the rarer strings.

Last modified 3 January 2010