Q1: 10 points Q2: 10 points Q3: 10 points Q4: 10 points Q5: 10 points Q6: 45 points Q7: 30 points Total: 125 points
Question text is in black, solutions in blue.
FALSE. Two choices of noodles, five choices of protein, 213 sets of vegetables, by the Product Rule we have 2 times 5 times 213 = 81920 total choices, many fewer than a million. (Without computing the exact number you can see that 10 times 213 is much smaller than a million which is about 220.)
TRUE. Let M be the event that the word "money" occurs and let S be the event that the email is spam. If M occurs the filter multiplies the odds by L(M, S) = Pr(M | S) / Pr(M | ¬S) = 0.06/0.02 = 3. If M does not occur it multiplies the odds by L(¬M | S) = Pr(¬M | S) / Pr(¬M | ¬S) = 0.94/0/98, slightly less than 1.
TRUE. Each pure strategy b for Player B has an expected reward for B, which we will call Rb, given by the sum over all A-options a of Pr(A does a)(- Payoff(a,b)). (The reward to B is -1 times the reward to A.) If B were to apply a mixed strategy, her reward would be a weighted average of the numbers Rb, with the weights given by her probability of taking each option. But no such weighted average can be larger than the largest of the Rb's, so B does at least as well by taking the pure strategy b that has the largest Rb.
FALSE. Pr(E) = Pr(E1 ∪ E2 ∪ E3), where Ei is the event that the i'th letters match. Each Pr(Ei) is equal to 1/26, and the events are independent, so that two of them happen with probability 1/262 and all three happen with probability 1/263. By inclusion/exclusion, Pr(E) is thus 1/26 + 1/26 + 1/26 - 1/262 - 1/262 - 1/262 + 1/263, which is strictly less than 3/26.
FALSE. Let G be the event that the first word has three different letters. Pr(F | G) = 1 - (23/26)3, because for F not to happen each of the three letters in the second word must fail to be one of the three letters in the first word. But Pr(F | ¬G) is smaller than that, either 1 - (24/26)3 if the first word has two different letters or 1 - (25/26)3 if it has only one letter occurring. By the Law of Total Probability, Pr(F) is a weighted average of Pr(F | G) and Pr(F | ¬G), and so is strictly larger than 1 - (23/26)3. (All we need to know about Pr(G) to solve the problem is that it is less than one.)
She wants to know whether this sequence of behaviors can be well modeled by a Markov chain. Examining her data, she finds that when b(t) = A, b(t+1) = A 20% of the time and b(t+1) = Q 80% of the time. When b(t) = Q, b(t+1) = A 20% of the time, b(t+1) = Q 20% of the time, and b(t+1) = S 60% of the time. Finally, when b(t) = S, she finds that b(t+1) = Q 20% of the time and b(t+1) = S 80% of the time.
The diagram has three nodes marked A, Q, and S, and arrows from A to itself
labeled 0.2, A to Q labeled 0.8, Q to A labeled 0.2, Q to itself labeled 0.2,
Q to S labeled 0.6, S to Q labeled 0.2, and S to S labeled 0.8. The matrix
M looks like this:
0.2 0.8 0.0
0.2 0.2 0.6
0.0 0.2 0.8
If we let v = (a, q, s) be the steady-state probability row vector, then the
identity vM = v gives us three equations in the three unknowns a, q, and s:
0.2a + 0.2q = a, 0.8a + 0.2q + 0.2s = q, and 0.6q + 0.8s = s. The first
equation implies 0.2q = 0.8a and thus q = 4a. The second then gives
0.2q + 0.2q + 0.2s = q and thus 0.2s = 0.6q or s = 3q = 12a. For the three
probabilities to add up to one we need a = 1/17, q = 4/17, and s = 12/17.
We compute the top row of the matrix M2 and find that the chance of A is (0.2)(0.2) + (0.8)(0.2) = 0.20, the chance of Q is (0.2)(0.8) + (0.8)(0.2) = 0.32, and the chance of S is (0.2)(0) + (0.8)(0.6) = 0.48.
If the Markov model is correct, we would expect the number of A's to be given by a binomial distribution with n = 100 and p = 0.20, so the expected number would be np = 20. The variance is np(1-p) = 100(0.2)(0.8) = 16, so the standard deviation is the square root of 16 or 4. The observed number of 4 A's is four standard deviations less than the expected number. There is about a 95% chance that a normal random variable will be within two standard deviations of its mean, so we REJECT the hypothesis at the 95% confidence level.
In the Markov chain, 0.16 of the probability of an A at time t+2 comes from
the path A --> Q --> A and the other 0.04 comes from A --> A --> A. If the
latter is happening at the rate implied by independence, the former isn't
happening at all. Perhaps the cat is less likely to go from quiet to active
if he has just been active, whereas the Markov Hypothesis would say that his
chance of becoming active would depend only on the state.
We might try refining the model by dividing Q into three states, "Q preceded
by A", "Q preceded by Q", and "Q preceded by S", and reinterpreting the data
as a five-state Markov process. But if we need still more information to
know the probabilities of each behavior, then the Markov Hypothesis does not
hold.
For every pair of states i and j, we would need to know the probability PrC,i,j = Pr(B(t+1) = i | B(t) = j and catnip is given at time t) and PrN,i,j = Pr(B(t+1) = i | B(t) = j and no catnip at time t). Her prior data give her estimates of the PrN,i,j's and she would need to repeat her observations with catnip to get estimates of the PrC,i,j's.
We could define a reward function of 1 for state A and 0 for states Q and S.
There are eight possible policies because there are three independent choices --
do you give catnip in state A, do you give it in state Q, and do you give it
in state S? For each of these eight policies, we can find the steady-state
distribution for the resulting Markov chain as we did in part (c). Then for
each policy we can determine the expected reward per term in the steady state,
which is the probability that the cat is active in the steady state, and choose
the policy that makes this largest.
Many of you wanted to decide the choice of action for each state
independently, based on which choice would make A more likely on the next term.
Such a greedy strategy is not guaranteed to work in general.
Block 000 has probability (0.8)3 = 0.512. Blocks 001, 010, and 100 each have probability (0.8)2(0.2) = 0.128. Blocks 011, 101, and 110 each have probability (0.8)(0.2)2 = 0.032. Finally, block 111 has probability (0.2)3 = 0.008. These probabilities add to 0.512 + 3(0.128 + 0.032) + 0.008 = 1.
We begin with eight nodes with the eight given probabilities. There are several
equivalent ways to proceed because several of the probabilities are equal, but
here is one way to do it. (All give codes with the same answer for part (c).)
Merge 111 and 110 to get node A, with weight 0.040.
Merge 101 and 011 to get node B, with weight 0.064.
Merge A and B to get node C, with weight 0.104.
Merge 001 and C to get node D, with weight 0.232.
Merge 010 and 100 to get node E, with weight 0.256.
Merge D and E to get node F, with weight 0.488.
Merge 000 and F to get node G, with weight 1.
Our tree has root G, with children 000 and F. F has children D and E,
D has 001 and C, E has 010 and 100, C has A and B, A has 111 and 110, and
B has 101 and 011. The code word for 000 is 0, for 001 is 100, for 010 is
110, for 011 is 10111, for 100 is 111, for 101 is 10110, for 110 is 10101,
and for 111 is 10100.
The probability of a one-bit code word is 0.512, of a three-bit code word is 3(0.128) = 0.384, and of a five-bit code word is 3(0.032) + 0.008 = 0.104. So the expected length of the code word is 1(0.512) + 3(0.384) + 5(0.104) = 0.512 + 1.152 + 0.520 = 2.184.
The coding scheme uses shorter strings to representing more common source strings, i.e., those with more zeros. A source string of length 3n will be encoded by a string of length ranging from n (if it is all 0's) to 5n (if it is all 1's). If all strings were equally likely, this coding scheme would use an average length of 1(0.125) + 3(0.375) + 5(0.500) = 3.750 to send a three-bit block. But because the source gives strings with more zeros with much higher probability, the code's better performance on these strings more than makes up for its poorer performance on the rarer strings.
Last modified 3 January 2010