% Kerry-Dean poll results, 7/26/03
% NH Poll results from 7/26/03 Boston Globe AP article:
% Kerry (K): 25
% Dean (D): 19
% Gephardt (G): 10
% Lieberman (L): 6
% Graham (R): 2
% Edwards (E): 2
% Wesley Clark (C): 2
% Braun (B): 1
% Kucinich (U): 1
% Sharpton (S): 1
% Biden (B): 1
% Undecided (N): 30
%
% The key elements of this m.file are excerpted in pollsig.m
% The following call to pollsig.m, reproduces the key significance tests using Monte Carlo simulation:
% [ME,D,Dprob,halfCI]=pollsig(600,[.25 .19])
% This m.file produces a lot of output. I like to save my output in a file, and the Matlab
% diary function is the only way that I know to do that readily:
% The diary command saves everything that appears on the command window after diary is turned on,
% but not the graphics. The text is saved in memory until the diary off command is entered
% All new output is added to the old unless the diary file is deleted between use, which is what we'll do first:
if exist('KerryDeanpoll.txt')==2; % checks whether the file KerryDeanpoll.txt exists on the Matlab search path
delete KerryDeanpoll.txt % deletes the KerryDeanpoll.txt from the search path.
end
diary KerryDeanpoll.txt % this turns on the diary and stores the results in KerryDeanpoll.txt
disp('Analysis of the NH poll from the 7/26/03 Boston Globe')
NHPoll=[25 19 10 6 2 2 2 1 1 1 1 30]';
% sum(NHPoll) % uncomment this to assure that the proportions sum to 100;
% Calculate Margin of Error, p. 330, Larsen & Marx:
% Need to find the 97.5% cutoff of the cumulative normal distribution:
z=erfinv(0.95)*sqrt(2)
% Now I can never remember this silly Matlab call to the inverse error function.
% erfinv, soI always call this statement with something I can remember: zfind.m, which
% is one of my m.files. Since I'm usually looking for the value of z distribution
% to calculate 95% confidence intervals (alpha=0.05, two-tailed p), I wrote zfind
% to caclulate the z value that produces a two-tailed p value of 0.05, and that is
% 1.96. Looking at the standard normal distribution, exactly 2.5% of the distribution
% is greater than zfind(0.05) and 2.5% is less than -zfind(0.05).
z2=zfind(0.05)
% The free statbox toolbox calls erfinv using normq:
% Note that they are calculating the z value using the cumulative normal distribution
% normq.m finds the z value such that 97.5% of the standard normal distribution
% is less than that z value. Gordon Smythe's normq.m is a one liner: q = erfinv(2*p-1) * sqrt(2);
z3=normq(0.975)
% The free stixbox toolbox calls erfinv using qnorm, again using the cumulative
% normal distribution.
z4=qnorm(0.975)
% all produce the identical value of roughly 1.96
V=[z z2 z3 z4];
% calculate the differences among all of the elements of V, using Gallagher's pydist.m
disp('Differences among the 4 ways of finding z values, as a 4x4 matrix');
disp(pydist(V'));
% Now calculate the margin of error, using the equation on page 330:
Numberpolled=600;
pK=0.25; % Proportion favoring Kerry
d=1.96/(2*sqrt(Numberpolled));
fprintf('The margin of error for a poll of size %d is %5.3f\n',Numberpolled,d);
% To put the two numbers in a single output statement, you can use frpintf, which is Matlab's way of printing
% statements, inserting numbers at the appropriate spot:
fprintf('The margin of error for a poll of size %d is %d\n',Numberpolled,d)
% The %d indicates, insert the variable here in decimal format -- integers will be represented as a single digit.
% The \n indicates that Matlab should insert a linefeed, or move the cursor to the next line.
% We can clean up the presentation of the margin of error by only using 3 significant figures:
fprintf('The margin of error for a poll of size %d is %5.3f\n',Numberpolled,d)
% The %5.3f indicates that Matlab should insert the 2nd variable from the list in 5 columns, with 3 digits to
% the right of the decimal point.
% Poll results are usually presented as %'s. In the fprintf statement, a single % indicates 'insert a variable here'
% So to really print a % sign, you must put in %%
fprintf('The margin of error for a poll of size %d is %3.1f%%.\n',Numberpolled,d*100)
% Now, let's do a Monte Carlo simulation of a poll of size 600 (specified in Numberpolled)
% first, generate 600 random numbers uniformly distributed on the interval 0, 1
poll=rand(Numberpolled,1); % This generates 600 random numbers between 0 and 1
% The simulation that we will run will at first assume that the true proportions of
% Kerry, Dean, ... undecided voters are indicated by:
NHPoll=[25 19 10 6 2 2 2 1 1 1 1 30]';
% The output vector we need will have the number of Kerry, Dean, Lieberman voters:
% Let's start out doing this as sim ply as possible;
K=0; % K will be a variable indicating the number of Kerry voters
D=0; % D will be the number of Howard Dean voters.
% In a random draw of 600 NH voters, if Kerry's true proportion is 25% how many Kerry voters might we expect
% to find in one poll;
poll=rand(Numberpolled,1); % This generates 600 random numbers between 0 and 1
i=find(poll<=0.25); % This will find the location of number of random numbers less than 0.25
K=length(i);
% since (poll<=0.25) returns a 0,1 vector the same size as poll, and find just finds the location of the non-zero
% elements, we can combined those two expressions into this simpler expression:
K=sum(poll<=0.25); % sums the 1's indicating a random number in the poll vector less than or equal to 0.25
% While the expected value is 150, on my 1st 2 attempts, I got 142 and 145, and on the third attempt 156.
% if we were to do this 1000 times, what would the distribution of Kerry votes look like
% We need to create a vector to hold those 1000 results. We'll call it Kerrytally
% One of the goals of the program is to determine the probability that Kerry & Howard Dean are equal in the
% NH poll. To determine that probability, we'll keep a tally of Dean voters too.
Trials=1e4; % We'll do 10000 Monte Carlo simulations of a poll of size Numberpolled (=600)
fprintf('\nMonte Carlo simulation based on %d trials\n',Trials);
Kerrytally=zeros(Trials,1); % A column vector with 1000 rows & just 1 column
% The reason for setting this array in advance is that it vastly speeds up the Matlab subroutine below. Matlab
% now has created a matrix of the proper size and the appropriate elements can be inserted in the for loop below.
Deantally=zeros(Trials,1); % A column vector with 1000 rows & just 1 column, to hold the results of how many voters
% in simulated polls are selecting Howard Dean.
KerrytallyHo=zeros(Trials,1);
DeantallyHo=zeros(Trials,1);
% This simulation can take a long time for Trials>>1000
for i=1:Trials
poll=rand(Numberpolled,1); % Creates a vector with uniformly distributed random numbers on the interval 0,1
Kerrytally(i)=sum(poll<=0.25); % Finds the number of random numbers less than 0.25. These will be the Kerry voters
Deantally(i)=sum( (poll>0.25) & (poll <= (0.25 + 0.19))); % Because Dean had 19% of the vote and the first 25% of the
% mapping of the random number line has been assigned as Kerry
% voters, this statement assures that roughly 19% of the voters
% will be assigned to Dean.
% Simulation under the null hypothesis of equal favorability in NH
% Below, we'll test the null hypothesis that Dean & Kerry are equal in popularity in NH. It is time consuming
% for Matlab to run through these if loops many times, say when the number of trials is over 100,000. I'll add
% the section for tallying the expected Kerry & Dean results, IF the null hypothesis of equal probability is
% true:
ExpP=(0.25+0.19)/2; % This is the expected Dean & Kerry proportion, under the null hypothesis that
% that they are equal in popularity.
KerrytallyHo(i)=sum(poll<=ExpP);
DeantallyHo(i)=sum( (poll>ExpP) & (poll <= 2*ExpP));
end
% To label graphs, I'll use Matlab's sprintf, which is one of the best ways of inserting numbers into text strings:
hist(Kerrytally);xlabel(sprintf('Number for Kerry in a poll of size %d',Numberpolled))
ylabel('Number');title(sprintf('Results of %d Monte Carlo simulations',Trials))
figure(gcf);pause;
% prnmenu('kd'); % % prnmenu is my m.file for setting printing options.
% plot the same data on a percentage basis:
hist(Kerrytally/Numberpolled);
xlabel(sprintf('Percentage of individuals supporting Kerry in a poll of size %d',Numberpolled))
ylabel('Number');title(sprintf('Results of %d Monte Carlo simulations',Trials))
figure(gcf);pause
% prnmenu('kd');
% Now the margin of error was reported as +/- 4%. The margin of error for polls is defined as half of the 95%
% confidence interval, so one might expect that 95% of the observations would fall between 0.21 and 0.29
% What is the actual percentage that falls between 0.21 and 0.29? Since the variance of binomial probabilities
% is a maximum when p=0.5, the margin of error is based on p=0.5
Width95ci=2*zfind(0.05)*sqrt(0.5*(1-0.5)/Numberpolled);
halfwidth=Width95ci/2;
p=0.25;
NowinCIS=sum(Kerrytally/Numberpolled>=p-halfwidth & Kerrytally/Numberpolled <=p+halfwidth)/Trials;
fprintf('The percentage of Monte Carlo simulations within p=%5.3f +/- %5.3f is %5.3f\n',p,halfwidth,NowinCIS);
% In my simulation, I found that 98.3% of the polls had Kerry voters between 0.21 and 0.29. The poll is too broad.
% The real margin of error for Kerry's vote should be set not with the formula on p. 330, but using the formula
% for the standard deviation of a proportion of 0.25. This formula is on the bottom of page 329
pK=0.25;n=Numberpolled;
Width95ci=2*zfind(0.05)*sqrt((n*pK/n)*(1-n*pK/n)/n);
halfwidth=Width95ci/2;
fprintf('Half the width of the 95%% Confidence interval with p=%5.3f is %5.3f\n',pK,halfwidth);
% So, the margin of error is 3.46% for a proportion of 0.25
% find the percentage of Monte Carlo simulations that fall within
NowinCIS2=sum(Kerrytally/Numberpolled>=(0.25-halfwidth) & Kerrytally/Numberpolled <=(0.25+halfwidth))/Trials;
fprintf('The percentage of Monte Carlo simulations within p=%5.3f +/- %5.3f is %5.3f\n',pK,halfwidth,NowinCIS2);
% With one million trials, 94.69% of the confidence intervals contain the true parameter estimate of 0.25
% The reason for the slight discrepancy is due to several factors. First, the poll results are a discrete distribution
% Using the appropriate variance formulae for a proportion of 0.25, Variance = (0.25*0.75)/600, one can calculate
% what the lower 95% confidence interval, expected value and upper 95% confidence interval should be:
NLowExpUp=Numberpolled*[(0.25-halfwidth) 0.25 (0.25+halfwidth)]; % Save the three outputs in the row vector LowExpUp
fprintf('\nNormal Approximation Upper and lower Confidence limits based on 95%% CI halfwidth=%5.3f:\n',halfwidth)
fprintf('\tL. 95%% CI \tExpected \tU. 95%% CI \n \t%5.3f \t%5.3f \t%5.3f\n',NLowExpUp(1),NLowExpUp(2),NLowExpUp(3))
% 129.2114 150.0000 170.7886
% One usual attempt to obtain more accurate confidence intervals is to apply the continuity correction to discrete
% data of this type, by subtracting -1/2 and adding 1/2 to the continuous distributions. Let's see whether that
% improves the 95% confidence intervals:
% We still have the Kerrytally, from the Monte Carlo simulations:
NowinCIS3=sum(Kerrytally>=NLowExpUp(1)-0.5 & Kerrytally <=NLowExpUp(3)+0.5)/Trials;
fprintf('\nPercentage Monte Carlo simulations with true paramter, %5.3f, within the margin of erro with continuity correction=%5.3f\n',p,NowinCIS3);
% By using the continuity correction, described on page 267-268 in Larsen & Marx,
% the 95.76% of the confidence intervals now contain the true parameter 0.25.
% In a sense this is better, in that the confidence intervals should be broad enough to include 95% of the data,
% but the confidence intervals are actually less accurate. In the Neyman-Pearson theory of hypothesis testing,
% the use of the continuity correction helps assure that the probability of Type I error is at least no greater
% than 5%. However the confidence intervals are conservative: they tend to be too broad. For this reason, many
% recent authors recommend against using the continuity correction. They tend to inflate the probability of
% Type II error. In comparing a point estimate with a parameter, say 0.25, one will have a probability of Type II
% error, failing to reject the null hypothesis, more often than one would prefer.
% There is a second more fundamental reason why the confidence intervals in the 1 million trial simulation are
% not quite what one would expect. The normal approximation for proportions, in this case, the binomial proportion:
% 0.25 for Kerry, 0.75 for everyone else (including undecided) is just an approximation. It is usually quite a good
% approximation, but an approximation none the less. There is a test to determine whether the sample sizes and
% observed proportions are sufficient for the use of the normal approximation. That approximation is provided on
% page 268. You can program this as a little function, contained below to return a 1 for approximation adequate
% or 0 for don't use the approximation.
% Note, I have included the m.file demoivre.m below as a function. This use of inline functions was allowed in
% Matlab after version 4, if the calling program is a function.
% I don't usually use this method, and I have provided demoivre.m as a stand alone m.file
% for more general use.
OUT=demoivre(Numberpolled,0.25);
if OUT % Since OUT is 1, which can be regarded as logically true, this is sufficient, if OUT==1 is the full form
fprintf('\nThe DeMoivre-Laplace limit to the binomial distribution (L&M p. 268) is adequate.\n');
else
disp('The DeMoivre-Laplace limit to the binomial distribution is inadequate')
end
if Numberpolled==600 % These graphs have axis limits appropriate only for the 600-subject poll
% The normal approximation, or DeMoivre-Laplace limit, is an approximation to the binomial pdf. We can
% calculate that binomial probability distribution using bernoull.m, which is a Gallagher m.file on Prometheus
% The expected frequency of those for Kerry using the binomial probability density function:
k=0:Numberpolled;binompdf=bernoull(k,Numberpolled,0.25); % Calls Gallagher's bernoull.m, on Prometheus
% Let's plot the binomial pdf and the expected poll results given this pdf
plot(k,binompdf*Trials);
title('Binomial probability distribution');xlabel('Number for Kerry');ylabel('Number');
figure(gcf);pause;
% prnmenu('binompdf');
% This plot isn't directly comparable to the others because the x axis limits range from 0 to 600
% this can be changed pretty readily, using the axis call
% help axis
% AXIS Control axis scaling and appearance.
% AXIS([XMIN XMAX YMIN YMAX]) sets scaling for the x- and y-axes
% on the current plot.
plot(k,binompdf.*k);axis([100 200 0 6]);
title('Binomial probability distribution');xlabel('Number for Kerry');ylabel('Number');
figure(gcf);pause
% prnmenu('kd');
% We can plot these results using side-by-side bar charts & an overlaid plot of the normal approximation
% Create an overlaid bar plot and line plot with observed and expected.
% Create a bar chart of the observed Kerry votes from the Monte Carlo simulation:
plotk=100:5:200; % specifies the center of bins.
N=hist(Kerrytally,plotk); % using the Matlab command hist, places the observed Kerry votes into bins specified
% by centers in the plot K vector.
% plot these results as a bar chart:
bar(plotk,N/Trials);axis([100 200 0 0.25]);
xlabel(sprintf('Kerry votes in a poll of size %d',Numberpolled));
ylabel('Frequency');title('Results of Monte Carlo simulation');
figure(gcf);pause
% prnmenu('kd');
ax1=gca;% save the handle of the first graph
% plot the binomial pdf as a line chart:
% Since the histogram has the elements binned from 98:102, 103:108, need to create the pdf in that form
k=98:202;
p=bernoull(k,Numberpolled,0.25);
reshapedp=reshape(p,5,21);
histbinomp=sum(reshapedp);
h1=line(plotk,histbinomp,'Color','r','Parent',ax1,'Linewidth',3);
set(h1,'marker','square','markersize',5,...
'markeredgecolor','r','markerfacecolor','r',...
'linestyle','-','color','r','linewidth',3)
title('Results of Monte Carlo simulation (bars) and expected frequencies from binomial pdf');
figure(gcf);pause
% Plot as a grouped histogram
bar(plotk,[N/Trials;histbinomp]','grouped'),axis([100 200 0 0.2]);
legend('Observed from Monte Carlo Simulation','Expected from binomial distribution',2)
xlabel(sprintf('Kerry votes in a poll of size %d',Numberpolled));
ylabel('Frequency')
title('Comparison of Monte Carlo simulation with expectations from binomial pdf');
figure(gcf);pause
% Superimpose the normal probability pdf:
% The normal probability equation is provided on p. 264. This is for mean 0, and unit standard
% deviation. The more general equation (Legendre & Legendre, 1998 p. 147) is:
% f(y_j)=1/(sqrt(2*pi)*sigma_j)*exp(-1/2*((y_j-mu_j)/sigma_j)^2)
% We just need to find the expected value for the mean and standard deviation to apply this equation
% The expected value and standard deviation are provided on page 268 top in Larsen & Marx.
pK=0.25; % proportion for Kerry;
n=Numberpolled;
muj=n*pK;
sigmaj=sqrt(n*pK*(1-pK)); % sigmaj is the standard deviation for the Kerry proportion, sd=sqrt(Variance);
% Need to find the normal probaility density function (not the standard normal pdf) at y=98:202;
y=98:202;
fyj=1/(sqrt(2*pi)*sigmaj)*exp(-1/2*((y-muj)/sigmaj).^2); % note that pi=3.14159..., a built-in Matlab constant
reshapedy=reshape(fyj,5,21); % reshape is a Matlab function, used here to arrange the probabilities in rows of 5.
histnormp=sum(reshapedy); % sum the 5 probabilities corresponding to the bins created by Matlab's hist
bar(plotk,[N/Trials;histbinomp;histnormp]','grouped'),axis([100 200 0 0.2]);
legend('Observed from Monte Carlo Simulation','Expected from binomial distribution',...
'Expected from Normal Approximation',2)
xlabel(sprintf('Kerry votes in a poll of size %d',Numberpolled));
ylabel('Frequency')
title('Comparison of Monte Carlo simulation, Binomial pdf, and normal pdf')
figure(gcf);pause
% prnmenu('kd');
% There is another approximation to the binomial distribution, the Poisson distribution. We can fit that
% distribution and plot it too. The Poisson approximation to the binomial is discussed in Larsen & Marx, p. 246-251
% The equation is provided in my poiss_pdf.m, which is from Theorem 4.2.2, Larsen & Marx (2001), p. 251
y=98:202;n=Numberpolled;p=0.25;
Poissp=poiss_pdf(n*p,y); % Calculate the Poisson pdf for lambda=150 (had to rewrite poiss_pdf to handle 150!
reshapedPp=reshape(Poissp,5,21);
histPoissp=sum(reshapedPp);
bar(plotk,[N/Trials;histbinomp;histnormp;histPoissp]','grouped'),axis([100 200 0 0.2]);
legend('Observed from Monte Carlo Simulation','Expected from binomial distribution',...
'Expected from Normal Approximation','Expected from Poisson distribution', 2)
xlabel(sprintf('Kerry votes in a poll of size %d',Numberpolled));
ylabel('Frequency')
title('Comparison of Monte Carlo simulation, Binomial pdf, normal pdf, and Poisson pdf')
figure(gcf);pause
% prnmenu('kd');
end
% 95% CI via Monte Carlo simulation
% One of the better ways of finding the 95% CI for an estimate is through Monte Carlo simulation
% Kerrytally, contains the number of respondents, out of 600 favoring Kerry, given p=0.25
% We can sort those and find the cutoffs for the lowest 2.5% and the upper 2.5% using Matlab's sort
sortedKtally=sort(Kerrytally);
lMC95CIi=floor(0.025*Trials); % find the index for the lower 95% cutoff value
uMC95CIi=ceil(0.975*Trials); % find the index for the uppere 95% cutoff value.
medi=round(0.5*Trials);
% Save the three outputs in the row vector LowExpUp
MCLowExpUp=[sortedKtally(lMC95CIi) sortedKtally(medi) sortedKtally(uMC95CIi)];
fprintf('\nMonte Carlo Lower 95%% confidence limit, median, and upper 95%% confidence limit based on %d trials:\n',Trials)
fprintf('\tL. 95%% CI \tMedian \tU. 95%% CI \n \t%5.3f \t%5.3f \t%5.3f\n',MCLowExpUp(1),MCLowExpUp(2),MCLowExpUp(3))
% What is the probability that Kerry & Dean have equal favorability?
% Monte Carlo simulation will answer this problem:
% I'll plot the number of Kerry votes and Dean votes from our simulations, completed above
% Now, in some of the trials, I've set Trials equal to one million. I want to see the pattern in the data,
% but I don't want to sit for minutes while Matlab creates a gigantic graph with 1 million data points. I'll
% use randperm to select at most 1000 polls to plot using an if statement and randperm.
% Recall that randperm(n), randomly shuffles the numbers from
% 1 to n, my shuffle.m does the same thing (written before Matlab's randperm became available in Matlab 5):
% RANDPERM Random permutation.
% RANDPERM(n) is a random permutation of the integers from 1 to n.
% For example, RANDPERM(6) might be [2 4 5 6 1 3].
if Trials>1000;
i=randperm(Trials);i=i(1:1000); % I'll plot only, 1000 randomly selected rows of Kerrytally & Deantally
else
i=1:Trials; % If Trials <=1000, I'll plot all Trials;
end
plot(Kerrytally(i),Deantally(i),'.');xlabel('Number for Kerry'),ylabel('Number for Dean');title('Monte Carlo simulation');
figure(gcf),pause;
% Now that result may look like just a shotgun pattern, but there is a negative correlation there. We can document
% that negative correlation using Matlab's corrcoef.m. corrcoef.m returns an n x n matrix of correlation coefficients
% to find the correlation between Dean & Kerry votes, in this case a 2 x 2 matrix. The main diagonal of a correlation
% matrix is always 1.0, and
fprintf('\nThe correlation matrix between Kerry & Dean votes in %d Monte Carlo simulations:\n',length(i))
corrs=corrcoef([Kerrytally(i) Deantally(i)]);
disp(corrs);
fprintf('The correlation between Kerry votes and Dean Votes is %6.3f\n',corrs(2,1));
% Advanced topic. It is relatively easy to plot a regression line through these data, and to determine the slope
% and Y intercept of the line, so let's do that. The only non-standard part of fitting the regression line with
% Kerry votes as the X variable, the sole explanatory variable, and Dean votes as the Y variable, or response
% variable (sometimes called the independent and dependent variables), is that we have to add a column of all
% 1's to the X variable -- in matrix terms, this column allows us to fit the Y intercept.
% The Y variable and explanatory variable have to be oriented as column vectors, which is how we set up the
% simulation;
X=[ones(size(Kerrytally)) Kerrytally]; % Concantenation, combining a vector of all 1's the size of Kerrytally
% in front of the obsereved Kerrytally vector;
Y=Deantally;
B=X\Y; % Matlab's regression fit is always this simple. B contains the y intercept and slope.
disp('The two regression coefficients for Dean''s votes vs. Kerry''s votes')
disp(B)
fprintf('The regression equation for Dean & Kerry votes produced by %d Monte Carlo simulations of polls of size %d:\n',...
Trials,Numberpolled)
fprintf('Dean Votes = %5.1f + %5.3f * Kerry Votes\n',B(1),B(2))
% We can plot this result using the same axes for the graph that we had just created, but just using 'hold on'
% This holds the previous graph in place and we can plot new results on it. We'll plot the expected values for
% the subset of the polls plotted above in line 309. The estimated Y values are obtained by multiplying the
% rows x 2 'X' matrix by the 2 x 1 vector of regression coefficients to obtain the estimated Y values.
% We only need to plot two data points to determine the line, not 1000, so let's find the smallest & largest Kerry totals:
minK=min(Kerrytally(i));
maxK=max(Kerrytally(i));
% We just have to estimate the expected Y values for the minimum and maximum points just calculated, with the
% two regression coefficients, Y intercept & slope, that are now saved in the B matrix.
x=[ 1 0
1 0]; % Creates a matrix for the two endpoints. We'll replace the 0's with real data
% As in all regressions, with a Y intercept, we need a column of all 1's in the first column;
x(1,2)=minK; % Replace the 0 in the upper right with the minimum Kerry total of those selected with
% index vector i
x(2,2)=maxK; % Replace the 0 in the lower right with the minimum
% We could have created this in one statement had we wanted to:
x=[ones(2,1) [min(Kerrytally(i));max(Kerrytally(i))] ];
Yest=x*B;
hold on; % This will hold all of the information from the previous graph and allow us to plot new stuff on top.
plot(x(:,2),Yest,'-r');figure(gcf);pause
% Using the sprintf command, we can put a more informative title on the graph, replacing the old title.
% sprintf is just like fprintf, only you can save the results as a text string. We don't need the final \n
% line space request that was in the frpintf command in line 317.
txt=sprintf('Dean Votes = %5.1f + %5.3f * Kerry Votes',B(1),B(2));
% hold on remains in effect until we turn it off, so:
title(txt);figure(gcf);pause
% prnmenu('kd');
hold off
% This negative correlation could be expected. If by chance alone, a poll had more Kerry votes, the expectation is that
% by chance alone, there would be fewer Dean voters. There is a negative association between them
% Monte Carlo simulation of the probability that Dean and Kerry are equal in popularity.
% The simulation above, based on the proportion of the poll for Kerry (25%) and the proportion of the poll for
% Dean (19%) can be used to test the hypothesis.
% To test the null hypothesis that Dean & Kerry are equal in popularity, we need to redo the Monte Carlo simulation,
% to test the null hypothesis that Dean & Kerry are equal in popularity. We'll see what percentage of the Monte
% Carlo simulations produce a result as extreme or more extreme than the results that were observed. The AP reported
% that Kerry had a 6% lead (25% to 19%). We will calculate the probability of seeing that result by chance alone.
% Let's cut and paste the key simulation from lines 97-104, but this time we'll make the Kerry & Dean expected poll
% results equal. Our best estimate of their percentages to test this null hypothesis is that Kerry & Dean are
% splitting the 25% + 19% of the vote equally.
% We are almost at the end of this rather long exercise. So what is the probability that Dean & Kerry are equal
% in popularity? Recall, when we ran the if loop, back about line 100, we saved the results of the poll if
% Kerry & Dean were in fact equal in favorability with (25% + 19%)/2 of the NH voters favoring each.
% Answer: We just need to find the percentage of the polls based on these equal proportions
% that produce a difference of 6% or greater. In this
% test we could test the one-sided or two-sided null hypothesis. There would be justification for testing this
% poll result using the two-sided, or two-tailed probability, which is that we will consider those polls where
% Dean's total is 6% or greater than the Kerry total as evidence against the null hypothesis of equality.
% Calculate the difference in polls and divide by the number of Trials:
DifferenceHo = (KerrytallyHo - DeantallyHo)/Numberpolled;
% Let's plot the results:
hist(DifferenceHo,20);xlabel('Kerry - Dean');ylabel('Number');
title('Monte Carlo simulation of the null hypothesis that Dean and Kerry are equal in popularity in NH')
figure(gcf);pause
% prnmenu('kd');
% Now we just need to find out what the 1 sided p value is and then the two sided p value. In this case, they
% will be nearly equal:
i=find(DifferenceHo>=0.06);onesidedp=length(i)/Trials;
fprintf(...
'\nUnder the null hypothesis, Kerry=Dean, the 1-sided p of observing a 6%% difference by chance is %d\n',...
onesidedp)
% Now, calculate the two-sided, or two-tailed p, where the trials with Dean having more than 6% counting against
% the null hypothesis
i=find(abs(DifferenceHo)>=0.06);twosidedp=length(i)/Trials;
fprintf(...
'Under the null hypothesis, Kerry=Dean, the 2-sided p of observing a 6%% difference by chance is %d\n',...
twosidedp)
% note that this could have been shortened to:
twosidedp=sum(abs(DifferenceHo)>=0.06)/Trials;
fprintf(...
'Under the null hypothesis, Kerry=Dean, the 2-sided p of observing a 6%% difference by chance is %d\n',...
twosidedp)
% Now, in a simulation that I ran, with Trials=100, I came up with a probabilities of 0, which is not true. By convention in
% reporting p values from Monte Carlo simulations, the p value is set at the observed value OR the minimum of 1/Trials;
onesidedp=max([onesidedp 1/Trials]);
twosidedp=max([twosidedp 1/Trials]);
if onesidedp<0.001 % change the format so that it is in full form only for low p values:
fprintf(...
'\nUnder the null hypothesis, Kerry=Dean, and %d trials, the 1-sided p of observing a 6%% difference by chance is %d\n',...
Trials,onesidedp);
fprintf(...
'Under the null hypothesis, Kerry=Dean, and %d trials, the 2-sided p of observing a 6%% difference by chance is %d\n',...
Trials,twosidedp);
else
fprintf(...
'\nUnder the null hypothesis, Kerry=Dean, and %d trials, the 1-sided p of observing a 6%% difference by chance is %5.3f\n',...
Trials,onesidedp);
fprintf(...
'Under the null hypothesis, Kerry=Dean, and %d trials, the 2-sided p of observing a 6%% difference by chance is %5.3f\n',...
Trials,twosidedp);
end
% Conclusion, depending on the number of trials, and chance you should find that the two-sided p value is about 0.028.
% Thus, this result would indicate that there is only a 3% chance that Kerry & Dean are equal in popularity. Using
% the old way of reporting results, based on a decision value of 0.05, the probability of rejecting a true null hypothesis
% is the Type 1 error. If the decision rule is set with the probability of Type 1 error, indicated by alpha, as being
% set at 0.05, then one could conclude that the null hypothesis could be rejected. How to report the 1-sided result?
% If one specifies in advance that you are testing the null hypothesis that Kerry and Dean are equal vs. the alternate
% hypothesis that Kerry is ahead of Dean, then one can reject the null hypothesis at a p value of 0.014.
% By convention, this would indicate moderate evidence against the null hypothesis. If the p value for the test was
% greater than 0.05, then the p value should be reported, but it should be regarded as weak evidence against the null
% hypothesis.
% The Monte Carlo simulation of the NH poll indicates that there is moderate evidence that Kerry is ahead of Dean
% in NH, but how far ahead. Our best estimate of how far ahead, the maximum likelihood estimator,
% is the observed result: 6%. We can use our Monte Carlo simulation to calculate the 95% confidence limits on
% this point estimate:
% Calculate the difference in polls and divide by the number of Trials
% We'd calculated the differences in the poll results under the null hypothesis, now let's calculate the
% differences under our maximum likelihood estimates that Kerry's proportion is 25% and Dean is 19%. These
% results are still saved in the vectors, with length=Trials, Kerrytally and Deantally.
KDDiff = (Kerrytally - Deantally)/Numberpolled;
% 95% CI via Monte Carlo simulation
sortedKDDiff=sort(KDDiff);
lMC95CIpi=floor(0.025*Trials); % find the index for the lower 95% cutoff value
uMC95CIpi=ceil(0.975*Trials); % find the index for the upper 95% cutoff value.
medpi=round(0.5*Trials); % find the median, should be close to or identical to the expected value.
% Save the three outputs in the row vector DLowExpUp
DLowExpUp=[sortedKDDiff(lMC95CIpi) sortedKDDiff(medpi) sortedKDDiff(uMC95CIpi)];
fprintf('\nMonte Carlo simulation of the difference between Kerry & Dean in the NH poll\n');
fprintf('Lower 95%% confidence limit, median, and upper 95%% confidence limit based on %d trials:\n',Trials)
fprintf('Lower 95%% CI \tMedian \tUpper 95%% CI \n \t%4.2f%% \t\t%4.2f%% \t%4.2f%% \n',...
DLowExpUp(1)*100,DLowExpUp(2)*100,DLowExpUp(3)*100)
% With 5000 trials, here is the result that I obtained:
%Lower 95% CI Median Upper 95% CI
% 0.67% 6.00% 11.33%
HalfCI=(DLowExpUp(3)-DLowExpUp(1))/2;
fprintf('\nReport this result as %3.1f%% +/- %4.1f%%\n',DLowExpUp(2)*100,HalfCI*100)
% It is reassuring that the formal test of the equal proportions hypothesis produced a p value less than 0.05.
% and that the 95% confidence interval for the difference of 6% does not include 0. This result could be
% reported as the difference in the Kerry vs. Dean vote is 6% with a margin of error of 5.2%
% That margin of error of 5.2% is interesting. How could we have obtained it without a Monte Carlo simulation?
% But what if you don't want to do a Monte Carlo simulation? Surely there are equations that will tell you what the
% p value is for the difference between two proportions. There are, but the standard formula is for two INDEPENDENT
% proportions, for example, if Kerry had 25% of those polled supporting him in 1 poll and Dean had 19% in another
% poll. There is a section of Larsen & Marx (p. 505-508), describing how to test the hypothesis that two proportions
% are equal. Even before doing the test, I know that it will produce p values that are TOO small. The Kerry & Dean
% proportions are from the same poll, and they are not independent. By chance, a high poll result for Kerry will
% result in a lower proportion for Dean. We've already seen that in the plot of the poll results, there is a negative
% correlation between the two results.
% Let's apply Theorem 9.4.1, from Larsen & Marx p. 506, the conventional test for the difference between two
% independent binomial proportions. This test will produce a p value that is TOO low:
pK=.25;pD=.19; % observed proportion of Kerry and Dean respondents in the poll
n=Numberpolled;m=Numberpolled; % The test is designed for proportions based on sample sizes that can differ. In this case n=m=600;
z=(pK-pD)/sqrt( pK*(1-pK)/n + pD*(1-pD)/m );
fprintf('\nThe value of the z statistic for the two-sample binomial test is %6.3f\n',z)
% This result is z=2.515
% Now we can look up the p value for z from the standard normal distribution. This is tabulated in Table A.1 in
% Larsen & Marx (2001, p. 716-717). On p. 717, there are values of the cumulative standard normal distribution
% for 2.51 and 2.52, we can interpolate between these to get the 2-tailed p value. These are 1-tailed or
% 1-sided p values -- you can always check such tables by looking up the value for 1.96. The two-tailed p is
% 0.05 and the 1-tailed p is 0.025
interpolatedp=(.9940+.9941)/2;
% What percentage of the area under the curve is greater than z=2.515: Use 1-interpolatedp=0.006 for a poll of 600
% Matlab's method for finding the significance of z statistics is the complementary error function:
% Here is an implementation that will find the 1-tailed p.
% I have this programmed as zprob.m
Prob1=erfc(abs(z)/sqrt(2))/2;
Prob1=zprob(z); % this is my little m.file for implementing the complementary error function, a 1 liner
fprintf('\nUsing Matlab''s erfc.m, Gallagher''s zprob.m, the 1-tailed p value is %6.4f\n',Prob1);
% We can see how the other toolboxes program this vital m.file, both these authors use the erf error function:
% Anders Holtsberg's statbox toolbox:
Prob2=1-pnorm(z);
fprintf('\nUsing Anders Holtsberg''s stixbox pnorm.m, the 1-tailed p value is %6.4f\n',Prob2);
% Gordon Smythe's statbox toolbox:
Prob3=1-normp(z);
fprintf('\nUsing Gordon Smythe''s statbox''s normp.m, the 1-tailed p value is %6.4f\n\n',Prob3);
% calculate the differences among the p values, using Gallagher's pydist.m
disp('Differences among the 3 ways of finding z values, as a 3x3 matrix');
V=[Prob1 Prob2 Prob3]';
disp(pydist(V));
% Conclusion: you can use any one of these m.files for calculating z values. They are identical
% All three of these methods produce p values that are only 39% of the value that should be observed.
% The Monte Carlo simulation showed that the 1-tailed p value is 0.015 but these tests, based on INDEPENDENT
% binomial proportions are producing a value that is only 0.0059
% Z statistics are usually defined as the estimate in the numerator and the standard error of the estimate
% in the denominator. Since the z values for this test are way too high, the standard error for the
% difference in proportions is too small. It must be larger
% This is the formula used for the two-sample test for differences in proportions, shown above
% in Larsen & Marx, page 506, Theorem 9.4.1
% z=(pK-pD)/sqrt( pK*(1-pK)/n + pD*(1-pD)/m );
% Larsen & Marx are applying the formula for the variance of two independent random variables
% Variance(X+Y)= Variance(X) + Variance (Y), IF X & Y are independent.
% The denominator of the above equation estimates the square root of the variance of X plus the variance of Y.
% If X & Y are not independent, then the Variance of their sum or difference must include the covariance:
% Variance(X+Y)=Variance(X) + Variance (Y) + 2* Covariance (X,Y);
% The mistake made in applying this formula is in the variance for the difference.
% Here is how they found the standard error for the difference:
n=Numberpolled;
pK=0.25; % Proportion for Kerry
VariancepK=pK*(1-pK); % Variance for a binomial proportion for Kerry, p. 224 in Larsen & Marx
VariancepD=pD*(1-pD); % Variance for binomial proportion for Dean, p. 224 in Larsen & Marx
VarianceDKdiff=VariancepD+VariancepK; % Variance for the difference in proportions
StandardErrorDKdiff=sqrt(VarianceDKdiff)/sqrt(n); % Standard error of the difference=standard deviation (difference)/sqrt(n)
z2=0.06/StandardErrorDKdiff;
% This result is identical to the calculated z value from Theorem 9.4.1
% I went through this derivation to show how the covariance term should be included.
% There are three facts that you need to know, and the first two are not in Larsen & Marx.
% Fact 1:
% The variance of the difference between two random variables X and Y is
% Variance(X-Y)=Variance(X)+Variance(Y)-2 Covariance(X,Y);
% Fact 2: The Covariance of two multinomial proportions X & Y is -X*Y;
% Fact 3: Standard error of the difference in binomial proportions, X & Y
% Standard Error of difference=standard deviation(X-Y)/sqrt(n), where n=poll size
% =sqrt(Variance(X-Y))/sqrt(n)
% =sqrt(Variance(X)+Variance(Y)-2*Covariance(X,Y))/sqrt(n);
% =sqrt(X*(1-X)+(Y*(1-Y))-(2*-X*Y))/sqrt(n);
% Let's combine these to get a new z statistic, taking into account that the variance of the difference will
% be Larger than if the results were independent:
pK=0.25;pD=0.19;n=Numberpolled;
z3=(pK-pD)/(sqrt(pK*(1-pK)+pD*(1-pD)-2*(-pK*pD))/sqrt(n));
p3=zprob(z3);
fprintf('\nIncorporating the covariance in the standard error, the 1-tailed p value is %5.3f\n\n',p3);
VarianceDKdiff=VariancepD+VariancepK; % Variance for the difference in proportions
StandardErrorDKdiff=sqrt(VarianceDKdiff)/sqrt(n); % Standard error of the difference=standard deviation (difference)/sqrt(n)
z2=0.06/StandardErrorDKdiff;
p2=zprob(z2);
% In 11.4.3, Larsen & Marx (2001, p. 611) show that
% correlation between X&Y =Covariance(X,Y)/(standard deviation(X)* standard deviation(Y))
fprintf('\nThe correlation matrix between Kerry & Dean votes in %d Monte Carlo simulations:\n',Trials)
corrs=corrcoef([Kerrytally Deantally]);
disp(corrs);
corrKD=corrs(2,1); % The lower left element of the 2 x 2 correlation matrix. Identical to corrs (1,2)
fprintf('The correlation between Kerry votes and Dean Votes in %d trials is %6.3f\n',Trials,corrKD);
% Using this, and our knowledge that the variance of a binomial proportion is p*(1-p), let's find the covariance:
covKD=corrKD*sqrt(pK*(1-pK))*sqrt(pD*(1-pD));
fprintf(...
'The covariance between Kerry and Dean Votes, estimated from the observed correlation=%6.3f is %6.3f\n',...
corrKD,covKD);
n=Numberpolled;
z4=(pK-pD)/(sqrt(pK*(1-pK)+pD*(1-pD)-2*covKD)/sqrt(n));
p4=zprob(z4);
fprintf('\nIncorporating the covariance in the standard error, estimated from the correlation, the 1-tailed p value is %5.3f\n',p4);
% One final fact and we're done:
% Confidence intervals are calculated as the estimated value +/- z_(alpha/2)*Standard Error
% If the standard error are known from theory, then the z statistic is used
% if the standard error is estimated from the data, then Student's t statistic is used
% in the case of proportions, the variance is known from theory: variance(p)=p*(1-p), so
% We estimated the appropriate standard error for the data above, about line 599.
% So, we just need to find the appropriate z statistic for the 2-tailed 95% confidence intervals.
% it is 1.96, but we can get it using zfind
z=zfind(0.05);
half95CI=z*StandardErrorDKdiff;
fprintf(...
'\nThe observed difference between Dean & Kerry of %4.1f%% has a 95%% confidence interval of +/-%4.1f%%\n',...
(pK-pD)*100,half95CI*100);
disp('This 95% confidence interval is too narrow, and is incompatible with the simulation results: Trust the simulation!')
% DONE AT LAST
diary off % This turns off the diary statement.