In this project, you will write a program that reads in a graph from a file, defines a Markov chain from it, and computes the steady state of the chain.
The graphs you'll be provided are fragments of the world wide web--crawls of 4 different websites. Each graph represents pages and their hyperlinks. You will simulate a random web surfer--someone half-asleep, zoned out, randomly clicking links on the internet. Fortunately, no need to physically impersonate them; rather, we'll define a Markov chain--actually, two different Markov chains--describing "random walks" on the web. Then you'll compute the steady-state probability of each walk--i.e., the proportion of time the web surfer would spend at each page if they kept at it for days.
Then, read this technical description of PageRank. This is exactly what we want to do. To summarize:
In this version, it's easier not to think about it as the web. Just think of it as an arbitrary graph. Because now we will treat the links as undirected, or symmetric. That's to say, if we can transition from node A to node B, we can transition from node B to node A. (Important: in the data file, the links are only given in one direction. So if the file says there's an edge A -> B, you need to also put in the edge B -> A.)
The walk we'll do on this undirected graph:
(Reviewing from class and the textbook . . .)
A Markov chain can be represented by a stochastic transition matrix; let's call it C. Its (i, j) entry (ith row, jth column) represents the probability of transitioning from state i to state j.
Then, the current state of a walker can be represented as a row vector; call it v. E.g., if the vector is (1, 0, 0), they are in state 1 (of a 3-state chain). Or they can probabilistically be in any of multiple states--e.g., (1/3, 1/3, 1/3) gives equal probability to each. The next state of a walker is a new vector obtained by computing vC = v'. (For example, if the initial state is (1, 0, 0), then multiplying vC just picks off the top row of C. This should make sense: if you're coming from state 1, the top row represents where you might go. If you're coming from a mixture of states, then the calculation combines where you might go from each of them.) The new state after two transitions is then v'C = vC2. After n transitions it's vCn.
After many many transitions, the walker may reach a steady state. This means that given the current distribution over states, after one more transition, the next distribution over states is the same. If that steady state vector were (1/4, 1/2, 1/4), that would mean that over the long term, after we've been walking around for a while, we'll be in state 2 about 1/2 the time and the others each 1/4 of the time.
The amazing thing is that . . . well, two amazing things:
We'll find the steady state two ways.
First, by Monte Carlo simulation--i.e., actually performing the random walk. Start out somewhere (anywhere), walk around according to the transition matrix for a long time, and keep track of the fraction of time spent in each state. (How to know when it's been long enough? When that vector of state frequencies converges. You could check it once every hundred thousand steps or so.)Second, analytically. To find the steady state, we'll compute Cn as n increases, until it converges, i.e. Cn+1 = Cn. In that matrix, every row will be equal. Then as said above, the steady state, s = vCn, will be the same for every v. Take v = (1, 0, 0, . . .) for example. Then s = vCn will be just the top row of Cn. (And the same holds regardless of v.) So, compute Cn until convergence, and return the top row. Note that in multiplying the matrix by itself, it will be much more efficient to compute CC = C2, then C2C2 = C4, and so on, doing powers of two.
Read in the file.
Each line of the file describes the links out of one state. For example, the line "3: 57 206" tells us that state 3 has links to states 57 and 206.
Hint: you will need to read in all the lines first, to know how big the matrix needs to be. Then allocate the matrix. Then go through the lines one by one, filling in the matrix entries.
Are the pages those with the highest degrees (for undirected walk)? Do the pages seem important? (Ok if this doesn't turn out to be so; but it's worth checking.)
Then pick one of the schools (any one of the four), and run all 4 combinations of random walks / steady states for it: directed/Monte Carlo simulation, directed/analytical, undirected/Monte Carlo, undirected/analytical. Verify that the simulated and analytical results match. Examine the top 10 as described above and say what you find.
proj4
in the
cs240
directory of the edlab machines. To get to it, use this link to the ftp server, or see the edlab main page for other ways to connect.
Submit your completed, commented java files along with your writeup to the cs240
folder of
your own home directory by midnight on Tuesday, May 12, 2009.