In this project, you will write a program that reads in a graph from a file, defines a Markov chain from it, and computes the steady state of the chain.
The graphs you'll be provided are fragments of the world wide web--crawls of 4 different websites. Each graph represents pages and their hyperlinks. You will simulate a random web surfer--someone half-asleep, zoned out, randomly clicking links on the internet. Fortunately, no need to physically impersonate them; rather, we'll define a Markov chain--actually, two different Markov chains--describing "random walks" on the web. Then you'll compute the steady-state probability of each walk--i.e., the proportion of time the web surfer would spend at each page if they kept at it for days.
Then, read this technical description of PageRank. This is exactly what we want to do. To summarize:
In this version, it's easier not to think about it as the web. Just think of it as an arbitrary graph. Because now we will treat the links as undirected, or symmetric. That's to say, if we can transition from node A to node B, we can transition from node B to node A. (Important: in the data file, the links are only given in one direction. So if the file says there's an edge A -> B, you need to also put in the edge B -> A.)
The walk we'll do on this undirected graph:
(Reviewing from class and the textbook . . .)
A Markov chain can be represented by a stochastic transition matrix; let's call it C. Its (i, j) entry (ith row, jth column) represents the probability of transitioning from state i to state j.
Then, the current state of a walker can be represented as a row vector; call it v. E.g., if the vector is (1, 0, 0), they are in state 1 (of a 3-state chain). Or they can probabilistically be in any of multiple states--e.g., (1/3, 1/3, 1/3) gives equal probability to each. The next state of a walker is a new vector obtained by computing vC = v'. (For example, if the initial state is (1, 0, 0), then multiplying vC just picks off the top row of C. This should make sense: if you're coming from state 1, the top row represents where you might go. If you're coming from a mixture of states, then the calculation combines where you might go from each of them.) The new state after two transitions is then v'C = vC2. After n transitions it's vCn.
After many many transitions, the walker may reach a steady state. This means that given the current distribution over states, after one more transition, the next distribution over states is the same. If that steady state vector were (1/4, 1/2, 1/4), that would mean that over the long term, after we've been walking around for a while, we'll be in state 2 about 1/2 the time and the others each 1/4 of the time.
The amazing thing is that . . . well, two amazing things:
We'll find the steady state two ways.First, by Monte Carlo simulation--i.e., actually performing the random walk. Start out somewhere (anywhere), walk around according to the transition matrix for a long time, and keep track of the fraction of time spent in each state. (How to know when it's been long enough? When that vector of state frequencies converges. You could check it once every hundred thousand steps or so.)
Second, analytically. To find the steady state, we'll compute Cn as n increases, until it converges, i.e. Cn+1 = Cn. In that matrix, every row will be equal. Then as said above, the steady state, s = vCn, will be the same for every v. Take v = (1, 0, 0, . . .) for example. Then s = vCn will be just the top row of Cn. (And the same holds regardless of v.) So, compute Cn until convergence, and return the top row. Note that in multiplying the matrix by itself, it will be much more efficient to compute CC = C2, then C2C2 = C4, and so on, doing powers of two.
GraphReader.java is missing the following functionality to build Markov chains:
Read in the file.
Each line of the file describes the links out of one state. For example, the line "3: 57 206" tells us that state 3 has links to states 57 and 206.
Hint: you will need to read in all the lines first, to know how big the matrix needs to be. Then allocate the matrix. Then go through the lines one by one, filling in the matrix entries.
Once GraphReader can construct Markov chains, you'll need to code up the following functionality in MarkovChain.java:
Find the steady state probabilities. Either by performing the walk, or by computing powers of the transition matrix. (Need to code both.)
Print out the steady state probabilities.
The test*.txt files contain little graphs for testing your code. (They may look strangely similar to diagrams from the book, although our transition probabilities will be different). The file test-outputs-desired.txt shows the correct results for the first couple cases. Convince yourself the code works on the test graphs, and in your write-up give the steady state probabilities from each of them.
The following should go into your write-up (no MS-Word files, please. Text, rtf, or pdf are all fine):
Pick one of the schools (any one of the four), and run all 4 combinations of random walks / steady states for it: directed/Monte Carlo simulation, directed/analytical, undirected/Monte Carlo, undirected/analytical.
Verify that the simulated and analytical results match.
Take the top 10 most frequently visited states, and look them up in the file page-attributes.txt describing what webpages they are. (The website data was collected in 1997, so the pages are unlikely to be live. The spreadsheet page-attributes.txt first lists the Cornell pages 1-861, then the Texas pages 1-825, etc.) (You may need to dump your output into Excel to sort the pages by steady state freq.)
Are these top 10 pages those with the highest degrees (for undirected walk)? Do the pages seem important? (Ok if this doesn't turn out to be so; but it's worth checking.)
The java files needed for this assignment are in
a directory called
See the edlab main page if you don't yet know how to access this directory.
Submit your completed java files along with your writeup to the
cs240 folder of
your own home directory by 11:59pm on Friday, December 11, 2009.
Last modified 25 November 2009.