CMPSCI 240 Project #4 Assignment

CMPSCI 240

Programming Project #4: Markov chains

Due Tuesday, May 12, 2009

Overview

In this project, you will write a program that reads in a graph from a file, defines a Markov chain from it, and computes the steady state of the chain.

The graphs you'll be provided are fragments of the world wide web--crawls of 4 different websites. Each graph represents pages and their hyperlinks. You will simulate a random web surfer--someone half-asleep, zoned out, randomly clicking links on the internet. Fortunately, no need to physically impersonate them; rather, we'll define a Markov chain--actually, two different Markov chains--describing "random walks" on the web. Then you'll compute the steady-state probability of each walk--i.e., the proportion of time the web surfer would spend at each page if they kept at it for days.

Background

We'll define two different ways of treating the web graph as a Markov chain.

PageRank: Directed walk

First, read these slides (excerpted from a tutorial here) for the intuition behind PageRank.

Then, read this technical description of PageRank. This is exactly what we want to do. To summarize:

Say we're at a page that has N outlinks. If there were no teleporting, we'd transition to any outlink with probability 1/N.
Since there is teleporting, there are total - N = M other (non-linked) pages where we might land (including the current page). Total probability of teleporting is alpha. Probability of teleporting to any of the M pages is then alpha / M.
Since we teleport alpha of the time, we only follow outlinks (1 - alpha) of the time. Transition probability to any outlink is (1 - alpha) / N.
If there are no outlinks, we always teleport, so it's a transition probability of 1 / total to any page.

If all goes well, the PageRank steady state probabilities will in some way represent the importance of the pages.

Undirected walk

In this version, it's easier not to think about it as the web. Just think of it as an arbitrary graph. Because now we will treat the links as undirected, or symmetric. That's to say, if we can transition from node A to node B, we can transition from node B to node A. (Important: in the data file, the links are only given in one direction. So if the file says there's an edge A -> B, you need to also put in the edge B -> A.)

The walk we'll do on this undirected graph:

Start at a page and randomly pick one of the neighbors (with equal probabilities).
To avoid the possibility that the Markov chain could be periodic--e.g., forcing the walker to go back and forth between two pages forever--we'll add a small probability of staying in the same place for a step.
Let beta = probability of staying put. That's the probability of the transition being just to the current state. Then (1 - beta) / N is the transition probability for each of your N neighbors. (In the graphs provided, every node has at least one neighbor; the graph is connected.)

If all goes well, the steady state probabilities ought to be proportional to the degrees of the nodes.

Markov chain steady states

(Reviewing from class and the textbook . . .)

A Markov chain can be represented by a stochastic transition matrix; let's call it C. Its (i, j) entry (ith row, jth column) represents the probability of transitioning from state i to state j.

Then, the current state of a walker can be represented as a row vector; call it v. E.g., if the vector is (1, 0, 0), they are in state 1 (of a 3-state chain). Or they can probabilistically be in any of multiple states--e.g., (1/3, 1/3, 1/3) gives equal probability to each. The next state of a walker is a new vector obtained by computing vC = v'. (For example, if the initial state is (1, 0, 0), then multiplying vC just picks off the top row of C. This should make sense: if you're coming from state 1, the top row represents where you might go. If you're coming from a mixture of states, then the calculation combines where you might go from each of them.) The new state after two transitions is then v'C = vC². After n transitions it's vCⁿ.

After many many transitions, the walker may reach a steady state. This means that given the current distribution over states, after one more transition, the next distribution over states is the same. If that steady state vector were (1/4, 1/2, 1/4), that would mean that over the long term, after we've been walking around for a while, we'll be in state 2 about 1/2 the time and the others each 1/4 of the time.

The amazing thing is that . . . well, two amazing things:

The steady state distribution, if it exists, is independent of the starting place. (Which makes sense when you think about it; you've been bouncing around a long time, long enough to forget where you started.)
Many Markov chains have such a unique steady state. Every aperiodic, irreducible chain does. (No need to test for those properties here; these chains will be aperiodic and irreducible.)

We'll find the steady state two ways.

First, by Monte Carlo simulation--i.e., actually performing the random walk. Start out somewhere (anywhere), walk around according to the transition matrix for a long time, and keep track of the fraction of time spent in each state. (How to know when it's been long enough? When that vector of state frequencies converges. You could check it once every hundred thousand steps or so.)

Second, analytically. To find the steady state, we'll compute Cⁿ as n increases, until it converges, i.e. Cⁿ⁺¹ = Cⁿ. In that matrix, every row will be equal. Then as said above, the steady state, s = vCⁿ, will be the same for every v. Take v = (1, 0, 0, . . .) for example. Then s = vCⁿ will be just the top row of Cⁿ. (And the same holds regardless of v.) So, compute Cⁿ until convergence, and return the top row. Note that in multiplying the matrix by itself, it will be much more efficient to compute CC = C², then C²C² = C⁴, and so on, doing powers of two.

What do you mean, "converge?"

Mathematically, there might always be a difference between Cⁿ and Cⁿ⁺¹. Computationally, things can go wrong with floating point arithmetic when dealing with very small numbers; when programming with them, it's a bad idea to test for strict equality. Let's say, for the sake of this assignment, that the matrix (or array) has converged when no entry changes more than some small amount epsilon, and that a reasonable epsilon here is .001 / N, where N is the number of states.

Assignment

Read in the file.

Each line of the file describes the links out of one state. For example, the line "3: 57 206" tells us that state 3 has links to states 57 and 206.
Convert it to a stochastic matrix. For the directed or the undirected walk. (Need to code both.)
Hint: you will need to read in all the lines first, to know how big the matrix needs to be. Then allocate the matrix. Then go through the lines one by one, filling in the matrix entries.
Find the steady state probabilities. Either by performing the walk, or by computing powers of the transition matrix. (Need to code both.)
Print out the steady state probabilities.
Take the top 10 most frequently visited states, and look them up in the file page-attributes.txt describing what webpages they are. (The website data was collected in 1997, so the pages are unlikely to be live. The spreadsheet page-attributes.txt first lists the Cornell pages 1-861, then the Texas pages 1-825, etc.) (You may need to dump your output into Excel to sort the pages by steady state freq.)

Are the pages those with the highest degrees (for undirected walk)? Do the pages seem important? (Ok if this doesn't turn out to be so; but it's worth checking.)

The test*.txt files contain little graphs for testing your code. (They may look strangely similar to diagrams from the book, although our transition probabilities will be different). The file test-outputs-desired.txt shows the correct results for the first couple cases. Convince yourself the code works on the test graphs, and attach the steady state probabilities from each of them.

Then pick one of the schools (any one of the four), and run all 4 combinations of random walks / steady states for it: directed/Monte Carlo simulation, directed/analytical, undirected/Monte Carlo, undirected/analytical. Verify that the simulated and analytical results match. Examine the top 10 as described above and say what you find.

Logistics

The files needed for this assignment are in a folder called proj4 in the cs240 directory of the edlab machines. To get to it, use this link to the ftp server, or see the edlab main page for other ways to connect.

Submit your completed, commented java files along with your writeup to the cs240 folder of your own home directory by midnight on Tuesday, May 12, 2009.