In this project, you will write a program that reads in a graph from a file, defines a Markov chain from it, and computes the steady state of the chain.

The graphs you'll be provided are fragments of the world wide web--crawls of 4 different websites. Each graph represents pages and their hyperlinks. You will simulate a random web surfer--someone half-asleep, zoned out, randomly clicking links on the internet. Fortunately, no need to physically impersonate them; rather, we'll define a Markov chain--actually, two different Markov chains--describing "random walks" on the web. Then you'll compute the steady-state probability of each walk--i.e., the proportion of time the web surfer would spend at each page if they kept at it for days.

Then, read this technical description of PageRank. This is exactly what we want to do. To summarize:

- Say we're at a page that has N outlinks. If there were no teleporting, we'd
transition to any outlink with probability
`1/N`. - Since there is teleporting,
there are
`total`- N = M other (non-linked) pages where we might land (including the current page). Total probability of teleporting is`alpha`. Probability of teleporting to any of the M pages is then`alpha / M`. - Since we teleport
`alpha`of the time, we only follow outlinks`(1 - alpha)`of the time. Transition probability to any outlink is`(1 - alpha) / N`. - If there are no outlinks, we always teleport, so it's a transition
probability of
`1 / total`to any page.

In this version, it's easier not to think about it as the web. Just think of it as an arbitrary graph. Because now we will treat the links as undirected, or symmetric. That's to say, if we can transition from node A to node B, we can transition from node B to node A. (Important: in the data file, the links are only given in one direction. So if the file says there's an edge A -> B, you need to also put in the edge B -> A.)

The walk we'll do on this undirected graph:

- Start at a page and randomly pick one of the neighbors (with equal probabilities).
- To avoid the possibility that the Markov chain could be periodic--e.g., forcing the walker to go back and forth between two pages forever--we'll add a small probability of staying in the same place for a step.
- Let
`beta`= probability of staying put. That's the probability of the transition being just to the current state. Then`(1 - beta) / N`is the transition probability for each of your N neighbors. (In the graphs provided, every node has at least one neighbor; the graph is*connected*.)

(Reviewing from class and the textbook . . .)

A Markov chain can be represented by a stochastic transition matrix; let's call it C. Its (i, j) entry (ith row, jth column) represents the probability of transitioning from state i to state j.

Then, the current state of a walker can
be represented as a row vector; call it **v**. E.g.,
if the vector is (1, 0, 0),
they are in
state 1 (of a 3-state chain).
Or they can
probabilistically be in any of multiple states--e.g., (1/3, 1/3, 1/3) gives equal probability
to each. The next state of a walker is a new vector
obtained by computing **v**C = **v'**. (For example, if the initial
state is (1, 0, 0), then multiplying **v**C just picks off the top row of C.
This should make sense: if you're coming from state 1, the top row represents where
you might go. If you're coming from a mixture of states, then the calculation
combines where you might go from each of them.)
The new state after two
transitions is then **v'**C = **v**C^{2}. After `n`
transitions it's **v**C^{n}.

After many many transitions, the walker may reach a steady state. This means that given the current distribution over states, after one more transition, the next distribution over states is the same. If that steady state vector were (1/4, 1/2, 1/4), that would mean that over the long term, after we've been walking around for a while, we'll be in state 2 about 1/2 the time and the others each 1/4 of the time.

The amazing thing is that . . . well, two amazing things:

- The steady state distribution, if it exists, is independent of the starting place. (Which makes sense when you think about it; you've been bouncing around a long time, long enough to forget where you started.)
- Many Markov chains have such a unique steady state. Every aperiodic, irreducible chain does. (No need to test for those properties here; these chains will be aperiodic and irreducible.)

We'll find the steady state two ways.

First, by Monte Carlo simulation--i.e., actually performing the random walk. Start out somewhere (anywhere), walk around according to the transition matrix for a long time, and keep track of the fraction of time spent in each state. (How to know when it's been long enough? When that vector of state frequencies converges. You could check it once every hundred thousand steps or so.)
Second, analytically. To find the steady state, we'll compute C^{n} as `n`
increases, until it converges, i.e. C^{n+1} = C^{n}. In that
matrix, every row will be equal. Then as said above, the steady state,
**s** = **v**C^{n}, will be the same for every **v**. Take **v** =
(1, 0, 0, . . .) for example. Then **s** = **v**C^{n} will be
just the top row of C^{n}. (And the same holds regardless of
**v**.) So, compute C^{n} until convergence, and return the top
row. Note that in multiplying the matrix by itself, it will be much more
efficient to compute CC = C^{2}, then C^{2}C^{2} =
C^{4}, and so on, doing powers of two.

Read in the file.

Each line of the file describes the links out of one state. For example, the line "3: 57 206" tells us that state 3 has links to states 57 and 206.

- Convert it to a stochastic matrix. For the directed or the
undirected walk. (Need to code both.)
Hint: you will need to read in all the lines first, to know how big the matrix needs to be. Then allocate the matrix. Then go through the lines one by one, filling in the matrix entries.

- Find the steady state probabilities. Either by performing the walk, or by
computing powers of the transition matrix. (Need to code both.)
- Print out the steady state probabilities.
- Take the top 10 most frequently visited states, and look them up in the
file
`page-attributes.txt`describing what webpages they are. (The website data was collected in 1997, so the pages are unlikely to be live. The spreadsheet`page-attributes.txt`first lists the Cornell pages 1-861, then the Texas pages 1-825, etc.) (You may need to dump your output into Excel to sort the pages by steady state freq.)Are the pages those with the highest degrees (for undirected walk)? Do the pages seem important? (Ok if this doesn't turn out to be so; but it's worth checking.)

Then pick one of the schools (any one of the four), and run all 4 combinations of random walks / steady states for it: directed/Monte Carlo simulation, directed/analytical, undirected/Monte Carlo, undirected/analytical. Verify that the simulated and analytical results match. Examine the top 10 as described above and say what you find.

`proj4`

in the
`cs240`

directory of the edlab machines. To get to it, use this link to the ftp server, or see the edlab main page for other ways to connect.
Submit your completed, commented java files along with your writeup to the `cs240`

folder of
your own home directory by midnight on Tuesday, May 12, 2009.