Week 9: More on graphs; search and representation

Recall

To search a graph, we need to be able to represent it.

We need to be able to add and query lists (or sets) of vertices and edges. Of course, edges are just links between two vertices, so we needn’t have a separate data type. What might this look like in the simple case, where we don’t worry about annotations? Something like:

public interface UndirectedGraph<V> {
  void addVertex(V v);
  boolean hasVertex(V v);
  Set<V> vertices();

  void addEdge(V u, V v);
  boolean hasEdge(V u, V v);  
  Set<V> neighborsOf(V v);
}

The idea behind searching a graph is that we want to systematically examine it, starting at one point, looking for a path to another point. We do so by keeping track of a list of places to be explored (the “frontier”).

We start by marking the start location as seen and adding it to the frontier.

We then repeat the following steps until the frontier is empty or our goal is found:

Coding the search algorithm

What might this look like in code?

static <V> boolean isPath(UndirectedGraph<V> graph, V start, V goal) {
  Queue<V> frontier = new LinkedList<>();
  frontier.add(start);

  Set<V> seen = new HashSet<>();
  seen.add(start);

  while (!frontier.isEmpty()) {
    V current = frontier.remove();
    if (current.equals(goal)) return true;
    for (V next : graph.neighborsOf(current)) {
      // note: could put check for goal here instead;
      // if so, we need to be careful of the case where start == goal
      // we'd have to check for that outside the loop
      if (!seen.contains(next)) {
        frontier.add(next);
        seen.add(next);
      }
    }
  }
  return false;
}

Note I used a Queue here; this first in, first-out behavior enforces a breadth-first search. (on board) Queues are lists, but you can only add on one end, and only remove from the other; like waiting in line at Disney or some such. (You could also use a List if you wanted to, but how you add and removes vertices from the frontier controls how the search runs.)

Using a Queue, this search will visit all vertices adjacent to the start (that is, one hop away from the start) before it visits their neighbors (two hops away from the start), and so on, like ripples in a pond. This is called a “breadth-first” search.

Depending upon the order of vertices returned from our frontier, the search will progress in different ways; most notably is a depth-first when the frontier is a stack – last in, first out. You’ll see this in more detail in 187.

Finding the path

isPath doesn’t actually find the path, it just checks to see if there is one.

One way to find the path is to change seen slightly.

Instead of keeping track only of whether or not a vertex has been seen, we can keep track of where we “came from” to get to that vertex. In other words, we can track the “predecessor” of that vertex. (on board)

Here’s the updated code:

static <V> List<V> findPath(UndirectedGraph<V> graph, V start, V goal) {
  Queue<V> frontier = new LinkedList<>();
  frontier.add(start);


  Map<V, V> predecessor = new HashMap<>();
  predecessor.put(start, null);

  List<V> path = new ArrayList<>();

  while (!frontier.isEmpty()) {
    V current = frontier.remove();

    // have we found the goal?
    if (current.equals(goal)) {
      // if so, reconstruct the path
      path.add(current);
      V previous = predecessor.get(current);
      while (previous != null) {
        path.add(0, previous);
        previous = predecessor.get(previous);
      }
      // and exit the loop / search
      break;
    }

    // otherwise, continue searching
    for (V next : graph.neighborsOf(current)) {
      if (!predecessor.containsKey(next)) {
        frontier.add(next);
        predecessor.put(next, current);
      }
    }
  }		
  return path;
}

As before, we could do the goal check inside the inner for loop to save a few frontier expansions; I broke it out here to make it more clear, but either way works.

OK, great! What does this look like generally? Again, we search each vertex one hop away before we get to any of the vertices two hops away, and so on. This behavior, the choice of which vertices to search, is entirely a function of how we store and return vertices from the frontier. When it’s a queue, we get this “breadth-first”, ripples-in-a-pond behavior. You can imagine the form of the search a tree, where each level of the tree is the distance, in hops, from the start node. We search this tree level-by-level in a breadth first search. (on board)

Depth first search (DFS)

The other way to search a graph is “depth-first” search, where we fully explore one branch before backtracking to the next. (on whiteboard)

What does the code look like? There’s a couple ways to structure it. One way, that you explored (or will explore) in lab, is recursive; the other is iterative. Both use a stack – the recursive method uses the call stack, while the iterative method uses an explicit stack to perform the search:

static <V> List<V> findPathDFS(UndirectedGraph<V> graph, V start, V goal) {
    Stack<V> frontier = new Stack<>();
    frontier.push(start);

    Map<V, V> predecessor = new HashMap<>();
    predecessor.put(start, null);

    List<V> path = new ArrayList<>();
    while (!frontier.isEmpty()) {
        V current = frontier.pop();
        if (current.equals(goal)) {
            path.add(current);
            V previous = predecessor.get(current);
            while (previous != null) {
                path.add(0, previous);
                previous = predecessor.get(previous);
            }
            break;
        }

        for (V next: graph.neighborsOf(current)) {
            if (!predecessor.containsKey(next)) {
                frontier.push(next);
                predecessor.put(next, current);
            }
        }
    }
    return path;
}

Notice the only code changes to support the use of a stack, rather than a queue, to hold the frontier.

BFS vs DFS: What kinds of paths are found?

BFS always finds (one of) the paths to the goal that has the shortest number of edges, since it always searches all paths of n edges before searching paths of n+1 edges. But it requires that you remember the entire search tree as you go.

DFS, interestingly, does not need to remember the entire tree; only the vertices along a path along with their neighbors. Once a vertex has been seen it can be forgotten, if you’re willing to do some minor bookkeeping (less than is required in BFS, which requires tracking every previously seen node). But DFS might not find the shortest path path. Remember that if this were a depth-first search, we’d search the tree as far as possible down one path before backtracking. (on board)

Again, to get this DFS, behavior, all we need to do is switch the queue to a stack. (Correctly “forgetting” nodes to keep space requirements low is more complex code-wise to implement than just switching to a stack, though.)

Both breadth- and depth-first search are said to be “uninformed” search. That is, they explore the frontier systematically, but with no knowledge of where the goal is relative to the current position. There’s only so much you can do to optimize them (there’s a hybrid algorithm called “iterative deepening DFS” that sorta gets you the best of both worlds; you might analyze this in more depth in 311 or 383).

But if you know something about the problem domain, you can do better.

For example, suppose you have a graph where the vertices represent places (say, on campus), and the edges represent paths between those places. Each edge has a cost associated with it (say, the distance), and you’re trying to find a least-cost (aka shortest) path between a start and a goal.

If you just pick edges according to BFS, you’ll find the shortest path number-of-edges-wise but not cost-wise. How might we do better?

Well, one way would be to order the frontier, from least-cost to highest-cost, and examine vertices in order from least-to-highest cost. Of course, we are probably working with estimates, since if we really knew the true cost, it wouldn’t be a search: we’d just follow the least-coast path like a homing missile.

How do estimate their costs? We say that each vertex’s cost is defined by a function f(x).

One definition for f(x) is a heuristic, say h(x). A heuristic is an “approximation” or guess of a true value. In our campus graph example, a heuristic might be the straight-line distance from the vertex in question to the goal; usually, paths are roughly straight lines, though of course buildings or the campus pond might make this heuristic not-quite-correct. We can compute this by looking at the map.

So, one approach is to do a “greedy” search, where we always choose the closest node. How can we do this? We could sort the frontier after each iteration of the loop, which would require time proportional to about (n log n), where n is the number of vertices, if we used an efficient sorting algorithm. And that’s fine. It would produce a “priority queue,” that is, a queue that returns things not in first-in, first-out order, but “best-out” order.

It turns out Java implements this for us.

Priority queues

The PriorityQueue will act exactly as we want, allowing items to be added in arbitrary order, and returning them in lowest-cost order. Priority queues internally are not a List, but instead a heap. We won’t implement heaps in this course (wait for it… but you will in 187!). Heaps are “not-quite-sorted”; they maintain another property (the “heap property”) which lets them remove and return the current smallest item in (log n) time, and add new items in (log n) time.

In any event, we need to define an ordering on items to create a useful priority heap. Just like we’ve seen several times before, when there’s additional context we need to compare two items (for example, in our campus navigation example, we’d need to know about the map, not just the location), we can define a Comparator to hold this additional state and use it in its compare method. This Comparator gets passed to the PriorityQueue constructor, and then we have a “greedy” search. This is basically a one-line change to the method, just like going from BFS to DFS.

static <V> List<V> findPath(UndirectedGraph<V> graph, V start, V goal,
    Comparator<V> comp) {
  Queue<V> frontier = new PriorityQueue<>(comp);
  frontier.add(start);

  Map<V, V> predecessor = new HashMap<>();
  predecessor.put(start, null);

  List<V> path = new ArrayList<>();

  while (!frontier.isEmpty()) {
    V current = frontier.remove();
    for (V next : graph.neighborsOf(current)) {
      if (!predecessor.containsKey(next)) {
        frontier.add(next);
        predecessor.put(next, current);
      }
    }
    if (current.equals(goal)) {
      path.add(current);
      V previous = predecessor.get(current);
      while (previous != null) {
        path.add(0, previous);
        previous = predecessor.get(previous);
      }
      break;
    }
  }
  return path;
}

Defining the comparator is problem-specific; we can assume it’s been passed in as above.

Greedy search can “get it wrong” and find a sub-optimal path, especially if the heuristic is inaccurate. As you’ll learn in later courses, an optimal informed search algorithm is called “A*”, and its f(x) = g(x) + h(x). g(x) is just the known lowest-cost to get to vertex x so far; h(x) is a heuristic that must obey certain conditions – an “admissible” heuristic. Again, you’ll see this in future courses.

Review of Graph Representations

Recall that we can represent the Graph ADT in any implementation we want: in this course, we’ll (briefly) sketch the Adjacency Matrix and Adjacency List implementations.

Consider a (very) simplified graph, where V = {0, 1, 2, … n-1}.

The adjacency matrix representation just creates an n x n 2D array of booleans, representing the edge from-to relationship. A given entry in the array is true iff there exists an edge from-to the corresponding indices of the array.

The adjacency list representation is an array of lists. The array is n elements long; each element points to a list of outgoing edge destination corresponding to that element’s edges (or an empty list, if it has no outgoing edges).

(on board)

Implementing the abstraction

Next we’re going to talk about how you’d implement those interfaces.

Here’s a naive implementation:

public class AdjacencyMatrixUndirectedGraph<V> implements UndirectedGraph<V> {

	private List<V> vertices;
	private final boolean[][] edges;

	public AdjacencyMatrixUndirectedGraph(int maxVertices) {
		vertices = new ArrayList<>();
		edges = new boolean[maxVertices][maxVertices];
	}


	@Override
	public void addVertex(V v) {
		// what if the vertex is already in the graph?
		vertices.add(v);
	}

	@Override
	public boolean hasVertex(V v) {		
		return vertices.contains(v);
	}

	@Override
	public Set<V> vertices() {
		return new HashSet<>(vertices);
	}

	@Override
	public void addEdge(V u, V v) {
		// order of edges?
		// u,v in graph?
		edges[vertices.indexOf(u)][vertices.indexOf(v)] = true;
	}

	@Override
	public boolean hasEdge(V u, V v) {
		// order of edges?
		// u,v in graph?
		return edges[vertices.indexOf(u)][vertices.indexOf(v)];
	}

	@Override
	public Set<V> neighborsOf(V v) {
		// order of edges?
		// v in graph?
		Set<V> neighbors = new HashSet<>();
		int index = vertices.indexOf(v);
		for (int i = 0; i < vertices.size(); i++) {
			if (edges[index][i]) {
				neighbors.add(vertices.get(i));
			}
		}
		return neighbors;
	}
}

Note that upon reflection, there are some problems here (repeated vertices! order of vertices in edges! are vertices even in the graph?). Some of this we can fix in code (by having, say, a canonical ordering, or being sure to set both spots in the matrix); some of this implies we need to add to our API (methods that take arbitrary vertices as parameters should throw an exception).

Why adjacency matrices?

Remember, the main advantage of adjacency matrices is that they’re lightning fast in terms of checking if an edge is in the graph; it’s not just constant time, it’s constant time with a very low constant. Except our crappy implementation above requires a call to List.indexOf first; so it’s actually linear in the number of vertices. But a better-optimized version of an adjacency matrix representation of a graph would not do this (it would instead use just ints for vertices) and would be “supah-fast”, that is, constant-time.

Adjacency lists

The main downside to adjacency matrices is that they consume a lot of space: the implementation above uses (maxVertices)^2 space, that is, space quadratic in the number of vertices. In the worst case, a graph actually needs this much space – an “almost-complete” graph is called a “dense” graph. But if most vertices are not connected to most other vertices, that is, if we have a “sparse” graph, a more efficient implementation is the adjacency list.

Let’s write one now using our by-now old friend the Map:

public class AdjacenyListUndirectedGraph<V> implements UndirectedGraph<V> {
	Map<V, List<V>> adjacencyList;

	public AdjacenyListUndirectedGraph() {
		adjacencyList = new HashMap<>();
	}

	@Override
	public void addVertex(V v) {
		// duplicate vertex?
		adjacencyList.put(v, new ArrayList<>());
	}

	@Override
	public boolean hasVertex(V v) {
		return adjacencyList.containsKey(v);
	}

	@Override
	public Set<V> vertices() {
		// modification?
		return adjacencyList.keySet();
	}

	@Override
	public void addEdge(V u, V v) {
		// order?
		// u, v in adjacencyList?
		adjacencyList.get(u).add(v);
	}

	@Override
	public boolean hasEdge(V u, V v) {
		return adjacencyList.get(u).contains(v);
	}

	@Override
	public Set<V> neighborsOf(V v) {
    return new HashSet<>(adjacencyList.get(v));
	}
}

Again some problems here, including that we need to be careful of returning Sets that share structure with the graph. The caller might mutate the Set, and thus change the graph! If that’s not what we want (and it usually isn’t), then we should return copies of the structures that represent parts of the graph, not the original structures themselves.

Why adjacency lists?

Is this “slower” than an adjacency matrix? Yes. In particular, any time we need to iterate over the list (contains), we are, worst case, linear in the number of vertices. But we only need exactly as much space as is required to store each edge/vertex. In the worst case this is quadratic in the number of vertices, so we’re no better off than an adjacency matrix. But in a sparse graph, we come out ahead space-wise. And, saying a graph is sparse is roughly equivalent to saying that each vertex has a small constant number of edges, so contains is usually OK in this case. (You’ll explore this more in 311).

“But Marc,” you might be thinking, “why not make it a Map<V, Set<V>> and get the best of both worlds?” You can! And you would (mostly!). But while hash lookups are constant time, they’re not as quite as small a constant as array lookups. If you’re really, really worried about speed, and space is not an issue, you may end up using the adjacency matrix representation anyway. But enough about that – the details of graph representation in data structures go deep, and this isn’t a class about that. And frankly, just like lists, sets, and maps, if you are using graphs in a real-world problem, you likely are going to use a pre-written graph library that’s optimized for your use case.