16: More Sorting; Graphs and Search

Welcome

Announcements

It’s April! The end of the semester is getting close. Can you feel it?

Insertion sort

Now let’s turn our attention to another sorting algorithm. This one is similar to how you might sort a handful of cards.

We break the hand up into two parts, sorted and unsorted. Then we add cards one-by-one from the unsorted part into the sorted part. (on board)

Let’s say we start with our old friend 5 3 7 1, an unsorted list on the right, and a sorted (empty) list on the left:

| 5 3 7 1

“insert” the first card on the left of the unsorted array into the sorted array:

5 | 3 7 1

(Note we didn’t actually do anything, just moved an index). Now take the next element, 3.

5 3 | 7 1

We have to move it into the correct position by successively swapping it left until it’s no longer smaller than its predecessor (or until there is no predecessor).

3 5 | 7 1

7 is easy:

3 5 7 | 1

and finally we need to successively move 1:

1 3 5 7 |

and we’re done. This is called insertion sorting, since we take elements one one-by-one from the unsorted portion of the list and insert them into the sorted portion.

static void insertIntoSorted(int[] array, int sortedBoundary) {
  for (int i = sortedBoundary; i > 0; i--) {
    if (array[i] < array[i - 1]) {
      swap(array, i, i - 1);
    }
    else break; // you could omit this, but then you'd lose some non-worst-case performance
  }
}

static void insertionSort(int[] array) {
  for (int i = 1; i < array.length; i++) {
    insertIntoSorted(array, i);
  }
}

How many steps to run insertIntoSorted? Worst case is the last insertIntoSorted having to go all they way: n-1 comparisons (and that many swaps as well).

In-class exercise

What is the worst case for insertion sort? That is, what input order on n inputs causes selection sort to make the greatest number of comparisons?

a. the input is already sorted b. the input is in reverse sorted order c. the order {n, 1, 2, …, n-1} d. selection sort’s behavior does not depend upon input order

Turns out to be O(n^2) again, same as selection sort.

But it turns out to be very fast in two particular cases:

the constant factor for insertion sort is generally lower than some of the other O(n log n) algorithms you’ll learn about later in 187, like merge sort and heap sort. Usually for n between 8 and 20 or so insertion’s O(n^2) will outperform merge O(n log n) or quick sort (another n^2 sort that’s got really good average-case performance).
best case is an already sorted list: exactly n-1 comparisons and no swaps; partially-sorted lists are O(nk) where each element is no more than k from where it should be

Other sorting things

There are other sorting algorithms that can do better than n^2; the best algorithms run in “n log n” time (“mergesort” and “heapsort” are two you’ll see in 187). They have tradeoffs, though, either requiring more than constant space or a higher constant factor (coefficient) than a simple sort like insertion sort.

In practice, most library sort methods, like Arrays.sort and Collections.sort, use a hybrid of the approaches, using the best algorithm for the task. Most common is the timsort, named after a guy named Tim (no joke!) Peters who first implemented it in the Python standard library.

Graphs

Remember our discussion of trees, and how we talked about trees being a “kind” of graph? Graphs are this really useful thing, a kind of multipurpose data structure that let us represent so many things.

Notation

Recall G = {V, E}.

Note about directed vs undirected graphs.

Note about annotations / weights.

Vocabulary

Linear lists and trees are two ways to make objects out of nodes and connections from one node to another. There are many other ways to do it.

A graph consists of a set of nodes called vertices (one of them is a vertex) and a set of edges that connect the nodes. In an undirected graph, each edge connects two distinct vertices. In a directed graph, each edge goes from one vertex to another vertex.

Two vertices are adjacent (or neighbors) if there is an undirected edge from one to the other.

A path (in either kind of graph) is a sequence of edges where each edge goes to the vertex that the next edge comes from. A simple path is one that never reuses a vertex. In a tree, there is exactly one simple path from any one vertex to any other vertex.

A complete graph is one with every possible edge among its vertices – in other words, every vertex is adjacent to every other vertex.

A connected component is a set of vertices in which every vertex has a path to every other vertex (though not every vertex is adjacent).

A single graph might have two or more connected components! (on board)

Examples

google map

maze

tic-tac-toe

8-puzzle

In-class thought experiment

Imagine you wanted to represent the first two years worth of COMPSCI courses (121, 186, 187, 220, 230, 240, 250) for majors (and their prerequisites) as a graph. What would it look like?

Graph abstraction and algorithms

Each of the previous examples can, if you squint at it correctly, be viewed as a graph. There is some “space” (finite or otherwise) of discrete points, and some points are connected to others.

This corresponds exactly to a graph. And what’s interesting here is that there are many algorithms that operate generally on graphs, regardless of the underlying problem. So we can write them (once) and solve many kinds of problems. Most common are things like:

search: start at a particular vertex, report true if there is a path to another vertex in the graph, or false otherwise)
path search: (also shortest-path search) find the shortest path from one vertex to anoter vertex (this might be lowest number of edges, or if edges have a “weight”, might be based upon sum of edge costs)
minimax search in an adversarial game: given a state, look for the “winning-est” move
all-pairs shortest path (which it turns out can be solved more efficiently than just doing each pairwise shortest-path search)

and many, many more.

Total vs partial knowledge of a graph

You can know (or compute) the entire graph “ahead of time” when it’s both small and knowable, for example, our earlier maze example. That is, you can create an ADT that allows you to set and access nodes and edges (and associated annotations) and instantiate it.

For some problems, the graph is too large to keep track of (e.g., the state space of chess is around 10^123). But obviously we have computer programs that can play chess. How do you do it? You generate a “partial view” of the state space, where you can find the “successors” of a particular state (on board) and their successors, and so on, up until you’re “out of time” to think more or out of space to store more, and do the best you can with this partial view.

How might these ADTs look in practice?

ADT for graphs

We need to be able to add and query lists (or sets) of vertices and edges. Of course, edges are just links between two vertices, so we needn’t have a separate data type. What might this look like in the simple case, where we don’t worry about annotations? Something like:

public interface UndirectedGraph<V> {
  void addVertex(V v);
  boolean hasVertex(V v);
  Set<V> vertices();

  void addEdge(V u, V v);
  boolean hasEdge(V u, V v);  
  Set<V> neighborsOf(V v);
}

What about if we are just concerned with a partial view of the graph? Maybe something like this:

public interface PartialUndirectedGraph<V> {
  List<V> neighborsOf(V v);
}

The implementation of a partial view would have to know quite a bit about the underlying problem in order to generate the neighbors, but on the other hand, you don’t need to generate everything, just the bits of the graph you care about.

Searching a graph

How might we go about trying to solve a search problem? That is, suppose we had a graph that had been instantiated a particular way. We’re given a start vertex, and we want to see if there’s a path in that graph to the end vertex. As a human, we can just look at a small graph and decide, but larger graphs eventually can’t just be glanced at. What’s a methodical way to check?

Let’s work through an example: (on board, graph S,1,2,3,4,G where 1,2,3 are strongly connected and 4 is only connected to 3).

The idea behind searching a graph is that we want to systematically examine it, starting at one point, looking for a path to another point. We do so by keeping track of a list of places to be explored (the “frontier”). We repeat the following steps until the frontier is empty or our goal is found:

Pick and remove a location from the frontier.
Mark the location as explored (visited) so we don’t “expand” it again.
“Expand” the location by looking at its neighbors. Any neighbor we haven’t seen yet (not visited, not already on the frontier) is added to the frontier.

Don’t panic

We’ll go over this in more detail and do some examples next lecture!