16: More Sorting; Graphs and Search
Welcome
Announcements
Quiz Monday.
Insertion sort
Now let’s turn our attention to another sorting algorithm. This one is similar to how you might sort a handful of cards.
We break the hand up into two parts, sorted and unsorted. Then we add cards one-by-one from the unsorted part into the sorted part. (on board)
Let’s say we start with our old friend 5 3 7 1, an unsorted list on the right, and a sorted (empty) list on the left:
| 5 3 7 1
“insert” the first card on the left of the unsorted array into the sorted array:
5 | 3 7 1
(Note we didn’t actually do anything, just moved an index). Now take the next element, 3.
5 3 | 7 1
We have to move it into the correct position by successively swapping it left until it’s no longer smaller than its predecessor (or until there is no predecessor).
3 5 | 7 1
7 is easy:
3 5 7 | 1
and finally we need to successively move 1:
1 3 5 7 |
and we’re done. This is called insertion sorting, since we take elements one one-by-one from the unsorted portion of the list and insert them into the sorted portion.
static void insertIntoSorted(int[] array, int sortedBoundary) {
for (int i = sortedBoundary; i > 0; i--) {
if (array[i] < array[i - 1]) {
swap(array, i, i - 1);
}
else break; // you could omit this, but then you'd lose some non-worst-case performance
}
}
static void insertionSort(int[] array) {
for (int i = 1; i < array.length; i++) {
insertIntoSorted(array, i);
}
}
How many steps to run insertIntoSorted
? Worst case is the last insertIntoSorted
having to go all they way: n-1 comparisons (and that many swaps as well).
Insertion sort worst case
What is the worst case for insertion sort? That is, what input order on n inputs causes selection sort to make the greatest number of comparisons?
a. the input is already sorted b. the input is in reverse sorted order c. the order {n, 1, 2, …, n-1} d. selection sort’s behavior does not depend upon input order
Turns out to be O(n^2) again, same as selection sort.
But it turns out to be very fast in two particular cases:
- the constant factor for insertion sort is generally lower than some of the other O(n log n) algorithms you’ll learn about later in 187, like merge sort and heap sort. Usually for n between 8 and 20 or so insertion’s O(n^2) will outperform merge O(n log n) or quick sort (another n^2 sort that’s got really good average-case performance).
- best case is an already sorted list: exactly n-1 comparisons and no swaps; partially-sorted lists are O(nk) where each element is no more than k from where it should be
Other sorting things
There are other sorting algorithms that can do better than n^2; the best algorithms run in “n log n” time (“mergesort” and “heapsort” are two you’ll see in 187). They have tradeoffs, though, either requiring more than constant space or a higher constant factor (coefficient) than a simple sort like insertion sort.
In practice, most library sort
methods, like Arrays.sort
and Collections.sort
, use a hybrid of the approaches, using the best algorithm for the task. Most common is the timsort, named after a guy named Tim (no joke!) Peters who first implemented it in the Python standard library.
Graphs
Remember our discussion of trees, and how we talked about trees being a “kind” of graph? Graphs are this really useful thing, a kind of multipurpose data structure that let us represent so many things.
Notation
Recall G = {V, E}.
Note about directed vs undirected graphs.
Note about annotations / weights.
Vocabulary
Linear lists and trees are two ways to make objects out of nodes and connections from one node to another. There are many other ways to do it.
A graph consists of a set of nodes called vertices (one of them is a vertex) and a set of edges that connect the nodes. In an undirected graph, each edge connects two distinct vertices. In a directed graph, each edge goes from one vertex to another vertex.
Two vertices are adjacent (or neighbors) if there is an undirected edge from one to the other.
A path (in either kind of graph) is a sequence of edges where each edge goes to the vertex that the next edge comes from. A simple path is one that never reuses a vertex. In a tree, there is exactly one simple path from any one vertex to any other vertex.
A complete graph is one with every possible edge among its vertices – in other words, every vertex is adjacent to every other vertex.
A connected component is a set of vertices in which every vertex has a path to every other vertex (though not every vertex is adjacent).
A single graph might have two or more connected components! (on board)
Examples
google map
maze
tic-tac-toe
8-puzzle
In-class thought experiment
Imagine you wanted to represent the first two years worth of COMPSCI courses (121, 190D, 187, 220, 230, 240, 250) for majors (and their prerequisites) as a graph. What would it look like?
Graph abstraction and algorithms
Each of the previous examples can, if you squint at it correctly, be viewed as a graph. There is some “space” (finite or otherwise) of discrete points, and some points are connected to others.
This corresponds exactly to a graph. And what’s interesting here is that there are many algorithms that operate generally on graphs, regardless of the underlying problem. So we can write them (once) and solve many kinds of problems. Most common are things like:
- search: start at a particular vertex, report true if there is a path to another vertex in the graph, or false otherwise)
- path search: (also shortest-path search) find the shortest path from one vertex to anoter vertex (this might be lowest number of edges, or if edges have a “weight”, might be based upon sum of edge costs)
- minimax search in an adversarial game: given a state, look for the “winning-est” move
- all-pairs shortest path (which it turns out can be solved more efficiently than just doing each pairwise shortest-path search)
and many, many more.
Total vs partial knowledge of a graph
You can know (or compute) the entire graph “ahead of time” when it’s both small and knowable, for example, our earlier maze example. That is, you can create an ADT that allows you to set and access nodes and edges (and associated annotations) and instantiate it.
For some problems, the graph is too large to keep track of (e.g., the state space of chess is around 10^123). But obviously we have computer programs that can play chess. How do you do it? You generate a “partial view” of the state space, where you can find the “successors” of a particular state (on board) and their successors, and so on, up until you’re “out of time” to think more or out of space to store more, and do the best you can with this partial view.
How might these ADTs look in practice?
ADT for graphs
We need to be able to add and query lists (or sets) of vertices and edges. Of course, edges are just links between two vertices, so we needn’t have a separate data type. What might this look like in the simple case, where we don’t worry about annotations? Something like:
public interface UndirectedGraph<V> {
void addVertex(V v);
boolean hasVertex(V v);
Set<V> vertices();
void addEdge(V u, V v);
boolean hasEdge(V u, V v);
Set<V> neighborsOf(V v);
}
What about if we are just concerned with a partial view of the graph? Maybe something like this:
public interface PartialUndirectedGraph<V> {
List<V> neighborsOf(V v);
}
The implementation of a partial view would have to know quite a bit about the underlying problem in order to generate the neighbors, but on the other hand, you don’t need to generate everything, just the bits of the graph you care about.
Searching a graph
How might we go about trying to solve a search problem? That is, suppose we had a graph that had been instantiated a particular way. We’re given a start vertex, and we want to see if there’s a path in that graph to the end vertex. As a human, we can just look at a small graph and decide, but larger graphs eventually can’t just be glanced at. What’s a methodical way to check?
Let’s work through an example: (on board, graph S,1,2,3,4,G where 1,2,3 are strongly connected and 4 is only connected to 3).
The idea behind searching a graph is that we want to systematically examine it, starting at one point, looking for a path to another point. We do so by keeping track of a list of places to be explored (the “frontier”). We repeat the following steps until the frontier is empty or our goal is found:
- Pick and remove a location from the frontier.
- Mark the location as explored (visited) so we don’t “expand” it again.
- “Expand” the location by looking at its neighbors. Any neighbor we haven’t seen yet (not visited, not already on the frontier) is added to the frontier.
What might this look like in code?
static <V> boolean isPath(UndirectedGraph<V> graph, V start, V goal) {
Queue<V> frontier = new LinkedList<>();
frontier.add(start);
Set<V> visited = new HashSet<>();
visited.add(start);
while (!frontier.isEmpty()) {
V current = frontier.remove();
if (current.equals(goal)) return true;
for (V next : graph.neighborsOf(current)) {
// note: could put check for goal here instead
if (!visited.contains(next)) {
frontier.add(next);
visited.add(next);
}
}
}
return false;
}
Note I used a Queue
here; this first in, first-out behavior enforces a breadth-first search. (on board) Queue
s are lists, but you can only add on one end, and only remove from the other; like waiting in line at Disney or some such. (You could totally use a List
if you wanted to, but how you add and removes vertices from the frontier controls how the search runs.)
Using a Queue
, this search will visit all vertices adjacent to the start (that is, one hop away from the start) before it visits their neighbors (two hops away from the start), and so on, like ripples in a pond. This is called a “breadth-first” search.
Depending upon the order of vertices returned from our frontier
, the search will progress in different ways; most notably is a depth-first when the frontier is a stack – last in, first out. You’ll see this in more detail in 187.
In-class exercise
Suppose we have the following graph:
1--2
/ \
S G
\ /
3--4
Vertices are added to the frontier in numerical order as a node is explored, and explored in the order they were added.
Finding the path
isPath
doesn’t actually find the path, it just checks to see if there is one.
One way to find the path is to change visited
slightly.
Instead of keeping track only of whether or not a vertex has been visited, we can keep track of where we “came from” to get to that vertex. In other words, we can track the “predecessor” of that vertex. (on board)
Here’s the updated code:
static <V> List<V> findPath(UndirectedGraph<V> graph, V start, V goal) {
Queue<V> frontier = new LinkedList<>();
frontier.add(start);
Map<V, V> predecessor = new HashMap<>();
predecessor.put(start, null);
List<V> path = new ArrayList<>();
while (!frontier.isEmpty()) {
V current = frontier.remove();
for (V next : graph.neighborsOf(current)) {
if (!predecessor.containsKey(next)) {
frontier.add(next);
predecessor.put(next, current);
}
}
if (current.equals(goal)) {
path.add(current);
V previous = predecessor.get(current);
while (previous != null) {
path.add(0, previous);
previous = predecessor.get(previous);
}
break;
}
}
return path;
}
As before, we could do the goal check inside the inner for
loop to save a few frontier expansions; I broke it out here to make it more clear, but either way works.
OK, great! What does this look like generally? Again, we search each vertex one hop away before we get to any of the vertices two hops away, and so on. This behavior, the choice of which vertices to search, is entirely a function of how we store and return vertices from the frontier. When it’s a queue, we get this “breadth-first”, ripples-in-a-pond behavior. You can imagine the form of the search a tree, where each level of the tree is the distance, in hops, from the start node. We search this tree level-by-level in a breadth first search. (on board)
The other way to search a graph is “depth-first” search, where we fully explore one branch before backtracking to the next.
Don’t panic
We’ll go over this in more detail and do some examples next lecture!