Week 8: Searching, sorting, and introduction to graphs

Analyzing search, linear and binary

Let’s look at one particular method on lists: indexOf. indexOf searches a list for an element, and returns its index (or -1) if not found. How long must a search take?

Well, knowing nothing else, we have to check every element of the list (or array, etc.). So? It’s linear, right? Something like:

public int indexOf(String s) {
    for (int i = 0; i < array.length; i++) {
        if (s.equals(array[i])) {
            return i;
        }
    }
    return -1;
}

Linear. But (and this is a big but and I cannot lie) if we know something more about the list, we can leverage that to not have to search the whole list.

For example, if the list is sorted. You know, like a telephone book, or a dictionary, or an index at the back of a book, or your phone’s address book, or basically anything that’s long and linear but where we want fast access to an arbitrary entry.

From Downey §12.8:

When you look for a word in a dictionary, you don’t just search page by page from front to back. Since the words are in alphabetical order, you probably use a binary search algorithm:

Start on a page near the middle of the dictionary.

Compare a word on the page to the word you are looking for. If you find it, stop.

If the word on the page comes before the word you are looking for, flip to somewhere later in the dictionary and go to step 2.

If the word on the page comes after the word you are looking for, flip to somewhere earlier in the dictionary and go to step 2.

If you find two adjacent words on the page and your word comes between them, you can conclude that your word is not in the dictionary.

We can leverage this to write a faster search algorithm, called “binary search”. It’s called this because each time through the loop, it eliminates half of the possible entries, unlike a regular linear search that eliminates only one. It looks like this:

// note: only works on sorted arrays!
public int binarySearch(String s) {
    int low = 0;
    int high = array.length - 1; // probably should be `size - 1`, do you see why?
    while (low <= high) {
        int mid = (low + high) / 2;
        int comp = array[mid].compareTo(s);
        if (comp == 0) {
            return mid;
        } else if (comp < 0) {
            low = mid + 1;
        } else { // comp > 0
            high = mid - 1;
        }
    }
    return -1;
}

How long does this take to run?

Each time through the loop, we cut the distance between low and high in half. After k iterations, the number of remaining cells to search is array.length / 2^k. To find the number of iterations it takes to complete (in the worse case), we set array.length / 2^k = 1 and solve for k. The result is log_2 array.length. This is sub-linear. For example, for an array of 1,000 elements, it’s about 10; a million elements, about 20; a billion elements, about 30, and so on.

The downside, of course, is that we have to keep the array sorted. How do we do that? (And more generally, we need things sorted all the time. How do we do that?)

Sorting

The binary search algorithm is a wonderful way to find an element in a list in less than n steps (log_2 n), even in the worst case, so long as the list is sorted.

Many operations can be performed more quickly on a sorted data set. Not to mention people often like to view sorted, rather than unsorted data (think about spreadsheets, indices, address books, etc.).

We’re next going to turn our attention to three (well, two reasonable ones and one not-so-reasonable one) sorting algorithms, methods for transforming unsorted lists or array into sorted ones. We’ll be using comparison-based sorts, where elements must be directly comparable (in Java: Comparable, or of primitive type that’s naturally ordered). There are other approaches that you’ll learn about in COMPSCI 311, like the radix sort.

We’ll pay particular attention to running times of these algorithms, but also think a little about space requirements, access requirements (e.g., random access, like an array), and behavior in multiple cases (not just worst case, but perhaps best or average case). Again, more to come in 187.

A note

We’ll think about runtimes in terms of swaps and comparisons.

We care about swaps as they are the basic way to reorder elements in an indexed list, like an array or ArrayList. (Note that some algorithms can be made to work on other abstractions.) A swap in an array or array-like structure usually requires small constant amount of space (equal to one element) to hold the swap variable, and a small constant amount of time:

public static void swap(int[] array, int i, int j) {
  int t = array[i];
  array[i] = array[j];
  array[j] = t;
}

As for comparisons: in the code I show, I’ll use arrays of ints, because it’s shorter. But in general, you could sort arrays of objects (that have an ordering) using element.compareTo(other) < 0 rather than say element < other, or by instantiating and using an appropriate Comparator.

A first sorting algorithm: selection sort

It turns out that, like skinning a cat, there’s more than one way to sort. See: https://en.wikipedia.org/wiki/Sorting_algorithm and https://www.toptal.com/developers/sorting-algorithms for example.

One way is to find the first (say, smallest) thing, and put it in the first position. Then find the next-smallest thing, and put it in the second position. Then find the third, and so on.

We find the thing using a simple linear search.

If we “put it in the ith position” using a swap, we don’t need an entire list’s (O(n)) worth of extra space, just a single element.

(on board with list 5 3 7 1)

This is called selection sort, because we select each element that we want, one-by-one, and put them where we want them.

static int indexOfMinimum(int[] array, int startIndex) {
  int minIndex = startIndex;
  for (int i = startIndex + 1; i < array.length; i++) {
    if (array[i] < array[minIndex]) {
      minIndex = i;
    }
  }
  return minIndex;
}

static void selectionSort(int[] array) {
  for (int i = 0; i < array.length - 1; i++) {
    // you could just:
    swap(array, i, indexOfMinimum(array, i));
    // but for maximum efficiency, instead:
    // int index = indexOfMinimum(array, i);
    // if (index != i) {
    //   swap(array, i, index);
    // }
    // this second approach avoids wasting time swapping
    // an element with itself, but at the cost of a if statement;
    // irrelevant optimization for asymptotic analysis though
  }
}

Let’s also add some printing code to see this in action:

static void printArray(int[] array) {
  for (int i: array) {
    System.out.print(i + " ");
  }
  System.out.println();
}

public static void main(String[] args) {
  int[] array = new int[] {5, 3, 7, 1};
  printArray(array);
  selectionSort(array);
}

(We can add printArray to the search method’s loop to see it work.)

Analyzing selection

How good or bad is selection sort, really? Let’s think about comparisons and swaps.

There are exactly n-1 comparisons the first time, n-2 the second time, and so on. This sums up to n(n-1)/2.

There are exactly n-1 swaps made (note that some could be no-ops, if an element i == minIndex(array, i) in other words, if it’s in the right place already).

If comparisons and swaps are both about constant cost, then this algorithm is O(n^2) – the cost is dominated by the comparisons.

Even so, if swaps are much more expensive (a bigger constant) and n is not too large, then selection sort could be OK, since it bounds the number of swaps to be at most (n-1). But this case would be very unusual! Usually we just think about what’s asymptotically most expensive, in this case, O(n^2) comparisons.

Bubble sort

Here’s another sorting algorithm:

Start at the end of the list (element n-1). Then compare against the previous element (n-2). If the element at (n-1) is smaller, swap it with the element at (n-2). Then compare the element at (n-2) with the element at (n-3). And so on, all the way to the 0th element.

This will move the smallest element to the zeroth index of the list.

Now repeat, but stop at the 1st element. This will move the second-smallest element to the 1st index. Then repeat again, and so on, until the list is sorted.

Each time the algorithm repeats, the ith smallest element “bubbles up” to the front of the list; this is called a bubble sort.

static void bubbleUp(int[] array, int stopIndex) {
  for (int i = array.length - 1; i > stopIndex; i--) {
    if (array[i] < array[i - 1]) {
      swap(array, i, i - 1);
    }
  }
}

static void bubbleSort(int[] array) {
  for (int i = 0; i < array.length; i++) {
    bubbleUp(array, i);
  }
}

(on board with list 5 3 7 1)

Insertion sort

Now let’s turn our attention to another sorting algorithm. This one is similar to how you might sort a handful of cards. Or maybe if you’ve ever volunteered at a library, it’s kinda like how you might sort the books on a cart.

We break the hand up into two parts, sorted and unsorted. Then we add cards one-by-one from the unsorted part into the sorted part. (on board)

Let’s say we start with our old friend 5 3 7 1, an unsorted list on the right, and a sorted (empty) list on the left:

| 5 3 7 1

“insert” the first card on the left of the unsorted array into the sorted array:

5 | 3 7 1

(Note we didn’t actually do anything, just moved the index dividing the sorted part from the unsorted part). Now take the next element, 3.

5 3 | 7 1

We have to move it into the correct position by successively swapping it left until it’s no longer smaller than its predecessor (or until there is no predecessor).

3 5 | 7 1

7 is easy:

3 5 7 | 1

and finally we need to successively move 1:

and we’re done. This is called insertion sorting, since we take elements one one-by-one from the unsorted portion of the list and insert them into the sorted portion. Notice that when we’re doing the “inserting” into the sorted position, it’s exactly the same algorithm as our implementation of ArrayList.add() earlier in the semester used – we need to “move everything out of the way”. The main difference here is we do both the moving and the finding-the-right-spot in the same loop, rather than being “given” the right spot (i) as a parameter to the add() method.

static void insertIntoSorted(int[] array, int sortedBoundary) {
  for (int i = sortedBoundary; i > 0; i--) {
    if (array[i] < array[i - 1]) {
      swap(array, i, i - 1);
    }
    else break; // you could omit this, but then you'd lose some non-worst-case performance
  }
}

static void insertionSort(int[] array) {
  for (int i = 1; i < array.length; i++) {
    insertIntoSorted(array, i);
  }
}

I’m going to ask you about the worst case in the problem set. But it turns out to be very fast in two particular cases:

the constant factor for insertion sort is generally lower than some of the other O(n log n) algorithms you’ll learn about later in 187, like merge sort and heap sort. Usually for n between 8 and 20 or so insertion’s O(n^2) will outperform merge O(n log n) or quick sort (another n^2 sort that’s got really good average-case performance) on a modern CPU.
best case is an already sorted list: exactly n-1 comparisons and no swaps; partially-sorted lists are O(nk) where each element is no more than k from where it should be

Other sorting things

There are other sorting algorithms that can do better than n^2; the best algorithms run in “n log n” time (“mergesort” and “heapsort” are two you’ll see in 187). They have tradeoffs, though, either requiring more than constant space or a higher constant factor (coefficient) than a simple sort like insertion sort.

In practice, most library sort methods, like Arrays.sort and Collections.sort, use a hybrid of the approaches, using the best algorithm for the task. Most common is the timsort, named after a guy named Tim (no joke!) Peters who first implemented it in the Python standard library.

There is also the legendary Bogosort, which will be familiar to you if you’ve ever played 52-card pickup.

Graphs

Remember our discussion of trees, and how we talked about trees being a “kind” of graph? Graphs are this really useful thing, a kind of multipurpose data structure that let us represent so many things.

Notation

Recall G = {V, E}.

Edges can be undirected or directed – undirected connect vertices on both ends; directed have a directionality (from/to). Usually we speak of an entire graph being either directed or undirected.

Often, vertices and/or edges are annotated in some way. They might be named (typically vertices are named for convenience). Or they might have a value associated with them – often edges will, where the value is some notion of “cost” or “weight”.

Vocabulary

Linear lists and trees are two ways to make objects out of nodes and connections from one node to another. There are many other ways to do it.

A graph consists of a set of nodes called vertices (one of them is a vertex) and a set of edges that connect the nodes. In an undirected graph, each edge connects two distinct vertices. In a directed graph, each edge goes from one vertex to another vertex.

Two vertices are adjacent (or neighbors) if there is an undirected edge from one to the other.

A path (in either kind of graph) is a sequence of edges where each edge goes to the vertex that the next edge comes from. A simple path is one that never reuses a vertex. In a tree, there is exactly one simple path from any one vertex to any other vertex.

A complete graph is one with every possible edge among its vertices – in other words, every vertex is adjacent to every other vertex.

A connected component is a set of vertices in which every vertex has a path to every other vertex (though not every vertex is adjacent).

A single graph might have two or more connected components! (on board)

Examples

maze

google maps

8-puzzle

tic-tac-toe

In-class thought experiment

Imagine you wanted to represent the first two years worth of COMPSCI courses (121, 186, 187, 220, 230, 240, 250) for majors (and their prerequisites) as a graph. What would it look like?

Graph abstraction and algorithms

Each of the previous examples can, if you squint at it correctly, be viewed as a graph. There is some “space” (finite or otherwise) of discrete points, and some points are connected to others.

This corresponds exactly to a graph. And what’s interesting here is that there are many algorithms that operate generally on graphs, regardless of the underlying problem. So we can write them (once) and solve many kinds of problems. Most common are things like:

search: start at a particular vertex, report true if there is a path to another vertex in the graph, or false otherwise)
path search: (also shortest-path search) find the shortest path from one vertex to another vertex (this might be lowest number of edges, or if edges have a “weight”, might be based upon sum of edge costs)
minimax search in an adversarial game: given a state, look for the “winning-est” move
all-pairs shortest path (which it turns out can be solved more efficiently than just doing each pairwise shortest-path search)
min-cut: what’s the fewest number of edges you can remove to partition the graph into two disjoint components (useful in disruption-tolerance algorithms)

and many, many more.

Total vs partial knowledge of a graph

You can know (or compute) the entire graph “ahead of time” when it’s both small and knowable, for example, our earlier maze example. That is, you can create an ADT that allows you to set and access nodes and edges (and associated annotations) and instantiate it.

For some problems, the graph is too large to keep track of (e.g., the state space of chess is around 10^123). But obviously we have computer programs that can play chess. How do you do it? You generate a “partial view” of the state space, where you can find the “successors” of a particular state (on board) and their successors, and so on, up until you’re “out of time” to think more or out of space to store more, and do the best you can with this partial view.

How might these two kinds of ADTs – total vs partial – look in practice?

ADT for graphs

We need to be able to add and query lists (or sets) of vertices and edges. Of course, edges are just links between two vertices, so we needn’t have a separate data type. What might this look like in the simple case, where we don’t worry about annotations? Something like:

public interface UndirectedGraph<V> {
  void addVertex(V v);
  boolean hasVertex(V v);
  Set<V> vertices();

  void addEdge(V u, V v);
  boolean hasEdge(V u, V v);  
  Set<V> neighborsOf(V v);
}

What about if we are just concerned with a partial view of the graph? Maybe something like this:

public interface PartialUndirectedGraph<V> {
  List<V> neighborsOf(V v);
}

The implementation of a partial view would have to know quite a bit about the underlying problem in order to generate the neighbors, but on the other hand, you don’t need to generate everything, just the bits of the graph you care about.

Searching a graph

How might we go about trying to solve a search problem? That is, suppose we had a graph that had been instantiated a particular way. We’re given a start vertex, and we want to see if there’s a path in that graph to the end vertex. As a human, we can just look at a small graph and decide, but larger graphs eventually can’t just be glanced at. What’s a methodical way to check?

Let’s work through an example:

(on board, graph S,1,2,3,4,G where 1,2,3 are strongly connected and 4 is only connected to 3).

The idea behind searching a graph is that we want to systematically examine it, starting at one point, looking for a path to another point. We do so by keeping track of a list of places to be explored (the “frontier”). We repeat the following steps until the frontier is empty or our goal is found:

Pick and remove a location from the frontier.
Mark the location as explored (visited) so we don’t “expand” it again.
“Expand” the location by looking at its neighbors. Any neighbor we haven’t seen yet (not visited, not already on the frontier) is added to the frontier.

More next week!