15: Searching and Sorting

Announcements

Reminder: Quiz Monday! Not even foolin’.

Even more thinking about efficiency

Recall, our rule of thumb is this: things take either:

a small, constant amount of time, which we’ll approximate as ‘about one unit’, or
they take an amount of time dependent upon some variable or variables

Just a few more things to consider.

First, just because you see a loop, it doesn’t mean that a method runs in non-constant time. For example:

int sumFirstThree(int[] a) {
  int sum = 0;
  for (int i = 0; i < 3; i++) {
    sum += a[i];
  }
  return sum;
}

…runs in constant time. Nor does there have to be an array in the parameter list (or as an instance variable, etc.) to trigger non-constant time behavior.

Related to this, remember that some things can be computed in different ways.

For example, to compute the sum of numbers from 1 to n,

Sum from one to n, or
compute n * (n + 1) / 2

Clicker questions

int sumTo(int n) {
  int sum = 0;
  for (int i = 0; i <= n; i++) {
    sum += n;
  }
  return sum;
}

int sumTo(int n) {
  return (n * (n + 1)) / 2;
}

Different algorithms that accomplish the same goal can have different running time behaviors.

Asymptotes

Finally, remember we’ve been talking about running time (mostly) on the basis of the degree of the polynomial (linear, quadratic, etc.). But remember that each variable has a coefficient, and that depending upon your value of n and the coefficients, linear-time algorithms are not strictly better than quadratic. For a silly example, consider which is faster:

an algorithm with runtime of 1,000,000 n
an algorithm with runtime (n^2)/ 1,000,000

For “large enough” values of n, the first algorithm is faster. How large? (set them equal to one another and solve for n) “Large enough” in this case is 10^12. Normally the coefficients aren’t quite this lopsided, but it is true in practice that sometimes, a small-coefficient quadratic algorithm is faster than a larger-coefficient (but better polynomial) algorithm for small but reasonable values of n.

Analyzing search, linear and binary

Let’s look at one particular method on lists: indexOf. indexOf searches a list for an element, and returns its index (or -1) if not found. How long must a search take?

Well, knowing nothing else, we have to check every element of the list (or array, etc.). So? It’s linear, right? Something like:

private E[] array; // note this isn't quite true

int indexOf(E e) {
  for (int i = 0; i < array.length; i++) {
    if (e.equals(array[i])) return i;
  }
  return -1;
}

Linear. But (and this is a big but and I cannot lie) if we know something more about the list, we can leverage that to not have to search the whole list.

For example, if the list is sorted. You know, like a telephone book, or a dictionary, or your phone’s address book, or basically anything that’s long and linear but where we want fast access to an arbitrary entry.

From Downey §12.8:

When you look for a word in a dictionary, you don’t just search page by page from front to back. Since the words are in alphabetical order, you probably use a binary search algorithm:

Start on a page near the middle of the dictionary.

Compare a word on the page to the word you are looking for. If you find it, stop.

If the word on the page comes before the word you are looking for, flip to somewhere later in the dictionary and go to step 2.

If the word on the page comes after the word you are looking for, flip to somewhere earlier in the dictionary and go to step 2.

If you find two adjacent words on the page and your word comes between them, you can conclude that your word is not in the dictionary.

We can leverage this to write a faster search algorithm, called “binary search”. It’s called this because each time through the loop, it eliminates half of the possible entries, unlike a regular linear search that eliminates only one. It looks like this:

int indexOf(E e) {
  int low = 0;
  int high = array.length - 1;
  while (low <= high) {
    int mid = (low + high) / 2;   // step 1
    int comp = array[mid].compareTo(e);

    if (comp == 0) { // step 2
      return mid;
    } else if (comp < 0) { // step 3
      low = mid + 1;
    } else { // comp > 0 // step 4
      high = mid - 1;
    }
  }
  return -1;
}

How long does this take to run?

Each time through the loop, we cut the distance between low and high in half. After k iterations, the number of remaining cells to search is array.length / 2^k. To find the number of iterations it takes to complete (in the worse case), we set array.length / 2^k = 1 and solve for k. The result is log_2 array.length. This is sub-linear. For example, for an array of 1,000 elements, it’s about 10; a million elements, about 20; a billion elements, about 30, and so on.

The downside, of course, is that we have to keep the array sorted. How do we do that?

Sorting

The binary search algorithm is a wonderful way to find an element in a list in less than n steps (log_2 n), even in the worst case, so long as the list is sorted.

Many operations can be performed more quickly on a sorted data set. Not to mention people often like to view sorted, rather than unsorted data (think about spreadsheets, indices, address books, etc.).

We’re next going to turn our attention to several sorting algorithms, methods for transforming unsorted lists or array into sorted ones. We’ll be using comparison-based sorts, where elements must be directly comparable (in Java: Comparable, or of primitive type that’s naturally ordered). There are other approaches that you’ll learn about in COMPSCI 311, like the radix sort.

We’ll pay particular attention to running times of these algorithms, but also think about space requirements, access requirements (e.g., random access, like an array), and behavior in multiple cases (not just worst case, but perhaps best or average case). Again, more to come in 187.

A note

We’ll think about runtimes in terms of swaps and comparisons.

We care about swaps as they are the basic way to reorder elements in an indexed list, like an array or ArrayList. (Note that some algorithms can be made to work on other abstractions.) A swap in an array or array-like structure usually requires small constant amount of space (equal to one element) to hold the swap variable, and a small constant amount of time:

public static void swap(int[] array, int i, int j) {
  int t = array[i];
  array[i] = array[j];
  array[j] = t;
}

As for comparisons: in the code I show, I’ll use arrays of ints, because it’s shorter. But in general, you could sort arrays of objects (that have an ordering) using element.compareTo(other) < 0 rather than say element < other, or by instantiating and using an appropriate Comparator.

A first sorting algorithm: selection sort

(Note to self: definitely don’t accidentally do bubble sort instead, that would be ridiculous! I mean, I certainly didn’t do that in Fall ‘16 or anything.)

It turns out that, like skinning a cat, there’s more than one way to sort. See: https://en.wikipedia.org/wiki/Sorting_algorithm and https://www.toptal.com/developers/sorting-algorithms for example.

One way is to find the first (say, smallest) thing, and put it in the first position. Then find the next-smallest thing, and put it in the second position. Then find the third, and so on.

We find the thing using a simple linear search.

If we “put it in the ith position” using a swap, we don’t need an entire list’s (O(n)) worth of extra space, just a single element.

(on board with list 5 3 7 1)

This is called selection sort, because we select each element that we want, one-by-one, and put them where we want them.

static int indexOfMinimum(int[] array, int startIndex) {
  int minIndex = startIndex;
  for (int i = startIndex + 1; i < array.length; i++) {
    if (array[i] < array[minIndex]) {
      minIndex = i;
    }
  }
  return minIndex;
}

static void selectionSort(int[] array) {
  for (int i = 0; i < array.length - 1; i++) {
    // you could just:
    // swap(array, i, indexOfMinimum(array, i));
    // but for maximum efficiency, instead:
    int index = indexOfMinimum(array, i);
    if (index != i) {
      swap(array, i, index);
    }
  }
}

Let’s also add some printing code to see this in action:

static void printArray(int[] array) {
  for (int i: array) {
    System.out.print(i + " ");
  }
  System.out.println();
}

public static void main(String[] args) {
  int[] array = new int[] {5, 3, 7, 1};
  printArray(array);
  selectionSort(array);
}

(We can add printArray to the search method’s loop to see it work.)

In-class exercise

For an array of length containing n elements, what is the largest number of comparisons that one invocation of indexOfMinimum might perform?

What is the worst case for selection sort? That is, what input order on n inputs causes selection sort to make the greatest number of comparisons?

a. the input is already sorted
b. the input is in reverse sorted order
c. the order {n, 1, 2, …, n-1}
d. selection sort’s behavior does not depend upon input order

Not a clicker question but worth asking of the class: how many swaps does selection sort make?

Back to selection sort

How bad is selection sort, really? Let’s think about comparisons and swaps.

There are exactly n-1 comparisons the first time, n-2 the second time, and so on. This sums up to n(n-1)/2.

There are exactly n-1 swaps made (note that some could be no-ops, if an element i == minIndex(array, i)), in other words, if it’s in the right place already.

If comparisons and swaps are both about constant cost, then this algorithm is O(n^2) – the cost is dominated by the comparisons.

Even so, if swaps are much more expensive (a bigger constant), selection sort can be OK, since it bounds the number of swaps to be at most (n-1). But you need to know what’s going to be more expensive in advance.

Bubble sort

Here’s another sorting algorithm:

Start at the end of the list (element n-1). Then compare against the previous element (n-2). If the element at (n-1) is smaller, swap it with the element at (n-2). Then compare the element at (n-2) with the element at (n-3). And so on, all the way to the 0th element.

This will move the smallest element to the zeroth index of the list.

Now repeat, but stop at the 1st element. This will move the second-smallest element to the 1st index. Then repeat again, and so on, until the list is sorted.

Each time the algorithm repeats, the ith smallest element “bubbles up” to the front of the list; this is called a bubble sort.

static void bubbleUp(int[] array, int stopIndex) {
  for (int i = array.length - 1; i > stopIndex; i--) {
    if (array[i] < array[i - 1]) {
      swap(array, i, i - 1);
    }
  }
}

static void bubbleSort(int[] array) {
  for (int i = 0; i < array.length; i++) {
    bubbleUp(array, i);
  }
}

(on board with list 5 3 7 1)

In class exercise

In the worst case, how many swaps will bubble sort make before completing?

What’s the worst case for bubble sort? (ask class :) Doesn’t matter! Just like selection sort, the loops (and their exit conditions) don’t depend upon the contents of the array.