Lecture 15: Sorting

Announcements

Quiz Monday.

Sorry about the Assignment 07 switcheroo. Experimental course means sometimes things don’t go perfectly.

Assignment 08 will be posted when it’s ready. Probably you’ll get another day’s reprieve.

Even more thinking about efficiency

Recall, our rule of thumb is this: things take either:

  • a small, constant amount of time, which we’ll approximate as ‘about one unit’, or
  • they take an amount of time dependent upon some variable or variables

Just a few more things to consider.

First, just because you see a loop, it doesn’t mean that a method runs in non-constant time. For example:

int sumFirstThree(int[] a) {
  int sum = 0;
  for (int i = 0; i < 3; i++) {
    sum += a[i];
  }
  return sum;
}

…runs in constant time. Nor does there have to be an array in the parameter list (or as an instance variable, etc.) to trigger non-constant time behavior.

Related to this, remember that some things can be computed in different ways. For example, to compute the sum of numbers from 1 to n:

int sumTo(int n) {
  int sum = 0;
  for (int i = 0; i <= n; i++) {
    sum += n;
  }
  return sum;
}

runs in time linear in the size of n. But we can do it in constant time, too:

int sumTo(int n) {
  return (n * (n + 1)) / 2;
}

Different algorithms can have different behaviors. Finally, remember we’ve been talking about run-time (mostly) on the basis of the degree of the polynomial (linear, quadratic, etc.). But remember that each variable has a coefficient, and that depending upon your value of n and the coefficients, linear-time algorithms are not strictly better than quadratic. For a silly example, consider which is faster:

  • an algorithm with runtime of 1,000,000 n
  • an algorihtm with runtime (n^2)/ 1,000,000

For “large enough” values of n, the first algorithm is faster. How large? (set them equal to one another and solve for n) “Large enough” in this case is 10^12. Normally the coefficients aren’t quite this lopsided, but it is true in practice that sometimes, a small-coefficient quadratic algorithm is faster than a larger-coefficient (but better polynomial) algorithm for small but reasonable values of n.

Sorting

We finished last class by talking about the binary search algorithm, a wonderful way to find an element in a list in less than n steps (log_2 n), even in the worst case, so long as the list is sorted.

Many operations can be performed more quickly on a sorted data set. Not to mention people often like to view sorted, rather than unsorted data (think about spreadsheets, indices, address books, etc.).

We’re next going to turn our attention to several sorting algorithms, methods for transforming unsorted lists or array into sorted ones. We’ll be using comparison-based sorts, where elements must be directly comparable (in Java: Comparable). There are other approaches that you’ll learn about in COMPSCI 187 and 311.

We’ll pay particular attention to runtimes of these algorithms, but also look at space requirements, access requirements (e.g., random access, like an array), and behavior in multiple cases (not just worst case, but perhaps best or average case). Again, more to come in 187.

A note

We’ll think about runtimes in terms of swaps and comparisons.

We care about swaps as they are the basic way to reorder elements in an indexed list, like an array or ArrayList. (Note that some algorithms can be made to work on other abstractions.) A swap in an array or array-like structure usually requires small constant amount of space (equal to one element) to hold the swap variable.

public static void swap(int[] array, int i, int j) {
  int t = array[i];
  array[i] = array[j];
  array[j] = t;
}

In the code I show, I’ll use arrays of ints, because it’s shorter. But in general, you could sort arrays of objects using element.compareTo(other) < 0 rather than say element < other.

A first sorting algorithm: selection sort

(Note to self: definitely don’t accidentally do bubble sort instead, that would be ridiculous.)

It turns out that, like skinning a cat, there’s more than one way to sort. See: https://en.wikipedia.org/wiki/Sorting_algorithm and https://www.toptal.com/developers/sorting-algorithms for example.

One way is to find the first (say, smallest) thing, and put it in the first position. Then find the next-smallest thing, and put it in the second position. Then find the third, and so on.

We find the thing using a simple linear search.

If we “put it in the ith position” using a swap, we don’t need an entire list’s (O(n)) worth of extra space, just a single element.

(on board with list 5 3 7 1)

This is called selection sort, because we select each element that we want, one-by-one, and put them where we want them.

static int indexOfMinimum(int[] array, int startIndex) {
  int minIndex = startIndex;
  for (int i = startIndex + 1; i < array.length; i++) {
    if (array[i] < array[minIndex]) {
      minIndex = i;
    }
  }
  return minIndex;
}

static void selectionSort(int[] array) {
  for (int i = 0; i < array.length - 1; i++) {
    // you could just:
    // swap(array, i, indexOfMinimum(array, i));
    // but for maximum efficiency, instead:
    int index = indexOfMinimum(array, i);
    if (index != i) {
      swap(array, i, index);
    }
  }
}

Let’s also add some printing code to see this in action:

static void printArray(int[] array) {
  for (int i: array) {
    System.out.print(i + " ");
  }
  System.out.println();
}

public static void main(String[] args) {
  int[] array = new int[] {5, 3, 7, 1};
  printArray(array);
  selectionSort(array);
}

(We can add printArray to the search method’s loop to see it work.)

In-class exercises

For an array of length containing n elements, what is the largest number of comparisons that one invocation of indexOfMinimum might perform?

(n - 1)

What is the worst case for selection sort? That is, what input order on n inputs causes selection sort to make the greatest number of comparisons?

a. the input is already sorted
b. the input is in reverse sorted order
c. the order {n, 1, 2, …, n-1}
d. selection sort’s behavior does not depend upon input order

does not matter!

Back to selection sort

How bad is selection sort, really? Let’s think about comparisons and swaps.

There are exactly n-1 comparisons the first time, n-2 the second time, and so on. This sums up to n(n-1)/2.

There are exactly n-1 swaps made (note that some could be no-ops, if an element i == minIndex(array, i)).

If comparisons and swaps are both about constant cost, then this algorithm is O(n^2) – the cost is dominated by the comparisons.

Even so, if swaps are much more expensive (a bigger constant), selection sort can be good, since it bounds the number of swaps to be at most (n-1).

Bubble sort

Here’s another sorting algorithm:

Start at the end of the list (element n-1). Then compare against the previous element (n-2). If the element at (n-1) is smaller, swap it with the element at (n-2). Then compare the element at (n-2) with the element at (n-3). And so on, all the way to the 0th element.

This will move the smallest element to the zeroth index of the list.

Now repeat, but stop at the 1st element. This will move the second-smallest element to the 1st index. Then repeat again, and so on, until the list is sorted.

Each time the algorithm repeats, the ith smallest element “bubbles up” to the front of the list; this is called a bubble sort.

static void bubbleUp(int[] array, int stopIndex) {
  for (int i = array.length - 1; i > stopIndex; i--) {
    if (array[i] < array[i - 1]) {
      swap(array, i, i - 1);
    }
  }
}

static void bubbleSort(int[] array) {
  for (int i = 0; i < array.length; i++) {
    bubbleUp(array, i);
  }
}

(on board with list 5 3 7 1)

In class exercise

In the worst case, how many swaps will bubble sort make before completing?

n(n-1) / 2

What’s the worst case for bubble sort? (ask class :) Doesn’t matter! Just like selection sort, the loops (and their exit conditions) don’t depend upon the contents of the array.

Insertion sort

Now let’s turn our attention to another sorting algorithm. This one is similar to how you might sort a handful of cards.

We break the hand up into two parts, sorted and unsorted. Then we add cards one-by-one from the unsorted part into the sorted part. (on board)

Let’s say we start with our old friend 5 3 7 1, an unsorted list on the right, and a sorted (empty) list on the left:

| 5 3 7 1

“insert” the first card on the left of the unsorted array into the sorted array:

5 | 3 7 1

(Note we didn’t actually do anything, just moved an index). Now take the next element, 3.

5 3 | 7 1

We have to move it into the correct position by successively swapping it left until it’s no longer smaller than its predecessor (or until there is no predecessor).

3 5 | 7 1

7 is easy:

3 5 7 | 1

and finally we need to successively move 1:

1 3 5 7 |

and we’re done. This is called insertion sorting, since we take elements one one-by-one from the unsorted portion of the list and insert them into the sorted portion.

static void insertIntoSorted(int[] array, int sortedBoundary) {
  for (int i = sortedBoundary; i > 0; i--) {
    if (array[i] < array[i - 1]) {
      swap(array, i, i - 1);
    }
    else break; // you could omit this, but then you'd lose some average-case performance
  }
}

static void insertionSort(int[] array) {
  for (int i = 1; i < array.length; i++) {
    insertIntoSorted(array, i);
  }
}

How many steps to run insertIntoSorted? Worst case is the last insertIntoSorted having to go all they way: n-1 comparisons (and that many swaps as well).

Insertion sort worst case

What is the worst case for insertion sort? That is, what input order on n inputs causes selection sort to make the greatest number of comparisons?

a. the input is already sorted b. the input is in reverse sorted order c. the order {n, 1, 2, …, n-1} d. selection sort’s behavior does not depend upon input order

Turns out to be O(n^2) again, same as selection sort.

But it turns out to be very fast in two particular cases:

  • the constant factor for insertion sort is generally lower than some of the other algorithms you’ll learn about later, like merge sort and heap sort. Usually for n between 8 and 20 or so insertion’s O(n^2) will outperform merge O(n log n) or quick sort.
  • best case is an already sorted list: exactly n-1 comparisons and no swaps; partially-sorted lists are O(nk) where each element is no more than k from where it should be

Other sorting things

There are other sorting algorithms that can do better than n^2; the best algorithms run in “n log n” time (“mergesort” and “heapsort” are two you’ll see in 187). They have tradeoffs, though, either requiring more than constant space or a higher constant factor (coefficient) than a simple sort like insertion sort.

In practice, most library sort methods, like Arrays.sort and Collections.sort, use a hybrid of the approaches, using the best algorithm for the task. Most common is the timsort, named after a guy named Tim (no joke!) Peters who first implemented it in the Python standard library.