CMPSCI 187: Programming With Data Structures
============================================

Today's topics
--------------

-   administrivia
-   a personal note
-   selecting
-   hashing

Administrivia
=============

Reminders
---------

-   A12 is due Thursday at 8:30

-   Last discussion and course evaluation is Thursday (on a Monday
    schedule)

-   Final Exam! Monday at 6pm in Totman!

A personal note
===============

I just want to say: I'm so proud of you all. It's certainly the case
that we've worked you hard, but we do so because we really want you to
learn and not just memorize for the tests.

We made you drink from the firehose, and we have to make choices about
what to cover and what not to cover. Lots of important stuff got left
out or under-emphasized. I don't want you to believe CS is just about
programming ADTs: much like fine art isn't just about drawing fruit
bowls, or chemistry about titrating at the lab bench, or writing isn't
just about freshman comp exercises.

This class is many things; it's the second course in the major, it's the
first "hard" CS course, it's the course other instructors love to hate
for not covering something they think should be covered freshman year,
it's the course where you learn to program non-trivially (getting your
first glimpse of data structures, and of software engineering, and of
different ways to approach problems); it's the course where you start to
see some of the theoretical underpinnings of CS (not just how to write
programs). My biggest regret is that we didn't do a little more
application of what we learned -- like the word counting problem, but
bigger and better. It's hard to believe, perhaps, but you're about one
or two weeks away from being able to write a web search engine (or at
least, knowing the abstractions you'd need to do so). That's a big
effing deal, as Joe Biden would say.

I know there are ways we'd improve if we ran this class again (better
autograder feedback, more JUnit tutorials, more debugging tutorials, to
start). But I thank you all for coming, and staying, and really putting
in the necessary effort (and I know it was a lot). It's been my pleasure
to run this class for you.

Now, back to the firehose.

Selecting
=========

Quick select
------------

The basic idea of quick select is to do a *partial* quick sort. Instead
of recursing down both sides of the pivot, you (most of the time)
recurse only on the relevant half.

The index i of the split value tells you the split value tells you that
the split value is the ith smallest element.

If i = k, you're done.

If k \< i, you again find the split value in the portion of the list
before i.

If k \> i, you find the (k - i - 1)th element from the portion of the
list after i.

If, magically, you always chose the median as your split value, you'd do
(n - 1) + (n/2 - 1) + (n/4 - 1) + ... comparisons, which sums to about
2n. In CMPSCI 311, you'll see that quick select is O(n) on average. But
just like quick sort, if you choose the worst case each time, you'll end
up with O(n\^2).

Hashing
=======

Review of collections and find()
--------------------------------

We've seen a number of ways to store a collection, and for each we have
looked at the time needed to *find* a given element. This is the
fundamental operation behind the contains, get, and remove operations.

An unsorted list requires us to make a linear search, taking O(n) time.
With a sorted list, we could use binary search, taking O(log n) time, as
long as we have random access to the elements.

In order to get O(log n) time in a collection where we can also easily
insert and delete, we looked at binary search trees. These allow
insertion, deletion, and finding in O(log n) time each if they are
balanced.

Can we do better? Is there a data structure in which we can insert,
delete, and find all in O(1) time?

Keys and slots
--------------

DJW give the example of a small company where the employee ID’s range
from 0 to 99 and the HR department knows everyone’s ID without having to
look it up.

They can keep the employee record objects in an array, and find it by
using the ID itself as an index.

The general problem of finding an object from a *key* is simplified if
the *key space* is small, as in this instance. If we can afford an array
with one index for each key, all our operations become easy.

We test whether a key is in use by checking that key’s slot for a null
entry. We add a key by replacing a null entry with a real one. We delete
a key by replacing that entry with a null one.

The problem is, of course, that our key space is usually uncomfortably
large. UMass keeps student records by eight-digit ID’s, so using this
system would require an array with 100,000,000 entries.

Only a small fraction of these entries would ever be used, since fewer
than a million students have ever attended UMass. US Social Security
numbers have a key space of size 10\^9, a sizable fraction of which is
in use.

Introduction to hashing
-----------------------

*Hashing* is a technique to simulate the dedicated-slot method in the
general situation where only a small fraction of the possible keys are
used.

In the general hashing situation we have a large set of keys and a
smaller set of indices for our array. We choose a *hash function*, which
takes any key and produces an index.

The simplest hash function uses the integer remainder operator `%` -- if
we have m different indices, then for any key k the hash function
produces the index `k % m`, which is in the range from 0 through m - 1.

Suppose for a moment that on the subset of the keys that we use, our
hash function is what CMPSCI 250 will call a one-to-one function. No two
different keys in use are ever mapped to the same index. What a
wonderful world in which we live.

Further suppose we have a magical `.hashCode()` method on all objects
that returns a unique integer hash code.

In that case, we can use the array just as we used the dedicated-slot
array, inserting, deleting, and finding elements in O(1) time with an
array of some (smaller) fixed length instead of the size of the key
space.

``` {.java}
public class NaiveHashTable<K, V> {

  protected V[] table;
  
  // constructor as you'd expect

  public void add(K key, V value) {
    int index = key.hashCode() % table.length;
    table[index] = value;
  }

  public V get(K key) {
    int location = key.hashCode() % table.length;
    return table[index];
  }
}  
```

Clicker question: `hashCode`
----------------------------

But in the worst case, there is no way to avoid the possibility that two
different keys, both in use, will be mapped to the same index by our
hash function.

Such a failure of the hash function to be one-to-one is called a
collision.

Clicker question: collisions
----------------------------

Key spaces are normally very large, much larger than the amount of space
we’d like to devote to the collection.

The basic idea is simple -- we define a *hash function* that maps keys
to *hash values*. A hash value is an index into a hash table, the array
in which we will actually store the items.

Important note: **This is an extremely useful abstract data type.** O(1)
mapping of arbitrary (hashable) keys to values is useful *all the time*.

Our hope is that there will be few or no collisions, meaning few or no
pairs of different keys, in use at the same time, that have the same
hash value.

If the number of different possible keys is greater than the size of the
hash table, though, we can’t avoid collisions in the worst case. We’ll
see later how to deal with them.

We need the hash function to be easy to compute. We also require that it
have nothing in particular to do with the meaning of the keys. (Patterns
in the keys can lead to collisions, as you saw during discussion
yesterday.)

Clicker question: A hash function
---------------------------------

Hashing assumptions
-------------------

Although we can’t avoid the problem of collisions, we’ll make some other
assumptions that are fairly realistic but will simplify our discussion
considerably.

We’ll assume that our hash table never gets *full*. That is, no matter
how large the key space is, we will only use a number of keys less than
the size of the hash table. We can make this happen by budgeting enough
table size for the keys that we use.

We will assume that the user will only call the `get` or `remove`
methods for keys that are actually in the database. (We can make them
use `contains` first.)

We will write our `HashTable` class to store items that have a
meaningful `hashCode` -- all `Object`s have this function, but the
default implementation returns only the object's memory address. This
means only objects that are `==` have the same hash code by default, not
a great state of affairs, since you usually want to hash equivalent
objects (by `.equals`) to the same location in a hash table.

Here the `Employee` class is just an example of how we might write
`hashCode`.

``` {.java}
public class Employee {
   protected String name;
   protected int idNum;
   protected int yearsOfService;
   public Employee (String name, int id, int years) {
      this.name = name; idNum = id; yearsOfService = years;}

   public int hashCode( ) {
      return (idNum);}
```

Since we're expecting a unique `idNum`, we can just return it. You saw
another, more complicated example in discussion yesterday.

Dealing with collisions
-----------------------

Here’s a simple way to resolve collisions. If the slot where the hash
function tells you to add is full, try the next, then the next, and so
on until you find an empty one. To get an element, try the hash
function’s place first, then the succeeding places until you find it. On
average, if the table is not very full, you shouldn’t have to look long.

To do this, we need to track both keys and values (to know when we
should skip ahead).

(diagram on board)

``` {.java}
public class LinearProbingHashTable<K, V> {

  protected K[] keys;
  protected V[] values;

  public void add(K key, V value) {
    int index = key.hashCode() % keys.length;
    while (keys[index] != null) {
      index = (index + 1) % keys.length;
    }
    keys[index] = key;
    values[index] = value;
  }

  public V get(K key) {
    int index = key.hashCode() % keys.length;
    while(!keys[index].equals(key)) {
      index = (index + 1) % keys.length;
    }
    return values[index];
  }
}  
```

Here, we're assuming that calls to `get` (and thus to `contains` and
`remove`, by implication) always are for elements that are in the table.

If we never remove, though, we are better off -- we can give up our
search when we reach a null array entry. This would also allow us to
have a contains method that searches until it finds the element or finds
a null entry. The contains method would have worst-case O(n) running
time, but would usually run faster unless the table were very full.

Clicker question: Why no removals?
----------------------------------

The main problem with the above approach -- *linear probing* -- is that
you end up with *clusters* of full sections in the hash table.

There are ways around this. One is to use a different function to find
the "next" index. For example, use a function that depends on not just
the index, but the number of rehashes so far. (You'll see this again in
311: *quadratic probling*.)

Another method is to change how collisions are handled completely,
called *buckets and chaining*. The intuition is that each cell in the
array is a bucket that holds *all* of the elements that have that index.
How do they do this? By storing a `LinkedList<K>` at each cell.

(on board)

We'll omit the code, but you should now know enough to know how to build
it yourself!

Characterizing performance
--------------------------

The linked list above needn't be short, because it's probably short.
Well, only if the table is mostly empty. We call the fraction of the
table that's in use the *load factor*. More precisely, it's the average
number of keys in each bucket (\# keys / total \# buckets). The load
factor can exceed 1! But if it's small, most probes will be fast, on
average.

Clicker question: load factor
-----------------------------

Hash tables in Java
-------------------

You don't actually need to build this yourself. Java has a `Hashtable`
that does it for you. `HashSet` and `HashMap` are similar and implement
the `Set` and `Map` interfaces (note `Hashtable` implements `Map`).
There are also tree-based, `TreeSet` and `TreeMap` implementations for
these interfaces.

**This is an extremely useful abstract data type.** Almost all
mainstream languages post-Java, especially interpreted ones, have
directly language support for hash tables as they're so useful.
Sometimes they're called *associative arrays* or *dictionaries*, but
whenever you need to store a set of things where you have (non-integer,
but hashable) keys and values, reach for a hash table.