11: HashSets and TreeSets

Welcome

Announcements

Classes start Monday after break. There will be a quiz on the Monday after spring break. “I didn’t come back until Tuesday” is not a valid excuse for missing a quiz.

Not everyone is a supergenius! Don’t feel bad if you actually have to work in this class! There’s a group of ~20 who maybe should have gone to 187 – don’t compare yourself against them. Maybe look up imposter syndrome and relax (I mean, if you’re passing.).

A reminder: the general policy in this course is that we don’t accept late work. If you require an extension on an assignment, you must ask at least a day in advance and have a reasonable justification. “I didn’t start it until today and it’s harder than I thought” is not a particularly reasonable justification.

A clarification / reminder from the syllabus: assignments (and homeworks) are weighted equally. In particular, each assignment “counts the same” even though some are few points and some are many. The ones with many just have more tests; they’re not worth more points.

On the course honesty policy: Working with a peer is OK. Looking at a peer’s code with your own code open nearby is not – your code must be your own, and this might result in something that looks like copying.

What we look for is two people passing in the same (or basically the same) code; we can’t know if that’s because they copy/pasted, or if they just worked together a little too closely, or what. Don’t give us things to worry about, and it won’t be a problem.

“But Marc, people work together in the real world!” They sure do. But in a basics class like this one, it’s important that you do the work yourself to help build mastery of introductory material. You’ll collaborate more in future CS classes.

Set review

Membership, union, intersection, difference – a quick rehash (see last lecture’s notes.)

In-class exercise

Some questions on math sets.

Set API

Java API for Set: almost, but not the same, as mathematical sets. In particular, the Java API is mutable – you can change Sets in-place, whereas the mathematical definition of sets is that they are immutable. You can emulate mathematical sets by defining methods that do not change their arguments, as you will in the current programming assignment.

Also! Java’s Set API has almost the same interface as List, but some methods are missing (you cannot index into a Set using get or the like); also, like math sets, they hold only distinct copies of elements. Why use a Set instead of a List? Because if you care mostly about set-like operations (add, remove, union, intersection, membership testing) they are generally very, very fast on average. How? It’s an implementation detail we’re going to talk about today.

On HashSets

One possible implementation of the set is the HashSet, which depends upon a correct hashCode method. Why? “This class implements the Set interface, backed by a hash table (actually a HashMap instance).” Now let’s look at the documentation for hashCode: “This method is supported for the benefit of hash tables such as those provided by HashMap.”

Wow, hash tables are so important that every object in Java must supply a hashCode method – it’s built into Object.

hashCode returns an integer, and must obey the contract in its documentation. Let’s look at each piece:

  • It provides something like (but not exactly like!) equality: If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

This implies that if you use a field in an equals method, you should also use it in the hashCode method.

  • It is consistent: Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.
  • It is not an equality check, though: It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

So you could have a hashCode method that always returned the same integer, like 1, and it would technically obey the contract. But usually, the hashCode of objects is not 1, but instead a large integer. Going back to our old example of (not) aliasing:

String x = new String("foo");
String y = new String("foo");
System.out.println(x == y);
System.out.println(x.equals(y));
System.out.println(x.hashCode());
System.out.println(y.hashCode());

Do you expect them to be ==? No. Do you expect them to be equals? Yes. Do you expect them to have the same hash code? Yes, because of the first property above.

Why does this weird integer result in fast (“constant time”) lookups?

Because you can use it as an index into an array.

In short, “hash tables” are arrays that store objects based upon their “hash code”. If you want to put an element into the array, you figure out the right place to put it by checking its hash code. And if you want to see if an element is in the array, you look up its hash code, then jump to the right spot in the array.

In a perfect world, the array would be big enough to hold everything, and the hash codes would always be unique per-object, and this would all just work. In practice, sometimes there are collisions – more than one object ends up in the same spot in the array. We resolve these collisions in different ways (one way: each element of the array might be a short linked list of elements with the hash code corresponding to that element’s index), and things usually work out with near-constant-time performance.

Of course, if we have a hash table that’s too small, or a hash code method that doesn’t, in the words of the Java doc, “as much as is reasonably practical, … return distinct integers for distinct objects”, then everything ends up in just one or two lists and performance is bad.

So it’s critical to write a hashCode method that works. And recall the contract: it should two objects’ hashCode should produce the same value if the two objects are equals. How might we do this?

Writing a hashCode method

Just like when we wrote our equals method, we have to consider what instance variables comprise this object’s identity. What makes it unique from other objects? Let’s consider a version of our PostalAddress:

public class PostalAddress {
    public final String name;
    public final int number;
    public final LocalDateTime created;
    public PostalAddress(String name, int number) {
        this.name = name;
        this.number = number;
        created = LocalDateTime.now();
    }
}

Note that this has a created variable, but let’s say we don’t care about it when checking equality. Therefore we don’t care about it when returning a hashCode, either. So how can we return a hashCode that’s valid? One option is to just return an int:

In-class exercise

public int hashCode() {
    return 1;
}

Valid hash code? Good hash code?

That’s a terrible idea. We want it to depend upon the things that our equals method might depend upon. What are those? name and number. How might we create a hashCode on that basis? number is already an int. Can we get an int out of a String? Lots of methods return int. You can imagine all sorts of convoluted methods that might involve summing up the integer value of the characters stored in the string, and some of these would probably work great. You know what else would work great? The String‘s hashCode method. So you might write:

public int hashCode() {
    return number + name.hashCode();
}

as a decent (though maybe not optimal) method. It turns out you need to know a tiny bit of number theory to understand why this isn’t the best option; when you get to either 250 or 311 I guarantee you’ll learn first-hand. But for now, know that this is OK, but not great.

Hey, though, my use of Eclipse to auto-write some of the constructor probably got you thinking, right? Like, if there’s a simple set of rules we can use on the basis of instance variables of interest to create a good hash code, couldn’t we ask the computer to do it for us? And in fact we can. Most modern IDEs will help you write the “boilerplate” methods, and they mostly do a good job. But sometimes they don’t so you need to either (a) understand what they’re doing, or (b) accept that if they result in weirdo errors, you’ll have to figure it out. Let’s try it now (demo):

@Override
public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ((name == null) ? 0 : name.hashCode());
    result = prime * result + number;
    return result;
}

@Override
public boolean equals(Object obj) {
    if (this == obj)
        return true;
    if (obj == null)
        return false;
    if (getClass() != obj.getClass())
        return false;
    PostalAddress other = (PostalAddress) obj;
    if (name == null) {
        if (other.name != null)
            return false;
    } else if (!name.equals(other.name))
        return false;
    if (number != other.number)
        return false;
    return true;
}

Remember when I said a “real” equals method was a little more involved than our discount method? There you go.

On to the trees

So that’s HashSets. If you can reasonably define an equals and hashCode method, you can get pretty good performance (near constant-time) if you use a HashSet. But there’s another option that also gets pretty good performance, the TreeSet. To describe how it works, we need to define trees.

In computer science and mathematics, a tree is a kind of graph. What’s a graph? It’s a set (oh snap!) of vertices and edges between those vertices (sometimes also called nodes). (On board)

A tree is a particular kind of graph. It has a single vertex called a root “at the top”, and it grows downward (weird, I know, like an upside-down tree). Each vertex in the tree can have “children”, that is, nodes “below” them. For the sake of simplicity, let’s say each vertex has zero, one, or two children, and the children are “left children” or “right children”. Trees that have at most two children per vertex are called “binary trees.”

Remember linked lists? It turns out you can model a tree in code, using something very similar:

class TreeNode<T> {
    T value;
    TreeNode leftChild;
    TreeNode rightChild;

    //...
}

But lucky for you, this is not 187, so we’re just going to draw diagrams to give you an intuition, rather than make you code this up yourself.

Why trees?

OK, so now we’ve just implicitly created a new homework assignment for 187, but who cares, right? It’s just a convoluted list, sort of, right?

Right, and wrong. Depending upon how you organize your tree, you can get very good or very bad performance. If you just stuff items into the tree willy-nilly, then yes, it’s really no better than a linked list, as you’d have to traverse the entire tree to, for example, look to see if an element is there. In some ways it’s worse, because you now also have to write the traversal code for a tree, which is more complicated than the same code for a list. But it turns out if you impose some constraints on the tree, you can do better.

Specifically, let’s say that we require that a left child (and all grandchildren) of a node can only contain a value that’s less than the current node’s value. And a right child (and grandchildren) can only contain a node that’s greater than the current node’s value. This is called a “binary search tree.”

In-class exercise:

Two trees!

How does this “Binary Search Tree” property help? Consider a tree that holds the values 1 through 7. Let’s say I magically decide to insert them into the tree in this order: {4, 2, 6, 1, 3, 5, 7}:

   4
 2   6
1 3 5 7

This tree holds 7 values, and takes at most two comparisons to check whether a given value is in the tree or not. If we build an even bigger tree you can see that the tree’s height (which is also how many comparisons are needed to search it) grows much more slowly than the tree’s size, which is the total number of nodes in the tree.

In-class exercise

Build a binary search tree containing the following values: {3, 2, 6, 10, 5, 1, 9}

Balanced? Not quite.

It’s not quite constant-time lookup (it’s “logarithmic” overhead) but it’s really fast nonetheless. The logarithm grows very slowly: https://en.wikipedia.org/wiki/Logarithm. So if we say a tree containing 10 elements has an overhead of 1, a tree containing 1000 elements (100 times as many!) has an overhead of only 3. And a tree containing 1,000,000 elements has an overhead of only 6. 10^9 elements? Overhead of 9. And so on.

There are some details I’m skipping over, for example, how do you make sure your tree doesn’t end up looking like a linked list? But you’ll see them in 187 and 311.

TreeSets

OK, so now we see that in order to build a tree, we need to be able to see if values are less than or equal to other values. Have we seen this before? Sure we have. Java’s Comparable and Comparator interfaces. So if we want to be able to place objects into a TreeSet, they’ll have to either have a natural ordering (that is, implement Comparable), or we can create the TreeSet with a specific Comparator in order to decide how the tree is built. So let’s add one.

Back to our PostalAddress:

public int compareTo(PostalAddress o) {
    if (name.compareTo(o.name) != 0) return name.compareTo(o.name);
    return Integer.compare(number, o.number);
}

Now we can instantiate a TreeSet of our PostalAddress (though we’ll need to add back in our toString, first):

Set<PostalAddress> addresses = new TreeSet<PostalAddress>();
for (int i = 1; i <= 10; i++) {
    addresses.add(new PostalAddress(i, "Maple St"));
}
System.out.println(addresses);
}