11: HashSets and TreeSets

Announcements

Not everyone is a supergenius! Don’t feel bad if you actually have to work in this class! There’s a group of ~20 who maybe should have gone to 187 – don’t compare yourself against them. Maybe look up imposter syndrome and relax (I mean, if you’re passing.).

Q3 delayed to next Monday. Sorry about communications problems.

A reminder: the general policy in this course is that we don’t accept late work. If you require an extension on an assignment, you must ask at least a day in advance and have a reasonable justification. “I didn’t start it until today and it’s harder than I thought” is not a particularly reasonable justification.

A clarification / reminder from the syllabus: assignments (and homeworks) are weighted equally. In particular, each assignment “counts the same” even though some are few points and some are many. The ones with many just have more tests; they’re not worth more points.

On the course honesty policy: Working with a peer is OK. Looking at a peer’s code with your own code open nearby is not – your code must be your own, and this might result in something that looks like copying.

What we look for is two people passing in the same (or basically the same) code; we can’t know if that’s because they copy/pasted, or if they just worked together a little too closely, or what. Don’t give us things to worry about, and it won’t be a problem.

“But Marc, people work together in the real world!” They sure do. But in a basics class like this one, it’s important that you do the work yourself to help build mastery of introductory material. You’ll collaborate more in future CS classes.

More on hash tables

Recall we were talking about hash tables at the end of last class. Hash tables are arrays that we index by hash code, so we can find any object we want in a single step:

(On board)

Though we run into trouble if there are collisions, that is, more than one object hashes (has the same hash code) to the same index.

(on board)

In this case, we have a linked list to resolve the problem (there are other approaches that you might learn about in later courses, but this one is reasonable).

What happens if we have a hashCode method that always returns 1? It obeys the contract, right? What happens to performance? We’re no better than a linked list!

So it’s critical to write a hashCode method that works. And recall the contract: it should two objects’ hashCode should produce the same value if the two objects are equals. How might we do this?

Writing a hashCode method

Just like when we wrote our equals method, we have to consider what instance variables comprise this object’s identity. What makes it unique from other objects? Let’s consider a version of our PostalAddress:

public class PostalAddress {
    public final String name;
    public final int number;
    public final LocalDateTime created;
    public PostalAddress(String name, int number) {
        this.name = name;
        this.number = number;
        created = LocalDateTime.now();
    }
}

Note that this has a created variable, but let’s say we don’t care about it when checking equality. Therefore we don’t care about it when returning a hashCode, either. So how can we return a hashCode that’s valid? One option is to just return an int:

In-class exercise

public int hashCode() {
    return 1;
}

Valid hash code? Good hash code?

That’s a terrible idea. We want it to depend upon the things that our equals method might depend upon. What are those? name and number. How might we create a hashCode on that basis? number is already an int. Can we get an int out of a String? Lots of methods return int. You can imagine all sorts of convoluted methods that might involve summing up the integer value of the characters stored in the string, and some of these would probably work great. You know what else would work great? The String‘s hashCode method. So you might write:

public int hashCode() {
    return number + name.hashCode();
}

as a decent (though maybe not optimal) method. It turns out you need to know a tiny bit of number theory to understand why this isn’t the best option; when you get to either 250 or 311 I guarantee you’ll learn first-hand. But for now, know that this is OK, but not great.

Hey, though, my use of Eclipse to auto-write some of the constructor probably got you thinking, right? Like, if there’s a simple set of rules we can use on the basis of instance variables of interest to create a good hash code, couldn’t we ask the computer to do it for us? And in fact we can. Most modern IDEs will help you write the “boilerplate” methods, and they mostly do a good job. But sometimes they don’t so you need to either (a) understand what they’re doing, or (b) accept that if they result in weirdo errors, you’ll have to figure it out. Let’s try it now (demo):

@Override
public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ((name == null) ? 0 : name.hashCode());
    result = prime * result + number;
    return result;
}

@Override
public boolean equals(Object obj) {
    if (this == obj)
        return true;
    if (obj == null)
        return false;
    if (getClass() != obj.getClass())
        return false;
    PostalAddress other = (PostalAddress) obj;
    if (name == null) {
        if (other.name != null)
            return false;
    } else if (!name.equals(other.name))
        return false;
    if (number != other.number)
        return false;
    return true;
}

Remember when I said a “real” equals method was a little more involved than our discount method? There you go.

On to the trees

So that’s HashSets. If you can reasonably define an equals and hashCode method, you can get pretty good performance (near constant-time) if you use a HashSet. But there’s another option that also gets pretty good performance, the TreeSet. To describe how it works, we need to define trees.

In computer science and mathematics, a tree is a kind of graph. What’s a graph? It’s a set (oh snap!) of vertices and edges between those vertices. (On board)

A tree is a particular kind of graph. It has a single vertex called a root “at the top”, and it grows downward (weird, I know, like an upside-down tree). Each vertex in the tree can have “children”, that is, nodes “below” them. For the sake of simplicity, let’s say each vertex has zero, one, or two children, and the children are “left children” or “right children”. Trees that have at most two children per vertex are called “binary trees.”

Remember linked lists? It turns out you can model a tree in code, using something very similar:

class TreeNode<T> {
    T value;
    TreeNode leftChild;
    TreeNode rightChild;

    //...
}

But lucky for you, this is not 187, so we’re just going to draw diagrams to give you an intuition, rather than make you code this up yourself.

Why trees?

OK, so now we’ve just implicitly created a new homework assignment for 187, but who cares, right? It’s just a convoluted list, sort of, right?

Right, and wrong. Depending upon how you organize your tree, you can get very good or very bad performance. If you just stuff items into the tree willy-nilly, then yes, it’s really no better than a linked list, as you’d have to traverse the entire tree to, for example, look to see if an element is there. In some ways it’s worse, because you now also have to write the traversal code for a tree, which is more complicated than the same code for a list. But it turns out if you impose some constraints on the tree, you can do better.

Specifically, let’s say that we require that a left child (and all grandchildren) of a node can only contain a value that’s less than the current node’s value. And a right child (and grandchildren) can only contain a node that’s greater than the current node’s value. This is called a “binary search tree.” How does this help? Consider a tree that holds the values 1 through 7. Let’s say I magically decide to insert them into the tree in this order: {4, 2, 6, 1, 3, 5, 7}:

   4
 2   6
1 3 5 7

This tree holds 7 values, and takes at most two comparisons to check whether a given value is in the tree or not. If we build an even bigger tree you can see that the tree’s height (which is also how many comparisons are needed to search it) grows much more slowly than the tree’s size, which is the total number of nodes in the tree.

It’s not quite constant-time lookup (it’s “logarithmic” overhead) but it’s really fast nonetheless. The logarithm grows very slowly: https://en.wikipedia.org/wiki/Logarithm. So if we say a tree containing 10 elements has an overhead of 1, a tree containing 1000 elements (100 times as many!) has an overhead of only 3. And a tree containing 1,000,000 elements has an overhead of only 6. 10^9 elements? Overhead of 9. And so on.

There are some details I’m skipping over, for example, how do you make sure your tree doesn’t end up looking like a linked list? But you’ll see them in 187 and 311.

In-class exercise

Build a binary search tree containing the following values: {3, 2, 6, 10, 5, 1, 9}

Balanced? Not quite.

TreeSets

OK, so now we see that in order to build a tree, we need to be able to see if values are less than or equal to other values. Have we seen this before? Sure we have. Java’s Comparable and Comparator interfaces. So if we want to be able to place objects into a TreeSet, they’ll have to either have a natural ordering (that is, implement Comparable), or we can create the TreeSet with a specific Comparator in order to decide how the tree is built. So let’s add one.

Back to our PostalAddress:

public int compareTo(PostalAddress o) {
    if (name.compareTo(o.name) != 0) return name.compareTo(o.name);
    return Integer.compare(number, o.number);
}

Now we can instantiate a TreeSet of our PostalAddress (though we’ll need to add back in our toString, first):

Set<PostalAddress> addresses = new TreeSet<PostalAddress>();
for (int i = 1; i <= 10; i++) {
    addresses.add(new PostalAddress(i, "Maple St"));
}
System.out.println(addresses);
}