Lecture 11: HashSets and TreeSets

Welcome

Announcements

Reminder: Next Monday is a holiday; next Tuesday is a “UMass Monday”, so you’ll follow Monday’s schedule. For this class, it means lab but no lecture. It also means Monday office hours – in particular, I won’t have office hours on Tuesday next week, as I am also doing a Monday schedule.

DTA / COTA, very important for faculty and TAs:

https://www.umass.edu/ctl/distinguished-teaching-award

https://www.cics.umass.edu/awards-programs#COTA (and similar for other colleges, google “UMass COTA” and the college name to find it)

More on hash tables

Last class we talked about hash tables, which are the data structure that’s used to implement HashSets in Java. We started with the idea of an infinite array and a way to “uniquely” position objects into that array via hashCode(), and then refined that intuition to deal with the real-world constraints of non-infinite arrays and imperfectly unique hash codes.

Of course, if we have a hash table that’s too small, or a hash code method that doesn’t, in the words of the Java doc, “as much as is reasonably practical, … return distinct integers for distinct objects”, then everything ends up in just one or two lists and performance is bad.

So it’s critical to write a hashCode method that works. And recall the contract: it should two objects’ hashCode should produce the same value if the two objects are equals. How might we do this?

Writing a hashCode method

Just like when we wrote our equals method, we have to consider what instance variables comprise this object’s identity. What makes it unique from other objects? Let’s consider a version of our PostalAddress:

public class PostalAddress {
	public final String name;
	public final int number;
	public final LocalDateTime created;
	public PostalAddress(String name, int number) {
		this.name = name;
		this.number = number;
		created = LocalDateTime.now();
	}
}

Note that this has a created variable, but let’s say we don’t care about it when checking equality. Therefore we don’t care about it when returning a hashCode, either. So how can we return a hashCode that’s valid? One option is to just return an int:

In-class exercise

public int hashCode() {
	return 1;
}

Valid hash code? Good hash code?

That’s a terrible idea. We want it to depend upon the things that our equals method might depend upon. What are those? name and number. How might we create a hashCode on that basis? number is already an int. Can we get an int out of a String? Lots of methods return int. You can imagine all sorts of convoluted methods that might involve summing up the integer value of the characters stored in the string, and some of these would probably work great. You know what else would work great? The String’s hashCode method. So you might write:

public int hashCode() {
	return number + name.hashCode();
}

as a decent (though maybe not optimal) method. It turns out you need to know a tiny bit of number theory to understand why this isn’t the best option; when you get to either 250 or 311 I guarantee you’ll learn first-hand. But for now, know that this is OK, but not great.

Hey, though, my use of Eclipse to auto-write some of the constructor probably got you thinking, right? Like, if there’s a simple set of rules we can use on the basis of instance variables of interest to create a good hash code, couldn’t we ask the computer to do it for us? And in fact we can. Most modern IDEs will help you write the “boilerplate” methods, and they mostly do a good job. But sometimes they don’t so you need to either (a) understand what they’re doing, or (b) accept that if they result in weirdo errors, you’ll have to figure it out. Let’s try it now (demo):

@Override
public int hashCode() {
	final int prime = 31;
	int result = 1;
	result = prime * result + ((name == null) ? 0 : name.hashCode());
	result = prime * result + number;
	return result;
}

@Override
public boolean equals(Object obj) {
	if (this == obj)
		return true;
	if (obj == null)
		return false;
	if (getClass() != obj.getClass())
		return false;
	PostalAddress other = (PostalAddress) obj;
	if (name == null) {
		if (other.name != null)
			return false;
	} else if (!name.equals(other.name))
		return false;
	if (number != other.number)
		return false;
	return true;
}

Remember when I said a “real” equals method was a little more involved than our discount method? There you go.

On to the trees

So that’s HashSets. If you can reasonably define an equals and hashCode method, you can get pretty good performance (near constant-time) if you use a HashSet. But there’s another option that also gets pretty good performance, the TreeSet. To describe how it works, we need to define trees.

In computer science and mathematics, a tree is a kind of graph. What’s a graph? It’s a set (oh snap!) of vertices and edges between those vertices (sometimes also called nodes). (On board)

A tree is a particular kind of graph. It has a single vertex called a root “at the top”, and it grows downward (weird, I know, like an upside-down tree). Each vertex in the tree can have “children”, that is, nodes “below” them. For the sake of simplicity, let’s say each vertex has zero, one, or two children, and the children are “left children” or “right children”.

Trees that have at most two children per vertex are called binary trees.

Remember linked lists? It turns out you can model a tree in code, using something very similar:

class TreeNode<T> {
	T value;
	TreeNode leftChild;
	TreeNode rightChild;

	//...
}

But lucky for you, this is not 187, so we’re just going to draw diagrams to give you an intuition, rather than make you code this up yourself.

Why trees?

OK, so now we’ve just implicitly created a new homework assignment for 187, but who cares, right? It’s just a convoluted list, sort of, right?

Right, and wrong. Depending upon how you organize your tree, you can get very good or very bad performance. If you just stuff items into the tree willy-nilly, then yes, it’s really no better than a linked list, as you’d have to traverse the entire tree to, for example, look to see if an element is there. In some ways it’s worse, because you now also have to write the traversal code for a tree, which is more complicated than the same code for a list. But it turns out if you impose some constraints on the tree, you can do better.

Specifically, let’s say that we require that a left child (and all grandchildren) of a node can only contain a value that’s less than the current node’s value. And a right child (and grandchildren) can only contain a node that’s greater than the current node’s value. This is called a binary search tree.

In-class exercise:

Two trees!

How does this “Binary Search Tree” property help? Consider a tree that holds the values 1 through 7. Let’s say I magically decide to insert them into the tree in this order: {4, 2, 6, 1, 3, 5, 7}:

   4
 2   6
1 3 5 7

This tree holds 7 values, and takes at most two comparisons to check whether a given value is in the tree or not. If we build an even bigger tree you can see that the tree’s height (which is also how many comparisons are needed to search it) grows much more slowly than the tree’s size, which is the total number of nodes in the tree.

In-class exercise

Build a binary search tree containing the following values: {3, 2, 6, 10, 5, 1, 9}

Balanced? Not quite.

But what if we can magically keep the tree (mostly) balanced? You’ll learn how in 187. It’s not quite constant-time lookup (it’s “logarithmic” overhead) but it’s really fast nonetheless. The logarithm grows very slowly: https://en.wikipedia.org/wiki/Logarithm. So if we say a tree containing 10 elements has an overhead of 1, a tree containing 1000 elements (100 times as many!) has an overhead of only 3. And a tree containing 1,000,000 elements has an overhead of only 6. 10^9 elements? Overhead of 9. And so on.

Again, there are some details I’m skipping over, for example, how do you make sure your tree doesn’t end up looking like a linked list? But you’ll see them in 187 and 311.

TreeSets

OK, so now we see that in order to build a tree, we need to be able to see if values are less than or equal to other values. Have we seen this before? Sure we have. Java’s Comparable and Comparator interfaces. So if we want to be able to place objects into a TreeSet, they’ll have to either have a natural ordering (that is, implement Comparable), or we can create the TreeSet with a specific Comparator in order to decide how the tree is built. So let’s add one.

Back to our PostalAddress:

public int compareTo(PostalAddress o) {
	if (name.compareTo(o.name) != 0) return name.compareTo(o.name);
	return Integer.compare(number, o.number);
}

Now we can instantiate a TreeSet of our PostalAddress (though we’ll need to add back in our toString, first):

Set<PostalAddress> addresses = new TreeSet<PostalAddress>();
for (int i = 1; i <= 10; i++) {
	addresses.add(new PostalAddress(i, "Maple St"));
}
System.out.println(addresses);
}

Which should I use?

Generally, you should reach for HashSet when you want to use the Set interface, in the same way you can reach for ArrayList. Usually, objects have a custom hashCode method defined already, and if they don’t, you can write one pretty easily with an IDE’s help.

TreeSets are most useful when the objects in the set have a natural ordering, and you care about using it. In particular, TreeSet also implements the NavigableSet and SortedSet interfaces, which means you can the smallest or largest elements of the set, or the elements closest to a given value, in “logarithmic” time, which is about as good as it gets (only “constant time ” is better). (See the respective JavaDocs for those interfaces.) If the problem you’re working on makes frequent use of computing these values, then a TreeSet might be a better choice than a HashSet. But 90% or more of the time, HashSet is what you’re gonna want.

Now for something completely different

We’re going to walk through solving another toy problem that’s made relatively straightforward with the Set abstraction. Then we’ll talk about an extension to this problem that’s harder to solve, and (probably next clas) introduce a new abstraction – the Map – to solve it.

The problem is called “Santa’s little helper” and is adapted from http://adventofcode.com/2015/day/3.

Santa’s little helper

Santa is delivering presents to an infinite two-dimensional grid of houses.

He begins by delivering a present to the house at his starting location, and then an elf at the North Pole calls him via radio and tells him where to move next. Moves are always exactly one house to the north (^), south (v), east (>), or west (<). After each move, he delivers another present to the house at his new location.

However, the elf back at the north pole has had a little too much eggnog, and so his directions are a little off, and Santa ends up visiting some houses more than once. How many houses receive at least one present?

For example:

Toward a solution

We’re going to try to solve this in Java. Let’s fire up Eclipse and create a new project. Now let’s write a DeliverySimulator class with a single method that takes a input of directions and returns the set of locations visited.

import java.util.Set;

public class DeliverySimulator {


	public static Set<Location> locationsVisited(String directions) {
		return null;
	}
}

Note we can use Eclipse to “stub out” our code, and to create empty implementations wherever possible here.

In-class exercise

Suppose we want to add the following method:

public static int housesVisited(Set<Location> visited) { 
    // what goes here?
}

Let’s also add a method to compute the actual number of houses visited:

public static int housesVisited(Set<Location> visited) {
	return visited.size();
}

Hey, half done! Well, sorta.

What should a location look like? Let’s give it an x and y coordinate:

public class Location {
	public final int x;
	public final int y;

	public Location(int x, int y) {
		this.x = x;
		this.y = y;
	}
}

Since we know we’re going to be storing locations in a set, we should make sure we implement meaningful equals and hashCode methods. Eclipse to the rescue again (Source -> Generate hashCode() and equals()…):

@Override
public int hashCode() {
	final int prime = 31;
	int result = 1;
	result = prime * result + x;
	result = prime * result + y;
	return result;
}

@Override
public boolean equals(Object obj) {
	if (this == obj)
		return true;
	if (obj == null)
		return false;
	if (getClass() != obj.getClass())
		return false;
	Location other = (Location) obj;
	if (x != other.x)
		return false;
	if (y != other.y)
		return false;
	return true;
}

and maybe:

public String toString() {
	return "(" + x + ", " + y + ")";
}

Does this work? We could add a main method to do some testing:

public static void main(String[] args) {
	Location x = new Location(0,0);
	System.out.println(x);

	System.out.println(x.equals(new Location(0, 0)));
	System.out.println(x.equals(new Location(0, 1)));
}

More next class!