Week 06: More on sets; binary trees; the Map ADT

Announcements

Don’t use Map for this week’s PA! Sets and Lists are all you need. We introduce Map this week but you’ll start using it later.

Sets in Java

How does Java represent a set? As an abstract data type, specified by the Set interface. First we’ll talk about the properties and assumptions we might expect from a Set, in the abstract. Then we’ll talk about two concrete implementations of the data type provided by the Java API and see how they work.

Let’s take a look at the interface: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Set.html

Not too different from List, though you’ll note some things (like remove at an index, or get) are not present, as those operations don’t make sense in the context of sets – they’re unordered, so there is no index!

Pay special attention to a few things:

”…sets contain no pair of elements e1 and e2 such that e1.equals(e2)” – the equals method is very important to sets, and if you stick objects in that don’t have an equals method, they’ll use Object’s equals method. Make sure that’s what you want if so!

Also note that “great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set.” In other words, if you have a setter that changes an instance variable in an object, and that instance variable is considered by the object’s equals method, Set will have undefined (read: bad) behavior.

So putting relatively immutable things into sets is OK. Like Integers or Strings. Putting arbitrary objects that can be changed is not so good. Putting things that can be changed, but that you won’t change is OK but dangerous – what if you accidentally do end up changing the object? The Set will almost certainly misbehave in a weird way.

Other than those two restrictions, you can use Sets almost like Lists. Let’s do some examples:

Set<Integer> s = new HashSet<Integer>();

s.add(1);
s.add(2);
System.out.println(s); // like lists, you can print them and their contents is printed

Set<Integer> t = new HashSet<Integer>();

t.add(2);
t.add(3);
t.add(4);

for (Integer i : t) {
	System.out.println(i); // like lists, you can iterate over them
}

s.addAll(t); // all elements in t are added to s; t is unchanged but s is not!
System.out.println(s);
System.out.println(t);

s.removeAll(t); // all elements in t are removed from s, as above
System.out.println(s);
System.out.println(t);

And you generally do want to use Sets when the set properties (of uniqueness and lack-of-intrinsic-order) apply to your data set, especially if your data set is going to be large.

Why? (you might ask.) Because sets have much, much better general performance for insertion, removal, and containment-testing than lists. How? (you might ask.) Well, now we have to talk a little about how the two most common implementations of Sets work: HashSets and TreeSets.

On HashSets

One possible implementation of the set is the HashSet, which depends upon a correct hashCode method. Why? “This class implements the Set interface, backed by a hash table (actually a HashMap instance).” Now let’s look at the documentation for hashCode: “This method is supported for the benefit of hash tables such as those provided by HashMap.”

Wow, hash tables are so important that every object in Java must supply a hashCode method – it’s built into Object.

hashCode returns an integer, and must obey the contract in its documentation. Let’s look at each piece:

This implies that if you use a field in an equals method, you should also use it in the hashCode method.

So you could have a hashCode method that always returned the same integer, like 1, and it would technically obey the contract. But usually, the hashCode of objects is not 1, but instead a large integer. Going back to our old example of (not) aliasing:

      String s = new String("x");
      String t = new String("x");

      System.out.println(s == t);
      System.out.println(s.equals(t));
      System.out.println(s.hashCode());
      System.out.println(t.hashCode());

Do you expect them to be ==? No. Do you expect them to be equals? Yes. Do you expect them to have the same hash code? Yes, because of the first property above.

Why does this weird integer result in fast (“constant time”) lookups?

Because you can use it as an index into an array.

In short, “hash tables” are arrays that store objects based upon their “hash code”. If you want to put an element into the array, you figure out the right place to put it by checking its hash code. And if you want to see if an element is in the array, you look up its hash code, then jump to the right spot in the array.

In a perfect world, the array would be big enough to hold everything, and the hash codes would always be unique per-object, and this would all just work. In practice, sometimes there are collisions – more than one object ends up in the same spot in the array. We resolve these collisions in different ways (one way: each element of the array might be a short linked list of elements with the hash code corresponding to that element’s index), and things usually work out with near-constant-time performance.

More on hash tables

Of course, if we have a hash table that’s too small, or a hash code method that doesn’t, in the words of the Java doc, “as much as is reasonably practical, … return distinct integers for distinct objects”, then everything ends up in just one or two lists and performance is bad.

So it’s critical to write a hashCode method that works. And recall the contract: it should two objects’ hashCode should produce the same value if the two objects are equals. How might we do this?

Writing a hashCode method

Just like when we wrote our equals method, we have to consider what instance variables comprise this object’s identity. What makes it unique from other objects? Let’s consider a version of our PostalAddress:

public class PostalAddress {
	public final String name;
	public final int number;
	public final LocalDateTime created;
	public PostalAddress(String name, int number) {
		this.name = name;
		this.number = number;
		created = LocalDateTime.now();
	}
}

Note that this has a created variable, but let’s say we don’t care about it when checking equality. Therefore we don’t care about it when returning a hashCode, either. So how can we return a hashCode that’s valid? One option is to just return an int:

public int hashCode() {
	return 1;
}

Valid hash code? Yes. Good hash code? No.

That’s a terrible idea. We want it to depend upon the things that our equals method might depend upon. What are those? name and number. How might we create a hashCode on that basis? number is already an int. Can we get an int out of a String? Lots of methods return int. You can imagine all sorts of convoluted methods that might involve summing up the integer value of the characters stored in the string, and some of these would probably work great. You know what else would work great? The String’s hashCode method. So you might write:

public int hashCode() {
	return number + name.hashCode();
}

as a decent (though maybe not optimal) method. It turns out you need to know a tiny bit of number theory to understand why this isn’t the best option; when you get to either 250 or 311 I guarantee you’ll learn first-hand. But for now, know that this is OK, but not great.

But for now, you can use one of the classes built-in to Java11 to write hash code methods – we talked about this last week:

public int hashCode() {
  return Objects.hash(number, name);
}

This will do the optimal thing for you behind the scenes. And you’d still need to write a valid equals method. Just like before, you’d check to make sure the other Object is non-null, that it’s an instanceof PostalAddress, then cast it to a PostalAddress and check that each relevant field is either == or .equals().

On to the trees

So that’s what you need to know for HashSets. If you can reasonably define an equals and hashCode method, you can get pretty good performance (near constant-time) if you use a HashSet. But there’s another option that also gets pretty good performance, the TreeSet. To describe how it works, we need to define trees.

In computer science and mathematics, a tree is a kind of graph. What’s a graph? It’s a set (oh snap!) of vertices and edges between those vertices (sometimes also called nodes). (On board)

A tree is a particular kind of graph. It has a single vertex called a root “at the top”, and it grows downward (weird, I know, like an upside-down tree). Each vertex in the tree can have “children”, that is, nodes “below” them. For the sake of simplicity, let’s say each vertex has zero, one, or two children, and the children are “left children” or “right children”.

Trees that have at most two children per vertex are called binary trees.

Remember linked lists? It turns out you can model a tree in code, using something very similar:

class TreeNode<E> {
	E value;
	TreeNode<E> leftChild;
	TreeNode<E> rightChild;

	//...
}

But lucky for you, this is not 187, so we’re just going to draw diagrams to give you an intuition, rather than make you code this up yourself.

Why trees?

OK, so now we’ve just implicitly created a new homework assignment for 187, but who cares, right? It’s just a convoluted list, sort of, right?

Right, and wrong. Depending upon how you organize your tree, you can get very good or very bad performance. If you just stuff items into the tree willy-nilly, then yes, it’s really no better than a linked list, as you’d have to traverse the entire tree to, for example, look to see if an element is there. In some ways it’s worse, because you now also have to write the traversal code for a tree, which is more complicated than the same code for a list. But it turns out if you impose some constraints on the tree, you can do better.

Specifically, let’s say that we require that a left child (and all grandchildren and additional descendants) of a node can only contain a value that’s less than the current node’s value. And a right child (and all descendants) can only contain a node that’s greater than the current node’s value. This is called a binary search tree.

Two trees!

How does this “Binary Search Tree” property help? Consider a tree that holds the values 1 through 7. Let’s say I magically decide to insert them into the tree in this order: {4, 2, 6, 1, 3, 5, 7}:

   4
 2   6
1 3 5 7

This tree holds 7 values, and takes at most two comparisons to check whether a given value is in the tree or not. If we build an even bigger tree you can see that the tree’s height (which is also how many comparisons are needed to search it) grows much more slowly than the tree’s size, which is the total number of nodes in the tree.

Now, let’s build a binary search tree containing the following values: {3, 2, 6, 10, 5, 1, 9}

Balanced? Not quite.

But what if we can magically keep the tree (mostly) balanced? You’ll learn how in 187. It’s not quite constant-time lookup (it’s “logarithmic” overhead) but it’s really fast nonetheless. The logarithm grows very slowly: https://en.wikipedia.org/wiki/Logarithm. So if we say a tree containing 10 elements has an overhead of 1, a tree containing 1000 elements (100 times as many!) has an overhead of only 3. And a tree containing 1,000,000 elements has an overhead of only 6. 10^9 elements? Overhead of 9. And so on.

Again, there are some details I’m skipping over, for example, how do you make sure your tree doesn’t end up looking like a linked list? But you’ll see them in 187 and 311.

TreeSets

OK, so now we see that in order to build a tree, we need to be able to see if values are less than or equal to other values. Have we seen this before? Sure we have. Java’s Comparable and Comparator interfaces. So if we want to be able to place objects into a TreeSet, they’ll have to either have a natural ordering (that is, implement Comparable), or we can create the TreeSet with a specific Comparator in order to decide how the tree is built. So let’s add one.

Back to our PostalAddress:

public int compareTo(PostalAddress o) {
	if (name.compareTo(o.name) != 0) return name.compareTo(o.name);
	return Integer.compare(number, o.number);
}

Now we can instantiate a TreeSet of our PostalAddress (though we’ll need to add back in our toString, first):

Set<PostalAddress> addresses = new TreeSet<PostalAddress>();
for (int i = 1; i <= 10; i++) {
	addresses.add(new PostalAddress(i, "Maple St"));
}
System.out.println(addresses);
}

Here’s the whole “new” PostalAddress.java for your reference:

package hashes;

import java.time.LocalDateTime;
import java.util.Objects;
import java.util.Set;
import java.util.TreeSet;

public class PostalAddress implements Comparable<PostalAddress> {
    public final int number;
    public final String name;
    public final LocalDateTime created;

    public PostalAddress(String name, int number) {
        this.number = number;
        this.name = name;
        created = LocalDateTime.now();
    }

    public boolean equals(Object o) {
        if (o == null) {
            return false;
        }
        if (!(o instanceof PostalAddress)) {
            return false;
        }

        PostalAddress p = (PostalAddress)o;

        if (number == p.number && name.equals(p.name)) {
            return true;
        }
        return false;
    }

    public int hashCode() {
        return Objects.hash(number, name);
    }

    public String toString() {
        return number +  " " + name;
    }

    public static void main(String[] args) {
        Set<PostalAddress> ts = new TreeSet<>();
        for (int i = 1; i <= 10; i++) {
            ts.add(new PostalAddress("Maple St", i));
        }
        System.out.println(ts);
    }

    @Override
    public int compareTo(PostalAddress o) {
        if (name.compareTo(o.name) != 0) return name.compareTo(o.name);
        return Integer.compare(number, o.number);
    }
}

Which should I use?

Generally, you should reach for HashSet when you want to use the Set interface, in the same way you can reach for ArrayList. Usually, objects have a custom hashCode method defined already, and if they don’t, you can usually write one pretty easily using Objects.hash().

TreeSets are most useful when the objects in the set have a natural ordering, and you care about using it. In particular, TreeSet also implements the NavigableSet and SortedSet interfaces, which means you can the smallest or largest elements of the set, or the elements closest to a given value, in “logarithmic” time, which is about as good as it gets (only “constant time ” is better). (See the respective JavaDocs for those interfaces.) If the problem you’re working on makes frequent use of computing these values, then a TreeSet might be a better choice than a HashSet. But 90% or more of the time, HashSet is what you’re gonna want.

A worked example

We’re going to walk through solving another toy problem that’s made relatively straightforward with the Set abstraction. Then we’ll talk about an extension to this problem that’s harder to solve, and introduce a new abstraction – the Map – to solve it.

The problem is called “Santa’s little helper” and is adapted from http://adventofcode.com/2015/day/3.

As an aside, if you’re looking for nerdy little problems to polish your programming skills on, there are many such sites on line. I like Advent of Code because it’s language-agnostic, there’s a nice ramp-up as the days of the month go on, it’s kind of a group activity you can do with your friends, and there’s now a nice archive of these problems to work on.

Santa’s little helper

Santa is delivering presents to an infinite two-dimensional grid of houses.

He begins by delivering a present to the house at his starting location, and then an elf at the North Pole calls him via radio and tells him where to move next. Moves are always exactly one house to the north (^), south (v), east (>), or west (<). After each move, he delivers another present to the house at his new location.

However, the elf back at the north pole has had a little too much eggnog, and so his directions are a little off, and Santa ends up visiting some houses more than once. How many houses receive at least one present?

For example:

Toward a solution

We’re going to try to solve this in Java. Let’s fire up Code and create a new package. Now let’s write a DeliverySimulator class with a single method that takes a input of directions and returns the set of locations visited.

import java.util.Set;

public class DeliverySimulator {


	public static Set<Location> locationsVisited(String directions) {
		return null;
	}
}

Note we can use Code to “stub out” our program, and to create empty implementations wherever possible here.

Let’s also add a method to compute the actual number of houses visited:

public static int housesVisited(Set<Location> visited) {
	return visited.size();
}

Hey, half done! Well, sorta.

What should a location look like? Let’s give it an x and y coordinate:

public class Location {
	public final int x;
	public final int y;

	public Location(int x, int y) {
		this.x = x;
		this.y = y;
	}
}

Since we know we’re going to be storing locations in a set, we should make sure we implement meaningful equals and hashCode methods. :

@Override
public int hashCode() {
  return Objects.hash(x, y);
}

@Override
public boolean equals(Object o) {
	if (o == null)
		return false;
	if !(o instanceof Location)
		return false;
	Location l = (Location) o;
	if (x == l.x && y == l.y)
		return true;
	return false;
}

and maybe:

public String toString() {
	return "(" + x + ", " + y + ")";
}

Does this work? We could add a main method to do some testing:

public static void main(String[] args) {
	Location x = new Location(0, 0);
	System.out.println(x);
	Location y = new Location(0, 0);

	System.out.println(x == y);

	System.out.println(x.equals(y));
	System.out.println(x.equals(new Location(1, 0)));
}

Sometimes it’s nice to have automated tests.

A Unit Test

OK, now we’ll write a test.

import static org.junit.Assert.*;

import org.junit.Test;

public class LocationTest {
	@Test
	public void testLocationEquals() {
		Location x = new Location(0,0);
		Location y = x;
		Location z = new Location(0,0);
		assertTrue(x == y);
		assertFalse(x == z);

		assertEquals(x, y);
		assertEquals(x, z);
	}

	@Test
	public void testLocationNotEquals() {
		assertFalse(new Location(0,0).equals(new Location(0,1)));
	}
}

A reminder about assertEquals: it uses the objects’ built-in equals() method. If you have tests failing (say, in a project?), and the expected and actual look the same, maybe you forgot to define .equals on your objects?

Anyway, let’s go back to locationsVisited. According to the problem statement, the house at (0, 0) always gets a present:

public static Set<Location> locationsVisited(String directions) {
	Set<Location> visited = new HashSet<Location>();

	visited.add(new Location(0,0));

	return visited;
}

Does this work? Let’s add a test.

@Test
public void testEmptyDirections() {
	assertEquals(new HashSet<Location>(Arrays.asList(new Location(0,0))), DeliverySimulator.locationsVisited(""));
}

Then we need to think about where the sleigh goes, and deliver a present at each stop:

    public static Set<Location> locationsVisited(String directions) {
        Set<Location> visited = new HashSet<>();
        visited.add(new Location(0, 0));

        int x = 0;
        int y = 0;

        for (int i = 0; i < directions.length(); i++) {
            final char d = directions.charAt(i);
            if (d == '>') {
                x += 1;
            } else if (d == '<') {
                x -= 1;
            } else if (d == '^') {
                y += 1;
            } else if (d == 'v') {
                y -= 1;
            }
            visited.add(new Location(x, y));
        }

        return visited;
    }

And now we add some tests:

@Test
public void testSimple() {
	assertEquals(new HashSet<Location>(Arrays.asList(new Location(0, 0), new Location(1, 0))),
			DeliverySimulator.locationsVisited(">"));
}

@Test
public void testFour() {
	assertEquals(
			new HashSet<Location>(
					Arrays.asList(new Location(0, 0), new Location(0, 1), new Location(1, 1), new Location(1, 0))),
			DeliverySimulator.locationsVisited("^>v<"));
}

You could also add checks for housesVisited, either to their own test cases (which would be more true to the spirit of unit tests) or within the above.

A new problem, and a new data structure: Map ADT

What if, each time Santa visited a house, he delivered a present, and we wanted to know the total number of presents delivered? How might you go about tracking this?

You could, for example, keep a List of visited locations (instead of a Set). Then you could count the number of times each location appeared. But how would you tabulate it? You want to build some kind of table, a way to connect a set of things, and for each thing, some information about it.

(0,0) | (0,1) | (1,1) | (1,0)
------|-------|-------|------
  2      1        1       1

You might create two Lists, one of Location and one of Integer, where the ith element in each corresponded. But that’s clunky, hard to program, and has the poor performance of a list for lookups. You might edit the Location to contain a counter variable. That could work; but it might cause problems in other contexts, when the thing you’re storing is doesn’t have a clear “has-a” relationship with whatever you’re adding it to. Objects, like functions, should generally do a small set of things well.

What if I told you there was a data structure that solved this problem? What if I told you you’ve already (sorta) seen it? What if I told you it could be yours for no money down, and no monthly installments?

Introducing… the Map.

Map ADT

A Map “maps” keys to values. In other words, it associates one kind of thing with another kind of thing.

The first “thing” is the key. Keys are lookup keys; you can think of them as kinda analogous to an index card, or a page number, or a URL; they are the thing we can do lookups on (though they can also be useful data in and of themselves).

The second “thing” is a value – each value is associated with a key.

So Maps model, in essence, a table (on board). For our Santa problem, we might use Locations as keys and Integers as values.

The keys in a map form a set – there can only be one of each. And each value is associated with exactly one key. Keys are unique (as they are a set), but the same value can be associated with more than one key (as shown in our table above).

And, the key values should be immutable, or at least, should not change with respect to equals, for exactly the same reason as in Sets – in fact, if you recall, Sets are implemented using Maps.

Like Sets, Maps come in two flavors you’ll likely use: HashMaps and TreeMaps, with the same constraints (about efficient hashCode and compareTo methods).

The type signature is slightly different though: Map<K, V>; the key and value types are both parameterized. This should make sense: there’s no particular reason a map would store keys and values of the same type, only that all keys are of the same type, and all values are of the same type.

There are many useful Map methods, but we’ll start with the basics that are unique to Map:

Let’s use these to come up with an answer to our previous question (“how many presents per house”).

How many? Etc.

Let’s parallel the code we wrote previously, but use a Map instead of a Set. We have to import some things, but then we can copy/paste our code and modify it in place.

    public static Map<Location, Integer> locationCount(String directions) {
        Map<Location, Integer> locationCount = new HashMap<>();

        locationCount.put(new Location(0, 0), 1);

        int x = 0;
        int y = 0;

        for (int i = 0; i < directions.length(); i++) {
            final char d = directions.charAt(i);
            if (d == '>') {
                x += 1;
            } else if (d == '<') {
                x -= 1;
            } else if (d == '^') {
                y += 1;
            } else if (d == 'v') {
                y -= 1;
            }
            
            Location current = new Location(x, y);

			// update code goes here

			return locationCount;

We have to declare a Map in our method, and translate the Set.add into Map.put. We also need to decide what to do when it’s time to update our map. There are two cases:

            // case 1: not in the map
            if (!locationCount.containsKey(current)) {
                locationCount.put(current, 1);
            } 
            // case 2: already in the map
            else {
                int j = locationCount.get(current);
                j++;
                locationCount.put(current, j);
            }

If the location isn’t in the map, we put it in, along with a count of 1. If it is, we pull out the current value, increment it, and update it in the map. We can do this more concisely:

            int t = locationCount.getOrDefault(current, 0);
            t++;
            locationCount.put(current, t);

Here, we pull the count out of the map, but set it to 0 if it’s not there. Then we increment it and put it back in.

Let’s test it out in a main method:

public static void main(String[] args) {
	Map<Location, Integer> lc = locationCount("^>v<");
	System.out.println(lc);
}

What do we expect? Same as the table above. What do we get?

{(1, 0)=1, (0, 0)=2, (1, 1)=1, (0, 1)=1}

Bingo!

How might we compute the number of houses visited, given the code written so far (in the context of our main method)?

        // compute number of houses visited
        System.out.println(lc.size());

How might we find the set of houses visited, given the code written so far?

public Set<Location> locationsVisited() {
	// what goes here?
}

How might we find the number of presents delivered to a house? Let’s define a method to compute it, given an already-filled-in Map and a Location to report on:

public static int numberPresents(Map<Location, Integer> locationCount, Location loc) {
	return locationCount.getOrDefault(loc, 0);
}

And then test it in our main method:

        // number of presents
        System.out.println(numberPresents(lc, new Location(0, 0)));
        System.out.println(numberPresents(lc, new Location(9, 9)));

On Map generality

Maps need not just be used to count things; they can be used to associate any (relatively immutable) set of keys with any values you like. For example, you could store: