Lecture 10: Introducing Sets

Welcome

Announcements

Back to compareTo

Generally, you want to define Comparable if you expect values of the data type to be compared and you want there to be a canonical way to compare them. You usually define custom Comparators for things like the custom sort. Let’s do both. First, let’s impose a natural order on PostalAddress so that they sort first by street name, then by number. Add implements Comparable<PostalAddress> to the class signature, and Eclipse can helpfully add the missing method:

@Override
public int compareTo(PostalAddress o) {
	// TODO Auto-generated method stub
	return 0;
}

Well that won’t do.

What to do

Implement the method. (JavaDoc for compareTo on projector.) See: http://docs.oracle.com/javase/8/docs/api/java/lang/Comparable.html

The memory aid is that if you want x < y, then you need to write compareTo such that x.compareTo(y) < 0.

Similarly, for x > y, then you need to write compareTo such that x.compareTo(y) > 0.

In-class exercise

@Override
public int compareTo(PostalAddress o) {
	if (streetName.compareTo(o.streetName) > 0) return -1;
	if (streetName.compareTo(o.streetName) < 0) return 1;
	if (number > o.number) return -1;
	if (number < o.number) return 1;
	return 0;		
}

or slightly more concisely:

@Override
public int compareTo(PostalAddress o) {
	if (streetName.compareTo(o.streetName) < 0) return -1;
	if (streetName.compareTo(o.streetName) > 0) return 1;
	return Integer.compare(number, o.number);
}

Remember, you can look up Integer.compare in the Java API (or just Google it).

Let’s create the list out of order, print it, sort it, then print it:

for (int i = 10; i >=1; i -= 2) {
	addresses.add(new PostalAddress(i, "Maple St"));
}		
for (int i = 1; i < 10; i += 2) {
	addresses.add(new PostalAddress(i, "Birch St"));
}
System.out.println(addresses);
addresses.sort(null);
System.out.println(addresses);

Hey, it works!

Now let’s define a custom comparator for use in doing a “postal sort”. That is, we still want to sort such that street names are alphabetical, but we want the numbers sorted as all odd first (in ascending order), then all even (in descending order). This is how the truck might go up and down the street (on board).

What does that look like? Let’s declare a new Comparator:

public class PostalOrderComparator implements Comparator<PostalAddress> { ... }

Again, Eclipse helpfully fills it out with the method we need to implement, so let’s do it.

It will be similar to but more complicated than the compareTo method we just wrote. A tip: x % 2 == 0 if and only if x is even. x % 2 == 1 iff it’s false.

public int compare(PostalAddress o1, PostalAddress o2) {
	if (o1.streetName.compareTo(o2.streetName) < 0) return -1;
	if (o1.streetName.compareTo(o2.streetName) > 0) return 1;
	if (o1.number % 2 == 1 && o2.number % 2 == 0) return -1;
	if (o1.number % 2 == 0 && o2.number % 2 == 1) return 1;
	if (o1.number % 2 == 1) return Integer.compare(o1.number, o2.number);
	if (o1.number % 2 == 0) return Integer.compare(o2.number, o1.number);
	return 0;
}

And let’s check it out:

for (int i = 6; i >=1; i -= 2) {
	addresses.add(new PostalAddress(i, "Maple St"));
}		
for (int i = 1; i < 6; i += 2) {
	addresses.add(new PostalAddress(i, "Birch St"));
}
for (int i = 6; i >=1; i -= 2) {
	addresses.add(new PostalAddress(i, "Birch St"));
}		
for (int i = 1; i < 6; i += 2) {
	addresses.add(new PostalAddress(i, "Maple St"));
}
System.out.println(addresses);
addresses.sort(null);
System.out.println(addresses);		
addresses.sort(new PostalOrderComparator());
System.out.println(addresses);

Things we might do to improve this? Add an isOdd and/or isEven method for readability, perhaps? Pull out o1.number and o2.number into local variables? Both are debatable. Here’s what we ended up with in class:

	public int compare(PostalAddress o1, PostalAddress o2) {
		if (o1.name.compareTo(o2.name) != 0)
			return o1.name.compareTo(o2.name); // sort by street name first
		// then break ties on street name
		if (o1.number % 2 == 1 && o2.number % 2 == 0) return -1; // if o1 is odd, it comes first
		if (o2.number % 2 == 1 && o1.number % 2 == 0) return 1;  // if o2 is odd, it comes first
		// then break ties again, on number
		if (o1.number % 2 == 1) return Integer.compare(o1.number, o2.number);
		return -Integer.compare(o1.number, o2.number);
	}

Sets: an introduction

Today we’re going to move on in our list of top-n abstract data types from the List to the Set. In order to give you some grounding in what a set is and how it differs from a list, we’re going to turn to an arcane and little-known subject: Mathematics. To be clear, we’re going to do a very gentle introduction to set theory; if you have already taken a discrete math course this will be review for you, and if you stay in CS, you’ll see this again in a lot more detail in COMPSCI 250.

Simply put: a set is a collection of distinct objects.

The objects can be anything: people, numbers, shapes, colors, (or perhaps most topically, instances of Java objects). These objects are generally referred to as members or elements of a set.

In set theory, sets are named by a single uppercase letter: A or B, for example.

For our purposes, we’ll usually write sets as a list of the elements. The list will be comma separated, and will be enclosed in curly braces. For example: A = {1, 3, -6} describes a set called “A” that has three integers as elements.

Sets contain a collection of unique items. That is, sets cannot have duplicate items.

While the items in a set might have an implicit, natural order (like the integers), the set itself doesn’t define an order. So {1, 3, -6} = {-6, 1, 3}, that is, they’re the same set. Order doesn’t matter when comparing sets. (This is very different from our intuition with lists, where different orders do generally matter.)

There are a few bits of notation I want you to have seen, so now I’m going to write them down for you.

First, how do we say a set contains an element? There’s a symbol that looks like this: ∈ (kind of a funny “E”) which is used to denote set membership. For example, “3 ∈ A” is pronounced “3 is an element of A.” You can think of the funny E as standing for “Element of” to help remember it. Likewise, ∉ means “not an element of” or “does not contain”, as in “10 ∉ A”. How do we say a set is empty? We call it the empty set and write it as ∅ (or {}).

Next, sometimes we might want to talk about one set being “contained within another”. For example, if B = {1, 3, -6, 10}, we might say A is a “subset” of B. This is written with the set containment symbol, which looks kinda like a curvy less-than-or-equal-to (or greater-than-or-equal-to) “A ⊆ B” and it “opens toward” the bigger set.

Sets are sometimes represented abstractly as “Venn diagrams”. Here’s what the above two sets might look like as a Venn diagram:

A subset of B

Finally, there are a few operations on sets you should know about.

First is “union”, written as a little ∪. The union of two sets contains all their elements. So if we have A = {1, 3, -6} and C = {9, 3, 4}, then A ∪ C = {1, 3, -6, 9, 4}. As a mnemonic, think of the “United States” as union of many things into one.

A union C

Next is the “intersection”. The intersection of two sets contains only the elements they have in common and is written as an upside-down u: ∩. Continuing our example from above, A ∩ C = {3}.

A intersection C

Think of the “intersection of two roads”: the intersection is just the part they share, not all of both roads.

Two more operations (I promise) then we’re done with math and notation. First is set difference, sometimes called “relative complement” or “set-theoretic difference.” It’s written with a backslash \ and refers to all the elements in one set that aren’t in another. For example, A \ C = {1, -6}. Note that in set difference, which set you write first matters (unlike union and intersection). Finally, there’s symmetric difference of two sets, which is all the things in the union that aren’t in the intersection. It can be written as 𝝙. A 𝝙 C = {1, -6, 9, 4}.

Sets in Java

How does Java represent a set? As an abstract data type, specified by the Set interface. First we’ll talk about the properties and assumptions we might expect from a Set, in the abstract. Then we’ll talk about two concrete implementations of the data type provided by the Java API and see how they work.

sets, like lists are unbounded, that is they don’t have a fixed size
duplicate elements are not allowed (only new elements are added; attempts to re-add existing elements are ignored)
sets are unordered (usually – though there is a subtype called an ordered set)
sets, lists can contain a null element (I hope you like NullPointerExceptions! though note some implementations might forbid null elements)
sets support an add (and an addAll) operation, which can modify the current set
sets support a remove operation of a specific value
sets support a size operation to determine how many elements are currently in the list
sets support a contains (and a containsAll) operation to check membership
and more, but we’ll get to them later when we look at the full API that Java supplies.

Let’s take a look at the interface: http://docs.oracle.com/javase/8//docs/api/java/util/Set.html

Not too different from List, though you’ll note some things (like remove at an index, or get) are not present, as those operations don’t make sense in the context of sets – they’re unordered, so there is no index!

Pay special attention to a few things:

”…sets contain no pair of elements e1 and e2 such that e1.equals(e2)” – the equals method is very important to sets, and if you stick objects in that don’t have an equals method, they’ll use Object’s equals method. Make sure that’s what you want if so.

Also note that “great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set.” In other words, if you have a setter that changes an instance variable in an object, and that instance variable is considered by the object’s equals method, Set will have undefined (read: bad) behavior.

So putting relatively immutable things into sets is OK. Like Integers or Strings. Putting arbitrary objects that can be changed is not so good. Putting things that can be changed, but that you won’t change is OK but dangerous – what if you accidentally do end up changing the object? The Set will almost certainly misbehave in a weird way.

Other than those two restrictions, you can use Sets almost like Lists. Let’s do some examples:

Set<Integer> s = new HashSet<Integer>();

s.add(1);
s.add(2);
System.out.println(s); // like lists, you can print them and their contents is printed

Set<Integer> t = new HashSet<Integer>();

t.add(2);
t.add(3);
t.add(4);

for (Integer i : t) {
	System.out.println(i); // like lists, you can iterate over them
}

s.addAll(t); // all elements in t are added to s; t is unchanged but s is not!
System.out.println(s);
System.out.println(t);

s.removeAll(t); // all elements in t are removed from s, as above
System.out.println(s);
System.out.println(t);

And you generally do want to use Sets when the set properties (of uniqueness and lack-of-intrinsic-order) apply to your data set, especially if your data set is going to be large.

Why? (you might ask.) Because sets have much, much better general performance for insertion, removal, and containment-testing than lists. How? (you might ask.) Well, now we have to talk a little about how the two most common implementations of Sets work: HashSets and TreeSets.

On `HashSet`s

One possible implementation of the set is the HashSet, which depends upon a correct hashCode method. Why? “This class implements the Set interface, backed by a hash table (actually a HashMap instance).” Now let’s look at the documentation for hashCode: “This method is supported for the benefit of hash tables such as those provided by HashMap.”

Wow, hash tables are so important that every object in Java must supply a hashCode method – it’s built into Object.

hashCode returns an integer, and must obey the contract in its documentation. Let’s look at each piece:

It provides something like (but not exactly like!) equality: If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

This implies that if you use a field in an equals method, you should also use it in the hashCode method.

It is consistent: Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.
It is not an equality check, though: It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

So you could have a hashCode method that always returned the same integer, like 1, and it would technically obey the contract. But usually, the hashCode of objects is not 1, but instead a large integer. Going back to our old example of (not) aliasing:

String x = new String("foo");
String y = new String("foo");
System.out.println(x == y);
System.out.println(x.equals(y));
System.out.println(x.hashCode());
System.out.println(y.hashCode());

Do you expect them to be ==? No. Do you expect them to be equals? Yes. Do you expect them to have the same hash code? Yes, because of the first property above.

Why does this weird integer result in fast (“constant time”) lookups?

Because you can use it as an index into an array.

In short, “hash tables” are arrays that store objects based upon their “hash code”. If you want to put an element into the array, you figure out the right place to put it by checking its hash code. And if you want to see if an element is in the array, you look up its hash code, then jump to the right spot in the array.

In a perfect world, the array would be big enough to hold everything, and the hash codes would always be unique per-object, and this would all just work. In practice, sometimes there are collisions – more than one object ends up in the same spot in the array. We resolve these collisions in different ways (one way: each element of the array might be a short linked list of elements with the hash code corresponding to that element’s index), and things usually work out with near-constant-time performance.