09: Introducing Sets

Announcements

Reminder: Quiz 2 during the next discussion meeting! Anything we’ve done through last class (use of List) is fair game.

Sets: an introduction

Today we’re going to move on in our list of top-n abstract data types from the List to the Set. In order to give you some grounding in what a set is and how it differs from a list, we’re going to turn to an arcane and little-known subject: Mathematics. To be clear, we’re going to do a very gentle introduction to set theory; if you have already taken a discrete math course this will be review for you, and if you stay in CS, you’ll see this again in a lot more detail in COMPSCI 250.

Simply put: a set is a collection of distinct objects.

The objects can be anything: people, numbers, shapes, colors, (or perhaps most topically, instances of Java objects). These objects are generally referred to as members or elements of a set.

In set theory, sets are named by a single uppercase letter: A or B, for example.

For our purposes, we’ll usually write sets as a list of the elements. The list will be comma separated, and will be enclosed in curly braces. For example: A = {1, 3, -6} describes a set called “A” that has three integers as elements.

Sets contain a collection of unique items. That is, sets cannot have duplicate items.

While the items in a set might have an implicit, natural order (like the integers), the set itself doesn’t define an order. So {1, 3, -6} = {-6, 1, 3}, that is, they’re the same set. Order doesn’t matter when comparing sets. (This is very different from our intuition with lists, where different orders do generally matter.)

There are a few bits of notation I want you to have seen, so now I’m going to write them down for you.

First, how do we say a set contains an element? There’s a symbol that looks like this: ∈ (kind of a funny “E”) which is used to denote set membership. For example, “3 ∈ A” is pronounced “3 is an element of A.” You can think of the funny E as standing for “Element of” to help remember it. Likewise, ∉ means “not an element of” or “does not contain”, as in “10 ∉ A”. How do we say a set is empty? We call it the empty set and write it as ∅ (or {}).

Next, sometimes we might want to talk about one set being “contained within another”. For example, if B = {1, 3, -6, 10}, we might say A is a “subset” of B. This is written with the set containment symbol, which looks kinda like a curvy less-than-or-equal-to (or greater-than-or-equal-to) “A ⊆ B” and it “opens toward” the bigger set.

Sets are sometimes represented abstractly as “Venn diagrams”. Here’s what the above two sets might look like as a Venn diagram: (on board).

Finally, there are a few operations on sets you should know about.

First is “union”, written as a little ∪. The union of two sets contains all their elements. So if we have A = {1, 3, -6} and C = {9, 3, 4}, then A ∪ C = {1, 3, -6, 9, 4}. (Also show with Venn diagram.) As a mnemonic, think of the “United States” as union of many things into one.

Next is the “intersection”. The intersection of two sets contains only the elements they have in common and is written as an upside-down u: ∩. Continuing our example from above, A ∩ C = {3}. (Also show Venn diagram). Think of the “intersection of two roads”: the intersection is just the part they share, not all of both roads.

Two more operations (I promise) then we’re done with math and notation. First is set difference, sometimes called “relative complement” or “set-theoretic difference.” It’s written with a backslash \ and refers to all the elements in one set that aren’t in another. For example, A \ C = {1, -6}. (Venn diagram). Note that in set difference, which set you write first matters (unlike union and intersection). Finally, there’s symmetric difference of two sets, which is all the things in the union that aren’t in the intersection. It can be written as 𝝙. A 𝝙 C = {1, -6, 9, 4}.

In-class exercise

Sets in Java

How does Java represent a set? As an abstract data type, specified by the Set interface. First we’ll talk about the properties and assumptions we might expect from a Set, in the abstract. Then we’ll talk about two concrete implementations of the data type provided by the Java API and see how they work.

  • sets, like lists are unbounded, that is they don’t have a fixed size
  • duplicate elements are not allowed (only new elements are added; attempts to re-add existing elements are ignored)
  • sets, lists can contain a null element (I hope you like NullPointerExceptions! though note some implementations might forbid null elements)
  • sets support an add (and an addAll) operation, which can modify the current set
  • sets support a remove operation of a specific value
  • sets support a size operation to determine how many elements are currently in the list
  • sets support a contains (and a containsAll) operation to check membership
  • and more, but we’ll get to them later when we look at the full API that Java supplies.

Let’s take a look at the interface: http://docs.oracle.com/javase/8//docs/api/java/util/Set.html

Not too different from List, though you’ll note some things (like remove at an index, or get) are not present, as those operations don’t make sense in the context of sets.

Pay special attention to a few things:

”…sets contain no pair of elements e1 and e2 such that e1.equals(e2)” – the equals method is very important to sets, and if you stick objects in that don’t have an equals method, they’ll use Object‘s equals method. Make sure that’s what you want if so.

Also note that “great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set.” In other words, if you have a setter that changes an instance variable in an object, and that instance variable is considered by the object’s equals method, Set will have undefined (read: bad) behavior.

So putting relatively immutable things into sets is OK. Like Integers or Strings. Putting arbitrary objects that can be changed is not so good. Putting things that can be changed, but that you won’t change is OK but dangerous – what if you accidentally do end up changing the object? The Set will almost certainly misbehave in a weird way.

Other than those two restrictions, you can use Sets almost like Lists. Let’s do some examples:

Set<Integer> s = new HashSet<Integer>();

s.add(1);
s.add(2);
System.out.println(s); // like lists, you can print them and their contents is printed

Set<Integer> t = new HashSet<Integer>();

t.add(2);
t.add(3);
t.add(4);

for (Integer i : t) {
    System.out.println(i); // like lists, you can iterate over them
}

s.addAll(t); // all elements in t are added to s; t is unchanged but s is not!
System.out.println(s);
System.out.println(t);

s.removeAll(t); // all elements in t are removed from s, as above
System.out.println(s);
System.out.println(t);

And you generally do want to use Sets when the set properties (of uniqueness and lack-of-intrinsic-order) apply to your data set, especially if your data set is going to be large.

Why? (you might ask.) Because sets have much, much better general performance for insertion, removal, and containment-testing than lists. How? (you might ask.) Well, now we have to talk a little about how the two most common implementations of Sets work: HashSets and TreeSets.

On HashSets

One possible implementation of the set is the HashSet, which depends upon a correct hashCode method. Why? “This class implements the Set interface, backed by a hash table (actually a HashMap instance).” Now let’s look at the documentation for hashCode: “This method is supported for the benefit of hash tables such as those provided by HashMap.”

Wow, hash tables are so important that every object in Java must supply a hashCode method – it’s built into Object.

hashCode returns an integer, and must obey the contract in its documentation. Let’s look at each piece:

  • It provides something like (but not exactly like!) equality: If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

This implies that if you use a field in an equals method, you should also use it in the hashCode method.

  • It is consistent: Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.
  • It is not an equality check, though: It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

So you could have a hashCode method that always returned the same integer, like 1, and it would technically obey the contract. But usually, the hashCode of objects is not 1, but instead a large integer. Going back to our old example of (not) aliasing:

String x = new String("foo");
String y = new String("foo");
System.out.println(x == y);
System.out.println(x.equals(y));
System.out.println(x.hashCode());
System.out.println(y.hashCode());

Do you expect them to be ==? No. Do you expect them to be equals? Yes. Do you expect them to have the same hash code? Yes, because of the first property above.

Why does this weird integer result in fast (“constant time”) lookups?

Because you can use it as an index into an array.

In short, “hash tables” are arrays that store objects based upon their “hash code”. If you want to put an element into the array, you figure out the right place to put it by checking its hash code. And if you want to see if an element is in the array, you look up its hash code, then jump to the right spot in the array.

In a perfect world, the array would be big enough to hold everything, and the hash codes would always be unique per-object, and this would all just work. In practice, sometimes there are collisions – more than one object ends up in the same spot in the array. We resolve these collisions in different ways (one way: each element of the array might be a short linked list of elements with the hash code corresponding to that element’s index), and things usually work out with near-constant-time performance.

Of course, if we have a hash table that’s too small, or a hash code method that doesn’t, in the words of the Java doc, “as much as is reasonably practical, … return distinct integers for distinct objects”, then everything ends up in just one or two lists and performance is bad.