10: Introducing Sets
Welcome
Announcements
First-years take note: Next Tuesday (October 9th) is a “UMass Monday,” and a Monday class schedule will be followed. Notably, you’ll have discussion (and a quiz in discussion!).
Quiz Tuesday; anything through today’s lecture is fair game.
Plagiarism workshop! 6pm Tuesday Oct 23rd in CS 150/151.
Note-taker! DS is once again looking for a note-taker for this class. See Piazza.
Sets: an introduction
Today we’re going to move on in our list of top-n abstract data types from the List
to the Set
. In order to give you some grounding in what a set is and how it differs from a list, we’re going to turn to an arcane and little-known subject: Mathematics. To be clear, we’re going to do a very gentle introduction to set theory; if you have already taken a discrete math course this will be review for you, and if you stay in CS, you’ll see this again in a lot more detail in COMPSCI 250.
Simply put: a set is a collection of distinct objects.
The objects can be anything: people, numbers, shapes, colors, (or perhaps most topically, instances of Java objects). These objects are generally referred to as members or elements of a set.
In set theory, sets are named by a single uppercase letter: A or B, for example.
For our purposes, we’ll usually write sets as a list of the elements. The list will be comma separated, and will be enclosed in curly braces. For example: A = {1, 3, -6} describes a set called “A” that has three integers as elements.
Sets contain a collection of unique items. That is, sets cannot have duplicate items.
While the items in a set might have an implicit, natural order (like the integers), the set itself doesn’t define an order. So {1, 3, -6} = {-6, 1, 3}, that is, they’re the same set. Order doesn’t matter when comparing sets. (This is very different from our intuition with lists, where different orders do generally matter.)
There are a few bits of notation I want you to have seen, so now I’m going to write them down for you.
First, how do we say a set contains an element? There’s a symbol that looks like this: ∈ (kind of a funny “E”) which is used to denote set membership. For example, “3 ∈ A” is pronounced “3 is an element of A.” You can think of the funny E as standing for “Element of” to help remember it. Likewise, ∉ means “not an element of” or “does not contain”, as in “10 ∉ A”. How do we say a set is empty? We call it the empty set and write it as ∅ (or {}).
Next, sometimes we might want to talk about one set being “contained within another”. For example, if B = {1, 3, -6, 10}, we might say A is a “subset” of B. This is written with the set containment symbol, which looks kinda like a curvy less-than-or-equal-to (or greater-than-or-equal-to) “A ⊆ B” and it “opens toward” the bigger set.
Sets are sometimes represented abstractly as “Venn diagrams”. Here’s what the above two sets might look like as a Venn diagram: (on board).
Finally, there are a few operations on sets you should know about.
First is “union”, written as a little ∪. The union of two sets contains all their elements. So if we have A = {1, 3, -6} and C = {9, 3, 4}, then A ∪ C = {1, 3, -6, 9, 4}. (Also show with Venn diagram.) As a mnemonic, think of the “United States” as union of many things into one.
Next is the “intersection”. The intersection of two sets contains only the elements they have in common and is written as an upside-down u: ∩. Continuing our example from above, A ∩ C = {3}. (Also show Venn diagram). Think of the “intersection of two roads”: the intersection is just the part they share, not all of both roads.
Two more operations (I promise) then we’re done with math and notation. First is set difference, sometimes called “relative complement” or “set-theoretic difference.” It’s written with a backslash \ and refers to all the elements in one set that aren’t in another. For example, A \ C = {1, -6}. (Venn diagram). Note that in set difference, which set you write first matters (unlike union and intersection). Finally, there’s symmetric difference of two sets, which is all the things in the union that aren’t in the intersection. It can be written as 𝝙. A 𝝙 C = {1, -6, 9, 4}.
Sets in Java
How does Java represent a set? As an abstract data type, specified by the Set
interface. First we’ll talk about the properties and assumptions we might expect from a Set
, in the abstract. Then we’ll talk about two concrete implementations of the data type provided by the Java API and see how they work.
- sets, like lists are unbounded, that is they don’t have a fixed size
- duplicate elements are not allowed (only new elements are added; attempts to re-add existing elements are ignored)
- sets are unordered (usually – though there is a subtype called an ordered set)
- sets, lists can contain a
null
element (I hope you likeNullPointerException
s! though note some implementations might forbidnull
elements) - sets support an
add
(and anaddAll
) operation, which can modify the current set - sets support a
remove
operation of a specific value - sets support a
size
operation to determine how many elements are currently in the list - sets support a
contains
(and acontainsAll
) operation to check membership - and more, but we’ll get to them later when we look at the full API that Java supplies.
Let’s take a look at the interface: http://docs.oracle.com/javase/8//docs/api/java/util/Set.html
Not too different from List
, though you’ll note some things (like remove
at an index, or get
) are not present, as those operations don’t make sense in the context of sets – they’re unordered, so there is no index!
Pay special attention to a few things:
”…sets contain no pair of elements e1
and e2
such that e1.equals(e2)
” – the equals
method is very important to sets, and if you stick objects in that don’t have an equals
method, they’ll use Object
‘s equals
method. Make sure that’s what you want if so.
Also note that “great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set.” In other words, if you have a setter that changes an instance variable in an object, and that instance variable is considered by the object’s equals
method, Set
will have undefined (read: bad) behavior.
So putting relatively immutable things into sets is OK. Like Integer
s or String
s. Putting arbitrary objects that can be changed is not so good. Putting things that can be changed, but that you won’t change is OK but dangerous – what if you accidentally do end up changing the object? The Set
will almost certainly misbehave in a weird way.
Other than those two restrictions, you can use Set
s almost like List
s. Let’s do some examples:
Set<Integer> s = new HashSet<Integer>();
s.add(1);
s.add(2);
System.out.println(s); // like lists, you can print them and their contents is printed
Set<Integer> t = new HashSet<Integer>();
t.add(2);
t.add(3);
t.add(4);
for (Integer i : t) {
System.out.println(i); // like lists, you can iterate over them
}
s.addAll(t); // all elements in t are added to s; t is unchanged but s is not!
System.out.println(s);
System.out.println(t);
s.removeAll(t); // all elements in t are removed from s, as above
System.out.println(s);
System.out.println(t);
And you generally do want to use Set
s when the set properties (of uniqueness and lack-of-intrinsic-order) apply to your data set, especially if your data set is going to be large.
Why? (you might ask.) Because sets have much, much better general performance for insertion, removal, and containment-testing than lists. How? (you might ask.) Well, now we have to talk a little about how the two most common implementations of Set
s work: HashSet
s and TreeSet
s.
On HashSet
s
One possible implementation of the set is the HashSet
, which depends upon a correct hashCode
method. Why? “This class implements the Set interface, backed by a hash table (actually a HashMap instance).” Now let’s look at the documentation for hashCode
: “This method is supported for the benefit of hash tables such as those provided by HashMap.”
Wow, hash tables are so important that every object in Java must supply a hashCode
method – it’s built into Object
.
hashCode
returns an integer, and must obey the contract in its documentation. Let’s look at each piece:
- It provides something like (but not exactly like!) equality: If two objects are equal according to the
equals(Object)
method, then calling thehashCode
method on each of the two objects must produce the same integer result.
This implies that if you use a field in an equals
method, you should also use it in the hashCode
method.
- It is consistent: Whenever it is invoked on the same object more than once during an execution of a Java application, the
hashCode
method must consistently return the same integer, provided no information used inequals
comparisons on the object is modified. - It is not an equality check, though: It is not required that if two objects are unequal according to the
equals(java.lang.Object)
method, then calling thehashCode
method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
So you could have a hashCode
method that always returned the same integer, like 1
, and it would technically obey the contract. But usually, the hashCode
of objects is not 1, but instead a large integer. Going back to our old example of (not) aliasing:
String x = new String("foo");
String y = new String("foo");
System.out.println(x == y);
System.out.println(x.equals(y));
System.out.println(x.hashCode());
System.out.println(y.hashCode());
Do you expect them to be ==
? No. Do you expect them to be equals
? Yes. Do you expect them to have the same hash code? Yes, because of the first property above.
Why does this weird integer result in fast (“constant time”) lookups?
Because you can use it as an index into an array.
In short, “hash tables” are arrays that store objects based upon their “hash code”. If you want to put an element into the array, you figure out the right place to put it by checking its hash code. And if you want to see if an element is in the array, you look up its hash code, then jump to the right spot in the array.
In a perfect world, the array would be big enough to hold everything, and the hash codes would always be unique per-object, and this would all just work. In practice, sometimes there are collisions – more than one object ends up in the same spot in the array. We resolve these collisions in different ways (one way: each element of the array might be a short linked list of elements with the hash code corresponding to that element’s index), and things usually work out with near-constant-time performance.
Of course, if we have a hash table that’s too small, or a hash code method that doesn’t, in the words of the Java doc, “as much as is reasonably practical, … return distinct integers for distinct objects”, then everything ends up in just one or two lists and performance is bad.