Week 7: Multimaps; introduction to algorithms

Announcements

Almost certainly you should not use Maps for the current programming assignment. You just need Sets and Lists – in fact, Lists would work, except they’re too slow for large inputs – you need Sets for their faster operations.

Another worked example: Maps and anagrams

We’re going to do another worked example. In particular, we’re going to write an anagram finder, that is, a program that given a word list, looks for words in that list that are anagrams of one another. An anagram of a word is a rearrangement of the letters in the word that results in a new (different) word, using each of the letters from the original word once. I’ll go start-to-finish again today, building the project in Code.

Program sketch

What should our program do?

First, how might we recognize anagrams? For example, we could write a method to count the occurrences of each letter in a word, then store that in an array (or object), then check those arrays/objects for equals-style equality. And then we could perhaps write a hashCode method for those objects, etc.

You could do that, and it would work. But for purposes of illustration, I’m going to take advantage of a well-known trick, which is that if you sort the letters of two words and compare the sorted letters, they’ll be equal iff the words are anagrams of one another. We can treat the sorted letters as identifying the group of anagrams for which this word is a member.

Here’s an example. Consider the words “cat” “act” and “tac”, all of which are anagrams of one another. If you sort their letters, you get “act”, regardless of the word you started with. Here, “act” happens to be one of the words, but that’s not necessarily the case. For example, “apple” has an sorted-letter identifier of “aelpp”, which is not a word in the English langauge (as far as I know).

The idea then is for each word, find its identifier, and add it to the set of anagrams corresponding to the identifier. Note that the identifier may or may not be one of the words (“cat” vs “apple”).

So now we’ve got the core of the algorithm. How can we group words into clusters of anagrams? I’m going to suggest a Map<String, List<String>>. The keys are going to be the sorted-letter version of the words, and the values will be lists of all the words that are associated with this sorted-letter version of the word.

(on board)

This structure, where a key is associated with many values, is sometimes called a “Multimap”; Java doesn’t directly support multimaps, but associating a value with a collection type is an ad hoc version of this.

So what are we going to do? Something like the following:

create a multimap
read a list of words (a list? an array? or process one-by-one? up to us; the first is probably simplest, the last might be more memory efficient; it depends upon how big you expect the list to be)
for each word:
- compute its sorted identifier
- add it the multimap
  - the key is the identifier
  - the word must be appended to the value, which is a list

Then, we can write methods to query the multimap. Let’s get started.

Coding up `AnagramFinder`

First, the instance variable that represents the map:

public class AnagramFinder {
	private final Map<String, List<String>> anagrams;

	public AnagramFinder() {
		anagrams = new HashMap<String, List<String>>();
	}
}

Next, the add method, to add a word to the AnagramFinder. What should add do?

It should look up the word in the map, and add it to the associated list. What if there is no list? It should make a new one and insert it into the map.

To look up the word, we’ll need a method to return the letter-by-letter alphabetization of a String. There are several ways to do this. Here’s one:

private static String alphabetized(String word) {
      char[] a = word.toCharArray();
      Arrays.sort(a);
      return new String(a);
  }

Why is this method static? It does not depend upon the instance in any way, so there is no need to make it an instance method. If it does later change to be part of the instance (that is, if we attempt to call an instance method from it, or use an instance variable from it), the type checker will alert us. Further, static methods could be moved (or copied) easily to another class if appropriate – this is part of a process called “refactoring”.

OK, back to add. It has to handle two cases: when the alphabetized version of the word is in the multimap already, and when it’s not.

(on board)

There are a couple of different ways you could write this. For example, you could handle the two cases completely separately:

public void add(String word) {
  String key = alphabetized(word);
  if (!anagrams.containsKey(key)) {
    List<String> l = new ArrayList<String>();
    l.add(word);
    anagrams.put(key, l);
  }
  else {
    List<String> l = anagrams.get(key);
    l.add(word);			
  }
}

Or you could deal with the not-in-map problem first, and unify things otherwise:

public void add(String word) {
  String key = alphabetized(word);
  if (!anagrams.containsKey(key)) {
    anagrams.put(key, new ArrayList<>());
  }
  anagrams.get(key).add(word);	
}

I find them both fairly readable, but things being otherwise equal, I will generally choose the shorter solution.

Let’s write some code in our main method to test this out.

public static void main(String[] args) {
  AnagramFinder af = new AnagramFinder();
  af.add("bird");
  af.add("drib");

  af.add("and");
  af.add("nad");
  af.add("dan");

  af.add("it");}

OK, but we forgot to write methods to get anything out of the AnagramFinder! Let’s do so now.

There’s no real reason for this multimap to store Lists as values, though – really it should store a Set<String>. Let’s do that now, and see how Code and Java’s type system show us what needs to be changed.

(demo refactoring from List to Set)

Finding the anagrams of a word

Let’s write a method that returns the anagrams of a given word, or an empty list if there are no such anagrams.

public Set<String> anagramsOf(String word) {
  return // what?
}

anagrams.getOrDefault(alphabetized(word), new HashSet<>()) – we can’t just use get(), because we don’t want to accidentally return null. In general, you don’t want to return null because callers won’t expect it and you’ll cause NullPointerExceptions.

Does it work?

System.out.println(af.anagramsOf("and"));
System.out.println(af.anagramsOf("it"));
System.out.println(af.anagramsOf("boo"));

Reading words from a file

Let’s add the ability to read from a file:

public void addFromFile(File f) throws FileNotFoundException {
    Scanner s = new Scanner(f);
    while (s.hasNext()) {
        add(s.next());
    }
    s.close();
}

(Note there are lots of ways to read from files, this is just one. It’s what’s done in AP CS last I checked, but you get the idea.)

Finding the most-anagram-words

Now let’s write a method to find the word(s) with the most anagrams. There could be more than one, but let’s just return any such set with the most anagrams (or an empty list if there are none yet). You’ll probably want to use the Map.values.

public Set<String> mostAnagrams() {
    int longest = -1;
    Set<String> mostAnagrams = new HashSet<>();
    for (Set<String> set : anagrams.values()) {
        if (set.size() > longest) {
            longest = set.size();
            mostAnagrams = set;
        }
    }
    return mostAnagrams;
}

Now let’s test it:

af.addFromFile(new File("/usr/share/dict/words")); // a common file on UNIX-like OSes
System.out.println(af.mostAnagrams());

Something to think about: What if we want to find the set of all sets of anagrams of maximum length? You should be able to do this on your own. A sketch:

Find the size of the largest set (which we already did, above).
Now create a `Set> to hold the result.
Iterate over the anagrams.values() just like we did above, but this time, each time you find a Set<String> that’s of the largest size, add it to your result set. If you find a larger set of anagrams, through away your current result set and start a new one with this larger set of anagrams as its first member.
Then return the result set.

I’ll leave the actual implementation to you. I strongly suggest you take the time to do it, just for the practice! (If you do it correctly on my wordlist, you get three sets of anagrams. They are: [[terse, tsere, ester, reest, reset, stree, steer, estre, stere], [creat, trace, caret, cater, carte, recta, react, crate, creta], [organ, groan, orang, angor, grano, goran, argon, nagor, rogan]]).

Once more, double time, in Python

Python is another programming language. It’s pretty popular in some circles, and for good reason: it makes writing easy programs ridiculously easy. There are various (somewhat hidden) costs involved though. For example, the lack of a static type system makes it harder to build large programs correctly. A good rule of thumb is that if the program is going to be more than a couple hundred lines or so, you probably shouldn’t use Python. (Though, you can do a lot in 200 lines of Python!).

Unfortunately, perhaps, this gives students the impression that Python and similar languages are “the best” since most of what they’re asked to program in school fits this description – but that’s not generally the case for the real world, where programs are large, stick around for years, and are maintained by many different people (which is basically the worst case for Python).

And Python’s bytecode is not terribly amenable to JIT compilation, so Python is notoriously “slow”. Slow is relative, of course, but if you have a large, computationally-intensive job, plain-old-python without native extensions is not always the right choice.

Anyway, I’m going to talk more about Python near the end of the semester, but for now, let me very quickly demo a translation of the program we wrote above into Python now.

The code

def alphabetized(word):
    return ''.join(sorted(word))

def add(anagrams, word):
    key = alphabetized(word)
    # if key not in anagrams:
    #     anagrams[key] = [word]
    # else:
    #     anagrams[key].append(word)

    # or shorter
    anagrams.setdefault(key, []).append(word)

def anagrams_of(anagrams, word):
    return anagrams.get(alphabetized(word), [])

def add_from_file(anagrams, f):
    for line in f:
        add(anagrams, line.strip())

def most_anagrams(anagrams):
    n = -1
    most = []
    for l in anagrams.values():
        if len(l) > n:
            n = len(l)
            most = l
    return most

def main():
    a = {}
    with open('/usr/share/dict/words') as f:
        add_from_file(a, f)
    print(most_anagrams(a))

if __name__ == '__main__':
    main()

What we’ve done, and what to do

So far, we reviewed 121 material (including basic control flow, conditionals, expressions, statements, arrays, objects and classes, scope, and references).

We’ve introduced several foundational ADTs:

lists
sets
maps

and covered their properties. We’ve also seen their implementations in the Java API.

We’ve seen their methods, and used them to iterate over, look up items within, and modify them.

And, guess what? For about 80% or more of the programs you’re likely to write out in the real world, this is what you need to know, at least in terms of standard data structures. Lists, sets, maps, will let you represent most problems generally, and if not, you can take 187 to learn how to define your own data structures.

There are actually two or three data types and associated implementations that 187 covers that we haven’t yet: stacks, queues, and priority queues / heaps. But they’re pretty straightforward (and we’ll get to them later this semester, at least the first two).

So what’s left? Are we done for the semester? Of course not!

First, more practice. We’re getting you ready for general programming out in the world (and for 187 in particular), so there will be more programming assignments, that may start to feel more difficult in various ways. Not all of them will involve new data structures concepts, but instead they’ll serve to give you more practice. We’ll also continue removing some of the training wheels you’ve had so far (full sets of test cases, for example) so that you can start to get ready for the 187 experience (or, you know, the real world). We’ll also do simplified versions of some previous 187 assignments to give you a running start in that class.

Second, more exposure to other topics in computer science and informatics. This course is a prerequisite for not just 187 but for various others as well. Some of our lectures and assignments will focus on things like: working with files (you’ll see CSVReader if you haven’t already in a programming assignment); simulated simple interactions over the network (with web servers or the like); text processing and data analysis (search engine stuff); and so on.

Finally, we’re going to touch upon a few more core computer science concepts in detail. In particular, we’re going to continue our study of algorithms – how we do certain tasks – and start to develop language to describe how efficient different approaches to the same problem might be. We’ll continue to focus on “toy problems” here, things like sorting lists of numbers and searching simple graphs, but the algorithms we develop and the approaches we take will be useful to you later when tackling bigger problems. (Some of this you’ll see in later assignments, I hope.)

Thinking about efficiency

So to think about how “efficient” an algorithm, or a piece of code, is, we need a way to quantify how long it takes to run. Our rule of thumb is as follows. Things take either:

a small, constant amount of time, which we’ll approximate as “about one unit,” or
they take an amount of time dependent upon some variable or variables

To simplify things, we say that almost all operators and keywords evaluate in a small, constant amount of time in Java: basic arithmetic, conditionals, assignment, array access, following the branch in control flow, and method invocation. So you might look at a method like:

int add(int x, int y) {
  int sum = x + y;
  return sum;
}

and say something like: well, when this method runs, first it adds x and y (1). Then it assigns to sum (1). Then it returns (1). So it takes “about” three units of time to execute.

int sub(int x, int y) {
  int diff = x - y;
  return diff;
}

About how long does this take to run?

Or you might look at:

void honkIfEven(int x) {
  if (x % 2 == 0) System.out.println("honk");
}

and say something like, well, first x % 2 is computed. Then it’s compared to zero. So the method takes at least two units. Then it might take a third to print “honk”.

Does it?

Well, that depends on the implementation of println(). To do a “real” analysis, we have to drill down into any method that’s called and check how it works, and look at methods it calls, and so on. For the purposes of this class, we’ll just state that certain methods are roughly constant time (like println), even though that’s not strictly true, in ways that will probably become clear to you as we go on.

OK, be that as it may, there’s something important to note here, which is that both of these methods take a small, fixed amount of time that doesn’t depend upon anything. Let’s look at something different:

int sum(int[] a) {
  int s = 0;
  for (int i: a) {
    s += i;
  }
  return s;
}

How long does this method take to execute? Well, about one to declare and assign 0 to s.

Then about one to update i each time through the loop, and another to update s each time through the loop.

Then one for returning s.

So what’s the answer? It depends upon the length of the array, right? It depends upon the input in other words, it’s not a constant. Some parts of the runtime are (the initial setup and the return) are constant, but some are not (the loop). Here, we might say the runtime is about 2 + 2 * (a.length). In other words, the runtime here is a function (in the mathematical sense) of the length of a. It’s proportional t the length of a.

Generally, any time you see a loop, you have the possibility of a non-constant runtime, that is, of a runtime that’s a function of (some aspect of) the input.

void print(List<Integer> l) {
    for (int i : l) {
        System.out.println(i);
    }
}

About how long does this take to run? Again it depends. One would probably say “proportional to the length of l”, though.

Early returns

What about if the loop can return early?

boolean containsOne(int[] a) {
  for (int i: a) {
    if (i == 1) return true;
  }
  return false;
}

Like before, the runtime here varies upon the input. But in a new and excitingly different way! When do we exit the loop? Who knows?!?

Since we can’t know, we generally concern ourselves with the worst case, that is, what’s the longest this loop could run?

Answer: it’s a function of the length of a, again. Again, About 2 * a.length, given our previous analysis.

(The other kind of analysis we might do is an “average case” analysis, but we’ll mostly leave that for COMPSCI 311.)

Also, just because you see a loop, it doesn’t mean that a method runs in non-constant time. For example:

int sumFirstThree(int[] a) {
  int sum = 0;
  for (int i = 0; i < 3; i++) {
    sum += a[i];
  }
  return sum;
}

…runs in constant time. Nor does there have to be an array in the parameter list (or as an instance variable, etc.) to trigger non-constant time behavior.

Nested loops

Now consider the case of for loops within for loops. Suppose we had an algorithm for duplicate detection that looked like this:

boolean containsDuplicate(int[] a) {} {
  for (int i = 0; i < a.length; i++) {
    for (int j = 0; j < a.length; j++) {
      if (i == j) continue;
      if (a[i] == a[j]) return true;
    }
  }
  return false;
}

How does this algorithm operate? (On board.)

How long does it take to run? For each iteration of the outer loop, we have to go through the entire inner loop. So we have to run a.length * (cost of inner loop) * a.length.

This method’s runtime is a function of its input, but it’s no longer a linear (first-degree polynomial) function; it’s “quadratic” – that is, it’s runtime is proportional to a.length squared.

That’s a lot worse, especially as a.length grows. Who cares, right? Computers are fast? 3 GHz = 3 billion operations a second, right?

Well, what if we’re working with a big array? Say, a million elements? 10 ns each is only 10 ms total to run. Something that runs in time proportional to the array length will be manageable. What about quadratic? 1,000,000 x 1,000,000 = 1,000,000,000,000. That’s a lot of zeroes! Even if each step only takes, say, 10 ns, we’re still talking about 10,000 seconds (nearly three hours) to complete!

So generally, when we write methods or call them, and we suspect that they’re going to be used with large inputs, we should be thinking about how much time they’ll take to run. Many efficient algorithms are linear in the size of their input, though some are a little worse, and some are much worse.

OK, Marc, but we don’t need to go through the entire array inside the inner loop; we could just go through “what’s left” at the end, since we’ve already checked everything there, right?

Quadratic or not?

boolean containsDuplicate(int[] a) {} {
  for (int i = 0; i < a.length; i++) {
    for (int j = i + 1; j < a.length; j++) {
      if (a[i] == a[j]) return true;
    }
  }
  return false;
}

Well, again, how many steps does the inner loop take? It’s not always a.length, but it’s a function of a.length – the first time through, it is a.length - 1; the next time a.length - 2, and so on, down to 3, 2, 1, 0. What’s that proportional to? It’s still a function of a.length (about ¹⁄₂). Here’s an illustration (on board), or you can run the sums if you like.

Asymptotes

We’ve been talking about running time (mostly) on the basis of the degree of the polynomial (linear, quadratic, etc.). But remember that each variable has a coefficient, and that depending upon your value of n and the coefficients, linear-time algorithms are not strictly better than quadratic. For a silly example, consider which is faster:

an algorithm with runtime of 1,000,000 n
an algorithm with runtime (n^2)/ 1,000,000

For “large enough” values of n, the first algorithm is faster. How large? (set them equal to one another and solve for n) “Large enough” in this case is 10^12. Normally the coefficients aren’t quite this lopsided, but it is true in practice that sometimes, a small-coefficient quadratic algorithm is faster than a larger-coefficient (but better polynomial) algorithm for small but reasonable values of n.

Implementations matter

How about the following?

boolean allEven(List<Integer> list) {
  for (int i = 0; i < list.size(), i++) {
    if (list.get(i) % 2 == 1) return false;
  }
  return true;
}

Answer? It depends. This is where understanding how the implementation (ArrayList? LinkedList? Something else?) underneath a given abstraction works matters.

If you’re using an ArrayList, this will be linear, just as when it was for an array. But remember that to get to the ith element in a linked list, you have to traverse the list. So if we’re using a linked list, each time we call get here, we are doing something that’s dependent upon the length of the list. So it will be quadratic!

To further muddle this mess, the enhanced for loop:

boolean allEven(List<Integer> list) {
  for (int i: list) {
    if (i% 2 == 1) return false;
  }
  return true;
}

is actually smart enough to “remember” where it was the next time through, so it won’t be quadratic. But you wouldn’t know this unless you knew how lists and their iterators were implemented. Take 187! :)

But, usually the Java Docs will help you here. Take a look at ArrayList to see that:

The size, isEmpty, get, set, iterator, and listIterator operations run in constant time. The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking). The constant factor is low compared to that for the LinkedList implementation.

Compare with the LinkedList:

All of the operations perform as could be expected for a doubly-linked list. Operations that index into the list will traverse the list from the beginning or the end, whichever is closer to the specified index.

I guess you need to know what “as could be expected” means. Again, take 187 to be a better programmer!

Another example of implementation mattering

Remember that some things can be computed in different ways.

For example, to compute the sum of numbers from 1 to n,

Sum from one to n, or

compute n * (n + 1) / 2

int sumTo(int n) {
int sum = 0;
for (int i = 0; i <= n; i++) {
sum += n;
}
return sum;
}

int sumTo(int n) {
  return (n * (n + 1)) / 2;
}

Different algorithms that accomplish the same goal can have different running time behaviors.