12: Maps, spam, pams, anagrams

Announcements

Programming assignment 06 is due Friday. The next assignment will be due the week after spring break (I treat break as an extended weekend: you don’t get double work from me).

The last day to withdraw is tomorrow. You must return a signed “Course Change Request” to the registrar by Wednesday at 5pm. The Computer Science main office (CS room 100) staff are allowed to sign on my behalf, if you cannot easily find me.

Today’s agenda

Today we’ll going to do another worked example. In particular, we’re going to write an anagram finder, that is, a program that given a word lists, looks for words in that list that are anagrams of one another. An anagram of a word is a rearrangement of the letters in the word that results in a new (different) word, using each of the letters from the original word once. I’ll go start-to-finish again today, building the project in Eclipse.

Program sketch

What should our program do?

First, how might we recognize anagrams? For example, we could write a method to count the occurrences of each letter in a word, then store that in an array (or object), then check those arrays/objects for equals-style equality. And then we could perhaps write a hashCode method for those objects, etc.

You could do that, and it would work. But for purposes of illustration, I’m going to take advantage of a well-known trick, which is that if you sort the letters of two words and compare the sorted letters, they’ll be equal iff the words are anagrams of one another.

So now we’ve got the core of the algorithm. How can we group words into clusters of anagrams? I’m going to suggest a Map<String, List<String>>. The keys are going to be the sorted-letter version of the words, and the values will be lists of all the words that are associated with this sorted-letter version of the word. This structure, where a key is associated with many values, is sometimes called a “Multimap”; Java doesn’t directly support multimaps, but associating a value with a collection type is an ad hoc version of this.

So what are we going to do? Something like the following:

read a list of words (a list? an array? or process one-by-one? up to us; the first is probably simplest, the last might be more memory efficient; it depends a lot upon how big you expect the list to be)
create a multimap
for each word:
- compute its sorted version
- insert it into the multimap

Then, we can write methods to query the multimap. Let’s get started.

Coding up `AnagramFinder`

First, the instance variable:

public class AnagramFinder {
    private final Map<String, List<String>> anagrams;

    public AnagramFinder() {
        anagrams = new HashMap<String, List<String>>();
    }
}

Next, the add method, to add a word to the AnagramFinder. What should it do? It should look up the word in the map, and add it to the associated list. What if there is no list? It should make a new one and insert it into the map.

To look up the word, we’ll need a method to return the letter-by-letter alphabetization of a String. There are several ways to do this. Here’s one:

private static String alphabetized(String word) {
      char[] a = word.toCharArray();
      Arrays.sort(a);
      return new String(a);
  }

Why is this method static? It does not depend upon the instance in any way, so there is no need to make it an instance method. If it does later change to be part of the instance (that is, if we attempt to call an instance method from it, or use an instance variable from it), the type checker will alert us. Further, static methods could be moved (or copied) easily to another class if appropriate – this is part of a process called “refactoring”.

OK, back to add. It has to handle two cases: when the alphabetized version of the word is in the multimap already, and when it’s not.

There are a couple of different ways you could write this. For example, you could handle the two cases completely separately:

public void add(String word) {
  String key = alphabetized(word);
  if (!anagrams.containsKey(key)) {
    List<String> l = new ArrayList<String>();
    l.add(word);
    anagrams.put(key, l);
  }
  else {
    List<String> l = anagrams.get(key);
    l.add(word);            
  }
}

Or you could deal with the not-in-map problem first, and unify things otherwise:

public void add(String word) {
  String key = alphabetized(word);
  if (!anagrams.containsKey(key)) {
    anagrams.put(key, new ArrayList<String>());
  }
  List<String> l = anagrams.get(key);
  l.add(word);      
}

I find them both fairly readable, but things being otherwise equal, I will generally choose the shorter solution.

Let’s write some code in our main method to test this out.

public static void main(String[] args) {
  AnagramFinder af = new AnagramFinder();
  af.add("bird");
  af.add("drib");

  af.add("and");
  af.add("nad");
  af.add("dan");

  af.add("it");}

OK, but we forgot to write methods to get anything out of the AnagramFinder! Let’s do so now.

There’s no real reason for this multimap to store Lists as values, though – really it should store a Set<String>. Let’s do that now, and see how Eclipse and Java’s type system show us what needs to be changed.

(demo)

In class exercise 1

Returns the anagrams of a given word, or an empty list if there are no such anagrams.

public Set<String> anagramsOf(String word) {
  return anagrams.getOrDefault(alphabetized(word), new HashSet<String>());
}

Does it work?

System.out.println(af.anagramsOf("and"));
System.out.println(af.anagramsOf("it"));
System.out.println(af.anagramsOf("boo"));

Let’s add the ability to read from a file:

public void addFromFile(Path path) throws IOException {
  BufferedReader br = Files.newBufferedReader(path);
  for (String word = br.readLine(); word != null; word = br.readLine()) {
    add(word);
  }
}

(Note there are lots of ways to read from files, this is just one.)

Now let’s write a method to find the word(s) with the most anagrams. There could be more than one, but let’s just return any such set with the most anagrams (or an empty list if there are none yet). You’ll probably want to use the Map.values.

public Set<String> mostAnagrams() {
  int longest = -1;
  Set<String> set = new HashSet<String>();
  for (Set<String> grams : anagrams.values()) {
    if (grams.size() > longest) {
      longest = grams.size();
      set = grams;
    }
  }
  return set;
}

Now let’s test it: