23: Variety is the Spice of Life

DRAFT

Announcements

Quiz in discussion Monday.

SRTIs: Please do them. Really, they matter to me and to the department.

Recap

We’ve done a ton of stuff this semester. I’ve made you write all kinds of code. You’ve learned about many parts of the Java Collections API, which not coincidentally are the data structures you’ll build many useful programs out of in the future. We’ve thought formally (at least a little) about algorithm design and analysis, testing and correctness, and even how to implement some of these useful data structures.

For today, I thought we’d put it all together in another worked example. To spice things up a little, though, I thought we’d do it more than once. No, I have not lost my marbles. Well, maybe a few of them.

Broadening your Horizons

In 121 (or AP CS) you learned some things about computer science, and some things about programming. Almost certainly, then, the programming language you have the most experience with is Java. It may come as a surprise to some of you that Java is maybe not my mostest favoritestest of programming languages. In fact, the primary reason I write anything at all in Java these days is only because our introductory curriculum is structured to use it.

I’m often banging on about weird little aphorisms, like that “names have power.” And a problem with aphorisms is that they sometimes sound trite. But names do have power, because they shape how we think. If you name a variable i it doesn’t mean much, though from context and convention we’ll usually assume it’s the iteration counter (or maybe just a short-lived integer). If you name a variable messageCount that’s pretty clear. If you name a variable c that’s, uhh, less clear, but context might help. If you name a variable thisAssignmentSucks, well, I guess that might help your state of mind temporarily, but it isn’t going to help you understand the code you’re writing.

In the same way, your choice of language will influence how you go about solving a problem. The designers of the Java language (and its standard libraries) were solving a problem (really, a large set of them): How do you write code that’s human readable, that ultimately is executable by a machine? At its lowest levels, a machine doesn’t provide much in the way of abstraction: you can execute instructions linearly, you can read and write bytes and “words” (collections of bytes) from memory, you can operate on them with basic arithmetic and logical operation, and you can “branch” (or “jump”) to another instruction either conditionally or unconditionally. And that’s about it. No loops, no objects, no classes, no types.

Yet James Gosling and his colleagues ended up with Java: a language that resembled imperative languages of the day (C, C++) sytanctically: curly braces for blocks, if and switch for flow control, infix operators, prefix functions, parenthesis for arguments and precedence, semicolons to end lines. There are other choices it made too: it’s essentially an “imperative” language: “do this, then do this, then do this”, in other words, programs are mostly composed of statements (which in turn are composed of expressions, many of which have side effects).

They also made choices about bigger things: for example, in Java, everything is scoped to a class; there’s no such thing as a free-standing method (sometimes called a function, procedure, or subroutine in other languages). Everything’s an object. (Well, except for primitives.) There’s a strict inheritance hierarchy among classes (Well, there’s also interfaces). And so on.

But these aren’t the only possible ways to design a programming language, and the choices your language designer makes influence, in turn, how you might think about or write a program yourself. There’s a whole universe of other languages out there whose designers made different choices, and thus work differently.

For example, consider Python, a “scripting” language that many people find looks like executable pseudocode. Much like Java, it’s a mostly imperative language. It supports classes and inheritance, but doesn’t require them. It has syntax support for lists, maps, and sets so you can write them quickly and concisely. It uses whitespace rather than braces to denote blocks.

Or, consider Lisp. There is almost no syntax: it’s parenthesis all the way down. The original motivation for this is choice is a little weird. When a compiler or interpreter parses your program, it transforms it into something called an “abstract syntax tree”. You can represent a tree as a nested group of expressions – so you could, say, use a list of nested parenthetical (symbolic) expressions called s-exprs. Lisp was originally going to have more syntax than this – the designer started with the parenthetical expressions, with the intention of adding syntax later. But it turns out that if your program’s syntax trivially maps to its structure, then it’s trivial to write programs that manipulate programs. This is called metaprogramming and enables all sorts of wonderful things.

Or, consider the ML-family of languages. Suppose you want to formally verify, like, with math, that your program “works”, for some definition of works. One way to define this is to say there will be no runtime errors of various categories. Another is to say there will be no type errors. Or bounds errors. With a formal enough type semantics, you can encode these requirements into a type system, and prove that any program that typechecks correctly will not have these problems.

You get the idea – and there’s lots more where that comes from. But this isn’t a course in programming language design, so let’s get down to our example, first in Java and then in some other ways.

Markov models for text generation

Markov models are a way of modeling a random process. At their simplest, they’re a graph (oh yeah) where the vertices are states and the directed edges are transitions between the states, annotated with the probability that the transition is taken. (A/B example on board.)

OK, how might we use this for text generation? Let’s take a look at a short sentence fragment:

a man a plan a canal a

(I heard you did some palindrome stuff lately.)

Suppose this is a representative sample of text. How might we characterize it? We could treat each word as a state, the choice of successor of each word as an edge. (On board)

And we could assign probabilities on the basis of what we observe in this text: equiprobably (p = 1/3) from “a” to each other word, and back to “a” (p = 1) from each word.

If we “walked” around on this graph we’d generate text each time we took a step: (on board).

In a nutshell, that’s it. We could fancy it up somewhat (and we will in a bit), but that’s the idea.

One question remains though: how do we build the graph? The answer is we read in a text. If you sample a large enough text (it turns out a half-dozen words isn’t really large enough), you can start to output meaningful-seeming data.

Building a Markov text generator

So how do we generate text? What’s the generation algorithm?

assume a model
start with a word
add it to our text
repeat until we’ve generated n words of output:
- choose a successor at random from the model; note this choice is weighted
- add the successor to our output text
- update our current word to be the successor

So what does our model need to support? The answer is a fast lookup of a word, and its associated successors, in a format that supports an easy random choice. So how might we build one?

build a model by:
- creating a mapping from words to lists of words
- splitting an input into words
- for each word:
  - insert it into the map if it’s not already there
  - append its successor to the associated list

How do we build our model? By reading an input string (“training data”), splitting it on spaces, and then updating our model: for each word in the string, add its successor to the list of successors for that word in the model. We will end up adding words more than once to the list, but that’s OK (and in fact, exactly what we want), since we want the choice to be weighted. For example, if the word “zebra” shows up only once in our training data after the word “the”, it should be correspondingly rare in our generated output.

How do we choose a successor for a word? Choose a word from the list at random – since there is more than one copy of more common words, they will correspondingly be chosen more at random.

This choice of data structure is somewhat wasteful of space, in that we needn’t actually store multiple copies of each word (or references), but it does make the random choice easy. I’ll leave it as an exercise to you how to build something that would be more efficient space-wise but still also efficient when making the random choice.

Notice that nothing we’ve written so far is actually specific to Java. That’s a good thing, in that you’ve been learning to think computationally, and that Java is just a way to express those thoughts concretely. Let’s do so next.

Building our generator in Java

To map words to lists of words, we’ll use an appropriate map. And this structure is essentially the only state (and thus, instance variable) the class will need:

public class MarkovTextGenerator {
    private Map<String, List<String>> model;

    public MarkovTextGenerator() {
        model = new HashMap<>();
    }
}

How do we add a word to the model? Trick question! We need to add the word and its successor.

public void add(String word, String successor) {
    if (!model.containsKey(word)) {
        model.put(word, new ArrayList<>());
    }
    model.get(word).add(successor);
}

Does this thing work? Let’s add a toString method:

public String toString() {
    return model.toString();
}

and test it in a main method:

public static void main(String[] args) {
    String[] words = "a man a plan a canal".split("\\s+");

    MarkovTextGenerator generator = new MarkovTextGenerator();
    for (int i = 0; i < words.length - 1; i++) {
        generator.addToModel(words[i], words[i + 1]);
    }

    System.out.println(generator);
}

Looks good:

{a=[man, plan, canal], man=[a], canal=[a], plan=[a]}

Now let’s write a method to add every word in a file:

public void addToModel(File file) throws FileNotFoundException {
    Scanner scanner = new Scanner(file);
    scanner.useDelimiter("\\s+");
    if (!scanner.hasNext()) {
        scanner.close();
        return;
    }
    String successor = scanner.next();
    while (scanner.hasNext()) {
        String word = successor;
        successor = scanner.next();
        addToModel(word, successor);
    }
    scanner.close();
}

Does it work?

generator = new MarkovTextGenerator();
generator.addToModel(new File("/Users/liberato/canal.txt"));
System.out.println(generator);

Yup. Now let’s do the text generation:

public String generateText(int n) {
    Random random = new Random(1);

    // choose the first word at random
    List<String> words = new ArrayList<>(model.keySet());
    String word = words.get(random.nextInt(words.size()));

    // build the list of strings
    List<String> result = new ArrayList<>(n);
    for (int i = 1; i < n; i++) {
        result.add(word);
        List<String> successors = model.get(word);
        String successor = successors.get(random.nextInt(successors.size()));
        word = successor;
    }

    // return a single string
    return String.join(" ", result);    
}

OK, let’s point it at something more real. How about some Sherlock Holmes via Project Gutenberg?

generator.addToModel(new File("/Users/liberato/sherlock.txt"));
System.out.println(generator.generateText(100));

That’s pretty neat. One last tweak. Here, each word has one word of context. What if we make each successor based upon the previous two words? Let’s cheat a little bit and concatenate those two words to make the “previous” word:

String word = scanner.next();
String successor = scanner.next();
while (scanner.hasNext()) {
    String previous = word;
    word = successor;
    successor = scanner.next();
    addToModel(previous + " " + word, successor);
}

A more general approach might be to keep a sliding window of the last k words, but we’ll omit that here. We’ll also modify the generator:

        // choose the first word at random
        List<String> words = new ArrayList<>(model.keySet());
        String start = words.get(random.nextInt(words.size()));
        String[] two = start.split(" ");
        String previous = two[0];
        String word = two[1];

        // build the list of strings
        List<String> result = new ArrayList<>(n);
        result.add(previous);
        for (int i = 2; i < n; i++) {
            result.add(word);
            List<String> successors = model.get(previous + " " + word);
            previous = word;
            word = successors.get(random.nextInt(successors.size()));;
        }

Building the generator in Python

So, you may have heard of a language called Python. It’s pretty popular in some circles, and for good reason: it makes writing easy programs ridiculously easy. There are various (somewhat hidden) costs involved though. For example, the lack of a static type system makes it harder to build large programs correctly. And Python’s bytecode is not terribly amenable to JIT compilation, and Python is notoriously “slow”. Slow is relative, of course, but if you have a large, computationally-intensive job, plain-old-python without native extensions is not always the right choice.

Syntactically, it’s very similar to Java, but with a few major changes. Semicolons are optional:

print("hello world")

does what you’d expect, for example. Next, types are dynamic (determined at runtime), not static, so you don’t include them in your programs!

i = 4
i += 1;

s = "hello"

print(s, i)

Another important difference is that it has a REPL (read-eval-print loop), which lets you program interactively. (Java9 is supposedly going to get one of these, called JShell.) Many newer programming languages have REPLs, and they’re surprisingly useful in helping you by letting you quickly test small snippets of code without going through the compile-run-debug loop. python is one, but there’s an enhanced one that basically everyone uses called ipython, which also has a web front-end called jupyter. (demo)

Python has classes, but we’re going to build the generator using just standalone functions (which is what Python calls ‘em). You define a method with def:

def add(x, y):
    return x + y

print(add(2, 2))

Notice a few things: no type declarations. No braces; the : says “what follows is the method”, and the amount of indent is used to indicate what’s part of this method and what’s not. You can already see that python is more concise than Java, without any real loss in readability. One tradeoff of not including type information at compile-time is that you won’t see type errors until runtime:

def div(x, y):
    return x / y

def do_it():
    print(div(12, 'three'))

“compiles” (that is, can be loaded by the interpreter) just fine. But when you actually invoke it:

do_it()

you get a TypeError – at runtime, not at load time.

Enough of that. Let’s rewrite our generator in Python. The equivalent of a Map in python is a dictionary. You can create one with either the built-in dict function, or a dictionary literal: {}

model = {}

You can look up the value corresponding to a key k in a dictionary d using the d[k].

A list in Python has a similar list() function and [] syntax support; you can access the ith element of a list l with l[i].

Now let’s write the function to add to the model. This function will modify the model dict in-place:

def add_to_model(model, word, successor):
    if word not in model:
        model[word] = []
    model[word].append(successor)

Though it turns out we can make a shorter version of this using one of dict‘s many helper methods:

def add_to_model(model, word, successor):
    model.setdefault(word, []).append(successor)

And let’s test it:

model = {}
words = 'a man a plan a canal a'
word_list = words.split()
for i in range(len(word_list) - 1):
    add_to_model(model, word_list[i], word_list[i + 1])

print(model)

And build the generator:

def generate(model, n):
    random.seed(0)
    result = []
    word = random.choice(list(model.keys()))
    for _ in range(n):
        result.append(word)
        word = random.choice(model[word])
    return ' '.join(result)

and test it:

print(generate(model, 20))

Now let’s write a function to add words from a file:

def add_file_to_model(model, path):
    with open(path) as f:
        data = f.read()
        words = data.split()
        for i in range(len(words) - 1):
            add_to_model(model, words[i], words[i + 1])

and test it:

model = {}
add_file_to_model(model, '/Users/liberato/sherlock.txt')
print(generate(model, 100))

So that’s a first cut. I won’t do the update to the two-word model, but it’s pretty straightforward.

Building the generator in OCaml

OCaml is an old language (as old as Java) that comes from different roots; it has a quite different syntax and semantics. It’s a dialect of ML, which is a functional language. C (and Java, and Python) are imperative; in some sense they mirror a particular model of computation – the Turing machine, where we say “do this, then do this, then do this…”. Function languages like OCaml more closely mirror the lambda calculus, where we think about repeated-function-evaluation rather than a sequence of instructions.

One knock-on effect of this is that you write things in terms of functions that evaluate to other functions. (Though you can string together more than one function with the ; operator, so it “looks like” imperative programming.)

The various ML dialects make (more) clear why recursion is natural. OCaml has a REPL called utop that we can use; we need to append ;; to the end of statements because utop doesn’t know when expressions end, but actual OCaml code does not require this ;; operator.

12;;

Notice that OCaml knows the type (int) of this value (12). The utop interpreter “feels like” python but it’s actually compiling the code, and doing typechecking. Let’s define an add function:

let add x y = x + y;;

add 5 6;;

Some things:

We use let to bind a name to a value. Here, the value is add, which is a function of type int -> int -> int, which means, sorta, that this is a function that takes two ints and returns an int.

OCaml invokes functions on arguments as shown; no parenthesis!

OCaml can do if/thens just like most programming languages:

let rec fact n =
  if n < 2
  then 1
  else n * fact (n - 1);;

Notice that we have to tell OCaml that this is a recursive function.

We can also write this in a more mathematical style:

let rec fact n =
  match n with
    0 -> 1
  | 1 -> 1

Oops! We forgot the recursive case. Here’s the first hint of how powerful OCaml’s typecheck is. It tells us we missed at least one case (n = 2). We could add it, then find we’ve forgotten n = 3, and so on. How do we say “otherwise”? The “match anything” case comes last, and we can re-use the already-bound value n:

let rec fact n =
  match n with
    0 -> 1
  | 1 -> 1
  | n -> n * (fact (n - 1))

Perhaps that gives you a hint of why recursion might be more useful in other languages. OCaml is in some ways more conceptually different from Java than Python, so I’m going to gloss over (more) details than I did when I did the Python version.

So now let’s turn to our text generator again. OCaml, like most modern languages, comes with a Map built in to its standard library (though unlike Python it has no special language support):

let add model word successor =
  let successors = match (Map.find model word) with
      Some l -> l
    | None -> []
  in  
  Map.add model word (successor :: successors)

Here, we lookup the word in the map; if it’s there we bind successors to it (Some l), otherwise we bind successors to an empty list. Then we add the successor to this list (successor :: successors) (:: is the prepend operator on a list), and return a new map, based upon the old one, with this new successor list added.

Why don’t we modify the list in place? OCaml, like most functional languages, defaults to immutable data structures. You “modify” them by calling a function that returns a changed copy of the data structure. They are implemented in a way that shares structure (that is, is memory efficient). Normally, this would be terrifying in a language like Java, where aliased references result in hard-to-track-down bugs. But if your structure is immutable, it doesn’t matter if it shares structure with others, since they can’t change it!

OCaml does actually support imperative programming; it has the ability to declare mutable variables, and do iteration and other imperative things when you need to. It also has a full object system (hence the “O”).

Let’s write the generator now:

let generate model n =
  let rec aux word i =
    let successor = 
      let successors = Map.find_exn model word in
      List.nth_exn successors (Random.int (List.length successors))
    in
    if i = 0 then [successor]
    else
      successor :: (aux successor (i - 1))
  in
  let words = Map.keys model in 
  aux (List.nth_exn words (Random.int (List.length words))) n

A recursive aux(iliary) method, which “counts down” using i, returning just the successor for the base case, and the current successor appended to the recursive call’s result.

Now let’s test it:

let split s =
  String.split_on_chars ~on:[ ' ' ; '\t' ; '\n' ; '\r' ] s
  |> List.filter ~f:(fun x -> x <> "")

let () =
  let phrase = "a man a plan a canal a" in
  let words = split phrase in
  let model = ref String.Map.empty in
  for i = 0 to (List.length words - 2) do
    model := add !model (List.nth_exn words i) (List.nth_exn words (i + 1))
  done;
  print_endline (String.concat ~sep:" " (generate !model 20))

OCaml doesn’t by default have an “all-whitespace” splitter like Python, so we wrote one here. This is also our first use of actual mutable variables; the variable model is a reference, which we can dereference with ! and update with :=.

What if we want to read from a file? Same as before, let’s write a function:

let add_from_file model path =
  let words = In_channel.read_all path |> split in
  let model = ref model in
  for i = 0 to (List.length words - 2) do
    model := add !model (List.nth_exn words i) (List.nth_exn words (i + 1))
  done;
  !model

Here we read the file in, then pass that directly as input to the split function using the |> function (an infix operator just like + or -).

and let’s try it on a short input:

model := add_from_file String.Map.empty "/Users/liberato/test.txt";
print_endline (String.concat ~sep:" " (generate !model 20));

works fine. What about for Sherlock?

model := add_from_file String.Map.empty "/Users/liberato/sherlock.txt";

print_endline (String.concat ~sep:" " (generate !model 20));

Huh, that’s funny. It’s taking a long time. Maybe we’ve gone…accidentally quadratic? List.nth has to traverse the list. But unlike Python lists which are O(1) lookup, OCaml lists are actually linked lists, so that’s no good. Let’s convert that list to an array so the lookups are faster:

let add_from_file model path =
  let words = In_channel.read_all path |> split |> Array.of_list in
  let model = ref model in
  for i = 0 to (Array.length words - 2) do
    model := add !model words.(i) words.(i + 1)
  done;
  !model

The syntax for array lookup of the i-th element is .(i). And run it!

OCaml is more clunky than Python, but (once you get used to the syntax) no worse than Java. There’s also a new front-end for it that presents a very Javascripty-feel, called “Reason” that you might check out if you’re curious.

OCaml’s main benefits are its speed (it compiles to native code), the awesome power of its type system (if programs typecheck, they are often correct), and its module system, which is much more powerful than Java’s or Python’s (you can write functions that take not just functions as inputs, but modules!).

Other stuff

Other things you might want to peruse:

High-level languages: Python, Ruby, Julia

Java-ish languages: C#, Go, C++ (sorta)

Low-level languages: C, Rust, C++ (sorta)

Languages with Real Type Systems: OCaml, Haskell

Lispy things: Racket, Clojure, Common Lisp, Scheme