CMPSCI 383: Artificial Intelligence

Fall 2014 (archived)

Assignment 05 Sample Solution

…and an editorial

I’ve provided sample solutions or fairly extensive templates to most of the assignments so far in Java. I’ve also talked about the importance of learning multiple programming languages. Even if you don’t use them in your classes (or your day job, etc.), they’ll help you think about problems and abstractions in new ways that will improve your work in other languages.

In Ye Olden Days, AI at UMass was taught in Common Lisp. For various reasons, we don’t do that anymore, but you can still benefit from learning a language other than Java. I’ve touted Python or Ruby as good second languages for a Java programmer in class. I prefer Python, but there are fine reasons to choose Ruby also (or a Lisp dialect, or another language).

Ruby and Python have similar benefits:

  • each provides nice syntactic sugar around commonly used abstractions like lists and maps (aka associative arrays, dictionaries, hash tables), enabling brevity Java cannot match
  • both can treat functions as first-class citizens, that is, they can be stored in variables and passed around as values, enabling types of abstraction that are difficult to express succinctly in Java
  • like Java, both have well-developed standard libraries and package systems; Python’s standard library is very comprehensive, though Ruby’s package management system is arguably a little saner than Python’s
  • each provides a nice REPL, which supports rapid and interactive development; I’m partial to IPython
  • both are well supported in many IDEs; if you don’t want to dive into Emacs, the makers of my preferred Java IDE also make PyCharm
  • testing in each (both unit testing and spot-checking) is more straightforward than in Java

Now onto the solution

Here’s a sample solution to Assignment 05, written in under 50 source lines of Python3. Whitespace and documentation bump that up to about 100 lines.

(fjdquery.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Marc Liberatore
# CMPSCI 383 / Fall 2014
# Sample solution to Assignment 05

import csv
import itertools
import sys

CAR_DATA_PATH = 'car.data'
VARIABLE_NAMES = ('buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'car')


def load_data(path):
    """
    :param path:
    :return: a list of dictionaries, one per instance, mapping variable names to values
    """
    with open(path) as f:
        data_reader = csv.DictReader(f, fieldnames=VARIABLE_NAMES)
        return list(data_reader)


def make_values_dict(data):
    """
    Returns the set of possible values associated with each variable in the data.
    :param data: a list of dictionaries, as from load_data
    :return: a dictionary mapping each variable to a set of possible its values
    """
    values = {name: set() for name in VARIABLE_NAMES}
    for instance in data:
        for (variable, vals) in values.items():
            vals.add(instance[variable])
    return values


def load_query(path):
    """
    :param path:
    :return: a pair: the list of query variables, and a dictionary mapping each
             evidence variable to its list of values
    """
    with open(path) as f:
        query_variables = f.readline().split()
        conditions = {}
        for line in f.readlines():
            ls = line.split()
            conditions[ls[0]] = ls[1:]
    return query_variables, conditions


def matches_conditions(instance, conditions={}):
    """
    :param instance: a dictionary mapping each variable to a value
    :param conditions: a dictionary mapping zero or more variables to one
           or more acceptable values
    :return: true iff the instance matches the conditions
    """
    for (variable, values) in conditions.items():
        if instance[variable] not in values:
            return False
    return True


def filter_data(data, conditions={}):
    """
    Returns the subset of the data that matches the given condition(s)
    :param data: a list of dictionaries, as from load_data
    :param conditions: a possibly-empty dictionary of conditions, as from load_query
    :return: a matching subset of the data
    """
    return [i for i in data if matches_conditions(i, conditions)]


def make_query_conditions(query_variables, all_values):
    """
    Returns a list of dictionaries in the format expected by filter_data; each
    dictionary corresponds to one of the possible settings of the query variables.
    :param query_variables: a list of query variables
    :param all_values: a dictionary mapping each variable to its possible
           values as from make_values_dict
    :return: a list of conditions, corresponding to each setting of the query variables
    """
    values_list = [all_values[variable] for variable in query_variables]
    values_product = itertools.product(*values_list)
    return [dict(zip(query_variables, [[v] for v in values]))
            for values in values_product]


def main():
    data = load_data(CAR_DATA_PATH)
    all_values = make_values_dict(data)
    query_variables, conditions = load_query(sys.argv[1])
    conditional_data = filter_data(data, conditions)
    conditional_count = len(conditional_data)
    for query_values in make_query_conditions(query_variables, all_values):
        for variable in query_variables:
            print(query_values[variable][0], end=' ')
        print(len(filter_data(conditional_data, query_values)) / conditional_count)

if __name__ == '__main__':
    main()

I wrote this solution with an emphasis on modularity and testability. I load the entire data file, then filter it based upon the conditions. I then enumerate the possible combinations of settings for the query variables, and count the number of occurrences of each.

As a result this program is perhaps not as efficient as it could be, though each piece is straightforward. I viewed this as a reasonable trade-off, given that the data set contains only a couple of thousand instances. Another reasonable approach would have been to do a single pass through the data, and to accumulate counts (or probabilities) as I went.

Compare this code with your own, regardless of the language you chose. Is it shorter or longer? Are the purposes of the individual methods clear and distinct? What approach did you take?