Various options for function approximators include (but not limited to):
Pick a function approximator that you are comfortable with, and submit a brief writeup describing the function approximator, and any design choices you had to make, along with a suitable rationale.
The points will be given according to the following criteria:
The episodes start in one the start states in the first column, with equal
probability. There are three variants of
the problem, A, B, and C, in each of which the goal is in the square
marked with the respective letter. There is a reward of +10 on reaching
the goal. There is a puddle in the middle of the gridworld, which the
agent likes to avoid. Every transition into a puddle cell, gives a
negative reward, depending on the depth of the puddle at that point, as
indicated in the figure.
What is a "run", and how long should a run be? A run is a sequence of episodes, with learning accumulating throughout each run. That is, the state-action values are initialized at the start of a run, but not re-initialized between episodes of a run. If eligibility traces are used (which is not the case for this part), you re-initialize eligibility between episodes. You should let the duration of a run be determined by how many episodes are required to show a "good amount" of saturation of the learning curve. What "good amount" means is up to you, but the idea is to get to a point where relatively little additional progress is being made. You can initialize the state-action values any way you want, but I suggest zero for all of them just for the sake of uniformity.
What plots are needed? These curves should be something like the ones in Figure 2.1 on p. 29, but where "plays" are now "episodes". So something like the curves on Figure 6.14 for the average reward per episode. Note that we are asking for the average reward per episode; not the average return per episode, which would include the discounting. You might also want to show this, but RL people tend not to plot average return (though they should since the real goal of the agent is to maximize expected return).
How do you show the policies? I suggest printing a 12x12 2-D array of letters N, S, E, or W for each policy. But whatever you do, please make it 2-D so it is easy to see how it relates to the problem.
More specifically, for this part you should produce the following graphs. For each variant A, B, and C, use two sets of axes. The y-axis of one is "average steps to goal"; the y-axis for the other is "average reward per episode". For each, the x-axis is episodes. Each set of axes has 3 plots on it, one for each value of gamma. So this is 6 sets of axes, with 3 curves each. For policies, you only need to show a single representative optimal policy for each of the 9 combinations of problem variant (A, B, and C) and gamma.
What does "average performance" mean and what is a "learning trial"? Average performance here means "average reward per episode" as in part 1. A learning trial is the same thing as an episode. Sorry about the confusion. Show averages over 50 runs, but now each run need only consist of 25 episodes. I suggest graphs that have lambda as the x-axis, average reward after 25 episodes as the y-axis, with each graph labeled by the appropriate value of gamma.
More specifically, for this part produce the following graphs. For each variant (A, B, and C), use a single set of axes, with x-axis giving the 6 values of lambda and y-axis "average performance after 25 episodes". On each set of axes, plot the results for the three values of gamma. So on each set of axes will be plotted 18 points. You might want to connect the points corresponding to a given value of gamma with lines to make things clear. Now, for each variant, pick what looks like the best lambda for gamma = .9 and produce graphs analogous to those in Part 1, but only for this single (gamma, lambda) pair. So this results in 6 sets of axes and one curve on each. So altogether, combining Parts 1 and 2, you will have 15 sets of axes (!)
The points will be given according to the following criteria:
Note: You can program in any language you desire. But if you need any help later on, it will have to be in a language Andrew can understand. So check with him before you start if there is a question. Also, you have to turn in a lot of learning curves. Organize them in a sensible fashion, across both parts, so that the number of graphs you turn in is minimal, yet each graph remains comprehensible.
From state (RD, 8a) on action P, add a link with probability 0.4 and reward of -4 back to (RU, 8p), changing the probability of the existing transition from that state on action P to 0.6. Repeat what you did for Part A but using this modified MDP.
Now experiment with changing the order in which you sweep in the policy evaluation phase. Do you see any influence on convergence speed? (This does not have to be an exhaustive study of sweep order. Just try a few variations.)
Now experiment on this problem with Modified Policy Iteration. Recall that in algorithms going by this name, the policy evaluation phase does not have to be performed to a high degree of accuracy. Devise your own set of experiments to investigate the effect of varying the accuracy required for policy evaluation, or just the number of sweeps.
(This does not have to be an exhaustive study of MPI on this problem. Just try a few variations.)
What to Hand In (on paper):
You may program in any programming language you like, but if you want any help from our esteemed TA, Andrew, it should be one he can read (Python, OCaml, Java, C++, C, Matlab, not Perl). If you'd like to use a language not listed here, please check with Andrew first. Your code MUST be well-formatted and well-documented.
Problems assigned: ex. 2.3, 2.4, 2.5, 2.6, 2.8, 2.16, additional problem (see below).
Additional exercise for Chapter 2 (also due February 14, 2006)
[modified from Rich Sutton's ]
Equation 2.5 is a key update rule we will use throughout the course. This exercise will give you a better hands-on feel for how it works. For this exercise we'll be considering recency-weighted averages of a target signal from time t=0 through time t=15. For all the exercises below, the target signal is: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0]
Get a piece of graph paper (or print this ) and prepare to plot by hand.
(1) Make a vertical axis one inch high that runs from 0 to 1 and a horizontal axis from 0 to 15. Suppose the estimate starts at 0 and the step-size (in the equation) is 0.5. (The estimate should be zero on the first time step.) Plot the trajectory of the estimate through t=15. How close is the estimate to 1.0 after the update at t=4? Suppose for a moment that the target signal continued to be 1.0. Without plotting, how close would it be at t = 10? At t=20?
(2) Start over with a new graph and the estimate again at zero. Plot again, this time with a step size of 1/8.
(3) Make a third graph with a step size of 1.0 and repeat. Which step size produces estimates of smaller error when the target is alternating (i.e., t=10 through t=15)?
(4) Make a fourth graph with a step size of 1/t (i.e., the first step size is 1, the second is 1/2, the third is 1/3, etc.) Based on these graphs, why is the 1/t step size appealing? Why is it not always the right choice?
(5) What happens if the step size is minus 0.5? 1.5? Plot the first few steps if you need to. What is the safe range for the step size?