There was an error in the bicycle code originally posted here. The code in the downloads was corrected on March 21, 2016 after the error was pointed out by Samuele Tosatto (thanks!). The change is a single line (line 22) in Bicycle.cpp. The erroneous line is: `I_bike = (13.0/3.0)*Mc*h*h + Mp*sqrt(h+dCM);`

The correct line is: `I_bike = (13.0/3.0)*Mc*h*h + Mp*(h+dCM)*(h + dCM);`

We provide three (identical) versions of the project. The first is a Visual Studio 2013 project that was tested on Windows 10, and comes as a .zip.

Download Visual Studio 2013 Project (.zip)

The second is a Code::Blocks project that was tested on Ubuntu 14.04.3 LTS, and comes as a .tar.gz.

Download Code::Blocks Project (.tar.gz)

The third is the raw C++ code without a project built around it, and comes as a .tar.gz and can be compiled in Ubuntu using a recent version of gcc by running the following line from the terminal inside the ConsistentBellman folder:

```
g++ src/*.cpp -std=c++11 -O3 -DNDEBUG -fopenmp -I header -I lib -o ConsistentBellman
```

Once compiled, run the program without any command line arguments (it ignores them). You will see a simple menu system where you can make several selections. First you will be asked what domain you want to use, and you can choose from several canonical domains used in reinforcement learning research: mountain car, puddle world, acrobot, cart-pole, pendulum (swing-up and balance), and bicycle.

Next you will be asked what "order" you want to use. We use multilinear interpolation (the extension of bilinear interpolation to higher dimensions) over a uniform grid of points to estimate the q-function. The "order" that this refers to is the number of points used along each dimension. Notice that the total number of points (weights used by the function approximator) is `pow(order, dim)`

, where `dim`

is the dimension of the (continuous) state-space. If you select an order that is too low, then the approximation of the q-function will be poor and the policy may not perform well. If you select an order that is too high, then it may take a large number of iterations of value iteration for the policy to become good, and each iteration may take a long time.

Next you will be asked whether or not you would like to use "value averaging". In short, for anything but bicycle, do not use value averaging. For bicycle, use value averaging. In more detail: if the state-transitions of the domain are stochastic (not deterministic), then value iteration calls for averaging over all possible outcomes of taking an action in a state. If there is a large number of possible outcomes (e.g., in bicycle there are infinite) then we can only sample some possible outcomes. However, to get a good estimate of the true average, we may have to use a massive number of samples, which is computationally expensive. To get around this, you can use "value averaging", wherein only a single sample is used, but the q-values aren't changed to the new values given by the (consistent) Bellman operator - instead they are moved a small amount toward the new amount (we take a weighted average of the old and new values).

You will then be asked which maxQ variant you want to use. Here "Ordinary Bellman" corresponds to performing value iteration using the Bellman operator, as has been done for decades. The option "Consistent Bellman" corresponds to performing value iteration using the newly proposed *consistent Bellman operator*. We refer to these as "maxQ variants" because the consistent Bellman operator can be viewed as replacing the term in the Bellman operator that takes the maximum over actions with a different term.

You will then be asked how many iterations of value iteration to run, as well as how many threads to use when running value iteration and when evaluating policies. By threading the code, we made it run slower (even if you select only one thread!) for most cases. However, most cases are fast anyway. The threading can help a lot when running the bicycle domain, where we suggest using the number of cores (or hyperthreads) available as the number of threads.

After you select the number of threads, you will be asked how often the policy should be evaluated. If you select 1, then after every iteration of value iteration, the greedy policy with respect to the q-function will be evaluated and the resulting performance stored. However, this can be costly - evaluating the performance of a policy can take a long time on some domains. So, you may wish to only evaluate the policy after every few iterations (e.g., the paper evaluated policies for bicycle after every 10 iterations).

Finally, you will be asked whether you want to print the results to a file. If you select "yes", then you will be asked for a file name to print the results to. The file will contain a sequence of iteration numbers and the corresponding performance of the policy at each iteration. Importantly, the performances are **not** necessarily the (discounted) returns - they are some meaningful statistic. For example, for bicycle the performance is 0 if the bicycle does not reach the goal and 1 if the bicycle does reach the goal. This is useful for domains where the reward is shaped, and so looking directly at returns isn't particularly informative. Another example is the pendulum domain: the performance is the amount of time during the 20-second episode that the pendulum is near-vertical. A brief description of what statistic is being reported as "performance" is printed to the console while the program runs, and is included at the top of the output file.

Once you have finished selecting options in the menu, value iteration will begin running. Every now and then (depending on what policy evaluation frequency you selected) the program will print a line: `Iteration: #1 [performance string] = #2`

This means that after running `#1`

iterations of value iteration, the performance of the greedy policy with respect the current q-function is `#2`

. Since different domains use different notions of performance (not just returns), `[performance string]`

is a brief description of what the performance value, `#2`

encodes.

Recall from the previous section that we present different performance metrics because, for some domains, the return (sum of rewards) is not very informative. Here we review the performance measures used for each domain.

- Mountain Car uses "Timesteps in episode". Since the rewards are always -1, the agent's goal is to get to the top of the hill (which terminates the episode) as soon as possible. So, a lower "Timesteps in episode" is better, and the undiscounted return is simply negative this value.
- Puddle World uses the undiscounted return.
- Acrobot uses the episode duration, in seconds. The maximum possible duration is 200 seconds. Since the goal is to terminate the episode as soon as possible, lower values are better. Notice that the time steps are 0.02 seconds, so 200 seconds corresponds to 1000 time steps.
- Cart-Pole uses the episode duration in seconds, with a maximum of 20. Since the goal is to balance the pole for as long as possible, larger values are better and a value of 20 is optimal - the pole did not fall over.
- Pendulum uses "time near-vertical". This is the amount of time, in seconds, that the pendulum is held near vertical during the 20-second episode. Since the pendulum is initially pointing down, the agent must first swing it up, and so it is not possible for the pendulum to be vertical during the full 20-seconds.
- Bicycle uses a binary value (C++ bool) that encodes whether or not the goal was reached. A zero denotes that the bicycle fell over before the goal was reached, and a one denotes that the bicycle reached the goal without falling over. Since policy evaluations are averaged over several trajectories for stochastic domains like bicycle, the value may be between 0 and 1, which encodes the number of times the goal was reached divided by the number of trials.

Below we provide a simple example, where we use the consistent Bellman operator to solve the cart pole domain. The menu options used are: `3 30 1 1 20 16 16 1 0 out_Consistent_CartPole.txt`

. Later we provide additional text files for runs using other domains and settings, which all use this same format. At the top we quickly describe the menu options used, and below "Console output:" we provide the copy-pasted output showing our use of these menu parameters.

*********************************************************** CartPole using Consistent Bellman Equation *********************************************************** Options used when running: 3: CartPole 30: Order --- this means we will have 30^4 = 810,000 weights to tune 1: No value averaging 1: Consistent Bellman 20: Number of iterations of value iteration 16: Number of threads used by value iteration (set to your number of cores) 16: Number of threads used to evaluate a policy (set to your number of cores) 1: Evaluate policy after every iteration 0: Print results to file out_Consistent_CartPole.txt: The file to print the results to Console output: Select environment: [0] Mountain Car [1] Puddle World [2] Acrobot [3] Cart Pole [4] Pendulum Swing-Up and Balance [5] Bicycle Selection: 3 Enter the order (state is 4 dimensional): 30 Select value averaging: [0] Yes [1] No Selection: 1 Select maxQ Variant: [0] Ordinary Bellman [1] Consistent Bellman Selection: 1 Enter number of iterations of value iteration: 20 Enter number of threads for value-iteration (we suggest # available cores): 16 Enter number of threads for policy evaluation (we suggest # available cores): 16 Evaluate policy every how-many iterations? 1 Select Print results to file: [0] Yes [1] No Selection: 0 Enter filename: out_Conssitent_CartPole.txt Initially: Episode duration = 0.22 Iteration: 1 Episode duration = 0.22 Iteration: 2 Episode duration = 0.24 Iteration: 3 Episode duration = 20 Iteration: 4 Episode duration = 20 Iteration: 5 Episode duration = 20 Iteration: 6 Episode duration = 20 Iteration: 7 Episode duration = 20 Iteration: 8 Episode duration = 20 Iteration: 9 Episode duration = 20 Iteration: 10 Episode duration = 20 Iteration: 11 Episode duration = 20 Iteration: 12 Episode duration = 20 Iteration: 13 Episode duration = 20 Iteration: 14 Episode duration = 20 Iteration: 15 Episode duration = 20 Iteration: 16 Episode duration = 20 Iteration: 17 Episode duration = 20 Iteration: 18 Episode duration = 20 Iteration: 19 Episode duration = 20 Iteration: 20 Episode duration = 20 Printing results to out_Conssitent_CartPole.txt

For cart-pole, the performance statistic is the length of the episode in seconds, with a maximum of 20 seconds. Initially the pole falls over within 0.22 seconds, but using the greedy policy produced by three iterations of value iteration causes the pole to not fall over during the 20 seconds.

**Below we provide downloads for similar text files that show how to get decent results using both the ordinary and consistent Bellman operators on all of the domains. The only setup that we were unable to get working is using the ordinary Bellman operator for the bicycle domain.**

To produce results on the bicycle domain similar to those shown in the paper, use the menu options `5 8 0 1 2000 16 16 10 0 out_Consistent_Bicycle.txt`

, or use `10`

for the second option. Plotting the resulting output with `8`

for the second option gives the following (compare to the top right plot of Figure 1 in the appendix - differences are likely due to noise - both plots use only one trial):

For simplicity, I've included the version of the Eigen library that I used in the source code. If you already have Eigen, then you may remove this folder and point the includes to your Eigen folder. However, since I used Eigen and include it in the download, I must state that the Eigen folder is distributed under MPL 2.0.

The rest of the code I "release to the public domain". The answer to your question of "can I use this?" is almost certainly "yes". Formally, all code other than the Eigen folder is provided under the CC0 1.0 license, which is summarized here.

If you have any questions, please e-mail me at PThomasCS [at] gmail [dot] com.