Tuesdays and Thursdays, 2:30 - 3:45pm, CS 142

This course will provide an introduction to, and comprehensive overview of, reinforcement learning. In general, reinforcement learning algorithms repeatedly answer the question "What should be done next?", and they can learn via trial and error to answer these questions even when there is no supervisor telling the algorithm what the correct answer would have been. Applications of reinforcement learning span across medicine, marketing, robotics, game playing, environmental applications, and dialogue systems, among many others.

Broad topics covered in this course will include: Markov decision processes, reinforcement learning algorithms (model-based / model-free, batch / online, value function based, actor-critics, policy gradient methods, etc.), hierarchical reinforcement learning, representations for reinforcement learning (including deep learning), and connections to animal learning. Special topics may include ensuring the safety of reinforcement learning algorithms, theoretical reinforcement learning, and multi-agent reinforcement learning. This course will emphasize hands-on experience, and assignments will require the implementation and application of many of the algorithms discussed in class.

The complete syllabus can be found here. Philip's office hours are Tuesdays and Thursdays from 4pm - 5pm in his office (CS 262). The TA is Nick Jacek, whose office hours are Wednesdays from 1pm - 2pm in CS 207. Nick's e-mail is njacek [at] cs [dot] umass [dot] edu.

Topic | Title | Topics | Readings | Notes |
---|---|---|---|---|

1 | Introduction and Describing Environments | Course introduction, definition of reinforcement learning, agent-environment diagram, and Markov Decision Processes (MDPs). | Sutton and Barto, Chapter 1 and Sections 3.1-3.6 | 1, 2a, 2b, 2c, 3a, 3b, 4a, 4b, 4c |

2 | BBO for Policy Search | Creating a simple first RL agent using the Cross-Entropy Method. | Szita and Lorincz, 2006; Stulp and Sigaud, 2012 | 5a, 5b |

3 | Value Functions | State, state-action, and optimal value functions, and the Bellman equation. | Sutton and Barto, Sections 3.7 and 3.8 | 6a, 6b, 7a, 7b, 8a |

4 | Planning | Policy iteration, value iteration, and the Bellman operator. | Sutton and Barto, Chapter 4 | 9a, 9b, 9c&10a, 10b |

5 | Monte Carlo Methods | Monte Carlo policy evaluation and control. | Sutton and Barto, Chapter 5 | 11a, 12a, 12b |

6 | Temporal Difference (TD) Learning | TD, TD-error, Function Approximation, Q-learning, and Sarsa. | Sutton and Barto, Chapters 6 and 8 | 13a, 13b, 14a, 14b, 15a |

8 | Complex Returns | Lambda-Return and eligibility traces. | Sutton and Barto, Chapter 7 | 16a, 17a |

9 | Function Approximation | Sutton and Barto, Chapter 8 | ||

10 | Actor-Critics and Policy Gradient | Basic actor-critics, policy gradient theorem (with function approximation), REINFORCE | Sutton et al., 2000 and Sutton and Barto, Second Edition Draft (Nov. 5), Chapter 13 | |

11 | Natural (Policy) Gradients | Natural gradient descent, and natural actor-critics | Amari 1998 , Amari and Douglas, 1998 , Peters and Schaal, 2008 , Thomas et al., 2016 , Thomas et al., 2017 [Introduction Only] | Part 1, Part 2 |

12 | Psychology and Neuroscience | The reward prediction error hypothesis of dopamine neuron activity | Sutton and Barto, 2nd edition chapters on psychology and neuroscience | Notes |

13 | Other Topics | LSTD, inverse RL, shaping rewards, hierarchical RL, deep RL, safe RL, off-policy evaluation, etc. | Part 1, Part 2 |