CMPSCI 687: Reinforcement Learning

Fall 2017, University of Massachusetts Amherst

Tuesdays and Thursdays, 2:30 - 3:45pm, CS 142

Course Description

This course will provide an introduction to, and comprehensive overview of, reinforcement learning. In general, reinforcement learning algorithms repeatedly answer the question "What should be done next?", and they can learn via trial and error to answer these questions even when there is no supervisor telling the algorithm what the correct answer would have been. Applications of reinforcement learning span across medicine, marketing, robotics, game playing, environmental applications, and dialogue systems, among many others.

Broad topics covered in this course will include: Markov decision processes, reinforcement learning algorithms (model-based / model-free, batch / online, value function based, actor-critics, policy gradient methods, etc.), hierarchical reinforcement learning, representations for reinforcement learning (including deep learning), and connections to animal learning. Special topics may include ensuring the safety of reinforcement learning algorithms, theoretical reinforcement learning, and multi-agent reinforcement learning. This course will emphasize hands-on experience, and assignments will require the implementation and application of many of the algorithms discussed in class.


The complete syllabus can be found here. Philip's office hours are Tuesdays and Thursdays from 4pm - 5pm in his office (CS 262). The TA is Nick Jacek, whose office hours are Wednesdays from 1pm - 2pm in CS 207. Nick's e-mail is njacek [at] cs [dot] umass [dot] edu.

Topic Title Topics Readings Notes
1 Introduction and Describing Environments Course introduction, definition of reinforcement learning, agent-environment diagram, and Markov Decision Processes (MDPs). Sutton and Barto, Chapter 1 and Sections 3.1-3.6 1,
2a, 2b, 2c,
3a, 3b,
4a, 4b, 4c
2 BBO for Policy Search Creating a simple first RL agent using the Cross-Entropy Method. Szita and Lorincz, 2006; Stulp and Sigaud, 2012 5a, 5b
3 Value Functions State, state-action, and optimal value functions, and the Bellman equation. Sutton and Barto, Sections 3.7 and 3.8 6a, 6b, 7a, 7b, 8a
4 Planning Policy iteration, value iteration, and the Bellman operator. Sutton and Barto, Chapter 4 9a, 9b, 9c&10a, 10b
5 Monte Carlo Methods Monte Carlo policy evaluation and control. Sutton and Barto, Chapter 5 11a, 12a, 12b
6 Temporal Difference (TD) Learning TD, TD-error, Function Approximation, Q-learning, and Sarsa. Sutton and Barto, Chapters 6 and 8 13a, 13b, 14a, 14b, 15a
8 Complex Returns Lambda-Return and eligibility traces. Sutton and Barto, Chapter 7 16a, 17a
9 Function Approximation Sutton and Barto, Chapter 8
10 Actor-Critics and Policy Gradient Basic actor-critics, policy gradient theorem (with function approximation), REINFORCE Sutton et al., 2000 and Sutton and Barto, Second Edition Draft (Nov. 5), Chapter 13
11 Natural (Policy) Gradients Natural gradient descent, and natural actor-critics Amari 1998 , Amari and Douglas, 1998 , Peters and Schaal, 2008 , Thomas et al., 2016 , Thomas et al., 2017 [Introduction Only] Part 1, Part 2
12 Psychology and Neuroscience The reward prediction error hypothesis of dopamine neuron activity Sutton and Barto, 2nd edition chapters on psychology and neuroscience Notes
13 Other Topics LSTD, inverse RL, shaping rewards, hierarchical RL, deep RL, safe RL, off-policy evaluation, etc. Part 1, Part 2