Philip S. Thomas · CMPSCI 687: Reinforcement Learning, Fall 2017

Course Description

This course will provide an introduction to, and comprehensive overview of, reinforcement learning. In general, reinforcement learning algorithms repeatedly answer the question "What should be done next?", and they can learn via trial and error to answer these questions even when there is no supervisor telling the algorithm what the correct answer would have been. Applications of reinforcement learning span across medicine, marketing, robotics, game playing, environmental applications, and dialogue systems, among many others.

Broad topics covered in this course will include: Markov decision processes, reinforcement learning algorithms (model-based / model-free, batch / online, value function based, actor-critics, policy gradient methods, etc.), hierarchical reinforcement learning, representations for reinforcement learning (including deep learning), and connections to animal learning. Special topics may include ensuring the safety of reinforcement learning algorithms, theoretical reinforcement learning, and multi-agent reinforcement learning. This course will emphasize hands-on experience, and assignments will require the implementation and application of many of the algorithms discussed in class.

Syllabus

The complete syllabus can be found here. Philip's office hours are Tuesdays and Thursdays from 4pm - 5pm in his office (CS 262). The TA is Nick Jacek, whose office hours are Wednesdays from 1pm - 2pm in CS 207. Nick's e-mail is njacek [at] cs [dot] umass [dot] edu.

Topic	Title	Topics	Readings	Notes
1	Introduction and Describing Environments	Course introduction, definition of reinforcement learning, agent-environment diagram, and Markov Decision Processes (MDPs).	Sutton and Barto, Chapter 1 and Sections 3.1-3.6	1, 2a, 2b, 2c, 3a, 3b, 4a, 4b, 4c
2	BBO for Policy Search	Creating a simple first RL agent using the Cross-Entropy Method.	Szita and Lorincz, 2006; Stulp and Sigaud, 2012	5a, 5b
3	Value Functions	State, state-action, and optimal value functions, and the Bellman equation.	Sutton and Barto, Sections 3.7 and 3.8	6a, 6b, 7a, 7b, 8a
4	Planning	Policy iteration, value iteration, and the Bellman operator.	Sutton and Barto, Chapter 4	9a, 9b, 9c&10a, 10b
5	Monte Carlo Methods	Monte Carlo policy evaluation and control.	Sutton and Barto, Chapter 5	11a, 12a, 12b
6	Temporal Difference (TD) Learning	TD, TD-error, Function Approximation, Q-learning, and Sarsa.	Sutton and Barto, Chapters 6 and 8	13a, 13b, 14a, 14b, 15a
8	Complex Returns	Lambda-Return and eligibility traces.	Sutton and Barto, Chapter 7	16a, 17a
9	Function Approximation		Sutton and Barto, Chapter 8
10	Actor-Critics and Policy Gradient	Basic actor-critics, policy gradient theorem (with function approximation), REINFORCE	Sutton et al., 2000 and Sutton and Barto, Second Edition Draft (Nov. 5), Chapter 13
11	Natural (Policy) Gradients	Natural gradient descent, and natural actor-critics	Amari 1998 , Amari and Douglas, 1998 , Peters and Schaal, 2008 , Thomas et al., 2016 , Thomas et al., 2017 [Introduction Only]	Part 1, Part 2
12	Psychology and Neuroscience	The reward prediction error hypothesis of dopamine neuron activity	Sutton and Barto, 2nd edition chapters on psychology and neuroscience	Notes
13	Other Topics	LSTD, inverse RL, shaping rewards, hierarchical RL, deep RL, safe RL, off-policy evaluation, etc.		Part 1, Part 2

Topic

Title

Topics

Readings

Notes

Introduction and Describing Environments

Course introduction, definition of reinforcement learning, agent-environment diagram, and Markov Decision Processes (MDPs).

Sutton and Barto, Chapter 1 and Sections 3.1-3.6

1,
2a, 2b, 2c,
3a, 3b,
4a, 4b, 4c

BBO for Policy Search

Creating a simple first RL agent using the Cross-Entropy Method.

Szita and Lorincz, 2006; Stulp and Sigaud, 2012

5a, 5b

Value Functions

State, state-action, and optimal value functions, and the Bellman equation.

Sutton and Barto, Sections 3.7 and 3.8

6a, 6b, 7a, 7b, 8a

Planning

Policy iteration, value iteration, and the Bellman operator.

Sutton and Barto, Chapter 4

9a, 9b, 9c&10a, 10b

Monte Carlo Methods

Monte Carlo policy evaluation and control.

Sutton and Barto, Chapter 5

11a, 12a, 12b

Temporal Difference (TD) Learning

TD, TD-error, Function Approximation, Q-learning, and Sarsa.

Sutton and Barto, Chapters 6 and 8

13a, 13b, 14a, 14b, 15a

Complex Returns

Lambda-Return and eligibility traces.

Sutton and Barto, Chapter 7

16a, 17a

Function Approximation

Sutton and Barto, Chapter 8

Actor-Critics and Policy Gradient

Basic actor-critics, policy gradient theorem (with function approximation), REINFORCE

Sutton et al., 2000 and Sutton and Barto, Second Edition Draft (Nov. 5), Chapter 13

Natural (Policy) Gradients

Natural gradient descent, and natural actor-critics

Amari 1998 , Amari and Douglas, 1998 , Peters and Schaal, 2008 , Thomas et al., 2016 , Thomas et al., 2017 [Introduction Only]

Part 1, Part 2

Psychology and Neuroscience

The reward prediction error hypothesis of dopamine neuron activity

Sutton and Barto, 2nd edition chapters on psychology and neuroscience

Notes

CMPSCI 687: Reinforcement Learning

Fall 2017, University of Massachusetts

Course Description

Syllabus