In this presentation we present the integration of first-order constrained optimization with two most widely used algorithm families in reinforcement learning, TD-learning and policy gradient learning. We present the first sparse Q-learning algorithm ,as well as the first off-policy convergent TD-learning algorithm termed as regularized off-policy TD-learning (RO-TD). We also extend the unconstrained natural actor-critic (NAC) to constrained projected natural actor-critic. All these are formulated under the universal framework of online convex optimization and mirror descent.
Bo Liu received the M.S. degree in computer engineering from Stevens Institute of Technology, USA. Currently, he is pursing the Ph.D. degree in computer science at University of Massachusetts, USA, and is advised by Sridhar Mahadevan. His research interests include machine learning, reinforcement learning, transfer learning (semi-supervised and supervised), deep learning and stochastic optimization.
Philip Thomas is a Ph.D candidate in the School of Computer Science at the University of Massachusetts Amherst, and is advised by Andrew G. Barto. He is a member of the Autonomous Learning Laboratory and received a MS and BS degree in computer science from Case Western Reserve University. His primary research interest is reinforcement learning.