Proximal Gradient Temporal Difference Learning
In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primal-dual saddle-point objective functions. We then conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and no finite-sample analysis had been attempted. Two novel GTD algorithms are also proposed, namely projected GTD2 and GTD2-MP, which use proximal “mirror maps” to yield improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.
Bo Liu is a Ph.D. candidate in School of Computer Science, University of Massachusetts Amherst, working under Prof. Sridhar Mahadevan. His primary research area covers machine learning, deep learning, stochastic optimization and their applications to BIGDATA. In the summers of 2011 and 2013, he enjoyed interning at eBay Search Science and Amazon Machine Learning, respectively.