Dirk Ormoneit
Saunak Sen
Peter Glynn
Kernel-Based Reinforcement Learning Reinforcement Learning is concerned with optimal control in Markov Decision Process (MDP) in cases where the transition model is unknown. It has been applied successfully to a variety of practical applications including prominent examples such as Tesauro's Neurogammon and Singh and Bertsekas' dynamic channel allocation algorithms [3, 1]. Another important field of application is in Robotics. A fundamental obstacle to an even more wide-spread application of reinforcement learning to industrial problems is the fact that reinforcement learning algorithms are frequently unstable in the sense that they fail to converge to a solution. This is particularly true for variants of temporal difference learning that use parametric function approximators - e.g. linear combinations of feature vectors or neural networks - to represent the value function of the underlying Markov Decision Process (MDP). In our work, we adopt a non-parametric perspective on reinforcement learning and we suggest a new algorithm that always converges to a unique solution. In detail, we assign value function estimates to the states in a sample trajectory and we update these estimates in an iterative procedure. Here each update is based on a kernel-based averaging operator which has the additional advantage that the wide body of statistical literature on local averaging methods can be brought to bear upon reinforcement learning. In particular, the close connection between kernel-based reinforcement learning and nonlinear regression using so-called kernel smoothers allows us to demonstrate that additional training data always improve the quality of the policy estimate and eventually lead to optimal performance - that is, our algorithm is consistent in a statistical sense.
Related Publications