Next: Summary of Thesis Up: No Title Previous: High-Dimensional Regression

Kernel-Based Reinforcement Learning

Reinforcement Learning is concerned with optimal control in Markov Decision Process (MDP) in cases where the transition model is unknown. One possibility is to estimate the value function of the system using training data. We suggest to carry out this estimation nonparametrically using kernel smoothers. By contrast to the widely known ``temporal difference'' training of neural networks, algorithms based on this nonparametric approach can be proven to converge to a unique fixed point. Furthermore, asymptotic theory can be used to bound the approximation error and to shed light on the bias-variance tradeoff in reinforcement learning.

We consider a MDP defined by a filtered probability space $(\Omega,{\cal F},P)$ , a cost function c(x,a), as well as sequences of states and actions, $\{X_1,\dots,X_\infty\vert X_t(\omega)\in \hbox{$I\!\!R$ }^d\}$ and $\{a_1,\dots,a_{\infty}\vert a_t\in A\}$ , respectively. The objective is to find a policy $\mu$ - a mapping from states into actions - that minimizes one of several cost criteria. As an example, we consider the infinite-horizon $\alpha$ -discounted expected cost

$\begin{displaymath}V_\mu(x) \equiv \lim_{T\rightarrow \infty} E^\mu_x \left[ \frac{1}{T} \sum_{t=0}^{T-1} \alpha^{T-t} c(X_t,\mu(X_t)) \right]. \end{displaymath}$

(7)

It is well-known that optimal policies can be found by solving Bellman's equation:

$\begin{displaymath}V^*(x) = \min_a \left\{ c(x,a) + \alpha E^a_x[V^*(X_1)] \right\} . \end{displaymath}$

(8)

In reinforcement learning, however, (7) cannot be solved because the conditional expectation operator $(\Gamma_a V)(x) \equiv E^a_x[V(X_1)]$ is unknown. Instead, we suggest to use the approximate operator

$\begin{displaymath}(\hat\Gamma_{m,a} V)(x) \equiv \sum_{s=1}^{m} k(z_{s,a},x) V(y_{s,a}) \end{displaymath}$

(9)

based on a set of sample transitions $S^a \equiv \{(y_{s,a},z_{s,a})\vert s=1,\dots,m\}$ , realizations of the random variables X_t+1 and X_t. k(z_s,a,x) is a non-negative weighting kernel satisfying $\sum_{s=1}^{m} k(z_{s,a},x) = 1$ . It is straightforward to replace $\Gamma_a$ in (7) with $\hat\Gamma_{m,a}$ , and to compute the solution $\hat V$ of the approximate fixed point equation, for example, using value iteration. In detail, using the fact that V is evaluated only at the transition endpoints y_s,a in (8), a single value iteration step can be written using matrix notation which greatly simplifies the implementation of the new algorithm.

The approximate operator (8) allows for an efficient implementation and a detailed theoretical analysis: First, we show that approximate value iteration using (8) leads to a unique fixed point, $\hat V$ . Second, we derive the asymptotic distribution of $\hat V$ that can be used to establish the convergence rate $\vert\vert\hat V - V^*\vert\vert _\infty = O(m^{-\frac{1}{2(d+2)}})$ . In addition, the asymptotic distribution of $\hat V$ allows for an analysis of the bias-variance tradeoff in reinforcement learning. In detail, we argue that value function estimates are typically biased downwards regardless of the learning algorithm. This bias is due to Jensen's inequality and the convexity of the minimum operator. Using the asymptotic distribution of (8), we derive analytic expressions to assess the quantity of the bias for the case of kernel-based reinforcement learning. In principle, these expressions can be used for bias correction. A simple example problem using simulated financial time-series data demonstrates the practical virtues of the new algorithm.

Next: Summary of Thesis Up: No Title Previous: High-Dimensional Regression

Dirk Ormoneit
1999-10-17