Incremental Algorithms 191
Retrace(
λ
), a multistep off-policy evaluation algorithm (Munos et al. 2016).
Nguyen et al. (2021) combine the quantile representation with a loss based
on the MMD metrics described in Chapter 4. Martin et al. (2020) propose a
proximal update scheme for the quantile representation based on (regularized)
Wasserstein flows (Jordan et al. 1998; Cuturi 2013; Peyré and Cuturi 2019).
Example 6.1 is from Bellemare et al. (2016).
6.3.
The categorical temporal-difference algorithm as a mixture update was
presented by Rowland et al. (2018). This is a variant of the C51 algorithm
introduced by Bellemare et al. (2017a), which uses a projection in a mixture of
Kullback–Leibler divergence and Cramér distance. Distributional versions of
gradient temporal-difference learning (Sutton et al. 2008a; Sutton et al. 2009)
based on the categorical representation have also been explored by Qu et
al. (2019).
6.4.
The QTD algorithm was introduced by Dabney et al. (2018b). Quan-
tile regression itself is a long-established tool within statistics, introduced by
Koenker and Bassett (1978); Koenker (2005) is a classic reference on the sub-
ject. The incremental rule for estimating quantiles of a fixed distribution was in
fact proposed by Robbins and Monro (1951), in the same paper that launched
the field of stochastic approximation.
6.5.
The discussion of sequences of learning rates that result in convergence
goes back to Robbins and Monro (1951), who introduced the field of stochastic
approximation. Szepesvári (1998), for example, considers this framework in
their study of the asymptotic convergence rate of Q-learning. A fine-grained
analysis in the case of temporal-difference learning algorithms, taking finite-
time concentration into account, was undertaken by Even-Dar and Mansour
(2003); see also Azar et al. (2011).
6.6–6.10.
Our proof of Theorem 6.9 (via Propositions 6.4, 6.7, and 6.8) closely
follows the argument given by Bertsekas and Tsitsiklis (1996) and Tsitsiklis
(1994). Specifically, we adapt this argument to deal with distributional informa-
tion, rather than a single scalar value. Proposition 6.4 is a special case of the
Robbins–Siegmund theorem (Robbins and Siegmund 1971), and a particularly
clear exposition of this and related material is given by Walton (2021). We note
also that this result can also be established via earlier results in the stochastic
approximation literature (Dvoretzky 1956), as noted by Jaakkola et al. (1994).
Theorem 6.10 is classical, and results of this kind can be found in Bertsekas and
Tsitsiklis (1996). Theorem 6.12 was first proven by Rowland et al. (2018), albeit
with a monotonicity argument based on that of Tsitsiklis (1994); the argument
here is based on a contraction mapping argument to match the analysis of the
temporal-difference algorithm. For further background on signed measures, see
Doob (1994).
Draft version.