256 Chapter 8
functionals such as moments, quantiles, and CVaR have been of interest for
risk-sensitive control (more on this in the bibliographical remarks of Chapter 7).
Chandak et al. (2021) consider the problem of off-policy Monte Carlo policy
evaluation of arbitrary statistical functionals of the return distribution.
8.2, 8.8.
Sobel (1982) gives a Bellman equation for return-distribution moments
for state-indexed value functions with deterministic policies. More recent work
in this direction includes that of Lattimore and Hutter (2012), Azar et al. (2013),
and Azar et al. (2017), who make use of variance estimates in combination with
Bernstein’s inequality to improve the efficiency of exploration algorithms, as
well as the work of White and White (2016), who use estimated return variance
to set trace coefficients in multistep TD learning methods. Sato et al. (2001),
Tamar et al. (2012), Tamar et al. (2013), and Prashanth and Ghavamzadeh
(2013) further develop methods for learning the variance of the return. Tamar
et al. (2016) show that the operator
T
π
(2)
is a contraction under a weighted
norm (see Exercise 8.4), develop an incremental algorithm with a proof of
convergence using the ODE method, and study both dynamic programming
and incremental algorithms under linear function approximation (the topic of
Chapter 9).
8.3–8.5.
The notion of Bellman closedness is due to Rowland et al. (2019),
although our presentation here is a revised take on the idea. The noted connec-
tion between Bellman closedness and diffusion-free representations and the
term “statistical functional dynamic programming” are new to this book.
8.6.
The expectile dynamic programming algorithm is new to this book but
is directly derived from expectile temporal-difference learning (Rowland et
al. 2019). Expectiles themselves were introduced by Newey and Powell (1987)
in the context of testing in econometric regression models, with the asymmetric
squared loss defining expectiles already appearing in Aigner et al. (1976).
Expectiles have since found further application as risk measures, particularly
within finance (Taylor 2008; Kuan et al. 2009; Bellini et al. 2014; Ziegel
2016; Bellini and Di Bernardino 2017). Our presentation here focuses on the
asymmetric squared loss, requiring a finite second-moment assumption, but an
equivalent definition allows expectiles to be defined for all distributions with a
finite first moment (Newey and Powell 1987).
8.7.
The study of characteristic functions in distributional reinforcement learn-
ing is due to Farahmand (2019), who additionally provides error propagation
analysis for the characteristic value iteration algorithm, in which value iteration
is carried out with characteristic function representations of return distributions.
Earlier, Mandl (1971) studied the characteristic function of the return in Markov
decision processes with deterministic immediate rewards and policies. Chow
et al. (2015) combine a state augmentation method (see Chapter 7) with an
Draft version.