44 Chapter 2
cumulative probability of observing a particular level of reward in order to
mitigate partial observability in a simple video game. The blackjack and cliff-
walking examples are adapted from Sutton and Barto (2018) and, in the latter
case, inspired by one of the authors’ trip to Ireland. In both cases, we put
a special emphasis on the probability distribution of the random return. The
uniform distribution example is taken from Bellemare et al. (2017a); such
discounted sums of Bernoulli random variables also have a long history in
probability theory (Jessen and Wintner 1935; see Solomyak 1995; Diaconis and
Freedman 1999; Peres et al. 2000 and references therein).
2.5–2.6.
The Bellman equation is standard to most textbooks on the topic. A
particularly thorough treatment can be found in the work of Puterman (2014)
and Bertsekas (2012). The former also provides a good discussion on the
implications of the Markov property and time-homogeneity.
2.7–2.8.
Bellman equations relating quantities other than expected returns were
originally introduced in the context of risk-sensitive control, at varying levels
of generality. The formulation of the distributional Bellman equation in terms
of cumulative distribution functions was first given by Sobel (1982), under the
assumption of deterministic rewards and policies. Chung and Sobel (1987) later
gave a version for random, bounded rewards. See also the work of White (1988)
for a review of some of these approaches and Morimura et al. (2010a) for a
more recent presentation of the CDF Bellman equation.
Other versions of the distributional Bellman equation have been phrased in
terms of moments (Sobel 1982), characteristic functions (Mandl 1971; Farah-
mand 2019), and the Laplace transform (Howard and Matheson 1972; Jaquette
1973, 1976; Denardo and Rothblum 1979), again at varying levels of general-
ity, and in some cases using the undiscounted return. Morimura et al. (2010b)
also present a version of the equation in terms of probability densities. The
formulation of the distributional Bellman equation in terms of pushforward dis-
tributions is due to Rowland et al. (2018); the pushforward notation is broadly
used in measure-theoretic probability, and our use of it is influenced by optimal
transport theory (Villani 2008).
Distributional formulations have also been used to design Bayesian methods
for reasoning about uncertainty regarding the value function (Ghavamzadeh et
al. 2015). Dearden et al. (1998) propose an algorithm that maintains a posterior
on the return distribution under the assumption that rewards are normally
distributed. Engel et al. (2003, 2007) use a random-variable equation to derive
an algorithm based on Gaussian processes.
Our own work in the field is rooted in the theory of stochastic fixed point
equations (Rösler 1991, 1992; Rachev and Rüschendorf 1995), also known as
recursive distributional equations (Aldous and Bandyopadhyay 2005), from
Draft version.