Control 229
as well as time consistency. Jiang and Powell (2018) develop sample-based
optimization methods for dynamic risk measures based on quantiles.
Howard and Matheson (1972) considered the optimization of an exponential
utility function applied to the random return by means of policy iteration. The
same objective is given a distributional treatment by Chung and Sobel (1987).
Heger (1994) considers optimizing for worst-case returns. Haskell and Jain
(2015) study the use of occupancy measures over augmented state spaces as an
approach for finding optimal policies for risk-sensitive control; similarly, an
occupancy measure-based approach to CVaR optimization is studied by Carpin
et al. (2016). Mihatsch and Neuneier (2002) and Shen et al. (2013) extend
Q-learning to the optimization of recursive risk measures, where a base risk
measure is applied at each time step. Recursive risk measures are more easily
optimized than risk measures directly applied to the random return but are not
as easily interpreted. Martin et al. (2020) consider combining distributional
reinforcement learning with the notion of second-order stochastic dominance as
a means of action selection. Quantile criteria are considered by Filar et al. (1995)
in the case of average-reward MDPs and, more recently, by Gilbert et al. (2017)
and Li et al. (2022). Delage and Mannor (2010) solve a risk-constrained opti-
mization problem to handle uncertainty in a learned model’s parameters. See
Prashanth and Fu (2021) for a survey on risk-sensitive reinforcement learning.
7.7.
Sobel (1982) establishes that an operator constructed directly from the
variance-penalized objective does not have the monotone improvement prop-
erty, making its optimization more challenging. The examples demonstrating
the need for randomization and a history-dependent policy are adapted from
Mannor and Tsitsiklis (2011), who also prove the NP-hardness of the problem
of optimizing the variance-constrained objective. Tamar et al. (2012) propose
a policy gradient algorithm for optimizing a mean-variance objective and for
the CVaR objective (Tamar et al. 2015); see also Prashanth and Ghavamzadeh
(2013) and Chow and Ghavamzadeh (2014) for actor-critic algorithms for these
criteria. Chow et al. (2018) augment the state with the return-so-far in order to
extend gradient-based algorithms to a broader class of risk measures.
7.8.
The reformulation of the conditional value-at-risk (CVaR) of a random
variable in terms of the (convex) optimization of a function of a variable b ∈R
is due to Rockafellar and Uryasev (2000); see also Rockafellar and Uryasev
(2002) and Shapiro et al. (2009). Bäuerle and Ott (2011) provide an algorithm
for optimizing the CVaR of the random return in Markov decision processes.
Their work forms the basis for the algorithm presented in this section, although
the treatment in terms of return-distribution functions is new here. Another
closely related algorithm is due to Chow et al. (2015), who additionally provide
an approximation error bound on the computed CVaR. Brown et al. (2020) apply
Draft version.