6 Chapter 1
In this case,
G
π
(
x
),
R
,
X
0
, and
G
π
(
X
0
) are random variables, and the superscript
D
indicates equality between their distributions. Correctly interpreting the distri-
butional Bellman equation requires identifying the dependency between random
variables, in particular between
R
and
X
0
. It also requires understanding how
discounting affects the probability distribution of
G
π
(
x
) and how to manipulate
the collection of random variables G
π
implied by the definition.
Another change concerns how we quantify the behavior of learning algo-
rithms and how we measure the quality of an agent’s predictions. Because
value functions are real-valued vectors, the distance between a value function
estimate and the desired expected return is measured as the absolute difference
between those two quantities. On the other hand, when analyzing a distribu-
tional reinforcement learning algorithm, we must instead measure the distance
between probability distributions using a probability metric. As we will see,
some probability metrics are better suited to distributional reinforcement learn-
ing than others, but no single metric can be identified as the “natural” metric
for comparing return distributions.
Implementing distributional reinforcement learning algorithms also poses
some concrete computational challenges. In general, the return distribution
is supported on a range of possible returns, and its shape can be quite com-
plex. To represent this distribution with a finite number of parameters, some
approximation is necessary; the practitioner is faced with a variety of choices
and trade-offs. One approach is to discretize the support of the distribution
uniformly and assign a variable probability to each interval, what we call the
categorical representation. Another is to represent the distribution using a finite
number of uniformly weighted particles whose locations are parameterized,
called the quantile representation. In practice and in theory, we find that the
choice of distribution representation impacts the quality of the return function
approximation and also the ease with which it can be computed.
Learning return distributions from sampled experience is also more challeng-
ing than learning to predict expected returns. The issue is particularly acute
when learning proceeds by bootstrapping: that is, when the return function
estimate at one state is learned on the basis of the estimate at successor states.
When the return function estimates are defined by a deep neural network, as is
common in practice, one must also take care in choosing a loss function that is
compatible with a stochastic gradient descent scheme.
For an agent that only knows about expected returns, it is natural (almost
necessary) to define optimal behavior in terms of maximizing this quantity.
The Q-learning algorithm, which performs credit assignment by maximizing
over state-action values, learns a policy with exactly this objective in mind.
Knowledge of the return function, however, allows us to define behaviors
Draft version.