Deep Reinforcement Learning 307
these agents; these differences match what is reported in the literature and are
the default choices in the code used for these experiments (Quan and Ostrovski
2020). The question of how to account for hyperparameters in scientific studies
is an active area of research (see, e.g., Henderson et al. 2018; Ceron and Castro
2021; Madeira Auraújo et al. 2021).
Historically, all four agents whose performance is reported here were trained
on what is called the deterministic-action version of the Arcade Learning Envi-
ronment, in which arbitrarily complicated joystick motions can be performed.
For example, nothing prevents the agent from alternating between the “left”
and “right” actions every four frames (fifteen times per emulated second). This
makes the comparison with human players somewhat unrealistic, as human play
involves a minimum reaction time and interaction with a mechanical device that
may not support such high-frequency decisions. In addition, some of the poli-
cies found by agents in the deterministic setting exploit quirks of the emulator
in ways that were clearly not intended by the designer.
To address this issue, more recent versions of the Arcade Learning Environ-
ment implement what is called sticky actions – a procedure that introduces a
variable delay in the environment’s response to the agent’s actions. Figure 10.4
(bottom panels) shows the results of the same experiment as above, but now
with sticky actions. The performance of the various algorithms considered here
generally remains similar, with some per-game differences (e.g., for the game
Space Invaders).
Although Atari 2600 games are fundamentally deterministic, randomness is
introduced in the learning process by a number of phenomena, including side
effects of distributional value iteration (Section 7.4), state aliasing (Section 9.1),
the use of a stochastic
ε
-greedy policy, and the sticky-actions delay added by the
Arcade Learning Environment. In many situations, this results in distributional
agents making surprisingly complex predictions (Figure 10.5). A common
theme is the appearance of bimodal or skewed distributions when the outcome
is uncertain – for example, when the agent’s behavior in the next few time steps
is critical to its eventual success or failure. Informally, we can imagine that
because the agent predicts such outcomes, it in some sense “knows” something
more about the state than, say, an agent that only predicts the expected return.
We will see some evidence to this effect in the next section.
Furthermore, incorporating distributional predictions in a deep reinforcement
learning agent provides an additional degree of freedom in defining the number
and type of predictions that an agent makes at any given point in time. C51, for
example, is parameterized by the number of particles
m
used to represent prob-
ability distributions as well as the range of its support (described by
θ
m
). Figure
10.6 illustrates the change in human-normalized interquartile mean (measured
Draft version.