108 Chapter 4
4.5.
The family of
p
distances described in this chapter is covered at length
in the work of Rachev et al. (2013), which studies an impressive variety of
probability metrics. A version of the contraction analysis in Cramér distance
was originally given by Rowland et al. (2018). In two and more dimensions, the
Cramér distance is generalized by the energy distance (Székely 2002; Székely
and Rizzo 2013; Rizzo and Székely 2016), itself a member of the maximum
mean discrepancy (MMD) family (Gretton et al. 2012); contraction analysis
in terms of MMD metrics was undertaken by Nguyen et al. (2021) (see Exer-
cise 4.19 for further details). Another special case of the
p
metrics considered in
this chapter is the Kolmogorov–Smirnov distance (
∞
), which features in results
in empirical process theory, such as the Glivenko–Cantelli theorem. Many of
these metrics are integral probability metrics (Müller 1997), which allows for a
dual formulation with appealing algorithmic consequences. Chung and Sobel
(1987) provide a nonexpansion result in total variation distance (without naming
it as such; the proof uses an integral probability metric formulation).
4.6.
The properties of regularity, convexity, and
c
-homogeneity were introduced
by Zolotarev (1976) in a slightly more general setting. Our earlier work pre-
sented these in a modern context (Bellemare et al. 2017b), albeit with only a
mention of their potential use in reinforcement learning. Although that work
proposed the term “sum-invariant” as mnemonically simpler, this is only techni-
cally correct when Equation 4.15 holds with equality; we have thus chosen to
keep the original name. Theorem 4.25 is new to this book.
The characterization of the Wasserstein distance as an optimal transport prob-
lem in Proposition 4.18 is the standard presentation of the Wasserstein distance
in more abstract settings, which allows it to be applied to probability distri-
butions over reasonably general metric spaces (Villani 2003, 2008; Ambrosio
et al. 2005; Santambrogio 2015). Optimal transport has also increasingly found
application within machine learning in recent years, particularly in generative
modeling (Arjovsky et al. 2017). Optimal transport and couplings also arise
in the study of bisimulation metrics for Markov decision processes (Ferns et
al. 2004; Ferns and Precup 2014; Amortila et al. 2019) as well as analytical
tools for sample-based algorithms (Amortila et al. 2020). Peyré and Cuturi
(2019) provide an overview of algorithms, analysis, and applications associated
with optimal transport in machine learning and related disciplines.
4.7–4.8.
Villani (2008) gives further discussion on the domain of Wasserstein
distances and on their relationship to weak convergence.
4.9.
The usefulness of a random-variable operator has been a source of intense
debate between the authors of this book. The form we present here is inspired
by the “stack of rewards” model from Lattimore and Szepesvári (2020).
Draft version.