52 Chapter 3
3.1 The Monte Carlo Method
Birds such as the pileated woodpecker follow a feeding routine that regularly
takes them back to the same foraging grounds. The success of this routine can be
measured in terms of the total amount of food obtained during a fixed period of
time, say a single day. As part of a field study, it may be desirable to predict the
success of a particular bird’s routine on the basis of a limited set of observations:
for example, to assess its survival chances at the beginning of winter based
on feeding observations from the summer months. In reinforcement learning
terms, we view this as the problem of learning to predict the expected return
(total food per day) of a given policy
π
(the feeding routine). Here, variations in
weather, human activity, and other foraging animals are but a few of the factors
that affect the amount of food obtained on any particular day.
In our example, the problem of learning to predict is abstractly a problem
of statistical estimation. To this end, let us model the woodpecker’s feeding
routine as a Markov decision process.
17
We associate each day with a sample
trajectory or episode, corresponding to measurements made at regular intervals
about the bird’s location
x
, behavior
a
, and per-period food intake
r
. Suppose
that we have observed a set of K sample trajectories,
n
(x
k,t
, a
k,t
, r
k,t
)
T
k
−1
t=0
o
K
k=1
, (3.1)
where we use
k
to index the trajectory and
t
to index time, and where
T
k
denotes
the number of measurements taken each day. In this example, it is most sensible
to assume a fixed number of measurements
T
k
=
T
, but in the general setting,
T
k
may be random and possibly dependent on the trajectory, often corresponding to
the time when a terminal state is first reached. For now, let us also assume that
there is a unique starting state
x
0
, such that
x
k,0
=
x
0
for all
k
. We are interested
in the problem of estimating the expected return
E
π
h
T −1
X
t=0
γ
t
R
t
i
= V
π
(x
0
) ,
corresponding to the expected per-day food intake of our bird.
18
Monte Carlo methods estimate the expected return by averaging the outcomes
of observed trajectories. Let us denote by
g
k
the sample return for the
k
th
17.
Although this may be a reasonable approximation of reality, it is useful to remember that
concepts such as the Markov property are in this case modeling assumptions, rather than actual
facts.
18.
In this particular example, a discount factor of
γ
= 1 is reasonable given that
T
is fixed and we
should have no particular preference for mornings over evenings.
Draft version.