Stochastic dynamic programming

Originally introduced by Richard E. Bellman in (Bellman 1957), stochastic dynamic programming is a technique for modelling and solving problems of decision making under uncertainty.

The aim is to compute a policy prescribing how to act optimally in the face of uncertainty.

A gambler has $2, she is allowed to play a game of chance 4 times and her goal is to maximize her probability of ending up with a least $6.

[1] Stochastic dynamic programming can be employed to model this problem and determine a betting strategy that, for instance, maximizes the gambler's probability of attaining a wealth of at least $6 by the end of the betting horizon.

Note that if there is no limit to the number of games that can be played, the problem becomes a variant of the well known St. Petersburg paradox.

Without loss of generality in what follow we will consider a reward maximisation setting.

In deterministic dynamic programming one usually deals with functional equations taking the following structure where

and the boundary condition of the system is The aim is to determine the set of optimal actions that maximise

, we know with certainty the reward secured during the current stage and – thanks to the state transition function

In practice, however, even if we know the state of the system at the beginning of the current stage as well as the decision taken, the state of the system at the beginning of the next stage and the current period reward are often random variables that can be observed only at the end of the current stage.

Stochastic dynamic programming deals with problems in which the current period reward and/or the next period state are random, i.e. with multi-stage stochastic systems.

The decision maker's goal is to maximise expected (discounted) reward over a given planning horizon.

In their most general form, stochastic dynamic programs deal with functional equations taking the following structure where Markov decision processes represent a special class of stochastic dynamic programs in which the underlying stochastic process is a stationary process that features the Markov property.

Gambling game can be formulated as a Stochastic Dynamic Program as follows: there are

Memoization is typically employed to enhance performance.

However, like deterministic dynamic programming also its stochastic variant suffers from the curse of dimensionality.

For this reason approximate solution methods are typically employed in practical applications.

Given a bounded state space, backward recursion (Bertsekas 2000) begins by tabulating

The process continues by considering in a backward fashion all remaining stages up to the first one.

– the value of an optimal policy given initial state

of the system at the beginning of period 1, forward recursion (Bertsekas 2000) computes

by progressively expanding the functional equation (forward pass).

The value of an optimal policy and its structure are then retrieved via a (backward pass) in which these suspended recursive calls are resolved.

A key difference from backward recursion is the fact that

Memoization is employed to avoid recomputation of states that have been already considered.

We shall illustrate forward recursion in the context of the Gambling game instance previously discussed.

However, this has led to additional suspended recursions involving

represent boundary conditions that are easily computed as follows.

At this point it is possible to proceed and recover the optimal policy and its value via a backward pass involving, at first, stage 3

An introduction to approximate dynamic programming is provided by (Powell 2009).