Differential dynamic programming

Differential dynamic programming (DDP) is an optimal control algorithm of the trajectory optimization class.

The algorithm was introduced in 1966 by Mayne[1] and subsequently analysed in Jacobson and Mayne's eponymous book.

[2] The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays quadratic convergence.

It is closely related to Pantoja's step-wise Newton's method.

[3][4] The dynamics describe the evolution of the state

is the sum of running costs

Trajectory optimization means finding

: The optimal cost-to-go or value function at time

is the cost-to-go given the minimizing control sequence: Setting

, the dynamic programming principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time: This is the Bellman equation.

DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory.

pair: and expand to second order The

notation used here is a variant of the notation of Morimoto where subscripts denote differentiation in denominator layout.

for readability, primes denoting the next time-step

, the expansion coefficients are The last terms in the last three equations denote contraction of a vector with a tensor.

Minimizing the quadratic approximation (3) with respect to

we have giving an open-loop term

Plugging the result back into (3), we now have a quadratic model of the value at time

: Recursively computing the local quadratic models of

Differential dynamic programming is a second-order algorithm like Newton's method.

It therefore takes large steps toward the minimum and often requires regularization and/or line-search to achieve convergence.

[6][7] Regularization in the DDP context means ensuring that the

Line-search in DDP amounts to scaling the open-loop control modification

Sampled differential dynamic programming (SaDDP) is a Monte Carlo variant of differential dynamic programming.

[8][9][10] It is based on treating the quadratic cost of differential dynamic programming as the energy of a Boltzmann distribution.

This way the quantities of DDP can be matched to the statistics of a multidimensional normal distribution.

The statistics can be recomputed from sampled trajectories without differentiation.

Sampled differential dynamic programming has been extended to Path Integral Policy Improvement with Differential Dynamic Programming.

[11] This creates a link between differential dynamic programming and path integral control,[12] which is a framework of stochastic optimal control.

Interior Point Differential dynamic programming (IPDDP) is an interior-point method generalization of DDP that can address the optimal control problem with nonlinear state and input constraints.