Differential dynamic programming (DDP) is an optimal control algorithm of the trajectory optimization class.
The algorithm was introduced in 1966 by Mayne[1] and subsequently analysed in Jacobson and Mayne's eponymous book.
[2] The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays quadratic convergence.
It is closely related to Pantoja's step-wise Newton's method.
[3][4] The dynamics describe the evolution of the state
is the sum of running costs
Trajectory optimization means finding
: The optimal cost-to-go or value function at time
is the cost-to-go given the minimizing control sequence: Setting
, the dynamic programming principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time: This is the Bellman equation.
DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory.
pair: and expand to second order The
notation used here is a variant of the notation of Morimoto where subscripts denote differentiation in denominator layout.
for readability, primes denoting the next time-step
, the expansion coefficients are The last terms in the last three equations denote contraction of a vector with a tensor.
Minimizing the quadratic approximation (3) with respect to
we have giving an open-loop term
Plugging the result back into (3), we now have a quadratic model of the value at time
: Recursively computing the local quadratic models of
Differential dynamic programming is a second-order algorithm like Newton's method.
It therefore takes large steps toward the minimum and often requires regularization and/or line-search to achieve convergence.
[6][7] Regularization in the DDP context means ensuring that the
Line-search in DDP amounts to scaling the open-loop control modification
Sampled differential dynamic programming (SaDDP) is a Monte Carlo variant of differential dynamic programming.
[8][9][10] It is based on treating the quadratic cost of differential dynamic programming as the energy of a Boltzmann distribution.
This way the quantities of DDP can be matched to the statistics of a multidimensional normal distribution.
The statistics can be recomputed from sampled trajectories without differentiation.
Sampled differential dynamic programming has been extended to Path Integral Policy Improvement with Differential Dynamic Programming.
[11] This creates a link between differential dynamic programming and path integral control,[12] which is a framework of stochastic optimal control.
Interior Point Differential dynamic programming (IPDDP) is an interior-point method generalization of DDP that can address the optimal control problem with nonlinear state and input constraints.