Policy gradient method

Policy gradient methods are a class of reinforcement learning algorithms.

The goal of any policy gradient method is to iteratively maximize

Lemma — The expectation of the score function is zero, conditional on any present or past state.

can be interpreted as the direction in the parameter space that increases the probability of taking action

The policy gradient, then, is a weighted average of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa.

REINFORCE is an on-policy algorithm, meaning that the trajectories used for the update must be sampled from the current policy

Many variants of REINFORCE has been introduced, under the title of variance reduction.

A common way for reducing variance is the REINFORCE with baseline algorithm, based on the following identity:

[5] Unlike standard policy gradient methods, which depend on the choice of parameters

While the objective (linearized improvement) is geometrically meaningful, the Euclidean constraint

This ensures updates are invariant to invertible affine parameter transformations.

, the KL divergence is approximated by the Fisher information metric:

is computationally intensive, especially for high-dimensional parameters (e.g., neural networks).

TRPO builds on the natural policy gradient by incorporating a trust region constraint.

While the natural gradient provides a theoretically optimal direction, TRPO's line search and KL constraint mitigate errors from Taylor approximations, ensuring monotonic policy improvement.

This makes TRPO more robust in practice, particularly for high-dimensional policies.

equals the policy gradient derived from the advantage function:

where: This reduces the problem to a quadratic optimization, yielding the natural policy gradient update:

A further improvement is proximal policy optimization (PPO), which avoids even computing

and PPO maximizes the surrogate advantage by stochastic gradient descent, as usual.

In words, gradient-ascending the new surrogate advantage function means that, at some state

requires multiple update steps on the same batch of data.

, then repeatedly apply gradient descent (such as the Adam optimizer) to update

For each such bound hit, the corresponding gradient becomes zero, and thus PPO avoid updating

This is important, because the surrogate loss assumes that the state-action pair

This has been used in training reasoning language models with reinforcement learning from human feedback.

[8] The KL divergence penalty term can be estimated with lower variance using the equivalent form (see f-divergence for details):[9]

The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator

Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse.

GRPO was first proposed in the context of training reasoning language model by researchers at DeepSeek.