Reinforcement learning from human feedback

[6] For example, one may want to train a model to generate safe text that is both helpful and harmless (such as lacking bias, toxicity, or otherwise harmful content).

[7] Despite the clear benefits of incorporating human feedback in training models, prior efforts—including some that leverage reinforcement learning—have encountered significant challenges.

Most attempts were either narrow and difficult to generalize, breaking down on more complex tasks,[8][9][10][11] or they faced difficulties learning from sparse (lacking specific information and relating to large amounts of text at a time) or noisy (inconsistently rewarding similar outputs) reward functions.

The foundation for RLHF was introduced as an attempt to create a general algorithm for learning from a practical amount of human feedback.

[14] Nevertheless, a larger and more diverse amount of data can be crucial for tasks where it is important to avoid bias from a partially representative group of annotators.

[20][22] In the offline data collection model, when the objective is policy training, a pessimistic MLE that incorporates a lower confidence bound as the reward estimate is most effective.

[22][23][15] In the online scenario, when human feedback is collected through pairwise comparisons under the Bradley–Terry–Luce model and the objective is to minimize the algorithm's regret (the difference in performance compared to an optimal agent), it has been shown that an optimistic MLE that incorporates an upper confidence bound as the reward estimate can be used to design sample efficient algorithms (meaning that they require relatively little training data).

A key challenge in RLHF when learning from pairwise (or dueling) comparisons is associated with the non-Markovian nature of its optimal policies.

[24][14] Ordinary reinforcement learning, in which agents learn from their actions based on a predefined "reward function", is difficult to apply to NLP tasks because the rewards tend to be difficult to define or measure, especially when dealing with complex tasks that involve human values or preferences.

[15][25] Some notable examples of RLHF-trained language models are OpenAI's ChatGPT (and its predecessor InstructGPT),[17][26][27] DeepMind's Sparrow,[28][29][30] Google's Gemini,[31] and Anthropic's Claude.

[33][34] Other methods tried to incorporate the feedback through more direct training—based on maximizing the reward without the use of reinforcement learning—but conceded that an RLHF-based approach would likely perform better due to the online sample generation used in RLHF during updates as well as the aforementioned KL regularization over the prior model, which mitigates overfitting to the reward function.

[35] RLHF was initially applied to other areas, such as the development of video game bots and tasks in simulated robotics.

For example, OpenAI and DeepMind trained agents to play Atari games based on human preferences.

In classical RL-based training of such bots, the reward function is simply correlated to how well the agent is performing in the game, usually using metrics like the in-game score.

This model is then customarily trained in a supervised manner on a relatively small dataset of pairs of prompts to an assistant and their accompanying responses, written by human annotators.

This change shifts the model from its original classification task over its vocabulary to simply outputting a number corresponding to the score of any given prompt and response.

This loss function essentially measures the difference between the reward model's predictions and the decisions made by humans.

The goal is to make the model's guesses as close as possible to the humans' preferences by minimizing the difference measured by this equation.

[14] In simpler terms, the objective function calculates how well the policy's responses are expected to align with human feedback.

The goal is to balance improving alignment with human preferences while ensuring the model's responses remain diverse and not too far removed from what it has learned during its initial training.

This helps the model not only to provide answers that people find useful or agreeable but also to maintain a broad understanding and avoid overly narrow or repetitive responses.

[15] In total, this objective function defines the method for adjusting the RL policy, blending the aim of aligning with human feedback and maintaining the model's original language understanding.

RLHF suffers from challenges with collecting human feedback, learning a reward model, and optimizing the policy.

For instance, the model may become biased, favoring certain groups over others, if the feedback lacks impartiality, is inconsistent, or is incorrect.

For instance, feedback predominantly from a specific demographic might lead the model to learn peculiarities or noise, along with the intended alignment.

Excessive alignment to the specific feedback it received (that is, to the bias therein) can lead to the model performing sub-optimally in new contexts or when used by different groups.

Even with a representative sample, conflicting views and preferences may result in the reward model favoring the majority's opinion, potentially disadvantaging underrepresented groups.

Such behavior, if unchecked, is not just incentivized but can cause significant deployment issues due to the model's potential to mislead.

Studies have found that humans are not skilled at identifying mistakes in LLM outputs in complex tasks; therefore, models learning to generate confident-sounding yet incorrect text can lead to significant issues when deployed.

Essentially, this approach directly shapes the model's decisions based on positive or negative human feedback.