The exploration–exploitation dilemma, also known as the explore–exploit tradeoff, is a fundamental concept in decision-making that arises in many domains.
[1][2] It is depicted as the balancing act between two opposing strategies.
Exploitation involves choosing the best option based on current knowledge of the system (which may be incomplete or misleading), while exploration involves trying out new options that may lead to better outcomes in the future at the expense of an exploitation opportunity.
Finding the optimal balance between these two strategies is a crucial challenge in many decision-making problems whose goal is to maximize long-term benefits.
[3] In the context of machine learning, the exploration–exploitation tradeoff is fundamental in reinforcement learning (RL), a type of machine learning that involves training agents to make decisions based on feedback from the environment.
The multi-armed bandit (MAB) problem was a classic example of the tradeoff, and many methods were developed for it, such as epsilon-greedy, Thompson sampling, and the upper confidence bound (UCB).
In more complex RL situations than the MAB problem, the agent can treat each choice as a MAB, where the payoff is the expected future reward.
For example, if the agent performs an epsilon-greedy method, then the agent will often "pull the best lever" by picking the action that had the best predicted expected reward (exploit).
However, it would pick a random action with probability epsilon (explore).
Monte Carlo tree search, for example, uses a variant of the UCB method.
That is, instead of trying to get the agent to balance exploration and exploitation, exploration is simply treated as another form of exploitation, and the agent simply attempts to maximize the sum of rewards from exploration and exploitation.
, meaning the intrinsic and extrinsic rewards at time step
[8] The forward dynamics model is a function for predicting the next state based on the current state and the current action:
The forward dynamics model is trained as the agent plays.
The model becomes better at predicting state transition for state-action pairs that had been done many times.
A forward dynamics model can define an exploration reward by
That is, the reward is the squared-error of the prediction compared to reality.
This rewards the agent to perform state-action pairs that had not been done many times.
Dynamics model can be run in latent space.
), randomly generated, the encoder-half of a variational autoencoder, etc.
A good featurizer improves forward dynamics exploration.
[12] The Intrinsic Curiosity Module (ICM) method trains simultaneously a forward dynamics model and a featurizer.
The featurizer is trained by an inverse dynamics model, which is a function for predicting the current action based on the features of the current and the next state:
As a state is visited more and more, the student network becomes better at predicting the teacher.
Meanwhile, the prediction error is also an exploration reward for the agent, and so the agent learns to perform actions that result in higher prediction error.
Thus, we have a student network attempting to minimize the prediction error, while the agent attempting to maximize it, resulting in exploration.
The rewards are normalized by dividing with a running variance.
[7][14] Exploration by disagreement trains an ensemble of forward dynamics models, each on a random subset of all
The exploration reward is the variance of the models' predictions.
That is, some network parameters are random variables from a probability distribution.