Q-learning

[1] For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state.

[2] "Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state.

Executing an action in a specific state provides the agent with a reward (a numerical score).

One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself.

The total boarding time, or cost, is then: On the next day, by random chance (exploration), you decide to wait and let other people depart first.

However, Q-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series).

If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.

A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities).

When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero.

(in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward.

[6] Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network.

[7] In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.

[8] Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs.

High initial values, also known as "optimistic initial conditions",[9] can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability.

[10] RIC seems to be consistent with human behaviour in repeated binary choice experiments.

[11] This makes it possible to apply the algorithm to larger problems, even when the state space is continuous.

Function approximation may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.

To shrink the possible space of valid actions multiple values can be assigned to a bucket.

[16] Watkins was addressing “Learning from delayed rewards”, the title of his PhD thesis.

Eight years earlier in 1981 the same problem, under the name of “Delayed reinforcement learning”, was solved by Bozinovski's Crossbar Adaptive Array (CAA).

The architecture introduced the term “state evaluation” in reinforcement learning.

The crossbar learning algorithm, written in mathematical pseudocode in the paper, in each iteration performs the following computation: The term “secondary reinforcement” is borrowed from animal learning theory, to model state values via backpropagation: the state value ⁠

CAA computes state values vertically and actions horizontally (the "crossbar").

Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q.

[3] This removes correlations in the observation sequence and smooths changes in the data distribution.

Double Q-learning[23] is an off-policy reinforcement learning algorithm, where a different policy is used for value evaluation than what is used to select the next action.

The double Q-learning update step is then as follows: Now the estimated value of the discounted future is evaluated using a different policy, which solves the overestimation issue.

[26] Greedy GQ is a variant of Q-learning to use in combination with (linear) function approximation.

[27] The advantage of Greedy GQ is that convergence is guaranteed even when function approximation is used to estimate the action values.

Discretization of these values leads to inefficient learning, largely due to the curse of dimensionality.