It was proposed by Rummery and Niranjan in a technical note[1] with the name "Modified Connectionist Q-Learning" (MCQ-L).
[2] Some authors use a slightly different convention and write the quintuple (St, At, Rt, St+1, At+1), depending on which time step the reward is formally assigned.
A SARSA agent interacts with the environment and updates the policy based on actions taken, hence this is known as an on-policy learning algorithm.
The Q value for a state-action is updated by an error, adjusted by the learning rate α. Q values represent the possible reward received in the next time step for taking action a in state s, plus the discounted future reward received from the next state-action observation.
Since SARSA is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs.
This resetting-of-initial-conditions (RIC) approach seems to be consistent with human behavior in repeated binary choice experiments.