[5] The Monte Carlo method, which uses random sampling for deterministic problems which are difficult or impossible to solve using other approaches, dates back to the 1940s.
[6] In his 1987 PhD thesis, Bruce Abramson combined minimax search with an expected-outcome model based on random game playouts to the end, instead of the usual static evaluation function.
Abramson said the expected-outcome model "is shown to be precise, accurate, easily estimable, efficiently calculable, and domain-independent.
"[7] He experimented in-depth with tic-tac-toe and then with machine-generated evaluation functions for Othello and chess.
[18] In 2008, MoGo achieved dan (master) level in 9×9 Go,[19] and the Fuego program began to win against strong amateur players in 9×9 Go.
[20] In January 2012, the Zen program won 3:1 in a Go match on a 19×19 board with an amateur 2 dan player.
[1][22][23] In March 2016, AlphaGo was awarded an honorary 9-dan (master) level in 19×19 Go for defeating Lee Sedol in a five-game match with a final score of four games to one.
[24] AlphaGo represents a significant improvement over previous Go programs as well as a milestone in machine learning as it uses Monte Carlo tree search with artificial neural networks (a deep learning method) for policy (move selection) and value, giving it efficiency far surpassing previous programs.
The application of Monte Carlo tree search in games is based on many playouts, also called roll-outs.
Each round of Monte Carlo tree search consists of four steps:[37] This graph shows the steps involved in one decision, with each node showing the ratio of wins to total playouts from that point in the game tree for the player that the node represents.
Rounds of search are repeated as long as the time allotted to a move remains.
Then the move with the most simulations made (i.e. the highest denominator) is chosen as the final answer.
For each position, all feasible moves are determined: k random games are played out to the very end, and the scores are recorded.
[39] DeepMind's AlphaZero replaces the simulation step with an evaluation based on a neural network.
The first formula for balancing exploitation and exploration in games, called UCT (Upper Confidence Bound 1 applied to trees), was introduced by Levente Kocsis and Csaba Szepesvári.
[17] UCT is based on the UCB1 formula derived by Auer, Cesa-Bianchi, and Fischer[40] and the probably convergent AMS (Adaptive Multi-stage Sampling) algorithm first applied to multi-stage decision-making models (specifically, Markov Decision Processes) by Chang, Fu, Hu, and Marcus.
[12] Kocsis and Szepesvári recommend to choose in each node of the game tree the move for which the expression
Most contemporary implementations of Monte Carlo tree search are based on some variant of UCT that traces its roots back to the AMS simulation optimization algorithm for estimating the value function in finite-horizon Markov Decision Processes (MDPs) introduced by Chang et al.[12] (2005) in Operations Research.
In particular, pure Monte Carlo tree search does not need an explicit evaluation function.
Simply implementing the game's mechanics is sufficient to explore the search space (i.e. the generating of allowed moves in a given position and the game-end conditions).
Thus[dubious – discuss], it achieves better results than classical algorithms in games with a high branching factor.
A disadvantage is that in certain positions, there may be moves that look superficially strong, but that actually lead to a loss via a subtle line of play.
Such "trap states" require thorough analysis to be handled correctly, particularly when playing against an expert player; however, MCTS may not "see" such lines due to its policy of selective node expansion.
[43][44] It is believed that this may have been part of the reason for AlphaGo's loss in its fourth game against Lee Sedol.
For instance, in many Go-playing programs certain stone patterns in a portion of the board influence the probability of moving into that area.
[47] Domain-specific knowledge may be employed when building the game tree to help the exploitation of some variants.
One such method assigns nonzero priors to the number of won and played simulations when creating each child node, leading to artificially raised or lowered average win rates that cause the node to be chosen more or less frequently, respectively, in the selection step.
[48] A related method, called progressive bias, consists in adding to the UCB1 formula a
This exploratory phase may be reduced significantly in a certain class of games using RAVE (Rapid Action Value Estimation).
[50] Monte Carlo tree search can be concurrently executed by many threads or processes.