Figure 12. Average scores and numbers of visits of nodes A, B, C, A1, A2 of Figure 11 as the MCTS simulation progresses.

MCTS simulation), the scores of A1 and A2 would eventually dominate the score of their parent node A so that it would be lower than the score of node B or C at which point MCTS would correctly select position B or C and avoid

Figure 13. Total scores and numbers of visits of all the child nodes of node A of Figure 11 at the end of simulation.

sudden death. In other words, the practical limitation of MCTS in dealing with sudden death is caused by the computational resource. In a way, this is somewhat expected when we note in Section 2.3 that convergence is quite slow even for a simple game position (Tic-Tac-Toe) and a simple algorithm (RPS).

4.2.4. Level-2 Sudden Death versus Level-3 Sudden Win

The example scenario in Figure 14 with K = 11 shows how MCTS fails to detect level-2 sudden death while pursuing level-3 sudden win. Shown on the left side of the figure is the current game position, a level-2 sudden death situation. It is now the computerâ€™s turn to move. The correct move should be B to avoid level-2 sudden death. However, after examining all available positions on the board with T = 10,000, MCTS decides to take position A to pursue level-3 sudden win.

The moves A, B correspond to two child nodes of the root node, represented by two circles labeled as A, B in Figure 14. Figure 15 plots the MCTS statistics of the two nodes. We observe that node A achieves a higher average score than node B does and is visited much more frequently.

Figure 16 plots the MCTS statistics of all the child nodes of node A when the simulation stops at T = 10,000. Among all the child nodes, node A1 is of index 56. We note that MCTS indeed visits A1 more frequently than any other child node and the total score of A1 is the lowest, similar to what we have observed in Figure 13. However, the difference between the child nodes is not as significant. The reason is probably that K is larger and T is smaller in this example and as a result MCTS is still biased towards exploration and exploitation has not been in effect.

The difference of the statistics of nodes A and B in Figure 15, however, is quite minor. In fact, we note that towards the end of the simulation, the average score of B increases while that of A decreases and MCTS visits node B increasingly more often, indicating that if more computational resource were available to allow T to go further beyond 10,000, then node B would probably overtake node A as the winner and MCTS would find the correct move and overcome level-2 sudden death.

5. An Improved MCTS Algorithm to Address Sudden Death/Win

The issue of sudden death/win is a known problem for MCTS [11] [12] . In [13] , the authors propose that if there is a move that leads to an immediate win or

Figure 14. An example of level-2 sudden death in Gomoku.

Figure 15. Average scores and numbers of visits of nodes A, B, A1 of Figure 14 as the MCTS simulation progresses.

loss, then play this move; otherwise, play MCTS. Clearly this algorithm only addresses level-1 sudden win or level-2 sudden problems. As mentioned before, level-2 sudden death or level-1 win is not a major issue in MCTS at least for playing Gomoku with reasonable K, T. While it is possible that with sufficient

Figure 16. Total scores and numbers of visits of all the child nodes of node A of Figure 14 at the end of simulation.

computational resource MCTS could eventually deal with sudden death/win, next we propose an improved MCTS algorithm that addresses this problem without increasing drastically required computational resource.

5.1. Basic Idea

To identify a level-k sudden death/win, we have to use a minimax search of depth k starting from the root node [11] . One brute-force algorithm, as shown in Figure 17(a), is that when the computer enters a new root node, it first runs the minimax search; if there is a sudden death/win within k steps, then follows the minimax strategy to take sudden win or avoid sudden death; otherwise, proceed with MCTS.

However, the computational costs of the minimax search would be significant even for a modest k. Carrying out the minimax search for every new root node reduces the computational resource left for MCTS. The key idea of our proposal is to not burden MCTS with the minimax search until it becomes necessary.

Figure 17(b) shows the flow chart of the improved MCTS algorithm. When the computer enters a new root node, it carries out MCTS. At the end of the simulation step of every MCTS iteration, if the number of steps from the root node to the terminal position exceeds k, then continue to the next MCTS

Figure 17. Flow chart of the improved MCTS algorithm: (a) Brute-force algorithm; and (b) Proposed algorithm.

iteration; otherwise, run the minimax search from the root node and follow the algorithm of Figure 17(a). In our implementation, we run a single minimax search of depth k to cover both level-k sudden death and level- $\left(k\xe2\u02c6\u20191\right)$ sudden win for even k.

From Proposition 2 the brute-force algorithm is guaranteed to detect any level-k sudden death or win. However, the improved MCTS algorithm can only guarantee the detectability with probability.

Proposition 4. Any level-k sudden death or win is detectable, with probability, with the improved MCTS algorithm in Figure 17(b), where the probability depends on the number of MCTS iterations, the complexity of the game, and the value of k.

The reason of â€œguarantee with probabilityâ€ in Proposition 4 is that the improved MCTS algorithm needs to encounter a terminal node within level-k in at least one of the many iterations before a minimax search is triggered. In Gomoku, we find that the probability is close to 1 when the number of iterations is large, the board size is modest, and depth k is a small number, e.g., T = 30,000, K = 9, k = 4.

5.2. Simulation Results

We have implemented the improved MCTS algorithm in JAVA. The implementation includes a minimax search of depth 4, and we confirm that the program is able to always detect any level-2/4 sudden death and level-1/3 sudden win. To assess the computational resource used in the minimax search of depth 4 and the overall MCTS algorithm, we record the total elapsed time of every MCTS step, which includes T simulations and possibly a minimax search, as well as the elapsed time of every minimax search when it happens. Figure 18 shows that the cumulative distribution functions (CDF) of the two elapsed times for the case of T = 30,000, K = 9.

We observe that the elapsed time of a minimax search is pretty uniformly distributed from 0 to about 1.8 seconds and that the elapsed time of a MCTS step is either small (below 2 seconds) or quite large (above 6 seconds), where the small value corresponds to the scenarios where MCTS can find easy moves to win the game (level-1 or level-3 sudden win). On average, a single minimax search is about 1/6 of the entire MCTS 30,000 simulations. As the depth increases, the minimax search will be exponentially expensive. Therefore, the advantage of the proposed algorithm over the brute-force algorithm could be significant.

5.3. Heuristic Strategies to Enhance Performance

In the improved algorithm of Figure 17(b), the minimax search of depth k runs from the root node to detect the presence of sudden death or win. We next propose to employ a heuristic strategy to reduce the minimax search complexity and to enhance the MCTS effectiveness. The basic idea is that it is not necessary

Figure 18. CDF of the total elapsed time in every MCTS step and the elapsed time of a minimax search.

to search over all the child nodes of the root node. Instead, we only need to run the minimax search from a subset of promising child nodes.

Suppose that in one iteration of MCTS a terminal node of loss (or win) is encountered within depth k from the root node, indicating the potential presence of sudden death (or win). Mark the child node of the root node selected in that iteration as a potential sudden death (or win) move.

Consider the first heuristic strategy. Instead of triggering a minimax search immediately as in Figure 17(b), we continue to run MCTS for some number of iterations and then apply the following two rules.

1) For the subset of child nodes of the root node whose average scores are in the top ${x}_{1}$ percentile and that have been marked as a potential sudden death move, run a minimax search of depth $k\xe2\u02c6\u20191$ from each of them. If the minimax score of a child node in the subset is equal to âˆ’1, then mark it as a sudden death move and refrain from selecting it in subsequent MCTS iterations. Continue to run MCTS.

2) For the subset of child nodes of the root node whose average scores are in the top ${x}_{2}$ percentile and that have been marked as a potential sudden win move, run a minimax search of depth $k\xe2\u02c6\u20191$ from each of them. If the minimax score of any of them is equal to 1, then mark the child node as a sudden win move, which is the best move to select, and stop MCTS. Otherwise, continue to run MCTS.

Rule 1 is to prevent a sudden death move from prevailing in MCTS node selection. For example, move A in Figure 11 will be detected and discarded, and therefore will not prevail over move B or C. Rule 2 is to discover a sudden win move and to end MCTS sooner. Here ${x}_{1}\mathrm{,}{x}_{2}$ are heuristic numbers that are used to trade off the reduction in minimax search with the improvement of MCTS effectiveness.

Consider the second heuristic strategy. Instead of running MCTS for some number of iterations, applying the above two rules and then continuing MCTS as in the first strategy, we can run MCTS to the end and then apply the following two similar rules.

1) For the child node of the root node with the highest average score, if it has been marked as a potential sudden death move, run a minimax search of depth $k\xe2\u02c6\u20191$ . If its minimax score is equal to âˆ’1, then discard it, proceed to the child node with the second highest average score and repeat this rule. Otherwise, proceed to the next rule.

2) For the subset of child nodes of the root node whose average scores are in the top ${x}_{2}$ percentile and that have been marked as potential sudden win moves, run a minimax search of depth $k\xe2\u02c6\u20191$ from each of them. If the minimax score of any of them is equal to 1, then select the move. Otherwise, select the move according to the MCTS rule.

The performance of the above two heuristic strategy can be assessed with the following proposition.

Proposition 5. Either the first or second heuristic strategy has the same capability to detect a sudden death as the improved MCTS algorithm in Figure 17(b). Its capability to detect a sudden win improves as ${x}_{2}$ increases, and is in general strictly weaker than the improved MCTS algorithm, except in the special case where ${x}_{2}=100$ in which case it has the same sudden win detection capability.

In Gomoku, if ${x}_{2}$ is a small number, the complexity reduction over the improved algorithm of Figure 17(b) is in the order of $O\left({K}^{2}\right)$ , because the size of the subset of the child nodes from which a minimax search actually runs is roughly $1/{K}^{2}$ of the size of the total set. Even for a modest board size K, e.g., $K=9$ , the reduction is significant.

6. Conclusions

In this paper we analyze the MCTS algorithm by playing two games, Tic-Tac-Toe and Gomoku. Our study starts with a simple version of MCTS, called RPS. We find that the random playout search is fundamentally different from the optimal minimax search and, as a result, RPS may fail to discover the correct moves even in a very simple game of Tic-Tac-Toe. Both the probability analysis and simulation have confirmed our discovery.

We continue our studies with the full version of MCTS to play Gomoku. We find that while MCTS has shown great success in playing more sophisticated games like Go, it is not effective to address the problem of sudden death/win, which ironically does not often appear in Go, but is quite common on simple games like Tic-Tac-Toe and Gomoku. The main reason that MCTS fails to detect sudden death/win lies in the random playout search nature of MCTS. Therefore, although MCTS in theory converges to the optimal minimax search, with computational resource constraints in reality, MCTS has to rely on RPS as an important step in its simulation search step, therefore suffering from the same fundamental problem as RPS and not necessarily always being a winning strategy.

Our simulation studies use the board size and number of simulations to represent the game complexity and computational resource respectively. By examining the detailed statistics of the scores in MCTS, we investigate a variety of scenarios where MCTS fails to detect level-2 and level-4 sudden death. Finally, we have proposed an improved MCTS algorithm by incorporating minimax search to address the problem of sudden death/win. Our simulation has confirmed the effectiveness of the proposed algorithm. We provide an estimate of the additional computational costs of this new algorithm to detect sudden death/win and present two heuristic strategies to reduce the minimax search complexity and to enhance the MCTS effectiveness.

The importance of sudden death/win is not limited to the board games. The real-world AI applications such as autonomous driving sometimes face similar situations where certain actions lead to drastic consequences. Therefore, we hope to extend the research of the MCTS algorithm to these applications in the future study.

Appendix A

A. Java Implementation

The following appendix sections describe the Java implementation of the analysis and simulation codes used in this study.

A1. Tic-Tac-Toe and Gomoku Game GUI

I downloaded a Java program from the Internet, which provides a graphic user interface (GUI) to play Tic-Tac-Toe. I modified the program to play Gomoku. Figure A1 shows a screenshot of the game GUI generated by the program.

The game GUI consists of two panels. The left panel shows two options to play the game: person-to-person or person-to-computer. In this study I am only interested in the latter option. The two colors, black (person) and green (computer), show whose turn it is to move. The right panel is a K × K grid to represent the game board.

The program uses an array currentTable[] of length K^{2} to represent the current game position, where element currentTable[i*K+j] is the state of row i and column j of the game board. currentTable[i*K+j] can be in one of three states:

enum TILESTATUS {EMPTY, PERSON, COMPUTER}.

Initially, all elements are EMPTY. When it is the personâ€™s turn to move and the human player clicks on one of the empty grid element, the program changes the corresponding element of currentTable[] to PERSON. When it is the computerâ€™s turn to move, the program calls a function.

int choseTile()

which returns an integer index pointing to one of the elements of currentTable[] representing the computerâ€™s move. The program changes the corresponding

Figure A1. A screenshot of Gomoku game environment.

element of currentTable[] to COMPUTER. After every move, the program calls a function

boolean checkWinner(TILESTATUS[] currentTable, int select)

to check whether the game position currentTable has a winner where select is the latest move.

Function choseTile() is where any AI algorithm is implemented. In the Tic-Tac-Toe Java program downloaded from the Internet, the default implementation is to randomly return any index that corresponds to an EMPTY element. In this study, I have implemented the RPS and MCTS algorithms in choseTile(). I will describe our implementation of RPS and MCTS later.

A2. Implementation of Probability Analysis of RPS

I wrote a Java program to calculate the expected score of each move from the current game position of RPS. The Java program is used in the probability analysis in Section 2.2.

The calculation is based on the recursive Equation (3). The current game position is represented by the root node and all the legal next-step moves from the current game position by the computer form its immediately child nodes. The Java problem uses a recursive function

float scoreIteration(TURN turn, int select)

to calculate the expected score of the child node presented by currentTable. Denote K the number of EMPTY elements in the root node and index[k] the k-th EMPTY element. The main() function calls scoreIteration() for each of the K child nodes of the root node, k, with turn set to COMPUTER, select set to index[k], and currentTable[index[k]] set to COMPUTER. currentTable[index[k]] is reset to EMPTY after each function call.

The programming idea of scoreIteration() is that if the node is terminal, then return âˆ’1, 0, 1 depending on the game outcome; otherwise, expand to its child nodes and call scoreIteration() recursively. When a node expands to its next level child nodes, the number of EMPTY elements decrements by one. The pseudo of scoreIteration() is given as follows.

A3. MCTS and RPS Implementation

I downloaded a Java program from the Internet (https://code.google.com/archive/p/uct-for-games/source), which implements the MCTS algorithm. I modified the MCTS program to work with the GUI program, to output various statistics to study the simulation results in Sections 2.3 and 4.2, and to allow flexible alrogithm selection, which will be necessary to implement the improved MCTS algorithm of Section 5.

The MCTS Java program defines a class called MCTSNode, which includes a number of variables and methods to implement the algorithm described in Section 3.2. The most important variables are

int totalScore to store the total score of the node,

int timesVisited to store the total number of times the node is visited,

ArrayList

int indexFromParentNode to store the move that its parent node takes to reach the node,

TILESTATUS[] nodeGameState to store the game position of the node,

TILESTATUS nodeTurn to store whose turn it is to move.

The average score of the node is simply totalScore/timesVisited. The most important methods are described below.

Function choseTile() calls bestMCTSMove(), which returns the move chosen by MCTS.

bestMCTSMove() in turn calls runTrial() in each MCTS iteration, in which the game ends in one of three outcomes

enum TURN {PERSON, COMPUTER, DRAW}.

runTrial() calls simulateFrom() to simulate a random playout beyond the tree boundary.

The RPS algorithm is a simpler version of MCTS. One variable is added to the MCTSNode class:

int levelFromTopNode to store the number of levels from the root node to the node. For a child node of the root node, levelFromTopNode is set to 1.

The only change to implement RPS is that line 5 of function runTrial() is replaced by

if Node is a leaf node AND levelFromTopNode == 0 then

so that only the root node is expanded to add its child nodes, which are never expanded and are kept as the leaf nodes.

A4. Improved MCTS Implementation

I revised the MCTS Java program to implement the improved MCTS algorithm of Figure 17(b). In addition to levelFromTopNode, two variables are added to the MCTSNode class:

int simulationDepth to store the number of steps from the root node to where the game ends in function simulateFrom()

ArrayList

Besides, a static variable is defined:

int MINIMAXSEARCH to indicate whether the minimax search should be taken and whether has been taken.

The changes to functions bestMCTSMove(), runTrial() and simulateFrom() are summarized as follows.

In bestMCTSMove(), lines 3 to 4 are replaced by the following.

In runTrial(), lines 7 to 11 are replaced by the following. Here, constant NUMSIMULATIONDEPTH is the depth k in Figure 17(b)).

In simulateFrom(), the following is added immediately after line 8.

simulationDepth â† simulationDepth+1

Function minimax() does the minimax search and constructs the array lists nextMinimaxMoves of the root node and the child nodes up to level NUMSIMULATIONDEPTH. I wrote two versions of minimax(). The first version uses recursion.

The recursive version of minimax() can successfully play Tic-Tac-Toe but runs into heap memory errors when playing Gomoku even with modest board sizes. Although the code is simple and clean, the recursive function calls consume a significant amount of memory. To overcome this problem, I wrote the second version using iteration. The second version does not run into any errors but is not general enough for any NUMSIMULATIONDEPTH. NUMSIMULATIONDEPTH has to be set to 4.

Cite this paper

Li, W. (2018) Prediction Distortion in Monte Carlo Tree Search and an Improved Algorithm.*Journal of Intelligent Learning Systems and Applications*, **10**, 46-79. doi: 10.4236/jilsa.2018.102004.

Li, W. (2018) Prediction Distortion in Monte Carlo Tree Search and an Improved Algorithm.

References

[1] Schaeffer, J., MÃ¼ller, M. and Kishimoto, A. (2014) Go-Bot, Go. IEEE Spectrum, 51, 48-53.

https://doi.org/10.1109/MSPEC.2014.6840803

[2] https://en.wikipedia.org/wiki/Computer_Go

[3] Nandy, A. and Biswas, M. (2018) Googleâ€™s DeepMind and the Future of Reinforcement Learning. In: Reinforcement Learning, Apress, Berkeley, 155-163.

https://doi.org/10.1007/978-1-4842-3285-9_6

[4] Silver, D., et al. (2017) Mastering the Game of Go without Human Knowledge. Nature, 550, 354-359.

https://doi.org/10.1038/nature24270

[5] Coulom, R. (2006) Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. In: van den Herik, H.J., Ciancarini, P. and Donkers, H.H.L.M., Eds., Computers and Games. CG 2006. Lecture Notes in Computer Science, vol 4630, Springer, Berlin, Heidelberg, 72-83.

[6] Kocsis, L. and SzepesvÃ¡ri, C. (2006) Bandit Based Monte-Carlo Planning. In: FÃ¼rnkranz, J., Scheffer, T. and Spiliopoulou, M., Eds., Machine Learning: ECML 2006. ECML 2006. Lecture Notes in Computer Science, vol 4212, Springer, Berlin, Heidelberg, 282-293.

https://doi.org/10.1007/11871842_29

[7] Chaslot, G.M.J-B., Saito, J.-T., Bouzy, B., Uiterwijk, J.W.H.M. and van den Herik, H.J. (2006) Monte-Carlo Strategies for Computer Go. Proceedings of the 18th BeNeLux Conference on Artificial Intelligence, Belgium, 2006, 83-90.

[8] Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samthrakis, S. and Colton, S. (2012) A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games, 4, 1-49.

https://doi.org/10.1109/TCIAIG.2012.2186810

[9] Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002) Finite-Time Analysis of the Multi-Armed Bandit Problem. Machine Learning, 47, 235-256.

https://doi.org/10.1023/A:1013689704352

[10] Chang, H., Fu, M., Hu, J. and Marcus, S. (2005) An Adaptive Sampling Algorithm for Solving Markov Decision Processes. Operations Research, 53, 126-139.

https://doi.org/10.1287/opre.1040.0145

[11] Ramanujan, R., Sabharval, A. and Selman, B. (2010) On Adversarial Search Spaces and Sampling-Based Planning. Proceedings of the 20th International Conference on Automated Planning and Scheduling, Toronto, 12-16 May 2010, 242-245.

[12] Browne, C. (2011) The Dangers of Random Playouts. ICGA Journal, 34, 25-26.

https://doi.org/10.3233/ICG-2011-34105

[13] Teytaud, F. and Teytaud, O. (2010) On the Huge Benefit of Decisive Moves in Monte-Carlo Tree Search Algorithms. Proceedings of IEEE Conference on Computational Intelligence and Games, Copenhagen, 8-21 August 2010, 359-364.

https://doi.org/10.1109/ITW.2010.5593334

[1] Schaeffer, J., MÃ¼ller, M. and Kishimoto, A. (2014) Go-Bot, Go. IEEE Spectrum, 51, 48-53.

https://doi.org/10.1109/MSPEC.2014.6840803

[2] https://en.wikipedia.org/wiki/Computer_Go

[3] Nandy, A. and Biswas, M. (2018) Googleâ€™s DeepMind and the Future of Reinforcement Learning. In: Reinforcement Learning, Apress, Berkeley, 155-163.

https://doi.org/10.1007/978-1-4842-3285-9_6

[4] Silver, D., et al. (2017) Mastering the Game of Go without Human Knowledge. Nature, 550, 354-359.

https://doi.org/10.1038/nature24270

[5] Coulom, R. (2006) Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. In: van den Herik, H.J., Ciancarini, P. and Donkers, H.H.L.M., Eds., Computers and Games. CG 2006. Lecture Notes in Computer Science, vol 4630, Springer, Berlin, Heidelberg, 72-83.

[6] Kocsis, L. and SzepesvÃ¡ri, C. (2006) Bandit Based Monte-Carlo Planning. In: FÃ¼rnkranz, J., Scheffer, T. and Spiliopoulou, M., Eds., Machine Learning: ECML 2006. ECML 2006. Lecture Notes in Computer Science, vol 4212, Springer, Berlin, Heidelberg, 282-293.

https://doi.org/10.1007/11871842_29

[7] Chaslot, G.M.J-B., Saito, J.-T., Bouzy, B., Uiterwijk, J.W.H.M. and van den Herik, H.J. (2006) Monte-Carlo Strategies for Computer Go. Proceedings of the 18th BeNeLux Conference on Artificial Intelligence, Belgium, 2006, 83-90.

[8] Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samthrakis, S. and Colton, S. (2012) A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games, 4, 1-49.

https://doi.org/10.1109/TCIAIG.2012.2186810

[9] Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002) Finite-Time Analysis of the Multi-Armed Bandit Problem. Machine Learning, 47, 235-256.

https://doi.org/10.1023/A:1013689704352

[10] Chang, H., Fu, M., Hu, J. and Marcus, S. (2005) An Adaptive Sampling Algorithm for Solving Markov Decision Processes. Operations Research, 53, 126-139.

https://doi.org/10.1287/opre.1040.0145

[11] Ramanujan, R., Sabharval, A. and Selman, B. (2010) On Adversarial Search Spaces and Sampling-Based Planning. Proceedings of the 20th International Conference on Automated Planning and Scheduling, Toronto, 12-16 May 2010, 242-245.

[12] Browne, C. (2011) The Dangers of Random Playouts. ICGA Journal, 34, 25-26.

https://doi.org/10.3233/ICG-2011-34105

[13] Teytaud, F. and Teytaud, O. (2010) On the Huge Benefit of Decisive Moves in Monte-Carlo Tree Search Algorithms. Proceedings of IEEE Conference on Computational Intelligence and Games, Copenhagen, 8-21 August 2010, 359-364.

https://doi.org/10.1109/ITW.2010.5593334