1 Introduction
Artificial Intelligence (AI) has become a major research topic in recent years due to many discoveries and advancement of machine learning. One of the uses of machine learning for artificial intelligence is within the field of games. Games form a challenging task for intelligent agents to learn and play at expert human level. The main challenge that makes games extremely difficult for the agents to master is that games have a large state space to explore and it is often infeasible or computationally expensive to find an optimal solution to win games. Reinforcement learning is one area of machine learning with widespread applications in training agents to play games. A Markov Decision Process (MDP) is generally used as a framework for modelling decision making in games. A MDP has four main elements which include a finite set of states, a finite set of actions, a reward function and a state transition function [Sutton and Barto, 1998]. Reinforcement learning is learning completed by an agent which uses an input pattern or states to produce an output action or signal and receives a reward, either negative or positive, depending on the performance of the agent. The agent is trained by trying to maximize its reward by choosing actions that give the greatest reward. In games, a positive reward is generally associated with winning the game and the actions which cause the agent to reach the winning state. In this research survey, analysis of the applications of reinforcement learning is conducted to determine if reinforcement learning can be used to create an intelligent agent that is able to learn and play games like a human. In particular, the agent needs to be able to play games at expert human level.
In this survey, three main problems are analysed to determine the viability and progress of a human-like agent in games.
1.1 Learning with no human designed features or prior knowledge
One general problem with inputs in training agents are that prior knowledge or hand designed features are usually given to the agent. This creates a human bias or prejudice where the agent does not learn from experience or explore new areas but instead learns from knowledge that has been designed by humans and training tries to optimise these features. Instead the agent should be able to learn without any prior knowledge or features and determine what to do in the game from actions, rules, rewards and the visible environment. The agent would need to devise its own strategy to win games. The agent should learn in a similar manner to how a human learns to play a game when played for the first time.
1.2 Expert human level game play
In many computer games virtual agents are hard-coded to produce an effect of artificial intelligence. These approaches may work for simple problems but for complex games with a large state space it is difficult to hard-code an optimal solution. One of the major challenges for reinforcement learning is determining a data structure for creating an expert human-level agent as a large state space in games makes it appear impossible to generate an optimal solution with a simple look-up table or search tree.
1.3 Generality of learning architecture to diverse games
The third problem relating to reinforcement learning in game play is that the learning approach needs to be a general solution which can be applied to many games and not bound to a single game. Generality of the approaches is important for the advancement of artificial intelligence and applications to other areas such as robotics.
2 Tesauro’s Application of Temporal Difference Learning for Backgammon
One of the first applications of reinforcement learning for learning the game of backgammon was proposed by Tesauro [1995]. Tesauro [1990] previously developed Neurogammon which used supervised learning with backpropagation through a neural network. Supervised learning is different to unsupervised reinforcement learning as it updates the weights in a neural network using labelled training data. This means the output result is known during training and the neural network updates the weights to try match the desired output. Neurogammon was able to play backgammon at an intermediate-level human player. The inputs provided to Neurogammon’s neural network were expert hand-designed features. These features and supervised learning creates human bias or prejudice and requires pre-processing of the board state to extract these features. Tesauro proposed TD-Gammon to learn with reinforcement learning from only inputs of a raw board state, thereby eliminating these human biases. TD-Gammon used temporal difference learning to update the weights in a neural network by evaluations of previous turns’ board states and the current turn’s board state. TD-Gammon was trained by playing against itself over 200 000 times and was able to reach an intermediate-level human player, similar to Neurogammon. Although TD-Gammon reached a similar level as Neurogammon, it was unable to reach expert human level play from just raw board state information. Tesauro then applied the hand-designed features of Neurogammon to TDGammon for learning. The addition of the features allowed TD-Gammon to compete at the level of a world-class human player. Tesauro’s work showed an alternative superior reinforcement learning approach compared to supervised learning for learning game play of backgammon but was unable to reach expert level play without giving prior knowledge or features to the network. TD-Gammon also had limited applications to other problems because backgammon has a stochastic environment due to die rolls. This makes it possible for the learning algorithm to explore more states than other games due to the degree of variability in training positions. Therefore the applications of TD-Gammon require deterministic games to use external noise sources during training. While the application of TDGammon is limited it did revolutionise how top-level backgammon players viewed openings by using better openings to the then standard openings.
3 Emigh et al.’s Approach to Speed Up Learning
Many games suffer from the curse of dimensionality as highlighted in the game of backgammon, where there is an extensive range of game scenarios or states and it is computationally too expensive to exhaustively explore all the possibilities. Emigh et al. [2016] tried to address this problem by using previous experiences to aid learning in the early stages. They developed a modification of Q-learning to use nearest neighbour states to represent similar game situations as similar states and actions. Q-learning is a reinforcement learning technique for finding optimal action for finite Markov Decision Process. Q-learning is model-free therefore it is able to determine expected utility of an action without modelling the environment, which is useful in creating general architectures for learning. The value functions of reinforcement learning are updated by using a nearest neighbour algorithm from inputs of a feature set. A metric learning method is also used to identify important game solving features and assign more weightings to those features. Emigh et al used two types of feature sets as inputs to evaluate and train their agent. A global feature set was used where the agent knew every object in the environment and a local feature set where the agent only knew surrounding features. Tests were conducted on one classical arcade game, Frogger, which has a very large, discrete state space. Results from a tabular Q-learning approach failed to learn as the states in the state space were hardly revisited and a neural network framework for Q-learning was successful when using a nearest neighbour approach. Evaluation was conducted on the speed of learning a policy. Nearest neighbour algorithm for Q-learning was found to improve the learning performance. Metric learning mitigated the effect of irrelevant features and increased the impact of relevant features, particularly when the network was given the global feature set, which contained a significant amount of features that did not aid the agent in winning the game. Although the results were shown to be successful, it is difficult to determine the generality of this approach to other games as only one arcade game was tested but Emigh et al. predicted similar success when applied to other applications.
4 A General Architecture for AI in Atari Games
Mnih et al. [2015] looked to solve the problem of the learning agent requiring prior knowledge or hand-designed features to play games at expert human 4 level. Their proposed approach was to combine deep learning and reinforcement learning. Deep learning uses a neural network that has many hidden layers between the input and output layer, which allows for multiple levels of representation / abstraction of the data between each layer. Their goal was to create a single convolutional neural network agent that is able to successfully learn to play many Atari 2600 games which each had different environments and different reward requirements. The network was provided with no game-specific information or hand-designed features. The agent is given the same information available to a human which includes the video input reward, terminal signals and a set of possible actions. Mnih et al. propose Deep Q-Learning which is a variant of Q-learning algorithm. Deep Q-Learning uses convolutional neural networks and Q-learning (DQN) but utilises experience replay to store the agent’s experience at each time-step. The experience replay memory is randomly sampled during training to mitigate data correlation and non-standard distributions. The agent was trained for over 10 million frames with a replay memory of one million of the most recent frames. DQN was evaluated by comparing it to other reinforcement learning approaches such as Sarsa algorithm and expert human level players. DQN outperformed the previous reinforcement learning approaches on six out of the seven Atari games. The results were very promising for the development of artificial intelligence in game play as the previous learning approaches used prior knowledge and human defined features while DQN did not require any prior knowledge or the human defined features. Although DQN outperformed previous learning approaches, it was only able to surpass expert human level player on three games. The games that DQN did not outperform expert human level player were on games that needed strategy that extended over a long time. TD-Gammon had a similar problem where it would occasionally have poor end game. Mnih et al. experimented with minor modifications and updates of DQN and increased training time to 50 million frames. They were able to achieve expert human level play on 49 Atari games and on 43 of these games they exceeded the results of previous reinforcement learning approaches. The results from DQN is promising for the development of artificial intelligence as they were able to create a general architecture for an agent to play multiple different games at expert human level. Although the agent was able to achieve expert human level game play, there is still a problem with reinforcement learning where learning is too slow compared to humans. The agents need a significantly longer time than humans to learn to play games and further research is required to achieve similar human level learning.
5 Shallow reinforcement learning for AI in Atari Games
Liang et al. [2016] investigated the DQN work of Mnih et al. They evaluated the impact of key representational biased encoded by DQN and criticised their evaluation methods. Mnih et al. only reported on one trial of DQN’s performance in each game and Liang et al. believe the performance of DQN may be inflated by only evaluating best performing results. Liang et al. also criticised the use of prior knowledge given to the agent during training of the DQN. They found upon further investigation, Mnih et al. gave a limited set of simple actions to the agent and not the whole set of controls were used during training of the agent to play Atari 2600 games, unlike for a human who has access to all controls when playing. Liang et al. also noted that the DQN approach had access to knowledge of lives and training terminated when the agent lost, which could have impacted the learning of the agent as no negative reward was conducted for losing lives. The goal of Liang et al. was to design a simpler reinforcement learning approach, than the Deep Reinforcement Learning approach, by using an adaption of previous approaches. Instead of focusing on eliminating humandesigned features as inputs, they focused on shallow reinforcement learning to reduce the complexity of the neural network. As inputs for training they used a modified version of Basic feature set described by Bellemare et al. [2013]. Four different approaches were explored by using different feature sets as inputs for training. The first approach used Basic Pairwise Relative Offsets in Space (B-PROS), which captures pairwise relative distance between objects in a single screen. The second approach used Basic Pairwise Relative Offsets in Time (B-PROT) which contained a frame of the past and Basic current frame features to give the agent an indication of travel direction and velocity of features called the non-Markov features. The third approach combined the previous approaches called B-PROST to give the Basic, B-PROS and B-PROT features. The final approach called Blob-PROST is similar to BPROST but used blobs to identify and separate objects in the game. The non-Markov features were found to be critically important in the success of game play. Blob-PROST’s average performance was higher than the other approaches in 59% of the games. Liang et al. found it difficult to draw comparisons between their approach and DQN due to the non-standard evaluation methodology conducted by Mnih et al. They modified their evaluation to conduct comparisons and found results from Blob-PROST that appeared to indicate a comparable quality to DQN’s learned representation across many games. Blob-PROST 6 did not need a deep neural network used in the DQN approach and Liang et al. suggested that the representation properties learned by DQN was more important than the specific features that it learns. Due to the difficulty in drawing comparisons, Liang et al. also proposed an Arcade Learning Environment (ALE) benchmark for future tests. They proposed that the average performance is measured after 24 trials and the full action set must be available to the agent during training.
6 Amato and Shani’s High Level Reinforcement Learning Approach for Partially Observable Environment Games
In the previous papers analysed, the learning agents for the games have a fully observable environment which allows the agents to have all the information required to make an optimal decision. Various strategy games have a partially observable environment to a player which makes it difficult to identify actions to win games. Strategy games are a difficult task to master due to the uncertainty about game conditions, large state space and vast action set. There has not been much research for these types of games. Amato and Shani [2010] introduced a high-level reinforcement learning approach in strategy games. Their goal was to create a learning agent which is able to quickly adapt to fixed opponent strategies and reduce the weaknesses in hard-coded strategies. The problem of hard-coded strategies are that human players can easily discover and exploit their weaknesses by using counter strategies. Amato and Shani used a game called Civilization IV for tests. Civilization IV is a two player turn-based strategy game with multiple paths to victory. They experimented with three different learning methods. The first approach was Q-learning which was used as a base line comparison method. The second approach used a model-based Q-learning method called Dyna-Q which learns by using a model for experiences learning. The third approach used a factored model version of Dyna-Q to learn features independently with transition functions. The input provided to the agent for learning was based on state features available to the agent. The reward was based on a set of scores available in game which are also available on screen to a human player. The actions that the agent chose from were a set of pre-designed candidate strategies. These strategies were game specific strategies and originally developed from hard-coded strategies. Their approach showed to improve on deficiencies in hard-coded strategies but is very limited to a high level view of the game. Out of the three approaches tested Q-learning performed the worst. The factored model learner won more games but reached a local minimum when learning and the Dyna-Q method overtook this approach after longer training times. Amato and Shani only focused on a strategy for winning the game and the game would automatically handle the low-level decisions and actions to achieve that strategy. The agent is also limited in that the agent learnt from a fixed player strategy and if the opponent changed strategies it would perform poorly.
7 Major Breakthrough for AI in the Game of Go
One of the most challenging classical games for computers to win against professional players is Go. The challenge of Go is that it has an optimal solution but it is so complex that it appears infeasible to find due to the large branching factor / search space. Silver et al. [2016] from Google DeepMind looked to create a program called AlphaGo to play Go at a professional level using artificial neural networks which have been extensively trained from both self-play and analysis of human games. The best previous attempts of computer programs to win games of Go use Monte-Carlo tree search (MCTS) [Enzenberger et al., 2010]. The MCTS approaches are trained by predicting expert human moves and policies are enhanced to narrow down the list of sample action choices by assigning probabilities to actions. The MCTS is not able to achieve expert human level play and was only capable of strong amateur play. Sliver et al. apply a similar architecture of deep convolutional neural networks for the visual domain to Go. The neural network is given a 19 × 19 image of the board state and is trained to construct a representation of the position, which reduces the search space. A value network is used to evaluate positions and a policy network is used to sample actions. The neural networks are trained through several stages and different learning processes. Each stage is used to optimise different aspects of the game. The first training stage uses supervised learning policy network to learn from expert human moves, allowing the network to accurately predict expert moves better than previous supervised learning methods. The next stage involves training a reinforcement learning policy network. The network is trained by self-play to optimise the goal of winning games. The final stage involves training a value network. The value network is trained to predict the winners of games from the output of the reinforcement learning policy network by 8 self-play. Optimal moves are determined from the value and policy networks using a Monte Carlo tree search. AlphaGo requires huge amounts of computational power with dozens of threads, CPUs and GPUs and hundreds of thousands self-play games and expert human moves. Evaluation found AlphaGo to outperform all other Go programs winning 99.8% of the games and even beating human Go champions. The results from combining machine learning techniques to create AlphaGo was a major breakthrough in artificial intelligence research. It provides a new look on how to approach complex problems to reach human-level performance.
2016-12-20-1482264403