Prisoner's Dilemma
Artificial Intelligence Research Project
60-371
Ananth Adhikarla
103462848
Abstract
Prisoner's Dilemma is a game Invented by Merrill Flood & Melvin Dresher during the 1950s with the main focus on, Iterated Prisoner's Dilemma experiments by Robert Axelrod's. Prisoner's Dilemma game is a classic prototype which is responsive to evolutionary behaviours. Iterated Prisoners Dilemma (IPD) is widely studied in the fields of Artificial Intelligence, Machine Learning and evolutionary computing fields due to its models which show the cooperation between two individuals. In this paper I will show different optimization methods and compare them with others to find the best possible strategies, how the strategies work in comparison to other strategies such as TFT and finally determine if the obtained strategy is efficient using machine learning.
Introduction
Before we progress into various aspects of Prisoner's Dilemma we need to learn the basics.
To start off with we look at this snippet from Wikipedia.Two criminals are caught and jailed (Ann and Bob). Each criminal is in solitary confinement which means they cannot communicate with the other. The police lack evidence to prison the pair on the crimes committed. The police wish to sentence both to prison for at least a year. Each prisoner is given the opportunity either to: defect the other by stating that the other person committed the crime, or to cooperate with the other by remaining silent. The offer is:
If Ann and Bob each defect the other, each of them serves 10 years in prison
If Ann defects Bob but Bob remains silent, Ann will be set free and Bob will serve 20 years in prison (and vice versa)
If Ann and Bob both remain silent, both of them will only serve 1 year in prison (on the lesser charge)
This gives an understanding of what Prisoner's Dilemma is about. Over the course of the paper we will look at different optimization methods using different memory depths and see how they standoff against other strategies such as TFT, TF2T and STFT. With the help of machine learning I hope to find the most suitable strategy.
Common Strategies used
There are several strategies used in the area of Prisoner's Dilemma, namely these are few of the strategies, I will consider
Tit-For-Tat (TFT) – The action chosen is based on the opponent's last move.
On the first turn, the previous move cannot be known, so always cooperate on the first move. Thereafter, always choose the opponent's last move as your next move.
Tit-For-Two-Tat (TF2T) – Same as Tit for Tat, but requires two consecutive defections for a defection to be returned. Cooperate on the first two moves. If the opponent defects twice in a row, choose defection as the next move.
Suspicious Tit-For-Tat (STFT) – Always defect on the first move.
Thereafter, replicate opponent's last move.
Free Rider (ALLD) – Always choose to defect no matter what the opponent's last turn was.
This is a dominant strategy against an opponent that has a tendency to cooperate.
Always Cooperate (ALLC) -Always choose to cooperate no matter what the opponent's last turn was.
This strategy can be terribly abused by the Free Rider Strategy.
Iterated Prisoner's Dilemma
If two players play more than once continuously and remember the previous actions of the prisoners then its called: Iterated Prisoner's Dilemma commonly referred to as 'IPD'. This is a sample table representing the above definition.
BOB
Cooperate
Defect
ANN
Cooperate
R,R
S,T
Defect
T,S
P,P
In the table, the pair (X, Y) which corresponds to the row and column of Ann and Bob respectively indicates that the Ann's payoff is X and the Bob's payoff is Y. In defining a Prisoner's Dilemma game, certain conditions have to hold. The order of the payoffs is important. The best a player can do is in this situation is to defect 'T'. The worst a player can do is to get the defect's payoff, 'S'. If the two players cooperate then the reward for that mutual cooperation, 'R', should be better than the punishment for mutual defection, 'P'. Therefore, the following must hold. T > R > P > S.
As per the IPD Payoff matrix given in the project slide.
BOB
Cooperate
Defect
ANN
Cooperate
3,3
0,5
Defect
5,0
1,1
In the iterated game, player strategies are rules that determine in a stochastic manner, a player's next move in any given game situation which can include the history of the game to that point. Each player's aim is to maximize his total payoff over the series. If you know how many times you are to play, then one can argue that the game can be reduced to a one-shot Prisoner's Dilemma. The argument is based on the observation that you, as a rational player will defect on the last iteration, because you are in effect playing a single iteration. The same logic applies to your opponent. Knowing that your opponent will therefore defect on the last iteration. Your opponent will make the same deduction. This logic can be applied all the way back to the first iteration. Thus, both players binded into a sequence of mutual defections.
One way to avoid this situation is to use a routine in which the players do not know when the game will end. If the players know the probability, that the game continues, then from their point of view, it is equivalent to an infinite game where the values of payoffs in each successive round are discounted by a factor. Depending on the value of factor and various other parameters, different Nash equilibria are possible, where both players play the same strategy.
Some strategies have no advantages over the single game. A player who cooperates regardless of previous behavior (AllC) or who always defects (AllD) will score no better than their memory-less counterpart. Much research suggests, however, that the Tit For Tat (TFT) strategy is very successful. This strategy simply states that a player should repeat the opponent's move of the previous round. In earlier research, TFT has been shown to outperform most other strategies [2]. Another strategy shown to perform well against a wide range of opponents is the Pavlov strategy.
Genetic Algorithm and Simulation (GA)
Genetic algorithms lend themselves well studying strategies in the prisoner's dilemma. Each player is represented by its strategy. In the memory-three game used in this study, each player's strategy must address sixty-four possible histories. We use the set of moves to create a 64 bit string which represents each player in the algorithm.
After calculating fitness, which is described in the next section, this study implements roulette wheel selection, also called stochastic sampling with replacement [4]. In this stochastic algorithm, the fitness of each individual is normalized. Based on their fitness, individuals are mapped to contiguous segments of a line, such that each individual's segment is equal in size to its fitness. A random number is generated and the individual whose segment spans the random number is selected. The process is repeated until the correct number of individuals is obtained.
Table below shows a sample population with calculated and normalized fitness.
Person 1 has a normalized fitness of approximately 0.20 which gives it a 1 in 5 chance of being selected. Person 10 has the lowest fitness, with a normalized fitness of 0.02. If a person had a fitness of zero, the person would have no chance of being selected to propagate into the new population. Random points are selected on this line to select individuals to reproduce. Children's chromosomes (strategies) are produced by single point crossover at a random point in the parent's chromosome. The mutation rate was 0.001 which produced approximately one mutation in the population per generation, and the recombination rate was set at 0.8.
Person
1
2
3
4
5
6
7
8
9
10
Fitness
27
22
18
15
17
12
9
8
4
3
Normalized Fitness
0.20
0.17
0.14
0.11
0.10
0.10
0.07
0.06
0.03
0.02
As per the payoff table given in IPD
Simulations in this study utilized a genetic algorithm to evolve strategies for the Prisoner's Dilemma. Each simulation began with an initial population of twenty players represented by their strategies. Several terms are used in this section. A game refers to one 'turn' in the Prisoner's Dilemma. Both players make simultaneous moves, and each are awarded points based on the outcome. A round is a set of games between two players. Rounds in this study are 64 games long. A cycle is completed when every player has played one round against every other player. To determine fitness, each player was paired with every other for one round of 64 games. Players did not compete against themselves. Since there are sixty-four possible histories, this number of games ensures that each reachable history is visited at least once. After each game, the players' scores are tallied and their histories are updated. Players maintain a performance score which is the sum of the points that they receive in every game against every player. The maximum possible performance score is 6080: if a player defected in every game and his opponents all cooperated in every game he would receive 5 points X 64 games X 19 opponents. For a player who is able to mutually cooperate in every game, the performance score would be 3,648 (3 points X 64 games X 19 opponents). After a full cycle of play, players are ranked according to their performance score, and selected to reproduce. Recombination occurs, children replace parents, and the cycle repeats. At the end of each generation, the total score for the population is tallied. This value is the sum of the scores of all members in the population.
While the maximum score for an individual is 6080, the maximum score for a population of 20 cannot be 20 times that. For one individual to score the maximum, all others must score very low. The highest cumulative score achievable in an individual game is 6, when both players receive a score of 3 for mutual cooperation. Mutual defection would have a total game score of 2 (1 point each), and mixed plays, with one cooperator and one defector, have a game total of 5 (5 for the defector plus 0 for the cooperator). Thus, the highest score that a population can achieve is 72,960 (3 points X 64 games X 19 opponents gives 3,648 per player X 20 players total). In the end, the fitness of a population is measured by what percentage of the highest possible score is achieved. A population with a total score of 63,480, for example, would have a population fitness of 50% (63,480 / 72,960).
Results of Genetic Algorithm
To test whether or not a population had each of the two traits described in the hypothesis, players' behavior in these experiments is compared to the behavior of Tit-For-Tat. Consider a population that has evolved Tit for Tat like behavior. That population is likely using only a small percentage of its genes because many of the possible histories are not achievable by a Tit for Tat player (i.e. CDCDCD where a cooperating player always defends against the defecting opponent). This means that an individual might look very little like an unevolved Tit-For-Tat player.
Five distinct populations were used to compare behavior before and after evolution. Tit For Tat, as discussed previously, were the two control populations for this experiment. Both have the inherent ability to exploit mutual cooperation and defend against defectors. Three other populations were respectively comprised of AllC players, of AllD players, and of independently randomly initialized players.
To measure the performance of populations, the average fitness over the last 10,000 generations of each simulation was studied. Starting with the five initial populations, each was evolved for 200,000 generations. This evolution was simulated several hundred times for each initial population. Significance was calculated by the standard 2 tailed t-test for data sets with unequal variance. Each population was compared to the Tit for Tat
Evolved populations of players develop the ability to defend against defectors, and the ability to take advantage of mutual cooperation.
After a period of evolution as described earlier, the average performances of the five populations were statistically equal. This equality came about as a result of 'random drift' of the populations. Random drift occurs when strategies are recombined and mutated without selection. Each specific gene has occurred simply by chance mutation or recombination, and the performance of such a population is generally low. By turning off the selection mechanism in the genetic algorithm, results for a random drift population were generated. The evolved populations all performed well above the level of the random drift population, indicating that they exhibit evolutionarily preferred traits.
The first experiment looks for the ability to defend against defectors. In the experiment, the five unevolved, initial populations were mixed with a small set of AllD players. Fitness of those populations was calculated over the first 10,000 generations immediately following inoculation. TFT performed well, with scores around 80%. Neither Random, AllC, nor AllD came near this level. By the standard t-test, all were significantly lower than TFT with p = .01.
The same experiment was performed with the five populations after evolution. After 200,000 generations, the populations were mixed with a small group of AllD players. Fitness was calculated over the next 10,000 generations. Looking at the average fitness of all five populations, it was found that there was no statistical difference in performance among them with p=.01. Additionally, comparing these results to the performance of unevolved Tit for Tat and Pavlov players, there was no statistical difference. Additionally, there was no statistical difference between performance of the inoculated populations, and the uninoculated evolved populations, indicating that defectors had no effect on performance of evolved populations.
Repeating the same experimental structure above, the five unevolved populations were mixed with a small set of AllC players. Tit for Tat again performed at nearly 80% if the maximum fitness, as did the initial population of AllC players. AllC players always cooperate by their nature. In an initial population made up entirely of AllC players, mutual cooperation is the norm. Introducing more AllC players to that initial population obviously does not change it. The prevalence of mutual cooperation explains the excellent performance of the unevolved AllC population.
Unevolved with AllC
Tit-For-Tat
Cooperate
Defect
Mean
0.7594
0.7999
0.8784
Test with unevolved TFT
1
0.8423
1.9440 E-05
Unevolved with AllD
Tit-For-Tat
Cooperate
Defect
Mean
0.8138
0.7348
0.8825
Test with unevolved TFT
0.3794
0.0077
7.330E-06
Conclusion
These results lead to several conclusions. Our first experiment shows that defectors effect all five of the evolved populations in the same way. They react identically, but does this necessarily indicate that they all have a defensive ability?
Since the populations which were unable to defend against defectors show that behaviour after evolution they must evolve that ability over time. The second set of conclusions that can be drawn are those regarding mutual cooperation.
Results here show that evolved populations are able to cooperate among themselves since they perform the same as the control populations in the presence of cooperators. Further, once can conclude that populations exhibit this behaviour even without the experimental conditions, since there is no difference between performance in the natural, evolved environment and performance in the presence of pure cooperators.
With the results outlined above, it follows that in this experiment, evolved populations performed equivalently to Tit for Tat. Specifically, these experiments show that evolved populations are able to oppose the defectors and mutually cooperate with other evolved individuals.
Since these populations did not have such abilities, we can say that it follows an evolution introduced this behaviour over time. Some preliminary simulations have been run to study this phenomenon in probabilistic strategies. Initial results show no difference in results between deterministic and probabilistic populations.
References
[2] Axelrod, Robert. 'Laws of Life: How Standards of Behavior Evolve.' The Sciences 27 (Mar/Apr. 1987): 44-51.
[4] Baker, J. E. 'Reducing Bias and Inefficiency in the Selection Algorithm.' Proceedings of the Second International Conference on Genetic Algorithms and their Application, Hillsdale, New Jersey, USA: Lawrence Erlbaum Associates, 1987: 14-21