The Iterated Prisoners Dilemma

Published Date: 02 Nov 2017

Shashank Shetty

Abstract

Cooperation has always been recognized as a fundamental ingredient in the cre-

ation of societies and the generation of wealth. As a concept, it has been studied for

many years. Yet, in practice, its emergence and persistence are less understood. In the

following, a game theoretic approach to the study of cooperation based on the Pris-

oners Dilemma is reviewed. The Iterated Prisoners Dilemma is a model for studying

cooperation and conicts. This paper reports results obtained with various strategies

for the Iterated Prisoners Dilemma, the performance of the various strategies is then

compared in speci_c tournaments. Also, the behaviour of a player is analysed in more

detail by looking at the speci_c game sequences (and corresponding decisions), which

out of all possible sequences are actually utilized in a tournament.

1 Introduction

The original Prisoners Dilemma (PD) is a two person game which can be formulated as

follows: Mark and Jamie are arrested because they are suspected of committing a bank

robbery. Both the suspects are kept in an isolation cell in order to be interrogated. The

interrogator wants Mark and Jamie to confess the crime. The following conditions are im-

posed on both of them: If you confess and your accomplice denies the crime, you will go free

for your testimony, but your accomplice will get _ve years prison for bank robbery. If your

accomplice confesses but you deny, it will be the other way round. If both of you confess,

you will get only three years prison for the robbery, but if both of you deny, you will get one

year prison for illegal possession of weapon. The goal of the each prisoner is to minimize his

or her time in the prison by either confessing or denying. From the view of the individual

prisoner, it seems to be evident that he should confess the crime resulting in a maximum of

three years prison, but when taking a close look the dilemma gets revealed by the fact that

denial of both prisoners results only in one year prison, but mutual confession of both would

lead to worse result of 3 years prison.

In general, PD is a model for scenarios in which individual interests lead to worse in-

dividual results for each participant. This paper investigates about the di_erent strategies

that can be used while playing the Iterated Prisoners Dilemma game in di_erent environ-

ments under di_erent circumstances. This paper also investigates about the well-performed

strategies when compared with other standard strategies.

Section 2 discusses the Prisoners Dilemma and the Iterated Prisoners Dilemma along

with the literature and background. Section 3 discusses the various reasons for this research.

Section 4 discusses the complete Experimental Design and the methods and tools used for

the research along with the di_erent strategies that are used in the experiments. Section 5

shows the results of the experiment in the form of tables and graph with detailed analysis

for each of the experiments done. Section 6 discusses the restrictions and limitations of this

research paper along with outlines of some future potential work. The Final section presents

some conclusions drawn from this work.

2 Background and related work

The Prisoners Dilemma is a small game popular for the problem of human cooperation. It

is a renowned game that has been widely studied in various _elds like economics, political

science, machine learning and evolutionary biology. It was originally framed by Merrill Flood

and Melvin Dresher of the Rand Corporation in 1950, which was then formulated by the

mathematician Albert W. Tucker based on ideas of Flood and Dresher.

The Prisoners Dilemma (PD) is a non-zero sum game which involves 2 players. Each of

them can make one of the moves available to them i.e. Cooperate(C) or Defect (D). Both

players can choose their moves simultaneously and independent to each other. The players

get some payo_s depending upon the move chosen by either of them. The payo_ matrix is

shown below in the Figure1.

Figure 1: The classical choice for payo_ in Prisoners Dilemma (Player 1s payo_s are given

_rst)

R: REWARD S: SUCKER T: TEMPTATION P: PENALTY

When both players cooperate, they are awarded equal rewards (the reward, R), the re-

wards are equal but intermediate. When one of the player defects, he receives the highest

possible payo_ (the temptation, T), while the other player gets the suckers payo_ (the sucker,

S). When both the players defect, they receive the penalty (the penalty, P) which is again

intermediate.

PD is a non-zero sum game, which means that the sum of the payo_s of the two players

is not always a constant, and hence there cannot be a single universal strategy which will

work for all game plays for a player. Since its a non-zero game, several interesting properties

can be observed immediately. In a one-shot game, both the players will choose to Defect

(D, D), because this move will guarantee maximum payo_ to the player no matter what his

opponent chooses. But, both players could even choose to Cooperate(C, C) with each other,

which will give them equal payo_s (3 each), hence the dilemma.

In game theory, the move (D, D) of the players is termed as a Nash Equilibrium. It is a

steady state of the game in which no player has an incentive to shift from its strategy. Nash

proved that any n-player game has a Nash Equilibrium, when randomization in choosing

the moves is permitted. It is however clear from the prisoners dilemma game that the Nash

Equilibrium may not necessarily be the social optimum.

A more interesting situation is created when the player plays this game iteratively, where

the payo_s are accumulated over each iteration. If both the players are not known about

the number of iterations beforehand, then it is possible to have equilibrium which is better

than (D, D).The folk theorems de_ne the outcomes in the iterated games. For prisoners

dilemma, there are in_nitely many equilibrium outcomes, there is a possibility of having an

equilibrium outcome in which both the players always cooperate.

Suppose that there are many players, and each player plays the iterated game with other

player in a round robin fashion, the scores (playo_s) being cumulated over all the games.

The player with the maximum payo_ at the end of the round robin tournament is the winner

of this game. The problem is to _nd the optimal strategies which will ensure victory in such

a tournament.

After the formulation of Prisoners Dilemma by Tucker, extensive discussions were made

by game theorists, sociologists, ethicists, philosophers, biologists, political scientists, math-

ematicians and computer scientists. Robert Axelrod was the _rst to study this problem in

detail, which was then kindled in his book The Evolution of Cooperation. Axelrod organized

two tournaments in the year 1985 and invited strategies from a number of experts and game

theorists. He surprisingly found that the winner in both the tournaments was a very simple

strategy namely Tit for Tat. This strategy cooperates on the _rst move, and then simply

copies the opponents last move in its subsequent move. Such a simple strategy turning out

to be a winner was quite surprising. Axelrod set out to _nd out other simple strategies with

same or greater power. He used single objective evolutionary algorithm to obtain optimal

strategies.

According to this algorithm for each move in the game, there are four possibilities: both

the players can cooperate (CC or R for reward), the other player can defect (CD or S for

sucker), the _rst player can defect (DC or T for temptation), or both the players can defect

(DD or P for penalty). To code the particular strategy, the particular behaviour sequence

is coded as three letter string. For example, RRR would represent the sequence where both

the players cooperated over the previous three moves and SSP would represent the sequence

where the _rst player was played for a sucker twice, and then _nally defected. This three

letter sequence is then used to generate a number between 0 and 63, by interpreting it as

a number in base 4.This is done by assigning a digit value to each of the characters in the

following way : CC = R = 0, DC = T = 1, CD = S = 2, and DD = P = 3.In this way, RRR

would decode to 0, and SSP would decode to 43. Using this scheme a particular strategy can

be de_ned as a 64-bit binary strings of Cs (cooperate) and Ds (defect) where the ith C or

D corresponds the ith behavioural sequence. Figure 2 shows an example Genetic Algorithm

(GA) string. For the example string in the _gure, the three letter code comes to be RTR for

the previous moves (given in the _gure).This decodes to 4 , thereby meaning that player 1

should play (4+1) or 5th move speci_ed in the _rst 64-bit GA string. In this case, the _fth

bit is C, meaning that the player 1 will cooperate.

Figure 2: Example of Genetic Algorithm string

.

Since a particular move depends on the previous three moves, the _rst three moves in the

game are unde_ned in the above scheme. To account for these six bits(Cs and Ds, initially

assigned at random) are appended to the above 64 bit string to specify a strategys premises,

or assumption about the pre-game behaviour. Together, each of the 70 bit strings thus

represents a particular strategy, the _rst 64 for rules and the next 6 for premises.

The above encoding scheme was used by Axelrod to _nd optimal strategies using a single-

objective genetic algorithm. He found that from a random start, the strategies discovered

by genetic algorithm not only performed quite well, but also beat the overall performance of

the Tit for Tat strategy, mentioned earlier.

Axelrods model of the evolution of cooperation was based on the Iterated Prisoners

Dilemma. Empirical work following this approach has helped establish the prevalence of

cooperation based on reciprocity. Theoretical work has led to a deeper understanding of the

role of other factors in the evolution of cooperation: the number of players, the range of

possible choices, variation in the payo_ structure and population structure.

3 Research Question

The Iterated Prisoners Dilemma version of the game is the subject of research. The game

proceeds over a number of moves, the value of which is either decided before hand in which

case results are those of the one shot Prisoners Dilemma diluted, or the game is inde_nitely

repeated until some random event occurs and brings it to a permanent end.

The Prisoners Dilemma game is characterised by the strict inequality relations between

the payo_s : T > R > P > S. In the iterated PD (IPD), the concept of time, contrary to

the one-shot PD , takes all its weight: players, for instance, realise that reprisals as well as

reward may be triggered by their strategy choices. And because of this time dimension the

IPD presents the players with scope for investigating each others behaviour. There is oppor-

tunity for cooperative, deceptive, threatening, exploitative behaviours and much more. But

there is no guarantee that any one behaviour will take place. This opportunity for diverse

behaviours to crop up in the IPD maps to a similarly large set of diverse strategies to choose

from, (Tit-for-Tat, Pavlov, All Defect). Some of these strategies will be discussed later. The

close analysis of this IPD reveals that, unlike the one-shot Prisoners Dilemma, it has a large

number of Nash Equilibrium.

Cooperation in the Iterated Prisoners Dilemma and its emergence and persistence is the

main question. The answer to which is can be given by attempting to use the IPD as a model

for capturing cooperative behaviour. How does Cooperation arise in the Iterated Prisoners

Dilemma, in the _rst place? When it has arisen how is it sustained?

In the 1957 Luce and Rai_a predicted that in the IPD, sequences of [R,R] (i.e. mutual

cooperation) will arise as the players are aware of the e_ect of reprisals. It is the reprisal

which is the trigger of the cooperative behaviour. Many DD lock-in e_ect was observed in

the experiments of late 50s and early 60s. The results obtained by the recent experiments

are not very clear either. It was argued that cooperation emerges invariably in the IPD due

to an implicit communication between the players, they cannot communicate explicitly, but

they do so by playing cooperatives moves, which signals their good intentions. For example

if a player persists in choosing C (cooperate) over the opponents choice of either C or D

(defect) signals for cooperative play. The objection is that overloading cooperative play can

also be seen as a sign of weakness that prompts exploitation and punishment.

Robert Axelrods algorithms and concepts in the IPD, where computer tournaments were

used in the study of the evolution of cooperation based on di_erent strategies, exploitation

of these strategies is the main motive of this research paper. This is done by using a software

application known as WinIPD which implements the Axelrods Algorithms. The main aim of

this research paper is to exploit the behaviour of di_erent strategies for cooperation which

are listed in the next section of this paper. The results are obtained by running di_erent

types of tournaments and evaluating the results obtained from then. The simulations are

run over a number of times for di_erent scenarios to _nd out the optimal strategy for the

Iterated Prisoners Dilemma.

4 Experimental Design

WinIPD 1.2.1 was used as the technical base for the application and testing strategies in

the Iterated Prisoners Dilemma game. This application was developed in C hash using

Microsofts Visual Studio.Net, which is based on Axelrods algorithms for Iterated Prisoners

Dilemma. The strategies in this application are de_ned as types of agent. Every strategy is

termed as one agent type. It has a library of 16 agent types (strategies) which allows the

user to create tournaments uniquely. It consists of customizable payo_ matrix that allows

the user to adjust the payo_s corresponding to each of the four possible combinations of

actions in any particular game play. The maximum agent count for a particular tournament

is 100, (i.e. 10 di_erent types of agent can participate in a tournament with 10 counts for

each type). This allows a user to host his/her own tournament by selecting the number and

types of agents present in the tournament pool, The playo_s that the agents receive are the

results of their game play moves, and the total number of iterations in the tournament. The

detail results from the tournament are then presented.

In this scenario we experiment with di_erent strategies playing against a _xed number of

known strategies described in the following section. The best of these strategies is selected

to be the _nal result. For the experiment the following strategies were used:

Tit For Tat (TFT) - Starts with cooperation. If defected against, TFT responds with a

defect .Otherwise TFT cooperates

Tit For Two Tats (TFTT) - Starts with cooperation. If defected against twice in a row,

TFTTT defects; otherwise it cooperates

Spiteful (SP) - Starts with cooperation and stays with it until defected against once, then

defects against that opponent for the rest of the tournament.

Joss (JO) - Starts with cooperation. If defected against, Joss responds with a defect.

Otherwise Joss cooperates 90 percent of the time.

Random (RAM) - Randomly selects cooperate or defect

Pavlov (PAV) - Starts with cooperation, then cooperates if both players made the same

move previously, defects otherwise.

Soft Majority(SM) - Starts with cooperate. Plays the way the opponent has played in

the majority of the previous rounds. A tie goes to cooperate.

Hard Majority (HM) - Starts with defect. Plays the way the opponent has played in the

majority of the previous rounds. A tie goes to defect.

All C (AC) - Always cooperates.

All D (AD) - Always Defects.

Two types of experiments are conducted to evaluate the optimal strategy for the Iterated

Prisoners Dilemma Game (IPD).

1) Single Confrontations (one strategy against another one)

Here each type of strategy is compared with another type, depending upon how similar

or how distinct the strategy is, this is known as single confrontation. At the end of the

confrontation (for example after 100 rounds) points obtained for a number of counts of each

strategy is added and the strategy with highest score at the end of the tournament is the

winner. Here score is nothing but the payo_s that are awarded for every move. The payo_s

are awarded based on default payo_ matrix shown in the above _gure 1.

Several tournaments were run to compare the results of the strategies; some tournaments

are run to compare two strategies at a time, whereas some compare 3 strategies at a time.

The simulation was run again for a number of encounters by varying the playo_ matrix and

the iterations for each of them.

The _rst tournament compares the All Defect strategy and Pavlov strategy; the second

tournament does the same by varying the no of counts and iteration. This cycle is repeated

several times for di_erent strategies and di_erent environments. This shows the type of

strategy suitable for particular type of environment. The results of the each tournament are

represented in the form of tables and graphs.

2) Round Robin tournaments.

Here all the 10 strategies listed above are used in a round robin fashion. Each strategy

is playing against all the other strategies (including itself). Points (payo_s) of confrontation

are added, the winner is the strategy that has the highest number of points at the end of each

tournament. The simulations are run for a number of encounters by varying the number of

iterations and counts. Some tournaments are run using the default playo_ matrix as shown

in the above _gure1, whereas some tournaments are run with the customized payo_s as

shown in the below _gure 3 .

Figure 3: Customized Payo_

R: REWARD S: SUCKER T: TEMPTATION P: PENALTY

There are two sets of tournament. The _rst set of tournaments compare all the strategies,

with 5 iterations and 5 count for each strategy and uses the default payo_ (_gure1) , The

second tournament does the same by increasing the number of iterations and counts by 5

.(iteration = 10 , count = 10) . The cycle repeats.

The second set of tournaments compare all the strategies by the varying the iterations

and counts, but it uses the customized payo_ matrix as shown in the above _gure 3. The

winner of every tournament is the strategy with highest number of points (playo_s) at the

end. The points are accumulated over each iteration.

The results of both sets of tournament are displayed in from of tables and graph similar

to the previous experiment.

5 Results

In this section the results of evolving strategies with di_erent environment are presented

along with the outcomes of speci_c tournaments and a deeper analysis of evolved strategies.

Experiment 1 - Single Confrontation

1.1 All Defect against Pavlov

The Pavlov strategy plays well against itself, but it is not reactive (it does not take in

account the behaviour of the opponent). This strategy can be exploited by All Defect strat-

egy which always defects. Thus the All Defect strategy clearly beats the Pavlov as shown

in the results below. The test was run for 5 iterations and 5 counts and uses the default

payo_ matrix (Table1). The simulations are run again for 20 iterations and 20 counts for

each agent type which is presented in Table 2.

Figure 4:

Table 1: All Defect vs Pavlov, 5 iterations with 5 counts for each strategy, using the

default payo_ matrix

Figure 5:

Table 1: All Defect vs Pavlov, 20 iterations with 20 counts for each strategy, using the

default payo_ matrix

2.2 All Defect against Soft Majority and Tit For-Tat

This tournament gives the results of All Defect strategy against Soft Majority and TitFor-

Tat which now beats the All Defect with greater margin. The tests were run for 5 iterations

and 5 counts, using the default payo_ matrix (Table 3). The simulation was run again for

20 iterations and 20 counts (Table 4).

Figure 6:

Table 3: All Defect vs Soft Majority and Tit-For-Tat, 5 iterations with 5 counts for each

strategy, using the default payo_ matrix

Figure 7:

Table 4: All Defect vs Soft Majority and Tit-For-Tat, 20 iterations with 20 counts for each

strategy, using the default payo_ matrix

2.3 Tit For Tat against Pavlov

This tournament gives the results of Tit For Tat strategy against Pavlov. Both the strate-

gies begin with cooperate and end with cooperate, none of them defects. As shown in the

result both strategies behave similarly hence returns similar results. The tests were run for

5 iterations and 5 counts, using the default payo_ matrix (Table5).

Figure 8:

Table 5: Tit-For-Tat vs Pavlov, 5 iterations with 5 counts for each strategy, using the

default payo_ matrix

2.3 Hard Majority against Random This tournament gives the results of Hard Ma-

jority strategy against Pavlov strategy. (Table6).

Figure 9:

Table 6: Hard Majority vs Random, 5 iterations with 5 counts for each strategy, using the

default payo_ matrix

Figure 10:

Table 7: Hard Majority vs Random, 20 iterations with 20 counts for each strategy, using

the default payo_ matrix

Experiment 2 Round Robin Tournaments

These are the results of the round robin tournaments, which were run for a single iteration,

and 2 counts for each strategy. Clearly All Defect strategy is the winner with highest total

payo_ of 158 and average payo_ of 4.158. This was run using the default payo_ matrix. It

beats the Tit-For-Tat strategy with a margin di_erence of 74 playo_s.

Figure 11:

Table 8: Tournament with 10 strategies for single iteration with 2 counts for each strategy

using the default payo_ matrix

Figure 12:

Chart representing total payo_s for above table

The simulations were run again for 5 iterations; in this case Tit-For-Tat strategy was

the clear winner with the total of 3140 payo_s at the end of the tournament. These results

show that All Defect strategy is good for one iteration, but for a longer run of 5 iterations

Tit-For-Tat is the dominant strategy.

Figure 13:

Table 9: Tournament with 10 strategies for 5 iterations and 5 counts for each strategy

using the default payo_ matrix

Figure 14:

Chart representing total payo_s for above table

The simulations were run again with 5 standard strategies for 20 iterations, and 20 counts

for each strategy. The Spiteful strategy was the most dominant in this case with a total pay-

o_ of 117294.

Figure 15:

Table 10: Tournament with 5 standard strategies for 20 iterations and 20 counts for each

strategy using the default payo_ matrix

Figure 16:

Chart representing total payo_s for above table

The below tables shows the results for random iterations using a customized payo_ ma-

trix as shown in the above _gure3. The results of these tournaments were di_erent when

the number of iterations was changed. The All Defect was proved to be dominant for 5

iterations, whereas Joss was the dominant strategy when the simulations were run for 20

iterations. Spiteful , Pavlov and Soft Majority showed similar results when they were run

for one iteration.

Figure 17:

Table 11: Tournament with 10 strategies for 1 iterations and 1 count for each strategy

Using a customized payo_ matrix

Figure 18:

Chart representing total payo_s for above table

Figure 19:

Table 12: Tournament with 10 strategies for 5 iterations and 5 counts for each strategy

Using a customized payo_ matrix

Figure 20:

Chart representing total payo_s for above table

Figure 21:

Table 13: Tournament with 5 standard strategies for 20 iterations and 20 counts for each

strategy Using a customized payo_ matrix

Figure 22:

Chart representing total payo_s for above table

6 Discussion

The result of the experiments clearly shows the performance of di_erent strategies in di_erent

environments. The round robin tournaments show us the well-performed strategy when

compared with the other strategies. The All Defect strategy was the clear winner when the

test were run for a single iteration, whereas Tit-For-Tat was the most dominant strategy

when the simulation where run again for 5 iterations which consisted of other standard

strategies like Tit-For-Tat , Tit-For-Two-Tats, Joss and Pavlov. When the simulations were

run again for 20 iterations, Spiteful was the dominant strategy. The test were done using a

customized payo_ matrix (_gure 3) , where the All Defect was proved to be dominant for 5

iterations, Joss was the dominant strategy when the simulations were run for 20 iterations.

Spiteful, Pavlov and Soft Majority showed similar results when run for a single iteration.

The overall tests show us that there is no universal winning strategy for a Iterated Prisoners

Dilemma. The winning strategy depends upon the environment and the number of counts

in a tournament, as well as the iterations.

The experiment done in this paper is restricted to single-objective algorithm which tries to

maximize a players own score. Multiple objective algorithms can also be used that maximizes

the self score and also minimizes the opponents score. Here the opponents score means the

cumulative score the opponents score when playing against a particular strategy. It is possible

to win the game by not only maximizing the self-score, but also minimizing the opponents

score. Since the prisoners dilemma game is a non- zero sum game, it is possible that there is a

trade-o_ between these two objectives and therefore using a multiple objective evolutionary

algorithm may actually give a better insight to the optimal strategies of playing the game

as compared to a single-objective formulation. This is because using multiple conicting

objectives, not one but a number of trade-o_ optimal solutions can be found. These non-

dominated trade-o_ solutions so obtained can then be analysed to look for any pattern or

insights about optimal strategies for the IPD. If any such patterns are discovered, they would

provide a blue-print in formulating optimal strategies for the game.

7 Conclusion

This paper has presented approaches that generate well-performing strategies for the Iter-

ated Prisoners Dilemma. The strategies were compared in di_erent environments by varying

the number if iterations, the agent count and also the payo_ matrix. This paper also shows

that a very small change in Iterated Prisoners Dilemma payo_ matrix leads to an iterated

game whose properties are very di_erent than those of classical IPD. Two levels of coopera-

tion are possible in this game. This creates an iterated game much more di_cult to analyse

than the classical IPD. Nevertheless very concrete situations of social life are simulated with

it. Some of our experiments proved that only probabilistic strategies can make a high score

when they play against themselves. The experiments also showed interesting characteristics

that allowed de_ning good strategies like Pavlov and All Defect. Finally the experiments

show the di_erent types of strategies that are suitable for various environments.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

The Iterated Prisoners Dilemma

.

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

Get 20% Discount, Now £19 £14/ Per Page14 days delivery time

Get An Instant Quote

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time