Learning in an Iterated Prisoner’s Dilemma Game

The prisoner’s dilemma (PD) game is being used in game theory to depict situations of competition or cooperation among players. The game is defined as a non-zero sum game indicating that whenever one player benefits the other player suffers penalties. The players do not have any knowledge of what the other player might play thus this make it a non-cooperative game.

A classical form of the prisoner’s dilemma (PD) game is described.

Two suspects are arrested by the police. The police have insufficient evidence for a conviction, and, having separated both prisoners, visit each of them to offer the same deal. If one testifies (defects from the other) for the prosecution against the other and the other remains silent (cooperates with the other), the betrayer goes free and the silent ac-complice receives the full 10-year sentence. If both remain silent, both prisoners are sentenced to only six months in jail for a minor charge. If each betrays the other, each receives a five-year sentence. Each prisoner must choose to betray the other or to remain silent. Each one is assured that the other would not know about the betrayal before the end of the investigation. How should the prisoners act? [150]

The game is essentially a two-player game where each player is trying to maximize its own payoff without any consideration of what happens to the other player.

TABLE 6.5: Prisoner sentences in PD game.

Prisoner B Prisoner B stays silent betrays

Prisoner A stays silent Each serves 6 months Prisoner A: 10 years, Prisoner B: goes free Prisoner A betrays Prisoner A: goes free,

Prisoner B: 10 years Each serves 5 years

In a one-shot game, because the players have no knowledge of other player strategies, the game may not be very useful. But in an iterated prisoner dilemma game, the game is repeatedly played amongst players. When repeat-edly playing the game, the players have a chance to punish others, if they have played a strategy which was unfavorable to them previously. This is similar to reinforcement learning where, by punishing the player, they can learn the beneficial strategies to play. The game can be repeated infinitely and eventu-ally find an equilibrium, where they learn to play the good defect strategy to prevent being punished in the future. In its classical form, the game presents a Nash equilibrium when the players both defect.

Conducting the game on a trial-by-trial basis or a series of moves, the play-ers must choose either to cooperate or defect on each trial. Table 6.6 shows the numerical payoffs of the strategies played. Table 6.6 depicts a mathemat-ical representation where T stands for temptation to defect, R for reward for mutual cooperation, P for punishment for mutual defection and S for sucker’s payoff. In this situation the following inequality will always hold:

T > R > P > S (6.10)

TABLE 6.6: Payoff matrix in PD game, where R=3, S=0, T=5, P=1.

Cooperate Defect Cooperate R,R (3,3) S,T (0,5) Defect T,S (5,0) P,P (1,1)

Playing the game repeatedly will eventually lead to an equilibrium, where all players learn to defect or stay silent to achieve the maximum payoff. This is maintained at a condition where the following rule is true [154, 48]:

2R > T+S (6.11)

The tendency to defect is the dominating move for the players. But when players jointly defect the payoff returned is less than the payoff returned with mutual cooperation. Playing the game once, clearly the players would think of defecting, but playing it with many trials, the players learn that they have a higher probability of getting a high payoff if they choose to cooperate, even-tually trusting the other player to cooperate.

Researchers have used the iterated prisoner’s dilemma game to draw im-portant conclusions on behavior of group selection or mutual altruism in real individuals. The gaining of trust among individuals when coming together in groups is often viewed as an evolutionary process which allows evolution of cooperative behaviors. Politics exhibits a PD scenario, illustrating when the country has to make decisions in spending money on its military expansion or reducing weapons. Advertising in economics is viewed as an example of a PD scenario, where firms are competing against each other for sales. They have to decide whether they need to advertise or not depending on whether the other firm has advertised. Their decisions and the times at which they make them would affect their sales.

Miller [132] used automaton to represent a strategy in a prisoner’s dilemma game. A player can make only two moves: either to cooperate or defect. A strategy, however, is a complete plan of the number of times to cooperate or defect depending on what the other player played. This can be represented as a sequence of states to determine the next move for each player. For instance, some of the strategies can be as follows:

Always cooperate. Always cooperate no matter what the other player plays (Figure 6.23(a)).

Always defect. Always defect no matter what the other player plays, coop-erates or defects (Figure 6.23(b)).

Tit for tat. Cooperate on the first move. Then mimic whatever the other player plays (Figure 6.23(c)).

Figure 6.23 depicts examples of automaton being used to represent the prisoner dilemma strategies. Table 6.7 explains how two players playing an all defecting strategy against a tit-for-tat strategy progress.

The players have no knowledge of what other players might be playing at timet= 0. After the players have made their move, they know what the last played strategy was. When an all defecting strategy plays against a tit-for-tat strategy, it starts with the first player playing a defect and the second player cooperating. As a result, the first player benefits getting a better payoff and Player 2 suffers. But after this time step, Player 2 starts to mimic Player 1’s last move. Since Player 1 defected in the last time step, it now plays a defect.

Player 1 is playing a strategy to defect. Each of these moves returns certain payoffs to the players as shown.

Axelrod [14] organized a prisoner’s dilemma tournament where he invited

d c s

(a) Always cooperate.

d c s

(b) Always defect.

c s d

d c

FIGURE 6.23: Example automaton for prisoner’s dilemma strategies.

TABLE 6.7: All defect strategy (Player 1) playing against a tit-for-tat strat-egy (Player 2).

Player 1 (All D) Player 2 (Tit-for-tat)

Attime=t D C

Payoff returned (5) (0)

Attime=t+ 1 D D

Payoff returned (1) (1)

Attime=t+ 2 D D

Payoff returned (1) (1)

game theorists to submit their own strategies for playing the game. Each strategy was played against the other about 200 times and their collected payoffs were collected. The experiment resulted in declaring the ‘Tit-for-Tat’

[154] strategy as the most successful strategy among the pool of strategies submitted. Jennings et al. [165] introduced an alternate strategy which used a tell to predict the other players’ strategy because it is being played a number of times.

In another experiment, Axelrod [12] introduced evolving strategies to play against each other. The results showed that the most effective strategies prop-agated through the population, initially moving away from cooperation, but then slowly moved towards it again. The average score of the population was also seen to increase as the population evolved to cooperate with each other.

FIGURE 6.24: Finite state machine of eight states representing a prisoner’s dilemma strategy. cf. [81].

Fogel [60] implemented a population of coevolving finite state machines (FSM) each with eight states to represent the various strategies of the PD game. Each FSM represented a predictive algorithm for a strategy and were allowed to mutate and evolve in light of the expectation of what the other state machines played. Figure 6.24 shows an example of a Fogel’s finite state machine representing a strategy.

In contrast to Axelrod’s results of cooperation, Fogel showed that the level of cooperation was not complete in most cases of the machines. His results showed that trials with larger populations, however, did show emergence of cooperative behavior but with smaller numbers and there was “a repeated pattern of initial complete mutual cooperation, but this quickly degenerated into cyclic behavior with moves covering the range from complete cooperation to complete defection” [81]. These experiments were useful to hint the ability of how evolutionary computation can be used to perform problem solving and generate any kind of behavior in simulations [61].

The prisoner’s dilemma game allows players to compete against each other to win payoffs. Locations can be used to allow closer players to continuously cooperate or defect to see which strategy wins the most. The players can

assess their strategies based on the fitness in the prisoner’s dilemma game (Table 6.5).

The strategy played in the prisoner’s dilemma game was a 16-state strat-egy. A 16-state strategy had a similar structure to the design of the automaton discussed by Miller [132].

Table 6.8 represents a three-state automaton represented as a series of strings showing a three state strategy. Figure 6.25 displays the corresponding strategy of this automaton. The starting state is State 0. In this state the player will cooperate. If the other player cooperates the player will move to the State 1, else it will move to State 2. Depending on the new states its next moves will depend on what is represented in the state it is currently in.

TABLE 6.8: Example of a three state machine represented by automaton.

State C/D Next State, if Other Player Cooperates

Next State, if Other Player Defects

0 C 1 2

1 D 0 2

2 C 2 2

C ( S t a t e 0 )

C ( S t a t e 2 ) D

( S t a t e 1 ) c

c , d

FIGURE 6.25: Example of automaton represented by Table 6.8.

The automaton used in the FLAME iterated prisoner’s dilemma game uses a 16-state strategy. A 16-state strategy is represented using 4 bits for each state. Each strategy will contain 16 states; a payoff playing that strategy is the score. Players maintain a database of these strategies in their memory, to aid their competition in the simulation. The structure of the strategy database is pictured in Figure 6.26.

Figure 6.27 shows the structure of one state in this strategy. The state in the strategy is a string of 9 bits. The first bit represents which strategy to play when in this state. In Figure 6.27, the player will cooperate in this state.

After doing so, depending on what the other player plays, it will move to a new state. If the other player cooperates, the player will move to the next

... P a y o f f S u g a r s S c o r e

FIGURE 6.26: Strategy database of ten strategies in player memory.

C 0 1 0 1 1 0 0 1

FIGURE 6.27: One state in a strategy.

state which is represented by the next 4 bits in the strings or move to the state represented by the last 4 bits, if the other player defected.

Because the length of each strategy was 16 states the game was played 16 times between the players. This ensured that all the states in the strategy were reached during the plays testing the complete strategy.

Using this, ideal payoffs for the players were calculated. If all players started to cooperate, this would be the maximum payoff they will strive to achieve. This will be given as

Cooperating equilibrium = Average payof f ×16 = (3 + 3)/2×16 = 48

(6.12) Similarly the other equilibriums for the other situations will be given as

Def ecting equilibrium = (1 + 1)/2×16 = 16 (6.13)

M ixed equilibrium = (0 + 5)/2 + (0 + 5)/2×16 = 2.25×16 = 36 (6.14)

Figure 6.28 displays two parents strategies in a three-state automaton. The parent are performing crossover at a point denoted by state number and the length in the state. Therefore as depicted in the Figures 6.28(a) and 6.28(b), the crossover point is at state number = 1 and state length = 4.

Figure 6.29 depicts the two children created by crossing over the two par-ents. Figure 6.30 depicts a mutant child of Parent 1 which was mutated at the same position. The diagrams show how through crossover and mutation techniques, new strategies can be generated by just moving the bits in the strings. Table 6.9 summarizes the values used during the experiment.

Steps taken in the PD model:

• Step 1: Citizen agent chooses a chosen strategy it might play using the roulette wheel selection mechanism. Posts the first step which is either to cooperate (C) or defect (D).

C D C

FIGURE 6.28: Two strategies acting as parents.

C D C

FIGURE 6.29: Two children resulting from crossover of parents, at crossover point state number 1 and state length 4.

• Step 2: Citizen agent performs crossover and mutation techniques on the strategy for the PD game.

• Step 3: Solver agent reads in the strategies of the two players and plays the game between them. Adds the payoffs collected and tells the citizen about the outcome, who won and who lost.

Figure 6.31 depicts the average score when the payoff of the IPD game is used as the score of the strategy. The graphs were plotted with their ideal values, in Equations 6.12 - 6.14. This shows which equilibrium was favorable for the players. In Figure 6.31, the players were seen to learn the equilibrium values very quickly in the simulation. The payoffs varied between 40 and 80, but stabilized above the ideal cooperating equilibrium.

Im Dokument Agent-Based Modeling (Seite 193-0)