# Rational Agents Cooperate in the Prisoner's Dilemma

The Prisoner's Dilemma is a well-known problem in game theory, economics, and decision theory. The simple version: Alice and Bob have been arrested on suspicion of having committed a crime. If they both stay silent (cooperating with each other), they'll each do a year in prison. If one of them testifies against the other (defecting against them) while the other stays silent, the one silent one goes to prison for 4 years and the narc goes free. If they both defect, they each serve two years in prison. In order to simplify the math, game theorists also make three rather unrealistic assumptions: the players are completely rational, they're completely self-interested, and they know all of these details about the game.

There's a general consensus in these academic fields (though many individuals disagree) that the rational choice is to defect against the other prisoner. I believe this consensus is mistaken. The simpler version of the argument:

Both agents are rational, meaning that they'll take the action that gets them the most of what they want. (In this case, the fewest years in jail.) Since the game is completely symmetric, this means they'll both end up making the same decision. Both players have full knowledge of the game and each other, meaning they know that the other player will make the rational choice, and that the other player knows the same thing about them, etc. Since they're guaranteed to make the same decision, the only possible outcomes of the game are Cooperate-Cooperate or Defect-Defect, and the players are aware of this fact. Both players prefer Cooperate-Cooperate to Defect-Defect, so that's the choice they'll make.

### Problems in traditional game theory

The standard argument for defecting is that no matter what the other player does, you're better off having defected. If they cooperate, it's better if you defect. If they defect, it's better if you defect. So by the sure-thing principle, since you know that if you knew the other player's decision in advance it would be best for you to defect no matter what their decision was, you should also defect without the knowledge of their decision.

The error here is to fail to take into account that both player's decisions are correlated. As the Wikipedia page explains:

Richard Jeffrey and later Judea Pearl showed that [the sure-thing] principle is only valid when the probability of the event considered is unaffected by the action (buying the property).

The technical way of arriving at Defect-Defect is though iterated elimination of strictly dominated strategies. A strictly dominated strategy is one that pays less regardless of what the other player does. Since no matter what Bob does, Alice will get fewer years in jail if she defects than if she cooperates, there's (supposedly) no rational reason for her to ever pick cooperate. Since Bob will reason the same way, Bob will also not pick cooperate, and both players will defect.

But this line of logic is flawed. At the time that Alice is making her decision, she doesn't know what Bob's decision is; just that it will be the same as hers. So if she defects, she knows that her payoff will be two years in jail, whereas if she cooperates, it will only be one year in jail. Cooperating is better.

Wait a second, didn't we specify that Alice knows all details of the game? Shouldn't she know in advance what Bob's decision will be, since she knows all of the information that Bob does, and she knows how Bob reasons? Well, yeah. That's one of the premises of the game. But what's *also* a premise of the game is that Bob knows Alice's decision before she makes it, for the same reasons. So what happens if both players pick the strategy of "I'll cooperate if I predict the other player will defect, and I'll defect if I predict the other player will cooperate"? There's no answer; this leads to an infinite regress.

Or to put it in a simpler way: If Alice knows what Bob's decision will be, and knows that Bob's decision will be the same as hers, that means that she knows what her own decision will be. But knowing what your decision is going to be before you've made it means that you don't have any decision to make at all, since a specific outcome is guaranteed.

Here's the problem; this definition of "rationality" that's traditional used in game theory assumes logical omniscience from the player; that they have "infinite intelligence" or "infinite computing power" to figure out the answer to anything they want to know. Pitting two players with this ability against each other leads to omnipotence paradoxes.

So the actual answer is that this notion of "rationality" is not logically valid, and a game phrased in terms of it may well have no answer. What we actually need to describe games like these are theories of bounded rationality. This makes the problem more complicated, since the result could depend on exactly how much computing power each player has access to.

So ok, let's assume that our players are boundedly rational, but it's a very high bound. And the other assumptions are still in place; the players are trying to minimize their time in jail, they're completely self-interested, and they have common knowledge of these facts.

In this case, the original argument still holds; the players each know that "I cooperate and the other player defects" is not a possible outcome, so their choice is between Cooperate-Cooperate and Defect-Defect, and the former is clearly better.

Note that this "eliminate impossible options" step is implicitly done under any model of game theory. The traditional payoff matrix for the prisoner's dilemma doesn't have a column for "one player decides to activate their pocket teleportation device and escape the prison", because that's not a possibility, and both players know it's not a possibility, so there's no point in including it in the analysis. The option of one player defecting and the other cooperating is similarly ruled out by the problem statement, so is not worth considering.

A version of the prisoner's dilemma where people tend to find cooperating more intuitive is that of playing against your identical twin. (Identical not just genetically, but literally an atom-by-atom replica of you.) Part of this may be sympathy leaking into the game; humans are not actually entirely self-interested, and it may be harder to pretend to be when imagining your decisions harming someone who's literally you. But I think that's not all of it; when you imagine playing against yourself, it's much more intuitive that the two yous will end up making the same decision.

The key insight here is that full "youness" is not necessary. If your replica had a different hair color, would that change things? If they were a different height, or liked a different type of food, would that suddenly make it rational to defect? Of course not. All that matters is that *the way you make decisions about prisoner's dilemmas* is identical

(Note that theories of bounded rationality often involve probability theory, and removing certainty from the equation doesn't change this. If each player only assigns 99% probability to the other person making the same decision, the expected value is still better if they cooperate.

### But causality!

A standard objection raised at this point is that this treatment of Alice as able to "choose" what Bob picks by picking the same thing herself violates causality. Alice could be on the other side of the galaxy from Bob, or Alice could have made her choice years before Bob; there's definitely no causal influence between the two.

...But why does there need to be? The math doesn't lie; Alice's expected value is higher if she cooperates than if she defects. It may be *unintuitive* for some people that Alice's decision can be correlated with Bob's without there being a casual link between them, but that doesn't mean it's *wrong*. Unintuitive things are discovered to be true all the time!

Formally, what's being advocated for here is Causal decision theory. Causal decision theory underlies the approach of iterated elimination of strictly dominated strategies; when in game theory you say "assume the other player's decision is set, now what should I do", that's effectively the same as how causal decision theory says "assume that everything I don't have causal control over is set in stone; now what should I do?". It's a great theory, except for all the ways in which it's wrong.

The typical demonstration of this is Newcomb's problem. A superintelligent trickster offers you a choice to take one or both of two boxes. One is transparent and contains $1000, and the other is opaque, but the trickster tells you that it put $1,000,000 inside if and only if it predicted that you'd take only the $1,000,000 and not also the $1000. The trickster then wanders off, leaving you to decide what to do with the boxes in front of you. In the idealized case, you know with certainty that the trickster is being honest with you and can reliably predict your future behavior. In the more realistic case, you know those things with high probability, as the scientists of this world have investigated its predictive capabilities, and it has demonstrated accuracy in thousands of previous such games with other people.

Similarly to the prisoner's dilemma, as per the sure-thing principle, taking both boxes is better regardless of what's inside the opaque one. Since the decision of what to put inside the opaque box has already been made, you have no causal control over it, and causal decision theory says you should take both boxes, getting $1000. Someone following a better decision theory can instead take just the opaque box, getting $1,000,000.

Newcomb's problem is perfectly realizable in theory, but we don't currently have the technology to predict human behavior with a useful degree of accuracy. This leads it to not feel very "real", and people can say they'd two-box without having to actually deal with the consequences of their decision.

So here's a more straightforwards example. Alice offer you a choice between two vouchers: a red one or a blue one. You can redeem them for cash from Alice's friend after you pick one. Her friend Bob is offering $10 for any voucher handed in by a person who chose to take the blue voucher. He's also separately offering $9 for any red voucher.

If you take the red voucher and redeem it, you'll get $9. If you take the blue voucher and redeem it, you'll get $10. A rational person would clearly take the blue voucher. Causal decision theory concurs; you have causal control over how much Bob is offering, so you should take that into account.

Now Bob is replaced by Carol and Alice offers you the same choice. Carol does things sightly differently; she considers the position and velocity of all particles in the universe 24 hours ago, and offers $10 for any voucher handed in by someone who would have been caused to choose a blue voucher by yesterday's state of the universe. She also offers $9 for any red voucher, same as Bob.

A rational person notices that this is a completely equivalent problem and takes the blue voucher again. Causal decision theory notices that it can't affect yesterday's state of the universe and takes the red voucher instead, making $1 less.

Unlike Newcomb's problem, this is a game we could play right now. Sadly though, I don't feel like giving out free money just to demonstrate that you could have gotten more. So here's a different offer; a modified Newcomb's problem that we can play with current technology. You can pay me $10 to have a choice between two options, A and B. Before you make your choice, I will predict which option you're going to pick, and assign $20.50 to the other option. You get all the money assigned to the option you picked. We play 100 times in a row, 0.5 seconds per decision (via keypresses on a computer). I will use an Aaronson oracle (or similar) to predict your decisions before you make them.

This is a completely serious offer. You can email me, I'll set up a webpage to track your choices and make my predictions (letting you inspect all the code first), and we can transfer winnings via Paypal.

If you believe that causal decision theory describes rational behavior, you should accept this offer. You can, of course, play around with the linked Aaronson oracle and note that it can predict your behavior with better than the ~51.2% accuracy needed for me to come out ahead in this game. This is completely irrelevant. CDT agents can have overwhelming evidence of their own predictability, and will still make decisions without updating on this fact. That's exactly what happens in Newcomb's problem: it's specified that the player knows with certainty, or at least with very high probability, that the trickster can predict their behavior. Yet the CDT agent chooses to take two boxes anyway, because it doesn't update on its own decision having been made when considering potential futures. This game is the same: you may believe that I can predict your behavior with 70% probability, but when considering option A, you don't update on the fact that you're going to choose option A. You just see that you don't know which box I've put the money in, and that by the principle of maximum entropy, without knowing what choice you're you're going to make, and therefore without knowing where I have a 70% chance of having not put the money, it has a 50% chance of being in either box, giving you an expected value of $0.25 if you pick box A.

If you're an advocate of CDT and think that maybe losing $50+ might not actually be rational, and choose to decline my offer, great! You've realized that CDT is a flawed model that cannot be trusted to make consistently rational decisions, and are choosing to discard it in favor of a better (probably informal) model that does not accept such offers.

### But free will!

This approach to decision theory, where we consider the agent's decision to be subject to deterministic causes, tends to lead to objections along the lines of "in that case it's impossible to ever make a decision at all".

First off, if decisions can't occur, I'd question why people devote so much time to the study of something that doesn't exist.

Yes, quantum mechanics is a thing. Maybe the universe is actually random and not deterministic. Does that restore free will? If your decisions are all determined by trillions of tiny coin flips that nothing has any control over, least of all you, does that somehow put you back in control of your destiny in a way that deterministic physics doesn't? Seems odd.

But quantum mechanics is a distraction. The whole point of decision theory is to formalize the process of making a decision into a function that takes in a world state and utility function, and outputs a best decision for the agent. Any such function would be deterministic

This demonstrates why "free will" style objections to cooperation-allowing theories of rationality are nonsense. Iterated elimination of strictly dominated strategies **is a deterministic process**. If you claim that rational agents are guaranteed to defect in the prisoner's dilemma, and you know this in advance, how exactly do they have free will? An agent with true libertarian free will can't exist inside *any* thought experiment with any known outcome.

(Humans, by the way, who are generally assumed to have free will if anything does, have knowledge of causes of their own decisions all the time. We can talk about how childhood trauma changes our responses to things, take drugs to modify our behavior, and even engage in more direct brain modification. I have yet to see anyone argue that these facts cause us to lose free will or render us incapable of analyzing the potential outcomes of our decisions.)

What exactly it means to "make a decision" is still an open question, but I think the best way to think about it is the act of finding out what your future actions will be. The feeling of having a choice between two options only comes from the fact that we have imperfect information about the universe. Maxwell's demon, with perfect information about all particle locations and speeds, would not have any decisions to make, as it would knows exactly what physics would compel it to do.

This is not to say that questions about free will are not interesting or meaningful in other ways, just that they're not particularly relevant to decision theory. Any formal decision theory can be implemented on a computer just as much as it can be followed by a human; if you want to say that both have free will, or neither, or only one, go for it. But clearly such a property of "free will" has no effect on which decision maximizes the agent's utility, nor can it interfere with the process of making a decision.

### Models of the world are supposed to be useful

Many two-boxers in Newcomb's problem accept that it makes them less money, yet maintain that two-boxing is the rational decision because it's what their theory predicts and/or what their intuition tells them they should do.

You can of course choose to define words however you want. But if your definition of "rational" includes "making decisions that unecessarily lose arbitrary amounts of money", it doesn't seem like a very useful concept. It also has next to no relation to the normal English meaning of the word "rational", and I'd encourage you to pick a different word to avoid confusion.

I think what's happened here is a sort of streetlight effect. People wanted to formally define and calculate rational (i.e. "effective") decisions, and they invented traditional rational choice theory, or Homo economicus, which follows causal decision theory. For most real-world problems they tested it on, this worked fine! So it became the standard theory of the field, and over time people started thinking about CDT as being *synonymous* with rational behavior, not just as a model of it.

But it's easy to construct scenarios where this model fails. In addition to the ones discussed here, there are many other well-known counterexamples: the dollar auction, the St. Petersburg paradox, the centipede game

I think causal decision theorists tend to have a bit of a spherical cow approach to these matters. This is why I tried to provide real-life examples and offers to bet real money; to remind people that we're not just trying to create the most elegant mathematical equation, we're trying to make theories that actually apply to the real world.

When your decision theory requires you to reject basic physics like a deterministic universe... maybe it's time to start looking for a new theory.

### Potential misconceptions

A few things that are not true:

It is not true that a rational agent will cooperate against *any* other agent; only against another rational one. If a rational agent plays against an opponent using the strategy of "always defect", or "always cooperate", or any other strategy that is uncorrelated with its opponent's decision, the rational agent will see that it gets higher utility by defecting.

It is not true that this is *directly* applicable to the real world. None of the decision making agents that we encounter in today's world; humans, institutions, and computer programs; are fully rational. Nor do they have full knowledge of each other's decision making processes, which prevents them from ruling out unlikely results of the game.

It is not true that I am defining "rational behavior" as "cooperates with other agents with this label". I am defining "rational behavior" as "the behavior that results in the best outcome", and the fact that such an agent will cooperate with other such agents is a consequence of the fact that doing so gets them both more utility than any other course of action. On the contrary, it is the classical game theorists who often play semantic games, simply *defining* "rationality" as "seeking Nash equilibria" or "following Causal Decision Theory", even when doing so has obviously terrible consequences.

It is not true that there is one decision theory that is strictly better than the others in all possible situations. By the no free lunch theorem, there will always be some situation in which another decision theory does better. (For a trivial example, consider playing a prisoner's dilemma against a player with the strategy "defect against players using [decision theory X], cooperate against all others".) But luckily, our world has patterns, and some types of problems and agents are more common than others. When we talk about being "rational", it means "adopting a theory that performs well on the sorts of problems it's likely to encounter".

### I welcome being caused to change my mind

My goal with this article is to serve as a comprehensive explanation of why I don't subscribe to causal decision theory, and operate under the assumption that some other decision theory is the correct one. (I don't know which.)

None of what I'm saying here is new, by the way. Evidential decision theory, functional decision theory, and superrationality are some attempts to come up with better systems that do things like cooperate in the prisoner's dilemma. But they're often harder to formalize and harder to calculate, so they haven't really caught on. For some more specific technical analyses, I'd recommend checking out papers like Functional Decision Theory: A New Theory of Instrumental Rationality and Robust Cooperation in the Prisonerâ€™s Dilemma: Program Equilibrium via Provability Logic, or books such as Paradoxes of Rationality and Cooperation: Prisoner's Dilemma and Newcomb's Problem and How We Cooperate: A Theory of Kantian Optimization.

I'm sure many people will still disagree with my conclusions; if that's you, I'd love to talk about it! If I explained something poorly or failed to mention something important, I'd like to find out so that I can improve my explanation. And if I'm mistaken, that would be extremely useful information for me to learn.

This isn't just an academic exercise. It's a lot easier to justify altruistic behavior when you realize that doing so makes it more likely that other people who reason in similar ways to you will treat you better in return. People will be less likely to threaten you if you credibly demonstrate that you won't accede to their demands, even if doing so in-the-moment would benefit you. And when you know that increasing the amount of knowledge people have about each other makes them more likely to cooperate with each other, it becomes easier to solve collective action problems.

As Karl Marx put it: Most philosophers only interpret the world in various ways. The point, however, is to change it.