Getting Stuck: When Cooperation Collapses — Repeated Prisoner’s Dilemma with Noise

This was the one problem that looked like a polite conversation between two rational agents — and turned into a screaming match with foggy hearing.

I wanted to understand how cooperation could be sustained between selfish agents when their interactions were repeated but noisy. Sounds textbook, right? Famous results (tit-for-tat, folk theorem, grim trigger) promise neat strategies. But once I actually started to play with imperfect observation — the small, realistic twist where you sometimes misread the other player’s move — everything that was tidy started to fray.

This is the story of getting stuck: my naive assumptions, the routes that wasted time, the experiments that confused me, and finally the insight that pulled me out.

The Game (short version)

One-shot Prisoner’s Dilemma (payoffs in standard normalized form):

Two players simultaneously choose C (cooperate) or D (defect).
Payoffs (Player 1, Player 2):
- (C, C) → (R, R) (reward for mutual cooperation)
- (C, D) → (S, T) (sucker, temptation)
- (D, C) → (T, S)
- (D, D) → (P, P)

Typical numeric example (strict PD): $T=5,\ R=3,\ P=1,\ S=0$ . Temptation > Reward > Punishment > Sucker.

In a single shot, defecting is dominant; mutual defection is the unique Nash equilibrium and is Pareto-inferior to mutual cooperation.

In the repeated game (infinitely repeated or with unknown horizon), cooperation can be sustained by trigger strategies. But — and here’s the rub — what if actions are observed noisily? Suppose after each round each player sees a signal about the opponent’s action; with some probability the signal is flipped (a “mistake” or “noise”). Now a single accidental flip can trigger a cascade of retaliation under naive trigger strategies and collapse cooperation.

That’s the setting I took on.

My initial, confident plan

I’d read about tit-for-tat (TFT) — start cooperating, then mimic opponent’s last move — and grim trigger (cooperate until the opponent defects once, then defect forever). TFT is simple and forgiving; grim is brutal but effective in noiseless repeated PD. My plan:

Implement an iterated PD simulator for two strategies.
Add observation noise (flip the opponent’s move with probability $\epsilon$ ).
Test pairings: (TFT vs TFT), (grim vs grim), (TFT vs grim) etc., and measure average payoff over many rounds.

I expected:

TFT vs TFT: stable cooperation.
Grim vs grim: stable cooperation (no mistakes).
But with noise, TFT remains robust, while grim collapses (a single error triggers permanent defection).

Turns out the reality was more interesting (and more frustrating).

Implementation (the lab bench)

I wrote a simple simulator to iterate many rounds and compute long-run average payoffs. Here is a compact version of the core simulation I used in the experiments — include this as a runnable snippet in the blog so readers can reproduce experiments.

import random
from collections import deque

# payoff matrix (Player1, Player2)
T, R, P, S = 5, 3, 1, 0

def payoff(a1, a2):
    if a1=='C' and a2=='C': return R, R
    if a1=='C' and a2=='D': return S, T
    if a1=='D' and a2=='C': return T, S
    return P, P

def noisy_observe(action, eps):
    # with prob eps signal flips: C->D or D->C
    if random.random() < eps:
        return 'D' if action == 'C' else 'C'
    return action

class TFT:
    def __init__(self): self.last_op = 'C'  # start cooperating
    def move(self): return self.last_op
    def update(self, observed_op): self.last_op = observed_op

class Grim:
    def __init__(self): self.anger = False
    def move(self): return 'D' if self.anger else 'C'
    def update(self, observed_op): 
        if observed_op == 'D': self.anger = True

def play(strat1, strat2, rounds=1000, eps=0.0):
    s1, s2 = strat1, strat2
    total1 = total2 = 0
    for _ in range(rounds):
        a1 = s1.move()
        a2 = s2.move()
        # signals perceived (noisy)
        obs1 = noisy_observe(a2, eps)
        obs2 = noisy_observe(a1, eps)
        p1, p2 = payoff(a1, a2)
        total1 += p1; total2 += p2
        s1.update(obs1); s2.update(obs2)
    return total1/rounds, total2/rounds

I then ran experiments across epsilons and averaged many trials.

First experiments — sanity checks

No noise ( $\epsilon = 0$ ):

TFT vs TFT → average payoffs near $R=3$ . ✅
Grim vs Grim → average payoffs near $R=3$ . ✅
TFT vs Grim → both cooperate and get R, or sometimes Grim exploits? Mostly mutual cooperation.

Okay, excellent. The simulators were working.

Now introduce a small noise, say $\epsilon = 0.01$ (1% mistakes). I expected TFT to sustain cooperation because it quickly “forgives” accidental defections — it retaliates once but then returns to C if opponent returns to C. Grim I expected to collapse: one noisy signal leads to permanent defection.

What I observed: grim collapsed (as expected), but TFT also suffered — cooperation decayed more than I thought, and the long-run payoff was noticeably less than $R$ . That was surprising.

Getting stuck: Why did TFT fail so badly?

I assumed TFT’s single-step reciprocity would be enough. But the simulation showed that even modest noise produced long sequences of alternating retaliations: A mis-signal causes A to defect, B observes a defect (maybe due to noise too), retaliates, A observes B’s retaliation (or a flipped signal), and a chain of tit-for-tat responses can generate long pockets of mutual defection.

I tried to reason this out:

Suppose both use TFT and an accidental flip makes A appear to defect to B. B defects next round in response. But if the second round’s observations are also noisy, the two agents can get out of sync in who they think defected when. The process is a Markov chain on the “state” of perceived last moves; noise creates excursions away from the cooperative absorbing-like region.

I felt stuck: TFT is textbook “nice” and forgiving — why does it die? The answer lies in the distinction between action noise (you execute the wrong action) vs observation noise (you misobserve the opponent). My simulation used observation noise (signals flipped), which is more pernicious: you might cooperate yet be perceived as defecting. TFT responds to perceived moves. Once perception diverges, the chain of tit-for-tat retaliations can be long.

I needed to explore alternatives.

Second wave: strategy engineering

I experimented with several proposed alternatives from the literature and intuition:

Generous TFT (GTFT): cooperate unless opponent defected, but retaliate only with some probability $q < 1$ . That is, if I see D, respond with D only with probability $q$ ; otherwise cooperate (forgive). The hope: this probabilistic forgiveness breaks cycles caused by noisy signals.
Contrite TFT: adapts to the idea of “I mistakenly defected” — the player keeps internal memory of whether they defected accidentally and tries to correct back. (A bit subtle to model.)
Win-stay-lose-shift (Pavlov): if your last payoff was R or T (i.e., “win”), repeat previous action; if last payoff was S or P (i.e., “lose”), switch action. Pavlov tends to be robust to some noises and performs well in tournaments.
Firm-but-fair: a mild trigger strategy that punishes for $k$ rounds on observed defection instead of forever.

I implemented GTFT and Pavlov. A sketch:

import random
class GTFT:
    def __init__(self, q=0.9): self.q = q; self.last = 'C'
    def move(self): return self.last
    def update(self, observed): 
        if observed == 'D' and random.random() < self.q:
            self.last = 'D'
        else:
            self.last = 'C'

class Pavlov:
    def __init__(self): self.last = 'C'; self.last_payoff = None
    def move(self): return self.last
    def update(self, observed, observed_payoff=None):
        # simpler: derive payoff from (last, observed); implement similar to paper
        if observed_payoff is None:
            # use payoff table to compute last payoff
            pass

I simplified Pavlov by using actual joint actions (cheating slightly by assuming perfect mapping between actions and payoffs), but it was enough to explore qualitative behavior.

Second experiments — some relief

With $\epsilon = 0.01$ :

GTFT (with generous parameter $q \approx 0.7$ –0.9) vs GTFT → Much better cooperation than TFT. Payoffs close to R.
GTFT vs TFT → GTFT could restore cooperation; TFT alone still fragile.
Pavlov vs Pavlov → surprisingly robust to noise; Pavlov often outperformed TFT.

So some strategies survive noise — but then I hit another puzzle: how to choose $q$ (the generosity)? Too forgiving (small q) and you allow exploitation; too unforgiving (q near 1) and you behave like TFT and allow retaliatory spirals.

I was stuck again: balancing exploitation resistance and noise tolerance seemed like tuning an instrument by ear. I wanted a principled guide: can we derive the right forgiveness level from parameters (payoffs, discount factor, noise $\epsilon$ )?

The theoretical pause: folk theorem & conditions for cooperation

I stepped back and reminded myself of formal results:

In infinitely repeated PD with discount factor $\delta$ (players value future payoffs by $\delta$ per period), many payoff profiles that are feasible and individually rational can be supported as equilibria if $\delta$ is large enough (Folk Theorem). That explains why grim can sustain cooperation in noiseless settings with high $\delta$ .
With noise, sustaining cooperation requires strategies that are robust to mistakes and sequentially rational given the observation model. Trigger strategies that punish forever are not robust; strategies that retaliate with finite/adjustable punishments or forgive probabilistically can be equilibrium strategies under noise.

But these theorems are existence results; they don’t give constructive parameters. I was left with the familiar divide: theory says “it’s possible” but doesn’t give me an explicit $q(\epsilon,\delta,T,R,S,P)$ .

Back to experiments — making them rigorous

I wrote some code to sweep across parameters and recorded long-run payoffs and fraction of cooperative rounds. The core experimental design:

Fix payoff matrix $T,R,P,S$ .
Run iterated matches for many rounds (large finite approximation with discounting, or randomized termination to mimic geometric discount).
Sweep $\epsilon$ from 0 to, say, 0.1 (0–10% noise).
For GTFT, sweep $q$ from 0.1 to 1.0.
Measure: mean payoff per round, fraction of rounds with (C,C), and exploitation rate.

What emerged:

There’s a sweet spot of $q$ that depends on $\epsilon$ and payoffs. As $\epsilon$ increases, optimal $q$ declines (you want more forgiveness).
Pavlov-like rules work well without tuning in a broader range of $\epsilon$ .
TFT is brittle: a single mistaken defection often initiates long runs of mutual defection even at small $\epsilon$ .

I felt ecstatic for a moment — empirical patterns! — but then cautious: were these artifacts of finite-round simulations, or genuine asymptotic phenomena? I varied the horizon and the termination distribution and the qualitative findings persisted.

False confidence and a nasty corner case

Buoyed by the experiments, I wrote up provisional conclusions: GTFT with $q\approx 1- c\epsilon$ performs well; Pavlov is a strong strategy; naive TFT should be avoided in noisy environments.

Then I discovered a nasty corner case when exploring heterogeneous agents: when facing an adversary who plays an opportunistic pattern (defects rarely but at strategic times), GTFT can be exploited if $q$ is too small. In a population with many strategies, evolution might favor exploiters who take advantage of mercy.

This exposed a deeper tension: robustness to random noise vs robustness to strategic exploitation. Designing a strategy that balances both is nontrivial; forgiving adversaries produce vulnerability; fierce strategies collapse cooperation due to noise.

I was stuck again: the space of strategies and opponents is large. How to summarize?

The climb — lessons and insights

After many hours of simulation, code tweaks, and theoretical reading (notes and lecture recollection — nothing I’ll quote verbatim here), the climb out looked like this:

Distinguish noise types. Observation noise (you misread the opponent) is worse for TFT than implementation/action noise (you accidentally do the wrong thing). That distinction informed why GTFT helps.
Use probabilistic forgiveness to damp retaliatory cascades. GTFT (retaliate with prob $q$ ) breaks deterministic cycles with a small chance of immediate return to cooperation. The optimal $q$ trades off forgiveness vs. exploitation.
Finite punishment windows (punish for $k$ rounds) can also work: it contains the damage and allows recovery, at the cost of weaker deterrence.
Pavlov (Win-stay-lose-shift) is surprisingly robust: it tends to correct errors by switching to cooperation after a round where mutual cooperation was previously present.
Population context matters. In evolutionary dynamics, a generous strategy can be invaded by exploiters; evolution selects strategies that beat the current population average, not necessarily those that produce the highest group payoff.
Discounting matters. Higher $\delta$ (players care more about the future) makes harsh punishments more credible; generous strategies tolerate noise better when future matters more.
Simplicity vs optimality. TFT is simple and interpretable but brittle. More complex strategies can outperform but are harder to analyze or justify.

Takeaways for the blog reader

If you want a tidy TL;DR to close the post:

Repeated interactions allow cooperation, but noise changes the game.
Tit-for-tat is great in noiseless worlds; under observation noise, it often fails because of retaliatory spirals.
Strategies that incorporate forgiveness (probabilistic or limited-duration punishments) or stateful correction (like Pavlov) are more robust.
There’s no single “best” strategy: the optimal design depends on noise level, payoff parameters, discounting, and the population of opponents.
Simulation is a powerful way to build intuition: sweep noise and strategy parameters and inspect both average payoffs and distributions of outcomes, not just means.

Arya’s Algorithms

Search This Blog