Getting Stuck: The Deadlock That Ate My Program
In this installment of Getting Stuck, I set out to do something totally ordinary: write a tiny banking system where two accounts can transfer money to each other. Easy enough, right? Just some locks and arithmetic.
Instead, I ended up with a program that sometimes worked, sometimes froze solid, and once even locked itself in an eternal staring contest with my CPU.
This is the story of my brush with deadlock.
Step 1: The Setup
I wanted a simple class for accounts:
import threading
class Account:
def __init__(self, balance):
self.balance = balance
self.lock = threading.Lock()
def withdraw(self, amount):
self.balance -= amount
def deposit(self, amount):
self.balance += amount
Then a function to transfer between accounts:
def transfer(src, dst, amount):
src.lock.acquire()
dst.lock.acquire()
src.withdraw(amount)
dst.deposit(amount)
src.lock.release()
dst.lock.release()
Seems straightforward. Two locks, a withdraw, a deposit, and we’re done.
Step 2: The Hang
I spun up a quick test:
a = Account(100)
b = Account(100)
threads = []
for _ in range(100):
t1 = threading.Thread(target=transfer, args=(a, b, 1))
t2 = threading.Thread(target=transfer, args=(b, a, 1))
threads.extend([t1, t2])
for t in threads:
t.start()
for t in threads:
t.join()
Sometimes it worked. Other times, the program froze. No error, no crash — just silence.
I stared at it for minutes, convinced I’d missed a typo. But everything looked fine.
Step 3: False Leads
I suspected the balance arithmetic. Maybe withdraw was letting balances go negative? I added print statements. Balances were fine.
Then I thought maybe I was exhausting threads. I cut the loop down to 2 iterations. Still froze.
I even wondered if Python’s GIL was messing with me. But I knew enough to realize: the GIL prevents true parallelism, but it doesn’t cause this.
It was something else.
Step 4: Seeing the Pattern
After running the program dozens of times, I noticed a pattern: it only froze when one thread tried transfer(a, b, 1) while another tried transfer(b, a, 1) at the same time.
That was my “aha” moment.
Thread 1 grabbed a.lock.
Thread 2 grabbed b.lock.
Thread 1 waited for b.lock.
Thread 2 waited for a.lock.
Both threads were waiting for each other, forever.
I had built a textbook deadlock.
Step 5: The Fix
The classic fix is to impose an ordering on lock acquisition. Always grab locks in the same global order, no matter what.
def transfer(src, dst, amount):
first, second = sorted([src, dst], key=lambda x: id(x))
first.lock.acquire()
second.lock.acquire()
src.withdraw(amount)
dst.deposit(amount)
second.lock.release()
first.lock.release()
Now no two threads can get into a circular wait. Deadlock avoided.
Step 6: The Bigger Lesson
The fix worked — but what struck me was how invisible the bug was. There was no error message, no stack trace. Just… nothing.
Deadlocks are scary because they’re not “fail fast.” They’re fail silently. They hide until the exact unlucky interleaving of events triggers them, and then they freeze everything.
And that makes them the ultimate “getting stuck.” Not only was my program stuck, but so was I, staring at it, unsure where to even begin looking.
Reflection
This episode of Getting Stuck left me with a deeper respect for concurrency bugs.
-
A single wrong order of locks can doom an entire system.
-
Debugging deadlocks is like looking for ghosts.
-
The solution often isn’t fancy — just a simple rule, like ordering lock acquisition.
But the real takeaway is humility. Because if this tiny toy program can deadlock, what about massive systems with hundreds of threads, locks, and resources?
Getting stuck on a bug like this teaches you that sometimes the hardest problems aren’t about code that doesn’t run — they’re about code that never stops running.
Comments
Post a Comment