Getting Stuck: Hashing My Way Into Chaos
Welcome back to Getting Stuck, the series where I share my CS problem-solving misadventures. Today’s episode: hash functions. On paper they look simple. In practice, they almost destroyed me.
Step 1: The Problem
I wanted to build a small key-value store. Nothing fancy, just strings as keys and integers as values. Obviously, I needed a hash function.
Naively, I thought: Why not just sum the ASCII codes?
def bad_hash(s):
return sum(ord(c) for c in s)
print(bad_hash("abc")) # 294
print(bad_hash("cab")) # also 294 (!)
And immediately, disaster: "abc" and "cab" collided. My hash function didn’t care about order.
But instead of abandoning it, I got stubborn.
Step 2: First Red Herring — Random Hashing
I thought, “What if I assign each character a random number and sum them up?”
import random
char_map = {chr(i): random.randint(1, 1000) for i in range(128)}
def random_hash(s):
return sum(char_map[c] for c in s)
print(random_hash("abc"))
print(random_hash("cab"))
At first, it looked fine. But collisions still popped up regularly. Worse, every time I reran the program, the results changed. Reproducibility was dead.
Red herring number one.
Step 3: Second Red Herring — Overcomplicated Cryptography
I panicked and thought: Maybe I need SHA-256.
import hashlib
def crypto_hash(s):
return int(hashlib.sha256(s.encode()).hexdigest(), 16)
This worked perfectly, but at a cost: way too slow for a lightweight key-value store. I didn’t need cryptographic security, just good distribution. I had overengineered.
Red herring number two.
Step 4: The Polynomial Hash Attempt
Then I remembered something from algorithms class: polynomial rolling hash.
The formula:
Where:
-
= a small prime (e.g. 31 or 53)
-
= a large prime modulus
I implemented it:
def poly_hash(s, p=31, M=10**9+7):
h = 0
power = 1
for c in s:
h = (h + (ord(c) - ord('a') + 1) * power) % M
power = (power * p) % M
return h
And it worked beautifully for small tests.
Step 5: The Bug That Stuck Me
But then I ran it on large inputs:
words = ["hello" * 1000, "world" * 1000]
print(poly_hash(words[0]))
print(poly_hash(words[1]))
Both hashes were exactly the same.
At first, I thought I had a catastrophic bug. I triple-checked my code. Nothing wrong.
Turns out, the problem was integer overflow. Python handles big integers fine, but my modulus M was too small compared to the string length and power multiplications.
Step 6: Fixing It With Double Hashing
The classic trick: use two hash functions with different moduli.
def double_hash(s, p1=31, M1=10**9+7, p2=53, M2=10**9+9):
h1 = h2 = 0
p_pow1 = p_pow2 = 1
for c in s:
val = ord(c) - ord('a') + 1
h1 = (h1 + val * p_pow1) % M1
h2 = (h2 + val * p_pow2) % M2
p_pow1 = (p_pow1 * p1) % M1
p_pow2 = (p_pow2 * p2) % M2
return (h1, h2)
Now collisions became astronomically rare.
Step 7: Reflection
What I learned from this ordeal:
-
Bad hash functions fail silently. The "sum of ASCII codes" looked innocent but collapsed instantly.
-
Overengineering isn’t the answer. Cryptographic hashes were secure but unusable for performance.
-
Theory rescues practice. The polynomial hash trick from textbooks actually saved me here.
-
Paranoia helps. I wouldn’t have noticed the modulus bug if I hadn’t tested absurdly long strings.
Closing
This round of Getting Stuck taught me that hashing is more art than science. The right solution wasn’t about brute force or complexity — it was about respecting the mathematical structure behind hash functions.
And as always, I only saw the path clearly after stumbling through red herrings.
Comments
Post a Comment