Getting Stuck: Hashing My Way Into Chaos

Welcome back to Getting Stuck, the series where I share my CS problem-solving misadventures. Today’s episode: hash functions. On paper they look simple. In practice, they almost destroyed me.

Step 1: The Problem

I wanted to build a small key-value store. Nothing fancy, just strings as keys and integers as values. Obviously, I needed a hash function.

Naively, I thought: Why not just sum the ASCII codes?

def bad_hash(s):
    return sum(ord(c) for c in s)

print(bad_hash("abc"))  # 294
print(bad_hash("cab"))  # also 294 (!)

And immediately, disaster: "abc" and "cab" collided. My hash function didn’t care about order.

But instead of abandoning it, I got stubborn.

Step 2: First Red Herring — Random Hashing

I thought, “What if I assign each character a random number and sum them up?”

import random

char_map = {chr(i): random.randint(1, 1000) for i in range(128)}

def random_hash(s):
    return sum(char_map[c] for c in s)

print(random_hash("abc"))
print(random_hash("cab"))

At first, it looked fine. But collisions still popped up regularly. Worse, every time I reran the program, the results changed. Reproducibility was dead.

Red herring number one.

Step 3: Second Red Herring — Overcomplicated Cryptography

I panicked and thought: Maybe I need SHA-256.

import hashlib

def crypto_hash(s):
    return int(hashlib.sha256(s.encode()).hexdigest(), 16)

This worked perfectly, but at a cost: way too slow for a lightweight key-value store. I didn’t need cryptographic security, just good distribution. I had overengineered.

Red herring number two.

Step 4: The Polynomial Hash Attempt

Then I remembered something from algorithms class: polynomial rolling hash.

The formula:

h(s) = (s_0 \cdot p^0 + s_1 \cdot p^1 + \cdots + s_{n-1} \cdot p^{n-1}) \bmod M

Where:

$p$ = a small prime (e.g. 31 or 53)
$M$ = a large prime modulus

I implemented it:

def poly_hash(s, p=31, M=10**9+7):
    h = 0
    power = 1
    for c in s:
        h = (h + (ord(c) - ord('a') + 1) * power) % M
        power = (power * p) % M
    return h

And it worked beautifully for small tests.

Step 5: The Bug That Stuck Me

But then I ran it on large inputs:

words = ["hello" * 1000, "world" * 1000]
print(poly_hash(words[0]))
print(poly_hash(words[1]))

Both hashes were exactly the same.

At first, I thought I had a catastrophic bug. I triple-checked my code. Nothing wrong.

Turns out, the problem was integer overflow. Python handles big integers fine, but my modulus M was too small compared to the string length and power multiplications.

Step 6: Fixing It With Double Hashing

The classic trick: use two hash functions with different moduli.

def double_hash(s, p1=31, M1=10**9+7, p2=53, M2=10**9+9):
    h1 = h2 = 0
    p_pow1 = p_pow2 = 1

    for c in s:
        val = ord(c) - ord('a') + 1
        h1 = (h1 + val * p_pow1) % M1
        h2 = (h2 + val * p_pow2) % M2
        p_pow1 = (p_pow1 * p1) % M1
        p_pow2 = (p_pow2 * p2) % M2

    return (h1, h2)

Now collisions became astronomically rare.

Step 7: Reflection

What I learned from this ordeal:

Bad hash functions fail silently. The "sum of ASCII codes" looked innocent but collapsed instantly.
Overengineering isn’t the answer. Cryptographic hashes were secure but unusable for performance.
Theory rescues practice. The polynomial hash trick from textbooks actually saved me here.
Paranoia helps. I wouldn’t have noticed the modulus bug if I hadn’t tested absurdly long strings.

Closing

This round of Getting Stuck taught me that hashing is more art than science. The right solution wasn’t about brute force or complexity — it was about respecting the mathematical structure behind hash functions.

And as always, I only saw the path clearly after stumbling through red herrings.

Arya’s Algorithms

Search This Blog