Skip to main content

Getting Stuck: Hashing My Way Into Chaos

 

Getting Stuck: Hashing My Way Into Chaos

Welcome back to Getting Stuck, the series where I share my CS problem-solving misadventures. Today’s episode: hash functions. On paper they look simple. In practice, they almost destroyed me.


Step 1: The Problem

I wanted to build a small key-value store. Nothing fancy, just strings as keys and integers as values. Obviously, I needed a hash function.

Naively, I thought: Why not just sum the ASCII codes?

def bad_hash(s):
    return sum(ord(c) for c in s)

print(bad_hash("abc"))  # 294
print(bad_hash("cab"))  # also 294 (!)

And immediately, disaster: "abc" and "cab" collided. My hash function didn’t care about order.

But instead of abandoning it, I got stubborn.


Step 2: First Red Herring — Random Hashing

I thought, “What if I assign each character a random number and sum them up?”

import random

char_map = {chr(i): random.randint(1, 1000) for i in range(128)}

def random_hash(s):
    return sum(char_map[c] for c in s)

print(random_hash("abc"))
print(random_hash("cab"))

At first, it looked fine. But collisions still popped up regularly. Worse, every time I reran the program, the results changed. Reproducibility was dead.

Red herring number one.


Step 3: Second Red Herring — Overcomplicated Cryptography

I panicked and thought: Maybe I need SHA-256.

import hashlib

def crypto_hash(s):
    return int(hashlib.sha256(s.encode()).hexdigest(), 16)

This worked perfectly, but at a cost: way too slow for a lightweight key-value store. I didn’t need cryptographic security, just good distribution. I had overengineered.

Red herring number two.


Step 4: The Polynomial Hash Attempt

Then I remembered something from algorithms class: polynomial rolling hash.

The formula:

h(s)=(s0p0+s1p1++sn1pn1)modMh(s) = (s_0 \cdot p^0 + s_1 \cdot p^1 + \cdots + s_{n-1} \cdot p^{n-1}) \bmod M

Where:

  • pp = a small prime (e.g. 31 or 53)

  • MM = a large prime modulus

I implemented it:

def poly_hash(s, p=31, M=10**9+7):
    h = 0
    power = 1
    for c in s:
        h = (h + (ord(c) - ord('a') + 1) * power) % M
        power = (power * p) % M
    return h

And it worked beautifully for small tests.


Step 5: The Bug That Stuck Me

But then I ran it on large inputs:

words = ["hello" * 1000, "world" * 1000]
print(poly_hash(words[0]))
print(poly_hash(words[1]))

Both hashes were exactly the same.

At first, I thought I had a catastrophic bug. I triple-checked my code. Nothing wrong.

Turns out, the problem was integer overflow. Python handles big integers fine, but my modulus M was too small compared to the string length and power multiplications.


Step 6: Fixing It With Double Hashing

The classic trick: use two hash functions with different moduli.

def double_hash(s, p1=31, M1=10**9+7, p2=53, M2=10**9+9):
    h1 = h2 = 0
    p_pow1 = p_pow2 = 1

    for c in s:
        val = ord(c) - ord('a') + 1
        h1 = (h1 + val * p_pow1) % M1
        h2 = (h2 + val * p_pow2) % M2
        p_pow1 = (p_pow1 * p1) % M1
        p_pow2 = (p_pow2 * p2) % M2

    return (h1, h2)

Now collisions became astronomically rare.


Step 7: Reflection

What I learned from this ordeal:

  • Bad hash functions fail silently. The "sum of ASCII codes" looked innocent but collapsed instantly.

  • Overengineering isn’t the answer. Cryptographic hashes were secure but unusable for performance.

  • Theory rescues practice. The polynomial hash trick from textbooks actually saved me here.

  • Paranoia helps. I wouldn’t have noticed the modulus bug if I hadn’t tested absurdly long strings.


Closing

This round of Getting Stuck taught me that hashing is more art than science. The right solution wasn’t about brute force or complexity — it was about respecting the mathematical structure behind hash functions.

And as always, I only saw the path clearly after stumbling through red herrings.



Comments

Popular posts from this blog

Sharp Extremal Bounds for Angular Occupancy (Behind the Curtains)

you can find the full paper here First Attempts, and the Limits of Energy When I first began thinking about angular occupancy, I wasn’t looking for a brand-new problem. I was chasing a theme that has long fascinated me in discrete geometry: how do we measure the geometric richness of a finite set of points? This fascination is not unique to me. Erdős, way back in 1946, posed the distinct distances problem : given n points in the plane, how many distinct distances can they determine? That question alone gave birth to entire decades of work in combinatorial geometry. Later, Fishburn and Füredi asked about distinct angles , while Pach and Sharir studied repeated angles . Each time, the same underlying itch was being scratched: when you place points in the plane, what kinds of diversity can they generate? Why angles, why bins? It was natural to think of angles — after all, distances and directions had already been well studied. But if you go straight for “distinct angles,” you run in...

The Art of Goodbye

  The Art of Goodbye: How Relationships End and Why It Matters The friendship had been slowly dissolving for months. What used to be weekly coffee dates became monthly check-ins, then sporadic text messages, then silence. No fight precipitated the ending, no dramatic confrontation or betrayal. It simply... faded. One day you realized you hadn't spoken to someone who was once central to your life, and you weren't sure when the relationship had officially ended or even if it had. This ambiguous loss left you with a peculiar grief – mourning someone still alive, still accessible, but no longer present in your world. This experience is incredibly common yet rarely discussed. While we have cultural scripts for how relationships begin – meet-cutes, first dates, friendship origin stories – we have few models for how they end. We talk extensively about building connections but rarely about gracefully releasing them. This gap in our social understanding leaves many people unprepared ...

Getting Stuck: When Dijkstra Meets Negative Edges

  Getting Stuck: When Dijkstra Meets Negative Edges One of the joys (and pains) of computer science is when you take a tool you know, confidently apply it, and then watch everything break down in surprising ways. This happened to me recently when I decided to revisit shortest path algorithms — specifically, Dijkstra’s algorithm — only to get stuck in a swamp of negative edges. Setting the Scene The problem was classic: Given a directed, weighted graph and a source vertex, find the shortest path to all other vertices. This is the kind of problem that every CS student learns early, usually with Dijkstra’s algorithm. I already knew Dijkstra like an old friend: Keep a priority queue of distances. Always expand the node with the current smallest distance. Relax its edges. Simple, elegant, efficient ( O((V + E) log V) with a binary heap). So when I started coding a graph utility for my project — a tool to analyze transportation networks — I confidently plugged in D...