Skip to main content

Getting Stuck: When Hashing Isn’t Enough

 

Getting Stuck: When Hashing Isn’t Enough

I was working on a side project where I needed to detect whether two strings were “similar” in some loose sense. The context was simple enough: given two strings, I wanted to know if they were almost the same, differing only by a few characters. Think of it like detecting typos in usernames or checking plagiarism in short text snippets.

At first glance, this looked like a problem tailor-made for hashing. After all, hashing is the workhorse of so many fast algorithms — why not here?


Attempt 1: Hashing to the Rescue (Or Not)

I thought: If I hash each string, then identical strings will have identical hashes. Easy. But then the problem struck me: what about strings that are close but not identical? For example:

"algorithm"
"algorihtm"

A transposition. Just two characters swapped. The hashes would be completely different.

I wondered if I could get clever. Maybe I could generate multiple hashes per string by deleting each character once and hashing the result. That way, two strings differing by one deletion would share a hash.

I wrote a small Python snippet to test the idea:

import hashlib

def generate_deletion_hashes(s):
    hashes = []
    for i in range(len(s)):
        candidate = s[:i] + s[i+1:]
        h = hashlib.md5(candidate.encode()).hexdigest()
        hashes.append(h)
    return hashes

print(generate_deletion_hashes("algorithm"))

It worked — in a sense. Strings that differed by exactly one deletion could indeed be detected. But what about substitutions? Or transpositions? Or two small errors at once?

My approach quickly spiraled. I’d need deletion hashes, insertion hashes, substitution hashes, and perhaps some complicated scheme for swaps. It was ballooning into something monstrous.

I was stuck.


Attempt 2: Borrowing from Cryptography

Then I thought: Wait, cryptography has something like this, right? Fuzzy hashes. There’s this tool called ssdeep that generates “context triggered piecewise hashes” used in digital forensics to detect similar files.

So I tried to adapt the idea — generate rolling hashes for chunks of the string. Something like Rabin-Karp but fuzzier.

I hacked together a rolling hash:

def rolling_hash(s, base=257, mod=10**9+7):
    h = 0
    for ch in s:
        h = (h * base + ord(ch)) % mod
    return h

print(rolling_hash("algorithm"))
print(rolling_hash("algorihtm"))

The two hashes looked close numerically… but that’s meaningless. Hashes don’t have a notion of “distance.” They’re deliberately chaotic. If I wanted to compare similarity, this wasn’t going to cut it.

Another dead end.


Attempt 3: Rediscovering Edit Distance

Then I remembered something from an algorithms class: Levenshtein distance. Of course! The canonical metric for string similarity.

I pulled up the recursive definition:

  • If both strings are empty, distance = 0.

  • Otherwise, take the minimum of:

    • deletion,

    • insertion,

    • substitution.

Dynamic programming could compute it in O(n*m) time.

I coded it up:

def edit_distance(a, b):
    n, m = len(a), len(b)
    dp = [[0] * (m+1) for _ in range(n+1)]
    
    for i in range(n+1):
        dp[i][0] = i
    for j in range(m+1):
        dp[0][j] = j
    
    for i in range(1, n+1):
        for j in range(1, m+1):
            if a[i-1] == b[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1 + min(
                    dp[i-1][j],    # deletion
                    dp[i][j-1],    # insertion
                    dp[i-1][j-1]   # substitution
                )
    
    return dp[n][m]

print(edit_distance("algorithm", "algorihtm"))

Finally! The output was 2, reflecting the swap. And unlike the hacky hash approach, this actually generalized to all edit types.

I had escaped the trap.


But Then — Scaling Woes

Of course, nothing in CS is ever that clean. Computing edit distance for two short strings is fine. But what if I needed to do this for a database of millions of strings? That’s O(n*m) per comparison. Brutal.

I went down another rabbit hole: approximate nearest neighbor search, BK-trees, metric indexing. BK-trees, in particular, are fascinating. You insert words into a tree keyed by their edit distance from a root. Queries then prune large chunks of the search space.

Pseudo-code looked like this:

class BKTree:
    def __init__(self, word):
        self.word = word
        self.children = {}
    
    def insert(self, other):
        d = edit_distance(self.word, other)
        if d in self.children:
            self.children[d].insert(other)
        else:
            self.children[d] = BKTree(other)

Suddenly, the problem wasn’t about hashing anymore — it was about metric space indexing. I had landed in the world of computational linguistics and spellcheckers, all starting from what I thought was a simple “hashing problem.”


Reflection

This was a classic “Getting Stuck” story:

  1. I rushed into hashing, thinking it would solve everything.

  2. I wasted time chasing after rolling hashes and fuzzy crypto hashes.

  3. Only after circling back to basics (edit distance) did I find a sound approach.

  4. But even then, scaling pushed me into entirely new algorithmic territory.

It wasn’t the answer I expected when I started, but it taught me a deeper lesson: sometimes the elegant solution isn’t about forcing one technique everywhere — it’s about stepping back, reframing the problem, and realizing you’re actually in a different domain entirely.



Comments

Popular posts from this blog

Sharp Extremal Bounds for Angular Occupancy (Behind the Curtains)

you can find the full paper here First Attempts, and the Limits of Energy When I first began thinking about angular occupancy, I wasn’t looking for a brand-new problem. I was chasing a theme that has long fascinated me in discrete geometry: how do we measure the geometric richness of a finite set of points? This fascination is not unique to me. Erdős, way back in 1946, posed the distinct distances problem : given n points in the plane, how many distinct distances can they determine? That question alone gave birth to entire decades of work in combinatorial geometry. Later, Fishburn and Füredi asked about distinct angles , while Pach and Sharir studied repeated angles . Each time, the same underlying itch was being scratched: when you place points in the plane, what kinds of diversity can they generate? Why angles, why bins? It was natural to think of angles — after all, distances and directions had already been well studied. But if you go straight for “distinct angles,” you run in...

The Art of Goodbye

  The Art of Goodbye: How Relationships End and Why It Matters The friendship had been slowly dissolving for months. What used to be weekly coffee dates became monthly check-ins, then sporadic text messages, then silence. No fight precipitated the ending, no dramatic confrontation or betrayal. It simply... faded. One day you realized you hadn't spoken to someone who was once central to your life, and you weren't sure when the relationship had officially ended or even if it had. This ambiguous loss left you with a peculiar grief – mourning someone still alive, still accessible, but no longer present in your world. This experience is incredibly common yet rarely discussed. While we have cultural scripts for how relationships begin – meet-cutes, first dates, friendship origin stories – we have few models for how they end. We talk extensively about building connections but rarely about gracefully releasing them. This gap in our social understanding leaves many people unprepared ...

Getting Stuck: When Dijkstra Meets Negative Edges

  Getting Stuck: When Dijkstra Meets Negative Edges One of the joys (and pains) of computer science is when you take a tool you know, confidently apply it, and then watch everything break down in surprising ways. This happened to me recently when I decided to revisit shortest path algorithms — specifically, Dijkstra’s algorithm — only to get stuck in a swamp of negative edges. Setting the Scene The problem was classic: Given a directed, weighted graph and a source vertex, find the shortest path to all other vertices. This is the kind of problem that every CS student learns early, usually with Dijkstra’s algorithm. I already knew Dijkstra like an old friend: Keep a priority queue of distances. Always expand the node with the current smallest distance. Relax its edges. Simple, elegant, efficient ( O((V + E) log V) with a binary heap). So when I started coding a graph utility for my project — a tool to analyze transportation networks — I confidently plugged in D...