Getting Stuck: When Hashing Isn’t Enough
I was working on a side project where I needed to detect whether two strings were “similar” in some loose sense. The context was simple enough: given two strings, I wanted to know if they were almost the same, differing only by a few characters. Think of it like detecting typos in usernames or checking plagiarism in short text snippets.
At first glance, this looked like a problem tailor-made for hashing. After all, hashing is the workhorse of so many fast algorithms — why not here?
Attempt 1: Hashing to the Rescue (Or Not)
I thought: If I hash each string, then identical strings will have identical hashes. Easy. But then the problem struck me: what about strings that are close but not identical? For example:
"algorithm"
"algorihtm"
A transposition. Just two characters swapped. The hashes would be completely different.
I wondered if I could get clever. Maybe I could generate multiple hashes per string by deleting each character once and hashing the result. That way, two strings differing by one deletion would share a hash.
I wrote a small Python snippet to test the idea:
import hashlib
def generate_deletion_hashes(s):
hashes = []
for i in range(len(s)):
candidate = s[:i] + s[i+1:]
h = hashlib.md5(candidate.encode()).hexdigest()
hashes.append(h)
return hashes
print(generate_deletion_hashes("algorithm"))
It worked — in a sense. Strings that differed by exactly one deletion could indeed be detected. But what about substitutions? Or transpositions? Or two small errors at once?
My approach quickly spiraled. I’d need deletion hashes, insertion hashes, substitution hashes, and perhaps some complicated scheme for swaps. It was ballooning into something monstrous.
I was stuck.
Attempt 2: Borrowing from Cryptography
Then I thought: Wait, cryptography has something like this, right? Fuzzy hashes. There’s this tool called ssdeep that generates “context triggered piecewise hashes” used in digital forensics to detect similar files.
So I tried to adapt the idea — generate rolling hashes for chunks of the string. Something like Rabin-Karp but fuzzier.
I hacked together a rolling hash:
def rolling_hash(s, base=257, mod=10**9+7):
h = 0
for ch in s:
h = (h * base + ord(ch)) % mod
return h
print(rolling_hash("algorithm"))
print(rolling_hash("algorihtm"))
The two hashes looked close numerically… but that’s meaningless. Hashes don’t have a notion of “distance.” They’re deliberately chaotic. If I wanted to compare similarity, this wasn’t going to cut it.
Another dead end.
Attempt 3: Rediscovering Edit Distance
Then I remembered something from an algorithms class: Levenshtein distance. Of course! The canonical metric for string similarity.
I pulled up the recursive definition:
-
If both strings are empty, distance = 0.
-
Otherwise, take the minimum of:
-
deletion,
-
insertion,
-
substitution.
-
Dynamic programming could compute it in O(n*m) time.
I coded it up:
def edit_distance(a, b):
n, m = len(a), len(b)
dp = [[0] * (m+1) for _ in range(n+1)]
for i in range(n+1):
dp[i][0] = i
for j in range(m+1):
dp[0][j] = j
for i in range(1, n+1):
for j in range(1, m+1):
if a[i-1] == b[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = 1 + min(
dp[i-1][j], # deletion
dp[i][j-1], # insertion
dp[i-1][j-1] # substitution
)
return dp[n][m]
print(edit_distance("algorithm", "algorihtm"))
Finally! The output was 2, reflecting the swap. And unlike the hacky hash approach, this actually generalized to all edit types.
I had escaped the trap.
But Then — Scaling Woes
Of course, nothing in CS is ever that clean. Computing edit distance for two short strings is fine. But what if I needed to do this for a database of millions of strings? That’s O(n*m) per comparison. Brutal.
I went down another rabbit hole: approximate nearest neighbor search, BK-trees, metric indexing. BK-trees, in particular, are fascinating. You insert words into a tree keyed by their edit distance from a root. Queries then prune large chunks of the search space.
Pseudo-code looked like this:
class BKTree:
def __init__(self, word):
self.word = word
self.children = {}
def insert(self, other):
d = edit_distance(self.word, other)
if d in self.children:
self.children[d].insert(other)
else:
self.children[d] = BKTree(other)
Suddenly, the problem wasn’t about hashing anymore — it was about metric space indexing. I had landed in the world of computational linguistics and spellcheckers, all starting from what I thought was a simple “hashing problem.”
Reflection
This was a classic “Getting Stuck” story:
-
I rushed into hashing, thinking it would solve everything.
-
I wasted time chasing after rolling hashes and fuzzy crypto hashes.
-
Only after circling back to basics (edit distance) did I find a sound approach.
-
But even then, scaling pushed me into entirely new algorithmic territory.
It wasn’t the answer I expected when I started, but it taught me a deeper lesson: sometimes the elegant solution isn’t about forcing one technique everywhere — it’s about stepping back, reframing the problem, and realizing you’re actually in a different domain entirely.
Comments
Post a Comment