Chasing Set Intersections: A CS Debugging Journey
Sometimes, the hardest part of research in computer science isn’t the complexity of the problem — it’s the way simple things explode when you scale them. I recently went through a debugging journey that started with a basic set intersection task, spiraled into dead ends, and ended with a neat trick involving bit vectors. Let me take you along step by step.
Step 1: The Naive Start
The question was simple:
Given subsets A and B (coming from a very large universe of integers), how can I quickly check if they intersect?
In Python, that’s just:
def has_intersection(A, B):
return len(A & B) > 0
A = {1, 2, 3, 10}
B = {5, 6, 7, 10}
print(has_intersection(A, B)) # True
And for small sets, this was instant. I thought I was done.
But my real dataset wasn’t small — I was dealing with millions of elements, and I needed to handle tens of thousands of queries per second.
When I tried:
import random
U = list(range(10**6)) # universe: 1 million integers
A = set(random.sample(U, 50000))
B = set(random.sample(U, 50000))
print(has_intersection(A, B))
It worked, but when repeated thousands of times, runtime shot up. The naive solution wouldn’t cut it.
Step 2: The Bloom Filter Mirage
I thought I was clever: why not use a Bloom filter? They’re space-efficient and fast for membership checks.
from bloom_filter2 import BloomFilter
bfA = BloomFilter(max_elements=50000, error_rate=0.01)
bfB = BloomFilter(max_elements=50000, error_rate=0.01)
for x in A: bfA.add(x)
for x in B: bfB.add(x)
print(any(x in bfB for x in A))
This looked beautiful at first — constant-time checks, tiny memory footprint.
Then I remembered: Bloom filters allow false positives. In my case, false positives meant reporting intersections that didn’t exist. Totally unacceptable.
Red herring number one. Dead end.
Step 3: Profiling the Wrong Suspect
Maybe the slowdown was in loading data, not the intersections?
So I switched to binary I/O and memory mapping:
import numpy as np
data = np.memmap("bigfile.dat", dtype="int32", mode="r")
This shaved off some milliseconds, but queries were still slow. Profiling revealed what I had tried to ignore: the real bottleneck wasn’t I/O. It was the set intersection itself.
Red herring number two. Another dead end.
Step 4: Reinventing the Wheel
I got stubborn and wrote my own “optimized” structure.
class FastSet:
def __init__(self, iterable):
self.data = {x: True for x in iterable}
def intersects(self, other):
for x in other.data:
if x in self.data:
return True
return False
A = FastSet(A)
B = FastSet(B)
print(A.intersects(B))
It looked promising until I benchmarked it. Turns out, I had essentially rebuilt Python’s set in a slower form. CPython’s C-optimized hash tables are incredibly hard to beat.
Red herring number three. More wasted hours.
Step 5: The Breakthrough
After taking a break, I reframed the problem:
Do I really need to store sets as hash tables?
What if I stored them as bit vectors?
Each element in the universe gets an index. A subset is represented by a bitmask. Intersection becomes a bitwise AND.
import numpy as np
N = 10**6 # size of universe
bitsA = np.zeros(N, dtype=bool)
bitsB = np.zeros(N, dtype=bool)
for x in A: bitsA[x] = 1
for x in B: bitsB[x] = 1
has_intersection = np.any(bitsA & bitsB)
print(has_intersection)
Boom. Intersection reduced to raw bit operations — something CPUs handle extremely well.
When I benchmarked:
import time
queries = [(set(random.sample(U, 50000)),
set(random.sample(U, 50000))) for _ in range(100)]
# baseline with Python sets
t0 = time.time()
for a, b in queries:
_ = len(a & b) > 0
print("Set time:", time.time() - t0)
# with bit vectors
t1 = time.time()
for a, b in queries:
bitsA = np.zeros(N, dtype=bool)
bitsB = np.zeros(N, dtype=bool)
for x in a: bitsA[x] = 1
for x in b: bitsB[x] = 1
_ = np.any(bitsA & bitsB)
print("Bit vector time:", time.time() - t1)
The bit vector method crushed the naive approach in repeated queries.
Step 6: Understanding the Tradeoffs
This wasn’t free:
-
Memory: each subset took
Nbits. With a universe of 10^6, that’s ~125KB per subset. Manageable here, but not for universes in the billions. -
Preprocessing cost: converting sets into bit vectors takes O(|A|) per subset. But once done, queries are lightning fast.
-
Specialization: bit vectors are ideal for intersections, but not always for other operations like frequent insertions/deletions.
Step 7: The Final Form
The clean version looked like this:
def to_bitvector(subset, N):
bv = np.zeros(N, dtype=bool)
for x in subset:
bv[x] = 1
return bv
def has_intersection_bv(A, B, N):
return np.any(A & B)
# Preprocess
bvA = to_bitvector(A, N)
bvB = to_bitvector(B, N)
print(has_intersection_bv(bvA, bvB, N))
And this finally gave me the balance of speed and exactness I was searching for.
Lessons Learned
-
Not every data structure fits every problem. Bloom filters were elegant, but the wrong tool.
-
Always profile before optimizing. I wasted time chasing I/O when the real bottleneck was elsewhere.
-
Representation changes everything. Shifting from hash-based sets to bit vectors turned milliseconds into microseconds.
-
Don’t fear red herrings. Each wrong path sharpened my understanding of the constraints.
Closing Thought
What started as a trivial “check if sets intersect” task turned into a full-blown debugging journey. The final solution wasn’t complicated — just a bitwise AND — but I only reached it after walking through dead ends.
And maybe that’s the essence of CS research: sometimes the most important discoveries are hiding just beyond the mistakes.
Comments
Post a Comment