Gettting Stuck: Chasing Set Intersections

Chasing Set Intersections: A CS Debugging Journey

Sometimes, the hardest part of research in computer science isn’t the complexity of the problem — it’s the way simple things explode when you scale them. I recently went through a debugging journey that started with a basic set intersection task, spiraled into dead ends, and ended with a neat trick involving bit vectors. Let me take you along step by step.

Step 1: The Naive Start

The question was simple:

Given subsets A and B (coming from a very large universe of integers), how can I quickly check if they intersect?

In Python, that’s just:

def has_intersection(A, B):
    return len(A & B) > 0

A = {1, 2, 3, 10}
B = {5, 6, 7, 10}

print(has_intersection(A, B))  # True

And for small sets, this was instant. I thought I was done.

But my real dataset wasn’t small — I was dealing with millions of elements, and I needed to handle tens of thousands of queries per second.

When I tried:

import random

U = list(range(10**6))  # universe: 1 million integers
A = set(random.sample(U, 50000))
B = set(random.sample(U, 50000))

print(has_intersection(A, B))

It worked, but when repeated thousands of times, runtime shot up. The naive solution wouldn’t cut it.

Step 2: The Bloom Filter Mirage

I thought I was clever: why not use a Bloom filter? They’re space-efficient and fast for membership checks.

from bloom_filter2 import BloomFilter

bfA = BloomFilter(max_elements=50000, error_rate=0.01)
bfB = BloomFilter(max_elements=50000, error_rate=0.01)

for x in A: bfA.add(x)
for x in B: bfB.add(x)

print(any(x in bfB for x in A))

This looked beautiful at first — constant-time checks, tiny memory footprint.

Then I remembered: Bloom filters allow false positives. In my case, false positives meant reporting intersections that didn’t exist. Totally unacceptable.

Red herring number one. Dead end.

Step 3: Profiling the Wrong Suspect

Maybe the slowdown was in loading data, not the intersections?

So I switched to binary I/O and memory mapping:

import numpy as np

data = np.memmap("bigfile.dat", dtype="int32", mode="r")

This shaved off some milliseconds, but queries were still slow. Profiling revealed what I had tried to ignore: the real bottleneck wasn’t I/O. It was the set intersection itself.

Red herring number two. Another dead end.

Step 4: Reinventing the Wheel

I got stubborn and wrote my own “optimized” structure.

class FastSet:
    def __init__(self, iterable):
        self.data = {x: True for x in iterable}
    
    def intersects(self, other):
        for x in other.data:
            if x in self.data:
                return True
        return False

A = FastSet(A)
B = FastSet(B)

print(A.intersects(B))

It looked promising until I benchmarked it. Turns out, I had essentially rebuilt Python’s set in a slower form. CPython’s C-optimized hash tables are incredibly hard to beat.

Red herring number three. More wasted hours.

Step 5: The Breakthrough

After taking a break, I reframed the problem:

Do I really need to store sets as hash tables?
What if I stored them as bit vectors?

Each element in the universe gets an index. A subset is represented by a bitmask. Intersection becomes a bitwise AND.

import numpy as np

N = 10**6  # size of universe
bitsA = np.zeros(N, dtype=bool)
bitsB = np.zeros(N, dtype=bool)

for x in A: bitsA[x] = 1
for x in B: bitsB[x] = 1

has_intersection = np.any(bitsA & bitsB)
print(has_intersection)

Boom. Intersection reduced to raw bit operations — something CPUs handle extremely well.

When I benchmarked:

import time

queries = [(set(random.sample(U, 50000)),
            set(random.sample(U, 50000))) for _ in range(100)]

# baseline with Python sets
t0 = time.time()
for a, b in queries:
    _ = len(a & b) > 0
print("Set time:", time.time() - t0)

# with bit vectors
t1 = time.time()
for a, b in queries:
    bitsA = np.zeros(N, dtype=bool)
    bitsB = np.zeros(N, dtype=bool)
    for x in a: bitsA[x] = 1
    for x in b: bitsB[x] = 1
    _ = np.any(bitsA & bitsB)
print("Bit vector time:", time.time() - t1)

The bit vector method crushed the naive approach in repeated queries.

Step 6: Understanding the Tradeoffs

This wasn’t free:

Memory: each subset took N bits. With a universe of 10^6, that’s ~125KB per subset. Manageable here, but not for universes in the billions.
Preprocessing cost: converting sets into bit vectors takes O(|A|) per subset. But once done, queries are lightning fast.
Specialization: bit vectors are ideal for intersections, but not always for other operations like frequent insertions/deletions.

Step 7: The Final Form

The clean version looked like this:

def to_bitvector(subset, N):
    bv = np.zeros(N, dtype=bool)
    for x in subset:
        bv[x] = 1
    return bv

def has_intersection_bv(A, B, N):
    return np.any(A & B)

# Preprocess
bvA = to_bitvector(A, N)
bvB = to_bitvector(B, N)

print(has_intersection_bv(bvA, bvB, N))

And this finally gave me the balance of speed and exactness I was searching for.

Lessons Learned

Not every data structure fits every problem. Bloom filters were elegant, but the wrong tool.
Always profile before optimizing. I wasted time chasing I/O when the real bottleneck was elsewhere.
Representation changes everything. Shifting from hash-based sets to bit vectors turned milliseconds into microseconds.
Don’t fear red herrings. Each wrong path sharpened my understanding of the constraints.

Closing Thought

What started as a trivial “check if sets intersect” task turned into a full-blown debugging journey. The final solution wasn’t complicated — just a bitwise AND — but I only reached it after walking through dead ends.

And maybe that’s the essence of CS research: sometimes the most important discoveries are hiding just beyond the mistakes.

Arya’s Algorithms

Search This Blog

Gettting Stuck: Chasing Set Intersections

Chasing Set Intersections: A CS Debugging Journey

Step 1: The Naive Start

Step 2: The Bloom Filter Mirage

Step 3: Profiling the Wrong Suspect

Step 4: Reinventing the Wheel

Step 5: The Breakthrough

Step 6: Understanding the Tradeoffs

Step 7: The Final Form

Lessons Learned

Closing Thought

Labels

Comments

Post a Comment

Popular posts from this blog

Sharp Extremal Bounds for Angular Occupancy (Behind the Curtains)

The Art of Goodbye

Getting Stuck: When Dijkstra Meets Negative Edges