Skip to main content

Gettting Stuck: Chasing Set Intersections


Chasing Set Intersections: A CS Debugging Journey

Sometimes, the hardest part of research in computer science isn’t the complexity of the problem — it’s the way simple things explode when you scale them. I recently went through a debugging journey that started with a basic set intersection task, spiraled into dead ends, and ended with a neat trick involving bit vectors. Let me take you along step by step.


Step 1: The Naive Start

The question was simple:

Given subsets A and B (coming from a very large universe of integers), how can I quickly check if they intersect?

In Python, that’s just:

def has_intersection(A, B):
    return len(A & B) > 0

A = {1, 2, 3, 10}
B = {5, 6, 7, 10}

print(has_intersection(A, B))  # True

And for small sets, this was instant. I thought I was done.

But my real dataset wasn’t small — I was dealing with millions of elements, and I needed to handle tens of thousands of queries per second.

When I tried:

import random

U = list(range(10**6))  # universe: 1 million integers
A = set(random.sample(U, 50000))
B = set(random.sample(U, 50000))

print(has_intersection(A, B))

It worked, but when repeated thousands of times, runtime shot up. The naive solution wouldn’t cut it.


Step 2: The Bloom Filter Mirage

I thought I was clever: why not use a Bloom filter? They’re space-efficient and fast for membership checks.

from bloom_filter2 import BloomFilter

bfA = BloomFilter(max_elements=50000, error_rate=0.01)
bfB = BloomFilter(max_elements=50000, error_rate=0.01)

for x in A: bfA.add(x)
for x in B: bfB.add(x)

print(any(x in bfB for x in A))

This looked beautiful at first — constant-time checks, tiny memory footprint.

Then I remembered: Bloom filters allow false positives. In my case, false positives meant reporting intersections that didn’t exist. Totally unacceptable.

Red herring number one. Dead end.


Step 3: Profiling the Wrong Suspect

Maybe the slowdown was in loading data, not the intersections?

So I switched to binary I/O and memory mapping:

import numpy as np

data = np.memmap("bigfile.dat", dtype="int32", mode="r")

This shaved off some milliseconds, but queries were still slow. Profiling revealed what I had tried to ignore: the real bottleneck wasn’t I/O. It was the set intersection itself.

Red herring number two. Another dead end.


Step 4: Reinventing the Wheel

I got stubborn and wrote my own “optimized” structure.

class FastSet:
    def __init__(self, iterable):
        self.data = {x: True for x in iterable}
    
    def intersects(self, other):
        for x in other.data:
            if x in self.data:
                return True
        return False

A = FastSet(A)
B = FastSet(B)

print(A.intersects(B))

It looked promising until I benchmarked it. Turns out, I had essentially rebuilt Python’s set in a slower form. CPython’s C-optimized hash tables are incredibly hard to beat.

Red herring number three. More wasted hours.


Step 5: The Breakthrough

After taking a break, I reframed the problem:

Do I really need to store sets as hash tables?
What if I stored them as bit vectors?

Each element in the universe gets an index. A subset is represented by a bitmask. Intersection becomes a bitwise AND.

import numpy as np

N = 10**6  # size of universe
bitsA = np.zeros(N, dtype=bool)
bitsB = np.zeros(N, dtype=bool)

for x in A: bitsA[x] = 1
for x in B: bitsB[x] = 1

has_intersection = np.any(bitsA & bitsB)
print(has_intersection)

Boom. Intersection reduced to raw bit operations — something CPUs handle extremely well.

When I benchmarked:

import time

queries = [(set(random.sample(U, 50000)),
            set(random.sample(U, 50000))) for _ in range(100)]

# baseline with Python sets
t0 = time.time()
for a, b in queries:
    _ = len(a & b) > 0
print("Set time:", time.time() - t0)

# with bit vectors
t1 = time.time()
for a, b in queries:
    bitsA = np.zeros(N, dtype=bool)
    bitsB = np.zeros(N, dtype=bool)
    for x in a: bitsA[x] = 1
    for x in b: bitsB[x] = 1
    _ = np.any(bitsA & bitsB)
print("Bit vector time:", time.time() - t1)

The bit vector method crushed the naive approach in repeated queries.


Step 6: Understanding the Tradeoffs

This wasn’t free:

  • Memory: each subset took N bits. With a universe of 10^6, that’s ~125KB per subset. Manageable here, but not for universes in the billions.

  • Preprocessing cost: converting sets into bit vectors takes O(|A|) per subset. But once done, queries are lightning fast.

  • Specialization: bit vectors are ideal for intersections, but not always for other operations like frequent insertions/deletions.


Step 7: The Final Form

The clean version looked like this:

def to_bitvector(subset, N):
    bv = np.zeros(N, dtype=bool)
    for x in subset:
        bv[x] = 1
    return bv

def has_intersection_bv(A, B, N):
    return np.any(A & B)

# Preprocess
bvA = to_bitvector(A, N)
bvB = to_bitvector(B, N)

print(has_intersection_bv(bvA, bvB, N))

And this finally gave me the balance of speed and exactness I was searching for.


Lessons Learned

  1. Not every data structure fits every problem. Bloom filters were elegant, but the wrong tool.

  2. Always profile before optimizing. I wasted time chasing I/O when the real bottleneck was elsewhere.

  3. Representation changes everything. Shifting from hash-based sets to bit vectors turned milliseconds into microseconds.

  4. Don’t fear red herrings. Each wrong path sharpened my understanding of the constraints.


Closing Thought

What started as a trivial “check if sets intersect” task turned into a full-blown debugging journey. The final solution wasn’t complicated — just a bitwise AND — but I only reached it after walking through dead ends.

And maybe that’s the essence of CS research: sometimes the most important discoveries are hiding just beyond the mistakes.



Comments

Popular posts from this blog

Sharp Extremal Bounds for Angular Occupancy (Behind the Curtains)

you can find the full paper here First Attempts, and the Limits of Energy When I first began thinking about angular occupancy, I wasn’t looking for a brand-new problem. I was chasing a theme that has long fascinated me in discrete geometry: how do we measure the geometric richness of a finite set of points? This fascination is not unique to me. Erdős, way back in 1946, posed the distinct distances problem : given n points in the plane, how many distinct distances can they determine? That question alone gave birth to entire decades of work in combinatorial geometry. Later, Fishburn and Füredi asked about distinct angles , while Pach and Sharir studied repeated angles . Each time, the same underlying itch was being scratched: when you place points in the plane, what kinds of diversity can they generate? Why angles, why bins? It was natural to think of angles — after all, distances and directions had already been well studied. But if you go straight for “distinct angles,” you run in...

The Art of Goodbye

  The Art of Goodbye: How Relationships End and Why It Matters The friendship had been slowly dissolving for months. What used to be weekly coffee dates became monthly check-ins, then sporadic text messages, then silence. No fight precipitated the ending, no dramatic confrontation or betrayal. It simply... faded. One day you realized you hadn't spoken to someone who was once central to your life, and you weren't sure when the relationship had officially ended or even if it had. This ambiguous loss left you with a peculiar grief – mourning someone still alive, still accessible, but no longer present in your world. This experience is incredibly common yet rarely discussed. While we have cultural scripts for how relationships begin – meet-cutes, first dates, friendship origin stories – we have few models for how they end. We talk extensively about building connections but rarely about gracefully releasing them. This gap in our social understanding leaves many people unprepared ...

Getting Stuck: When Dijkstra Meets Negative Edges

  Getting Stuck: When Dijkstra Meets Negative Edges One of the joys (and pains) of computer science is when you take a tool you know, confidently apply it, and then watch everything break down in surprising ways. This happened to me recently when I decided to revisit shortest path algorithms — specifically, Dijkstra’s algorithm — only to get stuck in a swamp of negative edges. Setting the Scene The problem was classic: Given a directed, weighted graph and a source vertex, find the shortest path to all other vertices. This is the kind of problem that every CS student learns early, usually with Dijkstra’s algorithm. I already knew Dijkstra like an old friend: Keep a priority queue of distances. Always expand the node with the current smallest distance. Relax its edges. Simple, elegant, efficient ( O((V + E) log V) with a binary heap). So when I started coding a graph utility for my project — a tool to analyze transportation networks — I confidently plugged in D...