Getting Stuck — When Sorting Became a Trap
This is another entry in my Getting Stuck series. The title says it: I get stuck, I chase red herrings, I learn, and I climb out. This time the trap was a classical one: sort a file that’s larger than RAM. On paper, “just sort” sounds trivial. In practice, it became an odyssey through I/O, memory, representation, parallelism, and algorithm engineering.
Problem statement (the kind that looks easy)
You have a file with
Nintegers, one per line.Nis huge (tens or hundreds of millions). You cannot load all of them into memory. Produce a sorted file (same format).
This is textbook external sorting territory, but I wanted to implement it myself for two reasons: (1) to understand performance pitfalls in practice, (2) because I enjoy learning the gritty details. Predictably, I got stuck several times.
First attempts — tiny code, big assumptions
I started with the naive in-memory approach (because why not check the obvious):
# naive — works until it doesn't
with open("bigfile.txt") as f:
nums = [int(x) for x in f]
nums.sort()
with open("out.txt", "w") as w:
for n in nums:
w.write(f"{n}\n")
Error: MemoryError. The dataset exceeded available RAM. Fine — obvious. Time to think external.
Red herring #1 — micro-optimizing Python's sort
I tried small micro-optimizations:
-
list.sort()vssorted()(negligible difference). -
Using
key=tricks (no gain). -
Switching from
intto another representation — but conversion cost and Python object overhead killed me.
Lesson: Python list of Python integers is memory-heavy (28+ bytes per int on typical 64-bit CPython). If you want external sorting, you must either use a compact binary representation or operate on chunks.
This was a red herring — tinkering with sort() won’t solve the RAM problem.
Red herring #2 — offload to a DB (SQLite) or load in-memory DB
I tried writing rows to SQLite and asking it to ORDER BY val. It felt appealing — databases handle big data.
Problems:
-
If you used
:memory:you were back to RAM limits. -
On-disk SQLite can sort on disk, but its performance depends on how you configure temp directories, transactions, and available temp space. It worked, but I lost control over tuning and wanted a pure algorithmic approach.
Outcome: valid but unsatisfying for this exercise. Another red herring for the story.
Red herring #3 — counting sort / radix sort attempts
Counting sort is wonderful if the value range R is small. I naïvely tried it for speed:
# terrible idea for large R
counts = [0] * R # R might be 1e9 — OOM
for n in nums:
counts[n] += 1
# then write counts out...
That blows memory immediately for realistic integer ranges (e.g., 0..1e9). Radix sort reduces memory but requires multiple passes and careful buffering. I prototyped one-pass LSD radix that wrote bucket files to disk—more complicated, and unless you have skew-friendly distribution or very fast sequential I/O, you still end up doing lots of IO passes.
Bottom line: counting/radix are useful in special cases, but for a general large-range integer file the classical external merge sort is the pragmatic, robust approach. I’d call attempts at counting/radix “tempting” red herrings — instructive, but not the right general solution.
The real escape (the planner moment): external merge sort
After pacing and scribbling, I remembered the canonical approach:
External merge sort:
-
Read the input file in chunks that fit comfortably in RAM.
-
Sort each chunk in memory and write the sorted chunk back to disk as a temporary file (a “run”).
-
Merge all sorted runs using a k-way merge into a final sorted output.
Reasons it works: each chunk is sorted in RAM (fast), and the merge uses streaming reads so memory requirement during merge is small (proportional to the number of runs and the per-run buffer).
Complexities:
-
In-memory sort of a chunk of size
M:O(M log M). -
If there are
k = N / Mruns, the k-way merge costsO(N log k)compares (if using heap-based k-way merge). -
I/O dominates time; the fewer passes and the larger sequential reads are, the better.
Implementation: from simple to production-ready
I'll present multiple implementations:
-
A simple, easy-to-read, text-line based external sort.
-
A binary, buffered, k-way merge version (faster, lower overhead).
-
Parallel chunk sorting (to use multiple cores while disk I/O continues).
-
Tuning notes and tests.
1) Simple, robust implementation (text lines)
This is great for clarity and correctness. It uses minimal exotic machinery and is easy to paste into a blog.
# external_sort_simple.py
import heapq
import os
import tempfile
def split_into_sorted_chunks(input_path, tmp_dir=None, chunk_size=10**6):
"""
Read `chunk_size` integers at a time (approx), sort them, write to a temp file.
Returns a list of temp file paths (text files with one integer per line).
"""
tmp_dir = tmp_dir or tempfile.mkdtemp(prefix="extsort_")
chunk_files = []
chunk = []
with open(input_path, 'r') as inf:
for line in inf:
if not line.strip():
continue
chunk.append(int(line))
if len(chunk) >= chunk_size:
chunk.sort()
tf = tempfile.NamedTemporaryFile(delete=False, dir=tmp_dir, mode='w', newline='\n')
for num in chunk:
tf.write(f"{num}\n")
tf.close()
chunk_files.append(tf.name)
chunk = []
# last partial chunk
if chunk:
chunk.sort()
tf = tempfile.NamedTemporaryFile(delete=False, dir=tmp_dir, mode='w', newline='\n')
for num in chunk:
tf.write(f"{num}\n")
tf.close()
chunk_files.append(tf.name)
return chunk_files
def merge_sorted_chunks(chunk_files, output_path):
"""
k-way merge text-based sorted chunk files using heapq.merge which performs
an efficient lazy merge of sorted iterables.
"""
files = [open(fn, 'r') for fn in chunk_files]
try:
iterators = [map(int, f) for f in files]
with open(output_path, 'w', newline='\n') as out:
for num in heapq.merge(*iterators):
out.write(f"{num}\n")
finally:
for f in files:
f.close()
def external_sort_simple(input_path, output_path, tmp_dir=None, chunk_size=10**6):
chunk_files = split_into_sorted_chunks(input_path, tmp_dir=tmp_dir, chunk_size=chunk_size)
try:
merge_sorted_chunks(chunk_files, output_path)
finally:
# cleanup temp files
for fn in chunk_files:
try:
os.remove(fn)
except OSError:
pass
Pros: Simple, easy to follow, robust.
Cons: Text I/O is slower (parsing strings to ints repeatedly), and each write/readline may be IO-heavy.
2) Faster approach: binary temporary files + buffered k-way merge
If the file is really big and performance matters, move to a compact binary on-disk representation for runs and use a buffered reader per run to decrease syscall overhead.
Key choices:
-
Use the
arraymodule (very compact and provides.tofile()/.fromfile()methods) ornumpyfor very fast block reads/writes. -
Pack integers as 32-bit (
'I') or 64-bit ('Q') depending on your range. -
During merge, keep a small in-memory buffer for each run to avoid a syscall per integer.
Below is a full binary-run implementation (pure standard library).
# external_sort_bin.py
import os
import tempfile
import heapq
from array import array
INTTYPE = 'I' # 'I' unsigned 32-bit; change to 'Q' for 64-bit, 'i' for signed 32-bit, etc.
INT_BYTES = array(INTTYPE).itemsize
def write_chunk_binary(nums, filename, inttype=INTTYPE):
a = array(inttype, nums)
with open(filename, "wb") as f:
a.tofile(f)
def read_chunk_binary_iter(filename, inttype=INTTYPE, buf_elems=1<<14):
"""
Generator that yields ints from a binary file written by array.tofile.
Reads `buf_elems` integers at a time for efficiency.
"""
with open(filename, "rb") as f:
a = array(inttype)
while True:
a = array(inttype)
try:
a.fromfile(f, buf_elems)
except EOFError:
# on some platforms, fromfile raises EOFError when fewer items left
pass
if len(a) == 0:
break
for v in a:
yield v
def split_into_sorted_chunks_bin(input_path, tmp_dir=None, chunk_size=10**6, inttype=INTTYPE):
"""
Read `chunk_size` integers into memory, sort, write a binary run file.
chunk_size is count of integers (not bytes).
"""
tmp_dir = tmp_dir or tempfile.mkdtemp(prefix="extsort_bin_")
chunk_files = []
chunk = []
with open(input_path, 'r') as inf:
for line in inf:
if not line.strip(): continue
chunk.append(int(line))
if len(chunk) >= chunk_size:
chunk.sort()
tf = tempfile.NamedTemporaryFile(delete=False, dir=tmp_dir, mode='wb')
write_chunk_binary(chunk, tf.name, inttype=inttype)
chunk_files.append(tf.name)
tf.close()
chunk = []
if chunk:
chunk.sort()
tf = tempfile.NamedTemporaryFile(delete=False, dir=tmp_dir, mode='wb')
write_chunk_binary(chunk, tf.name, inttype=inttype)
chunk_files.append(tf.name)
tf.close()
return chunk_files
class BufferedIntReader:
def __init__(self, filename, inttype=INTTYPE, buf_elems=1<<14):
self.f = open(filename, 'rb')
self.inttype = inttype
self.buf_elems = buf_elems
self.buf = array(inttype)
self.pos = 0
self._eof = False
self._refill()
def _refill(self):
self.buf = array(self.inttype)
try:
self.buf.fromfile(self.f, self.buf_elems)
except EOFError:
# maybe less than buf_elems left; that's OK
pass
self.pos = 0
if len(self.buf) == 0:
self._eof = True
def peek(self):
if self._eof:
return None
if self.pos >= len(self.buf):
self._refill()
if self._eof:
return None
return self.buf[self.pos]
def pop(self):
v = self.peek()
if v is None:
raise StopIteration
self.pos += 1
return v
def close(self):
try:
self.f.close()
except Exception:
pass
def merge_sorted_chunks_bin(chunk_files, output_path, inttype=INTTYPE):
"""
k-way merge binary chunk files using a heap + buffered readers.
Writes text output (one int per line) or could write binary output similarly.
"""
readers = [BufferedIntReader(fn, inttype=inttype) for fn in chunk_files]
heap = []
for i, r in enumerate(readers):
v = r.peek()
if v is not None:
heap.append((v, i))
heapq.heapify(heap)
with open(output_path, 'w') as out:
while heap:
val, idx = heapq.heappop(heap)
out.write(f"{val}\n")
r = readers[idx]
try:
r.pop() # we consumed the value
nxt = r.peek()
if nxt is not None:
heapq.heappush(heap, (nxt, idx))
except StopIteration:
pass
for r in readers:
r.close()
Why this is faster:
-
The temporary files are compact (no decimal text), saving disk bandwidth and parsing time.
-
The
BufferedIntReaderreads large blocks (buf_elems) into memory in one go, avoiding millions ofreadline()syscalls. -
The heap maintains only one value per run in memory (plus buffers), so memory overhead is small.
3) Parallel chunk sorting — using multiple CPU cores
If you have multiple cores, you can overlap chunk sorting (CPU-bound) with disk IO (reading input and writing runs). A straightforward pattern uses concurrent.futures.ProcessPoolExecutor to sort-and-write a chunk in another process, while the main process continues reading the next chunk.
Caveats:
-
Passing a large chunk list between processes requires pickling; that copies memory. To avoid double memory, stream chunk to file in parent and let worker sort the temporary file in-place, or use shared memory approaches.
-
Sorting in subprocesses is still limited by disk I/O when writing runs — tune degree of parallelism to avoid thrashing the disk.
Example sketch (text-based run writer):
from concurrent.futures import ProcessPoolExecutor, as_completed
def sort_and_write_text_chunk(chunk, tmp_dir):
chunk.sort()
tf = tempfile.NamedTemporaryFile(delete=False, dir=tmp_dir, mode='w', newline='\n')
for n in chunk:
tf.write(f"{n}\n")
tf.close()
return tf.name
def split_sort_parallel(input_path, tmp_dir=None, chunk_size=10**6, n_workers=4):
tmp_dir = tmp_dir or tempfile.mkdtemp(prefix="extsort_par_")
chunk_files = []
futures = []
with ProcessPoolExecutor(max_workers=n_workers) as ex:
chunk = []
with open(input_path) as inf:
for line in inf:
if not line.strip(): continue
chunk.append(int(line))
if len(chunk) >= chunk_size:
# submit a copy of chunk to worker
futures.append(ex.submit(sort_and_write_text_chunk, chunk[:], tmp_dir))
chunk = []
if chunk:
futures.append(ex.submit(sort_and_write_text_chunk, chunk, tmp_dir))
# collect results
for fut in as_completed(futures):
chunk_files.append(fut.result())
return chunk_files
This pipeline lets you use multiple cores during chunk sorting. It’s not magic — you must balance n_workers, chunk_size, and disk bandwidth.
Practical tuning, pitfalls, and lessons
Below are practical considerations I wrestled with and my conclusions.
Choosing chunk_size
-
chunk_sizeis the number of integers you sort in memory at once. Choose it so thatchunk_size * bytes_per_int< usable RAM * safety_factor. -
Example rough calculation: if you have 8 GB RAM and want to only use up to 4 GB for sorting, and you pack ints as 8 bytes (64-bit), then chunk_size ≈ 4e9 / 8 = 500 million ints — unrealistic for heavy Python object overhead. With binary array of 4 bytes per int, chunk_size ≈ 1 billion / 4 → still huge; realistic chunk sizes are more modest (e.g., 1–10 million) depending on system.
Always test with smaller sizes first.
I/O: sequential vs random
-
External sort is dominated by sequential I/O when implemented correctly. Make sure temporary files are written and read sequentially (they will be); SSDs help a lot.
-
Avoid tiny reads/writes — use buffers (e.g.,
array.fromfileorread(size)).
File format for runs
-
Text: human-readable, portable but slower.
-
Binary (
array/numpy): compact, faster. Use the right integer type. -
Compressed: you can compress runs (LZ4/Zstandard) to reduce disk bandwidth — but compression/decompression CPU cost may be tradeoff.
Merging strategy
-
heapq.mergeover text iterators is elegant and uses minimal memory. For largek, heap operations costO(log k)per emitted element. -
For very large
k, do multi-pass merging (merge groups of runs into intermediate runs, then merge results), or limitkper pass based on memory.
Replacement selection / longer runs
-
Replacement selection (tournament tree) can produce runs longer than memory. It’s more complex but can reduce number of runs and thus reduce merge cost. I did not implement it for the blog because it’s complex and the binary+buffered approach plus sensible chunk size was enough.
Verifying correctness
-
Always verify sorted output: for moderate samples, compare
external_sortoutput tosorted()of the entire dataset (if possible) or spot-check invariants (every subsequent number ≥ previous). -
Watch for integer overflow / signedness if using fixed-size binary types.
Failure modes encountered
-
I/O stall: too many concurrent writers caused disk stalls. Solution: limit parallel degree.
-
Unexpected newlines or non-integers: pre-validate input or robustly skip/handle parse errors.
-
File descriptor exhaustion: opening hundreds of run files simultaneously can hit OS limits — close files or merge in passes.
Example test harness (small-scale verification)
# test_external_sort.py
import random
import subprocess
def make_test_file(path, n=100000, maxv=10**9):
with open(path, 'w') as f:
for _ in range(n):
f.write(str(random.randint(0, maxv)) + '\n')
def verify_sorted(path):
prev = None
with open(path) as f:
for line in f:
v = int(line)
if prev is not None and v < prev:
return False
prev = v
return True
# create test file
make_test_file("small_in.txt", n=100000)
# run external sort
external_sort_simple("small_in.txt", "small_out.txt", chunk_size=20000)
assert verify_sorted("small_out.txt")
This helps you trust the implementation before scaling.
The human part — what I felt while getting stuck
I’ll be honest: the real narrative here isn’t just code. At several points I felt like I was grabbing at shiny objects — micro-optimizing sorted(), convincing myself SQLite would “just do it”, playing with fancy radix ideas. Those were the red herrings. They felt productive but mostly masked the real constraints.
The moment that felt like climbing out was when I stopped searching for a new data structure and recalled the algorithmic toolbox: chunk, sort, merge. Then the engineering part — binary format, buffer sizes, parallelism — was satisfying. It’s boring in an academic sense, but in implementation it’s where you make decisions that really matter.
Things I didn’t do (but you might want to try later)
-
Replacement selection: longer runs, fewer merges — more complex.
-
Polyphase merge: reduce disk passes for specific tape-like media strategies (more theoretical).
-
Using
sorton Unix:sort -S+-Tis an excellent pragmatic approach (systemsortis heavily optimized). For real-world jobs, call out tosortrather than reimplementing, but for learning it’s worth building your own. -
Compressed runs: for very large data, compressing runs (LZ4) often wins because it reduces I/O; decompress-on-the-fly during merge.
-
Out-of-core radix sort: two-pass radix (e.g., 16-bit buckets) building bucket files can be competitive if values are uniformly distributed.
Final reflections and succinct takeaways
-
Getting stuck is valuable: every red herring sharpened the constraints (memory, I/O, accuracy).
-
The right abstraction for big-data sorting is simple: represent data compactly and stream it — minimize random IO and per-element syscalls.
-
Engineering details matter: binary vs text, buffer size, parallelism, and filesystem performance determine whether your external sort finishes in minutes or hours.
-
If you need industrial-strength sorting, use battle-tested tools (system
sort, database bulk-sorting, or external libraries), but implement it yourself to learn the tradeoffs.
Comments
Post a Comment