Tackling this problem starts with some creativity in how to represent the tree in memory. The common pointer-based approach (each node stores pointers to its children) looks downward: it describes the tree from the root down to the leaves, making downward traversal (and recursion) easier. Instead, let’s look upward, and use the following way to describe a binary tree:
The two vectors c and p together uniquely specify a binary tree. Traversing up the tree can be done easily through p, and since c is essentially a bit-vector, this description is also significantly more space-efficient than storing two pointers per node. Inverting this tree is done by simply inverting the binary vector c (and does not change anything in the parent vector p).
In python, if we were to naively use a boolean vector for c, we would be wasting a lot of memory. Booleans are stored as bytes but carry the informational value of a single bit. Unfortunately, the smallest addressable unit of memory in python is a byte (8 bits), and there is no such thing as a true bit-array (where each element is one single bit). This problem is easy to overcome through bit-packing: we store our data in some larger format, like a byte, but we interpret our data on the bit-level. Python has a number of bitwise operators for this. For example, our vector c can be represented by a uint8 array in the following way:
The following code benchmarks how long it would take to bitwise invert a numpy array of 1.5 billion 64-bit unsigned integers, which represent 64 x 1.5 = 96 billion nodes:
When running this on a 12-core Intel(R) Xeon(R) CPU, it takes about 3.5 seconds. This is already quite nice thanks to numpy’s vectorization (in compiled C), but we can make this much faster, since this is an embarrassingly parallel task.
Let’s use PyCUDA as a Python wrapper around the CUDA C API, to parallelize the bit-inversion over thousands of threads across an NVIDIA GPU. We can write the kernel in C++ as follows:
The kernel takes in our array, divides it along the x-dimension into “blocks” of “threads”, and defines the work per thread. CUDA organizes computation into a grid of blocks, each containing threads, which are the individual execution units. This is a software abstraction that directly maps to GPU hardware, with blocks assigned to Streaming Multiprocessors (SMs) and threads grouped into so-called “warps” of 32, which are the hardware execution units. Before continuing, it’s interesting to digress a bit about these warps.
It’s fairly easy to see that our problem is a bandwidth-bound problem, and not compute-bound. That is, inverting bits is an almost trivial computation, but transferring 100 billion bits of data (~12.5GB) back and forth from the GPU’s global memory to thread-local registers is much less trivial. A nice thing to note here is that the memory controller tries to minimise these data transactions by coalescing them for contiguous global memory accesses across the threads of a single warp, up to 128 bytes (see the CUDA docs here). This incentivises you to write the kernel in such a way that adjacent threads access adjacent locations in memory, and our trivial kernel is doing exactly that: a single warp is processing 32 x 64 bits = 256 bytes of data, which it is able to fetch in only 256/128 = 2 requests!
Moving on, let’s run this on an NVIDIA A100-SXM4-40GB:
This takes only 17.4 milliseconds. The A100 has a a memory bandwidth of around 1.6TB/s, meaning the theoretical lower bound for just moving the data back and forth is around (12.5GB x 2)/1.6 = 15.6 ms. Since we’re close enough to that now, I’m gonna call it a day. If you want to run this yourself, see here.