Future of GPU Deep Learning Performance: Memory Compression

Here’s a simple prediction: in the (near) future, GPUs will support lossless memory compression not just for framebuffers but for everything. This will deliver a fairly substantial performance boost for Deep Learning (and not just for inference but for training as well!) and graphics to a lesser extent.

Lossless Framebuffer Compression is one of the most important important GPU HW optimisations for 3D graphics in recent years. It has also become an important part of fixed-function Deep Learning Inference accelerators like NVIDIA’s NVDLA.

And yet, it’s completely absent for Deep Learning on GPUs. This puts us in the bizarre situation on NVIDIA’s Xavier SoC where the dedicated Deep Learning Inference HW (NVDLA) supports compression but the GPU doesn’t despite a much higher overall performance level (and the GPU still ends up way faster in practice as NVDLA isn’t fast enough to saturate the bandwidth for typical network architectures).

Interestingly memory compression doesn’t just improve performance and save power by reducing external memory bandwidth, but also by reducing the number of on-chip communication. This interesting (and surprisingly not so recent!) research paper from NVIDIA has makes some interesting points on the subject: https://research.nvidia.com/sites/default/files/pubs/2015-05_Toggle-aware-compression-for/CAL-paper-final.pdf

NVIDIA’s cache hierarchy is already well suited to a 4:1 compression ratio by using 128B cache lines that can be partially filled by 32B memory bursts; presumably compressed framebuffer data means loading 32 to 96 bytes and decompressing it into a single 128B cache line. It’s not clear to me whether this happens at the level of the L1 or the L2, but given the ideas in the paper above, I expect at least in the future it will be done at the level of the L1.

The problem with lossless compression of nearly any kind is that it increases the typical memory burst size and so it may actually waste more bandwidth than it saves for extremely sparse access. Even for a clever implementation, the worst-case is 3x *higher* bandwidth: reading 96 bytes when you only need 32 bytes (or less). It may be possible to improve compression further with larger (256B+) block sizes but that would make the worst case even worse.

Framebuffer reads and writes are typically (not always!) very coherent. That’s obviously not always the case for general CUDA workloads. Thankfully it is typically true for Deep Learning! Still, we need a way to selectively enable/disable the compression so it’s only enabled when it should be.

The other complexity is that it’s potentially beneficial for the compression algorithm to know what kind of data it’s compressing. For example, when compressing FP32 data, it may be useful to shuffle the bits so that the mantissa and exponent are compressed separately; unfortunately, in the default FP32 representation, the exponent crosses a byte boundary so that 2 of the 4 bytes may be less efficiently compressed. Again, this depends on the kind of compression algorithm used, but with franebuffer compression this is typically known at memory allocation time which greatly simplifies things.

GPUs typically use 64KB memory pages and some part of virtual memory allocation is also done at an even larger granularity. So it may be possible to dynamically track access patterns at that granularity, but honestly, it doesn’t feel like it’s going to be worth the cost. So it’s a bit of an open question how this would work, but it’s certainly not insurmountable. That goes a fair bit beyond the scope of a (“short”) blog post. In the end it may not matter for CUDA and cuDNN as NVIDIA controls the API and can shift the burden to developers, but at least for both other vendors and for graphics, it’s an interesting problem.

At the other extreme, instead of reducing external memory bandwidth with compression, you can “just” eliminate most of it completely by keeping the weights on-chip. The extreme example is Graphcore’s HW architecture (which is worthy of a much longer analysis of its own), but even NVIDIA supports this for WaveNet (manually optimised) and automatically in cuDNN for smaller networks. Unfortunately, that approach doesn’t really benefit from lossless compression as the data cannot stay compressed in the register file (which is much bigger than the L2 cache on GPUs!) and even if it could be, lossless compression only improves bandwidth, not storage space as there’s no guarantee of the compression ratio.

This is especially important as HBM3 isn’t due out for a little while, and HBM in general remains expensive so just adding more stacks isn’t very helpful. And as Moore’s Law slows down (but isn’t quite dead yet!) this kind of efficiency improvement is becoming more and more important.

Fundamentally memory bandwidth is a critical part of Deep Learning performance and power cost so it’s hard to see how many of the Deep Learning start-ups could significantly beat GPUs unless they also innovate on the memory front. Many don’t, but thankfully some of them do, and it will be exciting to see what kind of architectural innovation happens in the next few years.