Future of GPU Deep Learning Performance: Memory Compression

Here’s a simple prediction: in the (near) future, GPUs will support lossless memory compression not just for framebuffers but for everything. This will deliver a fairly substantial performance boost for Deep Learning (and not just for inference but for training as well!) and graphics to a lesser extent.

Lossless Framebuffer Compression is one of the most important important GPU HW optimisations for 3D graphics in recent years. It has also become an important part of fixed-function Deep Learning Inference accelerators like NVIDIA’s NVDLA.

And yet, it’s completely absent for Deep Learning on GPUs. This puts us in the bizarre situation on NVIDIA’s Xavier SoC where the dedicated Deep Learning Inference HW (NVDLA) supports compression but the GPU doesn’t despite a much higher overall performance level (and the GPU still ends up way faster in practice as NVDLA isn’t fast enough to saturate the bandwidth for typical network architectures).

Interestingly memory compression doesn’t just improve performance and save power by reducing external memory bandwidth, but also by reducing the number of on-chip communication. This interesting (and surprisingly not so recent!) research paper from NVIDIA has makes some interesting points on the subject: https://research.nvidia.com/sites/default/files/pubs/2015-05_Toggle-aware-compression-for/CAL-paper-final.pdf

NVIDIA’s cache hierarchy is already well suited to a 4:1 compression ratio by using 128B cache lines that can be partially filled by 32B memory bursts; presumably compressed framebuffer data means loading 32 to 96 bytes and decompressing it into a single 128B cache line. It’s not clear to me whether this happens at the level of the L1 or the L2, but given the ideas in the paper above, I expect at least in the future it will be done at the level of the L1.

The problem with lossless compression of nearly any kind is that it increases the typical memory burst size and so it may actually waste more bandwidth than it saves for extremely sparse access. Even for a clever implementation, the worst-case is 3x *higher* bandwidth: reading 96 bytes when you only need 32 bytes (or less). It may be possible to improve compression further with larger (256B+) block sizes but that would make the worst case even worse.

Framebuffer reads and writes are typically (not always!) very coherent. That’s obviously not always the case for general CUDA workloads. Thankfully it is typically true for Deep Learning! Still, we need a way to selectively enable/disable the compression so it’s only enabled when it should be.

The other complexity is that it’s potentially beneficial for the compression algorithm to know what kind of data it’s compressing. For example, when compressing FP32 data, it may be useful to shuffle the bits so that the mantissa and exponent are compressed separately; unfortunately, in the default FP32 representation, the exponent crosses a byte boundary so that 2 of the 4 bytes may be less efficiently compressed. Again, this depends on the kind of compression algorithm used, but with franebuffer compression this is typically known at memory allocation time which greatly simplifies things.

GPUs typically use 64KB memory pages and some part of virtual memory allocation is also done at an even larger granularity. So it may be possible to dynamically track access patterns at that granularity, but honestly, it doesn’t feel like it’s going to be worth the cost. So it’s a bit of an open question how this would work, but it’s certainly not insurmountable. That goes a fair bit beyond the scope of a (“short”) blog post. In the end it may not matter for CUDA and cuDNN as NVIDIA controls the API and can shift the burden to developers, but at least for both other vendors and for graphics, it’s an interesting problem.

At the other extreme, instead of reducing external memory bandwidth with compression, you can “just” eliminate most of it completely by keeping the weights on-chip. The extreme example is Graphcore’s HW architecture (which is worthy of a much longer analysis of its own), but even NVIDIA supports this for WaveNet (manually optimised) and automatically in cuDNN for smaller networks. Unfortunately, that approach doesn’t really benefit from lossless compression as the data cannot stay compressed in the register file (which is much bigger than the L2 cache on GPUs!) and even if it could be, lossless compression only improves bandwidth, not storage space as there’s no guarantee of the compression ratio.

This is especially important as HBM3 isn’t due out for a little while, and HBM in general remains expensive so just adding more stacks isn’t very helpful. And as Moore’s Law slows down (but isn’t quite dead yet!) this kind of efficiency improvement is becoming more and more important.

Fundamentally memory bandwidth is a critical part of Deep Learning performance and power cost so it’s hard to see how many of the Deep Learning start-ups could significantly beat GPUs unless they also innovate on the memory front. Many don’t, but thankfully some of them do, and it will be exciting to see what kind of architectural innovation happens in the next few years.

 

Vector vs SIMD: Dynamic Power Efficiency

For the purposes of this post here are a few definitions (feel free to disagree; I don’t like this nomenclature myself, but it seemed easiest for the discussion ahead):

  • “SIMD” means the “vector length” of the instruction set is fixed and is the same as the width of the ALU, e.g. Intel SSE (not quite the same as SIMT on GPUs).
  • “Vector” means the “vector length” is configurable but typically greater than the width of the ALU, e.g. 1024-wide vector executed on a 16-wide ALU over 64 cycles.

There has been more interest in Vector Processing again recently thanks to David Patterson (et al.) re-introducing it to the mainstream in the upcoming RISC-V Vector Extension (RVV) and previously arguing against existing SIMD instruction sets (“SIMD Instructions Considered Harmful”).

Unfortunately all the discussion on SIMD vs Vector seems focused on instruction set complexity and overhead rather than area or power efficiency. The implication is that vector could provide better area/power efficiency by having lower overhead, but for very wide ALUs (e.g. GPUs or Intel AVX512) this is amortised very effectively so the difference is going to be pretty small.

So what about power efficiency? For massively parallel workloads where we can make use of wider vectors (like graphics or deep learning), is Vector Processing more efficient than SIMD, or worse?

Like many things in microarchitecture design, I believe the biggest difference isn’t the high-level design choice, but rather what low-level optimisations are possible depending on that high-level choice (and of those low-level optimisations, how many you actually discover and have the time to implement in the final product!)

In my mind there are 3 key factors:

  1. Better clock gating with vector processing as the workload is more predictable (e.g. this paper, although most of the techniques also work on non-vector architectures. The major exception is “ScalarCG” which only makes sense for vector processing).
  2. Lower switching activity with vector processing (-> lower dynamic power) as different vector elements for the same instruction will typically have much more similar values than 2 different instructions for the same vector elements. And the more similar the values, the fewer gates will switch, and the lower power will be. This excellent and intriguing paper provides some hard data for NVIDIA GPUs.
  3. Unlike SIMD, it’s impossible to optimise register file accesses with vector processing. Especially for smaller data types (e.g. FP32/FP16 rather than FP64), a very high percentage of total power is due to the register file, rather than the ALU itself. It is possible to optimise this for SIMD but not for vector processing with a configurable vector length. See this NVIDIA paper from 2011 and this one from 2012 for details although they ended up implementing something quite different and much simpler.

(2) happens automatically on vector processors while (3) is a specific optimisation which hasn’t historically been done in SIMD designs, but is becoming increasingly important in modern GPUs (e.g. NVIDIA’s operand cache). It is typically implemented with hardware-specific instruction set bits, but could be done in a more limited way with existing ISAs.

So for a naive design without many low-level optimisations, it seems likely that vector processing is more power efficient than SIMD, but whether it’s more efficient for a highly optimised design is hard to tell without being able to compare these trade-offs. Sadly it’s nearly impossible to get that kind of data unless you’re working at a leading-edge semiconductor design center, and if you are, then it would be nearly impossible to publish it.

Thankfully it doesn’t have to be all-or-nothing! While the fully configurable vector length of RVV prevents certain optimisations like (3), many of the benefits of vector processing are indirectly due to executing a single instruction over multiple cycles. So a “hybrid” architecture where the maximum vector length is roughly ~4x the native ALU width should be able to get most of the benefits of both worlds (or at least somewhere between ~2x and ~8x).

Interestingly that’s what modern GPUs already do, at least for NVIDIA/AMD/Apple! The “vector length” is not configurable, but a 32-wide instruction will run over 2 cycles on 16-wide ALUs for NVIDIA/Apple, and a 64-wide instruction will run over 4 cycles for AMD. AMD already has “scalar registers” and NVIDIA has only just introduced them in Turing (with a slightly different set of trade-offs) which should allow more optimisations in (1) as well, even if the benefit is smaller when only amortised over 2 cycles.

The main reason for all this is to reduce the area/power cost of the register file (which goes way beyond the scope of this post – hopefully something to write about in the future!) but it’s interesting that it also grants them many of the benefits of traditional vector architectures. And GPUs nearly certainly aren’t the only architecture where 1 vector instruction may be executed over 2 or 4 cycles, although I’m not aware of any other architecture with a fixed vector length which is larger than the native ALU width (tips welcome, I’d be very interested to learn about them if they exist!)

In conclusion, I strongly suspect the reason so many GPU architectures have converged to this design is that it is the most power-efficient trade-off (at least for current GPU workloads; not necessarily in the general case) and that when combined with (3) and other low-level tricks, it is more efficient than either a SIMD processor or a vector processor with a fully configurable vector length could ever be (for simplicity, I’m disregarding the differences of SIMT vs SIMD for the comparison)

How much more efficient? Probably not very much, to be honest. The leading-edge of microarchitecture design isn’t about getting a ~2x efficiency advantage for anything; it’s about getting a bit of efficiency here, a bit more over there, etc… and then when all is said is done, you might end up with a ~2x efficiency advantage over the baseline (if you did a great job, or more likely, if the baseline is unrealistically bad).

But if you don’t have the design budget of these huge companies and/or the ability to ignore binary compatibility, then Vector Processing architectures (like both RVV and the Hwacha accelerator, also from Berkeley) might not just be the simpler designs; they might also be the most power efficient (for highly parallel workloads).

Hopefully in the next few years, there will be competitive silicon implementations of the RISC-V Vector Extension which would allow us to evaluate its real-world efficiency against other CPU SIMD extensions!

P.S.: I really like both the RISC-V and the Vector Extension instruction sets. I think they are very elegant and probably as good as it gets for an ISA that needs to be binary compatible for many different cores (i.e. like CPUs and unlike GPUs where the driver provides an extra level of abstraction). However, I still think that binary compatible ISAs must necessarily leave a lot of efficiency on the table compared to HW-specific instruction sets.