Vector vs SIMD: Dynamic Power Efficiency

For the purposes of this post here are a few definitions (feel free to disagree; I don’t like this nomenclature myself, but it seemed easiest for the discussion ahead):

  • “SIMD” means the “vector length” of the instruction set is fixed and is the same as the width of the ALU, e.g. Intel SSE (not quite the same as SIMT on GPUs).
  • “Vector” means the “vector length” is configurable but typically greater than the width of the ALU, e.g. 1024-wide vector executed on a 16-wide ALU over 64 cycles.

There has been more interest in Vector Processing again recently thanks to David Patterson (et al.) re-introducing it to the mainstream in the upcoming RISC-V Vector Extension (RVV) and previously arguing against existing SIMD instruction sets (“SIMD Instructions Considered Harmful”).

Unfortunately all the discussion on SIMD vs Vector seems focused on instruction set complexity and overhead rather than area or power efficiency. The implication is that vector could provide better area/power efficiency by having lower overhead, but for very wide ALUs (e.g. GPUs or Intel AVX512) this is amortised very effectively so the difference is going to be pretty small.

So what about power efficiency? For massively parallel workloads where we can make use of wider vectors (like graphics or deep learning), is Vector Processing more efficient than SIMD, or worse?

Like many things in microarchitecture design, I believe the biggest difference isn’t the high-level design choice, but rather what low-level optimisations are possible depending on that high-level choice (and of those low-level optimisations, how many you actually discover and have the time to implement in the final product!)

In my mind there are 3 key factors:

  1. Better clock gating with vector processing as the workload is more predictable (e.g. this paper, although most of the techniques also work on non-vector architectures. The major exception is “ScalarCG” which only makes sense for vector processing).
  2. Lower switching activity with vector processing (-> lower dynamic power) as different vector elements for the same instruction will typically have much more similar values than 2 different instructions for the same vector elements. And the more similar the values, the fewer gates will switch, and the lower power will be. This excellent and intriguing paper provides some hard data for NVIDIA GPUs.
  3. Unlike SIMD, it’s impossible to optimise register file accesses with vector processing. Especially for smaller data types (e.g. FP32/FP16 rather than FP64), a very high percentage of total power is due to the register file, rather than the ALU itself. It is possible to optimise this for SIMD but not for vector processing with a configurable vector length. See this NVIDIA paper from 2011 and this one from 2012 for details although they ended up implementing something quite different and much simpler.

(2) happens automatically on vector processors while (3) is a specific optimisation which hasn’t historically been done in SIMD designs, but is becoming increasingly important in modern GPUs (e.g. NVIDIA’s operand cache). It is typically implemented with hardware-specific instruction set bits, but could be done in a more limited way with existing ISAs.

So for a naive design without many low-level optimisations, it seems likely that vector processing is more power efficient than SIMD, but whether it’s more efficient for a highly optimised design is hard to tell without being able to compare these trade-offs. Sadly it’s nearly impossible to get that kind of data unless you’re working at a leading-edge semiconductor design center, and if you are, then it would be nearly impossible to publish it.

Thankfully it doesn’t have to be all-or-nothing! While the fully configurable vector length of RVV prevents certain optimisations like (3), many of the benefits of vector processing are indirectly due to executing a single instruction over multiple cycles. So a “hybrid” architecture where the maximum vector length is roughly ~4x the native ALU width should be able to get most of the benefits of both worlds (or at least somewhere between ~2x and ~8x).

Interestingly that’s what modern GPUs already do, at least for NVIDIA/AMD/Apple! The “vector length” is not configurable, but a 32-wide instruction will run over 2 cycles on 16-wide ALUs for NVIDIA/Apple, and a 64-wide instruction will run over 4 cycles for AMD. AMD already has “scalar registers” and NVIDIA has only just introduced them in Turing (with a slightly different set of trade-offs) which should allow more optimisations in (1) as well, even if the benefit is smaller when only amortised over 2 cycles.

The main reason for all this is to reduce the area/power cost of the register file (which goes way beyond the scope of this post – hopefully something to write about in the future!) but it’s interesting that it also grants them many of the benefits of traditional vector architectures. And GPUs nearly certainly aren’t the only architecture where 1 vector instruction may be executed over 2 or 4 cycles, although I’m not aware of any other architecture with a fixed vector length which is larger than the native ALU width (tips welcome, I’d be very interested to learn about them if they exist!)

In conclusion, I strongly suspect the reason so many GPU architectures have converged to this design is that it is the most power-efficient trade-off (at least for current GPU workloads; not necessarily in the general case) and that when combined with (3) and other low-level tricks, it is more efficient than either a SIMD processor or a vector processor with a fully configurable vector length could ever be (for simplicity, I’m disregarding the differences of SIMT vs SIMD for the comparison)

How much more efficient? Probably not very much, to be honest. The leading-edge of microarchitecture design isn’t about getting a ~2x efficiency advantage for anything; it’s about getting a bit of efficiency here, a bit more over there, etc… and then when all is said is done, you might end up with a ~2x efficiency advantage over the baseline (if you did a great job, or more likely, if the baseline is unrealistically bad).

But if you don’t have the design budget of these huge companies and/or the ability to ignore binary compatibility, then Vector Processing architectures (like both RVV and the Hwacha accelerator, also from Berkeley) might not just be the simpler designs; they might also be the most power efficient (for highly parallel workloads).

Hopefully in the next few years, there will be competitive silicon implementations of the RISC-V Vector Extension which would allow us to evaluate its real-world efficiency against other CPU SIMD extensions!

P.S.: I really like both the RISC-V and the Vector Extension instruction sets. I think they are very elegant and probably as good as it gets for an ISA that needs to be binary compatible for many different cores (i.e. like CPUs and unlike GPUs where the driver provides an extra level of abstraction). However, I still think that binary compatible ISAs must necessarily leave a lot of efficiency on the table compared to HW-specific instruction sets.