Writing code with SIMD for vectorization is painful. It deserves a blog series to record all sorts of pains I have encountered and (partially) overcome.

Indeed, once the pain of coding and debugging is finished, the program is lightning-faster. Nonetheless, I am here to complain instead of praising. Let me state why writing SIMD code is causing me emotional damage:

  • a single line of normal c++ code could be easily inflated to a dozen lines of code.
  • when the code comes with data dependency across loop iterations, the SIMD would hit right at my front head and give me massive headache (for debugging).
  • the usage of SIMD require low-level C coding
  • SIMD intrinsics are often not compatible across different platforms, and even different CPU models.

SIMD intrinsics are available in ARM, Intel, AMD and Nvidia chips, but GPU/CUDA opens another genre of SIMD programming paradigm so I will not discuss here. AMD, for the x86 arch, offers the same intrinsic set as Intel does. Thus, only the intrinsics of ARM and Intel are really concerned.

Notation

Before going any further, I would like firstly clarify the terms of “vector” used in this blog, which unfortunately can be used to name two distinct matters.

  • vector in C++, is a container with variable size holding dynamically allocated data in heap.
  • vector in SIMD, is a type specifying the data stored in registers, with fixed sizes such as 128, 256, 512 bits.

In following context, vector refers to as the container, and vec denotes the data in register.

Intel Intrinsic

Let’s talk about the Intel firstly, the (aging) boss of CPU.

Intel provides a number of intrinsic sets to us, such as SSE, AVX2, …, AVX512. I only use AVX512 because it is the newest and widest set.

new means that AVX512 has something other sets do not have, for instance, the scatter and gather operations. The two counterparts are very useful, which is further discussed in another log.

wide means AVX512 has 512-bit width vec. It is not obvious to see from the name at all.

ARM Intrinsics

The computing capacity of ARM chip is weaker than that of Intel’s chip – I derive this personal and irresponsible conclusion based on the fact that the SIMD width of ARM is merely 128 bits and sometimes is even 64 bits. Why so short? I guess ARM prefers to 8-bit or 16-bit data type, sacrificing a little precision for efficiency, which makes the shorter vec more reasonable.

Another shortcoming of ARM intrinsics is the lack of masked operations. It happens all the time when the input data cannot be exactly fitted in the SIMD vec, or I just need a portion of data. The mask in Intel intrinsics allows us to easily extract/fill the imperfectly aligned vec. For ARM, sorry, we have to find alternative solutions, as described in this [log]