Writing SIMD code that works across different platforms can be a challenging task. The following log illustrates how a seemingly simple operation in C++ can quickly escalate into a significant problem.

Let’s look into the code below, where the elements of x is accessed through indices specified by idx.

normal code
std::vector<float> x = /*some data*/
std::vector<int> idx = /* index */
for(auto i: idx)
{
    auto data = x[i];
}

Gather with Intel

In AVX512, Gather is a specific intrinsic function to transfer data from a data array to a target vec, according to an index vec. This intrinsic vectorizes the example of normal code.

SIMD code
int simd_width = 16;
for(size_t i = 0; i < x.size(); i+= simd_width)
{
__m512i idx_vec = _mm512_loadu_epi32(&idx[i]);
__m512  x_vec   = _mm512_i32gather_ps(idx_vec, &x[0], sizeof(float));
}

With Intel’s SIMD, the code snippet gets the data from the vector x based on the index register idx_vec and store the resultant data into the result register x_vec.

Personally, after a few days of SIMD coding, I do appreciate such code, and consider this SIMD solution is simple and elegant: two instructions are used, which is nicely aligned with what happens in the normal code:

  1. loading the data from the idx vector;
  2. loading the data from x vector according to the result of the 1st step.

A big BUT, the gather (and scatter) operation is not supported by most of other sets – they simply just do NOT offer these instructions 😮‍💨. To achieve the same data loading task, more efforts are needed.

Customized Gather with ARM

Using ARM intrinsics, we have to implement our own gather. I found three solutions do so and benchmarked their performances.

Tmp array

/* tmp array */
int simd_width = 4;
aligns(16) std::array<float,simd_width> tmp;
for(size_t i = 0; i < x.size(); i += simd_width)
{
  tmp[0] = x[idx[i]];
  tmp[1] = x[idx[i + 1]];
  tmp[2] = x[idx[i + 2]];
  tmp[3] = x[idx[i + 3]];

  float32x4_t tmp_vec = vld1q_f32(tmp.data()); // loading to register
  vst1q_f32(&buf[i], tmp_vec);
}

A naive solution (suggested by ChatGPT-4 and many GitHub repos) is to load the idx and x[idx] directly without the help of intrinsics, store the data in a temporary array, and then load to the target register. This solution mixes SIMD and non-SIMD. The indexing accesses (e.g., [i]) to the arrays lets the compilers/CPU do whatever they want, which loses the register-level control.

Union

union alignas(64) f32x4_union
{
  float32x4_t          reg128;
  std::array<float, 4> f32x4;
};
f32x4_union res_vec;
for(size_t i = 0; i < size; i += 4)
{
  res_vec.f32x4[0] = x[idx[i]];
  res_vec.f32x4[1] = x[idx[i + 1]];
  res_vec.f32x4[2] = x[idx[i + 2]];
  res_vec.f32x4[3] = x[idx[i + 3]];
  vst1q_f32(&buf[i], res_vec.reg128);
}

By put the array and register into a union, we now have the access to the elements of the register by indexing. Compared to the tmp array solution, the union solution avoids the code of loading data to the register (i.e., vld1q_f32), thus improving the efficiency. However, the indexing access is still under the control of the compiler/CPU.

get & set

uint32x4_t  idx_vec;
float32x4_t x_vec;
for(size_t i = 0; i < size; i += 4)
{
  idx_vec = vld1q_u32(&idx[i]);
  x_vec   = vsetq_lane_f32(x[vgetq_lane_u32(idx_vec, 0)], x_vec, 0);
  x_vec   = vsetq_lane_f32(x[vgetq_lane_u32(idx_vec, 1)], x_vec, 1);
  x_vec   = vsetq_lane_f32(x[vgetq_lane_u32(idx_vec, 2)], x_vec, 2);
  x_vec   = vsetq_lane_f32(x[vgetq_lane_u32(idx_vec, 3)], x_vec, 3);
  vst1q_f32(&buf[i], x_vec);
}

This solution combines the get and set intrinsics to mimic the advanced gather operation. The code is … ugly, but efficient. It makes sure that idx_vec and x_vec are carefully reused, allowing the finest control in the registers.

get & tmp array

alignas(16) std::array<float, 4> values;
for(size_t i = 0; i < size; i += 4)
{
  uint32x4_t idx_vec = vld1q_u32(&idx[i]);

  values[0] = x[vgetq_lane_u32(idx_vec, 0)];
  values[1] = x[vgetq_lane_u32(idx_vec, 1)];
  values[2] = x[vgetq_lane_u32(idx_vec, 2)];
  values[3] = x[vgetq_lane_u32(idx_vec, 3)];

  float32x4_t x_vec = vld1q_f32(values.data());
  vst1q_f32(&buf[i], x_vec);
}

The last solution mixes get with tmp array. It is an intermediate between the 1nd and 3rd solution in terms of the use of registers.

Benchmarking

With data size of 1<<27, the performance of the four solutions are:

performanceuniontmpget&setget&tmp
time (ms)703639583648
assembly code lines16161818

The union solution yields the shortest assembly code but the longest execution time. This short code piece is reasonable, as it eliminates one line code compared with the tmp method. But why so long time? The key reason is that, the writing to the union elements by indexing (res_vec.f32x4[i] = ...) is inefficient, compared to the use of intrinsic vld1q_f32(*ptr). Explicit control on the registers promises better performance! The get&set facilitates the finest control of registers, and thus gives the shortest time.

A weird result is acquired by get&tmp. It actually slightly slower than tmp and I do not understand why. I feed the assembly code of the last three solutions to ChatGPT-4, and this is its analysis:

tmp – L10 (first loop): This loop involves more register manipulation (bfi, fmov, ins) compared to the other loops. It may contribute to higher register pressure and could potentially limit instruction-level parallelism, affecting performance.

get&set – L11 (second loop): This loop uses ld1 and ld1q instructions to load the required values into SIMD registers directly from memory. This reduces the amount of register manipulation required compared to L10, which could lead to better performance.

get&tmp – L12 (third loop): This loop uses a similar approach to L10, using a mix of bfi, fmov, and ins instructions to manipulate registers. However, it also involves an additional ldr instruction for each iteration, which increases the amount of memory operations per iteration compared to L10. This could potentially explain why L12 is slower than L10.

In conclusion, the second loop (L11) has the least amount of register manipulation and memory operations per iteration, which may contribute to its better performance. The third loop (L12) has more memory operations per iteration compared to the first loop (L10), which could lead to a slower execution time.

To wrap up, the get&set invokes the least number of instructions, thus yielding the best performance. This reason seems to make sense.

The complete code is here