A performance evaluation is conducted on an Nvidia L40, comparing the 100-iteration access times of device vectors with half, half2, and float types. Each vector was initialized with 1024*1024 elements, but for the half2 type, two elements were packed into a single vector entry. Hence, two randomness are tested for half2 type: random access per half2 and random access per half.

Access TypeData TypeVector SizeAllocated MemoryTime (ms)
Randomhalf1M2MB4.275200
Randomfloat1M4MB4.088096
Random by halfhalf20.5M2MB4.011008
Random by half2half20.5M2MB2.325184
Sequentialhalf1M2MB0.755712
Sequentialhalf20.5M2MB0.707488
Sequentialfloat1M4MB0.794912

The performance results can be explained by the tradeoff between memory alignment and memory footprint.

Random Access:

Speed: half2-by-half2 > half2-by-half > float > half

  • random with half is slowest: Despite its smaller size, half suffers from alignment issues. GPUs often prefer 32-bit aligned memory accesses, so accessing 16-bit values may require additional operations.
  • random with float is slow: float requires more memory than half, but it has natural alignment, so it is slightly faster than half.
  • random with half2 by half is slow too: The overhead of packing and unpacking values negates the benefits of memory alignment.
  • random with half2 by half2 is fastest: It packs two half values into a 32-bit word, allowing for more efficient memory access.

Sequential Access:

Speed: half2 > float > half

  • half2 is the fastest: Packed memory access and smaller memory footprint lead to the best performance.
  • half is slower: Small memory footprint, but it is slower than half2 due to alignment issues.
  • float is the slowest: float requires more memory than the half and half2 types, thus needs more time to transfer.