Performance Comparisons: Half, Half2 and Float

A performance evaluation is conducted on an Nvidia L40, comparing the 100-iteration access times of device vectors with half, half2, and float types. Each vector was initialized with 1024*1024 elements, but for the half2 type, two elements were packed into a single vector entry. Hence, two randomness are tested for half2 type: random access per half2 and random access per half.

Access Type	Data Type	Vector Size	Allocated Memory	Time (ms)
Random	half	1M	2MB	4.275200
Random	float	1M	4MB	4.088096
Random by half	half2	0.5M	2MB	4.011008
Random by half2	half2	0.5M	2MB	2.325184
Sequential	half	1M	2MB	0.755712
Sequential	half2	0.5M	2MB	0.707488
Sequential	float	1M	4MB	0.794912

The performance results can be explained by the tradeoff between memory alignment and memory footprint.

Random Access:

Speed: half2-by-half2 > half2-by-half > float > half

random with half is slowest: Despite its smaller size, half suffers from alignment issues. GPUs often prefer 32-bit aligned memory accesses, so accessing 16-bit values may require additional operations.
random with float is slow: float requires more memory than half, but it has natural alignment, so it is slightly faster than half.
random with half2 by half is slow too: The overhead of packing and unpacking values negates the benefits of memory alignment.
random with half2 by half2 is fastest: It packs two half values into a 32-bit word, allowing for more efficient memory access.

Sequential Access:

Speed: half2 > float > half

half2 is the fastest: Packed memory access and smaller memory footprint lead to the best performance.
half is slower: Small memory footprint, but it is slower than half2 due to alignment issues.
float is the slowest: float requires more memory than the half and half2 types, thus needs more time to transfer.

Random Access:#

Sequential Access:#

Random Access:

Sequential Access: