A performance evaluation is conducted on an Nvidia L40, comparing the 100-iteration access times of device vectors with half, half2, and float types. Each vector was initialized with 1024*1024 elements, but for the half2
type, two elements were packed into a single vector entry. Hence, two randomness are tested for half2
type: random access per half2
and random access per half
.
Access Type | Data Type | Vector Size | Allocated Memory | Time (ms) |
---|---|---|---|---|
Random | half | 1M | 2MB | 4.275200 |
Random | float | 1M | 4MB | 4.088096 |
Random by half | half2 | 0.5M | 2MB | 4.011008 |
Random by half2 | half2 | 0.5M | 2MB | 2.325184 |
Sequential | half | 1M | 2MB | 0.755712 |
Sequential | half2 | 0.5M | 2MB | 0.707488 |
Sequential | float | 1M | 4MB | 0.794912 |
The performance results can be explained by the tradeoff between memory alignment and memory footprint.
Random Access:
Speed: half2-by-half2 > half2-by-half > float > half
- random with half is slowest: Despite its smaller size,
half
suffers from alignment issues. GPUs often prefer 32-bit aligned memory accesses, so accessing 16-bit values may require additional operations. - random with float is slow:
float
requires more memory thanhalf
, but it has natural alignment, so it is slightly faster than half. - random with half2 by half is slow too: The overhead of packing and unpacking values negates the benefits of memory alignment.
- random with half2 by half2 is fastest: It packs two
half
values into a 32-bit word, allowing for more efficient memory access.
Sequential Access:
Speed: half2 > float > half
- half2 is the fastest: Packed memory access and smaller memory footprint lead to the best performance.
- half is slower: Small memory footprint, but it is slower than
half2
due to alignment issues. - float is the slowest:
float
requires more memory than thehalf
andhalf2
types, thus needs more time to transfer.