Performance Comparisons: Half, Half2 and Float

A performance evaluation is conducted on an Nvidia L40, comparing the 100-iteration access times of device vectors with half, half2, and float types. Each vector was initialized with 1024*1024 elements, but for the half2 type, two elements were packed into a single vector entry. Hence, two randomness are tested for half2 type: random access per half2 and random access per half. Access Type Data Type Vector Size Allocated Memory Time (ms) Random half 1M 2MB 4....

September 12, 2024 · 290 words · Me

Tensor Core Register Layout

Layout 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0 16 17 18 19 20 21 22 23 0 0 0 0 0 0 0 0 24 25 26 27 28 29 30 31 0 0 0 0 0 0 0 0 32 33 34 35 36 37 38 39 0 0 0 0 0 0 0 0 40 41 42 43 44 45 46 47 0 0 0 0 0 0 0 0 48 49 50 51 52 53 54 55 0 0 0 0 0 0 0 0 56 57 58 59 60 61 62 63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 16 24 32 40 48 56 0 0 0 0 0 0 0 0 1 9 17 25 33 41 49 57 0 0 0 0 0 0 0 0 2 10 18 26 34 42 50 58 0 0 0 0 0 0 0 0 3 11 19 27 35 43 51 59 0 0 0 0 0 0 0 0 4 12 20 28 36 44 52 60 0 0 0 0 0 0 0 0 5 13 21 29 37 45 53 61 0 0 0 0 0 0 0 0 6 14 22 30 38 46 54 62 0 0 0 0 0 0 0 0 7 15 23 31 39 47 55 63 Code on V100 int half_elements = a_frag....

January 21, 2024 · 382 words · Me