Bitwise Op

🦥 An old note. Bitwise vs Arithmetic running on a vector of size 2^31, bitwise operations are significantly faster than arithmetic counterparts: seg = 64; volume = (vec_size - 1)/ seg + 1; unsigned bs = log2(seg); unsigned bv= log2(volume); unsigned bbv = volume - 1; Arithmetic: out[i] = i % volume * seg + i / volume Bitwise: out[i] = ((i & bbv) << bs) + (i >> bv)...

May 7, 2023 Â· 80 words Â· Me

Omp Parallel Region

The results look suspicious to me… But I wrote down this note many days ago 🦥. Maybe I need to evaluate it again. Multiple Parallel Regions The cost of constructing parallel region is expensive in OpenMP. Let’s use two example for illustration: Three loops operating on a vector of size 2^31, e.g., for(size_t i = 0; i < vec.size(); i++) vec[i] += 1, vec[i] *= 0.9, vec[i] /= 7, Case 1: a large parallel region including the three loops by omp parallel { omp for }...

May 2, 2023 Â· 238 words Â· Me

Omp Collapse

One of my old-day notes 🦥. Collapse of Nested Loops The collapse clause converts a prefect nested loop into a single loop then parallelize it. The condition of a perfect nested loop is that, the inner loop is tightly included by the outer loop, and no other codes lying between: for(int i = 0 ... ) { for(int j = 0 ...) { task[i][j]; } } Such condition is hard to meet....

May 2, 2023 Â· 158 words Â· Yac

Vector vs Array

Another post recycled from my earlier notes. I really don’t have motivation to improve it further 🦥. Vector vs Array Initilization The Vector is the preferred choice for data storage in mordern C++. It is internally implemented based on the Array. However, the performance gap between the two is indeed obvious. The Vector can be initialized via std::vector<T> vec(size). Meanwhile, an Array is initialized by T* arr = new T[size]...

May 1, 2023 Â· 460 words Â· Yac

Gather with SIMD

Writing SIMD code that works across different platforms can be a challenging task. The following log illustrates how a seemingly simple operation in C++ can quickly escalate into a significant problem. Let’s look into the code below, where the elements of x is accessed through indices specified by idx. normal code std::vector<float> x = /*some data*/ std::vector<int> idx = /* index */ for(auto i: idx) { auto data = x[i]; } Gather with Intel In AVX512, Gather is a specific intrinsic function to transfer data from a data array to a target vec, according to an index vec....

April 27, 2023 Â· 1014 words Â· Yac

Parallel Algorithms from Libraries

The content of this post is extracted from my previous random notes. I am too lazy to update and organize it 🦥. C++17 new feature – parallel algorithms The parallel algorithms and execution policies are introduced in C++17. Unfortuantely, according to CppReference, only GCC and Intel support these features. Clang still leaves them unimplemented. A blog about it. The parallel library brough by C++17 requires the usage of Intel’s oneTBB for multithreading....

April 25, 2023 Â· 382 words Â· Yac