performance

Bitwise Op

🦥 An old note. Bitwise vs Arithmetic running on a vector of size 2^31, bitwise operations are significantly faster than arithmetic counterparts: seg = 64; volume = (vec_size - 1)/ seg + 1; unsigned bs = log2(seg); unsigned bv= log2(volume); unsigned bbv = volume - 1; Arithmetic: out[i] = i % volume * seg + i / volume Bitwise: out[i] = ((i & bbv) << bs) + (i >> bv)...

Omp Parallel Region

The results look suspicious to me… But I wrote down this note many days ago 🦥. Maybe I need to evaluate it again. Multiple Parallel Regions The cost of constructing parallel region is expensive in OpenMP. Let’s use two example for illustration: Three loops operating on a vector of size 2^31, e.g., for(size_t i = 0; i < vec.size(); i++) vec[i] += 1, vec[i] *= 0.9, vec[i] /= 7, Case 1: a large parallel region including the three loops by omp parallel { omp for }...

Omp Collapse

One of my old-day notes 🦥. Collapse of Nested Loops The collapse clause converts a prefect nested loop into a single loop then parallelize it. The condition of a perfect nested loop is that, the inner loop is tightly included by the outer loop, and no other codes lying between: for(int i = 0 ... ) { for(int j = 0 ...) { task[i][j]; } } Such condition is hard to meet....

Vector vs Array

Another post recycled from my earlier notes. I really don’t have motivation to improve it further 🦥. Vector vs Array Initilization The Vector is the preferred choice for data storage in mordern C++. It is internally implemented based on the Array. However, the performance gap between the two is indeed obvious. The Vector can be initialized via std::vector<T> vec(size). Meanwhile, an Array is initialized by T* arr = new T[size]...

Gather with SIMD

Writing SIMD code that works across different platforms can be a challenging task. The following log illustrates how a seemingly simple operation in C++ can quickly escalate into a significant problem. Let’s look into the code below, where the elements of x is accessed through indices specified by idx. normal code std::vector<float> x = /*some data*/ std::vector<int> idx = /* index */ for(auto i: idx) { auto data = x[i]; } Gather with Intel In AVX512, Gather is a specific intrinsic function to transfer data from a data array to a target vec, according to an index vec....

Parallel Algorithms from Libraries

The content of this post is extracted from my previous random notes. I am too lazy to update and organize it 🦥. C++17 new feature – parallel algorithms The parallel algorithms and execution policies are introduced in C++17. Unfortuantely, according to CppReference, only GCC and Intel support these features. Clang still leaves them unimplemented. A blog about it. The parallel library brough by C++17 requires the usage of Intel’s oneTBB for multithreading....