The results look suspicious to me… But I wrote down this note many days ago 🦥. Maybe I need to evaluate it again.

Multiple Parallel Regions

The cost of constructing parallel region is expensive in OpenMP. Let’s use two example for illustration:

Three loops operating on a vector of size 2^31, e.g.,

for(size_t i = 0; i < vec.size(); i++) 
  vec[i] += 1, 
  vec[i] *= 0.9,
  vec[i] /= 7,

Case 1: a large parallel region including the three loops by omp parallel { omp for }
Case 2: three separate parallel region are built for each loop via omp parallel for.

The time difference:

#parallel regiontime (ms)
one2.59
three0.57

The result is contradictory to our intuition, as we expect a big parallel region (case 1) to run faster than three regions (case 2). The contradition results from the expensive overhead of building the big parallel region. By breaking down the performance and measuring the three loops respetively, we obtain:

looponethree
init2.298/
1st0.0170.057
2nd0.0110.032
3rd0.0200.030

The initialization of a parallel region is extremely expensive (i.e., 2.298ms), which consumes even more time than the computational tasks in our case. Within the parallel region of case 1, each loop costs shorter than their counterparts in case 2. Thus, together with the initialization phase, the computing tasks in case 1 deliver suboptimal performance than the sum of individual regions.