Parallel Programming in OpenMP

There is no guarantee that just because a loop has been correctly parallelized, its performance will improve. In fact, in some circumstances parallelizing the wrong loop can slow the program down. Even when the choice of loop is reasonable, some performance tuning may be necessary to make the loop run acceptably fast.
Two key factors that affect performance are parallel overhead and loop scheduling. OpenMP provides several features for controlling these factors, by means of clauses on the parallel do directive. In this section we will briefly introduce these two concepts, then describe OpenMP's performance-tuning features for controlling these aspects of a program's behavior. There are several other factors that can have a big impact on performance, such as synchronization overhead and memory cache utilization. These topics receive a full discussion in Chapters 5 and 6, respectively.
From the discussion of OpenMP's execution model in Chapter 2, it should be clear that running a loop in parallel adds runtime costs: the master thread has to start the slaves, iterations have to be divided among the threads, and threads must synchronize at the end of the loop. We call these additional costs parallel overhead.
Each iteration of a loop involves a certain amount of work, in the form of integer and floating-point operations, loads and stores of memory locations, and control flow instructions such as subroutine calls and branches. In many loops, the amount of work per iteration may be small, perhaps...