A performance graph: dynamic vs static schedules
This benchmark is performed with no chunk size specified to the schedule clause. For the static type of schedule, the chunk size will be calculated automatically with equal distribution in mind. For the dynamic type, if the chunk size is not specified, it is one—so each thread will be provided with one iteration of the for loop to process, and after completion it can request a new iteration to process. In NUMA systems, iterations processed closer to the allocated memory location will complete faster, as compared to distant iteration processing nodes, due to memory access latencies. The static schedule will load all specified threads immediately with calculated chunk sizes. Up to 16 threads, there is a significant difference between the performance of dynamic and static schedules, but after that the difference vanishes. The use of dynamic scheduling increases performance for a lower number of threads. OpenMP behaviour is dynamic; you must tune parameters according to the underlying system to get the best performance. Idling threads are of no use at all; try to load threads equally, so that they complete their job at the same time. Avoid forking/joining of threads at every parallel construct; reuse invoked threads. Play with code and observe the performance to obtain the best results. Parallel code is very hard to debug, as it silently produces parallel code.