OpenMP is a shared memory threaded programming model that can be used to incrementally upgrade existing code for parallel execution. Upgraded code can still be compiled serially. This is a great feature of OpenMP that gives you opportunity to check if parallel execution provides the same results as a serial version. To understand the automatic parallelisation better we will initially take a look into runtime functions. They are usually not needed in simple OpenMP programs. Some basic compiler or pragma directives and scope of variables will be introduced to understand the logic of threaded access to the memory.
Work sharing directives and synchronisation of threads will be discussed within few examples. How to collect results from threads will be shown with common reduction clauses. At the end of this week we will present interesting task based parallelism approach. Don’t forget to experiment with exercises because those are your main learning opportunity. Let's dive in.
The structure of this week was inspired by HLRS OpenMP courses (courtesy: Rolf Rabenseifner (HLRS)).
The purpose of runtime functions is the management or modification of the parallel processes that we want to use in our code. They come with the OpenMP library.
For C++ and C, you can add the
header file to your code in the beginning of the file and then this library includes all the standard runtime functions that you need and you want to use. The functions that we will be using in our tutorial today can be accessed from the link in the transcript or in the resources.
For example, if you want to parallelise your program with, let's say, 12 threads, you specify the number of threads in the program using the function
With this the program will only work with 12 threads.
With this we set a number of threads and we will return the current number of threads. So, like our previous example, if you specify the number of threads to 12 then calling this function will return the number of threads that are being used in the program.
So, calling this function, when you are in a specific thread, would return an integer that is unique for every thread that is used in the code to parallelise your task.
true
if inside parallel region, useThis function returns true
if it is specified inside a parallel region. If it is not, i.e., if it is specified in serial region, it will return false
. And again, if you want to use these functions, you need to specify the appropriate header file at the beginning of your C code. Of course, there are multiple other runtime functions that are available in OpenMP.
Let's observe the following example.
Take a moment and try to understand what is happening in the code above.
What is the expected output? What are the values of rank
and nr_threads
?
Is the output always the same? What order are the threads printing in?
What would happen if we change the number of threads to 12?
Now go to the exercise, try it out and check if your answers were correct.
The next thing that we have to take a look at are environment variables. Contrary to runtime functions, environment variables are not used in the code but are specified in the environment, where you are compiling and running your code. The purpose of environment variables is to control the execution of parallel program at runtime. As these are not specified in the code, you could specify them for example in a Linux terminal before you compile and run your program. Let's go through the three most common environment variables.
To specify the number of threads to use
With this you can set the environment variable. For example, if you're using the bash terminal you can export this variable and specify a fixed number of threads and the program will only work with this specified number of threads. The same goes if you are using other terminals. For example, in TCSH, the usage of the environment variable is achieved through using the key word setenv
and you can specify the number of threads to be used in a similar way
To specify on which CPUs the threads should be placed, use
To show OpenMP version and environment, use
This basically shows the OpenMP version that you are in. Of course, there are multiple other environmental variables that you can use. For GCC compiler you can check the link and check the environment variables that you want to use yourself along with all the explanation and examples on how to use those environment variables.
Parallel construct is the basic or the fundamental construct when using OpenMP. So, every thread basically executes the same statements which are inside the parallel region simultaneously, as you can see on this image.
Image courtesy: Rolf Rabenseifner (HLRS)
So, first we have a master thread that executes the serial portion of the code. Then we come to a pragma omp
statement. We can see here that the master first encounters this omp
construct and creates multiple threads, what we call slave threads that run in parallel. Subsequently the master and slave threads divide the tasks between each other. In the end, we specify an implicit barrier, so when this barrier is reached, the threads finish and we wait for all threads to finish the execution. Following this, when all the threads have finished the execution, we go back to the master thread that finally resumes the execution of the code. In this step, of course, the slave threads are gone because they have completed their task.
In C this implicit barrier is specified with
Let's observe the following code.
Take a moment and try to understand what is happening in the code above. Note the usage of the construct and runtime functions defined earlier in the article.
So far we have just specified a parallel region and the code was executed in serial. Now we will move ahead to see directives for the OpenMP. The format for using a directive is as follows
We have already seen and used pragma omp parallel
that was a directive to execute the region in parallel. In this format we also have clauses in order to specify different parameters. For example, a private
variable is a variable that is private to each thread whereas a shared
variable is one that is shared for all threads and any thread can access and modify it.
We will explore the clauses more in the following subsection. For now we will learn about the conditionals. Similar to any programming language OpenMP also has
conditional statements. So for example, we can also specify an if
statement in OpenMP in the following way
When we specify #ifdef _OPENMP
then the code will execute and when it comes to this if
statement, it will track whether the code is compiled with OpenMP. In this case if it was compiled with OpenMP with the flag #ifdef _OPENMP
, then it will enter the subsequent block of code to execute it. Otherwise, if the code was compiled serially, the block of code following the else
statement would be executed . And of course we close the conditional statements with endif
.
The following example illustrates the use of conditional compilation. With OpenMP compilation, the _OPENMP
becomes defined.
$ gcc example.c This program is not compiled with OpenMP $ gcc -fopenmp example.c I am thread 3 of 4 threads I am thread 2 of 4 threads I am thread 1 of 4 threads I am thread 0 of 4 threads
The directive format we just have learnt
is an important keyword with OpenMP that we put in the beginning of our code on the line where we want the parallel region to start and then we mention the directive name and the clause. In this subsection we will learn about clauses.
There are basically two kind of clauses, i.e., private or shared. A private variable would be a variable that is private to each thread.
Image courtesy: Rolf Rabenseifner (HLRS)
So, we execute, e.g.
Here we define an integer A
in C code. Then we define the OpenMP directive, i.e., the omp parallel
and the private A
. So, what happens here is that any time we will get a new thread, this variable A
will be assigned inside of each thread individually. This would imply that the value of A
will go to the number of threads. So, in the first thread it will be 0
, in the second thread the value of this variable will be 1
because this would be the ID
of the thread and in the third the value will be 2
, and so on. We can see clearly that these variables are basically private, meaning that they are existing inside each thread. This implies that the variable A
(0) in the first thread cannot be accessed by the variable A
(1) in the second thread. So, this infers that this variable is basically private to each individual thread in our program. And of course the opposite of this is the shared variable. If we specify that a variable is a shared variable this would signify that the variable will be shared between the threads. If we specify the variable outside of the parallel region, so right before the
this variable will be accessed by every thread. To exemplify, let's say if we have a for
loop and we add a number to it in every iteration, we can just specify it to be a shared variable. In this case, whenever any thread will update the shared variable, it will add numbers to it. This is an adequate way to use the for
loop that we will see soon in the following subsections.
So, to sum up, the distinction between private and shared: the private variable is available only to one thread and cannot be accessed by any other thread whereas a shared variable cannot only be accessed by every thread in the part of the program but it can also be updated by each thread simultaneously.
Let's have a look at the code below.
Run the code above and observe the output. How does the value of private and shared variable changes when accessed by different threads? Does the value of shared variable increase when being modified by multiple threads? Why?
There might also be a race condition here. The write and immediate read of the variable inside the parallel region is a write-read race condition. Two or more threads access the same shared variable and modify it and these accesses are unsynchronized. We will see a more clear example of a race condition in Parallel region exercise.
Let's observe the following example
Take a moment and try to understand what is happening in the code above.
How does the value of private and shared variable change when accessed by different threads?
Does the value of shared variable increase when being modified by multiple threads? Why?
Does the value of private variable increase when being modified by multiple threads?
Now go to the exercise, try it out and check if your answers were correct.
In this exercise you will get to practice using basic runtime functions, directive format, parallel constructs and clauses which we have learned so far.
The code for this exercise is under the following instructions in a Jupyter notebook. You will start from this provided Hello world template. What is the expected output?
Go to the exercise and set the desired number of threads to 4 using one of the runtime functions.
Set variable i
to ID of this thread using one of the runtime functions.
Add a parallel region to make the code run in parallel.
Add the OpenMP conditional clause when including OpenMP header file and using runtime functions.
Before you run the program, what do you think will happen?
Now, run the program and observe the output. You can change the number of threads to 12 or other and observe the output.
i
.What will happen? Observe the difference in the output. Why is the output different? Check if you get a race condition.
Race condition:
Two threads access the same shared variable and at least one thread modifies the variable and accesses are not synchonized.
The outcome of the program depends on timing of the threads in the team.
This is caused by unintended shared of data.
Don't worry if you always get a correct output, because a compiler may use a private register on each thread instead of writing directly into memory.
We just covered the basics of OpenMP, runtime functions, constructs and directive format. This quiz tests your knowledge of OpenMP basics.
( ) ( … )
( ) [ … ]
(x) { … }
( ) < … >
( ) At the beginning of an OpenMP program, use the library function omp_get_num_threads(4) to set the number of threads to 4.
( ) At the beginning of an OpenMP program, use the library function num_threads(4) to set the number of threads to 4.
(x) In bash, export OMP_NUM_THREADS=4.
( ) At the beginning of an OpenMP program, use the library function omp_num_threads(4) to set the number of threads to 4.
(x) True
( ) False
(x) -fopenmp
( ) -o hello
( ) ./openmp
( ) None of the answers
(x) Single thread
( ) Two threads
( ) All threads
(x) True
( ) False
( ) 20
( ) 40
(x) 25
( ) 35
In the following steps we learn how to really organize our work in parallel. Please, share your ideas on how we can achieve that.
Do you know of possible ways of organizing work in parallel? How can the operations be distributed between threads? Is there a way to control the order of threads?
The work-sharing constructs divide the execution of the code region among different members of team threads. These are the constructs that do not launch the new threads and they are enclosed dynamically within the parallel region. Some examples of the work sharing constructs are:
Section construct
We will first see a code example for using the sections construct where we can specify it through directive sections.
When we use sections construct, multiple blocks of code are executed in parallel. When we specify section and we put a task into it, this specific task will execute in one thread. And then when we go on to another section, it will execute its task in a different thread. This way we can add these sections inside our pragma omp parallel
code by specifying a section per each thread that will be executed in that each individual thread.
Image courtesy: Rolf Rabenseifner (HLRS)
In the example code above we can see that inside the section we have specified variables a
and b
. When this code is executed, a new thread is generated with these variables and the same follows for the variables c
and d
which are specified in a different section and hence are in a different thread.
For construct
In computer science, for-loop is a control flow statement which specifies iteration. This allows code to be executed repeatedly. Such tasks, similar in action and executed multiple times, can be parallelised as well. In OpenMP, we can use the for
construct in
#pragma omp
. Simply put, a for
construct can be seen as a parallelised for
loop. We can specify the for
construct as
Here we also start with pragma omp
followed by the for
keyword and we can use different clauses again, i.e., private, shared and so on. The corresponding for-loop must have a canonical shape.
Since each iterator is by default a private variable and is shared by only one thread, the iterator is not modified inside the loop body. If it was accessed by every thread, our for-loop would get corrupted.
We have a few other clauses than just private
. For example:
schedule
: It classifies how the iterations of loops are divided among the threads.collapse(n)
: The iterations of n
loops are collapsed into one larger iteration space.We can see an example of the for
construct used in the code.
We start with pragma omp parallel
followed by private variable named f
. Then we use pragma omp for
construct, followed by a for loop that goes from 0 to 10 (10 different iterations). The private variable f
is then fixed in every thread and the list a
is updated in parallel. This is because the index each array needs is different from each other. So, every thread can access only one place of the array allowing us to update this list in parallel.
Image courtesy: Rolf Rabenseifner (HLRS)
Here we can see that if we are working on two threads with 10 iterations, then these iterations will be split between two threads from 0 to 4 and 5 to 9. Each place on list a
will be updated by itself and since the iterators are independent of each other, they modify just one place, so we can update each place of the list a
quite easily.
Example
Go to the provided examples and try to understand what is happening in the code. Run the examples and see if your understanding matches the actual output.
Sometimes in parallel programming, when dealing with multiple threads running in parallel, we want to pause the execution of threads and instead run only one thread at a time. This is achieved with so called barriers. Synchronization can be achieved by two ways, i.e., through an implicit barrier or an explicit barrier.
We have already seen the use of an implicit barrier in the previous two examples. It is a barrier for beginning and end of parallel constructs, as well as all other control constructs. In C++ this is achieved with curly brackets. As we saw in the previous examples, the {
is the implicit barrier where we specify the entry into parallel region and the last }
is basically the implicit barrier that specifies the end of the parallel construct and denotes moving to the serial execution of the code. Implicit synchronization can be removed with a nowait
clause but we will not discuss it in this section.
For applying an explicit barrier we use a critical
clause that basically specifies the presence of the barrier. While using an explicit barrier, the code which is enclosed in a critical clause is executed by all threads, but is restricted to only one thread at the time. The critical clause in C/C++ is defined with
Let's go over this code quickly.
We see that we have specified variables cnt
and f
, and in the parallel region we specified the for
construct, so we can do the iteration. Inside the if
statement we specified the pragma omp critical
for the next line which is cnt ++
. We can observe what is happening in the execution of the threads on the image below.
Image courtesy: Rolf Rabenseifner (HLRS)
Before we enter the pragma omp parallel
region, we were in serial execution, so that part was executed serially. Then we entered our parallel region. Everything is executed in parallel until the first thread encounters the cnt++
statement. At this point the cnt++
statement is executed by the first thread that encounters it. During this time, the second thread can't access it because cnt
is already being modified by the first thread. So, after the first thread finishes with the critical operation, the next thread will get access to the cnt
variable and modify it. After all the threads have executed the cnt++
statement, the code continues to execute in parallel as well. It continues until we reach the implicit barrier that we have specified at the end, following which we return to the serial execution.
We owe it to the critical clause that only one thread is executed at a time for this cnt
variable. Therefore, when we use the critical clause in a parallel program, only one thread will be able to execute that part of code that you specified in the critical clause.
Go to the provided examples and try to understand what is happening in the code. Run the examples and see if your understanding matches the actual output. Have fun and experiment.
OpenMP specifies a number of scoping rules on how directives may associate (bind) and nest within each other. That is why incorrect programs may result, if the OpenMP binding and nesting rules are ignored. These terms are used to explain the impact of OpenMP directives.
Static (Lexical) Extent:
The code textually enclosed between the beginning and the end of a structured block following a directive.
The static extent of a directive does not span multiple routines or code files.
Dynamic Extent:
The dynamic extent of a directive further includes the routines called from within the construct.
It includes both its static (lexical) extent and the extents of its orphaned directives.
Orphaned Directive:
An OpenMP directive that appears independently from another enclosing directive is said to be an orphaned directive. They are directives inside the dynamic extent but not within the static extent.
Will span routines and possibly code files.
Let's explain with this example program. We have 2 subroutine calls and both are parallelized.
Program Test:
These are the two subroutines sub1 and sub2.
In this example:
The static extent of our parallel region is exactly this, the calls inside the parallel region. The FOR directive occurs within an enclosing parallel region.
The dynamic extent of our parallel region is the static extent plus including the 2 subroutines that are called inside the parallel region. The CRITICAL and SECTIONS directives occur within the dynamic extent of the FOR and PARALLEL directives.
In the dynamic extent but not in the static extent we have orphaned CRITICAL and SECTIONS directives.
In this exercise you will get to practice using worksharing construct for
and critical
directive.
Pi is a mathematical constant. It is defined as a ratio of a circle's circumference to its diameter. It also appears in many other areas of mathematics. There are also many integrals yielding Pi. One of them is shown below.
This integral can be approximated numerically using Riemann sum:
Here, n
is the number of intervals and h = 1/n
.
The code above calculates the solution of the integral in serial. This template should be a starting point for this exercise. The heavy part of computation is performed in the for loop, so this is the part that needs parallelization.
Go to the exercise and add a parallel region and for
directive to the part that computes Pi. Is the calculation of Pi correct? Test it out more than once, change the number of threads to 2 or 12 and try to find the race condition.
Add private(x)
clause. Is it still incorrect?
Add a critical
directive around the sum statement and compile. Is the value of Pi correct? What is the CPU time? How can you optimize your code?
Move the critical
directive outside the for loop to decrease computational time.
Compare the CPU time for the template program and CPU time for our solution. Have we significantly optimized our code?
This quiz covers various aspects of worksharing directives that have been discussed so far this week.
#pragma omp for
is( ) Loop work is to be divided into user defined sections
( ) Work to be done in a loop when done, don’t wait
(x) Work to be done in a loop
#pragma omp sections
?(x) Loop work is to be divided into user defined sections
( ) Work to be done in a loop when done, don’t wait
( ) Work to be done in a loop
( ) read input, compute results, write output
( ) read input, read input, compute results, write output, write output
(x) read input, compute results, compute results, write output
( ) Error in program
#pragma omp for nowait
?( ) Loop work is to be divided into user defined sections
(x) Work to be done in a loop when done, don’t wait
( ) Work to be done in a loop
#pragma omp sections
?( ) #pragma omp section
(x) #pragma omp parallel
( ) None
( ) #pragma omp master
( ) #pragma omp parallel
(x) #pragma omp barrier
( ) #pragma omp critical
( ) #pragma omp sections
(x) 10, 10
( ) 10, 40
( ) 40, 10
( ) 40, 40
We have already learned about the private clause where we can specify that each thread should have its own instance of a variable. We have also learned about the shared clause where we can specify that one or more variables should be shared among all threads. This is normally not needed because the default scope is shared.
There are several exceptions:
stack (local) variables in called subroutines are automatically private
automatic variables within a block are private
the loop control variables of parallel FOR loops are private
Private clause
The private clause always creates a local instance of the variable. For each thread a new variable is created with an uninitialized value. This means these private variables have nothing to do with the original variable except they have the same name and type.
firstprivate(var)
specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the shared variable existing before the parallel construct.
lastprivate(var)
specifies that the variable's value after the parallel construct is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections
).
Nested private(var)
with same variable name allocate new private storage again.
Let's explain by observing the following code.
Take a moment and try to guess the values of variables after the parallel region. Note the usage of the data scope clauses.
var_shared
is a shared variable and it is normally updated by the parallel region.
var_private
is specified as private so every thread has its own instance and after the parallel region the value remains the same as before.
var_firstprivate
is specified as private and initialized with the value in the shared scope but after the parallel region the value remains the same.
var_lastprivate
is updated in the last iteration of the for loop to use after the parallel region.
The reduction clause is a data scope clause that can be used to perform some form of recurrence calculations in parallel. It defines the region in which a reduction is computed and specifies an operator and one or more list reduction variables. The syntax of the reduction
clause is as follows:
Variables in list must not be private in the enclosing context. A private variable cannot be specified in a reduction clause. A variable cannot be specified in both a shared and a reduction clause.
For each list item, a private copy is created in each iteration and is initialized with the neutral constant value of the operator. In the table below is the list of each operator and its semantic initializer value.
Operator | Initializer |
---|---|
+ | var = 0 |
- | var = 0 |
* | var = 1 |
& | var = ~ 0 |
` | ` |
^ | var = 0 |
&& | var = 1 |
` | |
max | var = most negative number |
min | var = most positive number |
After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the operator.
Let's observe the following example:
The reduction variable is sum
and the reduction operation is +
. The reduction does the operation automatically. It produces a private variable sum
inside the loop and in the end it sums up the private partial sum to the global variable.
In this exercise you will get to practice a sum and substract reduction within a combined parallel loop construct.
In the exercise we are generating a number of people and these people are substracting our value of apples. What you need to do is parallelize the code.
Then answer this:
Combined constructs are shortcuts for specifying one construct immediately nested inside another construct. Specifying a combined construct is semantically identical to specifying the first construct that encloses an instance of the second construct and no other statements. Most of the rules, clauses and restrictions that apply to both directives are in effect. The parallel
construct can be combined with one of the worksharing constructs, for example for
and sections
.
parallel for
When we are using a parallel region that contains only a single for
directive , we can substitute the separate directives with this combined directive:
This directive admits all the clauses of the parallel
directive and for
directive except the nowait
clause is forbidden.
This combined directive must be directly in front of the for
loop. An example of the combined construct is shown below:
In this exercise you will get to practice using combined constructs. You will get to use the reduction
clause and combined construct parallel for
.
This is a continuation of the previous exercise when we computed Pi using worksharing constructs and critical directive. You will start from the provided solution of that exercise and use the newly learned constructs.
Go to the exercise and remove the critical directive and the additional partial sum variable. Then add the reduction
clause and compile. Is the value of Pi correct?
Now change the parallel region, so that you use the combined construct parallel for
and compile.
In this exercise you will get to practice using directives and clauses that we have learned so far, such as parallel
, for
, single
, critical
, private
and shared
. It is your job to recognize where each of those are required.
Heat equation is a partial differential equation that describes how the temperature varies in space over time. It can be written as
This program solves the heat equation by using explicit scheme: time forwarding and centered space, and it solves the equation on a unit square domain.
The initial condition is very simple. Everywhere inside the square the temperature equals f=0
and on the edges the temperature is f=x
. This means the temperature goes from 0
to 1
in the direction of x
.
The source code is at times hard coded for the purpose of faster loop iterations. Your goal is to:
parallelize the program
use different parallelization methods with respect to their effect on execution times
The code above calculates the temperature for a grid of points, the main part of code being the time step iteration. dphi
is the difference of temperature and phi
is the temperature. Then we add the dphi
to the phi
array and we save the new phin
array. Then in the next for loop we exchange the role of the old and the new array (restoring the data).
1. Go to the exercise and parallelize the code.
2. Parallelize all of the for loops and use critical section for global maximum. Think about what variables should be in the private clause.
Then run the example. Run it with 1, 2, 3, 4 threads and look at the execution time.
You may see that with more threads it is slower than expected. Do you have any idea about why the parallel version is slower by looking at the code?
The sequence of the nested loops is wrong. In C/C++, the last array index is running the fastest so the k
loop should be the inner loop. This is not fixed by the OpenMP compiler, so you will need to
3. Interchange the sequence of the nested loops.
Run the code again with 1, 2, 3, 4 threads and look at the execution time.
Now the parallel version should be a little bit faster. The reason for only a slight improvement might be that the problem is too small and the parallelization overhead is too large.
This quiz tests your knowledge on OpenMP data environment and combined constructs.
( ) private
( ) local
(x) shared
( ) firstprivate
(x) 24
( ) 0
( ) 10
( ) 4
nowait
clause do?( ) Skips to the next OpenMP construct
( ) Prioritizes the following OpenMP construct
( ) Removes the synchronization barrier from the previous construct
(x) Removes the synchronization barrier from the current construct
[ ] a: shared
[x] a: private
[x] b: shared
[ ] b: private
[ ] c: shared
[ ] c: private
[x] c: reduction
[ ] d: shared
[x] d: private
( ) a=0, b=23, c=-3
( ) a=44, b=23, c=84
( ) a=0, b=23, c=42
(x) a=0, b=1, c=42
( ) private
( ) firstprivate
(x) lastprivate
( ) default
(x) True
( ) False
( ) Data dependency in #pragma omp for
(x) Data conflict in #pragma omp critical
( ) Data race in #pragma omp parallel
( ) Deadlock in #pragma omp parallel
Tasking allows the parallelization of applications where work units are generated dynamically, as in recursive structures or while loops.
In OpenMP an explicit task is defined using the task directive.
The task directive defines the code associated with the task and its data environment. When a thread encounters a task directive, a new task is generated. The task may be executed immediately or at a later time. If task execution is delayed, then the task is placed in a conceptual pool of sleeping tasks that is associated with the current parallel region. The threads in the current teams will take tasks out of the pool and execute them until the pool is empty. A thread that executes a task might be different than the one that originally encountered it.
The code associated with the task construct will be executed only once. A task is named to be tied, if it is executed by the same thread from beginning to end. A task is untied if the code can be executed by more than one thread, so that different threads execute different parts of the code. By default, tasks are tied.
We also want to mention that there are several task scheduling points where a task can be put from living into sleeping and back from sleeping to a living state:
In the generating task: after the task generates an explicit task, it can be put into a sleeping state.
In the generated task: after the last instruction of the task region.
If task is untied: everywhere inside the task.
In implicit and explicit barriers.
In taskwait
.
Completion of a task can be guaranteed using task synchronization constructs such as taskwait
directive. The taskwait construct specifies a wait on the completion of child tasks of the current task. The taskwait construct is a stand-alone directive.
There are additional clauses that are available with the task directive:
untied
If the task is tied, it is guaranteed that the same thread will execute all the parts of the task. So, the untied clause allows code to be executed by more than one thread.
default (shared | none | private | firstprivate)
Default defines the default data scope of a variable in each task. Only one default clause can be specified on an OpenMP task directive.
shared (list)
Shared declares the scope of the comma-separated data variables in
list
to be shared across all threads.
private (list)
Private declares the scope of the data variables in list
to be
private in each thread.
firstprivate (list)
Firstprivate declares the scope of the data variables to be private in each thread. Each new private object is initialized with the value of the original variable.
if (scalar expression)
Only if the scalar expression is true will the task be started, otherwise a normal sequential execution will be done. Useful for a good load balancing but limiting the parallelization overhead by doing a limited number of the tasks in total.
In the following example, the tasking concept is used to compute Fibonacci numbers recursively.
The parallel directive is used to define the parallel region which
will be executed by four threads. Inside parallel
construct, the
single
directive is used to indicate that only one of the threads will execute the print statement that calls fib(n)
.
In the code, two tasks are generated using the task directive. One of the tasks computes fib(n-1)
and the other computes fib(n-2)
. The return values of both tasks are then added together to obtain the value returned by fib(n)
. Every time functions fib(n-1)
and fib(n-2)
are called, two tasks are generated recursively until argument passed to fib()
is less than 2.
Furthermore, the taskwait
directive ensures that the two tasks
generated are first completed, before moving on to new stage of
recursive computation.
Go to the example to see it being done step by step and try it out for yourself.
The following exercise shows how to traverse a tree-like structure using explicit tasks.
In the previous step we looked at the Fibonacci example, now we traverse a linked list computing a sequence of Fibonacci numbers at each node.
Parallelize the provided program using parallel region, tasks and other directives. Then compare your solution’s complexity compared to the approach without tasks.
Go to the exercise and parallelize the part where we do process work for all the nodes.
The printing of the number of threads should be only done by the master thread. Think about what else must be done by one thread only.
Add a task directive.
Did the parallelization give faster results?
This quiz tests your knowledge on OpenMP tasking with which we will finish this week’s material.
( ) True
(x) False
taskwait
construct?( ) All tasks of the same thread team.
( ) All descendant tasks.
(x) The direct child tasks.
What is the data scope of x in the task region and what is printed at the end?
( ) shared, x=3
( ) firstprivate, x=3
(x) firstprivate, x=42
What is the data scope of y in the task region and what is printed at the end?
( ) shared, y=42
(x) shared, y=168
( ) firstprivate, y=168
barrier
construct?( ) All existing tasks are guaranteed to be completed at barrier exit.
(x) All tasks of the current thread team are guaranteed to be completed at barrier exit.
( ) Only the direct child tasks are guaranteed to be completed at barrier exit.
With this test we will check your knowledge about using OpenMP for parallel programming.
Test available on FutureLearn in the MOOC Introduction to Parallel Programming.
In Week 2 we presented the concepts, programming and execution model of OpenMP in detail. With hands on examples we have tried to show how to use this paradigm as efficiently as possible.
Please, discuss the OpenMP parallel programming paradigm and try to summarize its potential in general or maybe specifically for your applications.
We are also very much interested to know if you found Week 2 content useful?