Originally posted by: Michael_Wong
So much has been happening in OpenMP since SC 12 that I hope to capture it all in this post while flying to the ADC C++ meeting where I will talk about C++14, ISOCPP.org, and Transactional Memory.
First, the research arm of OpenMP is IWOMP, the annual research conference. You probably know by now that IWOMP 2013 will be in Canberra, Australia instead of its usual June summer time frame. This means that there is time (up till May 10) for new proposal submission. So if you have some research in OpenMP that should be exposed, please submit a paper.
When we last spoke, you heard that OpenMP has introduced a Technical Report process to improve its agility at issuing interim specifications, and more importantly to obtain user feedback. We used that process to introduce TR1 for accelerator support. We also released Release Candidate 1 which had 31 feature/defect fixes.
Since then, we had the Houston F2F meeting in January 2013, where we gathered to complete the work of
-
Incorporating feedback for accelerators and strengthen NVIDIA support where synchronization between teams are not implicit
-
Complete work on cancellation
-
Improve taskgroup support
-
Improve Fortran 2008 support
-
Fully specify affinity
-
Improved SIMD
-
Generalized Tooling and Debugger support
The outcome of that meeting was Release Candidate 2 which is currently going through public comments, with the potential of being released in June or July as OpenMP 4.0. OpenMP, the de-facto standard for shared memory systems, will extend its reach beyond pure HPC to include embedded systems, real time systems, and accelerators.
OpenMP wants to become suitable for a wide range of applications, from simulation and medical to biotech, automation, robotics and financial analysis. I will describe some of the syntax that is coming for OpenMP 4.0. This will not be a complete description as there will be additional APIs that are also part of the features which I may not cover. In the mean time, if this wets your appetite, I urge you to go to the OpenMP forum to give feedback to OpenMP 4.0 RC2 immediately as the window will be closing rapidly. Of course, any of this can still change until ratification, but this is fairly close to final form.
The key features that OpenMP 4.0 will have are :
-
Addition of support for accelerators. Finally, you can program an wide range of accelerators at a high-level language without dropping down to some low-level, or proprietary incantation. This is a mechanism to describe regions of code where data and/or computation should be moved to another computing device. There are several prototype implementations for the accelerator proposal as well as significant participation by all the major vendors, both hardware and software. This has many constructs to support it and will require a separate posting. But at a glance, here are some of them:
Target Data, Target
These constructs create a device data environment for the extent of the region. The Target construct also executes the construct on the same device.
#pragma omp target data [clause[ [, ]clause] ,...] structured-block
#pragma omp target [clause[ [, ]clause] ,...] structured-block
clause:
device(integer-expression)
map([map-type:]list)
if(scalar-expression)
Target Update
Makes the corresponding list items in the device data environment consistent with their original list items.
#pragma omp target update motion-clause [, clause[ [, ]clause] ,...]
motion-clause:
to(list)
from(list)
clause:
device(integer-expression)
if(scalar-expression)
Declare Target
Specifies that variables and functions are mapped to a device.
#pragma omp declare target
declarations-definition-seq
#pragma omp end declare target
Teams
Creates a league of thread teams where the master thread of each team executes the region.
#pragma omp teams [clause[ [, ]clause] ,...] structured-block
clause:
num_teams(integer-expression)
num_threads(integer-expression)
default(shared | none)
private(list)
firstprivate(list)
shared(list)
reduction(operator: list)
Distribute
Specifies that the iterations of one or more loops will be executed by the thread teams in the context of their implicit tasks. The iterations are distributed across the
master threads of all teams that execute the teams region to which the distribute region binds.
#pragma omp distribute [clause[ [, ]clause] ,...] for-loops
clause:
private(list)
firstprivate(list)
collapse(n)
dist_schedule(kind[, chunk-size])
-
Atomic now supports sequential consistency. Atomics in OpenMP 3.1 and before has always been relaxed in nature. We never precisely said so, but it seems clear from the wording and now needs to be more precise in view of the clarity that is coming forward in other languages such as C++11 and C11 memory model. Now OpenMP atomics can be made to be sequentially consistent with an extra clause:
Atomic
Ensures that a specific storage location is updated atomically, rather than exposing it to the possibility of multiple, simultaneous writing threads.
#pragma omp atomic [read | write | update | capture] [seq_cst] expression-stmt
#pragma omp atomic capture [seq_cst] structured-block
-
Addition of error handling. Error handling capabilities of OpenMP will be defined to improve the resiliency and stability of OpenMP applications in the presence of both system-level, runtime-level, and user-defined errors. This will be a multi-step process to be rolled out over several releases. It enables OpenMP to move into industrial applications which must know how to handle error cases. For now, our first step is to roll out features to cleanly abort parallel OpenMP execution, based on conditional cancellation and user-defined cancellation points.
Cancel
Requests cancellation of the innermost enclosing region of the type specified. The cancel directive may not be used in place of the statement following an if, while, do,
switch, or label.
#pragma omp cancel [clause[ [, ]clause]
clause:
parallel
sections
for
taskgroup
if(scalar-expression)
Cancellation Point
Introduces a user-defined cancellation point at which tasks check if cancellation of the innermost enclosing region of the type specified has been requested. The cancellation point directive may not be used in place of the statement following an if, while, do, switch, or label.
#pragma omp cancellation point clause
clause:
parallel
sections
for
taskgroup
-
Addition of thread affinity. Users will be given a way to define where to execute OpenMP threads. Platform-specific data and algorithm-specific properties are separated, offering a deterministic behavior and simplicity in use. The advantages for the user are better locality, less false sharing and more memory bandwidth. This will be achieved through two environment variables:
OMP_PROC_BIND bind [true | false | master, close, spread]
Sets the value of the global bind-var ICV. The value of this environment variable must be true or false.
OMP_PLACES places
Sets the place-partition-var ICV that defines the OpenMP places that are available to the execution environment.
-
Addition of support for tasking extensions. The new tasking extensions being considered are task groups, and how to define dependent tasks. In future, there could be reduction support for tasks and task-only threads. Task-only threads are threads which do not take part in worksharing constructs, but just wait for tasks to be executed. For now, we have the following new depend clause which has been added to the task construct:
Task
Defines an explicit task. The data environment of the task is created according to data-sharing attribute clauses on task construct and any defaults that apply.
#pragma omp task [clause[ [, ]clause] ...] structured-block
clause:
if(scalar-expression)
final(scalar-expression)
untied
default(shared | none)
mergeable
private(list)
firstprivate(list)
shared(list)
depend(dependence-type: list)
The list items that appear in the depend clause may include array sections.
dependence-type:
• in: The generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an out or inout clause.
• out and inout: The generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an in, out, or inout clause
We introduced task wait in 3.1 which specifies a wait on the completion of child tasks of the current task. Now you can also group them so that Taskgroup also waits for descendants of the child tasks.
Taskgroup
Taskgroup also waits for descendants of the child tasks.
pragma omp taskgroup structured-block
-
User-Defined Reduction: have you ever wanted to use your own reduction operation on your own type, instead of just the builtin operators and the builtin types as in 3.1. In 4.0, you can and this is the syntax:
Declare Reduction
Declares a reduction-identifier that can be used in a reduction clause. Custom reductions can be defined using the declare reduction directive; the reductionidentifier
and the type identify the declare reduction directive.
#pragma omp declare reduction(reduction-identifier : typename-list : combiner) [initializer-clause]
reduction-identifier: A base language identifier or one of the following operators: +, -, *, &, |, ^, && and ||. In C++, this may also be an operator-function-id
typename-list: A list of type names
combiner: An expression
initializer-clause: initializer ( omp_priv = initializer | function-name (argument-list ))
-
Query the Environment. Have you ever wondered what are the environment variable settings? Well, now you can find out:
OMP_DISPLAY_ENV
Instructs the runtime to display the OpenMP version number and the initial values of the ICVs, once, during initialization of the runtime.
SIMD
Applied to a loop to indicate that the loop can be transformed into a SIMD loop (that is, multiple iterations of the loop can be executed concurrently using SIMD instructions).
#pragma omp simd [clause[ [, ]clause] ...] for-loops
clause:
safelen(length)
linear(list[:linear-step])
aligned(list[:alignment])
private(list)
lastprivate(list)
reduction(operator: list)
collapse(n)
Declare SIMD
Applied to a function or a subroutine to enable the creation of one or more versions that can process multiple arguments using SIMD instructions from a single invocation from a SIMD loop.
#pragma omp declare simd [clause[ [, ]clause] ...] [#pragma omp declare simd [clause[ [, ]clause] ...]] [...] function definition or declaration
clause:
simdlen(length)
linear(argument-list[:linear-step])
aligned(argument-list[:alignment])
uniform(argument-list)
reduction(operator: list)
inbranch
notinbranch
Loop SIMD
Specifies a loop that can be executed concurrently using SIMD instructions and that those iterations will also be executed in parallel by threads in the team.
#pragma omp for simd [clause[ [, ]clause] ...] for-loops
clause:
Any accepted by the simd or for directives with identical meanings and restrictions.
-
Addition of support for Fortran 2003. The Fortran 2003 standard adds many modern computer language features. Having these features in the specification allows users to take advantage of using OpenMP directives to parallelize Fortran 2003 complying programs. This includes interoperability of Fortran and C, which is one of the most popular features in Fortran 2003.
In addition to the OpenMP 4.0 content, there is also an active Tool group meeting to define an official interface for analyzers and debuggers. These will likely result in a new TR that enable future addition to the specification, while allowing many vendors to follow with uniform support of the language. This TR has gathered momentum among implementors and universities.
We already know that for the new release of OpenMP 4.0, we plan to decouple the examples from the specification to make their maintenance simpler. The examples will likely be issued as a separately maintained document.
There is more. OpenMP internally is also going through fundamental changes. We completed an internal survey of its members on the future of OpenMP and we are redefining our mission statement . It used to be :
"Standardize and unify shared memory, thread-level parallelism for HPC”
Much of this probably needs to change in view of our penetration into commercial market and accelerators new memory architecture.
Want even more ? Once OpenMP 4.0 is out, we plan to rapidly begin work on 5.0 and that will begin in the June 2013 Niagara Falls meeting.
Where does OpenMP fit within the scheme of the world? Well, I and many others like to view it as a much more rapidly moving specification then an ISO Standard like C or C++. Yet, it can popularize, or commercialize experimental, or company-specific features for parallelism by making such features widely available. As such, we are already seeing this transfer as more OpenMP features are moving into C and C++, as witness this C++ Standard trip report. Currently, there are proposals to add OpenMP-like semantics into future C and C++ standard, most likely merging with Cilk, the other popular high level parallel language.
The parallel programming language world is clearly undergoing major tectonic shifts. But much of this effort is the work of hundreds of very talented experts who have been meeting weekly on the phone, meeting face-to-face and exchanging thousands of emails. Without them, this could not happen and we thank them for their dedication.