Several HPF coding examples of practical scientific algorithms and appli cations are examined in some detail, both for understanding how to effec tively use HPF and for assessing the performance of the HPF implementation in these particular situations. Messages passing codes per forming identical tasks as the HPF codes are also developed to provide tim ing as the achievable performance criteria. The key idea of this work is that on these simple, but non-trivial examples, we can understand fairly well a number of issues such as different data distributions, different parallel constructs, and different programming styles (static vs dynamic allocations).
The suite of coding examples cover a reasonable cross section of many application areas. They include a 2D static heat distribution (stencils solu tion of the Laplace equation), a long range N-body problem (pair-wise interaction between all particles), a LU factorization on dense matrix, and several standard vector/matrix routines like dot product, matrix-vector mul tiplications, etc. Multi-dimensional array redistribution is also investigated.
Compared to the efforts of programming message passing codes, HPF is much easier to understand and to use. All coding examples are trivially par allelized by inserting proper HPF directives. Re-distributing multi-dimen sional arrays is a simple array assignment, so that a parallel 3D FFT and/or ADI solution for PDEs can be easily written. We found that the FORALL construct is far less expressive/useful than DO INDEPENDENT, due to the fact of most practical do loops involve summation, temporary scalars, etc. The performances of the two parallel constructs can differ as much as 30%, depending on data distributions. Interfacing to sequential library routines, such as BLAS routines, is also difficult, which preclude highly efficient codes in some problems.
The performances of HPF codes are reasonable, i.e., close to hand-written message passing codes, for many of the code examples, including LU factorization, math vector/matrix library routines, multi-dimensional data redistributions. However, great care must be taken in coding HPF to achieve reasonable performance. (1) Whenever possible, Fortran 90 intrinsic func tion should be used. For example, a matrix-vector multiply using explicit indexing and do loops is several times slower than using the intrinsic function MATMUL(). (2) Whenever possible, whole array expressions should be used. For example, in re-distributing a 3D array, expression like B=A runs order of magnitudes faster than the one with explicit indexing. (3) Allocatable arrays sometime slow down calculation by orders of magnitude. All these dictate a rigid static allocation programing style, instead of a flexible dynamic programming style.
On less regular data/communication patterns, the stencils calculation in the heat problem and the long-range N-body problem, HPF codes perform con siderably less efficient comparing to the hand-written message passing codes, about a factor of 2.5 slower. Upon careful examinations of scaling to large number of processors, the HPF codes efficiency slowly decreases. For example, on a 2000x2000 heat problem, HPF codes slowdown (ratio of HPF codes timing to MPI codes timing) increases from 2.4 on 4 processors to 3.4 on 32 processors.
This work was supported by the Director, Office of Computational and Technology Research, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract number DE-AC03-76SF00098.