Some of our earlier work on parallel I/O:
Performance Characteristics of Parallel I/O on
Cray T3E and Effective User Strategies
(link),
Chris Ding.
Data Organization and I/O in a Parallel Ocean Circulation Model
(pdf),
Chris Ding and Yun He, Proceedings of SC'99.
Century- or millenium-long global climate simulations using the NCAR's Community Climate System Model (CCSM) generate tremendous amount of data. Efficient I/O is a crucial factor for such large-scale simulations on massively parallel machines, but CCSM currently uses sequential I/O through a single processor. Here we present a library which facilitates efficient parallel I/O.
In a distributed memory parallel environment, many applications rely on a serial I/O strategy, where the global array is gathered on a single processor and then written out to a file. I/O performance with this approach is largely limited by single PE I/O bandwidth. Even when parallel I/O is used, satisfactory parallel scaling is not always observed. Parallel I/O rates can depend sensitively on parallel decompositions. The best I/O rates are obtained when the field is decomposed along the last dimension (referred to here as "Z"). But usually this is not the case for arrays in many application codes.
In climate models a field in CPU resident memory is often in one index order but stored in a disk file in another order. For example, history data for NCAR's CAM's (Community Atmospheric Model) dynamic variables are in (longitude, height, latitude) order but must be written out to a file in (longitude, latitude, height) order. Changing index orders complicates a parallel I/O implementation and slows down I/O.
ZioLib facilitates an efficient parallel I/O for arrays in such
situations. In case of write, ZioLib remaps a distributed field into
a Z-decomposition on a subset of processors and from there writes to a
disk file in parallel (see the figure below). In this Z-decomposition,
the data layout of the remapped array on the staging processors' memory
is the same as on disk, thus only block data transfer occurs during
parallel I/O, achieving maximum efficiency. In case of read the reverse
steps are followed to have the required distributed array.
ZioLib has been tested on the following platforms:
To use the parallel netCDF library with ZioLib on an IBM SP (if the library being developed at NERSC is available on the machine), define ZIO_NETCDF_PAR_AIX (i.e., '#define ZIO_NETCDF_PAR_AIX') in zio.h. Otherwise, the serial netCDF library will be used.
User's manual is available in ps, pdf and html formats. See also our publication in pdf format:
Woo-Sun Yang and Chris Ding, 2003, ZioLib: a Parallel I/O Library, LBNL Tech Report, LBNL-53521
A LaTeX document for the manual can be retrieved from the source files using ProTeX (that is, 'protex -s zio.F90 zio_data.F90 zio_remap.F90 zio_binary.F90 zio_netcdf77.F90 > zio.tex').
Click here for pseudocodes as a quick guide.
The following plot shows total and remapping times for writing a
3D field with Fortran direct-access I/O mode, measured on the IBM SP at NERSC using
various numbers of total and I/O processors. The array size is 128 MB
(256×256×256), and the original and output index orders are (X,Z,Y) and (X,Y,Z), respectively.
The parallel decomposition is along latitude (Y).
Compared to the usual method where the root processor gathers all
the distributed arrays, changes index order and writes to a disk file
(this method corresponds to the number of I/O processors being one in
the plot), the parallel method is very effective. With 32 processors
write rates can be about 180 MB/s, big improvement from about 12 MB/s with
the conventional single-processor I/O method. Remapping time is 10-35 %
of total time. Below is a plot for parallel speed-up over the sequential
I/O method.
The experiments are repeated with the 128×64×26 grid (CAM's T42L26 resolution). Speeds up in writes by a factor of 3-4 and in reads by a factor of 6-7 is observed with respect to the single-PE I/O.
We have tested ZioLib for CAM2.0 history I/O using 8, 16 and 32 processors with the serial netCDF library (that is, one staging processor). Both the Eulerian (T42L26 resolution, Y-decomposition) and the Finite Volume (B26 resolution: 144×91×26; XY- and YZ-decompositions) dynamic cores were tried. Load balancing is turned off for physics chunking. Even with the serial netCDF library we observe a speed-up of history I/O with ZioLib by a factor of 1.5-2.5 with respect to the existing method.
Last modified on February 14, 2003.