ZioLib: a Parallel I/O Library

by Woosun Yang and Chris Ding

Some of our earlier work on parallel I/O:
Performance Characteristics of Parallel I/O on Cray T3E and Effective User Strategies (link), Chris Ding.
Data Organization and I/O in a Parallel Ocean Circulation Model (pdf), Chris Ding and Yun He, Proceedings of SC'99.

Introduction

Century- or millenium-long global climate simulations using the NCAR's Community Climate System Model (CCSM) generate tremendous amount of data. Efficient I/O is a crucial factor for such large-scale simulations on massively parallel machines, but CCSM currently uses sequential I/O through a single processor. Here we present a library which facilitates efficient parallel I/O.

In a distributed memory parallel environment, many applications rely on a serial I/O strategy, where the global array is gathered on a single processor and then written out to a file. I/O performance with this approach is largely limited by single PE I/O bandwidth. Even when parallel I/O is used, satisfactory parallel scaling is not always observed. Parallel I/O rates can depend sensitively on parallel decompositions. The best I/O rates are obtained when the field is decomposed along the last dimension (referred to here as "Z"). But usually this is not the case for arrays in many application codes.

In climate models a field in CPU resident memory is often in one index order but stored in a disk file in another order. For example, history data for NCAR's CAM's (Community Atmospheric Model) dynamic variables are in (longitude, height, latitude) order but must be written out to a file in (longitude, latitude, height) order. Changing index orders complicates a parallel I/O implementation and slows down I/O.

ZioLib

ZioLib facilitates an efficient parallel I/O for arrays in such situations. In case of write, ZioLib remaps a distributed field into a Z-decomposition on a subset of processors and from there writes to a disk file in parallel (see the figure below). In this Z-decomposition, the data layout of the remapped array on the staging processors' memory is the same as on disk, thus only block data transfer occurs during parallel I/O, achieving maximum efficiency. In case of read the reverse steps are followed to have the required distributed array.


Writing the global field of distributed array a(X,Z,Y) to a disk file in (X,Y,Z) index order using ZioLib.

Advantages

Features

The parallel netCDF library for which ZioLib provides wrappers is the library that is being developed at LBNL/NERSC. At this moment, it works on an IBM SP.

Source Codes and User's Manual

Click here to download the source codes (a gzipped tar file). (please see copyright info here)

ZioLib has been tested on the following platforms:

To use the parallel netCDF library with ZioLib on an IBM SP (if the library being developed at NERSC is available on the machine), define ZIO_NETCDF_PAR_AIX (i.e., '#define ZIO_NETCDF_PAR_AIX') in zio.h. Otherwise, the serial netCDF library will be used.

User's manual is available in ps, pdf and html formats. See also our publication in pdf format:

Woo-Sun Yang and Chris Ding, 2003, ZioLib: a Parallel I/O Library, LBNL Tech Report, LBNL-53521

A LaTeX document for the manual can be retrieved from the source files using ProTeX (that is, 'protex -s zio.F90 zio_data.F90 zio_remap.F90 zio_binary.F90 zio_netcdf77.F90 > zio.tex').

Click here for pseudocodes as a quick guide.

Performance

The following plot shows total and remapping times for writing a 3D field with Fortran direct-access I/O mode, measured on the IBM SP at NERSC using various numbers of total and I/O processors. The array size is 128 MB (256×256×256), and the original and output index orders are (X,Z,Y) and (X,Y,Z), respectively. The parallel decomposition is along latitude (Y).

Compared to the usual method where the root processor gathers all the distributed arrays, changes index order and writes to a disk file (this method corresponds to the number of I/O processors being one in the plot), the parallel method is very effective. With 32 processors write rates can be about 180 MB/s, big improvement from about 12 MB/s with the conventional single-processor I/O method. Remapping time is 10-35 % of total time. Below is a plot for parallel speed-up over the sequential I/O method.

The experiments are repeated with the 128×64×26 grid (CAM's T42L26 resolution). Speeds up in writes by a factor of 3-4 and in reads by a factor of 6-7 is observed with respect to the single-PE I/O.

CAM History I/O

We have tested ZioLib for CAM2.0 history I/O using 8, 16 and 32 processors with the serial netCDF library (that is, one staging processor). Both the Eulerian (T42L26 resolution, Y-decomposition) and the Finite Volume (B26 resolution: 144×91×26; XY- and YZ-decompositions) dynamic cores were tried. Load balancing is turned off for physics chunking. Even with the serial netCDF library we observe a speed-up of history I/O with ZioLib by a factor of 1.5-2.5 with respect to the existing method.

Comments, Suggestions, Bug Reports, ...

We welcome your comments, suggestions, bug reports, etc. Please send to wyang@lbl.gov or chqding@lbl.gov.

Acknowledgements

This work is supported by the DOE SciDAC climate project and the NERSC Program.

 Last modified on February 14, 2003.