Adaptively Applying Data-driven Execution Mode to Remove I/O Bottleneck for Data-intensive Computing
The increasingly popular multi-core/many-core technique is effective for accelerating programs' executions only when sufficient parallelism is maintained. For data-intensive programs, the increased parallelism in the execution can severely compromise their I/O efficiency. When a sequential program is parallelized, not only computations but also the I/O operations associated with them can be distributed among multiple processes. Because the execution order of the processes is usually determined by the scheduler at runtime, the relative progress of each process is nondeterministic and the order in which the processes issue their I/O requests is accordingly nondeterministic.
As the layers of the I/O stack responsible for serving I/O requests, including caching, prefetching, and I/O scheduling, all rely on predictability on I/O request pattern, such I/O non-determinism can have three effects compromising I/O efficiency. (1) The opportunities for reusing data in the buffer cache by different processes can be missed, leading to weakened temporal locality, because existing process scheduling does not consider the buffer cache. (2) As individual processes independently generate prefetch requests, it would be hard to aggregate these requests into a long sequential I/O stream, which is preferred by the storage system. (3) The I/O scheduler is difficult to exploit spatial locality among requests from different processes to form large requests for higher I/O efficiency. The nondeterministic issuance and uncoordinated service of requests can result in serious I/O bottleneck. For the hard disk, small random access can lead to an I/O performance degradation of one order of magnitude. For the solid-state disk, random writes can substantially increase garbage collection cost and reduce its throughput.
In the project we built facilities to harmonize or streamline the service of I/O requests from different processes of a parallel program. There is a major distinction between the facilities and conventional techniques for improving I/O performance. Conventional techniques, such as improved I/O schedulers, caching, and prefetching policies, only optimize the service of I/O requests. In contrast, the proposed facilities manage issuance of I/O requests through I/O-aware process scheduling and service of these requests with improved locality in a coordinated fashion for I/O-intensive multithreaded and MPI programs. When the I/O bottleneck is detected for a parallel program, the execution of the program and service of its I/O requests will deviate from its regular statement-driven mode to the data-driven mode. In the data-driven mode, all processes of a program collectively disclose their future I/O data needs through pre-execution for reads and through write-back for writes, and wait for the data to be efficiently prefetched into the cache or be flushed into the storage to proceed. So that I/O efficiency and data availability can take priority in the process scheduling. The facilities includes iHarmonizer, a runtime for improving I/O performance of multi-threaded programs, and DualPar, a runtime and daemons for MPI programs. The below figure illustrates how the system work and how its performance advantage is obtained.
Figure: Illustration of the concept of a dual-mode execution. In the figure there are processors (computing engines) and disk arrays (storage engines) to run three parallel programs. Program 1 is not I/O intensive (service of I/O requests is denoted as red rectangle). It stays in the statement-driven execution mode, where the timing of I/O requests is determined by process scheduling. Program 2 is I/O intensive and is in statement-driven execution mode. Its requests are issued in the statement execution order and are served inefficiently. Program 3 is I/O intensive and in the data-driven mode, where I/O efficiency takes priority in the scheduling of computation and I/O. I/O requests are served in an order friendly to I/O efficiency and process scheduling relies on the data availability.
Publications Related to this Project
v Xiaoning Ding, Jianchen Shan, Song Jiang. ". In IEEE Transactions on Parallel and Distributed Systems, 27 (8): August 2016.
v Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Sili Shao, and Song Jiang, "", in Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys'16), Hong Kong, China, August, 2016.
v Guoyao Xu, Cheng-Zhong Xu, and Song Jiang, "", in Proceedings of the 13th IEEE International Conference on Autonomic Computing (ICAC'16), Wuerzburg, Germany, July, 2016.
v Xingbo Wu, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, and Song Jiang, "", in Proceedings of the European Conference on Computer Systems (EuroSys'16), London, UK, April, 2016.
v Xingbo Wu, Wenguang Wang, and Song Jiang, "TotalCOW: Unleash the Power of Copy-On-Write for Thin-provisioned Containers", in Proceedings of the 6th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys2015), Tokyo, Japan, July, 2015.
v Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang, "LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data", in Proceedings of 2015 USENIX Annual Technical Conference (USENIX'15), Santa Clara, CA, July, 2015.
v Xiameng Hu, Xiaolin Wang, Yechen Li, Lan Zhou, Yingwei Luo, Chen Ding, Song Jiang, and Zhenlin Wang, "LAMA: Optimized Locality-aware Memory Allocation for Key-value Cache", in Proceedings of 2015 USENIX Annual Technical Conference (USENIX'15), Santa Clara, CA, July, 2015.
v Jianqiang Ou, Marc Patton, Michael Devon Moore, Yuehai Xu, Song Jiang. A Penalty Aware Memory Allocation Scheme for Key-value Cache. In Proceedings of the 44th International Conference on Parallel Processing (ICPP-2015), Beijing, China, To appear September 2015.
v Yuehai Xu, Eitan Frachtenberg, and Song Jiang, "Building a High-performance Key-value Cache as an Energy-efficient Appliance", in Proceedings of the 32st International Symposium on Computer Performance, Modeling, Measurement and Evaluation 2014 (IFIP Performance 2014), Turin, Italy, October, 2014.
v Xuechen Zhang, Jianqiang Ou, Kei Davis, and Song Jiang, "Orthrus: A Framework for Implementing Efficient Collective I/O in Multicore Clusters", in Proceedings of the International Supercomputing Conference (ISC '14), Leipzig, Germany, June 2014.
v Xuechen Zhang, Ke Liu, Kei Davis, and Song Jiang, "iBridge: Improving Unaligned Parallel File Access with Solid-State Drives", in Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'13), Boston, MA, May, 2013.
REU Program in the project