増分データとErasure Coding を利用した高速なチェックポイント手法

實本英之; 中村俊介; 遠藤敏夫; 松岡聡

Publication Information

Title

Japanese:	増分データとErasure Coding を利用した高速なチェックポイント手法
English:	The Efficient Checkpoint based on Erasure Coding with Incremental Method

Author

Japanese:	實本英之, 中村俊介, 遠藤敏夫, 松岡聡.
English:	HIDEYUKI JITSUMOTO, Shunsuke Nakamura, Toshio Endo, SATOSHI MATSUOKA.

Language

Japanese

Journal/Book name

Japanese:	情報処理学会研究報告
English:	IPSJ SIG Technical Reports

Volume, Number, Page

Vol. 2009-HPC-122 No. 9 pp. 1--6

Published date

Oct. 2009

Publisher

Japanese:	情報処理学会
English:	Information Processing Society of Japan

Conference name

Japanese:	HPC研究会
English:	SIG HPC

Conference site

Japanese:	東京
English:	Tokyo

Abstract

Checkpointing/restarting is a well-known method as a fault tolerance mechanism in large scale HPC systems. However, overhead of this method tends to get larger, since memory size of recent systems is increasing rapidly, while the improvement of I/O bandwidth of file systems is relatively mild. The purpose of this work is to achieve checkpointing that supports multiple faults with low overhead by utilizing erasure coding. To eliminate the bottleneck, we parallelize encoding and store process images into node-local storage instead of shared file systems. Furthermore, to reduce sizes of process images, we adopt incremental checkpointing, which stores only parts of the process image that are modified since the previous checkpointing. Through parallel experiments using matrix multiply computation and NPB LU benchmark, we have observed 28 to 84% performance improvement by introducing incremental checkpointing.

Home

Search

Support

About T2R2

Related Links

Publication Information