Home >

news Help

Publication Information


Title
Japanese:増分データとErasure Coding を利用した高速なチェックポイント手法 
English:The Efficient Checkpoint based on Erasure Coding with Incremental Method 
Author
Japanese: 實本英之, 中村俊介, 遠藤敏夫, 松岡聡.  
English: HIDEYUKI JITSUMOTO, Shunsuke Nakamura, Toshio Endo, SATOSHI MATSUOKA.  
Language Japanese 
Journal/Book name
Japanese:情報処理学会研究報告 
English:IPSJ SIG Technical Reports 
Volume, Number, Page Vol. 2009-HPC-122    No. 9    pp. 1--6
Published date Oct. 2009 
Publisher
Japanese:情報処理学会 
English:Information Processing Society of Japan 
Conference name
Japanese:HPC研究会 
English:SIG HPC 
Conference site
Japanese:東京 
English:Tokyo 
Abstract Checkpointing/restarting is a well-known method as a fault tolerance mechanism in large scale HPC systems. However, overhead of this method tends to get larger, since memory size of recent systems is increasing rapidly, while the improvement of I/O bandwidth of file systems is relatively mild. The purpose of this work is to achieve checkpointing that supports multiple faults with low overhead by utilizing erasure coding. To eliminate the bottleneck, we parallelize encoding and store process images into node-local storage instead of shared file systems. Furthermore, to reduce sizes of process images, we adopt incremental checkpointing, which stores only parts of the process image that are modified since the previous checkpointing. Through parallel experiments using matrix multiply computation and NPB LU benchmark, we have observed 28 to 84% performance improvement by introducing incremental checkpointing.

©2007 Tokyo Institute of Technology All rights reserved.