Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Hamid Reza Zohouri; Artur Podobas; Satoshi Matsuoka

doi:https://doi.org/10.1145/3174243.3174248

論文・著書情報

タイトル

和文:
英文:	Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

著者

和文:	ハミドレザゾフーリ, Artur Podobas, 松岡聡.
英文:	Hamid Reza Zohouri, Artur Podobas, Satoshi Matsuoka.

言語

English

掲載誌/書名

和文:
英文:	Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

巻, 号, ページ

pp. 153-162

出版年月

2018年2月27日

出版者

和文:
英文:	ACM, New York, NY, USA

会議名称

和文:
英文:	26th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA)

開催地

和文:
英文:	Monterey, CALIFORNIA

公式リンク

https://dl.acm.org/citation.cfm?id=3174248

DOI

https://doi.org/10.1145/3174243.3174248

アブストラクト

Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.

Home

各種検索

サポート

T2R2について

関連リンク

論文・著書情報