Accelerating Deep Learning Inference with a Parallel FPGA System

Takumi Suzuki; Ryohei Kobayashi; Norihisa Fujita; Taisuke Boku

doi:https://doi.org/10.1145/3728179.3728186

論文・著書情報

タイトル

和文:	Accelerating Deep Learning Inference with a Parallel FPGA System
英文:	Accelerating Deep Learning Inference with a Parallel FPGA System

著者

和文:	Takumi Suzuki, 小林諒平, Norihisa Fujita, Taisuke Boku.
英文:	Takumi Suzuki, Ryohei Kobayashi, Norihisa Fujita, Taisuke Boku.

言語

English

掲載誌/書名

和文:
英文:	HEART '25: Proceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

巻, 号, ページ

Page 49 - 56

出版年月

2025年5月26日

出版者

和文:
英文:	Association for Computing Machinery

会議名称

和文:
英文:	HEART 2025: 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

開催地

和文:
英文:	Kumamoto

公式リンク

https://doi.org/10.1145/3728179.3728186

DOI

https://doi.org/10.1145/3728179.3728186

アブストラクト

Deep learning has experienced rapid growth in applications such as image recognition and natural language processing, resulting in increasingly complex models that require more processing power and energy. While GPUs are widely used for training due to their highly parallel computing power and wide memory bandwidth, FPGAs offer a compelling alternative for inference tasks where stable, low-latency performance is essential. FPGAs allow for fine-grained hardware tuning and dedicated pipeline implementations, which can be leveraged to build multi-FPGA systems that seamlessly fuse computation and communication for Convolutional Neural Network (CNN) acceleration. However, existing multi-FPGA approaches typically require advanced hardware knowledge and are often implemented as dedicated systems, creating significant barriers for general-purpose application developers accustomed to high-level programming environments such as MPI with the host CPU. In this study, we propose a multi-FPGA-based deep learning inference accelerator that operates at the OpenCL abstraction level, enabling software engineers without extensive hardware expertise to partition and deploy CNN models, such as ResNet-50, across multiple FPGAs. Our approach combines both model and data parallelism to achieve high throughput while maintaining controlled latency. Experimental results show that our design increases throughput by a factor of 12 with only a 1.9-fold increase in latency compared to a baseline. This work paves the way for more accessible FPGA-based acceleration solutions for deep learning inference in real-world applications.

Home

各種検索

サポート

T2R2について

関連リンク

論文・著書情報