Accelerating Deep Learning Inference with a Parallel FPGA System

Takumi Suzuki; Ryohei Kobayashi; Norihisa Fujita; Taisuke Boku

doi:https://doi.org/10.1145/3728179.3728186

Publication Information

Title

Japanese:	Accelerating Deep Learning Inference with a Parallel FPGA System
English:	Accelerating Deep Learning Inference with a Parallel FPGA System

Author

Japanese:	Takumi Suzuki, 小林諒平, Norihisa Fujita, Taisuke Boku.
English:	Takumi Suzuki, Ryohei Kobayashi, Norihisa Fujita, Taisuke Boku.

Language

English

Journal/Book name

Japanese:
English:	HEART '25: Proceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

Volume, Number, Page

Page 49 - 56

Published date

May 26, 2025

Publisher

Japanese:
English:	Association for Computing Machinery

Conference name

Japanese:
English:	HEART 2025: 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

Conference site

Japanese:
English:	Kumamoto

Official URL

https://doi.org/10.1145/3728179.3728186

DOI

https://doi.org/10.1145/3728179.3728186

Abstract

Deep learning has experienced rapid growth in applications such as image recognition and natural language processing, resulting in increasingly complex models that require more processing power and energy. While GPUs are widely used for training due to their highly parallel computing power and wide memory bandwidth, FPGAs offer a compelling alternative for inference tasks where stable, low-latency performance is essential. FPGAs allow for fine-grained hardware tuning and dedicated pipeline implementations, which can be leveraged to build multi-FPGA systems that seamlessly fuse computation and communication for Convolutional Neural Network (CNN) acceleration. However, existing multi-FPGA approaches typically require advanced hardware knowledge and are often implemented as dedicated systems, creating significant barriers for general-purpose application developers accustomed to high-level programming environments such as MPI with the host CPU. In this study, we propose a multi-FPGA-based deep learning inference accelerator that operates at the OpenCL abstraction level, enabling software engineers without extensive hardware expertise to partition and deploy CNN models, such as ResNet-50, across multiple FPGAs. Our approach combines both model and data parallelism to achieve high throughput while maintaining controlled latency. Experimental results show that our design increases throughput by a factor of 12 with only a 1.9-fold increase in latency compared to a baseline. This work paves the way for more accessible FPGA-based acceleration solutions for deep learning inference in real-world applications.

Home

Search

Support

About T2R2

Related Links

Publication Information