Home >

news Help

Publication Information


Title
Japanese:Accelerating Deep Learning Inference with a Parallel FPGA System 
English:Accelerating Deep Learning Inference with a Parallel FPGA System 
Author
Japanese: Takumi Suzuki, 小林諒平, Norihisa Fujita, Taisuke Boku.  
English: Takumi Suzuki, Ryohei Kobayashi, Norihisa Fujita, Taisuke Boku.  
Language English 
Journal/Book name
Japanese: 
English:HEART '25: Proceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies 
Volume, Number, Page         Page 49 - 56
Published date May 26, 2025 
Publisher
Japanese: 
English:Association for Computing Machinery 
Conference name
Japanese: 
English:HEART 2025: 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies 
Conference site
Japanese: 
English:Kumamoto 
Official URL https://doi.org/10.1145/3728179.3728186
 
DOI https://doi.org/10.1145/3728179.3728186
Abstract Deep learning has experienced rapid growth in applications such as image recognition and natural language processing, resulting in increasingly complex models that require more processing power and energy. While GPUs are widely used for training due to their highly parallel computing power and wide memory bandwidth, FPGAs offer a compelling alternative for inference tasks where stable, low-latency performance is essential. FPGAs allow for fine-grained hardware tuning and dedicated pipeline implementations, which can be leveraged to build multi-FPGA systems that seamlessly fuse computation and communication for Convolutional Neural Network (CNN) acceleration. However, existing multi-FPGA approaches typically require advanced hardware knowledge and are often implemented as dedicated systems, creating significant barriers for general-purpose application developers accustomed to high-level programming environments such as MPI with the host CPU. In this study, we propose a multi-FPGA-based deep learning inference accelerator that operates at the OpenCL abstraction level, enabling software engineers without extensive hardware expertise to partition and deploy CNN models, such as ResNet-50, across multiple FPGAs. Our approach combines both model and data parallelism to achieve high throughput while maintaining controlled latency. Experimental results show that our design increases throughput by a factor of 12 with only a 1.9-fold increase in latency compared to a baseline. This work paves the way for more accessible FPGA-based acceleration solutions for deep learning inference in real-world applications.

©2007 Institute of Science Tokyo All rights reserved.