Home >

news ヘルプ

論文・著書情報


タイトル
和文:Accelerating Deep Learning Inference with a Parallel FPGA System 
英文:Accelerating Deep Learning Inference with a Parallel FPGA System 
著者
和文: Takumi Suzuki, 小林諒平, Norihisa Fujita, Taisuke Boku.  
英文: Takumi Suzuki, Ryohei Kobayashi, Norihisa Fujita, Taisuke Boku.  
言語 English 
掲載誌/書名
和文: 
英文:HEART '25: Proceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies 
巻, 号, ページ         Page 49 - 56
出版年月 2025年5月26日 
出版者
和文: 
英文:Association for Computing Machinery 
会議名称
和文: 
英文:HEART 2025: 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies 
開催地
和文: 
英文:Kumamoto 
公式リンク https://doi.org/10.1145/3728179.3728186
 
DOI https://doi.org/10.1145/3728179.3728186
アブストラクト Deep learning has experienced rapid growth in applications such as image recognition and natural language processing, resulting in increasingly complex models that require more processing power and energy. While GPUs are widely used for training due to their highly parallel computing power and wide memory bandwidth, FPGAs offer a compelling alternative for inference tasks where stable, low-latency performance is essential. FPGAs allow for fine-grained hardware tuning and dedicated pipeline implementations, which can be leveraged to build multi-FPGA systems that seamlessly fuse computation and communication for Convolutional Neural Network (CNN) acceleration. However, existing multi-FPGA approaches typically require advanced hardware knowledge and are often implemented as dedicated systems, creating significant barriers for general-purpose application developers accustomed to high-level programming environments such as MPI with the host CPU. In this study, we propose a multi-FPGA-based deep learning inference accelerator that operates at the OpenCL abstraction level, enabling software engineers without extensive hardware expertise to partition and deploy CNN models, such as ResNet-50, across multiple FPGAs. Our approach combines both model and data parallelism to achieve high throughput while maintaining controlled latency. Experimental results show that our design increases throughput by a factor of 12 with only a 1.9-fold increase in latency compared to a baseline. This work paves the way for more accessible FPGA-based acceleration solutions for deep learning inference in real-world applications.

©2007 Institute of Science Tokyo All rights reserved.