Deep learning has experienced rapid growth in applications such as image recognition and natural language processing, resulting in increasingly complex models that require more processing power and energy. While GPUs are widely used for training due to their highly parallel computing power and wide memory bandwidth, FPGAs offer a compelling alternative for inference tasks where stable, low-latency performance is essential. FPGAs allow for fine-grained hardware tuning and dedicated pipeline implementations, which can be leveraged to build multi-FPGA systems that seamlessly fuse computation and communication for Convolutional Neural Network (CNN) acceleration. However, existing multi-FPGA approaches typically require advanced hardware knowledge and are often implemented as dedicated systems, creating significant barriers for general-purpose application developers accustomed to high-level programming environments such as MPI with the host CPU. In this study, we propose a multi-FPGA-based deep learning inference accelerator that operates at the OpenCL abstraction level, enabling software engineers without extensive hardware expertise to partition and deploy CNN models, such as ResNet-50, across multiple FPGAs. Our approach combines both model and data parallelism to achieve high throughput while maintaining controlled latency. Experimental results show that our design increases throughput by a factor of 12 with only a 1.9-fold increase in latency compared to a baseline. This work paves the way for more accessible FPGA-based acceleration solutions for deep learning inference in real-world applications.