My research goal is to design algorithms and specialized hardware modules for addressing the requirements for efficient implementations of convolutional neural networks (CNNs) on field-programmable gate array (FPGA) platforms. I aim to build a novel end-to-end automated framework that domain experts can rely on for optimizing CNN implementation on FPGAs in terms of throughput, latency, energy efficiency, and power consumption.
Toward this end, my research provides a mechanism for partitioning available FPGA resources to design multiple high-throughput hardware accelerators for convolutional layer operations. Furthermore, for CNN models to be applicable for deployment on resource-constrained battery-powered edge devices, I am interested in
quantizing CNN tensors and
building computationally efficient models while maintaining the application’s level requirements.
To date, my work has provided a
comprehensive review of recent existing techniques and architectures for implementing deep neural networks (DNNs) on FPGAs and provided recommendations for future directions that will simplify the use of FPGA-based accelerators and enhance their performance. I have also developed an
optimized convolutional layer processor (CLP) and
an accompanying metaheuristic-based algorithm for designing an FPGA-based multi-CLP accelerator, showing a significant improvement in performance (1.31x - 2.37x higher throughput than the state-of-the-art single-/multi-CLP approaches). I further showed that different CNN layers have different properties related to the quantization process. Therefore, I have worked on
heterogeneously reducing the bit-precision level for the activations and weights of pre-trained models to low-precision integer numbers, lowering the overall memory requirements of several widely used benchmark architectures by 6.44x – 10.36x without noticeable accuracy drop. Additionally, I explored models that could
automatically learn the quantization level of every weight and activation tensor at
training time, directly fulfilling the memory constraints of a particular target device while learning model parameters to classify images in 51.69 ms.