Doctor of Philosophy

Dissertation Title:

On the Optimal Deployment of Deep Learning Neural Networks on Field Programmable Gate Arrays

Dissertation Abstract:

Deep learning neural networks (DNNs) have demonstrated their effectiveness in a wide range of computer vision tasks, with state-of-the-art results obtained through complex and deep structures that require intensive computation and memory. In the past, graphic processing units (GPUs) enabled these breakthroughs because of their greater computational speed. Nowadays, field programmable gate arrays (FPGAs) have seen a surge of interest in accelerating DNN inference. This is due to their ability to create custom designs with different levels of parallelism. Furthermore, FPGAs provide better performance per watt than other computing technologies such as GPUs, which is a critical requirement for DNN applications on battery-powered unmanned aerial vehicles and Internet of things devices. However, without careful implementation of today’s complex DNN models, the design may not fit the target FPGA due to limited logic resources. Additionally, DNN-based systems have to provide minimum latency overhead and high throughput such that making the right decision happens in time.

In this dissertation, we review recent existing techniques for accelerating DNNs on FPGAs and provide recommendations for future directions that will simplify the use of FPGA-based accelerators and enhance their performance. Then, we provide a couple of works for addressing the requirements for efficient implementations of convolutional neural networks (CNNs). These works optimize CNN implementation on FPGA platforms in terms of throughput, latency, energy efficiency, and power consumption. In particular, we propose and investigate three works, (i) a CNN accelerator and an accompanying automated design methodology that employs metaheuristics for optimal partitioning of available FPGA resources to design high-throughput multiple convolutional layer processors (CLPs), (ii) a framework, referred to as FxP-QNet, to efficiently quantize the weights and activations of pre-trained CNN-based models to low-precision numbers in fixed-point representation, and (iii) a novel end-to-end memory-driven methodology to enable the deployment of CNNs on resource-constrained edge devices while maintaining model’s accuracy.

The focus of the first work is on the development of a Multi-CLP acceleration framework with parameterized Verilog HDL modules. The presented optimization tool adopts simulated annealing (SA) and tabu search (TS) metaheuristic algorithms to find the number of CLPs required, their respective hardware configurations, and the assignment of convolutional layers to CLPs that achieve the best system performance on a given target FPGA device. We illustrate that the implemented SA-/TS-based Multi-CLP achieves 1.31x - 2.37x higher throughput than state-of-the-art Single-/Multi-CLP approaches in accelerating AlexNet, SqueezeNet 1.1, VGG-16, and GoogLeNet architectures on Xilinx VC707 and VC709 FPGA boards.

The second work, on the other hand, demonstrates the effectiveness of FxP-QNet in achieving the accuracy-compression trade-off with significant improvements in hardware cost, power efficiency, and operating frequency. This is mainly due to FxP-QNet’s replacement of expensive floating-point operations with faster, more hardware-efficient integer operations. In particular, the FxP-QNet-quantized AlexNet, VGG-16, and ResNet-18 reduce the overall memory requirements of their full-precision counterparts by 7.16x, 10.36x, and 6.44x, respectively, with less than 0.95%, 0.95%, and 1.99% accuracy drop. Additionally, the FxP-QNet-quantized ResNet-18 implemented on Xilinx Artix-7 FPGA uses 76%, 67%, and 98% fewer look-up tables, flip-flops, and digital signal processors, respectively, and consumes 455 mW less than those designed using the conventional 8-bit quantization scheme, even though the operating frequency of FxP-QNet-quantized ResNet-18 model is 1.35x higher.

In the third work, we illustrate that the defined optimized gradient formulas to automate the learning of wordlengths and quantization methods through stochastic gradient descent successfully enable the deployment of high-accurate and low-latency MobileNet models that are perfectly tailored to the target hardware architecture on Xilinx Zynq-7020 FPGA edge device. Specifically, the presented framework designs a customized integer-only MobileNet-V2 with a 1.53MB model and 0.97MB of activations memory space while achieving top-1 validation accuracy of 72.3% on the ImageNet dataset, improving the top-1 accuracy of previously published implementation by 4.3%.

Comprehensive Exam:

Excellent (Total score: 92.3%)

Courses:

Parallel Process Architectures
CAD of Digital Systems
Algorithms and Complexity
Data Security and Encryption
Spec Top: Comp Networking Tech
Computer Systems Performance
Modeling and Simulation
Heterogeneous Computing
Digital Forensics