CERCS Technical Reports

GIT-CERCS-09-06

A Characterization and Analysis of GPGPU Kernels

General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data- and compute-intensive applications, pushed to the forefront by the introduction of C-based programming environments such as NVIDIA's CUDA, [1], OpenCL [2], and Intel's Ct [3]. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and GPGPU micro-architecture design.

This paper proposes a set of metrics for GPGPU workloads and uses these metrics to analyze the behavior of GPGPU programs. We report on an analysis of over 50 kernels and applications including the full NVIDIA CUDA SDK [4] covering control flow, data flow, parallelism and memory behavior. The analysis was performed using a full function emulator we developed that implements the NVIDIA virtual machine referred to as PTX (Parallel Thread eXecution architecture) - a machine model and low level virtual ISA. The emulator can execute compiled kernels from the CUDA compiler, currently supports the full PTX 1.3 specification [5], and has been validated against the full CUDA SDK. The results quantify the importance of optimizations such as those for branch re-convergence, the prevalence of sharing between threads, and the opportunities for additional parallelism.