GIT-CERCS-12-01
Gregory Diamos, Haicheng Wu, Ashwin Lele, Jin Wang, Sudhakar Yalamanchili,
Efficient Relational Algebra Algorithms and Data Structures for GPU
Relational databases remain an important application domain for
organizing and analyzing the massive volume of data generated as
sensor technology, retail and inventory transactions, social media,
computer vision, and new fields continue to evolve. At the same time,
processor architectures are beginning to shift towards hierarchical
and parallel architectures employing throughput-optimized memory
systems, lightweight multi-threading, and Single-Instruction
Multiple-Data (SIMD) core organizations. Examples include general
purpose graphics processing units (GPUs) such as NVIDIA's Fermi,
Intels Sandy Bridge, and AMD's Fusion processors.
This paper explores the mapping of primitive relational algebra
operations onto GPUs. In particular, we focus on algorithms and data
structure design identifying a fundamental conflict between the
structure of algorithms with good computational complexity and that of
algorithms with memory access patterns and instruction schedules that
achieve peak machine utilization. To reconcile this conflict, our
design space exploration converges on a hybrid multi-stage algorithm
that devotes a small amount of the total runtime to prune input data
sets using an irregular algorithm with good computational complexity.
The partial results are then fed into a regular algorithm that
achieves near peak machine utilization. The design process leading to
the most efficient algorithm for each stage is described, detailing
alternative implementations, their performance characteristics, and an
explanation of why they were ultimately abandoned.
The least efficient algorithm (JOIN) achieves 57%-72% of peak
machine performance depending on the density of the input. The most
efficient algorithms (PRODUCT, PROJECT, and SELECT) achieve
86%-92% of peak machine performance across all input data sets. To
the best of our knowledge, these represent the best known published
results to date for any implementations. This work lays the foundation
for the development of a relational database system that achieves good
scalability on a Multi-level Bulk-Synchronous-Parallel (Multi-BSP)
processor architecture exemplified by modern GPUs.