CERCS Technical Reports

GIT-CERCS-07-09

POD: A Parallel-On-Die Architecture

As power constraints, complexity and design verification cost make it difficult to improve single-stream performance, parallel computing paradigm is taking a place amongst mainstream high-volume architectures. Most current commercial designs focus on MIMD-style CMPs built with rather complex single cores. While such designs provide a degree of generality, they may not be the most efficient way to build processors for applications with inherently scalable parallelism. These designs have been proven to work well for certain classes of applications such as transaction processing, but they have driven the development of new languages and complex architectural features.

Instead of building MIMD-CMPs for all workloads, we propose an alternative parallel on-die many-core architecture called POD based on a large SIMD PE array. POD helps to address the key challenges of on-chip communication bandwidth, area limitations, and energy consumed by routers by factoring out features necessary for MIMD machines and focusing on architectures that match many scalable workloads. In this paper, we evaluate and quantify the advantages of the POD architecture based its ISA on a commercially relevant CISC architecture and show that it can be as efficient as more specialized array processors based on one-off ISAs. Our single-chip POD is capable of best-in-class scalar performance up to 1.5 TFLOPS of single-precision floating-point arithmetic. Our experimental results show that in some application domains, our architecture can achieve nearly linear speedup on a large number of SIMD PEs, and this speedup is much bigger than the maximum speedup that MIMD-CMPs on the same die size can achieve. Furthermore, owing to synchronized computation and communication, it shows that POD can efficiently suppress energy consumption on the novel communication method in our interconnection network.