Abstract
The traditional balance between processor clock rate and instruction level parallelism (ILP) has been severely shaken due to enablers and constraints that were not prevalent or did not exist during the last two decades. With the current broad deployment of multithreaded software for servers (increasing rapidly for desktops), a traditional processor, e.g. with high frequency and traditional OoO scheduling, is no longer optimal to extract performance. Moreover as cache footprints grow due to problem size, scaling, and multiprogramming, the memory wall problem is only getting worse. Not to be forgotten is Amdahl's law, which in this context of, on-chip closely-coupled parallel execution, makes single-thread performance important since even a small fraction of non-scalable code affects overall performance. Other constraints such as power density and overall power limitation for air- and liquid-cooled systems make putting multiple traditional cores on a single die an unattractive design point. High-Performance Throughput Computing, achieved through design-from-scratch processors composed of multiple multithreaded cores, offers an unprecedented opportunity to create a new generation pipeline that delivers both high throughput performance and high single-thread performance. This is the first disclosure of what we believe is the first truly new pipeline in a decade. A checkpoint-based architecture that offers a new execution model will be described. Hardware threads are spawned and they speculatively execute and retire instructions out-of-order. Power efficiency is emphasized by maximizing the utilization of pipeline stages through temporal threading and functional units through spatial threading and speculation. This pipeline is embedded multiple times in our future high-end 65nm and 45nm processors that form the cornerstone a broad line of systems ranging from small servers to supercomputers.