MSc(Engg) Thesis, Department of Computer Science and Automation,
Indian Institute of Science, Bangalore, India, December 2006.
A Large instruction window is a key requirement to exploit greater Instruction Level Parallelism in out-of-order superscalar processors. Along with the instruction window size, the size of various other structures including the issue queue, store queue and register file need to increase as well. However, the cycle time and energy consumption of conventional large monolithic Content Addressable Memories (CAMs), the underlying structure of most conventional issue queue and store queue designs, worsen rapidly with an increase in size. This results in a three way trade-off involving ILP, clock frequency and energy consumption. In this thesis, we propose efficient designs for the issue queue and the store queue that improve the circuit latency and energy consumption while minimizing the loss in IPC.
We propose the Scalable Low power Issue Queue (SLIQ) design which segments the issue queue structure to reduce the latency. This is complemented with a fast wakeup index to a consumer in the issue queue for every instruction. As this consumer instruction can be woken up directly, without any delay, this mitigates the IPC loss faced by the pipelined issue queue. Also, as the scheme incorporates a pipelined broadcast, the indices are not required for correctness and can simply be gang invalidated on branch mispredictions. The IPC loss of an 8 segment SLIQ is within 2.3% for the entire SPEC CPU2000 benchmark suite while achieving a 39.3% reduction in issue latency. Further, in the SLIQ design unnecessary broadcasts to the higher segments are avoided most of the time as in a large majority of the cases, an instruction has a single consumer. This consumer is woken up either by direct indexing or by broadcast in the first segment of the SLIQ. This enables the 8 segment SLIQ to significantly reduce the energy consumption and the energy-delay product by 48.3% and 67.4% respectively on an average. SLIQ also allows the architects to segment the issue queue carefully so that the latency of the issue logic is just within the per pipeline stage latency goals of the design.
We also propose the Scalable Low power Store Queue (SLSQ) to address similar problems associated with the store queue data forwarding logic. We extend the state of-the-art Store Vector based Disambiguator to also predict the index of the store that will forward to a given load. SLSQ marginally adds to the hardware budget, but predicts the store queue index of the store which will forward with an accuracy of 99.5% on an average. SLSQ, thus, eliminates unnecessary address broadcasts and compares and reduces energy consumption of the store-to-load forwarding logic by 78.4% and 91.6% for the SPEC Int and FP suites respectively. Another variant of SLSQ, eliminates the need for a CAM in the forwarding logic and achieves a 49.9% reduction in store to load data forwarding latency while incurring a minimal IPC loss less than 0.1% on average for the entire SPEC CPU2000 benchmark suite.