-------------------------------------------------------------------------- Lecture Notes scripted by students of the course. All disclaimers and copyrights apply ! :-) (c) RG/SERC/IISc -------------------------------------------------------------------------- HIGH PERFORMANCE COMPUTING ---------------------------- NOTES OF CLASS 11 (Sept. 15 and 20, 2000) ----------------------------------------- PIPELINING ---------- For simplicity of discussion let we assume each instruction goes through all the five steps, namely IF,ID,EX,MEM,WB . So, in a non-pipelined execution, the execution of one instruction is completed before the next instructions is started. We can represent this diagrammatically as: +-------------------------+ | IF | ID | EX | MEM | WB | <-------------- 1 st instn +-------------------------+-------------------------+ | IF | ID | EX | MEM | WB | <----------- 2nd instn +-------------------------+-------------------------+ | IF | ID | EX | MEM | WB | +-------------------------+ The above diagram shows that instruction 2 waits till the completion of instruction 1. So if an instn takes 5 cycles, then n instructions should take 5n cycles. It is also shows that the hardware corresponding to the other stages of instruction execution wait for the remaining cycle when it is not active. Instead, if the execution of the subsequent instruction is started in the next cycle, and the hardware for the IF phase of the next instruction may be utilized to fetch the 2nd instruction in cycle 2. This is known as pipelining. Pipelining is illustrated below: +----+----+----+-----+----+ | IF | ID | EX | MEM | WB | <-------------- 1 st instruction +----+----+----+-----+----+----+ | IF | ID | EX | MEM| WB | <----------- 2nd instruction +----+----+-----+----+----+----+ | IF | ID | EX | MEM| WB | <------ 3 rd instruction +----+-----+----+----+----+ It is clear from the above diagram that immediately after the IF of instruction 1 (when ID of 1st instruction is going on ), IF of 2nd instruction is also started. so once the pipeline is full, at any cycle, if the ith instruction is in the WB stages, the (i+1), (i+2), (i+3) and (i+4)th instructions are respectively in MEM, EX, ID and IF stages. Further, in each cycle a new instruction starts and one completes . So Throughput = 1 instruction/cycle. therefore we can also say that CPI= 1.0 In this context it is important to note that the execution time of an instruction does not decrease due to pipelining. The execution time of an instruction is still 5 cycles; but the throughput increases. execution time in non-pipelined system Now, Speedup = ---------------------------------------- execution time in pipelined system So for a k stage pipeline , for n instructions, time taken in a non-pipelined system is kn , whereas in pipelined system it is k + (n-1) , and n >> k , kn So speedup = --------------- which becomes approximately k , k + (n-1) For non-pipelined system throughput is 1/k instruction/cycle ; However, the above discussion assumes an ideal situation, assuming a new instruction can be started in each cycle. In practice, this is not the case, as "hazards" incur stalls (wasted cycles) in pipelining. HAZARDS: ------- In pipelining we meet several hazards ; they are : 1. Structural hazards 2. Data hazards 3. Control hazards these are described below: STRUCTURAL HAZARDS ------------------ A resource (register, PC, memory port) is being used by more than 1 instruction at same time step, it is known as structural hazard. +----+----+----+-----+----+ | IF | ID | EX | MEM | WB | <-------------- 1 st instruction +----+----+----+-----+----+----+ | IF | ID | EX | MEM| WB | <----------- 2nd instruction +----+----+-----+----+----+----+ | IF | ID | EX | MEM| WB | <------ 3 rd instruction +----+-----+----+----+----+ In the above diagram it is clear that just after the IF phase of instruction 1, IF of instruction 2 begins. So the IR of instruction 1 must be copied to some other register, in order not to be overwritten by instruction 2. For this reason we use buffers between every phases. Thus there buffers (consisting of several registers) between each pair of consecutive pipeline stages (i.e, IF/ID, ID/EX, EX/Mem, Mem/WB ). IF/ID EX/Mem ^ ^ | | +----+----+----+-----+----+ | IF | ID | EX | MEM | WB | +----+----+----+-----+----+ | | v v ID/EX Mem/WB Any register or PC which may be needed later, are saved on this buffer and copied to next buffer continuously. So the IR of IF/ID buffer is copied to IR of ID/EX buffer when IF of instn2 takes place. If we do not have these buffers we will not be able to continue instructions. Another example of structural hazard is having a single memory port for both instruction and data memory. In this case if the ith instruction is a load/store instruction, and when the (i+3)rd instruction is in IF stage, both instructions require the use of the (single) memory port. This is a structural hazard, and only one of the instruction can proceed. In this case the i-th instruction proceeds, while the (i+3)rd instrn. is stalled as shown below! needs memory port for .load/store | | V +----+----+----+----+-----+ | IF | ID | EX | MEM| WB | instrn. i (load/store instrn.) +----+----+----+----+-----+----+ | IF | ID | EX | MEM | WB | instrn. (i+1) +----+----+----+-----+----+----+ | IF | ID | EX | MEM| WB | instrn. (i+2) +----+----+-----+----+----+----+----+ | IF |stall| ID | EX | MEM| WB | instrn. (i+3) +----+-----+----+----+----+----+ | V needs memory port for IF DATA HAZARD : -------------- Before we discuss data hazards, let us first explain data dependencies. There are 3 kinds of data dependencies : 1.RAW (read after write)dependency 2.WAR (write after read) dependency 3.WAW ( write after write) dependency RAW dependency -------------- Two instructions have a RAW dependency, if the result produced by first instruction is used (as source operand) in the second instruction. In this case, the 2nd instruction cannot start before completion of 1st instruction. For example, add R1,R2,R3 sub R6,R7,R1 are the two instructions. Now the new value of R1 (produced by the add instruction) will be available only after the WB stage of of 1st instruction. However, the subsequent sub instruction will attempt to read the value of R1 during its ID stage, at which time, the add instruction would only be in EX stage! This causes a RAW hazard. write in R1 | | V +----+----+----+----+-----+ | IF | ID | EX | MEM| WB | instrn. i (add R1, R2, R3) +----+----+----+----+-----+----+ | IF | ID | | | | instrn. (i+1) : (sub R7, R6, R1) +----+----+----+-----+----+ | V read R1 Hence the sub instrn. must be stalled by 2 cycles, as shown below. +----+----+----+----+-----+ | IF | ID | EX | MEM| WB | instrn. i (add R1, R2, R3) +----+----+----+----+-----+----+----+----+ | IF |stal|stal| ID | EX |MEM | WB | (i+1) : (sub R7, R6, R1) +----+----+----+-----+----+----+----+----+ |stal|stal| IF | ID | EX |MEM | WB | (i+2) +----+----+-----+----+----+----+----+ Instrn. i+2 is also stalled corresponding as shown above. [It may be noted that the Write in R1 and the Read from R1 can happen at the same cycle as indicated above. This is possible if the write takes during the first half of the cycle (rising edge of the clock) and the read takes place during the second half of the cycle (falling edge). Otherwise, 3 stall cycles may be needed in the above example!] WAR dependency --------------- Let us take the example add R1,R2,R3 sub R6,R7,R1 mult R1,R3,R5 There is a WAR dependency between the sub and mult instruction, as R1 must be (re)written by the mult instruction only after it is read by the sub instruction. WAR dependency is also known as anti dependency. WAW dependency -------------- Let us take the above example again. add R1,R2,R3 sub R6,R7,R1 mult R1,R3,R5 There is a WAW dependency between the add and mult instruction, as R1 must be (re)written by the mult instruction only after it is written by the add instruction. WAW dependency is also known as output dependency. Hazards may be caused due to WAR and WAW dependencies. Although WAR and WAW hazards cannot happen in a simple 5-stage pipeline that we have discussed, it is possible in more realistic pipelines where complex instructions such as FP divide take a longer time (e.g., 32 cycles) and a simple FP add takes only 4 cycles. In these cases, an earlier FP divide operate may complete at a later time time than a later FP add operation, causing WAW hazard! In such a case the pipeline hardware must ensure that the later instruction (FP Add) must be stalled until the earlier FP Divide completes. Because of pipeline stalls (due to data hazards), we may have cycles where there is no instruction completing. Solving the data hazard this way, i.e. by stalling the execution of the subsequent instruction(s), is known as pipeline interlocking; Note that pipeline interlocking is necessary to ensure that proper values of registers are fetched by dependent instructions. If dependent instructions are close together then we get more stalls; if they are far apart, (in the 5-stage pipeline, farther by 2 instructions) we may not need any stall at all. PIPELINE FORWARDING: ------------------- Pipeline forwarding or bypassing is another technique to avoid stall due to (RAW) data hazards. let take the example add R1,R2,R3 sub R5,R3,R1 The second instruction needs the value produced by the add instruction. The result produced by the add instruction is available right after the EX stage of the add instrn. Thus if the result value is "forwarded" directly from the output of the ALU to its input, then the second instruction (sub in this case) can proceed for the EX stage without any stalling. That is, instead of reading R1 in ID of 2nd instruction, it is forwarded to the EX stage of the 2nd instruction as shown below. +----+----+----+-----+----+ | IF | ID | EX | MEM | WB | +----+----+----+-----+----+ \ +----+----+----+-----+----+ | IF | ID | EX | MEM | WB | +----+----+----+-----+----+ However, bypassing is helpful only if the pipeline is not deep, i.e, the result (after the EX stage or MEM stage in case of Load instrn.) is available "soon". Even when the result is produced in Mem stage, we see that forwarding cannot completely eliminate the stall in the following example. ld R1,O(R2) add R3,R4,R1 in this case R1 will be available only after MEM phase of of previous instruction. So 1 stall cycle can't be avoided. +----+----+----+-----+----+ | IF | ID | EX | MEM | WB | +----+----+----+-----+----+ \ +----+----+-----+----+----+ | IF | ID |stall| Ex | WB | +----+----+-----+----+----+ For example consider the RAW hazard due to: fadd F0,F2,F4 fadd F6,F0,F8 Floating point addition takes more time as compared to integer one. So the execution phase is longer, say 4 cycles. We can think of EX phase consists of four sub phase EX1 ,EX2,EX3,EX4 ; So for the result of fadd instruction, the 2nd instn will have to wait upto EX4 of 1st instruction. +----+----+-----+-----+-----+-----+-----+----+ | IF | ID | EX1 | EX2 | EX3 | EX4 | MEM | WB | +----+----+-----+-----+-----+-----+-----+----+ \ \ +----+----+-----+-----+-----+-----+-----+----+------+-----+----+ | IF | ID |stall|stall|stall| EX1 | EX2 | EX3| EX4 | MEM | WB | +----+----+-----+-----+-----+-----+-----+----+------+-----+----+ Again, in these cases pipeline forwarding cannot completely eliminate all the stalls.