#### Midterm Exam Information - Exam Date/Time: Thurs. March 10, in class - Exam format: - Closed Book/ Closed notes - One 8.5' x 11" (on side) sheet of notes permitted - Approx. 5 problems - Questions will be problem-solving in nature - Exam will be based upon material covered in lecture - Lecture Notes should be your primary study guide - Relevant portions of text book: - Appendices A & B - Chapters 1 and 2 - Can ignore sections 1.5, 1.6, 1.7 and subsection on Value Prediction in Chapter 2 (page 130) # Simple Performance Comparison - Machine A is n times faster than machine B iff perf(A)/perf(B) = time(B)/time(A) = n - Machine A is x% faster than machine B iff perf(A)/perf(B) = time(B)/time(A) = 1 + x/100 - E.g. time(A) = 10s, time(B) = 15s - $15/10 = 1.5 \Rightarrow A \text{ is } 1.5 \text{ times faster than B}$ - 15/10 = 1.5 => A is 50% faster than B # **Processor Performance Equation** # **PPE Example** - Machine A: clock 1ns, CPI 2.0, for program P - Machine B: clock 2ns, CPI 1.2, for program P - Which is faster and how much? Time/Program = instr/program x cycles/instr x sec/cycle Time(A) = $N \times 2.0 \times 1 = 2N$ Time(B) = $N \times 1.2 \times 2 = 2.4N$ Compare: Time(B)/Time(A) = 2.4N/2N = 1.2 So, Machine A is 20% faster than Machine B for this program #### Amdahl's Law (Originally formulated for vector processing) - f = fraction of program that is vectorizable - (1-f) = fraction that is serial - N = speedup for vectorizable portion - Overall speedup: #### Generalization of Amdahl's Law (To apply to any processor performance enhancement) - f = fraction of program that can take advantage of the enhancement - (1-f) = fraction that cannot take advantage - N = speedup for enhanced portion - Overall speedup: #### Amdahl's Law Example An enhancement to a processor architecture is proposed that would decrease the CPI for floating point multiply instructions from 20 cycles to 1 cycle (a speedup of 20). The CPI of all other instructions will be unchanged. What will be the overall processor speedup resulting from this modification? #### Amdahl's Law Example (continued) Suppose that, in the original design, floating point multiplies accounted for 6% of the total execution time of a "typical program" Then by Amdahl's law the speedup due to the enhanced floating point multiply will be $$S = \frac{1}{(1 - 0.06) + 0.06/20} = 1.06$$ #### Amdahl's Law Example (continued) Now suppose that, for a different program, floating point multiplies account for 60% of the total execution time in the original design Then by Amdahl's law the speedup due to the enhanced floating point multiply (for this particular program) will be $$S = \frac{1}{(1 - 0.6) + 0.6/20} = 2.33$$ #### But, Pipelining is not quite that easy! - Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle - Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline (missing sock) - Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). Data Hazard on R1 Time (clock cycles) IF ID/RF EX MEM WB I add r1,r2,r3 feed a way was sub r4,r1,r3 feed a way was sub r4,r1,r7 or and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 #### Three Generic Data Hazards • Read After Write (RAW) Instr, tries to read operand before Instr, writes it ``` I: add r1,r2,r3 J: sub r4,r1,r3 ``` • Caused by a "Dependence" (in compiler nomenclature). This hazard results from an actual need for communication. 17 #### Three Generic Data Hazards Write After Read (WAR) Instr, writes operand before Instr, reads it ``` I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 ``` - Called an "anti-dependence" by compiler writers. This results from reuse of the name "r1". - Can't happen in MIPS 5 stage pipeline because: - All instructions take 5 stages, and - Register Reads are always in stage 2, and - Register Writes are always in stage 5 18 #### Three Generic Data Hazards • Write After Write (WAW) $Instr_{_J} \ writes \ operand \ \underline{\textit{before}} \ Instr_{_I} \ writes \ it.$ ``` I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 ``` - Called an "output dependence" by compiler writers This also results from the reuse of name "r1". - Can't happen in MIPS 5 stage pipeline because: - All instructions take 5 stages, and - Register Writes are always in stage 5 - Will see WAR and WAW in more complicated pipes 19 #### **Resolution of Pipeline Hazards** - Pipeline hazards - Potential violations of program dependences - Must ensure program dependences are not violated - Hazard resolution - Static: compiler/programmer guarantees correctness - Dynamic: hardware performs checks at runtime - Pipeline interlock - Hardware mechanism for dynamic hazard resolution - Must detect and enforce dependences at runtime # **Data Hazard Mitigation** - A better response forwarding - Also called bypassing - Comparators ensure register is read after it is written - Instead of stalling until write occurs - Use mux to select forwarded value rather than register value - Control mux with hazard detection logic | Data Hazard on R1 | | | | | | | | | | | |-------------------|---------------------|----|----|--------|--------|-------|------|--------|-----|----| | | Time (clock cycles) | | | | | | | | | . | | Instr.Order | add r1,r2,r3 | IF | ID | EX | MEM | WB | | | | | | | sub r4,r1,r3 | | IF | | | | | | | | | | Stall | | | Bubble | Bubble | ubbl | ubb | (W) | | | | | Stall | | | • | ubble | ubble | ubbl | Bubble | | | | | | | | | | ID | EX | MEM | WB | | | , | and r6,r1,r7 | | | | | IF | ID | EX | MEM | WB | | | | | | | | | | | | | #### **Control Dependences** - Conditional branches - Branch must execute to determine which instruction to fetch next - Instructions following a conditional branch are control dependent on the branch instruction - Unconditional Branches (including subroutine calls - Branch can't take place until branch target address is calculated - Exceptions - Interrupts - Hardware Exceptions - Trap Instructions 7 #### **Branch Stall Impact** - If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! - Two part solution: - Determine branch outcome(taken/not-taken) sooner, AND - Compute branch target address earlier - MIPS branch tests if register = $0 \text{ or } \neq 0$ - MIPS Solution: - Move Zero test to ID/RF stage - $-\,$ Adder to calculate new PC in ID/RF stage - $-\,$ 1 clock cycle penalty for branch versus 3 # **Evaluating Branch Alternatives** $Pipeline \ speedup \ = \frac{Pipeline \ depth}{1 + Branch \ frequency \times Branch \ penalty}$ $\label{eq:symmetric} Assume~4\%~unconditional~branch-~untaken,~10\%~conditional~branch-taken$ | Scheduling scheme | Branch<br>penalty | CPI | speedup v.<br>unpipelined | speedup v.<br>stall | |-------------------|-------------------|------|---------------------------|---------------------| | Stall pipeline | 1 | 1.2 | 4.17 | 1.0 | | Predict not taken | 1* | 1.14 | 4.39 | 1.05 | | Delayed branch | 0.5 | 1.10 | 4.55 | 1.09 | \* Only for wrong prediction Assumes Branch Outcome determination and BTA generation in decode stage, 50% of delay slots filled with useful instructions for delayed branching # Limitations of Our Simple 5-stage Pipeline - Assumes single cycle EX stage for all instructions - This is not feasible for - Complex integer operations - Multiply - Divide - Shift (possibly) - Floating Point Operations nen, Lipasti #### Problems with Diversified Pipeline - Many more RAW hazard opportunities due to longer fp instruction execution times - New Structural Hazards: - Divide instructions at distance < 25 (Due to nonpipelined Divide Unit. - Multiple Register Writes/Cycle due to variable instruction execution times - Out-of-order instruction completion—Why is this a problem? - WAW Hazards are possible (WAR not possible. Why?) 37 #### Diversified Pipeline--WAW Hazard MUL.D F0,F2,F4 ID М1 M2 МЗ M4 M5 М6 М7 MEM WB IF ID IF I+3 IF ID EX MEM LOAD. F0,10(R3) # Can the Compiler Help? ``` Loop: L.D F0,0(R1);F0=vector element ADD.D F4,F0,F2;add scalar from F2 O(R1),F4;store result S.D DADDUI R1,R1,-8; decrement pointer 8B (DW) BNEZ R1,Loop; branch R1!=zero ``` Assume the following pipeline latencies: Ignore delayed branch in these examples | Instruction producing result | Instruction<br>using result | stalls between<br>in cycles | |------------------------------|-----------------------------|-----------------------------| | FP ADD | Another FP ALU op | 3 | | FP ADD | Store double | 2 | | Load double | FP ALU op | 1 | | Load double | Store double | 0 | | Integer op | Integer op | 0 | | | | | | | | | 41 #### Reorganized Code to Reduce Stalls Swap DADDUI and S.D by changing address of S.D: ``` 1 Loop: L.D F0,0(R1) DADDUI R1,R1,-8 ADD.D F4,F0,F2 stall stall S.D 8\,\mbox{(R1),F4} ;altered offset when move DADUI BNEZ R1,Loop Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op FP ALU op Store double Load double FP ALU op 7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead; ``` Can we(the compiler) do better? 42 #### Loop Unrolling to Improve Performance ``` _1 cycle stall L.D F0,0(R1) 1 cycle stall ADD.D F4,F0,F2 2 cycles stall S.D 0(R1),F4 ;drop DADDUI L.D F6,-8(R1) 1 Loop:L.D drop DADDUI & BNEZ ADD.D F8,F6,F2 S.D -8(R1),F8 L.D F10,-16(R1) ADD.D F12,F10,F2 S.D -16(R1),F12 drop DADDUI & BNEZ 13 15 drop ADDUI & BNEZ 19 21 L.D F14,-24(R1) ADD.D F16,F14,F2 -24(R1),F16 S.D DADDUI R1,R1,#-32 ;alter to 4*8 BNEZ R1.LOOP 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of 4) 43 ``` #### Loop Unrolling with Code Rearrangement ``` F0,0(R1) 1 Loop:L.D L.D F6,-8(R1) F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D 0(R1),F4 -8(R1),F8 -16(R1),F12 10 S.D 11 S.D DSUBUI R1,R1,#32 S.D 8(R1),F16; 8-32 = -24 BNEZ R1,LOOP 13 14 clock cycles, or 3.5 per iteration ``` #### Hardware-based Performance Optimization--**Dynamic Scheduling** - Dynamic scheduling hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior Handles cases when dependences unknown at compile time Allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve Allows code to be compiled independently of data? - Allows code to be compiled independently of details of a particular pipeline - Simplifies the compiler - Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling (more about this later) #### **Dynamic Scheduling** • Key idea: Allow instruction(s) following a stall to proceed DIVD ADDD F0,F2,F4 F10,F0,F8 F12,F8,F1 - Enables out-of-order execution and allows outof-order completion (e.g., SUBD) - Will distinguish when an instruction begins execution and when it completes execution; between these times, the instruction is in execution - Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder CS252 S06 Lec7 ILP #### Dynamic Scheduling—Starting Point - Split the ID pipe stage of simple 5-stage pipeline into 2 stages: - Issue—Decode instructions, check for structural hazards - Read operands—Wait until no data hazards, then read operands #### Tomasulo's Algorithm - Control & buffers <u>distributed</u> with Functional Units (FU) FU buffers called "reservation stations"; have pending operands - Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming; Renaming avoids WAR, WAW hazards - More reservation stations than registers, so can do optimizations compilers can't Result forwarding via a <u>Common Data Bus</u> that broadcasts results to all FUs - Avoids RAW hazards by executing an instruction only when its operands are available - Load and Stores treated as FUs with RSs as well - Integer instructions can go past branches (predict taken), allowing FP ops beyond basic block in FP queue IBM 360/91 FPU Common Data Bus (CDB) # Performance Enhancement—Better Branch Prediction - Accurate Branch Prediction becomes more important with dynamic scheduling - Dynamic scheduling may stall if it can't look past branch points - Cost of misprediction may be high #### Dynamic (Run-time) Branch Prediction - Why does prediction work? - Underlying algorithm has regularities - Data that is being operated on has regularities - Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems - Is dynamic branch prediction better than static branch prediction? - Seems to be (most modern processor use it) - There are a small number of important branches in programs which have dynamic behavior # A two-bit branch predictor • Change prediction only if get misprediction twice - Adds *hysteresis* to decision making process - Many other two-bit prediction schemes are possible 58 # Another two-bit branch predictor • Two-bit saturating counter (Smith Predictor) Predict Taken Predict Not Taken Predict Not Taken Sp #### **Correlated Branch Prediction** - Idea: track the outcome of the *m* most recently executed branches (globally), and use that pattern to select the proper *n*-bit branch history table - In general, (m,n) predictor means use last m (global) branch outcomes to select between 2<sup>m</sup> history tables, each with n-bit counters - Thus, old 2-bit BHT is a (0,2) predictor - Global Branch History: m-bit shift register keeping T/NT status of last m branches. - Each entry in table has m n-bit predictors (local branch history). # Branch Prediction—What about the Branch Target Address(BTA)? - Branch Prediction is of no value unless we know the BTA - Branch target calculation is costly and stalls the instruction fetch. - A Branch Target Buffer (BTB) can store previously computed BTAs - The BTA of a taken branch is stored in the BTB - For subsequent executions of this branch, the BTA can be "looked up" in the BTB - If the branch was predicted taken, instruction fetch continues at the predicted PC # **Limitations of Scalar Pipelines** - Scalar upper bound on throughput - IPC <= 1 or CPI >= 1 - Inefficient unified pipeline - Long latency for each instruction - Rigid pipeline stall policy - One stalled instruction stalls all newer instructions - Tomasulo's algorithm alleviated this problem #### Challenge for Superscalar Pipes - How to keep the pipeline operating at or near full capacity? - Wide instruction fetch gobbles up instructions at a high rate - Branches are encountered frequently - Cost of stalls is much higher than for scalar pipelines - Branches pose the biggest challenge to exploiting Instruction Level Parallelism (ILP) 🕽 Shen, Lipasti #### Superscalar Pipelines—Exploiting ILP - To maintain a steady stream of instructions to feed functional units it is necessary to maintain instruction fetch and execution beyond branch points - This leads to "speculative execution" of instructions - Accurate branch prediction is essential - Must insure that wrong guesses don't lead to incorrect behavior Shen, Lipasti 7 #### Speculation for greater ILP - Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct - Speculation ⇒ fetch, issue, and execute instructions as if branch predictions were always correct - $-\hspace{0.1cm}$ Dynamic scheduling $\Rightarrow$ only fetches and issues instructions - Essentially a data flow execution model: Operations execute as soon as their operands are available Speculation for greater ILP - 3 components of HW-based speculation: - Dynamic branch prediction to choose which instructions to execute - 2. Speculation to allow execution of instructions before control dependences are resolved - + ability to undo effects of incorrectly speculated sequence - 3. Dynamic scheduling to deal with scheduling of different combinations of basic blocks #### Adding Speculation to Tomasulo's Algorithm - Must separate execution from instruction completion or "commit" - This additional step called instruction commit - When an instruction is no longer speculative, allow it to update the register file or memory - Requires additional set of buffers to hold results of instructions that have finished execution but have not committed - This reorder buffer (ROB) is also used to pass results among instructions that may be speculated Reorder Buffer operation • Holds instructions in FIFO order, exactly as dispatched • When instructions complete, results placed into ROB Supplies operands to other instruction between execution complete & commit $\Rightarrow$ more registers like RS Tag results with ROB buffer number instead of reservation station Instructions commit ⇒values at head of ROB placed in registers As a result, easy to undo Reorder speculated instructions Buffer on mispredicted branches QΩ or on exceptions FP Regs Commit path Res Stations Res Stations FP Adder FP Adder #### Speculation: Register Renaming vs. ROB - Alternative to ROB is a larger physical set of registers combined with register renaming - Extended registers replace function of both ROB and reservation stations - Instruction issue maps names of architectural registers to physical register numbers in extended register set - On issue, allocates a new unused register for the destination (which avoids WAW and WAR hazards) - Speculation recovery easy because a physical register holding an instruction destination does not become the architectural register until the instruction commits - Most Out-of-Order processors today use extended registers with renaming # **Necessity of Instruction Dispatch** #### A Dynamic Superscalar Processor #### **Avoiding Memory Hazards** - WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending - RAW hazards through memory are maintained by two restrictions: - not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and - 2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores. - these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data 81 #### **Memory Data Dependences** - "Memory Aliasing" = Two memory references involving the same memory location (collision of two memory addresses). - "Memory Disambiguation" = Determining whether two memory references will alias or not (whether there is a dependence or not). - Memory Dependency Detection: - Must compute effective addresses of both memory references - Effective addresses can depend on run-time data and other instructions - Comparison of addresses require much wider comparators #### vample code: (1) STORE V (2) ADD (3) LOAD W (4) LOAD X (5) LOAD V (6) ADD (7) STORE W 82 #### Conservative Approach: Maintain Total Order of Loads and Stores - Keep all loads and stores totally in order with respect to each other. - However, loads and stores can execute out of order with respect to other types of instructions. - Consequently, stores are held for all previous instructions, and loads are held for stores. - I.e. stores performed at commit point - Sufficient to prevent wrong branch path stores since all prior branches now resolved #### Load Bypassing - · Loads can be allowed to bypass stores (if no aliasing). - Two separate reservation stations and address generation units are employed for loads and stores. - Store addresses still need to be computed before loads can be issued to allow checking for load dependences. If dependence cannot be checked, e.g. store address cannot be determined, then all subsequent loads are held until address is valid (conservative). - Stores are kept in ROB until all previous instructions complete; and kept in the store buffer until gaining access to cache port. # **Load Forwarding** - If a subsequent load has a dependence on a store still in the store buffer, it need not wait till the store is issued to the data cache. - The load can be directly satisfied from the store buffer if the address is valid and the data is available in the store buffer. - This avoids the latency of accessing the data cache. 85