Diversified Pipelines—The Path Toward Superscalar Processors

HPCA, Spring 2011

Limitations of Our Simple 5-stage Pipeline

• Assumes single cycle EX stage for all instructions
• This is not feasible for
  – Complex integer operations
    • Multiply
    • Divide
    • Shift (possibly)
  – Floating Point Operations

A Naïve Extension of the 5 Stage Pipeline

Multicycle ALU Operations

• Latency: # of intervening cycles between the instruction that produced a result and a subsequent instruction that uses it.
• Initiation Interval: # of cycles between two instructions that utilize the same functional unit.

<table>
<thead>
<tr>
<th>Functional Unit</th>
<th>Latency</th>
<th>Initiation Interval</th>
</tr>
</thead>
<tbody>
<tr>
<td>Integer ALU</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>FP Add</td>
<td>3</td>
<td>1</td>
</tr>
<tr>
<td>Int Multiply</td>
<td>6</td>
<td>1</td>
</tr>
<tr>
<td>FP Multiply</td>
<td>6</td>
<td>1</td>
</tr>
<tr>
<td>FP Divide</td>
<td>24</td>
<td>25</td>
</tr>
</tbody>
</table>
Multicycle ALU Operations

- Latency: # of intervening cycles between the instruction that produced a result and a subsequent instruction that uses it.
- Initiation Interval: # of cycles between two instructions that utilize the same functional unit.

<table>
<thead>
<tr>
<th>Functional Unit</th>
<th>Latency</th>
<th>Initiation Interval</th>
</tr>
</thead>
<tbody>
<tr>
<td>Integer ALU</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>FP Add</td>
<td>3</td>
<td>1</td>
</tr>
<tr>
<td>Int Multiply</td>
<td>6</td>
<td>1</td>
</tr>
<tr>
<td>FP Multiply</td>
<td>6</td>
<td>1</td>
</tr>
<tr>
<td>FP Divide</td>
<td>24</td>
<td>25</td>
</tr>
</tbody>
</table>

Diversified Pipeline

Problems with Diversified Pipeline

- Many more RAW hazard opportunities due to longer fp instruction execution times
- New Structural Hazards:
  - Divide instructions at distance < 25 (Due to non-pipelined Divide Unit).
  - Multiple Register Writes/Cycle due to variable instruction execution times
- Out-of-order instruction completion—Why is this a problem?
- WAW Hazards are possible (WAR not possible. Why?)

Structural Hazard--FP Register Write Port
### Structural Hazard--FP Register Write Port

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>11</th>
</tr>
</thead>
<tbody>
<tr>
<td>MUL.D</td>
<td>IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+1</td>
<td>IF ID -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+2</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADD.D</td>
<td>IF ID A1 A2 A3 A4 MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+4</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+5</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>LOAD.D</td>
<td>IF ID EX MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Note:** Three FP Register Writes in Same Cycle

### Diversified Pipeline--WAW Hazard

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>11</th>
</tr>
</thead>
<tbody>
<tr>
<td>MUL.D</td>
<td>IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+1</td>
<td>IF ID -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+2</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADD.D</td>
<td>IF ID A1 A2 A3 A4 MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+4</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+5</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>LOAD.D</td>
<td>IF ID EX MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Diversified Pipeline--WAW Hazard

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>11</th>
</tr>
</thead>
<tbody>
<tr>
<td>MUL.D</td>
<td>IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+1</td>
<td>IF ID -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+2</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADD.D</td>
<td>IF ID A1 A2 A3 A4 MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+4</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+5</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>LOAD.D</td>
<td>IF ID EX MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Diversified Pipeline—Out of Order Completion

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>11</th>
</tr>
</thead>
<tbody>
<tr>
<td>MUL.D</td>
<td>IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+1</td>
<td>IF ID -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+2</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADD.D</td>
<td>IF ID A1 A2 A3 A4 MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+4</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>1+5</td>
<td>IF -- -- -- -- -- -- -- --</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>LOAD.D</td>
<td>IF ID EX MEM WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Note

Both the ADD and LOAD complete before the DIV.

Suppose a hardware exception occurs during the DIV, after stage 8. What is the PC address of the exception?

Also note that the ADD and LOAD have overwritten the source operands for the DIV so there is no way to restore the state before the DIV.
Diversified Pipeline—Can the Compiler Help?

- Consider this code to add a scalar to a vector:
  
  for (i=1000; i>0; i=i–1)
  
  \[ x[i] = x[i] + s; \]

- First translate into MIPS code:
  
  Loop: L.D F0,0(R1); F0=vector element
  ADD.D F4,F0,F2; add scalar from F2
  S.D 0(R1),F4; store result
  DADDUI R1,R1,-8; decrement pointer 8B (DW)
  BNEZ R1,Loop ;branch R1!=zero

Assume the following pipeline latencies:

- Ignore delayed branch in these examples

<table>
<thead>
<tr>
<th>Instruction Producing Result</th>
<th>Instruction Using Result</th>
<th>Stalls Between in Cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>FP ADD</td>
<td>Another FP ALU op</td>
<td>3</td>
</tr>
<tr>
<td>FP ADD</td>
<td>Store double</td>
<td>2</td>
</tr>
<tr>
<td>Load double</td>
<td>FP ALU op</td>
<td>1</td>
</tr>
<tr>
<td>Load double</td>
<td>Store double</td>
<td>0</td>
</tr>
<tr>
<td>Integer op</td>
<td>Integer op</td>
<td>0</td>
</tr>
</tbody>
</table>
Stalls (NOPs) needed to account for Pipeline Latencies

1 Loop: L.D F0,0(R1) ;F0=vector element
2 stall
3 ADD.D F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 S.D 0(R1),F4 ;store result
7 DADDUI R1,R1,-8 ;decrement pointer 8B (DN)
8 stall ;assumes can’t forward to branch
9 BNEZ R1,Loop ;branch R1=zero

<table>
<thead>
<tr>
<th>Instruction producing result</th>
<th>Instruction using result</th>
<th>Latency in clock cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>FP ALU op</td>
<td>Another FP ALU op</td>
<td>3</td>
</tr>
<tr>
<td>FP ALU op</td>
<td>Store double</td>
<td>2</td>
</tr>
<tr>
<td>Load double</td>
<td>FP ALU op</td>
<td>1</td>
</tr>
</tbody>
</table>

- 9 clock cycles per loop iteration
- Can the compiler reorganize the code to minimize stalls?

Reorganized Code to Reduce Stalls

Swap DADDUI and S.D by changing address of S.D:

1 loop: L.D F0,0(R1)
2 DADDUI R1,R1,-8
3 ADD.D F4,F0,F2
4 stall
5 stall
6 S.D 8(R1),F4 ;altered offset when move DADDUI
7 BNEZ R1,Loop

<table>
<thead>
<tr>
<th>Instruction producing result</th>
<th>Instruction using result</th>
<th>Latency in clock cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>FP ALU op</td>
<td>Another FP ALU op</td>
<td>3</td>
</tr>
<tr>
<td>FP ALU op</td>
<td>Store double</td>
<td>2</td>
</tr>
<tr>
<td>Load double</td>
<td>FP ALU op</td>
<td>1</td>
</tr>
</tbody>
</table>

7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead; Can we (the compiler) do better?

Loop Unrolling to Improve Performance

1 Loop: L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4 ;drop DADDUI & BNEZ
7 L.D B6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D -8(R1),F8 ;drop DADDUI & BNEZ
13 L.D F10,-16(R1)
15 ADD.D F12,F0,F2
18 S.D -16(R1),F12 ;drop DADDUI & BNEZ
19 L.D F14,-24(R1)
21 ADD.D F16,F4,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,-32 ; alter to 4*8
26 BNEZ R1,Loop

1 cycle stall
2 cycles stall

27 clock cycles, or 6.75 per iteration
(Assumes R1 is multiple of 4)

Loop Unrolling with Code Rearrangement

1 Loop: L.D F0,0(R1)
2 L.D F6,-8(R1)
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 S.D -16(R1),F12
13 S.D 8(R1),F16 ; 8-32 = -24
14 BNEZ R1,Loop

14 clock cycles, or 3.5 per iteration
Limits to Loop Unrolling

1. Decrease in amount of overhead amortized with each extra unrolling
   - Amdahl's Law
2. Growth in code size
   - For larger loops, concern it increases the instruction cache miss rate
3. Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling
   - If not possible to allocate all live values to registers, may lose some or all of its advantage
   - Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

Hardware-based Performance Optimization--Dynamic Scheduling

- **Dynamic scheduling** - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior
  - Handles cases when dependences unknown at compile time
  - Allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve
  - Allows code to be compiled independently of details of a particular pipeline
  - Simplifies the compiler
- **Hardware speculation**, a technique with significant performance advantages, builds on dynamic scheduling (more about this later)

Dynamic Scheduling Example

Consider:

i: R4 ← R0 + R8
j: R2 ← R0 * R4
k: R4 ← R4 + R8
l: R8 ← R4 * R2

Dynamic Scheduling Example

Consider:

i: R4 ← R0 + R8
j: R2 ← R0 * R4
k: R4 ← R4 + R8
l: R8 ← R4 * R2

RAW Hazards  WAW Hazards
Dynamic Scheduling—The dataflow limit

Objective of Dynamic Scheduling is to come as close as possible of the Dataflow Limit

Dynamic Scheduling Example

Consider:
- Reuse cycle For R4
- Another Reuse cycle For Rx

RAW Hazards  WAW Hazards
Dynamic Scheduling

- Key idea: Allow instruction(s) following a stall to proceed
  - DIVD F0,F2,F4
  - ADDD F10, F0,F8
  - SUBD F12,F8,F14
- Enables out-of-order execution and allows out-of-order completion (e.g., SUBD)
- Will distinguish when an instruction begins execution and when it completes execution; between these times, the instruction is in execution
- Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder

Dynamic Scheduling—Starting Point

- Split the ID pipe stage of simple 5-stage pipeline into 2 stages:
  - Issue—Decode instructions, check for structural hazards
  - Read operands—Wait until no data hazards, then read operands

Dynamic Scheduling: Tomasulo’s Algorithm

- For IBM 360/91 (late 1960s, before caches!)
  - => Long memory latency
- Goal: High Performance without special compilers
- Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations
  - This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware!
- Why Study 1966 Computer?
- Tomasulo’s algorithm is the basis for dynamic scheduling approach used in most modern processors

Tomasulo’s Algorithm

- Control & buffers distributed with Functional Units (FU)
  - FU buffers called “reservation stations”; have pending operands
- Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming;
  - Renaming avoids WAR, WAW hazards
  - More reservation stations than registers, so can do optimizations compilers can’t
- Result forwarding via a Common Data Bus that broadcasts results to all FUs
  - Avoids RAW hazards by executing an instruction only when its operands are available
- Load and Stores treated as FUs with RSs as well
- Integer instructions can go past branches (predict taken), allowing FP ops beyond basic block in FP queue
**Tomasulo’s Algorithm [Tomasulo, 1967]**

**IBM 360/91 FPU**

- Multiple functional units (FU's)
  - Floating point add
  - Floating point multiply/divide
- Three register files (pseudo reg-reg machine in floating-point unit)
  - (4) floating-point registers (FUR)
  - (6) floating-point buffers (FLB)
  - (3) store data buffers (SDB)
- Out of order instruction execution:
  - After decode the instruction unit passes all floating point instructions (in order) to the floating point operation stack (FLOS).
  - In the floating point unit, instructions are then further decoded and issued from the FLOS to the two FU's
- Variable operation latencies:
  - Floating point add: 2 cycles
  - Floating point multiply: 3 cycles
  - Floating point divide: 12 cycles
- Goal: achieve concurrent execution of multiple floating-point instructions, in addition to achieving one instruction per cycle in instruction pipeline

---

**Dependence Mechanisms**

Two Address IBM 360 Instruction Format:

\[ R_1 \leftarrow R_1 \text{ op } R_2 \]

Major dependence mechanisms:

- Structural (FU) dependence -> virtual FU's
  - Reservation stations
- True dependence -> pseudo operands + result forwarding
  - Register tags
  - Reservation stations
  - Common data bus (CDB)
- Anti-dependence -> operand copying
  - Reservation stations
- Output dependence -> register renaming + result forwarding
  - Register tags
  - Reservation stations
  - Common data bus (CDB)
Reservation Stations

- Used to collect operands or pseudo operands (tags).
- Associate more than one set of buffering registers (control, source, sink) with each FU, \( \Rightarrow \) virtual FU's.
- Add unit: three reservation stations
- Multiply/divide unit: two reservation stations

Common Data Bus (CDB)

- CDB is fed by all units that can alter a register (or supply register values) and it feeds all units which can have a register as an operand.
- Sources of CDB:
  - Floating point buffers (FLB)
  - Two FU's (add unit and the multiply/divide unit)
- Destinations of CDB:
  - Reservation stations
  - Floating-point registers (FLR)
  - Store data buffers (SDB)

Register Tags

- Every source of a register value must be uniquely identified by its own tag value.
  - (6) FLB's
  - (5) reservation stations (3 with add unit, 2 with multiply/divide unit)
  \( = > 4 \) bit tag is needed to identify the 11 potential sources

- Every destination of a register value must carry a tag field.
  - (5) "sink" entries of the reservation stations
  - (4) FLB's
  - (3) SDB's
  \( = > 4 \) total of 17 tag fields are needed (i.e. 17 places that need tags)

Operation of Dependence Mechanisms

1. Structural (FU) dependence = \( \Rightarrow \) virtual FU's
   - FLOS can hold and decode up to 8 instructions.
   - Instructions are dispatched to the 5 reservation stations (virtual FU's) even though there are only two physical FU's.
   - Hence, structural dependence does not stall dispatching.

2. True dependence = \( \Rightarrow \) pseudo operands + result forwarding
   - If an operand is available in FLR, it is copied to a res. station entry.
   - If an operand is not available (i.e. there is pending write), then a tag is copied to the reservation station entry instead. This tag identifies the source of the pending write. This instruction then waits in its reservation station for the true dependence to be resolved.
   - When the operand is finally produced by the source (ID of source \( = \) tag value), this source unit asserts its ID, i.e. its tag value, on the CDB followed by broadcasting of the operand on the CDB.
   - All the reservation station entries and the FLR entries and SDB entries carrying this tag value in their tag fields will detect a match of tag values and latch in the broadcasted operand from the CDB.
   - Hence, true dependence does not block subsequent independent instructions and does not stall a physical FU. Forwarding also minimizes delay due to true dependence.
Operation of Dependence Mechanisms

3. **Anti-dependence** = > operand copying
   - If an operand is available in FLR, it is copied to a reservation station entry.
   - By copying this operand to the reservation station, all anti-dependences due to future writes to this same register are resolved.
   - Hence, the reading of an operand is not delayed, possibly due to other dependences, and subsequent writes are also not delayed.

4. **Output dependence** = > register renaming + result forwarding
   - If a register is waiting for a pending write, its tag field will contain the ID, or tag value, of the source for that pending write.
   - When that source eventually produces the result, that result will be written into the register via the CDB.
   - It is possible that prior to the completion of the pending write, another instruction can come along and also has that same register as its destination register.
   - If this occurs, the operands (or pseudo operands) needed by this instruction are still copied to an available reservation station. In addition, the tag field of the destination register of this instruction is updated with the ID of this new reservation station, i.e. the old tag value is overwritten. This will ensure that the said register will get the latest value, i.e. the late completing earlier write cannot overwrite a later write.
   - Hence, the output dependence is resolved without stalling a physical functional unit, not requiring additional buffers to ensure sequential write back to the register file.
Summary of Tomasulo’s Algorithm

- Supports out of order execution of instructions.
- Resolves dependences dynamically using hardware.
- Attempts to delay the resolution of dependencies as late as possible.
- Structural dependence does not stall issuing; virtual FU’s in the form of reservation stations are used.
- Output dependence does not stall issuing; copying of old tag to reservation station and updating of tag field of the register with pending write with the new tag.
- True dependence with a pending write operand does not stall the reading of operands; pseudo operand (tag) is copied to reservation station.
- Anti-dependence does not stall write back; earlier copying of operand awaiting read to the reservation station.
- Can support sequence of multiple output dependences.
- Forwarding from FU’s to reservation stations bypasses the register file.
Tomasulo Revisited—H &P notation

Reservation Station Components

Op: Operation to perform in the unit (e.g., + or −)
Vj, Vk: Value of Source operands
   - Store buffers have V field, result to be stored
Qj, Qk: Reservation stations producing source registers (value to be written)
   - Note: Qj,Qk=0 => ready
   - Store buffers only have Qj for RS producing result
Busy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
Three Stages of Tomasulo Algorithm

1. **Issue**—get instruction from FP Op Queue
   - If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).
2. **Execute**—operate on operands (EX)
   - When both operands ready then execute; if not ready, watch Common Data Bus for result
3. **Write result**—finish execution (WB)
   - Write on Common Data Bus to all awaiting units; mark reservation station available

- **Normal data bus**: data + destination ("go to" bus)
- **Common data bus**: data + source ("come from" bus)
  - 64 bits of data + 4 bits of Functional Unit source address
  - Write if matches expected Functional Unit (produces result)
  - Does the broadcast

**Example speed**:
- 3 clock cycles for Fl.pt. +,
- 10 cycles for *
- 40 cycles for /

### Tomasulo Example Cycle 1

**Instruction status**
- Instruction status: Exec Write
- Busy Address

**Reservation Stations**
- Reservation Stations: S1 S2 RS RS

**Register result status**
- Clock cycle

### Tomasulo Example Cycle 2

**Instruction status**
- Instruction status: Exec Write
- Busy Address

**Reservation Stations**
- Reservation Stations: S1 S2 RS RS

**Register result status**
- Clock cycle

Note: Can have multiple loads outstanding
Tomasulo Example Cycle 3

Instruction status:
Instruction | Issue | Comp | Result | Busy Address
--- | --- | --- | --- | ---
LD F6 1 34+ R5 | 1 | Load1 | Yes | 34+B2
LD F2 2 45+ R3 | 2 | Load2 | Yes | 45+B3
MULTD F0 F2 F4 | 3 | Load3 | No |
SUBD F6 6 5 | 5 | Load5 | No |
DIVD F10 10 F6 | 10 | Load10 | No |
ADD0 F6 8 2 | 2 | Load2 | No |

Reservation Stations:
| Time | Name | Busy | Op | F1 | F2 | F3 | F4 | F5 | Qj | Qk |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
1 | S1 | Yes | MULTD | R(F4) | Load2 |
2 | S2 | No |
3 | RS | Yes | MULTD | R(F4) | Load2 |
4 | RS | No |

Register result status:
Clock 3
FU Mult1 F4 Load1

Note: registers names are removed ("renamed") in Reservation Stations;
Load1 completing; what is waiting for Load1?

Tomasulo Example Cycle 4

Instruction status:
Instruction | Issue | Comp | Result | Busy Address
--- | --- | --- | --- | ---
LD F6 1 34+ R5 | 1 | Load1 | Yes | 34+B2
LD F2 2 45+ R3 | 2 | Load2 | Yes | 45+B3
MULTD F0 F2 F4 | 3 | Load3 | No |
SUBD F6 6 5 | 5 | Load5 | No |
DIVD F10 10 F6 | 10 | Load10 | No |
ADD0 F6 8 2 | 2 | Load2 | No |

Reservation Stations:
| Time | Name | Busy | Op | F1 | F2 | F3 | F4 | F5 | Qj | Qk |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
1 | S1 | Yes | ADD0 | MA(A1) |
2 | S2 | No |
3 | RS | Yes | MULTD | R(F4) | Load2 |
4 | RS | No |

Register result status:
Clock 4
FU Mult1 Load2 MA(A1) ADD0

Load2 completing; what is waiting for Load2?

Tomasulo Example Cycle 5

Instruction status:
Instruction | Issue | Comp | Result | Busy Address
--- | --- | --- | --- | ---
LD F6 1 34+ R5 | 1 | Load1 | Yes | 34+B2
LD F2 2 45+ R3 | 2 | Load2 | Yes | 45+B3
MULTD F0 F2 F4 | 3 | Load3 | No |
SUBD F6 6 5 | 5 | Load5 | No |
DIVD F10 10 F6 | 10 | Load10 | No |
ADD0 F6 8 2 | 2 | Load2 | No |

Reservation Stations:
| Time | Name | Busy | Op | F1 | F2 | F3 | F4 | F5 | Qj | Qk |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
1 | S1 | Yes | ADD0 | MA(A1) |
2 | S2 | No |
3 | RS | Yes | MULTD | R(F4) |
4 | RS | No |

Register result status:
Clock 5
FU Mult1 MA(A1) ADD0

Timer starts down for Add1, Mult1

Tomasulo Example Cycle 6

Instruction status:
Instruction | Issue | Comp | Result | Busy Address
--- | --- | --- | --- | ---
LD F6 1 34+ R5 | 1 | Load1 | Yes | 34+B2
LD F2 2 45+ R3 | 2 | Load2 | Yes | 45+B3
MULTD F0 F2 F4 | 3 | Load3 | No |
SUBD F6 6 5 | 5 | Load5 | No |
DIVD F10 10 F6 | 10 | Load10 | No |
ADD0 F6 8 2 | 2 | Load2 | No |

Reservation Stations:
| Time | Name | Busy | Op | F1 | F2 | F3 | F4 | F5 | Qj | Qk |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
1 | S1 | Yes | ADD0 | MA(A1) |
2 | S2 | No |
3 | RS | Yes | MULTD | R(F4) |
4 | RS | No |

Register result status:
Clock 6
FU Add1 MA(A1) ADD0 ADD1 ADD0

Issue ADDD here despite name dependency on F6?
Tomasulo Example Cycle 7

**Instruction status:**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>j</th>
<th>k</th>
<th>Issue</th>
<th>Comp</th>
<th>Result</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD</td>
<td>F6</td>
<td>34+</td>
<td>R5</td>
<td>1</td>
<td>3</td>
<td>Load1</td>
</tr>
<tr>
<td>LD</td>
<td>F2</td>
<td>45+</td>
<td>R3</td>
<td>2</td>
<td>4</td>
<td>Load1</td>
</tr>
<tr>
<td>MULTD</td>
<td>F0</td>
<td>F2</td>
<td>F4</td>
<td>3</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>SUBD</td>
<td>F8</td>
<td>F6</td>
<td>F3</td>
<td>4</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>DVD</td>
<td>F10</td>
<td>F0</td>
<td>F5</td>
<td>5</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>ADDD</td>
<td>F6</td>
<td>F8</td>
<td>F2</td>
<td>6</td>
<td></td>
<td>No</td>
</tr>
</tbody>
</table>

**Reservation Stations:**

- S1
- S2
- RS
- RS

**Register result status:**

- F0: Mult1, ADDD
- F2: ADDD
- F4: ADDD
- F6: Mult1
- F8: Mult1
- F10: ADDD

- Add1 (SUBD) completing; what is waiting for it?

---

Tomasulo Example Cycle 8

**Instruction status:**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>j</th>
<th>k</th>
<th>Issue</th>
<th>Comp</th>
<th>Result</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD</td>
<td>F6</td>
<td>34+</td>
<td>R2</td>
<td>1</td>
<td>3</td>
<td>Load1</td>
</tr>
<tr>
<td>LD</td>
<td>F2</td>
<td>45+</td>
<td>R3</td>
<td>2</td>
<td>4</td>
<td>Load2</td>
</tr>
<tr>
<td>MULTD</td>
<td>F0</td>
<td>F2</td>
<td>F4</td>
<td>3</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>SUBD</td>
<td>F8</td>
<td>F6</td>
<td>F2</td>
<td>4</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>DVD</td>
<td>F10</td>
<td>F0</td>
<td>F5</td>
<td>5</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>ADDD</td>
<td>F6</td>
<td>F8</td>
<td>F2</td>
<td>6</td>
<td></td>
<td>No</td>
</tr>
</tbody>
</table>

**Reservation Stations:**

- S1
- S2
- RS
- RS

**Register result status:**

- F0: Mult1, ADDD
- F2: ADDD
- F4: ADDD
- F6: Mult1
- F8: Mult1
- F10: ADDD

---

Tomasulo Example Cycle 9

**Instruction status:**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>j</th>
<th>k</th>
<th>Issue</th>
<th>Comp</th>
<th>Result</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD</td>
<td>F6</td>
<td>34+</td>
<td>R2</td>
<td>1</td>
<td>3</td>
<td>Load1</td>
</tr>
<tr>
<td>LD</td>
<td>F2</td>
<td>45+</td>
<td>R3</td>
<td>2</td>
<td>4</td>
<td>Load2</td>
</tr>
<tr>
<td>MULTD</td>
<td>F0</td>
<td>F2</td>
<td>F4</td>
<td>3</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>SUBD</td>
<td>F8</td>
<td>F6</td>
<td>F2</td>
<td>4</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>DVD</td>
<td>F10</td>
<td>F0</td>
<td>F5</td>
<td>5</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>ADDD</td>
<td>F6</td>
<td>F8</td>
<td>F2</td>
<td>6</td>
<td></td>
<td>No</td>
</tr>
</tbody>
</table>

**Reservation Stations:**

- S1
- S2
- RS
- RS

**Register result status:**

- F0: Mult1, ADDD
- F2: ADDD
- F4: ADDD
- F6: Mult1
- F8: Mult1
- F10: ADDD

---

Tomasulo Example Cycle 10

**Instruction status:**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>j</th>
<th>k</th>
<th>Issue</th>
<th>Comp</th>
<th>Result</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD</td>
<td>F6</td>
<td>34+</td>
<td>R2</td>
<td>1</td>
<td>3</td>
<td>Load1</td>
</tr>
<tr>
<td>LD</td>
<td>F2</td>
<td>45+</td>
<td>R3</td>
<td>2</td>
<td>4</td>
<td>Load2</td>
</tr>
<tr>
<td>MULTD</td>
<td>F0</td>
<td>F2</td>
<td>F4</td>
<td>3</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>SUBD</td>
<td>F8</td>
<td>F6</td>
<td>F2</td>
<td>4</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>DVD</td>
<td>F10</td>
<td>F0</td>
<td>F5</td>
<td>5</td>
<td></td>
<td>No</td>
</tr>
<tr>
<td>ADDD</td>
<td>F6</td>
<td>F8</td>
<td>F2</td>
<td>6</td>
<td></td>
<td>No</td>
</tr>
</tbody>
</table>

**Reservation Stations:**

- S1
- S2
- RS
- RS

**Register result status:**

- F0: Mult1, ADDD
- F2: ADDD
- F4: ADDD
- F6: Mult1
- F8: Mult1
- F10: ADDD

- Add2 (ADDD) completing; what is waiting for it?
Tomasulo Example Cycle 11

Instruction status: Exec Write

<table>
<thead>
<tr>
<th>Instruction</th>
<th>j</th>
<th>k</th>
<th>Issue</th>
<th>Comp</th>
<th>Result</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD F6 34+ R5</td>
<td>1</td>
<td>3</td>
<td>4</td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>LD F2 45+ R3</td>
<td>2</td>
<td>4</td>
<td>5</td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>MULTD F0 F2 F4</td>
<td>3</td>
<td></td>
<td></td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>SUBD F0 F6 F2</td>
<td>4</td>
<td>7</td>
<td>8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DIVD F10 F0 F6</td>
<td>5</td>
<td>10</td>
<td>11</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADDD F6 F8 F2</td>
<td>6</td>
<td>10</td>
<td>11</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Fk Qi Qk

Add1 No
Add2 No
Add3 No
4 Mult3 Yes MULTD M(A2) R(F4)
Mult3 Yes DIVD MA(A3) Mult3

Register result status:

Clock Name FU

Write result of ADDD here?
All quick instructions complete in this cycle!

Tomasulo Example Cycle 12

Instruction status: Exec Write

<table>
<thead>
<tr>
<th>Instruction</th>
<th>j</th>
<th>k</th>
<th>Issue</th>
<th>Comp</th>
<th>Result</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD F6 34+ R2</td>
<td>1</td>
<td>3</td>
<td>4</td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>LD F2 45+ R3</td>
<td>2</td>
<td>4</td>
<td>5</td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>MULTD F0 F2 F4</td>
<td>3</td>
<td></td>
<td></td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>SUBD F0 F6 F2</td>
<td>4</td>
<td>7</td>
<td>8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DIVD F10 F0 F6</td>
<td>5</td>
<td>10</td>
<td>11</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADDD F6 F8 F2</td>
<td>6</td>
<td>10</td>
<td>11</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Fk Qi Qk

Add1 No
Add2 No
Add3 No
3 Mult3 Yes MULTD M(A2) R(F4)
Mult3 Yes DIVD MA(A3) Mult3

Register result status:

Clock Name FU

Tomasulo Example Cycle 13

Instruction status: Exec Write

<table>
<thead>
<tr>
<th>Instruction</th>
<th>j</th>
<th>k</th>
<th>Issue</th>
<th>Comp</th>
<th>Result</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD F6 34+ R2</td>
<td>1</td>
<td>3</td>
<td>4</td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>LD F2 45+ R3</td>
<td>2</td>
<td>4</td>
<td>5</td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>MULTD F0 F2 F4</td>
<td>3</td>
<td></td>
<td></td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>SUBD F0 F6 F2</td>
<td>4</td>
<td>7</td>
<td>8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DIVD F10 F0 F6</td>
<td>5</td>
<td>10</td>
<td>11</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADDD F6 F8 F2</td>
<td>6</td>
<td>10</td>
<td>11</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Fk Qi Qk

Add1 No
Add2 No
Add3 No
2 Mult3 Yes MULTD M(A2) R(F4)
Mult3 Yes DIVD MA(A3) Mult3

Register result status:

Clock Name FU

Tomasulo Example Cycle 14

Instruction status: Exec Write

<table>
<thead>
<tr>
<th>Instruction</th>
<th>j</th>
<th>k</th>
<th>Issue</th>
<th>Comp</th>
<th>Result</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD F6 34+ R2</td>
<td>1</td>
<td>3</td>
<td>4</td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>LD F2 45+ R3</td>
<td>2</td>
<td>4</td>
<td>5</td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>MULTD F0 F2 F4</td>
<td>3</td>
<td></td>
<td></td>
<td>Load1</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>SUBD F0 F6 F2</td>
<td>4</td>
<td>7</td>
<td>8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DIVD F10 F0 F6</td>
<td>5</td>
<td>10</td>
<td>11</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADDD F6 F8 F2</td>
<td>6</td>
<td>10</td>
<td>11</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Fk Qi Qk

Add1 No
Add2 No
Add3 No
1 Mult3 Yes MULTD M(A2) R(F4)
Mult3 Yes DIVD MA(A3) Mult3

Register result status:

Clock Name FU
Tomasulo Example

Cycle 15

<table>
<thead>
<tr>
<th>Instruction status: Exec Write</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD F6 34+ R2</td>
<td>3 4</td>
</tr>
<tr>
<td>LD F2 45+ R3</td>
<td>2 4 5</td>
</tr>
<tr>
<td>MULTD F0 F2 F4</td>
<td>3 15</td>
</tr>
<tr>
<td>SUBD F8 F6 F2</td>
<td>4 7 8</td>
</tr>
<tr>
<td>DIVD F10 F0 F6</td>
<td>5</td>
</tr>
</tbody>
</table>

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Fk Qi Qk
3 Add2 No
1 Add1 No
0 Mult3
0 Mult2 Yes MULT M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock

F0 F2 F4 F6 F8 F10 F12 F30

15 FU

Mult1 (MULTD) completing; what is waiting for it?

Tomasulo Example—Skip ahead to Cycle 55

Cycle 56

<table>
<thead>
<tr>
<th>Instruction status: Exec Write</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD F8 34+ R2</td>
<td>3 4</td>
</tr>
<tr>
<td>LD F2 45+ R3</td>
<td>2 4 5</td>
</tr>
<tr>
<td>MULTD F0 F2 F4</td>
<td>3 15</td>
</tr>
<tr>
<td>SUBD F8 F6 F2</td>
<td>4 7 8</td>
</tr>
<tr>
<td>DIVD F10 F0 F6</td>
<td>5</td>
</tr>
</tbody>
</table>

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Fk Qi Qk
3 Add2 No
1 Add1 No
0 Mult3
0 Mult2 Yes MULT M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock

F0 F2 F4 F6 F8 F10 F12 F30

56 FU

Just waiting for Mult2 (DIVD) to complete

Tomasulo Example Cycle 16

<table>
<thead>
<tr>
<th>Instruction status: Exec Write</th>
<th>Busy Address</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD F6 34+ R2</td>
<td>3 4</td>
</tr>
<tr>
<td>LD F2 45+ R3</td>
<td>2 4 5</td>
</tr>
<tr>
<td>MULTD F0 F2 F4</td>
<td>3 15</td>
</tr>
<tr>
<td>SUBD F8 F6 F2</td>
<td>4 7 8</td>
</tr>
<tr>
<td>DIVD F10 F0 F6</td>
<td>5</td>
</tr>
</tbody>
</table>

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Fk Qi Qk
3 Add2 No
1 Add1 No
0 Mult3
0 Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock

F0 F2 F4 F6 F8 F10 F12 F30

16 FU

Just waiting for Mult2 (DIVD) to complete
Why can Tomasulo overlap iterations of loops?

- Register renaming
  - Multiple iterations use different physical destinations for registers (dynamic loop unrolling).
- Reservation stations
  - Permit instruction issue to advance past integer control flow operations
  - Also buffer old values of registers - totally avoiding the WAR stall
- Other perspective: Tomasulo building data flow dependency graph on the fly

Tomasulo’s scheme offers Two major advantages

1. Distribution of the hazard detection logic
   - distributed reservation stations and the CDB
   - If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB
   - If a centralized register file were used, the units would have to read their results from the registers when register buses are available
2. Elimination of stalls for WAW and WAR hazards

Tomasulo Drawbacks

- Complexity
- Many associative stores (CDB) at high speed
- Performance limited by Common Data Bus
  - Each CDB must go to multiple functional units ⇒ high capacitance, high wiring density
  - Number of functional units that can complete per cycle limited to one!
  - Multiple CDBs ⇒ more FU logic for parallel assoc stores
- Non-precise interrupts!
  - We will address this later
Dynamic Scheduling—Conclusions

- Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
- Loop unrolling by compiler to increase ILP
- Branch prediction to increase ILP
- Dynamic HW exploiting ILP
  - Works when can’t know dependence at compile time
  - Can hide L1 cache misses
  - Code for one machine runs well on another

Performance Enhancement—Better Branch Prediction

- Accurate Branch Prediction becomes more important with dynamic scheduling
  - Dynamic scheduling may stall if it can’t look past branch points
  - Cost of misprediction may be high

Static Branch Prediction

- Earlier, we discussed scheduling code around delayed branch
- To reorder code around branches, need to predict branch statically at compile time
- Simplest scheme is to predict a branch as taken
  - Average misprediction = untaken branch frequency = 34% SPEC

- More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run:

Static Branch Prediction Rates

- Integer branches: average 22% misprediction, reduced to 15%
- Floating Point branches: average 25% misprediction, reduced to 10%

- Overall, misprediction rates reduced significantly.
**Dynamic (Run-time) Branch Prediction**

- **Why does prediction work?**
  - Underlying algorithm has regularities
  - Data that is being operated on has regularities
  - Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems

- **Is dynamic branch prediction better than static branch prediction?**
  - Seems to be (most modern processor use it)
  - There are a small number of important branches in programs which have dynamic behavior

---

**Multi-bit Branch History**

In general, there is little performance improvement Beyond n=2

---

**Dynamic Branch Prediction**

- Simplest Dynamic Predictor:
  - Branch History Table: Lower bits of PC address index table of 1-bit values
  - Keeps track of whether or not branch taken last time
  - No address check

- Problem: in a loop, 1-bit BHT will cause two mispredictions (average loop has only 9 iterations before exit):
  - End of loop case, when it exits instead of looping as before
  - First time through loop on next time through code, when it predicts exit instead of looping

---

**A two-bit branch predictor**

- Change prediction only if get misprediction twice

- Adds *hysteresis* to decision making process

- Many other two-bit prediction schemes are possible
Another two-bit branch predictor

- Two-bit saturating counter (Smith Predictor)

![Diagram of Two-bit Branch Predictor](image)

2-bit BHT Table Predictor Accuracy

- Causes of Misprediction:
  - Wrong guess for that branch
  - Got branch history of wrong branch when index the table (aliasing, due to limited table size)

- 4096 entry table:

![Graph of BHT Table Predictor Accuracy](image)

Correlated Branch Prediction

- Idea: track the outcome of the $m$ most recently executed branches (globally), and use that pattern to select the proper $n$-bit branch history table

- In general, $(m,n)$ predictor means use last $m$ (global) branch outcomes to select between $2^m$ history tables, each with $n$-bit counters
  - Thus, old 2-bit BHT is a $(0,2)$ predictor

- Global Branch History: $m$-bit shift register keeping T/NT status of last $m$ branches.

- Each entry in table has $m \times n$-bit predictors (local branch history).

Example: A (2,2) Branch Predictor

- (2,2) predictor
  - Behavior of recent branches selects between four predictions of next branch, updating just that prediction

![Diagram of (2,2) Branch Predictor](image)
Branch Predictor Accuracy

4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT

Frequency of Mispredictions

Tournament Branch Predictor

- Multilevel branch predictor
- Use n-bit saturating counter to choose between predictors
- Usual choice between global and local predictors

Branch Predictor Performance as a Function of Size (total # of Bits)
(SPEC89 Benchmarks)

Pentium 4 Misprediction Rate
(per 1000 instructions, not per branch)

-6% misprediction rate per branch SPECint
  (19% of INT instructions are branch)
-2% misprediction rate per branch SPECfp
  (5% of FP instructions are branch)
Branch Prediction—What about the Branch Target Address (BTA)?

- Branch Prediction is of no value unless we know the BTA
- Branch target calculation is costly and stalls the instruction fetch,
- A Branch Target Buffer (BTB) can store previously computed BTAs
- The BTA of a taken branch is stored in the BTB
- For subsequent executions of this branch, the BTA can be “looked up” in the BTB
- If the branch was predicted taken, instruction fetch continues at the predicted PC

Branch Target Buffer (BTB)

Often, BTB is used in conjunction with Dynamic Prediction
- BTB provides fast prediction and BTA in fetch stage
- Dynamic Predictor provides more accurate prediction in decode stage

BTB Flowchart

Branch Prediction Summary

- Dynamic Prediction is essential in modern high-performance processors
- Branch History Table: 2 bits for loop accuracy
- Correlation: Recently executed branches correlated with next branch
- Tournament predictors take insight to next level, by using multiple predictors
  - usually one based on global information and one based on local information, and combining them with a selector
  - Tournament predictors using ~30K bits are in processors like the Power5 and Pentium 4
- Branch Target Buffer: include branch address & prediction