55:132/22C:160

High Performance Computer Architecture

Spring, 2011

First Verilog Project

 

Due Dates: Tues, March 1, Thurs. March 3, Tues. March 8, by 11:59 p.m. (e-mail submission)

IMPORTANT: Be sure to follow the submission instructions and file/module naming conventions exactly. Your work will not be graded if files or modules are misnamed.

For this assignment, you will experiment with different branch prediction strategies using a Verilog model of a five-stage pipelined processor similar to the one discussed extensively in lecture and the text book. This processor model implements the DLX-subset ISA that you were exposed to in an earlier homework assignment. The project has three parts, each with a different due date.

Part 1: (You be the compiler)

A Verilog model for a version of the processor that utilizes Delayed Branching can be found here. Both the individual files and a TAR-ball including all of the files are provided. A block diagram of the processor is shown here. Note that branch outcome testing and branch target address calculation have been moved to the ID stage, reducing the branch shadow to one cycle. All forwarding paths and stall interlocks have been implemented. The top level module that implements the pipeline is called sdlxDelayedBranch and is found in the file sdlxDelayedBranch.v. The pipeline is mostly implemented as a Verilog structural model, but the pipeline-forwarding operations have been implemented as a behavioral model to facilitate easier modification. This will be discussed further in lecture.

A six-by-six matrix multiply program, similar to the one you worked with in an earlier homework assignment is provided. This program is in the file matmultDelayedBranch.s. Note that NOPs have been inserted in the branch shadows. The Verilog model is configured to execute this program with data memory contents as specified in the file matrix.dat.

Compile and run the Verilog model. Verify that the displayed results are correct and record the displayed information regarding the number of stall cycles (due to data hazards) and total number clock cycles required to execute the program.

Now, acting as the compiler, attempt to optimize the program to minimize the execution time. You may employ any or all of the following options:

·         Filling branch shadows with (potentially) useful work rather than NOPs

·         Loop unrolling (innermost loop only)

·         Rearranging instructions to eliminate stalls

Name your optimized program matmultOptimized.s. Execute your optimized program on the processor model. (Note that you will need to edit the filename in the $readmemb statement at line 273 of sdlxDelayedBranch.v to load your program into the instruction memory. Verify that the simulation produces the correct result. Note the displayed number of stalls and total number of required clock cycles.

Part 1 Deliverables: to be submitted by email sent to hpca@engineering.uiowa.edu by 11:59 p.m. on Tuesday, March 1:

·         Your optimized matrix multiply program, named matmultOptimized.s

·         A brief result describing your approach to optimizing the program and the observed performance improvement.

Part 2: (Let the Hardware do the work)

An alternative top-level module called sdlxStaticPredict can be found here. This module implements static, not-taken branch prediction, in lieu of delayed branching. A version of the matrix multiply program, called matmult.s that eliminates the NOPs after branches (and readjusts branch offsets accordingly) is also provided. Run the simulation using this module and matmult.s (sldxStaticPredict is configured to load this program.) All other modules are interchangeable between the two versions of the processor. Verify that the results are correct and note the displayed number of stalls, branch mispredictions, and total number of cycles.

Now, replace the static branch prediction with a dynamic branch prediction strategy based upon a one-bit branch History Table (BHT) and associated Branch Target Buffer (BTB) in the fetch stage. The BHT/BTB should have 32 entries and should be indexed by the low-order 5 bits of the current program counter (pcin). Each entry should store a 32 bit computed branch target in addition to the one-bit history. Each entry should also have a valid bit, indicating whether or not the entry holds valid information. At the end of each (non-stall) cycle your branching logic should do the following:

If the instruction fetched during this cycle (instrmemout) is a branch instruction then

If the BHT/BTB entry indexed by the current PC (pcin) is valid

If the BHT predicts taken

update the program counter for the next cycle with the BTB entry

else proceed with regular PC update

else use static not-taken prediction

 

When the actual branch outcome is known, the fetched instruction at the predicted BTA will need to be cancelled if the prediction proves to be incorrect. Also, the BHT/BTB entry will need to be updated according to the branch outcome.

 

Your modified module should be named sdlxOneBitHistory and the file in which it is located should be called sdlxOneBitHistory.v. Do not modify or rename anything in any of the other modules.

 

Run the simulation using your new top-level module and the program matmult.s. Verify that it produces the correct result and note the displayed number of stalls, mispredictions, and total number of cycles.

 

Part 2 Deliverables: to be submitted by email sent to hpca@engineering.uiowa.edu by 11:59 p.m. on Thursday, March 3:

·         A file, named sdlxOneBitHistory.v, containing your module sdlxOneBitHistory

·         A brief report noting the performance improvement observed as a result of using the one-bit dynamic prediction (as compared to delayed branching (without compiler optimization) and static not-taken prediction).

 

Part 3: (Two-bits are better than one):

 

Now, replace your one-bit static prediction approach from part 2, with a two-bit, history-based prediction using the prediction scheme described in Figure 2.4 on page 83 of the text. If a BTB/BHT entry is not valid, static not-taken prediction should be used. The initial (valid) prediction state for each entry should be determined by the first associated branch outcome (11 if taken, 00 if not taken).

 

Run the simulation using your new top-level module and the program matmult.s. Verify that it produces the correct result and note the displayed number of stalls, mispredictions, and total number of cycles.

 

 

 

Your new top-level module should be named sdlxTwoBitHistory and should be in a file named sdlxTwoBitHistory.

 

Part 3 Deliverables: to be submitted by email sent to hpca@engineering.uiowa.edu by 11:59 p.m. on Tuesday, March 8:

·         The file, named sdlxTwoBitHistory.v, containing your module sdlxTwoBitHistory

·         A brief report noting the performance improvement observed as a result of using the two-bit dynamic prediction (as compared to delayed branching (without compiler optimization), static not-taken prediction, and dynamic prediction using one-bit history).