Changelog:

19 October 2020: correct answer sheet to include all questions and mark the corresponding items and actually only ask for “lapack-segsv” and “queens-8” numbers, like item 2 states (instead of using old version)
20 October 2020: add note about module load python on department machines
21 October 2020: specify that for each of the items below, comparisons should be made to the original simulated procesor
22 October 2020: correct time_trace.py we supply to enable simulating load/use hazards by default (missing a default=True in add_argument call)
23 October 2020: correct “taken before” to “taken the previous time it was executed” on item 6
26 October 2020: clarify that differences in performance should be measured based on cycles
26 October 2020: clarify phrasing on item 6 regarding what happens on a misprediction (adjust “does not” to “does not predict correctly”; “fetching instructions” to “fetching predicted instructions”)

Your task

Download the supplied program traces (link ; extract with tar --xz -xvf traces.tar.xz) and trace analysis tool (link).

The supplied trace analysis tool takes a trace, which is a record of all the instructions a program executes and counts how many cycles a pipelined processor similar to the one you produced for pipehw2 would take to execute it, except it does not simulate stalling for ret.

Before 22 October 2020 around 11:15AM, the time_trace.py we supply did not enable simulating load/use hazards by default (but would do so if you supplied the correct command line argument. For the below questions, we’ll accept answers based on either the version of the tool which simulates those hazards or the one that does not.

The supplied traces in a format described below. Rather than being a list of instructions, they are a list of attributes of instructions, like their source and destination registers, whether they are branches, etc.
For the below questions, record your answers in the answer sheet.

To save you time computing performance numbers, we’ll only ask for statistics from the lapack-segsv and queens-8 traces, though we supply some additional ones. (Notably, the asumr trace is much smaller than the others.)

When you are done, also upload all modifications you made to the trace analysis tool to the submission site
Using the supplied trace analysis tool identify how many times slower the programs is predicted to be when increasing the branch misprediction penalty from 2 to 3 cycles. You can do this with the command-line options in the tool. (Throughout this assignment, measure the performance by counting the number of cycles (since we don’t give you enough information to make any other type of performance measurement).)

(The branch misprediction penalty is the number of cycles wasted when a branch is mispredicted. in the five-stage pipeline required for pipehw2, it was two cycles, since two instructions would be fetched and then discarded.)

Record your answers in the answer sheet.
Modify the trace analysis tool to determine what the performance would be if conditional branches were not predicted at all and the processor instead stalled for 2 cycles (compared to the original version).

Record the differences in performance in the answer sheet.
Suppose we had a processor that had a longer pipeline by splitting the execute stage into multiple stages. In this case, an instruction sequence like
```
addq %rax, %rbx
addq %rbx, %rcx
```
would require stalling for one cycle in order to forward %rbx.

Modify the trace tool to simulate this effect by assuming that a cycle of stalling (in addition to any cycles for load/use hazards) is required to read a value written by the previous instruction. (In a real processor, probably not every operation would require both execute stages to retrieve a value, but to keep this simpler, please do not try to determine how “complicated” an instruction that writes a register is.)

Record the differences in performance (between this and the original version of the processor) from this change in the answer sheet.
Suppose we had a processor that made branch predictions for conditional branches based on historical information: rather than predicting branches to always be taken, it:
- if the instruction (as identified by the orig_pc field in the trace) has been executed before, it predicts the branch to be taken if and only if it was taken the previous time it was executed;
- if the instruction was not executed before, it predicts the branch to be taken
Suppose that when the processor predicts a branch in this way, it can fetch the predicted instruction in the next cycle (like the processor we implemented for pipehw2). When the processor does not predict correctly, it results in two cycles being wasted fetching predicted instructions that will not be allowed to complete.

Modify the trace tool to simulate this effect and record the differences in performance from this change (starting from the original version of the processor) in the answer sheet.
In addition to your answers, remember to submit your modified trace analysis tool.

Program trace format

The program trace files we supply are CSV (comma-separated value) files with one row for each instruction executed and the following fields:

is_conditional_branch, is_constant_jump, is_computed_jump: Y if the program is that type of jump; N otherwise. In the trace, a Y86-like call function_name instruction is a constant jump and a Y86-like ret instruction is a computed jump.
srcA, srcB, dst: source and destination register numbers. If an instruction has no sources (e.g. load of a constant into a register), srcA and srcB will be left blank. If an instruction has no destination register (e.g.register-to-memory move), then dst will be left blank. If an instruction only has one source register, then one of srcA or srcB will be left blank.
is_memory_read, is_memory_write: Y if the instruction performs a data memory read/write; N otherwise.
mem_addr: address accessed by the data memory (blank if none).
branch_taken: Y if the instruction was a taken conditional branch; N if the instruction was a not-taken conditional branch; blank otherwise
orig_pc: the program counter of the instruction that produced this entry in the trace. In some case, wone instruction in the original program was transformed into multiple instructions in the trace, so there may be duplicate entries.

There may also be fields other starting with orig_ which give information (that you are not expected to interpret) about the instructions from which the trace was generated. Depending on the trace, these are either Y86 instructions or RISC V instructions. Note that this means that there may be some combinations of the fields above that would not be possible on Y86. (For example, RISC V’s equivalent of the call instruction writes the return address to a register rather than memory.)

For traces derived from Y86 instructions, instructions that would write multiple registers (like popq %rax) have been transformed into multiple instructions rather than adding a second dst field.

Trace analysis tool

We suply time_trace.py. If you run python3 time_trace.py --help, you can see the options it takes.

On department machines, you should run module load python first to ensure that Python 3 is available.

The functions inside time_trace.py have comments that should help explain what they do and the limitations of the simulation. The bulk of the work is done in the count_time_in function and this is where you should expect to make any changes.

Contents

Your task

Program trace format

Trace analysis tool