Contents
Changelog:
- 7 October 2020: (while still tentative) expand description of avoiding
stall_F
.
Your task
- Combine your solution from the previous HW and the previous lab into a new file called
pipehw2.hcl
to create a five-stage pipelined processor with forwarding and branch prediction as described in the textbook that implements:nop
halt
irmovq
rrmovq
OPq
cmovXX
rmmovq
mrmovq
We will provide an example lab solution on Collab sometime after the lab is due..
- Add the remaining instructions:
jXX
pushq
popq
call
ret
-
Test your combined simulator with
make test-pipehw2
- Submit your solution to kytos
Hints/Approach
General Approach
You may approach this however you wish, but I suggest the following flow:
- Combine your
pipehw1.hcl
andpipelab2.hcl
and test the combination. All of the tests that either source file passed previously ought to still pass the combination. - Add
jXX
with speculative execution and branch misprediction recovery. Predict that all branches are taken. Test. - Add
pushq
and test. - Add
call
and test. - Add
popq
and test. - Add
ret
with handling for the return hazard, and test.
Implementing jXX
1 | 2 | 3 | 4 | 5 | 6 | 7 | |||
---|---|---|---|---|---|---|---|---|---|
jXX |
F | D | E | (next PC available) | M | W | |||
wrong1 |
F | D | (bubble needed) | ||||||
wrong2 |
F | (bubble needed) | |||||||
right1 |
F | D | E | M | W |
-
Replace
pc
in thefF
(orxF
orpP
or whatever else you called it) register bank wtihpredPC
, which will store a predicted PC value instead of the actual PC value.To speculatively use this prediction, we can set
pc
to the predicted PC (pc = F_predPC
). -
Your processor should predict that all
jXX
s are taken (the new PC isvalC
). -
We will detect that predictions are wrong near the end of the
jXX
’s execute stage (when we check the condition codes). We will fetch the correct instruction during the fetch stage in the next cycle (whenjXX
is in the memory stage). - When we react to a misprediction, we need to:
- Squash the mispredicted instructions (which are about to enter the decode and execute stages).
You can do this by setting the
bubble_X
signals in the cycle before the corrected instruction is fetched in order to reset theX_*
pipeline registers to their default values in the next cycle. - Fetch the corrected instruction next cycle (e.g. with a MUX in front of the
pc
signal).
- Squash the mispredicted instructions (which are about to enter the decode and execute stages).
You can do this by setting the
-
You can fetch the corrected instruction with a MUX front of the
pc
signal:pc = [ mispredicted : oldValP ; ... 1: F_predPC ; ];
You may need to pass the
conditionsMet
signal or something equivalent through a pipeline register to be able to tell when a misprediction happened at the appropriate time. -
You will need access to the
valP
from thejXX
instruction. To do so, you will probably need to pass it through pipeline registers. -
Make sure you correctly handle interactions between
jXX
andhalt
. Consider code like:jne foo halt foo: rmmovq %rax, (%rax) rmmovq %rax, (%rax) rmmovq %rax, (%rax)
When the
halt
is executed,F_predPC
may contain the address of anrmmovq
instead ofhalt
: the halt instruction would be fetched because of themispredicted
case on thepc
MUX we suggseted above, rather than being feteched becauseF_predPC
pointed to it.This means that setting
stall_F
may not be enough to fetch ahalt
next cycle: if we setstall_F
, then during the next cycle,F_predPC
will be the same (not halt), but, if using a MUX forpc
like we suggest above, the value ofmispredicted
andoldValP
in that MUX will be different.Some solutions to this problem may involve using an technique other than setting
stall_F
to prevent the PC from changing, like adding a MUX or cases to a MUX forf_predPC
and/or other MUXes involved in computing the PC.For example, one might use code like
f_predPC = [ keep_same_instruction_we_fetched : pc ; ... ];
to keep the same instruction in the fetch stage when
keep_same_instruction_we_fetched
is set to true rather than settingstall_F
. - If instead of squashing the mispredicted instructions when they are about to enter the decode and execute stages (like suggested above), you squash them when they are about to enter the execute and memory stages, you will have to worry about preventing the conditions codes from being changed by one of the mispredicted instructions.
pushq
and popq
-
rmmovq
ormrmovq
except you useREG_RSP
notrB
and ±8 notvalC
. Also has a writeback component forREG_RSP
. -
popq
updates two registers, so it will need bothreg_dstE
andreg_dstM
. -
popq
reads from the old%rsp
whilepushq
writes to the new%rsp
.
call
and ret
|1|2|3|4| |5|6|7|8|
------|-|-|-|-|-----------|-|-|-|-|-
`ret` |F|D|E|M|(next PC available)|W| | | |
`???` | |F|F|F|(next PC needed) |F|D|E|M|W
-
call 0x1234
ispush valP; jXX 0x1234
. Combining the logic of push and unconditional jump should be sufficient. -
ret
is jump-to-the-read-value-when-popping. It always encounters the “ret
-hazard”:You’ll have to stall the fetch stage as long as a
ret
is in decode, execute, or memory and forward the value fromW_valM
to thepc
.
Testing your code
-
You can run the command
make test-pipehw2
to run your processor on almost all the files iny86/
, comparing its output to references supplied intestdata/pipe-reference
. The list of tested files is intestdata/pipehw2.txt
. For the filespop-forward2.yo
,pop-forward3.yo
,pop-forward4.yo
,load-store.yo
, you should have the same values, but you may take fewer cycles. -
For each input file in
y86/
, there is a trace from our reference implementation intestdata/pipe-traces
. -
Your code should have the same semantics as
tools/yis
: set the same registers and memory. You can use this to see if your processor does the correct thing on any input files, including files you come up with yourself. -
We will check the number of cycles your processor takes. As a general rule, your pipelined processor will need
- 1 cycle per instruction executed
- 4 extra cycles because we have a five-stage pipeline; even
halt
takes 5 cycles now. - +1 more cycle for each load-use hazard (i.e., read from memory in one cycle, use with ALU next cycle)
- +2 more cycles for each conditional jump the code should not take (the misprediction penalty)
- +3 more cycles for each
ret
executed
Specific Test Cases
jXX
y86/j-cc.yo
irmovq $1, %rsi
irmovq $2, %rdi
irmovq $4, %rbp
irmovq $-32, %rax
irmovq $64, %rdx
subq %rdx,%rax
je target
nop
halt
target:
addq %rsi,%rdx
nop
nop
nop
halt
takes 15 cycles and leaves
| RAX: ffffffffffffffa0 RCX: 0 RDX: 40 |
| RBX: 0 RSP: 0 RBP: 4 |
| RSI: 1 RDI: 2 R8: 0 |
A full trace is available in testdata/pipe-traces/j-cc.txt
y86/jxx.yo
irmovq $3, %rax
irmovq $-1, %rbx
a:
jmp b
c:
jge a
halt
b:
addq %rbx, %rax
jmp c
takes 25 cycles and leaves
| RAX: ffffffffffffffff RCX: 0 RDX: 0 |
| RBX: ffffffffffffffff RSP: 0 RBP: 0 |
A full trace is available in testdata/pipe-traces/jxx.txt
(distributed with hclrs.tar
)
pushq
y86/push.yo
irmovq $3, %rax
irmovq $256, %rsp
pushq %rax
takes 8 cycles and leaves
| RAX: 3 RCX: 0 RDX: 0 |
| RBX: 0 RSP: f8 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x000000f_: 03 00 00 00 00 00 00 00 |
A full trace is available as testdata/pipe-traces/push.txt
popq
y86/pop.yo
irmovq $4, %rsp
popq %rax
takes 7 cycles and leaves
| RAX: fb0000000000000 RCX: 0 RDX: 0 |
| RBX: 0 RSP: c RBP: 0 |
A full trace is available as testdata/pipe-traces/pop.txt
call
y86/call.yo
irmovq $256, %rsp
call a
addq %rsp, %rsp
a:
halt
takes 7 cycles and leaves
| RBX: 0 RSP: f8 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x000000f_: 13 00 00 00 00 00 00 00 |
A full trace is available as testdata/pipe-traces/call.txt
ret
y86/ret.yo
irmovq $256, %rsp
irmovq a, %rbx
rmmovq %rbx, (%rsp)
ret
halt
a:
irmovq $258, %rax
halt
takes 13 cycles and leaves
| RAX: 102 RCX: 0 RDX: 0 |
| RBX: 20 RSP: 108 RBP: 0 |
| used memory: _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _a _b _c _d _e _f |
| 0x0000010_: 20 00 00 00 00 00 00 00 |
A full trace is available as testdata/pipe-traces/ret.txt
(It is okay to diagree with this trace about what instruction is fetched and ignored while waiting for the ret, but you should take the same number of cycles and produce the same final results.)