Contents
Changelog:
- 14 Sep 2020 approx 10PM: be explicit in Part 3b that we don’t expect insturctions that are never executed to be counted, since we don’t tell you about them in our benchmark data
- 17 Sep 2020: correct error in answer sheet where we asked for “times faster” instead of “times more instructions for Part 2b to actually match what we ask here. We’ll also accept answers that based on how times fewer instructions are executed for those who used the original interpretation.
- 18 Sep 2020: for part 3a and 3b, provide a clearer description of what the difference between the two ways we want instructions counted; pointing out that the way we want them counted for 3a is the same as the “issued” column in the benchmark data and the way for 3b is the same as the “in program” column.
- 18 Sep 2020: for part 2b, clarify that you do not need to consider instructions like
movq 0x100000, %rax
as using complex addressing that needs to be translated - 18 Sep 2020: in part 3a, clarify the assumptions are about signed numbers, since this is what makes it possible to use the benchmark data most conveniently
- 21 Sep 2020: in part 3a, clarify that 16-bit constants should not need additional instructions with note that we’ll accept answers that don’t assume this
- 22 Sep 2020: in part 4a, correct typo of “necessary” for “unnecessary”
- 22 Sep 2020: on answer sheet, correct questions corresponding to Part 2b not making any sense (due to copy-and-paste issues)
- 23 Sep 2020: in part 4, add note re
mulq
needing to beimulq
Record your answers on the answer sheet.
Part 1: Linking
Using the supplied object file/executable format description (same as in the lab):
-
translate this following object files into an executable file
Copy-and-paste your executable into the answer sheet.
Part 2: Complex Addressing Modes
x86-64 allows for the complex addressing modes where a memory location is specified using two registers
as in (%rax, %rbx)
, including the most complicated form of something like 1234(%rax, %rbx, 4)
.
Although this makes instruction encoding more complicated, perhaps a x86-64 processor
can obtain a benefit by including hardware specialized for performing the two-register+scale
address computation.
Subpart a
Using only the 0x1234(%rax)
-style of addressing memory (like supported in Y86-64), write
out assembly code equivalent to
movq 0x1234(%rax, %rbx, 8), %rcx
using not more than 5 instructions. Your assembly code can use x86-64 instructions that don’t exist
in Y86-64, like imul
or shl
.
Subpart b
For this problem, use the instruction information from benchmarks originally supplied in the lab.
If the compiler naively translated these instructions to avoid the complex addressing
(using only the offset(base register)
addressing supported by Y86-64), then about how many times more
instructions would the processor execute? Record your answers (for each benchmark program)
on the answer sheet), along with an explanation of your method. (Since you do not have precise
information about the programs, your estimate does not have to be exact. Since we are asking for
naive translations, you do not have to come up with the best possible assembly translation.
You do not need to handle converting instructions like movq 0x100000, %rax
(no index register or scale) to the offset(base register)
style addressing, as we don’t consider
that a complex addressing mode (but we will not deduct credit if you attempted this).)
Part 3
Fixed-length encodings simplify instruction sets, but usually at the cost of making programs larger. In a fixed-length encoding, instructions that take fewer operands must take up the same amount of space as an instruction that uses many operands.
Based on the Y86-instruction set one might imagine that this means that an equivalent fixed-length instruction set would have 10-byte instructions. While this is certainly a possibility, typically fixed-length instruction sets compromise by limiting the size of constants in instructions. For example, the RISC V 64-bit ISA supports an encoding where all instructions are 32-bits, even though registers are 64-bits. So rather than being able to do something like:
irmovq $0x12345678, %r10
to load the 32-bit value 0x12345678
, one must use two instructions. If Y86-64 was designed
more like RISC V, this might look like:
lui $0x1234, %r10 /* load upper immediate (constant) --- move into top -bits of destination */
iaddq $0x5678, %r10 /* add immediate (constant) --- with 16-bit immediate */
where the encoding of lui
and iaddq
wouldn’t have room for much more than a 16-bit immediate after fitting
in the opcode and register number into the 32-bit instructions.
To put a 64-bit value like 0x1234567890ABCDEF
in a register in this design,
one would need a sequence of several instructions like:
/* "load upper immediate" */
lui $0x1234, %r10
/* "immediate add" */
iaddq $0x5678, %r10
/* "logical shift left by immediate (constant)" */
isll $16, %r10
iaddq $0x90AB, %r10
isll $16, %r10
iaddq $0xCDEF, %r10
Or, alternately, a compiler might prefer to generate a rmmovq
instruction rather than attempt to
use the equivalent of irmovq
.
Subpart a
For this problem, use the instruction information from benchmarks originally supplied in the lab.
Since our benchmark programs were run on X86-64, instructions supporting 32- and 64-bit constants were readily available to the compiler. Let’s consider how the compiler would perform if instructions using larger constants required a sequence to load the larger constant value. To get a rough estimate of the effect of a fixed-length instruction set, lets suppose the compiler would need:
- 2 instructions to construct a 32-bit signed value
- 4 instructions to construct a 48-bit signed value
- 6 instructions to construct a 64-bit signed value
[Note (added 18 Sep): originally we did not specify that these are signed values (just said “32-bit value”, etc.), but we think you would have trouble using the benchmark data we provide with assuming this.]
[Added 21 Sep:] Assume that using 16-bit signed values requires no additional instructions (we originally neglected to specify this explicitly, so we’ll also accept answers that don’t use this assumption).
Most likely, the compiler would also need to use some extra registers. To keep things simpler, let’s assume that effect is negligible. Based on these assumptions, about how many times more instructions would be executed in the benchmark programs? (When counting instructions executed for this part, if an instruction is executed 10 times, count it 10 times. This is what the “issued” column in our benchmark data does.)
Subpart b
Under the same assumptions as subpart a, about how much larger would the benchmark programs be in terms of number of instructions stored in the program’s executable or libraries that were executed at least once? (When counting instructions stored and executed at least once for this part, count each instruction once. This is what the “in program” column in our benchmark data does.)
(Since variable-length instruction sets typically have shorter instructions, this is probably an underestimate of increase in file size.)
Part 4
three- versus two-operand instructions
Making a typical instruction have three operands, so that assembly to compute C = A + B is like:
add A, B, C
rather than like:
mov B, C
add A, C
should allow many mov instructions to be avoided. x86-64, of course, requires the second form. However, compilers can usually find ways of allocating registers to avoid copying the value. For example:
- if B is not used after
C = A + B
, the compiler can select to storeC
in the register previously being used forB
and avoid amov
instruction; or - if the original code is doing an operation
A = A + B
instead ofC = A + B
, the compiler can store the result inA
by doingadd B, A
(taking advantage of theadd
being communicative).
One might worry that these cases are rare and so having only two operands will have a very large cost, but looking at what compilers do in practice often seems to show that it is not.
Subpart a
Consider the following x86-64-style assembly code but with three register instead of two register instructions (“source, source, destination” instead of “source, source and destination”):
mulq %rdx, %r9, %rax
mulq %rbx, %r10, %rcx
addq %rax, %r12, %r11
addq %r11, %rcx, %r12
mulq %rdx, %r13, %r11
mulq %r10, %rcx, %r13
(This snippet is based on an excerpt of a matrix multiply routine compiled for an architecture with three-operand arithmetic instructions.)
Naively translated into two-register assembly would result in code like:
movq %r9, %rax
mulq %rdx, %rax
movq %r10, %rcx
mulq %rbx, %rcx
movq %r12, %r11
addq %rax, %r11
movq %rcx, %r12
addq %r11, %r12
movq %r13, %r11
mulq %rdx, %r11
movq %rcx, %r13
mulq %r10, %r13
([Added 23 Sep:] assume mulq
is a two-argument multiply instruction that works like add; due to me getting confused, imul
would be the actual x86-64 instruction.)
But some of these movq
instructions are unnecessary.
Rewrite this two-register assembly as assembly that omits at least two mov
instructions but results in the same
values in registers.