F2020 quizzes (key)

Suppose memory contains the following 8-bit bytes at the following addresses (each written in hexadecimal):

address	value
...	...
0x0FD	0x11
0x0FE	0x22
0x0FF	0x33
0x100	0x44
0x101	0x55
0x102	0x66
0x103	0x77
0x104	0x88
0x105	0x99
0x106	0x00
0x107	0xA0
0x108	0xAB
0x109	0xBA
0x10A	0xC0
0x10B	0xD0
...	...

Question 1 (2 pt; mean 1.4) (see above)

In little-endian, reading a four-byte (32-bit) value from address 0x105 yields what value? Write your answer as a hexadecimal number. If not enough information is given write "unknown".

Answer:

Key: /(0[xX])?0*[aA][bB][aA]00099|2879389849/

0xaba00099 constructed from 0x99, 0x00, 0x0A, 0xAB at addresses 0x105 through 0x109 inclusive. Relucantly accepted if you decided that it should be written with the least significant hexadecimal nibble left-most (99000ABA), but it was our intention that "little-endian" just applied to how the value was read from memory.

Question 2 (2 pt; mean 1.77) (see above)

With the above memory layout, running movb 0x104, %al what will the value of the 8-bit register %al be? Write your answer as a hexadecimal number. If not enough information is given write "unknown".

Answer:

Key: /(0[xX])?0*88|136/

0x88

Question 3 (2 pt; mean 0.6) (see above)

Consider the following assembly snippet:

movq $0x4, %rax
movq $0x1, %rbx
movl 0x100(%rax, %rbx, 2), %eax

When run with the above memory layout, what will the resulting value of the 32-bit register %eax be? Write your answer as a hexadecimal number. If not enough information is given write "unknown".

Answer:

Key: /(0[xX])?0*[bB][aA][aA][bB][Aa]000|3131809792/

0xBAABA000; reads from 0x100 + 4 + 1 * 2 = 0x106; constructed from 0x00, 0xA0, 0xAB, 0xBA

Question 4 (3 pt; mean 2.87)

Consider the following AT&T syntax assembly:

addq %rbx, %rbx
addq %rax, %rbx
movq (%rbx), %rax

Which of the following assembly snippets will result in the same final value of %rax? (Ignore changes to the value of %rbx.) Select all that apply.

1%
movq (%rbx, %rax, 2), %rax
93%
⊤
movq (%rax, %rbx, 2), %rax
2%
addq %rbx, %rbx
addq %rax, %rbx
92%
⊤
addq %rbx, %rbx
movq (%rbx, %rax), %rax
93%
⊤
addq %rbx, %rbx
movq (%rax, %rbx), %rax
1%
addq %rbx, %rbx
addq %rax, (%rbx)

Question 5 (2 pt; mean 1.59)

Suppose the assembly label array is defined to be a constant 5-byte array as follows:

array:
   .byte 3
   .byte 4
   .byte 5
   .byte 6
   .byte 7

and the linker chooses to locate this array at address 0x10000. (Each .byte directive specifies the value of one byte of memory.)

Then, if we run the assembly snippet:

    movq $1, %rax
    addq $array, %rax
    movb 1(%rax), %bl

What would the value of 8-bit register %bl be afterwards? You may assume memory around array is not modified before this snippet is run.

Write your answer as a hexadecimal number. If not enough information is given write "unknown".

Answer:

Key: /(0[xX])?0*5/

after movq, %rax = 1; after addq, %rax = 0x10001; movb computes the address 0x10001 + 1 = 0x10002, which contains 5

quiz for week 2

Question 1 (2 pt; mean 1.73)

Consider the following C function:

long foo(long a, long b) {
    return a + b;
}

Which of the following are correct translations (but perhaps unnecessairily complex) of this function to AT&T syntax assembly (using the Linux x86-64 calling convention)? Select all that apply.

92%
⊤
foo: movq %rsi, %rax
addq %rdi, %rax
ret
36%
foo: movq (%rsi, %rdi), %rax
leaq (%rax), %rax
ret
1%
foo: leaq 1(%rsi, %rdi, 1), %rax
ret
90%
⊤
foo: leaq 0(%rsi, %rdi), %rax
ret

Consider the following C function:

long example(long a, long b) {
    while (a > b) { a = a - b; }
    return a;
}

Question 2 (2 pt; mean 1.63) (see above)

This function can be converted to x86-64 assembly (following Linux's x86-64 calling convention) like the following:

example:
    cmpq %rsi, %rdi
    ____ L_done
    subq %rsi, %rdi
    jmp example
L_done:
    movq %rdi, %rax
    ret

What instruction goes in the blank (in the second instruction of the function before L_done)?

Question 3 (2 pt; mean 1.58) (see above)

Assuming the assembly translation from the previous question is used, after this function returns, what will the value of the zero flag (ZF) will be?

2%

always 0
2%

always 1
8%

1 if the return value is 0; otherwise 0
1%

1 if b is 0, otherwise 0
78%
⊤
1 if the return value is equal to the argument b; otherwise 0

the last instruction run that sets ZF is cmpq, which is comparing what will become the return value to the original argument b; if the subtraction done by the compare reuslts in zero, that means the two values are equal
1%

1 if the loop executed at least one time (the original value of a was greater than b); otherwise 0
0%

1 if the loop executed zero times; otherwise 0
4%

it depends on the arguments and/or return value, but not in a way described above
it depends on what the value of the zero flag (ZF) was before the function was called

Suppose we assemble the following into an object file:

.data
.global array
array:
    .byte 1
    .byte 2
    .byte 3
    .byte 4

.text
.global bar 
bar:
    cmpq $0, %rdi
    je end_bar
    movq $array, %rdi
    call print_array
end_bar:
    movq $array, %rax
    ret

Question 4 (2 pt; mean 1.72) (see above)

The corresponding object file's relocations table will reference which of the following (either by name or by identifying the location of the corresponding label)? Select all that apply.

80%
⊤
array

required for movq $array, %rdi instruction
12%
bar

appears in symbol table, but not used by any instruction
90%
⊤
print_array

required for call print_array instruction

Question 5 (2 pt; mean 1.09) (see above)

When using the resulting object file to produce an executable (using the static (non-dynamic) linking scheme we discussed in lecture), the linker will _______. Select all that apply.

46%
⊤
write the four bytes that are stored after the label array to the executable file
71%
⊤
find a symbol table entry for print_array in some other object file
84%
write the memory address (in some format) chosen for the call print_array instruction somewhere in the executable

not referenced by any other instruction, no label permitting it to be located. Note that this question asks about the memory address of the call instruction, not about the memory adddress of print_array
85%
⊤
write the memory address for array (in some format) in the resulting executable somewhere

Consider the following C function:

int *quux(int *p) {
    int *r;
    r = p + 2;
    *r += 4;
    return r;
}

Question 6 (2 pt; mean 1.56) (see above)

Suppose the function quux is run on a Linux x86-64 system where:

ints are 4 bytes,
p points to the first element of an array of 400 ints located at address 0x10000
each of the 100 ints in the array initially has the value 7

What will the value of the pointer r be just before the quux returns? Write your answer as a hexadecimal number. (Be sure to give the value of the pointer and not the value it points to.)

Answer:

Key: /(?:0[xX])?0*10008/

2 advances by two ints -- 4 bytes each; initial value of the array is irrelevant, but I did make a mistake in being inconsistent about the size of the array.

quiz for week 3

Consider the following C function:

long example(long a) {
    long last_a = a;
    while (a != (a >> 40)) {
        last_a = a;
        a = a >> 40; 
    }   
    return last_a;
}

Assume >> on integers is implemented using an arithmetic shift (copies the sign bit for leftmost bits of result) and longs are 64 bits, represented using two's complement.

Question 1 (2 pt; mean 1.72) (see above)

The value of example(1) is 1. Besides 1, what is another value K such that example(K) is 1? (There are several possible answers.) Write your answer as a base-10 number.

Answer:

Key: /1099511627776|2199023255551|1[1-9][0-9]{11}|1099[6-9][0-9]{8}|10995[2-9][0-9]{7}|109951[2-9][0-9]{6}|1099511[7-9][0-9]{5}|10995116[3-9][0-9]{4}|109951162[8-9][0-9]{3}|1099511627[8-9][0-9]{2}|10995116277[8-9][0-9]{1}|109951162777[7-9]|2[0-0][0-9]{11}|21[0-8][0-9]{10}|219[0-8][0-9]{9}|21990[0-1][0-9]{7}|219902[0-2][0-9]{6}|2199023[0-1][0-9]{5}|21990232[0-4][0-9]{4}|219902325[0-4][0-9]{3}|2199023255[0-4][0-9]{2}|21990232555[0-4][0-9]{1}|219902325555[0-0]/

between 1099511627776 (2^40) and 2199023255551 (2^41-1). At some point, last_a was 1, then a became 0 and 0 != 0 >> 40, terminating the while loop. Before this happened, a was some number such that a >> 40 was 1. Since >> 40 is equivalent to dividing by 2 to the 40th, this is any number where dividing by 2 to the 40th would round down to 1.

Question 2 (2 pt; mean 1.22) (see above)

How many distinct return values can the above example function have? Write your answer as a base-10 number.

Answer:

Key: /2199023255552/

-2^40 through 2^40-1, inclusive; three-fourths credit given for half that (forgot negative) or off-by-one; X == X >> 40 only if X is 0 (all zeroes) or -1 (all ones). Y >> 40 is 0 or -1 only if bits 40-63 of Y are all 0s or all 1s. Alternate, Y >> 40 is only 0 or -1 if dividing Y by to 2 to the 40th power would round down to 0 or -1.

Question 3 (2.5 pt; mean 2.27)

If x and y are 32-bit signed ints with values between -1000000 and 1000000 on a system that uses two's complement, which of the following C expressions are always true? Select all that apply.

91%
⊤
(x >= 0) || (((x >> 31) & 1) == 1)
94%
⊤
(x & 0xFF) <= (x & 0xFFF)
88%
⊤
((x + (y & 0xFFFF)) & 0xFFFF) == (((x & 0xFFFF) + y) & 0xFFFF)

was originally miskeyed; should be always true
10%
(((x | y) & 0xFF) >> 8) == (((x >> 8) & 0xFF) | ((y >> 8) & 0xFF))
9%
(((x & 0xFFF) >> 8) ^ (y & 0xFFF)) == ((((x >> 8) ^ y) & 0xFF) | (y & 0xF00))

was originally miskeyed and this explanation was originally wrong;
countrexample: x = 0xF000, y = 0x0
(x & 0xFFF) >> 8 is 0, so the left-hand side is 0; but ((x >> 8) ^ y) is 0xF0, so the right hand side is 0xF0.

Question 4 (2 pt; mean 1.69)

Which of the following C expressions will, given an unsigned integer x, return the least significant 4 bits of the integer with its bits reversed. For example, if x in binary was 11001000001101, the result would be (in binary) 1011 (the reverse of 1101). Select all that apply.

45%
(((x & 0x11) << 3) | ((x & 0x12) << 1) | ((x & 0x14) >> 1) | ((x & 0x18) >> 3)) & 0xF
3%
((x << 1) & 1) | ((x << 2) & 2) | ((x >> 1) & 8) | ((x >> 2) & 4)
96%
⊤
((x & 1) << 3) | ((x & 2) << 1) | ((x & 4) >> 1) | ((x & 8) >> 3)
90%
⊤
((x << 3) & 15) | ((x << 1) & 5) | ((x >> 1) & 2) | ((x >> 3) & 1)

quiz for week 4

Question 1 (2 pt; mean 1.66)

Which of the following are likely attributes of processors following the RISC (reduced instruction set computer) design philosophy when compared to processors following the CISC (complex instruction set computer) design philosophy? Select all that apply.

10%
permitting efficient implementation with a register file with fewer registers

dropped, because we neglected to discuss what a register file is; RISC designs typically want to increase the number of registers versus a typical CISC to make up for not being able to access memory during most instructions
7%
providing more instructions that perform both computation and a memory access
90%
⊤
implementing fewer instructions overall

processor implements fewer instructions, though programs will need more instructions
36%
providing variants of instructions encoded by placing a special "prefix" byte value before original instruction's normal machine code

implies variable length instructions

Question 3 (2 pt; mean 1.42)

Consider the following Y86-64 machine code, written as a sequence of bytes in hexadecimal:

 50 76 74 00 00 00 00 00 00 80 60 12 61 84 00 00 00 00 00 00 00

If we translate this to assembly (assuming the first instruction starts at the first byte) then the first two instructions would be:

Question 5 (2 pt; mean 1.76)

Consider the following HCLRS code snippet where ... represents some omitted code:

register xY {
    foo : 64 = ...;
    bar : 64 = ...;
}
...
x_foo = Y_foo + Y_bar;
x_bar = Y_bar - Y_foo;

During cycle 10, Y_foo has the value 500 and Y_bar has the value 300. What is the value of Y_foo during cycle 12? (Write your answer as a base-10 number, like 123.)

Assume that cycles are seperated by a rising edge of the clock signal.

Answer:

Key: 600

cycle 11: Y_foo = (500+300) = 800; Y_bar = (300-500) = -200; cycle 12: Y_foo = (800-200) = 600; was originally miskeyed (because I swapped values when subtracting)

Question 6 (2 pt; mean 1.67)

Using the kind of registers we described in lecture and in section 4.2.5 of our textbook, suppose a register's output is 42 and its value input is also 42 and the clock signal is high. Then the following happens in this order:

the clock signal falls (becoming low)
the register's value input changes to 44
the register's value input changes to 45
the clock signal rises (becoming high again)
the register's value input changes to 46
the clock signal falls (becoming low)
the register's value input changes to 47

What will the value of the register's output be after this occurs? If not enough information is given to answer write unknown and explain in the comment field.

Answer:

Key: 45

quiz for week 5

Question 3 (2 pt; mean 1.69)

Consider the following HCLRS code snippet:

reg_srcA = 8;
reg_dstE = 0;
reg_inputE = reg_outputA;

If this were part of an HCLRS processor which does not have any other code using the register file inputs and outputs, then

84%
⊤
the value of %r8 would be copied to %rax during every cycle
1%

the value of %rax would be copied to %r8 during every cycle
2%

the value 0 would be written to %rax during every cycle
1%

the value 0 would be written to %r8 during every cycle
3%

the value 8 would be written to %rax during every cycle
2%

the value 8 would be written to %r8 during every cycle
2%

the value 0 would be written to both %r8 and %rax during every cycle
2%

the value of registers would not change
2%

none of the above

Rather than having call and ret instructions that push and pop values from the stack, many instruction sets instead store the return adddress in a register (which, if necessary, programs can save on the stack).

For example, RISC V provides a jal REGISTER, TARGET_LABEL ("jump and link") instruction to replace the functionality of call and a jr REGISTER ("jump to register") to replace the functionality of return. Like call, jal REGSITER, TARGET_LABEL stores the return address and then jumps to TARGET_LABEL, but it stores it in REGISTER, rather than on the stack. (If necessary, the function can use another instruction to push the return address onto the stack.) jr REGISTER takes a value from a register and sets the PC to the value.

Question 4 (2 pt; mean 1.24) (see above)

Suppose we added the jal instruction described above to the single-cycle Y86-64 procesor design we described in lecture (and which is described in our textbook). (By "single-cycle processor", we mean a processor that executes one cycle per reigster.) To avoid adding inputs to MUXes or additional MUXes (or similar circuitry) to control the 4-bit register number inputs to the register file (reg_srcA, reg_srcB, reg_dstE, and reg_dstM in HCLRS), which of the below encodings would be best?

(In each of the encodings, values of each byte are provided with most significant bits written first (left-most).)

17%

[4 bit icode][4 bit destination register] (1st byte) then [64-bit target address]

register number we need to write in a different location in instruction memory output than any other instruction that writes registers
62%
⊤
[4 bit icode][4 bits unused] (1st byte) then [4 bit unused][4 bit destination register] (2nd byte) then [64-bit target address]

can reuse MUX option from mov or add instruction
15%

[4 bit icode][4 bits unused] (1st byte) then [64-bit target address] (2nd through 9th byte) then [4 bit destination register][4 bits unused] (10th byte)

register number we need to write in a different location in instruction memory output than any other instruction that writes registers
2%

[4 bit icode][4 bit condition code info] (1st byte) then [64-bit target address] (2nd through 9th byte)

can't implement this instruction since no destination register

Question 5 (2 pt; mean 1.33) (see above)

Suppose we added the jal instruction descrbied above to the single-cycle processor design we described in lecture (and which is described in our textbook). While this instruction is executing the input to the PC register would most likely be equal to

67%
⊤
part of the output of the instruction memory

constant from the instruction specifies the address of the next instruction to run
7%

the result of a calcuation performed using one of the outputs of the register file
11%

one of the outputs of the register file
6%

the output of the data memory
6%

none of the above

Question 6 (2 pt; mean 1.59)

In the single-cycle Y86-64 processor design we discussed in lecture, which of the following operations may overlap in time with the addition of registers' values for an add instruction? Select all that apply.

86%
⊤
computing the next instruction's address (but not necessairily storing it in a register yet)

independent of add, circuit can act at same time
36%
writing a value to the data memory

not executing an instruction that should do this at all
32%
the clock signal rising

won't be done until addition completes
56%
reading a register's value from the register file

not triggered by any particular part of clock signal but has to be done before the addition can actually happen

quiz for week 6

Question 1 (2 pt; mean 1.56)

In lab to implement condition codes, we suggested declaring condition code registers using

register cC {
    SF:1 = 0;
    ZF:1 = 1;
}

and then using

stall_C = (icode != OPQ);

to keep the condition code registers from changing when the instruction was not an OPq instruction. We noted that, in HCLRS, "Register banks like cC have a special input stall_C which, if 1, causes the registers to ignore inputs and keep their current value." (Each register bank has its own stall signal, this one is stall_C, since the condition code registers were declared using register cC.)

If register banks did not provide this stall signal, we could have implemented the functionality using a case expression (MUX) when setting c_SF and c_ZF. What would the corresponding code for setting c_ZF look like?

3%

c_ZF = [ icode == OPQ : valE == 0; 1 : 0; ];
2%

c_ZF = [ icode == OPQ : valE == 0; 1 : 1; ];
9%

c_ZF = [ icode == OPQ : valE == 0; 1 : c_ZF; ];
78%
⊤
c_ZF = [ icode == OPQ : valE == 0; 1 : C_ZF; ];
2%

c_ZF = [ icode == OPQ : valE == 0; 1 : !C_ZF; ];
2%

c_ZF = [ icode == OPQ : valE == 0; 1 : !c_ZF; ];
2%

c_ZF = [ icode == OPQ : valE == 0; 1 : valE != 0; ];
1%

none of the above

Consider the following diagram of the single-cycle processor data path from lecture:

Note that in this version of the processor design, the second input to the ALU (which our textbook calls aluB) is 0 or the second output of the register file (reg_outputB in HCL).

Suppose we wanted to implement a new instruction ixorq on this processor, which would xor a register's value with a constant and store the result in a register. For example

ixorq $0x1234, %rax

would take the value of %rax, xor it with 0x1234 and store the result in %rax. In machine code, the instruction would have the same layout (placement of fields like icode and rA and valC) as irmovq.

Question 2 (2 pt; mean 1.61) (see above)

When this ixorq instruction is executing the MUX that controls the dstM regsiter file input should ____.

7%

select the top input (rA)
71%

select the second input (0xF, also known as REG_NONE)
16%
⊤
select any input; it won't affect the instruction's operation
4%

select a new rB input that the MUX needs to be modified to support in order to implement the ixorq instruction

the place where the rA field is will always be REG_NONE already, so the selection shouldn't matter

Question 3 (2 pt; mean 1.75) (see above)

When this ixorq instruction is executing the MUX that controls the aluB ALU input should ____.

87%
⊤
select the top input (reg_outputB)
5%

select the second input (0)
6%

select any input; it won't affect the instruction's operation

quiz for week 7

For the following two questions, consider executing the following assembly snippet:

addq %rax, %rbx
subq %rbx, %rdx
xorq %rbx, %rcx
rrmovq %rdx, %rcx
addq %rbx, %rcx

Question 2 (2 pt; mean 1.49) (see above)

Suppose the assembly snippet is executing on the five-stage pipelined processor we described in lecture, but instead of using forwarding, it uses only stalling to resolve data hazards (and no forwarding). If the first addq instruction is fetched in cycle 0, then during what cycle will the final addq instruction run its writeback stage?

Answer:

Key: 16

without stalling: first addq does writeback in cycle 4, so last addq does writeback in 4 + 4 = 8. Add in stalling: 3 cycles before subq to wait for first addq; 2 cycles before rrmovq to wait for subq; 3 cycles before final addq to wait for rrmovq

Question 3 (3 pt; mean 2.67) (see above)

Suppose the assembly snippet is executing on the five-stage pipelined processor we described in lecture that:

uses forwarding to resolve data hazards to the extent possible without dramatically increasing the cycle time

Which of the following forwarding operations must occur to avoid the most stalling possible? Select all that apply.

95%
⊤
%rbx will be forwarded from the first addq to the subq
77%
⊤
%rbx will be forwarded from the first addq to the xorq
9%
%rbx will be forwarded from the subq to the xorq

not written by subq
93%
⊤
%rdx will be forwarded from the subq to the rrmovq
10%
%rcx will be forwarded from the xorq to the rrmovq

value is not used by rrmovq, so value is not necessary (though sending it would be harmless)
11%
%rcx will be forwarded from the xorq to the addq

overwritten by rrmovq

For the following question, consider executing the following assembly snippet:

addq %rax, %rbx 
mrmovq 8(%rbx), %rcx
xorq %rcx, %rdx
rmmovq %rdx, 16(%rbx)

Question 4 (2 pt; mean 1.6) (see above)

Suppose the assembly snippet above were executed on a six-stage pipelined processor with the following stages:

Fetch
Decode
Execute
Memory 1
Memory 2
Writeback

This processor acts like the processor we discussed in lecture and implements all forwarding possible (that wouldn't dramatically increase cycle times).

For the purpose of forwarding, when the stages are not split, we generally assume:

a value needed for a computation or storage access can only be used by a stage if it's computed or retrieved in the previous cycle

Similarly, for the split memory stages, assume:

for instructions that read from the data memory, the address to read must be computed in the cycle before the Memory 1 stage runs, and
the result of any memory read is only available to be used (e.g. after being forwarded) by other instructions in the cycle after the Memory 2 stage runs

Given this processor, if the addq performs its fetch stage during cycle 0, then during what cycle number will the rmmovq instruction finish its writeback stage?

Answer:

Key: 10

also gave credit for 11, since a relatively common interpretation of "result is only available to be used ... after the Memory 2 stage runs" was that you couldn't forward it during Memory 2. (My intention was that "e.g. after being forwarded" would suggest that forwarding the value only didn't count as a "use" for this purpose, but I see that was less clear than I thought...)

Question 5 (2.5 pt; mean 2.18)

Consider the following assembly snippet: (where ... represents irrelevant instructions):

    addq %rax, %rcx
    subq %rcx, %rdx
    je foo
    xorq %rcx, %rdx
    ...
    ...
    ...
foo:
    irmovq $10, %rax /* A */
    irmovq $20, %rbx /* B */
    ...

where the je is not taken.

Suppose that we are executing the assembly snippet on a five-stage pipelined processor based on the design in lecture that:

uses forwarding to resolve data hazards to the extent possible without substantially increasing the cycle time, and
speculates that all conditional jumps will be taken, like we described in lecture, so the instructions labeled A and B will be fetched in the two cycles after the je is fetched and then squashed (discarded)
when a conditional jump is not taken like the processor guessed, fetches the corrected instruction during the memory stage of the conditional jump instruction (the cycle after determining what address to fetch in the conditional jump's execute stage)

Which of the following is true about what happens when the above assembly executes? Select all that apply.

88%
⊤
when the addq's memory stage runs, the subq's execute stage is running
83%
⊤
when the xorq's fetch stage runs, the subq's writeback stage has not yet completed

xorq in fetch --> je in memory --> subq in writeback
17%
the value of %rdx will be forwarded from subq to xorq
93%
⊤
the value of %rcx will be forwared from addq to subq
11%
the value of %rcx will be forwared from addq to xorq

quiz for week 8

Suppose we are implementing a five-stage processor with a similar design to the one discussed in lecture, but sometimes we need to stall for one cycle because the output of the data memory needs an extra cycle to be retrieved.

Suppose the instruction triggering the stall is in the memory stage during cycle number 0, and needs to stay in the memory stage until cycle number 1 to complete the memory read. (For the purposes of this question, we say an instruction is in a stage when its values are being output from the corresponding pipeline registers.)

Complete in the following statements about how the pipeline registers should behave.

Question 1 (0.5 pt; mean 0.35) (see above)

During cycle number 1, the pipeline registers between fetch and decode

69%
⊤
should output the same values they were outputting during cycle number 0
6%

should output the values for a nop
25%

should output values corresponding to the instruction that was fetched during cycle 0

Question 2 (0.5 pt; mean 0.31) (see above)

During cycle number 1, the pipeline registers between decode and execute

63%
⊤
should output the same values they were outputting during cycle number 0
14%

should output the values for a nop
23%

should output values corresponding to the instruction that was in the decode stage in cycle 0

Question 3 (0.5 pt; mean 0.31) (see above)

During cycle number 1, the pipeline registers between execute and memory

63%
⊤
should output the same values they were outputting during cycle number 0
26%

should output the values for a nop
11%

should output values corresponding to the instruction that was in the execute stage in cycle 0

Question 4 (0.5 pt; mean 0.27) (see above)

During cycle number 1, the pipeline registers between memory and writeback

9%

should output the same values they were outputting during cycle number 0
53%
⊤
should output the values for a nop
38%

should output values corresponding to the instruction that was in the memory stage in cycle 0

For the following two questions, consider a 4-block direct-mapped cache with 4 byte cache blocks. For each of the following two questions, assume the cache's contents are as follows:

index (in base 2)	valid bit	tag (in base 2)	data (hexadecimal, list of bytes, lowest address left-most)
00	1	001001	23 56 78 9A
01	1	001001	AA BB CC DD
10	1	000011	01 02 03 04
11	0	000000	00 00 00 00

For the following two questions, write down what the result of reading one byte from the specified addresses will be (assuming the cache has the contents listed above when the access occurs):

if the result will be a cache hit, write the value that will be read in hexadecimal (with or without a leading 0x)
if the result will be a cache miss, write the word miss.

(Note that addresses may have leading zeroes which are not written.)

Question 5 (1 pt; mean 0.94) (see above)

0x91

Answer:

Key: 56

Question 6 (1 pt; mean 0.94) (see above)

0x3B

Answer:

Key: /0?4/

For the following two questions, consider a 4-block 2-way set associtiative-mapped cache with 4 byte cache blocks whose contents are as follows:

index (in base 2)	valid bit (way 0)	tag (in base 2) (way 0)	data (hexadecimal, list of bytes, lowest address left-most) (way 0)	valid bit (way 1)	tag (way 1)	data (way 1)
0	1	0010010	23 56 78 9A	1	0010011	AA BB CC DD
1	1	0000110	01 02 03 04	1	0000011	71 82 93 F3

For the following two questions, write down what the result of reading one byte from each of the specified address will be (assuming the cache has the contents listed above when the access occurs):

if the result will be a cache hit, write the value that will be read in hexadecimal (with or without a leading 0x)
if the result will be a cache miss, write the word miss.

(Note that addresses may have leading zeroes which are not written.)

Question 7 (1 pt; mean 0.93) (see above)

0x91

Answer:

Key: 56

Question 8 (1 pt; mean 0.96) (see above)

0x3B

Answer:

Key: miss

quiz for week 9

Question 2 (2 pt; mean 1.7)

Consider a 8KB 2-way set associtiave cache with an LRU replacement policy and 64-byte blocks. (1KB = 1024 bytes.)

Suppose the cache is initially empty (all valid bits set to 0), then the program acceses 1 byte from each of the following addresses in the following order:

0x10005
0x40000
0x12400
0x10001
0x33320

Immediately after the accesses described above, give an example of an address which, if read using this cache, would cause something to be evicted from the cache:

Answer:

Key: /(?:0x)?0*(?!10|40)[1-9a-fA-F][0-9a-fA-F]*0[0-9A-Fa-f]{2}/

(key does not match some correct answers (and may incidentally match some incorrect ones); we'll go through and manually grade these over the next week)

Consider a two-set direct-mapped cache with 8B blocks and a write-allocate and writeback policy.

Suppose the cache is initially empty (all blocks invalid) and we perform the following accesses of single bytes:

read from 0x104
write to 0x101
write to 0x102
write to 0x108
read from 0x109
read from 0x10f
read from 0x110
read from 0x118
write to 0x119

Question 3 (1 pt; mean 0.87) (see above)

Will the read from 0x109 be a hit?

87%
⊤
yes
11%

no

Question 4 (1 pt; mean 0.91) (see above)

Will the read from 0x10f be a hit?

91%
⊤
yes
8%

no

Question 5 (2 pt; mean 1.69) (see above)

Which of these reads from the access pattern above will trigger a write to memory (or the next level of cache)? Select all that apply

17%
read from 0x109
9%
read from 0x10f
82%
⊤
read from 0x110
79%
⊤
read from 0x118

Question 6 (2 pt; mean 1.4)

Suppose we have a system with:

a 2-set direct-mapped cache with 8 byte cache blocks
4-byte ints (so each cache block can store 2 ints)

and run the following code:

int array[9];
...
int count1 = 0, count2 = 0, count3 = 0;
count1 += array[0];
count1 += array[3];
count1 += array[6];
count2 += array[1];
count2 += array[4];
count2 += array[7];
count3 += array[2];
count3 += array[5];
count3 += array[8];

Assuming that:

array[0] is assigned an address at the beginning of a cache block
the assembly the compiler generates for the above code does not reorder or omit data cache accesses
only accesses to array use the data cache, and
the cache is initially empty

how many data cache misses should we expect?

(Note that unlike the examples in lecture, the 9 accesses here are not evenly distributed across cache sets.)

Answer:

Key: 6

quiz for week 10

Consider the following 2 versions of C code:

/* Version 1 */
for (int i = 0; i < N; ++i) {
    for (int j = i; j < N; ++j) {
        C[j] += A[i*N+j] * B[j];
    }
}

/* Version 2 */
for (int j = 0; j < N; ++j) {
    for (int i = j; i < N; ++i) {
        C[j] += A[i*N+j] * B[j];
    }
}

Question 1 (1 pt; mean 0.95) (see above)

Which version has better temporal locality in accesses to C?

2%

version 1
95%
⊤
version 2
2%

they are about the same

Question 2 (1 pt; mean 0.95) (see above)

Which version has better spatial locality in accesses to A?

95%
⊤
version 1
2%

version 2
2%

they are about the same

Question 3 (2 pt; mean 1.66) (see above)

If N is 100000 and cache blocks can hold 8 elements of the array B, then we would expect approximately _____ cache misses for the accesses to B when running version 1 of the code above. (Choose the closest answer.)

Assume the cache is not large enough to hold 50000 elements of B (or A or C).

was originally miskeyed

Question 4 (2 pt; mean 1.75)

Consider the following assembly snippet:

addq %rax, %rbx
mrmovq 8(%rbx), %rax
subq %rbx, %rcx

In lecture, we discussed how an out-of-order processor may perform register renaming where it converts instructions from using architectural registers (the ones that appear in assembly) to physical registers (used internally in the processsor). When doing this, the processor ensures that each version of an architectural register's value uses a different physical register, which aids in resolving hazards.

After an out-of-order processor performs register renaming on the above instructions as discussed in lecture, which of the following statements about the physical registers used by the renamed versions of the above instructions will be true?

88%
⊤
for %rbx's value, the renamed addq instruction will write the same physical register that mrmovq reads
17%
for %rax's value, the renamed mrmovq instruction will write the same physical register that addq reads
91%
⊤
for %rbx's value, the renamed subq instruction will read the same physical register that mrmovq reads

quiz for week 11

Question 2 (2.5 pt; mean 2.24)

In lecture, we discussed the use of multiple accumulators to improve performance in addition to simple loop unrolling. Consider the following pair of unrolled loops with and without use of multiple accumulators:

/* loop, without multiple accumulators transformation */
for (int i = 0; i < N; i += 4) {
    product = product * (a[i] * b[i]);
    product = product * (a[i+1] * b[i+1]);
    product = product * (a[i+2] * b[i+2]);
    product = product * (a[i+3] * b[i+3]);
}

/* loop, with multiple accumulators transformation */
for (int i = 0; i < N; i += 4) {
    product1 = product1 * (a[i] * b[i]);
    product2 = product2 * (a[i+1] * b[i+1]);
    product1 = product1 * (a[i+2] * b[i+2]);
    product2 = product2 * (a[i+3] * b[i+3]);
}
product = product1 * product2;

Whether this transformation would be helpful depends on the execution units the processor has.

Suppose the performance of the execution units that perform the multiplications (represented by a * operations in the C code above) is what determines the performance of the loop overall. To make it easier to reason about performance, assume the execution units that perform these multiplications are never involved in any of the address or index calculations needed by the loop above (so one only needs to consider how the multiplication instructions are dispatched and executed).

Given which configuration(s) of execution units to perform the multiplications could the loop's performance benefit from the multiple accumulators optimization shown above? Select all that apply.

5%
one multipier which is not pipelined (does not accept new values to multiply until the current multiply is complete) and takes ten cycles to perform a multiplication
6%
one multiplier which takes one cycle to produce results
87%
⊤
ten multipliers, each of takes one cycle to produce results
81%
⊤
one multiplier which is pipelined (accepts a new pair of values to multiply each cycle) and takes ten cycles to produce results
92%
⊤
ten multipliers, each of which is pipelined (accepts a new pair of values to multiply each cycle) and takes ten cycles to produce results

Question 3 (2 pt; mean 1.62)

Which of the following optimizations are likely to significantly decrease the number of times an instruction that accesses the data cache runs? Select all that apply.

84%
⊤
moving a strlen() call from a loop condition to outside the loop

more calls to strlen() and each call to strlen() needs to read the loop
82%
⊤
using vector instructions to implement a loop that computes an array of values (replacing normal non-vector insturctions)

mov instruction for 256-bit register moves more at a time
24%
unrolling a for loop that iterates an index variable i from 0 to 10000000 by increments of 1

loop index management probably doesn't use the data cache, and the loop body runs the same instructions as before
29%
changing loop orders to improve locality

(accepted either answer) generally doesn't, but might allow compiler to keep something in a register that it couldn't otherwise (if it's now reused in the innermost loop) if it can show that alaising isn't a concern

Question 4 (2 pt; mean 1.54)

Consider the following two C functions:

void all_pairs_products1(int N, int *A, int *result) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            result[i * N + j] = A[i] * A[j];
        }
    }
}

void all_pairs_products2(int N, int *A, int *result) {
    for (int j = 0; j < N; ++j) {
        for (int i = 0; i < N; ++i) {
            result[i * N + j] = A[i] * A[j];
        }
    }
}

(The two functions differ in their loop orders.)

A compiler cannot generate identical for these two functions because in cases where A and result are pointers that refer to the same data (a problem we called "aliasing"), they can write different answers to the result array.

Which of the following are examples of calls to all_pairs_products1 which could result in different values in the array array than if the same call were made to all_pairs_products2 instead? Select all that apply.

81%
⊤
all_pairs_products1(1024, &array[0], &array[0])
21%
all_pairs_products1(1024, &array[0], &array[1024])

originally keyed incorrectly; only elements 0-1023 of A are used, so no overlap possible
64%
⊤
all_pairs_products1(1024, &array[1024], &array[0])

originally keyed incorrectly; will end up reading things written into result in future iterations in both cases, but with different orderings of whether things that written are read
85%
⊤
all_pairs_products1(1024, &array[1024], &array[1024])

quiz for week 12

Question 1 (2 pt; mean 1.6)

In lecture we discussed vector instructions (also known as SIMD instructions) where a single instruction can perform an operation on every pair of values in two vectors, which are typically fixed-sized array stored in registers.

Using vector instructions similar to those used in the lab, which of the following code snippets is simplest to transform into a version that makes effective use of vector instructions?

/* loop A */
for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        A[j*N + i] += A[i*N + j];
    }
}

/* loop B */
for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        A[i*N + j] += B[j] * C[(i+1)*N + j];
    }
}

/* loop C */
for (int i = 2; i < N; ++i) {
    A[i] *= (A[i-1] + A[i-2]);
}

(Assume A, B, and C are independent arrays.)

9%

loop A

need to have a good way to load/store non-contiguous values from A (separated by N)
79%
⊤
loop B
10%

loop C

difficult to compute A[i] before having computed A[i-1], but we'd want to compute both of them at the same time with vector instructions

For the following questions, consider a system with 20-bit virtual addresses where virtual addresses are divided into a 16-bit page offset and a 4-bit virtual page number. (For example, virtual address 0x12345 has virtual page number 0x1 and page offset 0x2345.) Suppose the contents of the page table of this system are:

virtual page number	valid	physical page number
0x0	0	0x00
0x1	1	0x15
0x2	1	0x14
0x3	1	0x20
0x4	1	0x05
0x5	1	0x06
0x6	1	0x09
0x7	0	0x00
0x8	0	0x00
0x9	1	0x13
0xA	1	0x14
0xB	1	0x30
0xC	1	0x31
0xD	1	0x32
0xE	1	0x33
0xF	1	0x34

Question 5 (1 pt; mean 0.82) (see above)

Based on the page table above, when accessing the virtual address 0x30001, what physical address will be accessed? Write your answer as a hexadecimal number. If a fault (an exception) would occur, write "fault".

Answer:

Key: /(?:0[xX])?0*200001/

Question 6 (1 pt; mean 0.81) (see above)

Based on the page table above, when accessing the virtual address 0x5467F, what physical address will be accessed? Write your answer as a hexadecimal number. If a fault (an exception) would occur, write "fault".

Answer:

Key: /(?:0[xX])?0*6467F/

quiz for week 13

Suppose a system has:

20-bit virtual addresses
1024 byte pages
24-bit physical addresses
4 byte page table enties
a page table base pointer set to physical (byte) address 0x1000
a single-level page table structure

Question 1 (2 pt; mean 1.03) (see above)

Based on the above information, what is the address at which the page table entry for the virtual address 0x1000 is stored? (Note that page table entries are larger than one byte.) Write your answer as a hexadecimal number.

Answer:

Key: /(?:0[Xx])?1010/

Question 2 (2 pt; mean 1.26) (see above)

How large are page tables on this system (in bytes)?

Answer:

Key: 4096

Suppose a system has:

42-bit virtual addresses (maximum value 0x3FF FFFF FFFF), with a 30-bit virtual page number and a 12-bit page offset
30-bit physical addresses (maximum value 0x3FFF FFFF), with an 18-bit physical page number and a 12-bit page offset
three-level page tables, where 10 bits of the virtual page number are used for a lookup at each level
page table entries are four bytes

Question 3 (2 pt; mean 1.14) (see above)

When looking up the virtual address 0x012 3456 789A, if the page table base pointer contains the physical byte address 0x44 000, then what is the physical address of the first-level page table entry for 0x012 3456 789A? (Note that page table entries are larger than one byte.) Write your answer as a hexadecimal number.

Answer:

Key: /(?:0[xX])?0*44048/

due to an editing error, was originally keyed assuming this was talking about the second level of lookup, which disagreed with the question

Question 4 (2 pt; mean 1.13) (see above)

Suppose that when looking up page table entries for the virtual adddresss 0x012 3456 789A:

the page table entry for the first-level was valid and contained physical page number 0x99,
the page table entry for the second-level was valid and contained physical page number 0xA4, and
the page table entry for the third-level was valid and contained physical page number 0xC3

Based on this information, when a program attempts to read data from virtual address 0x012 3456 789A, at what physical address will that data be found? Write your answer as a hexadecimal number. If not enough information is provided, write "unknown" and explain what information is missing in the comments.

Answer:

Key: /(?:0[xX])?0*[Cc]389[aA]/

Question 5 (2 pt; mean 1.34) (see above)

If this system had a 64-entry, 8-way TLB, that TLB would use 3 index bits. How many tag bits would it use?

Answer:

Key: 27