F2021 CS 3330 quizzes key

Question 1 (4 pt; mean 3.06)

In lecture, we discussed the idea of a processor that was connected to memory and I/O devices using a shared memory bus. Which of the following statements are true about how the processor using this memory bus would work? Select all that apply.

50%
when supplying an address to the memory, the processor also must indicate to the memory whether it is retrieving a data for operation (such as a value form the stack) or an instruction's machine code

nothing the memory can do with this information
16%
the processor will usually retrieve the machine code for an instruction from memory at the same time as it retrieves any data in memory that instruction operations on

can't know what to retrieve until it interprets the machine code
78%
⊤
(credit given for any answer)
when a key is pressed, the processor would receive information about the keypress using the memory bus

keyboard connected to processor via memory bus, so only way to get this info; [edited by CR 2021-09-06:] the textbook uses different terminology than we used in lecture (calling the connection between the processor and the I/O bridge the system bus and the connection between the I/O bridge and the memory the memory bus), which caused some students confusion. I think this question is entirely unambiguous about what terminology it was using, so the correct answer was clear. However, in light of the confusion, I've decided to drop this quesiton.
71%
⊤
to execute an instruction like addq %rax, %rax, the processor only needs to contact the memory bus to retrieve the machine code for the instruction and not to retrieve the value for %rax.

%rax is not stored in memory

For the following questions, consider the following AT&T syntax x86-64 assembly instruction:

movq (%rax, %rbx, 4), %rcx

And assume that, just before this instruction executes, registers have values as follows:

`%rax`	`0x1000`
`%rbx`	`0x8000`
`%rcx`	`0x79`
`%rsp`	`0xF00000`

Question 3 (2 pt; mean 1.78) (see above)

When the above instruction is executed, it will use a memory address. What memory address will it use? Write your answer as a hexadecimal number.

Answer:

Key: 0x21000

0x1000 + 0x8000 * 4

Question 4 (3 pt; mean 2.73) (see above)

When the above instruction executes successfully, it will ____.

Question 5 (4 pt; mean 3.66)

Consider the following assembly snippet:

addq %rbx, %rbx
addq %rbx, %rbx
addq %rax, %rbx
movq (%rbx), %rbx

Which of the following assembly snippets would result the same values in %rax and %rbx as the above assembly snippet? Select all that apply.

87%
⊤
leaq (, %rbx, 4), %rbx
movq (%rbx, %rax), %rbx
85%
⊤
movq (%rax, %rbx, 4), %rbx
1%
leaq (%rax, %rbx), %rbx
leaq (%rax, %rbx), %rbx
leaq (%rax, %rbx), %rbx

never reads from memory
95%
⊤
leaq (%rax, %rbx, 4), %rbx
movq (%rbx), %rbx

addq %rbx, %rbx --> rbx = initial rbx * 2; addq %rbx, %rbx --> rbx = initial rbx * 4; addq %rax, %rbx --> rbx = initial rbx * 4 + rax; movq (%rbx), %rbx --> rbx = memory[initial rbx * 4 + rax]

quiz for week 2

For each of the following assembly snippets, identify which are possible values of the ZF and SF condition codes.

Question 1 (2 pt; mean 1.8) (see above)

    cmpq $0, %rax
    jl skip
    imulq $-1, %rax
skip:
    testq %rax, %rax

21%
ZF = 0, SF = 0
94%
⊤
ZF = 1, SF = 0
95%
⊤
ZF = 0, SF = 1

I should have been more explicit that I only cared about the value when the snippet finishes, but I think/hope that's the only sensible interpretation. The condition code's final value is determined by the testq insturction. (How the cmpq or imulq do or do not set condition codes does not matter.) If rax is initially positive, itu becomes negative when the testq instruction runs; if rax is initially negative, it remains negative (the jl skips the imulq); if rax is initially zero, it remains zero.

Question 2 (2 pt; mean 1.86) (see above)

    movq $0, %rax
    cmpq %rax, %rax
    leaq 0x12345678(,%rax,4), %rax

11%
ZF = 0, SF = 0
91%
⊤
ZF = 1, SF = 0
1%
ZF = 0, SF = 1

the last leaq won't change the flags

Question 3 (5 pt; mean 4.83)

Consider the following C snippet:

while (rax >= 1 && rax < 10) {
    rax = rax - r8;
}

and the following assembly snippets:

/* snippet A */
loop:
    cmpq $0, %rax
    jle done
    cmpq $10, %rax
    jge done
    subq %r8, %rax
    jmp loop
done:

/* snippet B */
    jmp start
loop:
    subq %r8, %rax
start:
    cmpq $1, %rax
    jl done
    cmpq $10, %rax
    jl start
done:

/* snippet C */
loop:
    cmpq $1, %rax
    cmpq $10, %rax
    jl done
    jge done
    subq %r8, %rax
    jmp loop
done:

/* snippet D */
loop:
    cmpq $1, %rax
    jl done
    cmpq $11, %rax
    jg done
    subq %r8, %rax
    jl loop
done:

Assuming the variable rax is stored in the register %rax and the variable r8 is stored in the register %r8, which snippets are correct assembly translations of the loop? Select all that apply.

97%
⊤
snippet A
6%
snippet B

infinite loop
2%
snippet C

second cmp overwrites first
2%
snippet D

11 should be 9

For the following questions, consider static linking using object files as discussed in lecture.

Suppose one converts the following two assembly files into an executable (where ... represents some code that is not shown, and .asciz is an assembly directive to generate a nul-terminated ASCII string, and .global LABEL is a directive that causes the assembler to ensure that a specified label LABEL can be referenced from other files):

Assembly file main.s:

.text
.global main
main:
    mov $0, %r11
    mov $10, %r10
loop:
    mov $percent_d, %rdi
    mov %r11, %rsi
    call printf
    incq %r11
    decq %r10
    jge loop
    mov $0, %rax
    ret

.data
.global percent_d
percent_d:
    .asciz "%d"

Assembly file printf.s:

.text
.global printf
printf:
    movb (%rdi), %al
    ...
    ...
    ret

.data
...

Question 4 (4 pt; mean 3.33) (see above)

The object file generated from main.s will include metadata (outside of any machine code) specifying ____. Select all that apply.

13%
the address at which the string "%d" is located in memory when the executable runs

not known until the executable is generated, at least
77%
⊤
the location within the object file generated from main.s at which the string "%d" appears

in the symbol table entry for percent_d
90%
⊤
the location within the object file generated from main.s at which the machine code corresponding to percent_d in mov $percent_d, %rdi appears

in a relocation table entry
21%
the location at which the dec %r10 instruction appears in the machine code

not needed for anything the linker needs to do

Question 5 (4 pt; mean 2.59) (see above)

In the object files generated from the assembly files above, a symbol table entry for printf would most likely ____. Select all that apply.

61%
include the location (either in memory or in an object file) of the machine code for the call printf instruction

relocation table would have this information
49%
⊤
include the location (either in memory or in an object file) of the machine code for the movb (%rdi), %al instruction

where the printf label points
9%
include the location of a ret instruction

not specified in assembly, not needed to resolve call instruction
19%
specify that printf's first argument is a string

not specified in assembly, not needed to resolve call instruction

quiz for week 3

Question 1 (4 pt; mean 3.36)

If x and y are 64-bit signed integers between -100000000000 and 100000000000 (on a C implementation that uses two's complement and implements signed right shift using arithmetic shift and where other constants are 64-bits), then which of the following C expressions are always true? Select all that apply.

25%
((x + y) >> 22) == ((x >> 22) + (y >> 22))

it may at first seem that this translates to (x + y) / (2 to the 12th power) ?= x / (2 to the 20th power) + y / (2 to the 10th power), which is mathematically true, but that doesn't hold if one takes into account rounding. For example, if x and y are ((1 << 22) - 1), then (x >> 22) and (y >> 22) are 0, so the right-hand side is 0, but the left-hand side is 1.
88%
⊤
((~x >> 12) & 0xFFFFFFFF) == ((~(x >> 12)) & 0xFFFFFFFF)
20%
((x ^ 0xFFFF) & 0xCFF0) == (((x & 0x4FF0) ^ 0xCFF0) | (~x & 0x8000))

was originally miskeyed; meant to write ^ 0x4FF0 on the right hand side
7%
(((x | 0xF000) & 0xFFF0) ^ (y & 0x8008)) == (0xF000 | (x & 0x0FF0) | ((y ^ 0x8008) & 0x8008))

if x = 0 and y = 8, then the left-hand side is 0xF000 ^ 8 = 0xF008, and the right-hand-side is (0xF000 | 0 | 0x8000) == 0xF000

Question 2 (4 pt; mean 2.59)

Consider the following C function:

long foo(long x) {
    return (x >> 40) + x;
}

Assume this runs on a system where:

signed longs are implemented with 64-bit two's complement integers
right shift on signed longs is implmented using arithmetic shift

If overflow does not occur, what is an argument x that will result in foo returning the most negative number possible? (Overflow includes adding two negative numbers and getting a positive number or adding two positives to get a negative. It does not including right shift producing a rounded result for division by a power of two) Write your answer as a hexadecimal number.

Answer:

Key: -0x7FFFFFFFFF800000 (preferred) or 0x8000000000800000

Originally, I made an off-by-one-error in my calculations (by neglecting to consider that x needs to be close enough to the minimum long to cause shifting by 40 to give -1<<23, not -1<<22), this has now been corrected and the key adjusted accordingly. Sorry for the confusion.

result is floor(x / pow(2,40)) + x, which seems to minimized by -pow(2,63) (the lowest possible return value, since the return value must be a 64-bit two's complement integer), but if x is -pow(2, 63), this will wrap around (since (any negative value) + (-pow(2, 63)) is out of range for a 64-bit two's complement integer. Let's suppose x is -pow(2, 63) + K where K < pow(2, 40) (so it vanishes from the floor(x / pow(2, 40) term). Then x >> 40 will be -(1 << (63-40)) == -(1 << 23). Then, if we let K be (1<<23) to cancel this out the overall result -(1 << 23) - pow(2,63) + (1<<23) = -pow(2, 63), which is the minimum possible value.

Note that since this x achieves a return value which is the smallest possible 64-bit two's complement integer, so no lower result is possible, even if overflow is allowed to occur. The prohibition on overflow was meant to exclude values of x which were very close to pow(2,63)-1; it would have been better phrased "the most negative possible provided that overflow does not occur". If you assumed that we could somehow represent out-of-range values for the result of foo, then the true result of -pow(2,63) >> 40 + -pow(2,63) would be more negative.

Question 3 (6 pt; mean 5.61)

Which of the following are generally properties of an instruction set architecture (ISA) rather than properties of a microarchitecture? Select all that apply.

99%
⊤
the number of operands an instruction can take

affects machine code, assembly languages
60%
⊤
(credit given for any answer)
the number of bits in a general-purpose register

affects what instructions like add do; "general-purpose" register meant to specifically specify registers available for use from assembly, but this should've been made clearer
88%
⊤
whether a call instruction stores the return address on the stack or in a register
87%
⊤
how many bytes are necessary for a call instruction's machine code
94%
⊤
how many bytes a mov instruction copies from memory into a register
7%
whether using the multiplication instruction to multiply by 3 is faster than adding twice

doesn't change whether either version of the code is functional

quiz for week 4

Question 5 (4 pt; mean 3.59)

Consider the following HCLRS code snippet (where ... represents some omitted code):

register iO {
    a : 64 = ...;
    b : 64 = ...;
}

i_b = O_a + i_a;
i_a = O_a + O_b + 1;

If during cycle 1, O_a is 3 and O_b is 4, then the value of O_a during cycle 3 will be ____. Write your answer as a base-10 number. If not enough information is provided write unknown.

Answer:

Key: 20

order of assignments doesn't matter. cycle 1: i_a = 3 + 4 + 1 = 8, i_b = 3 + 8 = 11; cycle 2: i_a = 8 + 11 + 1 = 20 (becomes O_a during cycle 3), i_b = 8 + 20 = 28 (not relevant);

quiz for week 5

Consider the single-cycle processor design we discussed in lecture (built using the components we discussed in lecture that executes every instruction in one cycle) executing the instruction:

mrmovq 10(%rax), %rbx

Question 1 (2 pt; mean 1.46) (see above)

While the address 10(%rax) is being computed for the above instruction, the program counter register will output ____.

4%

the address of the first byte of the instruction that was run just before the above instruction
73%
⊤
the address of the first byte of the above instruction's machine code
2%

the address of the constant 10 in the above instruction's machine code
18%

the address of the byte immediately following the above instruction's machine code
1%

none of the above (explain in comments)

Question 2 (2 pt; mean 1.32) (see above)

A value from the data memory is read for this instruction ____.

3%

around the same time as the instruction is read from the instruction memory
19%

around the same time as values are read from the register file
64%
⊤
after the instruction is read from the instruction memory but well before the next rising edge of the clock cycle
12%

around the time of the rising edge of the clock cycle that occurs after the instruction is read from the instruction memory
1%

none of the above (explain briefly in comments)

Question 3 (2 pt; mean 1.64) (see above)

A value is written to the register file for this instruction ___.

2%

around the same time as the instruction is read from the instruction memory
6%

around the same time as values are read from the data memory
8%

after the instruction is read from the instruction memory but well before the next rising edge of the clock cycle
81%
⊤
around the time of the rising edge of the clock cycle that occurs after the instruction is read from the instruction memory
2%

none of the above (explain briefly in comments)

Question 4 (4 pt; mean 3.09)

Consider the single-cycle processor design we discussed in lecture (built using the components we discussed in lecture).

Suppose one wanted to add support for a new rrradd REG1, REG2, REG3 instruction which would take three register operands and compute the sum of the first two registers and store the result in the third register. For example if %rax contained 3 and %rbx contained 4, then running rrradd %rax, %rbx, %rcx would write 7 to %rcx.

Which of the following is true about adding this instruction to a Y86 and the single-cycle processor design? Select all that apply.

31%
it would require adding one or more additional read ports (a new reg_srcX input and reg_outputX output) to the register file
16%
it would require adding one or more additional write ports (new reg_dstX and reg_inputX inputs) to the register file
21%
it would require adding a new ALU operation
76%
⊤
it would require placing register numbers in a place in the machine code where no other instruction has a register number

Consider the single-cycle processor design we discussed in lecture (built using the components we discussed in lecture).

Suppose one wanted to add support for a new push2q REG1, REG2 instruction which would be equivalent to pushq REG1 and pushq REG2 but be completed by one instruction (and in one cycle).

Question 5 (3 pt; mean 2.05) (see above)

The layout of the machine code for the above instruction would most likely be most similar to the encoding for:

18%
⊤
pushq REG

not the intended answer; also accepted this on the argument we could change only the "F" placeholder for rB in the encoding and share the icode sanely
26%

mrmovq REG, OFFSET(REG)

meant to write rmmovq to agree with operand order, but both rmmovq and mrmovq are not correct (because the question is about machine code, not implementation)
4%

popq REG
50%
⊤
rrmovq REG, REG

same fields available
0%

irmovq $CONSTANT, REG

Question 6 (4 pt; mean 2.64) (see above)

As part of adding this instruction, one would expect the data memory to be modified to ____. Select all that apply.

76%
⊤
accept 128 bits as its value input (HCLRS mem_input) instead of 64
33%
accept 128 bits as its address input (HCLRS mem_addr) instead of 64
22%
provide 128 bits as its value output (HCLRS mem_output) instead of 64
42%
⊤
add a new control signal (in addition to write-enable and read-enable signals mem_readbit and mem_writebit`) that would change the behavior of the data memory

to control whether the data memory writes 128 bits or 64 bits

quiz for week 6

Consider a seven-stage pipelined processor with a 100 ps cycle time. Suppose this processor is executing instructions A, B, C, D, E, ... in order. (Assume no pipeline stalls, etc. are involved.)

Question 1 (3 pt; mean 2.81) (see above)

Instruction D will finish its first stage _____ ps after instruction A finishes its first stage.

Answer:

Key: 300

Question 2 (3 pt; mean 2.75) (see above)

Instruction B will finish its seventh stage ____ ps after instruction A finishes its first stage.

Answer:

Key: 700

Consider building a pipeline processor similar to our textbook's design Suppose the times required for components are as follows:

pipeline registers (including PC) -- 10 ps (the "register delay")
instruction memory read -- 175 ps
PC increment computation -- 20 ps
register file read -- 125 ps
register file write -- 150 ps
ALU -- 120 ps
data memory read or write -- 175 ps

Question 4 (4 pt; mean 3.47) (see above)

Suppose instead of constructing a 5-stage processor, like our textbook proposes and we discussed in lecture, one were to construct a 4-stage processor by merging two of the stages we would use in the textbook's 5-stage design. Based on the timings above, what two stages would make the most sense (in terms of expected performance) to merge?

5%

fetch and decode
86%
⊤
decode and execute
3%

execute and memory
6%

memory and writeback

quiz for week 7

For the following questions, consider a four-stage pipelined processor built using the design we discussed in lecture but with the following stages:

fetch
decode
execute
memory and writeback

(That is, the memory and writeback stages are combined.)

Suppose this processor uses forwarding to resolve data hazards. If forwarding alone is insufficient, the processor combines forwarding with a minimum amount of stalling.

Question 1 (3 pt; mean 2.46) (see above)

Which of the following assembly snippets would exercise a data hazard if executed on the processor described above? Select all that apply.

10%
addq %rax, %rbx
addq %rcx, %rdx
subq %rcx, %rax
mrmovq (%rcx), %rcx
87%
⊤
mrmovq (%rcx), %rax
subq %rcx, %r8
addq %rdx, %rax
66%
⊤
popq %rax
nop
addq %rsp, %rcx

Question 2 (2 pt; mean 1.75) (see above)

On the processor described above, the addq instruction in the following assembly snippet would finish ___ cycles after the subq instruction finishes?

subq %rcx, %rax
addq %rdx, %rax

Answer:

Key: 1

Question 3 (2 pt; mean 1.83) (see above)

On the processor described above, the subq instruction in the following assembly snippet would finish ___ cycles after the mrmovq instruction finishes?

mrmovq (%rcx), %rax
subq %rcx, %rax

Answer:

Key: 2

Question 4 (5 pt; mean 4.29) (see above)

Consider the following assembly snippet:

addq %r8, %r9
subq %r9, %r10
irmovq $42, %r9
rmmovq %r9, 0(%r10)
rmmovq %r8, 8(%r10)

When this is executed on the processor described above, one would expect which of the following forwarding operations to occur? (It's possible that not all of the neceessary forwarding operations are included among the options below.) Select all that apply.

98%
⊤
the value of %r9 from addq to subq
20%
the value of %r9 from addq to irmovq

does not read %r9
83%
⊤
the value of %r10 from subq to the first rmmovq
6%
the value of %r10 from subq to the second rmmovq
26%
the value of %r10 from the first rmmovq to the second rmmovq

For the following questions, consider the following assembly snippet:

        pushq %rax 
        addq %rcx, %rdx   
        jle foo          
        xorq %rdx, %rdx                  
        andq %r9, %r10       
        subq %r11, %r12
        nop
foo:    rrmovq %rdx, %r8        
        irmovq $10, %r13        
        rmmovq %r14, 0(%r15)

For the questions below, consider a six-stage processor, similar in design to the one discussed in lecture, with the following stages:

fetch
decode
execute part 1
execute part 2
memory
writeback

In this processor, the result of arithmetic (such as performed by the addq or xorq instructions) or of evaluating the condition codes to determine if a conditional jump is taken is not available until near the end of the execute part 2 stage. In particular, a conditional jump that was not predicted correctly will not be able to fetch the instruction that follows the jump instruction until the jump instruction is in the memory stage.

Assume the processor uses forwarding and, where necessary, stalling to resolve data hazards. Control hazards are resolved as described in each question below.

Question 5 (4 pt; mean 2.44) (see above)

Suppose the six-stage processor described above predicted all jumps as taken and the jle instruction above was actually not taken. Then the xorq instruction would retrieve the value of %rdx in its decode stage

61%
⊤
from the register file directly
14%

by having it forwarded from addq's execute 2 stage
11%

by having it forwarded from addq's memory stage
13%

by having it forwarded from addq's writeback stage

Question 6 (4 pt; mean 2.48) (see above)

Suppose the six-stage processor predicted all jumps as taken and the jle instruction above was actually not taken. Then the xorq instruction would complete its decode stage while a ____ instruction completed its writeback stage.

quiz for week 8

Suppose a five-stage processor was executing during cycle N:

instruction E in the fetch stage,
instruction D in the decode stage,
instruction C in the execute stage,
instruction B in the memory stage
instruction A in the writeback stage

and that the processor wants to stall instruction C and squash instructions B and A, so during the following cycle N+1:

instruction E in the fetch stage,
instruction D in the decode stage,
instruction C in the execute stage,
instruction B is running a nop (a pipeline "bubble") *[edit after quiz: should have written "the memory stage is running a nop"]
instruction A is running a nop (a pipeline "bubble") *[edit after quiz: should have written "the writeback stage is running a nop"; also to squash instruction A, we need to something in addition to adjusting the pipeline registers]

Question 1 (7 pt; mean 6.47) (see above)

If this processor were built using the bubble_X and stall_X signals we discussed in lecture, then as part of achieving this effect, the processor should set ___. Select all that apply.

Suppose we have a four-stage pipelined processor with the following stages:

fetch
decode
execute and memory
writeback

Suppose, unlike the processor we've discussed in lecture, this processor uses branch prediction which predicts are conditional jumps as not taken. When a conditional jump is mispredicted, the processor detects the misprediction during the corresponding jump instruction's execute stage, and retrieves the correct instruction in the following cycle.

Suppose the processor is executing the following assembly snippet:

        xorq %rax, %rax
        jle foo
        andq %rcx, %rdx
        pushq %rdx
foo:    addq %r8, %r9
        subq %r9, %r10
        xorq %r10, %r11

When the above assembly is run, the pushq instruction is fetched as a result of the misprediction.

Question 2 (3 pt; mean 2.56) (see above)

If this processor were built using the bubble_X and stall_X signals described in lecture, the pushq instruction will be squashed _____.

4%

by setting bubble_F when pushq is in the fetch stage
85%
⊤
by setting bubble_D when pushq is in the fetch stage
6%

by setting bubble_D when pushq is in the decode stage
3%

by setting bubble_E when pushq is in the decode stage
1%

by setting bubble_E when pushq is in the execute and memory stage
0%

by setting bubble_W when pushq is in the execute and memory stage

Question 3 (3 pt; mean 2.54) (see above)

If this processor were built using the bubble_X and stall_X signals described in lecture, the andq instruction will be squashed _____.

1%

by setting bubble_F when andq is in the fetch stage
3%

by setting bubble_D when andq is in the fetch stage
5%

by setting bubble_D when andq is in the decode stage
85%
⊤
by setting bubble_E when andq is in the decode stage
4%

by setting bubble_E when andq is in the execute and memory stage
2%

by setting bubble_W when andq is in the execute and memory stage

For each of the pairs below, select which one is expected to exhibit more temporal locality.

Question 4 (2 pt; mean 1.85) (see above)

Pair one.

6%

accesses made when iterating through a large array stored on the stack
92%
⊤
accesses to the machine code of a small, frequently used function
2%

the temporal locality is about the same

I should have been specific that option A was meant to refer to the array accesses and not the combination of array accesses (not good temporal locality) and machine code accesses (good temporal locality), but I think B is the better answer regardless.

Question 5 (2 pt; mean 1.41) (see above)

Pair two.

70%
⊤
accesses made while repeatedly looking up a particular key in a hashtable
6%

accesses while repeatedly linear searching an array for a particular key
24%

the temporal locality is about the same

quiz for week 9

Suppose a direct-mapped cache with a write-allocate, write-back policy has the following contents:

set index	valid	dirty	tag	value (as list of bytes in hexadecimal, lowest address (offset) first)
0	1	0	0x000	00 22 33 44 55 FF AA CC
1	0	-	-	-
2	1	0	0x002	11 22 33 44 55 66 77 88
3	1	1	0x002	99 AA BB CC 9A BC DE 00
4	1	0	0x003	23 34 56 00 33 44 55 88
5	1	0	0x003	00 00 00 11 22 33 44 55
6	1	0	0x002	00 33 44 55 99 11 33 55
7	0	-	-	-
8	0	-	-	-
9	1	1	0x002	56 78 45 67 34 23 12 01
10	1	1	0x002	FF FF FF 99 FF FF FF 88
11	1	1	0x002	FF FF FF 77 FF FF FF 66
12	0	-	-	-
13	1	0	0x000	AA BB CC DD 00 00 FF 00
14	1	0	0x000	00 00 00 01 00 00 00 02
15	1	0	0x001	03 00 00 00 04 00 00 00

Question 1 (4 pt; mean 3.59) (see above)

When this cache is used to access the address 0x123, what tag would be used for that access? Write your answer as a hexadecimal number.

Answer:

Key: 0x2

Question 2 (4 pt; mean 3.91) (see above)

When reading a 2-byte, little-endian value from an address with tag 0x3, set index 3, and (block) offset 0x3, the result will be? Write your answer as a hexadecimal number. If this access would be a cache miss, write miss.

Answer:

Key: miss

tag mismatch

Question 3 (4 pt; mean 3.55) (see above)

Using the above cache, to read an address with which of the following attributes would replace a dirty value (and therefore require writing out the dirty value to memory or the next level of cache)? Select all that apply.

88%
⊤
tag 0x1, (set) index 3, (block) offset 0x2
20%
tag 0x2, (set) index 3, (block) offset 0x3
10%
tag 0x1, (set) index 13, (block) offset 0x1
3%
tag 0x0, (set) index 0, (block) offset 0x0

Suppose a 2-way set-associativate cache with an LRU (least recently used) replacement policy and a write-no-allocate, write-through policy has the following contents:

	way 0			way 1
set index	valid	tag	value (as list of bytes in hexadecimal, lowest address (offset) first)	valid	tag	value	LRU way
0	1	0x01	01 23	1	0x02	34 56	0
1	1	0x01	01 23	1	0x02	34 56	1

Question 4 (4 pt; mean 3.61) (see above)

Suppose one reads a byte from address 0x13 using this cache, which is a cache miss. As a result of the miss, a block of the cache will be replaced. Which one?

4%

set index 0, way 0
2%

set index 0, way 1
4%

set index 1, way 0
90%
⊤
set index 1, way 1

100[1][1] --> set index 1, tag 100 (binary) = 4

Question 5 (3 pt; mean 1.94) (see above)

Suppose a program writes a 2-byte value to address 0x00 on a system using this cache as its data cache. Which of the following statements is true about how the write will be performed?

65%
⊤
the write will occur without the cache's contents changing, and will modify main memory or the next level of cache
22%

the write will replace the contents of one of the existing cache blocks as well as being immediately sent to the main memory or the next level of cache, and the prior contents of the replaced cache block will be discarded
12%

the write will replace the contents of one of the existing cache blocks, and the prior contents of the replaced cache block will be sent to main memory or the next level of cache
1%

none of the above

quiz for week 10

Suppose a direct-mapped cache with a write-allocate, write-back policy has the following contents:

set index	valid	dirty	tag	value (as list of bytes in hexadecimal, lowest address (offset) first)
0	1	0	0x000	00 22 33 44 55 FF AA CC
1	0	-	-	-
2	1	0	0x002	11 22 33 44 55 66 77 88
3	1	1	0x002	99 AA BB CC 9A BC DE 00
4	1	0	0x003	23 34 56 00 33 44 55 88
5	1	0	0x003	00 00 00 11 22 33 44 55
6	1	0	0x002	00 33 44 55 99 11 33 55
7	0	-	-	-
8	0	-	-	-
9	1	1	0x002	56 78 45 67 34 23 12 01
10	1	1	0x002	FF FF FF 99 FF FF FF 88
11	1	1	0x002	FF FF FF 77 FF FF FF 66
12	0	-	-	-
13	1	0	0x000	AA BB CC DD 00 00 FF 00
14	1	0	0x000	00 00 00 01 00 00 00 02
15	1	0	0x001	03 00 00 00 04 00 00 00

Question 1 (4 pt; mean 3.47) (see above)

Suppose the one-byte value 77 (written as hexadecimal) is written using this cache using an address which has a (set) index of 4, a tag of 0x4, and a (block) offset of 0x0. Which of the following is true about the effects of this accesses? Select all that apply.

18%
it will result in a write to memory of the data byte 77

not write-through
12%
it will result in a write to memory of data bytes other than 77

replaced block is not dirty
60%
⊤
(credit given for any answer)
after the access completes, the first byte stored in the value field of set index 4 will be 77 instead of 99.

meant to write 23 here
24%
after the access completes, the second byte stored in the value field of set index 4 will remain 34

Consider a system with two-level cache hierarchy where:

the first-level cache has a hit time of 2 cycles
the shared second-level cache has a hit time of 10 cycles
main memory has an access time of 200 cycles

Suppose a workload is measured to have

a 80% first-level cache hit rate
a 90% second-level cache hit rate

Question 2 (2 pt; mean 1.79) (see above)

What percentage of first-level cache accesses require a main memory access? Write your answer to the nearest whole number of percent.

Answer:

Key: 2%

Question 3 (3 pt; mean 2.5) (see above)

Assume that when there is a miss in a cache, the cache uses its hit time to determine that it is a miss, and then starts the access to the next level of the memory hierarchy. For example, this would mean that an access that misses in the first-level cache and hits in the second level would take 12 cycles to complete.

Given this, what is the average memory access time of the program, in cycles, rounded to the nearest tenth of a cycle?

Answer:

Key: 8.0

The following question ask about what data cache misses occur for a particular cache and a particular piece of C code. For those questions assume:

that only accesses to array use the data cache specified
that the data cache is initially empty (all invalid)
the address array[0] is a multiple of 2 to the 20th power (so array[0] is at the beginning of a cache block and has set index 0 in almost any cache)
accesses in the omitted code labelled ... are irrelevant to the question
the compiler compiles the loops so that the accesses to array are not reordered and does not omit any accesses to array

Question 4 (4 pt; mean 1.54) (see above)

Consider the following C code:

unsigned char array[280];
...
for (int i = 0; i < 10; ++i) {
    for (int j = 0; j < 280; ++j) {
        array[j] += 1;
    }
}

With a direct-mapped 256 byte cache with 64 byte blocks and a write-allocate, write-back policy, how many cache misses will occur? (Note that unlike the examples in lecture, these accesses are unevenly distributed across the sets of the cache.)

Answer:

Key: 23

10 misses each for 0-63 and 256-320; 3 misses for 64-128, 128-192, 192-256

Question 5 (4 pt; mean 1.95) (see above)

Consider the following C code:

unsigned char array[2500];
...
for (int i = 0; i < 2500; i += 1000) {
    array[i] += 1;
}
for (int i = 0; i < 2500; i += 990) {
    array[i] += 1;
}

How many cache misses would occur with a 1KB (1024 byte) fully-associative cache with 64 byte cache blocks and an LRU replacement policy and a write-allocate, write-back policy?

Answer:

Key: 4

first loop: 0 (miss loading 0-63); 1000 (miss loading 960-1023); 2000 (miss loading 1984-2047); second loop: 0 (hit); 990 (hit); 1980 (miss loading 1920-1983);

For the following questions, consider the following C code:

int A[1024 * 1024], B[1024 * 1024], C[1024 * 1024];

void foo1() {
    for (int i = 0; i < 1024; ++i) {
        for (int j = 0; j < 1024; ++j) {
            A[i + j] = B[j * 4] * C[i * 1024 + j];
        }
    }
}

void foo2() {
    for (int i_outer = 0; i_outer < 1024; i_outer += 16) {
        for (int j = 0; j < 1024; ++j) {
            for (int i = i_outer; i < i_outer + 16; ++i) {
                A[i + j] = B[j * 4] * C[i * 1024 + j];
            }
        }
    }
}

Question 6 (1 pt; mean 0.48) (see above)

The temporal locality in A is better for ____.

5%

foo1
47%
⊤
foo2
47%

it's about the same for both

Question 7 (1 pt; mean 0.88) (see above)

The temporal locality in B is better for ____.

6%

foo1
88%
⊤
foo2
6%

it's about the same for both

Question 8 (1 pt; mean 0.91) (see above)

The spatial locality in C is better for ____.

91%
⊤
foo1
5%

foo2
3%

it's about the same for both

Question 9 (1 pt; mean 0.83) (see above)

The temporal locality in C is better for ____.

7%

foo1
10%

foo2
83%
⊤
it's about the same for both

quiz for week 11

For the following questions, consider the following assembly code:

        movq $100, %rax             /* 1 */
begin_loop:     
        movq (%r8, %rax, 8), %r10   /* 2 */
        movq (%r9, %rax, 8), %r11   /* 3 */
        subq %r11, %r10             /* 4 */
        jg not_negative             /* 5 */
        neg %r10                    /* 6 */
not_negatve:
        addq %r10, %r13             /* 7 */
        subq $1, %rax               /* 8 */
        jg begin_loop               /* 9 */

(The neg (negate) instruction is equivalent to multiplying a value by negative 1.)

Question 1 (4 pt; mean 2.32) (see above)

If one wanted to optimize the above loop by unrolling, the unrolled version of the loop would need to include an extra copy of (perhaps with changes to offsets, etc.) instructions labelled ____.

meant to write not_negative: on label

Question 2 (4 pt; mean 2.65) (see above)

If one wanted to optimize the above loop to use multiple accumulators, the best candidate of an accumulator that should be split into two accumulators is ___.

Consider the following assembly snippet:

addq %r8, %r9
subq %r8, %r10
imulq %r9, %r10
andq %r9, %r8
xorq %r11, %r12

Question 3 (6 pt; mean 5.86) (see above)

Consider an out-of-order processor that uses register renaming to help handle hazards, In this scheme, the renaming step might convert the above instructions into three-operand versions with physical registers (filling in the numbered blanks shown below):

addq ___ (1), ___ (2) -> ___ (3)
subq ___ (4), ___ (5) -> ___ (6)
imulq ___ (7), ___ (8) -> ___ (9)
andq ___ (10), ___ (11) -> ___ (12)
xorq ___ (13), ___ (14) -> ___ (15)

In this renamed version, which of the following pairs of blanks would refer to the same physical register? Select all that apply.

Question 4 (4 pt; mean 3.51) (see above)

Suppose an out-of-order processor has two arithmetic execution units, each capable of performing any arithmetic operation, and they can perform:

an addition, subtraction, or bitwise and operation in one cycle
a multiplication operation in five cycles

What is the minimum number of cycles the arithmetic for the above assembly take on this processor? Do not include time needed to rename instructions, perform forwarding, register reading and writeback, etc.

Answer:

Key: 6

forgot to mention the actual time for xor, but if it's the same as and (which certainly seems most likely): addq/subq in parallel (1 cycle); imulq (5 cyles) in parallel andq + xorq (2 cycles)

quiz for week 12

For the following question, consider the following C code:

void foo(int N, int *A, int *B, int *C) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            C[i] += A[j*N+i] + B[i+j];
        }
    }
}

Question 1 (3 pt; mean 1.91) (see above)

Since A might point to part of the same array as B or C, the compiler cannot easily perform which of the following optimizations (without adding some check for whether A overlaps with B or C)? Select all that apply.

49%
unrolling the loop with index j
48%
⊤
swapping the i and j loops to improve cache locality
91%
⊤
performing cache blocking for the accesses to A and B

Question 2 (3 pt; mean 2.5) (see above)

Which of the following statements are true about the effects of unrolling the inner loop in the code above? Select all that apply.

10%
it is likely to make foo run more quickly when the argument N is 1 or 2
94%
⊤
it is likely to increase the amount of space foo will consume in the instruction cache
34%
it is likely to increase the number of instruction cache accesses

fewer instructions executed -> fewer instruction cache accesses

Question 3 (4 pt; mean 4)

Consider the following C functions:

int foo1(int *x, int *y, int *z) {
    for (int i = 0; i < 16; ++i) {
        z[i] = x[i] + y[i];
        z[i+16] = x[i] - y[i];
    }
}

int foo2(int *x, int *y, int *z) {
    for (int i = 0; i < 16; ++i) {
        z[i] = x[i] + y[i];
    }
    for (int i = 0; i < 16; ++i) {
        z[i+16] = x[i] - y[i];
    }
}

A compiler cannot generate the same code for these functions because of aliasing.

Which of the following would be sufficient conditions to gaurentee that foo1 and foo2 would have equivalent results? Select all that apply. (Assume that pointer comparisons compare the addresses of pointers and produce consistent usables results even if the two pointers point to parts of different objects.)

3%

x + 16 < y || x > y + 16
8%

(x + 16 < y || x > y + 16) && (x + 16 < z || x > z + 32)
6%

x + 16 < z || x > z + 32
83%
⊤
(credit given for any answer)
(x + 16 < z || x > z + 32) && (y + 16 < z || y > z + 32)

dropped because wasn't marked as select-all-that-apply in quiz system

quiz for week 13

Question 1 (4 pt; mean 3.38)

When two processes are sharing a single-core processor, usually ____. Select all that apply.

87%
⊤
they have different values for their general purpose registers (like x86-64's %rax through %r15)
22%
whichever process is currently active on the processor core will handle any input

input could be for a different program
28%
if one process has a valid pointer containing the address 0x123456, then the other cannot also have a valid pointer containing the address 0x123456
88%
⊤
(credit given for any answer)
if processor is writing back the result of one of one process's instructions to a register, then it is not also writing back the result of the one of other process's instructions

this is generally true even in OOO processors, but I don't think we covered this enough detail for you to know

Question 2 (4 pt; mean 3.98)

When a program attempts to read from a memory location that the program is not permitted to access, the processor will usually ____.

1%

switch to another program that's running on the system

[gave credit for this answer] might happen as a result of B; done by operating system code, not something hard-wired in the processor
27%
⊤
lookup the location of an operating system function in a table and jump to that location
71%

print the message "Segmentation fault" to the terminal

[gave credit for this answer] might happen as a result of B; done by operating system code, not something hard-wired in the processor; also very often operating systems do not do this (not Unix-like; program not running in a terminal)
0%

return an all-zeroes value to the program instead of the value it was trying to read

I had intended this question to primarily be about what the hardware does versus what the software does. A and C are things that would be done because the operating system has code to do that (and different operating system make different choices about what to do); B is something the hardware would almost always do.

Consider a system with 12-bit virtual addresses, 11-bit physical addresses, and 256-byte pages. On this system, page offsets are therefore 8 bits, virtual page numbers are 4 bits and physical page numbers are 3 bits.

Suppose a process's page table is as follows:

virtual page #	valid?	physical page #
0x0	0
0x1	1	0x2
0x2	1	0x4
0x3	0
0x4	0
0x5	0
0x6	0
0x7	1	0x3
0x8	1	0x5
0x9	1	0x6
0xa	0
0xb	0
0xc	0
0xd	0
0xe	0
0xf	1	0x0

Question 5 (2 pt; mean 1.81) (see above)

If the process attempts to read from address 0x742, what physical address will it read from? Write your answer as a hexadecimal number. If an exception will happen instead, write fault.

Answer:

Key: 0x342

Question 6 (2 pt; mean 1.82) (see above)

If the process attempts to write to address 0x187, what physical address will it write to?

Write your answer as a hexadecimal number. If an exception will happen instead, write fault.

Answer:

Key: 0x287

quiz for week 14

Consider a system with 12-bit virtual addresses, 11-bit physical addresses, and 256-byte pages.

Suppose a process's page table is as follows:

virtual page #	valid?	write allowed?	user mode allowed?	physical page #
0x0	0
0x1	1	1	1	0x2
0x2	1	1	1	0x4
0x3	0
0x4	0
0x5	0
0x6	0
0x7	1	1	1	0x3
0x8	1	0	1	0x5
0x9	1	0	1	0x6
0xa	0
0xb	0
0xc	0
0xd	0
0xe	0
0xf	1	1	0	0x0

Question 1 (4 pt; mean 3.18) (see above)

Suppose the system using the page table shown above requires that exception handlers be located at a virtual address while processes are running. Based on the page table above, give an example of a virtual address at which the OS might have placed an exception handler. Write your answer as a hexadecimal number.

Answer:

Key: anything in VPN 0xf (most sensible) or VPN 0x8 or 0x9

Consider a system with:

32-bit virtual addresses
32-bit physical addresses
4096 byte pages
page tables stored in memory as an array
4-byte page table entries, which, when interpreted as a 4-byte integer consist of:
- a physical page number (stored in the 20 most significant bits)
- 11 unused bits, and
- a valid bit (stored in the least significant bit)
a current value for the page table base register of 0x88000, which is a physical byte address

Question 3 (4 pt; mean 2.8) (see above)

What physical address will the processor access to look up a page table entry for virtual address 0x12345? (Remember to account for each page table entry being 4 bytes.) Write your answer as a hexadecimal number. If not enough information is given, write unknown.

Answer:

Key: 0x88048

Question 4 (4 pt; mean 2.95) (see above)

Suppose the page table entry retrieved for virtual address 0x12345 has the value 0x88800001 (when interpreted as a 4-byte integer). What physical address will contain the value stored at virtual address 0x12345? If an exception would occur instead, write fault.

Answer:

Key: 0x88800345

quiz for week 15

In lecture, we mentioned a scheme where the operating system can allocate memory on demand. The following questions are about how that scheme might be implemented.

Question 1 (3 pt; mean 1.78) (see above)

In order to configure a memory region to be allocated on demand, the operating system will set page table entries ____.

55%
⊤
corresponding to the virtual addresses of the memory region as invalid
12%

corresponding to the virtual addresses of the memory region as valid, but read-only
22%

corresponding to the physical addresses that will be allocated for the memory region as invalid
6%

corresponding to the physical addresses that will be allocated for the memory region as valid, but read-only
3%

none of the above

Question 2 (3 pt; mean 2.73) (see above)

When an access is made that requires memory allocation, the allocation will most likely occur

5%

when the program makes a system call after the access fails
91%
⊤
when an exception handler is run as a result of accessing an memory address with no translation in the page table
2%

during the processor's page table lookup
0%

when the processor squashes instructions that were mispredicted as a result of the access

Consider a system that uses a two-level page table structure with 8192-byte pages and 1024 page table entries per table (at each level), where each page table entry is 8 bytes.

Suppose the processor is running with a page table base pointer containing physical (byte) address 0x44440000. When looking up the physical address for virtual address 0x12345678, the processor determines that the relevant first-level page table entry (that is, the one in the table pointed to by the page table base pointer) was marked valid and contained the physical page number 0x1111, and the relevant second-level page table entry was marked valid and contained the physical page number 0x4321.

Question 3 (4 pt; mean 2.62) (see above)

During the first-level of the lookup described above, from what physical address did the processor read the first-level page table entry? Write your answer as a hexadecimal number.

Answer:

Key: 0x44440120

0x44440000 + (0x12345678 >> 13 >> 10) * 8 (13 to remove the page offset, 10 to remove the second-level's part of the virtual page number, times 8 for the page table entry size)

Question 4 (4 pt; mean 2.26) (see above)

During the lookup described above, what was the final physical address the processor determined? (That is, what physical address would the program read from when it read from virtual address 0x12345678.) Write your answer as a hexadecimal number.

Answer:

Key: 0x8643678

(0x4321 << 13) | (0x12345678 & 0x1FFF) --- (0x4321 << 13) to shift the physical page number to its location in the address, then or in the lower 13 bits to copy the page offset from the original virtual address

Suppose a system has:

a two-level page table structure with 16384-byte pages (below the "first-level page table" is the page table in this structure which is located via the page table base pointer)
4096 page table entries per level
38 bit virtual addresses
4-byte page table entries
36 bit physical addresses
a 32-entry 2-way set associative TLB

If on this system, the processor looks up virtual address 0x52345, and:

reads a first-level page table entry at index 0x0 in the first-level page table located at physical address 0x800000; then
reads a second-level page table entry at index 0x14 in the second-level page table located at physical address 0x50000
determines (using the second-level page table entry) that the physical address corresponding this virtual address is 0x5a345

Question 5 (3 pt; mean 1.93) (see above)

After the processor performs the lookup described above, it will store information in the TLB at what set index?

Answer:

Key: 0x4

Question 6 (3 pt; mean 1.88) (see above)

After the processor performs the lookup described above, it will store what tag in the corresponding TLB entry?

Answer:

Key: 0x1