Answer each of the following questions. This exam is open-book and open-notes, but you may use only resources that were created before this exam was released (at noon eastern time on 10 December 2020). You may not collaborate with other students.
Please show your work for questions in the comment field were applicable so we are able to give you partial credit.
If you think a question is ambiguous or unclear, please make your best guess about what was meant and explain what you did in the comments field for the question. We are unlikely to be able to answer your inquiries during the exam time.
For the following questions, consider this assembly snippet:
movq $0, %rax /* line 1 */
movl $0x12345678, %r9d /* line 2 */
start_loop: /* line 3 */
movl 0(%rbx,%rax,8), %r8d /* line 4 */
xorl %r9d, %r8d /* line 5 */
movl %r8d, 0(%rbx,%rax,8) /* line 6 */
addq $1, %rax /* line 7 */
cmpq $100, %rax /* line 8 */
jne start_loop /* line 9 */
Assuming:
%rbx
represents an int*
(pointer to int) named array
%rax
represents a long
named i
write C code or pseudocode using a while
or for
loop that is equivalent to the assembly loop:
Which of the following transformations of the above loop would result in an unrolled loop which writes the same values to memory but executes fewer instructions? Select all that apply.
Which of the following changes would not change what values the loop writes to memory? Select all that apply.
When line 9 executes for the second time, the values of SF
and ZF
will be:
Suppose we were generating an object file for the following Y86-64 snippet with a relocation table and symbol table to permit linking the object file along with other object files to generate an executable:
example_function3:
irmovq $100, %rsi
call example_function1
rrmovq %rax, %rsi
call example_function2
ret
Relocation table entries in the resulting object file would likely contain a symbol (or label) name and a location in the machine code. Which of the following would be likely relocation table entries in the resulting object file? Select all that apply.
For the following questions, consider a six-stage pipelined Y86-64 processor implementation that uses the following stages:
Everything but the execute stages work as in the five-stage pipelined processor we implemented (including branch prediction that predicts all branches as taken and forwarding). For the execute stages:
Consider the following assembly snippet executing on the procesor described above
addq %rcx, %rdx /* line 1 */
xorq %rcx, %rdx /* line 2 */
mrmovq 8(%rdx), %rax /* line 3 */
rmmovq %rbx, 16(%rax) /* line 4 */
subq %rax, %rbx /* line 5 */
If the addq
instruction on line 1 runs its fetch stage during cycle number 0, then during what cycle number
will the subq
instruction on line 5 run its writeback stage?
Consider the following assembly snippet executing on the procesor described above
xorq %r8, %r9 /* line 1 */
je after /* line 2 */
mrmovq 8(%r10), %r11 /* line 3 */
after:
rmmovq %r12, 16(%r13) /* line 4 */
subq %rax, %rbx /* line 5 */
Suppose %r8
is initially 15 and %r9
is initially 20.
If the the xorq
instruction on line 1 runs its fetch stage during cycle number 0, then during what cycle number
will the subq
instruction on line 5 run its writeback stage?
Consider the following assembly snippet executing on the processor described above:
addq %r8, %r9 /* line 1 */
addq %r9, %r8 /* line 2 */
nop /* line 3 */
nop /* line 4 */
nop /* line 5 */
mrmovq 8(%r8), %r10 /* line 6 */
subq %r9, %r10 /* line 7 */
Which of the following forwarding must occur to execute the above correctly (with minimum stalling)?
For the following questions, consider the process of adding a pcaddq REG1
instruction
that would jump to valP
plus the value of REG1. For example, given the assembly snippet:
pcaddq %rax
nop
if the nop
instruction was stored at address 0x1000
and %rax
had the value 0x24
, then
the next instruction executed would be 0x1024
.
Which of the following encoding would be possible and most appropriate for avoiding adding additional inputs to MUXes or additional MUXes to the single-cycle Y86-64 procesor design we discussed in lecture?
(In each of the encodings , values of each byte are provided with the most significant bits written first (left-most).)
When this instruction is executing, the data memory will
When this instruction is executing, one of the register file's destination register number inputs (reg_dstE
and reg_dstM
) will be REG_NONE
(the constant 0xF
) and the other will be:
Suppose a five-stage pipelined processor has a cycle time of 800 ps. When running a particular benchmark program, 10% of the instructions require exactly one cycle of stalling to resolve hazards and 5% of the instructions require exactly two cycles of stalling to resolve hazards, and no instructions require any more cycles of stalling. By changing to a six-stage design, one can reduce the cycle time but 15% (instead of 10%) of the instructions in the benchmark program would require exactly one cycle of stalling (and 5% would still require exactly two cycles of stalling). For this change to make the benchmark program run at least as fast as on the five-stage processor, about what is the minimum cycle time it should have? Write your answer as a base-10 number of picoseconds, rounded to the nearest picosecond. You may assume the benchmark program has a very large number of instructions (so, for example, the amount of time spent waiting for the first instructions of the benchmark to finish is negligible).
Consider a system with:
(For this questions KB = 1024 bytes, and B = 1 byte.)
The following questions consider the operation of reading a 4-byte little endian value from the virtual address 0x12345678
on this
system. (Note that some leading zeroes of the virtual address might not be written.)
For each of the questions, assume the corresponding physical (byte) address is 0xABCDD678
.
What is the size of virtual addresses on this system in bits?
If accessing the value requires a TLB lookup, what set index of the TLB will be accessed?
If accessing the value requires a page table lookup and the third-level page table
(that is, the last one used in the multi-level lookup)
used in that lookup starts at physical (byte) address 0x400000
, what would the physical address of the
corresponding (third-level) page table entry be? (Write your answer as a hexadecimal number.)
If accessing the value requires a page table lookup and the third-level page table
(that is, the last one used in the multi-level lookup)
used in that lookup starts at physical (byte) address 0x400000
, then what physical page number was
in the second-level page table entry accessed as part of this lookup?
(Write your answer as a hexadecimal number.)
If accessing the value results in a hit in the L1 data cache, and the data in the corresponding block in the L1 data cache is (written as hexadecimal, lowest address first):
01 23 45 67 02 34 56 78
03 34 56 78 04 56 78 9A
04 56 78 9A 05 67 89 AB
06 78 9A BC 07 89 AB CD
then what 4-byte value will be read? Assume values are read in little endian and write your answer as a hexadecimal number. (Like a hexadecimal constant in C or Python, your hexadecimal number should have the one's place written right-most.)
Write a C function
int couldHaveBeenEvicted(unsigned long p)
that takes in a physical address p
,
and returns true if and only if the value at p
could be evicted from the L2 cache as a result
of an accessing the value at physical address 0xABCDD678
.
(Since we're asking whether whether it could have been
evicted, you can assume the value at p
is part of the least recently used block in the set, or
any other assumptions that would make eviction more likely.)
Your C function must not use standard library functions
(but you can definite temporary variables, use all the C arithmetic and bitwise operators, etc.)
Suppose a system:
If a program's page tables are setup so that the only valid pages are the ones with virtual page number 0x00001, 0x00002, 0x00003, 0xFFDC0, 0xFF0C1, 0xFF0C2, 0xFF0C3, and 0xFFFFF (all hexadecimal page numbers), then what is the minimum of number of second-level page tables used to do this? (Count only second-level page tables; do not count the first-level page table.)
For the following questions, consider a 64KB, 4-way set associative cache with 128-byte blocks, a random replacement policy, and write-back, write-allocate policy on a system with 64-bit physical addresses. (1KB = 1024 bytes)
Suppose the following 1-byte accesses occur in this cache in the given order, and the cache is empty before these accesses occur:
(Note that not all leading zeros of addresses are written out.)
Which of the following statements are true about the results of this access pattern?
In addition to 64KB of storage for data, the cache needs storage for metadata. How much metadata storage, in bits, will it need? Write your answer as a base-10 number. Remember to show your work in the comments field.
Consider the following C function:
void foo(int *result, int *vector1, int *vector2, int N) {
for (int i = 0; i < N; ++i) {
int temp = vector1[i];
for (int j = 0; j < N; ++j) {
result[j * N + i] = temp * vector2[j];
}
}
}
For the following questions, assume:
result
, vector1
, and vector2
shown, and does not reorder
or omit or combine any reads or writes;result
, vector1
, and vector2
are pointers to non-overlapping arrays, and each of the pointers is a multiple of 16384 (that is, two to the 14th power)int
s are 4 bytes;Assuming a 16KB, 4-way data cache with 4B blocks,
would a write-allocate or write-no-allocate policy be better for the function foo
above?
Assuming a 32KB, 8-way data cache with 8B blocks and a write-back, write-allocate policy,
write code or pseudocode for a modified version of the function foo
that has improved cache performance
(fewer cache misses) but computes the same values.
For the following questions, consider the following x86-64 assembly snippet:
movq $3, %rax
imulq %r8, %rax
imulq %r9, %rax
imulq %r10, %rax
addq %rax, %rcx
movq $3, %rax
imulq %r11, %rax
addq %rax, %rcx
imulq REG1, REG2
multiplies the value of REG1 and REG2 together and stores the result in REG2.
Suppose the above assembly snippet is run on an out-of-order processor with the following execution units:
Compute the minimum number of cycles the processor described above would take to complete
the arithmetic in the assembly snippet above.
Do not include time needed for fetching instructions, executing the movq
instructions,
forwarding values, or reading/writing registers. Show your work in the comments field.
Rewrite the above assembly snippet, using only movq
, two-argument imulq
, and addq
instructions,
to reduce the time required for arithmetic on the processor described above. Your rewritten assembly snippet
must produce the same result in %rcx
. (We do not care if other registers values end up being the same.)
We discussed how virtual memory can be used to allow physical memory to be used as a cache for slower storage like hard drives and solid state disks. When this happens and a program tries to access a value in virtual memory that is not loaded into physical memory, which of the following is likely to happen as a result of this access? Select all that apply.
The abstraction of process gives programs the illusion of a dedicated machine, including a dedicated processor and main memory, even though the underlying may be shared between processes. As part of implementing processes by sharing a single processor and its memory over time, the operating system will change which process is active on the processor. Which of the following is true about this procedure?
Suppose a system has:
and when a miss occurs in the L1 or L2 cache, the cache must wait until the hit time completes before it accesses the next level of cache or main memory (so the total access time on a miss is the sum of the hit time and the next-level access time).
On a benchmark program the L1 cache experiences a 90% hit rate and the L2 cache experiences a 50% hit rate. What was the average memory access time experienced by the benchmark program in cycles? (Round your answer to the nearest tenth of a cycle. You may assume all the memory accesses in the benchmark program go through the L1 cache.)
For the following questions, consider the following circuit:
Assume:
If the registers X
, Y
, and Z
initially store 1
, then what value will X
store after 2 rising edges of
the clock signal?
Suppose the 64-bit registers have a 50 ps register delay and the adders take 500 ps to produce a stable result. What is the minimum cycle time for the circuit? Write your answer as a base-10 number of picoseconds.