Question 1: When statically linking an executable, which of the following operations are done
by the linker?
Select all that apply
Question 2: Suppose the register %rax
contains 0x100
, and the register %rbx
contains 0x10
.
What value is stored in %rcx
as a result of movq 2(%rax,%rbx,4), %rcx
?
Question 3: Consider the following C code:
long foo(long x, long y) {
while (x >= 0) {
y -= 2;
x += y;
}
return x;
}
Which of the following are valid translations of it to assembly?
(You may ignore what the function does in cases of integer overflow or underflow.)
Select all that apply
Question 4: What is the value of foo
after the following C code runs?
int foo[5] = {1, 3, 5, 7, 9};
int *p = &foo[1];
p += 1;
*p -= 1;
p += 2;
*p -= 1;
p -= 1;
Question 5: What are the possible values of the condition codes ZF
and SF
after the following is executed?
leaq (%rax, %rbx), %rcx
subq %rax, %rcx
subq %rbx, %rcx
Select all that apply
Question 1: In order to tell the mnemonic (e.g. addq
, rrmovq
, jle
) for a Y86-64 instruction from
its machine code, one needs to know:
Select all that apply
Question 2: Which of the following are possible versions of the addq
instruction on Y86-64? (The answer
is different than for x86-64.)
Select all that apply
Question 3: The textbook claims that each of the following attributes is typical of a RISC instruction set.
Which of them is true about Y86-64?
Select all that apply
Question 1: Which of the following are parts of ISA (as opposed to microarchitecture)?
Select all that apply
Question 2: RISC processors are typically described as having:
Select all that apply
Question 3: What is the operation performed by mrmovq
is Y86-64?
Question 4: Given the declaration int array[100];
, which of the following
C expressions are always true?
Select all that apply
Question 1: If x
is an int
, then the result of x & 0
in C is
Question 2: As described in section 4.2.5 of our textbook, if the (non-clock) input
to a register changes from 11
to 99
and remains at 99
indefinitely,
then the output of the register will change to 99
Question 3: In the register file described in section 4.2.5 of our textbook, if one of the read ports of the register file changes, then the corresponding output signal will change (to the value of the register being read)
Question 4: In the data memory described in section 4.2.5 of our textbook, when the "write" signal is set to 1, then the value stored in memory will change based on the "data input" and "address" inputs
Question 5: In the processor design described in the textbook, which of the following instructions
do not use the result of the ALU?
Select all that apply
Question 1: If x
and y
are signed 4-byte integers whose absolute value is less than 9 million,
which of the following C expressions are always true?
Select all that apply
Question 2: Given a signed integer x
, which C expression results in that integer with the
11 through 16th (inclusive) least significant bits (bits 10-15 if counting the least significant bit as bit 0) set to 1?
Select all that apply
Question 3: The addq
instruction in our processor adds two values from registers and writes back the result in the destination register. Which of the following components does it use in the processor?
Select all that apply
Question 4: The nop
instruction updates the PC with some value. What is the correct value?
Question 1: Given the case expressions described in section 4.2.3 of our book,
for which of the follow values of x
and y
is the
result of the following expression 100
[
x > 90 : x + y,
y < 50 : x - y,
1 : x + x,
]
Select all that apply
Question 2: Suppose you are writing an HLCRS program and discover that values in memory
are changing while running an instruction that is not supposed to update memory
at all. Which of the following are likely causes of this?
Select all that apply
Question 3: The SEQ processor design uses conceptual stages. Which of the following stages are used by the addq
instruction?
Select all that apply
Question 4: The popq
instruction pops a value from the top of the stack and moves it to a register. Which of the following statement is true?
Question 1: Consider the following HCLRS code:
register xY {
a : 32 = 0;
b : 32 = 0;
}
x_a = Y_a + Y_b;
x_b = Y_b + 1;
Assume we call the cycle in which Y_a
and Y_b
take on their
initial value, 0, cycle number 1.
Given this cycle numbering, what is the value of Y_a
during cycle number 3?
Question 2: Consider the following HCLRS code:
reg_inputE = mem_output;
Which of the following is this snippet most likely to help accomplish?
Question 3: Our textbook divides its single-cycle processor design into six stages. What is true about these stages?
Question 4: Which stage is responsible for generating the data memory address?
Question 1: Consider the following Y86 code
rrmovq %rax, %rbx
subq %rbx, %rcx
irmovq $10, %rcx
jle foo
xorq %rcx, %rbx
Which of the following statements about data dependencies in this code are true?
Select all that apply
Question 2: Adding more pipeline stages to a processor usually increases
Select all that apply
Question 3: Consider a pipelined processor that executes an mrmovq
instruction
by computing the address in one cycle, then reading the data memory
from this address in the next cycle. To pass the address between these two cycles,
the processor will
Question 4: When deciding on how to divide instruction execution into two pipeline stages, which of the following is the best strategy?
Question 1: Consider choosing between several options for designing a pipelined processor based on a single cycle design with a cycle time of 10 nanoseconds. Which of these options is likely to have the best performance?
Question 2: Suppose one wants to find the processor design which will get the best performance on a benchmark program that takes many hours to run. From our possible processor designs, we should choose the one with
Question 3: To execute the mrmovq D(rB), rA
instruction in the pipeline design we described in
lecture, which of the following values must appear in the pipeline regsiters
between the execute and memory stages?
Select all that apply
Question 4: Consider a five-stage pipelined processor with the following pipeline stages:
Assume this processor reads registers during the middle of the decode stage of an instruction and writes registers at the end of the writeback stage. If the processor resolves hazards with stalling only, how many cycles of stalling will it require to execute the following assembly correctly?
addq %rcx, %rdx
subq %rcx, %rax
xorq %rax, %rdx
Question 1: With branch prediction, the instruction fetched immediately after a conditional jump instruction
Question 2: Forwarding
Select all that apply
Question 3: Forwarding alone can't resolve the load/use hazard our textbook describes because
Question 4: A program has good temporal locality if
Select all that apply
Question 5: A program has good spatial locality if
Select all that apply
Consider the following assembly snippet
mrmovq (%rax), %rcx
addq %rcx, %rax
irmovq $0x1234, %rsp
popq %rcx
ret
Question 1: (see above) To execute this on a five-cycle pipelined processor like the one described in lecture
with minimum stalling what values will be forwarded?
Select all that apply
Question 2: (see above) Consider a three-stage pipelined processor with the following stages:
How many cycles of stalling will this processor require to execute the above snippet on this three-stage processor?
Consider the five-stage pipelined processor described in lecture with forwarding and branch prediction that assumes all branches are taken. Suppose this processor executes the following assembly snippet in which the branch is not taken:
popq %rax
andq %rax, %rax
jle foo
mrmovq %rcx, %rcx
(note: This was the question as written, but I meant to write mrmovq (%rcx), %rcx
for the last instruction.)
Question 3: (see above) How many cycles of stalling will be required to execute these instructions on this processor?
Question 4: (see above) The mrmovq
will perform its execute stage while
Question 5: To perform a pipeline stall -- that is, keeping an instruction
from advancing to the next stage in the pipeline while letting
the instructions in later stages proceed as usual --
using the stall
and bubble
signals provided by HCLRS, one needs to
Question 1: In order to determine if a request at address X hits/misses the cache, it is necessary to access:
Select all that apply
Question 2: The set index bits are part of which component?
Question 3: Increasing the size of a cache without changing its block size
or the number of blocks per set will usually decrease
Select all that apply
Question 4: In section 6.3.1, our book divides cache misses into compulsory misses, conflict misses, and capacity misses. Switching from a direct-mapped cache to a set-associative cache (with a similar number of blocks and similar block size) would primarily address which kind of misses?
Question 1: Which bits in the address are used to determine, after accessing a set, which way, if any, has data for the address when cache has the following properties: 32KB, 8-way set associative, 64B block size, uses 32-bit addresses?
Question 2: Which of the following cache design changes increase the amount of metadata (non-data storage) required for a cache?
Question 3: Recall the formula for average memory access time (AMAT):
average memory access time, AMAT = hit time + miss penalty × miss rate
Which of the following statement is true?
Question 4: Suppose your program sums up every 4th element of an array. Each element of the array takes 4 bytes of space. Your cache block size is 16B. Initially your cache is empty.
int sum = 0;
for(int i = 0; i < N; i += 4)
sum += v[i];
Which of the following statements are true?
Question 1 (0 points): Consider the following two functions:
void versionA(int *A, int *B, int n) {
for (int i = 0; i < n; ++i)
for (int j = 0; j < 8; ++j)
A[i] = A[i] * B[j];
}
void versionB(int *A, int *B, int n) {
for (int j = 0; j < 8; ++j)
for (int i = 0; i < n; ++i)
A[i] = A[i] * B[j];
}
Assuming n
is very large and the compiler does not reorder or omit memory accesses to
the arrays A
or B
(or keep values from A
or B
in registers),
which is likely to experience fewer cache misses?
Question 2 (0 points): Consider two functions foo
and bar
:
void foo(int *px, int *py, int *pz) {
int t = *px;
*px = *py;
*py = *pz;
*pz = t;
}
void bar(int *px, int *py, int *pz) {
int t = *py;
*py = *pz;
*pz = *px;
*px = t;
}
Compilers are not allowed to generate the same
code for foo
and bar
. Why is this?
Select all that apply
Question 3 (0 points): Consider the following two versions of a loop in assembly:
Version A:
movq $1023, %rax
movq $1, %rbx
loop:
imulq (%rcx,%rax,8), %rbx
sub $1, %rax
jne loop
Version B:
movq $1022, %rax
movq $1, %rbx
loop:
imulq 8(%rcx,%rax,8), %rbx
imulq (%rcx,%rax,8), %rbx
sub $2, %rax
jne loop
Version A performs at least one hundred more ______ than version B.
Select all that apply
Question 1: Consider the following code:
for (int k = 0; k < N; ++k) {
for (int i = 0; i < N; ++i) {
#A[i][k] loaded once in this loop
for (int j = 0; j < N; ++j) {
#B[i][j], A[k][j] loaded each iteration (if N is big)
B[i*N+j] += A[i*N+k] * A[k*N+j];
}
}
}
Assuming n
is very large and the compiler does not reorder or omit memory accesses to
the arrays, which of the following statements are true: (In the options below,
"loaded" means loaded into a cache with a typical organization.)
Select all that apply
Question 2: Consider the following codes: Version 0
for (int k = 0; k < N; ++k) {
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
B[i*N+j] += A[i*N+k] * A[k*N+j];
}
}
}
Version 1
for (int kk = 0; kk < N; kk+=2) {
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
for (int k = kk; k < kk + 2; ++k) {
B[i*N+j] += A[i*N+k] * A[k*N+j];
}
}
}
}
Assuming n
is very large and the compiler does not reorder or omit memory accesses to
the arrays, which of the following statements are true:
Select all that apply
Question 3: Aliasing often makes it difficult for compilers to perform which of the
following optimizations?
Select all that apply
Question 4: Which of the following optimizations will result in a program that executes
less instructions to perform the same amount of work?
(For the purposes of this question, count each time an instruction is run
as a seperate instruction executed; for example, a loop of ten
instructions run five times executes fifty instructions.)
Select all that apply
Question 1: In order for multiple accumulators to significantly improve performance the performance of a loop
Select all that apply
Question 2: Extending the multiple accumulator optimization to a very large number of accumulators generally results in slower performance primarily because
Question 3: On an out-of-order processor that can perform many additions per cycle in parallel, which
of the following orders of operations is likely to be most efficient to
compute the sum of a
, b
, c
, d
, e
, and f
:
Question 4: When an exception (in the sense of section 8.1) occurs while executing an application's code, code that is part of __________ is called by logic located inside ____________.
Question 1: SIMD arithmetic instructions improve performance primarily by
Select all that apply
Question 2: Suppose a loop unrolled 8 times executes at the same
speed as a loop unrolled 4 times on a processor. Which of the following
changes to the processor is likely to make the loop unrolled
8 times execute faster than the 4 times unrolled loop instead?
Select all that apply
Suppose an out-of-order processor has a single pipelined multiplier that takes 3 cycles to perform each multiply and an independent single pipelined adder that takes 2 cycles to perform each add. Assume that the result of a multiply or add can be used in as input ot multiply or add immediately after it is computed.
Question 3: (see above) What is the minimum number of cycles
addq %rax, %rbx
mulq %rax, %rcx
mulq %rdx, %rdx
addq %rcx, %rbx
takes to execute on this processor?
Question 4: (see above) What is the minimum number of cycles
addq %rcx, %rax
addq %rdx, %rbx
mulq %rax, %rbx
takes to execute on this processor?
Question 1: A process running in user mode is prevented from directly
Select all that apply
Question 2: The number of programs a system can run concurrently is determined primarily based on
Question 3: A page table entry contains
Select all that apply
Question 4: As described in our book, when virtual memory is used, _______ is/are a cache for ______.
Question 5: As described in our book, when virtual memory is used to implement a cache, the replacement policy of that cache is determined primarily by __________.
Question 1: Suppose the operating system contains a function that creates a new process, with its own new address space, and context switches to it. This function does not work outside of kernel mode. If a program tries to run this function from user mode, what will most likely occur? (For this question, consider our textbook's terminilogy for exceptions.)
Question 2: When the operating system performs a context switch between two processes,
which of the following values are likely to change?
Select all that apply
Question 3: Suppose we are running program A, which then makes a system call to read from disk. While waiting for the disk to return the read data, the operating system runs program B for 5 milliseconds, then runs program C until the disk completes the read. Finally, it returns to running program A.
How many exceptions must have occured during this entire process?
Question 4: A page table is a map of <vpn, ppn>, where vpn is a virtual page number and ppn is a physical page number. The virtual address can be split as <vpn, po>, where po is the page offset. Given this information, how would you form the physical address?
Question 1: Which of the following statements are true about the translation lookaside buffer (TLB)?
Select all that apply
Question 2: What do we use to index the TLB?
Question 3: What is the content of the first-level page table in a multi-level page hierarchy?
Question 4: Which of the following statements are true?
Question 1: Which of the following statement is true?
Question 2: Let’s assume there is a two-level page table in your system. The virtual address is split as <vpn part1, vpn part2, page offset>. Which of the following statement is true?
Core i7 memory divies virtual page numbers into four parts because it's MMU has a four-level page table hierarchy. Consider translating a single virtual address into a physical address.
Question 3: (see above) How many page table accesses will occur during that translation if there a TLB hit?
Question 4: (see above) What is the maximum number of page table accesses that can occur during that translation if there a TLB miss?
Question 5: You find that one particular configuration of TLB uses no TLB index bits during a lookup; Which of the following statement is true?