Answer the following questions about the lecture material from this week.
In lecture, we discussed the idea of a processor being connected to memory and I/O
devices via common bus. Suppose we are executing the instruction
movq (%rbx), %rax
in this model. The processor will send messages over
the bus in order to ___. Select all that apply.
Consider the following C code:
int x;
int *p;
x = 1077;
p = &x;
Suppose this is executed on a little-endian system where int
s are 4 bytes
and:
p
(after execution) is 0x50000
, andp
is stored in memory at address 0x60000
Identify the value of the byte in memory at each of the following addresses.
If not enough information is given, instead write unknown
and explain
briefly in the comment.
0x50000
1077 = 0x435 is composed of bytes 0x00 (most sig), 0x00, 0x04, 0x35 (least sig); The least significant byte goes in the base address which is 0x50000, because that's what a pointer to x contains.
0x50002
0x50002 contains the third least significant byte of 1077, which is 0
0x50004
no information about what, if anything, is stored at this address since it's just after the int 'x' in memory. (We know it's not p
, since it's stored at address 0x60000.)
Consider the following C snippet:
c = str[x+2];
where c
is declared as a char
, str
is declared as a char *
(pointer to char; and based
on the context of the snippet a pointer to the beginning of an array of char),
and x
is declared as a long
.
Suppose str
is stored in the register %rax
, x
in %rdi, c
in the register %bl
,
and %r8
through %r15
can be used for temporary values.
Which (if any) of the following are correct translations to assembly? Select all that apply.
Answer the following questions about the lecture material from this week.
Consider the following C snippet:
unsigned long x;
unsigned long *p;
...
*p = x * 2;
If:
x
is stored in %rax
,p
is stored in %r8
, and%r9
is a register available to store temporary valuesthen which of the following x86-64 assembly snippets are equivalent to the C statement above? Select all that apply.
Consider the following assembly snippet:
start:
addq %r8, %r9
cmpq %r9, %r10
jle start
end:
Suppose that while this assembly snippet executes,
the computations performed by the addq
and cmpq
instructions
do not result in integer overflow (or underflow) if we interpret their
operands and results as signed integers.
Then, after the snippet finishes executing,
what will the values of the condition codes SF
and ZF
be?
Assuming that the variables r8
, r9
, and r10
represent the values
in the registers %r8
, %r9
, and %r10
respectively, which is
equivalent C code to the above assembly snippet?
Suppose we have C code in two different files.
One file, upperstr.c
contains the following:
void upperstr(char *str) {
for (int i = 0; str[i] != '\0'; i += 1) {
if (str[i] >= 'a' && str[i] <= 'z') {
str[i] += ('A'-'a');
}
}
}
And another file, printbig.c
contains the following:
#include <stdio.h> /* for puts() */
void printbig(char *str) {
upperstr(str);
puts(str);
}
An executable is made from code in these files and others. As part of that process
an object file upperstr.o
is produced from upperstr.c
and
an object file printbig.o
is produced from printbig.c
.
Recall from lecture that object files generally contain:
Ignoring information only present for debugging, we would expect upperstr.o
to contain ____. Select all that apply.
Ignoring information only present for debugging, we would expect printbig.o
to contain ____. Select all that apply.
Answer the following questions about the lecture material from this week.
Consider the expression
(((x & 0xFF0F) >> 2) & 0xFF0) == (((x & 0xFF0) >> 2) & 0xFF0F)
where x
is a 32-bit unsigned integer.
The set of values x
for which the above expression is true is the same
as the set of values for which the expression (x & Y) == 0
is true for some constant Y
.
What is the value of Y
? Write your answer as a hexadecimal number.
If x is the bit pattern abcd efgh ijkl mnop
(letting letters represent variable values for each bit), then the right hand side is:(x & 0xFF0F) = abcd efgh 0000 mnop
(x & 0xFF0F) >> 2 = 00ab cdef gh00 00mn
((x & 0xFF0F) >> 2) & 0xFF0 = 0000 cdef gh00 0000
and the left-hand side:(x & 0xFF0) = 0000 efgh ijkl 0000
((x & 0xFF0) >> 2) = 0000 00ef ghij kl00
((x & 0xFF0) >> 2) & 0xFF0F = 0000 00ef 0000 kl00
So for the two results to be the same the bit pattern 0000 cdef gh00 0000
needs to be the same as 0000 00ef 0000 kl00
. This is only true if bits cdghkl are all 0s, so we need a bit mask of 0011 0011 0011 0000.
Consider the following incomplete C code (where ___s represent omitted constants):
const unsigned long A = ___;
const unsigned long B = ___;
unsigned long swapAndFlip(unsigned long x) {
unsigned long low = x & 0xFF;
unsigned long high = x >> 56;
return (((low ^ A) << 56) | high | (x & B);
}
Suppose we want to make this function take a 64-bit integer, and return a new 64-bit integer, such that:
What should the value of the constant A be? Write your answer as a hexadecimal integer.
What should the value of the constant B be? Write your answer as a hexadecimal integer.
For each of the following pairs of instruction set design choices, identify which is more consistent with the reduced instruction set computer (RISC) design philosophy
Answer the following questions about the lecture material for this week.
For the following questions, consider this Y86-64 machine code. (The machine code is written as a series of bytes in hexadecimal, written in order from lowest to highest memory address.)
30 f8 08 00 00 00 00 00 00 08 60 9a 50 ba 00 00 00 00 00 00 00 00
What is the mnemonic for the first instruction in the above machine code snippet?
The first instruction in the above assembly snippet has a 64-bit integer constant in it. What is it? Write your answer as a hexadecimal number.
What is the mnemonic for the second instruction in the above machine code snippet?
Consider the following HCLRS snippet:
register xY {
foo : 64 = 0x0;
bar : 64 = 0x5;
}
x_foo = Y_foo + 1;
x_bar = x_foo + Y_bar;
pc = x_foo;
Since the above snippet sets pc
, it will read from the instruction
memory, setting i10bytes
based on the contents of memory (even though the snippet does not
use that value).
Before the first rising edge
of the clock signal, what bytes of memory will be used to
construct the value of i10bytes
?
Identify the bytes by their memory addresses. Select all that apply
Initially, the value of the bar
register is 0x5
.
After two rising edges of the clock signal, it will be ___.
Write your answer as a hexadecimal number.
before first rising edge: x_foo = 0 + 1, x_bar = 1 + 5; after 1st: x_foo = 1 (previous x_foo) + 1; x_bar = 2 + 6 (previous x_bar) = 8; after 2nd: Y_foo = previous x_foo
Answer the following questions about the lecture material for this week.
Suppose we start with the mov-to-register CPU we described in lecture which implements
the Y86 instructions irmovq
, mrmovq
, and rrmovq
.
This processor design had MUXes (represented in HCLRS by case expressions)
that chose between several inputs based primarily on what the current
instruction was.
When adding a new instruction, for some of these MUXes, we can use
an existing input and modify when it is selected to include
the new instruction. For others, we may need to add a new input to
this MUX for the new instruction (or some other logic
that achieves this effect). In a few cases, we may even need
to add a new a MUX (or other logic that achieves the same effect).
Suppose we added the addq
instruction to the mov-to-register CPU.
For which of the following parts of the circuit would we need
to add a new MUX or a new MUX input (or equivalent logic)
rather than modifying the conditions for selecting existing
MUX inputs?
Select all that apply.
In Y86, the popq
instruction is encoded using 2 bytes. Suppose we made
a variant of Y86 that instead encoded popq
in one byte, where the high-order four bits of
the first byte would be 0xB (like popq
is in normal Y86) and the low-order
four bits of the first byte would be the register index corresponding
to the operand to popq
.
Supporting this new encoding would likely ____. Select all that apply.
In the single-cycle processor design discussed in lecture, executing the code
pushq %rax
popq %rbx
involves the following operations:
pushq
instruction%rax
%rsp
popq
instruction%rbx
%rsp
Which of the above operations could occur simulatenously or in any order in the single-cycle processor design? Select all that apply
Suppose we added the instruction
immovq CONSTANT, DISPLACEMENT(rA)
to the single-cycle Y86-64 processor design we discussed in lecture. This instruction would store the 64-bit value CONSTANT in memory at an address computed by adding the value of the register rA to the 64-bit constant DISPLACEMENT.
While this instruction is executing, the value input to the
data memory (called mem_input
in HCLRS) will be equal to ______.
Answer the following questions about the lecture material for this week.
Suppose we are building a two-stage pipelined Y86-64 processor from the following components:
(and that other components use negligible time to make the following questions simpler.)
Most simply, each our two stages could perform the work of one or more stages from the five-stage pipelined design we discussed in lecture.
If the first stage of the processor performs the work for Fetch and the second stage performs the work of Decode+Execute+Memory+Writeback, then the cycle time would be around _____ ps plus pipeline and/or program counter register delays.
slowest path: reg read + ALU + mem read + reg write
(3/4ths credit for reg read + ALU + mem write + reg write, but we never use the output of memory writes to do a register write)
To minimize the cycle time, these two stages should perform the work of _______ (for the first stage) and _______ (for the second stage).
Suppose a single-cycle processor design has a cycle time of 3000 ps. If this design is modified into an 8-stage pipelined processor (using the same storage components and adding pipeline registers) and this results in a performance improvement, then the new cycle time would be greater than _____ ps and less than _____ ps. Choose the tightest bounds you can infer from the information in the question. Write your answer as two numbers of picoseconds seperated by a comma. (For example, if you think the answer is "greater than 100 ps and less than 200 ps" write "100,200".)
with 8 stages, ideally each stage takes around 3000 ps/8 = 375 ps; in the worst case, we'd have one stage that took almost 3000 ps plus we'd have some additional delay from added pipeline registers/etc. — so the result could be worse than the original single-cycle processor. We are told in the question that there's a performance improvement, so we can at least say that the new performance is better than 3000 ps.
Suppose we construct a pipelined processor by modifying the single-cycle processor design we saw in lecture and that our pipeline processor has two stages:
With these stages, the pipeline registers should contain (at least when certain instructions are being run) ____. Select all that apply.
Suppose we modified the four-stage pipelined addq processor we discussed in lecture to have three stages:
With these three stages, suppose we run the following assembly and resolve data hazards with stalling:
addq %r8, %r9
addq %r10, %r9
addq %r8, %r10
addq %r8, %r9
Suppose the first instruction is fetched during cycle 1 (and so performs its decode during cycle 2 and its writeback during cycle 3). During what cycle would the last add instruction run its writeback stage?
Answer the following questions about the lecture material for this week.
For the following questions, consider the following assembly code:
addq %r8, %rcx
mrmovq 8(%rcx), %rax
subq %rcx, %rax
Suppose the above assembly code was executed on a six-stage pipelined processor with the following pipeline stages:
The result of reading the data memory is not available until near the end of the Memory 2 stage. The processor uses a combination of stalling and forwarding to resolve data hazards. It supports all forwarding paths that would not result in a large increase in cycle time.
If the addq
instruction completes its fetch stage during cycle 1,
then the subq
instruction will complete its writeback stage during cycle ___.
1 2 3 4 5 6
addq %r8, %rcx F D E M1 M2 W 7
mrmovq 8(%rcx), %rax F D E M1 M2 W 8 9 10
subq %rcx, %rax F D* D* D E M1 M2 W
Suppose the above assembly code was executed on seven-stage pipelined processor, where the pipeline stages are:
In this processor, the result of an ALU operation (addition or subtraction) is not available until near the end of the Execute 3 stage, but the inputs to the operation must be available near the beginning of the Execute 1 stage. This processor uses a combination of stalling and forwarding to resolve hazards. It supports all forwarding paths that would not result in a large increase in cycle time.
If the addq
instruction completes its fetch stage
during cycle 1, then the subq
instruction will complete its writeback stage
during cycle ____.
1 2 3 4 5 6 7
addq %r8, %rcx F D E1 E2 E3 M W 8 9 10
mrmovq 8(%rcx), %rax F D* D* D E1 E2 E3 M W 11 12 13 14
subq %rcx, %rax F* F* F D* D* D* D E1 E2 E3 M W
Consider the following assembly code:
addq %r8, %r9
mrmovq 16(%r8), %r9
subq %r9, %r10
xorq %r9, %r11
rrmovq %r9, %r12
When the above instruction is executed on a five-stage pipelined processor which uses forwarding to resolve hazards in a way that minimizes stalling like we discussed in lecture, which of the following forwarding operations must occur? Select all that apply
Answer the following questions about the lecture material for this week.
Consider a 7-stage pipelined processor with the following stages:
Assume these stages do the same thing as in the 5-stage processor we discussed in lecture, except that the decode and execute stages are split into two stages. For the split decode stage, the inputs to the original decode stage (like register indices) need to be available near the beginning of the decode 1 stage and outputs of the original decode stage (like register values) are only available near the end of the decode 2 stage. Similar is true for the execute 1 and execute 2 stages.
Assume the processor implements whatever forwarding is possible without dramatically increasing cycle time.
Suppose the 7-stage processor is used to execute the following:
addq %rax, %rcx
jle foo
rmmovq 8(%rax), %rax
subq %rcx, %rdx
halt
foo:
xorq %r8, %r9
xorq %r10, %r11
xorq %r12, %r13
xorq %r8, %r9
xorq %r10, %r11
xorq %r12, %r13
The processor implements jle
using branch prediction that predicts the branch
as always taken. The actual outcome of the branch prediction is computed
during the conditional jump's execute 2 stage (which will use the condition codes
updated by the previous instruction's execute 2 stage)
and can be used to determine what instruction to fetch in the next cycle (but not earlier).
When the assembly above is executed,
the jle
is not taken, so some xorq instructions are fetched and then
squashed. If the addq
instruction is fetched during cycle 1, then
the rmmovq
instruction will be fetched during cycle ___.
Consider the five-stage pipelined processor design discussed in lecture but with a different branch prediction strategy: instead of predicting conditional jumps as always taken, predict them as taken if they target an address that is less than the jump instruction's and as not taken if they target an address than is greater than the jump instruction's. (This strategy is called "forward not-taken, backwards taken").
In this processor, incorrectly predicted branches require 2 extra cycles, instructions that use a value computed in memory by the previous instruction in the ALU require 1 extra cycle, and returns require 3 extra cycles. (And no other cases require extra cycles due to hazards.)
Suppose the instructions run in a program are:
ret
instructionsAbout how many average cycles per instruction would the program experience on the processor described above?
1 + 10% * (20% + 10%) * 2 + 10% * 1 = 1.16
Consider the following C++ snippet:
for (int i = 0; i < 10000; ++i) {
A[i] += B[i];
A[i] *= C[i];
}
std::cout << "last A is " << A[9999] << std::endl;
The accesses to elements of A
will have _____ temporal locality than the accesses to i
.
Assuming about the same number of bytes of machine code are required for each,
the accesses to the machine code for the +=
and *=
statements in the loop will have _____
spatial locality than the machine code for the std::cout
statement.
Assuming about the same number of bytes of machine code are required for each,
the accesses to the machine code for the +=
and *=
statements in the loop will have _____
temporal locality than the machine code for the std::cout
statement.
Suppose the contents of a two-entry direct-mapped cache is as follows:
index (binary) | valid | tag (binary) | data bytes (hexadecimal) | |
00 | 1 | 001 | 1F 3F 5F 7F | |
01 | 1 | 110 | 2E 3E 4E 5E | |
10 | 1 | 001 | 33 44 55 8F | |
11 | 1 | 010 | 9E AE BF 01 |
(In the table above, data bytes are listed with the lowest address's value first (left-most).)
The byte of memory at address 0x11 (binary: 10001) is stored in this cache and has the value 0x3F.
Bytes from which of the following memory addresses are also stored in the cache shown above? (Note that addresses may be written without all leading 0s.) Select all that apply.
Answer the following questions about the lecture material for this week.
Consider a two-way set associative cache with the following contents:
set index | valid | tag | data bytes | valid | tag | data bytes | |
0 | 1 | 0xA20 | 1F 3F 2A 93 27 00 45 8A | 1 | 0xC20 | 33 44 55 92 91 90 0A 0C | |
1 | 0 | — | — | 1 | 0xC20 | 0F 1C 11 12 13 14 15 16 | |
2 | 1 | 0xC20 | 20 21 2F 2A 40 41 55 98 | 1 | 0x332 | 95 33 12 59 43 3A 3C 3D | |
3 | 0 | — | — | 0 | — | — |
In the representation above, the data bytes are listed in hexadecimal, starting with the byte with the lowest memory address.
The data byte 1C in the second way of the set with index 1 was retrieved from memory at address _____. Write your answer as a hexadecimal number.
concatenate: tag (0xC20 = 1100 0010 0000), index (01), offset (001) = 1 1000 0100 0000 1001 = 0x18409
([update 26 Oct:] original key had the wrong intermediate values, now fixed)
Give an example of an address that, if accessed, would result in a value being replaced in the cache above. Write your answer as a hexadecimal number.
has to be in set 0 or set 2 and not have tag 0xA20 or 0xC20 (for set 0); or tag 0xC20 or 0x332 (for set 2)
Suppose a program accesses a cache in the following pattern:
Assume the above accesses use a cache that is not also used for other accesses and that the cache is initially empty.
If the cache used for the accesses above is a 4-byte, direct-mapped cache with a write-through, write-no-allocate policy and 2 byte blocks, how many accesses (reads or writes) to the next level of cache (or main memory if there is not next level) will be performed when the above cache accesses happen?
2 reads of A + 40 writes of A + 2 reads of B (if arrays at addreses that are multipes of 2)
additional read of B if address of B is not multiple of 2 (read block contianing byte before array + first byte; then read block of second and third; then read block of fourth byte + byte after array)
if address of A is not multiple of 2, additoinal read for first pass (to read block contining last byte of array + byte after array), plus 2 additional reads for each of 9 following passes (replace block containing last byte with block containing byte before array + first byte; then replace that block with block contiaing last byte again) or 1 + 18 = 19 additional reads
Consider a 4MB (4194304 byte) cache, 4-way cache with 64-byte blocks, a first-in, first-out replacement policy, and a write-back, write-allocate policy.
Suppose the cache is initially empty (all valid bits and dirty bits set to 0), then the following accesses happen:
Assuming 64-bit memory addresses, how many bits does the cache have (in total, across the whole cache) available to store tags?
4194304 / (64 * 4) = 2^14 sets; (64 - 14 index bits - 6 offset bits) * (4194304/64 blocks)
After the accesses above, how many dirty bits will be set to 1?
After the accesses above, how many valid bits will be set to 1?
Answer the following questions about the lecture material for this week.
Consider a system with the following caches and performance results on a benchmark:
and main memory has a 200 cycle access time.
Assume that accesses to the next level of cache are not started until the access time on the current level completes.
What is the average memory access time (in cycles) on this system?
3 + 1% * (10 + 20% * (50 + 5% * 200)) = 3.22
Many caches store tags, valid bits, and other metadata separately from the data. One result of this design is that misses can be detected more quickly, after completing an access to likely faster "tag store".
Suppose a cache using this design:
Without the optimization to detect hits or misses early, this cache would have an average memory access time of 5 + 10% * 100 = 15 cycles. What would the average memory access time with this optimization be?
5 + 10% * (100 - 3 [cycles saved by starting next level faster])
Suppose we have a 2-way, 2-set data cache with 8 byte blocks and an LRU replacement policy and run the following C code:
int array[12];
... /* code omitted */
int sum = 0;
for (int j = 0; j < 2; j += 1) {
for (int i = 0; i < 4; i += 1) {
if (sum > array[i * 3 + j]) {
sum += array[i * 3 + j];
}
}
}
Assume that:
array
in the loops use the data cache;How many data cache misses will occur when the loops above run?
[Note that unlike some of the examples in lecture, the number of misses for each of the two sets of the cache may be different.]
access pattern is array[0], array[3], array[6], array[9], array[1], array[4], array[7], array[10]
Each cache block has two array elements. Let's say 0+1, 4+5, 8+9 map to set 0; and 2+3, 6+7, 10+11 map to set 1.
access pattern + cache contents identifying array elements by their indices:
0 miss: set 0 {0+1,--}; set 1 {--,--}
3 miss: set 0 {0+1,--}; set 1 {2+3,--}
6 miss: set 1 {0+1,--}; set 1 {2+3,6+7}
9 miss: set 0 {0+1,8+9}; set 1 {2+3,6+7}
1 hit
4 miss: set 0 {0+1,4+5} (8+9 was least recently used in this set since we just accessed array[1]); set 1 {2+3,6+7}
7 hit
10 miss
Consider the following C snippets:
/* version A */
for (int i = 0; i < N; i += 1) {
for (int j = 0; j < i; j += 1) {
A[i * N + j] = D[j * N + i] + B[i] * C[j];
}
}
/* version B */
for (int j = 0; j < N; j += 1) {
for (int i = 0; i < j; i += 1) {
A[i * N + j] += D[j * N + i] + B[i] * C[j];
}
}
Which version has better spatial locality for accesses to A?
If the cache can store four array elements in each cache block (for each of the arrays A, B, C, or D) and the cache is too small to store N elements of any of the arrays, approximately how many cache misses would we expect to occur in version A for accesses to the array B only? (N^2 below represents N squared.)
If the cache can store four array elements in each cache block (for each of the arrays A, B, C, or D) and the cache is too small to store N elements of any of the arrays, approximately how many cache misses would we expect to occur in version A for accesses to the array D only? (N^2 below represents N squared.)
Answer the following questions about the lecture material for this week.
Consider the following C code:
for (int i = 0; i < N; i += 1) {
for (int j = 0; j < i; j += 1) {
A[i] += B[i * N + j] * C[j * N + i];
}
}
Consider the following attempts at optimizing the above C code:
/* version A */
for (int ii = 0; ii < N; ii += 2) {
for (int i = 0; i < ii + 2 && i < N; i += 1) {
for (int jj = 0; jj < N; jj += 2) {
for (int j = 0; j < jj + 2 && j < i; j += 1) {
A[i] += B[i * N + j] * C[j * N + i];
}
}
}
}
/* version B */
for (int ii = 0; ii < N; ii += 2) {
for (int jj = 0; jj < ii + 2; jj += 2) {
for (int i = 0; i < ii + 2 && i < N; i += 1) {
for (int j = 0; j < jj + 2 && j < i; j += 1) {
A[i] += B[i * N + j] * C[j * N + i];
}
}
}
}
/* version C */
for (int i = 0; i < N; i += 1) {
int j = 0;
for (; j + 1 < i; j += 2) {
A[i] += B[i * N + j] * C[j * N + i];
A[i] += B[i * N + j+1] * C[(j+1) * N + i];
}
if (j < i)
A[i] += B[i * N + j] * C[j * N + i];
}
/* version D */
for (int j = 0; j < N - 1; j += 1) {
int i;
for (i = 0; i + 1 <= j ; i += 2) {
A[i] += B[i * N + j] * C[j * N + i];
A[i] += B[(i+1) * N + j] * C[j * N + i+1];
}
if (i <= j)
A[i] += B[i * N + j] * C[j * N + i];
}
Assuming N
is very large (relative to the cache size), which of the above attempts at such optimizations
are most likely to be successful at reducing the number of data cache misses?
Accepted all answers due to error in this question; meant to write version A and version B with i = ii and j = jj instead of i = 0 and j = 0 (and A[i+1] instead of A[i] in version D); if so than B would be the correct answer.
With this correction, versions A and C would hvae the same access pattern as the original, while version D would hurt the locality because the accesses to A would have worse temporal locality. version B would improve the temporal locality in B without hurting the temporal/spatial locality in the other arrays much.
Consider the folowing assembly code:
xorq %rbx, %r8
andq %rbx, %r9
subq %r8, %rbx
addq %rbx, %r9
after register renaming, the physical register that andq
reads to obtain
the value of %rbx
will be the same as ____. Select all that apply.
Consider the following assembly code:
xorq %rax, %rax
outer_loop:
movq %rax, %r8 /* A */
movq %rax, %rcx /* B */
imul $8192, %r8 /* C */
addq %rdi, %r8 /* D */
inner_loop:
movq (%rdx,%rcx,8), %r9 /* E */
movq (%rsi,%rax,8), %r10 /* F */
addq %r10, %r9 /* G */
movq %r9, (%r8,%rcx,8) /* H */
incq %rcx /* I */
cmpq $1024, %rcx /* J */
jne inner_loop /* K */
incq %rax /* L */
cmpq $1024, %rax /* M */
jne outer_loop
In this code there are two nested loops. The outer one which uses %rax
as its
index variable, and the inner one uses %rcx
as its index varaiable.
In an out-of-order processor with branch prediction
and several execution units (such as ALUs and data cache read/write ports)
of each type, multiple instances
of addq
instruction labelled G (from different iterations
of the inner loop) can be performed _____.
(Assume this out of order processor can perform loads and stores out-of-order.)
In order to unroll the inner loop in the code above,
one would most likely duplicate some of the instructions above
(possibly with changes to where they expect their operands, etc.)
and/or convert some of them to a form that performs the same work
that multiple copies of the instruction would perform (such
as converting an incq %rax
into an addq $2, %rax
).
In order to unroll the inner loop, which instructions would be so duplicated or transformed? (Instructions are identified by the letters in comments above.)
After loop unrolling is performed, the addq
instruction labeled G
would ______ from a multiple accumulators transformation.
Answer the following questions about the lecture material for this week.
Consider the following assembly function 'satSum':
satSum:
movq %rdi, %rax
addq %rsi, %rax
cmpq %rsi, %rax
jb retMax
cmpq %rdi, %rax
jb retMax
ret
retMax:
movq $-1, %rax
ret
Inlining this function would most likely allow an optimizing compiler to avoid including an instruction very similar to ___. Select all that apply.
For the following questions, consider the following C snippet:
for (int i = 0; i < N; i += 1) {
for (int j = 0; j < N; j += 1) {
A[i] += (A[i] - B[j]) * C[j];
}
}
where A
, B
, and C
are declared as int*
s and N
is declared as an int
.
On the snippet above, which of the following optimizations cannot be performed without adding extra checks due to aliasing? Select all that apply.
On the snippet above, which of the following optimizations
would be likely to result in substantially worse performance if N
was very small
(but still varied between executions of the code above)?
Select all that apply.
On the snippet above, which of the following optimizations would be likely to significantly increase the amount of machine code used to store the above loops? Select all that apply.
Answer the following questions about the lecture material for the two lectures before Thanksgiving break.
The vector instructions we discussed in lectures supported loading a vector of values from (or storing a vector of values to) a single location in memory. In addition to these types of vector load and store instructions, some vector instruction sets include support for an instruction that takes a vector of memory locations and fills a vector by loading an element from each of those memory locations (or stores a vector by storing an element to each of those locations). These more featureful vector load and store instructions are typically substantially slower than those that load from (or store to) a single location.
When vectorizing which of the following snippets would these special vector load and store instructions be useful? Select all that apply.
Consider a single-core system with an operating system that uses time multiplexing (with context switching) to share the processor between multiple processes.
Initially, process A is running the following assembly snippet:
movq $100, %rax
movq $200, %rcx
addq %rcx, %rax
After the second movq
and before the addq
instruction can complete, the operating system switches
to process B. After process B starts running, ____. Select all that apply.
Suppose a process attempts to execute the following,
whose machine code is located in memory at address
0x200000
:
movq $0x100000, %r8
movq (%r8), %r9
addq %r9, %r10
Suppose the memory address 0x100000
is not accessible, so
accessing it causes a crash in the second movq
instruction.
[editing note: originally the above sentence had some minor
typos: it erroneously duplicated
"causes a crash" and had the wrong number of 0
s in 0x100000
]
An exception will occur as part of the execution of the above. When it does part or parts of the exception table will be used. These part(s) will contain ____. Select all that apply.
Suppose an operating system with a single core is running two processes.
Initially process A is active, and it attempts to reads from a file. The data from the file is not yet available because it has not been read from the disk. While it is being read from the disk, process B runs, performing a part of a long computation. Then process A finishes reading from the file.
During this procedure, there is a context switch from process A to process B. This context switch most likely occurs ____.
Suppose process B stops running and process A is restarted almost immediately after the data is retrieved from the file. For this to happen, most likely ___.
Answer the following questions about the lecture material for this week.
Consider a system with 2048-byte (2 to 11th power bytes) pages, and the following page table:
virtual page # | valid | physical page # |
---|---|---|
0x0 | 0 | — |
0x1 | 1 | 0x34 |
0x2 | 1 | 0x35 |
0x3 | 1 | 0x36 |
0x4 | 1 | 0x12 |
0x5 | 1 | 0x10 |
0x6 | 1 | 0x06 |
0x7 | 1 | 0x04 |
(Since this system has 2048 byte pages, page offsets are 11 bits.)
If a program running on this system accesses address 0x1432
, then what physical address
will be accessed? Write your answer as a hexadecimal number.
If an exception will occur instead, write "fault".
Suppose a system has the following virtual memory configuration:
When a program accesses a virtual address, the processor reads a page table entry from physical address 0x402000 and then reads a value from physical address 0x201234. What virtual address did the program access? Write your answer as a hexadecimal number. If not enough information is provided, write "unknown" and explain in the comment.
Suppose a program accesses virtual address 0x1007. The processor reads a page table entry which, when interpreted as 4-byte integer, is 0x00099003 as part of trying to resolve this access. Which of the following is true about this virtual memory access? Select all that apply.
Suppose a system has the following virtual memory configuration:
While executing a move instruction that copies from address 0x123456789A into a register, the processor looks up a first-level page table entry at address 0x400240. That page table entry has its valid bit and user-mode-accessible bit set, and a physical page number of 0x456. From what physical address will the processor try to read the second-level page table entry for this address from?
Remember to account for page table entries being larger than 1 byte.
Write your answer as a hexadecimal number. If not enough information is provided, write "unknown" and explain in the comments. If an exception will occur instead, write "unknown" and explain in the comments