F2022 CS 3330 quizzes

Question 1 (6 points)

In lecture, we discussed the idea of a processor being connected to memory and I/O devices via common bus. Suppose we are executing the instruction movq (%rbx), %rax in this model. The processor will send messages over the bus in order to ___. Select all that apply.

request a copy of the machine code for this instruction.
request a copy of the value of %rbx
request a copy of the value at whatever address is contained in %rbx
request a copy of the value at whatever address is contained in %rax
modify the value in at whatever address is contained in %rbx
modify the value in %rax

Consider the following C code:

int x;
int *p;
x = 1077;
p = &x;

Suppose this is executed on a little-endian system where ints are 4 bytes and:

the value of p (after execution) is 0x50000, and
the variable p is stored in memory at address 0x60000

Identify the value of the byte in memory at each of the following addresses. If not enough information is given, instead write unknown and explain briefly in the comment.

Question 2 (2 points) (see above)

0x50000

Answer:

Question 3 (2 points) (see above)

0x50002

Answer:

Question 4 (2 points) (see above)

0x50004

Answer:

Question 5 (5 points)

Consider the following C snippet:

c = str[x+2];

where c is declared as a char, str is declared as a char * (pointer to char; and based on the context of the snippet a pointer to the beginning of an array of char), and x is declared as a long.

Suppose str is stored in the register %rax, x in %rdi, c in the register %bl, and %r8 through %r15 can be used for temporary values. Which (if any) of the following are correct translations to assembly? Select all that apply.

movq $2, %r8
addq %rdi, %r8
movb (%rax, %r8, 1), %bl
movq %rdi, %r8
addq $2, %r8
movb (%r8, %rax, 1), %bl
addq $2, (%rax, %rdi)
mov (%rax, %rdi, 1), %bl
movb 1(%rax, %rdi, 2), %bl
movq $0, %r8
addq 2(%rax, %rdi, 1), %r8
movb (%r8), %bl

quiz for week 2

Question 1 (4 points)

Consider the following C snippet:

unsigned long x;
unsigned long *p;
...
*p = x * 2;

If:

x is stored in %rax,
p is stored in %r8, and
%r9 is a register available to store temporary values

then which of the following x86-64 assembly snippets are equivalent to the C statement above? Select all that apply.

movq %rax, (%r8)
leaq (%r8,2), %r8
leaq (%rax, 2), %r8
movq (%r8), %r8
leaq (%r8, 2), %r9
addq %rax, (%r9)
leaq (%rax,%rax), %r9
movq %r9, (%r8)

Consider the following assembly snippet:

start:
    addq %r8, %r9
    cmpq %r9, %r10
    jle start
end:

Question 2 (4 points) (see above)

Suppose that while this assembly snippet executes, the computations performed by the addq and cmpq instructions do not result in integer overflow (or underflow) if we interpret their operands and results as signed integers. Then, after the snippet finishes executing, what will the values of the condition codes SF and ZF be?

SF = 0 and ZF = 0
SF = 1 and ZF = 0
SF = 0 and ZF = 1
it depends on the initial values of the registers %r8, %r9, and %r10
there is not enough information provided (explain in comments)

Question 3 (4 points) (see above)

Assuming that the variables r8, r9, and r10 represent the values in the registers %r8, %r9, and %r10 respectively, which is equivalent C code to the above assembly snippet?

do { r9 += r8; } while (r10 < r9);
do { r9 += r8; } while (r10 <= r9);
do { r9 += r8; } while (r10 >= r9);
do { r9 += r8; } while (r10 > r9);
while (r10 < r9) { r9 += r8; }
while (r10 <= r9) { r9 += r8; }
while (r10 >= r9) { r9 += r8; }
while (r10 > r9) { r9 += r8; }
none of the above

Suppose we have C code in two different files. One file, upperstr.c contains the following:

void upperstr(char *str) {
    for (int i = 0; str[i] != '\0'; i += 1) {
        if (str[i] >= 'a' && str[i] <= 'z') {
            str[i] += ('A'-'a');
        }
    }
}

And another file, printbig.c contains the following:

#include <stdio.h> /* for puts() */

void printbig(char *str) {
    upperstr(str);
    puts(str);
}

An executable is made from code in these files and others. As part of that process an object file upperstr.o is produced from upperstr.c and an object file printbig.o is produced from printbig.c.

Recall from lecture that object files generally contain:

machine code
data (such as string constants and the initial values of global variables)
a table of relocations that identify values within the machine code or data that must be filled in by the linker
a symbol table that identifies the names of labels (that were declared in the original assembly file) and the locations in the machine code or data they represent. (Often these labels correspond to functions or global variables.)

Question 4 (4 points) (see above)

Ignoring information only present for debugging, we would expect upperstr.o to contain ____. Select all that apply.

an entry in its symbol table for upperstr
a relocation specifying a location that needs to contain a pointer to upperstr
an entry in its symbol table for printbig
a relocation specifying a location that needs to contain a pointer to printbig

Question 5 (4 points) (see above)

Ignoring information only present for debugging, we would expect printbig.o to contain ____. Select all that apply.

an entry in its symbol table for upperstr
a relocation specifying a location that needs to contain a pointer to upperstr
an entry in its symbol table for printbig
a relocation specifying a location that needs to contain a pointer to printbig

quiz for week 3

Question 1 (4 points)

Consider the expression

(((x & 0xFF0F) >> 2) & 0xFF0) == (((x & 0xFF0) >> 2) & 0xFF0F)

where x is a 32-bit unsigned integer. The set of values x for which the above expression is true is the same as the set of values for which the expression (x & Y) == 0 is true for some constant Y. What is the value of Y? Write your answer as a hexadecimal number.

Answer:

Consider the following incomplete C code (where ___s represent omitted constants):

const unsigned long A = ___;
const unsigned long B = ___;
unsigned long swapAndFlip(unsigned long x) {
    unsigned long low = x & 0xFF;
    unsigned long high = x >> 56;
    return (((low ^ A) << 56) | high | (x & B);
}

Suppose we want to make this function take a 64-bit integer, and return a new 64-bit integer, such that:

the most significant byte of the new integer is equal to the least significant byte of the original integer with the least and most significant bit of that byte flipped
the least significant byte of the new integer is equal to the most significant byte of the original integer
the rest of the integer is the same as the original integer.

Question 2 (4 points) (see above)

What should the value of the constant A be? Write your answer as a hexadecimal integer.

Answer:

Question 3 (4 points) (see above)

What should the value of the constant B be? Write your answer as a hexadecimal integer.

Answer:

For each of the following pairs of instruction set design choices, identify which is more consistent with the reduced instruction set computer (RISC) design philosophy

Question 4 (2 points) (see above)

adding an instruction that finds the length of a string given a starting memory location
adding an instruction that finds the index of the first zero byte in an 8-byte value stored in a register
both are about as consistent with the RISC design philosophy

Question 5 (2 points) (see above)

adding a special "increment register by 1" instruction which needs fewer bytes of machine code than an equivalent "add constant to register" instruction
providing a "add constant to register" instruction whose machine code is always the same length, no matter how large the constant is
both are about as consistent with the RISC design philosophy

quiz for week 4

For the following questions, consider this Y86-64 machine code. (The machine code is written as a series of bytes in hexadecimal, written in order from lowest to highest memory address.)

30 f8 08 00 00 00 00 00 00 08 60 9a 50 ba 00 00 00 00 00 00 00 00

Question 1 (2 points) (see above)

What is the mnemonic for the first instruction in the above machine code snippet?

Question 2 (4 points) (see above)

The first instruction in the above assembly snippet has a 64-bit integer constant in it. What is it? Write your answer as a hexadecimal number.

Answer:

Question 3 (4 points) (see above)

What is the mnemonic for the second instruction in the above machine code snippet?

Consider the following HCLRS snippet:

register xY {
    foo : 64 = 0x0;
    bar : 64 = 0x5;
}
x_foo = Y_foo + 1;
x_bar = x_foo + Y_bar;
pc = x_foo;

Question 4 (4 points) (see above)

Since the above snippet sets pc, it will read from the instruction memory, setting i10bytes based on the contents of memory (even though the snippet does not use that value). Before the first rising edge of the clock signal, what bytes of memory will be used to construct the value of i10bytes? Identify the bytes by their memory addresses. Select all that apply

Question 5 (4 points) (see above)

Initially, the value of the bar register is 0x5. After two rising edges of the clock signal, it will be ___. Write your answer as a hexadecimal number.

Answer:

quiz for week 5

Question 1 (6 points)

Suppose we start with the mov-to-register CPU we described in lecture which implements the Y86 instructions irmovq, mrmovq, and rrmovq. This processor design had MUXes (represented in HCLRS by case expressions) that chose between several inputs based primarily on what the current instruction was. When adding a new instruction, for some of these MUXes, we can use an existing input and modify when it is selected to include the new instruction. For others, we may need to add a new input to this MUX for the new instruction (or some other logic that achieves this effect). In a few cases, we may even need to add a new a MUX (or other logic that achieves the same effect).

Suppose we added the addq instruction to the mov-to-register CPU. For which of the following parts of the circuit would we need to add a new MUX or a new MUX input (or equivalent logic) rather than modifying the conditions for selecting existing MUX inputs? Select all that apply.

controlling the input to the PC (program counter) register
controlling one or more of the ALU's inputs
controlling the memory write enable input (mem_writebit in HCLRS)
controlling one or more of the register file's "destination register index" inputs (reg_dstE and reg_dstM in HCLRS)
controlling the memory read enable input (mem_readbit in HCLRS)
controlling one or more of the register file's "register value to write" inputs (reg_inputE and reg_inputM in HCLRS)

Question 2 (5 points)

In Y86, the popq instruction is encoded using 2 bytes. Suppose we made a variant of Y86 that instead encoded popq in one byte, where the high-order four bits of the first byte would be 0xB (like popq is in normal Y86) and the low-order four bits of the first byte would be the register index corresponding to the operand to popq.

Supporting this new encoding would likely ____. Select all that apply.

require a register file with more read ports (ability to read more values at once)
require changes to the logic controlling the register source index inputs to the register file (e.g. reg_srcA and reg_srcB in HCLRS)
require changes to the logic controlling the register destination index inputs to the register file (e.g. reg_dstE and reg_dstM in HCLRS)
require changes to the logic controlling the register value inputs to the register file (e.g. reg_inputE and reg_inputM in HCLRS)
require changes to the logic controlling the ALU inputs

In the single-cycle processor design discussed in lecture, executing the code

pushq %rax
popq %rbx

involves the following operations:

1. reading the machine code for the pushq instruction
1. reading the value of %rax
1. writing a (non-machine-code) value to memory
1. writing a value to %rsp
1. reading the machine code for the popq instruction
1. reading a (non-machine-code) value from memory
1. writing a value to %rbx
1. writing a value to %rsp

Question 3 (7 points) (see above)

Which of the above operations could occur simulatenously or in any order in the single-cycle processor design? Select all that apply

Question 4 (4 points)

Suppose we added the instruction

immovq CONSTANT, DISPLACEMENT(rA)

to the single-cycle Y86-64 processor design we discussed in lecture. This instruction would store the 64-bit value CONSTANT in memory at an address computed by adding the value of the register rA to the 64-bit constant DISPLACEMENT.

While this instruction is executing, the value input to the data memory (called mem_input in HCLRS) will be equal to ______.

one of the outputs of the register file
part of the output of the instruction memory
the input to the instruction memory
the output of the ALU (arithmetic logic unit)

quiz for week 6

Suppose we are building a two-stage pipelined Y86-64 processor from the following components:

a register file that requires 100 ps for reads and 100 ps for writes
a program counter increment circuit that requires 20 ps to compute the incremented program counter (from the icode and previous program counter)
an ALU that requires 200 ps; and
a data memory that requires 120 ps for reads and 150 ps for writes
an instruction memory that requires 120 ps for reads
pipeline registers and a program counter register that have some small register delay

(and that other components use negligible time to make the following questions simpler.)

Most simply, each our two stages could perform the work of one or more stages from the five-stage pipelined design we discussed in lecture.

Question 1 (4 points) (see above)

If the first stage of the processor performs the work for Fetch and the second stage performs the work of Decode+Execute+Memory+Writeback, then the cycle time would be around _____ ps plus pipeline and/or program counter register delays.

Answer:

Question 2 (4 points) (see above)

To minimize the cycle time, these two stages should perform the work of _______ (for the first stage) and _______ (for the second stage).

Fetch / Decode+Execute+Memory+Writeback
Fetch+Decode / Execute+Memory+Writeback
Fetch+Decode+Execute / Memory+Writeback
Fetch+Decode+Execute+Memory / Writeback

Question 4 (4 points)

Suppose we construct a pipelined processor by modifying the single-cycle processor design we saw in lecture and that our pipeline processor has two stages:

Fetch and Decode; and
Execute and Memory and Writeback

With these stages, the pipeline registers should contain (at least when certain instructions are being run) ____. Select all that apply.

the opcode (icode) field from the instruction or equivalent information
one or more register indices from the instruction
the result of an ALU operation
the values of registers read from the register file

quiz for week 7

For the following questions, consider the following assembly code:

addq %r8, %rcx                 
mrmovq 8(%rcx), %rax           
subq %rcx, %rax

Question 1 (4 points) (see above)

Suppose the above assembly code was executed on a six-stage pipelined processor with the following pipeline stages:

Fetch
Decode
Execute
Memory 1
Memory 2
Writeback

The result of reading the data memory is not available until near the end of the Memory 2 stage. The processor uses a combination of stalling and forwarding to resolve data hazards. It supports all forwarding paths that would not result in a large increase in cycle time.

If the addq instruction completes its fetch stage during cycle 1, then the subq instruction will complete its writeback stage during cycle ___.

Answer:

Question 2 (4 points) (see above)

Suppose the above assembly code was executed on seven-stage pipelined processor, where the pipeline stages are:

Fetch
Decode
Execute 1
Execute 2
Execute 3
Memory
Writeback

In this processor, the result of an ALU operation (addition or subtraction) is not available until near the end of the Execute 3 stage, but the inputs to the operation must be available near the beginning of the Execute 1 stage. This processor uses a combination of stalling and forwarding to resolve hazards. It supports all forwarding paths that would not result in a large increase in cycle time.

If the addq instruction completes its fetch stage during cycle 1, then the subq instruction will complete its writeback stage during cycle ____.

Answer:

Question 3 (6 points)

Consider the following assembly code:

addq %r8, %r9
mrmovq 16(%r8), %r9
subq %r9, %r10
xorq %r9, %r11
rrmovq %r9, %r12

When the above instruction is executed on a five-stage pipelined processor which uses forwarding to resolve hazards in a way that minimizes stalling like we discussed in lecture, which of the following forwarding operations must occur? Select all that apply

of the value of %r8 from addq to mrmovq
of the value of %r9 from addq to mrmovq
of the value of %r9 from addq to subq
of the value of %r9 from mrmovq to subq
of the value of %r9 from mrmovq to xorq
of the value of %r9 from mrmovq to rrmovq

quiz for week 8

Consider a 7-stage pipelined processor with the following stages:

Fetch
Decode 1
Decode 2
Execute 1
Execute 2
Memory
Writeback

Assume these stages do the same thing as in the 5-stage processor we discussed in lecture, except that the decode and execute stages are split into two stages. For the split decode stage, the inputs to the original decode stage (like register indices) need to be available near the beginning of the decode 1 stage and outputs of the original decode stage (like register values) are only available near the end of the decode 2 stage. Similar is true for the execute 1 and execute 2 stages.

Assume the processor implements whatever forwarding is possible without dramatically increasing cycle time.

Question 1 (4 points) (see above)

Suppose the 7-stage processor is used to execute the following:

     addq %rax, %rcx       
     jle foo             
     rmmovq 8(%rax), %rax
     subq %rcx, %rdx
     halt
foo:
     xorq %r8, %r9
     xorq %r10, %r11
     xorq %r12, %r13
     xorq %r8, %r9
     xorq %r10, %r11
     xorq %r12, %r13

The processor implements jle using branch prediction that predicts the branch as always taken. The actual outcome of the branch prediction is computed during the conditional jump's execute 2 stage (which will use the condition codes updated by the previous instruction's execute 2 stage) and can be used to determine what instruction to fetch in the next cycle (but not earlier).

When the assembly above is executed, the jle is not taken, so some xorq instructions are fetched and then squashed. If the addq instruction is fetched during cycle 1, then the rmmovq instruction will be fetched during cycle ___.

Answer:

Question 2 (4 points)

Consider the five-stage pipelined processor design discussed in lecture but with a different branch prediction strategy: instead of predicting conditional jumps as always taken, predict them as taken if they target an address that is less than the jump instruction's and as not taken if they target an address than is greater than the jump instruction's. (This strategy is called "forward not-taken, backwards taken").

In this processor, incorrectly predicted branches require 2 extra cycles, instructions that use a value computed in memory by the previous instruction in the ALU require 1 extra cycle, and returns require 3 extra cycles. (And no other cases require extra cycles due to hazards.)

Suppose the instructions run in a program are:

10% conditional jumps, of which
- 40% are taken and jump to an address that is less than the jump instruction's
- 20% are taken and jump to an address that is greater than the jump instruction's
- 10% are not taken and jump to an address that is less than the jump instruction's
- 30% are not taken and jump to an address that is greater than the jump instruction's
10% instructions that use (for an ALU operation) a value loaded from memory by the previous instruction; and
no ret instructions

About how many average cycles per instruction would the program experience on the processor described above?

Answer:

Consider the following C++ snippet:

for (int i = 0; i < 10000; ++i) {
    A[i] += B[i];
    A[i] *= C[i];
}
std::cout << "last A is " << A[9999] << std::endl;

Question 3 (2 points) (see above)

The accesses to elements of A will have _____ temporal locality than the accesses to i.

less
about the same (or there's not enough information)
more

Question 4 (2 points) (see above)

Assuming about the same number of bytes of machine code are required for each, the accesses to the machine code for the += and *= statements in the loop will have _____ spatial locality than the machine code for the std::cout statement.

less
about the same (or there's not enough information)
more

Question 5 (2 points) (see above)

Assuming about the same number of bytes of machine code are required for each, the accesses to the machine code for the += and *= statements in the loop will have _____ temporal locality than the machine code for the std::cout statement.

less
about the same (or there's not enough information)
more

Question 6 (5 points)

Suppose the contents of a two-entry direct-mapped cache is as follows:

index (binary)	valid	tag (binary)	data bytes (hexadecimal)
00	1	001	`1F 3F 5F 7F`
01	1	110	`2E 3E 4E 5E`
10	1	001	`33 44 55 8F`
11	1	010	`9E AE BF 01`

(In the table above, data bytes are listed with the lowest address's value first (left-most).)

The byte of memory at address 0x11 (binary: 10001) is stored in this cache and has the value 0x3F.

Bytes from which of the following memory addresses are also stored in the cache shown above? (Note that addresses may be written without all leading 0s.) Select all that apply.

quiz for week 9

Consider a two-way set associative cache with the following contents:

set index	valid	tag	data bytes	valid	tag	data bytes
0	1	0xA20	`1F 3F 2A 93 27 00 45 8A`	1	0xC20	`33 44 55 92 91 90 0A 0C`
1	0	—	—	1	0xC20	`0F 1C 11 12 13 14 15 16`
2	1	0xC20	`20 21 2F 2A 40 41 55 98`	1	0x332	`95 33 12 59 43 3A 3C 3D`
3	0	—	—	0	—	—

In the representation above, the data bytes are listed in hexadecimal, starting with the byte with the lowest memory address.

Question 1 (4 points) (see above)

The data byte 1C in the second way of the set with index 1 was retrieved from memory at address _____. Write your answer as a hexadecimal number.

Answer:

Question 2 (4 points) (see above)

Give an example of an address that, if accessed, would result in a value being replaced in the cache above. Write your answer as a hexadecimal number.

Answer:

Question 3 (4 points)

Suppose a program accesses a cache in the following pattern:

it reads a 4 byte array A, one byte at a time, from beginning to end; then
it overwrites the same array A, one byte at a time, from end to beginning; then
it repeats the above steps 9 more times; then
it reads a different 4-byte array B, one byte at a time, from beginning to end

Assume the above accesses use a cache that is not also used for other accesses and that the cache is initially empty.

If the cache used for the accesses above is a 4-byte, direct-mapped cache with a write-through, write-no-allocate policy and 2 byte blocks, how many accesses (reads or writes) to the next level of cache (or main memory if there is not next level) will be performed when the above cache accesses happen?

Answer:

Consider a 4MB (4194304 byte) cache, 4-way cache with 64-byte blocks, a first-in, first-out replacement policy, and a write-back, write-allocate policy.

Suppose the cache is initially empty (all valid bits and dirty bits set to 0), then the following accesses happen:

write to 0x4444000
write to 0x4444008
read from 0x4444000
read from 0x4444800
read from 0x4443320
write to 0x4445000
read from 0x8445000

Question 4 (4 points) (see above)

Assuming 64-bit memory addresses, how many bits does the cache have (in total, across the whole cache) available to store tags?

Answer:

Question 5 (3 points) (see above)

After the accesses above, how many dirty bits will be set to 1?

Answer:

Question 6 (3 points) (see above)

After the accesses above, how many valid bits will be set to 1?

Answer:

quiz for week 10

Question 3 (4 points)

Suppose we have a 2-way, 2-set data cache with 8 byte blocks and an LRU replacement policy and run the following C code:

int array[12];
... /* code omitted */
int sum = 0;
for (int j = 0; j < 2; j += 1) {
    for (int i = 0; i < 4; i += 1) {
        if (sum > array[i * 3 + j]) {
            sum += array[i * 3 + j];
        }
    }
}

Assume that:

the compiler does not reorder operations in the loop;
only accesses to array in the loops use the data cache;
the data cache is initially empty when the loop starts;
the address of array[0] is a multiple of 1024

How many data cache misses will occur when the loops above run?

[Note that unlike some of the examples in lecture, the number of misses for each of the two sets of the cache may be different.]

Answer:

Consider the following C snippets:

/* version A */
for (int i = 0; i < N; i += 1) {
    for (int j = 0; j < i; j += 1) {
        A[i * N + j] = D[j * N + i] + B[i] * C[j];
    }
}

/* version B */
for (int j = 0; j < N; j += 1) {
    for (int i = 0; i < j; i += 1) {
        A[i * N + j] += D[j * N + i] + B[i] * C[j];
    }
}

Question 4 (4 points) (see above)

Which version has better spatial locality for accesses to A?

version A
version B
they are about the same

Question 5 (4 points) (see above)

If the cache can store four array elements in each cache block (for each of the arrays A, B, C, or D) and the cache is too small to store N elements of any of the arrays, approximately how many cache misses would we expect to occur in version A for accesses to the array B only? (N^2 below represents N squared.)

Question 6 (4 points) (see above)

If the cache can store four array elements in each cache block (for each of the arrays A, B, C, or D) and the cache is too small to store N elements of any of the arrays, approximately how many cache misses would we expect to occur in version A for accesses to the array D only? (N^2 below represents N squared.)

quiz for week 11

Question 1 (4 points)

Consider the following C code:

for (int i = 0; i < N; i += 1) {
    for (int j = 0; j < i; j += 1) {
        A[i] += B[i * N + j] * C[j * N + i];
    }
}

Consider the following attempts at optimizing the above C code:

/* version A */
for (int ii = 0; ii < N; ii += 2) {
    for (int i = 0; i < ii + 2 && i < N; i += 1) {
        for (int jj = 0; jj < N; jj += 2) {
            for (int j = 0; j < jj + 2 && j < i; j += 1) {
                A[i] += B[i * N + j] * C[j * N + i];
            }
        }
    }
}

/* version B */
for (int ii = 0; ii < N; ii += 2) {
    for (int jj = 0; jj < ii + 2; jj += 2) {
        for (int i = 0; i < ii + 2 && i < N; i += 1) {
            for (int j = 0; j < jj + 2 && j < i; j += 1) {
                A[i] += B[i * N + j] * C[j * N + i];
            }
        }
    }
}

/* version C */
for (int i = 0; i < N; i += 1) {
    int j = 0;
    for (; j + 1 < i; j += 2) {
        A[i] += B[i * N + j] * C[j * N + i];
        A[i] += B[i * N + j+1] * C[(j+1) * N + i];
    }
    if (j < i)
        A[i] += B[i * N + j] * C[j * N + i];
}

/* version D */
for (int j = 0; j < N - 1; j += 1) {
    int i;
    for (i = 0; i + 1 <= j ; i += 2) {
        A[i] += B[i * N + j] * C[j * N + i];
        A[i] += B[(i+1) * N + j] * C[j * N + i+1];
    }
    if (i <= j)
        A[i] += B[i * N + j] * C[j * N + i];
}

Assuming N is very large (relative to the cache size), which of the above attempts at such optimizations are most likely to be successful at reducing the number of data cache misses?

version A
version B
version C
version D

Question 2 (4 points)

Consider the folowing assembly code:

xorq %rbx, %r8
andq %rbx, %r9
subq %r8, %rbx
addq %rbx, %r9

after register renaming, the physical register that andq reads to obtain the value of %rbx will be the same as ____. Select all that apply.

the physical register that xorq reads %rbx from
the physical register that subq reads %rbx from
the physical register that subq writes to
the physical register that addq reads %rbx from

Consider the following assembly code:

    xorq    %rax, %rax
outer_loop:
    movq    %rax, %r8           /* A */
    movq    %rax, %rcx          /* B */
    imul    $8192, %r8          /* C */
    addq    %rdi, %r8           /* D */
inner_loop:
    movq    (%rdx,%rcx,8), %r9  /* E */
    movq    (%rsi,%rax,8), %r10 /* F */
    addq    %r10, %r9           /* G */
    movq    %r9, (%r8,%rcx,8)   /* H */
    incq    %rcx                /* I */
    cmpq    $1024, %rcx         /* J */
    jne     inner_loop          /* K */
    incq    %rax                /* L */
    cmpq    $1024, %rax         /* M */
    jne     outer_loop

In this code there are two nested loops. The outer one which uses %rax as its index variable, and the inner one uses %rcx as its index varaiable.

Question 3 (4 points) (see above)

In an out-of-order processor with branch prediction and several execution units (such as ALUs and data cache read/write ports) of each type, multiple instances of addq instruction labelled G (from different iterations of the inner loop) can be performed _____.

(Assume this out of order processor can perform loads and stores out-of-order.)

only one at a time
in parallel
there is not enough information to tell (explain in comments)

Question 4 (4 points) (see above)

In order to unroll the inner loop in the code above, one would most likely duplicate some of the instructions above (possibly with changes to where they expect their operands, etc.) and/or convert some of them to a form that performs the same work that multiple copies of the instruction would perform (such as converting an incq %rax into an addq $2, %rax).

In order to unroll the inner loop, which instructions would be so duplicated or transformed? (Instructions are identified by the letters in comments above.)

Question 5 (3 points) (see above)

After loop unrolling is performed, the addq instruction labeled G would ______ from a multiple accumulators transformation.

benefit
not benefit

quiz for week 12

Question 1 (4 points)

Consider the following assembly function 'satSum':

satSum:
    movq %rdi, %rax
    addq %rsi, %rax
    cmpq %rsi, %rax
    jb retMax
    cmpq %rdi, %rax
    jb retMax
    ret
retMax:
    movq $-1, %rax
    ret

Inlining this function would most likely allow an optimizing compiler to avoid including an instruction very similar to ___. Select all that apply.

the first movq
the first jb
the first ret
the second movq

For the following questions, consider the following C snippet:

for (int i = 0; i < N; i += 1) {
    for (int j = 0; j < N; j += 1) {
        A[i] += (A[i] - B[j]) * C[j];
    }
}

where A, B, and C are declared as int*s and N is declared as an int.

Question 2 (4 points) (see above)

On the snippet above, which of the following optimizations cannot be performed without adding extra checks due to aliasing? Select all that apply.

unrolling the inner most loop
loading and storing A[i] at most once per iteration of the outermost loop (rather than N times)
swapping the i and j loops to increase temporal locality in B and C
cache blocking to improve cache locality

Question 3 (4 points) (see above)

On the snippet above, which of the following optimizations would be likely to result in substantially worse performance if N was very small (but still varied between executions of the code above)? Select all that apply.

unrolling the inner most loop
loading and storing A[i] at most once per iteration of the outermost loop (rather than N times)
swapping the i and j loops to increase temporal locality in B and C
cache blocking to improve cache locality

Question 4 (4 points) (see above)

On the snippet above, which of the following optimizations would be likely to significantly increase the amount of machine code used to store the above loops? Select all that apply.

unrolling the innermost loop
loading and storing A[i] at most once per iteration of the outermost loop (rather than N times)
swapping the i and j loops to increase temporal locality in B and C
cache blocking to improve cache locality

quiz for week 14

Question 1 (4 points)

The vector instructions we discussed in lectures supported loading a vector of values from (or storing a vector of values to) a single location in memory. In addition to these types of vector load and store instructions, some vector instruction sets include support for an instruction that takes a vector of memory locations and fills a vector by loading an element from each of those memory locations (or stores a vector by storing an element to each of those locations). These more featureful vector load and store instructions are typically substantially slower than those that load from (or store to) a single location.

When vectorizing which of the following snippets would these special vector load and store instructions be useful? Select all that apply.

for (int i = 0; i < 8192; i += 1) { A[i] += B[i-1] + B[i] + B[i+1]; }
for (int i = 0; i < 8192; i += 1) { for (int j = 0; j < 8192; j += 1) { A[i*8192+j] = B[i*8192 + j] * C[j]; } }
for (int i = 0; i < 8192; i += 1) { B[i] = A[T[i]]; }
for (int i = 0; i < 8192; i *= 3) { A[i] *= B[i]; }

Question 2 (6 points)

Consider a single-core system with an operating system that uses time multiplexing (with context switching) to share the processor between multiple processes.

Initially, process A is running the following assembly snippet:

movq $100, %rax
movq $200, %rcx
addq %rcx, %rax

After the second movq and before the addq instruction can complete, the operating system switches to process B. After process B starts running, ____. Select all that apply.

the value 200 is stored in one or more of the processor's registers
the value 200 is stored somewhere in memory managed by the operating system
the value 200 is stored on the stack of process A
the value 300 is stored in one or more of the processor's registers
the value 300 is stored somewhere in memory managed by the operating system
the value 300 is stored on the stack of process A

Suppose a process attempts to execute the following, whose machine code is located in memory at address 0x200000:

movq $0x100000, %r8
movq (%r8), %r9
addq %r9, %r10

Suppose the memory address 0x100000 is not accessible, so accessing it causes a crash in the second movq instruction.

[editing note: originally the above sentence had some minor typos: it erroneously duplicated "causes a crash" and had the wrong number of 0s in 0x100000]

Question 3 (4 points) (see above)

An exception will occur as part of the execution of the above. When it does part or parts of the exception table will be used. These part(s) will contain ____. Select all that apply.

the address 0x100000
the address of the instruction that triggered the exception
the address of an instruction containing operating system code
the value of %rax (the register generally used for return values)

Suppose an operating system with a single core is running two processes.

Initially process A is active, and it attempts to reads from a file. The data from the file is not yet available because it has not been read from the disk. While it is being read from the disk, process B runs, performing a part of a long computation. Then process A finishes reading from the file.

Question 4 (4 points) (see above)

During this procedure, there is a context switch from process A to process B. This context switch most likely occurs ____.

while the exception handler for an interrupt triggered by the disk device is running
while the exception handler for a page fault or protection fault is running
while the exception handler for a system call is running
while process A is executing a function in the standard library in user mode

Question 5 (4 points) (see above)

Suppose process B stops running and process A is restarted almost immediately after the data is retrieved from the file. For this to happen, most likely ___.

process B was running code that checked whether process A could start running again
the disk device triggered an exception that started the operating system running when the data was ready
the processor still had process A's previous register values stored in a backup set of registers and switched to them when it received a signal from the disk device that the disk read was complete

quiz for week 15

Consider a system with 2048-byte (2 to 11th power bytes) pages, and the following page table:

virtual page #	valid	physical page #
0x0	0	—
0x1	1	0x34
0x2	1	0x35
0x3	1	0x36
0x4	1	0x12
0x5	1	0x10
0x6	1	0x06
0x7	1	0x04

(Since this system has 2048 byte pages, page offsets are 11 bits.)

Question 1 (4 points) (see above)

If a program running on this system accesses address 0x1432, then what physical address will be accessed? Write your answer as a hexadecimal number. If an exception will occur instead, write "fault".

Answer:

Suppose a system has the following virtual memory configuration:

one-level page tables, stored in memory
4096-byte pages (and therefore 12-bit page offsets),
4-byte page table entries, which have:
- a valid bit as their least significant bit,
- a user-mode-accessible bit as their second least significant bit
- a writable bit as their third least significant bit
- an executable bit as their fourth least significant bit, and
- a 20-bit physical page number as their most significant 20 bits
a page table base pointer set to physical byte address 0x400000 (which has physical page number 0x400)
- (This means that the page table entry for virtual page 0 is at address 0x400000 and the page table entry for virtual page 1 is at address 0x400004 and so on.)
28-bit virtual addresses
32-bit physical addresses

Question 2 (4 points) (see above)

When a program accesses a virtual address, the processor reads a page table entry from physical address 0x402000 and then reads a value from physical address 0x201234. What virtual address did the program access? Write your answer as a hexadecimal number. If not enough information is provided, write "unknown" and explain in the comment.

Answer:

Question 3 (3 points) (see above)

Suppose a program accesses virtual address 0x1007. The processor reads a page table entry which, when interpreted as 4-byte integer, is 0x00099003 as part of trying to resolve this access. Which of the following is true about this virtual memory access? Select all that apply.

if the program is trying to write to this address, it will result in an exception
if the program is trying to jump to code at this address, it will result in an exception
if the program is running in user-mode and trying to read from this address, it will result in an exception

Suppose a system has the following virtual memory configuration:

three-level page tables, stored in memory
4096-byte pages (so page offsets are 12 bits)
8-byte page table entries
page tables with 512 (2 to the 9th power) entries at each level (so each page table at each level is 4096 bytes, and the parts of the virtual page number used for lookups at each level are 9 bits)
a page table base pointer set to physical byte address 0x400000 (which has physical page number 0x400)
39-bit virtual addresses
36-bit physical addresses

Question 4 (4 points) (see above)

While executing a move instruction that copies from address 0x123456789A into a register, the processor looks up a first-level page table entry at address 0x400240. That page table entry has its valid bit and user-mode-accessible bit set, and a physical page number of 0x456. From what physical address will the processor try to read the second-level page table entry for this address from?

Remember to account for page table entries being larger than 1 byte.

Write your answer as a hexadecimal number. If not enough information is provided, write "unknown" and explain in the comments. If an exception will occur instead, write "unknown" and explain in the comments

Answer: