quiz for week 1

Answer the following questions about this week's lecture material.

Question 1 (4 pt; mean 3.06)

In lecture, we discussed the idea of a processor that was connected to memory and I/O devices using a shared memory bus. Which of the following statements are true about how the processor using this memory bus would work? Select all that apply.

  1. nothing the memory can do with this information

  2. can't know what to retrieve until it interprets the machine code

  3. keyboard connected to processor via memory bus, so only way to get this info; [edited by CR 2021-09-06:] the textbook uses different terminology than we used in lecture (calling the connection between the processor and the I/O bridge the system bus and the connection between the I/O bridge and the memory the memory bus), which caused some students confusion. I think this question is entirely unambiguous about what terminology it was using, so the correct answer was clear. However, in light of the confusion, I've decided to drop this quesiton.

  4. %rax is not stored in memory

Question 2 (4 pt; mean 3.57)

On a little-endian system, if the value 0x1234ABC was stored starting at address 0x101 as a 4-byte integer, then afterwards the byte stored at address 0x103 would have what value? (Assume, as we described in lecture, memory is organized similar to an array of bytes where each byte has its own address.) Write your answer as a hexadecimal number. If not enough information is supplied, write unknown and explain in the comment.

Answer:
Key: 0x23

stored as 0xBC (in 0x101), 0x4A (in 0x102), 0x23 (in 0x103) and 0x01 (in 0x104)

For the following questions, consider the following AT&T syntax x86-64 assembly instruction:

movq (%rax, %rbx, 4), %rcx

And assume that, just before this instruction executes, registers have values as follows:

%rax0x1000
%rbx0x8000
%rcx0x79
%rsp0xF00000
Question 3 (2 pt; mean 1.78) (see above)

When the above instruction is executed, it will use a memory address. What memory address will it use? Write your answer as a hexadecimal number.

Answer:
Key: 0x21000

0x1000 + 0x8000 * 4

Question 4 (3 pt; mean 2.73) (see above)

When the above instruction executes successfully, it will ____.

Question 5 (4 pt; mean 3.66)

Consider the following assembly snippet:

addq %rbx, %rbx
addq %rbx, %rbx
addq %rax, %rbx
movq (%rbx), %rbx

Which of the following assembly snippets would result the same values in %rax and %rbx as the above assembly snippet? Select all that apply.

  1. never reads from memory

addq %rbx, %rbx --> rbx = initial rbx * 2; addq %rbx, %rbx --> rbx = initial rbx * 4; addq %rax, %rbx --> rbx = initial rbx * 4 + rax; movq (%rbx), %rbx --> rbx = memory[initial rbx * 4 + rax]

quiz for week 2

Answer the following questions about the lecture material for week 2.

For each of the following assembly snippets, identify which are possible values of the ZF and SF condition codes.

Question 1 (2 pt; mean 1.8) (see above)
    cmpq $0, %rax
    jl skip
    imulq $-1, %rax
skip:
    testq %rax, %rax

I should have been more explicit that I only cared about the value when the snippet finishes, but I think/hope that's the only sensible interpretation. The condition code's final value is determined by the testq insturction. (How the cmpq or imulq do or do not set condition codes does not matter.) If rax is initially positive, itu becomes negative when the testq instruction runs; if rax is initially negative, it remains negative (the jl skips the imulq); if rax is initially zero, it remains zero.

Question 2 (2 pt; mean 1.86) (see above)
    movq $0, %rax
    cmpq %rax, %rax
    leaq 0x12345678(,%rax,4), %rax

the last leaq won't change the flags

Question 3 (5 pt; mean 4.83)

Consider the following C snippet:

while (rax >= 1 && rax < 10) {
    rax = rax - r8;
}

and the following assembly snippets:

/* snippet A */
loop:
    cmpq $0, %rax
    jle done
    cmpq $10, %rax
    jge done
    subq %r8, %rax
    jmp loop
done:

/* snippet B */
    jmp start
loop:
    subq %r8, %rax
start:
    cmpq $1, %rax
    jl done
    cmpq $10, %rax
    jl start
done:

/* snippet C */
loop:
    cmpq $1, %rax
    cmpq $10, %rax
    jl done
    jge done
    subq %r8, %rax
    jmp loop
done:

/* snippet D */
loop:
    cmpq $1, %rax
    jl done
    cmpq $11, %rax
    jg done
    subq %r8, %rax
    jl loop
done:

Assuming the variable rax is stored in the register %rax and the variable r8 is stored in the register %r8, which snippets are correct assembly translations of the loop? Select all that apply.

  1. infinite loop

  2. second cmp overwrites first

  3. 11 should be 9

For the following questions, consider static linking using object files as discussed in lecture.

Suppose one converts the following two assembly files into an executable (where ... represents some code that is not shown, and .asciz is an assembly directive to generate a nul-terminated ASCII string, and .global LABEL is a directive that causes the assembler to ensure that a specified label LABEL can be referenced from other files):

Assembly file main.s:

.text
.global main
main:
    mov $0, %r11
    mov $10, %r10
loop:
    mov $percent_d, %rdi
    mov %r11, %rsi
    call printf
    incq %r11
    decq %r10
    jge loop
    mov $0, %rax
    ret

.data
.global percent_d
percent_d:
    .asciz "%d"

Assembly file printf.s:

.text
.global printf
printf:
    movb (%rdi), %al
    ...
    ...
    ret

.data
...
Question 4 (4 pt; mean 3.33) (see above)

The object file generated from main.s will include metadata (outside of any machine code) specifying ____. Select all that apply.

  1. not known until the executable is generated, at least

  2. in the symbol table entry for percent_d

  3. in a relocation table entry

  4. not needed for anything the linker needs to do

Question 5 (4 pt; mean 2.59) (see above)

In the object files generated from the assembly files above, a symbol table entry for printf would most likely ____. Select all that apply.

  1. relocation table would have this information

  2. where the printf label points

  3. not specified in assembly, not needed to resolve call instruction

  4. not specified in assembly, not needed to resolve call instruction

quiz for week 3

Question 1 (4 pt; mean 3.36)

If x and y are 64-bit signed integers between -100000000000 and 100000000000 (on a C implementation that uses two's complement and implements signed right shift using arithmetic shift and where other constants are 64-bits), then which of the following C expressions are always true? Select all that apply.

  1. it may at first seem that this translates to (x + y) / (2 to the 12th power) ?= x / (2 to the 20th power) + y / (2 to the 10th power), which is mathematically true, but that doesn't hold if one takes into account rounding. For example, if x and y are ((1 << 22) - 1), then (x >> 22) and (y >> 22) are 0, so the right-hand side is 0, but the left-hand side is 1.

  2. was originally miskeyed; meant to write ^ 0x4FF0 on the right hand side

  3. if x = 0 and y = 8, then the left-hand side is 0xF000 ^ 8 = 0xF008, and the right-hand-side is (0xF000 | 0 | 0x8000) == 0xF000

Question 2 (4 pt; mean 2.59)

Consider the following C function:

long foo(long x) {
    return (x >> 40) + x;
}

Assume this runs on a system where:

If overflow does not occur, what is an argument x that will result in foo returning the most negative number possible? (Overflow includes adding two negative numbers and getting a positive number or adding two positives to get a negative. It does not including right shift producing a rounded result for division by a power of two) Write your answer as a hexadecimal number.

Answer:
Key: -0x7FFFFFFFFF800000 (preferred) or 0x8000000000800000

Originally, I made an off-by-one-error in my calculations (by neglecting to consider that x needs to be close enough to the minimum long to cause shifting by 40 to give -1<<23, not -1<<22), this has now been corrected and the key adjusted accordingly. Sorry for the confusion.

result is floor(x / pow(2,40)) + x, which seems to minimized by -pow(2,63) (the lowest possible return value, since the return value must be a 64-bit two's complement integer), but if x is -pow(2, 63), this will wrap around (since (any negative value) + (-pow(2, 63)) is out of range for a 64-bit two's complement integer. Let's suppose x is -pow(2, 63) + K where K < pow(2, 40) (so it vanishes from the floor(x / pow(2, 40) term). Then x >> 40 will be -(1 << (63-40)) == -(1 << 23). Then, if we let K be (1<<23) to cancel this out the overall result -(1 << 23) - pow(2,63) + (1<<23) = -pow(2, 63), which is the minimum possible value.

Note that since this x achieves a return value which is the smallest possible 64-bit two's complement integer, so no lower result is possible, even if overflow is allowed to occur. The prohibition on overflow was meant to exclude values of x which were very close to pow(2,63)-1; it would have been better phrased "the most negative possible provided that overflow does not occur". If you assumed that we could somehow represent out-of-range values for the result of foo, then the true result of -pow(2,63) >> 40 + -pow(2,63) would be more negative.

Question 3 (6 pt; mean 5.61)

Which of the following are generally properties of an instruction set architecture (ISA) rather than properties of a microarchitecture? Select all that apply.

  1. affects machine code, assembly languages

  2. affects what instructions like add do; "general-purpose" register meant to specifically specify registers available for use from assembly, but this should've been made clearer

  3. doesn't change whether either version of the code is functional

Question 4 (3 pt; mean 2.44)

Which of the following decisions are more typical of a reduced instruction set architecture (RISC) design philosophy than a complex instruction set architecture (CISC) design philosoophy? Select all that apply.

quiz for week 4

Answer the following questions about the lecture material for this week.

Question 1 (4 pt; mean 3.04)

Which of the following valid x86-64 instructions are also valid Y86-64 instructions (as they would be written in AT&T syntax)? Select all that apply.

  1. no push (address)

  2. no %r15

  3. no jmp-to-register

Question 2 (4 pt; mean 3.65)

Consider the following Y86-64 machine code (represented as a sequence of hexadecimal bytes, with the byte at the lowest address written left-most):

40 ab 12 60 12 60 12 60 12 60 61 ab 50 ab 78 56 00 00 00 00 00 00

When decoded, the first two instructions are, in order:

Question 3 (4 pt; mean 3.21)

Consider the following Y86-64 machine code for mrmovq SOMETHING(%r9), %r8 where SOMETHING is a constant:

50 89 00 ef 00 00 00 d0 bc 0a

What is the value of SOMETHING? Write your answer as a hexadecimal number.

Answer:
Key: 0xABCD0000000EF00
Question 4 (4 pt; mean 3.56)

Consider the registers we discussed in lecture. Suppose the register is currently outputting the value 17 and the input to the register is 20 and the clock signal is high. Then, the input changes to 21; then, the clock signals falls; then, the input changes to 22; then, the clock signal rises; then, the input changes to 23. Then, what will the value of the register's output be?

Answer:
Key: 22
Question 5 (4 pt; mean 3.59)

Consider the following HCLRS code snippet (where ... represents some omitted code):

register iO {
    a : 64 = ...;
    b : 64 = ...;
}

i_b = O_a + i_a;
i_a = O_a + O_b + 1;

If during cycle 1, O_a is 3 and O_b is 4, then the value of O_a during cycle 3 will be ____. Write your answer as a base-10 number. If not enough information is provided write unknown.

Answer:
Key: 20

order of assignments doesn't matter. cycle 1: i_a = 3 + 4 + 1 = 8, i_b = 3 + 8 = 11; cycle 2: i_a = 8 + 11 + 1 = 20 (becomes O_a during cycle 3), i_b = 8 + 20 = 28 (not relevant);

quiz for week 5

Consider the single-cycle processor design we discussed in lecture (built using the components we discussed in lecture that executes every instruction in one cycle) executing the instruction:

mrmovq 10(%rax), %rbx
Question 1 (2 pt; mean 1.46) (see above)

While the address 10(%rax) is being computed for the above instruction, the program counter register will output ____.

Question 2 (2 pt; mean 1.32) (see above)

A value from the data memory is read for this instruction ____.

Question 3 (2 pt; mean 1.64) (see above)

A value is written to the register file for this instruction ___.

Question 4 (4 pt; mean 3.09)

Consider the single-cycle processor design we discussed in lecture (built using the components we discussed in lecture).

Suppose one wanted to add support for a new rrradd REG1, REG2, REG3 instruction which would take three register operands and compute the sum of the first two registers and store the result in the third register. For example if %rax contained 3 and %rbx contained 4, then running rrradd %rax, %rbx, %rcx would write 7 to %rcx.

Which of the following is true about adding this instruction to a Y86 and the single-cycle processor design? Select all that apply.

Consider the single-cycle processor design we discussed in lecture (built using the components we discussed in lecture).

Suppose one wanted to add support for a new push2q REG1, REG2 instruction which would be equivalent to pushq REG1 and pushq REG2 but be completed by one instruction (and in one cycle).

Question 5 (3 pt; mean 2.05) (see above)

The layout of the machine code for the above instruction would most likely be most similar to the encoding for:

  1. not the intended answer; also accepted this on the argument we could change only the "F" placeholder for rB in the encoding and share the icode sanely

  2. meant to write rmmovq to agree with operand order, but both rmmovq and mrmovq are not correct (because the question is about machine code, not implementation)

  3. same fields available

Question 6 (4 pt; mean 2.64) (see above)

As part of adding this instruction, one would expect the data memory to be modified to ____. Select all that apply.

  1. to control whether the data memory writes 128 bits or 64 bits

quiz for week 6

Answer the following questions about the lecture material for this week.

Consider a seven-stage pipelined processor with a 100 ps cycle time. Suppose this processor is executing instructions A, B, C, D, E, ... in order. (Assume no pipeline stalls, etc. are involved.)

Question 1 (3 pt; mean 2.81) (see above)

Instruction D will finish its first stage _____ ps after instruction A finishes its first stage.

Answer:
Key: 300
Question 2 (3 pt; mean 2.75) (see above)

Instruction B will finish its seventh stage ____ ps after instruction A finishes its first stage.

Answer:
Key: 700
Question 3 (4 pt; mean 3.89)

Suppose we are constructing a three-stage pipelined processor, where the computations or data storage operations performed by each stage take:

and we are using pipeline registers with a 20 ps register delay.

What would the minimum cycle time for this processor be in ps?

Answer:
Key: 320

Consider building a pipeline processor similar to our textbook's design Suppose the times required for components are as follows:

Question 4 (4 pt; mean 3.47) (see above)

Suppose instead of constructing a 5-stage processor, like our textbook proposes and we discussed in lecture, one were to construct a 4-stage processor by merging two of the stages we would use in the textbook's 5-stage design. Based on the timings above, what two stages would make the most sense (in terms of expected performance) to merge?

Question 5 (4 pt; mean 3.41)

Suppose a 5-stage pipelined processor achives a throughput of 1000 instructions/microsecond. If one converted this into a single-cycle processor, one would expect a throughput ____.

quiz for week 7

For the following questions, consider a four-stage pipelined processor built using the design we discussed in lecture but with the following stages:

(That is, the memory and writeback stages are combined.)

Suppose this processor uses forwarding to resolve data hazards. If forwarding alone is insufficient, the processor combines forwarding with a minimum amount of stalling.

Question 1 (3 pt; mean 2.46) (see above)

Which of the following assembly snippets would exercise a data hazard if executed on the processor described above? Select all that apply.

Question 2 (2 pt; mean 1.75) (see above)

On the processor described above, the addq instruction in the following assembly snippet would finish ___ cycles after the subq instruction finishes?

subq %rcx, %rax
addq %rdx, %rax
Answer:
Key: 1
Question 3 (2 pt; mean 1.83) (see above)

On the processor described above, the subq instruction in the following assembly snippet would finish ___ cycles after the mrmovq instruction finishes?

mrmovq (%rcx), %rax
subq %rcx, %rax
Answer:
Key: 2
Question 4 (5 pt; mean 4.29) (see above)

Consider the following assembly snippet:

addq %r8, %r9
subq %r9, %r10
irmovq $42, %r9
rmmovq %r9, 0(%r10)
rmmovq %r8, 8(%r10)

When this is executed on the processor described above, one would expect which of the following forwarding operations to occur? (It's possible that not all of the neceessary forwarding operations are included among the options below.) Select all that apply.

  1. does not read %r9

For the following questions, consider the following assembly snippet:

        pushq %rax 
        addq %rcx, %rdx   
        jle foo          
        xorq %rdx, %rdx                  
        andq %r9, %r10       
        subq %r11, %r12
        nop
foo:    rrmovq %rdx, %r8        
        irmovq $10, %r13        
        rmmovq %r14, 0(%r15)    

For the questions below, consider a six-stage processor, similar in design to the one discussed in lecture, with the following stages:

In this processor, the result of arithmetic (such as performed by the addq or xorq instructions) or of evaluating the condition codes to determine if a conditional jump is taken is not available until near the end of the execute part 2 stage. In particular, a conditional jump that was not predicted correctly will not be able to fetch the instruction that follows the jump instruction until the jump instruction is in the memory stage.

Assume the processor uses forwarding and, where necessary, stalling to resolve data hazards. Control hazards are resolved as described in each question below.

Question 5 (4 pt; mean 2.44) (see above)

Suppose the six-stage processor described above predicted all jumps as taken and the jle instruction above was actually not taken. Then the xorq instruction would retrieve the value of %rdx in its decode stage

Question 6 (4 pt; mean 2.48) (see above)

Suppose the six-stage processor predicted all jumps as taken and the jle instruction above was actually not taken. Then the xorq instruction would complete its decode stage while a ____ instruction completed its writeback stage.

quiz for week 8

Answer the following questions about the lecture material for this week.

Suppose a five-stage processor was executing during cycle N:

and that the processor wants to stall instruction C and squash instructions B and A, so during the following cycle N+1:

Question 1 (7 pt; mean 6.47) (see above)

If this processor were built using the bubble_X and stall_X signals we discussed in lecture, then as part of achieving this effect, the processor should set ___. Select all that apply.

Suppose we have a four-stage pipelined processor with the following stages:

Suppose, unlike the processor we've discussed in lecture, this processor uses branch prediction which predicts are conditional jumps as not taken. When a conditional jump is mispredicted, the processor detects the misprediction during the corresponding jump instruction's execute stage, and retrieves the correct instruction in the following cycle.

Suppose the processor is executing the following assembly snippet:

        xorq %rax, %rax
        jle foo
        andq %rcx, %rdx
        pushq %rdx
foo:    addq %r8, %r9
        subq %r9, %r10
        xorq %r10, %r11

When the above assembly is run, the pushq instruction is fetched as a result of the misprediction.

Question 2 (3 pt; mean 2.56) (see above)

If this processor were built using the bubble_X and stall_X signals described in lecture, the pushq instruction will be squashed _____.

Question 3 (3 pt; mean 2.54) (see above)

If this processor were built using the bubble_X and stall_X signals described in lecture, the andq instruction will be squashed _____.

For each of the pairs below, select which one is expected to exhibit more temporal locality.

Question 4 (2 pt; mean 1.85) (see above)

Pair one.

I should have been specific that option A was meant to refer to the array accesses and not the combination of array accesses (not good temporal locality) and machine code accesses (good temporal locality), but I think B is the better answer regardless.

Question 5 (2 pt; mean 1.41) (see above)

Pair two.

Question 6 (2 pt; mean 1.07)

Identify which of the following is expected to exhibit more spatial locality.

quiz for week 9

Answer the following questions about the lecture material for this week.

Suppose a direct-mapped cache with a write-allocate, write-back policy has the following contents:

set indexvaliddirtytagvalue (as list of bytes in hexadecimal, lowest address (offset) first)
0100x00000 22 33 44 55 FF AA CC
10---
2100x00211 22 33 44 55 66 77 88
3110x00299 AA BB CC 9A BC DE 00
4100x00323 34 56 00 33 44 55 88
5100x00300 00 00 11 22 33 44 55
6100x00200 33 44 55 99 11 33 55
70---
80---
9110x00256 78 45 67 34 23 12 01
10110x002FF FF FF 99 FF FF FF 88
11110x002FF FF FF 77 FF FF FF 66
120---
13100x000AA BB CC DD 00 00 FF 00
14100x00000 00 00 01 00 00 00 02
15100x00103 00 00 00 04 00 00 00
Question 1 (4 pt; mean 3.59) (see above)

When this cache is used to access the address 0x123, what tag would be used for that access? Write your answer as a hexadecimal number.

Answer:
Key: 0x2
Question 2 (4 pt; mean 3.91) (see above)

When reading a 2-byte, little-endian value from an address with tag 0x3, set index 3, and (block) offset 0x3, the result will be? Write your answer as a hexadecimal number. If this access would be a cache miss, write miss.

Answer:
Key: miss

tag mismatch

Question 3 (4 pt; mean 3.55) (see above)

Using the above cache, to read an address with which of the following attributes would replace a dirty value (and therefore require writing out the dirty value to memory or the next level of cache)? Select all that apply.

Suppose a 2-way set-associativate cache with an LRU (least recently used) replacement policy and a write-no-allocate, write-through policy has the following contents:

way 0way 1
set index validtagvalue (as list of bytes in hexadecimal, lowest address (offset) first) validtagvalue LRU way
0 10x0101 23 10x0234 56 0
1 10x0101 23 10x0234 56 1
Question 4 (4 pt; mean 3.61) (see above)

Suppose one reads a byte from address 0x13 using this cache, which is a cache miss. As a result of the miss, a block of the cache will be replaced. Which one?

100[1][1] --> set index 1, tag 100 (binary) = 4

Question 5 (3 pt; mean 1.94) (see above)

Suppose a program writes a 2-byte value to address 0x00 on a system using this cache as its data cache. Which of the following statements is true about how the write will be performed?

quiz for week 10

Suppose a direct-mapped cache with a write-allocate, write-back policy has the following contents:

set indexvaliddirtytagvalue (as list of bytes in hexadecimal, lowest address (offset) first)
0100x00000 22 33 44 55 FF AA CC
10---
2100x00211 22 33 44 55 66 77 88
3110x00299 AA BB CC 9A BC DE 00
4100x00323 34 56 00 33 44 55 88
5100x00300 00 00 11 22 33 44 55
6100x00200 33 44 55 99 11 33 55
70---
80---
9110x00256 78 45 67 34 23 12 01
10110x002FF FF FF 99 FF FF FF 88
11110x002FF FF FF 77 FF FF FF 66
120---
13100x000AA BB CC DD 00 00 FF 00
14100x00000 00 00 01 00 00 00 02
15100x00103 00 00 00 04 00 00 00
Question 1 (4 pt; mean 3.47) (see above)

Suppose the one-byte value 77 (written as hexadecimal) is written using this cache using an address which has a (set) index of 4, a tag of 0x4, and a (block) offset of 0x0. Which of the following is true about the effects of this accesses? Select all that apply.

  1. not write-through

  2. replaced block is not dirty

  3. meant to write 23 here

Consider a system with two-level cache hierarchy where:

Suppose a workload is measured to have

Question 2 (2 pt; mean 1.79) (see above)

What percentage of first-level cache accesses require a main memory access? Write your answer to the nearest whole number of percent.

Answer:
Key: 2%
Question 3 (3 pt; mean 2.5) (see above)

Assume that when there is a miss in a cache, the cache uses its hit time to determine that it is a miss, and then starts the access to the next level of the memory hierarchy. For example, this would mean that an access that misses in the first-level cache and hits in the second level would take 12 cycles to complete.

Given this, what is the average memory access time of the program, in cycles, rounded to the nearest tenth of a cycle?

Answer:
Key: 8.0

The following question ask about what data cache misses occur for a particular cache and a particular piece of C code. For those questions assume:

Question 4 (4 pt; mean 1.54) (see above)

Consider the following C code:

unsigned char array[280];
...
for (int i = 0; i < 10; ++i) {
    for (int j = 0; j < 280; ++j) {
        array[j] += 1;
    }
}

With a direct-mapped 256 byte cache with 64 byte blocks and a write-allocate, write-back policy, how many cache misses will occur? (Note that unlike the examples in lecture, these accesses are unevenly distributed across the sets of the cache.)

Answer:
Key: 23

10 misses each for 0-63 and 256-320; 3 misses for 64-128, 128-192, 192-256

Question 5 (4 pt; mean 1.95) (see above)

Consider the following C code:

unsigned char array[2500];
...
for (int i = 0; i < 2500; i += 1000) {
    array[i] += 1;
}
for (int i = 0; i < 2500; i += 990) {
    array[i] += 1;
}

How many cache misses would occur with a 1KB (1024 byte) fully-associative cache with 64 byte cache blocks and an LRU replacement policy and a write-allocate, write-back policy?

Answer:
Key: 4

first loop: 0 (miss loading 0-63); 1000 (miss loading 960-1023); 2000 (miss loading 1984-2047); second loop: 0 (hit); 990 (hit); 1980 (miss loading 1920-1983);

For the following questions, consider the following C code:

int A[1024 * 1024], B[1024 * 1024], C[1024 * 1024];

void foo1() {
    for (int i = 0; i < 1024; ++i) {
        for (int j = 0; j < 1024; ++j) {
            A[i + j] = B[j * 4] * C[i * 1024 + j];
        }
    }
}

void foo2() {
    for (int i_outer = 0; i_outer < 1024; i_outer += 16) {
        for (int j = 0; j < 1024; ++j) {
            for (int i = i_outer; i < i_outer + 16; ++i) {
                A[i + j] = B[j * 4] * C[i * 1024 + j];
            }
        }
    }
}
Question 6 (1 pt; mean 0.48) (see above)

The temporal locality in A is better for ____.

Question 7 (1 pt; mean 0.88) (see above)

The temporal locality in B is better for ____.

Question 8 (1 pt; mean 0.91) (see above)

The spatial locality in C is better for ____.

Question 9 (1 pt; mean 0.83) (see above)

The temporal locality in C is better for ____.

quiz for week 11

For the following questions, consider the following assembly code:

        movq $100, %rax             /* 1 */
begin_loop:     
        movq (%r8, %rax, 8), %r10   /* 2 */
        movq (%r9, %rax, 8), %r11   /* 3 */
        subq %r11, %r10             /* 4 */
        jg not_negative             /* 5 */
        neg %r10                    /* 6 */
not_negatve:
        addq %r10, %r13             /* 7 */
        subq $1, %rax               /* 8 */
        jg begin_loop               /* 9 */

(The neg (negate) instruction is equivalent to multiplying a value by negative 1.)

Question 1 (4 pt; mean 2.32) (see above)

If one wanted to optimize the above loop by unrolling, the unrolled version of the loop would need to include an extra copy of (perhaps with changes to offsets, etc.) instructions labelled ____.

meant to write not_negative: on label

Question 2 (4 pt; mean 2.65) (see above)

If one wanted to optimize the above loop to use multiple accumulators, the best candidate of an accumulator that should be split into two accumulators is ___.

Consider the following assembly snippet:

addq %r8, %r9
subq %r8, %r10
imulq %r9, %r10
andq %r9, %r8
xorq %r11, %r12
Question 3 (6 pt; mean 5.86) (see above)

Consider an out-of-order processor that uses register renaming to help handle hazards, In this scheme, the renaming step might convert the above instructions into three-operand versions with physical registers (filling in the numbered blanks shown below):

addq ___ (1), ___ (2) -> ___ (3)
subq ___ (4), ___ (5) -> ___ (6)
imulq ___ (7), ___ (8) -> ___ (9)
andq ___ (10), ___ (11) -> ___ (12)
xorq ___ (13), ___ (14) -> ___ (15)

In this renamed version, which of the following pairs of blanks would refer to the same physical register? Select all that apply.

  1. was originally miskeyed

  2. was originally miskeyed

Question 4 (4 pt; mean 3.51) (see above)

Suppose an out-of-order processor has two arithmetic execution units, each capable of performing any arithmetic operation, and they can perform:

  • an addition, subtraction, or bitwise and operation in one cycle
  • a multiplication operation in five cycles

What is the minimum number of cycles the arithmetic for the above assembly take on this processor? Do not include time needed to rename instructions, perform forwarding, register reading and writeback, etc.

Answer:
Key: 6

forgot to mention the actual time for xor, but if it's the same as and (which certainly seems most likely): addq/subq in parallel (1 cycle); imulq (5 cyles) in parallel andq + xorq (2 cycles)

quiz for week 12

Answer the following quesitons about the lecture material for this week.

For the following question, consider the following C code:

void foo(int N, int *A, int *B, int *C) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            C[i] += A[j*N+i] + B[i+j];
        }
    }
}
Question 1 (3 pt; mean 1.91) (see above)

Since A might point to part of the same array as B or C, the compiler cannot easily perform which of the following optimizations (without adding some check for whether A overlaps with B or C)? Select all that apply.

Question 2 (3 pt; mean 2.5) (see above)

Which of the following statements are true about the effects of unrolling the inner loop in the code above? Select all that apply.

  1. fewer instructions executed -> fewer instruction cache accesses

Question 3 (4 pt; mean 4)

Consider the following C functions:

int foo1(int *x, int *y, int *z) {
    for (int i = 0; i < 16; ++i) {
        z[i] = x[i] + y[i];
        z[i+16] = x[i] - y[i];
    }
}

int foo2(int *x, int *y, int *z) {
    for (int i = 0; i < 16; ++i) {
        z[i] = x[i] + y[i];
    }
    for (int i = 0; i < 16; ++i) {
        z[i+16] = x[i] - y[i];
    }
}

A compiler cannot generate the same code for these functions because of aliasing.

Which of the following would be sufficient conditions to gaurentee that foo1 and foo2 would have equivalent results? Select all that apply. (Assume that pointer comparisons compare the addresses of pointers and produce consistent usables results even if the two pointers point to parts of different objects.)

dropped because wasn't marked as select-all-that-apply in quiz system

Question 4 (3 pt; mean 2.63)

Consider the following C snippets:

/* snippet A */
for (int i = 0; i < N; ++i)
    A[i] = B[i+1] + B[i] + B[i-1];

/* snippet B */
for (int i = 0; i < N; ++i)
    A[i] = B[A[i]];

Using the vector instructions and intrinsics we discussed in lecture, which of these snippets is most simple to optimize using vector instructions?

Question 5 (3 pt; mean 1.01)

Consider the following C snippets:

/* snippet A */
for (int i = 0; i < N; ++i)
    A[i] = A[i] + A[i-1];

/* snippet B */
for (int i = 0; i < N; ++i)
    A[i] = A[i] + A[i+1];

Using the vector instructions and intrinsics we discussed in lecture, which of these snippets is most simple to optimize using vector instructions?

  1. was originally miskeyed. In the first A[i] depends on prior iteratoins of the loop; when we add vector instructions, we'll compute A[i] for i = 0 and i=1 at the same time, so i=1 can't use the results from th eprior iteratoin without a bunch of extra work.

Question 6 (3 pt; mean 2.7)

After converting a function to use vector instructions that worked on vectors of 8 values, one would expect which of the following to be true about the new function compared to the original? Select all that apply.

  1. because more loaded/stored at a time

quiz for week 13

Answer the following questions about the lecture material for this week.

Question 1 (4 pt; mean 3.38)

When two processes are sharing a single-core processor, usually ____. Select all that apply.

  1. input could be for a different program

  2. this is generally true even in OOO processors, but I don't think we covered this enough detail for you to know

Question 2 (4 pt; mean 3.98)

When a program attempts to read from a memory location that the program is not permitted to access, the processor will usually ____.

  1. [gave credit for this answer] might happen as a result of B; done by operating system code, not something hard-wired in the processor

  2. [gave credit for this answer] might happen as a result of B; done by operating system code, not something hard-wired in the processor; also very often operating systems do not do this (not Unix-like; program not running in a terminal)

I had intended this question to primarily be about what the hardware does versus what the software does. A and C are things that would be done because the operating system has code to do that (and different operating system make different choices about what to do); B is something the hardware would almost always do.

Question 3 (4 pt; mean 3.75)

When an instruction triggers an exception, the processor should ___. Select all that apply.

  1. as our textbook mentions, some processors will save the address of the following instruction instead in some cases (b/c this makes system calls easier to implement)

  2. doesn't make sense for, e.g., divide-by-zero

Question 4 (4 pt; mean 2.4)

Typically, when a program reads input from the keyboard ____. Select all that apply.

  1. no, the operating system's exception handler will run first --- has to because we can't run a normal program in kernel mode

Consider a system with 12-bit virtual addresses, 11-bit physical addresses, and 256-byte pages. On this system, page offsets are therefore 8 bits, virtual page numbers are 4 bits and physical page numbers are 3 bits.

Suppose a process's page table is as follows:

virtual page #valid?physical page #
0x00
0x110x2
0x210x4
0x30
0x40
0x50
0x60
0x710x3
0x810x5
0x910x6
0xa0
0xb0
0xc0
0xd0
0xe0
0xf10x0
Question 5 (2 pt; mean 1.81) (see above)

If the process attempts to read from address 0x742, what physical address will it read from? Write your answer as a hexadecimal number. If an exception will happen instead, write fault.

Answer:
Key: 0x342
Question 6 (2 pt; mean 1.82) (see above)

If the process attempts to write to address 0x187, what physical address will it write to?

Write your answer as a hexadecimal number. If an exception will happen instead, write fault.

Answer:
Key: 0x287

quiz for week 14

Answer the following questions about the lecture material for this week.

Consider a system with 12-bit virtual addresses, 11-bit physical addresses, and 256-byte pages.

Suppose a process's page table is as follows:

virtual page #valid?write allowed?user mode allowed?physical page #
0x00
0x11110x2
0x21110x4
0x30
0x40
0x50
0x60
0x71110x3
0x81010x5
0x91010x6
0xa0
0xb0
0xc0
0xd0
0xe0
0xf1100x0
Question 1 (4 pt; mean 3.18) (see above)

Suppose the system using the page table shown above requires that exception handlers be located at a virtual address while processes are running. Based on the page table above, give an example of a virtual address at which the OS might have placed an exception handler. Write your answer as a hexadecimal number.

Answer:
Key: anything in VPN 0xf (most sensible) or VPN 0x8 or 0x9
Question 2 (4 pt; mean 3.27)

Suppose a system has 30-bit virtual addresses, 24-bit physical addresses, and 1024-byte pages. If this system uses a page table stored in memory as an array with 4 byte page table entries, how large is this page table in bytes? (Write your answer as a base-10 number.)

Answer:
Key: 4194304

Consider a system with:

Question 3 (4 pt; mean 2.8) (see above)

What physical address will the processor access to look up a page table entry for virtual address 0x12345? (Remember to account for each page table entry being 4 bytes.) Write your answer as a hexadecimal number. If not enough information is given, write unknown.

Answer:
Key: 0x88048
Question 4 (4 pt; mean 2.95) (see above)

Suppose the page table entry retrieved for virtual address 0x12345 has the value 0x88800001 (when interpreted as a 4-byte integer). What physical address will contain the value stored at virtual address 0x12345? If an exception would occur instead, write fault.

Answer:
Key: 0x88800345

quiz for week 15

Answer the following questions about the lecture material for week 15.

In lecture, we mentioned a scheme where the operating system can allocate memory on demand. The following questions are about how that scheme might be implemented.

Question 1 (3 pt; mean 1.78) (see above)

In order to configure a memory region to be allocated on demand, the operating system will set page table entries ____.

Question 2 (3 pt; mean 2.73) (see above)

When an access is made that requires memory allocation, the allocation will most likely occur

Consider a system that uses a two-level page table structure with 8192-byte pages and 1024 page table entries per table (at each level), where each page table entry is 8 bytes.

Suppose the processor is running with a page table base pointer containing physical (byte) address 0x44440000. When looking up the physical address for virtual address 0x12345678, the processor determines that the relevant first-level page table entry (that is, the one in the table pointed to by the page table base pointer) was marked valid and contained the physical page number 0x1111, and the relevant second-level page table entry was marked valid and contained the physical page number 0x4321.

Question 3 (4 pt; mean 2.62) (see above)

During the first-level of the lookup described above, from what physical address did the processor read the first-level page table entry? Write your answer as a hexadecimal number.

Answer:
Key: 0x44440120

0x44440000 + (0x12345678 >> 13 >> 10) * 8 (13 to remove the page offset, 10 to remove the second-level's part of the virtual page number, times 8 for the page table entry size)

Question 4 (4 pt; mean 2.26) (see above)

During the lookup described above, what was the final physical address the processor determined? (That is, what physical address would the program read from when it read from virtual address 0x12345678.) Write your answer as a hexadecimal number.

Answer:
Key: 0x8643678

(0x4321 << 13) | (0x12345678 & 0x1FFF) --- (0x4321 << 13) to shift the physical page number to its location in the address, then or in the lower 13 bits to copy the page offset from the original virtual address

Suppose a system has:

If on this system, the processor looks up virtual address 0x52345, and:

Question 5 (3 pt; mean 1.93) (see above)

After the processor performs the lookup described above, it will store information in the TLB at what set index?

Answer:
Key: 0x4
Question 6 (3 pt; mean 1.88) (see above)

After the processor performs the lookup described above, it will store what tag in the corresponding TLB entry?

Answer:
Key: 0x1