F2021 CS 3330 quizzes

Question 1 (4 points)

In lecture, we discussed the idea of a processor that was connected to memory and I/O devices using a shared memory bus. Which of the following statements are true about how the processor using this memory bus would work? Select all that apply.

when supplying an address to the memory, the processor also must indicate to the memory whether it is retrieving a data for operation (such as a value form the stack) or an instruction's machine code
the processor will usually retrieve the machine code for an instruction from memory at the same time as it retrieves any data in memory that instruction operations on
when a key is pressed, the processor would receive information about the keypress using the memory bus
to execute an instruction like addq %rax, %rax, the processor only needs to contact the memory bus to retrieve the machine code for the instruction and not to retrieve the value for %rax.

For the following questions, consider the following AT&T syntax x86-64 assembly instruction:

movq (%rax, %rbx, 4), %rcx

And assume that, just before this instruction executes, registers have values as follows:

`%rax`	`0x1000`
`%rbx`	`0x8000`
`%rcx`	`0x79`
`%rsp`	`0xF00000`

Question 3 (2 points) (see above)

When the above instruction is executed, it will use a memory address. What memory address will it use? Write your answer as a hexadecimal number.

Answer:

Question 4 (3 points) (see above)

When the above instruction executes successfully, it will ____.

Question 5 (4 points)

Consider the following assembly snippet:

addq %rbx, %rbx
addq %rbx, %rbx
addq %rax, %rbx
movq (%rbx), %rbx

Which of the following assembly snippets would result the same values in %rax and %rbx as the above assembly snippet? Select all that apply.

leaq (, %rbx, 4), %rbx
movq (%rbx, %rax), %rbx
movq (%rax, %rbx, 4), %rbx
leaq (%rax, %rbx), %rbx
leaq (%rax, %rbx), %rbx
leaq (%rax, %rbx), %rbx
leaq (%rax, %rbx, 4), %rbx
movq (%rbx), %rbx

quiz for week 2

For each of the following assembly snippets, identify which are possible values of the ZF and SF condition codes.

Question 1 (2 points) (see above)

    cmpq $0, %rax
    jl skip
    imulq $-1, %rax
skip:
    testq %rax, %rax

ZF = 0, SF = 0
ZF = 1, SF = 0
ZF = 0, SF = 1

Question 2 (2 points) (see above)

    movq $0, %rax
    cmpq %rax, %rax
    leaq 0x12345678(,%rax,4), %rax

ZF = 0, SF = 0
ZF = 1, SF = 0
ZF = 0, SF = 1

Question 3 (5 points)

Consider the following C snippet:

while (rax >= 1 && rax < 10) {
    rax = rax - r8;
}

and the following assembly snippets:

/* snippet A */
loop:
    cmpq $0, %rax
    jle done
    cmpq $10, %rax
    jge done
    subq %r8, %rax
    jmp loop
done:

/* snippet B */
    jmp start
loop:
    subq %r8, %rax
start:
    cmpq $1, %rax
    jl done
    cmpq $10, %rax
    jl start
done:

/* snippet C */
loop:
    cmpq $1, %rax
    cmpq $10, %rax
    jl done
    jge done
    subq %r8, %rax
    jmp loop
done:

/* snippet D */
loop:
    cmpq $1, %rax
    jl done
    cmpq $11, %rax
    jg done
    subq %r8, %rax
    jl loop
done:

Assuming the variable rax is stored in the register %rax and the variable r8 is stored in the register %r8, which snippets are correct assembly translations of the loop? Select all that apply.

snippet A
snippet B
snippet C
snippet D

For the following questions, consider static linking using object files as discussed in lecture.

Suppose one converts the following two assembly files into an executable (where ... represents some code that is not shown, and .asciz is an assembly directive to generate a nul-terminated ASCII string, and .global LABEL is a directive that causes the assembler to ensure that a specified label LABEL can be referenced from other files):

Assembly file main.s:

.text
.global main
main:
    mov $0, %r11
    mov $10, %r10
loop:
    mov $percent_d, %rdi
    mov %r11, %rsi
    call printf
    incq %r11
    decq %r10
    jge loop
    mov $0, %rax
    ret

.data
.global percent_d
percent_d:
    .asciz "%d"

Assembly file printf.s:

.text
.global printf
printf:
    movb (%rdi), %al
    ...
    ...
    ret

.data
...

Question 4 (4 points) (see above)

The object file generated from main.s will include metadata (outside of any machine code) specifying ____. Select all that apply.

the address at which the string "%d" is located in memory when the executable runs
the location within the object file generated from main.s at which the string "%d" appears
the location within the object file generated from main.s at which the machine code corresponding to percent_d in mov $percent_d, %rdi appears
the location at which the dec %r10 instruction appears in the machine code

Question 5 (4 points) (see above)

In the object files generated from the assembly files above, a symbol table entry for printf would most likely ____. Select all that apply.

include the location (either in memory or in an object file) of the machine code for the call printf instruction
include the location (either in memory or in an object file) of the machine code for the movb (%rdi), %al instruction
include the location of a ret instruction
specify that printf's first argument is a string

quiz for week 3

Question 1 (4 points)

If x and y are 64-bit signed integers between -100000000000 and 100000000000 (on a C implementation that uses two's complement and implements signed right shift using arithmetic shift and where other constants are 64-bits), then which of the following C expressions are always true? Select all that apply.

((x + y) >> 22) == ((x >> 22) + (y >> 22))
((~x >> 12) & 0xFFFFFFFF) == ((~(x >> 12)) & 0xFFFFFFFF)
((x ^ 0xFFFF) & 0xCFF0) == (((x & 0x4FF0) ^ 0xCFF0) | (~x & 0x8000))
(((x | 0xF000) & 0xFFF0) ^ (y & 0x8008)) == (0xF000 | (x & 0x0FF0) | ((y ^ 0x8008) & 0x8008))

Question 3 (6 points)

Which of the following are generally properties of an instruction set architecture (ISA) rather than properties of a microarchitecture? Select all that apply.

the number of operands an instruction can take
the number of bits in a general-purpose register
whether a call instruction stores the return address on the stack or in a register
how many bytes are necessary for a call instruction's machine code
how many bytes a mov instruction copies from memory into a register
whether using the multiplication instruction to multiply by 3 is faster than adding twice

quiz for week 4

Question 5 (4 points)

Consider the following HCLRS code snippet (where ... represents some omitted code):

register iO {
    a : 64 = ...;
    b : 64 = ...;
}

i_b = O_a + i_a;
i_a = O_a + O_b + 1;

If during cycle 1, O_a is 3 and O_b is 4, then the value of O_a during cycle 3 will be ____. Write your answer as a base-10 number. If not enough information is provided write unknown.

Answer:

quiz for week 5

Consider the single-cycle processor design we discussed in lecture (built using the components we discussed in lecture that executes every instruction in one cycle) executing the instruction:

mrmovq 10(%rax), %rbx

Question 1 (2 points) (see above)

While the address 10(%rax) is being computed for the above instruction, the program counter register will output ____.

the address of the first byte of the instruction that was run just before the above instruction
the address of the first byte of the above instruction's machine code
the address of the constant 10 in the above instruction's machine code
the address of the byte immediately following the above instruction's machine code
none of the above (explain in comments)

Question 2 (2 points) (see above)

A value from the data memory is read for this instruction ____.

around the same time as the instruction is read from the instruction memory
around the same time as values are read from the register file
after the instruction is read from the instruction memory but well before the next rising edge of the clock cycle
around the time of the rising edge of the clock cycle that occurs after the instruction is read from the instruction memory
none of the above (explain briefly in comments)

Question 3 (2 points) (see above)

A value is written to the register file for this instruction ___.

around the same time as the instruction is read from the instruction memory
around the same time as values are read from the data memory
after the instruction is read from the instruction memory but well before the next rising edge of the clock cycle
around the time of the rising edge of the clock cycle that occurs after the instruction is read from the instruction memory
none of the above (explain briefly in comments)

Question 4 (4 points)

Consider the single-cycle processor design we discussed in lecture (built using the components we discussed in lecture).

Suppose one wanted to add support for a new rrradd REG1, REG2, REG3 instruction which would take three register operands and compute the sum of the first two registers and store the result in the third register. For example if %rax contained 3 and %rbx contained 4, then running rrradd %rax, %rbx, %rcx would write 7 to %rcx.

Which of the following is true about adding this instruction to a Y86 and the single-cycle processor design? Select all that apply.

it would require adding one or more additional read ports (a new reg_srcX input and reg_outputX output) to the register file
it would require adding one or more additional write ports (new reg_dstX and reg_inputX inputs) to the register file
it would require adding a new ALU operation
it would require placing register numbers in a place in the machine code where no other instruction has a register number

Consider the single-cycle processor design we discussed in lecture (built using the components we discussed in lecture).

Suppose one wanted to add support for a new push2q REG1, REG2 instruction which would be equivalent to pushq REG1 and pushq REG2 but be completed by one instruction (and in one cycle).

Question 5 (3 points) (see above)

The layout of the machine code for the above instruction would most likely be most similar to the encoding for:

Question 6 (4 points) (see above)

As part of adding this instruction, one would expect the data memory to be modified to ____. Select all that apply.

accept 128 bits as its value input (HCLRS mem_input) instead of 64
accept 128 bits as its address input (HCLRS mem_addr) instead of 64
provide 128 bits as its value output (HCLRS mem_output) instead of 64
add a new control signal (in addition to write-enable and read-enable signals mem_readbit and mem_writebit`) that would change the behavior of the data memory

quiz for week 6

Consider a seven-stage pipelined processor with a 100 ps cycle time. Suppose this processor is executing instructions A, B, C, D, E, ... in order. (Assume no pipeline stalls, etc. are involved.)

Question 1 (3 points) (see above)

Instruction D will finish its first stage _____ ps after instruction A finishes its first stage.

Answer:

Question 2 (3 points) (see above)

Instruction B will finish its seventh stage ____ ps after instruction A finishes its first stage.

Answer:

Consider building a pipeline processor similar to our textbook's design Suppose the times required for components are as follows:

pipeline registers (including PC) -- 10 ps (the "register delay")
instruction memory read -- 175 ps
PC increment computation -- 20 ps
register file read -- 125 ps
register file write -- 150 ps
ALU -- 120 ps
data memory read or write -- 175 ps

Question 4 (4 points) (see above)

Suppose instead of constructing a 5-stage processor, like our textbook proposes and we discussed in lecture, one were to construct a 4-stage processor by merging two of the stages we would use in the textbook's 5-stage design. Based on the timings above, what two stages would make the most sense (in terms of expected performance) to merge?

fetch and decode
decode and execute
execute and memory
memory and writeback

quiz for week 7

For the following questions, consider a four-stage pipelined processor built using the design we discussed in lecture but with the following stages:

fetch
decode
execute
memory and writeback

(That is, the memory and writeback stages are combined.)

Suppose this processor uses forwarding to resolve data hazards. If forwarding alone is insufficient, the processor combines forwarding with a minimum amount of stalling.

Question 1 (3 points) (see above)

Which of the following assembly snippets would exercise a data hazard if executed on the processor described above? Select all that apply.

addq %rax, %rbx
addq %rcx, %rdx
subq %rcx, %rax
mrmovq (%rcx), %rcx
mrmovq (%rcx), %rax
subq %rcx, %r8
addq %rdx, %rax
popq %rax
nop
addq %rsp, %rcx

Question 2 (2 points) (see above)

On the processor described above, the addq instruction in the following assembly snippet would finish ___ cycles after the subq instruction finishes?

subq %rcx, %rax
addq %rdx, %rax

Answer:

Question 3 (2 points) (see above)

On the processor described above, the subq instruction in the following assembly snippet would finish ___ cycles after the mrmovq instruction finishes?

mrmovq (%rcx), %rax
subq %rcx, %rax

Answer:

Question 4 (5 points) (see above)

Consider the following assembly snippet:

addq %r8, %r9
subq %r9, %r10
irmovq $42, %r9
rmmovq %r9, 0(%r10)
rmmovq %r8, 8(%r10)

When this is executed on the processor described above, one would expect which of the following forwarding operations to occur? (It's possible that not all of the neceessary forwarding operations are included among the options below.) Select all that apply.

the value of %r9 from addq to subq
the value of %r9 from addq to irmovq
the value of %r10 from subq to the first rmmovq
the value of %r10 from subq to the second rmmovq
the value of %r10 from the first rmmovq to the second rmmovq

For the following questions, consider the following assembly snippet:

        pushq %rax 
        addq %rcx, %rdx   
        jle foo          
        xorq %rdx, %rdx                  
        andq %r9, %r10       
        subq %r11, %r12
        nop
foo:    rrmovq %rdx, %r8        
        irmovq $10, %r13        
        rmmovq %r14, 0(%r15)

For the questions below, consider a six-stage processor, similar in design to the one discussed in lecture, with the following stages:

fetch
decode
execute part 1
execute part 2
memory
writeback

In this processor, the result of arithmetic (such as performed by the addq or xorq instructions) or of evaluating the condition codes to determine if a conditional jump is taken is not available until near the end of the execute part 2 stage. In particular, a conditional jump that was not predicted correctly will not be able to fetch the instruction that follows the jump instruction until the jump instruction is in the memory stage.

Assume the processor uses forwarding and, where necessary, stalling to resolve data hazards. Control hazards are resolved as described in each question below.

Question 5 (4 points) (see above)

Suppose the six-stage processor described above predicted all jumps as taken and the jle instruction above was actually not taken. Then the xorq instruction would retrieve the value of %rdx in its decode stage

from the register file directly
by having it forwarded from addq's execute 2 stage
by having it forwarded from addq's memory stage
by having it forwarded from addq's writeback stage

Question 6 (4 points) (see above)

Suppose the six-stage processor predicted all jumps as taken and the jle instruction above was actually not taken. Then the xorq instruction would complete its decode stage while a ____ instruction completed its writeback stage.

quiz for week 8

Suppose a five-stage processor was executing during cycle N:

instruction E in the fetch stage,
instruction D in the decode stage,
instruction C in the execute stage,
instruction B in the memory stage
instruction A in the writeback stage

and that the processor wants to stall instruction C and squash instructions B and A, so during the following cycle N+1:

instruction E in the fetch stage,
instruction D in the decode stage,
instruction C in the execute stage,
instruction B is running a nop (a pipeline "bubble") *[edit after quiz: should have written "the memory stage is running a nop"]
instruction A is running a nop (a pipeline "bubble") *[edit after quiz: should have written "the writeback stage is running a nop"; also to squash instruction A, we need to something in addition to adjusting the pipeline registers]

Question 1 (7 points) (see above)

If this processor were built using the bubble_X and stall_X signals we discussed in lecture, then as part of achieving this effect, the processor should set ___. Select all that apply.

Suppose we have a four-stage pipelined processor with the following stages:

fetch
decode
execute and memory
writeback

Suppose, unlike the processor we've discussed in lecture, this processor uses branch prediction which predicts are conditional jumps as not taken. When a conditional jump is mispredicted, the processor detects the misprediction during the corresponding jump instruction's execute stage, and retrieves the correct instruction in the following cycle.

Suppose the processor is executing the following assembly snippet:

        xorq %rax, %rax
        jle foo
        andq %rcx, %rdx
        pushq %rdx
foo:    addq %r8, %r9
        subq %r9, %r10
        xorq %r10, %r11

When the above assembly is run, the pushq instruction is fetched as a result of the misprediction.

Question 2 (3 points) (see above)

If this processor were built using the bubble_X and stall_X signals described in lecture, the pushq instruction will be squashed _____.

by setting bubble_F when pushq is in the fetch stage
by setting bubble_D when pushq is in the fetch stage
by setting bubble_D when pushq is in the decode stage
by setting bubble_E when pushq is in the decode stage
by setting bubble_E when pushq is in the execute and memory stage
by setting bubble_W when pushq is in the execute and memory stage

Question 3 (3 points) (see above)

If this processor were built using the bubble_X and stall_X signals described in lecture, the andq instruction will be squashed _____.

by setting bubble_F when andq is in the fetch stage
by setting bubble_D when andq is in the fetch stage
by setting bubble_D when andq is in the decode stage
by setting bubble_E when andq is in the decode stage
by setting bubble_E when andq is in the execute and memory stage
by setting bubble_W when andq is in the execute and memory stage

For each of the pairs below, select which one is expected to exhibit more temporal locality.

Question 4 (2 points) (see above)

Pair one.

accesses made when iterating through a large array stored on the stack
accesses to the machine code of a small, frequently used function
the temporal locality is about the same

Question 5 (2 points) (see above)

Pair two.

accesses made while repeatedly looking up a particular key in a hashtable
accesses while repeatedly linear searching an array for a particular key
the temporal locality is about the same

quiz for week 9

Suppose a direct-mapped cache with a write-allocate, write-back policy has the following contents:

set index	valid	dirty	tag	value (as list of bytes in hexadecimal, lowest address (offset) first)
0	1	0	0x000	00 22 33 44 55 FF AA CC
1	0	-	-	-
2	1	0	0x002	11 22 33 44 55 66 77 88
3	1	1	0x002	99 AA BB CC 9A BC DE 00
4	1	0	0x003	23 34 56 00 33 44 55 88
5	1	0	0x003	00 00 00 11 22 33 44 55
6	1	0	0x002	00 33 44 55 99 11 33 55
7	0	-	-	-
8	0	-	-	-
9	1	1	0x002	56 78 45 67 34 23 12 01
10	1	1	0x002	FF FF FF 99 FF FF FF 88
11	1	1	0x002	FF FF FF 77 FF FF FF 66
12	0	-	-	-
13	1	0	0x000	AA BB CC DD 00 00 FF 00
14	1	0	0x000	00 00 00 01 00 00 00 02
15	1	0	0x001	03 00 00 00 04 00 00 00

Question 1 (4 points) (see above)

When this cache is used to access the address 0x123, what tag would be used for that access? Write your answer as a hexadecimal number.

Answer:

Question 2 (4 points) (see above)

When reading a 2-byte, little-endian value from an address with tag 0x3, set index 3, and (block) offset 0x3, the result will be? Write your answer as a hexadecimal number. If this access would be a cache miss, write miss.

Answer:

Question 3 (4 points) (see above)

Using the above cache, to read an address with which of the following attributes would replace a dirty value (and therefore require writing out the dirty value to memory or the next level of cache)? Select all that apply.

tag 0x1, (set) index 3, (block) offset 0x2
tag 0x2, (set) index 3, (block) offset 0x3
tag 0x1, (set) index 13, (block) offset 0x1
tag 0x0, (set) index 0, (block) offset 0x0

Suppose a 2-way set-associativate cache with an LRU (least recently used) replacement policy and a write-no-allocate, write-through policy has the following contents:

	way 0			way 1
set index	valid	tag	value (as list of bytes in hexadecimal, lowest address (offset) first)	valid	tag	value	LRU way
0	1	0x01	01 23	1	0x02	34 56	0
1	1	0x01	01 23	1	0x02	34 56	1

Question 4 (4 points) (see above)

Suppose one reads a byte from address 0x13 using this cache, which is a cache miss. As a result of the miss, a block of the cache will be replaced. Which one?

set index 0, way 0
set index 0, way 1
set index 1, way 0
set index 1, way 1

Question 5 (3 points) (see above)

Suppose a program writes a 2-byte value to address 0x00 on a system using this cache as its data cache. Which of the following statements is true about how the write will be performed?

the write will occur without the cache's contents changing, and will modify main memory or the next level of cache
the write will replace the contents of one of the existing cache blocks as well as being immediately sent to the main memory or the next level of cache, and the prior contents of the replaced cache block will be discarded
the write will replace the contents of one of the existing cache blocks, and the prior contents of the replaced cache block will be sent to main memory or the next level of cache
none of the above

quiz for week 10

Suppose a direct-mapped cache with a write-allocate, write-back policy has the following contents:

set index	valid	dirty	tag	value (as list of bytes in hexadecimal, lowest address (offset) first)
0	1	0	0x000	00 22 33 44 55 FF AA CC
1	0	-	-	-
2	1	0	0x002	11 22 33 44 55 66 77 88
3	1	1	0x002	99 AA BB CC 9A BC DE 00
4	1	0	0x003	23 34 56 00 33 44 55 88
5	1	0	0x003	00 00 00 11 22 33 44 55
6	1	0	0x002	00 33 44 55 99 11 33 55
7	0	-	-	-
8	0	-	-	-
9	1	1	0x002	56 78 45 67 34 23 12 01
10	1	1	0x002	FF FF FF 99 FF FF FF 88
11	1	1	0x002	FF FF FF 77 FF FF FF 66
12	0	-	-	-
13	1	0	0x000	AA BB CC DD 00 00 FF 00
14	1	0	0x000	00 00 00 01 00 00 00 02
15	1	0	0x001	03 00 00 00 04 00 00 00

Question 1 (4 points) (see above)

Suppose the one-byte value 77 (written as hexadecimal) is written using this cache using an address which has a (set) index of 4, a tag of 0x4, and a (block) offset of 0x0. Which of the following is true about the effects of this accesses? Select all that apply.

it will result in a write to memory of the data byte 77
it will result in a write to memory of data bytes other than 77
after the access completes, the first byte stored in the value field of set index 4 will be 77 instead of 99.
after the access completes, the second byte stored in the value field of set index 4 will remain 34

Consider a system with two-level cache hierarchy where:

the first-level cache has a hit time of 2 cycles
the shared second-level cache has a hit time of 10 cycles
main memory has an access time of 200 cycles

Suppose a workload is measured to have

a 80% first-level cache hit rate
a 90% second-level cache hit rate

Question 2 (2 points) (see above)

What percentage of first-level cache accesses require a main memory access? Write your answer to the nearest whole number of percent.

Answer:

Question 3 (3 points) (see above)

Assume that when there is a miss in a cache, the cache uses its hit time to determine that it is a miss, and then starts the access to the next level of the memory hierarchy. For example, this would mean that an access that misses in the first-level cache and hits in the second level would take 12 cycles to complete.

Given this, what is the average memory access time of the program, in cycles, rounded to the nearest tenth of a cycle?

Answer:

The following question ask about what data cache misses occur for a particular cache and a particular piece of C code. For those questions assume:

that only accesses to array use the data cache specified
that the data cache is initially empty (all invalid)
the address array[0] is a multiple of 2 to the 20th power (so array[0] is at the beginning of a cache block and has set index 0 in almost any cache)
accesses in the omitted code labelled ... are irrelevant to the question
the compiler compiles the loops so that the accesses to array are not reordered and does not omit any accesses to array

Question 4 (4 points) (see above)

Consider the following C code:

unsigned char array[280];
...
for (int i = 0; i < 10; ++i) {
    for (int j = 0; j < 280; ++j) {
        array[j] += 1;
    }
}

With a direct-mapped 256 byte cache with 64 byte blocks and a write-allocate, write-back policy, how many cache misses will occur? (Note that unlike the examples in lecture, these accesses are unevenly distributed across the sets of the cache.)

Answer:

Question 5 (4 points) (see above)

Consider the following C code:

unsigned char array[2500];
...
for (int i = 0; i < 2500; i += 1000) {
    array[i] += 1;
}
for (int i = 0; i < 2500; i += 990) {
    array[i] += 1;
}

How many cache misses would occur with a 1KB (1024 byte) fully-associative cache with 64 byte cache blocks and an LRU replacement policy and a write-allocate, write-back policy?

Answer:

For the following questions, consider the following C code:

int A[1024 * 1024], B[1024 * 1024], C[1024 * 1024];

void foo1() {
    for (int i = 0; i < 1024; ++i) {
        for (int j = 0; j < 1024; ++j) {
            A[i + j] = B[j * 4] * C[i * 1024 + j];
        }
    }
}

void foo2() {
    for (int i_outer = 0; i_outer < 1024; i_outer += 16) {
        for (int j = 0; j < 1024; ++j) {
            for (int i = i_outer; i < i_outer + 16; ++i) {
                A[i + j] = B[j * 4] * C[i * 1024 + j];
            }
        }
    }
}

Question 6 (see above)

The temporal locality in A is better for ____.

foo1
foo2
it's about the same for both

Question 7 (see above)

The temporal locality in B is better for ____.

foo1
foo2
it's about the same for both

Question 8 (see above)

The spatial locality in C is better for ____.

foo1
foo2
it's about the same for both

Question 9 (see above)

The temporal locality in C is better for ____.

foo1
foo2
it's about the same for both

quiz for week 11

For the following questions, consider the following assembly code:

        movq $100, %rax             /* 1 */
begin_loop:     
        movq (%r8, %rax, 8), %r10   /* 2 */
        movq (%r9, %rax, 8), %r11   /* 3 */
        subq %r11, %r10             /* 4 */
        jg not_negative             /* 5 */
        neg %r10                    /* 6 */
not_negatve:
        addq %r10, %r13             /* 7 */
        subq $1, %rax               /* 8 */
        jg begin_loop               /* 9 */

(The neg (negate) instruction is equivalent to multiplying a value by negative 1.)

Question 1 (4 points) (see above)

If one wanted to optimize the above loop by unrolling, the unrolled version of the loop would need to include an extra copy of (perhaps with changes to offsets, etc.) instructions labelled ____.

Question 2 (4 points) (see above)

If one wanted to optimize the above loop to use multiple accumulators, the best candidate of an accumulator that should be split into two accumulators is ___.

Consider the following assembly snippet:

addq %r8, %r9
subq %r8, %r10
imulq %r9, %r10
andq %r9, %r8
xorq %r11, %r12

Question 3 (6 points) (see above)

Consider an out-of-order processor that uses register renaming to help handle hazards, In this scheme, the renaming step might convert the above instructions into three-operand versions with physical registers (filling in the numbered blanks shown below):

addq ___ (1), ___ (2) -> ___ (3)
subq ___ (4), ___ (5) -> ___ (6)
imulq ___ (7), ___ (8) -> ___ (9)
andq ___ (10), ___ (11) -> ___ (12)
xorq ___ (13), ___ (14) -> ___ (15)

In this renamed version, which of the following pairs of blanks would refer to the same physical register? Select all that apply.

Question 4 (4 points) (see above)

Suppose an out-of-order processor has two arithmetic execution units, each capable of performing any arithmetic operation, and they can perform:

an addition, subtraction, or bitwise and operation in one cycle
a multiplication operation in five cycles

What is the minimum number of cycles the arithmetic for the above assembly take on this processor? Do not include time needed to rename instructions, perform forwarding, register reading and writeback, etc.

Answer:

quiz for week 12

For the following question, consider the following C code:

void foo(int N, int *A, int *B, int *C) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            C[i] += A[j*N+i] + B[i+j];
        }
    }
}

Question 1 (3 points) (see above)

Since A might point to part of the same array as B or C, the compiler cannot easily perform which of the following optimizations (without adding some check for whether A overlaps with B or C)? Select all that apply.

unrolling the loop with index j
swapping the i and j loops to improve cache locality
performing cache blocking for the accesses to A and B

Question 2 (3 points) (see above)

Which of the following statements are true about the effects of unrolling the inner loop in the code above? Select all that apply.

it is likely to make foo run more quickly when the argument N is 1 or 2
it is likely to increase the amount of space foo will consume in the instruction cache
it is likely to increase the number of instruction cache accesses

Question 3 (4 points)

Consider the following C functions:

int foo1(int *x, int *y, int *z) {
    for (int i = 0; i < 16; ++i) {
        z[i] = x[i] + y[i];
        z[i+16] = x[i] - y[i];
    }
}

int foo2(int *x, int *y, int *z) {
    for (int i = 0; i < 16; ++i) {
        z[i] = x[i] + y[i];
    }
    for (int i = 0; i < 16; ++i) {
        z[i+16] = x[i] - y[i];
    }
}

A compiler cannot generate the same code for these functions because of aliasing.

Which of the following would be sufficient conditions to gaurentee that foo1 and foo2 would have equivalent results? Select all that apply. (Assume that pointer comparisons compare the addresses of pointers and produce consistent usables results even if the two pointers point to parts of different objects.)

x + 16 < y || x > y + 16
(x + 16 < y || x > y + 16) && (x + 16 < z || x > z + 32)
x + 16 < z || x > z + 32
(x + 16 < z || x > z + 32) && (y + 16 < z || y > z + 32)

quiz for week 13

Consider a system with 12-bit virtual addresses, 11-bit physical addresses, and 256-byte pages. On this system, page offsets are therefore 8 bits, virtual page numbers are 4 bits and physical page numbers are 3 bits.

Suppose a process's page table is as follows:

virtual page #	valid?	physical page #
0x0	0
0x1	1	0x2
0x2	1	0x4
0x3	0
0x4	0
0x5	0
0x6	0
0x7	1	0x3
0x8	1	0x5
0x9	1	0x6
0xa	0
0xb	0
0xc	0
0xd	0
0xe	0
0xf	1	0x0

Question 5 (2 points) (see above)

If the process attempts to read from address 0x742, what physical address will it read from? Write your answer as a hexadecimal number. If an exception will happen instead, write fault.

Answer:

Question 6 (2 points) (see above)

If the process attempts to write to address 0x187, what physical address will it write to?

Write your answer as a hexadecimal number. If an exception will happen instead, write fault.

Answer:

quiz for week 14

Consider a system with 12-bit virtual addresses, 11-bit physical addresses, and 256-byte pages.

Suppose a process's page table is as follows:

virtual page #	valid?	write allowed?	user mode allowed?	physical page #
0x0	0
0x1	1	1	1	0x2
0x2	1	1	1	0x4
0x3	0
0x4	0
0x5	0
0x6	0
0x7	1	1	1	0x3
0x8	1	0	1	0x5
0x9	1	0	1	0x6
0xa	0
0xb	0
0xc	0
0xd	0
0xe	0
0xf	1	1	0	0x0

Question 1 (4 points) (see above)

Suppose the system using the page table shown above requires that exception handlers be located at a virtual address while processes are running. Based on the page table above, give an example of a virtual address at which the OS might have placed an exception handler. Write your answer as a hexadecimal number.

Answer:

Consider a system with:

32-bit virtual addresses
32-bit physical addresses
4096 byte pages
page tables stored in memory as an array
4-byte page table entries, which, when interpreted as a 4-byte integer consist of:
- a physical page number (stored in the 20 most significant bits)
- 11 unused bits, and
- a valid bit (stored in the least significant bit)
a current value for the page table base register of 0x88000, which is a physical byte address

Question 3 (4 points) (see above)

What physical address will the processor access to look up a page table entry for virtual address 0x12345? (Remember to account for each page table entry being 4 bytes.) Write your answer as a hexadecimal number. If not enough information is given, write unknown.

Answer:

Question 4 (4 points) (see above)

Suppose the page table entry retrieved for virtual address 0x12345 has the value 0x88800001 (when interpreted as a 4-byte integer). What physical address will contain the value stored at virtual address 0x12345? If an exception would occur instead, write fault.

Answer:

quiz for week 15

In lecture, we mentioned a scheme where the operating system can allocate memory on demand. The following questions are about how that scheme might be implemented.

Question 1 (3 points) (see above)

In order to configure a memory region to be allocated on demand, the operating system will set page table entries ____.

corresponding to the virtual addresses of the memory region as invalid
corresponding to the virtual addresses of the memory region as valid, but read-only
corresponding to the physical addresses that will be allocated for the memory region as invalid
corresponding to the physical addresses that will be allocated for the memory region as valid, but read-only
none of the above

Question 2 (3 points) (see above)

When an access is made that requires memory allocation, the allocation will most likely occur

when the program makes a system call after the access fails
when an exception handler is run as a result of accessing an memory address with no translation in the page table
during the processor's page table lookup
when the processor squashes instructions that were mispredicted as a result of the access

Consider a system that uses a two-level page table structure with 8192-byte pages and 1024 page table entries per table (at each level), where each page table entry is 8 bytes.

Suppose the processor is running with a page table base pointer containing physical (byte) address 0x44440000. When looking up the physical address for virtual address 0x12345678, the processor determines that the relevant first-level page table entry (that is, the one in the table pointed to by the page table base pointer) was marked valid and contained the physical page number 0x1111, and the relevant second-level page table entry was marked valid and contained the physical page number 0x4321.

Question 3 (4 points) (see above)

During the first-level of the lookup described above, from what physical address did the processor read the first-level page table entry? Write your answer as a hexadecimal number.

Answer:

Question 4 (4 points) (see above)

During the lookup described above, what was the final physical address the processor determined? (That is, what physical address would the program read from when it read from virtual address 0x12345678.) Write your answer as a hexadecimal number.

Answer:

Suppose a system has:

a two-level page table structure with 16384-byte pages (below the "first-level page table" is the page table in this structure which is located via the page table base pointer)
4096 page table entries per level
38 bit virtual addresses
4-byte page table entries
36 bit physical addresses
a 32-entry 2-way set associative TLB

If on this system, the processor looks up virtual address 0x52345, and:

reads a first-level page table entry at index 0x0 in the first-level page table located at physical address 0x800000; then
reads a second-level page table entry at index 0x14 in the second-level page table located at physical address 0x50000
determines (using the second-level page table entry) that the physical address corresponding this virtual address is 0x5a345

Question 5 (3 points) (see above)

After the processor performs the lookup described above, it will store information in the TLB at what set index?

Answer:

Question 6 (3 points) (see above)

After the processor performs the lookup described above, it will store what tag in the corresponding TLB entry?

Answer: