F2020 quizzes

Suppose memory contains the following 8-bit bytes at the following addresses (each written in hexadecimal):

address	value
...	...
0x0FD	0x11
0x0FE	0x22
0x0FF	0x33
0x100	0x44
0x101	0x55
0x102	0x66
0x103	0x77
0x104	0x88
0x105	0x99
0x106	0x00
0x107	0xA0
0x108	0xAB
0x109	0xBA
0x10A	0xC0
0x10B	0xD0
...	...

Question 1 (2 points) (see above)

In little-endian, reading a four-byte (32-bit) value from address 0x105 yields what value? Write your answer as a hexadecimal number. If not enough information is given write "unknown".

Answer:

Question 2 (2 points) (see above)

With the above memory layout, running movb 0x104, %al what will the value of the 8-bit register %al be? Write your answer as a hexadecimal number. If not enough information is given write "unknown".

Answer:

Question 3 (2 points) (see above)

Consider the following assembly snippet:

movq $0x4, %rax
movq $0x1, %rbx
movl 0x100(%rax, %rbx, 2), %eax

When run with the above memory layout, what will the resulting value of the 32-bit register %eax be? Write your answer as a hexadecimal number. If not enough information is given write "unknown".

Answer:

Question 4 (3 points)

Consider the following AT&T syntax assembly:

addq %rbx, %rbx
addq %rax, %rbx
movq (%rbx), %rax

Which of the following assembly snippets will result in the same final value of %rax? (Ignore changes to the value of %rbx.) Select all that apply.

movq (%rbx, %rax, 2), %rax
movq (%rax, %rbx, 2), %rax
addq %rbx, %rbx
addq %rax, %rbx
addq %rbx, %rbx
movq (%rbx, %rax), %rax
addq %rbx, %rbx
movq (%rax, %rbx), %rax
addq %rbx, %rbx
addq %rax, (%rbx)

Question 5 (2 points)

Suppose the assembly label array is defined to be a constant 5-byte array as follows:

array:
   .byte 3
   .byte 4
   .byte 5
   .byte 6
   .byte 7

and the linker chooses to locate this array at address 0x10000. (Each .byte directive specifies the value of one byte of memory.)

Then, if we run the assembly snippet:

    movq $1, %rax
    addq $array, %rax
    movb 1(%rax), %bl

What would the value of 8-bit register %bl be afterwards? You may assume memory around array is not modified before this snippet is run.

Write your answer as a hexadecimal number. If not enough information is given write "unknown".

Answer:

quiz for week 2

Question 1 (2 points)

Consider the following C function:

long foo(long a, long b) {
    return a + b;
}

Which of the following are correct translations (but perhaps unnecessairily complex) of this function to AT&T syntax assembly (using the Linux x86-64 calling convention)? Select all that apply.

foo: movq %rsi, %rax
addq %rdi, %rax
ret
foo: movq (%rsi, %rdi), %rax
leaq (%rax), %rax
ret
foo: leaq 1(%rsi, %rdi, 1), %rax
ret
foo: leaq 0(%rsi, %rdi), %rax
ret

Consider the following C function:

long example(long a, long b) {
    while (a > b) { a = a - b; }
    return a;
}

Question 2 (2 points) (see above)

This function can be converted to x86-64 assembly (following Linux's x86-64 calling convention) like the following:

example:
    cmpq %rsi, %rdi
    ____ L_done
    subq %rsi, %rdi
    jmp example
L_done:
    movq %rdi, %rax
    ret

What instruction goes in the blank (in the second instruction of the function before L_done)?

Question 3 (2 points) (see above)

Assuming the assembly translation from the previous question is used, after this function returns, what will the value of the zero flag (ZF) will be?

always 0
always 1
1 if the return value is 0; otherwise 0
1 if b is 0, otherwise 0
1 if the return value is equal to the argument b; otherwise 0
1 if the loop executed at least one time (the original value of a was greater than b); otherwise 0
1 if the loop executed zero times; otherwise 0
it depends on the arguments and/or return value, but not in a way described above
it depends on what the value of the zero flag (ZF) was before the function was called

Suppose we assemble the following into an object file:

.data
.global array
array:
    .byte 1
    .byte 2
    .byte 3
    .byte 4

.text
.global bar 
bar:
    cmpq $0, %rdi
    je end_bar
    movq $array, %rdi
    call print_array
end_bar:
    movq $array, %rax
    ret

Question 4 (2 points) (see above)

The corresponding object file's relocations table will reference which of the following (either by name or by identifying the location of the corresponding label)? Select all that apply.

array
bar
print_array

Question 5 (2 points) (see above)

When using the resulting object file to produce an executable (using the static (non-dynamic) linking scheme we discussed in lecture), the linker will _______. Select all that apply.

write the four bytes that are stored after the label array to the executable file
find a symbol table entry for print_array in some other object file
write the memory address (in some format) chosen for the call print_array instruction somewhere in the executable
write the memory address for array (in some format) in the resulting executable somewhere

Consider the following C function:

int *quux(int *p) {
    int *r;
    r = p + 2;
    *r += 4;
    return r;
}

Question 6 (2 points) (see above)

Suppose the function quux is run on a Linux x86-64 system where:

ints are 4 bytes,
p points to the first element of an array of 400 ints located at address 0x10000
each of the 100 ints in the array initially has the value 7

What will the value of the pointer r be just before the quux returns? Write your answer as a hexadecimal number. (Be sure to give the value of the pointer and not the value it points to.)

Answer:

quiz for week 3

Consider the following C function:

long example(long a) {
    long last_a = a;
    while (a != (a >> 40)) {
        last_a = a;
        a = a >> 40; 
    }   
    return last_a;
}

Assume >> on integers is implemented using an arithmetic shift (copies the sign bit for leftmost bits of result) and longs are 64 bits, represented using two's complement.

Question 1 (2 points) (see above)

The value of example(1) is 1. Besides 1, what is another value K such that example(K) is 1? (There are several possible answers.) Write your answer as a base-10 number.

Answer:

Question 2 (2 points) (see above)

How many distinct return values can the above example function have? Write your answer as a base-10 number.

Answer:

Question 3 (2.5 points)

If x and y are 32-bit signed ints with values between -1000000 and 1000000 on a system that uses two's complement, which of the following C expressions are always true? Select all that apply.

(x >= 0) || (((x >> 31) & 1) == 1)
(x & 0xFF) <= (x & 0xFFF)
((x + (y & 0xFFFF)) & 0xFFFF) == (((x & 0xFFFF) + y) & 0xFFFF)
(((x | y) & 0xFF) >> 8) == (((x >> 8) & 0xFF) | ((y >> 8) & 0xFF))
(((x & 0xFFF) >> 8) ^ (y & 0xFFF)) == ((((x >> 8) ^ y) & 0xFF) | (y & 0xF00))

Question 4 (2 points)

Which of the following C expressions will, given an unsigned integer x, return the least significant 4 bits of the integer with its bits reversed. For example, if x in binary was 11001000001101, the result would be (in binary) 1011 (the reverse of 1101). Select all that apply.

(((x & 0x11) << 3) | ((x & 0x12) << 1) | ((x & 0x14) >> 1) | ((x & 0x18) >> 3)) & 0xF
((x << 1) & 1) | ((x << 2) & 2) | ((x >> 1) & 8) | ((x >> 2) & 4)
((x & 1) << 3) | ((x & 2) << 1) | ((x & 4) >> 1) | ((x & 8) >> 3)
((x << 3) & 15) | ((x << 1) & 5) | ((x >> 1) & 2) | ((x >> 3) & 1)

quiz for week 4

Question 3 (2 points)

Consider the following Y86-64 machine code, written as a sequence of bytes in hexadecimal:

 50 76 74 00 00 00 00 00 00 80 60 12 61 84 00 00 00 00 00 00 00

If we translate this to assembly (assuming the first instruction starts at the first byte) then the first two instructions would be:

Question 5 (2 points)

Consider the following HCLRS code snippet where ... represents some omitted code:

register xY {
    foo : 64 = ...;
    bar : 64 = ...;
}
...
x_foo = Y_foo + Y_bar;
x_bar = Y_bar - Y_foo;

During cycle 10, Y_foo has the value 500 and Y_bar has the value 300. What is the value of Y_foo during cycle 12? (Write your answer as a base-10 number, like 123.)

Assume that cycles are seperated by a rising edge of the clock signal.

Answer:

Question 6 (2 points)

Using the kind of registers we described in lecture and in section 4.2.5 of our textbook, suppose a register's output is 42 and its value input is also 42 and the clock signal is high. Then the following happens in this order:

the clock signal falls (becoming low)
the register's value input changes to 44
the register's value input changes to 45
the clock signal rises (becoming high again)
the register's value input changes to 46
the clock signal falls (becoming low)
the register's value input changes to 47

What will the value of the register's output be after this occurs? If not enough information is given to answer write unknown and explain in the comment field.

Answer:

quiz for week 5

Question 3 (2 points)

Consider the following HCLRS code snippet:

reg_srcA = 8;
reg_dstE = 0;
reg_inputE = reg_outputA;

If this were part of an HCLRS processor which does not have any other code using the register file inputs and outputs, then

the value of %r8 would be copied to %rax during every cycle
the value of %rax would be copied to %r8 during every cycle
the value 0 would be written to %rax during every cycle
the value 0 would be written to %r8 during every cycle
the value 8 would be written to %rax during every cycle
the value 8 would be written to %r8 during every cycle
the value 0 would be written to both %r8 and %rax during every cycle
the value of registers would not change
none of the above

Rather than having call and ret instructions that push and pop values from the stack, many instruction sets instead store the return adddress in a register (which, if necessary, programs can save on the stack).

For example, RISC V provides a jal REGISTER, TARGET_LABEL ("jump and link") instruction to replace the functionality of call and a jr REGISTER ("jump to register") to replace the functionality of return. Like call, jal REGSITER, TARGET_LABEL stores the return address and then jumps to TARGET_LABEL, but it stores it in REGISTER, rather than on the stack. (If necessary, the function can use another instruction to push the return address onto the stack.) jr REGISTER takes a value from a register and sets the PC to the value.

Question 4 (2 points) (see above)

Suppose we added the jal instruction described above to the single-cycle Y86-64 procesor design we described in lecture (and which is described in our textbook). (By "single-cycle processor", we mean a processor that executes one cycle per reigster.) To avoid adding inputs to MUXes or additional MUXes (or similar circuitry) to control the 4-bit register number inputs to the register file (reg_srcA, reg_srcB, reg_dstE, and reg_dstM in HCLRS), which of the below encodings would be best?

(In each of the encodings, values of each byte are provided with most significant bits written first (left-most).)

[4 bit icode][4 bit destination register] (1st byte) then [64-bit target address]
[4 bit icode][4 bits unused] (1st byte) then [4 bit unused][4 bit destination register] (2nd byte) then [64-bit target address]
[4 bit icode][4 bits unused] (1st byte) then [64-bit target address] (2nd through 9th byte) then [4 bit destination register][4 bits unused] (10th byte)
[4 bit icode][4 bit condition code info] (1st byte) then [64-bit target address] (2nd through 9th byte)

Question 5 (2 points) (see above)

Suppose we added the jal instruction descrbied above to the single-cycle processor design we described in lecture (and which is described in our textbook). While this instruction is executing the input to the PC register would most likely be equal to

part of the output of the instruction memory
the result of a calcuation performed using one of the outputs of the register file
one of the outputs of the register file
the output of the data memory
none of the above

quiz for week 6

Question 1 (2 points)

In lab to implement condition codes, we suggested declaring condition code registers using

register cC {
    SF:1 = 0;
    ZF:1 = 1;
}

and then using

stall_C = (icode != OPQ);

to keep the condition code registers from changing when the instruction was not an OPq instruction. We noted that, in HCLRS, "Register banks like cC have a special input stall_C which, if 1, causes the registers to ignore inputs and keep their current value." (Each register bank has its own stall signal, this one is stall_C, since the condition code registers were declared using register cC.)

If register banks did not provide this stall signal, we could have implemented the functionality using a case expression (MUX) when setting c_SF and c_ZF. What would the corresponding code for setting c_ZF look like?

c_ZF = [ icode == OPQ : valE == 0; 1 : 0; ];
c_ZF = [ icode == OPQ : valE == 0; 1 : 1; ];
c_ZF = [ icode == OPQ : valE == 0; 1 : c_ZF; ];
c_ZF = [ icode == OPQ : valE == 0; 1 : C_ZF; ];
c_ZF = [ icode == OPQ : valE == 0; 1 : !C_ZF; ];
c_ZF = [ icode == OPQ : valE == 0; 1 : !c_ZF; ];
c_ZF = [ icode == OPQ : valE == 0; 1 : valE != 0; ];
none of the above

Consider the following diagram of the single-cycle processor data path from lecture:

Note that in this version of the processor design, the second input to the ALU (which our textbook calls aluB) is 0 or the second output of the register file (reg_outputB in HCL).

Suppose we wanted to implement a new instruction ixorq on this processor, which would xor a register's value with a constant and store the result in a register. For example

ixorq $0x1234, %rax

would take the value of %rax, xor it with 0x1234 and store the result in %rax. In machine code, the instruction would have the same layout (placement of fields like icode and rA and valC) as irmovq.

Question 2 (2 points) (see above)

When this ixorq instruction is executing the MUX that controls the dstM regsiter file input should ____.

select the top input (rA)
select the second input (0xF, also known as REG_NONE)
select any input; it won't affect the instruction's operation
select a new rB input that the MUX needs to be modified to support in order to implement the ixorq instruction

Question 3 (2 points) (see above)

When this ixorq instruction is executing the MUX that controls the aluB ALU input should ____.

select the top input (reg_outputB)
select the second input (0)
select any input; it won't affect the instruction's operation

quiz for week 7

Question 1 (2 points)

Suppose we have a six-stage pipelined processor with 500 ps cycle time.

Running a particular benchmark program which executes 1 billion (10 to the 9th power) instructions in a simulator, we determine that 5% of its instructions require exactly one cycle of stalling and 10% require exactly two cycles of stalling and no instructions would require more than two cycles of stalling.

(For the purposes of this question, a cycle of stalling means one bubble (hardware-generated nop) inserted in the pipeline rather than advancing an instruction. Assume we attribute each stall to exactly one instruction.)

Assume these stalls are the only reasons why the processor would not complete one program instruction every cycle.

To the nearest millisecond, how long (in milliseconds) will the benchmark program take to run? Write your answer as a base-10 number of milliseconds.

Answer:

For the following two questions, consider executing the following assembly snippet:

addq %rax, %rbx
subq %rbx, %rdx
xorq %rbx, %rcx
rrmovq %rdx, %rcx
addq %rbx, %rcx

Question 2 (2 points) (see above)

Suppose the assembly snippet is executing on the five-stage pipelined processor we described in lecture, but instead of using forwarding, it uses only stalling to resolve data hazards (and no forwarding). If the first addq instruction is fetched in cycle 0, then during what cycle will the final addq instruction run its writeback stage?

Answer:

Question 3 (3 points) (see above)

Suppose the assembly snippet is executing on the five-stage pipelined processor we described in lecture that:

uses forwarding to resolve data hazards to the extent possible without dramatically increasing the cycle time

Which of the following forwarding operations must occur to avoid the most stalling possible? Select all that apply.

%rbx will be forwarded from the first addq to the subq
%rbx will be forwarded from the first addq to the xorq
%rbx will be forwarded from the subq to the xorq
%rdx will be forwarded from the subq to the rrmovq
%rcx will be forwarded from the xorq to the rrmovq
%rcx will be forwarded from the xorq to the addq

For the following question, consider executing the following assembly snippet:

addq %rax, %rbx 
mrmovq 8(%rbx), %rcx
xorq %rcx, %rdx
rmmovq %rdx, 16(%rbx)

Question 4 (2 points) (see above)

Suppose the assembly snippet above were executed on a six-stage pipelined processor with the following stages:

Fetch
Decode
Execute
Memory 1
Memory 2
Writeback

This processor acts like the processor we discussed in lecture and implements all forwarding possible (that wouldn't dramatically increase cycle times).

For the purpose of forwarding, when the stages are not split, we generally assume:

a value needed for a computation or storage access can only be used by a stage if it's computed or retrieved in the previous cycle

Similarly, for the split memory stages, assume:

for instructions that read from the data memory, the address to read must be computed in the cycle before the Memory 1 stage runs, and
the result of any memory read is only available to be used (e.g. after being forwarded) by other instructions in the cycle after the Memory 2 stage runs

Given this processor, if the addq performs its fetch stage during cycle 0, then during what cycle number will the rmmovq instruction finish its writeback stage?

Answer:

Question 5 (2.5 points)

Consider the following assembly snippet: (where ... represents irrelevant instructions):

    addq %rax, %rcx
    subq %rcx, %rdx
    je foo
    xorq %rcx, %rdx
    ...
    ...
    ...
foo:
    irmovq $10, %rax /* A */
    irmovq $20, %rbx /* B */
    ...

where the je is not taken.

Suppose that we are executing the assembly snippet on a five-stage pipelined processor based on the design in lecture that:

uses forwarding to resolve data hazards to the extent possible without substantially increasing the cycle time, and
speculates that all conditional jumps will be taken, like we described in lecture, so the instructions labeled A and B will be fetched in the two cycles after the je is fetched and then squashed (discarded)
when a conditional jump is not taken like the processor guessed, fetches the corrected instruction during the memory stage of the conditional jump instruction (the cycle after determining what address to fetch in the conditional jump's execute stage)

Which of the following is true about what happens when the above assembly executes? Select all that apply.

when the addq's memory stage runs, the subq's execute stage is running
when the xorq's fetch stage runs, the subq's writeback stage has not yet completed
the value of %rdx will be forwarded from subq to xorq
the value of %rcx will be forwared from addq to subq
the value of %rcx will be forwared from addq to xorq

quiz for week 8

Suppose we are implementing a five-stage processor with a similar design to the one discussed in lecture, but sometimes we need to stall for one cycle because the output of the data memory needs an extra cycle to be retrieved.

Suppose the instruction triggering the stall is in the memory stage during cycle number 0, and needs to stay in the memory stage until cycle number 1 to complete the memory read. (For the purposes of this question, we say an instruction is in a stage when its values are being output from the corresponding pipeline registers.)

Complete in the following statements about how the pipeline registers should behave.

Question 1 (0.5 points) (see above)

During cycle number 1, the pipeline registers between fetch and decode

should output the same values they were outputting during cycle number 0
should output the values for a nop
should output values corresponding to the instruction that was fetched during cycle 0

Question 2 (0.5 points) (see above)

During cycle number 1, the pipeline registers between decode and execute

should output the same values they were outputting during cycle number 0
should output the values for a nop
should output values corresponding to the instruction that was in the decode stage in cycle 0

Question 3 (0.5 points) (see above)

During cycle number 1, the pipeline registers between execute and memory

should output the same values they were outputting during cycle number 0
should output the values for a nop
should output values corresponding to the instruction that was in the execute stage in cycle 0

Question 4 (0.5 points) (see above)

During cycle number 1, the pipeline registers between memory and writeback

should output the same values they were outputting during cycle number 0
should output the values for a nop
should output values corresponding to the instruction that was in the memory stage in cycle 0

For the following two questions, consider a 4-block direct-mapped cache with 4 byte cache blocks. For each of the following two questions, assume the cache's contents are as follows:

index (in base 2)	valid bit	tag (in base 2)	data (hexadecimal, list of bytes, lowest address left-most)
00	1	001001	23 56 78 9A
01	1	001001	AA BB CC DD
10	1	000011	01 02 03 04
11	0	000000	00 00 00 00

For the following two questions, write down what the result of reading one byte from the specified addresses will be (assuming the cache has the contents listed above when the access occurs):

if the result will be a cache hit, write the value that will be read in hexadecimal (with or without a leading 0x)
if the result will be a cache miss, write the word miss.

(Note that addresses may have leading zeroes which are not written.)

Question 5 (see above)

0x91

Answer:

Question 6 (see above)

0x3B

Answer:

For the following two questions, consider a 4-block 2-way set associtiative-mapped cache with 4 byte cache blocks whose contents are as follows:

index (in base 2)	valid bit (way 0)	tag (in base 2) (way 0)	data (hexadecimal, list of bytes, lowest address left-most) (way 0)	valid bit (way 1)	tag (way 1)	data (way 1)
0	1	0010010	23 56 78 9A	1	0010011	AA BB CC DD
1	1	0000110	01 02 03 04	1	0000011	71 82 93 F3

For the following two questions, write down what the result of reading one byte from each of the specified address will be (assuming the cache has the contents listed above when the access occurs):

if the result will be a cache hit, write the value that will be read in hexadecimal (with or without a leading 0x)
if the result will be a cache miss, write the word miss.

(Note that addresses may have leading zeroes which are not written.)

Question 7 (see above)

0x91

Answer:

Question 8 (see above)

0x3B

Answer:

quiz for week 9

Question 2 (2 points)

Consider a 8KB 2-way set associtiave cache with an LRU replacement policy and 64-byte blocks. (1KB = 1024 bytes.)

Suppose the cache is initially empty (all valid bits set to 0), then the program acceses 1 byte from each of the following addresses in the following order:

0x10005
0x40000
0x12400
0x10001
0x33320

Immediately after the accesses described above, give an example of an address which, if read using this cache, would cause something to be evicted from the cache:

Answer:

Consider a two-set direct-mapped cache with 8B blocks and a write-allocate and writeback policy.

Suppose the cache is initially empty (all blocks invalid) and we perform the following accesses of single bytes:

read from 0x104
write to 0x101
write to 0x102
write to 0x108
read from 0x109
read from 0x10f
read from 0x110
read from 0x118
write to 0x119

Question 3 (see above)

Will the read from 0x109 be a hit?

Question 4 (see above)

Will the read from 0x10f be a hit?

Question 5 (2 points) (see above)

Which of these reads from the access pattern above will trigger a write to memory (or the next level of cache)? Select all that apply

read from 0x109
read from 0x10f
read from 0x110
read from 0x118

Question 6 (2 points)

Suppose we have a system with:

a 2-set direct-mapped cache with 8 byte cache blocks
4-byte ints (so each cache block can store 2 ints)

and run the following code:

int array[9];
...
int count1 = 0, count2 = 0, count3 = 0;
count1 += array[0];
count1 += array[3];
count1 += array[6];
count2 += array[1];
count2 += array[4];
count2 += array[7];
count3 += array[2];
count3 += array[5];
count3 += array[8];

Assuming that:

array[0] is assigned an address at the beginning of a cache block
the assembly the compiler generates for the above code does not reorder or omit data cache accesses
only accesses to array use the data cache, and
the cache is initially empty

how many data cache misses should we expect?

(Note that unlike the examples in lecture, the 9 accesses here are not evenly distributed across cache sets.)

Answer:

quiz for week 10

Consider the following 2 versions of C code:

/* Version 1 */
for (int i = 0; i < N; ++i) {
    for (int j = i; j < N; ++j) {
        C[j] += A[i*N+j] * B[j];
    }
}

/* Version 2 */
for (int j = 0; j < N; ++j) {
    for (int i = j; i < N; ++i) {
        C[j] += A[i*N+j] * B[j];
    }
}

Question 1 (see above)

Which version has better temporal locality in accesses to C?

version 1
version 2
they are about the same

Question 2 (see above)

Which version has better spatial locality in accesses to A?

version 1
version 2
they are about the same

Question 3 (2 points) (see above)

If N is 100000 and cache blocks can hold 8 elements of the array B, then we would expect approximately _____ cache misses for the accesses to B when running version 1 of the code above. (Choose the closest answer.)

Assume the cache is not large enough to hold 50000 elements of B (or A or C).

Question 4 (2 points)

Consider the following assembly snippet:

addq %rax, %rbx
mrmovq 8(%rbx), %rax
subq %rbx, %rcx

In lecture, we discussed how an out-of-order processor may perform register renaming where it converts instructions from using architectural registers (the ones that appear in assembly) to physical registers (used internally in the processsor). When doing this, the processor ensures that each version of an architectural register's value uses a different physical register, which aids in resolving hazards.

After an out-of-order processor performs register renaming on the above instructions as discussed in lecture, which of the following statements about the physical registers used by the renamed versions of the above instructions will be true?

for %rbx's value, the renamed addq instruction will write the same physical register that mrmovq reads
for %rax's value, the renamed mrmovq instruction will write the same physical register that addq reads
for %rbx's value, the renamed subq instruction will read the same physical register that mrmovq reads

quiz for week 11

Question 1 (2 points)

Suppose an out-of-order processor has two execution units (that matter for this questions):

a 1-cycle adder (that can perform one add per cycle)
a 3-cycle pipelined multiplier (that can start one multiply per cycle, but finishes it two cycle later)

Suppose on this processor the results of an addition or multiply can be used to start another addition or multiply in the cycle after the first addition or multiply completes.

Consider the following C snippet:

a = b + c;
d = e * f;
h = a + d;
i = a * d;

With the processor described above, what's the fastest time (in cycles) the above C snippet can execute? (Do not attempt to count for time doing instruction fetching, register renaming, etc. — just the time to finish the actual computation.)

Answer:

Question 2 (2.5 points)

In lecture, we discussed the use of multiple accumulators to improve performance in addition to simple loop unrolling. Consider the following pair of unrolled loops with and without use of multiple accumulators:

/* loop, without multiple accumulators transformation */
for (int i = 0; i < N; i += 4) {
    product = product * (a[i] * b[i]);
    product = product * (a[i+1] * b[i+1]);
    product = product * (a[i+2] * b[i+2]);
    product = product * (a[i+3] * b[i+3]);
}

/* loop, with multiple accumulators transformation */
for (int i = 0; i < N; i += 4) {
    product1 = product1 * (a[i] * b[i]);
    product2 = product2 * (a[i+1] * b[i+1]);
    product1 = product1 * (a[i+2] * b[i+2]);
    product2 = product2 * (a[i+3] * b[i+3]);
}
product = product1 * product2;

Whether this transformation would be helpful depends on the execution units the processor has.

Suppose the performance of the execution units that perform the multiplications (represented by a * operations in the C code above) is what determines the performance of the loop overall. To make it easier to reason about performance, assume the execution units that perform these multiplications are never involved in any of the address or index calculations needed by the loop above (so one only needs to consider how the multiplication instructions are dispatched and executed).

Given which configuration(s) of execution units to perform the multiplications could the loop's performance benefit from the multiple accumulators optimization shown above? Select all that apply.

one multipier which is not pipelined (does not accept new values to multiply until the current multiply is complete) and takes ten cycles to perform a multiplication
one multiplier which takes one cycle to produce results
ten multipliers, each of takes one cycle to produce results
one multiplier which is pipelined (accepts a new pair of values to multiply each cycle) and takes ten cycles to produce results
ten multipliers, each of which is pipelined (accepts a new pair of values to multiply each cycle) and takes ten cycles to produce results

Question 4 (2 points)

Consider the following two C functions:

void all_pairs_products1(int N, int *A, int *result) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            result[i * N + j] = A[i] * A[j];
        }
    }
}

void all_pairs_products2(int N, int *A, int *result) {
    for (int j = 0; j < N; ++j) {
        for (int i = 0; i < N; ++i) {
            result[i * N + j] = A[i] * A[j];
        }
    }
}

(The two functions differ in their loop orders.)

A compiler cannot generate identical for these two functions because in cases where A and result are pointers that refer to the same data (a problem we called "aliasing"), they can write different answers to the result array.

Which of the following are examples of calls to all_pairs_products1 which could result in different values in the array array than if the same call were made to all_pairs_products2 instead? Select all that apply.

all_pairs_products1(1024, &array[0], &array[0])
all_pairs_products1(1024, &array[0], &array[1024])
all_pairs_products1(1024, &array[1024], &array[0])
all_pairs_products1(1024, &array[1024], &array[1024])

quiz for week 12

Question 1 (2 points)

In lecture we discussed vector instructions (also known as SIMD instructions) where a single instruction can perform an operation on every pair of values in two vectors, which are typically fixed-sized array stored in registers.

Using vector instructions similar to those used in the lab, which of the following code snippets is simplest to transform into a version that makes effective use of vector instructions?

/* loop A */
for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        A[j*N + i] += A[i*N + j];
    }
}

/* loop B */
for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        A[i*N + j] += B[j] * C[(i+1)*N + j];
    }
}

/* loop C */
for (int i = 2; i < N; ++i) {
    A[i] *= (A[i-1] + A[i-2]);
}

(Assume A, B, and C are independent arrays.)

loop A
loop B
loop C

For the following questions, consider a system with 20-bit virtual addresses where virtual addresses are divided into a 16-bit page offset and a 4-bit virtual page number. (For example, virtual address 0x12345 has virtual page number 0x1 and page offset 0x2345.) Suppose the contents of the page table of this system are:

virtual page number	valid	physical page number
0x0	0	0x00
0x1	1	0x15
0x2	1	0x14
0x3	1	0x20
0x4	1	0x05
0x5	1	0x06
0x6	1	0x09
0x7	0	0x00
0x8	0	0x00
0x9	1	0x13
0xA	1	0x14
0xB	1	0x30
0xC	1	0x31
0xD	1	0x32
0xE	1	0x33
0xF	1	0x34

Question 5 (see above)

Based on the page table above, when accessing the virtual address 0x30001, what physical address will be accessed? Write your answer as a hexadecimal number. If a fault (an exception) would occur, write "fault".

Answer:

Question 6 (see above)

Based on the page table above, when accessing the virtual address 0x5467F, what physical address will be accessed? Write your answer as a hexadecimal number. If a fault (an exception) would occur, write "fault".

Answer:

quiz for week 13

Suppose a system has:

20-bit virtual addresses
1024 byte pages
24-bit physical addresses
4 byte page table enties
a page table base pointer set to physical (byte) address 0x1000
a single-level page table structure

Question 1 (2 points) (see above)

Based on the above information, what is the address at which the page table entry for the virtual address 0x1000 is stored? (Note that page table entries are larger than one byte.) Write your answer as a hexadecimal number.

Answer:

Question 2 (2 points) (see above)

How large are page tables on this system (in bytes)?

Answer:

Suppose a system has:

42-bit virtual addresses (maximum value 0x3FF FFFF FFFF), with a 30-bit virtual page number and a 12-bit page offset
30-bit physical addresses (maximum value 0x3FFF FFFF), with an 18-bit physical page number and a 12-bit page offset
three-level page tables, where 10 bits of the virtual page number are used for a lookup at each level
page table entries are four bytes

Question 3 (2 points) (see above)

When looking up the virtual address 0x012 3456 789A, if the page table base pointer contains the physical byte address 0x44 000, then what is the physical address of the first-level page table entry for 0x012 3456 789A? (Note that page table entries are larger than one byte.) Write your answer as a hexadecimal number.

Answer:

Question 4 (2 points) (see above)

Suppose that when looking up page table entries for the virtual adddresss 0x012 3456 789A:

the page table entry for the first-level was valid and contained physical page number 0x99,
the page table entry for the second-level was valid and contained physical page number 0xA4, and
the page table entry for the third-level was valid and contained physical page number 0xC3

Based on this information, when a program attempts to read data from virtual address 0x012 3456 789A, at what physical address will that data be found? Write your answer as a hexadecimal number. If not enough information is provided, write "unknown" and explain what information is missing in the comments.

Answer:

Question 5 (2 points) (see above)

If this system had a 64-entry, 8-way TLB, that TLB would use 3 index bits. How many tag bits would it use?

Answer: