Quiz5 of Computer Architecture (2020 Fall)

Solutions

Question `A`

We would like to design a cache with an AMAT (average memory access time) of 1.5 cycles. Accessing our cache takes 1 cycle, and on a miss, it takes an additional 10 cycles to retrieve the data from main memory and update the cache. What does our hit ratio need to be in order to achieve the target AMAT?

Hit ratio = __ 0.95 __(A01)
AMAT = HitTime + MissPenalty * MissRatio
1.5 cycles = 1 cycle + (10 cycles) * (1 – HitRatio)
0.05 = 1 - HitRatio
HitRatio = 0.95

The timing for a particular cache is as follows: checking the cache takes 1 cycle. If there is a hit the data is returned to the CPU at the end of the first cycle. If there is a miss, it takes 10 additional cycles to retrieve the word from main memory, store it in the cache, and return it to the CPU. If we want an average memory access time of 1.4 cycles, what is the minimum possible value for the cache’s hit ratio?

Minimum possible value of hit ratio: __ 0.96 __(A02)
AMAT = HitTime + MissRatio * MissPenalty
1.4 = 1 + (1 - HitRatio) * 10
HitRatio = 0.96

Question `B`

Consider that P1 is a five-stage pipelined RISC-V processor (IF, DEC, EXE, MEM, WB), where:

All branches are predicted not-taken.
Branch decisions are made in the EXE stage.
The pipeline has full bypassing.

Assume that the loop shown to the right has been running for a while and will be re-executed after this iteration.

Given the program below. How many cycles does it take to execute one iteration of the loop on this processor?

loop: add  x3, x2, x1
      slli x2, x2, 1
      xor  x4, x3, x4
      bnez x2, loop
      and  x11, x2, x1
      or   x12, x22, x7
      sra  x13, x23, x8
      ...

Cycles per iteration on this processor P1: __ 6 __(B01)

Data Hazards will happen between add x3, x2, x1 and xor x4, x3, x4, slli x2, x2, 1 and bnez x2, loop.
There is no load-use hazard, so we do not stall the pipeline.

Forwarding

Flushing

Now consider that P2 is a variant of the previous five-stage pipelined RISC-V processor (IF, DEC, EXE, MEM, WB), where:
- All branches are predicted not-taken.
- Branch decisions are made in the EXE stage.
- The pipeline has no bypass paths.
Assume that our example loop (same as the previous program) has been running for a while and will be re-executed after this iteration. How many cycles does it take to execute one iteration of the loop on this processor?

Cycles per iteration on processor P2: __ 8 __ (B02)

Cycle 1 2 3 4 5 6 7 8 9

IF add slli xor benz benz banz and or add

ID add slli xor xor xor benz and flush

EX add slli NOP NOP xor benz flush

MEM add slli NOP NOP xor benz

WB add slli NOP NOP xor

In the 5-th cycle, because no bypass path, so the x3 value of add need first write back to Register, so insert two NOP to wait it to actually write back, also in cycle.

Cycle	1	2	3	4	5	6	7	8	9
IF	add	slli	xor	benz	benz	banz	and	or	add
ID		add	slli	xor	xor	xor	benz	and	flush
EX			add	slli	NOP	NOP	xor	benz	flush
MEM				add	slli	NOP	NOP	xor	benz
WB					add	slli	NOP	NOP	xor

Question `C`

You are designing a four stage RISC-V processor (IF, DEC, EX, WB) that:

Resolves branches in the EX stage.
Uses a scoreboard to track and stall on data hazards.
Initiates accesses to data memory in the EX stage, but data memory responses are not available until the WB stage.

This processor has a tight area constraint, and you only have enough area for one bypass path for read-after-write hazards. You are trying to decide whether to put the bypass path from EX to DEC or from WB to DEC. As part of this evaluation, you construct two processors:

Processor P1: Bypassing from WB to DEC.
Processor P2: Bypassing from EX to DEC.

These processors ensure that if the instruction in the DEC stage does not have all of its operands ready to read from the register file or read from a bypass path, the instruction in the decode stage is stalled.

You are using the following loop of a program to evaluate the performance of the processor:

loop: lw   t0, 0(a0)
      addi t1, t0, 1
      xor  t3, t1, a0
      slli t2, t1, 1
      blt  t0, a0, loop

For the following questions, assume this loop has been running for a long time, and that the processor assumes that the branch will be taken.

How many cycles does this loop take to execute in each of the processors?

P1 cycles per iteration: __ ? __
P2 cycles per iteration: __ ? __

Processor P1 cycles per iteration: __ 7 __(C01)
Processor P2 cycles per iteration: __ 8 __(C02)

P1 cycles per iteration:
Data Hazards is between lw t0, 0(a0) and addi t1, t0, 1, and also between addi t1, t0, 1 and xor t3, t1, a0.
So we can stall a cycle to wait the lw instruction and pass the data to DEC stage.

P2 cycles per iteration:
Only the forwarding from EX to DEC, we need to wait the data memory responses until WB stage.

Question `D`

Consider the execution of the following code sequence on a 5-stage pipelined RISC-V processor, which is fully bypassed. The code below iterates over the array shown below to find the index of the element whose value is 0x104. Thus, on the second iteration of the loop, it branches to label done. The instruction unimp signals the end of the program. This processor:

Always speculates that the next instruction should come from address PC+4.
Computes branch decisions and jump targets in the EXE stage.
Annuls incorrectly fetched instructions resulting from branches and jump instructions.
Receives the results of load instructions (from the data memory) in the WB stage.

The program:

. = 0x0
/* x10 holds the value we are looking for
 * x11 holds the address of A[0]
 * x13 holds the address of the current data access
 * Assume x10 = 0x104; x11 = 0x500; and x13 = 0x500 before executing the loop
 */
loop:
lw x14, 0(x13)
beq x14, x10, done
addi x13, x13, 4
j loop
done:
sub x13, x13, x11
srli x13, x13, 2
unimp
. = 0x500
/* Array data */
.word 0x0000100 /* A[0] */
.word 0x0000104 /* A[1] */
.word 0x0000108 /* A[2] */
.word 0x000010C /* A[3] */

Are any instructions stalled in the decode (DEC) stage during execution of this loop? If so, list the opcode(s) here. If not, write N/A.

Instructions stalled in DEC stage: __ beq __(D01)

This program's behavior is similar to the following c code, find the index of element 0x104.
int i = 0, A[4] = {0x100, 0x104, 0x108, 0x10C};
while(A[++i] != 0x104);
return i;
There has the data hazard between lw x14, 0(x13) and beq x14, x10, done, and also between addi x13, x13, 4 and sub x13, x13, x11, sub x13, x13, x11.

The load-use hazard between lw x14, 0(x13) and beq x14, x10, done, so we need to stall 2 cycles to wait the data write-back.

Because we always speculates that the next instruction should come from address PC+4, so the instruction in IF stage for the cycle 7 is sub.

Therefore, only beq instruction stalled in the DEC stage.

How many NOPs, if any, are inserted into the pipeline to handle the stalls required for execution of one iteration of the loop?

Number of NOPs added to handle stalls: __ 2 __(D02)

We need to stall 2 cycle for the load-use hazard in the one iteration of the loop, we can see the 2 NOP instruction called at the cycle 5.

How many cycles does one iteration of the loop (first iteration, i.e. lw -> beq -> addi -> j -> sub -> srli) take?

Number of cycles for one iteration of the loop: __ 8 __(D03)

The number of cycles between two lw instruction is 8.

How many NOPs, if any, are inserted into the pipeline to handle the final iteration of the loop through the end of the program?

Number of NOPs added to handle stalls: __ 2 __(D04)

Cycle 9 10 11 12 13 14 15 16 17

IF lw beq addi addi addi j sub srli unimp

DEC NOP lw beq beq beq addi NOP sub srli

EXE NOP NOP lw NOP NOP beq NOP NOP sub

MEM j NOP NOP lw NOP NOP beq NOP NOP

WB addi j NOP NOP lw NOP NOP beq NOP

The NOPs instruction at the cycle 9 and the cycle 15 are for branch hazard to flush the mistake instruction.

The same as the problem 2, we need to insert 2 NOP instructions to stall for the load-use hazards at the cycle 13.

Cycle	9	10	11	12	13	14	15	16	17
IF	lw	beq	addi	addi	addi	j	sub	srli	unimp
DEC	NOP	lw	beq	beq	beq	addi	NOP	sub	srli
EXE	NOP	NOP	lw	NOP	NOP	beq	NOP	NOP	sub
MEM	j	NOP	NOP	lw	NOP	NOP	beq	NOP	NOP
WB	addi	j	NOP	NOP	lw	NOP	NOP	beq	NOP

Which additional instructions, if any, would need to be stalled if the processor did not have a bypass path from the EXE stage but still had one from the MEM and WB stages?

Instructions stalled due to removal of EXE bypass path: __ srli __(D05)

Cycle 9 10 11 12 13 14 15 16 17 18

IF lw beq addi addi addi j sub srli unimp unimp

DEC NOP lw beq beq beq addi NOP sub srli srli

EXE NOP NOP lw NOP NOP beq NOP NOP sub NOP

MEM j NOP NOP lw NOP NOP beq NOP NOP sub

WB addi j NOP NOP lw NOP NOP beq NOP NOP

Cycle	9	10	11	12	13	14	15	16	17	18
IF	lw	beq	addi	addi	addi	j	sub	srli	unimp	unimp
DEC	NOP	lw	beq	beq	beq	addi	NOP	sub	srli	srli
EXE	NOP	NOP	lw	NOP	NOP	beq	NOP	NOP	sub	NOP
MEM	j	NOP	NOP	lw	NOP	NOP	beq	NOP	NOP	sub
WB	addi	j	NOP	NOP	lw	NOP	NOP	beq	NOP	NOP

How many additional NOPs, if any, are required to handle the removal of the EXE bypass path?

Number of NOPs required to handle removal of EXE bypass path: __ 1 __(D06)

We need to stall 1 cycle to wait the dependant data calculated and through forwarding from MEM to DEC.

Question `E`

Consider a direct-mapped cache with 64 total data words with 1 word/cache line.

The program shown below repeatedly executes an inner loop that sums the 16 elements of an array that is
stored starting in location 0x310.

 . = 0 /* tell assembler to start at address 0 */
outer_loop:
    addi x4, x0, 16 // initialize loop index J
    mv x1, x0 // x1 holds sum, initially 0
loop: // add up elements in array
    subi x4, x4, 1 // decrement index
    slli x2, x4, 2 // convert to byte offset
    lw x3, 0x310(x2) // load value from A[J]
    add x1, x1, x3 // add to sum
    bne x4, x0, loop // loop until all words are summed
    j outer_loop // perform test again!

The program is executed for many iterations, then a measurement of the cache statistics is made during one iteration through all the code, i.e., starting with the execution of the instruction labeled outer_loop: until just before the next time that instruction is executed.

In total, how many instruction fetches occur during one complete iteration of the outer loop? How many data reads?

Number of instruction fetches: __ 83 __(E01)
Number of data reads: __ 16 __(E02)

Instruction fetch = 2 + 5 * 16 + 1 = 83
2 instructions before loop label are executed once. The 5 instructions in the loop are executed 16 times, and j outer_loop is executed once.

The inner loop runs for 16 iterations. lw is called once per iteration.
The memory address has 32 bits total which will be divided up into bits for the tag, index, block offset and byte offset.Since there are 64 total words with 1 word/cache line, there must be 64 cache lines. The index bits must fully index into these 64 lines. This requires 6 bits. Each cache line holds one block which is specified to be one word (32 bits or 4 bytes). Since there is only one word in each block, we need no bits (i.e. 0 bits) for this offset. Since each byte has its own address, we need four possible offsets into each block. Thus, the two bits are sufficient for the byte offset. Tag: 24 bits, Index: 6 bits, Block offset: 0 bits, Byte offset: 2 bits. For each memory access, we look at bits [7:2] of the address to determine which index in the cache that particular memory address maps to. Remember that both instructions and the array are stored in memory. Thus, each instruction fetch is a memory access and is also cached along with all "normal" memory accesses in the form of lw instructions. Also note that this program has been executing for multiple cycles already. Thus, many slots in the cache are already populated. We need to figure out which ones will result in cache misses. As we walk through the program for the first time, we will not count initializing the cache contents as cache misses. How many instruction fetch misses occur during one complete iteration of the outer loop? How many data read misses? Hint: remember that the array starts at address 0x310.

Number of instruction fetch misses: __ 4 __(E03)
Number of data read misses: __ 4 __(E04)

We fetch and cache the lw instruction next at index 0b0100. The lw instruction accesses address 0x310 (the first element of the array). Bits [7:2] of 0x310 are 0b100. But the cache line at index 0b0100 already holds the lw instruction which must be evicted. This is the first data read miss. The next two add and bne instructions are cached in indices 0b0101 and 0b0110 respectively.

On the next iteration of the inner loop, the subi and slli instructions are already cached. The lw instruction is not. This is the first instruction fetch miss. We fetch the instruction and cache it into index 0b0100, evicting the previously cached memory contents at address 0x310 from the last iteration. Next, we look at the next element of the array at address 0x314. This gets cached at index 0b101, evicting the cached add instruction (the second data read miss). Almost immediately, we need to fetch the add instruction again (the second instruction fetch miss). This evicts the memory contents of 0x314 which we just cached. At this point, observe the caching pattern. All cached instructions in indices 0b0100 through 0b10011 (the cache indices containing the elements of the array) will be evicted once, replaced with array elements, and ultimately restored in a subsequent iteration. This leads to 4 instruction fetch misses and 4 data read misses.

Index	Content
0000	addi
0001	mv
0010	subi
0011	slli
0100	lw Mem[0x310] lw
0101	add Mem[0x314] add
0110	bne Mem[0x318] bne
0111	j outer_loop Mem[0x31C] j outer_loop ｜
1000	Mem[0x320]
1001	Mem[0x324]
1010	Mem[0x328]
…	…

As a final note, notice that on the next iteration of the outer loop, most of the cache indices will already contain the correct values which is the case if this code has been running for a while. This is why we did not treat all the cache initializations as cache misses –- because previous runs had already set the cache.

What is the hit ratio measured after one complete iteration of the outer loop?

Hit ratio: __ 91/99 __(E05)

The total number of memory accesses is the sum of the instruction fetches and data reads (
$83 + 16 = 99$ ). There were 4 instruction fetch misses and four data read misses, so the total number of cache hits is
$83 - 4 + 16 - 4 = 91$ . The hit ratio is therefore
$\frac{91}{99}$ .

Question `F`

Assume, the program shown below is being run on a RISC-V processor with a cache with the following parameters:

2-way set-associative
block size of 2, i.e., 2 data words are stored in each cache line
total number of data words in the cache is 32
LRU replacement strategy

. = 0x240 // start of program
test:
    addi x4, x0, 16 // initialize loop index J to size of array
    mv x1, x0 // x1: sum
loop: // add up elements in array
    subi x4, x4, 1 // decrement index
    slli x2, x4, 2 // convert to byte offset
    lw x3, 0x420(x2)// load value from A[J]
    add x1, x1, x3 // add to sum
    bnez x4, loop // loop N times
    j test // perform test again!
// allocate space to hold array
. = 0x420
A: .word A[0]
 .word A[1]

The cache will divide the 32-bit address supplied by the processor into four fields: 2 bits of byte offset, B bits of block offset, L bits of cache line index, and T bits of tag field. Based on the cache parameters given above, what are the appropriate values for B, L, and T?

value for B: __ 1 __(F01)
value for L: __ 3 __(F02)
value for T: __ 26 __(F03)

B: 2 words per line;
$\log_{2} (2) = 1$
L: 8 lines per ways
T: 32 - 6 = 26
If the SLLI instruction is resident in a cache line, what will be its cache line index? the value of the tag field for the cache?

Cache line index for SLLI when resident in cache: __ 1 __(F04)
Tag field for SLLI when resident in cache: __ 0x9 __(F05)

Instruction address = base + instruction_index * 4 = 0x240 + 3*4 = 0x24C
0x24C = 0b00000000000000000000001001_001_100
Given that the code begins at address 0x240 and the array begins at address 0x420, and that there are 16 elements in the array as shown in the code above, list all the values j (
$0 \leq j \leq 16$ ) where the location holding the value A[j] will map to the same cache line index as the SLLI instruction in the program.

List all j where A[j] have the same cache line index as SLLI: __ 10, 11 ___(F06)

The array addresses that overlap are:
0x448 = 0b00000000000000000000010001_001_000
and
0x44C = 0b00000000000000000000010001_001_100
These correspond to the 10_th and 11_th elements of the array.

Question `G`

Consider a 2-way set-associative cache where each way has 4 cache lines with a block size of 2 words. Each cache line includes a valid bit (V) and a dirty bit (D), which is used to implement a write-back strategy. The replacement policy is least-recently-used (LRU). The cache is used for both instruction fetch and data (LD,ST) accesses. Please use this cache when answering questions (A) through (D).

Using this cache, a particular benchmark program experiences an average memory access time (AMAT) of 1.3 cycles. The access time on a cache hit is 1 cycle; the miss penalty (i.e., additional access time) is 10 cycles. What is the hit ratio when running the benchmark program?

Hit ratio for benchmark program: __ 0.97 __(G01)

AMAT = hit_time + (1 - hit_ratio) * miss_penalty
1.3 = 1 + (1 - HR) * 10
This cache is used to run the following benchmark program. The code starts at memory address 0; the array referenced by the code has its first element at memory address 0x200. First determine the number of memory accesses (both instruction and data) made during each iteration through the loop. Then estimate the steady-state average hit ratio for the program, i.e., the average hit ratio after many iterations through the loop.
```
. = 0
    mv x3, x0 // byte index into array
    mv x1, x0 // initialize checksum accumulator
loop:
    lw x2, 0x200(x3) // load next element of array
    slli x1, x1, 1 // shift checksum
    addi x1, x1, 1 // increment checksum
    add x1, x1, x2 // include data value in checksum
    addi x3, x3, 4 // byte index of next array element
    slti x2, x3, 1000 // process 250 entries
    bnez x2, loop
    unimp // halt
. = 0x200
array:
    ... /* array content */ 
```
Number of memory accesses made during each iteration of the loop: __ 8 __(G02)
Estimated steady-state average hit ratio: __ 15/16 __(G03)
In steady state:
- Each loop fetches 7 instructions and 1 data (lw).
- Loop starts at instruction address 0x8 = 0b1000
- If you calculate the index bits for every two instructions (since each cache line contains two words) in the loop, you will see that none of them overlap. In addition, they can all be inserted into Way 0 of the cache. Thus, there are no instruction cache misses!
The other way of our cache can be used to load data from our array starting at address 0x200. Every time we encounter a new element of the array not in the cache, our cache load request will actually fetch two words (i.e. to contiguous elements in our array); thus, in the next iteration of our loop, we will not need to fetch again, as the next element is already in our cache.

Thus, the data fetch instruction only (lw) misses every other iteration of the while loop. Since there are 8 total fetches per while loop iteration, we see that one out of every 2 * 8 = 16 instructions is a miss.
Steady-state hit ratio = (8 - 0.5) / 8 =
$\frac{15}{16}$

Question `H`

Consider the diagram below for a 2-way set associative cache to be used with our RISC-V processor. Each cache line holds a single 32-bit word of data along with its associated tag and valid bit (0 when the cache line is invalid, 1 when the cache line is valid).

The RISC-V produces 32-bit byte addresses, A[31:0]. To ensure the best cache performance, which address bits should be used for the cache index? For the tag field?

address bits used for cache index: A[ __ 4:2 __] (H01)
address bits used for tag field: A[ __ 31:5 __] (H02)

2 bits for byte offset (both 0)
1 word per block implies block offset = 0 bits (no need to select)
8 lines/way implies index = 3 bits
Remaining 27 bits are tag bits.
Suppose the processor does a read of location 0x5678. Identify which cache location(s) would be checked to see if that location is in the cache. For each location specify the cache section (P or Q) and row number (0 through 7). E.g., 3P for row 3, section P. If there is a cache hit on this access what would be the contents of the tag data for the cache line that holds the data for this location?

cache location(s) checked on access to 0x5678: __ 6P and 6Q __(H03)
cache tag data on hit for location 0x5678 (hex): __ 0x2B3 __(H04)
0x5678 in binary:
```
0b 0101 0110 011 1 1000
                [* **]
```
Line: 3’b110 = 6
Tag: 0b0010_1011_0011 = 0x2B3
Remember, all ways are checked in parallel when searching a set-associative cache.
Estimate the approximate cache hit ratio for the following program. Assume the cache is empty before execution begins (all the valid bits are 0) and that an LRU replacement strategy is used. Remember the cache is used for both instruction and data (LD) accesses.
```
        . = 0
        addi x4, x0, 0x100
        mv x1, x0
        lui x2, 1 // x2 = 0x1000
loop:   lw x3, 0(x4)
        addi x4, x4, 4
        add x1, x1, x3
        addi x2, x2, -1
        bnez x2, loop
        sw x1, 0x100(x0)
        unimp // halt
        . = 0x100
source:
        . = . + 0x4000 // Set source to 0x100, reserve 0x1000 words
```
approximate hit ratio: __ 5/6 __(H05)

The first instruction in the loop (lw) is at 0xC = 0b1000. Extracting the index bits from this address, we see that the instructions within the loop body (lw - bnez) occupy one of the ways in the cache.

Thus, one way is always free for data. Each iteration causes a data fetch that is a cache miss (since there is only a single word per cache line
$\to$ no spatial locality). Thus, each iteration there are 5 instruction fetches (hits) and 1 data fetch (always a miss).

Quiz5 of Computer Architecture (2020 Fall)

Question A

Question B

Question C

Question D

Question E

Question F

Question G

Question H

Read more

資訊科技詞彙翻譯

單一指令處理器 (OISC)

從 CPU cache coherence 談 Linux spinlock 可擴展能力議題

淺談 Microkernel 設計和真實世界中的應用