Term Project - HackMD

--- tags: CA2020 --- # Term Project [kevin30292] Annotate and explain Quiz5 ## Question A ### 1 >Consider a cache in certain RISC-V processor with an AMAT (average memory access time) of 1.5 cycles. Accessing our cache takes 1 cycle, and on a miss, it takes an additional 10 cycles to retrieve the data from main memory and update the cache. What does our hit ratio need to be in order to achieve the target AMAT? → Hit ratio = __ A01 __ Every data access will first check if the data is in the cache, so every data access will have a cache accessing time. If the hit raito is `x`, then we have the function: 1.5(AMAT) = 1*1 + (1-x)*10 so we can get hit ratio (`x`) need be ==0.95==. ### 2 >The timing for a particular cache is as follows: checking the cache takes 1 cycle. If there is a hit the data is returned to the CPU at the end of the first cycle. If there is a miss, it takes 10 additional cycles to retrieve the word from main memory, store it in the cache, and return it to the CPU. If we want an average memory access time of 1.4 cycles, what is the minimum possible value for the cache’s hit ratio? → Minimum possible value of hit ratio: __ A02 __ If the hit raito is `x`, then we have the function: 1.4(AMAT) = 1*1 + (1-x)*10 so we can get hit ratio (`x`) need be ==0.96==. ## Question B Consider that P1 is a five-stage pipelined RISC-V processor (IF, DEC, EXE, MEM, WB), where: All branches are predicted not-taken. Branch decisions are made in the EXE stage. The pipeline has full bypassing. Assume that the loop shown to the right has been running for a while and will be re-executed after this iteration. ### 1 ``` loop: add x3, x2, x1 slli x2, x2, 1 xor x4, x3, x4 bnez x2, loop and x11, x2, x1 or x12, x22, x7 sra x13, x23, x8 ... ``` >Given the program. How many cycles does it take to execute one iteration of the loop on this processor? → Cycles per iteration on this processor? | Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | ----- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | IF | add | slli | xor | benz | and | or | add | | ID | | add | slli | xor | benz | and | flush| | EX | | | add | slli | xor | benz | flush| | MEM | | | | add | slli | xor | benz | | WB | | | | | add | slli | xor | In cycle 4 the value `x3` of `add` will forwarding to `xor`, also allof the varible can be forwarding, so it didn't need insert NOP in the loop. But the benz need to flush the wrong instruction. Finally, one loop need ==6== cycles. ### 2 >Now consider that P2 is a variant of the previous five-stage pipelined RISC-V processor (IF, DEC, EXE, MEM, WB), where: All branches are predicted not-taken. Branch decisions are made in the EXE stage. The pipeline has no bypass paths. Assume that our example loop (same as the previous program) has been running for a while and will be re-executed after this iteration. How many cycles does it take to execute one iteration of the loop on this processor? → Cycles per iteration on processor P2: __ B02 __ | Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | | ----- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | IF | add | slli | xor | benz | benz | banz | and | or | add | | ID | | add | slli | xor | xor | xor | benz | and | flush| | EX | | | add | slli | NOP | NOP | xor | benz | flush| | MEM | | | | add | slli | NOP | NOP | xor | benz | | WB | | | | | add | slli | NOP | NOP | xor | In the 5 cycle, because no bypass path, so the `x3` value of add need first write back to Register, so insert two NOP to wait it to actually write back, also in cycle ## Question C >You are designing a four stage RISC-V processor (IF, DEC, EX, WB) that: Resolves branches in the EX stage. Uses a scoreboard to track and stall on data hazards. Initiates accesses to data memory in the EX stage, but data memory responses are not available until the WB stage. This processor has a tight area constraint, and you only have enough area for one bypass path for read-after-write hazards. You are trying to decide whether to put the bypass path from EX to DEC or from WB to DEC. As part of this evaluation, you construct two processors: Processor P1: Bypassing from WB to DEC. Processor P2: Bypassing from EX to DEC. These processors ensure that if the instruction in the DEC stage does not have all of its operands ready to read from the register file or read from a bypass path, the instruction in the decode stage is stalled. You are using the following loop of a program to evaluate the performance of the processor: ``` loop: lw t0, 0(a0) addi t1, t0, 1 xor t3, t1, a0 slli t2, t1, 1 blt t0, a0, loop ``` >Assume this loop has been running for a long time, and that the processor assumes that the branch will be taken. How many cycles does this loop take to execute in each of the processors? For processor P1: | Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | | ----- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | IF | lw | addi | xor | xor | slli | slli | blt | lw | | ID | | lw | addi | addi | xor | xor | slli | blt | | EX | | | lw | NOP | addi | NOP | xor | slli | | WB | | | | lw | NOP | addi | NOP | xor | If the dependency register value was changed just before the instruction, than it need insert one NOP and wait the value bypassing from WB. For processor P2: | Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | | ----- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | IF | lw | addi | xor | xor | xor | slli | blt | blt | lw | | ID | | lw | addi | addi | addi | xor | slli | slli | blt | | EX | | | lw | NOP | NOP | addi | xor | NOP | slli | | WB | | | | lw | NOP | NOP | addi | xor | NOP | In this case, it doesn't need insert NOP if the required registers are update last instruction. However, there is a constraint. If the required value need read from memory(lw), it need to wait until WB stage end. Futher more, if the require register have update before last instruction then it need to wait for it to write back. ## Question D >Consider the execution of the following code sequence on a 5-stage pipelined RISC-V processor, which is fully bypassed. The code below iterates over the array shown below to find the index of the element whose value is 0x104. Thus, on the second iteration of the loop, it branches to label done. The instruction unimp signals the end of the program. This processor: Always speculates that the next instruction should come from address PC+4. Computes branch decisions and jump targets in the EXE stage. Annuls incorrectly fetched instructions resulting from branches and jump instructions. Receives the results of load instructions (from the data memory) in the WB stage. The program: ``` . = 0x0 /* x10 holds the value we are looking for * x11 holds the address of A[0] * x13 holds the address of the current data access * Assume x10 = 0x104; x11 = 0x500; and x13 = 0x500 before executing the loop */ loop: lw x14, 0(x13) beq x14, x10, done addi x13, x13, 4 j loop done: sub x13, x13, x11 srli x13, x13, 2 unimp . = 0x500 /* Array data */ .word 0x0000100 /* A[0] */ .word 0x0000104 /* A[1] */ .word 0x0000108 /* A[2] */ .word 0x000010C /* A[3] */ ``` >Are any instructions stalled in the decode (DEC) stage during execution of this loop? If so, list the opcode(s) here. If not, write N/A. → Instructions stalled in DEC stage: __ D01 __ | Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 10 | 11 | 12 | 13 | 13 | 14 | 15 | 16 | | ----- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | IF | lw | beq | addi | addi | addi | j | sub | srli | lw | beq | addi | addi | addi | j | sub | srli | unimp| | | ID | | lw | beq | beq | beq | addi | j | sub | flush| lw | beq | beq | beq | addi | flush| sub | srli | unimp| | EX | | | lw | NOP | NOP | beq | addi | j | flush| flush| lw | NOP | NOP | beq | flush| flush| sub | srli | | MEM | | | | lw | NOP | NOP | beq | addi | j | flush| flush| lw | NOP | NOP | beq | flush| flush| sub | | WB | | | | | lw | NOP | NOP | beq | addi | j | flush| flush| lw | NOP | NOP | beq | flush| flush| beq. The instruction need to wait `lw` in MEM stage, than load the needed value and bypass it to `beq`. So beq must need to wait for one clock. >How many NOPs, if any, are inserted into the pipeline to handle the stalls required for execution of one iteration of the loop? → Number of NOPs added to handle stalls: __ D02 __ D02 = ? 2, We can see in cycle 5. There are 2 NOP instructions inserted, because lw instruction need read memory to get the value. >How many cycles does one iteration of the loop (first iteration, i.e. lw -> beq -> addi -> j -> sub -> srli) take? → Number of cycles for one iteration of the loop: __ D03 __ D03 = ? 8, like the chart. >How many NOPs, if any, are inserted into the pipeline to handle the final iteration of the loop through the end of the program? → Number of NOPs added to handle stalls: __ D04 __ D04 = ? We can see cycle 12, the final loop is also two NOP inserted. >Which additional instructions, if any, would need to be stalled if the processor did not have a bypass path from the EXE stage but still had one from the MEM and WB stages? → Instructions stalled due to removal of EXE bypass path: __ D05 __ D05 = ? srli. In cycle 16, the x13 is bypassing to EXE. >How many additional NOPs, if any, are required to handle the removal of the EXE bypass path? → Number of NOPs required to handle removal of EXE bypass path: __ D06 __ D06 = ? In this case, the x13 can't be bypassing from EX stage, but need to btpassing from MEM stage. ## Question E >Consider a direct-mapped cache with 64 total data words with 1 word/cache line. The program shown below repeatedly executes an inner loop that sums the 16 elements of an array that is stored starting in location 0x310. ``` . = 0 /* tell assembler to start at address 0 */ outer_loop: addi x4, x0, 16 // initialize loop index J mv x1, x0 // x1 holds sum, initially 0 loop: // add up elements in array subi x4, x4, 1 // decrement index slli x2, x4, 2 // convert to byte offset lw x3, 0x310(x2) // load value from A[J] add x1, x1, x3 // add to sum bne x4, x0, loop // loop until all words are summed j outer_loop // perform test again! ``` >The program is executed for many iterations, then a measurement of the cache statistics is made during one iteration through all the code, i.e., starting with the execution of the instruction labeled outer_loop: until just before the next time that instruction is executed. In total, how many instruction fetches occur during one complete iteration of the outer loop? How many data reads? → Number of instruction fetches: __ E01 __ → Number of data reads: __ E02 __ E01 = ? E02 = ? The two instruction of outer loop execute 1 time, and the inner loop execute 16 times. Finally the jump execute once. 2 + 16*5 + 1 = 83 The inner loop execute 16 times, so lw read the data 16 times. >Also note that this program has been executing for multiple cycles already. Thus, many slots in the cache are already populated. We need to figure out which ones will result in cache misses. As we walk through the program for the first time, we will not count initializing the cache contents as cache misses. How many instruction fetch misses occur during one complete iteration of the outer loop? How many data read misses? Hint: remember that the array starts at address 0x310. → Number of instruction fetch misses: __ E03 __ → Number of data read misses: __ E04 __ E03 = ? E04 = ? | instruction | index | offset | | ----------- | -------- | ------ | | addi | 000000 | 00 | | mv | 000001 | 00 | | subi | 000010 | 00 | | slli | 000011 | 00 | | lw | 000100 | 00 | | add | 000101 | 00 | | bne | 000110 | 00 | | j | 000111 | 00 | 0x310 = 0b00000011 000100 00 | array | index | offset | | -------- | -------- | ------ | | MEM[310] | 000100 | 00 | | MEM[314] | 000101 | 00 | | MEM[318] | 000110 | 00 | | MEM[31C] | 000111 | 00 | | MEM[320] | 001000 | 00 | | MEM[324] | 001001 | 00 | | ... | ... | ... | The addi and mv didn't have same index with others, so these two instruction will not miss. In first iteration: lw instruction access the first element of array(0x310), but the cache line at 100 have hold value(lw). This causing the first data read miss. cache after fisrt iteration | index | content | I miss | D miss | | ------ | -------- | ------ | ------ | | 000000 | addi | 0 | 0 | | 000001 | mv | 0 | 0 | | 000100 | Mem[310] | 0 | 1 | | 000101 | add | 0 | 0 | | 000110 | bne | 0 | 0 | | 000111 | j | 0 | 0 | In second iteration: lw instruction will miss, because of the array data in cache line index 0100. This causing the first instruction fetch miss. | index | content | I miss | D miss | | ------ | -------- | ------ | ------ | | 000000 | addi | 0 | 0 | | 000001 | mv | 0 | 0 | | 000100 | lw | 1 | 1 | | 000101 | add | 0 | 0 | | 000110 | bne | 0 | 0 | | 000111 | j | 0 | 0 | When lw instruction access the second element of array(0x314), but cache line at 101 have hold value(add), which load at previous iteration. This causing the second data read miss. | index | content | I miss | D miss | | ------ | -------- | ------ | ------ | | 000000 | addi | 0 | 0 | | 000001 | mv | 0 | 0 | | 000100 | lw | 1 | 1 | | 000101 | Mem[314] | 0 | 1 | | 000110 | bne | 0 | 0 | | 000111 | j | 0 | 0 | However, when executing add, the second intruction miss happended. | index | content | I miss | D miss | | ------ | -------- | ------ | ------ | | 000000 | addi | 0 | 0 | | 000001 | mv | 0 | 0 | | 000100 | lw | 1 | 1 | | 000101 | add | 1 | 1 | | 000110 | bne | 0 | 0 | | 000111 | j | 0 | 0 | In third iteration: lw instruction access the first element of array(0x318), but the cache line at 110 have hold value(bne). This causing the data read miss. cache after fisrt iteration | index | content | I miss | D miss | | ------ | -------- | ------ | ------ | | 000000 | addi | 0 | 0 | | 000001 | mv | 0 | 0 | | 000100 | lw | 1 | 1 | | 000101 | add | 1 | 1 | | 000110 | Mem[318] | 0 | 1 | | 000111 | j | 0 | 0 | However, when executing bne, the intruction miss happended. | index | content | I miss | D miss | | ------ | -------- | ------ | ------ | | 000000 | addi | 0 | 0 | | 000001 | mv | 0 | 0 | | 000100 | lw | 1 | 1 | | 000101 | add | 1 | 1 | | 000110 | bne | 1 | 1 | | 000111 | j | 0 | 0 | Observe the pattern, we can see the cache line from index 0100 to 0111, which share both instruction and data, will causing one instructoin fetch miss and one data read miss. Finally, 4 data miss and 4 instruction miss happend. >What is the hit ratio measured after one complete iteration of the outer loop? → Hit ratio: __ E05 __ E05 = ? The total number of memory accesses is 83(Instruction) + 16(Data), and the total number of cache miss is 4 + 4. Therefore, 91/99 is the answer. ## Question F >Assume, the program shown below is being run on a RISC-V processor with a cache with the following parameters: 2-way set-associative block size of 2, i.e., 2 data words are stored in each cache line total number of data words in the cache is 32 LRU replacement strategy ``` . = 0x240 // start of program test: addi x4, x0, 16 // initialize loop index J to size of array mv x1, x0 // x1: sum loop: // add up elements in array subi x4, x4, 1 // decrement index slli x2, x4, 2 // convert to byte offset lw x3, 0x420(x2)// load value from A[J] add x1, x1, x3 // add to sum bnez x4, loop // loop N times j test // perform test again! // allocate space to hold array . = 0x420 A: .word A[0] .word A[1] ``` >The cache will divide the 32-bit address supplied by the processor into four fields: 2 bits of byte offset, B bits of block offset, L bits of cache line index, and T bits of tag field. Based on the cache parameters given above, what are the appropriate values for B, L, and T? >→ value for B: __ F01 __ → value for L: __ F02 __ → value for T: __ F03 __ There are 2 words per block, so we need 1 bit to indicate it. B = 1. It is 2-way set-associative, meaning two block per line. Total number of data words in the cache is 32. Therefore, there are 8 lines, so we need 3 bits to indicate it. Finally, T = 32 - 3 - 1 - 2 = 26. >If the SLLI instruction is resident in a cache line, what will be its cache line index? the value of the tag field for the cache? → Cache line index for SLLI when resident in cache: __ F04 __ → Tag field for SLLI when resident in cache: __ F05 __ The code section is started from 0x240, so the slli instruction is located at 0x24C. Therefore, 0x24c = 0b 00000000000000000000001001 001 100, so the index = 1. The tag field is 0x1001 = 9. >Given that the code begins at address 0x240 and the array begins at address 0x420, and that there are 16 elements in the array as shown in the code above, list all the values j ( 0 ≤ j ≤ 16 ) where the location holding the value A[j] will map to the same cache line index as the SLLI instruction in the program. → List all j where A[j] have the same cache line index as SLLI: __ F06 ___ ## Question G >Consider a 2-way set-associative cache where each way has 4 cache lines with a block size of 2 words. Each cache line includes a valid bit (V) and a dirty bit (D), which is used to implement a write-back strategy. The replacement policy is least-recently-used (LRU). The cache is used for both instruction fetch and data (LD,ST) accesses. >Using this cache, a particular benchmark program experiences an average memory access time (AMAT) of 1.3 cycles. The access time on a cache hit is 1 cycle; the miss penalty (i.e., additional access time) is 10 cycles. What is the hit ratio when running the benchmark program? It is pretty simple, the AMAT = 1.3 = cache access time + miss penalty * miss rate miss rate = 1 - hit rate. Therefore, 1.3 = 1 + 10*(1-x), and x = 0.97 >This cache is used to run the following benchmark program. The code starts at memory address 0; the array referenced by the code has its first element at memory address 0x200. First determine the number of memory accesses (both instruction and data) made during each iteration through the loop. Then estimate the steady-state average hit ratio for the program, i.e., the average hit ratio after many iterations through the loop. ``` . = 0 mv x3, x0 // byte index into array mv x1, x0 // initialize checksum accumulator loop: lw x2, 0x200(x3) // load next element of array slli x1, x1, 1 // shift checksum addi x1, x1, 1 // increment checksum add x1, x1, x2 // include data value in checksum addi x3, x3, 4 // byte index of next array element slti x2, x3, 1000 // process 250 entries bnez x2, loop unimp // halt . = 0x200 array: ... /* array content */ ``` >→ Number of memory accesses made during each iteration of the loop: __ G02 __ In the loop, there are 7 instructions, so must accesses memory 7 times for load instructions. There is 1 `lw` instruction, so one data memory accesses. The total number of memory access is 7 + 1 = 8 times. >→ Estimated steady-state average hit ratio: __ G03 __ 2-way set-associative cache where each way has 4 cache lines with a block size of 2 words. B = 1 L = 2 The loop start at 0x08 = 0b 0 01 00 0. The loop end at 0x20 = 0b 1 00 00 0. The array start at 0x200 = 0b0010000 00 000 Each cache line has 8 byte, so we divide the instruction every two instruction. | No. of instructions | address | index | | ----- | ---------------------- | ----- | | 1. | 0x008 = 0b00000 01 000 | 1 | | 2. | 0x00c = 0b00000 01 100 | 1 | | 3. | 0x010 = 0b00000 10 000 | 2 | | 4. | 0x014 = 0b00000 10 100 | 2 | | 5. | 0x018 = 0b00000 11 000 | 3 | | 6. | 0x01c = 0b00000 11 100 | 3 | | 7. | 0x020 = 0b00001 00 000 | 0 | | data1 | 0x200 = 0b10000 00 000 | 0 | | data2 | 0x204 = 0b10000 00 100 | 0 | | data3 | 0x208 = 0b10000 01 000 | 1 | | ... | ... | ... | We can observe that there are no instruction overlap. And, because it is a 2-way cache, the data would not replace the instructions in the cache. In other words, instructions only occupy one way of the cache. Furthermore, the data array is loaded one by one, so it only miss one time in two times loaded. Moreover, there is one lw per loop, so there only one miss in two loop. Hit ratio: 16-1/16 = 15/16 ## Question H >Consider the diagram below for a 2-way set associative cache to be used with our RISC-V processor. Each cache line holds a single 32-bit word of data along with its associated tag and valid bit (0 when the cache line is invalid, 1 when the cache line is valid). ![](https://i.imgur.com/EB1j5YB.png) >The RISC-V produces 32-bit byte addresses, A[31:0]. To ensure the best cache performance, which address bits should be used for the cache index? For the tag field? → address bits used for cache index: A[ __ H01 __] → address bits used for tag field: A[ __ H02 __] There are two bits for byte offset. Because 8 cache line per way, so we need 3 bits to indicate it.(==A[ 4:2 ]==) The remaining bits are tag bits, so (==A[ 31:5 ]==) >Suppose the processor does a read of location 0x5678. Identify which cache location(s) would be checked to see if that location is in the cache. For each location specify the cache section (P or Q) and row number (0 through 7). E.g., 3P for row 3, section P. If there is a cache hit on this access what would be the contents of the tag data for the cache line that holds the data for this location? → cache location(s) checked on access to 0x5678: __ H03 __ → cache tag data on hit for location 0x5678 (hex): __ H04 __ 0x5678 = 0b1010110011 110 00 We can see the index is 0b110, so the row number is 6. Then it will check ==6P and 6Q== If hit, the tag bits must be 0b1010110011 = ==0x2b3==. >Estimate the approximate cache hit ratio for the following program. Assume the cache is empty before execution begins (all the valid bits are 0) and that an LRU replacement strategy is used. Remember the cache is used for both instruction and data (LD) accesses. ``` . = 0 addi x4, x0, 0x100 mv x1, x0 lui x2, 1 // x2 = 0x1000 loop: lw x3, 0(x4) addi x4, x4, 4 add x1, x1, x3 addi x2, x2, -1 bnez x2, loop sw x1, 0x100(x0) unimp // halt . = 0x100 source: . = . + 0x4000 // Set source to 0x100, reserve 0x1000 words ``` In the loop: The first instruction is at 0x0c. The end instruction is at 0x2c. | instructions | address | index | | ----- | ---------------------- | ----- | | lw | 0x00c = 0b0000 011 00 | 3 | | addi | 0x010 = 0b0000 100 00 | 4 | | add | 0x014 = 0b0000 101 00 | 5 | | addi | 0x018 = 0b0000 110 00 | 6 | | bnez | 0x01c = 0b0000 111 00 | 7 | | data1 | 0x100 = 0b1000 000 00 | 0 | | data2 | 0x104 = 0b1000 001 00 | 1 | | data3 | 0x108 = 0b1000 010 00 | 2 | | ... | ... | ... | We can see at least one way of cache are free for data to use. Apperantly, instruction never miss, but data always miss. Finally, for each loop, 5 instructions hit, and one data miss. The answer is ==5/6==.