Quiz7 of Computer Architecture (2021 Fall) Solutions

# Quiz7 of Computer Architecture (2021 Fall) Solutions ###### tags: `RISC-V` ## Problem A Assume that NCKU has been developing its own online grading system, “LearnSomthing”, which it hopes will replace how people grade homework and exams both in NCKU and beyond. “LearnSomthing” prides itself on parallelized grading by starting multiple threads running the gradeExams function described below in pseudocode. ``` Shared Memory: // exams is an array containing the exams exams = [exam0, exam1, exam2, ... , exam99]; next_exam = 0; gradeExams: // Get ungraded exam exam = exams[next_exam] next_exam = next_exam + 1 // Grade exam grade(exam) goto gradeExams ``` Suppose two threads, A and B, are running the gradeExams code above without any synchronization. For each of the following failure scenarios, answer whether it is possible to happen or not: (Yes for possible; No for impossible) - - A and B start grading the same exam: __ A01 __ - A gets exam k+1 before B gets exam k: __ A02 __ - An exam is skipped and is never graded by either A or B: __ A03 __ - A01 : Yes - A02 : No - A03 : Yes ### 📝 Annotate :::info Asynchronous,such as the arrival of signals, or actions instigated by a program that take place concurrently with program execution, without the program blocking to wait for results. ![](https://i.imgur.com/tRXtmWI.png) If we begin with exam[0],and execute the process with above order. Step1 and Step2 will both enter in exam[0] and that's the A01 question asks. Going on step3, next_exam will become 1,and after step4 next exam will become 2.When step5 starts, it will execute grade(exam[2]). So,both A01 and A03 situation might happen. As for A02, it won't get k+1 until next_exam = k execute. So,it is impossible to happen. ::: Despite LearnSomething’s advanced parallel grading technology, it cannot interface with the gradebook module of this course. Instead, teaching assistant (TA) manually enters the grades into the gradebook as “LearnSomthing” completes its grading. When the gradeExams function finishes grading an exam, it emails a score report for the student to him. He will read a score report email, enter the grade into the gradebook, and repeat, as represented by the enterGrades function below. He does not like having more than 10 unread score report emails at a time and also does not like checking her email when he does not have any new messages. Assume that there is more than 1 and fewer than 10 threads running the gradeExams function. Add semaphores below to enforce these constraints: - No two LearnSomething threads should ever grade the same exam - Teaching assistant should never have more than 10 unread LearnScope score report emails - readScoreReportEmail() should not be called until there is an unread score report email from “LearnSomething” - After all 100 exams are claimed by “LearnSomething” threads, LearnSomething should stop attempting to grade exams - As long as there are still unclaimed exams, avoid deadlock - Use as few semaphores as possible, and do not add any additional precedence constraints ``` Shared Memory: // exams is an array containing the exams exams = [exam0, exam1, exam2, ... , exam99]; next_exam = 0; // Specify the semaphores and initial values here semaphore remainingExams = A04; semaphore nextExamLock = A05; semaphore emailSpaces = A06; semaphore unreadEmails = A07` ``` | gradeExams: | enterGrades: | |----- |--------| | __ A08 __ | __ A13 __ | | __ A09 __ | readScoreReportEmail()| | // Get ungraded exam | __ A14 __| | exam = exams[next_exam]| | | |enterGrade()| |next_exam += 1 | | |__ A10 __ | | | |goto enterGrades| |// Grade exam | | | grade(exam) | | |__ A11 __ | | |emailScoreReport(exam) | | |__ A12 __ | | |goto gradeExams | | A04 = ?, A05 = ?, A06 = ?, A07 = ? A08 = ?, A09 = ?, A10 = ? A11 = ? A12 = ? A13 = ? A14 = ? - A04: 100 - A05: 1 - A06: 10 - A07: 0 - A08: wait(remainingExams) - A09: wait(nextExamLock) - A10: signal(nextExamLock) - A11: wait(emailSpaces) - A12: signal(unreadEmails) - A13: wait(unreadEmails) - A14: signal(emailSpaces) ### 📝 Annotate :::info *Since we have to store how many emails wait to check, remainingExams = 100. *In the beginning,we need to unlock semaphore nextExamLock to let assistant read email,so nextExamLock = 1. *Teaching assistant should never have more than 10 unread emails,then he will have 10 space to store the unreadmail,so make semaphore emailSpaces = 10. *If there is no email waiting to read, it will not need to enter enterGrade function. Semaphore unreadEmails sets to be 0. First, check if there exists exam to be grading,so A08: wait(remainingExams). No two LearnSomething threads should ever grade the same exam, it need lock and unlock the exam to be grading, so A09: wait(remainingExams), after next_exam += 1 then A10: signal(nextExamLock). After grade(exam),it needs to mail the email to himself if there are not over 10 space in his email,so A11: wait(emailSpaces). After emailing the mail,it can start to enter function enterGrades(),so unlock the unreadEmail.A12: signal(unreadEmails). If there is no exam finish,then no email will sent to him, and it won't need to enter function enterGrades(), so lock the function.A13: wait(unreadEmails). After assistant reads the email, the unemail space will add 1 which meansunlock the emailspaces.A14: signal(emailSpaces) ::: ## Problem B The following semaphore-synchronized processes are run on a shared processor. They can coordinate their execution via shared semaphores, using standard signal and wait calls. Assume that execution may switch between any of the three processes at any point in time. |Process1| Process2| Process3| |--|--|--| |Loop1: |Loop2:| Loop3:| |Print("I")| Print("P")| Print("L")| |Print("E")| |Print("N")| |goto Loop1| goto Loop2| goto Loop3| 1. Assuming that no semaphores are being used, for each of the following sequences of characters, specify whether or not this system could produce that output. Answer in “Yes” or “No.” “PIEPIE” : __ B01 __ “LIPNIP” : __ B02 __ - B01 : Yes ### 📝 Annotate :::info Follow the step can print "PIEPIE". Process2->Process1->Process2->Process1 ::: - B02 : No ### 📝 Annotate :::info If we want to print “LIPNIP”,we need to print following process.We know that if three process execute at the same time,it depends on the processor of the computer to determine which task to be scheduled.So, we don't know the exactly routine which alphabet will be printed. Let's assume Process3->Process1->Process2 this routine.But after print "I", it must have to print "E",but in problem2,it lost the alphabet "E" in the middle and print "I" again.So,it's impossible. ::: 2. In the code listing below, add calls to signal and wait to ensure the three processes print “PIPELINE” exactly once and then stop printing. You should specify a semaphore in each call to signal or wait. You must use 4 semaphores needed. ``` s1 = B03 s2 = B04 s3 = B05 p2p3stop = B06 ``` |Process1 |Process2| Process3| |----- |----|--------| |Loop1: |Loop2: |Loop3|: |__ B07 __| __ B08 __| __ B09 __| |Print("I")| | Print("L")| |__ B10 __| |__ B11 __| |__ B12 __ |Print("P")| | ||__ B13 __| __ B14 __| |Print("E")| __ B15 __| Print("N")| |__ B16 __ | |__ B17 __| |||__ B18 __| |goto Loop1 |goto Loop2| goto Loop3| - B03 : 0, B04 : 1, B05 : 0, B06 : 1 - B07 : wait(s1), B08 : wait(s2), B09: wait(s3) - B10 : signal(s2), B11 : signal(s1) - B12 : wait(s1), B13 : signal(s1), B12 : wait(s1) - B15 : wait(p2p3stop), B16 : signal(s3), B17 : signal(s1) - B18 : wait(p2p3stop) ### 📝 Annotate :::info There are three processes,so they must be three wait() to controll the process to execute,so B07 : wait(s1), B08 : wait(s2), B09: wait(s3). "PIPE" step： First,they must print "P", so the semaphore s2 must be 1 ,B04 : 1,then it unlocks the process2：,it then prints "P".Then it turns to print "I", so process2 here needs to stop and turn to process1, B13 : signal(s1), and later we needto print "P" agains,so lock the process2, B15 : wait(p2p3stop) because the beginning semaphore p2p3stop is 1, so here it will unlock and stop at wait(s2) here. After B13 : signal(s1), s1 is 1 now.Here process1 is unlock,and enter process1, print "I".And now we need to print "P" again, B10 : signal(s2) and pause process1 B12 : wait(s1), unlock process2,print "P" and signal(s1),it will stop at wait(p2p3stop),since p2p3stop now is 0. Go back to process1,now s1 is 1,so unlock wait(s1).It will print "E", now we have already print half of word "PIPE". Now,it's time to turn to processs3,so B16 : signal(s3). "LINE" step： In process3,since s3 is 1,unlock wait(s3),print "L",and the same way,B11 : signal(s1),B14 : wait(s2), unlock process1 and stop process2. In process1,it will print "I", and unlock s2,to go back to s3.Why it won't go back to process2? Since in the buttom of process2, wait(p2p3stop), p2p3stop is still 0,so it won't go back to Loop2. And now s2 is 1,so continue on process3,unlock wait(s2) ,print "N",and B17 : signal(s1).Go back to process1 again,unlock wait(s1) in B12, print "E". That's the end. print "PIPELINE". ::: ## Problem C Consider the execution of the following C code, which sums across the columns of a 2D array stored in row-major order. Row-major order means that elements in the same row of the array are adjacent in memory, i.e., A[i][j] is next to A[i][j+1]. The array elements are 32-bit integers. ``` int A[N][M]; // N = 32, M = 256 int sum = 0 for (int j = 0; j < M; j++) { for (int i = 0; i < N; i++) { sum += A[i][j]; } } ``` 1. Consider an 8-way set-associative cache with 16-byte cache lines and LRU replacement. What is the minimum number of sets that this cache needs such that this code will only produce compulsory misses? __ C01 __ C01 = ? - 256 ### 📝 Annotate :::info In the question,we can see the main point 'row-major order' which means elements in the same row of the array are adjacent in memory.One set size is 8*16 = 128 bytes.And one element is 4 bytes,the code is from top to down,from left to right,one column size is 4 * 32 = 128 bytes. (128/128)*256 = 256. ::: - If only compulsory misses are allowed, then the first N accesses (loading the first column across many cache lines) must not cause any evictions, since subsequent accesses would be to those (32 * 64) / 8 same cache lines.Since our replacement policy is ideal, the first N / (8 ways) = 4 accesses should fall in one way of the cache. Each row of the matrix corresponds to (256 * 4 / 16 = 64) cache lines. So two consecutive accesses in the same column stride over 64 rows.Thus the first 4 accesses stride over 4 * 64 entries that map to consecutive sets. In total, we need 256 sets to avoid conflicts. 2. Suppose the cache is virtually indexed and physically tagged. Does the number of sets you derived in previous question introduce a virtual aliasing issue assuming a 4 KiB page size? __ C02 __ (Answer in Yes/No with brief explanation.) C02 = ? - No (256 sets * 16 bytes per line <= 4096 B page size, so no aliasing) (類似的解釋可給分，但若只有 No，則不給分) 3. We would like to reduce the frequency of compulsory misses in this code by adding software prefetching instructions. We measure the L1 miss penalty to be 40 cycles. When the prefetch instruction is replaced with a NOP, the first 32 iterations of the inner loop each take 50 cycles to complete. What should the OFFSET of the prefetch instruction be to maximize timeliness? __ C03 __ ### 📝 Annotate :::info Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Software prefetching : Insert intrinsics through the compiler or directly in the program ::: ``` int A[N][M]; // N = 32, M = 256 int sum = 0; for (int j = 0; j < M; j++) { for (int i = 0; i < N; i++) { prefetch(&A[i][j] + OFFSET); // prefetches from (A + M * i + j + OFFSET) sum += A[i][j]; } } ``` - 1024 - If the prefetcher behaved perfectly, then the inner loop would take 50−40=10 cycles per iteration. Thus, if prefetched data is expected to return in 40 cycles, then we want to fetch 40/10=4 iterations ahead. So OFFSET would be 4×256=1024 ## Problem D Consider the effectiveness evaluation of multithreading using a simple database benchmark. The benchmark searches for an entry in a linked list built from the following structure, which contains a key, a pointer to the next node in the linked list, and a pointer to the data entry. ``` struct node { int key; struct node *next; struct data *ptr; } ``` The following RISC-V code shows the core of the benchmark, which traverses the linked list and finds an entry with a particular key. ``` loop: LW x3, 0(x1) # load a key LW x4, 4(x1) # load the next pointer SEQ x3, x3, x2 # set x3 if x3 == x2 BNEZ x3, end # found the entry ADD x1, x0, x4 BNEZ x4, loop # check the next node end: ``` We run this benchmark on a single-issue in-order processor. The processor can fetch and issue (dispatch) one instruction per cycle. If an instruction cannot be issued due to a data dependency, the processor stalls. Integer instructions take one cycle to execute and the result can be used in the next cycle. For example, if SEQ is executed in cycle 1, BNEZ can be executed in cycle 2. We also assume that the processor has a perfect branch predictor with no penalty for both taken and not-taken branches. 1. Assume that our system does not have a cache. Each memory operation directly accesses main memory and takes 100 CPU cycles. The load/store unit is fully pipelined, and non-blocking. After the processor issues a memory operation, it can continue executing instructions until it reaches an instruction that is dependent on an outstanding memory operation. How many cycles does it take to execute one iteration of the loop in steady state? __ D01 __ cycles D01 = ? - 104 ![](https://i.imgur.com/t76faUY.jpg) ### 📝 Annotate :::info SEQ:if (rs==rt) rd=1;else rd=0 single-issue:one instruction, one clock cycle Since all instruction which related to memry will take 100 cycle to complete the instruction, and SEQ instruction have data dependency with x3, so it has to wait until lw x3,0(x1) completes,then it can continue. And in the topic it says :Integer instructions take one cycle to execute and the result can be used in the next cycle.So,except memory instructions, every instruction can use one cycle and the result can be used to next cycle. ::: 2. Now we add zero-overhead multithreading to our pipeline. A processor executes multiple threads, each of which performs an independent search. Hardware mechanisms schedule a thread to execute each cycle. In our first implementation, the processor switches to a different thread every cycle using fixed round robin scheduling. Each of the N threads executes one instruction every N cycles. What is the minimum number of threads that we need to fully utilize the processor, i.e., execute one instruction per cycle? __ D02 __ thread(s) D02 = ? - 50 - If we have N threads and the first load executes at cycle 1, SEQ, which depends on the load, executes at cycle 2*N +1. To fully utilize the processor, we need to hide 100-cycle memory latency, 2 * N + 1 >= 101. The minimum number of threads needed is 50. Then, we change the processor to only switch to a different thread when an instruction cannot execute due to data dependency. What is the minimum number of threads to fully utilize the processor now? Note that the processor issues instructions in-order in each thread. __ D03 __ thread(s) D03 = ? - 18 - In steady state, each thread can execute 6 instructions (SEQ, BNEZ, ADD, BNEZ, LW, LW). Therefore, to hide 98 (100 - 2[number of instructions between LW and SEQ]) cycles between the second LW and SEQ, a processor needs ⌈98/6⌉ + 1 = 18 threads. ## Problem E Assume that we are going to implement SIMD instructions in our pipelined RISC-V datapath. In order to do so, we will take 2 steps: 1.Combine the IF and ID stages into an IFD stage 2.Create 2 paths for the datapath to take after IF: one path for normal RISC-V instructions and one for SIMD RISC-V containing specialized hardware. The only problem is that the SIMD EX stage takes 3 cycles to complete instead of 1, and no other SIMD instruction is allowed to enter the SIMD EX stage while another SIMD instruction is there. ![](https://i.imgur.com/Q0Tq2aK.png) 1. This “delay” from the SIMD EX stage necessitates the use of stalls to ensure proper functionality. Which of the following implementations correctly generates the stall signal? ![](https://i.imgur.com/v5rDOkq.png) -- You may ignore any kinds of stalls caused by hazards; we are only concerned with this special case in our new pipeline. However, we still want to maintain a good instruction throughput. To do this, we should allow normal instructions to continue through the CPU, as they are not blocked from doing so by the SIMD path. -- The signal SIMD_Inst is an indicator that the current instruction fetched from IFD is a SIMD instruction, while the signal SIMD_count refers to the number of the cycle the SIMD instruction is completing in the EX stage, i.e. when it is in the first cycle of the EX stage, SIMD_count = 1. If there is no instruction in the SIMD EX stage, this value is 0. The comparators are unsigned. Select all that apply. __ E01 __ (Answer in A, B, C, D, or None of the other options) E01 = ? - C - We need to stall when we fetch a SIMD instruction, but one is already in the EX stage and will remain there for the next cycle (i.e. simd_count >= 1, but <= number of cycles for EX - 1 = 2). Otherwise, if a normal instruction is fetched, it can just be pushed to the Normal path, and no stall is needed. 2. Because we wish to actually stall and not flush, how should the PC and PC mux be updated to allow for this? ![](https://i.imgur.com/qjopNNX.png) -- Assume stall is a signal that is 1 when we should stall, and therefore not fetch a new instruction, or 0 otherwise. -- Select all that apply. __ E02 __ (Answer in A, B, C, D, or None of the other options) E02 = ? - B, D - When stall = 1, we want the new PC to be the old PC, otherwise the behavior should be unchanged. Option A is incorrect since the selector bits of 1 or 2 will never be chosen; 0 will be sign extended to 00 (PC+4), while 1 will be sign extended to 11 (undefined). Option B is correct. Option C is close, but incorrect as ALUout is at index 2 of the mux, which is unreachable as the normal PCSel = 1 that would choose the ALUout becomes 11 after sign extension. Option D is correct. 3. How many stalls caused by the SIMD EX stage are needed for the following piece of code? __ E03 __ ``` addi t0, x0, 1 simd_add x1, x2, x3 simd_sub x2, x1, x3 addi t1, t0, 2 simd_mul x2, x2, x2 sub t0, t0, t1 mul t0, t0, t0 simd_div x2, x2, x1 ``` E03 = ? - 3 - Line 2 adds 2 stalls. - Line 5 adds 1 stall left over from simd_sub. ## Problem F Assume we have a single-level, 1 KiB direct-mapped L1 cache with 16-byte blocks. We have 4 GiB of memory. An integer is 4 bytes. The array is block-aligned. ``` #define LEN 2048 int ARRAY[LEN]; int main() { for (int i = 0; i < LEN - 256; i+=256) { ARRAY[i] = ARRAY[i] + ARRAY[i+1] + ARRAY[i+256]; ARRAY[i] += 10; } } ``` 1. Calculate the number of tag, index, and offset bits in the L1 cache. T: __ F01 __ I: __ F02 __ O: __ F03 __ F01 = ?, F02 = ?, F03 = ? - F01: 22, F02: 6, F03: 4 - Offset: log2(block size) = log2(16) = 4 - Index: log2(cache size / block size) = log2(1 KiB / 16) = log2(64) = 6 - Tag: - First find total address bits log2(4 GiB) = log2(4 * 2^30) = log~2(2^32) = 32 Then 32 - Index - Offset = 32 - 6 - 4 = 22 2. What is the hit rate for the code above? Assume C processes expressions left-to-right. __ F04 __ F04 = ? - 50% ``` Every iteration it is ARRAY[i] read MISS ARRAY[i+1] read HIT ARRAY[i+256] read CONFLICT→ MISS ARRAY[i] write CONFLICT→ MISS ARRAY[i] read HIT ARRAY[i] write HIT 3 MISSES, 3 HITS. 50% hit rate. ``` 3. We decide to add an L2 cache to our system. We shrink our L1 cache, so it now takes 3 cycles to access L1. Since we have a larger L2 cache, it takes 50 cycles to access L2. The L1 cache has a hit rate of 25% while the L2 cache has a hit rate of 90%. It takes 500 cycles to access physical memory. What is the average memory access time (AMAT) in cycles? __ F05 __ F05 = ? - 78 ### 📝 Annotate ::: info local miss rate AMAT = L1 Hit time + L1 Miss rate * L1 Miss penalty L1 Miss penalty = L2 Hit time + L2 Miss rate * L2 Miss penalty L2 Miss penalty = L3 Hit time + L3 Miss rate * L3 Miss penalty L3 Miss penalty = Main memory hit time AMAT = 3 + 3/4×(50+(1-0.9)×500)= 78 ::: ## Problem G Assume that you have been tasked to assess the performance of two designs: - Processor1 - Processor2 You have decided to benchmark both processors on the following assembly sequence. ``` L1: slli x11, x10, 2 lw x12, 0x80(x11) add x13, x13, x12 addi x10, x10, 1 blt x10, x14, L1 sub x14, x14, x15 xori x15, x14, 0x1 ``` ![](https://i.imgur.com/llqC1Ge.png) ![](https://i.imgur.com/ALtNU58.png) ![](https://i.imgur.com/cHbW5QG.png) ### 📝 Annotate ::: info This is the 5-stage pipeline Processor,since lw instruction get the correct value until it gets to MEM stage,in the cycle4,add x13,x13,x12 instruction has to wait the value from lw,so insert a nop instruction. The difference between Processor1 and Processor2 are the stage of pipeline and the result of calculated value in which stage and when can use bypassing strategy to reduce the total cycle in one iteration. ::: Processor1 is a 4-stage pipelined processor with full bypassing. It attempts to improve performance by decreasing the number of cycles it takes to execute each instruction. The EXE and MEM stages are combined. Load requests are sent in the EXE stage and received in the WB stage one cycle after. The IF stage speculatively fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a branch misprediction. Branches are resolved in the EXE stage. Processor2 is a 6-stage pipelined processor with full bypassing. It attempts to improve performance by raising the clock speed. The EXE stage is split into two stages to break up the critical path through the ALU. The IF stage always fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a taken branch. ALU and Branch ALU results are available only in EXE2. How many cycles does it take the Procesor1 to execute one iteration of the loop? __ G01 __ cycle(s) G01 = ? - 8 ![](https://i.imgur.com/2pxR5F4.jpg) ### 📝 Annotate ::: info Since EXE stage and MEM stage merge,pipeline with full bypassing will like this.Take a look at cycle 4 (forwarding),since EXM is combined with execute and memory stage, the forwarding is delayed until write back stage.If add instruction wants to get correct value early before entering EXM stage,so we need to add a nop instruction in EXM stage. As for cycle 7, since blt instruction will decide whether jump or not until EXM stage, and now it happens misprediction.Pipeline has to flushed the two stage in cycle 8 in pipeline,and jump to right address. From cycle 0 to cycle 7,it passes eight cycle. ::: How many cycles does it take the Processor2 to execute one iteration of the loop? __ G02 __ cycles(s) G02 = ? - 13 my answer: 12 ![](https://i.imgur.com/LEPvffR.png) origin answer: 13 ![](https://i.imgur.com/JYvMCjs.png) ### 📝 Annotate ::: info In this case, execute stage is divided into two stage,so every data dependency must occurred nop instruction.In cycle 4,6,9,they all have to wait the calulated value,nop one or two times. I found a little mistake by the origin answer.In origin cycle 7,it bypassing when lw instruction reaches to write back stage.Maybe I'm wrong,but I think it could bypass the value when it is in MEM stage,cause the pipleline principal I remember is read after write. ::: What is the clock period for Processor1? __ G03 __ ns per loop iteration. G03 = ? - 48 - 8 cycles (6 ns/cycle) = 48 ns per loop iteration ## Problem H Consider a computer which has 2 processors, each with their own cache. Both have the same design: A 128Byte cache size, 2-way set associative, 4 ints per block, write-back, and write-allocate with LRU replacement. Each cache takes in 20-bit addresses. Assume that ints are 4 bytes, and we are using the MOESI cache-coherence protocol. 1. Are the 20-bit addresses Virtual Addresses? __ H01 __ (answer in Yes or No) - No 2. How many Offset bits? __ H02 __ - 4 ::: info - Since four ints per block and ints are 4bytes, so 1 block is 16 byte,2^4 = 16 ,so offset = 4 bits. ::: 3. How many Index bits? __ H03 __ - 2 ::: info - 128 / 16 = 8 blocks in the cache, - Since 2 ways, 8/2 = 4 sets,2^2 = 4,so index = 2 bits. ::: 4. How many Tag bits? __ H04 __ - 14 ::: info - 20 - 2 - 4 = 14 bits ::: We decide to parallelize a for loop across these 2 processors, but we have each thread do a strided memory access, where processor 0 handles even indices, while processor 1 handles odd indices. However, the memory accesses are perfectly interleaved, i.e. the order of array accesses are still A[0], A[1], A[2], A[3] … ``` #define ARR_LEN 32 // A is located at address 0xA0000 int A[ARR_LEN]; // Processor 0’s loop for (int i = 0; i < ARR_LEN; i += 2) { A[i] += i } // Processor 1’s loop for (int j = 1; j < ARR_LEN; j += 2) { A[j] += j } ``` 5. What is the overall hit rate? __ H05 __ - 50% - The pattern above continues and repeats for all 8 blocks, giving us a 50% hit rate. 6. What fraction of misses are coherency misses? Leave your answer as a fully simplified fraction. __ H06 __ - 3/4 - Out of the 4 misses in each “access pattern block”, 1 is compulsory, while the other 3 are coherency misses, so 75% of the overall misses. 7. In total, how many times did we need to go to main memory to write-back? __ H07 __ - 0 - As the array fits perfectly into the cache, we never need to evict an block and write-back, so 0. 8. We want to avoid all the coherency misses, so we look to see if we can rewrite our code to optimize for cache performance. Which of the following methods will lead to a higher hit rate than that from the interleaved accesses? __ H08 __ (select all that apply) -- a. Letting processor 0 start and finish, then processor 1 starts and finishes -- b. Letting processor 1 start and finish, then processor 0 starts and finishes -- c. None of the other options - a,b - Both of these approaches would be better, since then there would be no coherency misses during the first processor’s execution (load in the block, then hit on the other 3 WRW, so 75% HR). Then, the second processor would begin, but instead of compulsory missing, just coherency miss, but get the same HR pattern of MHHH for each block. ## Problem I Assume that your processor is designed as the traditional five-stage pipeline. You find that your unit tests with single commands work for all instructions, and write some test patterns with multiple instructions. After running the test suite, the following cases fail. You should assume registers are initialized to 0, the error condition is calculated in the fetch stage, and no forwarding is currently implemented. Case 1: Assume the address of an array with all different values is stored in s0. ``` addi t0, x0, 1 slli t1, t0, 2 add t1, s0, t1 lw t2, 0(t1) ``` Each time you run this test, there is the same incorrect output for t2. All the commands work individually on the single-stage pipeline. 1. What caused the failure? __ I01 __ (select ONE) a. Control Hazard b. Structural Hazard c. Data Hazard d. None of the above I01 = ? - c ### 📝 Annotate ::: info Since there is no forwarding technique, line 3 and line 4 will have data hazard and data dependency on t1. ::: 2. How could you fix it? __ I02 __ (select all that apply) a. Insert a nop 3 times if you detect this specific error condition b. Forward execute to write back if you detect this specific error condition c. Forward execute to memory if you detect this specific error condition d. Forward execute to execute if you detect this specific error condition e. Flush the pipeline if you detect this specific error condition I02 = ? - a,d ### 📝 Annotate :::info - The problem is a data hazard as the data which we want is not restored to the regfile.For the following picture,we can see that the problem we wanna solve.We have to take the previous value from previous cycle. If we don't have forwarding, the result value which is in execute stage, it will finally update the register in Write Back stage.But it was too late and wasted time. One way to solve the problem is stall three cycles which means nop 3 times.This can make the system have time to do the write back.The other way is the following picture to use the forwarding, from execute to execute stage. Flushing the pipeline does not work because it means that we will no longer execute the instructions which were flushed. This means we give away the instruction,this way is useless. ::: ![](https://i.imgur.com/57mNzRe.png) Case 2: After fixing that hazard, the following case ``` fails: addi s0, x0, 4 slli t1, s0, 2 bge s0, x0, greater xori t1, t1, -1 addi t1, t1, 1 greater: mul t0, t1, s0 ``` When this test case is run, t0 contains 0xFFFFFFC0, which is not what it should have been. ![](https://i.imgur.com/U6kB09M.png) ![](https://i.imgur.com/XCIYQnW.png) ![](https://i.imgur.com/UavBsoW.png) ### 📝 Annotate :::info In the picture,we can see that it decides branch or not until it finishes execute stage.So, if misprediction happened , it has to flush the pipeline and stall two cycle and jump to correct address to decode the instruction. In this case,because of misprediction,'xori' and 'addi' need to be flushed and PC have to jump to "greater" function to decode the 'mul' instruction. ::: 3. What caused the failure? __ I03 __ (select ONE) a. Control Hazard b. Structural Hazard c. Data Hazard d. None of the above - a 4. How could you fix it? __ I04 __ (select all that apply) a. Insert a nop 3 times if you detect this specific error condition b. Forward execute to write back if you detect this specific error condition c. Forward execute to memory if you detect this specific error condition d. Forward execute to execute if you detect this specific error condition e. Flush the pipeline if you detect this specific error condition - a,e ### 📝 Annotate :::info The problem it causes is that it does not flush the instructions after branch instruction executes.Because it is 5 stage pipeline, and it will not determine wheather jump address or not until branch instruction executes in execute stage.Before branch instruction executes in execute stage, there are two instructions already in pipeline stages. So,we would have a control hazard if wrong predictions happened.One way to solve the problem is add the nop instructions if it detects branch instruction in the IF stage or just flushs the pipeline.Forwarding method in mispredict branch is useless. ::: Some other notes about this. The [incorrect] value we got was -64. How did we get this? Well if we look at the code, we see that s0 will hold 4 and t1 will hold 16. Due to the control hazard, this means we will execute and write back the following two instructions before we execute the mul. This means that we will flip the bits (xori, note that -1 means all bits are 1 due to twos complement encoding) and add one (addi). This is just twos complement inversion! This is what causes us to get -16 which multiplied by 4 will give us -64. In a correct implementation, we would not have executed the xori and addi thus have gotten 64. ## Problem J Which of the following were discussed in Prof. David Patterson’s talk “A New Golden Age for Computer Architecture History, Challenges, and Opportunities”? __ J01 __(select all that apply) 1. The power demands of machine learning are growing 10x per year; Moore’s Law is only 10x in 5 years. 2. The Tensor Processing Unit has a similar performance per watt to a CPU on production applications. 3. The marketplace has decided: Open ISAs (e.g., RISC-V) are better than proprietary ones (e.g., ARM). 4. Domain Specific Architectures achieve incredible performance but just for one application, like ASICs. - 1