# EECS151LA ASIC Final Report 1. **Show the final pipeline diagram, and explain the functionality of different submodules in your design, how control signals are generated, memory structure, etc.** The final pipeline diagram and our overall top-down diagram of how our CPU interfaces with memory and cache is attached to this document. CPU: The first stage is responsible for instruction decoding, immediate generation, and reading from the register file. The second stage is reposible for branch comparison and arithmetic operations. The third stage is reponsible for immediate generation for memory output, writing back to the register file, and selecting the next PC. Control Unit takes care of generating all of the control signals required in the CPU. It would take in all three stages' instruction and would perform combinational logic operations to determine whether to turn the MUX one way or the other or to generate a mask for our MemWEn, etc. Also because we had to deal with detecting data hazards within our Control Unit, this submodule was often times included in our critical path. The memory unit is located on the boundary stage 3 and stage 1 (IMEM) and the bondary of stage 2 and 3 (DMEM). It consists of the the actual memory and 10 srams, 5 srams each for data and instruction memories as their cache. Out of the 5 srams, 1 sram is responsible for storing the tag, and the rest 4 are responsible for storing the 512 bit cacheline at a paticular index. ![](https://i.imgur.com/NPQol9i.jpg) **Figure 1: Cache FSM Diagram** In our project, we have designed a direct-mapped write through cache to increase our performance from a naive IMEM/DMEM by utilizing SRAMs. Our objective was to create a 512 bytes by 8 lines cache for both our data memory and instruction memory. The constraint was placed on the type of SRAMs we were able to use. The largest SRAM we could use had a word length of 128 bytes, therefore, we needed 4 SRAMs to generate a cacheline and it is up to us to program to function like a cache. Our FSM designed for our cache is shown in Figure 1. For contexualization, write through cache means that whenever we are writing onto the cache, this gets immediately reflected back onto the memory. This is different from the other well known type of cache called write back cache as write back writes to the cache but doesn't write to the memory immediately - but will only write to the memory whenever that cacheline is getting evicted. Write back caches are more efficient as it does write to the memory for every write anad does it whenever it needs to keep its memory consistent. But the implementation of keeping a 'dirty bit' made it a little bit tricky for us to implement. Therefore, implementing the write-through cache was a design choice we made that allowed us to debug a lot more cleanly than write-back caches. Lastly, direct mapped caches is a type of cache that assigns each section of the memory to a certain line number in the cache. This functionality is important when designing caches as it determines the amount of hits or misses a program would have. Direct-mapped cache is known to be the most simple cache variation - having the lowest associativity as each memory block is assigned to one section. Increasing associativity decreases the 'height' of our cache but allows us to store more things into one bucket - as one might say. This benefits whenever we have, for example, a for loop that jumps back and forth in some stride size of the cache size as higher associativty would decrease the amount of cache misses. As for our actual implementation, we have a READY state whenever our cache is IDLE and is not being used by the CPU to effectively perform read-valid interface (a.k.a Handshake Protocol). Whenever we were requested to perform data transaction, we move onto TAG_COMP state to see if the data requested is within our cache to see if it is a hit or it is a miss and we would have to fetch the data from our memory (DMEM/IMEM). For read hit, we then just have to return the data found in the cache and return to READY state (for a total of 2 cycles). For a write hit, we would edit our cacheline to our desired data and then would go onto MEM_WRITE state to write to the memory itself (for a minimum cycle of 3 - depending on the ready-valid interface between the memory). When we would have any type of cache miss, we would have to fetch a new cacheline from the memory - therefore, we would require 4 cycles to get each block of data from each of the SRAMs to generate 512 bytes worth of new cacheline. Lastly, to keep track of which line in the cache contains valid data or not, we also created 8 registers to track the state of the validity of each cacheline to serve as a 'valid bit' tracker. Lastly, the CSR is located on the boundary of stage 2 and 3, like the DMEM. It is a simple register with some checks that only enables writes when the instruction matches. 2. **What is the post-synthesis critical path length? What sections of the processor does the critical path pass through? Why is this the critical path?** The post synthesis critical path is 1857ps, with a setup time of 43ps, uncertainty of 100ps, and data path delay of 1763ps. The critical path starts from the instruction register between stage 2 and 3, then to the forwarding mux at stage 2, then to the branch comparator, then to the flushSel, then to the ImmGen, then to the ALU for jal instructions, then to the PC select mux, and then to the memory. This was critical path likely because it spans many stages (3, then 2, then 1, then 3) and goes through many muxes, ALUs, and comparators. Therefore, the reason as to why this becomes the critical path in our design is mainly because 1. our dependence on other stage instructions to determine control signals and 2. forwarding paths that will extend the critical path - allowing them more ways to cross-over stages. This is the exact reason why there is a whole tradeoff between adding forwarding paths versus removing them entirely and just adding NOPs to avoid any hazards. For more information on the tradeoff of these optimizations - refer to Question 9. 3. **Show a screenshot of the final floorplan. Also include a screenshot of the clock tree debugger results. Discuss your floorplanning strategy and the quality of your clock tree results.** The final floorplan(s) for our design is provided below. There are two equivalent designs we settled on - the difference just being the pure amount of area provided if area isn't a concern. Deciding on these two designs are a design choice engineers would have to make depending on what they want to prioritize. **Design 1: Unconstrained Area Design** ![](https://i.imgur.com/Hj7GZv4.png) ![](https://i.imgur.com/sWxYKQH.png) ![](https://i.imgur.com/9EYy41u.png) **Figure 2: Various view of our floorplan for Design 1** Our floorplanning strategy consisted of giving the design as much area as possible to obtain the best power and operating frequency as we can, given that there is no hard cap on the area. This was because in the initial stages of floorplanning, we had some trouble placing the SRAMs in a way it would not be overlapping and would allow space to perform clock tree synthesis and our overall place and route optimizations. Because providing more area would give the optimization algorithms more degrees of freedom to place and route the wires, this can be thought as the highest performance of our overall RISC-V CPU. ![](https://i.imgur.com/yi0YVrC.png) **Figure 3: Clock tree for Design 1** This approach and our assumptions on unconstrained area resulted in a clock tree that has relatively even leafs, with the exception of the last cluster. The last cluster all point to the SRAMs which makes sense - as all clocked signals are originating from our cpu (which is within one block) - different from the SRAMs. ![](https://i.imgur.com/WXtIRKz.png) **Figure 4: Slack histogram for Design 1** Looking at the slack diagram, we can see that it has a relatively large standard deviation with a long left tail, which indicates that our clock frequency probably had more room to improve, especially given the extra large area we had. However, we were short on time and felt the process of adjusting the clock both laborious and time-consuming. As a future direction, we think that there is potential for a script that performs parallelized binary search on a bunch of targe frequencies and automatically picks the lowest achievable one. **Design 2: Constrained Area Design** In addition, we also experimented with decreasing our area in a following design - minimizing the amount of area required to fit all of our SRAMs and CPU: ![](https://i.imgur.com/GFM8Kyw.png) **Figure 5: Floorplan for Design 2** ![](https://i.imgur.com/0UJq583.png) **Figure 6: Slack histogram for Design 2** We observed that our slack distribution has its mean shifted significantly to the left, indicating that the decrease in area has made meeting timing more difficult. Given that we don't know the exact grading rubric (i.e. whether it weights area or other factors more heavily), we decided to report the previous version of our design, though if area proves to be more valuable then the design below should be chosen instead. But these should be decisions chip designers should be aware of when coming together in a design. 4. **What is the post-place-and-route critical path length? What sections of the processor does the critical path pass through? Why is this the critical path? If it is different from the post-synthesis critical path, why?** The post-place-and-route critical path length is 1786.221ps, with clock gating setup of 75.360ps, uncertainty of 100ps, and data path delay of 17777.899ps. This is the same critical path as the post synthesis path. For what sections of the processor does the critical path pass through and why is this the critical path, please see question 2. 5. **What is the area utilization of the final design? Also include the total core area you used in PnR and the density.** The final design takes 184061.10816 um^2. Core utilization = 99.677921 TU for group AO = 0.000000 Effective Utilizations Average module density = 1.000. Density for the design = 1.000. = stdcell_area 3633930 sites (847723 um^2) / alloc_area 3633928 sites (847723 um^2). Pin Density = 0.01626. = total # of pins 69643 / total area 4281825. 6. **What is the Innovus-estimated power consumption of the final design?** ``` Total Power ----------------------------------------------------------------------------------------- Total Internal Power: 3.86770176 25.8682% Total Switching Power: 9.31992019 62.3341% Total Leakage Power: 1.76394122 11.7977% Total Power: 14.95156314 ----------------------------------------------------------------------------------------- ``` 7. **What is the number of cycles that your design takes to run the benchmarks? What changes/optimizations have you done to try and optimize for these tests?** ``` cachetest: 5698430 cycles final: 11621 cycles fib: 10376 cycles sum: 35248092 cycles replace: 35250602 cycles ``` We tried to decrease the number of states in our cache's state machine. Initially, we had separate states whenever when we need to write the sram. Now, we merged those stages into MEM_3 and TAG_COMP and saved at least one cycle on all writes and cache misses - settling into FSM seen in Figure 1. 8. **What is the post-place-and-route runtime (in seconds) of each benchmark? Use the number of cycles from RTL simulation, and minimum clock period to meet timing for place-and-route (design doesn't have to pass post-PAR simulations with this clock period).** ``` Frequency: 1/1786.221ps = 559.84 MHz cachetest: 5698430/(5.5984*1e8) = 0.0101786760503s final: 11621/(5.5984*1e8) = 0.000020757716490425835s fib: 10376/(5.5984*1e8) = 0.00001853386681909117s sum: 35248092/(5.5984*1e8) = 0.0629610102886539s replace: 35250602/(5.5984*1e8) = 0.06296549371248927s ``` 9. **Explain any other optimizations you made for your design.** The optimization we have made with our design was decreasing the number of MUXes we had in our stage 2 in regards to forwarding. Before, we had a separate MUX to detect ALU->ALU hazards and ALU->MEM / MEM-> ALU hazards that were used to perform data forwarding. Soon enough, we realized that we can combine these control signals for data forwarding into one MUX and this played some role in decreasing our critical path delay for our current design. In addition, we have reduced the number of cycles it takes for JAL instructions by calculating our JAL pc in stage 1 - providing a separate ALU for JAL calculations rather than the univesal ALU set in our processer at stage 2. This made the number of NOP injections we would make for our JAL instruction to be one less than it needs to. Thus, this optimization also required us to extend our PCSel MUX by one bit to include our computed JAL PC. There were a couple of other optimizations we had in mind to perform. One of which was the Immediate Bypass. In our current design, we have our immediate generator requiring input from the FlushSel MUX which can either give us the NOP instruction when we come across a control hazard or the normal instruction pulled from our program counter. If we do generate the 'wrong' immediate for NOPs, the destination register writes to x0 - therefore, nothing happens. From this, what we can do is have the entire pre_s1_inst bypass the FLushSel MUX to start generating its immediate. Another optimization we were thinking of if we want to trade off between CPI for clock frequency was to remove all forwarding paths (tuhs decreasing our critical path delay by a significant margin) and extend our logic for FlushSel to NOP any type of hazards. This would allow the cpu to be clocked at a much higher frequency - therefore if we were to run a program that do not have any hazards, it will run very fast. But this optimization is suboptimal as most programs aren't like so and turns out, the Iron Law to computing performance shows that there is no significant change in the overall performance (time it takes to compute the program) from making this change. Last but not least, another optimization we could make is what we have done throughout this whole project - pipelining! We can pipeline the critical path to break them up into chunks. The catch is that depending on where we are adding the additional pipeline stage, we would have to consider various hazard scenarios by adding forwarding paths, NOP injections, and etc. Lastly, carrying all 32 bits of instruction bits is a lot of work for a module and oftentimes, especially for stage 3 instructions when all they are left to do is perform write back, there was a lot of room for us to shave off unnecessary bits off of the instructions to save hardware and potentially improve our critical paths. Some of these optimizations can very much be a tradeoff between CPI and clock frequency - therefore a set stone metric to measure performance and designers to make a decision on whether one design is better than the other is crucially important. 10. **Is there anything you would like to tell the staff before we grade your project?** Though we did not implement too many additional optimizations for our CPU, for reference, our group also did the FPGA final project where we implemented these optimizations and discussions regarding the clock frequency we were able to achieve. Therefore, for further information, refer to the FPGA final project report for team 11.