###### tags: `class note` # VLSI FINAL ## REQUIREMENTS - The processor core operating speed is targeting at least **75 MHz** for post-synthesized netlist. - Its instruction set shall have at least **45 instructions**, including branch, I/O instructions. - IM + DM cannot be larger than **320KB** - The silicon area of the CPU+ICache+DCache+IM+DM shall be confined within **110 mm2** in total - EPU shall be synthesized and constrained within **3 mm2** - The kernel of the chip shall be less than **120 mm2** - The read/write access time of an off-chip not-synthesized memory, usually DRAM, is **60ns** - The main memory only has one read/write port with bit width of **32** - The processor must have **interrupt mechanism** and **interrupt service routine** for handling requests from other devices ## Advanced - Add, synthesize, and verify **at least 10 more instructions** other than those in basic requirements and include instructions facilitating 64-bit addition/subtraction & store/load. - Add, synthesize, and verify another levels of Cache, such as **L2 or L3**. - Add, synthesize, and verify direct-memory access (DMA) block. - Add, synthesize, and verify stack or other mechanisms to facilitate **function calls or recursive function**. - Add, synthesize, and verify dynamic **branch prediction**. - Add, synthesize, and verify **floating-point co-processor** - Make the full system **bootable by an operating system**, such as Linux, Android, or RTOS - Validate the full system running on an **FPGA board** after verifying using simulations. - MMU ## Works 1. Complete CPU with 45/55 ISA and ISR (and branch prediction) - 2 people - ==Eric==、==Willy== 2. L2 Cache - 2 people - ==Su==、==Willy== 3. Verification - 2 people - ==Jacky==、==Su==、==Kai==ㄍ 4. Burst Mode AXI Bus - 2 people - ==Kai==、==Jacky== 5. Out Of Order CPU - 2 people - ==Eric==、==Anita==、==Yao== 6. FreeRTOS - 2 people - ==Yao==、==Anita== ## DISCUSSION - Floating-point - Out-of-oder cpu and bus - Branch prediction - Bootable OS - FPGA - L2/L3 Cache - Application (IM + DM < 320 KB) - MMU ## ppt 大綱 題目:dual core RISC-V processor 1. 組員介紹 * 照片 2. 分工講解 * CPU 按照作業一,作業二,作業三的進度後 * 整體架構已經包含 AXI Bus,L1 cache,dma * 所以我們 final CPU 硬體架構的部份就是補齊所需功能的硬體 * 硬體部份上下限可以支援到更多的功能 * 例如支援更多的 ISA,ISR,實作 branch prediction,MMU,Out-of-Order,floating point,dma * L2/L3 cache * 解決 cache coherency 的問題 * Verification * 我們對這個部份比較不太了解,硬體還沒做完的時候,驗證組可以先了解驗證的程式和步驟 4. motivation,application background,technical overview * 目標實作一個 lenet 的程式碼|boot rtos 6. specifiactions & reason on major features * 要能夠實作 lenet 程式的話,CPU 至少最低需要支援以下規格 * input output data bandwidth : 32bit * nn model : lenet * fps : 1 * operating speed : 75Mhz * key lengths of encryption : * throughput : * security features : invalid opcode,invalid register operation 7. key result * 成功執行完 lenet 的程式,並且最終預測結果與預期答案相符 9. extras ## 分工內容 * 10 個 ISA 擴充評估(期開) * 結論:MUL, CSR, FENCE(共11個)看之後有需要什麼再新增。 * RV32I 剩餘指令研究: - [x] multiprocessor -> FENCE * RISC-V Weak Memory Ordering (RVWMO) model * 為了保證存储操作的执行顺序 - [ ] prefetch buffer -> FENCE.I * 處理 instruction coherence 的問題 - [ ] system call -> ECALL * 沒有要開 OS 應該可以先不用做 - [ ] Debugging mode -> EBREAK * 沒有做 debugger 也不需要做這個 - [x] CSR Instructions - csr 就這 6 個指令,實作上應該不難,就是按照定義的 atomic behavior 做就好,比較麻煩的是要考慮會用到哪些 csr - [x] CSRRW - [x] CSRRS - [x] CSRRC - [x] CSRRWI - [x] CSRRSI - [x] CSRRCI * 一些比較重要的 CSR: * ==mtvec, mcause, mtval, mepc, mstatus, mie, mip== * ![](https://i.imgur.com/jg4SFBD.png) * RV32M Extension 評估: - [x] Muliplication 32I * 要做 convulution 的話,基本支援 32I MUL 運算,所以 Lenet 32F 的 Input 需要經過 Quantization - [x] MUL - [x] MULH - [x] MULHSU - [x] MULHU - [ ] Divison 32I * 考量到面積限制以及 target application,我們先暫時不支援除法指令 - [ ] DIV - [ ] DIVU - [ ] REM - [ ] REMU * Interrupt Service Routine(Cheng You) * Interrupt types * Internal interrupt, or exception * Caused inside CPU * EX: stack overflow, illegal command, divided by zero... * Interrupt vector number is fixed and known by CPU * External interrupt * Triggered by peripheral devices * Interrupt vector number is provided by the hardware or PIC(programmable interrupt controller) ![](https://i.imgur.com/jksefFY.png) * Software interrupt * Triggered when user program needs OS service * EX: system call, trap * x86 Interrupt Process ![](https://i.imgur.com/QS792ZI.png) ![](https://i.imgur.com/lXlkpPb.png) * x86 Interrupt Vector Table ![](https://i.imgur.com/kzez1e6.png) * Reference * [Interrupt Concepts](http://www.csie.ntnu.edu.tw/~swanky/os/chap2.htm) * [Interrupt Vector](https://www.sciencedirect.com/topics/engineering/interrupt-vector) * Branch Prediction(Cheng You) * Static Branch Prediction * Follow a pre-defined mechanism designed in hardware * Dynamic Branch Prediction * Use information about taken or not gathered at run-time to predict * Saturating counter * 2-bit, four state machine * Predictor table is indexed with instruction address * Very large bimodal predictors saturate at 93.5% correct on SPEC'89 benchmark ![](https://i.imgur.com/sG1aa6s.png) * Two-level Predictor/Correlation-Based Branch Predictor * Two-level Adaptive Predictor(1991) * N-bit branch history register with 2^n^ history pattern * Use "Pattern History Table" to store saturating counter * Quickly learn to predict an arbitrary repetitive pattern ![](https://i.imgur.com/BRBTPWM.png) * Local Branch Prediction * A seperate history buffer for each prediction instruction * The pattern history table may be separate as well or shared between all jump instructions * EX: Intel Pentium MMX, Pentium II, and Pentium III * Very large local predictors saturate at 97.1% correct on SPEC'89 benchmarks * Global Branch Prediction * Shared history buffer * EX: AMD, Intel Pentium M/Core/Core 2 * Very large gshare predictor can reach 96.6% accuracy on SPEC'89 benchmark