###### tags: `class note`
# VLSI FINAL
## REQUIREMENTS
- The processor core operating speed is targeting at least **75 MHz** for post-synthesized netlist.
- Its instruction set shall have at least **45 instructions**, including branch, I/O instructions.
- IM + DM cannot be larger than **320KB**
- The silicon area of the CPU+ICache+DCache+IM+DM shall be confined within **110 mm2** in total
- EPU shall be synthesized and constrained within **3 mm2**
- The kernel of the chip shall be less than **120 mm2**
- The read/write access time of an off-chip not-synthesized memory, usually DRAM, is **60ns**
- The main memory only has one read/write port with bit width of **32**
- The processor must have **interrupt mechanism** and **interrupt service routine** for handling requests from other devices
## Advanced
- Add, synthesize, and verify **at least 10 more instructions** other than those in basic requirements and include instructions facilitating 64-bit addition/subtraction & store/load.
- Add, synthesize, and verify another levels of Cache, such as **L2 or L3**.
- Add, synthesize, and verify direct-memory access (DMA) block.
- Add, synthesize, and verify stack or other mechanisms to facilitate **function calls or recursive function**.
- Add, synthesize, and verify dynamic **branch prediction**.
- Add, synthesize, and verify **floating-point co-processor**
- Make the full system **bootable by an operating system**, such as Linux, Android, or RTOS
- Validate the full system running on an **FPGA board** after verifying using simulations.
- MMU
## Works
1. Complete CPU with 45/55 ISA and ISR (and branch prediction)
- 2 people
- ==Eric==、==Willy==
2. L2 Cache
- 2 people
- ==Su==、==Willy==
3. Verification
- 2 people
- ==Jacky==、==Su==、==Kai==ㄍ
4. Burst Mode AXI Bus
- 2 people
- ==Kai==、==Jacky==
5. Out Of Order CPU
- 2 people
- ==Eric==、==Anita==、==Yao==
6. FreeRTOS
- 2 people
- ==Yao==、==Anita==
## DISCUSSION
- Floating-point
- Out-of-oder cpu and bus
- Branch prediction
- Bootable OS
- FPGA
- L2/L3 Cache
- Application (IM + DM < 320 KB)
- MMU
## ppt 大綱
題目:dual core RISC-V processor
1. 組員介紹
* 照片
2. 分工講解
* CPU 按照作業一,作業二,作業三的進度後
* 整體架構已經包含 AXI Bus,L1 cache,dma
* 所以我們 final CPU 硬體架構的部份就是補齊所需功能的硬體
* 硬體部份上下限可以支援到更多的功能
* 例如支援更多的 ISA,ISR,實作 branch prediction,MMU,Out-of-Order,floating point,dma
* L2/L3 cache
* 解決 cache coherency 的問題
* Verification
* 我們對這個部份比較不太了解,硬體還沒做完的時候,驗證組可以先了解驗證的程式和步驟
4. motivation,application background,technical overview
* 目標實作一個 lenet 的程式碼|boot rtos
6. specifiactions & reason on major features
* 要能夠實作 lenet 程式的話,CPU 至少最低需要支援以下規格
* input output data bandwidth : 32bit
* nn model : lenet
* fps : 1
* operating speed : 75Mhz
* key lengths of encryption :
* throughput :
* security features : invalid opcode,invalid register operation
7. key result
* 成功執行完 lenet 的程式,並且最終預測結果與預期答案相符
9. extras
## 分工內容
* 10 個 ISA 擴充評估(期開)
* 結論:MUL, CSR, FENCE(共11個)看之後有需要什麼再新增。
* RV32I 剩餘指令研究:
- [x] multiprocessor -> FENCE
* RISC-V Weak Memory Ordering (RVWMO) model
* 為了保證存储操作的执行顺序
- [ ] prefetch buffer -> FENCE.I
* 處理 instruction coherence 的問題
- [ ] system call -> ECALL
* 沒有要開 OS 應該可以先不用做
- [ ] Debugging mode -> EBREAK
* 沒有做 debugger 也不需要做這個
- [x] CSR Instructions
- csr 就這 6 個指令,實作上應該不難,就是按照定義的 atomic behavior 做就好,比較麻煩的是要考慮會用到哪些 csr
- [x] CSRRW
- [x] CSRRS
- [x] CSRRC
- [x] CSRRWI
- [x] CSRRSI
- [x] CSRRCI
* 一些比較重要的 CSR:
* ==mtvec, mcause, mtval, mepc, mstatus, mie, mip==
* ![](https://i.imgur.com/jg4SFBD.png)
* RV32M Extension 評估:
- [x] Muliplication 32I
* 要做 convulution 的話,基本支援 32I MUL 運算,所以 Lenet 32F 的 Input 需要經過 Quantization
- [x] MUL
- [x] MULH
- [x] MULHSU
- [x] MULHU
- [ ] Divison 32I
* 考量到面積限制以及 target application,我們先暫時不支援除法指令
- [ ] DIV
- [ ] DIVU
- [ ] REM
- [ ] REMU
* Interrupt Service Routine(Cheng You)
* Interrupt types
* Internal interrupt, or exception
* Caused inside CPU
* EX: stack overflow, illegal command, divided by zero...
* Interrupt vector number is fixed and known by CPU
* External interrupt
* Triggered by peripheral devices
* Interrupt vector number is provided by the hardware or PIC(programmable interrupt controller)
![](https://i.imgur.com/jksefFY.png)
* Software interrupt
* Triggered when user program needs OS service
* EX: system call, trap
* x86 Interrupt Process
![](https://i.imgur.com/QS792ZI.png)
![](https://i.imgur.com/lXlkpPb.png)
* x86 Interrupt Vector Table
![](https://i.imgur.com/kzez1e6.png)
* Reference
* [Interrupt Concepts](http://www.csie.ntnu.edu.tw/~swanky/os/chap2.htm)
* [Interrupt Vector](https://www.sciencedirect.com/topics/engineering/interrupt-vector)
* Branch Prediction(Cheng You)
* Static Branch Prediction
* Follow a pre-defined mechanism designed in hardware
* Dynamic Branch Prediction
* Use information about taken or not gathered at run-time to predict
* Saturating counter
* 2-bit, four state machine
* Predictor table is indexed with instruction address
* Very large bimodal predictors saturate at 93.5% correct on SPEC'89 benchmark
![](https://i.imgur.com/sG1aa6s.png)
* Two-level Predictor/Correlation-Based Branch Predictor
* Two-level Adaptive Predictor(1991)
* N-bit branch history register with 2^n^ history pattern
* Use "Pattern History Table" to store saturating counter
* Quickly learn to predict an arbitrary repetitive pattern
![](https://i.imgur.com/BRBTPWM.png)
* Local Branch Prediction
* A seperate history buffer for each prediction instruction
* The pattern history table may be separate as well or shared between all jump instructions
* EX: Intel Pentium MMX, Pentium II, and Pentium III
* Very large local predictors saturate at 97.1% correct on SPEC'89 benchmarks
* Global Branch Prediction
* Shared history buffer
* EX: AMD, Intel Pentium M/Core/Core 2
* Very large gshare predictor can reach 96.6% accuracy on SPEC'89 benchmark