Rework Raster I as a compact and efficient GPU in Chisel

# Rework Raster I as a compact and efficient GPU in Chisel > 吳承晉 > [Github](https://github.com/jin11109/raster-i-chisel) > [Video](https://youtu.be/IcLdYcLvVrQ) ## Goals * Architecturally, [Raster I](https://github.com/raster-gpu/raster-i) includes a multi-cycle vertex transformation unit, eight parallel interpolation pipelines, and a deferred shading pipeline implementing the Phong shading model using Q11.13 fixed-point arithmetic. The proposed task is to replace the Xilinx Vitis HLS graphics pipeline with functionally equivalent Chisel modules and use Verilator as the unified simulation and verification backend. * This is technically realistic because the HLS kernel is a fixed-function, fixed-point streaming pipeline (vertex processing, interpolation, shading) that maps well to hand-written Chisel: its internal interfaces are already AXI-style streams, and its behavior is largely feed-forward. * This requires: * Re-expressing the HLS C++ kernel (rasterization, G-buffer writes, shading) as explicit Chisel pipelines. * Preserving AXI/AXI-Stream protocols so the existing system-level integration and memory controller remain usable. * Re-establishing timing closure and resource usage * The main risks are engineering effort (rewriting a non-trivial HLS kernel by hand) and timing/resource regression; these can be managed by incremental replacement (stage-by-stage Chisel re-implementation) and continuous Verilator + post-synthesis checks. ## Overview - Introduce and analyze the original project Raster I - [Structure of Raster I](#Structure-of-Raster-I) - Build enviroment and run minimum viable rasterI - [1. Build Testbench (C Simualaiton)](#1-Build-Testbench-C-Simualaiton) - [2. Simple C/Chisel Cosimulation](#2-Simple-CChisel-Cosimulation) - [3. Chisel Simualtion: Merging the Raster I System](#3-Chisel-Simualtion-Merging-the-Raster-I-System) - Replace the HLS renderer with Chisel modules stage-by-stage: - [Renderer: Core Logic](#Renderer-Core-Logic) - [4. Renderer: Geometry Stage](#4-Renderer-Geometry-Stage) - [4.1 Renderer: Geometry Stage Improve and Simulaiton (continue)](#41-Renderer-Geometry-Stage-Improve-and-Simulaiton-continue) - Reproduce RasterI - [Build Raster I](#Build-Raster-I) - Usage and Debug for Latest Version of chisel RasterI - Usage: See [quick start](https://github.com/jin11109/raster-i-chisel#quick-start) - Verification: See [Verification](https://github.com/jin11109/raster-i-chisel#verification) - Debugging Methodology: For testing our chisel RasterI, use: ```shell # in verification $ make cosim ``` If some error message occur, then utilize the trace file `trace.vcd` to locate the key signal, such as: - If there is an error occurred at submodule of geometry stage, we need to locate when the submodule start to process. So, we can utilize the key signal `start` and `done` the loacte this progress. - If there is an error occurred at framebuffer, we need to locate when the renderer start to interact with framebuffer. This can locate by key signal of state machine in renderer. ### Progress Table Replace the HLS renderer with Chisel modules stage-by-stage: |Stage|State| |---|---| |Geometry|partially implemented| |Iterate over every tile|not started| |Render Triangles|not started| |Deferred Shading|not started| |Flush Tiles to FB|not started| ## Structure of Raster I > I use chatgpt to help me realize the structure of Raster I The project is split across two implementation domains: - A system-level(`system`) infrastructure described in **Chisel(Scala)** — this handles aspects such as clock domains, AXI interconnect, memory arbitration, framebuffer swapper, VGA/DRAM interface, etc. - The graphics pipeline(`renderer`) itself (vertex transform, rasterization / interpolation, shading) implemented via **Xilinx Vitis HLS (C/C++)**. >[!Important] Correlation between `system`/`renderer` and our goals > In the earlier phase of this project, our focus should remain solely on the `renderer` subsystem. The `renderer` is architecturally self-contained and can be verified independently using its dedicated testbench. The system layer primarily defines the high-level integration logic and is responsible for producing the FPGA bitstream, but it does not affect functional validation of the `renderer` itself. Therefore, system-level integration can be deferred to the final stage, while the core development and verification effort concentrates on rewriting and validating the `renderer` pipeline in Chisel. ### System-level Infrastructure `system/` 3 separate clock domains: **system**, **graphics** and **display** (their frequencies are currently 100MHz, 100MHz and 65MHz) ``` system/ ├── src/main/scala/ # Chisel hardware modules │ ├── core/ # The top-level SoC module in Chisel │ ├── display/ # Reading the framebuffer from DRAM, applying │ │ effects like dithering and presenting it │ │ onto the screen synchronously. │ ├── renderer/ # Traditional rendering tasks │ └── utils/ # Utilities used across modules: │ Fixed-point helpers, Synchronizers between │ clock domains, AXI helper constructs ├── generated/ # SystemVerilog generated by Chisel ├── vivado/ # Vivado TCL scripts & board integration └── build.tcl/ # Use to generate final bitstream ``` #### Core (system) - `Trinity.scalar`: >`system/src/main/scala/core/Trinity.scala` This code defines the Top-Level Module of the hardware design, which acts like a motherboard. It instantiates all the major subsystems (Graphics Card, Monitor Controller, Memory Controller, Clock Generator) and wires them together. It also handles the tricky business of managing different "Time Zones" (Clock Domains) within the chip. - Clock Wizard (`clkWiz`) Generates specific frequencies needed for the design: - `clkDisplay` - `clkGraphics` - Memory Subsystem Connect the physical top-level DDR3 pins to the VRAM controller - Clock Domain Crossing Decides which buffer is being read and which is being written (run on the system clock). - Display Subsystem (Read Domain) - Graphics Subsystem (Write Domain) - **Data Flow** 1. Renderer (fast clock) draws a frame into Buffer A. 2. Display (slow clock) reads Buffer B and sends it to the Monitor. 3. Renderer finishes and asserts `graphicsDone`. 4. Display finishes scanning the screen and asserts `displayVsync`. 5. Swapper sees both are done, flips the switch. 6. Renderer starts drawing into Buffer B; Display starts showing Buffer A. - `Fb.scalar`: >`system/src/main/scala/core/Fb.scala` Defines the memory infrastructure for a graphics system, likely running on an FPGA. - DDR3 Interface (`class Ddr3Ext`): The physical connections to external RAM. - VRAM Controller (`Vram`, `vram`, `class Vram`): A wrapper around a BlackBox IP (likely a Memory Controller Interface) that provides AXI ports for reading and writing. - Frambebuffer: - Framebuffer Geometry (`FbRGB`, `Fb`) Calculates how pixels are mapped into the memory's 128-bit data words. - Double Buffering (`FbSwapper`) This module implements Double Buffering to ensure smooth animation. - Framebuffer Reader (`FbReader`) This module acts as a DMA (Direct Memory Access) Read engine. It fetches pixels from RAM to send to the display controller (VGA/HDMI). - Framebuffer Writer (`FbWriter`) This module acts as a DMA Write engine. It takes pixels generated by the graphics logic and writes them to RAM. - **Frambebuffer Data Flow**: 1. Graphics Logic sends 4 pixels to `FbWriter`. 2. `FbWriter` packs them into 128 bits and writes to DDR3 via AXI Write. 3. `FbSwapper` waits for the frame to finish, then flips the buffer IDs. 4. `FbReader` reads 128 bits from the new buffer via AXI Read. 5. `FbReader` unpacks the 128 bits back into 4 pixels and sends them to the Display. #### Display #### Graphic (renderer) This code defines the Graphics Engine (GPU) of the system. It wraps a specialized hardware accelerator and manages the timing and control signals required to make it draw frames. - Hardware Interface(`class render`) This class extends `BlackBox`, which means it is an interface to an external Verilog or VHDL file. > [!Note] Blacl block > In Chisel, a BlackBox is a mechanism used to instantiate hardware modules defined in external HDL (Hardware Description Language) files—typically Verilog, SystemVerilog, or VHDL—within your Chisel design. > ```verilog > module AluMod( > input clk, > input [15:0] a, > input [15:0] b, > output [15:0] out > ); > // Verilog implementation... > endmodule > ``` > Chisel wrapper would look like this: > ```scala > class AluMod extends BlackBox { > val io = IO(new Bundle { > val clk = Input(Clock()) // Explicitly defined clock > val a = Input(UInt(16.W)) > val b = Input(UInt(16.W)) > val out = Output(UInt(16.W)) > }) > ``` - Controller Logic(`class Renderer`) This Chisel module wraps the BlackBox and acts as a Driver. It translates the simple system events (like "swap buffers") into the precise handshake protocol required by the HLS block. #### Utils **Axi** **Channels**: Unlike traditional bus systems that mix address and data, AXI separates read and write traffic into five independent highways: - Write Address (AW) - Write Data (W) - Write Response (B) - Read Address (AR) - Read Data \(R) **Handshakes**: All channels use the `VALID`(sender active) and `READY`(receiver ready) signals. - Data is considered successfully transmitted only if the current cycle reaches its rising edge `VALID = 1`, `READY = 1` **Burst transmission**: This is key to AXI's performance. Sending a single address allows for the continuous transmission of multiple data entries. - `AWLEN`: The decision to proceed involves several transfers (Beats). - `AWSIZE`: Determines the width (Bytes) of each Transfer. - `AWBURST`: Determines how the address is calculated (Fixed, Increment, Wrap). ### Graphics Pipeline `renderer/` This section is designed specifically for [Xilinx](https://en.wikipedia.org/wiki/Xilinx) HLS (High-Level Synthesis). The HLS compiler takes this C++ code and synthesizes it into the Verilog hardware logic (the "BlackBox") that instantiated in Chisel code as renderer. ``` renderer/ ├── src/ # HLS kernel (core GPU pipeline) │ ├── tb.cpp # HLS software test benches │ ├── top.cpp │ ... ├── include/ # C++ headers for geometry & util └── hls_config.cfg # Synthesis parameters ``` ### Testbench > `src/tb.cpp` It uses the [SDL2](https://www.libsdl.org/) (Simple DirectMedia Layer) library to visualize the memory output of the hardware function in real-time. Essentially, it simulates a graphics card: allocating "Video RAM," asking the hardware function to draw into it, and then displaying that memory on the PC screen. The steps in this testbench: 1. Initialize SDL and open a window to serve as the display end of the framebuffer. 2. Configure a simulated VRAM (128-bit width, 256MB space). 3. The call is made once per frame: ```c++ trinity_renderer(fb_id, vram, (SDL_GetTicks() / 14) % 360) ```` - `fb_id`: Intended to handle double-buffering - `(SDL_GetTicks() / 14) % 360`: This generates an animating angle input. It takes the system time, scales it, and wraps it around 0–359 degrees. 4. Take the pixel data written to VRAM by the renderer and update it to the SDL texture. 5. Display the final screen in the viewport and calculate the FPS. > [!Important] Necessity of Building RasterI > Since it already exist a testbench with is a software way to varify renderer, the necessity of [gernerate big stream to fpga board](#Generate-Bitstream-for-FPGA-Broad) is not first order of business. > [!Important] First job > In order to reuse this testbench, we should find how this simulation been executed first and build it with Verilator instead of Vitis. > > See [C simulation](#C-Simulation) in build raster I seciton, and [1. Build Testbench](#1-Build-Testbench) ### Testbech core: `trinity_renderer` > `src/top.cpp` ```c++ #ifdef __SYNTHESIS__ void trinity_renderer(fb_id_t fb_id, hls::burst_maxi<ap_uint<128>> vram, ap_uint<9> angle) #else void trinity_renderer(fb_id_t fb_id, ap_uint<128> *vram, ap_uint<9> angle) #endif ``` This function can use two different testing mode: - HLS mode: `vram` is `hls::burst_maxi<ap_uint<128>>` tands for AXI burst memory interface, which is used to directly access DDR or framebuffer. Use `__SYNTHESIS__` block to write. - Software mode: `vram` is a simple `ap_uint<128>*` indicator pointing to a memory array used to simulate a framebuffer. Use #else` block. ## 1. Build Testbench (C Simualaiton) > [Main commit](https://github.com/jin11109/raster-i-renderer/commit/a90119fed2dda6226f1e76d022596a13a16a6ec6): Enable standalone C sim without Vitis IDE > [Latest commit](https://github.com/jin11109/raster-i-renderer/commit/3d6e43b42a345eaf503c1aa616402cd396df4cc3): The latest executable commit for this project stage In Vitis IDE, C Simulation runs the C model entirely in software and serves as the golden reference for Raster I. Similar to C/RTL co-simulation, my goal is to build a standalone testbench outside of Vitis IDE and use it as the golden test for my Chisel implementation. After successfully fixing and running the [run C simulation](#C-Simulation), I wanted to decouple the testbench from Vitis IDE. We only require this software model to validate the Chisel renderer, and Vitis IDE is too heavy for this purpose (installation size exceeds 80 GB). Our long-term objective is to run tests with Verilator, except in cases where we need to generate a bitstream for final-stage hardware validation. To separate the renderer from Vitis IDE, I applied the following modifications: - Fixed-Point Types The design uses fixed-point types such as `ap_fixed<24, 13>`, which are defined in Xilinx’s proprietary libraries and automatically included when using Vitis. To eliminate this dependency, I migrated to Xilinx’s open-source [HLS_arbitrary_Precision_Types](https://github.com/Xilinx/HLS_arbitrary_Precision_Types) library and added a Makefile target `make download_types` to automatically download and unpack the headers into the expected path. - Frambuffer The framebuffer is not required for C Simulation. To avoid compilation errors caused by the use of `hls::burst_maxi`, I wrapped the related code in `fb.cpp` with the `__SYNTHESIS__` macro. - sqrt Replacement In `deferred_shading` inside `top.cpp`, I replaced `hls::sqrt` with the standard `std::sqrt` during software simulation: ```c++=237 #ifdef __SYNTHESIS__ fixed len = hls::sqrt(dir.x * dir.x + dir.y * dir.y + dir.z * dir.z); #else fixed sq_norm = dir.x * dir.x + dir.y * dir.y + dir.z * dir.z; fixed len = (fixed)std::sqrt((double)sq_norm); #endif ``` However, one instance of `hls::sqrt` cannot be replaced perfectly. `hls::sqrt` may implement a hardware-oriented approximation (e.g., LUT-based), so using `std::sqrt` will not bit-match the hardware behavior. The simulation will run, but the numerical results may differ slightly. :::warning If the testbench compares the C model and Chisel model by exact equality, you will either need to: - reimplement sqrt with the same behavior as hls::sqrt, or - introduce a small tolerance in the comparison. ::: After applying these modifications, the renderer runs without requiring Vitis IDE. However, several warnings remain and must be addressed. - Bitsize mismatch ``` WARNING: Bitsize mismatch for ap_[u]int|=ap_[u]int. ``` Using GDB, I set a breakpoint at `_IO_new_file_write` (triggered when printing warnings) and traced the call stack. The root cause was in `render_triangle()` inside `top.cpp`: ```c++=167 if (bary00.x >= 0 && bary00.y >= 0 && bary00.z >= 0 && z00 <= zbuf[y * FB_TILE_WIDTH + x][0]) { zbuf[y * FB_TILE_WIDTH + x][0] = z00; // original // we |= 1; // fix by we |= (ap_uint<FB_SAMPLES_PER_PIXEL>)1; } ``` The fix ensures the constant literal matches the expected bit width. - Assign NaN to fixed point value ``` WARNING: assign NaN to fixed point value ``` Using the same GDB approach, I identified the warning in `deferred_shading()`. It occurs when shading a pixel that was never written by `render_triangle`, meaning the normal vector remains `(0, 0, 0)`, which later causes invalid normalization. I fixed this by explicitly skipping background pixels: ```c++=224 if (n.x == 0 && n.y == 0 && n.z == 0) { continue; } ``` This is safe because an untouched pixel indicates that no triangle covers the corresponding location, and its initial values remain valid. :::warning Althought the simulation now is successfully ran, its output is currently grayscale. I will proceed with the Chisel implementation first and treat color as a future patch. ::: ## 2. Simple C/Chisel Cosimulation > [Main commit](https://github.com/jin11109/raster-i-renderer/commit/eec6e955e077e79d49178977c8731c6acc2ee679): Refactor project and add Chisel renderer with cosim support Now we have the C simulation results, which can serve as our golden reference. However, there is currently no way to verify the correctness of our Chisel code. In this section, I will introduce a co-simulation routine that compares the results of the C simulation with those produced by the Chisel implementation. ### Build Simple Renderer in Chisel To test the co-simulation workflow, I first implemented a simple renderer in Chisel. This renderer only writes a fixed 128 bits value (`0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF`) to ouput `mem_wdata`. Later, in [3. Renderer: Load Rom Data](#3-Renderer:-Load-Rom-Data), I will show how the renderer reads external data such as model data or math tables. The simple renderer exposes a template I/O interface: ```scala val io = IO(new Bundle { val start = Input(Bool()) val angle = Input(UInt(9.W)) val fb_id = Input(UInt(1.W)) val done = Output(Bool()) // For test val mem_we = Output(Bool()) val mem_addr = Output(UInt(32.W)) val mem_wdata = Output(UInt(128.W)) }) ``` Here, `mem_we`, `mem_addr`, and `mem_wdata` exist only for testing memory writes. ### Compile and Simulate with Verilator The simple renderer can be compiled to Verilog using `sbt` and simulated using Verilator: ```mermaid flowchart TD subgraph chisel[chisel] Renderer.scala(Renderer.scala) end subgraph verilog[verilog] Renderer.sv(Renderer.sv) end subgraph verilator[c++] VRenderer__ALL.a(VRenderer__ALL.a) end chisel --> |sbt| verilog --> |verilator| verilator ``` After building the library, we define a C++ abstraction to wrap the memory writes of the Chisel renderer and handle interaction with the Verilated circuit. The top-level interface is defined as: ```c++ class Renderer { public: virtual ~Renderer() = default; virtual void draw(fb_id_t fb_id, ap_uint<128>* vram, int angle) = 0; }; ``` The `HardwareRenderer` class inherits from `Renderer` and implements its own draw method to interact with the Verilated Chisel renderer. ```c++ void draw(fb_id_t fb_id, ap_uint<128>* vram, int angle) override { top->io_start = 1; top->io_angle = angle; top->io_fb_id = fb_id; bool done = false; int timeout = 1000000; while (!done && timeout > 0) { tick(); top->io_start = 0; // Memory Write Emulation if (top->io_mem_we) { uint32_t addr = top->io_mem_addr; // Reconstruct 128-bit data from Verilator's WData array ap_uint<128> data = 0; data |= (ap_uint<128>)top->io_mem_wdata[0]; data |= ((ap_uint<128>)top->io_mem_wdata[1]) << 32; data |= ((ap_uint<128>)top->io_mem_wdata[2]) << 64; data |= ((ap_uint<128>)top->io_mem_wdata[3]) << 96; vram[addr] = data; } if (top->io_done) done = true; timeout--; } } ``` ### Co-simulation and Verification Once we have a simple hardware renderer and the logic to trigger it, we can wrap the golden reference (inherited from Renderer) and compare the framebuffer contents. Specifically, `SoftwareRenderer` and `HardwareRenderer` are constructed and executed via the `draw` method in separate threads. The comparison is performed every 16 bytes across the framebuffer, and a total of 5 frames are verified. If any mismatch occurs, the verification routine prints the first five 16-byte mismatches and calculates the total error count: ``` # in verificaiton $ make cosim [COSIM] Starting Co-simulation... [COSIM]--- Processing Frame 0 (Angle: 0) --- [Mismatch] offset: 0x0 | HW: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF | SW: 0 [Mismatch] offset: 0x10 | HW: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF | SW: 0 [Mismatch] offset: 0x20 | HW: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF | SW: 0 [Mismatch] offset: 0x30 | HW: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF | SW: 0 [Mismatch] offset: 0x40 | HW: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF | SW: 0 [Mismatch] ... [COSIM] Verification FAILED! Total errors: 22990 ``` This section allows us to automatically validate that the hardware implementation behaves consistently with the software golden reference, providing a robust co-simulation framework for renderer verification. ## 3. Chisel Simualtion: Merging the Raster I System > [Main commit](https://github.com/jin11109/raster-i-chisel/commit/1bf6dfdd772311b543332410e9061ea4f0a4b65d): Refactor Renderer and implement Vram model > [Enable VCD tracing](https://github.com/jin11109/raster-i-chisel/commit/9275c3185c83866ede0d1fbbca9f366889a1b76f): Enable VCD tracing in Verilator simulation > [Latest commit](https://github.com/jin11109/raster-i-chisel/commit/047a733cc83a2b849e7dce87c796c67935974ede): The latest executable commit for this project stage In seciton [2. Simple C/Chisel Cosimulation](#2-Simple-C/Chisel-Cosimulation) we utilized a rudimentary Chisel renderer that simply output fixed values to a wire. While this helped verify the co-simulation workflow, it is impractical for actual hardware design. A realistic renderer must adhere to hardware protocols, specifically using the AXI protocol to write processed pixel data into VRAM. Therefore, in this section, we reformulate the verification strategy by merging the original Raster I system architecture into our simulation environment. ### The Challenge: Handling BlackBoxes The original Raster I project relies on Xilinx-specific IPs (BlackBoxes) to define interfaces for undefined circuits, such as the MIG (Memory Interface Generator) for DDR3 and the Clock Wizard. However, Verilator cannot simulate these vendor-specific, encrypted components. To solve this, we employ two strategies: - Simulation Stubs: For infrastructure components like `clk_wiz` and `proc_sys_rst`, we created Verilog stubs (`sim_stubs.v`) that bypass complex logic and simply pass clock and reset signals through. - Functional Models: For the memory, we replaced the DDR3 controller with a functional Chisel model. ### Blackblock: renderer Renderer In the original RasterI project, the system only defined the renderer's interface, with the actual implementation generated by Vitis HLS. In this phase, we aim to integrate a custom implementation directly into the system. Implementing the entire renderer immediately is impractical. Therefore, we have developed a simplified "stub" renderer. This version generates a solid purple pixel pattern and writes it to the framebuffer via its own AXI interface to verify the data path. The renderer operates as a Finite State Machine (FSM) with two states: ```mermaid flowchart TD %% 設定節點樣式 classDef resetNode fill:#ffffff,rx:50,ry:50; Reset(reset):::resetNode sRunning(**Running state** Send request to FbWriter every cycle fbWriter.io.req.valid=1 io.done=0):::stateNode sIdel(**Idel state** renderer done fbWriter.io.req.valid=0 io.done=1):::stateNode Reset --> sRunning sRunning --> |"FbWriter has finished a frame: fbWriter.io.done=1 "| sIdel sIdel --> |"Frame ID Changed"| sRunning ``` The frambuffer `FbWriter` is a submodule within the renderer dedicated to write-only operations. It bridges the renderer and `io.vram`, abstracting the complexity of the AXI protocol. Consequently, the renderer logic only needs to interface with FbWriter without managing the low-level AXI signals directly: - State Transition (Edge Detection): When the renderer is running, it waits for the `fbWriter` to complete its operation. ```scala when (fbWriter.io.done) { state := sIdle } ``` But, this is not correct because current donesignal remains at true(Level-sensitive), causing the second frame to read the old High signal at the very beginning. So turn it to detect pulse signal: ```scala when (fbWriter.io.done && !RegNext(fbWriter.io.done)) { state := sIdle } ``` - Driving Requests: When in the running state, the renderer drives the request to `FbWriter`: ```scala // Drive FbWriter request fbWriter.io.req.valid := (state === sRunning) fbWriter.io.req.bits.pix := purplePixels ``` ### Blackblock: Vram In the original RasterI project, VRAM was treated solely as an external interface provided by the Vitis IDE as an IP block (similar to the role of the Clock Wizard). The actual IP includes complex logic for the physical memory controller and latency handling. Vram is provide by vitis IDE like 'IP', it deploy the completed logic of ture memory circuit like it have some latecy of process request and response back. To simulate the entire SoC system, we require a behavioral model (a minimum viable circuit) of the VRAM. The primary task of this model is to handle write requests triggered by `FbWriter`. Note that we only need to support the write channel for now, as the display output circuit has been disabled to simplify the initial SoC verification (see the "Simulation and Verification" section for details). **Interface of vram**: ```scala val io = IO(new Bundle { val axiGraphics = new WrAxiExtUpper(Vram.addrWidth, Vram.dataWidth) val aclkGraphics = Input(Clock()) val arstnGraphics = Input(Reset()) /* Temporarily disable display logic */ // val axiDisplay = Flipped(new RdAxi(Vram.addrWidth, Vram.dataWidth)) // val aclkDisplay = Input(Clock()) // val arstnDisplay = Input(Reset()) val ddr3 = new Ddr3Ext } ``` - Like we mention above, we disable display io interface here. -`axiGraphics` connect to `FbWriter` in renderer. - `ddr3` is a stub connect to fixed default value. **State mechine**: ```mermaid flowchart TD classDef resetNode fill:#ffffff,rx:50,ry:50; Reset(reset):::resetNode Wait_AW["Wait AW state AW (ready=1, valid=0) W (ready=0, valid=X) B (ready=X, valid=0)"]:::stateNode Wait_W["Wait W state AW (ready=0, valid=X) W (ready=1, valid=0) B (ready=X, valid=0)"]:::stateNode Mem_Write["Mem write state AW (ready=0, valid=X) W (ready=1, valid=1) B (ready=X, valid=0)"]:::stateNode Respond["Respond state AW (ready=0, valid=X) W (ready=0, valid=X) B (ready=0, valid=1)"]:::stateNode WLAST{"WLAST"}:::decisionNode Reset --> Wait_AW Wait_AW --> |"AW request AW valid=1"| Wait_W Wait_W --> |"W request W valid=1"| Mem_Write Mem_Write --> WLAST WLAST --> |False| Mem_Write WLAST --> |True| Respond Respond --> |"Ready to receive respond B ready=1"| Wait_AW ``` **Timing Diagram**: ```mermaid --- displayMode: compact --- gantt title Example of AXI Write Transaction Sequence (Renderer to VRAM) axisFormat %S dateFormat ss section AW Channel Ready=1 :active, aw0, 00, 06 Ready=0 :active, aw1, 06, 15 Ready=1 :active, after n0, 18 Valid=0 (Wait request):crit, aw3, 00, 05 1 :done, aw5, 05, 06 Valid=X :done, aw4, 06, 15 0(Next request):crit, after n1, 18 AW Handshake (Valid=1 && Ready=1):milestone, h1, 05, 0 .:crit, n0, after aw1, 0.1s .:crit, n1, after aw1, 0.1s .:crit, n2, 15, 0.1s section W Channel Ready=1 :active, w0, 00, 10 Ready=0 :active, w1, 10, 15 Ready=0 :active, after n0, 18 Valid=0 (Wait request):crit, w_valid, 00, 05 Valid=1 :done, w, 05, 10 Valid=X :done, w, 10, 15 0(Next request):crit, after n1, 18 W Handshake :milestone, h2, 06, 0 WLAST=1 :milestone, h3, 09, 0 .:crit, n0, after w1, 0.1s .:crit, n1, after w1, 0.1s .:crit, n2, 15, 0.1s section B Channel Ready=X :done, b0, 00, 10 ready=0(wait recieve) :crit, b1, 10, 14 1 :done, b2, 14, 15 Ready=X :done, after n0, 18 Valid=0 :active, b3, 00, 10 Valid=1 :active, b4, 10, 15 Valid=0 :active, after n1, 18 AW Handshake :milestone, h4, 14, 0 .:crit, n0, after b2, 0.1s .:crit, n1, after b4, 0.1s .:crit, n2, after w1, 0.1s ``` #### Why `Mem` is Used in VRAM Before discussing the implementation choice, it is essential to understand how Chisel handles memory primitives. > [!Note] Chisel Memory Types > According to the [Chisel Documentation](https://www.chisel-lang.org/docs/explanations/memories#:~:text=ROM%E2%80%8B%20Users%20can%20define%20read%2Donly%20memories%20by,counter%20as%20an%20address%20generator%20as%20follows:), there are three primary ways to define memory: > - ROM (`VecInit`): Used for read-only data that is hardcoded into the hardware. > - `SyncReadMem` (Synchronous Read): > - Behavior: Requires a clock edge to initiate a read operation; the data is available at the next cycle. > - Hardware Mapping: Usually synthesized into SRAM blocks (e.g., Block RAM in FPGAs), which are area-efficient but require strict timing management. > - `Mem` (Combinational/Asynchronous Read): > - Behavior: Read data is available in the same cycle as the address is provided (combinational path). > - Hardware Mapping: Typically synthesized into distributed RAM or Register Banks. In the early stages of our renderer project, our primary goal was to achieve a Minimum Viable Product (MVP). We chose Mem for the internal storage for the following reasons: - Complexity Reduction: `SyncReadMem` introduces a one-cycle latency for every read operation. This would require us to design complex pipeline stages and stall logic to manage data hazards. - Rapid Prototyping: By using `Mem`, we can treat memory access like a simple combinational wire, allowing us to verify the core rendering logic without being bogged down by synchronous timing constraints. - Our frambuffer and Vram interface have already using AXI protocol to interract with each other. So, what memory type inside the Vram does not affect our interract logic. > [!Important] TODO: Consider to Replace `Mem` with `SyncReadMem` in Vram module > Using `Mem` is a short-term workaround. In the long run, `Mem` forces the synthesis tool to use registers instead of dedicated SRAM blocks. For a VRAM module, this will consume an excessive amount of FPGA logic resources (LUTs and Flip-Flops) and may lead to timing violations or synthesis failure as the memory size increases. Replacing this with `SyncReadMem` is a high-priority task for hardware optimization. ### Blackblock: clwiz ### Simualtion and Verification In order to verification the whole SoC, I build a "back door" in the vram to help us dirextly access data in Vram and don't intract with axi protocol. #### Read Data in Framebuffer `core/Trinity.scala` ```scala val io = IO(new Bundle { val ddr3 = new Ddr3Ext // val vga = new VgaExt /* Debug interface */ val debug_idx = Input(UInt(32.W)) val debug_data = Output(UInt(128.W)) val debug_graphicsFbId = Output(UInt(Fb.idWidth.W)) val debug_graphicsDone = Output(Bool()) // Temporarily diable dispaly component and play the role of this during simualtion val debug_displayVsync = Input(Bool()) }) ``` Usage `$ make sim` #### Error ``` [INFO] Rendering Frame 0... [INFO] Frame 0 finished at cycle 197941 [INFO] Verifying data via IO interface... [ERROR] Mismatch at (4, 767) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (5, 767) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (6, 767) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (7, 767) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (8, 767) Got: 0x0 Exp: 0xff00ff [ERROR] ... [INFO] Frame 0 error count 1020 [INFO] Frame 0 verification done. [INFO] Rendering Frame 1... [INFO] Frame 1 finished at cycle 396092 [INFO] Verifying data via IO interface... [ERROR] Mismatch at (4, 767) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (5, 767) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (6, 767) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (7, 767) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (8, 767) Got: 0x0 Exp: 0xff00ff [ERROR] ... [INFO] Frame 1 error count 1020 [INFO] Frame 1 verification done. [INFO] Rendering Frame 2... [INFO] Frame 2 finished at cycle 594243 [INFO] Verifying data via IO interface... [INFO] Frame 2 error count 0 [INFO] Frame 2 verification done. [INFO] Rendering Frame 3... [INFO] Frame 3 finished at cycle 792394 [INFO] Verifying data via IO interface... [INFO] Frame 3 error count 0 [INFO] Frame 3 verification done. ``` Oringinal `FbWriter` It only start when there is a request from renderer and it is not process data now `!addrBegan`. When it staring, - Set `addrValid` to `true` which means there is data address to send to Vram - Set `addrBegan` to `true` which means it is processing data now. It is very important variable to play a role of "lock". It only get a new address uitls already sent data. - Set `done` to `false` which means it has not done yet. It will trigger next handshake logic if Vram is available to recieve addr: - increase `addr` by `Fb.width.U` which product the next writting addr, and test whether it is reach the frambuffer end. But the error occurs here beacuse we testing and verify the data in framebuffer by `done`: ``` /* Wait util renderer draw a frame into frame buffer */ while (!top->io_debug_graphicsDone && cycles < timeout) { top->clock = !top->clock; // Toggle Clock top->eval(); // Update Logic top->clock = !top->clock; top->eval(); cycles++; total_cycles++; } ``` This logic is too early to stop when data channel has not sent data already. And there may also an error occur when display logic try to read the unwritten data. So, we try to fix `FbWriter` instead of our verification logic. We add a new accumlator to accumualte the sent data addr and move the ending logic to data channel. address channel: ```diff when (addrValid && io.vram.addr.ready) { addrValid := false.B addr := addr + Fb.width.U when (addr === Fb.lastLine.U) { addr := 0.U - done := true.B } } ``` data channel: ```diff when (io.req.valid && io.vram.data.ready) { idx := idx + 1.U when (last) { idx := 0.U addrBegan := false.B + dataAddr := dataAddr + Fb.width.U + when (dataAddr === Fb.lastLine.U) { + dataAddr := 0.U + done := true.B + } } ``` But there still an error occur: ``` [INFO] Rendering Frame 0... [INFO] Frame 0 finished at cycle 198197 [INFO] Verifying data via IO interface... [INFO] Frame 0 error count 0 [INFO] Frame 0 verification done. [INFO] Rendering Frame 1... [INFO] Frame 1 finished at cycle 396347 [INFO] Verifying data via IO interface... [ERROR] Mismatch at (0, 0) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (1, 0) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (2, 0) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (3, 0) Got: 0x0 Exp: 0xff00ff [ERROR] Mismatch at (4, 0) Got: 0x0 Exp: 0xff00ff [ERROR] ... [INFO] Frame 1 error count 1024 [INFO] Frame 1 verification done. [INFO] Rendering Frame 2... [INFO] Frame 2 finished at cycle 594497 [INFO] Verifying data via IO interface... [INFO] Frame 2 error count 0 [INFO] Frame 2 verification done. [INFO] Rendering Frame 3... [INFO] Frame 3 finished at cycle 792647 [INFO] Verifying data via IO interface... [INFO] Frame 3 error count 0 [INFO] Frame 3 verification done. ``` It seems `addr` is not correct after change a frame to write. It is very confuse me, so I try to record the waveform and see what happen after the change the framebuffer to write. ![image](https://hackmd.io/_uploads/HknRwLCzZl.png) This waveform shows the timming at frambuffer change. Top four variable is from top level, show the clocks and our debug io interface value. - `io_debug_displayVsync`: is our input use to control when to calculate next frame. It can work because renderer starting to process next frame only when fbid change which is trigger by `displayVsync` and `graphicsDone` are done. - `io_debug_graphicsDone`: When it is true, it means it has done to write frame. The error is occur Below `io_done` and `state` is in renderer: - `state` equal to `0` means it is in idel stage, otherwise it is `running` The remain variable is in `fbWriter`: - When `dataAddr` meat the end of frambuffer, it reset `dataAddr` to `0x0` and also set `done` to `0` - `addr` reset when it meat end of frambuffer. So, you can see that it reset to `0x0` earlier than `daraAddr` ![image](https://hackmd.io/_uploads/BJVRMU0Mbe.png) At the momnent shows in image as red line This `!addrBegin` and `io_req_valid` trigger addr channel in frambuffer start to calculate next addr. `addrBegin` is set to `0` is correct because it means framebuffer has written data to Vram. But `io_req_valid` is `true` is not correct, this moment renderer should not raise a request to when buffer. It represent that renderer is still in running stage when framebuffer is done a frame. This occurs beacuse our renderer need one cycle to change it stage form running to idel. The latancy triger more one requset which trigger `addr` to next address. You can see the `addr` is set to `0x400` after `io_debug_graphicsDone` is `true`. So, it cause 1024 error as same as error showed in error message: ``` [INFO] Frame 1 error count 1024 ``` We fix it by using `addrBegan` like: ```diff when (last) { idx := 0.U dataAddr := dataAddr + Fb.width.U // Distinguish between "End of Frame" and "End of Row" when (dataAddr === Fb.lastLine.U) { // Frame Completed dataAddr := 0.U done := true.B + /* + * IMPORTANT: Keep 'addrBegan' true here. + * This "locks" the address logic, preventing the generation of the + * address for the next frame until the Renderer stops its request. + */ + addrBegan := true.B + } .otherwise { + // Just a Row Completed, allow the next row address to be sent + addrBegan := false.B + } } ``` ```diff + /* + * Force a state reset when the Renderer enters Idle (req goes low) + * This ensures that when Frame 1 starts, all states are clean. + */ + when (!io.req.valid) { + addrBegan := false.B + done := false.B + addrValid := false.B + addr := 0.U + dataAddr := 0.U + idx := 0.U + } ``` ## Renderer: Core Logic Now we can start to process core vitis HLS renderer logic into chisel. This can be divede into such stage like: ```c++ // 1. Geometry Stage preproc(angle); // Iterate over every tile on the screen for (int y = 0; y < FB_HEIGHT; y += FB_TILE_HEIGHT) { for (int x = 0; x < FB_WIDTH; x += FB_TILE_WIDTH) { // Reset local on-chip buffers (Z-buffer and G-buffer) clear_tile(...); // 2. Find triangles overlapping this tile and rasterize them render_triangles(...); // 3. Calculate lighting and final color deferred_shading(...); // 4. Flush the result to external DDR3 memory fb_flush_tiles(...); } } ``` ```mermaid flowchart TD classDef resetNode fill:#ffffff,rx:50,ry:50; Reset(reset):::resetNode subgraph Renderer Geometry(Geometry):::stateNode subgraph For each tiles render_triangles(Render Triangles):::stateNode deferred_shading(Deferred Shading):::stateNode fb_flush_tiles(FB Flush Tiles):::stateNode end end Reset --> Geometry Geometry --> render_triangles render_triangles --> deferred_shading deferred_shading --> fb_flush_tiles fb_flush_tiles --> render_triangles fb_flush_tiles --> |If all tiles have done| done ``` ## 4. Renderer: Geometry Stage > [Main commit](https://github.com/jin11109/raster-i-chisel/commit/6da4c66281dbae97c64ba878d9721a3161db9b7c): Implement Geometry stage and overhaul cosim > [Latest commit](https://github.com/jin11109/raster-i-chisel/commit/8dbc8e20a51f626c2e16aa57d0727a90fd85e754): The latest executable commit for this project stage ### Types #### Fixed point > `soc/main/scala/utils/FixedPiot.scala` Because our renderer only use one type of posicion `fixed24_13`, which means it is 24 bit long and has binary point 11 bit, we only support this fixed point type instead of implement atribute length and percision of fixed point. >[!Note] Fixed point in Vitis > [Vitis High-Level Synthesis User Guide (UG1399)](https://docs.amd.com/r/en-US/ug1399-vitis-hls/Advantages-of-AP-Data-Types) introduce the usage of fixed poind `ap_fixed<W,I,Q,O,N>`, which mainaly used in orginal renderer project. > - `W`: Word length in bits > - `I`: The number of bits used to represent the integer value, include sign bit > - `Q`: Quantization mode: **Default** `AP_TRN` > When the decimal part of the calculation result exceeds the bit width, this determine how to handle the discarded mantissas. > "Truncation to minus infinity" means directly cut off the excess decimal part (discarding towards negative infinity). > - `O`: Overflow mode: **Default** `AP_WRAP` > When the result of an operation exceeds the maximum value that the integer part can represent, this parameter determines how to handle it. > "Wrap around" means that after overflow, the value is immediately rolled back from the other end. > - `N`: The number of saturation bits in overflow wrap modes. >> We use gemini to realize the meaning of these parameters. There is a important thing we need to take into account is that the percision processing of fixed point in Vitis may not as same as the process in chisel. We need to add this manually > [!Note] Chisel Percision and Overflow > [Chisel: Width Inference](https://www.chisel-lang.org/docs/explanations/width-inference) introduce the operation and the data width. > We can see that chisel reserve entire percition and doesn't support wrap operation when overflow. We now implement fixed point with the same operation of quantization mode `AP_TRN` in chisel. And leave the overfow logic. > [!Important] TODO: Add the Overflow Logic to Fixed Point Operation ##### Testing For testing this fixed opint operation, we introduce the chisel test. > this test is built by gemini ``` sbt test ``` #### `NumVec` > `soc/main/scala/utils/Vec.scala` Because we choose `Vec` to be the core of vector, support containing different type in this vector is too complex. There are some strategy to implement vector to contain data. - Fixed lengh and fixed type - Attribute length and fixed type - Attribute lenght and attribute type ==<- We selected this== However, we encounter a challenge that how to let out vec support our `Fixed24_13` type. The regular vector can use `[T: num]` which is already support by scala means this type should have num operation like `+` `-` `*` `/` and so on. However, our `Fixed24_13` doesn't fulling support the operation which `num` require, so we need to use an other way > `soc/main/scala/utils/MathTraits.scala` implicit object implicits #### `Triangle` and `Aabb` > `soc/main/scala/utils/Triangle.scala` > `soc/main/scala/utils/Aabb.scala` ### Load Rom Data > `soc/src/main/scala/renderer/memory/DataLoader.scala` #### From HLS Global Variables to Hardware Memory Initially, analyzing the Vitis HLS reference code caused some confusion for me. The original author used global variables for model data and math tables instead of passing them as function parameters. In software, global variables are just memory addresses. However, in Hardware Design (HLS), global arrays are typically synthesized into local on-chip memory blocks (like ROM or BRAM). The HLS tool automatically handles the wiring to these memory blocks. When moving to Chisel, we must explicitly define these storage structures and their initialization mechanisms. > [!Note] Data Storage in general Hardware Design, FPGA architecture, and Chisel > Local Storage (Logic & Registers) > - **Logic Circuits (`Wire`):** Pure combinational paths. Data flows through gates (AND, OR, Adders) without waiting for a clock. > - **Registers (`Reg` / `RegInit`):** The smallest and fastest storage. Built using **Flip-Flops (FF)**. > > On-Chip Memory (BRAM & Distributed RAM) > - **ROM** (`VecInit`): Constants or lookup tables (LUTs) that don't change during execution. > - **BRAM (Block RAM)(`SyncReadMem`):** Dedicated "hard" blocks of SRAM inside the FPGA. limited to a few MBs > - **Distributed RAM (`Mem`):** Memory built by repurposing **LUTs** (logic gates). Supports Asynchronous (immediate) reads. Consumes valuable logic resources; inefficient for large sizes. > > Off-Chip Storage: DRAM (DDR3/4/5) > - Mass storage (Gigabytes). It is physically located outside the FPGA/ASIC chip. > - AXI or Wishbone prtocaol > > > Renference form Gemini After studying different memory storage options, we move on to discussing how to read data from files (such as `hex` text files) and selecting the appropriate memory type for storing this read-only data. Below, we list the potential methods and explain the reasoning behind our final decision. #### Approaches to Reading Data from Files - Data Loading via Chisel Test Chisel Test makes it easy to inject raw data into our module. However, once the geometry stage processes this raw data and stores it in an intermediate buffer, other stages—such as `render_triangles`—also need to read the data from ROM. Consequently, we would still need logic to store the ROM data originating from the Chisel Test, or we would need the test to inject data serially whenever required. Moreover, setting up cosimulation between the Chisel Test and the golden reference would require significant effort. - Data Loading via `VecInit` `VecInit` is a method used for initializing ROM data. However, unless we write logic to read files within Scala, we would need to write an extremely long `VecInit` list, similar to how the original RasterI author placed ROM data into a global array in a C++ header file. That said, cosine and sine tables are suitable for this implementation, as these math tables can be generated directly by Scala. A key advantage of this method over Chisel Test is that it is easily cosimulated with the golden reference. - Data Loading via `loadMemoryFromFile` ==<- We selected this!== We found that `loadMemoryFromFile` satisfies our requirements. By using the import `chisel3.util.experimental.loadMemoryFromFileInline`, we can automatically load a hex file into our memory as shown below: ```scala val normMem = SyncReadMem(4096, UInt(72.W)) loadMemoryFromFileInline(normMem, resPath + "normals.hex") ``` >[!Tip] `loadMemoryFromFile` > Note on Chisel Versions `loadMemoryFromFile` cannot be used directly due to compatibility issues with Chisel 5. Therefore, we replaced it with `loadMemoryFromFileInline`. #### Memory Type Selection for Read-Only Data We have already discussed the different memory types available in Chisel. Unlike VRAM usage, where the AXI protocol is already defined, we need to consider latency when designing the I/O interface and pipeline. Furthermore, it is crucial to use appropriate memory hardware, such as Block RAM, rather than combinational logic. Therefore, we selected `SyncReadMem` instead of `Mem`. #### Dump Rom Data > `tools/gen_hex.cpp` ##### Data Type Strategy: Fixed-Point Conversion A critical design decision for ROM data generation is determining the storage format. Our renderer architecture relies on fixed-point arithmetic for efficiency. However, the original source data (e.g., from rasterI headers) is typically in floating-point format. We evaluated two approaches for handling this conversion: - Hardware Runtime Conversion: The hardware `dataloader` reads floating-point data and converts it to fixed-point on the fly. - Software Pre-processing: The generation tool (`gen_hex.cpp`) converts the data before writing it to the ROM. ==<- We selected this!== **Reason**: In a standard PC architecture, the CPU driver or DMA engine often handles format conversions before data reaches the GPU VRAM. Since our project currently treats model data as static Read-Only Data (ROM) and lacks a complex driver layer, we decided to perform the float-to-fixed conversion in software. This approach simplifies the hardware Geometry Stage, allowing it to consume fixed-point data directly without the overhead of conversion logic. >[!Important] TODO: Add support for `float` to `fixed` conversion ##### Data Packing Strategy Another challenge is mapping the 3D model data (arrays of x, y, z vectors) into the 1D address space of hardware memory. - Approach A (Flattening): Storing components sequentially (e.g., `addr 0: x`, `addr 1: y`, `addr 2: z`). This requires complex index calculation (e.g., `index * 3 + offset`) inside the hardware. - Approach B (Packing): Concatenating components into a single wide word (e.g., `z | y | x`). ==<- We take this!== **Reason**: We chose Approach B (Packing). By packing `(x, y, z)` into a single data entry, the hardware can access a complete vertex using the original index, avoiding extra calculation cycles. The hardware simply "unpacks" (slices) the bits when reading. To ensure consistency, we use an HLS (High-Level Synthesis) library within the C++ tool for the float-to-fixed transformation. This guarantees that the binary data in the ROM is bit-exact to the values used in our C++ Golden Reference simulation. **Example code: Pack in `gen_hex.cpp`:** ```c++ for (int i = 0; i < NR_MESH_VERTICES; ++i) { PackedVertex packed = 0; packed.range(23, 0) = FixedType(MESH_VERTICES[i].x).range(23, 0); packed.range(47, 24) = FixedType(MESH_VERTICES[i].y).range(23, 0); packed.range(71, 48) = FixedType(MESH_VERTICES[i].z).range(23, 0); // 72 bits = 18 hex digits out << format_hex(packed.to_string(16), 18) << "\n"; } ``` **Example code: Unpack in `dataloader`:** ```scala val vtxRaw = vtxMem.read(io.geo.vtxAddr) // Unpack: z(71-48) | y(47-24) | x(23-0) io.geo.vtxData(0) := Fixed24_13.fromRaw(vtxRaw(23, 0).asSInt) io.geo.vtxData(1) := Fixed24_13.fromRaw(vtxRaw(47, 24).asSInt) io.geo.vtxData(2) := Fixed24_13.fromRaw(vtxRaw(71, 48).asSInt) ``` #### Test Unit Usage chisel test for dataloader ```bash $ sbt "testOnly DataLoaderTest" [DataLoader] Loading path: src/main/resources/ [info] DataLoaderTest: [info] DataLoader [info] - should read correct values from Hex files [info] Run completed in 1 second, 552 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Intermidiate Buffer ### Gemotry stage > `soc/src/main/scala/renderer/geometry.scala` The main task of gemotry stage is transform 3D model data into 2D screen data This stage can clearly seperate into three part, which is `preproc_vertices`, `preproc_mormals` and `preproc_triangles`. ```c++ static void preproc(ap_uint<9> angle) { fixed sine = SINE_TABLE[angle]; fixed cosine = COSINE_TABLE[angle]; Vec3f axis(0.0f, 1.0f, 0.0f); preproc_vertices: for (int i = 0; i < NR_MESH_VERTICES; i++) { Vec3f pos = rotate_vec(MESH_VERTICES[i], axis, sine, cosine); pos.z += 2; screen_vertices[i].pos = Vec2i((1 + pos.x / pos.z * 3 / 4) * FB_WIDTH / 2, (1 - pos.y / pos.z) * FB_HEIGHT / 2); screen_vertices[i].z = pos.z; transformed_positions[i] = pos; } preproc_normals: for (int i = 0; i < NR_MESH_NORMALS; i++) { Vec3f normal = rotate_vec(MESH_NORMALS[i], axis, sine, cosine); transformed_normals[i] = normal; } preproc_triangles: for (int i = 0; i < NR_MESH_TRIANGLES; i++) { Vec3i idx = MESH_INDICES[i].vertices; Triangle2i triangle(screen_vertices[idx.x].pos, screen_vertices[idx.y].pos, screen_vertices[idx.z].pos); bounding_boxes[i] = triangle.aabb(); } } ``` For our chisel implement, we start with directly thransform these stage into seperate submudules, `PreprocVertices`, `PreprocNormals`, `PreprocTriangles`, and sequentially switch to next when one of the submudule has done. Although only `PreprocTriangles` depends on the result of `PreprocVertices`, the other two submudules can process parallelly, we still decide to use sequential way and leave this optimization into future work. >[!Important] TODO: Consider the Optimization of Geometry Stage In order to reduce the complexity of our implement, we want to support processing a frame first and also leave this into next seciton. >[!Important] TODO: Add support for proccessing muti frames We introduce the state machine of geometry stage below: ```mermaid flowchart TD %% 設定節點樣式 classDef resetNode fill:#ffffff,rx:50,ry:50; Reset(reset):::resetNode sVtx(**sVtx** procVtx.io.start=1):::stateNode sNorm(**sNorm** procNorm.io.start=1):::stateNode sIdx(**sIdx** procIdx.io.start=1):::stateNode sDone(**sDone** io.done=1):::stateNode Reset --> sVtx sVtx --> |sVtx submodule done procVtx.io.done=1 procVtx.io.start=0| sNorm sNorm --> |sNorm submodule done procNorm.io.done=1 procNorm.io.start=0|sIdx sIdx --> |sIdx submodule done procIdx.io.done=1 procIdx.io.start=0| sDone ``` #### Submodule: `preprocVtx`, `preprocNorm` Recall that we implement intermediate memory buff and Rom by `SyncreadMem` which have a cycle latancy for reading and writing. To address this, the simplest way is wait a cycle for each reading and writing requests like: ``` Pull a reading request for Rom data │ wait a cycle ▼ Process data and Pull a writing request for intermdiate data │ wait a cycle ▼ Next iteration ``` Althought this way confirm the correctness of pipeline, it spend twice of process cycles for each iteration because of waiting and idle. We can address this by sending the reading request of next vertext data to Rom when we processing the vertext and sending sending the writting for this data. numbers mean data idx vertices or norm num -> n | cycle | 0 | 1 | 2 | 3 | n-1 | n | n+1 | | -------- |---|---|---|---|-----|---|-| | Read req to Romdata | 0 | 1 | 2 | 3 | n-1 | | | | Romdata valid | | 0 | 1 | 2 | n-2 | n-1 | | | Process Romdata | | 0 | 1 | 2 | n-2 | n-1 | | | Write req to buffer | | 0 | 1 | 2 | n-2 | n-1 | | | Result has writeen | | | 0 | 1 | n-3 | n-2 | n-1 | So, when processing data, we only need to wait the first reading request and waiting the latest writing requset to end. We treat reading request as `rIdx` and writing request as `wIdx` and implement pipeline by: - Reading request and writing request ```scala val rIdx = RegInit(0.U(32.W)) val wIdx = RegNext(rIdx, 0.U) when (io.start) { rIdx := rIdx + 1.U } ``` - Staring processing data statement ```scala when (RegNext((rIdx < RomData.MeshNormalsNum.U) && io.start, false.B)) { // Process Romdata // Send writing request with wIdx } ``` - End logic: ```scala io.done := RegNext((rIdx >= RomData.MeshNormalsNum.U), false.B) ``` The below image shows the waveform of `procNorm` pipeline: ![image](https://hackmd.io/_uploads/HJ8ZGb67bl.png) #### Submodule: `preprocidx` `preprocidx` is more complexity than `preprocVtx`, `preprocNorm`. Because it need the result of `preprocVtx` to process data in this stage, which means it should consider the latency of reading intermediate buffer. Moreover, the finally result `BoundingBox` should be calculatated by three vertice data in buffer and these vertices index depent on Rom data `idxData`. If we use the way like implement `preprocVtx` and `preprocNorm` there will leak readability and extensibility which only depend on `rIdx` and `wIdx`. I have tried this and occur an unreasonable issue you can find below discussion in Verification section. It also need a register to store each pending data and this also need a cycle latency. To address this, we rewrite by using state machine and tempily sacrifice compact pipeline instead of original way. This obvious shows the waiting logic >[!Important] TODO: Compact `preprocidx` pipeline ```mermaid flowchart TD %% 設定節點樣式 classDef resetNode fill:#ffffff,rx:50,ry:50; Reset(reset):::resetNode sWaitRomData(**sWaitRomData**):::stateNode sWriteTriangle(**sWriteTriangle** Send read request for buffer data io.VtxReadPort.req.valid=1 ):::stateNode sWaitResp(**sWaitResp** Wait respond valid record buffer data to register nextIdx = reqIdx + 1):::stateNode sExportBB(**sExportBB** Send write request for BB buffer io.bbWritePort.valid=1 rIdx = rIdx + 1 reqIdx = 0):::stateNode sDone(**sDone** io.done=1):::stateNode Reset --> sWaitRomData sWaitRomData --> |if io.start=1 and wait a cycle for ROM data valid| sWriteTriangle sWriteTriangle --> |wait a cycle for buffer data valid| sWaitResp sWaitResp --> |if nextIdx == 3| sExportBB sWaitResp --> |if nextIdx < 3| sWriteTriangle sExportBB --> |if rIdx < Total| sWaitRomData sExportBB --> |if rIdx == Total| sDone ``` ### Verification The first idea to verify geometry stage is "Compare the result in the frame buffer between gloden refernce and chisel frambuffer". This is because we already have gloden refernce and the written pixels logic in renderer. But there still have a lot of work to us, such as how to use only geometry stage data and ensure these data can be processed and written it framebuffer? This need bypass the remain stage and writting data into framebuffer. It is hard to comfirn the error is occured in withch stage. The other idea is "Print the written data when we send a request to write mideiate buffer". This way is easy to implement and I indeed use it to address issue in earlier work. But I need to observe the different between c++ simulation output and chisel ouput with the naked eye. Although this can automatically comapre by dumping the logfiles and compare them, I think it is not extensible for future work and not unified (compare with our way to test framebuffer). Moreover, this verificaiton have a drawback when error is occur durning writting in mediate buffer, it can not detect this error. This will mislead when start our work in remaining stages. ```scala when (RegNext(rIdx < RomData.nrMeshVertices.U, false.B)) { io.vtxWritePort.valid := true.B io.vtxWritePort.data := vtxWritePort io.posWritePort.valid := true.B io.posWritePort.data.pos := pos io.posWritePort.data.pos(2) := posZ // printf(p"[screen_vertices] ${vtxWritePort.pos(0)} ${vtxWritePort.pos(1)} ${vtxWritePort.z.bits.asUInt}\n") // printf(p"[transformed_positions] ${pos(0).bits.asUInt} ${pos(1).bits.asUInt} ${posZ.bits.asUInt}\n") } ``` Take that these mediate buffer is lile **Checkpoint** for geometry stage. We can testing the data in these buffer is correct to ensure our geometry stage is work. So we came out with an idea of "Build backdoor for these midediate buffer and read in verilator" like how we test the data in framebuffer. But there are many buffer, if we buffer a io interface of these and link to the top level, it will be a lot of work and our renderer will full with many debug information. To address this, we cosider to directly read the array which define and implement the `syncmem` after Verilator thransform the chisel code into c++ code. This encounter such issues: - Verilator inline: address by `fno-inline` configure - Mutilevel of `Syncmem`: address by `auto` - Read raw data in `Syncmem`: observer how it wriiten into buffer - Comapre logic between different type: tranform all types such ad `fixed` `int16_t` into `uint32_t` before compare. - Otimize out these memory buffer: use `dontTouch` :::warning This cosimualtion logic only simulatie a frame with angle 0 now. ::: #### Verification Stack ```mermaid flowchart TD subgraph Spec [**Spec**] CModel(C Golden Model common/src/top.cpp) end subgraph Design [**Chisel RasterI**] Chisel(Chisel Source soc/) Verilog(Verilog/SystemVerilogc Trinity.sv sim_stubs.v) Verilator(Verilator C++ Simuation soc/verilator/obj_dir/*) end subgraph Simulation [cosim/src/main.cpp] loop(For Each Frame) subgraph checker [**checker**] CompareGeo{Compare Mediate buffer} CompareFB{Compare FrameBuffer} end result(Verification Result) end loop --> |call and run| CModel loop --> |call and simulate| Verilator Chisel --> |sbt| Verilog Verilog --> |verilate| Verilator CModel --> |Expected value| checker Verilator --> |Actual value| checker checker --> |Pass/Fail| result ``` > We used the initial image from gemini, and then modified it ourselves. **Change to Next Frame Logic** - For Chisel RasterI ```mermaid flowchart TD init(Init) subgraph For each frame process(Process a Frame) change(Send a pulse signal displayVsyn=1) end init --> |Reset| process process --> |Finish a frame and checking logic| change change --> |Swap framebuffer and restart renderer| process ``` - For C Golden Model ```mermaid flowchart TD subgraph For each frame call_(Call **trinity_renderer** with parameter fb_id and angle) change(Calcualte next fb_id and angle) end call_ --> |Finish a frame and checking logic| change change --> call_ ``` #### Usage ``` #in verification make cosim ``` Expected output ``` [INFO] Rendering Frame 0... [INFO] Start C++ simualtion ... [INFO] Start Varilator simulation ... [INFO] Frame 0 finished at cycle 229683 [INFO] Start verification ... [INFO] Verifying geometry stage ... - memVertex compare to screen_vertices ... - memPos compare to transformed_positions ... - memNorm compare to transformed_normals ... - memBB compare to bounding_boxes ... [INFO] Verifying frambuffer ... [INFO] Frame 0 error count 0 [INFO] Frame 0 verification done. [INFO] To view waveform: gtkwave ./cosim/trace.vcd ``` #### Issue We Meet ``` [INFO] Rendering Frame 0... [INFO] Start C++ simualtion ... [INFO] Start Varilator simulation ... [INFO] Frame 0 finished at cycle 210195 [INFO] Start verification ... --- memVertex compare to screen_vertices --- wrong [0] Get pos.x=4294967284 pos.y=4294966990 z=3477 Expected pos.x=511 pos.y=441 z=3477 wrong [1] Get pos.x=33 pos.y=4294966805 z=3497 Expected pos.x=514 pos.y=476 z=3497 wrong [2] Get pos.x=4294966762 pos.y=281 z=3842 Expected pos.x=478 pos.y=331 z=3842 wrong [3] Get pos.x=1389 pos.y=72 z=3519 Expected pos.x=598 pos.y=370 z=3519 wrong [4] Get pos.x=4294967278 pos.y=326 z=3273 Expected pos.x=511 pos.y=322 z=3273 --- memPos (transformed_positions) --- --- memNorm (transformed_normals) --- Wrong [0] Get X=16775684 Y=16775910 Z=16776833 Expected x=1951 y=195 z=588 Wrong [1] Get X=1848 Y=189 Z=16776354 Expected x=235 y=16775900 z=16775664 Wrong [2] Get X=16776394 Y=1818 Z=460 Expected x=16777064 y=16776087 z=16775513 Wrong [3] Get X=1573 Y=16776982 Z=1290 Expected x=16775630 y=16776410 z=16776200 Wrong [4] Get X=1089 Y=1678 Z=16776781 Expected x=661 y=441 z=1887 --- memBB (bounding_boxes) --- Wrong [0] Get 4294967269 4294966905 0 0 Expected 510 441 514 476 Wrong [1] Get 4294967269 4294966805 33 4294966990 Expected 434 279 453 297 Wrong [2] Get 4294966045 462 4294966360 558 Expected 477 161 493 166 Wrong [3] Get 4294966744 1161 4294966993 1188 Expected 410 161 430 187 Wrong [4] Get 4294965664 1048 4294965988 1185 Expected 425 244 440 263 [INFO] Frame 0 error count 0 [INFO] Frame 0 verification done. ``` **We first address the error in preprocvtx.** 1. Fixed by adding shift, turns `fixedpoint` into `sint` ```scala outVtx.pos(0) := ((Fixed24_13(1.U) + (pos(0) / posZ * Fixed24_13.fromRaw(0x1800.S))) * Fixed24_13((1024 / 2).U)).bits >> 11 outVtx.pos(1) := ((Fixed24_13(1.U) - (pos(1) / posZ)) * Fixed24_13((768 / 2).U)).bits >> 11 ``` (Only shows screen_vertices here) ``` --- memVertex compare to screen_vertices --- wrong [0] Get pos.x=509 pos.y=441 z=3477 Expected pos.x=511 pos.y=441 z=3477 wrong [1] Get pos.x=520 pos.y=476 z=3497 Expected pos.x=514 pos.y=476 z=3497 wrong [2] Get pos.x=378 pos.y=331 z=3842 Expected pos.x=478 pos.y=331 z=3842 wrong [3] Get pos.x=859 pos.y=370 z=3519 Expected pos.x=598 pos.y=370 z=3519 wrong [4] Get pos.x=507 pos.y=322 z=3273 Expected pos.x=511 pos.y=322 z=3273 ``` 2. Fixed by replace `Fixed24_13.fromRaw(0x1800.S)` to `Fixed24_13(3.U) / Fixed24_13(4.U)` ``` outVtx.pos(0) := ((Fixed24_13(1.U) + (pos(0) / posZ * Fixed24_13(3.U) / Fixed24_13(4.U))) * Fixed24_13((1024 / 2).U)).bits >> 11 ``` >[!Important] TODO: Presicion issue >The compare logic now ensure data between chisel and golden renference completely consistent. Consider to allow little error. **error in preprocnorm** index start to increament when preprocvtx has not done(highlight with blue) ![image](https://hackmd.io/_uploads/ryLa8JvQZg.png) Still have error ``` --- memNorm (transformed_normals) --- Wrong [0] Get X=16775684 Y=16775910 Z=16776833 Expected x=1951 y=195 z=588 Wrong [1] Get X=1848 Y=189 Z=16776354 Expected x=235 y=16775900 z=16775664 Wrong [2] Get X=16776394 Y=1818 Z=460 Expected x=16777064 y=16776087 z=16775513 Wrong [3] Get X=1573 Y=16776982 Z=1290 Expected x=16775630 y=16776410 z=16776200 Wrong [4] Get X=1089 Y=1678 Z=16776781 Expected x=661 y=441 z=1887 ``` but the waveform is correct ![image](https://hackmd.io/_uploads/BkcAGlP7Wx.png) ![image](https://hackmd.io/_uploads/HJgguvtQWx.png) error cause by ```diff - val memNorm = Module(new GenericRam(new TransFormedNorm, 2048)) + val memNorm = Module(new GenericRam(new TransFormedNorm, 4096)) ``` >[!Important] TODO: Use parameter instead of hardcoding ``` --- memBB (bounding_boxes) --- Wrong [0] Get 0 0 510 457 Expected 510 441 514 476 Wrong [1] Get 510 441 514 476 Expected 434 279 453 297 Wrong [2] Get 434 279 453 297 Expected 477 161 493 166 Wrong [3] Get 477 161 493 166 Expected 410 161 430 187 Wrong [4] Get 410 161 430 187 Expected 425 244 440 263 ``` ![image](https://hackmd.io/_uploads/By_lTvFm-g.png) ```diff class PreprocTriangles extends Module { val io = IO(new Bundle { val done = Output(Bool()) val start = Input(Bool()) val idxAddr = Output(UInt(32.W)) val idxData = Input(NumVec(3, SInt(16.W))) val outBB = new MemWriter(new BoundingBox, 4096) val inVtx = new MemReader(new ScreenVertex, 2048) }) val rIdx = RegInit(0.U(32.W)) - val wIdx = RegNext(rIdx) + val wIdx = RegNext(RegNext(RegNext(rIdx))) val triangle2 = Reg(new Triangle2(SInt(32.W))) io.idxAddr := rIdx io.outBB.valid := false.B; io.outBB.index := wIdx; io.outBB.data := DontCare io.inVtx.req.valid := false.B; io.inVtx.req.bits := 0.U val reqStep = RegInit(0.U(2.W)) io.outBB.data.bb := triangle2.aabb() when (rIdx < RomData.nrMeshTriangles.U && io.start) { val idx = io.idxData io.inVtx.req.valid := true.B io.inVtx.req.bits := idx(reqStep).asUInt when (io.inVtx.resp.valid) { triangle2(reqStep) := io.inVtx.resp.bits.pos reqStep := reqStep + 1.U when (reqStep === 2.U) { reqStep := 0.U rIdx := rIdx + 1.U io.outBB.valid := true.B // printf(p"[bounding_boxes] io.outBB.data.bb: ${io.outBB.data.bb}\n") } } } io.done := RegNext((rIdx >= RomData.nrMeshTriangles.U), false.B) } ``` ``` --- memVertex compare to screen_vertices --- --- memPos (transformed_positions) --- --- memNorm (transformed_normals) --- --- memBB (bounding_boxes) --- Wrong bounding_boxes[2997] Get 0 0 0 0 Expected 377 171 391 178 ``` Address this by refine the preprocidx logic ``` [INFO] Rendering Frame 0... [INFO] Start C++ simualtion ... [INFO] Start Varilator simulation ... [INFO] Frame 0 finished at cycle 229683 [INFO] Start verification ... --- memVertex compare to screen_vertices --- --- memPos (transformed_positions) --- --- memNorm (transformed_normals) --- --- memBB (bounding_boxes) --- [INFO] Frame 0 error count 0 [INFO] Frame 0 verification done. ``` ## 4.1 Renderer: Geometry Stage Improve and Simulaiton (continue) We leave some works in [4. Renderer: Geometry Stage](#4-Renderer:-Geometry-Stage) : - Cosimulation now only support a frame with angle 0 and it doesn't change its angle. - Chisel simualtion (`make sim` in `soc/`) now still using implement of [3. Chisel Simualtion: Merging the Raster I System](#3-Chisel-Simualtion:-Merging-the-Raster-I-System) which has moved to `verificaiton/` as cosimulation logic. This should simulate as same way as our golden renference demo which display the result on the screen. - Consider the optimization for gemetry stage and discussperformance evaluation way. - Compact `preprocidx` pipeline - Percision issue in cosimualtion ## 5. Renderer: render_triangles ### Verification --- --- --- ## Build Raster I > Enviroment: > OS: 24.04 Overview of different building ways: - Generate Bitstream for FPGA Broad 1. Use Chisel to generate SystemVerilog (`system/generated`) 2. Use Vitis HLS to synthesize and package renderer pipeline as IP 3. Use Vivado (via `build.tcl`) to integrate system + IP, target FPGA board, and generate bitstream - C simulation In order to get golden reference helping me to test I follow the step and build Raster I by [Build Instructions](https://github.com/raster-gpu/raster-i?tab=readme-ov-file#build-instructions). 1. After [download sbt](https://www.scala-sbt.org/download/) and run `sbt run` in `/system`, it occur an error which says ``` Note that this version of Chisel (5.0.0) was published against firtool version 1.40.0. ``` 2. Download and extract 1.40.0 version of `firtool`, and set enviroment. ```bash $ wget https://github.com/llvm/circt/releases/download/firtool-1.40.0/firrtl-bin-ubuntu-20.04.tar.gz # After extract firrtl-bin-ubuntu-20.04.tar.gz $ export PATH=<your-path>/firrtl-bin-ubuntu-20.04/firtool-1.40.0/bin:$PATH ``` :::warning The release packge of `firtool 1.40.0` is for ubuntu-20.04. I don't know it will be error because of collision by my os version ::: 3. Produce SystemVerilog file under `system/generated` by: ```console $ sbt run [info] welcome to sbt 1.11.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/jin/course/ca2025/gpu/raster-i/system/project [info] loading settings for project system from build.sbt... [info] set current project to system (in build file:/home/jin/course/ca2025/gpu/raster-i/system/) [info] running Trinity [success] Total time: 17 s, completed Dec 3, 2025, 5:18:17 PM ``` ### Install Vitis IDE and Vivado **Vitis IDE** is a complete SW/HW collaborative development environment (Application + HLS + Platform). **Vivado** is a true FPGA hardware building tool (RTL → Bitstream). > [!Note] Typical HLS kernel flow (options in Vitis IDE) > - C simulatoin: > It is ==not a typical== pure C simulation: > Authought it is ultimately compiled into a C executable file (csim.exe) to run. No RTL generation is performed, and Verilog is not run. > But it still need: > - The HLS library included is entirely a specialization of HLS. > - Build flags and link flags should follow the HLS flow. > - The program partitioning (kernel vs. TB) is managed by HLS. > - The goal is to verify HLS behavior, not general C/C++ program portability. > - C synthesis: > Converting C/C++/OpenCL Kernel to RTL (Verilog/VHDL) > - HLS Package: > Transform RTL into an IP core, or RTL Kernel into a package usable by Vitis. > - RTL/C cosimuation: > Comapre (C v.s. RTL) behavior 1. Visit [xillinx]( https://www.xilinx.com/support/download.html) and register AMD account for download the pakage to install Vitis and Vivado. 2. After download the package ```bash # Add permission of execution to it $ chmod +x ./FPGAs_AdaptiveSoCs_Unified_SDI_2025.2_1114_2157_Lin64.bin # Run setup wizard $ ./FPGAs_AdaptiveSoCs_Unified_SDI_2025.2_1114_2157_Lin64.bin ``` 3. Follow the setup steps and choose below options ![image](https://hackmd.io/_uploads/HyQnNhaZ-l.png) :::warning Maybe some mistakes be here, because of error when gernerate bit stream. ::: 4. Set enviroment for Vitis and Vivator ```bash $ sudo <install-path>/2025.2/Vitis/scripts/installLibs.sh # Set Vitis enviroment $ . <install-path>/2025.2/Vitis/settings64.sh # Set Vivado enviroment $ . <install-path>/2025.2/Vivado/settings64.sh ``` Another try: > OS: 22.04 > vitis 2023.1 and vitis 2025.2 Both vitis version stuck at: ![image](https://hackmd.io/_uploads/BJVh0qzz-l.png) ### C Simulation 1. Use gui to click the buttom "run" ![image](https://hackmd.io/_uploads/SJ5MH60bbx.png) or use command: ```bash $ vitis-run --mode hls --csim --config /home/jin/course/ca2025/gpu/raster-i/renderer/hls_config.cfg --work_dir renderer ``` :::danger ``` INFO: [SIM 211-2] *************** CSIM start *************** INFO: [HLS 200-2191] C-Simulation will use clang-16 as the compiler INFO: [HLS 200-2036] Building optimized C Simulation binaries make: 'csim.exe' is up to date. @E Simulation failed with unknown error: child killed: floating-point exception ERROR: [SIM 211-100] CSim failed with errors. INFO: [SIM 211-3] *************** CSIM finish *************** ERROR: CSIM Failed INFO: [HLS 200-112] Total CPU user time: 1.47 seconds. Total CPU system time: 0.55 seconds. Total elapsed time: 3.01 seconds; peak allocated memory: 254.238 MB. Failed to run c-simulation ``` ::: ``` gdb ./csim.exe ``` ``` thread 1 "csim.exe" received signal SIGFPE, Arithmetic exception. 0x000055555569f57a in ap_private<50, true, true>::udiv(ap_private<50, true, true> const&) const () ``` ``` (gdb) bt #0 0x000055555569f57a in ap_private<50, true, true>::udiv(ap_private<50, true, true> const&) const () #1 0x000055555569f511 in ap_private<50, true, true>::sdiv(ap_private<50, true, true> const&) const () #2 0x000055555569f1cd in ap_fixed_base<24, 13, true, (ap_q_mode)5, (ap_o_mode)3, 0>::RType<50, 28, true>::div ap_fixed_base<24, 13, true, (ap_q_mode)5, (ap_o_mode)3, 0>::operator/<50, 28, true, (ap_q_mode)5, (ap_o_mode)3, 0>(ap_fixed_base<50, 28, true, (ap_q_mode)5, (ap_o_mode)3, 0> const&) const () #3 0x0000555555691391 in ap_fixed_base<24, 13, true, (ap_q_mode)5, (ap_o_mode)3, 0>& ap_fixed_base<24, 13, true, (ap_q_mode)5, (ap_o_mode)3, 0>::operator/=<50, 28, true, (ap_q_mode)5, (ap_o_mode)3, 0>(ap_fixed_base<50, 28, true, (ap_q_mode)5, (ap_o_mode)3, 0> const&) () #4 0x0000555555674068 in deferred_shading(unsigned int*, PixAttrib (*) [4]) () #5 0x0000555555671fca in trinity_renderer(ap_uint<1>, ap_uint<128>*, ap_uint<9>) () #6 0x00005555555c8c55 in main () ``` fix `renderer/src/top.cpp`: > [commit: Fix floating-point exception error](https://github.com/jin11109/raster-i-renderer/commit/ab4bbe80640bfe8bdfdf4d49afe139b699351656) ``` diff=218 - intensity /= - hls::sqrt(dir.x * dir.x + dir.y * dir.y + dir.z * dir.z); + fixed len = hls::sqrt(dir.x * dir.x + dir.y * dir.y + dir.z * dir.z); + if (len > (fixed)0.001) { + intensity = intensity / len; + } else { + intensity = 0; + } ``` Because it is very slow during the simualtion. it is okay that jumping window says `csim.exe` is not responding. You can see the model is successfully spun and displayed. ![image](https://hackmd.io/_uploads/HkED1jfMbe.png) ![image](https://hackmd.io/_uploads/B1hneofzZx.png) ![image](https://hackmd.io/_uploads/SJypiMmfWe.png) #### What C Simulation Actually Did? Before start to rewrite testbench, I first how the testbench be executed and compiled. When press "run" buttom in c simualaiton, it actually do: ```bash $ vitis-run --mode hls --csim \ --config /home/jin/course/ca2025/gpu/raster-i/renderer/hls_config.cfg \ --work_dir renderer ``` - `vitis-run` The executor of Vitis Unified Flow is responsible for driving all workflows such as HLS, RTL, and Simulation. - `--mode hls` HLS（High-Level Synthesis) - `csim` Run c simulaiton mode - `--config` This configure define which files are testbench, and which files are core HLS files. I further study the `csim.mk` to see how the testbench and relative files been compiled, because the commnad above is only a triger. > /home/jin/course/ca2025/gpu/raster-i/renderer/renderer/hls/csim/build/csim.mk 1. Compiler: Vitis use its own version of gcc compiler `--gcc-toolchain=/home/jin/tools/vitis/2025.2/Vitis/tps/lnx64/gcc-8.3.0` to ensure same bevaior in different environment. 2. Macro Definitions: This is most important part which we can see that it use different macro to satisfy different target of compiliting same files. ``` makefile IFLAG += -D__HLS_COSIM__ IFLAG += -D__HLS_CSIM__ IFLAG += -D__VITIS_HLS__ IFLAG += -D__SIM_FPO__ IFLAG += -D__SIM_FFT__ IFLAG += -D__SIM_FIR__ IFLAG += -D__SIM_DDS__ IFLAG += -D__DSP48E1__ ``` 3. Compilation Split: For testbench: ``` $(ObjDir)/tb.o: ../../../../src/tb.cpp $(ObjDir)/.dir csim.mk $(Echo) " Compiling ../../../../src/tb.cpp in $(BuildMode) mode" $(AVE_DIR_DLOG) $(Verb) $(CXX) -std=gnu++17 ${CCFLAG} -c -MMD -I/home/jin/course/ca2025/gpu/raster-i/renderer/include -Wno-unknown-pragmas -Wno-unknown-pragmas $(IFLAG) $(DFLAG) -DNDEBUG $< -o $@ ; \ -include $(ObjDir)/tb.d ``` For HLS core: ``` $(ObjDir)/texture.o: ../../../../src/texture.cpp $(ObjDir)/.dir csim.mk $(Echo) " Compiling ../../../../src/texture.cpp in $(BuildMode) mode" $(AVE_DIR_DLOG) $(Verb) $(CXX) -std=gnu++17 ${CCFLAG} -c -MMD -I/home/jin/course/ca2025/gpu/raster-i/renderer/include -fhls-csim -fhlstoplevel=trinity_renderer $(IFLAG) $(DFLAG) -DNDEBUG $< -o $@ ; \ -include $(ObjDir)/texture.d ``` - `-fhls-csim`: This tells the compiler that this is an HLS emulation and to enable `ap_fixed` bit-accurate emulation mode. - `-fhlstoplevel=trinity_renderer`: This is what you `hls_config.cfg` set up `syn.top=trinity_renderer`. The compiler needs to know who the main character is. ### C/RTL Cosimulation Althought C simulation can work now, "C/RTL cosimulation" option still suffer an error. ::: danger ``` INFO: [COSIM 212-302] Starting C TB testing ... ERROR: System received a signal named SIGSEGV and the program has to stop immediately! This signal was generated when a program tries to read or write outside the memory that is allocated for it, or to write memory that can only be read. Possible cause of this problem may be: 1) Missing depth information for one or more pointers to arrays in the interface; 2) Insufficient depth setting for array argument(s); 3) Excessive depth setting for array argument(s), that exceeds the maximum virtual memory size for the process; 4) Null pointer etc. Current execution stopped during CodeState = DUMP_INPUTS. You can search CodeState variable name in apatb*.cpp file under ./sim/wrapc dir to locate the position. ERROR: [COSIM 212-360] Aborting co-simulation: C TB simulation failed. ERROR: [COSIM 212-320] C TB testing failed, stop generating test vectors. Please check C TB or re-run cosim. ERROR: [COSIM 212-5] *** C/RTL co-simulation file generation failed. *** ERROR: [COSIM 212-4] *** C/RTL co-simulation finished: FAIL *** ERROR: INFO: [HLS 200-112] Total CPU user time: 50.62 seconds. Total CPU system time: 3.43 seconds. Total elapsed time: 54.1 seconds; peak allocated memory: 345.750 MB. Failed to run co-simulation ``` ::: ### Generate Bitstream for FPGA Broad 1. Open Visit IDE ``` $ visit ``` 2. To produce HDL files from HLS source code, click "Synthesis" followed by "Package": ![image](https://hackmd.io/_uploads/HJhld2aZZx.png) Or use terminal instead of GUI: ```console # C Synthesis $ v++ -c --mode hls --config /home/jin/course/ca2025/gpu/raster-i/renderer/hls_config.cfg --work_dir renderer # Package $ vitis-run --mode hls --package --config /home/jin/course/ca2025/gpu/raster-i/renderer/hls_config.cfg --work_dir renderer ``` 3. Launch Vivado from `system/vivado` and execute `build.tcl` (Tools -> Run Tcl Script) ```console # In system/vivado vivado ``` ![image](https://hackmd.io/_uploads/H1_my6p-We.png) ![image](https://hackmd.io/_uploads/Sk92y6a-We.png) :::danger I meet error here. ::: --- ## AI tool usage I draft all the content myself first, then use Gemini to polish the English for better fluency.