# Week 1 ## Lab1 Simulation success ![Simulation success screenshot](https://hackmd.io/_uploads/BJnxSfrFkg.png) Report timing ![Report timing](https://hackmd.io/_uploads/Bk3xSMSKJl.png) Report utilization-SynLab_thesis success ![Report utilization](https://hackmd.io/_uploads/S13lSzrKJg.png) # Week 2 ## Basic Coding style ![屏幕截图 2025-02-13 210507](https://hackmd.io/_uploads/rkcp0wiYkx.png) ``` if (condition) begin foo = bar; end else begin foo = bum; end assign a = ((addr & mask) == My_addr) ? b[1] : ~b[0]; logic [7:0][3:0] data[128][2]; typedef logic [31:0] word_t; assign foo = condition_a ? (condition_a_x ? x : y) : b; ``` ![屏幕截图 2025-02-13 213337](https://hackmd.io/_uploads/SJe-kB_iYke.png) ``` module top_module ( output reg [7:0] q, input [7:0] d, input clk, input rst_n ); myreg #( .SIZE(8), .Tckq(2) ) r1 ( .q(q), .d(d), .clk(clk), .rst_n(rst_n) ); endmodule ``` ## Logic design basics ![屏幕截图 2025-02-16 173512](https://hackmd.io/_uploads/ByqD-N1qJl.png) In Fig#1 and Fig#2, due to that not all conditions are fully covered in combinatorial logic, there are latches generated. The fixed codes are as follows. ``` always @(*) begin: EX22_PROC case (c) 1'b0: begin q = 1'b1; z = 1'b0; end default: begin q = 1'b0; z = 1'b1; end endcase end ``` ``` always @(*) begin case (d) 2'b00: begin z <= 1'b1; s <= 1'b0; end 2'b01: begin z <= 1'b0; s <= 1'b0; end 2'b10: begin z <= 1'b1; s <= 1'b1; end default: begin z <= 1'b0; s <= 1'b0; end endcase end ``` ![屏幕截图 2025-02-16 223517](https://hackmd.io/_uploads/Bk03vO15Jl.png) As shown in the figure above, by inserting latch (L2) between two registers (F1 and F2), time borrowing can be achieved. In the first clock cycle, the data passes through F1,the path#1 and L2, then processed in the path#2. In the second clock cycle, the data passes through path#2 and the second DFF. Through time borrowing, the delay of some paths is allowed to exceed a single clock cycle, but the overall path is ensured to meet the timing requirements within multiple clock cycles. But it should be noted that F1 and F2 are rising edge triggered, so L2 needs to be transparent when clock in low level. **Time requirement:** Tcq_register + Tp_path#1 + Tsetup_Latch < 15 Tcd_register + Tcd_path#1 > 5 + Tholdup_Latch Tcq_register + Tp_path#1 + Tdp_latch + Tp_path#2 + Tsetup_register < 20 Tcd_register + Tcd_path#1 + Tcd_latch + Tcd_path#2 > 10 +Tholdup_register # Week 3 ## AXI Design the interface logic between AXI and SRAM using minimum hardware resource, and draw timing waveform ``` module axi_sram_interface ( input wire clk, input wire rst_n, //AXI input wire [31:0] awaddr, input wire awvalid, output reg awready, input wire [31:0] wdata, input wire wvalid, output reg wready, output reg [1:0] bresp, output reg bvalid, input wire bready, input wire [31:0] araddr, input wire arvalid, output reg arready, output reg [31:0] rdata, output reg rvalid, input wire rready, //SRAM output wire [31:0] sram_addr, inout wire [31:0] sram_data, output wire sram_we_n, output wire sram_ce_n ); typedef enum logic [1:0] { IDLE, WRITE, READ, RESPONSE } state_t; state_t current_state, next_state; reg [31:0] addr_reg; reg [31:0] data_reg; reg we_reg; // FSM always @(posedge clk or negedge rst_n) begin if (!rst_n) begin current_state <= IDLE; end else begin current_state <= next_state; end end // state transfer always @(*) begin next_state = current_state; case (current_state) IDLE: begin if (awvalid && awready) begin next_state = WRITE; end else if (arvalid && arready) begin next_state = READ; end end WRITE: begin if (wvalid && wready) begin next_state = RESPONSE; end end READ: begin if (rvalid && rready) begin next_state = RESPONSE; end end RESPONSE: begin if ((bvalid && bready) || (rvalid && rready)) begin next_state = IDLE; end end endcase end // AXI control always @(*) begin awready = 0; wready = 0; arready = 0; bvalid = 0; rvalid = 0; case (current_state) IDLE: begin awready = 1; arready = 1; end WRITE: begin wready = 1; end READ: begin rvalid = 1; end RESPONSE: begin bvalid = (current_state == RESPONSE && !we_reg); rvalid = (current_state == RESPONSE && we_reg); end endcase end //SRAM interface assign sram_ce_n = (current_state == IDLE) ? 1'b1 : 1'b0; assign sram_we_n = (current_state == WRITE) ? 1'b0 : 1'b1; assign sram_addr = addr_reg; assign sram_data = (sram_we_n) ? 32'bz : data_reg; //data path always @(posedge clk) begin case (current_state) IDLE: begin if (awvalid && awready) begin addr_reg <= awaddr; we_reg <= 1'b1; end else if (arvalid && arready) begin addr_reg <= araddr; we_reg <= 1'b0; end WRITE: begin if (wvalid && wready) begin data_reg <= wdata; end end READ: begin if (rvalid && rready) begin rdata <= sram_data; end end endcase end endmodule ``` ``` Write operation wavefoWrm: __ __ __ __ __ __ __ __ __ CLK | |__| |__| |__| |__| |__| |__| |__| |__| _______________ ___________ AWADDR X 0x100 X_______________________X _______ _________ AWVALID| |_______________________________| _______ _________ AWREADY| |_______________________________| ___________________ ___________ WDATA X 0x12345678 X_________________X _______ _________ WVALID | |_______________________________| _______ _________ WREADY | |_______________________________| _______________________ ___________ BRESP X 2'b00 X_____________X _______ _________ BVALID | |_______________________________| _______ _________ BREADY | |_______________________________| _______ ___________________________ SRAM_CE_N | |_____________| _______ ___________________________ SRAM_WE_N | |_____________| _______________ ___________________________ SRAM_ADDR X 0x100 X_____X ___________________ ___________________________ SRAM_DATA X 0x12345678 X_X ``` ``` Read operation waveform: __ __ __ __ __ __ __ __ __ CLK | |__| |__| |__| |__| |__| |__| |__| |__| _______________ ___________ ARADDR X 0x200 X_______________________X _______ _________ ARVALID| |_______________________________| _______ _________ ARREADY| |_______________________________| _______________________ ___________ RDATA X 0xAABBCCDD X_____________X _______ _________ RVALID | |_______________________________| _______ _________ RREADY | |_______________________________| _______ ___________________________ SRAM_CE_N | |_____________| _____________________ ___________________________ SRAM_WE_N | |_________________________| _______________ ___________________________ SRAM_ADDR X 0x200 X_____X ___________________ ___________________________ SRAM_DATA X 0xAABBCCDD X_X ``` Q:Interleave order: axi burst type = interleave order, the starting address is 011, what is address sequence for data access? A:For an INCR burst, the address increments by the size of each transfer. Therefore, the address sequence would be: 011, 100, 101, 110, 111, 1000, 1001, 1010 ... ![image](https://hackmd.io/_uploads/rJg-MWMo1x.png) Leveraging the DRAM controller to handle access order conflicts is optimal because it maintains system-wide efficiency without requiring DMA modifications. This approach centralizes memory optimization, ensures transparency to software/drivers, and adapts to diverse access patterns through hardware-level remapping and scheduling. ## IO cache TPH Implementation Options For IO read/write operations, given that IO supports TPH, choose the implementation for the different TPH that provides the most effective IO operations. ![B55C65EFCE0F59C0AF01AA010DA45E32](https://hackmd.io/_uploads/r14AQNzjkl.jpg) # Week 4 ## Testbench ![image](https://hackmd.io/_uploads/Byif6OV2Je.png) Q:Do the two always block behave the same? A:No, it's different. Because in Verilog blocking assignment is executed sequentially. The first one calculates y first, and the second one calculates tmp first and then calculates the value of y ![image](https://hackmd.io/_uploads/SJwzyYNhkl.png) Code A ## Delay ![image](https://hackmd.io/_uploads/B1_bbYN21e.png) ``` Case1: a : 0────A─────────2────F─────────────────────────────────── b : 0────────3───────────────────────────────────────────── ci : 0────────────────────────1───────────────────────────── tmp : 0 A A+3 2+3 F+3 F+3+1 (Updates immediately) ▲ ▲ ▲ ▲ ▲ │ │ │ │ │ Trigger Events 15 17 19 21 24 (ns) co/sum : 0────────────────────A────A+3──2+3──F+3──F+3+1 Delayed by 3s 18 20 22 24 27 (ns) ``` ``` Case2: -------------------------------------------------------------------------- a : 0────A─────────2────F─────────────────────────────────── b : 0────────3───────────────────────────────────────────── ci : 0────────────────────────1───────────────────────────── tmp : 0 A A+3 2+3 F+3 F+3+1 (Updates immediately) ▲ ▲ ▲ ▲ ▲ │ │ │ │ │ Trigger Events 15 17 19 21 24 (ns) co/sum : 0────────────────────A────A+3──2+3──F+3──F+3+1 Locked Values 18 20 22 24 27 (ns) ``` ``` Case3: a : 0────A─────────2────F─────────────────────────────────── b : 0────────3───────────────────────────────────────────── ci : 0────────────────────────1───────────────────────────── tmp : 0──────────────────A────A+3──2+3──F+3──F+3+1 Delayed Calc 18 20 22 24 27 (ns) co/sum : 0──────────────────A────A+3──2+3──F+3──F+3+1 (Same as tmp) 18 20 22 24 27 (ns) ``` ## block/nonblock ![image](https://hackmd.io/_uploads/B1L14FE3kl.png) async_reset---------reset_nonblocking-------clock_1_nonblocking sync_reset----------reset_blocking-------clock_0 or clock1 ![image](https://hackmd.io/_uploads/SyecYK4hye.png) ``` module tb; reg clk; reg [7:0] data; wire [7:0] r_data; dut u_dut ( .clk(clk), .data(data), .r_data(r_data) ); initial begin clk = 0; forever #5 clk = ~clk; end task tx; input [7:0] value; begin @(negedge clk); data = value; end endtask initial begin data = 0; tx(8'h05); tx(8'h06); tx(8'h07); #50 $finish; end initial begin $dumpfile("wave.vcd"); $dumpvars(0, tb); end endmodule module dut( input clk, input [7:0] data, output reg [7:0] r_data ); always @(posedge clk) begin r_data <= data; endmodule ``` # Week 5 ## rtl-synthesis consistency ![image](https://hackmd.io/_uploads/r1YtqYV21g.png) No, the two circuits are different. When a changes from0 to 1, the final value of x is 1 and the value of y is 0 in circuit A. However, the final value of x is 1 and that of y is 1. ![image](https://hackmd.io/_uploads/rJWq2K4nkx.png) Code 4a without fullcase will generate more latch than Code4b with fullcase. ![image](https://hackmd.io/_uploads/S16EpY4hye.png) Code5a generates a priority encoder with consistent behavior with the RTL (cascode logic gate), but with high resource consumption. Code5b generates a parallel multiplexer with fewer resources (Independent comparator), but may cause functional anomalies when conditions overlap. ## Construst design ![image](https://hackmd.io/_uploads/SyzjkqV3Jx.png) ``` Timeline (clk cycles): 1 2 3 4 5 6 7 8 ---------------------------------------------------------- clk : _|‾|_|‾|_|‾|_|‾|_|‾|_|‾|_|‾|_ tvalid : __|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|_________ tready : ________|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|__ tdata : X──A0──A1──A2──A3──A4──A5──X muxsel : __|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|_________ ffen : __|‾|_|‾|_|‾|_|‾|_|‾|_|‾|______ addr : X──0──1──2──3──4──5──X──X rdata : X──X──D0──D1──D2──D3──D4──D5 ``` ``` module axi_sram_controller ( input wire clk, // System clock input wire rst_n, // Active-low reset // AXI-Stream Interface output reg tvalid, // Data valid input wire tready, // Receiver ready output reg [31:0] tdata, // Transmit data // SRAM Control Interface output reg muxsel, // 1: Address phase, 0: Data phase output reg ffen, // FIFO write enable output reg [31:0] addr, // SRAM address input wire [31:0] rdata // SRAM read data (1-cycle latency) ); // FSM states typedef enum { IDLE, // Waiting for transaction ADDR_PHASE, // SRAM address phase DATA_PHASE // SRAM data phase } state_t; state_t current_state, next_state; // Address counter reg [31:0] addr_counter; // Main FSM always @(posedge clk or negedge rst_n) begin if (!rst_n) begin current_state <= IDLE; addr_counter <= 0; tvalid <= 0; muxsel <= 0; ffen <= 0; end else begin current_state <= next_state; case (current_state) IDLE: begin if (tready) begin next_state <= ADDR_PHASE; muxsel <= 1; // Enter address phase addr <= addr_counter; end end ADDR_PHASE: begin next_state <= DATA_PHASE; muxsel <= 0; // Switch to data phase ffen <= 1; // Enable FIFO write end DATA_PHASE: begin if (tready) begin tvalid <= 1; // Drive valid data tdata <= rdata; // Forward SRAM data (1-cycle latency) addr_counter <= addr_counter + 1; next_state <= ADDR_PHASE; muxsel <= 1; // Next address phase addr <= addr_counter + 1; end else begin next_state <= IDLE; tvalid <= 0; end ffen <= 0; // Disable FIFO end endcase end end endmodule ``` ``` module sram_model ( input wire clk, // System clock input wire [31:0] addr, // Read address output reg [31:0] rdata // Read data ); // 1KB memory array reg [31:0] mem [0:1023]; // Read operation with 1-cycle latency always @(posedge clk) begin rdata <= mem[addr]; // Pipelined read end endmodule ``` ``` module tb; reg clk; // System clock reg rst_n; // Active-low reset reg tready; // AXI-Stream ready signal wire tvalid; // AXI-Stream valid signal wire [31:0] tdata; // AXI-Stream data wire muxsel; // SRAM mux control wire ffen; // FIFO enable wire [31:0] addr; // SRAM address wire [31:0] rdata; // SRAM read data // Instantiate DUT axi_sram_controller u_controller ( .clk(clk), .rst_n(rst_n), .tvalid(tvalid), .tready(tready), .tdata(tdata), .muxsel(muxsel), .ffen(ffen), .addr(addr), .rdata(rdata) ); sram_model u_sram ( .clk(clk), .addr(addr), .rdata(rdata) ); // Clock generation (10ns period) initial begin clk = 0; forever #5 clk = ~clk; end // Test sequence initial begin // Initialization rst_n = 0; tready = 0; #20 rst_n = 1; // Release reset tready = 1; // Enable data transfer // Preload SRAM data u_sram.mem[0] = 32'hA0; // Address 0 data u_sram.mem[1] = 32'hA1; // Address 1 data u_sram.mem[2] = 32'hA2; // Address 2 data // Run simulation #100 $finish; end // Waveform dumping initial begin $dumpfile("wave.vcd"); $dumpvars(0, tb); end endmodule ``` ## FSM ![image](https://hackmd.io/_uploads/ryndRYV21x.png) It is a Mealy Machine. ## SRAM ![image](https://hackmd.io/_uploads/SJ9CZcVnkg.png) When synthesizing the code in the figure, you will get a register-based memory (a bank of flip-flops), not an SRAM macro. To use an SRAM macro, we must generate an SRAM macro using your foundry’s memory compiler (e.g., TSMC, Samsung, or Synopsys Memory Compiler) and then instantiate the SRAM macro directly in your RTL ``` module my_design ( input wire clk, input wire we, input wire [5:0] wa, // 6-bit address (64 entries) input wire [31:0] di, input wire [5:0] ra, output wire [31:0] do ); // ----------------------------------------------------------- // Instantiate SRAM macro (technology-dependent) // ----------------------------------------------------------- sram_sp_64x32 u_sram ( .CLK (clk), // Clock .CE (1'b1), // Always enabled .WE (we), // Write Enable .A (wa), // Address .D (di), // Data Input .Q (do) // Data Output (synchronous read) ); endmodule ``` ``` # Synopsys DC script set_lib [list std_cell_lib sram_macro.lib] read_verilog -library sram_macro sram_sp_64x32.v ``` We can implement an SRAM behavioral model using pure Verilog. Note that this implementation cannot generate a real SRAM physical structure (such as a 6T memory cell) ## reset ![image](https://hackmd.io/_uploads/B1-zM9N2Jx.png) Synchronous Reset-------Verilog-A-------Waveform-A-----Schematic-B Asynchronous Reset-------Verilog-B-------Waveform-B-----Schematic-A # Week 6 ## Caravel_hk_gpio_spi_mmio Q: Explain the Caravel RISC-V firmware code update, and boot-up sequence, it involves 1. Using passthru-mode to update spiflash 2. CPU code fetch from spi-flash A: 1. Passthru-Mode Update: - External host takes over SPI flash. - Erases/programs new firmware. - Releases control for normal boot. 2.CPU Boot from SPI Flash: - CPU starts at reset vector. - Fetches instructions via SPI (XIP mode). - Executes firmware, optionally copies to SRAM. Q: GPIO(MPRJ) can be used for management core or user-project, explain how to program mprj 1. What is the MMIO address to configure each MPRJ pin 2. Which bit is used to set the MPRJ used by the management core or the user 3. illustrate by code (reference the firmware code) A: 1.The MPRJ pins are controlled by registers in the Management SoC's address space. The key registers are: | **Register Name** | **MMIO Address (Hex)** | **Description** | |------------------------|-----------------------|----------------| | `GPIO_MODE_USER` | `0x26000000` | Configures whether MPRJ pins are controlled by the **Management Core (0)** or **User Project (1)**. | | `GPIO_OUT` | `0x26000004` | Sets output values for MPRJ pins (if controlled by Management Core). | | `GPIO_IN` | `0x26000008` | Reads input values from MPRJ pins. | | `GPIO_OEB` (Output Enable Bar) | `0x2600000C` | Controls tri-state (input/output) for MPRJ pins (0 = output, 1 = input). | 2. The **`GPIO_MODE_USER`** register determines whether each MPRJ pin is controlled by the **Management Core** or **User Project**: - **Bit `n` = 0** → MPRJ pin `n` is controlled by the **Management Core**. - **Bit `n` = 1** → MPRJ pin `n` is controlled by the **User Project**. - Each bit corresponds to an MPRJ pin (e.g., `MPRJ[0]` = Bit 0, `MPRJ[1]` = Bit 1, etc.). 3. Below is an example firmware code snippet that configures MPRJ[7:0] for the Management Core and sets some pins as outputs: ``` #include <stdint.h> // Define MMIO addresses (from caravel_reg.h) #define GPIO_MODE_USER (*(volatile uint32_t*)0x26000000) #define GPIO_OUT (*(volatile uint32_t*)0x26000004) #define GPIO_OEB (*(volatile uint32_t*)0x2600000C) void configure_mprj() { // Set MPRJ[7:0] to be controlled by Management Core (clear bits 7-0) GPIO_MODE_USER &= ~0xFF; // Bits 7-0 = 0 (Management Core control) // Set MPRJ[3:0] as outputs (OEB = 0), MPRJ[7:4] as inputs (OEB = 1) GPIO_OEB = (GPIO_OEB & ~0x0F) | 0xF0; // OEB[3:0] = 0 (output), OEB[7:4] = 1 (input) // Write values to MPRJ[3:0] (e.g., set MPRJ[0] = 1, others = 0) GPIO_OUT = (GPIO_OUT & ~0x0F) | 0x01; // MPRJ[0] = HIGH } int main() { configure_mprj(); while (1); // Keep running return 0; } ``` **Explanation:** 1. **`GPIO_MODE_USER &= ~0xFF`** - Clears bits 0-7 → MPRJ[7:0] are now controlled by the **Management Core**. 2. **`GPIO_OEB = (GPIO_OEB & ~0x0F) | 0xF0`** - Sets MPRJ[3:0] as **outputs** (`OEB=0`) and MPRJ[7:4] as **inputs** (`OEB=1`). 3. **`GPIO_OUT = (GPIO_OUT & ~0x0F) | 0x01`** - Sets **MPRJ[0] = HIGH (1)**, others LOW (0). ## caravel_intr_sram_wb_usrprj_firmware Q: Explain the procedure/code to move code from spiflash to dff A: Procedure to Copy Code from SPI Flash to SRAM A) Linker Script Setup Define memory regions (SPI flash for storage, SRAM for execution): ``` MEMORY { FLASH (rx) : ORIGIN = 0x10000000, LENGTH = 256K /* SPI Flash (XIP) */ SRAM (rwx) : ORIGIN = 0x00000000, LENGTH = 16K /* Fast SRAM */ } SECTIONS { .text : { *(.text.boot) /* Early boot code (runs from flash) */ *(.text*) /* Main code (copied to SRAM) */ } > FLASH .sram_text : { _sram_text_start = .; *(.sram_text*) /* Code marked for SRAM */ _sram_text_end = .; } > SRAM AT > FLASH /* Load from FLASH, run from SRAM */ } ``` B) Copy Function in C ``` #include <stdint.h> extern uint32_t _sram_text_start, _sram_text_end; extern uint32_t _sram_text_load; /* Load address in flash */ void copy_to_sram() { uint32_t *src = &_sram_text_load; uint32_t *dest = &_sram_text_start; uint32_t size = &_sram_text_end - &_sram_text_start; while (size--) { *dest++ = *src++; } } ``` C) Bootloader Code (Runs from Flash) ``` void _start() { // 1. Copy critical code to SRAM copy_to_sram(); // 2. Jump to SRAM-resident code void (*sram_main)() = (void (*)())&_sram_text_start; sram_main(); } ``` Q: List the user project interface signals (ref: the user_project wrapper.v) What memory address is used to access the user project? A: 1. **Power Supply Pins** (Conditional on `USE_POWER_PINS`): ```verilog vdda1, vdda2 // 3.3V analog supplies vssa1, vssa2 // Analog grounds vccd1, vccd2 // 1.8V digital supplies vssd1, vssd2 // Digital grounds ``` 2. **Wishbone Management Interface**: ```verilog wb_clk_i // System clock wb_rst_i // System reset wbs_stb_i // Strobe signal wbs_cyc_i // Cycle valid wbs_we_i // Write enable wbs_sel_i[3:0] // Byte select wbs_dat_i[31:0] // Data input wbs_adr_i[31:0] // Address input wbs_ack_o // Acknowledge wbs_dat_o[31:0] // Data output ``` 3. **Logic Analyzer Interface**: ```verilog la_data_in[127:0] // Logic analyzer input la_data_out[127:0] // Logic analyzer output la_oenb[127:0] // Output enable (active low) ``` 4. **GPIO Interface**: ```verilog io_in[`MPRJ_IO_PADS-1:0] // Digital input pads io_out[`MPRJ_IO_PADS-1:0] // Digital output pads io_oeb[`MPRJ_IO_PADS-1:0] // Output enable (active low) ``` 5. **Analog Interface**: ```verilog analog_io[`MPRJ_IO_PADS-10:0] // Direct analog I/O ``` 6. **Additional Signals**: ```verilog user_clock2 // Secondary clock input user_irq[2:0] // Interrupt outputs ``` ** Memory Addressing: The **exact memory address** is **not defined in this wrapper code**. The addressing is determined by: 1. **Caravel Platform Configuration**: - Typical base address for user projects in Caravel is **0x3000_0000** - Address range is usually **0x3000_0000 - 0x3FFF_FFFF** (256MB space) 2. **User Project Implementation**: - The actual address decoding happens in the user project (e.g., `user_proj_example_counter/gcd.v`) - The wrapper simply passes through the 32-bit address bus (`wbs_adr_i[31:0]`) - Users implement address decoding within their project to map registers/memory The wrapper serves as a pass-through interface - the actual memory mapping is handled by: 1. Caravel's address space allocation 2. The user project's internal address decoding logic 3. The system-on-chip (SoC) configuration in the full chip design Q: Explain the counter_WB example, Verilog testbench, and firmware C code A: 1) Module `user_proj_example`serves as the main user project example. It demonstrates how to connect a simple counter module to various interfaces such as the Wishbone bus, logic analyzer, and I/O pads. **Parameters and Ports** - **Parameter `BITS`**: Defines the bit-width of the counter. - **Wishbone Slave Ports**: These ports (`wb_clk_i`, `wb_rst_i`, `wbs_stb_i`, etc.) allow the module to communicate with a Wishbone bus master. The Wishbone bus is a simple and widely-used bus protocol for connecting peripherals in embedded systems. - **Logic Analyzer Signals**: These signals (`la_data_in`, `la_data_out`, `la_oenb`) connect to a logic analyzer for debugging and monitoring purposes. - **I/O Pads**: These ports (`io_in`, `io_out`, `io_oeb`) allow the module to interact with external I/O devices. - **IRQ**: Interrupt request lines (unused in this example). **Key Functionalities** - **Wishbone Interface**: The module can read and write data over the Wishbone bus. The `valid` signal indicates when a valid transaction is occurring, and `wstrb` indicates which bytes of the data are valid. - **Logic Analyzer Interface**: The `la_data_out` signal outputs the counter value to the logic analyzer. The `la_write` signal allows the logic analyzer to control the counter's behavior. - **Counter Output**: The counter value is output to the I/O pads (`io_out`). **Module `counter`** This is a simple counter module that increments its value on each clock cycle unless it is controlled by the Wishbone bus or logic analyzer signals. **Parameters and Ports** - **Parameter `BITS`**: Defines the bit-width of the counter. - **Inputs**: Clock (`clk`), reset (`reset`), valid (`valid`), write strobe (`wstrb`), write data (`wdata`), logic analyzer write signal (`la_write`), and logic analyzer input (`la_input`). - **Outputs**: Ready signal (`ready`), read data (`rdata`), and counter value (`count`). **Key Functionalities** - **Counter Logic**: The counter increments on each clock cycle unless it is reset or controlled by the Wishbone bus or logic analyzer. - **Wishbone Interface**: The counter can be read from and written to via the Wishbone bus. The `valid` signal indicates when a valid transaction is occurring, and `wstrb` indicates which bytes of the data are valid. - **Logic Analyzer Interface**: The counter can be controlled by the logic analyzer using the `la_write` and `la_input` signals. 2) This Verilog code is a testbench (`counter_wb_tb`) designed to simulate and verify the functionality of a hardware module, likely a counter module interfacing with a Wishbone bus. **Key Components and Functionalities** **Clock Generation** ```verilog always #12.5 clock <= (clock === 1'b0); initial begin clock = 0; end ``` - A clock signal (`clock`) is generated with a period of 25 ns (12.5 ns high and 12.5 ns low). - This clock is used to drive the simulation and simulate the behavior of the hardware module. **Power-Up Sequence** ```verilog initial begin power1 <= 1'b0; power2 <= 1'b0; #200; power1 <= 1'b1; #200; power2 <= 1'b1; end ``` - The power-up sequence initializes the power supply signals (`power1` and `power2`) to 0, waits for 200 ns, sets `power1` to 1, waits another 200 ns, and then sets `power2` to 1. - This simulates the power-up behavior of the hardware module. **Reset and Chip Select Signals** ```verilog initial begin RSTB <= 1'b0; CSB <= 1'b1; // Force CSB high #2000; RSTB <= 1'b1; // Release reset #100000; CSB = 1'b0; // CSB can be released end ``` - The reset signal (`RSTB`) is initially set to 0, then released after 2000 ns. - The chip select signal (`CSB`) is initially forced high to disable the module, then released after a delay. **Monitoring Logic** ```verilog initial begin wait(checkbits == 16'hAB60); $display("Monitor: MPRJ-Logic WB Started"); wait(checkbits == 16'hAB61); `ifdef GL $display("Monitor: Mega-Project WB (GL) Passed"); `else $display("Monitor: Mega-Project WB (RTL) Passed"); `endif $finish; end ``` - The testbench monitors the `checkbits` signal for specific values (`16'hAB60` and `16'hAB61`) to determine the status of the module. - When `checkbits` reaches `16'hAB60`, it indicates that the Wishbone bus logic has started. - When `checkbits` reaches `16'hAB61`, it indicates that the test has passed. - The testbench then terminates the simulation using `$finish`. **Timeout Logic** ```verilog initial begin $dumpfile("counter_wb.vcd"); $dumpvars(0, counter_wb_tb); repeat (70) begin repeat (1000) @(posedge clock); // $display("+1000 cycles"); end $display("%c[1;31m",27); `ifdef GL $display ("Monitor: Timeout, Test Mega-Project WB Port (GL) Failed"); `else $display ("Monitor: Timeout, Test Mega-Project WB Port (RTL) Failed"); `endif $display("%c[0m",27); $finish; end ``` - The testbench includes a timeout mechanism to ensure the simulation does not run indefinitely. - If the simulation runs for 70,000 clock cycles without passing, it prints an error message and terminates the simulation. **SPI Flash Simulation** ```verilog spiflash #( .FILENAME("counter_wb.hex") ) spiflash ( .csb(flash_csb), .clk(flash_clk), .io0(flash_io0), .io1(flash_io1), .io2(), // not used .io3() // not used ); ``` - The testbench includes a SPI flash module (`spiflash`) to simulate the behavior of a flash memory. - The flash memory is initialized with a file (`counter_wb.hex`), which likely contains the program or data to be loaded into the hardware module. **Caravel Wrapper** ```verilog caravel uut ( .clock (clock), .gpio (gpio), .mprj_io (mprj_io), .flash_csb(flash_csb), .flash_clk(flash_clk), .flash_io0(flash_io0), .flash_io1(flash_io1), .resetb (RSTB) ); ``` - The testbench instantiates a `caravel` module (`uut`), which is likely a wrapper for the hardware module being tested. - The `caravel` module interfaces with the clock, GPIO, Wishbone bus, and flash memory. **Purpose and Usage** - This testbench is designed to simulate and verify the behavior of a hardware module, particularly focusing on the interaction with a Wishbone bus and GPIO pins. - It includes various initializations, power-up sequences, and monitoring logic to ensure the module operates correctly. - The testbench is likely used in the development and verification of a hardware project, such as a custom chip or FPGA design. 3)This code is a C program designed to run on a microcontroller or FPGA-based system, specifically interacting with a Wishbone bus and configuring GPIO pins. **Register Definitions** ```c #define reg_mprj_slave (*(volatile uint32_t*)0x30000000) ``` - This macro defines a memory-mapped register (`reg_mprj_slave`) at address `0x30000000`. - The `volatile` keyword is used to indicate that the value of this register may change at any time and should not be optimized by the compiler. **Main Function** The `main` function contains the core logic of the program, which performs the following tasks: **GPIO Configuration** ```c reg_mprj_io_31 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_30 = GPIO_MODE_MGMT_STD_OUTPUT; // ... (similar lines for other GPIO pins) ``` - These lines configure the GPIO pins to be used as standard outputs under management control. - The `reg_mprj_io_x` registers likely control the mode of each GPIO pin. - `GPIO_MODE_MGMT_STD_OUTPUT` is a constant that specifies the mode for standard output. **Apply Configuration** ```c reg_mprj_xfer = 1; while (reg_mprj_xfer == 1); ``` - This code initiates a transfer/configuration operation by setting `reg_mprj_xfer` to 1. - The `while` loop waits for the transfer to complete by polling the `reg_mprj_xfer` register until it is no longer set. **Logic Analyzer Configuration** ```c reg_la2_oenb = reg_la2_iena = 0x00000000; // [95:64] ``` - These lines configure the logic analyzer (LA) output enable (`oenb`) and input enable (`iena`) registers. - Setting them to `0x00000000` likely disables specific logic analyzer channels or configurations. **Test Sequence** ```c reg_mprj_datal = 0xAB600000; reg_mprj_slave = 0x00002710; reg_mprj_datal = 0xAB610000; if (reg_mprj_slave == 0x2B3D) { reg_mprj_datal = 0xAB610000; } ``` - This sequence of operations performs a test on the Wishbone bus: 1. `reg_mprj_datal = 0xAB600000;` sets a flag indicating the start of the test. 2. `reg_mprj_slave = 0x00002710;` writes a value to the Wishbone slave register. 3. `reg_mprj_datal = 0xAB610000;` sets another flag indicating the test is in progress. 4. The `if` statement checks if the value read from `reg_mprj_slave` matches `0x2B3D`. If it does, it sets a flag indicating the test has passed. **Purpose and Usage** - This program is designed to test the functionality of a Wishbone bus interface and GPIO pins on a hardware platform. - It configures the GPIO pins, performs a test sequence, and checks the results to verify correct operation. - The specific values used (e.g., `0xAB600000`, `0x2B3D`) are likely part of a predefined test protocol. ## caravel soc - lab4-0 Q: Observe Caravel SoC simulation, show waveforms with related signals ![image](https://hackmd.io/_uploads/SyL7yiGklg.png) A: To explain the timing and interactions in a Caravel SoC simulation, particularly focusing on SPI flash access, CPU Wishbone cycles, Logic Analyzer (LA) interactions, and user project interactions, we need to delve into how these components interact in a waveform viewer. Below are detailed explanations for each observation: **1. SPI Flash Access & Code Execution (Observe CPU Trace)** **Waveform Observations:** - **SPI Flash Signals**: Look for signals related to the SPI flash interface, such as `csb`, `clk`, `io0`, `io1`, etc. - `csb` (Chip Select Bar): Active low signal indicating when the SPI flash is selected. - `clk` (Clock): Clock signal driving the SPI communication. - `io0`, `io1`: Data lines used for SPI communication. - **CPU Trace**: Look for signals related to the CPU's instruction fetch and execution. - `pc` (Program Counter): Shows the address of the current instruction being executed. - `instr` (Instruction Register): Shows the current instruction being decoded/executed. **Explanation:** - When the CPU starts executing code, it fetches instructions from the SPI flash. - The `csb` signal goes low, indicating the SPI flash is selected. - The `clk` signal toggles, driving the SPI communication. - Data is transferred over `io0` and `io1` lines. - The CPU trace will show the program counter (`pc`) incrementing as instructions are fetched and executed. - The instruction register (`instr`) will show the current instruction being executed. **Waveform Example:** ``` Time(ms) csb clk io0 io1 pc instr 0.000 1 0 - - 0x00000000 - 0.001 0 0 - - 0x00000000 - 0.002 0 1 0x01 0x02 - - 0.003 0 0 0x01 0x02 - - 0.004 0 1 0x03 0x04 - - 0.005 1 0 - - 0x00000004 0x00100000 ``` - At `0.001 ms`, `csb` goes low, indicating SPI flash access. - Data is transferred over `io0` and `io1` during `clk` toggles. - At `0.005 ms`, the CPU starts executing the fetched instruction (`0x00100000`). **2. CPU Wishbone Cycles Interaction with User Project Area** **Waveform Observations:** - **Wishbone Signals**: Look for signals related to the Wishbone bus. - `wb_clk_i`: Clock signal for the Wishbone bus. - `wb_rst_i`: Reset signal for the Wishbone bus. - `wbs_stb_i`: Strobe signal indicating a valid transaction. - `wbs_ack_o`: Acknowledge signal indicating the transaction is complete. - `wbs_dat_i`, `wbs_dat_o`: Data lines for reading and writing. - **User Project Signals**: Look for signals related to the user project area. - `mprj_io`: GPIO pins used for user project interaction. **Explanation:** - The CPU communicates with the user project area via the Wishbone bus. - The `wbs_stb_i` signal goes high when the CPU initiates a transaction. - The `wbs_dat_i` and `wbs_dat_o` lines carry data between the CPU and the user project. - The `wbs_ack_o` signal goes high when the transaction is complete. - The user project area responds to these transactions, possibly modifying the `mprj_io` signals. **Waveform Example:** ``` Time(ms) wb_clk_i wb_rst_i wbs_stb_i wbs_ack_o wbs_dat_i wbs_dat_o mprj_io 0.000 1 0 0 0 - - - 0.001 1 0 1 0 0x1234 - - 0.002 1 0 1 0 0x1234 - - 0.003 1 0 1 1 0x1234 0x5678 0x0001 0.004 1 0 0 0 - - 0x0001 ``` - At `0.001 ms`, the CPU initiates a transaction (`wbs_stb_i` goes high). - Data is transferred over `wbs_dat_i` and `wbs_dat_o`. - At `0.003 ms`, the transaction is complete (`wbs_ack_o` goes high). - The user project area modifies the `mprj_io` signal based on the transaction. **3. CPU Interface with User Project with Logic Analyzer (LA)** **Waveform Observations:** - **Logic Analyzer Signals**: Look for signals related to the Logic Analyzer (LA). - `la_data_in`, `la_data_out`: Data lines for input and output. - `la_oenb`: Output enable signal. - **User Project Signals**: Look for signals related to the user project area. - `mprj_io`: GPIO pins used for user project interaction. **Explanation:** - The Logic Analyzer monitors and controls the user project area. - The `la_data_in` signal carries data from the user project to the Logic Analyzer. - The `la_data_out` signal carries data from the Logic Analyzer to the user project. - The `la_oenb` signal controls the direction of data flow. - The user project area responds to these signals, possibly modifying the `mprj_io` signals. **Waveform Example:** ``` Time(ms) la_data_in la_data_out la_oenb mprj_io 0.000 - - 1 - 0.001 0x1234 - 1 - 0.002 0x1234 0x5678 0 0x0001 0.003 0x1234 0x5678 1 0x0001 0.004 - - 1 0x0001 ``` - At `0.001 ms`, the Logic Analyzer reads data from the user project (`la_data_in`). - At `0.002 ms`, the Logic Analyzer writes data to the user project (`la_data_out`). - The user project area modifies the `mprj_io` signal based on the Logic Analyzer's actions. **4. User Project/RISC-V Uses `mprj` Pin and Interacts with Testbench** **Waveform Observations:** - **User Project Signals**: Look for signals related to the user project area. - `mprj_io`: GPIO pins used for user project interaction. - **Testbench Signals**: Look for signals related to the testbench. - `checkbits`: Signals used to monitor the state of the user project. **Explanation:** - The user project area uses the `mprj_io` pins to interact with the testbench. - The testbench monitors the `checkbits` signal to determine the state of the user project. - The user project modifies the `mprj_io` signals based on its internal logic and interactions with the testbench. **Waveform Example:** ``` Time(ms) mprj_io checkbits 0.000 - - 0.001 0x0001 0xAB60 0.002 0x0001 0xAB61 0.003 0x0002 0xAB61 0.004 0x0002 0xAB61 ``` - At `0.001 ms`, the user project sets `mprj_io` to `0x0001` and the testbench detects `checkbits` as `0xAB60`. - At `0.002 ms`, the testbench detects `checkbits` as `0xAB61`. - The user project continues to modify `mprj_io` based on its internal logic. # Week 7 ## superscalar Q: The example shows RAW dependency. Explain the software and hardware techniques to solve the problem. ![image](https://hackmd.io/_uploads/HyQvHoGygg.png) A: **1. Identifying Data Dependencies** **a) True Dependencies (Read-After-Write - RAW)** - **(1) → (2):** - (1) writes `X1` (load result), (2) reads `X1` → **RAW hazard**. - **Critical path:** (2) cannot execute until (1) completes **MEM/WB**. - **(3) → (4):** - (3) writes `X3` (load result), (4) reads `X3` → **RAW hazard**. - (3) also modifies `X2` (post-increment), but (4) does **not** read `X2` → no additional hazard. **b) No Dependencies (Parallel Execution Possible)** - **(2) and (3):** - No overlapping registers → can execute **in parallel** (if no structural hazards). **2. Pipeline Execution (Assuming 5-Stage Pipeline)** Key stages: **IF** (Fetch), **ID** (Decode), **EX** (Execute), **MEM** (Memory), **WB** (Writeback). **Without Forwarding (Stalls Required)** ``` Cycle | (1) LDR X1,[X2] | (2) ADD X1,X1,X3 | (3) LDR X3,[X2],#4 | (4) SUB X2,X3,#1 ------|-----------------|-------------------|---------------------|----------------- 1 | IF | — | — | — 2 | ID | IF | — | — 3 | EX | ID | IF | — 4 | MEM | (STALL: waits X1) | ID | IF 5 | WB | EX (now has X1) | (STALL: waits X3) | ID 6 | — | MEM | EX | (STALL: waits X3) 7 | — | WB | MEM | EX 8 | — | — | WB | MEM 9 | — | — | — | WB ``` - **Stalls:** 1 cycle for (2), 1 cycle for (4). - **Total cycles:** 9 (vs. ideal 8 without hazards). **With Forwarding (Bypassing)** Forwarding paths allow `EX/MEM` or `MEM/WB` results to be used immediately: - (2) gets `X1` from (1)'s **MEM/WB** stage. - (4) gets `X3` from (3)'s **MEM/WB** stage. ``` Cycle | (1) LDR X1,[X2] | (2) ADD X1,X1,X3 | (3) LDR X3,[X2],#4 | (4) SUB X2,X3,#1 ------|-----------------|-------------------|---------------------|----------------- 1 | IF | — | — | — 2 | ID | IF | — | — 3 | EX | ID | IF | — 4 | MEM | EX (X1 forwarded) | ID | IF 5 | WB | MEM | EX | ID 6 | — | WB | MEM (X3 forwarded) | EX 7 | — | — | WB | MEM 8 | — | — | — | WB ``` - **No stalls!** Forwarding resolves RAW hazards. - **Total cycles:** 8 (optimal). **3. Key Signals in Waveform** To observe this in a waveform: **RAW Hazard (Without Forwarding)** - **Signal:** `pipeline_stall` (asserted when (2) or (4) waits for data). - **Register File:** `X1`/`X3` update in **WB** stage. **Forwarding in Action** - **Signal:** `forward_EX_MEM` (data from (1)'s MEM → (2)'s EX). - **Effect:** `X1` bypasses writeback and feeds directly to (2)'s ALU. **Example Waveform Snippet (Textual)** ``` Cycle 4: (1) MEM: X1 = [X2] (available for forwarding) (2) EX: ADD uses forwarded X1 (no stall) Cycle 6: (3) MEM: X3 = [X2+4] (forwarded) (4) EX: SUB uses forwarded X3 ``` **4. Memory Access Implications** - **(1) and (3) both read from `[X2]`:** - If (1) and (3) execute back-to-back, memory may need **2 read ports** (or a stall if single-ported). - **Post-increment in (3):** `X2` is updated **after** the load (no impact on (1)). **5. Summary of Dependencies** | Instruction Pair | Dependency Type | Hazard? | Resolution | |------------------|-----------------|---------|----------------------| | (1) → (2) | RAW (X1) | Yes | Forwarding/Stall | | (3) → (4) | RAW (X3) | Yes | Forwarding/Stall | | (2) ↔ (3) | None | No | Execute in parallel | --- Q: Identify the load-to-use harzards? And reorder the instruction to eliminate it. ![image](https://hackmd.io/_uploads/H1WxwiMyex.png) A: **1. Identifying Load-to-Use Hazards** Load-to-use hazards occur when a value loaded from memory (via `lw`) is used **immediately** in the next instruction(s) before the load completes. In the provided code: ```asm 1. lw r1, b ; Load b → r1 2. lw r2, e ; Load e → r2 3. add r3, r1, r2 ; RAW hazard: r1/r2 used right after lw 4. sw r3, a 5. lw r4, f ; Load f → r4 6. add r5, r1, r4 ; RAW hazard: r4 used right after lw 7. sw r5, c ``` **Hazards Identified:** - **Between `lw r1, b` (1) → `add r3, r1, r2` (3):** `r1` is used immediately after loading. - **Between `lw r2, e` (2) → `add r3, r1, r2` (3):** `r2` is used immediately after loading. - **Between `lw r4, f` (5) → `add r5, r1, r4` (6):** `r4` is used immediately after loading. --- **2. Reordering Instructions to Eliminate Hazards** To resolve these hazards, reorder instructions to insert **independent operations** between the `lw` and its dependent `add`. **Original Code (Hazard-Prone):** ```asm 1. lw r1, b 2. lw r2, e 3. add r3, r1, r2 ← Hazard: Uses r1/r2 too soon 4. sw r3, a 5. lw r4, f 6. add r5, r1, r4 ← Hazard: Uses r4 too soon 7. sw r5, c ``` **Optimized Code (No Hazards):** ```asm 1. lw r1, b ; Load b → r1 2. lw r2, e ; Load e → r2 3. lw r4, f ; Load f → r4 (moved up) 4. add r3, r1, r2 ; Now safe: 1 cycle gap after lw r1/r2 5. sw r3, a 6. add r5, r1, r4 ; Now safe: 2 cycles gap after lw r4 7. sw r5, c ``` **Key Changes:** - Moved `lw r4, f` (5) to position 3. - This creates: - **1-cycle gap** between `lw r1/r2` (1–2) and `add r3` (4). - **2-cycle gap** between `lw r4` (3) and `add r5` (6). --- **Pipeline Behavior (With Forwarding)** Assuming a 5-stage pipeline with forwarding: ``` Cycle | Instruction | Stage | Notes ------|------------------|--------|----------------------- 1 | lw r1, b | IF | 2 | lw r2, e | IF | 3 | lw r4, f | IF | 4 | add r3, r1, r2 | EX | r1/r2 forwarded from MEM/WB 5 | sw r3, a | MEM | 6 | add r5, r1, r4 | EX | r4 forwarded from MEM/WB 7 | sw r5, c | MEM | ``` **Why It Works:** - **Forwarding** allows `r1`, `r2`, and `r4` to bypass writeback and feed directly to `EX`. - The reordering ensures dependent instructions have enough cycles for data availability. --- Q: Identify RAW, WAW, WAR dependency. Use Register renaming to eliminate WAW, WAR dependency ![image](https://hackmd.io/_uploads/HJmVFiMyxe.png) A: To resolve **WAW (Write-After-Write)** and **WAR (Write-After-Read)** hazards in the given code using **register renaming**, we dynamically assign unique temporary registers to eliminate conflicts. Here's the step-by-step solution: **Original Code (Hazards Highlighted)** ```asm 1. add r1, r2, r3 ; Writes r1 (WAW with instruction 3) 2. sub r3, r2, r1 ; Reads r1 (WAR with instruction 3) 3. mul r1, r2, r3 ; Writes r1 (WAW with instruction 1; WAR with instruction 4) 4. div r2, r1, r3 ; Reads r1 and r3 (WAR with instruction 3) ``` **Step 1: Identify Hazards** 1. **WAW Hazards:** - `add r1` (1) and `mul r1` (3) both write to `r1`. 2. **WAR Hazards:** - `sub r3` (2) reads `r1` before `mul r1` (3) overwrites it. - `div r2` (4) reads `r1` and `r3` after `mul r1` (3) and `sub r3` (2) modify them. **Step 2: Apply Register Renaming** Replace conflicting registers with temporary registers (e.g., `r1 → r1'`, `r3 → r3'`): ```asm 1. add r1', r2, r3 ; Rename r1 → r1' 2. sub r3', r2, r1' ; Rename r3 → r3'; reads r1' 3. mul r1'', r2, r3' ; Rename r1 → r1''; reads r3' 4. div r2', r1'', r3' ; Rename r2 → r2'; reads r1'' and r3' ``` **Key Outcomes** 1. **WAW Eliminated:** - `r1'`, `r1''`, and `r2'` are unique, so no two writes target the same register. 2. **WAR Eliminated:** - Dependent reads (`sub` and `div`) use renamed registers (`r1'`, `r1''`, `r3'`), avoiding overwrites. **Final Code with No Hazards** ```asm 1. add r1', r2, r3 ; r1' = r2 + r3 2. sub r3', r2, r1' ; r3' = r2 - r1' 3. mul r1'', r2, r3' ; r1'' = r2 * r3' 4. div r2', r1'', r3' ; r2' = r1'' / r3' ``` # Week 8 ## DMA ![image](https://hackmd.io/_uploads/Hyb4hiMkxe.png) The correct order for adding a new descriptor to a Scatter-Gather DMA descriptor list while the DMA engine is running is: **Correct Sequence: C → A → B** 1. **C) Prepare the new descriptor**: - First, fully initialize the new descriptor in memory (e.g., set source/destination addresses, control fields). - Until this step is complete, the new descriptor is invalid and must not be linked. 2. **A) Fill the "MM2S_NXTDESC" field in the current descriptor**: - Update the current descriptor’s `MM2S_NXTDESC` pointer to point to the **new descriptor**. - This links the new descriptor into the chain, ensuring the DMA engine can traverse to it after processing the current descriptor. 3. **B) Set the Tail Descriptor Pointer (10h-14h)**: - Finally, update the Tail Pointer register to point to the new descriptor. - This signals to the DMA engine that a new descriptor has been added to the list. ## Interrupt ![image](https://hackmd.io/_uploads/S1Q4piMJlg.png) Timeline B allows preemption. The high-priority interrupt interrupts the current task, demonstrating: 1. **Immediate Response**: The interrupt handler runs before the task resumes. 2. **Nested Interrupts**: Higher-priority interrupts can preempt lower-priority ones (if supported). 3. **Context Switching**: The system saves/restores the task’s state. ## Timer ![image](https://hackmd.io/_uploads/HJdyRizylx.png) To generate an interrupt with a frequency of **16 Hz** using a **15 MHz clock**, follow these steps: **1. Key Formula** The timer reload value is calculated as: Start Vaule = (Clock Freq/Desired Interrupt period) - 1 **2. Plug in Values** - Clock Frequency = 15 MHz = 15,000,000 Hz - Desired Interrupt Frequency = 16 Hz **3. Explanation** - The timer counts down from the start value to 0, requiring **937,500 clock cycles** (i.e., \(937,\!499 + 1\)) to elapse. - Time per cycle =1/15M - Total time for one interrupt: 937500/15000000 =0.0625 seconds **Answer** **Start Value = 937,499** (in hexadecimal: `0xE48E7`). # Week 9 ## lab verilog fir | Rule# | Issue Statement | Pass/Fail | Code Snippet | | -------- | -------- | -------- | ------- | | 1 | Design should not be custimized for testbench | Fail | Ignore register address 'd20' | 2 |Do not use specific hardcoded constant in design | Pass | | 3 | Mul/Add in separate pipeline cycle | Pass | | 4 | Do not use DSP | Pass | | 5 | Coding should be concise | Fail | Three FSM code | 6 | Avoid Faulty Logic - Not qualify by control signal | Pass | | 7 | ap_start, ap_idle, ap_done should be separatly controlled | Pass | | 8 | Avoid input to output path | Pass | | 9 | AXI bus signals should not be in the FSM | Pass | | 10 | Should not Lock-step on Xin (ss_tready), Yout (sm_tavlid) | Pass | | 11 | Avoid Unnessary registers/latches | Pass | | 12 | What is your design II, i.e. Y output rate? | 266.67 Mbps | # Week 10 ## Cache ![image](https://hackmd.io/_uploads/SJdsxhzyge.png) **1. Total Cache Size Calculation** - **Cache Line Size**: 64 bytes - **Total Cache Lines**: 4K (4,096) - **Total Cache Size**: 4096 lines × 64 B/line =256KB **2. Tag Address Range Calculation** For a **4-way set-associative cache**: - **Number of Sets**: 4096/4=1024 sets - **Set Index Bits**: log2(1024)=10 bits - **Block Offset Bits**: log2(64B)=6 bits **Summary** | **Parameter** | **Value** | |----------------------|--------------------| | Total Cache Size | 256 KB | | Tag Bits (32-bit) | 16 bits (MSBs) | | Set Index Bits | 10 bits | | Block Offset Bits | 6 bits | This configuration ensures the DMA engine correctly identifies cached memory blocks. --- ![image](https://hackmd.io/_uploads/H1fpX2zJxl.png) **Solution Table** | **Operation** | **Bus Operation** | **Data Supplier** | **Cache States After (P1$, P2$, P3$)** | **Memory Content (X)** | |-------------------------|-----------------------------|-------------------|----------------------------------------|------------------------| | **Initial State** | - | - | I, I, I | 0 | | **P1 load X** | Read | Memory | E, I, I | 0 | | **P2 load X** | Read | P1$ | S, S, I | 0 | | **P1 store X (1)** | Read-Exclusive-Invalidate | P1$ | M, I, I | 0 | | **P3 load X** | Read | P1$ | S, I, S | 1 (updated by P1$) | | **P3 store X (2)** | Read-Exclusive-Invalidate | P3$ | I, I, M | 1 | | **P2 load X again** | Read | P3$ | I, S, S | 2 (updated by P3$) | | **P1 load X again** | Read | Memory | S, S, S | 2 | --- **Explanation** 1. **P1 load X**: - **Bus Operation**: Read (no cache has X). - **Data Supplier**: Memory → P1$ transitions to **Exclusive (E)**. 2. **P2 load X**: - **Bus Operation**: Read (P1$ has X in E). - **Data Supplier**: P1$ → Both P1$ and P2$ transition to **Shared (S)**. 3. **P1 store X (1)**: - **Bus Operation**: Read-Exclusive-Invalidate (P1$ needs exclusive access). - **Data Supplier**: P1$ (already holds data in S). - P2$ invalidated (I), P1$ transitions to **Modified (M)**. 4. **P3 load X**: - **Bus Operation**: Read (P1$ has X in M). - **Data Supplier**: P1$ → P1$ writes back to memory (X=1). - P1$ and P3$ transition to **Shared (S)**. 5. **P3 store X (2)**: - **Bus Operation**: Read-Exclusive-Invalidate (P3$ needs exclusive access). - **Data Supplier**: P3$ (already holds data in S). - P1$ invalidated (I), P3$ transitions to **Modified (M)**. 6. **P2 load X again**: - **Bus Operation**: Read (P3$ has X in M). - **Data Supplier**: P3$ → P3$ writes back to memory (X=2). - P2$ and P3$ transition to **Shared (S)**. 7. **P1 load X again**: - **Bus Operation**: Read (memory now has X=2). - **Data Supplier**: Memory → All caches (P1$, P2$, P3$) in **Shared (S)**. ## Memory ![image](https://hackmd.io/_uploads/B1kfH3fyel.png) | **Design Technique** | **Category** | **Explanation** | |------------------------------------------|--------------|---------------------------------------------------------------------------------| | 1. Reduce memory access latency | C | Reduces the **miss penalty** by speeding up memory accesses. | | 2. Higher cache associativity | B | Reduces conflict misses, **increasing hit rate**. | | 3. Processor multithreading, OoO execution | C | Hides miss latency via parallelism, **reducing effective miss penalty**. | | 4. Bigger cache | B | Reduces capacity misses, **increasing hit rate**. | | 5. Virtually indexed physically tagged (VIPT) | A | Reduces **hit time** by parallelizing TLB and cache access. | | 6. Prefetch | B | Anticipates data needs, **increasing hit rate** by reducing misses. | | 7. Non-blocking caches | C | Allows overlapping misses, **reducing effective miss penalty**. | | 8. Critical word first | C | Prioritizes needed data, **reducing effective miss penalty**. | | 9. Read bypass write | A | Reduces **hit time** by allowing reads to bypass pending writes. | --- ![image](https://hackmd.io/_uploads/SkkTS2Mkex.png) Code A has higher performance. This is because: 1. **Cache Efficiency**: - Row-major storage places consecutive elements of a row contiguously in memory. - Accessing elements sequentially (e.g., `arr[i][j] → arr[i][j+1]`) leverages **spatial locality**, reducing cache misses. 2. **Default Alignment**: - C/C++ compilers and hardware prefetchers are optimized for row-major access patterns. - Column-wise access (e.g., `arr[i][j] → arr[i+1][j]`) results in **strided memory access**, causing frequent cache misses and degraded performance. --- ![image](https://hackmd.io/_uploads/ryj6rhGkge.png) **Answer Table** | **Parameter** | **Value** | |---------------------------|------------------------------------| | Memory Channels | 4 (using 2 bits: bits 6-7) | | DIMMs per Channel | 2 (using 1 bit) | | Banks per DIMM | 8 (using 3 bits) | | CPU Address Bit for Channel Decoding | **Bits 6 and 7** (for 4-channel interleaving) | **Optimal CPU Address Bit for Channel Decoding** For a **256B dataset accessed sequentially**, maximize bandwidth by interleaving data across channels. Use **bits 6 and 7** (assuming 4 channels) for channel decoding. **Reasoning**: - Sequential addresses increment by 1 (byte-addressable). - Interleave at **64B granularity** (cache line size): \(64 = 2^6\) → bits 6:7 cycle through 4 channels (00, 01, 10, 11). - Ensures consecutive 64B blocks map to different channels, maximizing parallelism. # Week 11 ## Lab 4-2 ![image](https://hackmd.io/_uploads/ryo-b2fkxg.png) **Waveform B** indicates that the **CPU (SW) is faster than HW** on the input path (SS bus), as SW must wait for HW to become ready. This reflects a scenario where SW can generate data faster than HW can process it.