* # Analyze PicoRV32 > contributed by [Zheng-Xian Li](https://github.com/garyparrot) and [Xiang-Jun Sun](https://github.com/qoo332001) ## Introducion [PicoRV32](https://github.com/YosysHQ/picorv32) is a CPU core that implements the [RISC-V RV32IMC InstructionSet](https://riscv.org/). It can be configured as RV32E, RV32I, RV32IC, RV32IM, or RV32IMC core, and optionally contains a built-in interrupt controller. We will use [Verilator](https://verilator.org/guide/latest/index.html) to simulation [PicoRV32](https://github.com/YosysHQ/picorv32) and write testbench to verify it and use GTKwave to observe its waveform,after verify we will use [Yosys](https://github.com/YosysHQ/yosys) to synthesis PicoRV32 to gate level netlist. ## Set up Environment ### get picorv32 ```shell= git clone https://github.com/YosysHQ/picorv32.git ``` ### install and build Verilator ```shell= git clone https://github.com/verilator/verilator cd verilator git pull autoconf export VERILATOR_ROOT=`pwd` ./configure make -j `nproc` sudo make install ``` ### install and build Yosys ```shell= sudo apt-get install build-essential clang bison flex \ libreadline-dev gawk tcl-dev libffi-dev git \ graphviz xdot pkg-config python3 libboost-system-dev \ libboost-python-dev libboost-filesystem-dev zlib1g-dev git clone https://github.com/YosysHQ/yosys.git make config-clang make config-gcc make make test sudo make install ``` ## PicoRV32 Source Code Tour Goal of this section: * Provide a bird's-eye view of the CPU implementation. * Explain the data path from source code. * Explain the how the RV32I instruction is processed from source code. * Shows that: **To simplify the CPU design, what kind of trade off they made to this CPU implementation**. To proceed this section you need to know Verilog. > If you happened to graduate from a university where they don't teach Verilog just like mine. Go pay [this website](https://www.asic-world.com/verilog/veritut.html) a visit. You won't regret it. > [name=Zheng-Xian Li] ### Code Structure `picorv32.v` Execute `git clone https://github.com/YosysHQ/picorv32.git` to obtain the source code of PicoRV32 at `picorv32.v`. Judging from the content we can see there are many `module` defined in `picorv32.v`. Each dedicated to [different purpose](https://github.com/YosysHQ/picorv32#picorv32v)(maybe a CPU for different interface(?) or for Verilog `generation` purpose). The `module` we care is `picorv32`. We will drill down into it. ```verilog // The PicoRV32 CPU implementation module picorv32 // ... endmodule module picorv32_regs // ... endmodule // A PCPI core that implements the MUL[H[SU|U]] instructions module picorv32_pcpi_mul // ... endmodule // A version of picorv32_pcpi_fast_mul using a single cycle multiplier module picorv32_pcpi_fast_mul // ... endmodule // A PCPI core that implements the DIV[U]/REM[U] instructions module picorv32_pcpi_div // ... endmodule // The version of the CPU with AXI4-Lite interface module picorv32_axi // ... endmodule // Adapter from PicoRV32 Memory Interface to AXI4-Lite module picorv32_axi_adapter // ... endmodule // The version of the CPU with Wishbone Master interface module picorv32_wb // ... endmodule ``` ### Analyze the content inside `picorv32` module. Peek into the implemention of `picrov32`, we can see this module is organized into the following section. ![](https://i.imgur.com/PiGsjrJ.png) * Module Parameter. * PicoRV32 offers a very extensive configuration. [Here](https://github.com/YosysHQ/picorv32#verilog-module-parameters) lists all the possible module parameters. Like `counter`, `dual-port Register file`, `two-cycle ALU/compare`, and more. This extensibility makes PicoRV32 itself suitable for many situations. Very easy to integrate with other components. * Module Input/Output. * This part defines all the IO ports associated with the external environment. * Memory Interface * Verilog code related to memory access. * Instruction Decoder * Decode code. * Main State Machine * Processing the decoded instrucion. * ALU calculating the result. * Schedule memory access. * Manage IRQ, trap, whatever. #### The Memory Interface This part defines the memory read/write logic. ![](https://i.imgur.com/asrODFQ.png) ```verilog= reg [1:0] mem_state; reg [1:0] mem_wordsize; reg [31:0] mem_rdata_word; reg [31:0] mem_rdata_q; // reg and wire definitions... always @(posedge clk) begin // [1] Module reset action. end always @* begin // [2] Transform read/write data format. // (1 byte read to 32 bit register, etc...) end always @(posedge clk) begin // [3] Compressed instruction related code. end always @(posedge clk) begin // [4] Assertions. end always @(posedge clk) begin // [5] The actual memory fetching/storing logic. // A small state-machine inside here, that means // this process might takes some cycles. end ``` This section contains 5 `always` blocks. Each `always` block maintain different aspect of state. Some for assertion, some for compressed instruction handling and so on. In this report we will ignore the compressed instruction part. We will focus on the interaction between instruction execution and data-flow. In the second `always` block we have the following: ```verilog= always @* begin // memory word size, block, when someone update the `mem_wordsize // thing (sb,sh,sw,lb,lh,lw does). This part is triggered. this write // the `mem_rdata` part to `mem_rdata_word` // the (* ... *) is a verilog attribute // An attribute specifies special properties of a Verilog object or statement, for // use by specific software tools, such as synthesis. Attributes were added in // Verilog-2001. (* full_case *) // mem_wordsize is a 2 bit register, this match to 00, 01, 10 case (mem_wordsize) 0: begin mem_la_wdata = reg_op2; mem_la_wstrb = 4'b1111; mem_rdata_word = mem_rdata; end 1: begin mem_la_wdata = {2{reg_op2[15:0]}}; mem_la_wstrb = reg_op1[1] ? 4'b1100 : 4'b0011; case (reg_op1[1]) 1'b0: mem_rdata_word = {16'b0, mem_rdata[15: 0]}; 1'b1: mem_rdata_word = {16'b0, mem_rdata[31:16]}; endcase end 2: begin mem_la_wdata = {4{reg_op2[7:0]}}; mem_la_wstrb = 4'b0001 << reg_op1[1:0]; case (reg_op1[1:0]) // reading data from `mem_rdata`, but only byte 2'b00: mem_rdata_word = {24'b0, mem_rdata[ 7: 0]}; 2'b01: mem_rdata_word = {24'b0, mem_rdata[15: 8]}; 2'b10: mem_rdata_word = {24'b0, mem_rdata[23:16]}; 2'b11: mem_rdata_word = {24'b0, mem_rdata[31:24]}; endcase end endcase end/*}}}*/ ``` This `always` block has the `@*` sensitivty list, this list represent that this block will be execute if any of the related input data changed. Judging from the content we can see there is only one big `case` block in there. When `mem_wordsize` match to any of the value `0`, `1` or `2`. Corresponding code path get triggered. The content is basically doing alignment for different size of read data. * The `mem_la_wdata` and `mem_la_wstrb` are related to memory write instruction(`sw, sh, sb`). `reg_op2` is the second register sepcified in the `store` series instruction (Fig 1). We can confirm this relationship from the code snippet of (Code 1) I copied from the CPU state machine. Both of the value will be used later in the fifth `always` block. Over there the code put the `mem_la_wdata`(A `wire` type data value, point to the memory address we want to operate on) and `mem_la_wstrb` to the `PicoRV32` module output `mem_wdata` and `mem_strb`. > I have no idea what `la` is. * The `mem_rdata_word` will become a complete word of the data being read. That mean it will handle the half-word/ byte loading behavior. ![](https://i.imgur.com/CHWQjq6.png) <center>(Fig 1) The store instruction</center> ```verilog= cpu_state_ld_rs1: begin /* LD RS1 state (what?) {{{*/ reg_op1 <= 'bx; reg_op2 <= 'bx; (* parallel_case *) case (1'b1) (CATCH_ILLINSN || WITH_PCPI) && instr_trap: begin // if something fuck up execute this if (WITH_PCPI) begin/*{{{*/ `debug($display("LD_RS1: %2d 0x%08x", decoded_rs1, cpuregs_rs1);) reg_op1 <= cpuregs_rs1; dbg_rs1val <= cpuregs_rs1; dbg_rs1val_valid <= 1; if (ENABLE_REGS_DUALPORT) begin pcpi_valid <= 1; `debug($display("LD_RS2: %2d 0x%08x", decoded_rs2, cpuregs_rs2);) reg_sh <= cpuregs_rs2; reg_op2 <= cpuregs_rs2; dbg_rs2val <= cpuregs_rs2; ``` <center>(Code 1) where reg_op1 and reg_op2 updated. </center> The (code 1) example shows how register 1 and register 2 are updated when `ENABLE_REGS_DUALPORT` is enabled. If `ENABLE_REGS_DUALPORT` is not enabled, this process might takes 2 CPU state to accomplish. ```verilog= always @(posedge clk) begin if (!resetn || trap) begin // clear state when reset or trapped{{{ if (!resetn) mem_state <= 0; if (!resetn || mem_ready) mem_valid <= 0; mem_la_secondword <= 0; prefetched_high_word <= 0;/*}}}*/ end else begin // `mem_la_read` and `mem_la_write` are related to memory // operation, this is set by the sb/sh/sw or lb/lh/lw instruction if (mem_la_read || mem_la_write) begin // write `mem_la_addr` to `mem_addr`, this is a output pin of this CPU mem_addr <= mem_la_addr; mem_wstrb <= mem_la_wstrb & {4{mem_la_write}}; end if (mem_la_write) begin mem_wdata <= mem_la_wdata; end // A state machine here managing mem state{{{ case (mem_state) 0: begin if (mem_do_prefetch || mem_do_rinst || mem_do_rdata) begin mem_valid <= !mem_la_use_prefetched_high_word; mem_instr <= mem_do_prefetch || mem_do_rinst; mem_wstrb <= 0; mem_state <= 1; end if (mem_do_wdata) begin mem_valid <= 1; mem_instr <= 0; mem_state <= 2; end end 1: begin `assert(mem_wstrb == 0); `assert(mem_do_prefetch || mem_do_rinst || mem_do_rdata); `assert(mem_valid == !mem_la_use_prefetched_high_word); `assert(mem_instr == (mem_do_prefetch || mem_do_rinst)); // if the memory read operation is done, we do the following{{{ if (mem_xfer) begin if (COMPRESSED_ISA && mem_la_read) begin mem_valid <= 1; mem_la_secondword <= 1; if (!mem_la_use_prefetched_high_word) mem_16bit_buffer <= mem_rdata[31:16]; end else begin mem_valid <= 0; mem_la_secondword <= 0; if (COMPRESSED_ISA && !mem_do_rdata) begin if (~&mem_rdata[1:0] || mem_la_secondword) begin mem_16bit_buffer <= mem_rdata[31:16]; prefetched_high_word <= 1; end else begin prefetched_high_word <= 0; end end mem_state <= mem_do_rinst || mem_do_rdata ? 0 : 3; end end/*}}}*/ end 2: begin `assert(mem_wstrb != 0); `assert(mem_do_wdata); // if the memory write operation is done, we do the following{{{ if (mem_xfer) begin mem_valid <= 0; mem_state <= 0; end/*}}}*/ end 3: begin `assert(mem_wstrb == 0); `assert(mem_do_prefetch); if (mem_do_rinst) begin mem_state <= 0; end end/*}}}*/ endcase end if (clear_prefetched_high_word) prefetched_high_word <= 0; end/*}}}*/ ``` <center>(Code 2) The Memory Interface State Machine</center> The above (Code 2) depict a small memory operaiton state machine inside `PicoRV32`. There are four state. * The state machine execute for every post-edge cycle. * Before execute the state machine. If there are any scheduled memory read/write. We set these signal to the memory output port of CPU first. ```verilog if (mem_la_read || mem_la_write) begin // write `mem_la_addr` to `mem_addr`, this is a output pin of this CPU mem_addr <= mem_la_addr; mem_wstrb <= mem_la_wstrb & {4{mem_la_write}}; end if (mem_la_write) begin mem_wdata <= mem_la_wdata; end ``` The Memory Controller(Or something I don't know) will pick up these signal and perform the specific operation. * After the above operation we head to the real state machine logic ```verilog= case (mem_state) 0: begin // [0] Up on a declared memory opeartion, // update corresponding state. // read(mem_do_prefetch): goto 1 // read(mem_do_rinst): goto 1 // read(mem_do_rdata): goto 1 // write(mem_do_wdata) goto 2 // (btw mem_do_rinst mean "memory do read instruction") end 1: begin // [1] When the memory read result is ready (mem_xfer) // update corresponding state. // done(mem_do_rinst): goto 1 // done(mem_do_rdata): goto 1 // done(mem_do_prefetch): goto 3 end 2: begin // [2] When the memory write is done (mem_xfer) // update the corresponding state // always goto 0 end 3: begin // [3] when perform mem_do_prefetch, we will stuck in this state // until the `mem_do_rinst` operation is declared. // prefetch done(mem_do_rinst): goto 0 end ``` The above comment is very straightforward. There is not much inside this part. Just doing the state management. It serve as a interface to synchronize the state between the main CPU state machine(will mention later) and the input/output ports of this CPU. #### The Instruction Decoder In this section we are going through the Instruction Decoder of PicoRV32. This section is more boring and easier to understand. The decoder itself is full of the instruction parsing logic. There are a few `always` blocks in this part. But only one of them matters. ```verilog= always @(posedge clk) begin if (mem_do_rinst && mem_done) begin // identify which instrction we are about to execute end if (decoder_trigger && !decoder_pseudo_trigger) begin // identify the exact executing instruciton end end/*}}}*/ ``` The instruction decoder doing instruction reading by using a always block to check the registers ,if any of them change,then set new_ascii_instr="operation". ```verilog= reg instr_lui, instr_auipc, instr_jal, instr_jalr; reg instr_beq, instr_bne, instr_blt, instr_bge, instr_bltu, instr_bgeu; reg instr_lb, instr_lh, instr_lw, instr_lbu, instr_lhu, instr_sb, instr_sh, instr_sw; reg instr_addi, instr_slti, instr_sltiu, instr_xori, instr_ori, instr_andi, instr_slli, instr_srli, instr_srai; reg instr_add, instr_sub, instr_sll, instr_slt, instr_sltu, instr_xor, instr_srl, instr_sra, instr_or, instr_and; reg instr_rdcycle, instr_rdcycleh, instr_rdinstr, instr_rdinstrh, instr_ecall_ebreak; reg instr_getq, instr_setq, instr_retirq, instr_maskirq, instr_waitirq, instr_timer; always @* begin new_ascii_instr = ""; if (instr_lui) new_ascii_instr = "lui"; if (instr_auipc) new_ascii_instr = "auipc"; if (instr_jal) new_ascii_instr = "jal"; if (instr_jalr) new_ascii_instr = "jalr"; if (instr_beq) new_ascii_instr = "beq"; if (instr_bne) new_ascii_instr = "bne"; if (instr_blt) new_ascii_instr = "blt"; if (instr_bge) new_ascii_instr = "bge"; if (instr_bltu) new_ascii_instr = "bltu"; if (instr_bgeu) new_ascii_instr = "bgeu"; if (instr_lb) new_ascii_instr = "lb"; if (instr_lh) new_ascii_instr = "lh"; if (instr_lw) new_ascii_instr = "lw"; if (instr_lbu) new_ascii_instr = "lbu"; if (instr_lhu) new_ascii_instr = "lhu"; if (instr_sb) new_ascii_instr = "sb"; if (instr_sh) new_ascii_instr = "sh"; if (instr_sw) new_ascii_instr = "sw"; if (instr_addi) new_ascii_instr = "addi"; if (instr_slti) new_ascii_instr = "slti"; if (instr_sltiu) new_ascii_instr = "sltiu"; if (instr_xori) new_ascii_instr = "xori"; if (instr_ori) new_ascii_instr = "ori"; if (instr_andi) new_ascii_instr = "andi"; if (instr_slli) new_ascii_instr = "slli"; if (instr_srli) new_ascii_instr = "srli"; if (instr_srai) new_ascii_instr = "srai"; if (instr_add) new_ascii_instr = "add"; if (instr_sub) new_ascii_instr = "sub"; if (instr_sll) new_ascii_instr = "sll"; if (instr_slt) new_ascii_instr = "slt"; if (instr_sltu) new_ascii_instr = "sltu"; if (instr_xor) new_ascii_instr = "xor"; if (instr_srl) new_ascii_instr = "srl"; if (instr_sra) new_ascii_instr = "sra"; if (instr_or) new_ascii_instr = "or"; if (instr_and) new_ascii_instr = "and"; if (instr_rdcycle) new_ascii_instr = "rdcycle"; if (instr_rdcycleh) new_ascii_instr = "rdcycleh"; if (instr_rdinstr) new_ascii_instr = "rdinstr"; if (instr_rdinstrh) new_ascii_instr = "rdinstrh"; if (instr_getq) new_ascii_instr = "getq"; if (instr_setq) new_ascii_instr = "setq"; if (instr_retirq) new_ascii_instr = "retirq"; if (instr_maskirq) new_ascii_instr = "maskirq"; if (instr_waitirq) new_ascii_instr = "waitirq"; if (instr_timer) new_ascii_instr = "timer"; end ``` There is a D flip-flop and the input is decoder_trigger the output is decoder_trigger_q ,it means decoder_trigger_q will get the last state decoder_trigger, and the decoder_trigger will be set to 1 when the memeory start read instruction(memomem_do_rinst) and the memory read is done(mem_done). ```verilog= always @(posedge clk) begin decoder_trigger <= mem_do_rinst && mem_done; decoder_trigger_q <= decoder_trigger; decoder_pseudo_trigger <= 0; decoder_pseudo_trigger_q <= decoder_pseudo_trigger; do_waitirq <= 0; end ``` check the insrtuction opcode is branch,IRQ(interrupt request) or memory operation,and set the flag to register to compressed the instrucions. ```verilog= always @(posedge clk) begin if (mem_do_rinst && mem_done) begin instr_lui <= mem_rdata_latched[6:0] == 7'b0110111; instr_auipc <= mem_rdata_latched[6:0] == 7'b0010111; instr_jal <= mem_rdata_latched[6:0] == 7'b1101111; instr_jalr <= mem_rdata_latched[6:0] == 7'b1100111 && mem_rdata_latched[14:12] == 3'b000; instr_retirq <= mem_rdata_latched[6:0] == 7'b0001011 && mem_rdata_latched[31:25] == 7'b0000010 && ENABLE_IRQ; instr_waitirq <= mem_rdata_latched[6:0] == 7'b0001011 && mem_rdata_latched[31:25] == 7'b0000100 && ENABLE_IRQ; is_beq_bne_blt_bge_bltu_bgeu <= mem_rdata_latched[6:0] == 7'b1100011; is_lb_lh_lw_lbu_lhu <= mem_rdata_latched[6:0] == 7'b0000011; is_sb_sh_sw <= mem_rdata_latched[6:0] == 7'b0100011; is_alu_reg_imm <= mem_rdata_latched[6:0] == 7'b0010011; is_alu_reg_reg <= mem_rdata_latched[6:0] == 7'b0110011; { decoded_imm_j[31:20], decoded_imm_j[10:1], decoded_imm_j[11], decoded_imm_j[19:12], decoded_imm_j[0] } <= $signed({mem_rdata_latched[31:12], 1'b0}); . . . //instruction compressed (32bit to 16 bit) ``` when decoder_trigger is set,it will check funct7(mem_rdata_q[14:12]) and funct3(mem_rdata_q[31:25]),and set the flag. ```verilog= always @(posedge clk) begin if (decoder_trigger && !decoder_pseudo_trigger) begin pcpi_insn <= WITH_PCPI ? mem_rdata_q : 'bx; instr_beq <= is_beq_bne_blt_bge_bltu_bgeu && mem_rdata_q[14:12] == 3'b000; instr_bne <= is_beq_bne_blt_bge_bltu_bgeu && mem_rdata_q[14:12] == 3'b001; instr_blt <= is_beq_bne_blt_bge_bltu_bgeu && mem_rdata_q[14:12] == 3'b100; instr_bge <= is_beq_bne_blt_bge_bltu_bgeu && mem_rdata_q[14:12] == 3'b101; instr_bltu <= is_beq_bne_blt_bge_bltu_bgeu && mem_rdata_q[14:12] == 3'b110; instr_bgeu <= is_beq_bne_blt_bge_bltu_bgeu && mem_rdata_q[14:12] == 3'b111; instr_lb <= is_lb_lh_lw_lbu_lhu && mem_rdata_q[14:12] == 3'b000; instr_lh <= is_lb_lh_lw_lbu_lhu && mem_rdata_q[14:12] == 3'b001; instr_lw <= is_lb_lh_lw_lbu_lhu && mem_rdata_q[14:12] == 3'b010; instr_lbu <= is_lb_lh_lw_lbu_lhu && mem_rdata_q[14:12] == 3'b100; instr_lhu <= is_lb_lh_lw_lbu_lhu && mem_rdata_q[14:12] == 3'b101; . . . end end ``` #### The Main State Machine The main state machine is the brain of `PicoRV32`. The 2 components we mention in the above sections are schedule/invoke by this big state mahcine by someway. * Show the state machine * What each stage is about ![](https://i.imgur.com/hOvlCQb.png) This figure shows the state transition inside PicoRV32. Only important state changes are drawn. * **Fetch** This is the initial state of this CPU. In this state, the goal of the CPU is to fetch next instruction from the memory. This stage might takes multiple cycle since we have to wait for the memory to come back. Once The instruction is read and decoded. The state transit to `Id_rs1` * **Id_rs1** This state dedicated to read the register value. There are actually two state that related to register value. The another is **Id_rs2**. The reason that we have two state for register value is: Some of the register file might not support **Dual Port**. So it cannot read two register value at the same cycle(I might be wrong, I am not a expert of embbeded system). Once the register value is obtained. We head to one of the state: `ldmem`, `stmem`, `exec` or `shift`. Depended on the decoded instruction type. * **Id_rs2** Read the above state. * **ldmme** Memory loading related instruction. * **stmem** Memory storing related instruction. * **exec** Mainly ALU operation here. collect the calculated result and place it in another register. * **Shift** For some reason they separate the **Shift** state from the **exec** state. This might related to the `TWO_STAGE_SHIFT` module parameter. This parameter will make the CPU to take two cycle to accomplish a shift operation. They are probably doing this to reduce the longest path of the circuit? The main state machine code is rather lengthly. I am not going to copy-paste it here. For further explaination of how the CPU internal doing when executing the instruction. Head to the [observation of waveform](https://hackmd.io/fAWaTqqwSPmXwfsgEPtz5g?view#Observe-the-signal-waves). ### The Trade-Off of PicoRV32 TLDR: Simple Design, Higher Extensibility, Slower Performance. * No pipeline, simpler code. * Since it is simple, it is much easier to extend this design. PicoRV32 come with many configurable parameter to suit many situation. * No pipeline, slower performance. But [this CPU is meant to be used as auxiliary processor in FPGA designs and ASICs](https://github.com/YosysHQ/picorv32#features-and-typical-applications). So this type of usage doesn't line in their consideration. ## Observe the waveform of PicoRV32 CPU In this section we try to simulate the `PicoRV32` CPU by Verilator. We will collect simulation(wave) result from the CPU and analyze if the CPU behavior match our observation. * Setting up Valiator. * Address the issue between PicoRV32 & Verilator. * Build & Run * Build & Run with Custom Memory. * Observe the signal waves. * Instruction Fetch & Memory Read. * Instruction Execution. * Memory Write. ### Setting up Valiator In this section we will * Show how to use Verilator to generate the signal waves files. * Setting up custom memory (instruction & data) for CPU execution. The Verilator version I am using is `Verilator 4.216 2021-12-05 rev UNKNOWN.REV`. Beware the version of yours, older version might not work on the following instruction. #### Address the issue between PicoRV32 & Verilator. Before doing this section we need to make some modification to the `picrov32.v` file. ```verilog= reg [63:0] new_ascii_instr; `FORMAL_KEEP reg [63:0] dbg_ascii_instr; `FORMAL_KEEP reg [31:0] dbg_insn_imm; `FORMAL_KEEP reg [4:0] dbg_insn_rs1; `FORMAL_KEEP reg [4:0] dbg_insn_rs2; `FORMAL_KEEP reg [4:0] dbg_insn_rd; `FORMAL_KEEP reg [31:0] dbg_rs1val; `FORMAL_KEEP reg [31:0] dbg_rs2val; `FORMAL_KEEP reg dbg_rs1val_valid; `FORMAL_KEEP reg dbg_rs2val_valid; always @* begin new_ascii_instr = ""; if (instr_lui) new_ascii_instr = "lui"; if (instr_auipc) new_ascii_instr = "auipc"; if (instr_jal) new_ascii_instr = "jal"; if (instr_jalr) new_ascii_instr = "jalr"; if (instr_beq) new_ascii_instr = "beq"; if (instr_bne) new_ascii_instr = "bne"; if (instr_blt) new_ascii_instr = "blt"; if (instr_bge) new_ascii_instr = "bge"; if (instr_bltu) new_ascii_instr = "bltu"; if (instr_bgeu) new_ascii_instr = "bgeu"; if (instr_lb) new_ascii_instr = "lb"; if (instr_lh) new_ascii_instr = "lh"; if (instr_lw) new_ascii_instr = "lw"; if (instr_lbu) new_ascii_instr = "lbu"; if (instr_lhu) new_ascii_instr = "lhu"; if (instr_sb) new_ascii_instr = "sb"; if (instr_sh) new_ascii_instr = "sh"; if (instr_sw) new_ascii_instr = "sw"; if (instr_addi) new_ascii_instr = "addi"; if (instr_slti) new_ascii_instr = "slti"; if (instr_sltiu) new_ascii_instr = "sltiu"; if (instr_xori) new_ascii_instr = "xori"; if (instr_ori) new_ascii_instr = "ori"; if (instr_andi) new_ascii_instr = "andi"; if (instr_slli) new_ascii_instr = "slli"; if (instr_srli) new_ascii_instr = "srli"; if (instr_srai) new_ascii_instr = "srai"; if (instr_add) new_ascii_instr = "add"; if (instr_sub) new_ascii_instr = "sub"; if (instr_sll) new_ascii_instr = "sll"; if (instr_slt) new_ascii_instr = "slt"; if (instr_sltu) new_ascii_instr = "sltu"; if (instr_xor) new_ascii_instr = "xor"; if (instr_srl) new_ascii_instr = "srl"; if (instr_sra) new_ascii_instr = "sra"; if (instr_or) new_ascii_instr = "or"; if (instr_and) new_ascii_instr = "and"; if (instr_rdcycle) new_ascii_instr = "rdcycle"; if (instr_rdcycleh) new_ascii_instr = "rdcycleh"; if (instr_rdinstr) new_ascii_instr = "rdinstr"; if (instr_rdinstrh) new_ascii_instr = "rdinstrh"; if (instr_getq) new_ascii_instr = "getq"; if (instr_setq) new_ascii_instr = "setq"; if (instr_retirq) new_ascii_instr = "retirq"; if (instr_maskirq) new_ascii_instr = "maskirq"; if (instr_waitirq) new_ascii_instr = "waitirq"; if (instr_timer) new_ascii_instr = "timer"; end ``` <center>(code ?) Above code will cause a compile error when verilator compile the generated code</center> <br> The above code is try to provide some debugging signals. Notice it try to assign an ASCII string to a 64-bit register. This is ok in Verilog. But apparently, this will piss the verilator code generation. ```= Vpicorv32___024root__DepSet_h733510b4__0.cpp: In function ‘void Vpicorv32___024root___sequent__TOP__1(Vpicorv32___024root*)’: Vpicorv32___024root__DepSet_h733510b4__0.cpp:1255:51: error: cannot convert ‘std::string’ {aka ‘std::__cxx11::basic_string<char>’} to ‘QData’ {aka ‘long unsigned int’} in assignment 1255 | vlSelf->picorv32__DOT__new_ascii_instr = std::string(""); | ^~~~~~~~~~ | | | std::string {aka std::__cxx11::basic_string<char>} In file included from Vpicorv32__ALL.cpp:8: Vpicorv32___024root__DepSet_h733510b4__0__Slow.cpp: In function ‘void Vpicorv32___024root___settle__TOP__2(Vpicorv32___024root*)’: Vpicorv32___024root__DepSet_h733510b4__0__Slow.cpp:235:51: error: cannot convert ‘std::string’ {aka ‘std::__cxx11::basic_string<char>’} to ‘QData’ {aka ‘long unsigned int’} in assignment 235 | vlSelf->picorv32__DOT__new_ascii_instr = std::string(""); | ^~~~~~~~~~ | | | std::string {aka std::__cxx11::basic_string<char>} ``` To avoid this issue we have to make some modification to the PicoRv32 itself. Maybe there are other way to solve this issue. But I am not a Verilog expert or Verilator expert. So anyway the following works. ```bash= function trans { python -c "print('{'+','.join(['8\'d'+str(ord(x)) for x in \"$1\"])+'}')" } trans lui trans auipc trans jal trans jalr trans beq trans bne trans blt trans bge trans bltu trans bgeu trans lb trans lh trans lw trans lbu trans lhu trans sb trans sh trans sw trans addi trans slti trans sltiu trans xori trans ori trans andi trans slli trans srli trans srai trans add trans sub trans sll trans slt trans sltu trans xor trans srl trans sra trans or trans and trans rdcycle trans rdcycleh trans rdinstr trans rdinstrh trans getq trans setq trans retirq trans maskirq trans waitirq trans timer ``` <center>(code ?) scripts to generate better Verilog code (type hint)</center> <br> Above code is trying to transform the ASCII string into a combination of bits. With this we can workaround this issue. ```verilog= reg [63:0] new_ascii_instr; `FORMAL_KEEP reg [63:0] dbg_ascii_instr; `FORMAL_KEEP reg [31:0] dbg_insn_imm; `FORMAL_KEEP reg [4:0] dbg_insn_rs1; `FORMAL_KEEP reg [4:0] dbg_insn_rs2; `FORMAL_KEEP reg [4:0] dbg_insn_rd; `FORMAL_KEEP reg [31:0] dbg_rs1val; `FORMAL_KEEP reg [31:0] dbg_rs2val; `FORMAL_KEEP reg dbg_rs1val_valid; `FORMAL_KEEP reg dbg_rs2val_valid; always @* begin new_ascii_instr = {8'd0}; if (instr_lui) new_ascii_instr = {8'd108,8'd117,8'd105}; if (instr_auipc) new_ascii_instr = {8'd97,8'd117,8'd105,8'd112,8'd99}; if (instr_jal) new_ascii_instr = {8'd106,8'd97,8'd108}; if (instr_jalr) new_ascii_instr = {8'd106,8'd97,8'd108,8'd114}; if (instr_beq) new_ascii_instr = {8'd98,8'd101,8'd113}; if (instr_bne) new_ascii_instr = {8'd98,8'd110,8'd101}; if (instr_blt) new_ascii_instr = {8'd98,8'd108,8'd116}; if (instr_bge) new_ascii_instr = {8'd98,8'd103,8'd101}; if (instr_bltu) new_ascii_instr = {8'd98,8'd108,8'd116,8'd117}; if (instr_bgeu) new_ascii_instr = {8'd98,8'd103,8'd101,8'd117}; if (instr_lb) new_ascii_instr = {8'd108,8'd98}; if (instr_lh) new_ascii_instr = {8'd108,8'd104}; if (instr_lw) new_ascii_instr = {8'd108,8'd119}; if (instr_lbu) new_ascii_instr = {8'd108,8'd98,8'd117}; if (instr_lhu) new_ascii_instr = {8'd108,8'd104,8'd117}; if (instr_sb) new_ascii_instr = {8'd115,8'd98}; if (instr_sh) new_ascii_instr = {8'd115,8'd104}; if (instr_sw) new_ascii_instr = {8'd115,8'd119}; if (instr_addi) new_ascii_instr = {8'd97,8'd100,8'd100,8'd105}; if (instr_slti) new_ascii_instr = {8'd115,8'd108,8'd116,8'd105}; if (instr_sltiu) new_ascii_instr = {8'd115,8'd108,8'd116,8'd105,8'd117}; if (instr_xori) new_ascii_instr = {8'd120,8'd111,8'd114,8'd105}; if (instr_ori) new_ascii_instr = {8'd111,8'd114,8'd105}; if (instr_andi) new_ascii_instr = {8'd97,8'd110,8'd100,8'd105}; if (instr_slli) new_ascii_instr = {8'd115,8'd108,8'd108,8'd105}; if (instr_srli) new_ascii_instr = {8'd115,8'd114,8'd108,8'd105}; if (instr_srai) new_ascii_instr = {8'd115,8'd114,8'd97,8'd105}; if (instr_add) new_ascii_instr = {8'd97,8'd100,8'd100}; if (instr_sub) new_ascii_instr = {8'd115,8'd117,8'd98}; if (instr_sll) new_ascii_instr = {8'd115,8'd108,8'd108}; if (instr_slt) new_ascii_instr = {8'd115,8'd108,8'd116}; if (instr_sltu) new_ascii_instr = {8'd115,8'd108,8'd116,8'd117}; if (instr_xor) new_ascii_instr = {8'd120,8'd111,8'd114}; if (instr_srl) new_ascii_instr = {8'd115,8'd114,8'd108}; if (instr_sra) new_ascii_instr = {8'd115,8'd114,8'd97}; if (instr_or) new_ascii_instr = {8'd111,8'd114}; if (instr_and) new_ascii_instr = {8'd97,8'd110,8'd100}; if (instr_rdcycle) new_ascii_instr = {8'd114,8'd100,8'd99,8'd121,8'd99,8'd108,8'd101}; if (instr_rdcycleh) new_ascii_instr = {8'd114,8'd100,8'd99,8'd121,8'd99,8'd108,8'd101,8'd104}; if (instr_rdinstr) new_ascii_instr = {8'd114,8'd100,8'd105,8'd110,8'd115,8'd116,8'd114}; if (instr_rdinstrh) new_ascii_instr = {8'd114,8'd100,8'd105,8'd110,8'd115,8'd116,8'd114,8'd104}; if (instr_getq) new_ascii_instr = {8'd103,8'd101,8'd116,8'd113}; if (instr_setq) new_ascii_instr = {8'd115,8'd101,8'd116,8'd113}; if (instr_retirq) new_ascii_instr = {8'd114,8'd101,8'd116,8'd105,8'd114,8'd113}; if (instr_maskirq) new_ascii_instr = {8'd109,8'd97,8'd115,8'd107,8'd105,8'd114,8'd113}; if (instr_waitirq) new_ascii_instr = {8'd119,8'd97,8'd105,8'd116,8'd105,8'd114,8'd113}; if (instr_timer) new_ascii_instr = {8'd116,8'd105,8'd109,8'd101,8'd114}; end ``` #### Build & Run Procceed to [here](#Run-the-firmware-offer-in-PicoRV32-by-simulator) to understand how we can use verilator to simulate this. After follow the step in the above link we will have the waveform. ![](https://i.imgur.com/A02MAjV.png) ### Observe the signal waves #### Instruction Execution (shift) The following image shows the result of `srl` instruction. ![](https://i.imgur.com/8aI2ovK.png) Observe the waveform we can see after the instruction fetched. The `srl` instruction start the exectuion. The `reg_op1` contains the value to shift. And the `reg_op2` contains the number of bits to shift. It is intresting to see that it takes many cycles to accomplish a simple shift. Check the source code and we can see what's going on. ```verilog= cpu_state_shift: begin /* Shift (Goto FETCH) {{{*/ latched_store <= 1; if (reg_sh == 0) begin reg_out <= reg_op1; mem_do_rinst <= mem_do_prefetch; cpu_state <= cpu_state_fetch; end else if (TWO_STAGE_SHIFT && reg_sh >= 4) begin // for two stage shift (* parallel_case, full_case *) case (1'b1) instr_slli || instr_sll: reg_op1 <= reg_op1 << 4; instr_srli || instr_srl: reg_op1 <= reg_op1 >> 4; instr_srai || instr_sra: reg_op1 <= $signed(reg_op1) >>> 4; endcase reg_sh <= reg_sh - 4; end else begin // for normal shift (* parallel_case, full_case *) case (1'b1) instr_slli || instr_sll: reg_op1 <= reg_op1 << 1; instr_srli || instr_srl: reg_op1 <= reg_op1 >> 1; instr_srai || instr_sra: reg_op1 <= $signed(reg_op1) >>> 1; endcase reg_sh <= reg_sh - 1; end /*}}}*/ ``` Judging from the source code we can see that the `shift` can perform at most `4` bit shift at one cycle. And if the value is less than `4`, for each shift bit it's going to cost us one cycle. the `reg_sh` value is `0x14`. So it takes `1+5` cycle to done this shift. Now talk about the instruction fetching of next instruction. We can see that the `mem_do_prefetch` is set at the beginning of execution. This so-called prefetch is all about the next instruction. When this register flag is set. The memory interface start fetching the value of the next instruction binary from memory. Once the instruction is fetched. The memory interface stuck in state `3` (remember the special state `3`, which is reaching from state `1`(reading)). The memory interface will stuck in this `3` state until the `mem_do_rinst` is actually set. That mean the current instruction execution is done. And we need the next instruction. (Observe the image or source code you can see `mem_do_rinst` is set once the shift is done). This is basically how prefetch is accomplish on this CPU. #### Instruction Execution (add) ![](https://i.imgur.com/0E9gniu.png) The above image shows the execution of a `add` instruction. In `fetch` state,`decoder_trigger` will be set ,so that the `ID` stage will working in next posedge clock.after `ID` state and enter `execucion` state,cpu will execute addition `reg_op1` and `reg_op2` and get the result to `alu_out`,and set the `mem_do_rinst` so that the next `fetch` stage can fetch next instruction. #### Instruction Execution (addi) ![](https://i.imgur.com/PHllYz6.png) The above image shows the execution of a `addi` instruction. In `fetch` stage, it will start decode `from req_pc` and get the opcode.The above image shows the execution of a `addi` instruction.in `ID` stage, `dbg_insn_imm` will get the immediate value.In `exec` stagem,`dbg_insn_imm` put the immediat value it to `reg_op2` ,and get the data from `decoded_rs1` to `reg_op1` ,and get the sum to `alu_out`, and `mem_do_rinst` will be set it means will read memory in next posedge clock. #### Instruction Execution (jal) ![](https://i.imgur.com/wsWKZKV.png) The above image shows the execution of a `jal` instruction. This instruction only work in `fetch` state. According to the specification. The `jal` instruction will move the program count to another memory location(based on a 20bit signed immediate value plus the program counter). Judging from the diagram we can infer that the decoder figure out the `-718` decimal as the immediate value. Since the current `PC` is `0xC8E`, by minus it with `-718` now you will have `0x9C0`. Which is the value of `reg_next_pc`. Once this value is calculated. The target `PC` was assigned to the `mem_la_addr` registers. And the `mem_do_rinst` signal is set. Which trigger the external memory module to fetch the specific instruction. Once this process is done. The `decoder_trigger` is set and trigger the new instruction decoding. Upon the next cycle. We will have the brand new instruction here. And continue on with the original state `fetch`. #### Instruction Execution (beq) ![](https://i.imgur.com/BpU4r13.png) Above image shows the execution of `beq`. At `ld_rs1` stage the CPU attempts to retrieve the two register values from the register file. Both value are store in `reg_op1` and `reg_op2`. And both value are `0`. At `exec` stage, the `alu_out` indicate the comparasion result is `true`(1). So we should take this branch. Set the `reg_next_pc` to current PC(`0xC084`) plus the immediate value(`12`). So the `reg_next_pc` is `0xC090`. Observe the image we can find a weird situation: **The CPU read two instructions at the same branch instruction**(yellow circle). This is unnatural and make people cringe. If we take a closer look at `mem_la_addr`. We can realize the instruction we try to read is located at `0xC088` and `0xC090`. * The former `0xC088` is the instruction we will execute if the branch is not taken. * The latter `0xC090` is the instruction we will execute if the branch is taken. #### Instruction Execution (lbu) ![](https://i.imgur.com/gZiHvON.png) Above image shows the execution of `lbu` instruction. This instruction read one unsigned byte into register. * `mem_wordsize` is set to `2`, which perform the `byte` truncate on the retrieved memory word. * `reg_op1` contain the address to read. * `decoded_rd` is `0x0B` * immediate value are `0`. * There are one unrelated memory fetch invoked by instruction prefetch. * Once the memory value is retrieved. Value are assigned to corresponding register. #### Instruction Execution (sw) ![](https://i.imgur.com/QgFTdxs.png) Above image shows the execution of `sw` instruction. Once we reach the `stmem` state. The CPU attempts to write the value in `reg_op2` to the address of `reg_op1` plus immediate value. In this example * `reg_op1` is `0x10000000` * immediate value `0` * `reg_op2` is `0x30` Observe the `mem_la_addr` we can see the CPU are trying to write data to `0x10000000`. The CPU set `set_mem_do_wdata` to `1`. Which trigger the `mem_do_wdata` at certain code path. Once the write operation is done(`mem_done`). The CPU leave `stmem` state and reach `fetch` state. ## Run the firmware offer in `PicoRV32` by simulator PicoRV32 project offers firmware to test the implementation. Peek into the `Makefile` we can see the related definiton. ```makefile TEST_OBJS = $(addsuffix .o,$(basename $(wildcard tests/*.S))) FIRMWARE_OBJS = firmware/start.o firmware/irq.o firmware/print.o firmware/hello.o firmware/sieve.o firmware/multest.o firmware/stats.o # Prepare the firmware binary file. Since we are simulate this in a # bare-metal like environment. We have to break apart the file to execute # into pieces. Then place them to appropriate position to let it work. firmware/firmware.hex: firmware/firmware.bin firmware/makehex.py $(PYTHON) firmware/makehex.py $< 32768 > $@ # Entrypoint to execute verilator. # This binary is generated by verilator and the given input # (picorv32.v and verilator c++ wrapper file) test_verilator: testbench_verilator firmware/firmware.hex ./testbench_verilator # Call Verilator to do the simulator code generation. # Once the generation done it compile the binary that perform the # simulation. testbench_verilator: testbench.v picorv32.v testbench.cc $(VERILATOR) --cc --exe -Wno-lint -trace --top-module picorv32_wrapper testbench.v picorv32.v testbench.cc \ $(subst C,-DCOMPRESSED_ISA,$(COMPRESSED_ISA)) --Mdir testbench_verilator_dir $(MAKE) -C testbench_verilator_dir -f Vpicorv32_wrapper.mk cp testbench_verilator_dir/Vpicorv32_wrapper testbench_verilator ``` > Before execute verilator remember to fix the [issue](https://hackmd.io/fAWaTqqwSPmXwfsgEPtz5g?view#Address-the-issue-between-PicoRV32-amp-Verilator). Otherwise you migth encounter a compile error when execute the makefile. Run `make test_verilator` in the repository root. We will get the following output. ``` Built with Verilator 4.216 2021-12-05. Recommended: Verilator 4.0 or later. hello world lui..OK auipc..OK j..OK jal..OK jalr..OK beq..OK bne..OK blt..OK bge..OK bltu..OK bgeu..OK lb..OK lh..OK lw..OK lbu..OK lhu..OK sb..OK sh..OK sw..OK addi..OK slti..OK xori..OK ori..OK andi..OK slli..OK srli..OK srai..OK add..OK sub..OK sll..OK slt..OK xor..OK srl..OK sra..OK or..OK and..OK mulh..OK mulhsu..OK mulhu..OK mul..OK div..OK divu..OK rem..OK remu..OK simple..OK 1st prime is 2. 2nd prime is 3. 3rd prime is 5. 4th prime is 7. 5th prime is 11. 6th prime is 13. 7th prime is 17. 8th prime is 19. 9th prime is 23. 10th prime is 29. 11th prime is 31. 12th prime is 37. 13th prime is 41. 14th prime is 43. 15th prime is 47. 16th prime is 53. 17th prime is 59. 18th prime is 61. 19th prime is 67. 20th prime is 71. 21st prime is 73. 22nd prime is 79. 23rd prime is 83. 24th prime is 89. 25th prime is 97. 26th prime is 101. 27th prime is 103. 28th prime is 107. 29th prime is 109. 30th prime is 113. 31st prime is 127. checksum: 1772A48F OK input [FFFFFFFF] 80000000 [FFFFFFFF] FFFFFFFF hard mul 80000000 00000000 80000000 7FFFFFFF soft mul 80000000 00000000 80000000 7FFFFFFF OK hard div 80000000 00000000 00000000 80000000 soft div 80000000 00000000 00000000 80000000 OK input [00000000] 00000000 [00000000] 00000000 hard mul 00000000 00000000 00000000 00000000 soft mul 00000000 00000000 00000000 00000000 OK hard div FFFFFFFF FFFFFFFF 00000000 00000000 soft div FFFFFFFF FFFFFFFF 00000000 00000000 OK input [FFFFFFFF] 8B578493 [00000000] 00000000 hard mul 00000000 00000000 00000000 00000000 soft mul 00000000 00000000 00000000 00000000 OK hard div FFFFFFFF FFFFFFFF 8B578493 8B578493 soft div FFFFFFFF FFFFFFFF 8B578493 8B578493 OK input [00000000] 6F038AFB [00000000] 00000000 hard mul 00000000 00000000 00000000 00000000 soft mul 00000000 00000000 00000000 00000000 OK hard div FFFFFFFF FFFFFFFF 6F038AFB 6F038AFB soft div FFFFFFFF FFFFFFFF 6F038AFB 6F038AFB OK input [00000000] 1BFC9C22 [FFFFFFFF] 876B9BDE hard mul 67CDFB7C F2D15DD3 0ECDF9F5 0ECDF9F5 soft mul 67CDFB7C F2D15DD3 0ECDF9F5 0ECDF9F5 OK hard div 00000000 00000000 1BFC9C22 1BFC9C22 soft div 00000000 00000000 1BFC9C22 1BFC9C22 OK input [00000000] 76141B16 [00000000] 5BA2940D hard mul 949A181E 2A4422A3 2A4422A3 2A4422A3 soft mul 949A181E 2A4422A3 2A4422A3 2A4422A3 OK hard div 00000001 00000001 1A718709 1A718709 soft div 00000001 00000001 1A718709 1A718709 OK input [00000000] 2D45231C [FFFFFFFF] ADFA166F hard mul C756A124 F17ECF19 1EC3F235 1EC3F235 soft mul C756A124 F17ECF19 1EC3F235 1EC3F235 OK hard div 00000000 00000000 2D45231C 2D45231C soft div 00000000 00000000 2D45231C 2D45231C OK input [00000000] 09C7BF74 [00000000] 3B014C60 hard mul 73323B80 024115D2 024115D2 024115D2 soft mul 73323B80 024115D2 024115D2 024115D2 OK hard div 00000000 00000000 09C7BF74 09C7BF74 soft div 00000000 00000000 09C7BF74 09C7BF74 OK input [00000000] 4325E1E6 [00000000] 1C32932A hard mul 0BDA21BC 076568B5 076568B5 076568B5 soft mul 0BDA21BC 076568B5 076568B5 076568B5 OK hard div 00000002 00000002 0AC0BB92 0AC0BB92 soft div 00000002 00000002 0AC0BB92 0AC0BB92 OK input [FFFFFFFF] 84A97421 [FFFFFFFF] EF8D27D7 hard mul 002E8EB7 07ECBD6D 8C96318E 7C235965 soft mul 002E8EB7 07ECBD6D 8C96318E 7C235965 OK hard div 00000007 00000000 F7CD5D40 84A97421 soft div 00000007 00000000 F7CD5D40 84A97421 OK input [00000000] 258BAFEC [00000000] 5EB6FD37 hard mul D7A707B4 0DE4210A 0DE4210A 0DE4210A soft mul D7A707B4 0DE4210A 0DE4210A 0DE4210A OK hard div 00000000 00000000 258BAFEC 258BAFEC soft div 00000000 00000000 258BAFEC 258BAFEC OK input [FFFFFFFF] A31BEA5F [00000000] 145CCB97 hard mul FE749309 F89C8277 F89C8277 0CF94E0E soft mul FE749309 F89C8277 F89C8277 0CF94E0E OK hard div FFFFFFFC 00000008 F48F18BB 00358DA7 soft div FFFFFFFC 00000008 F48F18BB 00358DA7 OK input [00000000] 28E3CD00 [00000000] 793F5181 hard mul 21A74D00 135DC8F9 135DC8F9 135DC8F9 soft mul 21A74D00 135DC8F9 135DC8F9 135DC8F9 OK hard div 00000000 00000000 28E3CD00 28E3CD00 soft div 00000000 00000000 28E3CD00 28E3CD00 OK input [FFFFFFFF] F2E838C6 [00000000] 4BE0C5FE hard mul 6558B274 FC1E89B3 FC1E89B3 47FF4FB1 soft mul 6558B274 FC1E89B3 FC1E89B3 47FF4FB1 OK hard div 00000000 00000003 F2E838C6 0F45E6CC soft div 00000000 00000003 F2E838C6 0F45E6CC OK input [00000000] 38BAA671 [FFFFFFFF] E2E2B92B hard mul 1B639DFB F98C5E4E 324704BF 324704BF soft mul 1B639DFB F98C5E4E 324704BF 324704BF OK hard div FFFFFFFF 00000000 1B9D5F9C 38BAA671 soft div FFFFFFFF 00000000 1B9D5F9C 38BAA671 OK Cycle counter ......... 428722 Instruction counter .... 90750 CPI: 4.72 DONE ------------------------------------------------------------ EBREAK instruction at 0x00000680 pc 00000683 x8 00000000 x16 E2E2B92B x24 00000000 x1 00000652 x9 00000000 x17 38BAA671 x25 00000000 x2 00020000 x10 20000000 x18 00000000 x26 00000000 x3 DEADBEEF x11 075BCD15 x19 00003884 x27 00000000 x4 DEADBEEF x12 0000004F x20 00000000 x28 00000000 x5 00000E8A x13 0000004E x21 00000000 x29 00010000 x6 00000000 x14 00000045 x22 00000000 x30 324704BE x7 00000000 x15 0000000A x23 00000000 x31 00000001 ------------------------------------------------------------ Number of fast external IRQs counted: 54 Number of slow external IRQs counted: 6 Number of timer IRQs counted: 22 TRAP after 470567 clock cycles ALL TESTS PASSED. - testbench.v:266: Verilog $finish ``` Brrrrrruh, what are all these output? Where am I? what the hell is going on? 1. We attempts to compile the firmware code and test code by RISC-V compiler. * firmware code locate at [/firmware](https://github.com/YosysHQ/picorv32/tree/master/firmware) and test code located at [/tests/*.S](https://github.com/YosysHQ/picorv32/tree/master/tests). * test code are written in assembly. executing some workflow and assert the outcome to match specific expected result. * firmware. <del>I don't know and I don't want to know.</del> * After this stage we got the binary RISC-V code. 2. Some python or shell script doing crazy bat sheep to break apart the binary content. 3. Call verilator. Give him all these ingredients. * The `picorv32.v` verilog file. verilator doing some code generaiton to describe the logic of hardware by code. * The c++ wrapper file. It orchestrate the simulation. For things **like logging signals or not**, corrupte your filesystem or something worst. * If you wonder what the wrapper file PicoRV32 is using. [Here you go](https://github.com/YosysHQ/picorv32/blob/master/testbench.cc). And guess what they are using `vcd` format for waveform. Great news now I have to port this script to `fst` if I want to finish this term project. Awesome! To port the `vcd` to `fst`. Visite the [documentation](https://verilator.org/guide/latest/faq.html#how-do-i-generate-fst-waveforms-traces-in-c-or-systemc) for further details. 4. Compile everything and enjoy the binary executable! To force the wrapper file to generate fst wave form I made the following changes. ```diff= diff --git a/Makefile b/Makefile index d7027e3..f118b03 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ RISCV_GNU_TOOLCHAIN_GIT_REVISION = 411d134 -RISCV_GNU_TOOLCHAIN_INSTALL_PREFIX = /opt/riscv32 +RISCV_GNU_TOOLCHAIN_INSTALL_PREFIX = /opt/xpack-riscv/xpack-riscv-none-embed-gcc-10.1.0-1.2 # Give the user some easy overrides for local configuration quirks. # If you change one of these and it breaks, then you get to keep both pieces. @@ -15,7 +15,7 @@ TEST_OBJS = $(addsuffix .o,$(basename $(wildcard tests/*.S))) FIRMWARE_OBJS = firmware/start.o firmware/irq.o firmware/print.o firmware/hello.o firmware/sieve.o firmware/multest.o firmware/stats.o GCC_WARNS = -Werror -Wall -Wextra -Wshadow -Wundef -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings GCC_WARNS += -Wredundant-decls -Wstrict-prototypes -Wmissing-prototypes -pedantic # -Wconversion -TOOLCHAIN_PREFIX = $(RISCV_GNU_TOOLCHAIN_INSTALL_PREFIX)i/bin/riscv32-unknown-elf- +TOOLCHAIN_PREFIX = $(RISCV_GNU_TOOLCHAIN_INSTALL_PREFIX)/bin/riscv-none-embed- COMPRESSED_ISA = C # Add things like "export http_proxy=... https_proxy=..." here @@ -25,22 +25,22 @@ test: testbench.vvp firmware/firmware.hex $(VVP) -N $< test_vcd: testbench.vvp firmware/firmware.hex - $(VVP) -N $< +vcd +trace +noerror + $(VVP) -N $< +vcd +fst +trace +noerror test_rvf: testbench_rvf.vvp firmware/firmware.hex - $(VVP) -N $< +vcd +trace +noerror + $(VVP) -N $< +vcd +fst +trace +noerror test_wb: testbench_wb.vvp firmware/firmware.hex $(VVP) -N $< test_wb_vcd: testbench_wb.vvp firmware/firmware.hex - $(VVP) -N $< +vcd +trace +noerror + $(VVP) -N $< +vcd +fst +trace +noerror test_ez: testbench_ez.vvp $(VVP) -N $< test_ez_vcd: testbench_ez.vvp - $(VVP) -N $< +vcd + $(VVP) -N $< +vcd +fst test_sp: testbench_sp.vvp firmware/firmware.hex $(VVP) -N $< @@ -78,8 +78,8 @@ testbench_synth.vvp: testbench.v synth.v $(IVERILOG) -o $@ -DSYNTH_TEST $^ chmod -x $@ -testbench_verilator: testbench.v picorv32.v testbench.cc - $(VERILATOR) --cc --exe -Wno-lint -trace --top-module picorv32_wrapper testbench.v picorv32.v testbench.cc \ +testbench_verilator: testbench.v picorv32.v testbench.cc + $(VERILATOR) --cc --exe -Wno-lint -trace-fst --top-module picorv32_wrapper testbench.v /home/garyparrot/Programming/CA/verilator/picorv32.v testbench.cc \ $(subst C,-DCOMPRESSED_ISA,$(COMPRESSED_ISA)) --Mdir testbench_verilator_dir $(MAKE) -C testbench_verilator_dir -f Vpicorv32_wrapper.mk cp testbench_verilator_dir/Vpicorv32_wrapper testbench_verilator ``` ```diff= diff --git a/testbench.cc b/testbench.cc index 61c4366..9a1e78b 100644 --- a/testbench.cc +++ b/testbench.cc @@ -1,5 +1,5 @@ #include "Vpicorv32_wrapper.h" -#include "verilated_vcd_c.h" +#include "verilated_fst_c.h" int main(int argc, char **argv, char **env) { @@ -9,22 +9,15 @@ int main(int argc, char **argv, char **env) Verilated::commandArgs(argc, argv); Vpicorv32_wrapper* top = new Vpicorv32_wrapper; - // Tracing (vcd) - VerilatedVcdC* tfp = NULL; - const char* flag_vcd = Verilated::commandArgsPlusMatch("vcd"); - if (flag_vcd && 0==strcmp(flag_vcd, "+vcd")) { - Verilated::traceEverOn(true); - tfp = new VerilatedVcdC; - top->trace (tfp, 99); - tfp->open("testbench.vcd"); - } + VerilatedFstC* tfp = NULL; + Verilated::traceEverOn(true); + tfp = new VerilatedFstC; + top->trace (tfp, 99); + tfp->open("testbench.fst"); // Tracing (data bus, see showtrace.py) FILE *trace_fd = NULL; - const char* flag_trace = Verilated::commandArgsPlusMatch("trace"); - if (flag_trace && 0==strcmp(flag_trace, "+trace")) { - trace_fd = fopen("testbench.trace", "w"); - } + trace_fd = fopen("testbench.trace", "w"); top->clk = 0; int t = 0; ``` ![](https://i.imgur.com/KT9oMjh.png) Now execute it and you have the waveform! Finally. > Suppose in this section we try to execute the firmware. But it looks like I am executing the testbench. I just borrow their code to do the execution part for me. Yo serious I am too busy. This place treat me like a slave. I don't even have the time to think. > One of my classmate tell me GTKwave support `vcd` file. Greate thanks. What am I doing. ## Synthesis In this section we will use [Yosys](https://github.com/YosysHQ/yosys) to synthesis the picorv32.v to netlist file. The following are the steps to perform: 1. enter Yosys's interactive command shell ```shell= sean@sean:~/Documents/yosys$ ./yosys /----------------------------------------------------------------------------\ | | | yosys -- Yosys Open SYnthesis Suite | | | | Copyright (C) 2012 - 2020 Claire Xenia Wolf <claire@yosyshq.com> | | | | Permission to use, copy, modify, and/or distribute this software for any | | purpose with or without fee is hereby granted, provided that the above | | copyright notice and this permission notice appear in all copies. | | | | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES | | WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF | | MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR | | ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES | | WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN | | ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF | | OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. | | | \----------------------------------------------------------------------------/ Yosys 0.12+54 (git sha1 59a715034, clang 10.0.0-4ubuntu1 -fPIC -Os) yosys> ``` 2. read the riscV file(picorv32.v) ```shell= yosys> read_verilog ../picorv32/picorv32.v 1. Executing Verilog-2005 frontend: ../picorv32/picorv32.v Parsing Verilog input from `../picorv32/picorv32.v' to AST representation. Generating RTLIL representation for module `\picorv32'. Generating RTLIL representation for module `\picorv32_regs'. Generating RTLIL representation for module `\picorv32_pcpi_mul'. Generating RTLIL representation for module `\picorv32_pcpi_fast_mul'. Generating RTLIL representation for module `\picorv32_pcpi_div'. Generating RTLIL representation for module `\picorv32_axi'. Generating RTLIL representation for module `\picorv32_axi_adapter'. Generating RTLIL representation for module `\picorv32_wb'. Successfully finished Verilog frontend. yosys> ``` 3. change some parameter and re-evaluate the spitify module * COMPRESSED_ISA = 0 -> 1 * ENABLE_MUL = 0 -> 1 * ENABLE_DIV = 0 -> 1 * ENABLE_IRQ = 0 -> 1 * ENABLE_TRACE = 0 -> 1 you can see that default value of these parameters is 0. ![](https://i.imgur.com/z4LXfSc.png) ```shell= yosys> chparam -set COMPRESSED_ISA 1 -set ENABLE_MUL 1 -set ENABLE_DIV 1 -set ENABLE_IRQ 1 -set ENABLE_TRACE 1 picorv32_axi Parameter \COMPRESSED_ISA = 1 Parameter \ENABLE_MUL = 1 Parameter \ENABLE_DIV = 1 Parameter \ENABLE_IRQ = 1 Parameter \ENABLE_TRACE = 1 2. Executing AST frontend in derive mode using pre-parsed AST for module `\picorv32_axi'. Parameter \COMPRESSED_ISA = 1 Parameter \ENABLE_MUL = 1 Parameter \ENABLE_DIV = 1 Parameter \ENABLE_IRQ = 1 Parameter \ENABLE_TRACE = 1 Generating RTLIL representation for module `$paramod$3b7338fad9e6014598aea2fe0db9d404d58a5994\picorv32_axi'. yosys> ``` 4. use the specified top module to build the design hierarchy(picorv32_axi).Modules outside this tree (unused modules) are removed. ```shell= yosys> hierarchy -top picorv32_axi 3. Executing HIERARCHY pass (managing design hierarchy). 3.1. Analyzing design hierarchy.. Top module: \picorv32_axi Used module: \picorv32_axi_adapter Used module: \picorv32 Parameter \ENABLE_COUNTERS = 1'1 Parameter \ENABLE_COUNTERS64 = 1'1 Parameter \ENABLE_REGS_16_31 = 1'1 Parameter \ENABLE_REGS_DUALPORT = 1'1 Parameter \TWO_STAGE_SHIFT = 1'1 Parameter \BARREL_SHIFTER = 1'0 Parameter \TWO_CYCLE_COMPARE = 1'0 Parameter \TWO_CYCLE_ALU = 1'0 Parameter \COMPRESSED_ISA = 1'1 Parameter \CATCH_MISALIGN = 1'1 Parameter \CATCH_ILLINSN = 1'1 Parameter \ENABLE_PCPI = 1'0 Parameter \ENABLE_MUL = 1'1 Parameter \ENABLE_FAST_MUL = 1'0 Parameter \ENABLE_DIV = 1'1 Parameter \ENABLE_IRQ = 1'1 Parameter \ENABLE_IRQ_QREGS = 1'1 Parameter \ENABLE_IRQ_TIMER = 1'1 Parameter \ENABLE_TRACE = 1'1 Parameter \REGS_INIT_ZERO = 1'0 Parameter \MASKED_IRQ = 0 Parameter \LATCHED_IRQ = 32'11111111111111111111111111111111 Parameter \PROGADDR_RESET = 0 Parameter \PROGADDR_IRQ = 16 Parameter \STACKADDR = 32'11111111111111111111111111111111 3.2. Executing AST frontend in derive mode using pre-parsed AST for module `\picorv32'. Parameter \ENABLE_COUNTERS = 1'1 Parameter \ENABLE_COUNTERS64 = 1'1 Parameter \ENABLE_REGS_16_31 = 1'1 Parameter \ENABLE_REGS_DUALPORT = 1'1 Parameter \TWO_STAGE_SHIFT = 1'1 Parameter \BARREL_SHIFTER = 1'0 Parameter \TWO_CYCLE_COMPARE = 1'0 Parameter \TWO_CYCLE_ALU = 1'0 Parameter \COMPRESSED_ISA = 1'1 Parameter \CATCH_MISALIGN = 1'1 Parameter \CATCH_ILLINSN = 1'1 Parameter \ENABLE_PCPI = 1'0 Parameter \ENABLE_MUL = 1'1 Parameter \ENABLE_FAST_MUL = 1'0 Parameter \ENABLE_DIV = 1'1 Parameter \ENABLE_IRQ = 1'1 Parameter \ENABLE_IRQ_QREGS = 1'1 Parameter \ENABLE_IRQ_TIMER = 1'1 Parameter \ENABLE_TRACE = 1'1 Parameter \REGS_INIT_ZERO = 1'0 Parameter \MASKED_IRQ = 0 Parameter \LATCHED_IRQ = 32'11111111111111111111111111111111 Parameter \PROGADDR_RESET = 0 Parameter \PROGADDR_IRQ = 16 Parameter \STACKADDR = 32'11111111111111111111111111111111 Generating RTLIL representation for module `$paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32'. 3.3. Analyzing design hierarchy.. Top module: \picorv32_axi Used module: \picorv32_axi_adapter Used module: $paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32 Used module: \picorv32_pcpi_div Used module: \picorv32_pcpi_mul 3.4. Analyzing design hierarchy.. Top module: \picorv32_axi Used module: \picorv32_axi_adapter Used module: $paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32 Used module: \picorv32_pcpi_div Used module: \picorv32_pcpi_mul Removing unused module `\picorv32_wb'. Removing unused module `\picorv32_pcpi_fast_mul'. Removing unused module `\picorv32_regs'. Removing unused module `\picorv32'. Removed 4 unused modules. yosys> ``` 5.optimization the RTL code ,this replaces the processes in the design with multiplexers,flip-flops and latches. ```shell= yosys> proc; opt 4. Executing PROC pass (convert processes to netlists). . . . 5.29. Executing OPT_EXPR pass (perform const folding). Optimizing module $paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32. Optimizing module picorv32_axi. Optimizing module picorv32_axi_adapter. Optimizing module picorv32_pcpi_div. Optimizing module picorv32_pcpi_mul. 5.30. Finished OPT passes. (There is nothing left to do.) yosys> ``` 6. optimization the RTL code ,this pass implements a very simple technology mapper that replaces cells in the design with implementations given in form of a Verilog or ilang source file. ```shell= yosys> techmap; opt 6. Executing TECHMAP pass (map to technology primitives). . . . 7.28. Executing OPT_CLEAN pass (remove unused cells and wires). Finding unused cells or wires in module $paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32.. Finding unused cells or wires in module \picorv32_axi.. Finding unused cells or wires in module \picorv32_axi_adapter.. Finding unused cells or wires in module \picorv32_pcpi_div.. Finding unused cells or wires in module \picorv32_pcpi_mul.. 7.29. Executing OPT_EXPR pass (perform const folding). Optimizing module $paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32. Optimizing module picorv32_axi. Optimizing module picorv32_axi_adapter. Optimizing module picorv32_pcpi_div. Optimizing module picorv32_pcpi_mul. 7.30. Finished OPT passes. (There is nothing left to do.) yosys> ``` 7. we try to use ```show -format ps -viewer gv``` to see Simple RTL Netlist,but the synthesized from picorv32 is too large to display. 8. synth the ```shell= yosys> synth 8. Executing SYNTH pass. 8.1. Executing HIERARCHY pass (managing design hierarchy). 8.1.1. Analyzing design hierarchy.. Top module: \picorv32_axi Used module: \picorv32_axi_adapter Used module: $paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32 Used module: \picorv32_pcpi_mul Used module: \picorv32_pcpi_div 8.1.2. Analyzing design hierarchy.. Top module: \picorv32_axi Used module: \picorv32_axi_adapter Used module: $paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32 Used module: \picorv32_pcpi_mul Used module: \picorv32_pcpi_div Removed 0 unused modules. . . . === design hierarchy === picorv32_axi 1 $paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32 1 picorv32_pcpi_div 1 picorv32_pcpi_mul 1 picorv32_axi_adapter 1 Number of wires: 11865 Number of wire bits: 16882 Number of public wires: 319 Number of public wire bits: 4417 Number of memories: 0 Number of memory bits: 0 Number of processes: 0 Number of cells: 14686 $_ANDNOT_ 2981 $_AND_ 727 $_DFFE_PN_ 7 $_DFFE_PP_ 1668 $_DFF_P_ 177 $_MUX_ 3697 $_NAND_ 638 $_NOR_ 423 $_NOT_ 324 $_ORNOT_ 415 $_OR_ 2131 $_SDFFCE_PN0P_ 41 $_SDFFCE_PP0P_ 176 $_SDFFCE_PP1P_ 5 $_SDFFE_PN0N_ 1 $_SDFFE_PN0P_ 195 $_SDFFE_PN1N_ 4 $_SDFFE_PN1P_ 32 $_SDFFE_PP0P_ 3 $_SDFFE_PP1P_ 3 $_SDFF_PN0_ 110 $_SDFF_PN1_ 1 $_SDFF_PP0_ 5 $_XNOR_ 208 $_XOR_ 714 8.26. Executing CHECK pass (checking for obvious problems). Checking module $paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32... Checking module picorv32_axi... Checking module picorv32_axi_adapter... Checking module picorv32_pcpi_div... Checking module picorv32_pcpi_mul... Found and reported 0 problems. yosys> ``` 9. write design netlist to a new Verilog file: synth.v ```shell= yosys> write_verilog synth.v 9. Executing Verilog backend. Dumping module `$paramod$f97d53d313a1d6e116f82bfcfe9e30d20d038d5e\picorv32'. Dumping module `\picorv32_axi'. Dumping module `\picorv32_axi_adapter'. Dumping module `\picorv32_pcpi_div'. Dumping module `\picorv32_pcpi_mul'. yosys> ```